Skip to content
@commoncrawl

Common Crawl Foundation

Common Crawl provides an archive of webpages going back to 2007.

Pinned Loading

  1. cc-pyspark cc-pyspark Public

    Process Common Crawl data with Python and Spark

    Python 453 93

  2. cc-crawl-statistics cc-crawl-statistics Public

    Statistics of Common Crawl monthly archives mined from URL index files

    Python 212 16

  3. cc-index-table cc-index-table Public

    Index Common Crawl archives in tabular format

    Java 126 15

  4. cc-warc-examples cc-warc-examples Public

    Forked from Smerity/cc-warc-examples

    CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop

    Java 38 18

  5. cc-citations cc-citations Public

    Scientific articles using or citing Common Crawl data

    Jupyter Notebook 28 4

  6. cc-notebooks cc-notebooks Public

    Various Jupyter notebooks about Common Crawl data

    Jupyter Notebook 64 11

Repositories

Showing 10 of 83 repositories
  • cc-index-annotations Public

    Example code to join an annotation to a host or url index

    commoncrawl/cc-index-annotations’s past year of commit activity
    Python 1 0 0 0 Updated Mar 22, 2026
  • whirlwind-java Public

    A whirlwind tour of Common Crawl's data using Java

    commoncrawl/whirlwind-java’s past year of commit activity
    Java 3 Apache-2.0 0 2 1 Updated Mar 21, 2026
  • cc-index-table Public

    Index Common Crawl archives in tabular format

    commoncrawl/cc-index-table’s past year of commit activity
    Java 126 Apache-2.0 15 6 1 Updated Mar 20, 2026
  • cc-webgraph Public

    Tools to construct and process Common Crawl webgraphs

    commoncrawl/cc-webgraph’s past year of commit activity
    Java 105 Apache-2.0 6 7 (1 issue needs help) 0 Updated Mar 20, 2026
  • ipv6-analysis Public

    A survey of IPv6 support among the 100k top ranked hosts on the web

    commoncrawl/ipv6-analysis’s past year of commit activity
    Python 1 Apache-2.0 0 0 0 Updated Mar 19, 2026
  • cc-crawl-statistics Public

    Statistics of Common Crawl monthly archives mined from URL index files

    commoncrawl/cc-crawl-statistics’s past year of commit activity
    Python 212 Apache-2.0 16 3 3 Updated Mar 19, 2026
  • warcio-s3 Public Forked from webrecorder/warcio

    Streaming WARC/ARC library for fast web archive IO

    commoncrawl/warcio-s3’s past year of commit activity
    Python 0 Apache-2.0 68 0 0 Updated Mar 19, 2026
  • cc-citations Public

    Scientific articles using or citing Common Crawl data

    commoncrawl/cc-citations’s past year of commit activity
    Jupyter Notebook 28 4 0 0 Updated Mar 19, 2026
  • cc-nutch-example Public

    Apache Nutch example project to archive content in WARC files

    commoncrawl/cc-nutch-example’s past year of commit activity
    Shell 3 Apache-2.0 2 0 0 Updated Mar 17, 2026
  • awesome-low-resource-languages Public

    This list provides resources useful for documenting, conserving, developing, preserving, or working with endangered and low resource languages.

    commoncrawl/awesome-low-resource-languages’s past year of commit activity
    0 0 0 0 Updated Mar 16, 2026