27
Introduction to Common Crawl Dave Lester March 21, 2013 Monday, April 1, 13

Introduction to Common Crawl

Embed Size (px)

DESCRIPTION

March 2013 slides about Common Crawl. Original presentation was to the Working with Open Data class at the UC Berkeley School of Information http://www.ischool.berkeley.edu/courses/i290t-wod

Citation preview

Page 1: Introduction to Common Crawl

Introduction to Common Crawl

Dave LesterMarch 21, 2013

Monday, April 1, 13

Page 2: Introduction to Common Crawl

video intro: https://www.youtube.com/watch?v=ozX4GvUWDm4

Monday, April 1, 13

Page 3: Introduction to Common Crawl

What is Common Crawl?

• non-profit org providing an open repository of web crawl data to be accessed and analyzed by anyone

• data is currently shared as a public dataset on Amazon S3

Monday, April 1, 13

Page 4: Introduction to Common Crawl

Why Open Data?

• It’s difficult to crawl the web at scale

• Provides a shared resource for researchers to compare results and recreate experiments

Monday, April 1, 13

Page 5: Introduction to Common Crawl

2012 Corpus Stats

• Total # of Web Documents: 3.8 billion

• Total Uncompressed Content Size: 100 TB+

• # of Domains: 61 million

• # of PDFs: 92.2 million

• # of Word Docs: 6.6 million

• # of Excel Docs: 1.3 million

Monday, April 1, 13

Page 6: Introduction to Common Crawl

Other Data Sources

• Blekko - “spam-free search engine”

• their metadata includes:

• rank on a linear scale, and 0-10 web rank

• true/false for Blekko’s webspam algorithm thinking this domain or page is spam

• true/false for Blekko’s pr0n detection algorithm

Monday, April 1, 13

Page 7: Introduction to Common Crawl

What is Crawled?

• Check out the new URL search tool: http://commoncrawl.org/url-search-tool/

• (try entering ischool.berkeley.edu)

• First five people to share open source code on GitHub that incorporates a JSON file from URL Search will each get $100 in AWS Credit!

Monday, April 1, 13

Page 8: Introduction to Common Crawl

How is Data Crawled?

• Customized crawler (it’s open source!)

• Some basic page rank included. Lots of time spent optimizing this and filtering spam

• See Apache Nutch as alternative web-scale crawler

• Future datasets may incl other crawl sources

Monday, April 1, 13

Page 9: Introduction to Common Crawl

Common Crawl Uses

Monday, April 1, 13

Page 10: Introduction to Common Crawl

Analyze References to Facebook

• of ~1.3 Billion URLs:

• 22% of Web pages contain Facebook URLs

• 8% of Web pages implement Open Graph tags

• Among ~500m hardcoded links to Facebook, only 3.5 million are unique

• These are primarily for simple social integrations

Monday, April 1, 13

Page 11: Introduction to Common Crawl

References to FB Pages

• /merriamwebster 676071 (0.14%)

• /kevjumba 651389 (0.14%)

• /placeformusic 618963 (0.13%)

• /lyricskeeper 517999 (0.11%)

• /kayak 465179 (0.10%)

• /twitter 281882 (0.06%)

Monday, April 1, 13

Page 12: Introduction to Common Crawl

Analyze JavaScript Libraries on the Web

1. jQuery (82.64%)

2. Prototype(6.06%)

3. Mootools (4.83%)

4. Ext (3.47%)

5. YUI (1.78%)

6. Modernizr (0.59%)

7. Dojo(0.21%)

8. Ember (0.14%)

9. Underscore (0.11%)

10. Backbone (0.09%)

Monday, April 1, 13

Page 13: Introduction to Common Crawl

Library Co-occurence

Monday, April 1, 13

Page 14: Introduction to Common Crawl

Web Data Commons

• sub-corpus of Common Crawl data

• includes RDFa, hCalendar, hCard, Geo Microdata, hResume, XFN

• built using 2009/2010 corpus

Monday, April 1, 13

Page 15: Introduction to Common Crawl

Monday, April 1, 13

Page 16: Introduction to Common Crawl

Traitor: Associating Concepts

http://www.youtube.com/watch?v=c7Y149RnQjw

Monday, April 1, 13

Page 17: Introduction to Common Crawl

Associated Costs?

• Complete data set, ~$1300.00

• Facebook Link Analysis, $434.61

• Searchable Index of Data Set, $100

• “average per-hour cost for a High-CPU Medium Instance (c1.medium) was about $.018, just under one tenth of the on-demand rate”

Monday, April 1, 13

Page 18: Introduction to Common Crawl

Give it a Try

Monday, April 1, 13

Page 19: Introduction to Common Crawl

ARC Files

• Files contain the full HTTP response and payload for all pages crawled.

• Format designed by the Internet Archive

• ARC files are a series of concatenated GZIP documents

Monday, April 1, 13

Page 20: Introduction to Common Crawl

Text-Only Files• Saved as sequence files -- consisting of binary

key/value pairs. (Used extensively in MapReduce as input/output formats)

• On average 20% the size of raw content

• located in the segment directories, with a file name of "textData-nnnnn". For example:

• s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/textData-00112

Monday, April 1, 13

Page 21: Introduction to Common Crawl

Metadata Files

• For each URL, metadata files contain status information, the HTTP response code, and file names and offsets of ARC files where the raw content can be found.

• Also contain the HTML title, HTML meta tags, RSS/Atom information, and all anchors/hyperlinks from HTML documents (including all fields on the link tags).

• Records in the Metadata files are in the same order and have the same file numbers as the Text Only content

• Saved as sequence files

Monday, April 1, 13

Page 22: Introduction to Common Crawl

Browsing Data

• You can use s3cmd on your local machine

• Install using pip, ‘pip install s3cmd’

• Configure, ‘s3cmd --configure’

• Requires AWS keys

• Demo: s3cmd ls s3://aws-publicdatasets/common-crawl/parse-output/

Monday, April 1, 13

Page 23: Introduction to Common Crawl

Common Crawl AMI

• Amazon Machine Image loaded with Common Crawl example programs, a development Hadoop instance, and scripts to submit jobs to Amazon Elastic MapReduce

• Amazon AMI ID: "ami-07339a6e"

Monday, April 1, 13

Page 24: Introduction to Common Crawl

Running Example MR Jobs Using the AMI

• ccRunExample [ LocalHadoop | AmazonEMR ] [ ExampleName ] ( S3Bucket )

• bin/ccRunExample LocalHadoop ExampleMetadataDomainPageCount aws-publicdatasets/common-crawl/parse-output/segment/1341690167474/

• look at the code: nano src/java/org/commoncrawl/examples/ExampleMetadataDomainPageCount.java

Monday, April 1, 13

Page 26: Introduction to Common Crawl

Helpful Resources

• Developer Documentation:

• https://commoncrawl.atlassian.net/

• Developer Discussion List:

• https://groups.google.com/group/common-crawl

Monday, April 1, 13

Page 27: Introduction to Common Crawl

Questions?

• @davelester

[email protected]

• www.davelester.org

Monday, April 1, 13