Upload
davelester
View
19.166
Download
399
Embed Size (px)
DESCRIPTION
March 2013 slides about Common Crawl. Original presentation was to the Working with Open Data class at the UC Berkeley School of Information http://www.ischool.berkeley.edu/courses/i290t-wod
Citation preview
Introduction to Common Crawl
Dave LesterMarch 21, 2013
Monday, April 1, 13
video intro: https://www.youtube.com/watch?v=ozX4GvUWDm4
Monday, April 1, 13
What is Common Crawl?
• non-profit org providing an open repository of web crawl data to be accessed and analyzed by anyone
• data is currently shared as a public dataset on Amazon S3
Monday, April 1, 13
Why Open Data?
• It’s difficult to crawl the web at scale
• Provides a shared resource for researchers to compare results and recreate experiments
Monday, April 1, 13
2012 Corpus Stats
• Total # of Web Documents: 3.8 billion
• Total Uncompressed Content Size: 100 TB+
• # of Domains: 61 million
• # of PDFs: 92.2 million
• # of Word Docs: 6.6 million
• # of Excel Docs: 1.3 million
Monday, April 1, 13
Other Data Sources
• Blekko - “spam-free search engine”
• their metadata includes:
• rank on a linear scale, and 0-10 web rank
• true/false for Blekko’s webspam algorithm thinking this domain or page is spam
• true/false for Blekko’s pr0n detection algorithm
Monday, April 1, 13
What is Crawled?
• Check out the new URL search tool: http://commoncrawl.org/url-search-tool/
• (try entering ischool.berkeley.edu)
• First five people to share open source code on GitHub that incorporates a JSON file from URL Search will each get $100 in AWS Credit!
Monday, April 1, 13
How is Data Crawled?
• Customized crawler (it’s open source!)
• Some basic page rank included. Lots of time spent optimizing this and filtering spam
• See Apache Nutch as alternative web-scale crawler
• Future datasets may incl other crawl sources
Monday, April 1, 13
Common Crawl Uses
Monday, April 1, 13
Analyze References to Facebook
• of ~1.3 Billion URLs:
• 22% of Web pages contain Facebook URLs
• 8% of Web pages implement Open Graph tags
• Among ~500m hardcoded links to Facebook, only 3.5 million are unique
• These are primarily for simple social integrations
Monday, April 1, 13
References to FB Pages
• /merriamwebster 676071 (0.14%)
• /kevjumba 651389 (0.14%)
• /placeformusic 618963 (0.13%)
• /lyricskeeper 517999 (0.11%)
• /kayak 465179 (0.10%)
• /twitter 281882 (0.06%)
Monday, April 1, 13
Analyze JavaScript Libraries on the Web
1. jQuery (82.64%)
2. Prototype(6.06%)
3. Mootools (4.83%)
4. Ext (3.47%)
5. YUI (1.78%)
6. Modernizr (0.59%)
7. Dojo(0.21%)
8. Ember (0.14%)
9. Underscore (0.11%)
10. Backbone (0.09%)
Monday, April 1, 13
Library Co-occurence
Monday, April 1, 13
Web Data Commons
• sub-corpus of Common Crawl data
• includes RDFa, hCalendar, hCard, Geo Microdata, hResume, XFN
• built using 2009/2010 corpus
Monday, April 1, 13
Monday, April 1, 13
Traitor: Associating Concepts
http://www.youtube.com/watch?v=c7Y149RnQjw
Monday, April 1, 13
Associated Costs?
• Complete data set, ~$1300.00
• Facebook Link Analysis, $434.61
• Searchable Index of Data Set, $100
• “average per-hour cost for a High-CPU Medium Instance (c1.medium) was about $.018, just under one tenth of the on-demand rate”
Monday, April 1, 13
Give it a Try
Monday, April 1, 13
ARC Files
• Files contain the full HTTP response and payload for all pages crawled.
• Format designed by the Internet Archive
• ARC files are a series of concatenated GZIP documents
Monday, April 1, 13
Text-Only Files• Saved as sequence files -- consisting of binary
key/value pairs. (Used extensively in MapReduce as input/output formats)
• On average 20% the size of raw content
• located in the segment directories, with a file name of "textData-nnnnn". For example:
• s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/textData-00112
Monday, April 1, 13
Metadata Files
• For each URL, metadata files contain status information, the HTTP response code, and file names and offsets of ARC files where the raw content can be found.
• Also contain the HTML title, HTML meta tags, RSS/Atom information, and all anchors/hyperlinks from HTML documents (including all fields on the link tags).
• Records in the Metadata files are in the same order and have the same file numbers as the Text Only content
• Saved as sequence files
Monday, April 1, 13
Browsing Data
• You can use s3cmd on your local machine
• Install using pip, ‘pip install s3cmd’
• Configure, ‘s3cmd --configure’
• Requires AWS keys
• Demo: s3cmd ls s3://aws-publicdatasets/common-crawl/parse-output/
Monday, April 1, 13
Common Crawl AMI
• Amazon Machine Image loaded with Common Crawl example programs, a development Hadoop instance, and scripts to submit jobs to Amazon Elastic MapReduce
• Amazon AMI ID: "ami-07339a6e"
Monday, April 1, 13
Running Example MR Jobs Using the AMI
• ccRunExample [ LocalHadoop | AmazonEMR ] [ ExampleName ] ( S3Bucket )
• bin/ccRunExample LocalHadoop ExampleMetadataDomainPageCount aws-publicdatasets/common-crawl/parse-output/segment/1341690167474/
• look at the code: nano src/java/org/commoncrawl/examples/ExampleMetadataDomainPageCount.java
Monday, April 1, 13
Code Samples to Try
• http://github.com/commoncrawl/
• Pete Warden’s Ruby example• http://petewarden.typepad.com/searchbrowser/2012/03/twelve-steps-to-running-your-
ruby-code-across-five-billion-web-pages.html
Monday, April 1, 13
Helpful Resources
• Developer Documentation:
• https://commoncrawl.atlassian.net/
• Developer Discussion List:
• https://groups.google.com/group/common-crawl
Monday, April 1, 13
Questions?
• @davelester
• www.davelester.org
Monday, April 1, 13