Upload
holli
View
58
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Crawlers and Crawling Strategies. CSCI 572: Information Retrieval and Search Engines Summer 2010. Outline. Crawlers Web File-based Characteristics Challenges. Why Crawling?. Origins were in the web Web is a big “spiderweb”, so like a a “spider” crawl it - PowerPoint PPT Presentation
Citation preview
Crawlers and Crawling Strategies
CSCI 572: Information Retrieval and Search Engines
Summer 2010
May-20-10 CS572-Summer2010 CAM-2
Outline
• Crawlers– Web
– File-based
• Characteristics• Challenges
May-20-10 CS572-Summer2010 CAM-3
Why Crawling?
• Origins were in the web– Web is a big “spiderweb”, so like a a “spider” crawl it
• Focused approach to navigating the web– It’s not just visit all pages at once
– …or randomly
– There needs to be a sense of purpose• Some pages more important or different than others
• Content-driven– Different crawlers for different purposes
May-20-10 CS572-Summer2010 CAM-4
Different classifications of Crawlers
• Whole-web crawlers– Must deal with different concerns than
more focused vertical crawlers, or content-based crawlers
– Politeness, ability to mitigate any and all protocols defined in the URL space
– Deal with URL filtering, freshness and recrawling strategies
– Examples: Heretix, Nutch, Bixo, crawler-commons, clever uses of wget and curl, etc.
May-20-10 CS572-Summer2010 CAM-5
Different classifications of Crawlers
• File-based crawlers– Don’t necessitate the understanding of
protocol negotiation – it’s a hard problem in its own right!
– Assume that the content is already local
– Uniqueness is in the methodology for• File identification and selection
• Ingestion methodology
• Examples: OODT CAS, scripting (ls/grep/UNIX), internal appliances (Google), Spotlight
May-20-10 CS572-Summer2010 CAM-6
Web-scale Crawling
• What do you have to deal with?– Protocol negotiation
• How do you get data from FTP, HTTP, SMTP, HDFS, RMI, CORBA, SOAP, Bittorrent, ed2k URLs?
• Build a flexible protocol layer like Nutch did?
– Determination of which URLs are important or not• Whitelists
• Blacklists
• Regular Expressions
May-20-10 CS572-Summer2010 CAM-7
Politeness
• How do you take into account that web servers and Internet providers can and will– Block you after a certain # of concurrent attempts
– Block you if you ignore their crawling desirements codified in e.g., a robots.txt file
– Block you if you don’t specify a User Agent
– Identify you based on • Your IP
• Your User Agent
May-20-10 CS572-Summer2010 CAM-8
Politeness
• Queuing is very important• Maintain host-specific crawl patterns and policies
– Sub-collection based using regex
• Threading and brute-force is your enemy• Respect robots.txt• Declare who you are
May-20-10 CS572-Summer2010 CAM-9
Crawl Scheduling• When and where should you crawl
– Based on URL freshness within some N day cycle?• Relies on unique identification of URLs and approaches for that
– Based on per-site policies?• Some sites are less busy at certain times of the day
• Some sites are on higher bandwidth connections than others
• Profile this?
• Adaptative fetching/scheduling– Deciding the above on the fly while crawling
• Regular fetching/scheduling– Profiling the above and storing it away in policy/config
May-20-10 CS572-Summer2010 CAM-10
Data Transfer
• Download in parallel?• Download sequentially?• What to do with the data once you’ve crawled in, is
it cached temporarily or persisted somewhere?
May-20-10 CS572-Summer2010 CAM-11
Identification of Crawl Path
• Uniform Resource Locators• Inlinks• Outlinks• Parsed data
– Source of inlinks, outlinks
• Identification of URL protocolschema/path– Deduplication
May-20-10 CS572-Summer2010 CAM-12
File-based Crawlers• Crawling remote content,
getting politeness down,dealing with protocols,and scheduling is hard!
• Let some other componentdo that for you– CAS Pushpull great ex.
– Staging areas, deliveryprotocols
• Once you have the content, there is still interesting crawling strategy
May-20-10 CS572-Summer2010 CAM-13
What’s hard? The file is already here
• Identification of which files are important, and which aren’t– Content detection and analysis
• MIME type, URL/filename regex, MAGIC detection, XML root chars detection, combinations of them
• Apache Tika
• Mapping of identified file types to mechanisms for extracting out content and ingesting it
May-20-10 CS572-Summer2010 CAM-14
Quick intro to content detection
• By URL, or file name– People codified classification into URLs or file names
– Think file extensions
• By MIME Magic– Think digital signatures
• By XML schemas, classifications– Not all XML is created equally
• By combinations of the above
May-20-10 CS572-Summer2010 CAM-15
Case Study: OODT CAS
• Set of components
for sciencedata
processing• Deals with
file-based crawling
May-20-10 CS572-Summer2010 CAM-16
File-based Crawler Types
• Auto-detect
• Met Extractor
• Std Product Crawler
May-20-10 CS572-Summer2010 CAM-17
Other Examples of File Crawlers
• Spotlight– Indexing your hard drive on Mac and making it readily
available for fast free-text search
– Involves CAS/Tika like interactions
• Scripting with ls and grep– You may find yourself doing this to run processing in
batch, rapidly and quickliy
– Don’t encode the data transfer into the script!• Mixing concerns
May-20-10 CS572-Summer2010 CAM-18
Challenges
• Reliability– If crawl fails during web-scale crawl, how do you
mitigate?
• Scalability– Web-based vs. file based
• Commodity versus appliance– Google or build your own
• Separation of concerns– Separate processing from ingestion from acquisition
May-20-10 CS572-Summer2010 CAM-19
Wrapup
• Crawling is a canonical piece of a search engine• Utility is seen in data systems across the board• Determine what your strategy for acquisition vis a
vis your processing and ingestion strategy is• Separate and insulate • Identify content flexibly