10 Crawling

Embed Size (px)

Citation preview

  • 8/12/2019 10 Crawling

    1/34

    A. Haeberlen, Z. Ives

    CIS 455/555: Internet and Web Systems

    1University of Pennsylvania

    Crawling and Publish/Subscribe

    February 15, 2012

  • 8/12/2019 10 Crawling

    2/34

    A. Haeberlen, Z. Ives

    Plan for today

    Basic crawling

    Mercator

    Publish/subscribe

    XFilter

    2University of Pennsylvania

    NEXT

  • 8/12/2019 10 Crawling

    3/34

    A. Haeberlen, Z. Ives

    Motivation

    Suppose you want to build a search engine Need a large corpus of web pages

    How can we find these pages?

    Idea: crawlthe web

    What else can you crawl? For example, social network

    3University of Pennsylvania

  • 8/12/2019 10 Crawling

    4/34

    A. Haeberlen, Z. Ives

    4

    Crawling: The basic process

    What state do we need? Q := Queue of URLs to visit

    P := Set of pages already crawled

    Basic process:1. Initialize Q with a set of seed URLs

    2. Pick the first URL from Q and download the corresponding page

    3. Extract all URLs from the page ( tag, anchor links, CSS, DTDs,scripts, optionally image links)

    4. Append to Q any URLs that a) meet our criteria, and b) are not already in P5. If Q is not empty, repeat from step 2

    Can one machine crawl the entire web? Of course not! Need to distribute crawling across many machines.

    University of Pennsylvania

  • 8/12/2019 10 Crawling

    5/34

    A. Haeberlen, Z. Ives

    Crawling visualized

    5University of Pennsylvania

    Seeds

    Unseen web

    "Frontier"

    URLs alreadycrawled

    URLs that willeventually be

    crawled

    URLs that do notfit our criteria(too deep, etc)

    The Web

    Pages currentlybeing crawled

  • 8/12/2019 10 Crawling

    6/34

    A. Haeberlen, Z. Ives

    Crawling complications

    What order to traverse in? Polite to do BFS - why?

    Malicious pages Spam pages / SEO

    Spider traps (incl. dynamically generated ones)

    General messiness Cycles

    Varying latency and bandwidth to remote servers

    Site mirrors, duplicate pages, aliases

    Web masters' stipulations

    How deep to crawl? How often to crawl? Continuous crawling; freshness

    6University of Pennsylvania

    Need to berobust!

  • 8/12/2019 10 Crawling

    7/34A. Haeberlen, Z. Ives

    Normalization; eliminating duplicates

    Some of the extracted URLs are relative URLs Example: /~ahae/papers/ from www.cis.upenn.edu

    Normalize it: http://www.cis.upenn.edu/~ahae/papers

    Duplication is widespread on the web If the fetched page is already in the index, do not process it

    Can verify using document fingerprint (hash) or shingles

    7University of Pennsylvania

  • 8/12/2019 10 Crawling

    8/34A. Haeberlen, Z. Ives

    8

    Crawler etiquette

    Explicit politeness Look for meta tags; for example, ignore pages that have

  • 8/12/2019 10 Crawling

    9/34A. Haeberlen, Z. Ives

    9

    Robots.txt

    What should be in robots.txt? See http://www.robotstxt.org/wc/robots.html

    To exclude all robots from a server:User-agent: *

    Disallow: / To exclude one robot from two directories:

    User-agent: BobsCrawler

    Disallow: /news/Disallow: /tmp/

    University of Pennsylvania

    http://www.robotstxt.org/wc/robots.htmlhttp://www.robotstxt.org/wc/robots.html
  • 8/12/2019 10 Crawling

    10/34A. Haeberlen, Z. Ives

    10

    Recap: Crawling

    How does the basic process work?

    What are some of the main challenges? Duplicate elimination

    Politeness

    Malicious pages / spider traps

    Normalization

    Scalability

    University of Pennsylvania

  • 8/12/2019 10 Crawling

    11/34A. Haeberlen, Z. Ives

    Plan for today

    Basic crawling

    Mercator

    Publish/subscribe

    XFilter

    11University of Pennsylvania

    NEXT

  • 8/12/2019 10 Crawling

    12/34A. Haeberlen, Z. Ives

    12

    Mercator: A scalable web crawler

    Written entirely in Java

    Expands a URL frontier Avoids re-crawling same URLs

    Also considers whether a document has beenseen before Same content, different URL [when might this occur?]

    Every document has signature/checksum info computed as

    its crawled

    Despite the name, it does not actually scaleto a large number of nodes But it would not be too difficult to parallelize

    Heydon and Najork: Mercator, a scalable, extensible web crawler (WWW'99)University of Pennsylvania

  • 8/12/2019 10 Crawling

    13/34A. Haeberlen, Z. Ives13

    Mercator architecture

    1. Dequeue frontier URL

    2. Fetch document

    3. Record into RewindStream(RIS)

    4. Check against fingerprints toverify its new

    5. Extract hyperlinks

    6. Filter unwanted links

    7. Check if URL repeated(compare its hash)

    8. Enqueue URL

    Source:Mercatorpape

    r

    University of Pennsylvania

  • 8/12/2019 10 Crawling

    14/34A. Haeberlen, Z. Ives

    14

    Mercators polite frontier queues

    Tries to go beyond breadth-first approach Goal is to have only one crawler thread per server

    What does this mean for the load caused by Mercator?

    Distributed URL frontier queue: One subqueue per worker thread

    The worker thread is determined by hashing the hostnameof the URL

    Thus, only one outstanding request per web server

    University of Pennsylvania

  • 8/12/2019 10 Crawling

    15/34A. Haeberlen, Z. Ives

    Mercators HTTP fetcher

    First, needs to ensure robots.txt is followed Caches the contents of robots.txt for various web sites as it

    crawls them

    Designed to be extensible to other protocols Had to write own HTTP requestor in Java

    their Java version didnt have timeouts Today, can use setSoTimeout()

    Could use Java non-blocking I/O: http://www.owlmountain.com/tutorials/NonBlockingIo.htm

    But they use multiple threads and synchronous I/O

    15University of Pennsylvania

    http://www.owlmountain.com/tutorials/NonBlockingIo.htmhttp://www.owlmountain.com/tutorials/NonBlockingIo.htm
  • 8/12/2019 10 Crawling

    16/34A. Haeberlen, Z. Ives

    16

    Other caveats

    Infinitely long URL names (good way to get abuffer overflow!)

    Aliased host names

    Alternative paths to the same host Can catch most of these with signatures of

    document data (e.g., MD5) Comparison to Bloom filters

    Crawler traps (e.g., CGI scripts that link tothemselves using a different name) May need to have a way for human to override certain URL

    pathssee Section 5 of paper

    University of Pennsylvania

  • 8/12/2019 10 Crawling

    17/34

  • 8/12/2019 10 Crawling

    18/34A. Haeberlen, Z. Ives

    18

    Further considerations

    May want to prioritize certain pages as beingmost worth crawling Focused crawling tries to prioritize based on relevance

    May need to refresh certain pages more often

    University of Pennsylvania

  • 8/12/2019 10 Crawling

    19/34A. Haeberlen, Z. Ives

    Plan for today

    Basic crawling

    Mercator

    Publish/subscribe

    XFilter

    19University of Pennsylvania

    NEXT

  • 8/12/2019 10 Crawling

    20/34

  • 8/12/2019 10 Crawling

    21/34A. Haeberlen, Z. Ives

    Example: RSS

    Web server publishes XML file with events

    Clients periodically request the file to see ifthere are new events

    Is this a good solution?

    21University of Pennsylvania

  • 8/12/2019 10 Crawling

    22/34A. Haeberlen, Z. Ives

    Interest-based crawling

    Suppose we want to crawl XML documentsbased on user interests

    We need several parts: A list of interests expressed in an executable form,

    perhaps XPath queries

    A crawlergoes out and fetches XML content

    A filter / routing enginematches XML content againstusers interests, sends them the content if it matches

    22University of Pennsylvania

  • 8/12/2019 10 Crawling

    23/34A. Haeberlen, Z. Ives

    Plan for today

    Basic crawling

    Mercator

    Publish/subscribe

    XFilter

    23University of Pennsylvania

    NEXT

  • 8/12/2019 10 Crawling

    24/34

    A. Haeberlen, Z. Ives24

    XML-Based information dissemination

    Basic model (XFilter, YFilter, Xyleme):

    Users are interested in data relating to a particular topic, andknow the schema

    /politics/usa//body

    A crawler-aggregator reads XML files from the web (or getsthem from data sources) and feeds them to interestedparties

    XPath

    (here used tomatch

    documents,not nodes)

    University of Pennsylvania

  • 8/12/2019 10 Crawling

    25/34

    A. Haeberlen, Z. Ives

    25

    Engine for XFilter [Altinel & Franklin 00]

    University of Pennsylvania

  • 8/12/2019 10 Crawling

    26/34

    A. Haeberlen, Z. Ives

    26

    How does it work?

    Each XPath segment is basically a subset ofregular expressions over element tags Convert into finite state automata

    Parse data as it comes inuse SAX API Match against finite state machines

    Most of these systems use modified FSMsbecause they want to match manypatternsat the same time

    University of Pennsylvania

  • 8/12/2019 10 Crawling

    27/34

    A. Haeberlen, Z. Ives

    27

    Path nodes and FSMs

    XPath parser decomposes XPath expressions into a set ofpath nodes

    These nodes act as the states of corresponding FSM

    A node in the Candidate List denotes the current state

    The rest of the states are in corresponding Wait Lists

    Simple FSM for/politics[@topic=president]/usa//body:

    politics usa body

    Q1_1 Q1_2 Q1_3

    University of Pennsylvania

  • 8/12/2019 10 Crawling

    28/34

    A. Haeberlen, Z. Ives

    28

    Decomposing into path nodes

    Query ID

    Position in state machine

    Relative Position (RP) in tree:

    0 for root node if its notpreceded by //

    -1 for any node preceded by//

    Else =1+ (no of * nodes frompredecessor node)

    Level:If current node has fixed

    distance from root, then 1+distance

    Else if RP =1, then1, else 0

    Finaly, NextPathNodeSet points tonext node

    Q1=/politics[@topic=president]/usa//body

    Q1 Q1 Q1

    1 2 3

    0 1 -1

    1 2 -1

    Q1-1 Q1-2 Q1-3

    Q2 Q2 Q21 2 3

    -1 2 1

    -1 0 0

    Q2-1 Q2-2 Q2-3

    Q2=//usa/*/body/p

    University of Pennsylvania

  • 8/12/2019 10 Crawling

    29/34

    A. Haeberlen, Z. Ives

    29

    Query index

    Query index entryfor each XML tag Two lists: Candidate

    List (CL) and WaitList (WL) dividedacross the nodes

    Live queriesstates are in CL;pending queries +

    states are in WL Events that cause

    state transition aregenerated by theXML parser

    politics

    usa

    body

    p

    Q1-1

    Q2-1

    Q1-3 Q2-2

    Q2-3

    X

    X

    X

    X

    X

    X

    X

    X CL

    WL

    Q1-2

    University of Pennsylvania

  • 8/12/2019 10 Crawling

    30/34

    A. Haeberlen, Z. Ives

    30

    Encountering an element

    Look up the element name in the QueryIndex and all nodes in the associated CL

    Validate that we actually have a match

    Q1

    1

    0

    1

    Q1-1politicsQ1-1

    X

    X

    WL

    startElement: politics

    CL

    Query ID

    Position

    Rel. Position

    LevelEntry in Query Index:

    NextPathNodeSet

    University of Pennsylvania

  • 8/12/2019 10 Crawling

    31/34

    A. Haeberlen, Z. Ives

    31

    Validating a match

    We first check that the current XML depthmatches the level in the user query: If level in CL node is less than 1, then ignore height

    else level in CL node must = height

    This ensures were matching at the right point in the tree!

    Finally, we validate any predicates againstattributes (e.g., [@topic=president])

    University of Pennsylvania

  • 8/12/2019 10 Crawling

    32/34

    A. Haeberlen, Z. Ives

    32

    Processing further elements

    Queries that dont meet validation areremoved from the Candidate Lists

    For other queries, we advance to the next

    state We copy the next node of the query from the WL to the CL,

    and update the RP and level

    When we reach a final state (e.g., Q1-3), we can output thedocument to the subscriber

    When we encounter an end element, wemust remove that element from the CL

    University of Pennsylvania

  • 8/12/2019 10 Crawling

    33/34

    A. Haeberlen, Z. Ives

    A simpler approach

    Instantiate a DOM tree for each document

    Traverse and recursively match XPaths

    Pros and cons?

    33University of Pennsylvania

  • 8/12/2019 10 Crawling

    34/34

    34

    Recap: Publish-subscribe model

    Publish-subscribe model Publishers produce events

    Each subscriber is interested in a subset of the events

    Challenge: Efficient implementation Comparison: XFilter vs RSS

    XFilter Interests are specified with XPaths (very powerful!)

    Sophisticated technique for efficiently matching documentsagainst many XPaths in parallel