10 Crawling

8/12/2019 10 Crawling

1/34

A. Haeberlen, Z. Ives

CIS 455/555: Internet and Web Systems

1University of Pennsylvania

Crawling and Publish/Subscribe

February 15, 2012

8/12/2019 10 Crawling

2/34


Plan for today

Basic crawling

Mercator

Publish/subscribe

XFilter


NEXT

8/12/2019 10 Crawling

3/34


Motivation

Suppose you want to build a search engine Need a large corpus of web pages

How can we find these pages?

Idea: crawlthe web

What else can you crawl? For example, social network


8/12/2019 10 Crawling

4/34


4

Crawling: The basic process

What state do we need? Q := Queue of URLs to visit

P := Set of pages already crawled

Basic process:1. Initialize Q with a set of seed URLs

2. Pick the first URL from Q and download the corresponding page

3. Extract all URLs from the page ( tag, anchor links, CSS, DTDs,scripts, optionally image links)

4. Append to Q any URLs that a) meet our criteria, and b) are not already in P5. If Q is not empty, repeat from step 2

Can one machine crawl the entire web? Of course not! Need to distribute crawling across many machines.

University of Pennsylvania

8/12/2019 10 Crawling

5/34


Crawling visualized


Seeds

Unseen web

"Frontier"

URLs alreadycrawled

URLs that willeventually be

crawled

URLs that do notfit our criteria(too deep, etc)

The Web

Pages currentlybeing crawled

8/12/2019 10 Crawling

6/34


Crawling complications

What order to traverse in? Polite to do BFS - why?

Malicious pages Spam pages / SEO

Spider traps (incl. dynamically generated ones)

General messiness Cycles

Varying latency and bandwidth to remote servers

Site mirrors, duplicate pages, aliases

Web masters' stipulations

How deep to crawl? How often to crawl? Continuous crawling; freshness


Need to berobust!

8/12/2019 10 Crawling

7/34A. Haeberlen, Z. Ives

Normalization; eliminating duplicates

Some of the extracted URLs are relative URLs Example: /~ahae/papers/ from www.cis.upenn.edu

Normalize it: http://www.cis.upenn.edu/~ahae/papers

Duplication is widespread on the web If the fetched page is already in the index, do not process it

Can verify using document fingerprint (hash) or shingles


8/12/2019 10 Crawling


8

Crawler etiquette

Explicit politeness Look for meta tags; for example, ignore pages that have

8/12/2019 10 Crawling


9

Robots.txt

What should be in robots.txt? See http://www.robotstxt.org/wc/robots.html

To exclude all robots from a server:User-agent: *

Disallow: / To exclude one robot from two directories:

User-agent: BobsCrawler

Disallow: /news/Disallow: /tmp/

http://www.robotstxt.org/wc/robots.htmlhttp://www.robotstxt.org/wc/robots.html

8/12/2019 10 Crawling


10

Recap: Crawling

How does the basic process work?

What are some of the main challenges? Duplicate elimination

Politeness

Malicious pages / spider traps

Normalization

Scalability


8/12/2019 10 Crawling


Plan for today

Basic crawling

Mercator

Publish/subscribe

XFilter


NEXT

8/12/2019 10 Crawling


12

Mercator: A scalable web crawler

Written entirely in Java

Expands a URL frontier Avoids re-crawling same URLs

Also considers whether a document has beenseen before Same content, different URL [when might this occur?]

Every document has signature/checksum info computed as

its crawled

Despite the name, it does not actually scaleto a large number of nodes But it would not be too difficult to parallelize

Heydon and Najork: Mercator, a scalable, extensible web crawler (WWW'99)University of Pennsylvania

8/12/2019 10 Crawling

13/34A. Haeberlen, Z. Ives13

Mercator architecture

1. Dequeue frontier URL

2. Fetch document

3. Record into RewindStream(RIS)

4. Check against fingerprints toverify its new

5. Extract hyperlinks

6. Filter unwanted links

7. Check if URL repeated(compare its hash)

8. Enqueue URL

Source:Mercatorpape

r


8/12/2019 10 Crawling


14

Mercators polite frontier queues

Tries to go beyond breadth-first approach Goal is to have only one crawler thread per server

What does this mean for the load caused by Mercator?

Distributed URL frontier queue: One subqueue per worker thread

The worker thread is determined by hashing the hostnameof the URL

Thus, only one outstanding request per web server


8/12/2019 10 Crawling


Mercators HTTP fetcher

First, needs to ensure robots.txt is followed Caches the contents of robots.txt for various web sites as it

crawls them

Designed to be extensible to other protocols Had to write own HTTP requestor in Java

their Java version didnt have timeouts Today, can use setSoTimeout()

Could use Java non-blocking I/O: http://www.owlmountain.com/tutorials/NonBlockingIo.htm

But they use multiple threads and synchronous I/O

http://www.owlmountain.com/tutorials/NonBlockingIo.htmhttp://www.owlmountain.com/tutorials/NonBlockingIo.htm

8/12/2019 10 Crawling


16

Other caveats

Infinitely long URL names (good way to get abuffer overflow!)

Aliased host names

Alternative paths to the same host Can catch most of these with signatures of

document data (e.g., MD5) Comparison to Bloom filters

Crawler traps (e.g., CGI scripts that link tothemselves using a different name) May need to have a way for human to override certain URL

pathssee Section 5 of paper


8/12/2019 10 Crawling

17/34

8/12/2019 10 Crawling


18

Further considerations

May want to prioritize certain pages as beingmost worth crawling Focused crawling tries to prioritize based on relevance

May need to refresh certain pages more often


8/12/2019 10 Crawling


Plan for today

Basic crawling

Mercator

Publish/subscribe

XFilter


NEXT

8/12/2019 10 Crawling

20/34

8/12/2019 10 Crawling


Example: RSS

Web server publishes XML file with events

Clients periodically request the file to see ifthere are new events

Is this a good solution?


8/12/2019 10 Crawling


Interest-based crawling

Suppose we want to crawl XML documentsbased on user interests

We need several parts: A list of interests expressed in an executable form,

perhaps XPath queries

A crawlergoes out and fetches XML content

A filter / routing enginematches XML content againstusers interests, sends them the content if it matches


8/12/2019 10 Crawling


Plan for today

Basic crawling

Mercator

Publish/subscribe

XFilter


NEXT

8/12/2019 10 Crawling

24/34

A. Haeberlen, Z. Ives24

XML-Based information dissemination

Basic model (XFilter, YFilter, Xyleme):

Users are interested in data relating to a particular topic, andknow the schema

/politics/usa//body

A crawler-aggregator reads XML files from the web (or getsthem from data sources) and feeds them to interestedparties

XPath

(here used tomatch

documents,not nodes)


8/12/2019 10 Crawling

25/34


25

Engine for XFilter [Altinel & Franklin 00]


8/12/2019 10 Crawling

26/34


26

How does it work?

Each XPath segment is basically a subset ofregular expressions over element tags Convert into finite state automata

Parse data as it comes inuse SAX API Match against finite state machines

Most of these systems use modified FSMsbecause they want to match manypatternsat the same time


8/12/2019 10 Crawling

27/34


27

Path nodes and FSMs

XPath parser decomposes XPath expressions into a set ofpath nodes

These nodes act as the states of corresponding FSM

A node in the Candidate List denotes the current state

The rest of the states are in corresponding Wait Lists

Simple FSM for/politics[@topic=president]/usa//body:

politics usa body

Q1_1 Q1_2 Q1_3


8/12/2019 10 Crawling

28/34


28

Decomposing into path nodes

Query ID

Position in state machine

Relative Position (RP) in tree:

0 for root node if its notpreceded by //

-1 for any node preceded by//

Else =1+ (no of * nodes frompredecessor node)

Level:If current node has fixed

distance from root, then 1+distance

Else if RP =1, then1, else 0

Finaly, NextPathNodeSet points tonext node

Q1=/politics[@topic=president]/usa//body

Q1 Q1 Q1

1 2 3

0 1 -1

1 2 -1

Q1-1 Q1-2 Q1-3

Q2 Q2 Q21 2 3

-1 2 1

-1 0 0

Q2-1 Q2-2 Q2-3

Q2=//usa/*/body/p


8/12/2019 10 Crawling

29/34


29

Query index

Query index entryfor each XML tag Two lists: Candidate

List (CL) and WaitList (WL) dividedacross the nodes

Live queriesstates are in CL;pending queries +

states are in WL Events that cause

state transition aregenerated by theXML parser

politics

usa

body

p

Q1-1

Q2-1

Q1-3 Q2-2

Q2-3

X

X

X

X

X

X

X

X CL

WL

Q1-2


8/12/2019 10 Crawling

30/34


30

Encountering an element

Look up the element name in the QueryIndex and all nodes in the associated CL

Validate that we actually have a match

Q1

1

0

1

Q1-1politicsQ1-1

X

X

WL

startElement: politics

CL

Query ID

Position

Rel. Position

LevelEntry in Query Index:

NextPathNodeSet


8/12/2019 10 Crawling

31/34


31

Validating a match

We first check that the current XML depthmatches the level in the user query: If level in CL node is less than 1, then ignore height

else level in CL node must = height

This ensures were matching at the right point in the tree!

Finally, we validate any predicates againstattributes (e.g., [@topic=president])


8/12/2019 10 Crawling

32/34


32

Processing further elements

Queries that dont meet validation areremoved from the Candidate Lists

For other queries, we advance to the next

state We copy the next node of the query from the WL to the CL,

and update the RP and level

When we reach a final state (e.g., Q1-3), we can output thedocument to the subscriber

When we encounter an end element, wemust remove that element from the CL


8/12/2019 10 Crawling

33/34


A simpler approach

Instantiate a DOM tree for each document

Traverse and recursively match XPaths

Pros and cons?


8/12/2019 10 Crawling

34/34

34

Recap: Publish-subscribe model

Publish-subscribe model Publishers produce events

Each subscriber is interested in a subset of the events

Challenge: Efficient implementation Comparison: XFilter vs RSS

XFilter Interests are specified with XPaths (very powerful!)

Sophisticated technique for efficiently matching documentsagainst many XPaths in parallel

Documents

10 Crawling