49
1 Wen-Hsiang Lu ( 盧盧盧 ) Department of Computer Science and Information Engineering, National Cheng Kung University 2007/10/02 Lecture 3: Crawling the Web (Chap 2, Charkrabarti)

Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

  • Upload
    dimaia

  • View
    50

  • Download
    0

Embed Size (px)

DESCRIPTION

Lecture 3: Crawling the Web (Chap 2, Charkrabarti). Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National Cheng Kung University 2007/10/02. Crawling the Web. Web pages Few thousand characters long - PowerPoint PPT Presentation

Citation preview

Page 1: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

1

Wen-Hsiang Lu (盧文祥 )Department of Computer Science and Information

Engineering,National Cheng Kung University

2007/10/02

Lecture 3: Crawling the Web(Chap 2, Charkrabarti)

Page 2: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

Crawling the Web•Web pages

—Few thousand characters long—Served through the internet using the hypertext transport protocol (HTTP)—Viewed at client end using `browsers’

•Crawler—To fetch the pages to the computer —At the computer

•Automatic programs can analyze hypertext documents

Page 3: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

3

HTML• HyperText Markup Language• Lets the author

—specify layout and typeface—embed diagrams—create hyperlinks.

• expressed as an anchor tag with a HREF attribute

• HREF names another page using a Uniform Resource Locator (URL),

—URL = • protocol field (“HTTP”) +• a server hostname (“www.cse.iitb.ac.in”) +• file path (/, the `root' of the published file

system).

Page 4: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

4

HTTP(hypertext transport protocol)

• Built on top of the Transport Control Protocol (TCP)

• Steps (from client end)—resolve server host name to Internet address (IP)

• Use Domain Name Server (DNS)• DNS is a distributed database of name-to-IP mappings

maintained at a set of known servers

—contact the server using TCP• connect to default HTTP port (80) on the server.• Enter the HTTP requests header (E.g.: GET)• Fetch the response header

—MIME (Multipurpose Internet Mail Extensions)—A meta-data standard for email and Web content transfer

• Fetch the HTML page

Page 5: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

5

Crawl “all” Web pages?• Problem: no catalog of all accessible

URLs on the Web.• Solution:

—start from a given set of URLs—Progressively fetch and scan them for

new outlinking URLs—fetch these pages in turn….. —Submit the text in page to a text

indexing system—and so on……….

Page 6: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

6

Crawling procedure• Simple

—Great deal of engineering goes into industry-strength crawlers

—Industry crawlers crawl a substantial fraction of the Web

—E.g.: Alta Vista, Northern Lights, Inktomi• No guarantee that all accessible Web

pages will be located in this fashion• Crawler may never halt …….

—pages will be added continually even as it is running.

Page 7: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

7

Crawling overheads• Delays involved in

—Resolving the host name in the URL to an IP address using DNS

—Connecting a socket to the server and sending the request

—Receiving the requested page in response

• Solution: Overlap the above delays by—fetching many pages at the same time

Page 8: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

8

Anatomy of a crawler.• Page fetching threads

—Starts with DNS resolution —Finishes when the entire page has been

fetched

• Each page —stored in compressed form to disk/tape —scanned for outlinks

• Work pool of outlinks—maintain network utilization without

overloading it• Dealt with by load manager

• Continue till the crawler has collected a sufficient number of pages.

Page 9: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

9

Typical anatomy of a large-scale crawler

Page 10: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

10

Large-scale crawlers: performance and reliability

considerations• Need to fetch many pages at same time

—utilize the network bandwidth—single page fetch may involve several seconds of

network latency

• Highly concurrent and parallelized DNS lookups• Use of asynchronous sockets

—Explicit encoding of the state of a fetch context in a data structure

—Polling socket to check for completion of network transfers

—Multi-processing or multi-threading: Impractical

• Care in URL extraction—Eliminating duplicates to reduce redundant fetches—Avoiding “spider traps” (e.g., fake URLs)

Page 11: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

11

DNS caching, pre-fetching and resolution

• A customized DNS component with…..1. Custom client for address resolution2. Caching server3. Prefetching client

Page 12: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

12

Custom client for address resolution

• Tailored for concurrent handling of multiple outstanding requests

• Allows issuing of many resolution requests together —polling at a later time for completion of

individual requests

• Facilitates load distribution among many DNS servers.

Page 13: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

13

Caching server • With a large cache, persistent across

DNS restarts• Residing largely in memory if

possible.• DNS resolution

– UNIX gethostbyname: cannot provide concurrent requests

– Mercator: reduce time from 87% to 25%– ADNS: asynchronous DNS client library

www.chiark.greenend.org.uk/~ian/adns

Page 14: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

14

Prefetching client• Steps

1. Parse a page that has just been fetched2. extract host names from HREF targets3. Make DNS resolution requests to the caching

server

• Usually implemented using UDP— User Datagram Protocol— connectionless, packet-based communication

protocol— does not guarantee packet delivery

• Does not wait for resolution to be completed.

Page 15: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

15

Multiple concurrent fetches• Managing multiple concurrent

connections— A single download may take several seconds— Open many socket connections to different

HTTP servers simultaneously

• Multi-CPU machines not useful— crawling performance limited by network and

disk

• Two approaches1. using multi-threading 2. using non-blocking sockets with event

handlers

Page 16: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

16

Multi-threading• logical threads

—physical thread of control provided by the operating system (E.g.: pthreads) OR

—concurrent processes

• fixed number of threads allocated in advance• programming paradigm

—create a client socket—connect the socket to the HTTP service on a server—Send the HTTP request header—read the socket (recv) until

• no more characters are available

—close the socket.

• use blocking system calls

Page 17: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

17

Multi-threading: Problems• performance penalty

—mutual exclusion—concurrent access to data structures

• slow disk seeks.—great deal of interleaved, random input-

output on disk—Due to concurrent modification of

document repository by multiple threads

Page 18: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

18

Non-blocking sockets and event handlers

• non-blocking sockets—connect, send or recv call returns immediately without

waiting for the network operation to complete.—poll the status of the network operation separately

• “select” system call—lets application suspend until more data can be read

from or written to the socket—timing out after a pre-specified deadline—Monitor polls several sockets at the same time

• More efficient memory management—code that completes processing not interrupted by other

completions—No need for locks and semaphores on the pool—only append complete pages to the log

Page 19: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

19

Link extraction and normalization

• Goal: Obtaining a canonical form of URL

• URL processing and filtering—Avoid multiple fetches of pages known by

different URLs—many IP addresses

• For load balancing on large sites• Mirrored contents/contents on same file system

• “Proxy pass“ • Mapping of different host names to a single IP

address• need to publish many logical sites

—Relative URLs• need to be interpreted w.r.t a base URL.

Page 20: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

20

Canonical URLFormed by

• Using a standard string for the protocol

• Canonicalizing the host name• Adding an explicit port number• Normalizing and cleaning up the path

• /books/../papers/index.html =>/papers/index.html

Page 21: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

21

Robot exclusion• Check

—whether the server prohibits crawling a normalized URL

—In robots.txt file in the HTTP root directory of the server• species a list of path prefixes which crawlers

should not attempt to fetch.• http://www.iitb.ac.in/robots.txt

• Meant for crawlers only• User-agent specification

Page 22: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

22

Eliminating already-visited URLs

• Checking if a URL has already been fetched —Before adding a new URL to the work pool—Needs to be very quick.—Achieved by computing MD5 ( Message-Digest

algorithm 5) hash function• (www.rsasecurity.com/rsalabs/faq/3-6-6.html)• X = MD5("vote1234") =

8339e38c61175dbd07846ad70dc226b2

• Exploiting spatio-temporal locality of access• Two-level hash function.

—most significant bits (say, 24) derived by hashing the host name plus port

—lower order bits (say, 40) derived by hashing the path• concatenated bits use d as a key in a B-tree

• qualifying URLs added to frontier of the crawl.

• hash values added to B-tree.

Page 23: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

23

Spider traps• Protecting from crashing on

—Ill-formed HTML• E.g.: page with 68 kB of null characters

—Misleading sites• indefinite number of pages dynamically

generated by CGI scripts• paths of arbitrary depth created using soft

directory links and path remapping features in HTTP server

Page 24: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

24

Spider Traps: Solutions• No automatic technique can be

foolproof• Check for URL length• Guards

—Preparing regular crawl statistics—Adding dominating sites to guard

module—Disable crawling active content such as

CGI form queries—Eliminate URLs with non-textual data

types

Page 25: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

25

Avoiding repeated expansion of links on duplicate pages

• Reduce redundancy in crawls• Duplicate detection

—Mirrored Web pages and sites

• Detecting exact duplicates—Checking against MD5 digests of stored URLs—Representing a relative link v (relative to aliases

u1 and u2) as tuples (h(u1); v) and (h(u2); v)

• Detecting near-duplicates—Even a single altered character will completely

change the digest !• E.g.: date of update/ name and email of the site

administrator

—Solution : Shingling

Page 26: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

26

Load monitor• Keeps track of various system

statistics— Recent performance of the wide area

network (WAN) connection• E.g.: latency and bandwidth estimates.

— Operator-provided/estimated upper bound on open sockets for a crawler

— Current number of active sockets.

Page 27: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

27

Thread manager• Responsible for

Choosing units of work from frontier Scheduling issue of network resources Distribution of these requests over

multiple ISPs if appropriate.

• Uses statistics from load monitor

Page 28: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

28

Per-server work queues• Denial of service (DoS) attacks

limit the speed or frequency of responses to any fixed client IP address

• Avoiding DOS limit the number of active requests to a

given server IP address at any time maintain a queue of requests for each

server Use the HTTP/1.1 persistent socket capability.

Distribute attention relatively evenly between a large number of sites

• Access locality vs. politeness dilemma

Page 29: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

29

Text repository• Crawler’s last task

Dumping fetched pages into a repository

• Decoupling crawler from other functions for efficiency and reliability preferred

• Page-related information stored in two parts meta-data page contents

Page 30: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

30

Storage of page-related information

• Meta-data relational in nature

usually managed by custom software to avoid relation database system overheads

text index involves bulk updates

includes fields like content-type, last-modified date, content-length, HTTP status code, etc.

Page 31: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

31

Page contents storage• Typical HTML Web page compresses

to 2-4 kB (using zlib)• File systems have a 4-8 kB file block

size Too large !!

• Page storage managed by custom storage manager simple access methods for

crawler to add pages Subsequent programs (Indexer etc) to

retrieve documents

Page 32: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

32

Page Storage• Small-scale systems

Repository fitting within the disks of a single machine

Use of storage manager (E.g.: Berkeley DB: www.sleepycat.com) Manage disk-based databases within a

single file configuration as a hash-table/B-tree for URL

access key To handle ordered access of pages

configuration as a sequential log of page records.

Since Indexer can handle pages in any order

Page 33: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

33

Page Storage• Large-scale systems

Repository distributed over a number of storage servers

Storage servers Connected to the crawler through a fast

local network (E.g.: gigabit Ethernet) Hashed by URLs

`T3' grade leased lines. To handle 10 million pages (40 GB) per hour

Page 34: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

34

Large-scale crawlers often use multiple ISPs and a bank of local storage servers to store the pages crawled.

Page 35: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

35

Refreshing crawled pages• Search engine's index should be fresh• Web-scale crawler never `completes'

its job• High variance of rate of page changes• “If-modified-since” request header

with HTTP protocol Impractical for a crawler

• Solution At commencement of new crawling round

estimate which pages have changed

Page 36: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

36

Determining page changes• “ Expires” HTTP response header

For page that come with an expiry date

• Otherwise need to guess if revisiting that page will yield a modified version. Score reflecting probability of page

being modified Crawler fetches URLs in decreasing

order of score. Assumption : recent past predicts the

future

Page 37: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

37

Estimating page change rates• Brewington and Cybenko & Cho

Algorithms for maintaining a crawl in which most pages are fresher than a specified epoch.

• Prerequisite average interval at which crawler checks for

changes is smaller than the inter-modification times of a page

• Small scale intermediate crawler runs to monitor fast changing sites

E.g.: current news, weather, etc. Patched intermediate indices into master index

Page 38: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

38

Putting together a crawler

Reference implementation of the HTTP client protocol World-wide Web Consortium (

http://www.w3c.org/) w3c-libwww package

Page 39: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

39

Design of the core components: Crawler class.

• To copy bytes from network sockets to storage media

• Three methods to express Crawler's contract with user pushing a URL to be fetched to the Crawler

(fetchPush) Termination callback handler (fetchDone)

called with same URL Method (start) which starts Crawler's event

loop.

• Implementation of Crawler class Need for two helper classes called DNS and

Fetch

Page 40: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

40

Page 41: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

41

Page 42: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

42

Page 43: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

43

Page 44: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

44

Crawler at WKD Lab. of Academia Sinica

Page 45: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

45

Parameters

Page 46: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

46

Initial URLs

Page 47: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

47

Crawling

Page 48: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

48

Storing pages

Page 49: Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

49

Criterions for the Project Crawler

• Performance—Speed—Scalability

• Interesting/useful design strategies—Prefetching—Threading—Compressing—Data management—Others

• Algorithms: MD5, Depth first search/Breadth first search

• Configuration: flexible parameter setting