61
Crawling HTML

Crawling HTML

  • Upload
    eudora

  • View
    65

  • Download
    0

Embed Size (px)

DESCRIPTION

Crawling HTML. Class Overview. Query processing. Content Analysis. Indexing. Crawling. Document Layer. Network Layer. Other Cool Stuff. Today. Crawlers Server Architecture. Graphic by Stephen Combs (HowStuffWorks.com) & Kari Meoller(Turner Broadcasting). - PowerPoint PPT Presentation

Citation preview

Page 1: Crawling HTML

Crawling HTML

Page 2: Crawling HTML

Class Overview

Network Layer

Document Layer

Crawling

Indexing

Content Analysis

Query processing

Other Cool Stuff

Page 3: Crawling HTML

Today

• Crawlers• Server Architecture

Page 4: Crawling HTML

Graphic by Stephen Combs (HowStuffWorks.com) & Kari Meoller(Turner Broadcasting)

Page 5: Crawling HTML

Standard Web Search Engine Architecture

crawl theweb

create an inverted

index

store documents,check for duplicates,

extract links

inverted index

DocIds

Slide adapted from Marti Hearst / UC Berkeley]

Search engine servers

userquery

show results To user

Page 6: Crawling HTML

CRAWLERS…

Page 7: Crawling HTML

Danger Will Robinson!!

• Consequences of a bug

Max 6 hits/server/minuteplus….

http://www.cs.washington.edu/lab/policies/crawlers.html

Page 8: Crawling HTML

Open-Source Crawlers

• GNU Wget – Utility for downloading files from the Web. – Fine if you just need to fetch files from 2-3 sites.

• Heritix – Open-source, extensible, Web-scale crawler – Easy to get running. – Web-based UI

• Nutch – Featureful, industrial strength, Web search package.– Includes Lucene information retrieval part

• TF/IDF and other document ranking • Optimized, inverted-index data store

– You get complete control thru easy programming.

Page 9: Crawling HTML

Search Engine Architecture• Crawler (Spider)

– Searches the web to find pages. Follows hyperlinks.Never stops

• Indexer– Produces data structures for fast searching of all words in the

pages

• Retriever– Query interface– Database lookup to find hits

• 300 million documents• 300 GB RAM, terabytes of disk

– Ranking, summaries

• Front End

Page 10: Crawling HTML

Spiders = Crawlers

• 1000s of spiders• Various purposes:

– Search engines– Digital rights management– Advertising– Spam– Link checking – site validation

Page 11: Crawling HTML

Spiders (Crawlers, Bots)• Queue := initial page URL0 • Do forever

– Dequeue URL– Fetch P– Parse P for more URLs; add them to queue– Pass P to (specialized?) indexing program

• Issues…– Which page to look at next?

• keywords, recency, focus, ???– Avoid overloading a site– How deep within a site to go?– How frequently to visit pages?– Traps!

Page 12: Crawling HTML

Crawling Issues• Storage efficiency• Search strategy

– Where to start– Link ordering– Circularities– Duplicates– Checking for changes

• Politeness– Forbidden zones: robots.txt– CGI & scripts– Load on remote servers– Bandwidth (download what need)

• Parsing pages for links• Scalability• Malicious servers: SEOs

Page 13: Crawling HTML

Robot Exclusion

• Person may not want certain pages indexed.• Crawlers should obey Robot Exclusion Protocol.

– But some don’t

• Look for file robots.txt at highest directory level– If domain is www.ecom.cmu.edu, robots.txt goes in

www.ecom.cmu.edu/robots.txt

• Specific document can be shielded from a crawler by adding the line: <META NAME="ROBOTS” CONTENT="NOINDEX">

Page 14: Crawling HTML

Robots Exclusion Protocol

• Format of robots.txt– Two fields. User-agent to specify a robot– Disallow to tell the agent what to ignore

• To exclude all robots from a server:User-agent: *Disallow: /

• To exclude one robot from two directories:User-agent: WebCrawlerDisallow: /news/Disallow: /tmp/

• View the robots.txt specification at http://info.webcrawler.com/mak/projects/robots/norobots.html

Page 15: Crawling HTML

Danger, Danger

• Ensure that your crawler obeys robots.txt• Don’t make any of these typical

mistakes:– Provide contact info in user-agent field.– Monitor the email address – Notify the CS Lab Staff – Honor all Do Not Scan requests– Post any "stop-scanning" requests – “The scanee is always right." 

– Max 6 hits/server/minute

Page 16: Crawling HTML

Outgoing Links?

• Parse HTML…

• Looking for…what?

anns html foosBar baz hhh wwwA href = www.csFrame font zzz,li> bar bbb anns html foosBar baz hhh wwwA href = ffff zcfgwww.cs bbbbb zFrame font zzz,li> bar bbb ?

Page 17: Crawling HTML

Which tags / attributes hold URLs?

Anchor tag: <a href=“URL” … > … </a>

Option tag: <option value=“URL”…> … </option>

Map: <area href=“URL” …>

Frame: <frame src=“URL” …>

Link to an image: <img src=“URL” …>

Relative path vs. absolute path: <base href= …>

Bonus problem: Javascript

In our favor: Search Engine Optimization

Page 18: Crawling HTML

Web Crawling Strategy

• Starting location(s)• Traversal order

– Depth first (LIFO)– Breadth first (FIFO)– Or ???

• Politeness• Cycles?• Coverage?

Page 19: Crawling HTML

Structure of Mercator Spider

1. Remove URL from queue2. Simulate network protocols & REP3. Read w/ RewindInputStream (RIS)4. Has document been seen before? (checksums and fingerprints)

5. Extract links6. Download new URL?7. Has URL been seen before?8. Add URL to frontier

Document fingerprints

Page 20: Crawling HTML

URL Frontier (priority queue)

• Most crawlers do breadth-first search from seeds.• Politeness constraint: don’t hammer servers!

– Obvious implementation: “live host table”– Will it fit in memory?– Is this efficient?

• Mercator’s politeness:– One FIFO subqueue per thread.– Choose subqueue by hashing host’s name.– Dequeue first URL whose host has NO outstanding requests.

Page 21: Crawling HTML

Fetching Pages

• Need to support http, ftp, gopher, ....– Extensible!

• Need to fetch multiple pages at once.• Need to cache as much as possible

– DNS– robots.txt– Documents themselves (for later processing)

• Need to be defensive!– Need to time out http connections.– Watch for “crawler traps” (e.g., infinite URL names.)– See section 5 of Mercator paper.– Use URL filter module– Checkpointing!

Page 22: Crawling HTML

Duplicate Detection

• URL-seen test: has URL been seen before?– To save space, store a hash

• Content-seen test: different URL, same doc.– Supress link extraction from mirrored pages.

• What to save for each doc?– 64 bit “document fingerprint”– Minimize number of disk reads upon retrieval.

Page 23: Crawling HTML

Nutch: A simple architecture

• Seed set• Crawl• Remove duplicates• Extract URLs (minus those we’ve been to)

– new frontier

• Crawl again• Can do this with Map/Reduce architecture

Page 24: Crawling HTML

Mercator Statistics

PAGE TYPE PERCENTtext/html 69.2%image/gif 17.9%image/jpeg 8.1%text/plain 1.5pdf 0.9%audio 0.4%zip 0.4%postscript 0.3%other 1.4%

Exponentially increasing size

Outdated

Page 25: Crawling HTML

Advanced Crawling Issues

• Limited resources– Fetch most important pages first

• Topic specific search engines– Only care about pages which are relevant to topic

“Focused crawling”

• Minimize stale pages– Efficient re-fetch to keep index timely– How track the rate of change for pages?

Page 26: Crawling HTML

Focused Crawling

• Priority queue instead of FIFO.•

• How to determine priority?– Similarity of page to driving query

• Use traditional IR measures• Exploration / exploitation problem

– Backlink• How many links point to this page?

– PageRank (Google)• Some links to this page count more than others

– Forward link of a page– Location Heuristics

• E.g., Is site in .edu? • E.g., Does URL contain ‘home’ in it?

– Linear combination of above

Page 27: Crawling HTML

Outline

• Search Engine Overview• HTTP • Crawlers• Server Architecture

Page 28: Crawling HTML

Server Architecture

Page 29: Crawling HTML

Connecting on the WWW

Internet

Page 30: Crawling HTML

Client-Side View

Web SitesWeb Sites

Internet

Content rendering engineTags, positioning, movement

Scripting language interpreterDocument object modelEventsProgramming language itself

Link to custom Java VM

Security access mechanisms

Plugin architecture + plugins

Page 31: Crawling HTML

Server-Side ViewDatabase-driven content

Lots of Users

Scalability

Load balancing

Often implemented with cluster of PCs

24x7 Reliability

Transparent upgradesClientsClients

Internet

Page 32: Crawling HTML

Trade-offs in Client/Server Arch.

• Compute on clients?– Complexity: Many different browsers

• {Firefox, IE, Safari, …} Version OS

• Compute on servers?– Peak load, reliability, capital investment.+ Access anywhere, anytime, any device+ Groupware support (shared calendar, …)+ Lower overall cost (utilization & debugging)+ Simpler to update service

Page 33: Crawling HTML

Dynamic Content

• We want to do more via an http request– E.g. we’d like to invoke code to run on the server.

• Initial solution: Common Gateway Interface (CGI) programs.

• Example: web page contains form that needs to be processed on server.

Page 34: Crawling HTML

CGI Code

• CGI scripts can be in any language.

• A new process is started (and terminated) with each script invocation (overhead!).

• Improvement I: – Run some code on the client’s machine– E.g., catch missing fields in the form.

• Improvement II: – Server APIs (but these are server-specific).

Page 35: Crawling HTML

Java Servlets

• Servlets : applets that run on the server.– Java VM stays, servlets run as threads.

• Accept data from client + perform computation• Platform-independent alternative to CGI.• Can handle multiple requests concurrently

– Synchronize requests - use for online conferencing• Can forward requests to other servers

– Use for load balancing

Page 36: Crawling HTML

Java Server Pages (JSP)Active Server Pages (ASP)

• Allows mixing static HTML w/ dynamically generated content• JSP is more convenient than servlets for the above purpose• More recently PHP & Ruby on Rails

<html>

<head><title>Example #3</title></head><? print(Date("m/j/y")); ?>

<body></body></html>

Page 37: Crawling HTML

AJAX

• Getting the browser to behave like your applications (caveat: Asynchronous)

• Client Rendering library (Javascript)– Widgets

• Talks to Server (XML)

• How do we keep state?

• Over the wire protocol: SOAP/XML-RPC/etc.

Page 38: Crawling HTML

Interlude: HTML 5

Page 39: Crawling HTML

Why HTML 5?

‘The websites of today are built with languages largely conceived during the mid to late1990’s, when the web was still in its infancy.’*

* Work on HTML 4 started in early 1997 CSS 2 was published in 1998

Slide from David Penny, EMCDDA 11/09

Page 40: Crawling HTML

The website circa 1998

• Simple layout

• No frills design

• Text, text, text

Slide from David Penny, EMCDDA 11/09

Page 41: Crawling HTML

The website circa 2009

• Complex layout

• Fancy designs

• User-interactivity

The modern web page is sometimes like a book,

sometimes like an application,

sometimes like an extension of your TV.

Current web languages were never designed to do this.Slide from David Penny, EMCDDA 11/09

Page 42: Crawling HTML

HTML 5 & CSS 3

HTML 5• Specifically designed for

web applications• Nice to search engines

and screen readers• HTML 5 will update HTML 4.01,

DOM Level 2

CSS level 3• Will make it easier to do

complex designs• Will look the same

across all browsers • CSS 3 will update CSS level 2

(CSS 2.1)

Slide from David Penny, EMCDDA 11/09

Page 43: Crawling HTML

HTML 5: today’s markup• Today, if we wanted to

markup this page we would use a lot of <div> tags, and classes.

• Semantic value of <div> and ‘class’ = 0

• • Can lead to ‘divitis’

and ‘classitis’.

Slide from David Penny, EMCDDA 11/09

Page 44: Crawling HTML

HTML 5: new tags to the rescue

• Hello ,<header>, <nav>, <article>, <section>, and other new tags.

• It’s good for search engines, screen readers, information architects, and the web in general.

Slide from David Penny, EMCDDA 11/09

Page 45: Crawling HTML

HTML 5: at last, video + audio

• Currently Video and audio handled by plugins (Flash, ReatTime, etc.)

• New <video> and <audio> and associated APIs tags will be used as <img> tag is today

• Browsers will need to define how video and audio should be played (controls, interface, etc.)

Slide from David Penny, EMCDDA 11/09

Page 46: Crawling HTML

HTML 5: Web applications 1.0

• Web applications a huge part of HTML 5.

• Some APIs include: – drag and drop, – canvas (drawing),

– offline storage, – geo-location,

Slide from David Penny, EMCDDA 11/09

Page 47: Crawling HTML

HTML 5: Form handling

• required attribute: – browser checks for you that the data has

been entered• email input type:

– a valid email must be entered• url input type:

– requires a valid web address

Slide from David Penny, EMCDDA 11/09

Page 48: Crawling HTML

Roadmap

• First W3C Working Draft in October 2007.• Last Call Working Draft in October 2009.• Candidate Recommendation in 2012.• First and second draft of test suite in 2012,

2015.• Reissued Last Call Working Draft in 2020.• Proposed Recommendation in 2022 (!)• Current browsers have already started

implementing HTML 5.

Note: today’s candidate recommendation status = yesterday’s recommendation status

Slide from David Penny, EMCDDA 11/09

Page 49: Crawling HTML

Server Architecture

Page 50: Crawling HTML

Connecting on the WWW

Internet

Page 51: Crawling HTML

Tiered Architectures

1-tier = dumb terminal smart server.

2-tier = client/server.

3-tier = client/application server/database.Why decompose the server?

Page 52: Crawling HTML

Two-Tier Architecture

TIER 1:CLIENT

TIER 2:SERVER

Server performsall processing

Web ServerApplication ServerDatabase Server

Server does too much work. Weak Modularity.

Page 53: Crawling HTML

Three-Tier Architecture

TIER 1:CLIENT

TIER 2:SERVER

TIER 3:BACKEND

Application serveroffloads processing

to tier 3

Web Server +Application Server

Using 2 computers instead of 1 can result in a huge increase in simultaneous clients. Depends on % of CPU time spent on database access.While DB server waits on DB, Web server is busy!

Page 54: Crawling HTML

Getting to ‘Giant Scale’• Only real option is cluster computing

Optional Backplane:

System-wide network for intra-server traffic: Query redirect, coherence traffic for store, updates, …

From: Brewer Lessons from Giant-Scale Services

Page 55: Crawling HTML

Microsoft Server FarmQuincy, WA

9th largest in US (as of May 2010)

Page 56: Crawling HTML

Containerized Data Centers

• Factory built in shipping container • Trucked to loc; forklift stacks in

warehouse• Connected to:

– chilled water supply, – fiber-optic connection,– electrical plugs

• Self-provisioning +self-managed.

Page 57: Crawling HTML

Inside the Container Inside the Container

From: Brewer Lessons from Giant-Scale ServicesImage: Microsoft Chicago data center

• Extreme symmetry• Internal disks• No monitors• No visible cables• No people!

• Offsite management• Contracts limit

Power

Temperature

Page 58: Crawling HTML

High Availability

• Essential Objective

• Phone network, railways, water system

• Challenges– Component failures– Constantly evolving features– Unpredictable growth

From: Brewer Lessons from Giant-Scale Services

Page 59: Crawling HTML

Architecture

• What do faults impact? Yield? Harvest?• Replicated systems

Faults reduced capacity (hence, yield @ high util)

• Partitioned systemsFaults reduced harvestCapacity (queries / sec) unchanged

• DQ Principle physical bottleneckData/Query Queries/Sec = Constant

From: Brewer Lessons from Giant-Scale Services

Page 60: Crawling HTML

Graceful Degradation

• Too expensive to avoid saturation• Peak/average ratio

– 1.6x - 6x or more– Moviefone: 10x capacity for Phantom Menace

• Not enough…

• Dependent faults (temperature, power) – Overall DQ drops way down

• Cutting harvest by 2 doubles capacity…

From: Brewer Lessons from Giant-Scale Services

Page 61: Crawling HTML

Admission Control (AC) Techniques

• Cost-Based AC– Denying an expensive query allows 2 cheap ones– Inktomi

• Priority-Based (Value-Based) AC– Stock trades vs. quotes– Datek

• Reduced Data Freshness

From: Brewer Lessons from Giant-Scale Services