33
Web Crawlers Oct 28, 2010 Wednesday, November 3, 2010

Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan

Web CrawlersOct 28, 2010

Wednesday, November 3, 2010

Page 2: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan

What’s a website?

Wednesday, November 3, 2010

Page 3: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan

Introduc)on  to  Informa)on  Retrieval    

From  Christopher  Manning  and  Prabhakar  Raghavan

Basic  crawler  opera-onBegin  with  known  “seed”  URLsFetch  and  parse  themExtract  URLs  they  point  toPlace  the  extracted  URLs  on  a  queue

Fetch  each  URL  on  the  queue  and  repeat

Sec. 20.2

Wednesday, November 3, 2010

Page 4: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan

Introduc)on  to  Informa)on  Retrieval    

From  Christopher  Manning  and  Prabhakar  Raghavan

Crawling  picture

Web

URLs crawledand parsed

URLs frontier

Unseen Web

Seedpages

Sec. 20.2

Wednesday, November 3, 2010

Page 5: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan

How do we determine the seed

URLS?

Wednesday, November 3, 2010

Page 6: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan

Introduc)on  to  Informa)on  Retrieval    

From  Christopher  Manning  and  Prabhakar  Raghavan

Simple  picture  –  complica-ons Web  crawling  isn’t  feasible  with  one  machine

All  of  the  above  steps  distributed Malicious  pages

Spam  pages   Spider  traps  –  incl  dynamically  generated

Even  non-­‐malicious  pages  pose  challenges Latency/bandwidth  to  remote  servers  varyWebmasters’  s-pula-ons

How  “deep”  should  you  crawl  a  site’s  URL  hierarchy? Site  mirrors  and  duplicate  pages

Politeness  –  don’t  hit  a  server  too  oOen

Sec. 20.1.1

Wednesday, November 3, 2010

Page 7: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan

Introduc)on  to  Informa)on  Retrieval    

From  Christopher  Manning  and  Prabhakar  Raghavan

What  any  crawler  must  doBe  Polite:  Respect  implicit  and  explicit  politeness  considera-onsOnly  crawl  allowed  pagesRespect  robots.txt  (more  on  this  shortly)

Be  Robust:  Be  immune  to  spider  traps  and  other  malicious  behavior  from  web  servers

Sec. 20.1.1

Wednesday, November 3, 2010

Page 8: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan

Introduc)on  to  Informa)on  Retrieval    

From  Christopher  Manning  and  Prabhakar  Raghavan

What  any  crawler  should  do Be  capable  of  distributed  opera-on:  designed  to  run  on  mul-ple  distributed  machines

Be  scalable:  designed  to  increase  the  crawl  rate  by  adding  more  machines

Performance/efficiency:  permit  full  use  of  available  processing  and  network  resources

Sec. 20.1.1

Wednesday, November 3, 2010

Page 9: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan

Introduc)on  to  Informa)on  Retrieval    

From  Christopher  Manning  and  Prabhakar  Raghavan

What  any  crawler  should  doFetch  pages  of  “higher  quality”  firstCon-nuous  opera-on:  Con-nue  fetching  fresh  copies  of  a  previously  fetched  page

Extensible:  Adapt  to  new  data  formats,  protocols

Sec. 20.1.1

Wednesday, November 3, 2010

Page 10: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan

Introduc)on  to  Informa)on  Retrieval    

From  Christopher  Manning  and  Prabhakar  Raghavan

Updated  crawling  picture

URLs crawledand parsed

Unseen Web

SeedPages

URL frontier

Crawling thread

Sec. 20.1.1

Wednesday, November 3, 2010

Page 11: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan

Introduc)on  to  Informa)on  Retrieval    

From  Christopher  Manning  and  Prabhakar  Raghavan

URL  fron-erCan  include  mul-ple  pages  from  the  same  host

Must  avoid  trying  to  fetch  them  all  at  the  same  -me

Must  try  to  keep  all  crawling  threads  busy

Sec. 20.2

Wednesday, November 3, 2010

Page 12: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan

Introduc)on  to  Informa)on  Retrieval    

From  Christopher  Manning  and  Prabhakar  Raghavan

Explicit  and  implicit  politenessExplicit  politeness:  specifica-ons  from  webmasters  on  what  por-ons  of  site  can  be  crawledrobots.txt

Implicit  politeness:  even  with  no  specifica-on,  avoid  hiYng  any  site  too  oOen

Sec. 20.2

Wednesday, November 3, 2010

Page 13: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan

Introduc)on  to  Informa)on  Retrieval    

From  Christopher  Manning  and  Prabhakar  Raghavan

Robots.txt Protocol  for  giving  spiders  (“robots”)  limited  access  to  a  website,  originally  from  1994www.robotstxt.org/wc/norobots.html

Website  announces  its  request  on  what  can(not)  be  crawledFor  a  URL,  create  a  file  URL/robots.txtThis  file  specifies  access  restric-ons

Sec. 20.2.1

Wednesday, November 3, 2010

Page 14: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan

Introduc)on  to  Informa)on  Retrieval    

From  Christopher  Manning  and  Prabhakar  Raghavan

Robots.txt  example No  robot  should  visit  any  URL  star-ng  with  "/yoursite/temp/",  except  the  robot  called  “searchengine":  

User-agent: *Disallow: /yoursite/temp/

User-agent: searchengine

Disallow:  

Sec. 20.2.1

Wednesday, November 3, 2010

Page 15: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan

Introduc)on  to  Informa)on  Retrieval    

From  Christopher  Manning  and  Prabhakar  Raghavan

Processing  steps  in  crawling Pick  a  URL  from  the  fron-er Fetch  the  document  at  the  URL Parse  the  URL

Extract  links  from  it  to  other  docs  (URLs)

Check  if  URL  has  content  already  seen If  not,  add  to  indexes

For  each  extracted  URL Ensure  it  passes  certain  URL  filter  tests Check  if  it  is  already  in  the  fron-er  (duplicate  URL  elimina-on)

E.g., only crawl .edu, obey robots.txt, etc.

Which one?

Sec. 20.2.1

Wednesday, November 3, 2010

Page 16: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan

Introduc)on  to  Informa)on  Retrieval    

From  Christopher  Manning  and  Prabhakar  Raghavan

Basic  crawl  architecture

Sec. 20.2.1

Wednesday, November 3, 2010

Page 17: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan

Introduc)on  to  Informa)on  Retrieval    

From  Christopher  Manning  and  Prabhakar  Raghavan

DNS  (Domain  Name  Server) A  lookup  service  on  the  internet

Given  a  URL,  retrieve  its  IP  address Service  provided  by  a  distributed  set  of  servers  –  thus,  lookup  latencies  can  be  high  (even  seconds)

Common  OS  implementa-ons  of  DNS  lookup  are  blocking:  only  one  outstanding  request  at  a  -me

Solu-ons DNS  caching Batch  DNS  resolver  –  collects  requests  and  sends  them  out  together

Sec. 20.2.2

Wednesday, November 3, 2010

Page 18: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan

DNS dig +trace www.djp3.net

Root Name Server

.netName Server

djp3.netName Server

Where is www.djp3.net?

Ask 192.5.6.30

{A}.ROOT-SERVERS.NET = 198.41.0.4

{A}.GTLD-SERVERS.net = 192.5.6.30

Ask 72.1.140.145

{ns1}.speakeasy.net =72.1.140.145

Use 69.17.116.124

Give me a web page

www.djp3.net = 69.17.116.124

1

2

3

4

Wednesday, November 3, 2010

Page 19: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan

Wednesday, November 3, 2010

Page 20: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan

Question

• How do I know if I’ve seen this before?

• Am I stuck in a loop?

Wednesday, November 3, 2010

Page 21: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan

Introduc)on  to  Informa)on  Retrieval    

From  Christopher  Manning  and  Prabhakar  Raghavan

Parsing:  URL  normaliza-on

When  a  fetched  document  is  parsed,  some  of  the  extracted  links  are  rela)ve  URLs

E.g.,  at  hcp://en.wikipedia.org/wiki/Main_Pagewe  have  a  rela-ve  link  to  /wiki/Wikipedia:General_disclaimer  which  is  the  same  as  the  absolute  URL  hcp://en.wikipedia.org/wiki/Wikipedia:General_disclaimer

During  parsing,  must  normalize  (expand)  such  rela-ve  URLs

Sec. 20.2.1

Wednesday, November 3, 2010

Page 22: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan

Introduc)on  to  Informa)on  Retrieval    

From  Christopher  Manning  and  Prabhakar  Raghavan

Content  seen?Duplica-on  is  widespread  on  the  webIf  the  page  just  fetched  is  already  in  the  index,  do  not  further  process  itThis  is  verified  using  document  fingerprints  or  shinglesA  type  of  hashing  scheme

Sec. 20.2.1

Wednesday, November 3, 2010

Page 23: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan

Introduc)on  to  Informa)on  Retrieval    

From  Christopher  Manning  and  Prabhakar  Raghavan

Filters  and  robots.txt  

Filters  –  regular  expressions  for  URL’s  to  be  crawled/not

Once  a  robots.txt  file  is  fetched  from  a  site,  need  not  fetch  it  repeatedlyDoing  so  burns  bandwidth,  hits  web  server

Cache  robots.txt  files

Sec. 20.2.1

Wednesday, November 3, 2010

Page 24: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan

Duplicate elimination

• One-time crawl:

• Test to see if an extracted,parsed, filtered URL

• has already been sent to frontier

• has already been indexed

Wednesday, November 3, 2010

Page 25: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan

Duplicate elimination

• Continuos Crawl:

• Update the URL’s priority

• staleness

• quality

• politeness

Wednesday, November 3, 2010

Page 26: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan

Introduc)on  to  Informa)on  Retrieval    

From  Christopher  Manning  and  Prabhakar  Raghavan

Distribu-ng  the  crawler Run  mul-ple  crawl  threads,  under  different  processes  –  poten-ally  at  different  nodesGeographically  distributed  nodes

Par--on  hosts  being  crawled  into  nodesHash  used  for  par--on

How  do  these  nodes  communicate?

Sec. 20.2.1

Wednesday, November 3, 2010

Page 27: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan

Introduc)on  to  Informa)on  Retrieval    

From  Christopher  Manning  and  Prabhakar  Raghavan

Communica-on  between  nodes

Sec. 20.2.1

Wednesday, November 3, 2010

Page 28: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan

Introduc)on  to  Informa)on  Retrieval    

From  Christopher  Manning  and  Prabhakar  Raghavan

URL  fron-er:  two  main  considera-ons

Politeness:  do  not  hit  a  web  server  too  frequently Freshness:  crawl  some  pages  more  oOen  than  othersE.g.,  pages  (such  as  News  sites)  whose  content  changes  oOen

These  goals  may  conflict  each  other.(E.g.,  simple  priority  queue  fails  –  many  links  out  of  a  page  go  to  its  own  site,  crea-ng  a  burst  of  accesses  to  that  site.)

Sec. 20.2.3

Wednesday, November 3, 2010

Page 29: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan

Introduc)on  to  Informa)on  Retrieval    

From  Christopher  Manning  and  Prabhakar  Raghavan

Politeness  –  challengesEven  if  we  restrict  only  one  thread  to  fetch  from  a  host,  can  hit  it  repeatedly

Common  heuris-c:  insert  -me  gap  between  successive  requests  to  a  host  that  is  >>  -me  for  most  recent  fetch  from  that  host

Sec. 20.2.3

Wednesday, November 3, 2010

Page 30: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan

Exercise...

Wednesday, November 3, 2010

Page 31: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan

Crawl a site

Wednesday, November 3, 2010

Page 32: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan

Ethics

Wednesday, November 3, 2010

Page 33: Web Crawlers - University of Washingtoncourses.washington.edu/ir2010/crawlers.pdf · 2010-11-16 · Introducon*to*Informa)on*Retrieval! ! FromChristopher*Manning*and*Prabhakar*Raghavan

What should I crawl?

• robots.txt

• Facebook pages?

• Change in Service

• Terms of Service (TOS?)

Wednesday, November 3, 2010