24
IRLbot: Scaling to 6 Billion Pages and Beyond Presented by rohit tummalapalli sashank jupudi

IRLbot: Scaling to 6 Billion Pages and Beyond Presented by rohit tummalapalli sashank jupudi

Embed Size (px)

Citation preview

Page 1: IRLbot: Scaling to 6 Billion Pages and Beyond Presented by rohit tummalapalli sashank jupudi

IRLbot: Scaling to 6 Billion Pages and Beyond

Presented by rohit  tummalapalli                  sashank jupudi

Page 2: IRLbot: Scaling to 6 Billion Pages and Beyond Presented by rohit tummalapalli sashank jupudi

Agenda

• Introduction  

• Challenges

• Overcoming the challengeso Scalabilityo Reputation and spamo Politeness

• Experimental results

• Conclusion

Page 3: IRLbot: Scaling to 6 Billion Pages and Beyond Presented by rohit tummalapalli sashank jupudi

   Introduction

• Search engines consist of two fundamental componentso web crawlerso data miners

• What is a web crawler?

• Challengeso scalabilityo spam avoidanceo politeness 

     

Page 4: IRLbot: Scaling to 6 Billion Pages and Beyond Presented by rohit tummalapalli sashank jupudi

Challenges - Scalability

• Internet trade off between scalability(N) , performance(S) and resource usage(E).

• Previous works?

• Our goal

Page 5: IRLbot: Scaling to 6 Billion Pages and Beyond Presented by rohit tummalapalli sashank jupudi

Challenges - spam avoidance

• Crawler is “bogged down” in synchronized delay attacks of certain spammers

• Prior research?

• Our goal

Page 6: IRLbot: Scaling to 6 Billion Pages and Beyond Presented by rohit tummalapalli sashank jupudi

Challenges - politeness

• Web crawlers often get in trouble with webmasters for slowing down their servers.

• Even with spam avoidance, the entire RAM eventually gets filled with URLs from a small set of hosts and the crawler simply chokes on its politeness.

• Previous algorithms: No

• Our goal

Page 7: IRLbot: Scaling to 6 Billion Pages and Beyond Presented by rohit tummalapalli sashank jupudi

Scalability

• Bottleneck caused by complexity of verifying uniqueness of        URL's and compliance with robots.txt .

• Need for an algorithm that can perform very fast checks for    large N

• Authors introduced DRUM(Disk Repository with Update    Management)

•DRUM can store large volumes of arbitrary data on disk to implement very fast checks using bucket sort.

•Efficient and faster than prior disk based methods 

Page 8: IRLbot: Scaling to 6 Billion Pages and Beyond Presented by rohit tummalapalli sashank jupudi

Disk Repository with Update Mngmt

•The purpose of DRUM is to allow for efficient storage of large collections of <key, value> pairs━Key is a unique identifier of some data━Value is arbitrary information attached to keys 

•Three supported operations   check, update, check+update

Page 9: IRLbot: Scaling to 6 Billion Pages and Beyond Presented by rohit tummalapalli sashank jupudi
Page 10: IRLbot: Scaling to 6 Billion Pages and Beyond Presented by rohit tummalapalli sashank jupudi

cont..

• input : Continuous tuples of <key,value,aux>

•DRUM spreads <key,value> pairs in K disk buckets according to key.

•All buckets pre-allocated on disk before usage.

Page 11: IRLbot: Scaling to 6 Billion Pages and Beyond Presented by rohit tummalapalli sashank jupudi

cont..

•Once largest bucket reaches size r < R foll. process repeated for all i = 1..k

1) Bucket QiH read into bucket buffer

2) Disk Cache Z sequentially read and compared with bucket buffer.

3) Updating disk cache Z

Page 12: IRLbot: Scaling to 6 Billion Pages and Beyond Presented by rohit tummalapalli sashank jupudi

DRUM used for storing crawler data

DRUM objects 

•URLseen    check+update operation    •RobotsCache    check and update operations    caching robots.txt file

•RobotsRequested    stores hashes of sites for which robots.txt been requested

Page 13: IRLbot: Scaling to 6 Billion Pages and Beyond Presented by rohit tummalapalli sashank jupudi

 

 

Page 14: IRLbot: Scaling to 6 Billion Pages and Beyond Presented by rohit tummalapalli sashank jupudi

Organisation of IRLbot

• URLseen discards duplicate URL's , unique one's are sent for BUDGET and STAR check.

•passing or failing based on matching Robots.txt

•sent back to Qr and host names passed through RobotsReq

•Sites hashes not already present are fed into Qd which performs DNS lokups

Page 15: IRLbot: Scaling to 6 Billion Pages and Beyond Presented by rohit tummalapalli sashank jupudi

    

• Overhead metric ω: # of bytes written to/read from disk during uniqueness checks of lN URLs 

       ━α is the number of times they are written to/read from disk    ━blN is the number of bytes in all parsed URLs

• Maximum download rate (in pages/s) supported by the disk portion of URL uniqueness

                

Page 16: IRLbot: Scaling to 6 Billion Pages and Beyond Presented by rohit tummalapalli sashank jupudi

spam avoidance

• BFS is a poor technique in presence of spam farms and infinite webs.

• Simply restricting the branching factor or the maximum number of pages/hosts per domain is not a viable solution.

• Computing traditional PageRank for each page could be prohibitively expensive in large crawls.

• Spam can be effectively deterred by budgeting the number of allowed pages per pay-level domain (PLD).

Page 17: IRLbot: Scaling to 6 Billion Pages and Beyond Presented by rohit tummalapalli sashank jupudi

cont..• This solution we call Spam Tracking and Avoidance through

Reputation (STAR).

• Each PLD x has budget Bx that represents the number of pages that are allowed to pass from x (including all subdomains) to crawling threads every T time units.

• Each PLD x starts with a default budget B0 , which is then dynamically adjusted as its in-degree dx changes over time

• DRUM is used to store PLD budgets and aggregate PLD-PLD link information.

Page 18: IRLbot: Scaling to 6 Billion Pages and Beyond Presented by rohit tummalapalli sashank jupudi

  

  

Page 19: IRLbot: Scaling to 6 Billion Pages and Beyond Presented by rohit tummalapalli sashank jupudi

Politeness

• Prior work has only enforced a certain per-host access delay τh seconds

    ━Easy to crash servers that co-locate 1000s of virtual hosts.

• To admit URLs into RAM we have a method called Budget Enforcement with Anti-Spam Tactics (BEAST).

• BEAST does not discard URLs, but rather delays their download until it knows more about the PLD they belong to.

Page 20: IRLbot: Scaling to 6 Billion Pages and Beyond Presented by rohit tummalapalli sashank jupudi

  

  

Page 21: IRLbot: Scaling to 6 Billion Pages and Beyond Presented by rohit tummalapalli sashank jupudi

cont..

• A naive implementation is to maintain two queues     ━Q contains URLs that passed the budget check    ━QF contains those that failed

• After Q is emptied, QF is read and again split into two queues – Q and QF & checks at regular increasing intervals.

• For N →∞, disk speed λ → 2Sb(2L —1)= constant

• It is roughly four times the speed needed to write all unique URLs to disk as they are discovered during the crawl

 

Page 22: IRLbot: Scaling to 6 Billion Pages and Beyond Presented by rohit tummalapalli sashank jupudi

Experiments - summary

• Active crawling period of 41 days in summer 2007.

• IRLbot attempted 7.6 billion connections and received 7.4 billion valid HTTP replies.

• Average download rate 319 mb/s (1,789 pages/s).

Page 23: IRLbot: Scaling to 6 Billion Pages and Beyond Presented by rohit tummalapalli sashank jupudi

Conclusion

• This paper tackled the issue of scaling web crawlers to billions and even trillions of pages

    ━Single server with constant CPU, disk, and memory speed

•  • We identified several bottlenecks in building an efficient

large-scale crawler and presented our solution to these problems

    ━Low-overhead disk-based data structures     Non-BFS crawling order━     Real-time reputation to guide the crawling rate━

Page 24: IRLbot: Scaling to 6 Billion Pages and Beyond Presented by rohit tummalapalli sashank jupudi

Future work

• Refining reputation algorithms, accessing their performance.

• Mining the collected data.