18
Hypersearching the Web, Chakrabarti, Soumen Presented By Ray Yamada

Hypersearching the Web, Chakrabarti, Soumen

  • Upload
    moriah

  • View
    29

  • Download
    0

Embed Size (px)

DESCRIPTION

Hypersearching the Web, Chakrabarti, Soumen. Presented By Ray Yamada. Overview. Why Do We Care? Purpose of The Paper? Solution by Clever Project Pros / Cons of the Paper Further Research. Why Do We Care?. Web Link Analysis is crucial for efficient Crawling and Ranking algorithms - PowerPoint PPT Presentation

Citation preview

Page 1: Hypersearching the Web,  Chakrabarti, Soumen

Hypersearching the Web, Chakrabarti, Soumen

Presented By

Ray Yamada

Page 2: Hypersearching the Web,  Chakrabarti, Soumen

Overview

• Why Do We Care?

• Purpose of The Paper?

• Solution by Clever Project

• Pros / Cons of the Paper

• Further Research

Page 3: Hypersearching the Web,  Chakrabarti, Soumen

Why Do We Care?

• Web Link Analysis is crucial for efficient Crawling and Ranking algorithms

• Crawling: Google Sitemap Submission, Yahoo Directory

• Ranking: Relevant Result

Page 4: Hypersearching the Web,  Chakrabarti, Soumen

Purpose of The Paper?

• To Overcome These Challenges:

– Its Size & Growth

– Its Content Types

– Language Semantics

– New Language

– Staleness of Results

– SPAM

– And More…

Page 5: Hypersearching the Web,  Chakrabarti, Soumen

Solution: Hyperlinks, Hyperlinks, Hyperlinks…

• Can Think of the Web as a Directed Graph

• Node = Web page (URL)

• Edge = Hyperlink

Page 6: Hypersearching the Web,  Chakrabarti, Soumen

Solution: HITS Algorithm

• Hyperlink-Induced Topic Search (HITS)– A.k.a. Hubs and Authorities

• Hubs – Highly-valued lists for a given query– Ex. Yahoo Directory, Open Directory Project and Bookmarking

sites.

• Authorities – Highly endorsed answers to the query– Ex. New York Times, Huffington Post, Twitter

• It is possible for a webpage to be both Hub and Authority– Ex. Restaurant Review Blogs

Page 7: Hypersearching the Web,  Chakrabarti, Soumen

Solution: HITS Algorithm Cont…

• For each page p, we assign it two values hub(p) and auth(p)

• Initial Value: For all p, hub(p) = 1, auth(p) = 1 (or any predetermined number)

• Authority Update Rule: For each page p, update auth(p) to be the sum of the hub scores of all pages that point to it.

• Hub Update Rule: For each page p, update hub(p) to be the sum of the authority scores of all pages that point to it.

• Normalize and Repeat

Page 8: Hypersearching the Web,  Chakrabarti, Soumen

Solution: HITS Algorithm Cont…

  Hub(p) Num of Links Raw Score  0.249 3 0.747  0.321 4 1.284  0.181 2 0.362  0.123 2 0.246  0.088 2 0.176  0.015 1 0.015  0.018 2 0.036  0.003 1 0.003  0.003 1 0.003

Sum: 1.00   2.872         Authority Pages (q) Raw Score Auth(q)  SJ Merc News 0.57 0.198  Wall St. Journal 0.57 0.198  New York Times 0.874 0.304  USA Today 0.59 0.205  Facebook 0.123 0.043  Yahoo! 0.121 0.042  Amazon 0.024 0.008

Sum:     1.000

Calculation

Page 9: Hypersearching the Web,  Chakrabarti, Soumen

Pros:

– Accurately addresses concerns and challenges we currently deal with

– Great introduction to search engine algorithm

– Briefly covered many topics (Breadth)

Page 10: Hypersearching the Web,  Chakrabarti, Soumen

Cons:

– Some materials are out of date (1999)

– Ex. Google vs. Clever Project

– Lack of Depth

– Ex. Normalization of Hub and Auth values

Page 11: Hypersearching the Web,  Chakrabarti, Soumen

Further Research: HITS Algorithm – Extreme Cases

• Large-in-small-out sites

– High Auth(p)

– No Problem

• Small-in-large-out sites

– High Hub(p)

– Problem

Page 12: Hypersearching the Web,  Chakrabarti, Soumen

Further Research: HITS + Relevance Scoring Method

• Vector Space Model (VSM)

– Documents and queries are represented by vectors

– Term Frequency

• Okapi Measurement

– Term Frequency + Document Length

• Cover Density Ranking (CDR)

– Phrase Similarity (How close terms appear)

Page 13: Hypersearching the Web,  Chakrabarti, Soumen

Further Research: HITS + Relevance Scoring Method

• Use Cosine Relevance Test

Price

Car

Page 14: Hypersearching the Web,  Chakrabarti, Soumen

Further Research: HITS + Relevance Scoring Method

• Three-Level Scoring Method (TLS)

– Manual Evaluation of Relevance• Relevant Links = 2 points

• Slightly Relevant Links = 1 point

• Inactive Links + Error Links (404, 603) = 0 point

• Irrelevant Links = 0 point

– Order of query terms matters

Page 15: Hypersearching the Web,  Chakrabarti, Soumen

Further Research: Co-citation Graph

• Regular Link Graph:

• Co-citation Graph:

Page 16: Hypersearching the Web,  Chakrabarti, Soumen

What’s Next?

• Google’s New Search Index: Caffeine

– Announced June 8th, 2010

– Up to 50% fresher results

– Twice as fast

• Real Time Search– Twitter / Facebook

http://googleblog.blogspot.com/2010/06/our-new-search-index-caffeine.html

Page 17: Hypersearching the Web,  Chakrabarti, Soumen

References

• Chakrabarti, Soumen; Dom, Byron; Kumar, S. Ravi; Raghavan, Prabhakar; Rajagopalan, Sridhar & Tomkins, Andrew. (1999). "Hypersearching the Web" [Article]. Scientific American, June1999, ():. 

• Longzhuang Li , Yi Shang , Wei Zhang, Improvement of HITS-based algorithms on web documents, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA  [doi>10.1145/511446.511514]

• Henzinger, M. (2001). Hyperlink analysis for the Web. IEEE Internet Computing, 5(1), 45-50.

• Kleinberg, Jon (1999). "Authoritative sources in a hyperlinked environment" (PDF). Journal of the ACM 46 (5): 604–632. doi:10.1145/324133.324140.

• von Ahn, Luis (2008-10-19). "Hubs and Authorities" (PDF). 15-396: Science of the Web Course Notes. Carnegie Mellon University. Retrieved 2008-11-09.

Page 18: Hypersearching the Web,  Chakrabarti, Soumen

Q & A