47
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd, Jon M. Kleinberg

COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

1 1

COMP5331: Knowledge Discovery and Data Mining

Acknowledgement: Slides modified based on the slides provided

by Lawrence Page, Sergey Brin, Rajeev Motwani and Terry

Winograd, Jon M. Kleinberg

Page 2: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

2

PageRank & HITS: Bring Order to the Web

Background and Introduction

Approach – PageRank

Approach – Authorities & Hubs

Page 3: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

Motivation and Introduction

Why is Page Importance Rating important?

New challenges for information retrieval on the WorldWide Web. Huge number of web pages: 150 million by1998

1000 billion by 2008

Diversity of web pages: different topics, different quality, etc.

Hard to imagine no ranking algorithms in search

engine.

November 9, 2015 Data Mining: Concepts and Techniques 3

Page 4: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

Motivation and Introduction

Hard to imagine no ranking algorithms in search

engine.

November 9, 2015 Data Mining: Concepts and Techniques 4

Page 5: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

Motivation and Introduction

Modern search engines may return millions of

pages for a single query. This amount is

prohibitive to preview for human users.

Ranking algorithms will process the search results

and only show the most useful information to the

search engine user.

November 9, 2015 Data Mining: Concepts and Techniques 5

Page 6: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

Motivation and Introduction

November 9, 2015 Data Mining: Concepts and Techniques 6

Page 7: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

PageRank: History

PageRank was developed by Larry Page (hence

the name Page-Rank) and Sergey Brin.

It is first as part of a research project about a

new kind of search engine. That project started

in 1995 and led to a functional prototype in 1998.

Shortly after, Page and Brin founded Google.

November 9, 2015 Data Mining: Concepts and Techniques 7

Page 8: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

Link Structure of the Web

150 million web pages 1.7 billion links

November 9, 2015 Data Mining: Concepts and Techniques 8

Backlinks and Forward links:A and B are C’s backlinks

C is A and B’s forward link

Intuitively, a webpage is important if it has a lot of backlinks.

What if a webpage has only one link off www.yahoo.com?

Page 9: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

PageRank: A Simplified Version

November 9, 2015 Data Mining: Concepts and Techniques 9

u: a web page

Bu: the set of u’s backlinks

Nv: the number of forward links of page v

c: the normalization factor to make

||R||L1 = 1 (||R||L1= |R1 + … + Rn|)

Page 10: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

An example of Simplified PageRank

November 9, 2015 Data Mining: Concepts and Techniques 10

PageRank Calculation: first iteration

Page 11: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

An example of Simplified PageRank

November 9, 2015 Data Mining: Concepts and Techniques 11

PageRank Calculation: second iteration

Page 12: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

An example of Simplified PageRank

November 9, 2015 Data Mining: Concepts and Techniques 12

Convergence after some iterations

Page 13: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

A Problem with Simplified PageRank

November 9, 2015 Data Mining: Concepts and Techniques 13

A loop:

During each iteration, the loop accumulates

rank but never distributes rank to other pages!

Page 14: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

An example of the Problem

Page 15: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

An example of the Problem

Page 16: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

An example of the Problem

Page 17: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

Random Walks in Graphs The Random Surfer Model

The simplified model: the standing

probability distribution of a random walk on the graph of the web. simply keeps clicking

successive links at random

The Modified Model

The modified model: the “random surfer”

simply keeps clicking successive links at random, but periodically “gets bored” and

jumps to a random page based on the

distribution of E

Page 18: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

Modified Version of PageRank

E(u): a distribution of ranks of web pages that “users” jump to

when they “gets bored” after successive links at random.

Page 19: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

An example of Modified PageRank

19

Page 20: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

Dangling Links

Links that point to any page with no outgoing

links

Most are pages that have not been downloaded yet

Affect the model since it is not clear where

their weight should be distributed

Do not affect the ranking of any other page

directly

Can be simply removed before pagerank

calculation and added back afterwards

Page 21: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

PageRank Implementation

Convert each URL into a unique integer and store each hyperlink in a database using the integer IDs

to identify pages

Sort the link structure by ID

Remove all the dangling links from the database

Make an initial assignment of ranks and start iteration

Choosing a good initial assignment can speed up the pagerank

Adding the dangling links back.

Page 22: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

Convergence Property PR (322 Million Links): 52 iterations

PR (161 Million Links): 45 iterations

Scaling factor is roughly linear in logn

Page 23: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

Convergence Property

The Web is an expander-like graph

Theory of random walk: a random walk on a graph is said to be

rapidly-mixing if it quickly converges to a limiting distribution

on the set of nodes in the graph. A random walk is rapidly-

mixing on a graph if and only if the graph is an expander graph.

Expander graph: every subset of nodes S has a neighborhood

(set of vertices accessible via outedges emanating from nodes in

S) that is larger than some factor α times of |S|. A graph has a

good expansion factor if and only if the largest eigenvalue is

sufficiently larger than the second-largest eigenvalue.

Page 24: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

PageRank vs. Web Traffic

Some highly accessed web pages have low

page rank possibly because

People do not want to link to these pages from

their own web pages (the example in their paper is pornographic sites…)

Some important backlinks are omitted

use usage data as a start vector for PageRank.

Page 25: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

Hypertext-Induced Topic Search(HITS)

To find a small set of most “authoritative’’ pages

relevant to the query.

Authority – Most useful/relevant/helpful results of a query.

“java’’ – java.com

“harvard’’ – harvard.edu

“search engine’’ – powerful search engines.

November 9, 2015 Data Mining: Concepts and Techniques 25

Page 26: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

Hypertext-Induced Topic Search(HITS)

Or Authorities & Hubs, developed by Jon

Kleinberg, while visiting IBM Almaden

IBM expanded HITS into Clever.

Authorities – pages that are relevant and are

linked to by many other pages

Hubs – pages that link to many related

authorities

November 9, 2015 Data Mining: Concepts and Techniques 26

Page 27: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

Authorities & Hubs

Intuitive Idea to find authoritative results using

link analysis:

Not all hyperlinks related to the conferral of authority.

Find the pattern authoritative pages have:

Authoritative Pages share considerable overlap in

the sets of pages that point to them.

November 9, 2015 Data Mining: Concepts and Techniques 27

Page 28: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

Authorities & Hubs

First Step:

Constructing a focused subgraph of the WWW

based on query

Second Step

Iteratively calculate authority weight and hub

weight for each page in the subgraph

November 9, 2015 Data Mining: Concepts and Techniques 28

Page 29: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

Constructing a focused subgraph

November 9, 2015 Data Mining: Concepts and Techniques 29

Page 30: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

Constructing a focused subgraph

November 9, 2015 Data Mining: Concepts and Techniques 30

Page 31: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

Constructing a focused subgraph

November 9, 2015 Data Mining: Concepts and Techniques 31

Page 32: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

Constructing a focused subgraph

November 9, 2015 Data Mining: Concepts and Techniques 32

Page 33: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

Computing Hubs and Authorities

Rules:

A good hub points to many good authorities.

A good authority is pointed to by many good hubs.

Authorities and hubs have a mutual

reinforcement relationship.

November 9, 2015 Data Mining: Concepts and Techniques 33

Page 34: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

Computing Hubs and Authorities

November 9, 2015 Data Mining: Concepts and Techniques 34

Page 35: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

Example (no normalization)

November 9, 2015 Data Mining: Concepts and Techniques 35

Page 36: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

Example (no normalization)

November 9, 2015 Data Mining: Concepts and Techniques 36

Page 37: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

Example (no normalization)

November 9, 2015 Data Mining: Concepts and Techniques 37

Page 38: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

Example (no normalization)

November 9, 2015 Data Mining: Concepts and Techniques 38

Page 39: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

Example (no normalization)

November 9, 2015 Data Mining: Concepts and Techniques 39

Page 40: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

The Iterative Algorithm

November 9, 2015 Data Mining: Concepts and Techniques 40

Page 41: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

The Iterative Algorithm

November 9, 2015 Data Mining: Concepts and Techniques 41

Page 42: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

The Iterative Algorithm

November 9, 2015 Data Mining: Concepts and Techniques 42

Page 43: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

The Iterative Algorithm

November 9, 2015 Data Mining: Concepts and Techniques 43

Page 44: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

A Statistical View of HITS

November 9, 2015 Data Mining: Concepts and Techniques 44

Page 45: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

A Statistical View of HITS

The weight of authority equals the contribution of

transforming the dataset to first principal

component.

Importance of this vector for the distribution of

whole dataset.

From the statistical view:

HITS can be implemented by PCA

HITS is different from clustering using dimensionality reduction.

The number of samples of PCA is limited.

November 9, 2015 Data Mining: Concepts and Techniques 45

Page 46: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

PageRank v.s. HITS

November 9, 2015 Data Mining: Concepts and Techniques 46

Page 47: COMP5331: Knowledge Discovery and Data Miningleichen/courses/comp5331/lectures/Webdataming.pdf · Intuitive Idea to find authoritative results using link analysis: Not all hyperlinks

47

Summary

PageRank is a global ranking of all web pages based on their locations in the web graph structure

PageRank uses information which is external to the web pages – backlinks

Backlinks from important pages are more significant than backlinks from average pages

The structure of the web graph is very useful for information retrieval tasks.

HITS – Find authoritative pages; Construct subgraph; Mutually reinforcing relationship; Iterative algorithm