60
Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper ([email protected])

Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Embed Size (px)

DESCRIPTION

Overview What is the current algorithm search engines use. Motivation to improve. The proposed method. Implementation. Experimental results. Conclusions (problems & future work)

Citation preview

Page 1: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Page Quality: In Search of an Unbiased Web Ranking

Seminar on databases and the internet.

Hebrew University of JerusalemWinter 2008

Ofir Cooper ([email protected])

Page 2: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Page Quality: in search of unbiased web ranking

Junghoo Cho, Sourashis Roy, Robert E. Adams UCLA Computer Science department (June 2005)

Impact of search engines of page popularityJunghoo Cho, Sourashis Roy

UCLA Computer Science department (May 2004)

References

Page 3: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Overview

What is the current algorithm search engines use.

Motivation to improve.The proposed method.Implementation.Experimental results.Conclusions (problems & future work)

Page 4: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Search engines today

Search engines today use a variant of the PageRank rating system to sort relevant results.

PageRank (PR) tries to measure the “importance” of a page, by measuring its popularity.

Page 5: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

What is PageRank ?

Based on the random-surfer web-user model:A person starts surfing the web at a random

page.The person advances by clicking on links in

the page (selected at random).At each step, there is a small chance the

person will jump to a new, random page.This model does not take into account

search engines.

Page 6: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

What is PageRank ?

PageRank (PR) measures the probability that the random-surfer is at page p, at any given time.

Computed by this formula:

1 n i

j j

1 1

pages p p link to page pc is the number of outgoing link from p

d is a constant, called damping factor

( ) (1- )( ( ) / ( ) / )

n niPR p d d PR p c PR p c

Page 7: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

The problem with PR

PageRank rating creates a “rich-get-richer” phenomenon.

Popular pages will become even more popular over time.

Unpopular pages are hardly ever visited, because they remain unknown to users. They are doomed to obscurity.

Page 8: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

The problem with PR

This was observed in an experiment:The Open Directory (http://dmoz.org) was

sampled, twice within seven months period.

Change to number of incoming links to each page was recorded.

Pages were divided into popularity groups, and the results…

Page 9: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

The bias against low-PageRank pages

Page 10: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

The bias against low-PageRank pages

Page 11: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

The bias against low-PageRank pages

In their study, Cho and Roy show that in a search-dominant world, discovery time of new pages rises by a factor of 66 !

Page 12: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

What can be done to remedy the situation ?

Ranking should reflect quality of pages.

Popularity is not a good-enough measure for quality, because there are many good, yet unknown, pages.

We want to give new pages an equal opportunity (if they are of high quality).

Page 13: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

How to define “quality” ?

Quality is a very subjective notion.

Let’s try to define it anyway…

Page Quality – the probability that an average user will like the page when he/she visits it for the first time.

Page 14: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

How to estimate quality ?

Quality can be measured exactly if we show all users the page, and ask their opinion.

It’s impractical to ask users their opinion on every page they visit.

(PageRank is a good measure of quality, if all pages had been given equal opportunity to be discovered. That is no longer the case)

Page 15: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

How to estimate quality ?

We want to estimate quality, but from measurable quantities.

We can talk about these quantities: Q(p) – page quality. The probability that a user will like page

p when exposed to it for the first time. P(p,t) – page popularity. The fraction of users who like p at

time t. V(p,t) – visit popularity. The number of “visits” page p

receives at unit time interval at time t. A(p,t) – page awareness. The fraction of web users who are

aware of page p, at time t.

Page 16: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Estimating quality

( , )( , ) ( , ) ( ) ( )( , )P p tP p t A p t Q p Q pA p t

Lemma 1

This is not sufficient – we can’t measure awareness.

We can measure page popularity, P(p,t).

How do we estimate Q(p) only from P(p,t) ?

Proof: follows from definitions.

Page 17: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Estimating quality

Observation 1 – popularity (as measured in incoming links) measures quality well for pages with the same age.

Observation 2 – the popularity of new, high-quality pages, will increase faster then the popularity of new, low-quality pages.In other words, the time-derivative of popularity is also a measure of quality.

Page 18: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Estimating quality

We need a web-user model to link popularity and quality.

We start with these two assumptions:1. Visit popularity is proportional to

popularity;V(p,t) = rP(p,t)

2. Random visit hypothesis: a visit to page p can be from any user with equal probability.

Page 19: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Estimating quality

Lemma 2

A(p,t) can be computed from past popularity:

0( , ) 1 exp{ ( , ) }

trA p t P p t dtn

* n is number of web users.

Page 20: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Estimating quality

Proof:

By time t, page p was visited times.

We compute the probability that some user, u, is not aware of p, after p was visited k times.

0 0( , ) ( , )

t tk V p t dt r P p t dt

Page 21: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Estimating quality

PR(i’th visitor to p is not u)=1-1/n PR(u didn’t visit p | p was visited k times) =

00 0( , )( , ) ( , )

1 1 1(1 ) (1 ) (1 )

tt tr rP p t dtr P p t dt P p t dtn nk n

nen n n

0( , )

(1 ( , ))

tr P p t dtnA p t e

Page 22: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Estimating quality

We can combine lemmas 1 and 2 to get a popularity as a function of time.

Theorem:

The proof is a bit long, we won’t go into it(available on hard copy, to those interested).

( , )}

( )( )1 ( 1)exp{ ( )

( ,0)

P p t Q pQ p rQ p tnP p

Page 23: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Estimating quality

This is popularity vs. time, as predicted by our formula:

(This trend was seen in practice, by companies such as NetRatings)

Page 24: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Estimating quality

Important fact: Popularity converges to quality over a long period of time.

( , )}

( ) ( )( )1 ( 1)exp{ ( )

( ,0)tP p t Q p Q p

Q p rQ p tnP p

We will use this fact to check estimates about quality later.

Page 25: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Estimating quality

Lemma 3

( , ) /( )( , )(1 ( , ))

n dP p t dtQ pr P p t A p t

Proof:

We differentiate the equation P(p,t)=A(p,t)Q(p) by time,

plug in the expression we found for A(p,t) in Lemma 2,

and that’s it.

Page 26: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Estimating quality

We define the “relative popularity increase function”:

( , ) /( , )( , )

n dP p t dtI p tr P p t

Page 27: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Estimating quality

Page 28: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Estimating quality

Theorem Q(p) = I(p,t)+P(p,t) at all times.

( , )1. (1 ( , )) ( , ) (from lemma 2)

( , )2. ( ) ( ( ) ( ) ( , )) ( , ) (multiply by Q(p))

( , )3. ( ( ) ( , )) ( , )

( , ) /( ) ( , )( , )

dA p t rA p t P p tdt ndA p t rQ p Q p Q p A p t P p tdt n

dP p t rQ p P p t P p tdt n

dP p t dtQ p P p tr P p tn

Proof:

Page 29: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Estimating quality

We can now estimate quality of page by measuring only popularity.

What happens if quality changes in time?

Is our estimate still good ?

Page 30: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Quality change over time

In reality, quality of pages changes:Web pages change.Expectation of users rise as better pages

appear all the time.

Will the model handle changing quality well ?

Page 31: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Quality change over time

TheoremIf quality changed at time T (from Q1 to Q2),

then for t > T, the estimate for quality is still:

2( , ) / ( , )

( , )n dP p t dtQ P p tr P p t

Page 32: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Quality change over time

Proof:After time T, we put users into three groups:(1) Users who visited the page before T. (group u1)(2) Users who visited the page after T. (group u2)(3) Users who never visited the page.

Page 33: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Quality change over time

Fraction of users who like the page at time t>T:

1 1 2 2 2( , ) ( ) ( )P p t Q u u t Q u t

* After time t, group u2 expands, while u1 remains the same.

We will have to compute u2(t).

Page 34: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Quality change over time

( , )

2 ( ) 1t

T

rP p t dt

nu t e

From the proof of lemma 2 (calculation of awareness at time t) it is easy to see that:

Page 35: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Quality change over time

Size of |u1-u2(t)|:

1 2 1 1 2 1 1 2( ) ( ) ( )u u t u u u t u u u t

* The size of intersection of u1 and u2 is their multiplication, because they are independent. (According to random-visit hypothesis, the probability that a user visits page p at time t is independent of his past visit history.)

Page 36: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Quality change over time

1 1 2 2 2

1 1 1 1 2 2 2

1 1 2 1 1 2

( , ) ( ) ( )

( ) ( )

( ) ( )

P p t Q u u t Q u t

Q u Q u u t Q u t

Q u Q Q u u t

22 1 1

2 1 1 2

2

( )( , ) ( )

( ) ( , )(1 ( ) )

......

( , )( ( , ))

d u tdP p t Q Q udt dt

rQ Q u P p t u tn

r P p t Q P p tn

Page 37: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Quality change over time

2( , ) / ( , )

( , )n dP p t dtQ P p tr P p t

Q.E.D

Page 38: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Implementation

Page 39: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Implementation

The implementation of a quality-estimator system is very simple:

1. “Sample” the web at different times.2. Compute popularity (PageRank) for each

page, and popularity change.3. Estimate quality of each page.

Page 40: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Implementation

But there are problems with this implementation.

1. Approximation error – we sample at discrete time points, not a continuous sample.

2. Quality change between samples makes estimate inaccurate.

3. We will have a time lag. Quality estimate will never be up-to-date.

Page 41: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

ImplementationExamining approximation error

Q=0.5∆t=1 (units not specified!)

Page 42: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

ImplementationExamining slow change in quality

Q(p,t)=0.4+0.0006t

Q(p,t)=0.5+ct

Page 43: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

ImplementationExamining rapid change in quality

Page 44: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

The Experiment

Page 45: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

The Experiment

Evaluating a web metric such as quality is difficult.

Quality is subjective.There is no standard corpus.Doing a user survey is not practical.

Page 46: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

The Experiment

The experiment is based on the observation that popularity converges to quality (assuming quality is constant).

If we estimate quality of pages, and wait some time, we can check our estimates against the eventual popularity.

Page 47: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

The Experiment

The test was done on 154 web sites, obtained from the Open Directory (http://dmoz.org).

All pages of these web sites were downloaded (~5 million).

Page 48: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

First three snapshots were used to estimate quality, and fourth snapshot was used to check prediction

4 snapshots were taken at these times:

Page 49: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

The Experiment

Quality is taken to be PR(t=4).Quality estimator is measured against

PR(t=3)

Page 50: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

The Experiment

The results:

Page 51: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

The quality estimator metric seems better than PageRank. Its average error is smaller:

Average error of Q3 estimator = 45%Average error of P3 estimator = 74%

The distribution of error is also better.

Page 52: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Summary & Conclusions

Page 53: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Summary

We saw the bias created by search engines.

A more desirable ranking will rank pages by quality, not popularity.

Page 54: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Summary

We can estimate quality from the link structure of the web (popularity and popularity evolution).

Implementation is feasible, only slightly different than current PageRank system.

Page 55: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Summary

Experimental results show that quality estimator is better than PageRank

Page 56: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Conclusions

Problems & Future work:

Statistical noise is not negligible for pages with low popularity.

Experiment was done on small scale. Should try it on large scale.

Page 57: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Conclusions

Problems & Future work:Can we use number of “visits” to pages to

estimate popularity increase, instead of number of incoming links ?

Theory is based on a web-user model that doesn’t take into account search engines. That is unrealistic in this day and age.

Page 58: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Follow up suggestions

Many more interesting publications in Junghoo Cho’s website:

http://oak.cs.ucla.edu/~cho/

Such as:Estimating Frequency of ChangeShuffling the deck: Randomizing search resultsAutomatic Identification of User Interest for Personalized Search

Page 59: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Other algorithms for ranking

Extra material can be found at:http://www.seoresearcher.com/category/link-popularity-algorithms/

Algorithms such as:Hub and AuthorityHITSHUBAVG

Page 60: Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper