11
Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang Birkbeck, University of London The slides are adapted from Prof. Mark Levene’s at http://www.dcs.bbk.ac.uk/~mark/download/lec2_the_structure_of_the_web.ppt

Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang

Embed Size (px)

Citation preview

Information Retrieval

Lecture 8Introduction to Information Retrieval (Manning et al. 2007)

Chapter 19

For the MSc Computer Science Programme

Dell ZhangBirkbeck, University of London

The slides are adapted from Prof. Mark Levene’s athttp://www.dcs.bbk.ac.uk/~mark/download/lec2_the_structure_of_the_web.ppt

The Size of the Web

Lawrence and Giles 1999 – 800 million

Over 11.5 billion in 2005 (Google indexes over 8 billion)

Coverage – about 40% in 1999

Overlap - low The deep (or hidden or

invisible) web contains 400-550 times more information.

Capture Recapture

SE1: the reported size of search engine 1.

QSE1 and QSE2: the pages returned for a set of queries Q from two engines.

OVR: the overlap of QSE1 and QSE2

Estimate of Web size: (QSE2 x SE1) / OVR

a.k.a. Mark and Recapture

OVR / QSE2 = SE1 / Web

Diameter of the Web

Compute Average shortest path between pairs of pages that have a path from one to the other.

Broder 99 – directed 16.2, undirected 6.8 Barabasi 99 – directed for nd.edu 19 Small diameter is a charactersitic of a small world

network Choose random source and destination – 75% of

the time no directed path between them.

Bowtie Model of the Web

Broder et al. 1999 – crawl of over 200 million pages and 1.5 billion links. SCC – 27.5% IN and OUT – 21.5% Tendrils and tubes –

21.5% Disconnected – 8%

Link Degree Distributions

How many page have n=1,2,… links: indegree : outdegree :

The log-log plots are linear!

1.2

1

n

72.2

1

n

What is a Power Law

f(i) is the proportion of objects having property i E.g. f(i) = # pages, i = # inlinks E.g. f(i) = # sites, i = # pages E.g. f(i) = # sites i = # users E.g. f(i) = frequency of word , i = rank of word, from most

freqeunt to least frequent The log-log plot: linear relationship (straight line)

i

Cif

Power Laws on the Web

inlinks (2.1) outlinks (2.72) Strongly connected components (2.54) No. of web pages in a site (2.2) No. of visitors to a site during a day (2.07) No. links clicked by web surfers (1.5) PageRank (2.1)

Preferential Attachment or The Rich Get Richer

How Power Laws Arise

Scale-Free NetworksClassic Random Graphs

Take Home Messages

The Web Graph Large and Sparse

Capture Recapture Small World Network

19 Degrees of Separation Scale Free Network

The Power Law Rich Get Richer