Upload
neil-charles
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Information Retrieval
Lecture 8Introduction to Information Retrieval (Manning et al. 2007)
Chapter 19
For the MSc Computer Science Programme
Dell ZhangBirkbeck, University of London
The slides are adapted from Prof. Mark Levene’s athttp://www.dcs.bbk.ac.uk/~mark/download/lec2_the_structure_of_the_web.ppt
The Size of the Web
Lawrence and Giles 1999 – 800 million
Over 11.5 billion in 2005 (Google indexes over 8 billion)
Coverage – about 40% in 1999
Overlap - low The deep (or hidden or
invisible) web contains 400-550 times more information.
Capture Recapture
SE1: the reported size of search engine 1.
QSE1 and QSE2: the pages returned for a set of queries Q from two engines.
OVR: the overlap of QSE1 and QSE2
Estimate of Web size: (QSE2 x SE1) / OVR
a.k.a. Mark and Recapture
OVR / QSE2 = SE1 / Web
Diameter of the Web
Compute Average shortest path between pairs of pages that have a path from one to the other.
Broder 99 – directed 16.2, undirected 6.8 Barabasi 99 – directed for nd.edu 19 Small diameter is a charactersitic of a small world
network Choose random source and destination – 75% of
the time no directed path between them.
Bowtie Model of the Web
Broder et al. 1999 – crawl of over 200 million pages and 1.5 billion links. SCC – 27.5% IN and OUT – 21.5% Tendrils and tubes –
21.5% Disconnected – 8%
Link Degree Distributions
How many page have n=1,2,… links: indegree : outdegree :
The log-log plots are linear!
1.2
1
n
72.2
1
n
What is a Power Law
f(i) is the proportion of objects having property i E.g. f(i) = # pages, i = # inlinks E.g. f(i) = # sites, i = # pages E.g. f(i) = # sites i = # users E.g. f(i) = frequency of word , i = rank of word, from most
freqeunt to least frequent The log-log plot: linear relationship (straight line)
i
Cif
Power Laws on the Web
inlinks (2.1) outlinks (2.72) Strongly connected components (2.54) No. of web pages in a site (2.2) No. of visitors to a site during a day (2.07) No. links clicked by web surfers (1.5) PageRank (2.1)