Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
L23: PageRank
Je↵ M. Phillips
April 13, 2020
Final Report
At most 4 pages/student. Don’t cram in too much!
I Succinct title (and names)
I Problem definition and motivation.
I Explain your Data.
I key idea
I What did you do (which techniques, an implementation, acomparison, an extension)
I What did you learn? Artifacts (charts, plots, examples, math)and Intuition (in words, did it work?)
I ← Same ten Posters
o
webpage-simia.it( Search ). InvertedIndex
year ;i¥¥ . .
bout→ page 't
, P ?,"
car"
→¥pQgirl)
-
Define most relevantwebpages-
D- d-'
ne'EE¥÷÷÷÷:÷:i:smarm :÷÷÷i÷"¥xE÷÷g÷÷:X
Crawlers : program ; that walksaround web : ① read page
update fentwiuecfor
② follow random
inverting hyperlinks
use hyperlink infoL a hired "
www.pie.com" )pie
- 3
Spammersbuild fleet pages i link to your
Page w/ hyperlink tag .
studies : Alternative to search Engine
Yahoo ! and Look Smart
Built an organized , curatedcollection of websites
pageR-ankfslpjiturml-fktytg.fi?ggignhspiddeieafe
•
1ttqitlinkedto
balanceboth pages .
•
page is important ifa
random MCMC' ' random surfer "
were
to fond it.
-
Web is a big graph G- L V,
E)
V = I set do all pages }
F- = l Ei ; = link pi → B- 3
Petone ML → g* ⇐ converged to vectordistribution
q*Lj ) says how important page ; is .
Compute q* of lwepgraph-
• Keep truck of crawlers : how frequentreturn .
• Buy bag computer : Compute eigcp )← probtra.ca)
• Precompute P#
=P - P - p .
. . ..
. p
← too big°
q*= go ←last night
a :÷÷÷o÷÷÷t÷÷¥power
method
Anatomy of Web
Strongly ConnectedComponent
IN
OUTTubes
tendrilsOUT
tendrilsIN
disconnected
ANATOMY of WEB
is this G ergodic
÷ . . Is•
i'④ yo
Anatomy of Web
0
O
canwemaheGergodi.ci#• Teleportation ,/ taxation
→ about once every I steps
→ jumpto randomnode
.
P prob trans (G)13--0-15 p
R= a - is ) Pepe Q¥fIf¥¥Z
↳ dense
Rq . - Ul - B) Ptp'
G)ginLt B) Pain + qt nxt vector- -
→
Spam Farms
÷ .
Google counter
^seafhs¥at
Trust ( noes ? )
only teleport to trustedpages .
r ← qx , pagerank
t ← get trusted teleport
rts ) - ft ; ) if l arse → spam
↳ truthfulness of webpage
Word CountConsider as input all of English Wikipedia stored in DFS. Goal is tocount how many times each word is used.
Inverted IndexConsider as input all of English Wikipedia stored in DFS. Goal is tobuild an index, so each word has a list of pages it is in.
PhrasesConsider as input all of English Wikipedia stored in DFS. Goal is tobuild an index, on 3-grams (sequence of 3 words) that appears onexactly one page, with link to page.
Label Propagation (Graph)Consider a large graph G = (V ,E ) (e.g., a social network), with asubset of notes V 0 ⇢ V with labels (e.g., {pos, neg}). Each nodestores its label (if any) and edges.Assign a vertex a label if (a) unlabled, (b) has � 5 labeledneighbors, (c) based on majority vote.
Label Propagation (Embedding)Consider a data set X ⇢ Rd , with a subset of points X 0 ⇢ X withlabels (e.g., {pos, neg}). Implicitly defines graph with V = X andE using k = 20 nearest neighbors.Assign a vertex a label if (a) unlabled, (b) has � 5 labeledneighbors, (c) based on majority vote.
Example PageRank
M =
2
664
0 1/2 0 01/3 0 1 1/21/3 0 0 1/21/3 1/2 0 0
3
775
Stripes:
M1 =
2
664
01/31/31/3
3
775 M2 =
2
664
1/2001/2
3
775 M3 =
2
664
0100
3
775 M4 =
2
664
01/21/20
3
775
These are stored as�1 : (1/3, 2), (1/3, 3), (1/3, 4)
�,�
2 : (1/2, 1)(1/2, 4)�,�3 : (1, 3)
�, and
�4 : (1/3, 1), (1/2, 2)
�.
Example PageRank
M =
2
664
0 1/2 0 01/3 0 1 1/21/3 0 0 1/21/3 1/2 0 0
3
775
Stripes:
M1 =
2
664
01/31/31/3
3
775 M2 =
2
664
1/2001/2
3
775 M3 =
2
664
0100
3
775 M4 =
2
664
01/21/20
3
775
These are stored as�1 : (1/3, 2), (1/3, 3), (1/3, 4)
�,�
2 : (1/2, 1)(1/2, 4)�,�3 : (1, 3)
�, and
�4 : (1/3, 1), (1/2, 2)
�.
Example PageRank
M =
2
664
0 1/2 0 01/3 0 1 1/21/3 0 0 1/21/3 1/2 0 0
3
775
Blocks:
M1,1 =
0 1/2
1/3 0
�M1,2 =
0 01 1/2
�M2,1 =
1/3 01/3 1/2
�M2,2 =
0 1/20 0
�
These are stored as�1 : (1/2, 2)
�,�2 : (1/3, 1)
�, as�
2 : (1, 3), (1/2, 4)�, as
�3 : (1/3, 1)
�,�4 : (1/3, 1), (1/2, 2)
�, and
as�3 : (1/2, 4)
�.