46
M.P. Johnson, DBMS, Stern/NYU, Sprin g 2005 1 C20.0046: Database Management Systems Lecture #27 M.P. Johnson Stern School of Business, NYU Spring, 2005

C20.0046: Database Management Systems Lecture #27

  • Upload
    tal

  • View
    39

  • Download
    0

Embed Size (px)

DESCRIPTION

C20.0046: Database Management Systems Lecture #27. M.P. Johnson Stern School of Business, NYU Spring, 2005. Agenda. Last time: Data Mining RAID Websearch Etc. Goals after today:. Understand what RAID is Be able to perform RAID 4 Understand some issues in websearch - PowerPoint PPT Presentation

Citation preview

Page 1: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

1

C20.0046: Database Management SystemsLecture #27

M.P. JohnsonStern School of Business, NYUSpring, 2005

Page 2: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

2

Agenda Last time:

Data Mining

RAID

Websearch

Etc.

Page 3: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

3

Goals after today:1. Understand what RAID is

2. Be able to perform RAID 4

3. Understand some issues in websearch

4. Be able to perform PageRank

Page 4: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

4

New topic: RecoveryType of Crash Prevention

Wrong data entry Constraints andData cleaning

Disk crashes Redundancy: e.g. RAID, archive

Fire, theft, bankruptcy…

Buy insurance, Change jobs…

System failures:e.g. blackout

DATABASERECOVERY

Page 5: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

5

System Failures (skip?) Each transaction has internal state

When system crashes, internal state is lost Don’t know which parts executed and which didn’t

Remedy: use a log A file that records each action of each xact Trail of breadcrumbs

See text for details…

Page 6: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

6

Media Failures Rule of thumb: Pr(hard drive has head crash

within 10 years) = 50% Simpler rule of thumb: Pr(hard drive has head

crash within 1 year) = (say) 10% If have many drives, then regular occurrence

Soln: different RAID strategies RAID: Redundant Arrays of Independent

Disks

Page 7: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

7

RAID levels RAID level 1: each disk gets a mirror RAID level 4: one disk is xor of all others

Each bit is sum mod 2 of corresponding bits E.g.:

Disk 1: 11110000 Disk 2: 10101010 Disk 3: 00111000 Disk 4:

How to recover?

Various other RAID levels in text…

Page 8: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

8

RAID levels RAID level 1: each disk gets a mirror RAID level 4: one disk is xor of all others

Each bit is sum mod 2 of corresponding bits E.g.:

Disk 1: Disk 2: 10101010 Disk 3: 00111000 Disk 4:

How to recover?

Various other RAID levels in text…

Page 9: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

9

Next topic: Websearch Create a search engine for searching the web

DBMS queries use tables and (optionally) indices

First thing to understand about websearch: we never run queries on the web Way too expensive, for several reasons

Instead: Build an index of the web Search the index Return the results

Page 10: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

10

Crawling To obtain the data for the index, we crawl the web

Automated web-surfing Conceptually very simple But difficult to do robustly

First, must get pages Prof. Davis (NYU/CS)’s example:

http://www.cs.nyu.edu/courses/fall02/G22.3033-008/WebCrawler.java http://pages.stern.nyu.edu/~mjohnson/dbms/eg/WebCrawler.java Rule of thumb: 1 page per minute Run program:

sales% cd ~mjohnson/public_html/dbms/egsales% java WebCrawler http://pages.stern.nyu.edu/~mjohnson/dbms 200

Page 11: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

11

Crawling issues in practice DNS bottleneck

to view page by text link, must get address BP claim: 87% crawling time ~ DNS look-up

Search strategy?

Refresh strategy?

Primary key for webpages Use artificial IDs, not URLs more popular pages get shorter DocIDs (why?)

Page 12: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

12

Crawling issues in practice Content-seen test

compute fingerprint/hash (again!) of page content

robots.txt http://www.robotstxt.org/wc/robots.html

Bad HTML Tolerant parsing

Non-responsive servers

Spurious text

Page 13: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

13

Inverted indices Basic idea of finding pages:

Create inverted index mapping words to pages

First, think of each webpage as a tuple One column for each possible word True means the word appears on the page Index on all columns

Now can search: john bolton select * from T where john=T and bolton=T

Page 14: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

14

Inverted indices Can simplify somewhat:

1. For each field index, delete False entries2. True entries for each index become a bucket

Create an inverted index: One entry for each search word

the lexicon Search word entry points to corresponding bucket Bucket points to pages with its word

the postings file

Final intuition: the inverted index doesn’t map URLs to words

It maps words to URLs

Page 15: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

15

Inverted Indices What’s stored? For each word W, for each doc D

relevance of D to W #/% occurs. of W in D meta-data/context: bold, font size, title, etc.

In addition to page importance, keep in mind: this info is used to determine relevance of

particular words appearing on the page

Page 16: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

16

Search engine infrastructure

Image from here: http://www.cs.wisc.edu/~dbbook/openAccess/thirdEdition/slides/slides3ed-english/Ch27c_ir3-websearch-95.pdf

Page 17: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

17

Google-like infrastructure Very large distributed system

File sizes routines in GBs Google File System Block size = 64MB (not kb)!

100k+ low-quality Linux boxes system failures are the rule, not exception

Divide index up by words into many barrels lexicon maps word ids to word’s barrel also, do RAID-like stragegy two-D matrix of servers

many commodity machines frequent crashes Draw picture May have more duplication for popular pages…

Page 18: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

18

Google-like infrastructure To respond to single-word query Q(w):

send to the barrel column for word w pick random server in that column

return (some) sorted results

To respond to multi-word query Q(w1…wn): for each word wi, send to the barrel column for wi

pick random server in that column for all words in parallel, merge and prune

step through until find doc containing all words, add to results index ordered on word;docID, so linear time

return (some) sorted results

Page 19: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

19

Websearch v. DBMS

DBMS IR/websearch

precise query language KISS

SQL selects Keyword search

Relational schemas Loosely structured

Generate full answer Display first/next k results

Read/write Read-mostly

Commits immediately Commits eventually

Page 20: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

20

New topic: Sorting Results How to respond to Q(w1,w2,…,wn)?

Search index for pages with w1,w2,…,wn Return in sorted order (how?)

Soln 1: current order Return 100,000 (mostly) useless results

Sturgeon's Law: “Ninety percent of everything is crud.”

Soln 2: ways from Information Retrieval Theory library science + CS = IR

Page 21: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

21

Simple IR-style approach for each word W in a doc D, compute

# occurs of W in D / total # word occurs in D each document becomes a point in a space

one dimension for every possible word Like k-NN and k-means

value in that dim is ratio from above (maybe weighted, etc.) Choose pages with high values for query words

A little more precisely: each doc becomes a vector in space Values same as above But: think of the query itself as a document vector Similarity between query and doc = dot product / cos Draw picture

Page 22: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

22

Information Retrieval Theory With some extensions, this works well for

relatively small sets of quality documents

But the web has 8 billion documents Problem: if based just on percentages, very short

pages containing query words score very high BP: query a “major search engine” for “bill clinton” “Bill Clinton Sucks” page

Page 23: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

23

Soln 3: sort by “quality” What do you mean by quality?

Hire readers to rate my webpage (early Yahoo)

Problem: doesn’t scale well more webpages than Yahoo employees…

Page 24: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

24

Soln 4: count # citations (links) Idea: you don’t have to hire webpage raters

The rest of the web has already voted on the quality of my webpage

1 link to my page = 1 vote

Similar to counting academic citations Peer review

Page 25: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

25

Soln 5: Google’s PageRank Count citations, but not equally – weighted sum Motiv: we said we believe that some pages are

better than others those pages’ votes should count for more

A page can get a high PageRank many ways Two cases at ends of a continuum:

many pages link to you yahoo.com links to you

PageRank, not PigeonRank Search for “PigeonRank”…

Page 26: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

26

PageRank More precisely, let P be a page; for each page Li that links to P, let C(Li) be the number of pages Li links to.

Then PR0(P) = SUM(PR0(Li)/C(Li)))

Motiv: each page votes with its quality; its quality is divided among the pages it votes for Extensions: bold/large type/etc. links may get

larger proportions…

Page 27: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

27

Understanding PageRank (skip?) Analogy 1: Friendster/Orkut

someone “good” invites you in someone else “good” invited that person in, etc.

Analogy 2: PKE certificates my cert authenticated by your cert your cert endorsed by someone else's…

Both cases here: eventually reach a foundation

Analogy 3: job/school recommendations three people recommend you why should anyone believe them?

three other people rec-ed them, etc. eventually, we take a leap of faith

Page 28: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

28

Understanding PageRank Analogy 4: Random Surfer Model

Idealized web surfer: First, start at some page Then, at each page, pick a random link…

Turns out: after long time surfing, Pr(were at some page P right now) = PR0(P) PRs are normalized

Page 29: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

29

Computing PageRank For each page P, we want:

PR(P) = SUM(PR(Li)/C(Li))) But its circular – how to compute?

Meth 1: for n pages, we've got n linear eqs and n unknowns can solve for all PR(P)s, but too hard see your linear algebra course…

Meth 2: iteratively start with PR0(P) set to E for each P iterate until no more significant change PB report O(50) iterations for O(30M) pages/O(300M) links

#iters req. grows only with log of web size

Page 30: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

30

Problems with PageRank Example (from Ullman):

A points to Y, M; Y points to self, A; M points nowhere draw picture

Start A,Y,M at 1:

(1,1,1) (0,0,0) The rank dissipates

Soln: add (implicit) self link to any dead-end

sales% cd ~mjohnson/public_html/dbms/egstern% java PageRank

Page 31: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

31

Problems with PageRank Example (from Ullman):

A points to Y, M; Y points to self, A; M points to self

Start A,Y,M at 1:

(1,1,1) (0,0,3) Now M becomes a rank sink RSM interp: we eventually end up at M and then get stuck

Soln: add “inherent quality” E to each page

stern% java PageRank2

Page 32: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

32

Modified PageRank Apart from inherited quality, each page also

has inherent quality E: PR(P) = E + SUM(PR(Li)/C(Li)))

More precisely, have weighted sum of the two terms: PR(P) = .15*E + .85*SUM(PR(Li)/C(Li)))

Leads to a modified random surfer model

stern% java PageRank3

Page 33: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

33

Random Surfer Model’ Motiv: if we (qua random surfer) end up at page M,

we don’t really stay there forever We type in a new URL

Idealized web surfer: First, start at some page Then, at each page, pick a random link But occasionally, we get bored and jump to a random new

page

Turns out: after long time surfing, Pr(we’re at some page P right now) = PR(P)

Page 34: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

34

Understanding PageRank One more interp: hydraulic model

picture the web graph again imagine each link as a tube bet. two nodes imagine quality as fluid each node is a reservoir initialized with amount E of fluid

Now let flow…

Steady state is: each node P w/PR(P) amount of fluid PR(P) of fluid eventually settles in node P equilibrium

Page 35: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

35

Somewhat analogous systems (skip?) Sornette: “Why Stock Markets Crash”

Si(t+1) = sign(ei + SUM(Sj(t)) trader buys/sells based on1. is inclination and2. what is associates are saying

directions. of magnet det-ed by1. old direction and2. dirs. of neighbors

activation of neuron det-ed by1. its props and2. activation of neighbors connected by synapses

PR of P based on1. its inherent value and2. PR of in-links

Page 36: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

36

Non-uniform Es (skip?) So far, assumed E was const for all pages But can make E a function E(P)

vary by page

How do we choose E(P)? Idea 1: set high for pages with high PR from earlier

iterations Idea 2: set high for pages I like

BP paper gave high E to John McCarthy’s homepage pages he links to get high PR, etc. Result: his own personalized search engine Q: How would google.com get your prefs?

Page 37: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

37

Tricking search engines “Search Engine Optimization”

Challenge: include on your page lots of words you think people will query on maybe hidden with same color as background

Response: popularity ranking the pages doing this probably aren't linked to that

much but…

Page 38: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

38

Tricking search engines I can try to make my page look popular to the search

engine Challenge: create a page with 1000 links to my page

does this work?

Challenge: Create 1000 other pages linking to it Response: limit the weight a single domain can give

to itself

Challenge: buy a second domain and put the 1000 pages there

Response: limit the weight from any single domain…

Page 39: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

39

Using anchor text Another good idea: use anchor text

Motiv: pages may not give best descrips. of themselves most search engines don’t contain "search engine" BP claim: only 1 of 4 “top search engines” could find

themselves on query "search engine"

Anchor text also describes page: many pages link to google.com many of them likely say "search engine" in/near the link Treat anchor text words as part of page

Search for “US West” or for “g++”

Page 40: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

40

Tricking search engines This provides a new way to trick the search engine Use of anchor text is a big part of result quality

but has potential for abuse Lets you influence the appearance of other people’s pages

Google Bombs put up lots of pages linking to my page, using some

particular phrase in the anchor text result: search for words you chose produces my page Examples: "talentless hack", "miserable failure", “waffles",

the last name of a prominent US senator…

Page 41: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

41

Bidding for ads Google had two really great ideas:

1. PageRank2. AdWords/AdSense

Fundamental difficulty with mass-advertising: Most of the audience does want it Most people don’t want what you’re selling Think of car commercials on TV

But some of them do!

Page 42: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

42

Bidding for ads If you’re selling widgets, how do you know

who wants them? Hard question, so answer its inversion

If someone is searching for widgets, what should you try to sell them? Easy – widgets!

Whatever the user searches for, display ads relevant to that query

Page 43: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

43

Bidding for ads Q: How to divvy correspondences up? A: Create a market, and let the divvying take care of

itself

Each company places the bid it’s willing to pay for an ad responding to a particular query

Ad auction “takes place” at query-time Relevant ads displayed in descending bid order Company pays only if user clicks

AdSense: place ads on external webpages, auction based on page content instead of query

Huge huge huge business

Page 45: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

45

For more info See sources drawn upon here: Prof. Davis (NYU/CS) search engines course

http://www.cs.nyu.edu/courses/fall02/G22.3033-008/

Original research papers by Page & Brin: The PageRank Citation Ranking: Bringing Order to the Web The Anatomy of a Large-Scale Hypertextual Web Search

Engine Links on class page Interesting and very accessible

Google Labs: http://labs.google.com

Page 46: C20.0046: Database Management Systems Lecture #27

M.P. Johnson, DBMS, Stern/NYU, Spring 2005

46

You mean that’s it? Final Exam: next Thursday, 5/5,10-11:50am

Final exam info is up

Course grades are cuvered

Interest in a review session?

Please fill out course evals! https://ais.stern.nyu.edu/ Comments by email, etc., are welcome

Thanks!