104
Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh, Zoltan Gyongyi, Andy Kacsmar, Sep Kamvar, Wang Lam, Mor Naaman, Larry Page, Andreas Paepcke, Sriram Raghavan, Gary Wesley, Rebecca Wesley, and others...

Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

Embed Size (px)

Citation preview

Page 1: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

Digital Libraries Initiatives: What I learned (and didn't) in 10 years

Hector Garcia-Molina

Stanford University

Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh, Zoltan Gyongyi, Andy Kacsmar, Sep Kamvar,

Wang Lam, Mor Naaman, Larry Page, Andreas Paepcke, Sriram Raghavan, Gary Wesley, Rebecca Wesley, and others...

Page 2: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher
Page 3: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

3

Outline

• DLI I & II Experience– (with special help from Andreas and Rebecca)

• Stanford Research

• “Controversial” Questions for the Future

Page 4: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

4

Disclaimer

• Stanford Perspective

Page 5: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

5

DLI Experience

• Lots of great research!

• Lots of great content!

• Main Event was....

Page 6: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

6

Main DL Event

IEEE and ACM DL Conferences Merge into JCDL!!

Page 7: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher
Page 8: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

8

Page 9: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

9

The WWW Tsunami

• Before the Web:– Publishers, catalogs,...– Librarians: see the need for technology– CS Types: want to have social impact

Page 10: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

10

The WWW Tsunami

• The Web Arrives:– few coherent collections– producers = consumers– everything free– heterogeneous– merge:

• shopping,

• entertainment

• library services ...

Page 11: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

11

CS-Library “Tensions”

• Web generated a lot of excitement, but...

• “Friendly tensions” as everyone adjusted:– Techies take all the funding!– Librarians don’t get it!

Page 12: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

12

Example: CS-TR Experience

• History

• Copyright issues

• Pubs servers everywhere

• Citeseer,...

• Organization vs chaos• Chaos wins! (this round)

DLI I & II

NCSTRL

Page 13: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

13

Bright Future

• DLI I & II made important contributions (more later...)• Huge volume of information available• Direct communication between authors and librarians• Core library functions needed, more than ever:

– organization– curation– trusted information– ...

DLI I & II

NCSTRL

today

Page 14: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

14

Stanford DLI Project

Page 15: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

Stanford Theme - Phase I

• “GLUE” for accessing diverse libraries and services

InternetLibraries

PaymentInstitutions

SearchAgents

User Interfacesand Annotations

Commercial Information Brokers &

Providers

CopyrightServices

Query/DataConversionHTTP

Z39.50

Telnet

Page 16: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

InfoBus Example

Folio Dialog DigiCash F.V.

FolioProxy

DialogProxy

DigiCashProxy

F.V.Proxy

DLite GlossQueryTrans

MetaData U-Pai

Con-tracts

Q: Find Ti distributed (W) systems

Page 17: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

InfoBus Example

Folio Dialog DigiCash F.V.

FolioProxy

DialogProxy

DigiCashProxy

F.V.Proxy

DLite GlossQueryTrans

MetaData U-Pai

Con-tracts

Q: Find Ti distributed (W) systems

Suggested: Folio, Dialog

Page 18: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

InfoBus Example

Folio Dialog DigiCash F.V.

FolioProxy

DialogProxy

DigiCashProxy

F.V.Proxy

DLite GlossQueryTrans

MetaData U-Pai

Con-tracts

Q: Find Ti distributed (W) systems

Q’: Find Ti distributed AND systems

Query Translation

Page 19: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

InfoBus Example

Folio Dialog DigiCash F.V.

FolioProxy

DialogProxy

DigiCashProxy

F.V.Proxy

DLite GlossQueryTrans

MetaData U-Pai

Con-tracts

Q: Find Ti distributed (W) systems

Pay per View

Page 20: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

InfoBus Details

LSP LSP LSP LSPZ-cl

Z-sr

DLITEclient

Z39.50client

SenseMakerclient

Z39.50Library

L1 Ln. . . S1 Sn. . .

Libraries Services

Payment, Translation,MetaData,… Services

•ILU Objects•Information Models•DLI Protocol

{

Page 21: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

Querying Sources

• Differences: Language, Operators, Attributes,...

Q1: title contains large AND distributed (W) system

Q2: FIND heading large AND distributed NEAR system

Page 22: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

Query Translation

TargetIR System

TargetIR System

TargetIR System

...Query

Translator

Post-Filter

Userquery

Final results

Target syntax, capabilities, schemas

Filter Queries

Page 23: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

Stop Word Examples• User Query Q1:

– title contains gone AND with AND the AND wind

• Subsuming Query QS: (for Dialog)– title contains gone AND wind

• Filter Query QF:– title contains with AND the

post-filter

query trans

sourceQ1

QS

QF

ASA1

Page 24: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

Stop Word Examples• User Query Q1:

– title contains gone (W) with (W) the (W) wind

• Subsuming Query QS: (for Dialog)– title contains gone (2W) wind

• Filter Query QF:– title contains gone (W) with (W) the (W) wind

post-filter

query trans

sourceQ1

QS

QF

ASA1

Page 25: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

Translation Overhead: Stop Words

Size of user query with (W) operator

Size ofsubsumingquerywithoutstopwords

Text field on Dialog

Page 26: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

Summary

• Option 1: Avoid Translation– Need: common language and operators– Need: common attributes

• Option 2: Translate– Need: source meta-data– Need: user involvement in translation

Page 27: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

27

Stanford DLI II: Technical Barriers

Economic Weaknesses

Information Loss

Information Overload

Service Heterogeneity

Physical Barriers

Page 28: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

RepositoryMulticastEngine

WWW

FeatureRepository

RetrievalIndexes

Webbase API

Web CrawlerWeb

CrawlerWeb CrawlerWeb Crawlers

Client Client Client Client

Client ClientWebBase Architecture

Page 29: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

29

PowerBrowser - Start Screen

Page 30: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

30

PowerBrowser - Hypertext View

Page 31: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

31

Copy Detection

Copy Detection System

Page 32: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

32

Replicated Collections on the Web

Page 33: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

33

Archival Repository

server

stanfordTRs

server

illinoisTRs

stanfordarchival repository

illinoisarchival repository

Page 34: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

34

Archival Repository Design

• If I have $100K/yr• Want 99.999% “reliability”

– how many copies

– how much preventive maintenance

– ???

Preventive Maintenance and Aging

0

10

20

30

40

50

60

70

80

1 3 5 10 Never

Start of Aging (years)

MT

TF

(y

ea

rs)

1

3

5

10

Never

Preventive Maintenance

Period (years)

Page 35: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

35

Crawler Friendly Web Servers

• Year 2000 Paper:– Onn Brandman, Junghoo Cho, Narayanan Shivakumar

– Help crawlers identify pages of interest

webserver

crawler

pull

Page 36: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

36

Crawler Friendly Web Servers

• Year 2000 Paper:– Onn Brandman, Junghoo Cho, Narayanan Shivakumar

– Help crawlers identify pages of interest

webserver

crawler

pull

dige

st

Other options:• Push• Filter service

Page 37: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

37

Needless Requests

Page 38: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

38

Improved Freshness

Page 39: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher
Page 40: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

40

DLI Technology Transfer

• Research Product: Students

• Transfer Takes Time!

Page 41: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

Economic Weaknesses

Information Loss

Information Overload

Service Heterogeneity

Physical Barriers

• Interoperability

• Value Filtering

• Mobile Access

• IP Infrastructure

• Archival Repository

Technologies forTechnologies forDigital LibrariesDigital Libraries

41

Page 42: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

42

“Controversial” Questions

Page 43: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

43

Is Metadata Dead?

document

metadata

Page 44: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

44

Will the Semantic Web Make It?

• Will tags be generated?• By whom?• Agreement

web

? SearchEngine

semantic tags

Page 45: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

45

Is Google the Future Digital Library?

Page 46: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

46

Not Online, Not Worth Having?

• Bill Arms Quote

Page 47: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

47

Are Publishers Still Needed?

Page 48: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

48

Here Today, Gone Tomorrow?

• Will we find today’s materials in 50 years?

Page 49: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

49

Will Lawyers Win?

Page 50: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

50

Summary

• We learned a lot from DLI I & II

• Trained students who are changing the world

• Many challenges ahead...

Page 51: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

51

Extra Slides

Page 52: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

52

Outline

• dli– 94-98; 00-05

– lots of great research; wonderful sites (cervantes)

– the web; like doing research on tidal pools when tsunami hits

– before the web:• librarians: catalogs, publishers in control, research funding low

• com sci: chance to have impact; do good for society

– the web• blurred distinction between producers consumers

• no coherent collections (with curator who controlled, organized...)

• everything free (expectation that...)

• heterogeneity (beyond html...)

• merged shopping, work, library, entertainment... blurred distinctions...

– tensions cs-librarians• cs folks taking all the funding to work on technology

• librarians “don’t get it” times are changing

• CS-TR experience...copyrights, servers, search, etc...

Page 53: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

53

Outline

• dli (continued)– Bright Future

• direct communication between librarians and authors (camera ready...)

• huge volume of information available

• core function of librarianship remains (organize, categorize,....)now more than ever: need to filter out junk, need to organize, synthesize....

• more on this future later on in talk...

Page 54: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

54

Outline

• summary of stanford work

Page 55: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

55

Outline

• dli

• summary of stanford work

• future issues– will semantic web ever make it?

– Is metadata really dead?

– Are publishers still needed?

– Is Google the digital library of the future?• google scans books

– Is paper relevant?• bill arms: “If it is not online, it is not work having”

• my students do not cite anything not online (Michigan story)

– Will we be able to find today’s digital materials in 50 years?

– How will DLs be funded? DL Research funded?

Page 56: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

Research Areas

• Interface: our window to a digital library• Interoperation: accessing heterogeneous services• Discovery: finding desired resources• Translation: speaking the right language• Payment: multiple policies & currencies• Interpretation: understanding results• Creation: generating new information

Page 57: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

Outline

• Overview of Digital Library Innitiative

• The Stanford Digital Library Project

– Overview

– The InfoBus– Internet Meta-Searching

• Discovery

• Querying

• Merging and ranking

– STARTS Protocol

Page 58: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

Discovery: Exhaustive Searching

Source

Source

Queries

Answer

Answer

Page 59: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

Discovery: Full Index

Source

Extractor

Source

Extractor

INDEXINDEX

Query

DocumentIdentifiers

Requests for Specific Documents

Full Text

Page 60: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

Discovery: GLOSS

Source

Collector

Source

Collector

GLOSSGLOSS

Query

Hints

Query to source

Statistics

Page 61: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

Example:

• query: find author Knuth and title computers

• statistics GLOSS keeps on databases:

DB #docs #docs with #docs with author Knuth title computers

db1 100 0 3 db2 200 10 200 db3 1000 100 100 db4 1000 1 1

Which database(s) should the user search?

Page 62: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

• q = find author Knuth and title computers

DB #docs #docs with #docs with author Knuth title computers

db1 100 0 3 db2 200 10 200 db3 1000 100 100 db4 1000 1 1

Example (cont.):

• Use IND predictor (others available).

• Resulting rank: ESize(q, db2) = (10/200)*(200/200)*200 = 10 docs ESize(q, db3) = (100/1000)*(100/1000)*1000 = 10 docs

ESize(q, db4) = (1/1000)*(1/1000)*1000 = 0.001 docs ESize(q, db1) = (0/100)*(3/100)*100 = 0 docs

Page 63: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

GLOSS Results

• Experimental Evaluation• GLOSS hints “very good” 85% to 90% of the time• GLOSS index is 2% of the size of full index

Page 64: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

Summary

• GLOSS and other resource discovery tools work…• BUT require meta-data collection facilities.

SourceCollector

Queries

Page 65: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

Translation Overhead: Stop Words

Size ofsubsumingquerywithoutstopwords

Size of user query with AND operator

Text field on Dialog

Page 66: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

Translation Overhead: Stop Words

Size of user query with (W) operator

Size ofsubsumingquerywithoutstopwords

Text field on Dialog

Remaining lengthgreater than 1

Page 67: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

Ranking & Interpreting Results

• How do we merge ranked results?– Example: Query: “distributed databases”– Source1: (d1, 0.7), (d2, 0.3)– Source2: (d3, 100), (d4, 82). (d5, 71)

Page 68: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

Ranking & Interpreting Results• Need additional information from sources

– Example: Query: “distributed databases”– Source1: ( doc = d1,

rank = 0.7,frequency[“distributed”] = 100,frequency[“databases”] = 1000,totalDocuments = 5000 ),

( doc = d2,rank = 0.3,frequency[“distributed”] = 10,frequency[“databases”] = 300,totalDocuments = 5000 )

Page 69: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

Target Ranking

• Compute target ranking:– Source1: (d1, T100), (d2, T50)– Source2: (d3, T150), (d4, T80), (d5, T25)

• Merge:– Combined: (d3, T150), (d1, T100), (d4, T80),

(d2, T50), (d5, T25)

Page 70: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

Target Ranking

• Compute target ranking:– Source1: (d1, T100), (d2, T50)

– Source2: (d3, T150), (d4, T80), (d5, T25)

• Merge:– Combined: (d3, T150), (d1, T100), (d4, T80), (d2,

T50), (d5, T25)

• Question: Are we positive (d3, T150) is best?– Maybe (dx, 0.25) at Source1 (ranked below d2 there)

has target rank of (dx, T200)??

Page 71: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

Summary

• Sources need to export auxiliary ranking information• We need some ``knowledge’’ of source ranking

function

Page 72: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

STARTS

• Stanford Protocol for Internet Search and Retrieval

• Participants:– Fulcrum, Infoseek, Microsoft Network, Verity, WAIS– GILS, Harvest, Netscape, PLS, HP, others

• Goal: Simplify the Job of Meta-searchers.• Goal: Simplicity• Can be used by different transport protocols.• Visit:

– http://www-db.stanford.edu/~gravano/starts_home.html

Page 73: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

STARTS Components

(1) Common scheme for collecting meta-data(2) Common query language(3) Common result ranking information

SourceQueries

Collector

Answers

(1)

(2)

(3)

Page 74: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

Metadata Example (SOIF)

@SMetaAttributes{Version{10}: STARTS 1.0SourceID{8}: Source-1FieldsSupported{17}: [basic-1 author]ModifiersSupported{19}: {basic-1 phonetics}FieldModifierCombinations{39}: ([basic-1 author] {basic-1 phonetics})QueryPartsSupported{2}: RFScoreRange{7}: 0.0 1.0RankingAlgorithmID{6}: Acme-1...

Page 75: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

Sample Query (SOIF)

@SQuery{Version{10}: STARTS 1.0FilterExpression{48}: ((author ``Ullman'') and (title stem ``databases''))RankingExpression{61}: list( (body-of-text ``distributed'') (body-of-text ``databases''))DropStopWords{1}: TDefaultAttributeSet{7}: basic-1DefaultLanguage{5}: en-USAnswerFields{12}: title authorMinDocumentScore{3}: 0.5MaxNumberDocuments{2}: 10}

Page 76: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

Meta-Searching Conclusion

• Need extra information from sources:– Meta-data– Ranking information

• For querying multiple sources:– Need standard query language; or– Need query translation machinery

Page 77: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

Meta-Searching Conclusion

• Other issues:– Payment– Preserving advertisements– Improved “value” filtering

Page 78: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

The Stanford Digital Library Project

InternetLibraries

PaymentInstitutions

SearchAgents

User Interfacesand Annotations

Commercial Information Brokers &

Providers

CopyrightServices

Query/DataConversionHTTP

Z39.50

Telnet

Page 79: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

79

Interoperability Challenges

• Growing number of players, formats, countries,...• Repositories Services• Dynamic artifacts

Digital Libraries

Page 80: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

80

Standards

• Too Many– e.g., Z39-50, HTTP, SDLIP, CORBA, DASL, ...

• Narrow– e.g., XML not a silver bullet

• Nevertheless Important...translation

Page 81: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

81

Query Translation Example

Q: Find Title contains(“cats” near “dogs”)

targetsystem

blah, blah,cats and dogs

blah, blah

doc 1:

blah, cats,blah,blah, blah,blah, dogs

doc 2:

Q’: Find Title contains(“cats”)AND contains(“dogs”)

translate filter

{doc1, doc2}

{doc1}

Page 82: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

82

Another Query Translation Example

Q: Find [grade > 8] AND [name =“elton john”]

Q’: Find [score = A] AND [last-name = “john”] AND [first-name = “elton”]

targetsystem

translate• basic rules• translation algorithm• error estimation

Page 83: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

83

Filtering Challenges

• Too much information

• Not controlled

Page 84: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

84

Current Filtering

textualsimilarity

Page 85: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

85

Page Rank Filtering

textualsimilarity

page rank(Google)

Page 86: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

86

Initial Page Rank

4

1

Page 87: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

87

Recursive Page Rank

4

1

6

1

2

2

1+2+1+2 = 6

Page 88: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

88

Value Filtering

textualsimilarity

page rank

geography

context

opinions

access

Page 89: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

89

Mobile Access Challenges

• Limited Screen Size

• Limited Bandwidth

• Disconnected Operation

• Limited Power

Page 90: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

90

Power Browsing

Techniques• Show only text headers• Show URLs, anchors, titles• Order URLs by page rank• Summarize text• Summarize set of pages• Low-resolution pictures• Site search, word completion• ...

Page 91: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

91

PowerBrowser - Text View

Page 92: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

92

PowerBrowser - History

Page 93: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

93

Economic Challenges

• Piracy

• Payment

• Heterogeneity

• Security/Privacy

Page 94: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

94

Piracy on the Internet

Page 95: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

95

Approaches

• Copy Prevention– isolation– cryptography– secure viewer

• Copy Detection– watermarking– content based

Page 96: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

96

Copy Detection

• Content– text

– audio

– video

• Challenges– crawling the web, mailing lists,...

– large scale comparison

– false negatives, positives

– different formats, sampling rates, frame rates,...

– adversary tries to fool system

Page 97: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

97

Example: Text Copy Detection

chunk signature

database(hash table)

get document

break intochunks

computesignature

store indatabase

get document

break intochunks

computesignature

probedatabase

abovethreshold?

Page 98: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

98

Text Detection Issues

• What are chunks?• What is threshold?• How to foil adversary?• How to compare hypertext documents?

Page 99: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

99

Information Preservation Challenges

• Preserving the Bits– Evolving hardware– Evolving software– Evolving organizations

• Preserving the Meaning

Page 100: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

100

Archiving the Web

server

documents

web server

web pages

stanfordarchival repository web users

Page 101: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

101

InfoMonitor History View

Page 102: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

102

InfoMonitor Snapshot View

Page 103: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

103

Page 104: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher

104

Archival Repository

• Object Identifier Signature

• No Deletions (never ever!)

handle

set set

new version?