96
Gerhard Weikum Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/ ~weikum/ Knowledge Harvesting from Text and Web Sources Part 2: Search and Ranking of Knowledge

Gerhard Weikum Max Planck Institute for Informatics weikum

  • Upload
    glynn

  • View
    28

  • Download
    2

Embed Size (px)

DESCRIPTION

Knowledge Harvesting f rom Text and Web Sources. Part 2: Search and Ranking of Knowledge. Gerhard Weikum Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/~weikum/. The Web Speaks to Us. Source: DB & IR methods for knowledge discovery. Communications of - PowerPoint PPT Presentation

Citation preview

Page 1: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Gerhard Weikum Max Planck Institute for Informaticshttp://www.mpi-inf.mpg.de/~weikum/

Knowledge Harvesting from Text and Web SourcesPart 2: Search and Ranking of Knowledge

Page 2: Gerhard  Weikum Max Planck Institute  for Informatics weikum

The Web Speaks to Us

• Web 2012 contains more DB-style data than ever• getting better at making structured content explicit: entities, classes (types), relationships• but no (hope for) schema here!

Source: DB & IR methods for knowledge discovery.Communications ofthe ACM 52(4), 2009

2-2

Page 3: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Structure Now!

Bob Dylan

CarlaBruni

Nicolas Sarkozy

JoanBaez

Bob Dylan

Bob Dylan

France

ChampsElysees

Grammy

Nicolas Sarkozy

Paris

Actor AwardChristoph Waltz OscarSandra Bullock OscarSandra Bullock Golden Raspberry…

Movie ReportedRevenueAvatar $ 2,718,444,933The Reader $ 108,709,522 Facebook FriendFeedSoftware AG IDS Scheer…

Company CEOGoogle Eric SchmidtYahoo OvertureFacebook FriendFeedSoftware AG IDS Scheer…

Knowledge bases with factsfrom Web and IE witnesses

IE-enriched Web pages with embedded entities and facts

2-3

Page 4: Gerhard  Weikum Max Planck Institute  for Informatics weikum

http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png

Distributed Structure: Linking Open Data30 Bio. triples500 Mio. links

2-4

Page 5: Gerhard  Weikum Max Planck Institute  for Informatics weikum

owl:s

ameAs

rdf.freebase.com/ns/en.rome

owl:sameAs

owl:sameAs

data.nytimes.com/51688803696189142301

Coord

geonames.org/3169070/roma

N 41° 54' 10'' E 12° 29' 2''

dbpprop:citizenOf

dbpedia.org/resource/Rome

rdf:ty

pe

rdfs:subclassOf

yago/wordnet:Actor109765278

rdf:ty

pe

rdfs:subclassOfyago/wikicategory:ItalianComposer

yago/wordnet: Artist109812338

prop:actedInimdb.com/name/nm0910607/

Distributed Structure: Linking Open Data

prop: composedMusicFor

imdb.com/title/tt0361748/

dbpedia.org/resource/Ennio_Morricone

2-5

Page 6: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Entity Search

http://entitycube.research.microsoft.com/

2-6

Page 7: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Semantic Search with Entities, Classes, Relationships

Politicians who are also scientists?European composers who won the Oscar?Chinese female astronauts?

Enzymes that inhibit HIV? Antidepressants that interfere with blood-pressure drugs?German philosophers influenced by William of Ockham?

US president when Barack Obama was born?Nobel laureate who outlived two world wars and all his children?

Commonalities & relationships among:Li Lianjie, Steve Jobs, Carla Bruni, Plato, Andres Iniesta?

FIFA 2010 finalists who played in a Champions League final?German football clubs that won against Real Madrid?

instances of classes

properties of entity

relationships

multiple entities

applications

2-7

Page 8: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Quiz Time

2-8

Rank the Folllowing Numbers

# ants on the Earth

# RDF triples in LOD

# Wikipedia links

# followers of Lady Gaga

# Google queries per day

# Yuan by Mark Zuckerberg

12*106

3*109

80*109

1015

25*109

600*106

1

2

4

6

5

3

2-8

Page 9: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Outline

...

Searching for Entities & Relations

Efficient Query Processing

Motivation

Wrap-up

Informative Ranking

User Interface

2-9

Page 10: Gerhard  Weikum Max Planck Institute  for Informatics weikum

RDF: Structure, Diversity, No Schema

• SPO triples: Subject – Property/Predicate – Object/Value)• pay-as-you-go: schema-agnostic or schema later• RDF triples form fine-grained ER graph• popular for Linked Data, comp. biology (UniProt, KEGG, etc.)• open-source engines: Jena, Sesame, RDF-3X, etc.

EnnioMorricone RomebornIn

Rome ItalylocatedIn

SPO triples (statements, facts):(EnnioMorricone, bornIn, Rome)(Rome, locatedIn, Italy)(JavierNavarrete, birthPlace, Teruel)(Teruel, locatedIn, Spain)(EnnioMorricone, composed, l‘Arena)(JavierNavarrete, composerOf, aTale)

CityRomeinstanceOf

(uri1, hasName, EnnioMorricone)(uri1, bornIn, uri2)(uri2, hasName, Rome)(uri2, locatedIn, uri3)…

bornIn (EnnioMorricone, Rome) locatedIn(Rome, Italy)

Page 11: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Facts about Facts

• temporal annotations, witnesses/sources, confidence, etc. can refer to reified facts via fact identifiers (approx. equiv. to RDF quadruples: Col Sub Prop Obj)

facts: (EnnioMorricone, composed, l‘Arena) (JavierNavarrete, composerOf, aTale) (Berlin, capitalOf, Germany) (Madonna, marriedTo, GuyRitchie) (NicolasSarkozy, marriedTo, CarlaBruni)

facts:1:2:3:4:5:

temporal facts:6: (1, inYear, 1968)7: (2, inYear, 2006)8: (3, validFrom, 1990)9: (4, validFrom, 22-Dec-2000) 10: (4, validUntil, Nov-2008)11: (5, validFrom, 2-Feb-2008)

provenance:12: (1, witness, http://www.last.fm/music/Ennio+Morricone/) 13: (1, confidence, 0.9)14: (4, witness, http://en.wikipedia.org/wiki/Guy_Ritchie)15: (4, witness, http://en.wikipedia.org/wiki/Madonna_(entertainer))16: (10, witness, http://www.intouchweekly.com/2007/12/post_1.php)17: (10, confidence, 0.1)

Page 12: Gerhard  Weikum Max Planck Institute  for Informatics weikum

owl:s

ameAs

rdf.freebase.com/ns/en.rome

owl:sameAs

owl:sameAs

data.nytimes.com/51688803696189142301

Coord

geonames.org/3169070/roma

N 41° 54' 10'' E 12° 29' 2''

dbpprop:citizenOf

dbpedia.org/resource/Rome

rdf:ty

pe

rdfs:subclassOf

yago/wordnet:Actor109765278

rdf:ty

pe

rdfs:subclassOfyago/wikicategory:ItalianComposer

yago/wordnet: Artist109812338

prop:actedInimdb.com/name/nm0910607/

Distributed Structure: Linking Open Data

prop: composedMusicFor

imdb.com/title/tt0361748/

dbpedia.org/resource/Ennio_Morricone

2-12

Page 13: Gerhard  Weikum Max Planck Institute  for Informatics weikum

SPARQL Query LanguageSPJ combinations of triple patterns(triples with S,P,O replaced by variable(s)) Select ?p, ?c Where { ?p instanceOf Composer . ?p bornIn ?t . ?t inCountry ?c . ?c locatedIn Europe . ?p hasWon ?a .?a Name AcademyAward . }

+ filter predicates, duplicate handling, RDFS types, etc. Select Distinct ?c Where { ?p instanceOf Composer . ?p bornIn ?t . ?t inCountry ?c . ?c locatedIn Europe . ?p hasWon ?a .?a Name ?n . ?p bornOn ?b . Filter (?b > 1945) . Filter(regex(?n, “Academy“) . }

Semantics:return all bindings to variables that match all triple patterns(subgraphs in RDF graph that are isomorphic to query graph)

Page 14: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Querying the Structured WebStructure but no schema: SPARQL well suited

wildcards for properties (relaxed joins): Select ?p, ?c Where { ?p instanceOf Composer . ?p ?r1 ?t . ?t ?r2 ?c . ?c isa Country . ?c locatedIn Europe . }

Extension: transitive paths [K. Anyanwu et al.: WWW‘07] Select ?p, ?c Where { ?p instanceOf Composer . ?p ??r ?c . ?c isa Country . ?c locatedIn Europe . PathFilter(cost(??r) < 5) . PathFilter (containsAny(??r,?t ) . ?t isa City . }

Extension: regular expressions [G. Kasneci et al.: ICDE‘08] Select ?p, ?c Where { ?p instanceOf Composer . ?p (bornIn | livesIn | citizenOf) locatedIn* Europe . }

flexiblesubgraphmatching

Page 15: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Querying Facts & Text

• Consider witnesses/sources (provenance meta-facts)• Allow text predicates with each triple pattern (à la XQ-FT)

Problem: not everything is triplified

European composers who have won the Oscar,whose music appeared in dramatic western scenes,and who also wrote classical pieces ?

Select ?p Where { ?p instanceOf Composer . ?p bornIn ?t . ?t inCountry ?c . ?c locatedIn Europe . ?p hasWon ?a .?a Name AcademyAward . ?p contributedTo ?movie [western, gunfight, duel, sunset] . ?p composed ?music [classical, orchestra, cantata, opera] . }

Semantics: triples match struct. pred.witnesses match text pred.

Page 16: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Querying Facts & Text

• Consider witnesses/sources (provenance meta-facts)• Allow text predicates with each triple pattern (à la XQ-FT)

Problem: not everything is triplified

French politicians married to Italian singers? Select ?p1, ?p2 Where { ?p1 instanceOf politician [France] . ?p2 instanceOf singer [Italy] . ?p1 marriedTo ?p2 . }

Grouping ofkeywords or phrasesboosts expressiveness

Select ?p1, ?p2 Where { ?p1 instanceOf ?c1 [France, politics] . ?p2 instanceOf ?c2 [Italy, singer] . ?p1 marriedTo ?p2 . }

CS researchers whose advisors worked on the Manhattan project?Select ?r, ?a Where {?r instOf researcher [computer science] . ?a workedOn ?x [Manhattan project] .?r hasAdvisor ?a . }

Select ?r, ?a Where {?r ?p1 ?o1 [computer science] . ?a ?p2 ?o2 [Manhattan project] .?r ?p3 ?a . }

Page 17: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Relatedness Queries

Schema-agnostic keyword search(on RDF, ER graph, relational DB)becomes a special case

Relationship between Jet Li, Steve Jobs, Carla Bruni?Select ??p1, ??p2, ??p3 Where {

LiLianjie ??p1 SteveJobs. SteveJobs ??p2 CarlaBruni .CarlaBruni ??p3 LiLianjie . }

Select ??p1, ??p2, ??p3 Where {?e1 ?r1 ?c1 [“Li Lianjie“] . ?e2 ?r2 ?c2 [“Steve Jobs“] . ?e3 ?r3 ?c3 [“Carla Bruni“] . ?e1 ??p1 ?e2 .?e2 ??p2 ?e3 .?e3 ??p3 ?e1 . }

Page 18: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Querying Temporal Facts• Consider temporal scopes of reified facts• Extend Sparql with temporal predicates

Managers of German clubs who won the Champions League?Select ?m Where { isa soccerClub . ?c inCountry Germany .

?id1: ?c hasWon ChampionsLeague . ?id1 validOn ?t . ?id2: ?m manages ?c . ?id2 validSince ?s . ?id2 validUntil ?u .

[?s,?u] overlaps [?t,?t] . }

Problem: not all facts hold forever (e.g. CEOs, spouses, …)

When did a German soccer club win the Champions League?Select ?c, ?t Where { ?c isa soccerClub . ?c inCountry Germany .?id1: ?c hasWon ChampionsLeague . ?id1 validOn ?t . }

1: (BayernMunich, hasWon, ChampionsLeague)2: (BorussiaDortmund, hasWon, ChampionsLeague)3: (1, validOn, 23May2001) 4: (1, validOn, 15May1974) 5: (2, validOn, 28May1997)6: (OttmarHitzfeld, manages, BayernMunich)7: (6, validSince, 1Jul1998) 8:(6, validUntil, 30Jun2004)

[Y. Wang et al.: EDBT’10][O. Udrea et al.: TOCL‘10

Page 19: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Querying with Vague Temporal Scope

• Consider temporal phrases as text conditions• Allow approximate matching and rank results wisely

Problem: user‘s temporal interest is imprecise1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Problem: user‘s temporal interest is often imprecise

German Champion League winners in the nineties?Select ?c Where {

?c isa soccerClub . ?c inCountry Germany .?c hasWon ChampionsLeague [nineties] . }

Soccer final winners in summer 2001?Select ?c Where {

?c isa soccerClub . ?id: ?c matchAgainst ?o [final] .?id winner ?c [“summer 2001“] . }

Page 20: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Keyword Search on Graphs

Example: Conferences (CId, Title, Location, Year) Journals (JId, Title)CPublications (PId, Title, CId) JPublications (PId, Title, Vol, No, Year) Authors (PId, Person) Editors (CId, Person)Select * From * Where * Contains ”Gray, DeWitt, XML, Performance“ And Year > 95

Schema-agnostic keyword search over multiple tables:graph of tuples with foreign-key relationships as edges

[BANKS, Discover, DBExplorer, KUPS, SphereSearch, BLINKS, NAGA, …]

Result is connected tree with nodes that contain as many query keywords as possible

Ranking: 1)(1)1(),(),(

eedgesnnodes eedgeScoreqnnodeScoreqtrees

with nodeScore based on tf*idf or prob. IRand edgeScore reflecting importance of relationships (or confidence, authority, etc.)

Top-k querying: compute best trees, e.g. Steiner trees (NP-hard) 2-20

Page 21: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Best Result: Group Steiner Tree

Result is connected tree with nodes that contain as many query keywords as possible

Group Steiner tree: • match individual keywords terminal nodes, grouped by keyword• compute tree that connects at least one terminal node per keyword and has best total edge weight

y

x

x

y

zw

w w

x

y

zw

for query: x w y z2-21

Page 22: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Outline

...

Searching for Entities & Relations

Efficient Query Processing

Motivation

Wrap-up

Informative Ranking

User Interface

2-22

Page 23: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Ranking CriteriaConfidence:Prefer results that are likely correct accuracy of info extraction trust in sources

(authenticity, authority)

Informativeness:Prefer results with salient factsStatistical estimation from: frequency in answer frequency on Web frequency in query log

Conciseness:Prefer results that are tightly connected size of answer graph cost of Steiner tree

bornIn (Jim Gray, San Francisco) from„Jim Gray was born in San Francisco“(en.wikipedia.org)

livesIn (Michael Jackson, Tibet) from„Fans believe Jacko hides in Tibet“(www.michaeljacksonsightings.com)

q: Einstein isa ?Einstein isa scientistEinstein isa vegetarian

q: ?x isa vegetarianEinstein isa vegetarianWhocares isa vegetarian

Diversity:Prefer variety of facts

Einstein won NobelPrizeBohr won NobelPrizeEinstein isa vegetarianCruise isa vegetarianCruise born 1962 Bohr died 1962

E won … E discovered … E played … E won … E won … E won … E won …

Page 24: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Ranking ApproachesConfidence:Prefer results that are likely correct accuracy of info extraction trust in sources

(authenticity, authority)

Informativeness:Prefer results with salient factsStatistical LM with estimations from: frequency in answer frequency in corpus (e.g. Web) frequency in query log

Conciseness:Prefer results that are tightly connectedsize of answer graph cost of Steiner tree

PR/HITS-style entity/fact ranking[V. Hristidis et al., S.Chakrabarti, …]

IR models: tf*idf … [K.Chang et al., …]Statistical Language Models

Diversity:Prefer variety of facts

empirical accuracy of IEPR/HITS-style estimate of trustcombine into: max { accuracy (f,s) * trust(s) | s witnesses(f) }

Statistical Language Models

graph algorithms (BANKS, STAR, …) [J.X. Yu et al., S.Chakrabarti et al.,B. Kimelfeld et al., G.Kasneci et al., …]

or

Page 25: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Example: RDF-Express[S. Elbassuoni: SIGIR‘12, ESWC‘11, CIKM‘09]

?a1 isMarriedTo ?a2 .?a1 actedIn ?m .?a2 actedIn ?m .

http://www.mpi-inf.mpg.de/yago-naga/rdf-express/

Page 26: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Example: RDF-Express

?a1 isMarriedTo ?a2 {Bollywood} .?a1 actedIn ?m .?a2 actedIn ?m .

Page 27: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Example: RDF-Express

?a1 isMarriedTo ?a2 .?a1 actedIn ?m {thriller} .?a2 actedIn ?m {love} .

Page 28: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Example: RDF-Express[S. Elbassuoni: SIGIR‘12, ESWC‘11, CIKM‘09]

?a1 directed ?m .?a2 actedIn ?m .?a1 hasWonPrize Academy_Award .?a2 type wordnet_musician .

Page 29: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Information Retrieval Basics[Textbooks by R. Baza-Yates, B. Croft, Manning/Raghavan/Schuetze, …]

......

.....

......

.....

crawlextract& clean index search rank present

strategies forcrawl schedule andpriority queue for crawl frontier

handle dynamic pages,detect duplicates,detect spam

build and analyzeWeb graph,index all tokensor word stems

server farm with 10 000‘s of computers,distributed/replicated data in high-performance file system,massive parallelism for query processing

fast top-k queries,query logging,auto-completion

scoring functionover many dataand context criteria

GUI, user guidance,personalization

Page 30: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Ranking bydescendingrelevance

Vector Space Model for Content Relevance Ranking

Search engine

Query (set of weightedfeatures)

||]1,0[ Fid Documents are feature vectors

(bags of words)

||]1,0[ Fq

||

1

2||

1

2

||

1:),(F

jj

F

jij

F

jjij

i

qd

qdqdsim

Similarity metric:

2-30

Page 31: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Vector Space Model for Content Relevance Ranking

Search engine

Query (Set of weightedfeatures)

||]1,0[ Fid Documents are feature vectors

(bags of words)

||]1,0[ Fq

||

1

2||

1

2

||

1:),(F

jj

F

jij

F

jjij

i

qd

qdqdsim

Similarity metric:Ranking bydescendingrelevance

k ikijij wwd 2/:

iikk

ijij fwithdocs

docsdffreq

dffreqw

##log

),(max),(

1log:

tf*idfformula:term frequency *document frequency

2-31

Page 32: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Statistical Language Models (LM‘s)

qLM(1)

d1

d2

LM(2)

?

?

• each doc di has LM: generative prob. distr. with params i

• query q viewed as sample from LM(1), LM(2), …• estimate likelihood P[ q | LM(i) ] that q is sample of LM of doc di (q is „generated by“ di)• rank by descending likelihoods (best „explanation“ of q)

[Maron/Kuhns 1960, Ponte/Croft 1998, Hiemstra 1998, Lafferty/Zhai 2001]

„God does not play dice“ (Albert Einstein)

„IR does“ (anonymous)

„God rolls dice in places where you can‘t see them“ (Stephen Hawking)

Page 33: Gerhard  Weikum Max Planck Institute  for Informatics weikum

4-33

LM: Doc as Model, Query as Sample

A A

C

A

DE E E E

C CB

A

E

B

model M

document d: sample of Mused for parameter estimation

P [ | M]A A B C E E

estimate likelihoodof observing query

query

)(

||21)(

)(...)()(||

]|[ qfjqj

q

jdpjfjfjf

qdqP

multinomial prob. distribution

Page 34: Gerhard  Weikum Max Planck Institute  for Informatics weikum

LM: Need for Smoothing

A A

C

A

DE E E E

C CB

A

E

B

model M

document d

P [ | M]A B C E F

estimate likelihoodof observing query

query

+ background corpus and/or smoothing

used for parameter estimation

C

AD

AB

EF

+

Laplace smoothingJelinek-MercerDirichlet smoothing…

2-34

Page 35: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Some LM Basics

i i dqPdqPqds ]|[]|[),(simple MLE: overfitting

][)1(]|[),( qPdqPqds

i

kk kdfidf

dktfditf

)()()1(

),(),(log~

ik

kidf

kdfdktf

ditf)(

)(1),(

),(1log~

mixture modelfor smoothing

i diPqiP

qiPdqKL]|[]|[log]|[)|(~

KL divergence(Kullback-Leibler div.)aka. relative entropy

tf*idffamily

ik

dktfditf

),(),(log~

independ. assumpt.

efficientimplementation

• Precompute per-keyword scores • Store in postings of inverted index• Score aggregration for (top-k) multi-keyword query

P[q] est. fromlog or corpus

Page 36: Gerhard  Weikum Max Planck Institute  for Informatics weikum

IR as LM EstimationP[R|d,q]

user likes doc (R)given that it has features dand user poses query q

],|[],|[~qRdPqRdP

prob. IR

][]|,[~ RPRdqP

][]|[],|[ RPRdPRdqP

]|[~ dqP statist. LM

qj d ]|j[Plog]d|q[Plog)d,q(squery likelihood:

]d|q[Plogargmax-k d

top-k query result:

MLE would be tf2-36

Page 37: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Multi-Bernoulli vs. Multinomial LM

Multi-Bernoulli:)q(X1

j)q(X

jjj ))d(p1()d(p]d|q[P

with Xj(q)=1 if jq, 0 otherwise

Multinomial:

)(

||21

)()(...)()(

||]|[ qf

jqjq

jdpjfjfjf

qdqP

with fj(q) = f(j) = frequency of j in q

multinomial LM more expressive and usually preferred2-37

Page 38: Gerhard  Weikum Max Planck Institute  for Informatics weikum

LM Scoring by Kullback-Leibler Divergence

)(

||2122 )(

)(...)()(||

log]|[log qfjqj

q

jdpjfjfjf

qdqP

)(log)(~ 2 dpqf jjqj

))(),(( dpqfH neg. cross-entropy

))(())(),((~ qfHdpqfH ))(||)(( dpqfD

)()(

log)( 2 dpqf

qfj

jj j neg. KL divergence

of q and d

2-38

Page 39: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Jelinek-Mercer SmoothingIdea:use linear combination of doc LM withbackground LM (corpus, common language);

could also consider query log as background LMfor query||

),()1(||

),()(ˆCCjfreq

ddjfreqdp j

parameter tuning of by cross-validation with held-out data:• divide set of relevant (d,q) pairs into n partitions• build LM on the pairs from n-1 partitions• choose to maximize precision (or recall or F1) on nth

partition• iterate with different choice of nth partition and average

2-39

Page 40: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Jelinek-Mercer Smoothing:Relationship to tf*idf

]q[P)1(]d|q[P]|q[P

qi

kk )k(df)i(df)1(

)d,k(tf)d,i(tflog~

qik

k )i(df

)k(df

1)d,k(tf)d,i(tf1log~

with absolutefrequencies tf, df

relative tf ~ relative idf

2-40

Page 41: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Dirichlet-Prior Smoothing

|d|

]C|j[P|d|

]d|j[P|d|mn1f

i

ii)(Mmaxargˆ)d(p̂ jj

with i set to P[i|C]+1 for the Dirichlet hypergeneratorand > 1 set to multiple of average document length

Dirichlet (): 1jm..1j

jm..1j

jm..1jm1 j

)()(

),...,(f

with m..1j j 1

(Dirichlet is conjugate prior for parameters of multinomial distribution: Dirichlet prior implies Dirichlet posterior, only with different parameters)

d]][|f[P][P]|f[P]f|[P:)(M Posterior distr. with

Dirichlet distribution as prior

)f(Dirichlet with term frequencies fin document d

MAP (Maximum Posterior) for

2-41

Page 42: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Dirichlet-Prior Smoothing:Relationship to Jelinek-Mercer Smoothing

|d|]C|j[P

|d|]d|j[P|d|

]|[)1(]|[)(ˆ CjPdjPdp j

with

|d|

|d|

where 1= P[1|C], ..., m= P[m|C] are the parametersof the underlying Dirichlet distribution, with constant > 1typically set to multiple of average document length

with MLEsP[j|d], P[j|C]

tfj fromcorpus

2-42

Page 43: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Entity Search with LM Ranking [Z. Nie et al.: WWW’07, H. Fang et al.: ECIR‘07, P. Serdyukov et al.: ECIR‘08, …]

LM (entity e) = prob. distr. of words seen in context of e

][)1(]|[),( qPeqPqes ]q[P]e|q[P~

i

ii

query q: „French player who won world championship“

candidate entities:e1: David Beckhame2: Ruud van Nistelroye3: Ronaldinhoe4: Zinedine Zidanee5: FC Barcelona

played for ManU, Real, LA GalaxyDavid Beckham champions leagueEngland lost match against Francemarried to spice girl …

weightedby conf.

Zizou champions league 2002Real Madrid won final ...Zinedine Zidane best playerFrance world cup 1998 ...

))e(|)q((KL~ LMLM

query: keywords answer: entities

Page 44: Gerhard  Weikum Max Planck Institute  for Informatics weikum

LM‘s: from Entities to Facts Document / Entity LM‘s

Triple LM‘s

LM for doc/entity: prob. distr. of wordsLM for query: (prob. distr. of) words LM‘s: rich for docs/entities, super-sparse for queries

richer query LM with query expansion, etc.

LM for facts: (degen. prob. distr. of) triple LM for queries: (degen. prob. distr. of) triple pattern LM‘s: apples and oranges

• expand query variables by S,P,O values from DB/KB• enhance with witness statistics• query LM then is prob. distr. of triples !

Page 45: Gerhard  Weikum Max Planck Institute  for Informatics weikum

LM‘s for Triples and Triple Patterns

f1: Beckham p ManchesterUf2: Beckham p RealMadridf3: Beckham p LAGalaxyf4: Beckham p ACMilanF5: Kaka p ACMilanF6: Kaka p RealMadridf7: Zidane p ASCannesf8: Zidane p Juventusf9: Zidane p RealMadridf10: Tidjani p ASCannesf11: Messi p FCBarcelonaf12: Henry p Arsenalf13: Henry p FCBarcelonaf14: Ribery p BayernMunichf15: Drogba p Chelseaf16: Casillas p RealMadrid

triples (facts f):triple patterns (queries q):q: Beckham p ?y

200 300 20 30300150 20200350 10400200150100150 20

: 2600

q: Beckham p ManUq: Beckham p Realq: Beckham p Galaxyq: Beckham p Milan

200/550300/550 20/550 30/550

witness statistics

q: Cruyff ?r FCBarcelonaCruyff playedFor FCBarca 200/500 Cruyff playedAgainst FCBarca 50/500Cruyff coached FCBarca 250/500

q: ?x p ASCannesZidane p ASCannes 20/30Tidjani p ASCannes 10/30

LM(q) + smoothing

q: ?x p ?yMessi p FCBarcelona 400/2600Zidane p RealMadrid 350/2600Kaka p ACMilan 300/2600…

LM(q): {t P [t | t matches q] ~ #witnesses(t)}LM(answer f): {t P [t | t matches f] ~ 1 for f}smooth all LM‘srank results by ascending KL(LM(q)|LM(f))

[G. Kasneci et al.: ICDE’08; S. Elbassuoni et al.: CIKM’09, ESWC‘11]

Page 46: Gerhard  Weikum Max Planck Institute  for Informatics weikum

LM‘s for Composite Queriesq: Select ?x,?c Where { France ml ?x . ?x p ?c . ?c in UK . }

f1: Beckham p ManU 200f7: Zidane p ASCannes 20f8: Zidane p Juventus 200f9: Zidane p RealMadrid 300f10: Tidjani p ASCannes 10f12: Henry p Arsenal 200f13: Henry p FCBarca 150f14: Ribery p Bayern 100f15: Drogba p Chelsea 150

f31: ManU in UK 200f32: Arsenal in UK 160f33: Chelsea in UK 140

f21: F ml Zidane 200f22: F ml Tidjani 20f23: F ml Henry 200f24: F ml Ribery 200f25: F ml Drogba 30f26: IC ml Drogba 100f27 ALG ml Zidane 50

queries q with subqueries q1 … qn

results are n-tuples of triples t1 … tn

LM(q): P[q1…qn] = i P[qi]LM(answer): P[t1…tn] = i P[ti]KL(LM(q)|LM(answer)) = i KL(LM(qi)|LM(ti))

P [ F ml Henry, Henry p Arsenal, Arsenal in UK ]

500160

2600200

650200~

P [ F ml Drogba, Drogba p Chelsea, Chelsea in UK ]

500140

2600150

65030~

Page 47: Gerhard  Weikum Max Planck Institute  for Informatics weikum

LM‘s for Keyword-Augmented Queriesq: Select ?x, ?c Where { France ml ?x [goalgetter, “top scorer“] . ?x p ?c . ?c in UK [champion, “cup winner“, double] . }

subqueries qi with keywords w1 … wm

results are still n-tuples of triples ti

LM(qi): P[triple ti | w1 … wm] = k P[ti | wk] + (1) P[ti]LM(answer fi) analogousKL(LM(q)|LM(answer fi)) = i KL (LM(qi) | LM(fi))

result ranking prefers (n-tuples of) tripleswhose witnesses score high on the subquery keywords

Page 48: Gerhard  Weikum Max Planck Institute  for Informatics weikum

LM‘s for Temporal Phrases

q: Select ?c Where { ?c instOf nationalTeam . ?c hasWon WorldCup [nineties] . }

Problem: user‘s temporal interest is imprecise1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

July 1994mid 90s

extract temp expr‘s xjfrom witnesses& normalize

[K.Berberich et al.: ECIR‘10]

“mid 90s“ xj = [1Jan93,31Dec97] normalize

temp expr vin query

“nineties“ v = [1Jan90,31Dec99]

P[q | t] ~ … j P[v | xj]~ … j { P[ [b,e] | xj] | interval [b,e]v} ~ … j overlap(v,xj) / union(v,xj)

• enhanced ranking• efficiently computable• plug into doc/entity/triples LM‘s

lastcentury

summer 1990

1998

P[v=„nineties“ | xj = „mid 90s“] = 1/2P[„nineties“ | „summer 1990“] = 1/30P[„nineties“ | „last century“] = 1/10

Page 49: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Query Relaxationq: … Where { France ml ?x . ?x p ?c . ?c in UK . }

f1: Beckham p ManU 200f7: Zidane p ASCannes 20f9: Zidane p Real 300f10: Tidjani p ASCannes 10f12: Henry p Arsenal 200f15: Drogba p Chelsea 150

f31: ManU in UK 200f32: Arsenal in UK 160f33: Chelsea in UK 140

f21: F ml Zidane 200f22: F ml Tidjani 20F23: F ml Henry 200F24: F ml Ribery 200F26: IC ml Drogba 100F27 ALG ml Zidane 50

[ F ml Zidane, Zidane p Real, Real in ESP ]

q(1): … Where { France ml ?x . ?x p ?c . ?c in ?y . }

[ IC ml Drogba, Drogba p Chelsea, Chelsea in UK] [ F resOf Drogba,

Drogba p Chelsea, Chelsea in UK] [ IC ml Drogba,

Drogba p Chelsea, Chelsea in UK]

q(2): … Where { ?x ml ?x . ?x p ?c . ?c in UK . }q(3): … Where { France ?r ?x . ?x p ?c . ?c in UK . }q(4): … Where { IC ml ?x . ?x p ?c . ?c in UK . }

LM(q*) = LM(q) + 1 LM(q(1)) + 2 LM(q(2)) + …

replace e in q by e(i) in q(i):precompute P:=LM (e ?p ?o) and Q:=LM (e(i) ?p ?o)set i ~ 1/2 (KL (P|Q) + KL (Q|P))

replace r in q by r(i) in q(i) LM (?s r(i) ?o)replace e in q by ?x in q(i) LM (?x r ?o)… LM‘s of e, r, ...

are prob. distr.‘s of triples !

Page 50: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Result Personalization

Open issue: „insightful“ results (new to the user)

q

q

q1

q2

q3f3

f4

f1f2f5

q3

q4 q5f3

q6

f6

same answer for everyone?

Personal histories ofqueries & clicked facts LM(user u): prob. distr. of triples !

[S. Elbassuoni et al.: PersDB‘08]

u1 [classical music] q: ?p from Europe . ?p hasWon AcademyAward u2 [romantic comedy] q: ?p from Europe . ?p hasWon AcademyAward u3 [from Africa] q: ?p isa SoccerPlayer . ?p hasWon ?a

LM(q|u] = LM(q) + (1) LM(u)then business as usual

Page 51: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Result Diversification

q: Select ?p, ?c Where { ?p isa SoccerPlayer . ?p playedFor ?c . }

1 Beckham, ManchesterU2 Beckham, RealMadrid3 Beckham, LAGalaxy4 Beckham, ACMilan5 Zidane, RealMadrid6 Kaka, RealMadrid7 Cristiano Ronaldo, RealMadrid8 Raul, RealMadrid9 van Nistelrooy, RealMadrid10 Casillas, RealMadrid

1 Beckham, ManchesterU2 Beckham, RealMadrid3 Zidane, RealMadrid4 Kaka, ACMilan5 Cristiano Ronaldo, ManchesterU6 Messi, FCBarcelona7 Henry, Arsenal8 Ribery, BayernMunich9 Drogba, Chelsea10 Luis Figo, Sporting Lissabon

rank results f1 ... fk by ascending KL(LM(q) | LM(fi)) (1) KL( LM(fi) | LM({f1..fk}\{fi}))implemented by greedy re-ranking of fi‘s in candidate pool

[J. Carbonell, J. Goldstein: SIGIR‘98]

Page 52: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Entity-Search Ranking by Link Analysis[A. Balmin et al. 2004, Nie et al. 2005, Chakrabarti 2007, J. Stoyanovich 2007]

EntityAuthority (ObjectRank, PopRank, HubRank, EVA, etc.):• define authority transfer graph among entities and pages with edges:• entity page if entity appears in page• page entity if entity is extracted from page• page1 page2 if there is hyperlink or implicit link between pages• entity1 entity2 if there is a semantic relation between entities• edges can be typed and (degree- or weight-) normalized and are weighted by confidence and type-importance• also applicable to graph of DB records with foreign-key relations (e.g. bibliography with different weights of publisher vs. location for conference record)• compared to standard Web graph, ER graphs of this kind have higher variation of edge weights

2-52

Page 53: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Outline

...

Searching for Entities & Relations

Efficient Query Processing

Motivation

Wrap-up

Informative Ranking

User Interface

2-53

Page 54: Gerhard  Weikum Max Planck Institute  for Informatics weikum

54/34

Scalable Semantic Web: Pattern Queries on Large RDF Graphs

schema-free RDF triples: subject-property-object (SPO) example: Einstein hasWon NobelPrizeSPARQL triple patterns: Select ?p,?c Where { ?p isa scientist . ?p hasWon NobelPrize . ?p bornIn ?t . ?t inCountry ?c . ?c partOf Europe}large join queries, unpredictable workload,difficult physical design, difficult query optimization

Einstein hasWon NobelEinstein bornIn UlmRonaldo hasWon FIFASpain partOf EuropeFrance partOf Europe… … .,.

S P O S O Einstein NobelRonaldo FIFA… …

hasWonS hasWon bornIn .,.

Person

Einstein Nobel Ulm .,.Ronaldo FIFA Rio .,..,. .,. .,. .,.

Semantic-Web engines (Sesame, Jena, etc.)did not provide scalable query performance

CountryS partOf capital .,..,. .,. .,. .,.

S O … …

bornIn

AllTriples

?p ?b ?t

Page 55: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Scalable Semantic Web: RDF-3X Engine[T. Neumann et al.: VLDB 2008, SIGMOD 2009, VLDBJ’10]

• RISC-style, tuning-free system architecture• map literals into ids (dictionary) and precompute exhaustive indexing for SPO triples: SPO, SOP, PSO, POS, OSP, OPS, SP*, PS*, SO*, OS*, PO*, OP*, S*, P*, O* very high compression• efficient merge joins with order-preservation• join-order optimization by dynamic programming over subplan result-order• statistical synopses for accurate result-size estimation

http://code.google.com/p/rdf3x/http://www.mpi-inf.mpg.de/~neumann/rdf3x/

Page 56: Gerhard  Weikum Max Planck Institute  for Informatics weikum

56/34

Scalable Indexing & Merge Joins in RDF-3X

scan OPS[O=scientist

P=isa]

scan OPS[O=NobelPrize,

P=hasWon]

scan PSO[P=bornIn]

|| [S=S]

...

…3007 2015 10113007 2015 10553007 2015 11193007 2015 11273007 2015 11353007 2015 13393007 2015 13483007 2015 1418

O P S …3111 2003 10193111 2003 10553111 2003 11113111 2003 11273111 2003 11353111 2003 11993111 2003 12273111 2003 1243

O P S …2100 1003 56482100 1119 55552100 1127 57122100 1255 55152100 1266 57122100 1418 55102100 1429 55552100 1683 5843

P S O

scan PSO[P=inCountry]

P S O

2077 1005 8311

2077 1431 8200

2077 1333 8751

2077 1148 8244

2077 1099 80182077 1127 8123

2077 1266 8510

2077 1429 8135

?p isa scientist . ?p hasWon NobelPrize . ?p bornIn ?t . ?t inCountry ?c . ?c partOf Europe

|| [S=S]

|| [O=S]

poorjoin order

Page 57: Gerhard  Weikum Max Planck Institute  for Informatics weikum

57/34

Scalable Indexing & Merge Joins in RDF-3X

scan OPS[O=scientist

P=isa]

scan OPS[O=NobelPrize,

P=hasWon]

scan PSO[P=bornIn]

|| [S=S]

|| [S=S]

...

…3007 2015 10113007 2015 10553007 2015 11193007 2015 11273007 2015 11353007 2015 13393007 2015 13483007 2015 1418

O P S …3111 2003 10193111 2003 10553111 2003 11113111 2003 11273111 2003 11353111 2003 11993111 2003 12273111 2003 1243

O P S …2100 1003 56482100 1119 55552100 1127 57122100 1255 55152100 1266 57122100 1418 55102100 1429 55552100 1683 5843

P S O

scan PSO[P=inCountry]

P S O

|| [O=S]

2077 1005 8311

2077 1431 8200

2077 1333 8751

2077 1148 8244

2077 1099 80182077 1127 8123

2077 1266 8510

2077 1429 8135

?p isa scientist . ?p hasWon NobelPrize . ?p bornIn ?t . ?t inCountry ?c . ?c partOf Europe good

join order

Page 58: Gerhard  Weikum Max Planck Institute  for Informatics weikum

58/34

Scalable Indexing & Merge Joins in RDF-3X

scan OPS[O=scientist

P=isa]

scan OPS[O=NobelPrize,

P=hasWon]

scan PSO[P=bornIn]

|| [S=S]

|| [S=S]

...

…3007 2015 10113007 2015 10553007 2015 11193007 2015 11273007 2015 11353007 2015 13393007 2015 13483007 2015 1418

O P S …3111 2003 10193111 2003 10553111 2003 11113111 2003 11273111 2003 11353111 2003 11993111 2003 12273111 2003 1243

O P S …2100 1003 56482100 1119 55552100 1127 57122100 1255 55152100 1266 57122100 1418 55102100 1429 55552100 1683 5843

P S O

scan PSO[P=inCountry]

P S O

|| [O=S]

2077 1005 8311

2077 1431 8200

2077 1333 8751

2077 1148 8244

2077 1099 80182077 1127 8123

2077 1266 8510

2077 1429 8135

?p isa scientist . ?p hasWon NobelPrize . ?p bornIn ?t . ?t inCountry ?c . ?c partOf Europe

sidewaysinformationpassing

run-time filters

Page 59: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Join-Order Optimization for SPARQL

mehasFriend

gender

livesIn

Berlin

?f ?s ?p ?a

?x ?y

singer

Berlin

performedIn

inC

ity

2009

inYe

ar

likes

BobDylan

composedBy

prote

st antiwar

female

Fine-grained RDF (e.g. over social-tagging data)often entails complex queries with 10, 20 or more joins

Join-order optimization operates on join graph:nodes are triple patterns, edges denote shared variables• need selectivity (result cardinality) estimator • join-order optimization is O(n3) for chains, O(2n) for stars• exact optimization uses dynamic programming over sub-graphs and the sort order of intermediate results

[T. Neumann: VLDBJ‘10]

?f gender female .?f hasFriend me .?f livesIn Berlin .?f likes ?s .?s taggedProtestBy ?x .?s taggedAntiwarBy ?y .?s composedBy BobDylan .?s performedIn ?p .?p inCity Berlin .?s inYear 2009 .?p singer ?a .

Page 60: Gerhard  Weikum Max Planck Institute  for Informatics weikum

60/34

Experimental Evaluation: SetupSetup and competitors:2GHz dual core, 2 GB RAM, 30MB/s disk, Linux• column-store property tables by MIT folks, using MonetDB• triples store with SPO, POS, PSO indexes, using PostgreSQL

Datasets:1) Barton library catalog: 51 Mio. triples (4.1 GB)2) YAGO knowledge base: 40 Mio. triples (3.1 GB)3) Librarything social-tagging excerpt: 30 Mio. triples (1.8 GB)

Benchmark queries such as:1) counts of French library items (books, music, etc.), with creator, publisher, language, etc.2) scientist from Poland with French advisor who both won awards3) books tagged with romance, love, mystery, suspense by users who like crime novels and have friends who ...

Select ?t Where {?b hasTitle ?t . ?u romance ?b .?u love ?b .?u mystery ?b .?u suspense ?b .?u crimeNovel ?c .?u hasFriend ?f .?f ... }

[T. Neumann: SIGMOD‘09]

Page 61: Gerhard  Weikum Max Planck Institute  for Informatics weikum

61/34

Experimental Evaluation: Results

YAGO dataset(40 Mio. triples, 3.1 GB):2.7 GB in RDF-3X, loaded in 25 min

Librarything dataset(36 Mio. triples, 1.8 GB):1.6 GB in RDF-3X, loaded in 20 min

measurements on Uniprot (845 Mio. triples) and Billion-Triples (562 Mio. triples)confirm these experimental results

exec

utio

n tim

e [s

]

exec

utio

n tim

e [s

]

[T. Neumann: SIGMOD‘09]

Page 62: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Real

Madrid

Özil

Distributed RDF Stores & Sparql at Scale[D. Abadi et al.: VLDB‘11]

• Partition data graph by hashing triples on subject• Enhance locality by replicating N-hop neighbors with each triple• Can automatically determine when query is parallelizable without communication• For remaining cases: distributed joins, semi-joins, etc.

Iniesta

FC Barca

Xavi

Barcelona

?player

?pos

?pop

?city?team

> 1 000 000

locatedIn

bornIn

playsPos

playsForhasPop

midfielderMadrid

Rosario

Real

Gelsenkirchen

Casillas

Özil

Messi

Casillas

Iniesta

Messi

Xavi

Özil

FC Barca

Real

Barcelona

Madrid

Rosario

Gelsenkirchen

Page 63: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Top-k Query Processing for Ranked Results

Index lists

s(t1,d1) = 0.7…s(tm,d1) = 0.2

Data items: d1, …, dn

Query: q = (t1, t2, t3)aggr: summation

t1d780.9

d10.7

d880.2

d100.2

d780.1

d990.2

d340.1

d230.8

d100.8

d1

t2d640.8

d230.6

d100.6

t3d100.7

d780.5

d640.4

fetch listsjoin&sort

simple & DB-style;needs only O(k) memory;for monotone score aggr.

2-63

Page 64: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Index lists

s(t1,d1) = 0.7…s(tm,d1) = 0.2

Data items: d1, …, dn

Query: q = (t1, t2, t3)aggr: summation

t1d780.9

d880.2d780.1d340.1

d230.8

d100.8

d1

t2d640.9

d230.6

d100.6

t3d100.7

d780.5

d640.3

Threshold algorithm (TA):scan index lists; consider d at posi in Li;highi := s(ti,d);if d top-k then { look up s(d) in all lists L with i; score(d) := aggr {s(d) | =1..m};if score(d) > min-k then add d to top-k and remove min-score d’; min-k := min{score(d’) | d’ top-k};threshold := aggr {high | =1..m};if threshold min-k then exit;

Scan depth 1

Scan depth 2

Scan depth 3

k = 2

simple & DB-style;needs only O(k) memory;for monotone score aggr.

Scan depth 4

d10.7

d990.2

d120.2 2 d64 0.9

Rank Doc Score

2 d64 1.2

1 d78 0.91 d78 1.51 d78 1.5

2 d64 0.9

Rank Doc Score

2 d78 1.5

1 d10 2.1Rank Doc Score

1 d10 2.1

2 d78 1.5

Rank Doc Score

1 d10 2.1

2 d78 1.5

Rank Doc Score

1 d10 2.1

2 d78 1.5

STOP!

Threshold Algorithm (TA) [Fagin 01, Güntzer 00, Nepal 99, Buckley85]

2-64

Page 65: Gerhard  Weikum Max Planck Institute  for Informatics weikum

TA with Sorted Access Only (NRA) [Fagin 01, Güntzer et al. 01]

Index lists

s(t1,d1) = 0.7…s(tm,d1) = 0.2

Data items: d1, …, dn

Query: q = (t1, t2, t3)aggr: summation

Rank Doc Worst-score

Best-score

1 d78 0.9 2.4

2 d64 0.8 2.4

3 d10 0.7 2.4

Rank Doc Worst-score

Best-score

1 d78 1.4 2.0

2 d23 1.4 1.9

3 d64 0.8 2.1

4 d10 0.7 2.1

Rank Doc Worst-score

Best-score

1 d10 2.1 2.1

2 d78 1.4 2.0

3 d23 1.4 1.8

4 d64 1.2 2.0

t1d780.9

d10.7

d880.2

d120.2

d780.1

d990.2

d340.1

d230.8

d100.8

d1

t2d640.8

d230.6

d100.6

t3d100.7

d780.5

d640.4

STOP!

No-random-access algorithm (NRA):scan index lists; consider d at posi in Li;E(d) := E(d) {i}; highi := s(ti,d);worstscore(d) := aggr{s(t,d) | E(d)};bestscore(d) := aggr{worstscore(d), aggr{high | E(d)}};if worstscore(d) > min-k then add d to top-k min-k := min{worstscore(d’) | d’ top-k};else if bestscore(d) > min-k then cand := cand {d};threshold := max {bestscore(d’) | d’ cand};if threshold min-k then exit;

Scan depth 1

Scan depth 2

Scan depth 3

k = 1

sequential access (SA) faster than random access (RA)by factor of 20-1000

2-65

Page 66: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Query Expansion (Synonyms, Related Terms, …)

magician

wizard

intellectual

artist

alchemist

directorprimadonna

lecturer

professor

teacher

educator

scholar

academic, academician, faculty member

scientist

researcher

HYPONYM (0.749)

Thesaurus/Ontology:concepts, relationships, glosses from WordNet, Wikipedia, KB, Web, etc.

relationships quantified bystatistical correlation measurese.g. Jaccard coefficients: |XY| / |XY|

Query expansion

Weighted expanded queryExample:(professor lecturer (0.749) scholar (0.71) ...)and Germanyand ( (course class (1.0) seminar (0.84) ... ) and („IR“ „Web search“ (0.653) ... ) )

Term2Concept with WSD

exp(ti)={w | sim(ti,w) }

Efficient top-k searchwith dynamic expansion

better recall, better meanprecision for hard queries

investigator

mentor

Problems:• efficiency• tuning the threshold

User query: ~t1 ... ~tmExample:~professor Germany ~course ~IR

2-66

Page 67: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Query Expansion with Incremental Merging

relaxable query q: ~professor researchwith expansions based on ontology relatedness modulatingmonotonic score aggregation

Better: dynamic query expansion with incremental merging of additional index lists

efficient, robust, self-tuning

lecturer: 0.7

37: 0.944: 0.8

...

22: 0.723: 0.651: 0.652: 0.6

scholar: 0.692: 0.967: 0.9

...

52: 0.944: 0.855: 0.8

research

B+ tree index on terms

57: 0.644: 0.4

...

professor

52: 0.433: 0.375: 0.3

12: 0.914: 0.8

...

28: 0.617: 0.5561: 0.544: 0.5

44: 0.4

ontology / meta-index

professorlecturer: 0.7scholar: 0.6academic: 0.53scientist: 0.5...

exp(i)={w | sim(i,w) }

primadonna

teacher

investigator

magician

wizard

intellectual

artist

alchemist

director

professor

scholar

academic,academician,faculty member

scientist

researcherHyponym (0.749)

mentor

Related (0.48)

lecturerTA scans of index lists for iq exp(i)

[M. Theobald et al.: SIGIR 2005]

2-67

Page 68: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Query Expansion with Incremental Merging[M. Theobald et al.: SIGIR 2005]

Principles of the algorithm:• organize possible expansions Exp(t)={w | sim(t,w) , tq} in priority queues based on sim(t,w) for each tfor given q = {t1, tm}: • keep track of active expansions for each ti: ActExp(ti) = {w1(ti), w2(ti), ...w j(i)(ti)} ~ position j(i) in the priority queue for Exp(ti)• scan index lists for t1, w1(t1), ..., wj(1)(t1), ..., tm, w1(tm), ..., wj(m)(tm) maintaining cursors pos() and score bounds high() for each list • in each scan step do for each ti: advance cursor of list for w(ti) ActExp(ti) for which best := high()*sim(w,ti) is largest among all lists of ti

or add next largest w(ti) to ActExp(ti) if sim(w,ti) > best and proceed in this new list

2-68

Page 69: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Nested Top-k [Theobald et al.: SIGIR’05]

top-k

research professor XML „information retrieval“ „query result ranking“

research prof XML informationretrieval

query result ranking

top-k top-k

multithreaded coroutine-like evaluation with synchronization of PQs2-69

Page 70: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Nested Top-k [Theobald et al.: SIGIR’05]

top-k

~research ~professor XML „information retrieval“ „query result ranking“

top-k top-k

research prof XML informationretrieval

query result ranking

top-k top-k

science scholar lecturer

multithreaded coroutine-style evaluation with synchronization of PQs2-70

Page 71: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Top-k Rank Joins on Structured Data[Ilyas et al. 2008]

extend TA/NRA/etc. to ranked query results from structured data(improve over baseline: evaluate query, then sort)

Select R.Name, C.Theater, C.MovieFrom RestaurantsGuide R, CinemasProgram CWhere R.City = C.CityOrder By R.Quality/R.Price + C.Rating Desc

BlueDragon Chinese €15 SBHaiku Japanese €30 SBMahatma Indian €20 IGBMescal Mexican €10 IGBBigSchwenk German €25 SLS...

Name Type Quality Price City

RestaurantsGuide

BlueSmoke Tombstone 7.5 SBOscar‘s Hero 8.2 SBHolly‘s Die Hard 6.6 SBGoodNight Seven 7.7 IGBBigHits Godfather 9.1 IGB...

Theater Movie Rating City

CinemasProgram

process index lists on quality desc, price asc, rating desc2-71

Page 72: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Top-k Sparql Queries over RDF

?w wrote ?b . ?b hasTag Memoir .

An_Hour_before_Daylight hasTag Memoir 0.880The_Grand_Alliance hasTag Memoir 0.873The_Night_Trilogy hasTag Memoir 0.800Roughing_it hasTag Memoir 0.796................................................ …….

Mark_Twain wrote Roughing_it 0.999J_K_Rowling wrote Harry_Potter 0.982Stephanie_Meyer wrote Twilight 0.966Cormac_McCarthy wrote The_Road 0.864Jimmy_Carter wrote Palestine 0.732………………………………… …….

[S. Elbassuoni 2012]• Precompute score-sorted index lists for each S value, P, O, SP value pair, PO, SO• Perform rank-join over index lists

2-72

Page 73: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Top-k Sparql Queries over RDF + Text [S. Elbassuoni 2012]

?w wrote ?b[nobel prize]; ?b hasTag Memoir

An_Hour_before_Daylight hasTag Memoir 0.880The_Grand_Alliance hasTag Memoir 0.873The_Night_Trilogy hasTag Memoir 0.800Roughing_it hasTag Memoir 0.796................................................ …….

Jimmy_Carter wrote Palestine 0.712Albert_Camus wrote The_Rebel 0.667 Doris_Lessing wrote Alfred_and_Emily 0.660  ………………………………… …….

nobel

Ian_McEwan wrote On_Chesil_Beach 0.882Jimmy_Carter wrote Palestine 0.774 Cormac_McCarthy wrote The_Road 0.623 Jhumpa_Lahiri wrote The_Namesake 0.600………………………………… …….

prize

• Precompute score-sorted index lists for each S+keyword, P+keyword, O+…, SP+keyword, PO+…, SO+…• Perform rank-join over index lists

2-73

Page 74: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Outline

...

Searching for Entities & Relations

Efficient Query Processing

Motivation

Wrap-up

Informative Ranking

User Interface

2-74

Page 75: Gerhard  Weikum Max Planck Institute  for Informatics weikum

UI: Table Search by Example & Description Expect result table specify query in tabular form (QBE, QueryByDescription)

[S.Sarawagi et al.: VLDB‘09]

q1: David Beckham Real MadridZinedine Zidane JuventusLionel Messi FC Barcelona

Bayern Munich Real Madrid1.FC Kaiserslautern Real MadridBorussia Dortmund Real Madrid

q2: 2:15:05:1

match against table rowsinfer query fromtable headers ???

QBE

QBE

soccer club from Germany and winner against Real Madridq2:

output class descriptionQBD

player club ??? ???q1:

QBD

Page 76: Gerhard  Weikum Max Planck Institute  for Informatics weikum

UI: Structured Keyword Search Need to map (groups of) keywords onto entities & relationshipsbased on name-entity similarities/probabilities

q: Champions League finals with Real Madrid

[Ilyas et al. Sigmod‘10]

UEFA Champions League

Real Madrid C.F.League ofWrestlingChampions

final match

final exam

q: German football clubs that won (a match) against Real

more ambiguity, more sophisticated relation more candidates combinatorial complexity

Real Madrid C.F.

Real Zaragoza

Brazilian Real

Familia Real

Real Number

soccer

American football

social clubnight club

sports team

Page 77: Gerhard  Weikum Max Planck Institute  for Informatics weikum

UI: Natural Language Questions

map resultsinto tabular or visual presentationor speech

Which German soccer clubs have won against Real Madrid ?

translate question into Sparql query:• dependency parsing to decompose question• mapping of question units onto entities, classes, relations

Page 78: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Which ?c

?c German soccer clubs ?r ?r have won against ?o

?o Real Madrid ?

UI: Natural Language Questions

map resultsinto tabular or visual presentationor speech

Select ?c Where { ?c isa SoccerClub . c? in Germany . } ?c wonAgainst RealMadrid . }?id: ?c playedAgainst RealMadrid . ?c isWinnerOf ?id .}

translate question into Sparql query:• dependency parsing to decompose question• mapping of question units onto entities, classes, relations

Select ?c Where { ?c isa GermanSoccerClub . c? wonAgainst RealMadrid . }

Page 79: Gerhard  Weikum Max Planck Institute  for Informatics weikum

SPARQL or Keywords or … ?

SPARQL query: Select ?p Where {?p created ?s . ?s type music . ?s contributesTo ?m . ?m type movie .?p bornIn Rome . }

Who composed scores for westerns and is from Rome?

Keywords query:composer score western Rome

or: {composer Rome} score westernor: …

Faceted search with interactive navigation:composer

Romescore music movie

western2-79

Page 80: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Natural Language Question AnsweringIBM Watson in Jeopardy show: Which city has two airports named after a war hero and a battlefield?used Yago & Dbpedia for type checking

[IBM Journal of R&D 56(3-4), 2012]

question

QuestionAnalysis:ClassificationDecomposition

HypothesesGeneration(Search):AnswerCandidates

CandidateFiltering &Ranking

Hypotheses& EvidenceScoring

answer

Overall architecture of Watson (simplified)

2-80

Page 81: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Semantic Technologies in IBM WatsonIBM Watson in Jeopardy show: Which city has two airports named after a war hero and a battlefield?used Yago & Dbpedia for type checking

[A. Kalyanpur et al.: ISWC 2011]

Semantic checking of answer candidatesquestion

candidatestring

ConstraintChecker

TypeChecker

lexicalanswer type

KB

RelationDetection

EntityDisambiguation& Matching

PredicateDisambiguation& Matching

candidatescore

KB instances

semantic types

spatial & temporalrelations

2-81

Page 82: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Scoring of Semantic Answer Types

[A. Kalyanpur et al.: ISWC 2011]

Check for 1) Yago classes, 2) Dbpedia classes, 3) Wikipedia lists

Match lexical answer type against class candidatesbased on string similarity and class sizes (popularity)Examples: Scottish inventor inventor, star movie star

Compute scores for semantic types, considering:class match, subclass match, superclass match,sibling class match, lowest common ancestor, class disjointness, …

no types Yago Dbpedia Wikipedia all 3Standard QAaccuracy 50.1% 54.4% 54.7% 53.8% 56.5%Watsonaccuracy 65.6% 68.6% 67.1% 67.4% 69.0%

see also: j-archive.com 2-82

Page 83: Gerhard  Weikum Max Planck Institute  for Informatics weikum

From Questions to Queries

NL question:

Who composed scores for westerns and is from Rome?

scores for westerns

is from Rome Who composed scores

Dependency parsing exposes structure of question „triploids“ (sub-cues)

2-83

Page 84: Gerhard  Weikum Max Planck Institute  for Informatics weikum

From Triploids to TriplesWho composed scores for westerns and is from Rome?

Who is from Rome

Who composed scores

scores for westerns

?x composed scores

?x bornIn Rome

scores contributesTo ?y?y type westernMovie

?x type composer?x composed ?s

?s contributesTo ?y

?s type music

2-84

Page 85: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Disambiguation Mapping for TriploidsWho composed scores for westerns and is from Rome?

composed

composedscores

scores for

westerns

is from

Rome

Who

q1

q2

q3

q4

Combinatorial Optimization by ILP (with type constraints etc.)

Rome (Italy)Lazio Roma

personmusicianThe Who

createdwroteCompositionwroteSoftware

soundtracksoundtrackForshootsGoalFor

bornInactedIn

western movieWestern Digital

wei

ghte

d ed

ges

(coh

eren

ce, s

imila

rity,

etc

.)

2-85

Page 86: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Relaxing Overconstrained QueriesSelect ?p Where {

?p composed ?s . ?s type music . ?s for ?m . ?m type movie .?p bornIn Rome . }

Select ?p Where {

?p composed ?s . ?s type music . ?s for ?m . ?m type movie [western] .?p bornIn Rome . }

Select ?p Where {

?p ?rel1 ?s [composed] . ?s type music . ?s ?rel2 ?m . ?m type movie [western] .?p bornIn Rome . }

with extended SPARQL-FullText: SPOX quad patterns

(S. Elbassuoni et al.: CIKM‘10, ESWC’11) 2-86

Page 87: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Preliminary Resultshttp://www.mpi-inf.mpg.de/yago-naga/deanna/

(M. Yahya et al.: WWW’12)

2-87

Page 88: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Preliminary Results

2-88

Page 89: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Preliminary Results

2-89

Page 90: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Preliminary Results

2-90

Page 91: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Outline

...

Searching for Entities & Relations

Efficient Query Processing

Motivation

Wrap-up

Informative Ranking

User Interface

2-91

Page 92: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Take-Home LessonsSemantic Search over Entities and Relationships:Sparql is natural choice, but text extension is crucial

For ranking, LM‘s are effective, elegant, versatilestatistically principled & compositionalapplicable to entities, RDF, triples+text, temporal expr, …

keywords++, tables by example, natural language, speech, …APIs becoming clear, UIs widely open

triple indexing, join order optimization, …top-k query processing

Efficient implementation techniques

2-92

Page 93: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Open Problems and Grand Challenges

Reconcile expressiveness of Sparql + extensionswith ease-of-use of natural-language QA

Efficient and effective (LM-based) Top-k Rankingfor entire Web of Linked Data (& Text)

Extended LM-based Ranking for Entitiesin Temporal & Spatial (& Social) Context

Extend Sparql with Text, Time, Space, … Influence W3C standard …

Distributed / federated query processingfor entire Web of Linked Data

2-93

Page 94: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Recommended Readings: Search & Ranking• G. Kasneci, F. Suchanek, G. Ifrim: NAGA: Searching and Ranking Knowledge. ICDE 2008• S. Elbassuoni, M. Ramanath, G. Weikum: Query Relaxation for Entity-Relationship Search. ESWC 2011• S. Elbassuoni, M. Ramanath, et al.: Language-model-based ranking for queries on RDF-graphs. CIKM 2009• Z. Nie, Y. Ma, S. Shi, J.-R. Wen, W.-Y. Ma: Web Object Retrieval. WWW 2007• H. Bast, A. Chitea, F. Suchanek, I. Weber: ESTER: efficient search on text, entities, and relations. SIGIR 2007• J. Pound, P. Mika, H. Zaragoza: Ad-hoc object retrieval in the web of data. WWW 2010• O. Udrea, D. Recupero, V.S. Subrahmanian: Annotated RDF. ACM Trans. Comput. Log. 11(2), 2010• V. Hristidis, H.Hwang, Y.Papakonstantinou: Authority-based keyword search in databases. TODS 33(1), 2008• T. Cheng, X. Yan, K. Chang: EntityRank: Searching Entities Directly and Holistically. VLDB 2007• D. Vallet, H. Zaragoza: Inferring the Most Important Types of a Query: a Semantic Approach. SIGIR 2008• R. Blanco, H. Zaragoza: Finding support sentences for entities. SIGIR 2010• P. Serdyukov, D. Hiemstra: Modeling Documents as Mixtures of Persons for Expert Finding. ECIR 2008• R. Kaptein, P. Serdyukov, A.P. de Vries, J. Kamps: Entity ranking using Wikipedia as a pivot. CIKM 2010• H. Fang, C. Zhai: Probabilistic Models for Expert Finding. ECIR 2007• D. Petkova, W.B. Croft: Hierarchical Language Models for Expert Finding in Enterprise Corpora. ICTAI 2006• M. Pasca: Towards Temporal Web Search. SAC 2008• K. Berberich, S.J. Bedathur, O. Alonso, G. Weikum: A Language Modeling Approach for Temporal Information Needs. ECIR 2010: 13-25• J. Tappolet et al.: Applied temporal RDF: efficient temporal querying of RDF data with SPARQL. ESWC 2009• C. Zhai: Statistical Language Models for Information Retrieval. Morgan&Claypool, 2008• B.C. Ooi (Ed.): Special Issue on Keyword Search, IEEE Data Eng. Bull. 33(1), 2010• J.X. Yu, L. Qin, L. Chang: Keyword Search in Databases Morgan & Claypool, 2010• S. Ceri, M. Brambilla (Eds.): Search Computing: Challenges and Directions, Springer, 2010• M. Arenas, S. Conca, J. Perez: Counting beyond a Yottabyte, or how SPARQL 1.1 property paths will prevent adoption of the standard. WWW 2012 2-94

Page 95: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Recommended Readings: Efficiency & UI• T. Neumann, G. Weikum: The RDF-3X engine for scalable management of RDF data. VLDB J. 19(1), 2010• J. Huang, D.J. Abadi, K. Ren: Scalable SPARQL Querying of Large RDF Graphs. PVLDB 4(11), 2011• L. Sidirourgos, R. Goncalves, M.L. Kersten, N. Nes, S. Manegold: Column-store support for RDF data management: not all swans are white. PVLDB 1(2), 2008• R. Delbru, S. Campinas, G. Tummarello: Searching web data: An entity retrieval and high-performance indexing model. J. Web Sem. 10, 2012• G. Tummarello et al.: Sig.ma: live views on the web of data. WWW 2010• M. Schmidt, T. Hornung, G. Lausen, C. Pinkel: SP^2Bench: A SPARQL Performance Benchmark. ICDE 2009• M. Morsey, J. Lehmann, S. Auer, A.N. Ngomo: DBpedia SPARQL Benchmark - Performance Assessment with Real Queries on Real Data. ISWC 2011• K. Hose, R. Schenkel, M. Theobald, G. Weikum: Database Foundations for Scalable RDF Processing. in: A. Polleres et al. (Eds.): Reasoning Web - Semantic Technologies for the Web of Data, Springer 2011• A. Schwarte, P. Haase, K. Hose, R. Schenkel, M. Schmidt: FedX: Optimization Techniques for Federated Query Processing on Linked Data. ISWC 2011• J. Pound, I.F. Ilyas, G.E. Weddell: Expressive and flexible access to web-extracted data: a keyword-based structured query language. SIGMOD 2010 • R. Pimplikar, S. Sarawagi: Answering Table Queries on the Web using Column Keywords. PVLDB 5(10), 2012• A. Kalyanpur, J.W. Murdock, J. Fan, C.A. Welty: Leveraging Community-Built Knowledge for Type Coercion in Question Answering. ISWC 2011• C. Unger, L. Bühmann, J. Lehmann, A.N. Ngomo, D. Gerber, P. Cimiano: Template-based question answering over RDF data. WWW 2012• M. Yahya. K. Berberich, S. Elbassuoni, M. Ramanath, V. Tresp, G. Weikum: Natural Language Questions for the Web of Data. EMNLP 2012• D.A. Ferrucci et al.: Building Watson: An Overview of the DeepQA Project. AI Magazine 31(3), 2010• IBM Journal of Research and Development 56(3-4), Special Issue on “This is Watson”, 2012 2-95

Page 96: Gerhard  Weikum Max Planck Institute  for Informatics weikum

Thank You!

2-96