Upload
kieu
View
41
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Knowledge Harvesting f rom Text and Web Sources. Part 2: Search and Ranking of Knowledge. Gerhard Weikum Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/~weikum/. The Web Speaks to Us. Source: DB & IR methods for knowledge discovery. Communications of - PowerPoint PPT Presentation
Citation preview
Gerhard Weikum Max Planck Institute for Informaticshttp://www.mpi-inf.mpg.de/~weikum/
Knowledge Harvesting from Text and Web Sources
Part 2: Search and Ranking of Knowledge
The Web Speaks to Us
• Web 2012 contains more DB-style data than ever• getting better at making structured content explicit: entities, classes (types), relationships• but no (hope for) schema here!
Source: DB & IR methods for knowledge discovery.Communications ofthe ACM 52(4), 2009
2-2
Structure Now!
Bob Dylan
CarlaBruni
Nicolas Sarkozy
JoanBaez
Bob Dylan
Bob Dylan
France
ChampsElysees
Grammy
Nicolas Sarkozy
Paris
Actor AwardChristoph Waltz OscarSandra Bullock OscarSandra Bullock Golden Raspberry…
Movie ReportedRevenueAvatar $ 2,718,444,933The Reader $ 108,709,522 Facebook FriendFeedSoftware AG IDS Scheer…
Company CEOGoogle Eric SchmidtYahoo OvertureFacebook FriendFeedSoftware AG IDS Scheer…
Knowledge bases with factsfrom Web and IE witnesses
IE-enriched Web pages with embedded entities and facts
2-3
http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png
Distributed Structure: Linking Open Data30 Bio. triples500 Mio. links
2-4
owl:s
ameAs
rdf.freebase.com/ns/en.rome
owl:sameAs
o
wl:sameAs
data.nytimes.com/51688803696189142301
Coord
geonames.org/3169070/roma
N 41° 54' 10'' E 12° 29' 2''
dbpprop:citizenOf
dbpedia.org/resource/Rome
rd
f:type
rdfs:subclassOf
yago/wordnet:Actor109765278
rd
f:type
rdfs:subclassOfyago/wikicategory:ItalianComposer
yago/wordnet: Artist109812338
prop:a
ctedIn
imdb.com/name/nm0910607/
Distributed Structure: Linking Open Data
prop: composedMusicFor
imdb.com/title/tt0361748/
dbpedia.org/resource/Ennio_Morricone
2-5
Entity Search
http://entitycube.research.microsoft.com/
2-6
Semantic Search with Entities, Classes, Relationships
Politicians who are also scientists?European composers who won the Oscar?Chinese female astronauts?
Enzymes that inhibit HIV? Antidepressants that interfere with blood-pressure drugs?German philosophers influenced by William of Ockham?
US president when Barack Obama was born?Nobel laureate who outlived two world wars and all his children?
Commonalities & relationships among:Li Lianjie, Steve Jobs, Carla Bruni, Plato, Andres Iniesta?
FIFA 2010 finalists who played in a Champions League final?German football clubs that won against Real Madrid?
instances of classes
properties of entity
relationships
multiple entities
applications
2-7
Quiz Time
2-8
Rank the Folllowing Numbers
# ants on the Earth
# RDF triples in LOD
# Wikipedia links
# followers of Lady Gaga
# Google queries per day
# Yuan by Mark Zuckerberg
12*106
3*109
80*109
1015
25*109
600*106
1
2
4
6
5
3
2-8
Outline
...
Searching for Entities & Relations
Efficient Query Processing
Motivation
Wrap-up
Informative Ranking
User Interface
2-9
RDF: Structure, Diversity, No Schema
• SPO triples: Subject – Property/Predicate – Object/Value)• pay-as-you-go: schema-agnostic or schema later• RDF triples form fine-grained ER graph• popular for Linked Data, comp. biology (UniProt, KEGG, etc.)• open-source engines: Jena, Sesame, RDF-3X, etc.
EnnioMorricone RomebornIn
Rome ItalylocatedIn
SPO triples (statements, facts):(EnnioMorricone, bornIn, Rome)(Rome, locatedIn, Italy)(JavierNavarrete, birthPlace, Teruel)(Teruel, locatedIn, Spain)(EnnioMorricone, composed, l‘Arena)(JavierNavarrete, composerOf, aTale)
CityRomeinstanceOf
(uri1, hasName, EnnioMorricone)(uri1, bornIn, uri2)(uri2, hasName, Rome)(uri2, locatedIn, uri3)…
bornIn (EnnioMorricone, Rome) locatedIn(Rome, Italy)
Facts about Facts
• temporal annotations, witnesses/sources, confidence, etc. can refer to reified facts via fact identifiers (approx. equiv. to RDF quadruples: Col Sub Prop Obj)
facts: (EnnioMorricone, composed, l‘Arena)
(JavierNavarrete, composerOf, aTale)
(Berlin, capitalOf, Germany)
(Madonna, marriedTo, GuyRitchie)
(NicolasSarkozy, marriedTo, CarlaBruni)
facts:1:
2:
3:
4:
5:
temporal facts:6: (1, inYear, 1968)
7: (2, inYear, 2006)
8: (3, validFrom, 1990)
9: (4, validFrom, 22-Dec-2000)
10: (4, validUntil, Nov-2008)
11: (5, validFrom, 2-Feb-2008)provenance:12: (1, witness, http://www.last.fm/music/Ennio+Morricone/)
13: (1, confidence, 0.9)
14: (4, witness, http://en.wikipedia.org/wiki/Guy_Ritchie)
15: (4, witness, http://en.wikipedia.org/wiki/Madonna_(entertainer))
16: (10, witness, http://www.intouchweekly.com/2007/12/post_1.php)
17: (10, confidence, 0.1)
owl:s
ameAs
rdf.freebase.com/ns/en.rome
owl:sameAs
o
wl:sameAs
data.nytimes.com/51688803696189142301
Coord
geonames.org/3169070/roma
N 41° 54' 10'' E 12° 29' 2''
dbpprop:citizenOf
dbpedia.org/resource/Rome
rd
f:type
rdfs:subclassOf
yago/wordnet:Actor109765278
rd
f:type
rdfs:subclassOfyago/wikicategory:ItalianComposer
yago/wordnet: Artist109812338
prop:a
ctedIn
imdb.com/name/nm0910607/
Distributed Structure: Linking Open Data
prop: composedMusicFor
imdb.com/title/tt0361748/
dbpedia.org/resource/Ennio_Morricone
2-12
SPARQL Query LanguageSPJ combinations of triple patterns(triples with S,P,O replaced by variable(s)) Select ?p, ?c Where { ?p instanceOf Composer . ?p bornIn ?t . ?t inCountry ?c . ?c locatedIn Europe . ?p hasWon ?a .?a Name AcademyAward . }
+ filter predicates, duplicate handling, RDFS types, etc.
Select Distinct ?c Where { ?p instanceOf Composer . ?p bornIn ?t . ?t inCountry ?c . ?c locatedIn Europe . ?p hasWon ?a .?a Name ?n . ?p bornOn ?b . Filter (?b > 1945) . Filter(regex(?n, “Academy“) . }
Semantics:return all bindings to variables that match all triple patterns(subgraphs in RDF graph that are isomorphic to query graph)
Querying the Structured WebStructure but no schema: SPARQL well suited
wildcards for properties (relaxed joins): Select ?p, ?c Where { ?p instanceOf Composer . ?p ?r1 ?t . ?t ?r2 ?c . ?c isa Country . ?c locatedIn Europe . }
Extension: transitive paths [K. Anyanwu et al.: WWW‘07] Select ?p, ?c Where { ?p instanceOf Composer . ?p ??r ?c . ?c isa Country . ?c locatedIn Europe . PathFilter(cost(??r) < 5) . PathFilter (containsAny(??r,?t ) . ?t isa City . }
Extension: regular expressions [G. Kasneci et al.: ICDE‘08] Select ?p, ?c Where { ?p instanceOf Composer . ?p (bornIn | livesIn | citizenOf) locatedIn* Europe . }
flexiblesubgraphmatching
Querying Facts & Text
• Consider witnesses/sources (provenance meta-facts)• Allow text predicates with each triple pattern (à la XQ-FT)
Problem: not everything is triplified
European composers who have won the Oscar,whose music appeared in dramatic western scenes,and who also wrote classical pieces ?
Select ?p Where { ?p instanceOf Composer . ?p bornIn ?t . ?t inCountry ?c . ?c locatedIn Europe . ?p hasWon ?a .?a Name AcademyAward . ?p contributedTo ?movie [western, gunfight, duel, sunset] . ?p composed ?music [classical, orchestra, cantata, opera] . }
Semantics: triples match struct. pred.witnesses match text pred.
Querying Facts & Text
• Consider witnesses/sources (provenance meta-facts)• Allow text predicates with each triple pattern (à la XQ-FT)
Problem: not everything is triplified
French politicians married to Italian singers?
Select ?p1, ?p2 Where { ?p1 instanceOf politician [France] . ?p2 instanceOf singer [Italy] . ?p1 marriedTo ?p2 . }
Grouping ofkeywords or phrasesboosts expressiveness
Select ?p1, ?p2 Where { ?p1 instanceOf ?c1 [France, politics] . ?p2 instanceOf ?c2 [Italy, singer] . ?p1 marriedTo ?p2 . }
CS researchers whose advisors worked on the Manhattan project?Select ?r, ?a Where {?r instOf researcher [computer science] . ?a workedOn ?x [Manhattan project] .?r hasAdvisor ?a . }
Select ?r, ?a Where {?r ?p1 ?o1 [computer science] . ?a ?p2 ?o2 [Manhattan project] .?r ?p3 ?a . }
Relatedness Queries
Schema-agnostic keyword search(on RDF, ER graph, relational DB)becomes a special case
Relationship between Jet Li, Steve Jobs, Carla Bruni?
Select ??p1, ??p2, ??p3 Where {LiLianjie ??p1 SteveJobs. SteveJobs ??p2 CarlaBruni .CarlaBruni ??p3 LiLianjie . }
Select ??p1, ??p2, ??p3 Where {?e1 ?r1 ?c1 [“Li Lianjie“] . ?e2 ?r2 ?c2 [“Steve Jobs“] . ?e3 ?r3 ?c3 [“Carla Bruni“] . ?e1 ??p1 ?e2 .?e2 ??p2 ?e3 .?e3 ??p3 ?e1 . }
Querying Temporal Facts
• Consider temporal scopes of reified facts• Extend Sparql with temporal predicates
Managers of German clubs who won the Champions League?Select ?m Where { isa soccerClub . ?c inCountry Germany .
?id1: ?c hasWon ChampionsLeague . ?id1 validOn ?t . ?id2: ?m manages ?c . ?id2 validSince ?s . ?id2 validUntil ?u .
[?s,?u] overlaps [?t,?t] . }
Problem: not all facts hold forever (e.g. CEOs, spouses, …)
When did a German soccer club win the Champions League?Select ?c, ?t Where { ?c isa soccerClub . ?c inCountry Germany .?id1: ?c hasWon ChampionsLeague . ?id1 validOn ?t . }
1: (BayernMunich, hasWon, ChampionsLeague)2: (BorussiaDortmund, hasWon, ChampionsLeague)3: (1, validOn, 23May2001) 4: (1, validOn, 15May1974) 5: (2, validOn, 28May1997)6: (OttmarHitzfeld, manages, BayernMunich)7: (6, validSince, 1Jul1998) 8:(6, validUntil, 30Jun2004)
[Y. Wang et al.: EDBT’10][O. Udrea et al.: TOCL‘10
Querying with Vague Temporal Scope
• Consider temporal phrases as text conditions• Allow approximate matching and rank results wisely
Problem: user‘s temporal interest is imprecise1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Problem: user‘s temporal interest is often imprecise
German Champion League winners in the nineties?Select ?c Where {
?c isa soccerClub . ?c inCountry Germany .?c hasWon ChampionsLeague [nineties] . }
Soccer final winners in summer 2001?Select ?c Where {
?c isa soccerClub . ?id: ?c matchAgainst ?o [final] .?id winner ?c [“summer 2001“] . }
Keyword Search on Graphs
Example: Conferences (CId, Title, Location, Year) Journals (JId, Title)CPublications (PId, Title, CId) JPublications (PId, Title, Vol, No, Year) Authors (PId, Person) Editors (CId, Person)Select * From * Where * Contains ”Gray, DeWitt, XML, Performance“ And Year > 95
Schema-agnostic keyword search over multiple tables:graph of tuples with foreign-key relationships as edges
[BANKS, Discover, DBExplorer, KUPS, SphereSearch, BLINKS, NAGA, …]
Result is connected tree with nodes that contain as many query keywords as possible
Ranking: 1)(1)1(),(),(
eedgesnnodes
eedgeScoreqnnodeScoreqtrees
with nodeScore based on tf*idf or prob. IRand edgeScore reflecting importance of relationships (or confidence, authority, etc.)
Top-k querying: compute best trees, e.g. Steiner trees (NP-hard)
2-20
Best Result: Group Steiner Tree
Result is connected tree with nodes that contain as many query keywords as possible
Group Steiner tree: • match individual keywords terminal nodes, grouped by keyword• compute tree that connects at least one terminal node per keyword and has best total edge weight
y
x
x
y
zw
w w
x
y
zw
for query: x w y z2-21
Outline
...
Searching for Entities & Relations
Efficient Query Processing
Motivation
Wrap-up
Informative Ranking
User Interface
2-22
Ranking CriteriaConfidence:Prefer results that are likely correct
accuracy of info extraction trust in sources
(authenticity, authority)
Informativeness:Prefer results with salient factsStatistical estimation from:
frequency in answer frequency on Web frequency in query log
Conciseness:Prefer results that are tightly connected
size of answer graph cost of Steiner tree
bornIn (Jim Gray, San Francisco) from„Jim Gray was born in San Francisco“(en.wikipedia.org)
livesIn (Michael Jackson, Tibet) from„Fans believe Jacko hides in Tibet“(www.michaeljacksonsightings.com)
q: Einstein isa ?Einstein isa scientistEinstein isa vegetarian
q: ?x isa vegetarianEinstein isa vegetarianWhocares isa vegetarian
Diversity:Prefer variety of facts
Einstein won NobelPrizeBohr won NobelPrize
Einstein isa vegetarianCruise isa vegetarianCruise born 1962 Bohr died 1962
E won … E discovered … E played … E won … E won … E won … E won …
Ranking ApproachesConfidence:Prefer results that are likely correct
accuracy of info extraction trust in sources
(authenticity, authority)
Informativeness:Prefer results with salient factsStatistical LM with estimations from:
frequency in answer frequency in corpus (e.g. Web) frequency in query log
Conciseness:Prefer results that are tightly connected
size of answer graph cost of Steiner tree
PR/HITS-style entity/fact ranking[V. Hristidis et al., S.Chakrabarti, …]
IR models: tf*idf … [K.Chang et al., …]Statistical Language Models
Diversity:Prefer variety of facts
empirical accuracy of IEPR/HITS-style estimate of trustcombine into: max { accuracy (f,s) * trust(s) | s witnesses(f) }
Statistical Language Models
graph algorithms (BANKS, STAR, …) [J.X. Yu et al., S.Chakrabarti et al.,B. Kimelfeld et al., G.Kasneci et al., …]
or
Example: RDF-Express[S. Elbassuoni: SIGIR‘12, ESWC‘11, CIKM‘09]
?a1 isMarriedTo ?a2 .?a1 actedIn ?m .?a2 actedIn ?m .
http://www.mpi-inf.mpg.de/yago-naga/rdf-express/
Example: RDF-Express
?a1 isMarriedTo ?a2 {Bollywood} .?a1 actedIn ?m .?a2 actedIn ?m .
Example: RDF-Express
?a1 isMarriedTo ?a2 .?a1 actedIn ?m {thriller} .?a2 actedIn ?m {love} .
Example: RDF-Express[S. Elbassuoni: SIGIR‘12, ESWC‘11, CIKM‘09]
?a1 directed ?m .?a2 actedIn ?m .?a1 hasWonPrize Academy_Award .?a2 type wordnet_musician .
Information Retrieval Basics[Textbooks by R. Baza-Yates, B. Croft, Manning/Raghavan/Schuetze, …]
......
.....
......
.....
crawlextract& clean
index search rank present
strategies forcrawl schedule andpriority queue for crawl frontier
handle dynamic pages,detect duplicates,detect spam
build and analyzeWeb graph,index all tokensor word stems
server farm with 10 000‘s of computers,distributed/replicated data in high-performance file system,massive parallelism for query processing
fast top-k queries,query logging,auto-completion
scoring functionover many dataand context criteria
GUI, user guidance,personalization
Ranking bydescendingrelevance
Vector Space Model for Content Relevance Ranking
Search engine
Query (set of weightedfeatures)
||]1,0[ Fid Documents are feature vectors
(bags of words)
||]1,0[ Fq
||
1
2||
1
2
||
1:),(F
jj
F
jij
F
jjij
i
qd
qd
qdsim
Similarity metric:
2-30
Vector Space Model for Content Relevance Ranking
Search engine
Query (Set of weightedfeatures)
||]1,0[ Fid Documents are feature vectors
(bags of words)
||]1,0[ Fq
||
1
2||
1
2
||
1:),(F
jj
F
jij
F
jjij
i
qd
qd
qdsim
Similarity metric:Ranking bydescendingrelevance
k ikijij wwd 2/:
iikk
ijij fwithdocs
docs
dffreq
dffreqw
#
#log
),(max
),(1log:
tf*idfformula:term frequency *document frequency
2-31
Statistical Language Models (LM‘s)
qLM(1)
d1
d2
LM(2)
?
?
• each doc di has LM: generative prob. distr. with params i
• query q viewed as sample from LM(1), LM(2), …
• estimate likelihood P[ q | LM(i) ]
that q is sample of LM of doc di (q is „generated by“ di)• rank by descending likelihoods (best „explanation“ of q)
[Maron/Kuhns 1960, Ponte/Croft 1998, Hiemstra 1998, Lafferty/Zhai 2001]
„God does not play dice“ (Albert Einstein)
„IR does“ (anonymous)
„God rolls dice in places where you can‘t see them“ (Stephen Hawking)
4-33
LM: Doc as Model, Query as Sample
A A
C
A
D
E E E E
C C
B
A
E
B
model M
document d: sample of Mused for parameter estimation
P [ | M]A A B C E E
estimate likelihoodof observing query
query
)(
||21
)()(...)()(
||]|[
qfjqj
q
jdpjfjfjf
qdqP
multinomial prob. distribution
LM: Need for Smoothing
A A
C
A
D
E E E E
C C
B
A
E
B
model M
document d
P [ | M]A B C E F
estimate likelihoodof observing query
query
+ background corpus and/or smoothing
used for parameter estimation
C
AD
A
BE
F
+
Laplace smoothingJelinek-MercerDirichlet smoothing…
2-34
Some LM Basics
i i dqPdqPqds ]|[]|[),(simple MLE: overfitting
][)1(]|[),( qPdqPqds
ikk
kdf
idf
dktf
ditf
)(
)()1(
),(
),(log~
i
k
kidf
kdf
dktf
ditf
)(
)(
1),(
),(1log~
mixture modelfor smoothing
i diP
qiPqiPdqKL
]|[
]|[log]|[)|(~
KL divergence(Kullback-Leibler div.)aka. relative entropy
tf*idffamily
ik
dktf
ditf
),(
),(log~
independ. assumpt.
efficientimplementation
• Precompute per-keyword scores • Store in postings of inverted index• Score aggregration for (top-k) multi-keyword query
P[q] est. fromlog or corpus
IR as LM Estimation
P[R|d,q]user likes doc (R)given that it has features dand user poses query q
],|[
],|[~
qRdP
qRdP
prob. IR
][]|,[~ RPRdqP
][]|[],|[ RPRdPRdqP
]|[~ dqP statist. LM
qj d ]|j[Plog]d|q[Plog)d,q(s
query likelihood:
]d|q[Plogargmax-k d
top-k query result:
MLE would be tf2-36
Multi-Bernoulli vs. Multinomial LM
Multi-Bernoulli:)q(X1
j)q(X
jjj ))d(p1()d(p]d|q[P
with Xj(q)=1 if jq, 0 otherwise
Multinomial:
)(
||21
)()(...)()(
||]|[ qf
jqjq
jdpjfjfjf
qdqP
with fj(q) = f(j) = frequency of j in q
multinomial LM more expressive and usually preferred2-37
LM Scoring by Kullback-Leibler Divergence
)(
||2122 )(
)(...)()(
||log]|[log qf
jqjq
jdpjfjfjf
qdqP
)(log)(~ 2 dpqf jjqj
))(),(( dpqfH neg. cross-entropy
))(())(),((~ qfHdpqfH ))(||)(( dpqfD
)(
)(log)( 2 dp
qfqf
j
j
j j neg. KL divergenceof q and d
2-38
Jelinek-Mercer SmoothingIdea:use linear combination of doc LM withbackground LM (corpus, common language);
could also consider query log as background LMfor query||
),()1(
||
),()(ˆ
C
Cjfreq
d
djfreqdp j
parameter tuning of by cross-validation with held-out data:• divide set of relevant (d,q) pairs into n partitions• build LM on the pairs from n-1 partitions• choose to maximize precision (or recall or F1) on nth
partition• iterate with different choice of nth partition and average
2-39
Jelinek-Mercer Smoothing:Relationship to tf*idf
]q[P)1(]d|q[P]|q[P
qikk
)k(df
)i(df)1(
)d,k(tf
)d,i(tflog~
qi
k
k)i(df
)k(df
1)d,k(tf
)d,i(tf1log~
with absolutefrequencies tf, df
relative tf ~ relative idf
2-40
Dirichlet-Prior Smoothing
|d|
]C|j[P
|d|
]d|j[P|d|
mn
1f
i
ii)(Mmaxargˆ)d(p̂ jj
with i set to P[i|C]+1 for the Dirichlet hypergeneratorand > 1 set to multiple of average document length
Dirichlet (): 1jm..1j
jm..1j
jm..1jm1
j
)(
)(),...,(f
with
m..1j j 1
(Dirichlet is conjugate prior for parameters of multinomial distribution: Dirichlet prior implies Dirichlet posterior, only with different parameters)
d]][|f[P
][P]|f[P]f|[P:)(M
Posterior distr. with Dirichlet distribution as prior
)f(Dirichlet with term frequencies fin document d
MAP (Maximum Posterior) for
2-41
Dirichlet-Prior Smoothing:Relationship to Jelinek-Mercer Smoothing
|d|
]C|j[P
|d|
]d|j[P|d| ]|[)1(]|[)(ˆ CjPdjPdp j
with
|d|
|d|
where 1= P[1|C], ..., m= P[m|C] are the parameters
of the underlying Dirichlet distribution, with constant > 1typically set to multiple of average document length
with MLEsP[j|d], P[j|C]
tf
j
fromcorpus
2-42
Entity Search with LM Ranking [Z. Nie et al.: WWW’07, H. Fang et al.: ECIR‘07, P. Serdyukov et al.: ECIR‘08, …]
LM (entity e) = prob. distr. of words seen in context of e
][)1(]|[),( qPeqPqes ]q[P
]e|q[P~
i
ii
query q: „French player who won world championship“
candidate entities:
e1: David Beckham
e2: Ruud van Nistelroy
e3: Ronaldinho
e4: Zinedine Zidane
e5: FC Barcelona
played for ManU, Real, LA GalaxyDavid Beckham champions leagueEngland lost match against Francemarried to spice girl …
weightedby conf.
Zizou champions league 2002Real Madrid won final ...Zinedine Zidane best playerFrance world cup 1998 ...
))e(|)q((KL~ LMLM
query: keywords answer: entities
LM‘s: from Entities to Facts
Document / Entity LM‘s
Triple LM‘s
LM for doc/entity: prob. distr. of words
LM for query: (prob. distr. of) words
LM‘s: rich for docs/entities, super-sparse for queries
richer query LM with query expansion, etc.
LM for facts: (degen. prob. distr. of) triple
LM for queries: (degen. prob. distr. of) triple pattern
LM‘s: apples and oranges• expand query variables by S,P,O values from DB/KB• enhance with witness statistics• query LM then is prob. distr. of triples !
LM‘s for Triples and Triple Patterns
f1: Beckham p ManchesterUf2: Beckham p RealMadridf3: Beckham p LAGalaxyf4: Beckham p ACMilanF5: Kaka p ACMilanF6: Kaka p RealMadridf7: Zidane p ASCannesf8: Zidane p Juventusf9: Zidane p RealMadridf10: Tidjani p ASCannesf11: Messi p FCBarcelonaf12: Henry p Arsenalf13: Henry p FCBarcelonaf14: Ribery p BayernMunichf15: Drogba p Chelseaf16: Casillas p RealMadrid
triples (facts f):triple patterns (queries q):q: Beckham p ?y
200 300 20 30300150 20200350 10400200150100150 20
: 2600
q: Beckham p ManUq: Beckham p Realq: Beckham p Galaxyq: Beckham p Milan
200/550300/550 20/550 30/550
witness statistics
q: Cruyff ?r FCBarcelonaCruyff playedFor FCBarca 200/500 Cruyff playedAgainst FCBarca 50/500Cruyff coached FCBarca 250/500
q: ?x p ASCannesZidane p ASCannes 20/30Tidjani p ASCannes 10/30
LM(q) + smoothing
q: ?x p ?yMessi p FCBarcelona 400/2600Zidane p RealMadrid 350/2600Kaka p ACMilan 300/2600…
LM(q): {t P [t | t matches q] ~ #witnesses(t)}LM(answer f): {t P [t | t matches f] ~ 1 for f}smooth all LM‘srank results by ascending KL(LM(q)|LM(f))
[G. Kasneci et al.: ICDE’08; S. Elbassuoni et al.: CIKM’09, ESWC‘11]
LM‘s for Composite Queriesq: Select ?x,?c Where { France ml ?x . ?x p ?c . ?c in UK . }
f1: Beckham p ManU 200f7: Zidane p ASCannes 20f8: Zidane p Juventus 200f9: Zidane p RealMadrid 300f10: Tidjani p ASCannes 10f12: Henry p Arsenal 200f13: Henry p FCBarca 150f14: Ribery p Bayern 100f15: Drogba p Chelsea 150
f31: ManU in UK 200f32: Arsenal in UK 160f33: Chelsea in UK 140
f21: F ml Zidane 200f22: F ml Tidjani 20f23: F ml Henry 200f24: F ml Ribery 200f25: F ml Drogba 30f26: IC ml Drogba 100f27 ALG ml Zidane 50
queries q with subqueries q1 … qn
results are n-tuples of triples t1 … tn
LM(q): P[q1…qn] = i P[qi]
LM(answer): P[t1…tn] = i P[ti]
KL(LM(q)|LM(answer)) = i KL(LM(qi)|LM(ti))
P [ F ml Henry, Henry p Arsenal, Arsenal in UK ]
500
160
2600
200
650
200~
P [ F ml Drogba, Drogba p Chelsea, Chelsea in UK ]
500
140
2600
150
650
30~
LM‘s for Keyword-Augmented Queries
q: Select ?x, ?c Where { France ml ?x [goalgetter, “top scorer“] . ?x p ?c . ?c in UK [champion, “cup winner“, double] . }
subqueries qi with keywords w1 … wm
results are still n-tuples of triples ti
LM(qi): P[triple ti | w1 … wm] = k P[ti | wk] + (1) P[ti]
LM(answer fi) analogous
KL(LM(q)|LM(answer fi)) = i KL (LM(qi) | LM(fi))
result ranking prefers (n-tuples of) tripleswhose witnesses score high on the subquery keywords
LM‘s for Temporal Phrases
q: Select ?c Where { ?c instOf nationalTeam . ?c hasWon WorldCup [nineties] . }
Problem: user‘s temporal interest is imprecise1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
July 1994mid 90s
extract temp expr‘s xj
from witnesses& normalize
[K.Berberich et al.: ECIR‘10]
“mid 90s“ xj = [1Jan93,31Dec97] normalize
temp expr vin query
“nineties“ v = [1Jan90,31Dec99]
P[q | t] ~ … j P[v | xj]
~ … j { P[ [b,e] | xj] | interval [b,e]v}
~ … j overlap(v,xj) / union(v,xj)
• enhanced ranking• efficiently computable• plug into doc/entity/triples LM‘s
lastcentury
summer 1990
1998
P[v=„nineties“ | xj = „mid 90s“] = 1/2P[„nineties“ | „summer 1990“] = 1/30P[„nineties“ | „last century“] = 1/10
Query Relaxationq: … Where { France ml ?x . ?x p ?c . ?c in UK . }
f1: Beckham p ManU 200f7: Zidane p ASCannes 20f9: Zidane p Real 300f10: Tidjani p ASCannes 10f12: Henry p Arsenal 200f15: Drogba p Chelsea 150
f31: ManU in UK 200f32: Arsenal in UK 160f33: Chelsea in UK 140
f21: F ml Zidane 200f22: F ml Tidjani 20F23: F ml Henry 200F24: F ml Ribery 200F26: IC ml Drogba 100F27 ALG ml Zidane 50
[ F ml Zidane, Zidane p Real, Real in ESP ]
q(1): … Where { France ml ?x . ?x p ?c . ?c in ?y . }
[ IC ml Drogba, Drogba p Chelsea, Chelsea in UK] [ F resOf Drogba,
Drogba p Chelsea, Chelsea in UK] [ IC ml Drogba,
Drogba p Chelsea, Chelsea in UK]
q(2): … Where { ?x ml ?x . ?x p ?c . ?c in UK . }q(3): … Where { France ?r ?x . ?x p ?c . ?c in UK . }q(4): … Where { IC ml ?x . ?x p ?c . ?c in UK . }
LM(q*) = LM(q) + 1 LM(q(1)) + 2 LM(q(2)) + …
replace e in q by e(i) in q(i):precompute P:=LM (e ?p ?o) and Q:=LM (e(i) ?p ?o)set i ~ 1/2 (KL (P|Q) + KL (Q|P))
replace r in q by r(i) in q(i) LM (?s r(i) ?o)replace e in q by ?x in q(i) LM (?x r ?o)… LM‘s of e, r, ...
are prob. distr.‘s of triples !
Result Personalization
Open issue: „insightful“ results (new to the user)
q
q
q1
q2
q3f3
f4
f1f2f5
q3
q4 q5f3
q6
f6
same answer for everyone?
Personal histories ofqueries & clicked facts LM(user u): prob. distr. of triples !
[S. Elbassuoni et al.: PersDB‘08]
u1 [classical music] q: ?p from Europe . ?p hasWon AcademyAward
u2 [romantic comedy] q: ?p from Europe . ?p hasWon AcademyAward
u3 [from Africa] q: ?p isa SoccerPlayer . ?p hasWon ?a
LM(q|u] = LM(q) + (1) LM(u)then business as usual
Result Diversification
q: Select ?p, ?c Where { ?p isa SoccerPlayer . ?p playedFor ?c . }
1 Beckham, ManchesterU2 Beckham, RealMadrid3 Beckham, LAGalaxy4 Beckham, ACMilan5 Zidane, RealMadrid6 Kaka, RealMadrid7 Cristiano Ronaldo, RealMadrid8 Raul, RealMadrid9 van Nistelrooy, RealMadrid10 Casillas, RealMadrid
1 Beckham, ManchesterU2 Beckham, RealMadrid3 Zidane, RealMadrid4 Kaka, ACMilan5 Cristiano Ronaldo, ManchesterU6 Messi, FCBarcelona7 Henry, Arsenal8 Ribery, BayernMunich9 Drogba, Chelsea10 Luis Figo, Sporting Lissabon
rank results f1 ... fk by ascending
KL(LM(q) | LM(fi)) (1) KL( LM(fi) | LM({f1..fk}\{fi}))implemented by greedy re-ranking of fi‘s in candidate pool
[J. Carbonell, J. Goldstein: SIGIR‘98]
Entity-Search Ranking by Link Analysis[A. Balmin et al. 2004, Nie et al. 2005, Chakrabarti 2007, J. Stoyanovich 2007]
EntityAuthority (ObjectRank, PopRank, HubRank, EVA, etc.):
• define authority transfer graph
among entities and pages with edges:• entity page if entity appears in page• page entity if entity is extracted from page• page1 page2 if there is hyperlink or implicit link between pages• entity1 entity2 if there is a semantic relation between entities
• edges can be typed and (degree- or weight-) normalized and
are weighted by confidence and type-importance• also applicable to graph of DB records with foreign-key relations
(e.g. bibliography with different weights of publisher vs. location for conference record)
• compared to standard Web graph, ER graphs of this kind
have higher variation of edge weights
2-52
Outline
...
Searching for Entities & Relations
Efficient Query Processing
Motivation
Wrap-up
Informative Ranking
User Interface
2-53
54/34
Scalable Semantic Web: Pattern Queries on Large RDF Graphs
schema-free RDF triples: subject-property-object (SPO) example: Einstein hasWon NobelPrizeSPARQL triple patterns: Select ?p,?c Where { ?p isa scientist . ?p hasWon NobelPrize . ?p bornIn ?t . ?t inCountry ?c . ?c partOf Europe}large join queries, unpredictable workload,difficult physical design, difficult query optimization
Einstein hasWon NobelEinstein bornIn UlmRonaldo hasWon FIFASpain partOf EuropeFrance partOf Europe… … .,.
S P O S O Einstein NobelRonaldo FIFA… …
hasWon
S hasWon bornIn .,.
Person
Einstein Nobel Ulm .,.Ronaldo FIFA Rio .,..,. .,. .,. .,.
Semantic-Web engines (Sesame, Jena, etc.)did not provide scalable query performance
CountryS partOf capital .,..,. .,. .,. .,.
S O … …
bornIn
AllTriples
?p ?b ?t
Scalable Semantic Web: RDF-3X Engine[T. Neumann et al.: VLDB 2008, SIGMOD 2009, VLDBJ’10]
• RISC-style, tuning-free system architecture
• map literals into ids (dictionary) and precompute
exhaustive indexing for SPO triples:
SPO, SOP, PSO, POS, OSP, OPS,
SP*, PS*, SO*, OS*, PO*, OP*, S*, P*, O*
very high compression
• efficient merge joins with order-preservation
• join-order optimization
by dynamic programming over subplan result-order
• statistical synopses for accurate result-size estimation
http://code.google.com/p/rdf3x/
http://www.mpi-inf.mpg.de/~neumann/rdf3x/
56/34
Scalable Indexing & Merge Joins in RDF-3X
scan OPS[O=scientist
P=isa]
scan OPS[O=NobelPrize,
P=hasWon]
scan PSO[P=bornIn]
|| [S=S]
...
…3007 2015 10113007 2015 10553007 2015 11193007 2015 11273007 2015 11353007 2015 13393007 2015 13483007 2015 1418
…
O P S…
3111 2003 10193111 2003 10553111 2003 11113111 2003 11273111 2003 11353111 2003 11993111 2003 12273111 2003 1243
…
O P S…
2100 1003 56482100 1119 55552100 1127 57122100 1255 55152100 1266 57122100 1418 55102100 1429 55552100 1683 5843
…
P S O
scan PSO[P=inCountry]
…
…
P S O
2077 1005 8311
2077 1431 8200
2077 1333 8751
2077 1148 8244
2077 1099 80182077 1127 8123
2077 1266 8510
2077 1429 8135
?p isa scientist . ?p hasWon NobelPrize . ?p bornIn ?t . ?t inCountry ?c . ?c partOf Europe
|| [S=S]
|| [O=S]
poorjoin order
57/34
Scalable Indexing & Merge Joins in RDF-3X
scan OPS[O=scientist
P=isa]
scan OPS[O=NobelPrize,
P=hasWon]
scan PSO[P=bornIn]
|| [S=S]
|| [S=S]
...
…3007 2015 10113007 2015 10553007 2015 11193007 2015 11273007 2015 11353007 2015 13393007 2015 13483007 2015 1418
…
O P S…
3111 2003 10193111 2003 10553111 2003 11113111 2003 11273111 2003 11353111 2003 11993111 2003 12273111 2003 1243
…
O P S…
2100 1003 56482100 1119 55552100 1127 57122100 1255 55152100 1266 57122100 1418 55102100 1429 55552100 1683 5843
…
P S O
scan PSO[P=inCountry]
…
…
P S O
|| [O=S]
2077 1005 8311
2077 1431 8200
2077 1333 8751
2077 1148 8244
2077 1099 80182077 1127 8123
2077 1266 8510
2077 1429 8135
?p isa scientist . ?p hasWon NobelPrize . ?p bornIn ?t . ?t inCountry ?c . ?c partOf Europe good
join order
58/34
Scalable Indexing & Merge Joins in RDF-3X
scan OPS[O=scientist
P=isa]
scan OPS[O=NobelPrize,
P=hasWon]
scan PSO[P=bornIn]
|| [S=S]
|| [S=S]
...
…3007 2015 10113007 2015 10553007 2015 11193007 2015 11273007 2015 11353007 2015 13393007 2015 13483007 2015 1418
…
O P S…
3111 2003 10193111 2003 10553111 2003 11113111 2003 11273111 2003 11353111 2003 11993111 2003 12273111 2003 1243
…
O P S…
2100 1003 56482100 1119 55552100 1127 57122100 1255 55152100 1266 57122100 1418 55102100 1429 55552100 1683 5843
…
P S O
scan PSO[P=inCountry]
…
…
P S O
|| [O=S]
2077 1005 8311
2077 1431 8200
2077 1333 8751
2077 1148 8244
2077 1099 80182077 1127 8123
2077 1266 8510
2077 1429 8135
?p isa scientist . ?p hasWon NobelPrize . ?p bornIn ?t . ?t inCountry ?c . ?c partOf Europe
sidewaysinformationpassing
run-time filters
Join-Order Optimization for SPARQL
mehasFriend
gender
livesIn
Berlin
?f ?s ?p ?a
?x ?y
singer
Berlin
performedIn
inC
ity
2009
inYe
ar
likes
BobDylan
composedBy
prot
est antiwar
female
Fine-grained RDF (e.g. over social-tagging data)often entails complex queries with 10, 20 or more joins
Join-order optimization operates on join graph:nodes are triple patterns, edges denote shared variables• need selectivity (result cardinality) estimator • join-order optimization is O(n3) for chains, O(2n) for stars• exact optimization uses dynamic programming over sub-graphs and the sort order of intermediate results
[T. Neumann: VLDBJ‘10]
?f gender female .?f hasFriend me .?f livesIn Berlin .?f likes ?s .?s taggedProtestBy ?x .?s taggedAntiwarBy ?y .?s composedBy BobDylan .?s performedIn ?p .?p inCity Berlin .?s inYear 2009 .?p singer ?a .
60/34
Experimental Evaluation: Setup
Setup and competitors:2GHz dual core, 2 GB RAM, 30MB/s disk, Linux• column-store property tables by MIT folks, using MonetDB• triples store with SPO, POS, PSO indexes, using PostgreSQL
Datasets:1) Barton library catalog: 51 Mio. triples (4.1 GB)2) YAGO knowledge base: 40 Mio. triples (3.1 GB)3) Librarything social-tagging excerpt: 30 Mio. triples (1.8 GB)
Benchmark queries such as:1) counts of French library items (books, music, etc.), with creator, publisher, language, etc.2) scientist from Poland with French advisor who both won awards3) books tagged with romance, love, mystery, suspense by users who like crime novels and have friends who ...
Select ?t Where {?b hasTitle ?t . ?u romance ?b .?u love ?b .?u mystery ?b .?u suspense ?b .?u crimeNovel ?c .?u hasFriend ?f .?f ... }
[T. Neumann: SIGMOD‘09]
61/34
Experimental Evaluation: Results
YAGO dataset(40 Mio. triples, 3.1 GB):2.7 GB in RDF-3X, loaded in 25 min
Librarything dataset(36 Mio. triples, 1.8 GB):1.6 GB in RDF-3X, loaded in 20 min
measurements on Uniprot (845 Mio. triples) and Billion-Triples (562 Mio. triples)confirm these experimental results
exe
cuti
on
tim
e [s
]
exe
cuti
on
tim
e [s
]
[T. Neumann: SIGMOD‘09]
Real
Madrid
Özil
Distributed RDF Stores & Sparql at Scale[D. Abadi et al.: VLDB‘11]
• Partition data graph by hashing triples on subject• Enhance locality by replicating N-hop neighbors with each triple• Can automatically determine when query is parallelizable without communication• For remaining cases: distributed joins, semi-joins, etc.
Iniesta
FC Barca
Xavi
Barcelona
?player
?pos
?pop
?city?team
> 1 000 000
locatedIn
bornIn
playsPos
playsForhasPop
midfielderMadrid
Rosario
Real
Gelsenkirchen
Casillas
Özil
Messi
Casillas
Iniesta
Messi
Xavi
Özil
FC Barca
Real
Barcelona
Madrid
Rosario
Gelsenkirchen
Top-k Query Processing for Ranked Results
Index lists
s(t1,d1) = 0.7…s(tm,d1) = 0.2
s(t1,d1) = 0.7…s(tm,d1) = 0.2
…
Data items: d1, …, dn
Query: q = (t1, t2, t3)aggr: summation
…
…
t1d780.9
d10.7
d880.2
d100.2
d780.1
d990.2
d340.1
d230.8
d100.8
d1d1
t2d640.8
d230.6
d100.6
t3d100.7
d780.5
d640.4
fetch listsjoin&sort
simple & DB-style;needs only O(k) memory;for monotone score aggr.
2-63
Index lists
s(t1,d1) = 0.7…s(tm,d1) = 0.2
s(t1,d1) = 0.7…s(tm,d1) = 0.2
…
Data items: d1, …, dn
Query: q = (t1, t2, t3)aggr: summation
…
…
t1d780.9
d880.2d780.1d340.1
d230.8
d100.8
d1d1
t2d640.9
d230.6
d100.6
t3d100.7
d780.5
d640.3
Threshold algorithm (TA):scan index lists; consider d at posi in Li;highi := s(ti,d);if d top-k then { look up s(d) in all lists L with i; score(d) := aggr {s(d) | =1..m};if score(d) > min-k then add d to top-k and remove min-score d’; min-k := min{score(d’) | d’ top-k};threshold := aggr {high | =1..m};if threshold min-k then exit;
Scan depth 1Scan
depth 1Scan
depth 2Scan
depth 2Scan
depth 3Scan
depth 3
k = 2
simple & DB-style;needs only O(k) memory;for monotone score aggr.
Scan depth 4Scan
depth 4
d10.7
d990.2
d120.2 2 d64 0.9
Rank Doc Score
2 d64 1.2
1 d78 0.91 d78 1.51 d78 1.5
2 d64 0.9
Rank Doc Score
2 d78 1.5
1 d10 2.1Rank Doc Score
1 d10 2.1
2 d78 1.5
Rank Doc Score
1 d10 2.1
2 d78 1.5
Rank Doc Score
1 d10 2.1
2 d78 1.5
STOP!STOP!
Threshold Algorithm (TA) [Fagin 01, Güntzer 00, Nepal 99, Buckley85]
2-64
TA with Sorted Access Only (NRA) [Fagin 01, Güntzer et al. 01]
Index lists
s(t1,d1) = 0.7…s(tm,d1) = 0.2
s(t1,d1) = 0.7…s(tm,d1) = 0.2
…
Data items: d1, …, dn
Query: q = (t1, t2, t3)aggr: summation
Rank Doc Worst-score
Best-score
1 d78 0.9 2.4
2 d64 0.8 2.4
3 d10 0.7 2.4
Rank Doc Worst-score
Best-score
1 d78 1.4 2.0
2 d23 1.4 1.9
3 d64 0.8 2.1
4 d10 0.7 2.1
Rank Doc Worst-score
Best-score
1 d10 2.1 2.1
2 d78 1.4 2.0
3 d23 1.4 1.8
4 d64 1.2 2.0
…
…
t1d780.9
d10.7
d880.2
d120.2
d780.1
d990.2
d340.1
d230.8
d100.8
d1d1
t2d640.8
d230.6
d100.6
t3d100.7
d780.5
d640.4
STOP!STOP!
No-random-access algorithm (NRA):scan index lists; consider d at posi in Li;E(d) := E(d) {i}; highi := s(ti,d);worstscore(d) := aggr{s(t,d) | E(d)};bestscore(d) := aggr{worstscore(d), aggr{high | E(d)}};if worstscore(d) > min-k then add d to top-k min-k := min{worstscore(d’) | d’ top-k};else if bestscore(d) > min-k then cand := cand {d};threshold := max {bestscore(d’) | d’ cand};if threshold min-k then exit;
Scan depth 1Scan
depth 1Scan
depth 2Scan
depth 2Scan
depth 3Scan
depth 3
k = 1
sequential access (SA) faster than random access (RA)by factor of 20-1000
2-65
Query Expansion (Synonyms, Related Terms, …)
magician
wizard
intellectual
artist
alchemist
directorprimadonna
lecturer
professor
teacher
educator
scholar
academic, academician, faculty member
scientist
researcher
HYPONYM (0.749)HYPONYM (0.749)
Thesaurus/Ontology:concepts, relationships, glosses from WordNet, Wikipedia, KB, Web, etc.
relationships quantified bystatistical correlation measurese.g. Jaccard coefficients: |XY| / |XY|
Query expansion
Weighted expanded queryExample:(professor lecturer (0.749) scholar (0.71) ...)and Germanyand ( (course class (1.0) seminar (0.84) ... ) and („IR“ „Web search“ (0.653) ... ) )
Term2Concept with WSD
exp(ti)={w | sim(ti,w) }
Efficient top-k searchwith dynamic expansion
better recall, better meanprecision for hard queries
investigator
mentor
Problems:• efficiency• tuning the threshold
User query: ~t1 ... ~tmExample:~professor Germany ~course ~IR
2-66
Query Expansion with Incremental Merging
relaxable query q: ~professor researchwith expansions based on ontology relatedness modulatingmonotonic score aggregation
Better: dynamic query expansion with incremental merging of additional index lists
efficient, robust, self-tuning
lecturer: 0.7
37: 0.944: 0.8
...
22: 0.723: 0.651: 0.652: 0.6
scholar: 0.692: 0.967: 0.9
...
52: 0.944: 0.855: 0.8
research
B+ tree index on terms
57: 0.644: 0.4
...
professor
52: 0.433: 0.375: 0.3
12: 0.914: 0.8
...
28: 0.617: 0.5561: 0.544: 0.5
44: 0.4
ontology / meta-index
professorlecturer: 0.7scholar: 0.6academic: 0.53scientist: 0.5...
exp(i)={w | sim(i,w) }
primadonna
teacher
investigator
magician
wizard
intellectual
artist
alchemist
director
professor
scholar
academic,academician,faculty member
scientist
researcher
Hyponym (0.749)Hyponym (0.749)
mentor
Related (0.48)Related (0.48)
lecturerTA scans of index lists for iq exp(i)
[M. Theobald et al.: SIGIR 2005]
2-67
Query Expansion with Incremental Merging[M. Theobald et al.: SIGIR 2005]
Principles of the algorithm:• organize possible expansions Exp(t)={w | sim(t,w) , tq} in priority queues based on sim(t,w) for each t
for given q = {t1, tm}: • keep track of active expansions for each ti: ActExp(ti) = {w1(ti), w2(ti), ...w j(i)(ti)}
~ position j(i) in the priority queue for Exp(ti)
• scan index lists for t1, w1(t1), ..., wj(1)(t1), ..., tm, w1(tm), ..., wj(m)(tm)
maintaining cursors pos() and score bounds high() for each list • in each scan step do for each ti:
advance cursor of list for w(ti) ActExp(ti)
for which best := high()*sim(w,ti) is largest among all lists of ti
or add next largest w(ti) to ActExp(ti) if sim(w,ti) > best
and proceed in this new list
2-68
Nested Top-k [Theobald et al.: SIGIR’05]
top-k
research professor XML „information retrieval“ „query result ranking“
research prof XML informationretrieval
query result ranking
top-k top-k
multithreaded coroutine-like evaluation with synchronization of PQs
2-69
Nested Top-k [Theobald et al.: SIGIR’05]
top-k
~research ~professor XML „information retrieval“ „query result ranking“
top-k top-k
research prof XML informationretrieval
query result ranking
top-k top-k
science scholar lecturer
multithreaded coroutine-style evaluation with synchronization of PQs
2-70
Top-k Rank Joins on Structured Data[Ilyas et al. 2008]
extend TA/NRA/etc. to ranked query results from structured data(improve over baseline: evaluate query, then sort)
Select R.Name, C.Theater, C.MovieFrom RestaurantsGuide R, CinemasProgram CWhere R.City = C.CityOrder By R.Quality/R.Price + C.Rating Desc
BlueDragon Chinese €15 SBHaiku Japanese €30 SBMahatma Indian €20 IGBMescal Mexican €10 IGBBigSchwenk German €25 SLS...
Name Type Quality Price City
RestaurantsGuide
BlueSmoke Tombstone 7.5 SBOscar‘s Hero 8.2 SBHolly‘s Die Hard 6.6 SBGoodNight Seven 7.7 IGBBigHits Godfather 9.1 IGB...
Theater Movie Rating City
CinemasProgram
process index lists on quality desc, price asc, rating desc2-71
Top-k Sparql Queries over RDF
?w wrote ?b . ?b hasTag Memoir .
An_Hour_before_Daylight hasTag Memoir 0.880The_Grand_Alliance hasTag Memoir 0.873The_Night_Trilogy hasTag Memoir 0.800Roughing_it hasTag Memoir 0.796................................................ …….
Mark_Twain wrote Roughing_it 0.999J_K_Rowling wrote Harry_Potter 0.982Stephanie_Meyer wrote Twilight 0.966Cormac_McCarthy wrote The_Road 0.864Jimmy_Carter wrote Palestine 0.732………………………………… …….
[S. Elbassuoni 2012]• Precompute score-sorted index lists for each S value, P, O, SP value pair, PO, SO• Perform rank-join over index lists
2-72
Top-k Sparql Queries over RDF + Text [S. Elbassuoni 2012]
?w wrote ?b[nobel prize]; ?b hasTag Memoir
An_Hour_before_Daylight hasTag Memoir 0.880The_Grand_Alliance hasTag Memoir 0.873The_Night_Trilogy hasTag Memoir 0.800Roughing_it hasTag Memoir 0.796................................................ …….
Jimmy_Carter wrote Palestine 0.712Albert_Camus wrote The_Rebel 0.667 Doris_Lessing wrote Alfred_and_Emily 0.660 ………………………………… …….
nobel
Ian_McEwan wrote On_Chesil_Beach 0.882Jimmy_Carter wrote Palestine 0.774 Cormac_McCarthy wrote The_Road 0.623 Jhumpa_Lahiri wrote The_Namesake 0.600………………………………… …….
prize
• Precompute score-sorted index lists for each S+keyword, P+keyword, O+…, SP+keyword, PO+…, SO+…• Perform rank-join over index lists
2-73
Outline
...
Searching for Entities & Relations
Efficient Query Processing
Motivation
Wrap-up
Informative Ranking
User Interface
2-74
UI: Table Search by Example & Description Expect result table specify query in tabular form (QBE, QueryByDescription)
[S.Sarawagi et al.: VLDB‘09]
q1: David Beckham Real MadridZinedine Zidane JuventusLionel Messi FC Barcelona
Bayern Munich Real Madrid1.FC Kaiserslautern Real MadridBorussia Dortmund Real Madrid
q2: 2:15:05:1
match against table rows
infer query fromtable headers ???
QBE
QBE
soccer club from Germany and winner against Real Madridq2:
output class descriptionQBD
player club ???
???q1:QBD
UI: Structured Keyword Search Need to map (groups of) keywords onto entities & relationshipsbased on name-entity similarities/probabilities
q: Champions League finals with Real Madrid
[Ilyas et al. Sigmod‘10]
UEFA Champions League
Real Madrid C.F.League ofWrestlingChampions
final match
final exam
q: German football clubs that won (a match) against Real
more ambiguity, more sophisticated relation more candidates combinatorial complexity
Real Madrid C.F.
Real Zaragoza
Brazilian Real
Familia Real
Real Number
soccer
American football
social club
night club
sports team
UI: Natural Language Questions
map resultsinto tabular or visual presentationor speech
Which German soccer clubs have won against Real Madrid ?
translate question into Sparql query:• dependency parsing to decompose question• mapping of question units onto entities, classes, relations
Which ?c
?c German soccer clubs ?r ?r have won against ?o
?o Real Madrid ?
UI: Natural Language Questions
map resultsinto tabular or visual presentationor speech
Select ?c Where { ?c isa SoccerClub . c? in Germany . }
?c wonAgainst RealMadrid . }?id: ?c playedAgainst RealMadrid . ?c isWinnerOf ?id .}
translate question into Sparql query:• dependency parsing to decompose question• mapping of question units onto entities, classes, relations
Select ?c Where { ?c isa GermanSoccerClub . c? wonAgainst RealMadrid . }
SPARQL or Keywords or … ?
SPARQL query: Select ?p Where {?p created ?s . ?s type music . ?s contributesTo ?m . ?m type movie .?p bornIn Rome . }
Who composed scores for westerns and is from Rome?
Keywords query:composer score western Rome
or: {composer Rome} score westernor: …
Faceted search with interactive navigation:composer
Romescore music movie
western2-79
Natural Language Question AnsweringIBM Watson in Jeopardy show: Which city has two airports named after a war hero and a battlefield?
used Yago & Dbpedia for type checking
[IBM Journal of R&D 56(3-4), 2012]
question
QuestionAnalysis:ClassificationDecomposition
HypothesesGeneration(Search):AnswerCandidates
CandidateFiltering &Ranking
Hypotheses& EvidenceScoring
answer
Overall architecture of Watson (simplified)
2-80
Semantic Technologies in IBM WatsonIBM Watson in Jeopardy show: Which city has two airports named after a war hero and a battlefield?
used Yago & Dbpedia for type checking
[A. Kalyanpur et al.: ISWC 2011]
Semantic checking of answer candidatesquestion
candidatestring
ConstraintChecker
TypeChecker
lexicalanswer type
KB
RelationDetection
EntityDisambiguation& Matching
PredicateDisambiguation& Matching
candidatescore
KB instances
semantic types
spatial & temporalrelations
2-81
Scoring of Semantic Answer Types
[A. Kalyanpur et al.: ISWC 2011]
Check for 1) Yago classes, 2) Dbpedia classes, 3) Wikipedia lists
Match lexical answer type against class candidatesbased on string similarity and class sizes (popularity)Examples: Scottish inventor inventor, star movie star
Compute scores for semantic types, considering:class match, subclass match, superclass match,sibling class match, lowest common ancestor, class disjointness, …
no types Yago Dbpedia Wikipedia all 3Standard QAaccuracy 50.1% 54.4% 54.7% 53.8% 56.5%Watsonaccuracy 65.6% 68.6% 67.1% 67.4% 69.0%
see also: j-archive.com 2-82
From Questions to Queries
NL question:
Who composed scores for westerns and is from Rome?
scores for westerns
is from Rome Who composed scores
Dependency parsing exposes structure of question „triploids“ (sub-cues)
2-83
From Triploids to TriplesWho composed scores for westerns and is from Rome?
Who is from Rome
Who composed scores
scores for westerns
?x composed scores
?x bornIn Rome
scores contributesTo ?y
?y type westernMovie
?x type composer
?x composed ?s
?s contributesTo ?y
?s type music
2-84
Disambiguation Mapping for TriploidsWho composed scores for westerns and is from Rome?
composed
composedscores
scores for
westerns
is from
Rome
Who
q1
q2
q3
q4
Combinatorial Optimization by ILP (with type constraints etc.)
Rome (Italy)
Lazio Roma
person
musicianThe Who
createdwroteCompositionwroteSoftware
soundtracksoundtrackFor
shootsGoalFor
bornIn
actedIn
western movie
Western Digital
we
igh
ted
ed
ge
s (c
oh
ere
nc
e, s
imila
rity
, etc
.)
2-85
Relaxing Overconstrained QueriesSelect ?p Where {
?p composed ?s . ?s type music . ?s for ?m . ?m type movie .?p bornIn Rome . }
Select ?p Where {?p composed ?s . ?s type music . ?s for ?m . ?m type movie [western] .?p bornIn Rome . }
Select ?p Where {?p ?rel1 ?s [composed] . ?s type music . ?s ?rel2 ?m . ?m type movie [western] .?p bornIn Rome . }
with extended SPARQL-FullText: SPOX quad patterns
(S. Elbassuoni et al.: CIKM‘10, ESWC’11) 2-86
Preliminary Resultshttp://www.mpi-inf.mpg.de/yago-naga/deanna/
(M. Yahya et al.: WWW’12)
2-87
Preliminary Results
2-88
Preliminary Results
2-89
Preliminary Results
2-90
Outline
...
Searching for Entities & Relations
Efficient Query Processing
Motivation
Wrap-up
Informative Ranking
User Interface
2-91
Take-Home LessonsSemantic Search over Entities and Relationships:Sparql is natural choice, but text extension is crucial
For ranking, LM‘s are effective, elegant, versatilestatistically principled & compositionalapplicable to entities, RDF, triples+text, temporal expr, …
keywords++, tables by example, natural language, speech, …
APIs becoming clear, UIs widely open
triple indexing, join order optimization, …top-k query processing
Efficient implementation techniques
2-92
Open Problems and Grand Challenges
Reconcile expressiveness of Sparql + extensionswith ease-of-use of natural-language QA
Efficient and effective (LM-based) Top-k Rankingfor entire Web of Linked Data (& Text)
Extended LM-based Ranking for Entitiesin Temporal & Spatial (& Social) Context
Extend Sparql with Text, Time, Space, … Influence W3C standard …
Distributed / federated query processingfor entire Web of Linked Data
2-93
Recommended Readings: Search & Ranking• G. Kasneci, F. Suchanek, G. Ifrim: NAGA: Searching and Ranking Knowledge. ICDE 2008• S. Elbassuoni, M. Ramanath, G. Weikum: Query Relaxation for Entity-Relationship Search. ESWC 2011• S. Elbassuoni, M. Ramanath, et al.: Language-model-based ranking for queries on RDF-graphs. CIKM 2009• Z. Nie, Y. Ma, S. Shi, J.-R. Wen, W.-Y. Ma: Web Object Retrieval. WWW 2007• H. Bast, A. Chitea, F. Suchanek, I. Weber: ESTER: efficient search on text, entities, and relations. SIGIR 2007• J. Pound, P. Mika, H. Zaragoza: Ad-hoc object retrieval in the web of data. WWW 2010• O. Udrea, D. Recupero, V.S. Subrahmanian: Annotated RDF. ACM Trans. Comput. Log. 11(2), 2010• V. Hristidis, H.Hwang, Y.Papakonstantinou: Authority-based keyword search in databases. TODS 33(1), 2008• T. Cheng, X. Yan, K. Chang: EntityRank: Searching Entities Directly and Holistically. VLDB 2007• D. Vallet, H. Zaragoza: Inferring the Most Important Types of a Query: a Semantic Approach. SIGIR 2008• R. Blanco, H. Zaragoza: Finding support sentences for entities. SIGIR 2010• P. Serdyukov, D. Hiemstra: Modeling Documents as Mixtures of Persons for Expert Finding. ECIR 2008• R. Kaptein, P. Serdyukov, A.P. de Vries, J. Kamps: Entity ranking using Wikipedia as a pivot. CIKM 2010• H. Fang, C. Zhai: Probabilistic Models for Expert Finding. ECIR 2007• D. Petkova, W.B. Croft: Hierarchical Language Models for Expert Finding in Enterprise Corpora. ICTAI 2006• M. Pasca: Towards Temporal Web Search. SAC 2008• K. Berberich, S.J. Bedathur, O. Alonso, G. Weikum: A Language Modeling Approach for Temporal Information Needs. ECIR 2010: 13-25• J. Tappolet et al.: Applied temporal RDF: efficient temporal querying of RDF data with SPARQL. ESWC 2009• C. Zhai: Statistical Language Models for Information Retrieval. Morgan&Claypool, 2008• B.C. Ooi (Ed.): Special Issue on Keyword Search, IEEE Data Eng. Bull. 33(1), 2010• J.X. Yu, L. Qin, L. Chang: Keyword Search in Databases Morgan & Claypool, 2010• S. Ceri, M. Brambilla (Eds.): Search Computing: Challenges and Directions, Springer, 2010• M. Arenas, S. Conca, J. Perez: Counting beyond a Yottabyte, or how SPARQL 1.1 property paths will prevent adoption of the standard. WWW 2012
2-94
Recommended Readings: Efficiency & UI• T. Neumann, G. Weikum: The RDF-3X engine for scalable management of RDF data. VLDB J. 19(1), 2010• J. Huang, D.J. Abadi, K. Ren: Scalable SPARQL Querying of Large RDF Graphs. PVLDB 4(11), 2011• L. Sidirourgos, R. Goncalves, M.L. Kersten, N. Nes, S. Manegold: Column-store support for RDF data management: not all swans are white. PVLDB 1(2), 2008• R. Delbru, S. Campinas, G. Tummarello: Searching web data: An entity retrieval and high-performance indexing model. J. Web Sem. 10, 2012• G. Tummarello et al.: Sig.ma: live views on the web of data. WWW 2010• M. Schmidt, T. Hornung, G. Lausen, C. Pinkel: SP^2Bench: A SPARQL Performance Benchmark. ICDE 2009• M. Morsey, J. Lehmann, S. Auer, A.N. Ngomo: DBpedia SPARQL Benchmark - Performance Assessment with Real Queries on Real Data. ISWC 2011• K. Hose, R. Schenkel, M. Theobald, G. Weikum: Database Foundations for Scalable RDF Processing. in: A. Polleres et al. (Eds.): Reasoning Web - Semantic Technologies for the Web of Data, Springer 2011• A. Schwarte, P. Haase, K. Hose, R. Schenkel, M. Schmidt: FedX: Optimization Techniques for Federated Query Processing on Linked Data. ISWC 2011• J. Pound, I.F. Ilyas, G.E. Weddell: Expressive and flexible access to web-extracted data: a keyword-based structured query language. SIGMOD 2010 • R. Pimplikar, S. Sarawagi: Answering Table Queries on the Web using Column Keywords. PVLDB 5(10), 2012• A. Kalyanpur, J.W. Murdock, J. Fan, C.A. Welty: Leveraging Community-Built Knowledge for Type Coercion in Question Answering. ISWC 2011• C. Unger, L. Bühmann, J. Lehmann, A.N. Ngomo, D. Gerber, P. Cimiano: Template-based question answering over RDF data. WWW 2012• M. Yahya. K. Berberich, S. Elbassuoni, M. Ramanath, V. Tresp, G. Weikum: Natural Language Questions for the Web of Data. EMNLP 2012• D.A. Ferrucci et al.: Building Watson: An Overview of the DeepQA Project. AI Magazine 31(3), 2010• IBM Journal of Research and Development 56(3-4), Special Issue on “This is Watson”, 2012
2-95
Thank You!
2-96