46
Gerhard Weikum Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/ ~weikum/ From Information to Knowledge: Harvesting Entities, Relationships, and Temporal Facts from Web Sources

Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

Embed Size (px)

Citation preview

Page 1: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

Gerhard Weikum Max Planck Institute for Informaticshttp://www.mpi-inf.mpg.de/~weikum/

From Information to Knowledge:Harvesting Entities, Relationships, andTemporal Facts from Web Sources

Page 2: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

Acknowledgements

Page 3: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

Goal: Turn Web into Knowledge Base

comprehensive DB of human knowledge• everything that Wikipedia knows• everything machine-readable• capturing entities, classes, relationships

Source: DB & IR methods for knowledge discovery.Communications ofthe ACM 52(4), 2009

Page 4: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

Approach: Harvesting Facts from WebPolitician Political Party

Angela Merkel CDU

Karl-Theodor zu Guttenberg CDU

Christoph Hartmann FDP

Company CEO

Google Eric Schmidt

Yahoo Overture

Facebook FriendFeed

Software AG IDS Scheer

Movie ReportedRevenue

Avatar $ 2,718,444,933

The Reader $ 108,709,522

Facebook FriendFeed

Software AG IDS Scheer

PoliticalParty Spokesperson

CDU Philipp Wachholz

Die Grünen Claudia Roth

Facebook FriendFeed

Software AG IDS Scheer

Actor Award

Christoph Waltz Oscar

Sandra Bullock Oscar

Sandra Bullock Golden Raspberry

Politician Position

Angela Merkel Chancellor Germany

Karl-Theodor zu Guttenberg Minister of Defense Germany

Christoph Hartmann Minister of Economy Saarland

Company AcquiredCompany

Google YouTube

Yahoo Overture

Facebook FriendFeed

Software AG IDS Scheer

YAGO-NAGA IWPCyc

TextRunnerReadTheWebWikiTax2WordNet

SUMO

Page 5: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

Knowledge for Intelligence• entity recognition & disambiguation• understanding natural language & speech• knowledge services & reasoning for semantic apps (e.g. deep QA)

• semantic search: precise answers to advanced queries (by scientists, students, journalists, analysts, etc.)

FIFA 2010 finalists who played in a Champions League final?

Politicians who are also scientists?

Enzymes that inhibit HIV? Influenza drugs for teens with high blood pressure?...

German football coach when Bastian Schweinsteiger was born?

Relationships between Manfred Pinkal, Edsger Dijkstra, Michael Dell, and Renee Zellweger?

Page 6: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

Outline

...

Automatic KB Construction

Growing & Maintaining the KB

Temporal Knowledge

What and Why

Wrap-up

Page 7: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

What is Knowledge (in a KB)?

...

• facts / assertions: bornIn (BastianSchweinsteiger, Kolbermoor),

hasWon (BastianSchweinsteiger, BronzeFIFAWorldCup2010), playedInFinal (BastianSchweinsteiger, ChampionsLeague2010), …• taxonomic: instanceOf (BastianSchweinsteiger, footballPlayer),

subclassOf (footballPlayer, athlete), …• lexical / terminology: means (“Big Apple“, NewYorkCity),

means (“Apple“, AppleComputerCorporation) means (“MS“, Microsoft) , means (“MS“, MultipleSclerosis) …• common-sense properties: apples are green, red, juicy, sweet, sour … - but not fast, smart … balls are round, smooth, slippery … - but not square, funny …• common-sense axioms: x: human(x) male(x) female(x) x: (male(x) female(x)) (female(x) ) male(x)) x: animal(x) (hasLegs(x) isEven(numberOfLegs(x)) …• procedural: how to fix/install/prepare/remove …• epistemic / beliefs: believes (Ptolemy, shape(Earth, disc)),

believes (Copernicus, shape(Earth, sphere)) …

Page 8: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

Tapping on Wikipedia Categories

Page 9: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

http://www.mpi-inf.mpg.de/yago-naga/

KB‘s: Example YAGO (Suchanek et al.: WWW‘07)Entity

Max_Planck

Apr 23, 1858

Person

City

Countrysubclass

Locationsubclass

instanceOf

subclass

bornOn

“Max Planck”

means(0.9)

subclass

Oct 4, 1947 diedOn

Kiel

bornInNobel Prize

Erwin_Planck

FatherOfhasWon

Scientist

means

“Max Karl Ernst Ludwig Planck”

Physicist

instanceOf

subclassBiologist

subclass

Germany

Politician

Angela Merkel

Schleswig-Holstein

State

“Angela Dorothea Merkel”

Oct 23, 1944diedOn

Organization

subclass

Max_Planck Society

instanceOf

means(0.1)

instanceOfinstanceOf

subclass

subclass

means

“Angela Merkel”

means

citizenOf

instanceOfinstanceOf

locatedIn

locatedIn

subclassAccuracy 95%

2 Mio. entities, 200 000 classes 40 Mio. RDF triples (facts) ( entity1-relation-entity2, subject-predicate-object )

Page 10: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

KB‘s: Example YAGO (F. Suchanek et al.: WWW‘07)

http://www.mpi-inf.mpg.de/yago-naga/

Page 11: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

KB‘s: Example DBpedia (Auer, Bizer, et al.: ISWC‘07)

• 3 Mio. entities, • 1 Bio. facts (RDF triples)• 1.5 Mio. entities mapped to hand-crafted taxonomy of 259 classes with 1200 properties

http://www.dbpedia.org

Page 12: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

Outline

...

Automatic KB Construction

Growing & Maintaining the KB

Temporal Knowledge

What and Why

Wrap-up

Page 13: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

French Marriage Problem

facts in KB: new facts or fact candidates:

married (Hillary, Bill)married (Carla, Nicolas)married (Angelina, Brad)

married (Cecilia, Nicolas)married (Carla, Benjamin)married (Carla, Mick)married (Michelle, Barack)married (Yoko, John)married (Kate, Leonardo)married (Carla, Sofie)married (Larry, Google)

1) for recall: pattern-based harvesting2) for precision: consistency reasoning

Page 14: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

Pattern-Based Harvesting

Facts Patterns

(Hillary, Bill)

(Carla, Nicolas)

& Fact Candidates

X and her husband Y

X and Y on their honeymoon

X and Y and their children

X has been dating with Y

X loves Y

… • good for recall• noisy, drifting• not robust enough for high precision

(Angelina, Brad)

(Hillary, Bill)(Victoria, David)

(Carla, Nicolas)

(Angelina, Brad)

(Yoko, John)

(Carla, Benjamin)(Larry, Google)

(Kate, Pete)

(Victoria, David)

(Hearst 92, Brin 98, Agichtein 00, Etzioni 04, …)

Page 15: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

Reasoning about Fact Candidates Use consistency constraints to prune false candidates

spouse(Hillary,Bill)spouse(Carla,Nicolas)spouse(Cecilia,Nicolas)spouse(Carla,Ben)spouse(Carla,Mick)spouse(Carla, Sofie)

spouse(x,y) diff(y,z) spouse(x,z)

f(Hillary)f(Carla)f(Cecilia)f(Sofie)

m(Bill)m(Nicolas)m(Ben)m(Mick)

spouse(x,y) f(x) spouse(x,y) m(y)

spouse(x,y) (f(x)m(y)) (m(x)f(y))

FOL rules (restricted): ground atoms:

Rules can be weighted(e.g. by fraction of ground atoms that satisfy a rule) ® uncertain / probabilistic data® compute prob. distr. of subset of atoms being the truth

Rules reveal inconsistenciesFind consistent subset(s) of atoms(“possible world(s)“, “the truth“)

spouse(x,y) diff(w,x) spouse(w,y)

Page 16: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

Markov Logic Networks (MLN‘s) (M. Richardson / P. Domingos 2006)

Map logical constraints & fact candidatesinto probabilistic graph model: Markov Random Field (MRF)

s(x,y) m(y)

s(x,y) diff(y,z) s(x,z) s(Carla,Nicolas)s(Cecilia,Nicolas)s(Carla,Ben)s(Carla,Sofie)…

s(x,y) diff(w,y) s(w,y)

s(x,y) f(x)

s(Ca,Nic) s(Ce,Nic)

s(Ca,Nic) s(Ca,Ben)

s(Ca,Nic) s(Ca,So)

s(Ca,Ben) s(Ca,So)

s(Ca,Ben) s(Ca,So)

s(Ca,Nic) m(Nic)

Grounding:

s(Ce,Nic) m(Nic)

s(Ca,Ben) m(Ben)

s(Ca,So) m(So)

f(x) m(x)

m(x) f(x)

Literal Boolean VarLiteral binary RV

Page 17: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

Markov Logic Networks (MLN‘s) (M. Richardson / P. Domingos 2006)

Map logical constraints & fact candidatesinto probabilistic graph model: Markov Random Field (MRF)

s(x,y) m(y)

s(x,y) diff(y,z) s(x,z) s(Carla,Nicolas)s(Cecilia,Nicolas)s(Carla,Ben)s(Carla,Sofie)…

s(x,y) diff(w,y) s(w,y)

s(x,y) f(x) f(x) m(x)

m(x) f(x)

m(Ben)

m(Nic) s(Ca,Nic)

s(Ce,Nic)

s(Ca,Ben)

s(Ca,So) m(So)

RVs coupledby MRF edgeif they appearin same clause

MRF assumption:P[Xi|X1..Xn]=P[Xi|N(Xi)]

Variety of algorithms for joint inference:Gibbs sampling, other MCMC, belief propagation, randomized MaxSat, …

joint distribution has product form over all cliques

Page 18: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

Related Alternative Probabilistic Models

software tools: alchemy.cs.washington.edu code.google.com/p/factorie/ research.microsoft.com/en-us/um/cambridge/projects/infernet/

Constrained Conditional Models [D. Roth et al. 2007]

Factor Graphs with Imperative Variable Coordination [A. McCallum et al. 2008]

log-linear classifiers with constraint-violation penaltymapped into Integer Linear Programs

RV‘s share “factors“ (joint feature functions)generalizes MRF, BN, CRF, …inference via advanced MCMCflexible coupling & constraining of RV‘s

m(Ben)

m(Nic) s(Ca,Nic)

s(Ce,Nic)

s(Ca,Ben)

s(Ca,So) m(So)

Page 19: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

Reasoning for KB Growth: Direct Route

facts in KB:new fact candidates:

married (Hillary, Bill)married (Carla, Nicolas)married (Angelina, Brad)

married (Cecilia, Nicolas)married (Carla, Benjamin)married (Carla, Mick)married (Carla, Sofie)married (Larry, Google)

+

patterns:X and her husband YX and Y and their childrenX has been dating with YX loves Y

?

1. facts are true; fact candidates & patterns hypothesesgrounded constraints clauses with hypotheses as vars

2. type signatures of relations greatly reduce #clauses3. cast into Weighted Max-Sat with weights from pattern stats

customized approximation algorithmunifies: fact cand consistency, pattern goodness, entity disambig.

(F. Suchanek et al.: WWW‘09)

www.mpi-inf.mpg.de/yago-naga/sofie/

Direct approach:

Page 20: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

Facts & Patterns Consistency with SOFIE

constraints to connect facts, fact candidates, patterns(F. Suchanek et al.: WWW’09, N. Nakashole et al.: WebDB‘10)

functional dependencies:spouse(X,Y): X Y, Y X

relation properties:asymmetry, transitivity, acyclicity, …

type constraints, inclusion dependencies:spouse Person Person capitalOfCountry cityOfCountry

domain-specific constraints:bornInYear(x) + 10years ≤ graduatedInYear(x)

www.mpi-inf.mpg.de/yago-naga/sofie/

hasAdvisor(x,y) graduatedInYear(x,t) graduatedInYear(y,s) s < t

pattern-fact duality:

occurs(p,x,y) expresses(p,R) type(x)=dom(R) type(y)=rng(R) R(x,y)

name(-in-context)-to-entity mapping:

means(n,e1) means(n,e2) …

occurs(p,x,y) R(x,y) type(x)=dom(R) type(y)=rng(R) expresses(p,R)

Page 21: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

Entity Disambiguation Revisitedoccurs (“divorced from“, Madonna, Guy Ritchie) expresses (“divorced from“, wasMarriedTo) wasMarriedTo (Madonna, Guy Ritchie)

actually is:occurs (“divorced from“, “Madonna“, “Guy Ritchie“) means (“Madonna“, Madonna Louise Ciccone ) expresses (“divorced from“, wasMarriedTo) wasMarriedTo (Madonna Louise Ciccone, Guy Ritchie) [0.7]

occurs (“divorced from“, “Madonna“, “Guy Ritchie“) means (“Madonna“, Madonna (Edvard Munch)) expresses (“divorced from“, wasMarriedTo) wasMarriedTo (Madonna (Edvard Munch), Guy Ritchie) [0.3]

• use context-similarity as disambiguation prior• set clause weights accordingly

reduced to normal case

entity level

word/phrase level

Page 22: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

Experimental ResultsSOFIE (F. Suchanek et al.: WWW’09)• input: biographies of 400 US senators, 3500 HTML files• output: birth/death date&place, politicianOf (state)• run-time: 7 h parsing, 6 h hypotheses, 2 h Max-Sat• precision: 90-95 % (except for death place)• recall: ca. 750 extracted facts (300 politicianOf facts)

PROSPERA (N. Nakashole et al.: WebDB‘10):• input: 87 000 Wikipedia articles and Web homepages of scientists• output: hasAdvisor, graduatedAt, hasCollaborator, facultyAt, wonAward• run-time: 1 h total (largely parallelized)• precision: 85-95 % • recall: ca. 4000 extracted facts (400 hasAdvisor facts)

Now running experiments on ClueWeb‘09 corpus (500 Mio. English Web pages) with Hadoop cluster of 10x16 cores and 10x48 GB

Page 23: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

Outline

...

Automatic KB Construction

Growing & Maintaining the KB

Temporal Knowledge

What and Why

Wrap-up

Page 24: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

Temporal KnowledgeWhich facts for given relations hold at what time point or during which time intervals ?

marriedTo (Madonna, Guy) [ 22Dec2000, Dec2008 ]capitalOf (Berlin, Germany) [ 1990, now ]capitalOf (Bonn, Germany) [ 1949, 1989 ]hasWonPrize (JimGray, TuringAward) [ 1998 ]graduatedAt (HectorGarcia-Molina, Stanford) [ 1979 ]graduatedAt (SusanDavidson, Princeton) [ Oct 1982 ]hasAdvisor (SusanDavidson, HectorGarcia-Molina) [ Oct 1982, forever ]

How can we query & reason on entity-relationship factsin a “time-travel“ manner - with uncertain/incomplete KB ?

US president when Barack Obama was born?students of Hector Garcia-Molina while he was at Princeton?

Page 25: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

French Marriage Problem

facts in KB

new fact candidates:

married (Hillary, Bill)married (Carla, Nicolas)married (Angelina, Brad)

married (Cecilia, Nicolas)married (Carla, Benjamin)married (Carla, Mick)divorced (Madonna, Guy)domPartner (Angelina, Brad)

1:

2:

3:

validFrom (2, 2008)

validFrom (4, 1996) validUntil (4, 2007)validFrom (5, 2010)validFrom (6, 2006)validFrom (7, 2008)

4: 5:6:7:8:

JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC

Page 26: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

Challenge: Temporal Knowledgefor all people in Wikipedia (300 000) gather all spouses, incl. divorced & widowed, and corresponding time periods! >95% accuracy, >95% coverage, in one night

consistency constraints are potentially helpful:• functional dependencies: husband, time wife• inclusion dependencies: marriedPerson adultPerson• age/time/gender restrictions: birthdate + < marriage < divorce

1) recall: gather temporal scopes for base facts2) precision: reason on mutual consistency

Page 27: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

Difficult Dating

Page 28: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

(Even More Difficult) Implicit Datingexplicit dates vs.implicit dates relative to other dates

Page 29: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

(Even More Difficult) Relative Datingvague dates relative dates

narrative textrelative order

Page 30: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

Framework for T-Fact Extraction(Theobald et al.: MUD’10, Wang et al.: EDBT’10; Zhang et al.: WebDB‘08)

1) represent temporal scopes of factsin the presence of incompleteness and uncertainty

2) gather & filter candidates for t-facts: extract base facts R(e1, e2) first; then focus on sentences with e1, e2 and date or temporal phrase

3) aggregate & reconcile evidence from observations

4) reason on joint constraints about facts and time scopes

Page 31: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

1) Representing T-Fact Evidence

different resolutions, later refinement

uncertain & inconsistent evidence confidence distribution

After 4 years of happy marriage,Madonna and Sean got divorced in September 1989.

1: married(Madonna, Sean), earliestSince (1, 1-Jan-1985), latestSince (1, 31-Dec-1985),

earliestUntil (1, 1-Sep-1989), latestUntil (1, 30-Sep-1989)

event-style and state-style facts meta-facts to capture temporal scopes

1: married(Madonna, Sean), 2: married(Madonna, Guy), validSince (1, 16-Aug-1985), validUntil (1, 14-Sep-1989),

validSince (2, 22-Dec-2000), validUntil (2, 15-Dec-2008)3: wonAward(Sean, AcademyAwardForBestActor) validOn (3, 29-Feb-2004)

1984 1987 1990

µ=1987σ2=1

0.70.40.1

1984 1985 19901989

Page 32: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

2) Gather & Filter T-Fact CandidatesChoice of sources:

news-style biography-styledate in header many dates in textrelative temp expr‘s explicit dates, narrativesimple language elaborated languagemany pronouns pronouns for main entity

Naive approach:use deep NLP (dependency parser) on every sentencethen use classifier (or structured-output learner) to detect t-facts too expensive

Bruni met recently divorced president Sarkozy in November 2007 at a dinner party.

She has said she is easily "bored with monogamy“ …

A romance is said to have started a few weeks ago between her and Biolay.

Page 33: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

2) Gather & Filter: Multi-Stage Approachstage 1: sentences with e1 and e2 from R

stage 2: sentences that contain a temporal expression

stage 3: sentences where the t-expression refers to R(e1,e2)

• match noun phrases against YAGO means relation• use disambiguation prior for entity mentions

• use TARSQI tool to extract relative t-expressions and • map them to absolute dates or durations

• run dependency parser: check shortest path connecting e1, e2, verb, t-expr

• alternatively, consider only sentences with two noun groups & short surface distances of e1, e2, t-expr

Jim married Sue, but later left her and began an affair with Jane in 2005.

Page 34: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

3) Aggregate & Reconcile T-Fact EvidenceIdeal input:Madonna and Sean were married from 16-Aug-85 until 12-Sep-89.Madonna and Sean married on August 16, 1985.Madonna and Sean got divorced in September 1989.

time

evid

ence

Imprecise input:Madonna and Sean were married from 1985 through 1989.Madonna and Sean were married four years in the late nineties.Madonna and Sean got divorced in fall 1989.

Noisy input:Madonna and Sean plan their wedding in summer 1985.Madonna and Sean just returned from their honeymoon (in Jan 1986).Madonna and Sean will be divorced by the the end of the year (1989).The marriage of Madonna and Sean will not survive this year (1987).

Page 35: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

3) Aggregate & Reconcile T-Fact EvidenceReal input:…Madonna and Sean were chased during their honeymoon … (Jan 19, 1986)Madonna and her husband Sean opened the exhibition … (March 7, 1986)Madonna and her husband Sean were seen at … (April 1, 1986)Madonna and Sean met other couples at … (June 22, 1986)Madonna and Sean plan to have children … (July 4, 1986)Madonna and Sean would consider adopting a child … (July 14, 1986)Sean and his wife Madonna purchase another castle in … (November 5, 1986)...Madonna and Sean think about getting divorced … (April 21, 1989)The marriage of Madonna and Sean is in deep crisis … (May 11, 1989)…

time

evid

ence

Page 36: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

3) Aggregate & Reconcile T-Fact EvidenceReal input:…Madonna and Sean were chased during their honeymoon … (Jan 19, 1986)Madonna and her husband Sean opened the exhibition … (March 7, 1986)Madonna and her husband Sean were seen at … (April 1, 1986)Madonna and Sean met other couples at … (June 22, 1986)Madonna and Sean plan to have children … (July 4, 1986)Madonna and Sean would consider adopting a child … (July 14, 1986)Sean and his wife Madonna purchase another castle in … (November 5, 1986)...Madonna and Sean think about getting divorced … (April 21, 1989)The marriage of Madonna and Sean is in deep crisis … (May 11, 1989)…

time

evid

ence

…..……..…

Page 37: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

3) Aggregate & Reconcile: Solution

time

evid

ence

event histogram(begin)

event histogram(end)

state histogram(during)

• Classifer for t-fact observations: begin vs. during vs. end• Build separate histogram for each class (and each t-fact)• Combine histograms & derive high-confidence time scope

Page 38: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

4) Joint Reasoning on Facts and T-Facts

X, Y, Z, T1, T2:m(X,Y) m(X,Z) validTime(m(X,Y),T1) validTime(m(X,Z),T2)

overlaps(T1, T2)

constraint:marriedTo (m) is an injective function at any given point

Combine & reconcile t-scopes across different facts

after grounding:

m(Carla, Nicolas) m(Cecilia, Nicolas) overlaps ([2008,2010], [1996,2007])

m(Carla, Nicolas) m(Carla, Benjamin) overlaps ([2008,2010], [2009,2011])

m(Ca,Nic) m(Ce,Nic) false

m(Ca,Nic) m(Ca,Ben) true

Page 39: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

4) Joint Reasoning on Facts and T-Facts

time

m(Ca, Ben)m(Ca, Nic)

m(Ce, Nic)

m(Ca, Mi)

m(Ce, Mi)

Conflict graph:

m(Ca, Ben)[2009,2011]

m(Ca, Nic)[2008,2010]

m(Ce, Nic)[1996,2007]

m(Ca, Mi)[2004,2008]

m(Ce, Mi)[1998,2005]

Find maximalindependent set: subset of nodes w/o adjacent pairswith (evidence-)weighted nodes

Page 40: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

4) Joint Reasoning on Facts and T-Facts

time

m(Ca, Ben)m(Ca, Nic)

m(Ce, Nic)

m(Ca, Mi)

m(Ce, Mi)

Conflict graph:

m(Ca, Ben)[2009,2011]

m(Ca, Nic)[2008,2010]

m(Ce, Nic)[1996,2007]

m(Ca, Mi)[2004,2008]

m(Ce, Mi)[1998,2005]

Find maximalindependent set: subset of nodes w/o adjacent pairswith (evidence-)weighted nodes

100

20

80

30 10

Page 41: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

4) Joint Reasoning on Facts and T-Facts

time

m(Ca, Ben)m(Ca, Nic)

m(Ce, Nic)

m(Ca, Mi)

m(Ce, Mi)

alternative approach:split t-scopes and reason on consistency of t-fact partitions

Page 42: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

Preliminary Results

playsForTeam(X,Z)@T1 playsForTeam(Y,Z)@T2 overlaps (T1,T2) teammates(X,Y)

• automatic extraction of t-facts about football/soccer from Wikipedia and news articles• query answering by reasoning on t-facts

Page 43: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

Outline

...

Automatic KB Construction

Growing & Maintaining the KB

Temporal Knowledge

What and Why

Wrap-up

Page 44: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

KB Building: Where Do We Stand?Knowledge Bases on Entities & Classes

Relationships

Temporal Knowledgewidely open (fertile) research ground:

• uncertain / incomplete temporal scopes of facts• joint reasoning on base-facts and time-scopes

good progress, but many challenges left:• recall & precision by patterns & reasoning• efficiency & scalability• soft rules, hard constraints, richer logics, …• open-domain discovery of new relation types

strong success story, some problems left:• large taxonomies of classes with individual entities• long tail calls for new methods• entity disambiguation remains grand challenge

Page 45: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

Overall Take-Home

...

Historic opportunity: revive Cyc vision, make it real & large-scale ! KB as enabler of macroscopic „machine reading“challenging & risky, but high pay-off

Explore & exploit synergies between semantic, statistical, & social Web methods:statistical evidence + logical consistency !

Many interesting research topics for CS (+ CoLi):• efficiency & scalability• constraints & reasoning on uncertain data• NLP for temporal statements• statistical ranking for semantic search• knowledge-base life-cycle: growth & maintenance

Page 46: Gerhard Weikum Max Planck Institute for Informatics weikum/ From Information to Knowledge: Harvesting Entities, Relationships,

Thank You !