Upload
tyrone-casey
View
228
Download
0
Tags:
Embed Size (px)
Citation preview
Natural Language Semantics using Probabilistic Logic
Islam Beltagy
Doctoral Dissertation Proposal
Supervising Professors: Raymond J. Mooney, Katrin Erk
• Who is the second president of the US ?
– A: “John Adams”
• Who is the president that came after the first US president?
– A: …
1. Semantic Representation: how the meaning of natural text is represented
2. Inference: how to draw conclusions from that semantic representation
2
Objective
• Find a “semantic representation” that is
– Expressive
– Supports automated inference
• Why ? more NLP applications more effectively
– Question Answering, Automated Grading, Machine Translation, Summarization …
3
Outline
• Introduction
– Semantic representations
– Probabilistic logic
– Evaluation tasks
• Completed research
– Parsing and task representation
– Knowledge base construction
– Inference
– Evaluation
• Future work
4
Outline
• Introduction
– Semantic representations
– Probabilistic logic
– Evaluation tasks
• Completed research
– Parsing and task representation
– Knowledge base construction
– Inference
– Evaluation
• Future work
5
Semantic Representations - Formal Semantics
• Mapping natural language to some formal language (e.g. first-order logic) [Montague, 1970]
• “John is driving a car”
x,y,z. john(x) agent(y, x) drive(y) patient(y, z) car(z)• Pros
– Deep representation: Relations, Negations, Disjunctions, Quantifiers ...
– Supports automated inference
• Cons: Unable to handle uncertain knowledge. Why important ? (pickle, cucumber), (cut, slice)
6
Semantic Representations - Distributional Semantics
• Similar words and phrases occur in similar contexts
• Use context to represent meaning
• Meanings are vectors in high-dimensional spaces
• Words and phrases similarity measure
– e.g: similarity(“water”, “bathtub”) = cosine(water, bathtub)• Pros: robust probabilistic model that captures graded
notion of similarity.
• Cons: shallow representation for the semantics
7
8
Proposed Semantic Representation
• Proposed semantic representation: Probabilistic Logic
– Combines advantages of:
– Formal Semantics (expressive + automated inference)
– Distributional Semantics (gradedness)
Outline
• Introduction
– Semantic representations
– Probabilistic logic
– Evaluation tasks
• Completed research
– Parsing and task representation
– Knowledge base construction
– Inference
– Evaluation
• Future work
9
10
Probabilistic Logic
• Statistical Relational Learning [Getoor and Taskar, 2007]
• Combine logical and statistical knowledge
• Provide a mechanism for probabilistic inference
• Use weighted first-order logic rules– Weighted rules are soft rules (compared to hard logical
constraints)
– Compactly encode complex probabilistic graphical models
• Inference: P(Q|E, KB)• Markov Logic Networks (MLN) [Richardson and Domingos,
2006]
• Probabilistic Soft Logic (PSL) [Kimmig et al., NIPS 2012]
Markov Logic Networks[Richardson and Domingos, 2006]
x. smoke(x) cancer(x) | 1.5x,y. friend(x,y) (smoke(x) smoke(y)) | 1.1
• Two constants: Anna (A) and Bob (B)
• P(Cancer(Anna) | Friends(Anna,Bob), Smokes(Bob))Cancer(A)
Smokes(A)Friends(A,A)
Friends(B,A)
Smokes(B)
Friends(A,B)
Cancer(B)
Friends(B,B)
11
PSL: Probabilistic Soft Logic[Kimmig et al., NIPS 2012]
• Probabilistic logic framework designed with efficient inference in mind
• Atoms have continuous truth values in interval [0,1] (Boolean atoms in MLN)
• Łukasiewicz relaxation of AND, OR, NOT– I(ℓ1 ℓ2) = max {0, I(ℓ1) + I(ℓ2) – 1}– I(ℓ1 ℓ2) = min {1, I(ℓ1) + I(ℓ2) }– I( ℓ1) = 1 – I(ℓ1)• Inference: linear program (combinatorial counting
problem in MLN)
13
Outline
• Introduction
– Semantic representations
– Probabilistic logic
– Evaluation tasks
• Completed research
– Parsing and task representation
– Knowledge base construction
– Inference
– Evaluation
• Future work
15
Evaluation Tasks
• Two tasks that require deep semantic understanding to do well on them
1) Recognizing Textual Entailment (RTE) [Dagan et al., 2013]
– Given two sentences T and H, finding if T Entails, Contradicts or not related (Neutral) to H
– Entailment: T: “A man is walking through the woods.
H: “A man is walking through a wooded area.”
– Contradiction: T: “A man is jumping into an empty pool.”
H: “A man is jumping into a full pool.”
– Neutral: T: “A young girl is dancing.”
H: “A young girl is standing on one leg.”
16
Evaluation Tasks
• Two tasks that require deep semantic understanding to do well on them
2) Semantic Textual Similarity (STS) [Agirre et al., 2012]
– Given two sentences S1, S2 , judge their semantic similarity on a scale from 1 to 5
– S1: “A man is playing a guitar.”
S2: “A woman is playing the guitar.” (score: 2.75)
– S1: “A car is parking.”
S2: “A cat is playing.” (score: 0.00)
17
Outline
• Introduction
– Semantic representations
– Probabilistic logic
– Evaluation tasks
• Completed research
– Parsing and task representation
– Knowledge base construction
– Inference
– Evaluation
• Future work
18
System Architecture[Beltagy et al., *SEM 2013]
19
T/S1
Parsing KB
Result (RTE/STS)
H/S2
LF1
LF2
Knowledge Base
Construction
InferenceP(Q|E,KB)(MLN/PSL)One advantage of using logic:
Modularity
Task Representation
(RTE/STS)
1
2
3
4
Outline
• Introduction
– Semantic representations
– Probabilistic logic
– Evaluation tasks
• Completed research
– Parsing and task representation
– Knowledge base construction
– Inference
– Evaluation
• Future work
20
Parsing
• Mapping input sentences to logic form
• Using Boxer, a rule based system on top of a CCG parser [Bos, 2008]
• “John is driving a car”
x,y,z. john(x) agent(y, x) drive(y) patient(y, z) car(z)
21
Task Representation[Beltagy et al., SemEval 2014]
• Represent all tasks as inferences of the form: P(Q|E, KB)• RTE
– Two inferences: P(H|T, KB), P(H|T, KB)– Use a classifier to map probabilities to RTE class
• STS
– Two inferences: P(S1|S2, KB), P(S2|S1, KB)– Use regression to map probabilities to overall similarity score
22
Domain Closure Assumption (DCA)
• There are no objects in the universe other than the named constants
– Constants need to be explicitly added
– Universal quantifiers do not behave as expected because of finite domain
– e.g. Tweety is a bird and it flies All birds fly
P(Q|E,KB) E Q
Skolemization none
All birds fly Some ⇏
birds flyTweety is a bird. It flies
All birds fly
( ) none future work
23
• E: x,y. john(x) agent(y, x) eat(y) • Skolemized E: john(J) agent(T, J) eat(T) • Embedded existentials
– E : x. bird(x) y. agent(y, x) fly(y)– Skolemized E: x. bird(x) agent(f(x), x) fly(f(x))– Simulate skolem functions
x. bird(x) y. skolemf(x,y) agent(y, x) fly(y)– skolemf (B1, C1), skolemf (B2, C2) …24
P(Q|E,KB) E Q
Skolemization none
All birds fly Some birds fly⇏ Tweety is a bird. It flies All birds fly
( ) none future work
• E : x. bird(x) y. agent(y, x) fly(y)• Q: x,y. bird(x) agent(y, x) fly(y) (false)
• Solution: introduce additional evidence bird(B)– Pragmatically, birds exist (Existence)
• Negated existential
– E : x,y. bird(x) agent(y, x) fly(y)– No additional constants needed
25
P(Q|E,KB) E Q
Skolemization none
All birds fly Some birds fly⇏ Tweety is a bird. It flies All birds fly
( ) none future work
• E : bird( ) agent(F, ) fly (F)• Q: x. bird(x) y. agent(y, x) fly(y) (true)• Universal quantifiers work only on the constants of the
given finite domain
• Solution: add an extra bird( ) to the domain
– If the new bird can be shown to fly, then there is an explicit universal quantification in E
26
P(Q|E,KB) E Q
Skolemization none
All birds fly Some birds fly⇏ Tweety is a bird. It flies All birds fly
( ) none future work
Outline
• Introduction
– Semantic representations
– Probabilistic logic
– Evaluation tasks
• Completed research
– Parsing and task representation
– Knowledge base construction
– Inference
– Evaluation
• Future work
27
Knowledge Base Construction
• Represent background knowledge as weighted inference rules
1) WordNet rules
– WordNet: lexical database of word and their semantic relations
– Synonyms: x. man(x) guy(x) | w = ∞– Hyponym: x. car(x) vehicle(x) | w = ∞– Antonyms: x. tall(x) short(x) | w = ∞
28
29
Knowledge Base Construction
• Represent background knowledge as weighted inference rules
2) Distributional rules (on-the-fly rules)
– For all pairs of words (a, b) where aT/S1, bH/S2, generate the rule
x. a(x) → b(x) | f(w)– w = cosine(a, b)– f(w) = log(w/(1-w))x. hamster(x) → gerbil(x) | f(w)
cos
Outline
• Introduction
• Completed research
– Parsing and task representation
– Knowledge base construction
– Inference
• RTE using MLNs
• STS using MLNs
• STS using PSL
– Evaluation
• Future work
30
Inference
• Inference problem: P(Q|E, KB)• Solve it using MLN and PSL for RTE and STS
– RTE using MLNs
– STS using MLNs
– STS using PSL
31
Outline
• Introduction
• Completed research
– Parsing and task representation
– Knowledge base construction
– Inference
• RTE using MLNs
• STS using MLNs
• STS using PSL
– Evaluation
• Future work
32
MLNs for RTE - Query Formula (QF) [Beltagy and Mooney, StarAI 2014]
• Alchemy (MLN’s implementation) calculates only probabilities of ground atoms
• Inference algorithm supports query formulas– P(Q|R) = Z(Q U R) / Z(R) [Gogate and Domingos, 2011]– Z: normalization constant of the probability distribution
– Estimate the partition function Z using SampleSearch [Gogate and Dechter, 2011]
– SampleSearch is an algorithm to estimate the partition function Z of mixed graphical models (probabilistic and deterministic)
33
MLNs for RTE - Modified Closed-world (MCW) [Beltagy and Mooney, StarAI 2014]
• MLN’s grounding generates very large graphical models
• Q has O(cv) ground clauses– v: number of variables in Q– c: number of constants in the domain
34
MLNs for RTE - Modified Closed-world (MCW) [Beltagy and Mooney, StarAI 2014]
• Low priors: by default, ground atoms have very low probabilities, unless shown otherwise thought inference
• Example
– E: man(M) agent(D, M) drive(D)– Priors: x. man(x) | -2, x. guy(x) | -2, x. drive(x) | -2 – KB: x. man(x) guy(x) | 1.8– Q: x,y. guy(x) agent(y, x) drive(y)– Ground Atoms: man(M), man(D), guy(M), guy(D), drive(M), drive(D)
• Solution: a MCW to eliminate unimportant ground atoms
– not reachable from the evidence (evidence propagation)
– Strict version of low priors
– Dramatically reduces size of the problem35
Outline
• Introduction
• Completed research
– Parsing and task representation
– Knowledge base construction
– Inference
• RTE using MLNs
• STS using MLNs
• STS using PSL
– Evaluation
• Future work
36
MLNs for STS [Beltagy et al., *SEM 2013]
• Strict conjunction in Q does not fit STS
– E: “A man is driving”: x,y. man(x) drive(y) agent(y, x)– Q: “A man is driving a bus”: x,y,z. man(x) drive(y) agent(y, x) bus(z) patient(y,z)
• Break Q into mini-clauses then combine their evidences using an averaging combiner [Natarajan et al., 2010]
x,y,z. man(x) agent(y, x) result(x,y,z) | w x,y,z. drive(y) agent(y, x) result(x,y,z) | w x,y,z. drive(y) patient(y, z) result(x,y,z) | w x,y,z. bus(z) patient(y, z) result(x,y,z) | w
37
Outline
• Introduction
• Completed research
– Parsing and task representation
– Knowledge base construction
– Inference
• RTE using MLNs
• STS using MLNs
• STS using PSL
– Evaluation
• Future work
38
PSL for STS [Beltagy and Erk and Mooney, ACL 2014]
• Similar to MLN, conjunction in PSL does not fit STS
• Replace conjunctions in Q with average– I(ℓ1 … ℓn) = avg( I(ℓ1), …, I(ℓn))• Inference
– “average” is a linear function
– No changes in the optimization problem
– Heuristic grounding (details omitted)
39
Outline
• Introduction– Semantic representations
– Probabilistic logic
– Evaluation tasks
• Completed research– Parsing and task representation
– Knowledge base construction
– Inference
– Evaluation
• Knowledge Base
• Inference
• Future work
40
41
Evaluation - Datasets
• SICK (RTE and STS) [SemEval 2014]
– “Sentences Involving Compositional Knowledge”
– 10,000 pairs of sentences
• msr-vid (STS) [SemEval 2012]
– Microsoft video description corpus
– 1,500 pair of short video descriptions
• msr-par (STS) [SemEval 2012]
– Microsoft paraphrase corpus
– 1,500 pair of long news sentences
Outline
• Introduction– Semantic representations
– Probabilistic logic
– Evaluation tasks
• Completed research– Parsing and task representation
– Knowledge base construction
– Inference
– Evaluation
• Knowledge Base
• Inference
• Future work
42
Evaluation – Knowledge Base
• logic+kb is better than logic and better than dist
• PSL is doing much better than MLN for the STS task
43
Evaluation – Error analysis of RTE
• Our system’s accuracy: 77.72%
• The remaining 22.28% are
– Entailment pairs classified as Neutral: 15.32%
– Contradiction pairs classified as Neutral: 6.12%
– Other: 0.84 %
• System precision: 98.9%, recall: 78.56%.
– High precision, low recall is the typical behavior of logic-base systems
• Fixes (future work)
– Larger knowledge base
– Fix some limitations in the detection of contradictions
44
Outline
• Introduction– Semantic representations
– Probabilistic logic
– Evaluation tasks
• Completed research– Parsing and task representation
– Knowledge base construction
– Inference
– Evaluation
• Knowledge Base
• Inference
• Future work
45
Evaluation – Inference (RTE) [Beltagy and Mooney, StarAI 2014]
• Dataset: SICK (from SemEval 2014)
• Systems compared
– mln: using Alchemy out-of-the-box
– mln+qf: our algorithm to calculate probability of query formula
– mln+mcw: mln with our modified-closed-world
– mln+qf+mcw: both components
System Accuracy CPU Time Timeouts(30 min)
mln 57% 2min 27sec 96%
mln+qf 69% 1min 51sec 30%
mln+mcw 66% 10sec 2.5%
mln+qf+mcw 72% 7sec 2.1%
46
Evaluation – Inference (STS) [Beltagy and Erk and Mooney, ACL 2014]
• Compare MLN with PSL on the STS task
PSL time MLN time MLN timeouts (10 min)
msr-vid 8s 1m 31s 9%
msr-par 30s 11m 49s 97%
SICK 10s 4m 24s 36%
• Apply MCW to MLN for a fairer comparison because PSL already has a lazy grounding
47
Outline
• Introduction
– Semantic representations
– Probabilistic logic
– Evaluation tasks
• Completed research
– Parsing and task representation
– Knowledge base construction
– Inference
– Evaluation
• Future work
– Short Term
– Long Term
48
49
Future Work – RTE Task formulation
1) Better detecting of contradictions
– Example where the current P(H|T) fails
• T: “No man is playing a flute”
• H: “A man is playing a large flute”
– Detection of contradiction is• TH false (from the logic point of view)
• T H probabilistic P(H|T) = 1 – P(H|T) (useless)
• H T probabilistic P(T|H)
50
Future Work – RTE Task formulation
2) Using ratios– P(H|T) / P(H)– P(T|H) / P(T)
• Q : x,y. bird(x) agent(y, x) fly(y)
• Because of closed-world, Q comes true regardless of E• Need Q to be true only when Q is explicitly stated in E• Solution
– Add the negation of Q to the MLN with high weight not infinity
– R: bird(B) agent(F, B) fly(F) | w=5– P (Q|R) = 0 – P(Q|E, R) = 0 unless E unless E is negating R51
P(Q|E,KB) E Q
Skolemization none
All birds fly Some birds fly⇏ Tweety is a bird. It flies All birds fly
( ) none Future Work: DCA
52
Future Work – Knowledge Base Construction
1) Precompiled rules of paraphrase collections like PPDB [Ganitkevitch et al., NAACL 2013]
– e.g: “solves => finds a solution to” e,x. solve(e) patient(e,x) s. find(e) patient(e,s) solution(s) to(t,x) | w
– Variable binding is not trivial
• Templates
• Difference between logical expressions of the sentence with and without the rule applied
53
Future Work – Knowledge Base Construction
2) Phrasal distributional rules
– Use linguistically motivated templates like
• Noun-phrase => noun-phrase
x. little(x) kid(x) smart(x) boy(x) | w• Subject-verb-object => subject-verb-object
x,y,z. man(x) agent(y,x) drive(y) patient(y,z) car(z) guy(x) agent(y,x) ride(y) patient(y,z) bike(z) | w
54
Future Work – Inference
1) Better MLN inference with query formula
– Currently, we estimate Z(K U R) and Z(R)– Combine both runs in one that exploits
• Similarities between Q U R and R• We only need the ratio, not the absolute values
55
Future Work – Inference
2) Generalized modified closed-world assumption
– It is not clear how to propagate evidence of rule like
x. dead(x) live(x)– MCW needs to be generalized to arbitrary MLNs
– Find and eliminate ground atoms that have: marginal probability = prior probability
56
Future Work – Weight Learning
• One of the following
– Weight learning for inference rules
• Learn a better mapping from the weights we have on resources to MLN weights
• Learning how to weight different rules of different resources differently
– Weight learning for the STS task
• Weight different parts of the sentence differently
• e.g. “black dog” is more similar to “white dog” that “black cat”
Outline
• Introduction
• Completed research
– Parsing and task representation
– Knowledge base construction
– Inference
– Evaluation
• Future work
– Short Term
– Long Term
57
Long-term Future Work
1) Question Answering– Our semantic representation is a general and expressive one, so
apply it to more tasks
– Given a query, find an answer for it from large corpus of unstructured text
– Inference finds best filling for existentially quantified query
– Efficient inference is the bottleneck
2) Generalized Quantifiers – Few, most, many, … are not natively supported in first-order
logic
– Add support for them using
• Checking monotonicity
• Represent Few and Most as weighted universally quantified rules
58
Long-term Future Work
3) Contextualize WordNet Rules
– Use Word Sense Disambiguation, then generate weighted inference rules from WordNet
4) Other Languages
– Theoretically, this is a language independent semantic representation
– Practically, resources are not available especially CCGBanks to train parsers and Boxer
5) Inference Inspector
– Visualize the inference process and highlight most effective rules
– Not trivial in MLN because all rules affect the final result to some extent
59
60
Conclusion• Probabilistic logic for semantic representation
– expressivity, automated inference and gradedness
– Evaluation on RTE and STS
– Formulating tasks as probabilistic logic inferences
– Building a knowledge base
– Performing inference efficiently base on the task
• For the short term future work, we– enhance formulation of the RTE task, build bigger knowledge base from
more resources, generalize the modified closed-world assumption, enhance our MLN inference algorithm, and use some weight learning
• For the long term future work, we– apply our semantic representation to the question answering task,
support generalized quantifiers, contextualize WordNet rules, apply our semantic representation to other languages and implementing a probabilistic logic inference inspector
Thank You