Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is...

Preview:

Citation preview

Joint work with Sameer Singh, Michael Wick, Karl Schultz, Sebastian Reidel, Limin Yao, Aron Culotta.

Andrew McCallum

Department of Computer ScienceUniversity of Massachusetts Amherst

Probabilistic Programming andProbabilistic Databases with

Imperatively-defined Factor Graphs

Goal

Build models that mine actionable knowledge

from unstructured text.

Cites

Entities and Relations

Research Paper

Cites

Grant

Expertise

Venue University Groups

Research Paper

Person

Entities and Relations

Extracted Database

• 8 million research papers

• 2 million authors

• 400k grants, 90k institutions, 10k venues

Gather raw data

Extraction Resolution

Text

Men

tions

Entities

query

answer

Combining ambiguous evidence

Applying probabilistic modeling to large data.

Applying probabilistic modeling to large data.

scalability

algorithms

data structures

software engineering

bio/medicalinformatics

information extraction

computer vision

scientificdata modeling

StatisticsComputational

...

Rich model structurespatio-temporalhierarchicalrelationalinfinite

Implementing the new model is a significant task.

parallelism

Bayesian NetworkDirected, Generative Graphical Models

c d e

a b

p(a, b, c, d, e) = p(a) p(b|a) p(c|a) p(d|a, c) p(e|b, c)

Markov Random FieldUndirected Graphical Models, aka, Markov Network

c d e

a b

p(a, b, c, d, e) =1Z

φ(a, c, d) φ(a, b) φ(b, e) φ(d, e)

Markov Random FieldUndirected Graphical Models, aka, Markov Network

p(a, b, c, d, e) =1Z

φ(a, c)φ(a, d)φ(c, d) φ(a, b) φ(b, e) φ(d, e)

c d e

a b

p(a, b, c, d, e) =1Z

φ(a, c, d) φ(a, b) φ(b, e) φ(d, e)

Factor GraphCan represent both directed and undirected graphical models

p(a, b, c, d, e) =1Z

φ(a, c)φ(a, d)φ(c, d) φ(a, b) φ(b, e) φ(d, e)

c d e

a b

p(a, b, c, d, e) =1Z

φ(a, c, d) φ(a, b) φ(b, e) φ(d, e)

Factor GraphCan represent both directed and undirected graphical models

p(a, b, c, d, e) =1Z

φ(a, c)φ(a, d)φ(c, d) φ(a, b) φ(b, e) φ(d, e)

c d e

a b

p(a, b, c, d, e) =1Z

φ(a, c, d) φ(a, b) φ(b, e) φ(d, e)

Conditional Random Field (CRF)Undirected graphical model, conditioned on some data variables

c d e

a boutputpredictedvariables

inputobservedvariablesx

y

[Lafferty, McCallum, Pereira 2001]

p(y|x) =1

Zx

f

φ(x∈f ,y∈f )

Conditional Random Field (CRF)Undirected graphical model, conditioned on some data variables

c d e

a boutputpredictedvariables

inputobservedvariablesx

y

p(y|x) =1

Zx

f

φ(x∈f ,y∈f )

+ Tremendous freedom to use arbitrary features of input. + Predict multiple dependent variables (“structured output”)

[Lafferty, McCallum, Pereira 2001]

= exp

� �

k

λkfk(xt, yt)

Relational Graphical ModelRelational = repeated structure of data & factors

d e

a b

g

f

i

h

Relational Graphical ModelRelational = repeated structure of data & factors

d e

a b

g

f

i

h

x1

y1

FactorTemplate #1

y1 y2FactorTemplate #2

Information Extraction with Linear-chain CRFs

Finite state model

Today Morgan Stanley Inc announced Mr. Friday’s appointment.

s1 s2 s3 s4 s5 s6 s7 s8

person nameorganization namebackground

Graphical model

state sequence

observation sequence

State-of-the-art predictive accuracy on many tasks.

Logistic Regression analogue of a hidden Markov model

Outline• Motivate software engineering for statistics

• Graphical models for Extraction & Integration- Extraction (linear-chain CRFs)

- Information Integration (really hairy CRFs, MCMC, SampleRank)

• Probabilistic Programming: FACTORIE

• Example

• Relation Extraction (cross-document, w/out labeled data)

• Probabilistic Programming inside a DB

• Ongoing Work

Outline• Motivate software engineering for statistics

• Graphical models for Extraction & Integration- Extraction (linear-chain CRFs)

- Information Integration (really hairy CRFs, MCMC, SampleRank)

• Probabilistic Programming: FACTORIE

• Example

• Relation Extraction (cross-document, w/out labeled data)

• Probabilistic Programming inside a DB

• Ongoing Work

Database A (Schema A)First Name Last Name Contact

J. Smith 222-444-1337

J. Smith 444 1337

John Smith (1) 4321115555

Database B (Schema B)Name Phone

John Smith U.S. 222-444-1337

John D. Smith 444 1337

J Smiht 432-111-5555

Schema A Schema BFirst Name Name

Last Name Phone

Contact

John #1 John #2

J. Smith John Smith

J. Smith J Smiht

John Smith

John D. Smith

Information Integration

Entity# Name Phone

523 John Smith 222-444-1337

524 John D. Smith 432-111-5555

… … …

Schema Matching Coreference

Canonicalization: Normalized DB

x6 x7

y67

f67x5 x8

x4

f5y5

f8y8y54 y54

Schema Matching

y1 y2x3

y3

y13 y23

y12

f1 f2

Coreference and Canonicalization

P(Y | X) =1ZX

ψw(yi,xi) ψb(yij,xij)yi,yj∈Y∏

yi∈Y∏

ψ(yi,xi) = exp λkfk(yi,xi)k∑⎛

⎝ ⎜

⎠ ⎟ f7

y5

y7

x1 x2

• x1 is a set of mentions {J. Smith,John,John Smith}}

• x2 is a set of mentions {Amanda, A. Jones}

• f12 is a factor between x1/x2

• y12 is a binary variable indicating a match (no)

• f1 is a factor over cluster x1

• y1 is a binary variable indicating match (yes)

• Entity/attribute factors omitted for clarity

x6 x7

y67

f67x5 x8

x4

f5y5

f8y8y54 y54

Schema Matching

y1 y2x3

y3

y13 y23

y12

f1 f2

Coreference and Canonicalization

P(Y | X) =1ZX

ψw(yi,xi) ψb(yij,xij)yi,yj∈Y∏

yi∈Y∏

ψ(yi,xi) = exp λkfk(yi,xi)k∑⎛

⎝ ⎜

⎠ ⎟ f7

y5

y7

• x6 is a set of attributes {phone,contact,telephone}}

• x7 is a set of attributes {last name, last name}

• f67 is a factor between x6/x7

• y67 is a binary variable indicating a match (no)

• f7 is a factor over cluster x7

• y7 is a binary variable indicating match (yes)

x6 x7

y67

f67x5 x8

x4

f5y5

f8y8y54 y54

Schema Matching

f43

y1 y2x3

y3

y13 y23

y12

f1 f2

Coreference and Canonicalization

P(Y | X) =1ZX

ψw(yi,xi) ψb(yij,xij)yi,yj∈Y∏

yi∈Y∏

ψ(yi,xi) = exp λkfk(yi,xi)k∑⎛

⎝ ⎜

⎠ ⎟ f7

y5

y7

x1 x2

x6 x7

y67

f67x5 x8

x4

f5y5

f8y8y54 y54

Schema Matching

f43

y1 y2x3

y3

y13 y23

y12

f1 f2

Coreference and Canonicalization

P(Y | X) =1ZX

ψw(yi,xi) ψb(yij,xij)yi,yj∈Y∏

yi∈Y∏

ψ(yi,xi) = exp λkfk(yi,xi)k∑⎛

⎝ ⎜

⎠ ⎟ f7

y5

y7

x1 x2

Really Hairy!

How to do

• parameter estimation• inference

• Most methods require calculating gradient of log-likelihood, P(y1, y2, y3,... | x1, x2, x3,...)...

• ...which in turn requires “expectations of marginals,” P(y1| x1, x2, x3,...)

• But, getting marginal distributions can be difficult.

• Alternative: Perceptron. Approximate gradient from difference between true output and model’s predicted best output.

• But, even finding model’s predicted best output is expensive.

• We propose: “Sample Rank” [Culotta, Wick, Hall, McCallum, HLT 2007]Learn to rank intermediate solutions: P(y1=1, y2=0, y3=1,... | ...) > P(y1=0, y2=0, y3=1,... | ...)

Parameter Estimation in Large State Spaces

Metropolis-HastingsSampleRank

Metropolis-Hastings for MAP

Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1

. . . over a model

P(y |x) =1

ZX

y i∈F

ψ(x , y i)

. . . using a proposal distribution q(y �|y) : F × F → [0, 1]

MH for MAP:

1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α

α = min

�1,

p(y �)

p(y)

q(y |y �)

q(y �|y)

1F is the feasible region defined by deterministic constraints, e.g. clustering,

non-projective tree

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 3 / 27

SampleRank

Metropolis-Hastings for MAP

Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1

. . . over a model

P(y |x) =1

ZX

y i∈F

ψ(x , y i)

. . . using a proposal distribution q(y �|y) : F × F → [0, 1]

MH for MAP:

1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α

α = min

�1,

p(y �)

p(y)

q(y |y �)

q(y �|y)

1F is the feasible region defined by deterministic constraints, e.g. clustering,

non-projective tree

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 3 / 27

feasible region defined by deterministic constraintse.g. clustering, parse-tree projectivity.

SampleRank

Metropolis-Hastings for MAP

Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1

. . . over a model

P(y |x) =1

ZX

y i∈F

ψ(x , y i)

. . . using a proposal distribution q(y �|y) : F × F → [0, 1]

MH for MAP:

1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α

α = min

�1,

p(y �)

p(y)

q(y |y �)

q(y �|y)

1F is the feasible region defined by deterministic constraints, e.g. clustering,

non-projective tree

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 3 / 27

Given factor graph with target variables y and observed x

SampleRank

Metropolis-Hastings for MAP

Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1

. . . over a model

P(y |x) =1

ZX

y i∈F

ψ(x , y i)

. . . using a proposal distribution q(y �|y) : F × F → [0, 1]

MH for MAP:

1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α

α = min

�1,

p(y �)

p(y)

q(y |y �)

q(y �|y)

1F is the feasible region defined by deterministic constraints, e.g. clustering,

non-projective tree

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 3 / 27

proposal distribution

SampleRank

Metropolis-Hastings for MAP

Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1

. . . over a model

P(y |x) =1

ZX

y i∈F

ψ(x , y i)

. . . using a proposal distribution q(y �|y) : F × F → [0, 1]

MH for MAP:

1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α

α = min

�1,

p(y �)

p(y)

q(y |y �)

q(y �|y)

1F is the feasible region defined by deterministic constraints, e.g. clustering,

non-projective tree

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 3 / 27

Can do MAP inference with decreasing temperature on ratio of p(y)’s

mod

M-H Natural Efficiencies1. Partition function cancels

SampleRank

MH Efficiency

partition cancels

p(y �)

p(y)=

p(Y = y �|x ; θ)

p(Y = y |x ; θ)

=1

ZX

�y i∈y � ψ(x , y i)

1

ZX

�y∈y ψ(x , y i)

=

�y i∈y � ψ(x , y i)

�y∈y ψ(x , y i)

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 5 / 27

SampleRank

MH Efficiency

factors cancel

=

�y �i∈y � ψ(x , y �i)

�y i∈y ψ(x , y i)

=

��y �i∈δy � ψ(x , y �i)

� ��y i∈y �/δy � ψ(x , y i)

��y i∈δy

ψ(x , y i)� ��

y i∈y/δyψ(x , y i)

=

�y �i∈δy � ψ(x , y �i)

�y i∈δy

ψ(x , y i)

δy is the “diff”, ie variables in y that have changed

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 6 / 27

2. Unchanged factors cancel

How to learn parameters for ?

1.

UPDATE

Ranking Intermediate SolutionsExample

2.

∆ Model = -23∆ Truth = -0.2

3.

∆ Model = 10∆ Truth = -0.1

4.

∆ Model = -10∆ Truth = -0.1

5.

∆ Model = -3∆ Truth = 0.3

• Like Perceptron:Proof of convergence under Marginal Separability.

• More constrained than Maximum Likelihood:Parameters must correctly rank incorrect solutions!

• Very fast to train.

UPDATE

Comparison to Contrastive DivergenceContrastive Divergence, n=2 [Hinton 2002]

sufficient statistics for update

proposal

truth

Persistent Contrastive Divergence [Tieleman 2008]

Sample Rank

Comparison to LASO

Sample Rank scores possible worlds.

LASO scores “actions” (transitions).

SCORESCORESCORESCORESCORE

SCORESCORESCORESCORE

No concern about generation ordering of output vars.Defines standard factor graph score on possible world.Can get marginal probability distributions.

Daumé, Langford, Marcu ’05

SampleRank on Coreference• ACE 2004

• All nouns. 28,122 mentions, 14,047 entitiese.g. he, the President, Clinton, Mrs. Clinton, Washington

2005 Ng . 69.5%

2007 Culotta, Wick, Hall, McCallum . 79.3%

2008 Bengston, Roth . 80.8%

2009 Wick, McCallum MCMC+SampleRank . 81.5%

Contrastive Divergence . 75.1%

Persistent Contrastive Divergence . 74.9%

Perceptron . 76.3%

B3

x6 x7

y67

f67x5 x8

x4

f5y5

f8y8y54 y54

Schema Matching

f43

y1 y2x3

y3

y13 y23

y12

f1 f2

Coreference and Canonicalization

P(Y | X) =1ZX

ψw(yi,xi) ψb(yij,xij)yi,yj∈Y∏

yi∈Y∏

ψ(yi,xi) = exp λkfk(yi,xi)k∑⎛

⎝ ⎜

⎠ ⎟ f7

y5

y7

x1 x2

Dataset• Faculty and alumni listings from

university websites, plus an IE system

• 9 different database schemas

• ~1400 mentions, 294 coreferent

DEX IE Northwestern Fac UPenn FacFirst Name Name Name

Middle Name Title First Name

Last Name PhD Alma Mater Last Name

Title Research Interests Job+Department

Department Office Address

Company Name E-mail

Home Phone

Office Phone

Fax Number

E-mail

Example schemas:

Coreference Results

PairPairPair MUCMUCMUC

F1 Prec Recall F1 Prec Recall

No Canon

ISO 72.7 88.9 61.5 75.0 88.9 64.9

No Canon

CASC 64.0 66.7 61.5 65.7 66.7 64.9No Canon

JOINT 76.5 89.7 66.7 78.8 89.7 70.3

Canon

ISO 78.3 90.0 69.2 80.6 90.0 73.0

CanonCASC 65.8 67.6 64.1 67.6 67.6 67.6

Canon

JOINT 81.7 90.6 74.4 84.1 90.6 74.4

~15% error reduction from joint model

ISO = isolated CASC = cascade JOINT = joint inference

Schema Matching Results

PairPairPair MUCMUCMUC

F1 Prec Recall F1 Prec Recall

No Canon

ISO 50.9 40.9 67.5 69.2 81.8 60.0

No Canon

CASC 50.9 40.9 67.5 69.2 81.8 60.0No Canon

JOINT 68.9 100 52.5 69.6 100 53.3

Canon

ISO 50.9 40.9 67.5 69.2 81.8 60.0

CanonCASC 52.3 41.8 70.0 74.1 83.3 66.7

Canon

JOINT 71.0 100 55.0 75.0 100 60.0

~40% error reduction from joint model

ISO = isolated CASC = cascade JOINT = joint inference

x6 x7

y67

f67x5 x8

x4

f5y5

f8y8y54 y54

Schema Matching

f43

y1 y2x3

y3

y13 y23

y12

f1 f2

Coreference and Canonicalization

P(Y | X) =1ZX

ψw(yi,xi) ψb(yij,xij)yi,yj∈Y∏

yi∈Y∏

ψ(yi,xi) = exp λkfk(yi,xi)k∑⎛

⎝ ⎜

⎠ ⎟ f7

y5

y7

x1 x2

Really Hairy!

How to do

• parameter estimation• inference

x6 x7

y67

f67x5 x8

x4

f5y5

f8y8y54 y54

Schema Matching

f43

y1 y2x3

y3

y13 y23

y12

f1 f2

Coreference and Canonicalization

P(Y | X) =1ZX

ψw(yi,xi) ψb(yij,xij)yi,yj∈Y∏

yi∈Y∏

ψ(yi,xi) = exp λkfk(yi,xi)k∑⎛

⎝ ⎜

⎠ ⎟ f7

y5

y7

x1 x2

Really Hairy!

How to do

• parameter estimation• inference• software engineering

Outline• Motivate software engineering for statistics

• Graphical models for Extraction & Integration- Extraction (linear-chain CRFs)

- Information Integration (really hairy CRFs, MCMC, SampleRank)

• Probabilistic Programming: FACTORIE

• Example

• Relation Extraction (cross-document, w/out labeled data)

• Probabilistic Programming inside a DB

• Ongoing Work

Probabilistic Programming Languages

• Make it easy to specify rich, complex models, using the full power of programming languages- data structures

- control mechanisms

- abstraction

• Inference implementation comes for free

Provides language to easily create new models

Small Sampling of Probabilistic Programming Languages

• Explicit Directed Graph- BUGS

• Functional- IBAL, Church

• Object Oriented- Figaro, Infer.NET

• Logic-based- Markov logic, BLOG, PRISM

Declarative Model Specification• One of biggest advances in Artificial Intelligence community

• Gone too far?Much domain knowledge is also procedural.

• Logic + Probability → Imperative + Probability- Rising interest: Church, Infer.NET,...

Imperative tools for creating aDeclarative Model Specification

• One of biggest advances in Artificial Intelligence community

• Gone too far?Much domain knowledge is also procedural.

• Logic + Probability → Imperative + Probability- Rising interest: Church, Infer.NET,...

Imperative tools for creating aDeclarative Model Specification

• Preserve the declarative statistical semantics of factor graphs

• Provide imperative hooks to define structure, parameterization, inference, estimation.

“Imperatively-Defined Factor Graphs”

Our approach:

[McCallum, Rohanemanesh, Wick, Schultz, Singh, NIPS, 2008]

Our Design Goals• Represent factor graphs

- emphasis on conditional random fields

• Scalability- input data, output configuration, factors, tree-width

- observed data that cannot fit in memory

- super-exponential number of factors

• Leverage object-oriented benefits- Modularity, encapsulation, inheritance,...

• Integrate declarative & procedural knowledge- natural, easy-to-use

- upcoming slides: 2 examples of injecting imperativ-ism into factor graphs

FACTORIE• “Factor Graphs, Imperative, Extensible”• Implemented as a library in Scala [Martin Odersky]

- object oriented & functional- type inference- runs in JVM (complete interoperation with Java)- fast, JIT compiled, but also cmd-line interpreter

• Library, not new “little language”- integrate data pre-processing & eval. w/ model spec- leverage OO-design: modularity, encapsulation, inheritance

• Scalable- billions of variables, super-exp #factors, DB back-end- fast parameter estimation through SampleRank [2009]

http://code.google.com/p/factorie

Stages of FACTORIE programming1. Define “templates for data” (i.e. classes)

- Use data structures just like in deterministic programming.

- Only special requirement: “undo” capability for changes.- (Variable holds single possible value, not a distribution.)

2. Define “templates for factors”- Distinct from above data representation;

makes it easy to modify model scoring independently.

- Leverage data’s natural relations to define factors’ relations.

3. Select inference (MCMC, variational)

- Optionally, define MCMC proposal functions that leverage domain knowledge.

4. Read the data, creating variables.Then inference / parameter estimation is often a one-liner!

Outline• Motivate software engineering for statistics

• Graphical models for Extraction & Integration- Extraction (linear-chain CRFs)

- Information Integration (really hairy CRFs, MCMC, SampleRank)

• Probabilistic Programming: FACTORIE

• Example

• Relation Extraction (cross-document, w/out labeled data)

• Probabilistic Programming inside a DB

• Ongoing Work

Scala• New variable

var myHometown : String

• New constantval myName = “Andrew”

• New methoddef climb(increment:double) = myAltitude += increment

• New classclass Skier extends Person

• New trait (like Java interface with implementations)trait FirstAid { def applyBandage = ... }

• New class with traitclass BackcountrySkier extends Skier with FirstAid

• Generics in square bracketsnew ArrayList[Skier]

Example: Linear-Chain CRF for Segmentation

Labels

Words

Bill loves skiing Tom loves snowshoeing

T F F T F F

Example: Linear-Chain CRF for Segmentation

Labels

Words

class Label(isBeg:boolean) extends BooleanVariable(isBeg)

class Token(word:String) extends CategoricalVariable(word)

Bill loves skiing Tom loves snowshoeing

T F F T F F

Example: Linear-Chain CRF for Segmentation

Labels

Words

class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq

class Token(word:String) extends CategoricalVariable(word) with VarInSeq

Bill loves skiing Tom loves snowshoeing

T F F T F F

label.prevlabel.next

Example: Linear-Chain CRF for Segmentation

Labels

Words

class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq

class Token(word:String) extends CategoricalVariable(word) with VarInSeq

Bill loves skiing Tom loves snowshoeing

T F F T F F

Example: Linear-Chain CRF for Segmentation

Labels

Words

class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label}

Bill loves skiing Tom loves snowshoeing

T F F T F F

Avoid representing relations by indices.Do it directly with member pointers... arbitrary data structure.

Example: Linear-Chain CRF for Segmentation

Labels

Words

class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}

Bill loves skiing Tom loves snowshoeing

T F F T F F

Example: Linear-Chain CRF for Segmentation

Labels

Words

class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}

val model = new Model( new TemplateWithStatistics1[Label])

Bill loves skiing Tom loves snowshoeing

T F F T F F

Example: Linear-Chain CRF for Segmentation

Labels

Words

class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}

val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token])

Bill loves skiing Tom loves snowshoeing

T F F T F F

Example: Linear-Chain CRF for Segmentation

Labels

Words

class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}

val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token], new TemplateWithStatistics2[Label,Label])

Bill loves skiing Tom loves snowshoeing

T F F T F F

Key Operation: Scoring a Proposal

• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.

• To efficiently score:– Proposal method runs.

Key Operation: Scoring a Proposal

• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.

• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.

Key Operation: Scoring a Proposal

• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.

• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.– Find factors that touch changed variables

Key Operation: Scoring a Proposal

• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.

• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.– Find factors that touch changed variables– Find other (unchanged) variables needed to calculate

those factors’ scores

Key Operation: Scoring a Proposal

• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.

• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.– Find factors that touch changed variables– Find other (unchanged) variables needed to calculate

those factors’ scores

Key Operation: Scoring a Proposal

• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.

• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.– Find factors that touch changed variables– Find other (unchanged) variables needed to calculate

those factors’ scores• How to find factors from variables & vice versa?

– In BLOG, rich, highly-indexed data structure stores mapping variables ←→ factors

– But complex to maintain as structure changes– Factors consume memory

Imperativ-ism #1: Model Structure• Maintain no map structure between factors and variables

• Finding factors is easy. Usually # templates < 50.– Given changed variable, query each template

• What’s hard: Given factor template and one changed variable, find other variables

• In factor Template object, define imperative methods that do this.– unroll1(v1) returns (v1,v2,v3)– unroll2(v2) returns (v1,v2,v3)– unroll3(v3) returns (v1,v2,v3)– i.e., use Turing-complete language to determine structure on the fly.

• Other nice attributes– Easy to do value-conditioned structure. Case Factor Diagrams, etc.– Not only avoid super-exp, don’t even allocate all factors for current config.– FACTORIE provides several simpler mechanisms that build on this primitive.

Primitive operation:

Example: Linear-Chain CRF for Segmentation

Labels

Words

class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}

val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token], new TemplateWithStatistics2[Label,Label])

Example: Linear-Chain CRF for Segmentation

Labels

Words

class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}

val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token] {

}, new TemplateWithStatistics2[Label,Label])

Example: Linear-Chain CRF for Segmentation

Labels

Words

class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}

val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token)

}, new TemplateWithStatistics2[Label,Label])

Example: Linear-Chain CRF for Segmentation

Labels

Words

class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}

val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = throw new Error // Tokens shouldn’t change }, new TemplateWithStatistics2[Label,Label])

Example: Linear-Chain CRF for Segmentation

Labels

Words

class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}

val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = throw new Error // Tokens shouldn’t change }, new TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) })

Example: Linear-Chain CRF for Segmentation

Labels

Words

class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}

val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = throw new Error // Tokens shouldn’t change }, new TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label) })

Imperativ-ism #2: Neighbor-Sufficient Map• “Neighbor Variables” of a factor

– Values of variables touching the factor• “Sufficient Statistics” of a factor

– Vector, dot product with weights of log-linear factor → factor’s score

• Usually confounded. Separate them w/ user-defined function!• Skip-chain NER. Instead of 5x5 parameters, just 2.

(label1, label2) → label1 == label2

Labels

Words

Bill loves Paris Bill the painter ...

PER O LOC PER O O

[Sutton & McCallum 2006]

Example: Skip-Chain CRF for Segmentationclass Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}

val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = throw new Error // Tokens shouldn’t change }, new TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label) })

Example: Skip-Chain CRF for Segmentationclass Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}

val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = throw new Error // Tokens shouldn’t change }, new TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label) }, new Template2[Label,Label] with Statistics1[BooleanVariable])

Example: Skip-Chain CRF for Segmentationclass Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}

val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = throw new Error // Tokens shouldn’t change }, new TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label) }, new Template2[Label,Label] with Statistics1[BooleanVariable]{ def unroll1(label:Label) = for (other <- label.seq; if (label.token == other.token)) yield Factor(label,other) })

Example: Skip-Chain CRF for Segmentationclass Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}

val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = throw new Error // Tokens shouldn’t change }, new TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label) }, new Template2[Label,Label] with Statistics1[BooleanVariable]{ def unroll1(label:Label) = for (other <- label.seq; if (label.token == other.token)) yield Factor(label,other) def statistics(label1:Label, label2:Label) = Stat(label1 == label2) })

Example: Skip-Chain CRF for Segmentationclass Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}

val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = throw new Error // Tokens shouldn’t change }, new TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label) }, new Template2[Label,Label] with Statistics1[BooleanVariable]{ def unroll1(label:Label) = for (other <- label.seq; if (label.token == other.token)) yield Factor(label,other) def statistics(label1:Label, label2:Label) = Stat(label1 == label2) })val labels:Collection[Label] = readData()val inferencer = new GibbsSampler(model)for (i <- 1 to numIterations) inferencer.process(labels)

Example RunCoNLL 2003 NER

Example: Dependency Parsingclass Word(str:String) extends CategoricalVariable(word)class Node(word:Word, parent:Node) extends RefVariable(parent)

object ChildParentTemplate extends Template1[Node] with Statistics2[Word,Word] { def statistics(n:Node) = Stat(n.word, n.parent.word)}

object NearestVerbTemplate extends Template1[Node] with Statistics2[Word,Word] { def statistics(n:Node) = Stat(n.word, closestVerb(n).word) def closestVerb(n:Node) = if (isVerb(n.word)) n else closestVerb(n.parent) def unroll1(n:Node) = n.selfAndDescendants}

VERB

Example: Alternative Template Spec

val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = throw new Error }, new TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label) },)

Instead of previous:

Higher-level “Entity-Relationship” Specification:val model = new Model( Foreach[Label] { label => Score(label) }, Foreach[Label] { label => Score(label, label.token) }, Foreach[Label] { label => Score(label.prev, label) })

Example: First-order Logic Templates

val model = new Model ( Forany[Person] { p => p.cancer } * 0.1, Forany[Person] { p => p.smokes ==> p.cancer } * 2.0 Forany[Person] { p => p.friends.smokes <==> p.smokes } * 1.5)

Example: First-order Logic Templates

val model = new Model ( Forany[Person] { p => p.cancer } * 0.1, Forany[Person] { p => p.smokes ==> p.cancer } * 2.0 Forany[Person] { p => p.friends.smokes <==> p.smokes } * 1.5)

Example: Latent Dirichlet Allocationclass Z(p:Proportions, value:Int) extends MixtureChoice(p, value)class Word(ps:FiniteMixture[Proportions], z:MixtureChoiceVariable, value:String) extends CategoricalMixture[String](ps, z, value)class Document(val file:String) extends ArrayBuffer[Word] { var theta:DirichletMultinomial = null }

Example: Latent Dirichlet Allocationclass Z(p:Proportions, value:Int) extends MixtureChoice(p, value)class Word(ps:FiniteMixture[Proportions], z:MixtureChoiceVariable, value:String) extends CategoricalMixture[String](ps, z, value)class Document(val file:String) extends ArrayBuffer[Word] { var theta:DirichletMultinomial = null }

val phis = FiniteMixture(numTopics)(new GrowableDenseDirichletMultinomial(0.01))val documents = new ArrayBuffer[Document]for (directory <- directories) { for (file <- new File(directory).listFiles; if (file.isFile)) { val doc = new Document(file.toString) doc.theta = new DenseDirichletMultinomial(numTopics, 0.01) for (word <- lexer.findAllIn(file.mkString).map(_ toLowerCase)) { val z = new Z(doc.theta, random.nextInt(numTopics)) doc += new Word(phis, z, word) } documents += doc }}

Example: Latent Dirichlet Allocationclass Z(p:Proportions, value:Int) extends MixtureChoice(p, value)class Word(ps:FiniteMixture[Proportions], z:MixtureChoiceVariable, value:String) extends CategoricalMixture[String](ps, z, value)class Document(val file:String) extends ArrayBuffer[Word] { var theta:DirichletMultinomial = null }

val phis = FiniteMixture(numTopics)(new GrowableDenseDirichletMultinomial(0.01))val documents = new ArrayBuffer[Document]for (directory <- directories) { for (file <- new File(directory).listFiles; if (file.isFile)) { val doc = new Document(file.toString) doc.theta = new DenseDirichletMultinomial(numTopics, 0.01) for (word <- lexer.findAllIn(file.mkString).map(_ toLowerCase)) { val z = new Z(doc.theta, random.nextInt(numTopics)) doc += new Word(phis, z, word) } documents += doc }}

val sampler = new CollapsedGibbsSampler(phis ++ documents.map(_.theta))val zs = documents.flatMap(document => document.map(word => word.choice))sampler.processAll(zs)

Experimental Comparison• Joint Segmentation & Coreference of

research paper citations (Cora data).- 1295 mentions, 134 entities, 36487 tokens

• Compare with Markov Logic Networks (Alchemy)

- Same observable features

• FACTORIE results:

- ~25% reduction in error (segmentation & coref)

- 3-20x faster

- coref results:

Outline• Motivate software engineering for statistics

• Graphical models for Extraction & Integration- Extraction (linear-chain CRFs)

- Information Integration (really hairy CRFs, MCMC, SampleRank)

• Probabilistic Programming: FACTORIE

• Example

• Relation Extraction (cross-document, w/out labeled data)

• Probabilistic Programming inside a DB

• Ongoing Work

founded(Bill Gates,Microsoft)

nationality(Steve Jobs,USA)

...(...,...)

Microsoft was founded by Bill Gates

With Microsoft chairman Bill Gates soon relinquishing ...

Paul Porter, a founder of Industry Ears

334K entities, 10 types 488K relation instances, 54 types

2 years216K articles

Joint model ofentities & relations

founded(Paul Porter,Industry Ears)

founded(D.L.Sifry,Technorati)

founded

founded

founded(D. L. Sifry, Technorati)?

KnowledgeBase Augmentation[Yao, Riedel, McCallum 2010]

Cross-document

Joint

Inferencefounded

Microsoft was founded by Bill Gates...

personcompany

nation

country

With Microsoft chairman Bill Gates soon relinquishing...

Bill Gates was born in the USA in 1955

1

nationof

Elevation Partners, was founded by Roger McNamee ...

Roger McNamee,USAYrel

Z1

personR McNamee

countryUSA

worksfor

comp. Microsoft

1Z1

1Z1

Roger McNamee,Microsoft

functionality-factor

ner-relation-factors

relation-mention factors

Ytypel

mention factors

Elevation Partners, was founded by Roger McNamee ...

Elevation Partners, was founded by Roger McNamee ...

Figure 1: Factor Graph for joint relation mention prediction and relation type identification.

fine the following conditional distribution:

p (y|x) =1

Zx

Tj∈T

(yi,xi)∈Tj

ePKj

k=1 θjkfj

k(yi,xi) (3)

In our case the set T consist of four templateswe will describe below. Note that to construct thisgraphical model we use FACTORIE (McCallum etal., 2009), a probabilistic programming languagethat simplifies the construction process, as well asinference and learning.

3.1.1 Bias TemplateWe use a bias template TBias that prefers certain

relations a priori over others. When the template isunrolled, it creates one factor per variable Ycfor can-didate tuple c and one weight θBias

r and feature func-tion fBias

r for each possible relation r. fBiasr fires if

the relation associated with tuple c is r.

3.1.2 Mention TemplateIn order to extract relations from text, we need to

model the correlation between relation instances andtheir mentions in text. For this purpose we definethe mention template TMen that connects each rela-tion instance variable Yc with its observed variablesmention variables XMc .

The feature functions of this template are takenfrom (Mintz et al., 2009b) (with minor modifica-tions). This includes features that inspect the lexical

context between entity mentions in the same sen-tence, and the syntactic path between these. Oneexample is

fMen101 (yc,xMc)

def=

1 yc = founder∧m1", director of "m2 ∈ xMc

0 otherwise.

It tests whether for any of the mentions of the can-didate tuple the sequence ", director of " appears be-tween the mentions of the argument entites.

Crucially, these templates function on a cross-document level. They gather all mentions of the can-didate tuple c and extract features from all of these.

3.1.3 Selectional Preference TemplatesTo capture the correlations between entity types

and the relations the entities participate in, we in-troduce the joint template TJoint. It connects a re-lation instance variable Ye1,...,ea to the entity typevariables Ye1 , . . . , Yen . To measure the compabil-ity between relation and entity variables, we useone feature f Joint

r,t1...ta (and weight θJointr,t1...ta) for each

combination of relation and entity types r, t1 . . . ta.The feature fires when the variables are in thestate r, t1 . . . ta. After training we would expecta weight θJoint

founder,person,company to be larger thanθJointfounder,person,country.

We also add a template TPair that measures thecompability between Ye1,...,ea and each Yei in iso-lation. Here we use features fPair

i,r,t that fire if ei is

1

nationof

Elevation Partners, was founded by Roger McNamee ...

Roger McNamee,USAYrel

Z1

personR McNamee

countryUSA

worksfor

comp. Microsoft

1Z1

1Z1

Roger McNamee,Microsoft

functionality-factor

ner-relation-factors

relation-mention factors

Ytypel

mention factors

Elevation Partners, was founded by Roger McNamee ...

Elevation Partners, was founded by Roger McNamee ...

Figure 1: Factor Graph for joint relation mention prediction and relation type identification.

fine the following conditional distribution:

p (y|x) =1

Zx

Tj∈T

(yi,xi)∈Tj

ePKj

k=1 θjkfj

k(yi,xi) (3)

In our case the set T consist of four templateswe will describe below. Note that to construct thisgraphical model we use FACTORIE (McCallum etal., 2009), a probabilistic programming languagethat simplifies the construction process, as well asinference and learning.

3.1.1 Bias TemplateWe use a bias template TBias that prefers certain

relations a priori over others. When the template isunrolled, it creates one factor per variable Ycfor can-didate tuple c and one weight θBias

r and feature func-tion fBias

r for each possible relation r. fBiasr fires if

the relation associated with tuple c is r.

3.1.2 Mention TemplateIn order to extract relations from text, we need to

model the correlation between relation instances andtheir mentions in text. For this purpose we definethe mention template TMen that connects each rela-tion instance variable Yc with its observed variablesmention variables XMc .

The feature functions of this template are takenfrom (Mintz et al., 2009b) (with minor modifica-tions). This includes features that inspect the lexical

context between entity mentions in the same sen-tence, and the syntactic path between these. Oneexample is

fMen101 (yc,xMc)

def=

1 yc = founder∧m1", director of "m2 ∈ xMc

0 otherwise.

It tests whether for any of the mentions of the can-didate tuple the sequence ", director of " appears be-tween the mentions of the argument entites.

Crucially, these templates function on a cross-document level. They gather all mentions of the can-didate tuple c and extract features from all of these.

3.1.3 Selectional Preference TemplatesTo capture the correlations between entity types

and the relations the entities participate in, we in-troduce the joint template TJoint. It connects a re-lation instance variable Ye1,...,ea to the entity typevariables Ye1 , . . . , Yen . To measure the compabil-ity between relation and entity variables, we useone feature f Joint

r,t1...ta (and weight θJointr,t1...ta) for each

combination of relation and entity types r, t1 . . . ta.The feature fires when the variables are in thestate r, t1 . . . ta. After training we would expecta weight θJoint

founder,person,company to be larger thanθJointfounder,person,country.

We also add a template TPair that measures thecompability between Ye1,...,ea and each Yei in iso-lation. Here we use features fPair

i,r,t that fire if ei is

g ( , ) � DKL ( || )

g ( ) = log�1− µi + µie

θi�− µie

θi

. . . + wφfφ (y , y , y ) + . . .

> 0

= maxy� ,y� ,y�

�y� , y� , y� �

< maxy� ,y� ,y�

�y� , y� , y� �

Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)

Φ (yi,j ;x) = exp

��

k

wkfk (yi,j ;x)

p (y;x) =1

ZxΨ1 (y;x) · . . . · Ψn (y;x)

log E [Ψi]− µi

Ψi (y;x) = exp (θiφi (y;x))

µi = E [φi]

Y

Y

X1

X2

g ( , ) � DKL ( || )

g ( ) = log�1− µi + µie

θi�− µie

θi

. . . + wφfφ (y , y , y ) + . . .

> 0

= maxy� ,y� ,y�

�y� , y� , y� �

< maxy� ,y� ,y�

�y� , y� , y� �

Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)

Φ (yi,j ;x) = exp

��

k

wkfk (yi,j ;x)

p (y;x) =1

ZxΨ1 (y;x) · . . . · Ψn (y;x)

log E [Ψi]− µi

Ψi (y;x) = exp (θiφi (y;x))

µi = E [φi]

Y

Y

Y

Y

Y

X1

X2

g ( , ) � DKL ( || )

g ( ) = log�1− µi + µie

θi�− µie

θi

. . . + wφfφ (y , y , y ) + . . .

> 0

= maxy� ,y� ,y�

�y� , y� , y� �

< maxy� ,y� ,y�

�y� , y� , y� �

Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)

Φ (yi,j ;x) = exp

��

k

wkfk (yi,j ;x)

p (y;x) =1

ZxΨ1 (y;x) · . . . · Ψn (y;x)

log E [Ψi]− µi

Ψi (y;x) = exp (θiφi (y;x))

µi = E [φi]

Y

Y

Y

Y

Y

X1

X2

g ( , ) � DKL ( || )

g ( ) = log�1− µi + µie

θi�− µie

θi

. . . + wφfφ (y , y , y ) + . . .

> 0

= maxy� ,y� ,y�

�y� , y� , y� �

< maxy� ,y� ,y�

�y� , y� , y� �

Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)

Φ (yi,j ;x) = exp

��

k

wkfk (yi,j ;x)

p (y;x) =1

ZxΨ1 (y;x) · . . . · Ψn (y;x)

log E [Ψi]− µi

Ψi (y;x) = exp (θiφi (y;x))

µi = E [φi]

Y

Y

Y

Y

Y

X1

X2

g ( , ) � DKL ( || )

g ( ) = log�1− µi + µie

θi�− µie

θi

. . . + wφfφ (y , y , y ) + . . .

> 0

= maxy� ,y� ,y�

�y� , y� , y� �

< maxy� ,y� ,y�

�y� , y� , y� �

Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)

Φ (yi,j ;x) = exp

��

k

wkfk (yi,j ;x)

p (y;x) =1

ZxΨ1 (y;x) · . . . · Ψn (y;x)

log E [Ψi]− µi

Ψi (y;x) = exp (θiφi (y;x))

µi = E [φi]

Y

Y

Y

Y

Y

X1

X2

g ( , ) � DKL ( || )

g ( ) = log�1− µi + µie

θi�− µie

θi

. . . + wφfφ (y , y , y ) + . . .

> 0

= maxy� ,y� ,y�

�y� , y� , y� �

< maxy� ,y� ,y�

�y� , y� , y� �

Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)

Φ (yi,j ;x) = exp

��

k

wkfk (yi,j ;x)

p (y;x) =1

ZxΨ1 (y;x) · . . . · Ψn (y;x)

log E [Ψi]− µi

Ψi (y;x) = exp (θiφi (y;x))

µi = E [φi]

Y

Y

Y

Y

Y

X1

X2

g ( , ) � DKL ( || )

g ( ) = log�1− µi + µie

θi�− µie

θi

. . . + wφfφ (y , y , y ) + . . .

> 0

= maxy� ,y� ,y�

�y� , y� , y� �

< maxy� ,y� ,y�

�y� , y� , y� �

Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)

Φ (yi,j ;x) = exp

��

k

wkfk (yi,j ;x)

p (y;x) =1

ZxΨ1 (y;x) · . . . · Ψn (y;x)

log E [Ψi]− µi

Ψi (y;x) = exp (θiφi (y;x))

µi = E [φi]

Y

Y

Y

Y

Y

X1

X2

nationality. When testing our model we thenencounter a sentence such as

(3) Arrest Warrant Issued for Richard Gere inIndia.

that leads us to extract that RICHARD GERE is a cit-izen of INDIA.

2.6 Global Consistency of FactsAs discussed above, distant supervision can lead tonoisy extractions. However, such noise can often beeasily identified by testing how compatible the ex-tracted facts are to each other. In this work we areconcerned with a particular type of compatibility:selectional preferences.

Relations require, or prefer, their arguments to beof certain types. For example, the nationalityrelation requires the first argument to be a person,and the second to be a country. On inspection,we find that these preferences are often not satis-fied in a baseline distant supervision system akin toMintz et al. (2009). This often results from patternssuch as “<Entity1> in <Entity2>” that fire in manycases where <Entity2> is a location, but not acountry.

3 ModelOur observations in the previous section suggestthat we should (a) explicitly model compatibil-ity between extracted facts, and (b) integrate ev-idence from several documents to exploit redun-dancy. In this work we choose a Conditional Ran-dom Field (CRF) to achieve this. CRFs are a naturalfit for this task: They allow us to capture correlationsin an explicit fashion, and to incorporate overlappinginput features from multiple documents.

The hidden output variables of our model are Y =(Yc)c∈C . That is, we have one variable Yc for eachcandidate tuple c ∈ C . This variable can take asvalue any relation in C with the same arity as c. Seeexample relation variables in figure 1.

The observed input variables X consists of a fam-ily of variables Xc =

�X1

c, . . .Xmc

�m∈M

for eachcandidate tuple c. Here Xi

c stores relevant observa-tions we make for the i-th candidate mention tuple ofc in the corpus. For example, X1

BILL GATES,MICROSOFT

in figure 1 would contain, among others, the pattern“[M2] was founded by [M1]”.

3.1 Factor TemplatesOur conditional probability distribution over vari-ables X and Y is defined using using a set T offactor templates. Each template Tj ∈ T definesa set of factors {(yi,xi)}, a set Kj of feature in-dices, parameters

�θjk

k∈Kj

and feature functions�

f jk

k∈Kj

. Together they define the following con-

ditional distribution:

p (y|x) =1

Zx

Tj∈T

(yi,xi)∈Tj

eP

k∈Kjθjkfj

k(yi,xi)

(4)In our case the set T consists of four templates

we will describe below. We construct this graphicalmodel using FACTORIE (McCallum et al., 2009), aprobabilistic programming language that simplifiesthe construction process, as well as inference andlearning.

3.1.1 Bias TemplateWe use a bias template TBias that prefers certain

relations a priori over others. When the templateis unrolled, it creates one factor per variable Yc forcandidate tuple c ∈ C. The template also consists ofone weight θBias

r and feature function fBiasr for each

possible relation r. fBiasr fires if the relation associ-

ated with tuple c is r.

3.1.2 Mention TemplateIn order to extract relations from text, we need

to model the correlation between relation instancesand their mentions in text. For this purpose we de-fine the template TMention that connects each relationinstance variable Yc with its observed mention vari-ables Xc. Crucially, this template gathers mentionsfrom multiple documents, and enables us to exploitredundancy.

The feature functions of this template are takenfrom Mintz et al. (2009). This includes features thatinspect the lexical content between entity mentionsin the same sentence, and the syntactic path betweenthem. One example is

fMen101 (yc,xc)

def=

1 yc = founded ∧ ∃i with"M2 was founded by M1" ∈ xi

c

0 otherwise.

founder

Microsoft was founded by Bill Gates...

personcompany

nationality

country

With Microsoft chairman Bill Gates soon relinquishing...

Bill Gates was born in the USA in 1955

1

nationof

Elevation Partners, was founded by Roger McNamee ...

Roger McNamee,USAYrel

Z1

personR McNamee

countryUSA

worksfor

comp. Microsoft

1Z1

1Z1

Roger McNamee,Microsoft

functionality-factor

ner-relation-factors

relation-mention factors

Ytypel

mention factors

Elevation Partners, was founded by Roger McNamee ...

Elevation Partners, was founded by Roger McNamee ...

Figure 1: Factor Graph for joint relation mention prediction and relation type identification.

fine the following conditional distribution:

p (y|x) =1

Zx

Tj∈T

(yi,xi)∈Tj

ePKj

k=1 θjkfj

k(yi,xi) (3)

In our case the set T consist of four templateswe will describe below. Note that to construct thisgraphical model we use FACTORIE (McCallum etal., 2009), a probabilistic programming languagethat simplifies the construction process, as well asinference and learning.

3.1.1 Bias TemplateWe use a bias template TBias that prefers certain

relations a priori over others. When the template isunrolled, it creates one factor per variable Ycfor can-didate tuple c and one weight θBias

r and feature func-tion fBias

r for each possible relation r. fBiasr fires if

the relation associated with tuple c is r.

3.1.2 Mention TemplateIn order to extract relations from text, we need to

model the correlation between relation instances andtheir mentions in text. For this purpose we definethe mention template TMen that connects each rela-tion instance variable Yc with its observed variablesmention variables XMc .

The feature functions of this template are takenfrom (Mintz et al., 2009b) (with minor modifica-tions). This includes features that inspect the lexical

context between entity mentions in the same sen-tence, and the syntactic path between these. Oneexample is

fMen101 (yc,xMc)

def=

1 yc = founder∧m1", director of "m2 ∈ xMc

0 otherwise.

It tests whether for any of the mentions of the can-didate tuple the sequence ", director of " appears be-tween the mentions of the argument entites.

Crucially, these templates function on a cross-document level. They gather all mentions of the can-didate tuple c and extract features from all of these.

3.1.3 Selectional Preference TemplatesTo capture the correlations between entity types

and the relations the entities participate in, we in-troduce the joint template TJoint. It connects a re-lation instance variable Ye1,...,ea to the entity typevariables Ye1 , . . . , Yen . To measure the compabil-ity between relation and entity variables, we useone feature f Joint

r,t1...ta (and weight θJointr,t1...ta) for each

combination of relation and entity types r, t1 . . . ta.The feature fires when the variables are in thestate r, t1 . . . ta. After training we would expecta weight θJoint

founder,person,company to be larger thanθJointfounder,person,country.

We also add a template TPair that measures thecompability between Ye1,...,ea and each Yei in iso-lation. Here we use features fPair

i,r,t that fire if ei is

1

nationof

Elevation Partners, was founded by Roger McNamee ...

Roger McNamee,USAYrel

Z1

personR McNamee

countryUSA

worksfor

comp. Microsoft

1Z1

1Z1

Roger McNamee,Microsoft

functionality-factor

ner-relation-factors

relation-mention factors

Ytypel

mention factors

Elevation Partners, was founded by Roger McNamee ...

Elevation Partners, was founded by Roger McNamee ...

Figure 1: Factor Graph for joint relation mention prediction and relation type identification.

fine the following conditional distribution:

p (y|x) =1

Zx

Tj∈T

(yi,xi)∈Tj

ePKj

k=1 θjkfj

k(yi,xi) (3)

In our case the set T consist of four templateswe will describe below. Note that to construct thisgraphical model we use FACTORIE (McCallum etal., 2009), a probabilistic programming languagethat simplifies the construction process, as well asinference and learning.

3.1.1 Bias TemplateWe use a bias template TBias that prefers certain

relations a priori over others. When the template isunrolled, it creates one factor per variable Ycfor can-didate tuple c and one weight θBias

r and feature func-tion fBias

r for each possible relation r. fBiasr fires if

the relation associated with tuple c is r.

3.1.2 Mention TemplateIn order to extract relations from text, we need to

model the correlation between relation instances andtheir mentions in text. For this purpose we definethe mention template TMen that connects each rela-tion instance variable Yc with its observed variablesmention variables XMc .

The feature functions of this template are takenfrom (Mintz et al., 2009b) (with minor modifica-tions). This includes features that inspect the lexical

context between entity mentions in the same sen-tence, and the syntactic path between these. Oneexample is

fMen101 (yc,xMc)

def=

1 yc = founder∧m1", director of "m2 ∈ xMc

0 otherwise.

It tests whether for any of the mentions of the can-didate tuple the sequence ", director of " appears be-tween the mentions of the argument entites.

Crucially, these templates function on a cross-document level. They gather all mentions of the can-didate tuple c and extract features from all of these.

3.1.3 Selectional Preference TemplatesTo capture the correlations between entity types

and the relations the entities participate in, we in-troduce the joint template TJoint. It connects a re-lation instance variable Ye1,...,ea to the entity typevariables Ye1 , . . . , Yen . To measure the compabil-ity between relation and entity variables, we useone feature f Joint

r,t1...ta (and weight θJointr,t1...ta) for each

combination of relation and entity types r, t1 . . . ta.The feature fires when the variables are in thestate r, t1 . . . ta. After training we would expecta weight θJoint

founder,person,company to be larger thanθJointfounder,person,country.

We also add a template TPair that measures thecompability between Ye1,...,ea and each Yei in iso-lation. Here we use features fPair

i,r,t that fire if ei is

1

nationof

Elevation Partners, was founded by Roger McNamee ...

Roger McNamee,USAYrel

Z1

personR McNamee

countryUSA

worksfor

comp. Microsoft

1Z1

1Z1

Roger McNamee,Microsoft

functionality-factor

ner-relation-factors

relation-mention factors

Ytypel

mention factors

Elevation Partners, was founded by Roger McNamee ...

Elevation Partners, was founded by Roger McNamee ...

Figure 1: Factor Graph for joint relation mention prediction and relation type identification.

fine the following conditional distribution:

p (y|x) =1

Zx

Tj∈T

(yi,xi)∈Tj

ePKj

k=1 θjkfj

k(yi,xi) (3)

In our case the set T consist of four templateswe will describe below. Note that to construct thisgraphical model we use FACTORIE (McCallum etal., 2009), a probabilistic programming languagethat simplifies the construction process, as well asinference and learning.

3.1.1 Bias TemplateWe use a bias template TBias that prefers certain

relations a priori over others. When the template isunrolled, it creates one factor per variable Ycfor can-didate tuple c and one weight θBias

r and feature func-tion fBias

r for each possible relation r. fBiasr fires if

the relation associated with tuple c is r.

3.1.2 Mention TemplateIn order to extract relations from text, we need to

model the correlation between relation instances andtheir mentions in text. For this purpose we definethe mention template TMen that connects each rela-tion instance variable Yc with its observed variablesmention variables XMc .

The feature functions of this template are takenfrom (Mintz et al., 2009b) (with minor modifica-tions). This includes features that inspect the lexical

context between entity mentions in the same sen-tence, and the syntactic path between these. Oneexample is

fMen101 (yc,xMc)

def=

1 yc = founder∧m1", director of "m2 ∈ xMc

0 otherwise.

It tests whether for any of the mentions of the can-didate tuple the sequence ", director of " appears be-tween the mentions of the argument entites.

Crucially, these templates function on a cross-document level. They gather all mentions of the can-didate tuple c and extract features from all of these.

3.1.3 Selectional Preference TemplatesTo capture the correlations between entity types

and the relations the entities participate in, we in-troduce the joint template TJoint. It connects a re-lation instance variable Ye1,...,ea to the entity typevariables Ye1 , . . . , Yen . To measure the compabil-ity between relation and entity variables, we useone feature f Joint

r,t1...ta (and weight θJointr,t1...ta) for each

combination of relation and entity types r, t1 . . . ta.The feature fires when the variables are in thestate r, t1 . . . ta. After training we would expecta weight θJoint

founder,person,company to be larger thanθJointfounder,person,country.

We also add a template TPair that measures thecompability between Ye1,...,ea and each Yei in iso-lation. Here we use features fPair

i,r,t that fire if ei is

g ( , ) � DKL ( || )

g ( ) = log�1− µi + µie

θi�− µie

θi

. . . + wφfφ (y , y , y ) + . . .

> 0

= maxy� ,y� ,y�

�y� , y� , y� �

< maxy� ,y� ,y�

�y� , y� , y� �

Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)

Φ (yi,j ;x) = exp

��

k

wkfk (yi,j ;x)

p (y;x) =1

ZxΨ1 (y;x) · . . . · Ψn (y;x)

log E [Ψi]− µi

Ψi (y;x) = exp (θiφi (y;x))

µi = E [φi]

Y

Y

X1

X2

g ( , ) � DKL ( || )

g ( ) = log�1− µi + µie

θi�− µie

θi

. . . + wφfφ (y , y , y ) + . . .

> 0

= maxy� ,y� ,y�

�y� , y� , y� �

< maxy� ,y� ,y�

�y� , y� , y� �

Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)

Φ (yi,j ;x) = exp

��

k

wkfk (yi,j ;x)

p (y;x) =1

ZxΨ1 (y;x) · . . . · Ψn (y;x)

log E [Ψi]− µi

Ψi (y;x) = exp (θiφi (y;x))

µi = E [φi]

Y

Y

Y

Y

Y

X1

X2

g ( , ) � DKL ( || )

g ( ) = log�1− µi + µie

θi�− µie

θi

. . . + wφfφ (y , y , y ) + . . .

> 0

= maxy� ,y� ,y�

�y� , y� , y� �

< maxy� ,y� ,y�

�y� , y� , y� �

Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)

Φ (yi,j ;x) = exp

��

k

wkfk (yi,j ;x)

p (y;x) =1

ZxΨ1 (y;x) · . . . · Ψn (y;x)

log E [Ψi]− µi

Ψi (y;x) = exp (θiφi (y;x))

µi = E [φi]

Y

Y

Y

Y

Y

X1

X2

g ( , ) � DKL ( || )

g ( ) = log�1− µi + µie

θi�− µie

θi

. . . + wφfφ (y , y , y ) + . . .

> 0

= maxy� ,y� ,y�

�y� , y� , y� �

< maxy� ,y� ,y�

�y� , y� , y� �

Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)

Φ (yi,j ;x) = exp

��

k

wkfk (yi,j ;x)

p (y;x) =1

ZxΨ1 (y;x) · . . . · Ψn (y;x)

log E [Ψi]− µi

Ψi (y;x) = exp (θiφi (y;x))

µi = E [φi]

Y

Y

Y

Y

Y

X1

X2

g ( , ) � DKL ( || )

g ( ) = log�1− µi + µie

θi�− µie

θi

. . . + wφfφ (y , y , y ) + . . .

> 0

= maxy� ,y� ,y�

�y� , y� , y� �

< maxy� ,y� ,y�

�y� , y� , y� �

Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)

Φ (yi,j ;x) = exp

��

k

wkfk (yi,j ;x)

p (y;x) =1

ZxΨ1 (y;x) · . . . · Ψn (y;x)

log E [Ψi]− µi

Ψi (y;x) = exp (θiφi (y;x))

µi = E [φi]

Y

Y

Y

Y

Y

X1

X2

g ( , ) � DKL ( || )

g ( ) = log�1− µi + µie

θi�− µie

θi

. . . + wφfφ (y , y , y ) + . . .

> 0

= maxy� ,y� ,y�

�y� , y� , y� �

< maxy� ,y� ,y�

�y� , y� , y� �

Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)

Φ (yi,j ;x) = exp

��

k

wkfk (yi,j ;x)

p (y;x) =1

ZxΨ1 (y;x) · . . . · Ψn (y;x)

log E [Ψi]− µi

Ψi (y;x) = exp (θiφi (y;x))

µi = E [φi]

Y

Y

Y

Y

Y

X1

X2

g ( , ) � DKL ( || )

g ( ) = log�1− µi + µie

θi�− µie

θi

. . . + wφfφ (y , y , y ) + . . .

> 0

= maxy� ,y� ,y�

�y� , y� , y� �

< maxy� ,y� ,y�

�y� , y� , y� �

Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)

Φ (yi,j ;x) = exp

��

k

wkfk (yi,j ;x)

p (y;x) =1

ZxΨ1 (y;x) · . . . · Ψn (y;x)

log E [Ψi]− µi

Ψi (y;x) = exp (θiφi (y;x))

µi = E [φi]

Y

Y

Y

Y

Y

X1

X2

Figure 1: Factor Graph of our model that captures selectional preferences and functionality constraints. For

readability we only label a subsets of equivalent variables and factors. Note that the graph shows an example

assignment to variables.

It tests whether for any mentions of the candidate

tuple the phrase "founded by" appears between the

mentions of the argument entities.

3.1.3 Selectional Preferences Templates

To capture the correlations between entity types

and relations the entities participate in, we introduce

the template TJoint. It connects a relation instance

variable Ye1,...,en to the individual entity type vari-

ables Ye1 , . . . , Yen . To measure the compatibility

between relation and entity variables, we use one

feature f Joint

r,t1...ta (and weight θJoint

r,t1...ta) for each com-

bination of relation and entity types r, t1 . . . ta.

f Joint

r,t1...ta fires when the factor variables are in the

state r, t1 . . . ta. For example, f Joint

founded,person,company

fires if Ye1 is in state person, Ye2 in state company,

and Ye1,e2 in state founded.

We also add a template TPair that measures the

pairwise compatibility between the relation variable

Ye1,...,ea and each entity variable Yei in isolation.

Here we use features fPair

i,r,t that fire if ei is the i-th ar-

gument of c, has the entity type t and the candidate

tuple c is labelled as instance of relation r. For ex-

ample, fPair

1,founded,person fires if Ye1(argument i = 1)

is in state person, and Ye1,e2 in state founded, re-

gardless of the state of Ye2 .

X1BILL GATES,USA

3.2 InferenceThere are two types of inference we have to perform:

sampling from the posterior during training (see sec-

tion 3.3), and finding the most likely configuration

(aka MAP inference). In both settings we employ a

Gibbs sampler (Geman and Geman, 1990) that ran-

domly picks a variable Yc and samples its relation

value conditioned on its Markov Blanket. At test

time we decrease the temperature of our sampler in

order to find an approximation of the MAP solution.

3.3 TrainingMost learning methods need to calculate the model

expectations (Lafferty et al., 2001) or the MAP con-

figuration (Collins, 2002) before making an update

to the parameters. This step of inference is usually

the bottleneck for learning, even when performed

approximately.

SampleRank (Wick et al., 2009) is a rank-based

learning framework that alleviates this problem by

performing parameter updates within MCMC infer-

ence. Every pair of consecutive samples in the

MCMC chain is ranked according to the model and

the ground truth, and the parameters are updated

when the rankings disagree. This update can fol-

low different schemes, here we use MIRA (Cram-

mer and Singer, 2003). This allows the learner to

acquire more supervision per instance, and has led

to efficient training for models in which inference

Entities and Relations[Yao, Riedel, McCallum 2010]

Isolated Pipeline Joint

0.78 0.82 0.94

Relation Extraction ExperimentsManual EvaluationTraining Set

Precision @50

2 years 1 year

Outline• Motivate software engineering for statistics

• Graphical models for Extraction & Integration- Extraction (linear-chain CRFs)

- Information Integration (really hairy CRFs, MCMC, SampleRank)

• Probabilistic Programming: FACTORIE

• Example

• Relation Extraction (cross-document, w/out labeled data)

• Probabilistic Programming inside a DB

• Ongoing Work

Information Extraction into DB

DocumentsExtraction &

Matching

Information Extraction into DB

DocumentsExtraction &

Matching

Information Extraction into DB

DocumentsExtraction &

Matching

Information Extraction into DB

DocumentsExtraction &

Matching

Information Extraction into DB

DocumentsExtraction &

Matching

query

queryproc

answer

Information Extraction into Pr DB

DocumentsExtraction &

Matching

Information Extraction into Pr DB

DocumentsExtraction &

Matching

Information Extraction into Pr DB

DocumentsExtraction &

Matching

Information Extraction into Pr DB

DocumentsExtraction &

Matching

Information Extraction into Pr DB

DocumentsExtraction &

Matching

query

queryproc

answer

Information Extraction into Pr DB

Documents

The MCMC Alternative

Information Extraction into Pr DB

Documents

The MCMC Alternative

Information Extraction into Pr DB

Documents

Extraction &Matching

DB contains only one possible world

at a time.

The MCMC Alternative

Information Extraction into Pr DB

Documents

Extraction &Matching

MHinference

The MCMC Alternative

Information Extraction into Pr DB

Documents

Extraction &Matching

MHinference

The MCMC Alternative

Information Extraction into Pr DB

Documents

Extraction &Matching

MHinference

The MCMC Alternative

Information Extraction into Pr DB

Documents

Extraction &Matching

MHinference

query

SQL answer

The MCMC Alternative

Information Extraction into Pr DB

Documents

Extraction &Matching

MHinference

query

SQL answer

The MCMC Alternative

Information Extraction into Pr DB

Documents

Extraction &Matching

MHinference

query

SQL answer

The MCMC Alternative

Information Extraction into Pr DB

Documents

Extraction &Matching

MHinference

query

SQL answer

The MCMC Alternative

Information Extraction into Pr DB

Documents

Extraction &Matching

MHinference

query

SQL answer

The MCMC Alternative

Information Extraction into Pr DB

Documents

Extraction &Matching

MHinference

query

SQL answer

The MCMC Alternative

Information Extraction into Pr DB

Documents

Extraction &Matching

MHinference

query

SQL answer

The MCMC Alternative

Information Extraction into Pr DB

Documents

Extraction &Matching

MHinference

query

SQL answer

The MCMC Alternative

Information Extraction into Pr DB

Documents

Extraction &Matching

MHinference

query

answer

SQL answer

[Wick, McCallum, Miklau 2010]

The MCMC Alternative

Particle Filtering

Documents

Extraction &Matching

query

answer

Particle Filtering

Documents

Extraction &Matching

query

answer

[Schultz, McCallum, Miklau 2010]

with compactrepresentation

Ongoing Work

• Factored particle filtering

• Query-specific MCMC

• Inference caching wrt to E[query]

• Learned proposal distributions

• Bayesian inference in distributed systems

• Probabilistic Database of all of Wikipedia, automatically growing by reading.

Thank you!

http://code.google.com/p/factorieVersion 0.9 at

Recommended