Upload
dustin-sullivan
View
227
Download
1
Tags:
Embed Size (px)
Citation preview
Toward Unified Models of Information Extraction and Data Mining
Andrew McCallum
Information Extraction and Synthesis Laboratory
Computer Science Department
University of Massachusetts Amherst
Joint work with
Aron Culotta, Wei Li, Khashayar Rohanimanesh, Charles Sutton, Ben Wellner
Goal:
Improving our abilityto mine actionable knowledgefrom unstructured text.
Larger Context
SegmentClassifyAssociateCluster
Filter
Prediction Outlier detection Decision support
IE
Documentcollection
Database
Discover patterns - entity types - links / relations - events
DataMining
Spider
Actionableknowledge
Problem:
Combined in serial juxtaposition,IE and KD are unaware of each others’ weaknesses and opportunities.
1) KD begins from a populated DB, unaware of where the data came from, or its inherent uncertainties.
2) IE is unaware of emerging patterns and regularities in the DB.
The accuracy of both suffers, and significant mining of complex text sources is beyond reach.
SegmentClassifyAssociateCluster
IE
Documentcollection
Database
Discover patterns - entity types - links / relations - events
KnowledgeDiscovery
Actionableknowledge
SegmentClassifyAssociateCluster
Filter
Prediction Outlier detection Decision support
IE
Documentcollection
Database
Discover patterns - entity types - links / relations - events
DataMining
Spider
Actionableknowledge
Uncertainty Info
Emerging Patterns
Solution:
SegmentClassifyAssociateCluster
Filter
Prediction Outlier detection Decision support
IE
Documentcollection
ProbabilisticModel
Discover patterns - entity types - links / relations - events
DataMining
Spider
Actionableknowledge
Solution:
Conditional Random Fields [Lafferty, McCallum, Pereira]
Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…]
Discriminatively-trained undirected graphical models
Complex Inference and LearningJust what we researchers like to sink our teeth into!
Unified Model
Outline
• The need for unified IE and DM.
• Review of Conditional Random Fields for IE.
• Preliminary steps toward unification:
– Joint Co-reference Resolution (Graph Partitioning)
– Joint Labeling of Cascaded Sequences (Belief Propagation)
– Joint Segmentation and Co-ref (Iterated Conditional Samples.)
• Conclusions
Hidden Markov Models
St -1
St
Ot
St+1
Ot +1
Ot -1
...
...
Finite state model Graphical model
Parameters: for all states S={s1,s2,…} Start state probabilities: P(st ) Transition probabilities: P(st|st-1 ) Observation (emission) probabilities: P(ot|st )Training: Maximize probability of training observations (w/ prior)
∏=
−∝||
11 )|()|(),(
o
ttttt soPssPosP
vvv
HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, …
...transitions
observations
o1 o2 o3 o4 o5 o6 o7 o8
Generates:
State sequenceObservation sequence
Usually a multinomial over atomic, fixed alphabet
€
vs = s1,s2,...sn
v o = o1,o2,...on
Joint
Conditional
St-1 St
Ot
St+1
Ot+1Ot-1
St-1 St
Ot
St+1
Ot+1Ot-1
...
...
...
...
€
P(v s ,
v o ) = P(st | st−1)P(ot | st )
t=1
|v o |
∏
€
Φo(t) = exp λ k fk (st ,ot )k
∑ ⎛
⎝ ⎜
⎞
⎠ ⎟ (A super-special case of
Conditional Random Fields.)
[Lafferty, McCallum, Pereira 2001]
€
P(v s |
v o ) =
1
P(v o )
P(st | st−1)P(ot | st )t=1
|v o |
∏
€
=1
Z(v o )
Φs(st ,st−1)Φo(ot ,st )t=1
|v o |
∏
where
From HMMs to Conditional Random Fields
Set parameters by maximum likelihood, using optimization method on L.
Table Extraction from Government ReportsCash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 --------------------------------------------------------------------------------1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves.
Table Extraction from Government Reports
Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was
slightly below 1994. Producer returns averaged $12.93 per hundredweight,
$0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds,
1 percent above 1994. Marketings include whole milk sold to plants and dealers
as well as milk sold directly to consumers.
An estimated 1.56 billion pounds of milk were used on farms where produced,
8 percent less than 1994. Calves were fed 78 percent of this milk with the
remainder consumed in producer households.
Milk Cows and Production of Milk and Milkfat:
United States, 1993-95
--------------------------------------------------------------------------------
: : Production of Milk and Milkfat 2/
: Number :-------------------------------------------------------
Year : of : Per Milk Cow : Percentage : Total
:Milk Cows 1/:-------------------: of Fat in All :------------------
: : Milk : Milkfat : Milk Produced : Milk : Milkfat
--------------------------------------------------------------------------------
: 1,000 Head --- Pounds --- Percent Million Pounds
:
1993 : 9,589 15,704 575 3.66 150,582 5,514.4
1994 : 9,500 16,175 592 3.66 153,664 5,623.7
1995 : 9,461 16,451 602 3.66 155,644 5,694.3
--------------------------------------------------------------------------------
1/ Average number during year, excluding heifers not yet fresh.
2/ Excludes milk sucked by calves.
CRFLabels:• Non-Table• Table Title• Table Header• Table Data Row• Table Section Data Row• Table Footnote• ... (12 in all)
[Pinto, McCallum, Wei, Croft, 2003 SIGIR]
Features:• Percentage of digit chars• Percentage of alpha chars• Indented• Contains 5+ consecutive spaces• Whitespace in this line aligns with prev.• ...• Conjunctions of all previous features,
time offset: {0,0}, {-1,0}, {0,1}, {1,2}.
100+ documents from www.fedstats.gov
Table Extraction Experimental Results
Line labels,percent correct
Table segments,F1
95 % 92 %
65 % 64 %
error = 85%
error = 77%
85 % -
HMM
StatelessMaxEnt
CRF w/outconjunctions
CRF
52 % 68 %
[Pinto, McCallum, Wei, Croft, 2003 SIGIR]
IE from Research Papers[McCallum et al ‘99]
IE from Research Papers
Field-level F1
Hidden Markov Models (HMMs) 75.6[Seymore, McCallum, Rosenfeld, 1999]
Support Vector Machines (SVMs) 89.7[Han, Giles, et al, 2003]
Conditional Random Fields (CRFs) 93.9[Peng, McCallum, 2004]
error40%
Main Point #2
Conditional Random Fields were more accurate in practice than a generative model
... on a research paper extraction task,
... and others, including- a table extraction task- noun phrase segmentation- named entity extraction- …
Outline
• The need for unified IE and DM.
• Review of Conditional Random Fields for IE.
• Preliminary steps toward unification:
1. Joint Labeling of Cascaded Sequences (Belief Propagation)Charles Sutton
2. Joint Co-reference Resolution (Graph Partitioning)Aron Culotta
3. Joint Labeling for Semi-Supervision (Graph Partitioning)Wei Li
4. Joint Segmentation and Co-ref (Iterated Conditional Samples.)Andrew McCallum
1. Jointly labeling cascaded sequencesFactorial CRFs
Part-of-speech
Noun-phrase boundaries
Named-entity tag
English words
[Sutton, Khashayar, McCallum, ICML 2004]
Joint prediction of part-of-speech and noun-phrase in newswire,equivalent accuracy with only 50% of the training data.
Inference:Tree reparameterization
[Wainwright et al, 2002]
1b. Jointly labeling distant mentionsSkip-chain CRFs
Mr. Ted Green said today … … Mary saw Green at …
…
[Sutton, McCallum, 2004]
14% reduction in error on most repeated field in email seminar announcements.
Inference:Tree reparameterization
[Wainwright et al, 2002]
2. Joint co-reference among all pairsAffinity Matrix CRF
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . .
45
99Y/N
Y/N
Y/N
11
[McCallum, Wellner, IJCAI WS 2003]
25% reduction in error on co-reference ofproper nouns in newswire.
Inference:Correlational clusteringgraph partitioning
[Bansal, Blum, Chawla, 2002]
Y/N
Y/N
Y/N
3. Joint Labeling for Semi-SupervisionAffinity Matrix CRF with prototypes
45
99
11
[Li, McCallum, 2003]
50% reduction in error ondocument classificationwith labeled and unlabeleddata.
Inference:Correlational clusteringgraph partitioning
[Bansal, Blum, Chawla, 2002]
y1 y2
x3
x2
x1
p
Databasefield values
c
4. Joint segmentation and co-reference
o
s
o
s
c
c
s
o
Citation attributes
y y
y
Segmentation
[Wellner, McCallum, Peng, Hay, UAI 2004]Inference:Variant of Iterated Conditional Modes
Co-reference decisions
Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison-Wesley, 1990.
Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990.
[Besag, 1986]
World Knowledge
35% reduction in co-reference error by using segmentation uncertainty.
6-14% reduction in segmentation error by using co-reference.
Extraction from and matching of research paper citations.
see also [Marthi, Milch, Russell, 2003]
To Charles
Citation Segmentation and Coreference
Laurel, B. Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , B. Laurel (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Laurel , The Art of Human-Computer Interface Design , 355-366 ,
1990 .
Laurel, B. Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , B. Laurel (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Laurel , The Art of Human-Computer Interface Design , 355-366 ,
1990 .
• Segment citation fields
Citation Segmentation and Coreference
Laurel, B. Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , B. Laurel (ed) ,
Addison-Wesley , 1990 .
• Segment citation fields
• Resolve coreferent papers
Brenda Laurel . Interface Agents: Metaphors with Character , in
Laurel , The Art of Human-Computer Interface Design , 355-366 ,
1990 .
Citation Segmentation and Coreference
Y/N
Laurel, B. Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , B. Laurel (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Laurel , The Art of Human-Computer Interface Design , 355-366 ,
1990 .
?
Segmentation Quality Citation
Co-reference (F1)
No Segmentation .787
CRF Segmentation .913
True Segmentation .932
Incorrect Segmentation Hurts Coreference
Laurel, B. Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , B. Laurel (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Laurel , The Art of Human-Computer Interface Design , 355-366 ,
1990 .
?
Incorrect Segmentation Hurts Coreference
Solution: Perform segmentation and coreference jointly.
Use segmentation uncertainty to improve coreference
and use coreference to improve segmentation.
o
s
Observed citation
CRF Segmentation
Segmentation + Coreference Model
o
s
c Citation attributes
CRF Segmentation
Observed citation
Segmentation + Coreference Model
o
s
c
o
s
c
c
s
o
Citation attributes
Observed citation
CRF Segmentation
Segmentation + Coreference Model
o
s
c
o
s
c
c
s
o
Citation attributes
Observed citation
y y
ypairwise coref
CRF Segmentation
Segmentation + Coreference Model
Such a highly connected graph makes exact inference intractable, so…
• Loopy Belief
Propagation
v6v5
v3v2v1
v4
m1(v2) m2(v3)
m3(v2)m2(v1) messages passed between nodes
Approximate Inference 1
• Loopy Belief
Propagation
• Generalized Belief
Propagation
v6v5
v3v2v1
v4
m1(v2) m2(v3)
m3(v2)m2(v1)
v6v5
v3v2v1
v4
v9v8v7
messages passed between nodes
messages passed between regions
Here, a message is a conditional probability table passed among nodes.But when message size grows exponentially with region size!
Approximate Inference 1
• Iterated Conditional
Modes (ICM)
[Besag 1986]
v6v5
v3v2v1
v4
v6i+1 = argmax P(v6
i | v \ v6
i) v6
i
= held constant
Approximate Inference 2
• Iterated Conditional
Modes (ICM)
[Besag 1986]
v6v5
v3v2v1
v4
v5j+1 = argmax P(v5
j | v \ v5
j) v5
j
= held constant
Approximate Inference 2
• Iterated Conditional
Modes (ICM)
[Besag 1986]
v6v5
v3v2v1
v4
v4k+1 = argmax P(v4
k | v \ v4
k) v4
k
= held constant
Approximate Inference 2
But greedy, and easily falls into local minima.
• Iterated Conditional
Modes (ICM)
[Besag 1986]
• Iterated Conditional Sampling (ICS) (our proposal; related work?) Instead of passing only argmax, sample of argmaxes of P(v4
k | v \ v4
k)
i.e. an N-best list (the top N values)
v6v5
v3v2v1
v4
v4k+1 = argmax P(v4
k | v \ v4
k) v4
k
= held constant
v6v5
v3v2v1
v4
Approximate Inference 2
Can use “Generalized Version” of this; doing exact inference on a region of several nodes at once.
Here, a “message” grows only linearly with region size and N!
o
s
c
o
s
c
c
s
o
y y
y
p
pprototype
pairwise vars
Sample = N-best List from CRF Segmentation
Do exact inference over these linear-chain regions
Pass N-best List to coreference
o
s
c
o
s
cy
pairwise vars
Parameterized by N-Best lists
Sample = N-best List from Viterbi
Name Title …
Laurel, B Interface Agents: Metaphors with Character The
…
Laurel, B. Interface Agents: Metaphors with Character
…
Laurel, B. Interface Agents
Metaphors with Character
…
o
s
c
o
s
cy
When calculating similarity with another citation, have more opportunity to find correct, matching fields.
Name Title Book Title Year
Laurel, B. Interface Agents: Metaphors with Character
The Art of Human Computer Interface Design
1990
Laurel, B. Interface Agents: Metaphors with Character The Art
of Human Computer Interface Design
1990
Laurel, B. Interface Agents: Metaphors with Character
The Art of Human Computer Interface Design
1990
Sample = N-best List from Viterbi
N Reinforce Face Reason Constraint
1 0.946 0.967 0.945 0.961
3 0.95 0.979 0.961 0.960
7 0.948 0.979 0.951 0.971
9 0.982 0.967 0.960 0.971
Optimal 0.995 0.992 0.994 0.988
Coreference F1 performance
• Average error reduction is 35%.
• “Optimal” makes best use of N-best list by using true labels.
• Indicates that even more improvement can be obtained
Results on 4 Sections of CiteSeer Citations
Conclusions
• Conditional Random Fields combine the benefits of– Conditional probability models (arbitrary features)– Markov models (for sequences or other relations)
• Success in– Factorial finite state models– Coreference analysis – Semi-supervised Learning– Segmentation uncertainty aiding coreference
• Future work:– Structure learning.– Further tight integration of IE and Data Mining– Application to Social Network Analysis.
End of Talk
Application Project:
Application Project:
ResearchPaper
Cites
Application Project:
ResearchPaper
Cites
Person
UniversityConf-
erence
Grant
Groups
Expertise
• ~60k lines of Java• Document classification, information extraction, clustering, co-
reference, POS tagging, shallow parsing, relational classification, …• Many ML basics in common, convenient framework:
– naïve Bayes, MaxEnt, Boosting, SVMs, Dirichlets, Conjugate Gradient• Advanced ML algorithms:
– Conditional Random Fields, Maximum Margin Markov Networks, BFGS, Expectation Propogatation, Tree-Reparameterization, …
• Unlike other toolkits (e.g. Weka) MALLET scales to millions of features, 100k’s training examples, as needed for NLP.
MALLET:Machine Learning for Language Toolkit
Released as Open Source Software.http://mallet.cs.umass.edu
Software Infrastructure
In use at UMass, MIT, CMU, UPenn,
End of Talk