27
Finite State Parsing & Information Extraction CMSC 35100 Intro to NLP January 10, 2006

Finite State Parsing & Information Extraction CMSC 35100 Intro to NLP January 10, 2006

Embed Size (px)

Citation preview

Page 1: Finite State Parsing & Information Extraction CMSC 35100 Intro to NLP January 10, 2006

Finite State Parsing &Information Extraction

CMSC 35100

Intro to NLP

January 10, 2006

Page 2: Finite State Parsing & Information Extraction CMSC 35100 Intro to NLP January 10, 2006

Roadmap

• Motivation– Limitations & Advantages

• Example: Fastus– Finite state cascades

• Other applications

Page 3: Finite State Parsing & Information Extraction CMSC 35100 Intro to NLP January 10, 2006

Why NOT Finite State?

• Fundamental representational limitations– Finite state systems can’t handle recursion

– Unsupported phenomena: center embedding, etc

– Fundamentally a strict subset of context-free languages

Page 4: Finite State Parsing & Information Extraction CMSC 35100 Intro to NLP January 10, 2006

Why Finite State?

• Significant computational advantages– FAST!!!!

• 10 mins vs 36 hours for 100 sentences– Can compile rules, even CFGs, to transducers

• Approximate CFGs, overgenerate in specific ways– Toolkits

• Minimal representational limitations– Most recursion is actually bounded

• Human memory practically limits depth of recursion• Unroll finite number of recursions

• Sufficient simple representation for many tasks– Information extraction, speech recognition

Page 5: Finite State Parsing & Information Extraction CMSC 35100 Intro to NLP January 10, 2006

Fastus & MUC

• MUC: Message Understanding Conference– DARPA shared-task evaluation– Task: Information extraction

• Essentially form-filling– Only 10% info relevant, no nuance– Joint ventures, terrorist incidents

– Original system: Deep syntax, KR, Semantics• High precision – best in task• SLOW!!!! 36 hours for 100 messages

Page 6: Finite State Parsing & Information Extraction CMSC 35100 Intro to NLP January 10, 2006

MUC ExampleBridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs tobe shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month.

TIE-UP-1:Relationship: TIE-UPEntities: “Bridgestone Sports Co.”, “a local concern”,

“a Japanese trading houseJoint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$20000000A-1:Activity: PRODUCTIONCompany: “Bridgestone Sports Taiwan Co.”Product: “iron and ‘metal wood’ clubs”Start Date: DURING: January 1990

Page 7: Finite State Parsing & Information Extraction CMSC 35100 Intro to NLP January 10, 2006

Finite-State Cascade

• Cascade of FSTs – Separates stages of processing

• Initially: smaller units, linguistically base• Later: larger units, domain specific information

– Complex words: multi-words, proper names– Basic phrases: noun groups, verb groups, part– Complex phrases: Complex NG, VG– Domain events: Application info– Merging structures: co-ref, related info

Page 8: Finite State Parsing & Information Extraction CMSC 35100 Intro to NLP January 10, 2006

Complex Words

• Identifies “multiwords”– E.g. set up, trading house, joint venture– Company names, people, locations, etc

• Fixed expressions recognized with microgrammars

• Subsequent stages can also distinguish– E.g. preceding appositive

Page 9: Finite State Parsing & Information Extraction CMSC 35100 Intro to NLP January 10, 2006

MUC Example: Basic PhrasesBridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs tobe shipped to Japan. Company name: Bridgestone Sports Co. Verb Group: to be shippedVerb Group: said Preposition: toNoun Group: Friday Location: JapanNoun Group: itVerb Group: had set upNoun Group: a joint venturePreposition: inLocation: TaiwanPreposition: withNoun Group: a local concernConjunction: andNoun Group: a Japanese trading houseVerb Group: to produceNoun Group: golf clubs

Page 10: Finite State Parsing & Information Extraction CMSC 35100 Intro to NLP January 10, 2006

Noun Group Extraction

• Noun Group: Head noun + premodifiers

NG -> Pronoun | Time-NP | Date-NP | (DETP) (Adjs) HdNns DETP Ving HdNns | DETP-CP (and HdNns)

DETP -> DETP-CP | DETP-INCP

DETP-CP -> ({Adv-pre-num|”another”| {Det|Pro-Poss}({Adv-pre-num|”only” (“other”)})} Number | Q | Q-er | (“the”) Q-est | “another” | Det-cp | DetQ | Pro-Poss-cp

DETP-INCP -> {{Det|Pro-Poss} “only” | “a” | “an” | Det-incomp | Pros-Poss-incomp } (“other”) | (DET-CP) “other”}

Page 11: Finite State Parsing & Information Extraction CMSC 35100 Intro to NLP January 10, 2006

Noun Group Extraction• Adjs -> AdjP ({“,”| (”,”) Conj} {AdjP | Vparticiple})*

• AdjP -> Ordinal | ({Q-er|Q-est}{Adj|Vparticiple}+ | {N[sing,!Time-NP](“-”){Vparticiple}

• | Number (“-”) {“month”|”day”|”year”}(“-”)”old”}

• HdNns -> HdNn (“and” HdNn)• HdNn -> PropN | {PreNs | PropN PreNs} N[!Time-NP]|• {PropN CommonN[!Time-NP]}

• PreNs -> PreN (“and” PreN2)• PreN -> (Adj “-”) Common-Sing-N • PreN2 -> PreN | Ordinal | Adj-noun-like

Page 12: Finite State Parsing & Information Extraction CMSC 35100 Intro to NLP January 10, 2006

Noun Group Extraction: AdjP FSA

AdjP

e

“and”

“,”

“,””and”

AdjP

Vparticiple

0 1 2 3

Page 13: Finite State Parsing & Information Extraction CMSC 35100 Intro to NLP January 10, 2006

Noun Group Extraction: Adj FSA

0

1 2

3

4 9

5

6 7

8

Nsing[!TimeNP]

“-”Vparticiple

Ordinal

Q-est

e

Q-er

Vparticiple

Adj

VparticipleAdj

e

“-”

e

“month”“day”“year”

“-”

e

“old”

Page 14: Finite State Parsing & Information Extraction CMSC 35100 Intro to NLP January 10, 2006

Complex Phrases

• Build up from basic noun and verb groups– Attach appositives– Construct measure phrases– Attach prepositional phrases– Conjoin noun phrases

• Combine syntactic variants, modalities with common meaning

• Identify domain entities and events

Page 15: Finite State Parsing & Information Extraction CMSC 35100 Intro to NLP January 10, 2006

Domain Events

• Ordered list of complex phrases– Drops out all other elements -> robustness

• Transitions driven by headword + phrasetype– E.g. “company-NounGroup”,”Formed-

PassiveVerbGroup”• <Company> <Set-up><Joint-Venture>with <Company>• <Produce> <Product>

• Map to particular extracted units– E.g. Entities in set-up, Production+Product Type

Page 16: Finite State Parsing & Information Extraction CMSC 35100 Intro to NLP January 10, 2006

Multi-layer Cascades

• Finesse the recursion problem– Automata construction expands rules-

>automata– AdjP’s are duplicated, but no self-reference– AdjPs and NPs in conjunction independent

• One level identifies base, non-recursive NGs• Next levels combine with

– Measure phrases, prepositional phrases, conjunction

• Limits depth of possible “recursive” constructs

Page 17: Finite State Parsing & Information Extraction CMSC 35100 Intro to NLP January 10, 2006

More Complete FST Parsing

• Roche 1996, 97, etc

• Construct syntactic dictionary– S | N thinks that S; S| N kept N – N | John; N| Peter; N|the book

• Convert entries to finite-state transducers– [S a thinks that B S]->

• (S [N a N] <V thinks V> that [S b S] S)

– [N John N] => (N John N)

Page 18: Finite State Parsing & Information Extraction CMSC 35100 Intro to NLP January 10, 2006

Transducer Dictionary

Page 19: Finite State Parsing & Information Extraction CMSC 35100 Intro to NLP January 10, 2006

Transducer Dictionary

Page 20: Finite State Parsing & Information Extraction CMSC 35100 Intro to NLP January 10, 2006

Full Transducer Dictionary

Page 21: Finite State Parsing & Information Extraction CMSC 35100 Intro to NLP January 10, 2006

Transducers -> Parser

• Transducer dictionary = Union of transducers– T_dic = U T_i

• Parser = Repeated application of transducers– Repeat until output = input

• Transduction causes no change

Page 22: Finite State Parsing & Information Extraction CMSC 35100 Intro to NLP January 10, 2006

Finite-State Extensions

• Finite-State Approaches to– Tree Adjoining Grammars

– Machine translation

– Multimodal analysis and interpretation

Page 23: Finite State Parsing & Information Extraction CMSC 35100 Intro to NLP January 10, 2006

Probabilistic CFGs

Page 24: Finite State Parsing & Information Extraction CMSC 35100 Intro to NLP January 10, 2006

Handling Syntactic Ambiguity

• Natural language syntax • Varied, has DEGREES of acceptability • Ambiguous

• Probability: framework for preferences– Augment original context-free rules: PCFG– Add probabilities to transitions

NP -> NNP -> Det NNP -> Det Adj NNP -> NP PP

0.2

0.65

0.10

VP -> VVP -> V NPVP -> V NP PP

0.45

0.45

0.10

S -> NP VPS -> S conj S

0.85

0.15

0.05

PP -> P NP1.0

Page 25: Finite State Parsing & Information Extraction CMSC 35100 Intro to NLP January 10, 2006

PCFGs

• Learning probabilities– Strategy 1: Write (manual) CFG,

• Use treebank (collection of parse trees) to find probabilities

• Parsing with PCFGs– Rank parse trees based on probability– Provides graceful degradation

• Can get some parse even for unusual constructions - low value

Page 26: Finite State Parsing & Information Extraction CMSC 35100 Intro to NLP January 10, 2006

Parse Ambiguity

• Two parse trees

S

NP VP

N V NP PP

Det N P NPDet N

I saw the man with the duck

S

NP VP

N V NP

NP PP Det N P NP

Det N

I saw the man with the duck

Page 27: Finite State Parsing & Information Extraction CMSC 35100 Intro to NLP January 10, 2006

Parse Probabilities

– T(ree),S(entence),n(ode),R(ule)– T1 = 0.85*0.2*0.1*0.65*1*0.65 = 0.007– T2 = 0.85*0.2*0.45*0.05*0.65*1*0.65 = 0.003

• Select T1

• Best systems achieve 92-93% accuracy

Tn

nrpSTP ))((),(