52
Natural Language Processing

Natural Language Processing. 2 Why “natural language”? Natural vs. artificial Language vs. English

  • View
    245

  • Download
    6

Embed Size (px)

Citation preview

Natural Language Processing

2

Why “natural language”?

Natural vs. artificial

Language vs. English

3

Why “natural language”?

Natural vs. artificial Not precise, ambiguous, wide range of

expression

Language vs. English English, French, Japanese, Spanish

4

Why “natural language”?

Natural vs. artificial Not precise, ambiguous, wide range of

expression Language vs. English

English, French, Japanese, Spanish

Natural language processing = programs, theories towards understanding a problem or question in natural language and answering it

5

Approaches

System building Interactive Understanding only Generation only

Theoretical Draws on linguistics, psychology,

philosophy

6

Building an NL system is hard

Unlikely to be possible without solid theoretical underpinnings

7

Natural language is useful

Question-answering systems http://tangra.si.umich.edu/clair/NSIR/NSIR.cgi

Mixed initiative systems http://www.cs.columbia.edu/~noemie/match.mpg

Information extraction http://nlp.cs.nyu.edu/info-extr/biomedical-snapshot.

jpg Systems that write/speak

http://www-2.cs.cmu.edu/~awb/synthesizers.html MAGIC

Machine translation http://world.altavista.com/babelfish

8

Topics

Syntax

Semantics

Pragmatics

Statistical NLP: combining learning and NL processing

9

Goal of Interpretation

Identify sentence meaning

Do something with meaning Need some representation of

action/meaning

10

Analysis of form: Syntax

Which parts were damaged by larger machines?

Which parts damaged larger machines? Which larger machines damaged parts?

Approaches: Statistical part of speech tagging Parsing using a grammar Shallow parsing: identify meaningful

chunks

11

Which parts were damaged by larger machines?

S (Q)

NP VP

N NP (Q)

machines

V (past)

damage Det (Q) N

which parts

ADJ

larger

12

Which parts were damaged by machines? – with functional roles

S (Q)

NP (SUBJ) VP

N NP (Q) (OBJ)

machines

V (past)

damage Det (Q) N

which parts

ADJ

larger

13

Which parts damaged machines? – with functional roles

S (Q)

NP (OBJ)

VP

N

machines

V (past)

damage

parts

NP (Q) (SUBJ)

Det (Q) N

which

ADJ

larger

14

Parsers

Grammar S -> NP VP NP -> DET {ADJ*} N

Different types of grammars Context Free vs. Context Sensitive Lexical Functional Grammar vs. Tree Adjoining

Grammars Different ways of acquiring grammars

Hand-encoded vs. machine learned Domain independent (TreeBank, Wall Street

Journal) Domain dependent (Medical texts)

15

Semantics: analysis of meaning

Word meaning John picked up a bad cold John picked up a large rock. John picked up Radio Netherlands on his radio. John picked up a hitchhiker on Highway 66.

Phrasal meaning Baby bonuses -> allocations Senior citizens -> personnes agees Causing havoc -> seme le dessaroi

Approaches Representing meaning Statistical word disambiguation Symbolic rule-based vs. shallow statistical

semantics

16

Representing Meaning - WordNet

17

18

OMEGA

http://omega.isi.edu:8007/index

http://omega.is.edu/doc/browsers.html

19

20

Statistical Word Sense Disambiguation

Context within the sentence determines which sense is correct

The candidate picked up [sense6] thousands of additional votes.

He picked up [sense2] the book and started to read. Her performance in school picked up [sense13].

The swimmers got out of the river and climbed the bank [sloping land] to retrieve their towels.

The investors took their money out of the bank [financial institution] and moved it into stocks and bonds.

21

Goal

A program which can predict which sense is the correct sense given a new sentence containing “pick up” or “bank”

Avoid manually itemizing all words which can occur in sentences with different meanings

Can we use machine learning?

22

What do we need?

Data

Features

Machine Learning algorithm Decision tree vs. SVM/Naïve Bayes Inspecting the output

Accuracy of these methods

23

Using Categories from Roget’s Thesaurus (e.g., machine vs. animal) for training

24

Training data for “machines”

25

26

Predicting the correct sense in unseen text

Use presence of the salient words in context

50 word window

Use Baye’s rule to compute probabilities for different categories

27

“Crane”

Occurred 74 times in Grolliers, 36 as animal, 38 as machine

Prediction in new sentences were 99% correct

Example: lift water and to grind grain .PP Treadmills attached to cranes were used to lift heavy objects from Roman times.

28

29

30

Going Home – A play in one act

Scene 1: Pennsylvania Station, NYCBonnie: Long Beach?Passerby: Downstairs, LIRR Station

Scene 2: ticket counter: LIRRBonnie: Long Beach?Clerk: $4.50

Scene 3: Information Booth, LIRRBonnie: Long Beach?Clerk: 4:19, Track 17

Scene 4: On the train, vicinity of Forest HillsBonnie: Long Beach?Conductor: Change at Jamaica

Scene 5: On the next train, vicinity of LynbrookBonnie: Long Beach?Conductor: Rigtht after Island Park.

31

Question Answering on the web

Input: English question

Data: documents retrieved by a search engine from the web

Output: The phrase(s) within the documents that answer the question

32

Examples

When was X born? When was Mozart born? Mozart was born in 1756. When was Gandhi born? Gandhi (1869-1948)

Where are the Rocky Mountains located?

What is nepotism?

33

Common Approach

Create a query from the question When was Mozart born -> Mozart born Use WordNet to expand terms and increase

recall: Which high school was ranked highest in the US in

1998? “high school” -> (high&school)|(senior&high&school)|

(senior&high(|high|highschool

Use search engine to find relevant documents

Pinpoint passage within document that has answer using patterns

From IR to NLP

34

PRODUCE A BIOGRAPHY OF [PERON].Only these fields are Relevant:

1. Name(s), aliases:2. *Date of Birth or Current Age:3. *Date of Death:4. *Place of Birth:5. *Place of Death:6. Cause of Death:7. Religion (Affiliations):8. Known locations and dates:9. Last known address:10. Previous domiciles:11. Ethnic or tribal affiliations:12. Immediate family members 13. Native Language spoken:14. Secondary Languages spoken:15. Physical Characteristics 16. Passport number and country of issue:17. Professional positions:18. Education 19. Party or other organization affiliations:20. Publications (titles and dates):

35

Biography of Han Ming

Han Ming, born 1944 March in Pyongyan, South Korean Lei Fa Women’s University in French law, literature, a former female South Korean people, chairman of South Korea women’s groups,…Han, 62, has championed women’s rights and liberal political ideas. Han was imprisoned from 1979 to 1981 on charges of teaching pro-Communist ideas to workers, farmers and low-income women. She became the first minister of gender equality in 2001 and later served as an environment minister.

36

Biography – two approaches

To obtain high precision, we handle each slot independently using bootstrapping to learn IE patterns.

To improve the recall, we utilize a biography Language Model.

37

Approach Characteristics of the IE approach

Training resource: Wikipedia and its manual annotations Bootstrapping interleaves two corpora to improve

precision Wikipedia: reliable but small Web: noisy but many relevant documents

No manual annotation or automatic tagging of corpus Use seed tuples (person, date-of-birth) to find patterns This approach is scalable for any corpus

Irrespective of size Irrespective of whether it is static or dynamic

The IE system is augmented with language models to increase recall

38

Biography as an IE task

We need patterns to extract information from a sentence

Creating patterns manually is a time consuming task, and not scalable

We want to find these patterns automatically

39

Biography patterns from Wikipedia

40

• Martin Luther King, Jr., (January 15, 1929 – April 4, 1968) was the most …

• Martin Luther King, Jr., was born on January 15, 1929, in Atlanta, Georgia.

Biography patterns from Wikipedia

41

Run IdFinder on these sentences

<Person> Martin Luther King, Jr. </Person>, (<Date>January 15, 1929</Date> – <Date> April 4, 1968</Date>) was the most…

<Person> Martin Luther King, Jr. </Person>, was born on <Date> January 15, 1929 </Date>, in <GPE> Atlanta, Georgia </GPE>.

Take the token sequence that includes the tags of interest + some context (2 tokens before and 2 tokens after)

42

Convert to Patterns:

<My_Person> (<My_Date> – <Date>) was the

<My_Person> , was born on <My_Date>, in

Remove more specific patterns – if there is a pattern that contains other, take the smallest > k tokens.

<MY_Person> , was born on <My_Date>

<My_Person> (<My_Date> – <Date>)

Finally, verify the patterns manually to remove irrelevant patterns.

43

Examples of Patterns:

502 distinct place-of-birth patterns: 600 <MY_Person> was born in <MY_GPE> 169 <MY_Person> ( born <Date> in <MY_GPE> ) 44 Born in <MY_GPE> <MY_Person> 10 <MY_Person> was a native <MY_GPE> 10 <MY_Person> 's hometown of <MY_GPE> 1 <MY_Person> was baptized in <MY_GPE> …

291 distinct date-of-death patterns: 770 <MY_Person> ( <Date> - <MY_Date> ) 92 <MY_Person> died on <MY_Date> 19 <MY_Person> <Date> - <MY_Date> 16 <MY_Person> died in <GPE> on <MY_Date> 3 < MY_Person> passed away on < MY_Date > 1 < MY_Person> committed suicide on <MY_Date> …

44

Biography as an IE task

This approach is good for the consistently annotated fields in Wikipedia: place of birth, date of birth, place of death, date of death

Not all fields of interests are annotated, a different approach is needed to cover the rest of the slots

45

Bouncing between Wikipedia and Google

Use one seed only: <my person> and <target field>

Google: “Arafat” “civil engineering”, we get:

46

47

Use one seed only: <my person> and <target field>

Google: “Arafat” “civil engineering”, we get: Arafat graduated with a bachelor’s degree in civil engineering Arafat studied civil engineering Arafat, a civil engineering student …

Using these snippets, corresponding patterns are created, then filtered out manually.

Bouncing between Wikipedia and Google

48

Use one seed tuple only: <my person> and <target field>

Google: “Arafat” “civil engineering”, we get: Arafat graduated with a bachelor’s degree in civil

engineering Arafat studied civil engineering Arafat, a civil engineering student …

Using these snippets, corresponding patterns are created, then filtered out manually

To get more seed pairs, go to Wikipedia biography pages only and search for:

“graduated with a bachelor’s degree in” We get:

Bouncing between Wikipedia and Google

49

50

New seed tuples: “Burnie Thompson” “political science“ “Henrey Luke” “Environment Studies” “Erin Crocker” “industrial and management

engineering” “Denise Bode” “political science” …

Go back to Google and repeat the process to get more seed patterns!

Bouncing between Wikipedia and Google

51

Bouncing between Wikipedia and Google

This approach worked well for a few fields such as: education, publication, Immediate family members, and Party or other organization affiliations

Did not provide good patterns for every field, such as: Religion, Ethnic or tribal

affiliations, and Previous domiciles), we got a lot of noise

For some slots, we created some patterns manually

52

Biography as Sentence Selection and Ranking

To obtain high recall, we also want to include sentences that IE may miss, perhaps due to ill-formed sentences (ASR and MT)

Get the top 100 documents from Indri

Extract all sentences that contain the person or reference to him/her

Use a variety of features to rank these sentence…