Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural...

Preview:

Citation preview

Special Topics in Computer ScienceSpecial Topics in Computer Science

Advanced Topics in Information RetrievalAdvanced Topics in Information Retrieval

Lecture 10: Lecture 10: Natural Language Processing and IR. Natural Language Processing and IR.

Syntax and structural disambiguation Syntax and structural disambiguation Alexander Gelbukh

www.Gelbukh.com

2

Previous Chapter: Previous Chapter: ConclusionsConclusions

Tagging, word sense disambiguation, andanaphora resolution are cases of disambiguation ofmeaning

Useful in translation, information retrieval, and textundertanding

Dictionary-based methods good but expensive

Statistical methods cheap and sometimes imperfect... but not always (if very

large corpora are available)

3

Previous Chapter: Research topicsPrevious Chapter: Research topics

Too many to list New methods Lexical resources (dictionaries) = Computational linguistics

4

ContentsContents

Language levels Syntax

Dependency approach Constituency-based approach Head-driven approach

Grammars and parsing Ambiguity and disambiguation

5

Language levelsLanguage levels

Letters are built up into words Words into sentences Sentences into <...> text

Each level has its own representation This allows for modular processing

A module describes one levelor transforms from one level to another

6

Source of language complexity: 1-DSource of language complexity: 1-D

This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the

Language

Text (speech)

Meaning Meaning

........Text Text.......

Bra

in 1

Brain 2

7

Knowledge Knowledge

Lan-guage

Lan-guage

This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture.

Text

Source of language complexity: 1-DSource of language complexity: 1-D

8

Linguistic processorLinguistic processortranslates between representationstranslates between representations

Linguisticmodule

Meanings

This is an example of the output text ofthe system. This is an example of theoutput text of the system. This is anexample of the output text of thesystem. This is an example of the outputtext of the system. This is an example ofthe output text of the system. This is anexample of the output text of thesystem. This is an example of the outputtext of the system. This is an example ofthe output text of the system. This is anexample of the output text of thesystem. This is an example of the outputtext of the system. This is an example ofthe output text of the system. This is anexample of the output text of thesystem. This is an example of the outputtext of the system. This is an example ofthe output text of the system. This is anexample of the output text of thesystem. This is an example of the outputtext of the system. This is an example ofthe output text of the system. This is an

Texts

Linguisticmodule

Appliedsystem

9

General scheme of text General scheme of text processingprocessing

L inguistic processor

Applied system

(e.g., Expert system)

Out-put

In-put

(Semantic) representation

Linguistic processor uses linguistic knowledge Applied system uses other types of knowledge

(e.g., Artificial Intelligence)

10

Language levelsLanguage levels

Morphological: words Syntactic: sentences Semantic: meaning Pragmatic: intention ...?

11

This is a text that represents themeaning shown in the right part of thepicture. This is a text that represents themeaning shown in the right part of thepicture. This is a text that represents themeaning shown in the right part of thepicture. This is a text that represents themeaning shown in the right part of thepicture. This is a text that represents themeaning shown in the right part of thepicture. This is a text that represents themeaning shown in the right part of thepicture. This is a text that represents themeaning shown in the right part of thepicture. This is a text that represents themeaning shown in the right part of thepicture. This is a text that represents themeaning shown in the right part of thepicture. This is a text that represents themeaning shown in the right part of thepicture. This is a text that represents themeaning shown in the right part of thepicture.

LanguageText Meaning

Morphologicalrepresentation

Syntacticrepresentation

Morpho-logicaltrans-former

Syntac-tic

trans-former

Seman-tic

trans-former

Semanitcrepresentation

Surfacerepresentation

Fine structure of linguistic processor

12

Example of textExample of text

““Science is important for Science is important for our country.our country.

The Government pays it The Government pays it much attention.”much attention.”

13

Textual representationTextual representation

Text is a sequence of letter.

S c i e n c e i s S c i e n c e i s i m p o r t a n t i m p o r t a n t f o r o u r c f o r o u r c o u n t r y . T h e o u n t r y . T h e G o v e r n m e n G o v e r n m e n t p a y s i t t p a y s i t m u c h a t t e n m u c h a t t e n t i o n t i o n ..

14

Linguistic processor

Morpho-logical

analyzer

Semantic analyzer

Syntactic parser

Morphologicalanalysis

Morfological analysisMorfological analysis

15

Morphological Morphological representationrepresentation

A sequence of words.The THE article definite, plural/singular

science SCIENCE noun singular

is BE verb present, 3rd person, sing.

important IMPORTANT adjective

for FOR preposition

our WE pronoun possessive

country COUNTRY noun singular

16

Linguistic processor

Morpho-logical

analyzer

Semantic analyzer

Syntactic parser

Syntacticparsing

Syntactic parsingSyntactic parsing

17

Syntactic representation Syntactic representation

A sequence of syntactic trees.

BE

SCIENCE IMPORTANT

COUNTRY

WE

of

PAY

GOVERNMENT ATTENTION IT

MUCH

18

Syntactic representationSyntactic representation

What happened?

With whom happened?

... their details

PAY

GOVERNMENT ATTENTION IT

MUCH

19

Linguistic processor

Morpho-logical

analyzer

Semantic analyzer

Syntactic parser

Semanticanalysis

Semantic analysisSemantic analysis

Next lecture...Next lecture...

20

SyntaxSyntax

The structure describing the relationships between words in a sentence

Describes the relationships implied by grammatical characteristics not by meaning

Often allows for simple paraphrasing John reads the book The book is read by John

21

Early approach: Dependency syntaxEarly approach: Dependency syntax

Tree Nodes: words Arcs: modified by

Modifies means adds details,clarifies, chooses of many...makes more specific

Arcs are typed Types are: subject, object, attribute, ...

PAY

GOVERNMENT ATTENTION IT

MUCH

Subject

Obje

ct

Recipient

Att

ribute

22

... Dependency syntax... Dependency syntax

General situation: pay More specifically: the one

where: who pays is government what is paid is attention to whom it is paid is it

More specifically: attention that is much

PAY

GOVERNMENT ATTENTION IT

MUCH

Subject

Obje

ct

Recipient

Att

ribute

23

Advantages/disadvantages of Advantages/disadvantages of Dependency SyntaxDependency Syntax

Advantages Solid linguistic base Rather direct translation into semantics Easily applicable to languages with free word order

Korean? Russian, Latin This is why solid linguistic base: good for classical

languages!

Disadvantages No nice mathematical base No simple algorithms

24

Most popular approach: Constituency Most popular approach: Constituency (Phrase Structure grammars)(Phrase Structure grammars)

Tree Nodes: nested segments of the phrase

Cannot intersect, only nested Usually are labeled with part-of-speech names

Arcs: nesting In classical approach, arcs are not labeled

[[Our Government ] [pays [ much attention] [to it ] ] ]

25

ConstituencyConstituency

[[Our Government ] [pays [ much attention] [to it ] ] ]Our Government

pays

much attention

to it

26

ConstituencyConstituency

[[OurR GovernmentN ]NP

[paysV [ muchA attentionN]NP [toP itR ]PP ] VP]S

R: pronoun NP: noun phraseN: noun VP: verb phraseV: verb PP: prepositional phraseA: adjective S: sentence

27

Constituency: graphical representationConstituency: graphical representation

[[Our Government ]NP [pays [ much attention]NP [to it ]PP ] VP]S

S VP

NP NP PP

NP VP NP NP

R N V A N P R

Our Government pays much attention to it

28

Phrase structure grammarPhrase structure grammar

Enumerates possible configurations at nodes Usually recursive

S NP VP

NP A NP

NP R NP

NP P NP

NP N

VP VP NP PP

VP V

S VP

NP NP PP

NP VP NP NP

R N V A N P R

Our Government pays much attention to it

29

Context-independency hypothesisContext-independency hypothesis

A configuration is possible or not,regardless of where it is used Wherever you find VP NP PP, it can be VP Wherever you find NP VP, it can be S If you can put together S that covers all the sentence,

it is a grammatically correct description With this, given a suitable grammar, you can

List all sentences of a language List only correct sentences of that language

List all and only correct structures Correctness means a native speaker’s intuition

30

Generative ideaGenerative idea

Find a grammar to list all and only correct sentences (with their structures) of a language

This is a complete description of that language!

How can be useful in analysis? Reverse the grammar

31

ParsingParsing

Given a grammar and a sentence Find all possible structures That describe this sentence with this grammar

Many methods. Not discussed today.A lot of research. Very fast algorithms

Complexity: cubic in the number of words in the sentence (there are better methods, up to 2.8)

Problem: combinatorics of variants

32

Advantages and disadvantages of cAdvantages and disadvantages of consitituency approachonsitituency approach

Advantages Nice mathematics, very well understood Efficient analysis algorithms, very well-elaborated Good for languages with fixed word order

English. Chinese?

Disadvantages Difficult translation into semantics Bad when it comes to freer word order

Even in English! Worse in other languages

33

Head-driven approachesHead-driven approaches

Combine some advantages of dependency-based and constituency-based approaches

Syntax is still fixed-order. But word dependency information is added Easier translation into semantics More linguistically-based

How? In each constituent, the main word (head) is marked It modifies the head of the larger constituent

[[Our Government ] [pays [ much attention] [to it ] ] ]

34

Syntactic ambiguitySyntactic ambiguity

I see a cat with a telescope I see [a cat] [with a telescope]

I use a telescope to see a cat

I see [a cat [with a telescope]] I see a cat that has a telescope

Nearly any preposition causes ambiguity Dozens, thousands, millions of variants for a sentence!

Because their numbers multiply I see a cat with a telescope in a garden at the shore of a river

35

Ambiguity resolutionAmbiguity resolution

Syntactic means are not enough Is telescope more related to see or to cat?

Statistical methods: is it used with see or cat? Dictionary-based methods: does it share more meaning

with see or cat?• Path length in a dictionary of semantic relationships

Ideally, context should be analyzed, and reasoning applied: I see a cat with a telescope. It keeps the telescope in its

left paw. Now no good methods for this.

36

Shallow parsingShallow parsing

Due to the HUGE problems in resolving ambiguity Do not resolve it! Do what you can de wellI see [a cat] [with a telescope] [in a garden] [at the shore] [of a river]

Better than nothing Can be done well

37

EvaluationEvaluation

PARSEVAL international contents A practical parser usually gives only one variant

Implies disambiguation!

Manually built corpora (treebanks) Compare what the program did with what humans di

d

38

One of the uses in IR:One of the uses in IR:Lexical ambiguity resolutionLexical ambiguity resolution

Syntactic analysis helps in POS disambiguation: Oil is used well in Mexico. Oil well is used in Mexico. Well = ?

But does not help in WSD: I deposited my money in an international bank. I live on a beautiful bank of Han river.

39

Research topicsResearch topics

Faster algorithms E.g. parallel

Handling linguistic phenomena not handled bycurrent approaches

Ambiguity resolution! Statistical methods A lot can be done

40

ConclusionsConclusions

Syntax structure is one of intermediate representationsof a text for its processing

Helps text understanding Thus reasoning, question answering, ...

Directly helps POS tagging Resolves lexical ambiguity of part of speech But not WSD-type ambiguities

A big science in itself, with 50 (2000?) years of history

41

Thank you!Till June 8? 6 pm

Semantics

Recommended