26
October 2004 CSA3050 NL Algorithms 1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota

CSA3050: Natural Language Algorithms

Embed Size (px)

DESCRIPTION

CSA3050: Natural Language Algorithms. Words, Strings and Regular Expressions Finite State Automota. This lecture. Outline Words The language of words FSAs in Prolog Acknowledgement Jurafsky and Martin, Speech and Language Processing, Prentice Hall 2000 - PowerPoint PPT Presentation

Citation preview

Page 1: CSA3050: Natural Language Algorithms

October 2004 CSA3050 NL Algorithms 1

CSA3050: Natural Language Algorithms

Words, Strings and

Regular Expressions

Finite State Automota

Page 2: CSA3050: Natural Language Algorithms

October 2004 CSA3050 NL Algorithms 2

This lecture

• Outline– Words– The language of words– FSAs in Prolog

• Acknowledgement– Jurafsky and Martin, Speech and Language

Processing, Prentice Hall 2000– Blackburn and Steignitz: NLP Techiques in Prolog:

http://www.coli.uni-sb.de/~kris/nlp-with-prolog/html/

Page 3: CSA3050: Natural Language Algorithms

October 2004 CSA3050 NL Algorithms 3

What is a Word?

• A series of speech sounds that symbolizes meaning without being divisible into smaller units

• Any segment of written or printed discourse ordinarily appearing between spaces or between a space and a punctuation mark

• A set of linguistic forms produced by combining a single base with various inflectional elements without change in the part-of-speech elements

• A number of bytes processed as a unit.

Page 4: CSA3050: Natural Language Algorithms

October 2004 CSA3050 NL Algorithms 4

Information Associated with Words

• Spelling– orthographic– phonological

• Syntax– POS– Valency

• Semantics– Meaning – Relationship to other words

Page 5: CSA3050: Natural Language Algorithms

October 2004 CSA3050 NL Algorithms 5

Properties of Words

• Sequence– characters pollution– phonemes

• Delimitation– whitespace– other?

• Structure– simple ("atomic“) words– complex ("molecular") words

Page 6: CSA3050: Natural Language Algorithms

October 2004 CSA3050 NL Algorithms 6

Complex Words

• enlargementen + large + ment(en + large) + menten + (large + ment)

• affixation– prefix– suffix– infix

Page 7: CSA3050: Natural Language Algorithms

October 2004 CSA3050 NL Algorithms 7

Sets Underly the Formation of Complex Words

disreunen

largechargeinfectcodedecide

edingeeerly

+ +

prefixes roots suffixes

Page 8: CSA3050: Natural Language Algorithms

October 2004 CSA3050 NL Algorithms 8

Structure of Complex Words

• Complex words are made by concatenating elements chosen from – a set of prefixes– a set of roots– a set of suffixes

• The set of valid words for a given human language (e.g. English, Maltese) can be regarded as a formal language.

Page 9: CSA3050: Natural Language Algorithms

October 2004 CSA3050 NL Algorithms 9

The Language of Words

• What kind of formal language is the language of words?

• One which can be constructed out of– A characteristic set of basic symbols (alphabet)– A characteristic set of combining operations

• Union (disjunction) • Concatenation• Closure (iteration)

• Regular Language; Regular Sets

Page 10: CSA3050: Natural Language Algorithms

October 2004 CSA3050 NL Algorithms 10

Characterising Classes of Set

CLASS OFSETS or LANGUAGES

NOTATION MACHINE

Page 11: CSA3050: Natural Language Algorithms

October 2004 CSA3050 NL Algorithms 11

Regular Expressions

• Notation for describing regular sets

• Used extensively in the Unix operating system (grep, sed, etc.) and also in some Microsoft products (Word)

• Xerox Finite State tools use a somewhat different notation, but similar function.

Page 12: CSA3050: Natural Language Algorithms

October 2004 CSA3050 NL Algorithms 12

Regular Expressions

a a simple symbol

A B concatenation

A | B alternation operator

A & B intersection operator

A* Kleene star

Page 13: CSA3050: Natural Language Algorithms

October 2004 CSA3050 NL Algorithms 13

Characterising Classes of Set

CLASS OFSETS or LANGUAGES

NOTATION MACHINE

Page 14: CSA3050: Natural Language Algorithms

October 2004 CSA3050 NL Algorithms 14

Finite Automaton

• A finite automaton comprises• A finite set of states Q• An alphabet of symbols I• A start state q0 Q• A set of final states F Q• A transition function δ(q,i) which maps a

state q Q and a symbol i I to a new state q' Q

Page 15: CSA3050: Natural Language Algorithms

October 2004 CSA3050 NL Algorithms 15

Encoding FSAs in Prolog

• Three predicates– initial/1initial(s) – s is an initial state

– final/1final(f) – f is a final state

– arc/3arc(s,t,c)there is an arc from s to t labelled c

Page 16: CSA3050: Natural Language Algorithms

October 2004 CSA3050 NL Algorithms 16

Example 1: FSA

initial(1).final(4).arc(1,2,h).arc(2,3,a).arc(3,4,!).arc(3,2,h).

1-

2

3

4=

h

ha

!

Page 17: CSA3050: Natural Language Algorithms

October 2004 CSA3050 NL Algorithms 17

Example 2: FSA with jump arc

initial(1).final(4).arc(1,2,h).arc(2,3,a).arc(3,4,!).arc(3,1,#).

1-

2

3

4=

h

#a

!

Page 18: CSA3050: Natural Language Algorithms

October 2004 CSA3050 NL Algorithms 18

Example 3: NDA

initial(1).final(4).arc(1,2,h).arc(2,3,a).arc(3,4,!).arc(2,1,a).

1-

2

3

4=

h a

a

!

Page 19: CSA3050: Natural Language Algorithms

October 2004 CSA3050 NL Algorithms 19

A Recogniser

recognize1(Node,[ ]) :-    final(Node).

recognize1(Node1,String) :-    arc(Node1,Node2,Label),    traverse1(Label,String,NewString),    recognize1(Node2,NewString).

traverse1(Label,[Label|Symbols],Symbols).

Page 20: CSA3050: Natural Language Algorithms

October 2004 CSA3050 NL Algorithms 20

TraceCall: (7) test1([h, a, !]) Call: (8) initial(_L181) Exit: (8) initial(1) Call: (8) recognize1(1, [h, a, !]) Call: (9) arc(1, _L199, _L200) Exit: (9) arc(1, 2, h) Call: (9) traverse1(h, [h, a, !], _L201) Exit: (9) traverse1(h, [h, a, !], [a, !]) Call: (9) recognize1(2, [a, !]) Call: (10) recognize1(3, [!]) Call: (11) recognize1(4, []) Call: (12) final(4) Exit: (12) final(4) Exit: (11) recognize1(4, []) Exit: (10) recognize1(3, [!]) Exit: (9) recognize1(2, [a, !]) Exit: (8) recognize1(1, [h, a, !]) Exit: (7) test1([h, a, !])

Page 21: CSA3050: Natural Language Algorithms

October 2004 CSA3050 NL Algorithms 21

Generation

• test1(X)

• X = [h, a, !] ;

• X = [h, a, h, a, !] ;

• X = [h, a, h, a, h, a, !] ;

• X = [h, a, h, a, h, a, h, a, !] ;

• etc.

Page 22: CSA3050: Natural Language Algorithms

October 2004 CSA3050 NL Algorithms 22

3 Related Frameworks

REGULARLANGS/SETS

REGULAREXPRESSIONS

FINITE STATENETWORKS

describe recognise

Page 23: CSA3050: Natural Language Algorithms

October 2004 CSA3050 NL Algorithms 23

Regular Operations

• Operations– Concatenation– Union– Closure

• Over What– Language– Expressions– FS Automota

Page 24: CSA3050: Natural Language Algorithms

October 2004 CSA3050 NL Algorithms 24

Concatenation over Reg. Expression and LanguageRegular Expression

E1: = [a|b]

E2: = [c|d]

E1 E2 =

[a|b] [c|d]

Language

L1 = {"a", "b"}

L2 = {"c", "d"}

L1 L2 =

{"ac", "ad", "bc", "bd"}

Page 25: CSA3050: Natural Language Algorithms

October 2004 CSA3050 NL Algorithms 25

Concatenation overFS Automata

a

b

c

d

a

b

c

d

Page 26: CSA3050: Natural Language Algorithms

October 2004 CSA3050 NL Algorithms 26

Issues

• Handling jump arcs.

• Handling non-determinism

• Computing operations over networks.

• Maintaining multiple states in DB

• Representation.