Sentence Processing using a Simple Recurrent Network EE 645 Final Project Spring 2003 Dong-Wan Kang 5/14/2003

Sentence Processing using a Simple Recurrent Network

EE 645 Final Project

Spring 2003

Dong-Wan Kang

5/14/2003

Contents1. Introduction - Motivations

2. Previous & Related Worksa) McClelland & Kawamoto(1986)b) Elman (1990, 1993, & 1999)c) Miikkulainen (1996)

3. Algorithms (Williams and Zipser, 1989)

- Real Time Recurrent Learning

4. Simulations

5. Data & Encoding schemes

6. Results

7. Discussion & Future work

Motivations

• Can the neural network recognize the lexical classes from the sentence and learn the various types of sentences?

• From cognitive science perspective: - comparison between human language learning and neural network learning pattern - (e.g.) Learning English past tense (Rumelhart & McClelland,1986), Grammaticality judgment (Allen & Seidenberg,1999), Embedded sentences(Elman 1993, Miikkulainen1996, etc.)

Related Works• McClelland & Kawamoto (1986)

- Sentences with Case Role Assignments and semantic features by using backpropagation algorithm

- output: 2500 case role units for each sentence

- (e.g.) input: the boy hit the wall with the ball. output: [ Agent Verb Patient Instrument ] + [other features]

- Limitation: poses a hard limit on the number of input size.

- Alternative: Instead of detecting the input patterns displaced in space, detect the patterns which were in time (sequential inputs).

Related Works (continued)

• Elman (1990,1993, & 1999) - Simple Recurrent Network: Partially

Recurrent Network using Context units

- Network with a dynamic memory

- Context units at time t hold a copy of the activations of the hidden units from the previous time step at time t-1.

- Network can recognize sequences.

input: Many years ago boy and girl … | | | | | | output: years ago boy and girl …

Related Works (continued)

• Miikkulainen (1996) - SPEC architecture

(Subsymbolic Parser for Embedded Clauses Recurrent Network)

- Parser, Segmenter, and Stack:

process the center and tail embedded sentences: 98,100 sentences with 49 different sentence by using case role assignments

- (e.g.) Sequential Inputs input: …, the girl, who, liked, the dog, saw, the boy, …

output: …, [the girl, saw, the boy] [the girl, liked, the dog] case role: (agent, act, patient) (agent, act, patient)

Algorithms• Recurrent Network

- Unlike feedforward networks, they allow connections both ways between a pair of units and even from a unit to itself.

- Backpropagation through time (BPTT) – unfolds the temporal operation of the network into a layered feedforward network at every time step. (Rumelhart, et al., 1986)

- Real Time Recurrent Learning (RTRL) – two versions (Williams and Zipser, 1989) 1) update weights after processing sequences is completed.

2) on-line: update weights while sequences are being presented.

- Simple Recurrent Network (SRN) – partially recurrent network in terms of time and space. It has context units which store the outputs of the hidden units (Elman, 1990). (It can be modified from RTRL algorithm.)

Real Time Recurrent Learning• Williams and Zipser (1989)

- This algorithm computes the derivatives of states and

outputs with respect to all weights as the network

processes the sequence.

• Summary of Algorithm:

In recurrent network, for any unit connected to any other and the input at node i at time t, the dynamic update rule is:

iV

))1()1(())1(()( ttVwgthgtV ij

jijii

)(ti

RTRL (continued) • Error measure:

with target outputs defined for some k’s and t’s

if is defined at time t;

otherwise

• Total cost function , t =0,1, …, T, where

0

)()()(

tVttE kk

k

T

t

tEE0

)( k

k tEtE 2)]([2

1)(

)(tk

)(tk

RTRL (continued)

• The gradient of E separate in time, to do gradient descent, we define:

• The derivative of update rule:

where initial condition t = 0,

k pq

kk

pqpq w

tVtE

w

tEtw

)()(

)()(

j pq

jijqipi

pq

i

w

tVwtVthg

w

tV )1()1())1(('

)(

0)0(

pq

i

w

V

RTRL (continued)

• Depending on the way of updating weights, there can be two versions of RTRL.

1) Update the weights after the sequences are

completed at (t = T ).

2) Update the weights after each time step: on-line

• Elman’s “tlearn” simulator program for “Simple Recurrent Network” (which I’m using for this project) is implemented based on the classical backpropagation algorithm and the modification of this RTRL algorithm.

Simulation • Based on Elman’s data and Simple Recurrent Network(1990,1993, & 1999), simple sentences and embedded sentences are simulated by using “tlearn” neural network program (BP + modified version of RTLR algorithm) available at http://crl.ucsd.edu/innate/index.shtml.

• Question: 1. Can the network discover the lexical classes from word

order?

2. Can the network recognize the relative pronouns and predict them?

http://crl.ucsd.edu/innate/index.shtml

http://crl.ucsd.edu/innate/index.shtml

Network Architecture

• 31 input nodes• 31 output nodes• 150 hidden nodes• 150 context nodes * black arrow: distributed and learnable* dotted blue arrow: linear function and one-to-one connection

with hidden nodes

Training Data

• Lexicon (31 words)

NOUN-HUM man woman boy girlNOUN-ANIM cat mouse dog lionNOUN-INANIM book rock carNOUN-AGRESS dragon monsterNOUN-FRAG glass plateNOUN-FOOD cookie bread sandwich

VERB-INTRAN think sleep existVERB-TRAN see chase likeVERB-AGPAT move breakVERB-PERCEPT smell seeVERB-DESTROY break smashVERB-EAT eat-----------------------------RELAT-HUM whoRELAT-INHUM which

• Grammar (16 templates)

NOUN-HUM VERB-EAT NOUN-FOODNOUN-HUM VERB-PERCEPT NOUN-INANIMNOUN-HUM VERB-DESTROY NOUN-FRAGNOUN-HUM VERB-INTRANNOUN-HUM VERB-TRAN NOUN-HUMNOUN-HUM VERB-AGPAT NOUN-INANIMNOUN-HUM VERB-AGPATNOUN-ANIM VERB-EAT NOUN-FOODNOUN-ANIM VERB-TRAN NOUN-ANIMNOUN-ANIM VERB-AGPAT NOUN-INANIMNOUN-ANIM VERB-AGPATNOUN-INANIM VERB-AGPATNOUN-AGRESS VERB-DESTROY NOUN-FRAGNOUN-AGRESS VERB-EAT NOUN-HUMNOUN-AGRESS VERB-EAT NOUN-ANIMNOUN-AGRESS VERB-EAT NOUN-FOOD

Sample Sentences & Mapping

• Simple sentences – 2 types - man think (2 words) - girl see dog (3 words) - man break glass (3 words)

• Embedded sentences - 3 types (*RP – Relative Pronoun)

1. monster eat man who sleep (RP–sub, VERB-INTRAN) 2. dog see man who eat sandwich (RP-sub, VERB-TRAN) 3. woman eat cookie which cat chase (RP-obj, VERB-TRAN)

• Input-Output Mapping: (predict next input – sequential input)

INPUT: girl see dog man break glass cat … | | | | | | | OUTPUT: see dog man break glass cat …

Encoding scheme

• Random word representation

- 31-bit vector for each lexical item, each lexical item is

represented by a randomly-assigned different bit.

- not semantic feature encoding

sleep 0000000000000000000000000000001 dog 0000100000000000000000000000000 woman 0000000000000000000000000010000 …

Training a network• Incremental Input (Elman, 1993) “Starting small” strategy

• Phase I: simple sentences (Elman, 1990, used 10,000 sentences)

- 1,564 sentences generated(4,636 31-bit vectors) - train all patterns: learning rate =0.1, 23 epochs

• Phase II: embedded sentences (Elman, 1993, 7,500 sentences)

- 5,976 sentences generated(35,688 31-bit vectors) - loaded with weights from phase I - train (1,564 + 5,976) sentences together: learning rate = 0.1, 4 epochs

Performance• Network performance was measured by Root Mean Squared Error: the number of input patterns

RMS error = target output vector

actual output vector

• Phase I: After 23 epochs, RMS ≈ 0.91 • Phase II: After 4 epochs, RMS ≈ 0.84

• Why can RMS not be lowered? The prediction task is nondeterministic, so the network cannot produce the unique output for the corresponding input. For this simulation, RMS is NOT the best measurement of performance.

• Elman’s simulation: RMS = 0.88 (1990), Mean Cosine = 0.852 (1993)

k

kkk

yt

2)(

k

k

k

y

t

Phase I: RMS ≈ 0.91 after 23 epochs

Phase II: RMS ≈ 0.84 after 4 epochs

Results and Analysis<Test results after phase II>Output Target…which (target: which )? (target: lion )? (target: see )? (target: boy ) ? (target: move )? (target: sandwich )which (target: which )? (target: cat )? (target: see )…? (target: book )which (target: which )? (target: man )see (target: see )…? (target: dog )? (target: chase )? (target: man )? (target: who )? (target: smash )? (target: glass )...

• Arrow() indicates the start of the sentence.

• In all positions, the word “which” is predicted correctly! But most of words are not predicted including the word “who” is not. Why? Training Data

• Since the prediction task is nondeterministic, predicting the exact next word can not be the best performance measurement.

• We need to look at hidden unit activations of each input, since they reflect what the network has learned about classes of inputs with regard to what they predict. Cluster Analysis, PCA

• Cluster Analysis

• The network successfully recognizes VERB, NOUN, and some of their

subcategories.

• WHO and WHICH has different distance

• VERB-INTRAN failed to fit in VERB

<Hierarchicalcluster

diagramof hidden

unitactivationvectors>

Discussion & Conclusion1. The network can discover the lexical classes from word order. Noun

and Verb are different classes except the “VERB-INTRAN”. Also, subclasses for NOUN are classified correctly, but some subclasses for VERB are mixed. This is related to the input example.

2. The network can recognize and predict the relative pronoun, “which”, but not “who” Why? Because the sentences for “who” is not “RP-

obj”, so “who” is just considered as one of normal subject in simple sentences.

3. The organization of input data is important and sensitive to the recurrent network, since it processes the input sequentially and on-line.

4. Generally, recurrent networks by using RTRL recognized sequential input, but it requires more training time and computation resources.

Future Studies

• Recurrent Least Squared Support Vector Machines, Suykens, J.A.K. & Vandewalle, J., (2000).

- provides new perspectives for time-series

prediction and nonlinear modeling

- seems more efficient than BPTT, RTRL & SRN

ReferencesAllen, J., & Seidenberg, M. (1999). The emergence of grammaticality in connectionist networks. In

Brian MacWhinney (Ed.), The emergence of language (pp.115-151). Hillsdale, NJ: Lawrence Erlbaum.

Elman, J.L. (1990). Finding structure in time. Cognitive Science, 14, 179-211.

Elman, J. (1993). Learning and development in neural networks: the importance of starting small. Cognition, 48, 71-99.

Elman, J.L. (1999). The emergence of language: A conspiracy theory. In B. MacWhinney (Ed.) Emergence of Language. Hillsdale, NJ: Lawrence Earlbaum Associates.

Hertz, J., Krogh, A., & Palmer, R.G. (1991). Introduction to the Theory of Neural Computation. Redwood City, CA: Addison-Wesley.

MacClelland, J. L., & Kawamoto A. H. (1986). Mechanisms of sentence processing: Assigning roles to constituents of sentences. (273-325). In J. L. McClelland & D. E. Rumelhart (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition. Cambridge, MA: MIT Press.

Miikkulaninen R. (1996). Subsymbolic case-role analysis of sentences with embedded clauses, Cognitive Science, 20, 47-73.

Rummelhart, D., Hinton, G. E., and Williams, R. (1986). "Learning Internal Representations by Error Propagation," Parallel and Distributed Processing: Exploration in the Microstructure of Cognition, Vol. 1, D. Rumelhart and J. McClelland (Eds.), MIT Press, Cambridge, Massachusetts, 318-362

Rumelhart, D.E., & McClelland, J.L. (1986). On learning the past tense of English verbs. In J.L. McClelland & D.E. Rumelhart (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition. Cambridge, MA: MIT Press.

Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1, 270--280.

Documents

Sentence Processing using a Simple Recurrent Network EE 645 Final Project Spring 2003 Dong-Wan Kang 5/14/2003