Natural language processing: syntactic and semantic tagginginfo.usherbrooke.ca/hlarochelle/ift725/nlp-tagging.pdf · A Uniﬁed Architecture for Natural Language Processing: Deep

Natural language processing:syntactic and semantic taggingIFT 725 - Réseaux neuronaux

WORD TAGGINGTopics: word tagging• In many NLP applications, it is useful to augment text data

with syntactic and semantic information‣ we would like to add syntactic/semantic labels to each word

• This problem can be tackled using a conditional random field with neural network unary potentials‣ we will describe the model developed by Ronan Collobert and Jason Weston in:

A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask LearningCollobert and Weston, 2008

(see Natural Language Processing (Almost) from Scratch for the journal version)

2

WORD TAGGINGTopics: part-of-speech tagging• Tag each word with its part of speech category‣ noun, verb, adverb, etc.

‣ might want to distinguish between singular/plural, present tense/past tense, etc.

‣ see Penn Treebank POS tags set for an example

• Example:

3

POS (Part-Of-Speech)

!  Étiqueter les mots d'une phrase en fonction de leurs rôles syntaxiques.

!  Entre 50 et 150 classes pour l’anglais. !  Ex : nom, adverbe

The

DT

little

JJ

yellow

JJ

dog

NN

barked

VBD

at

IN

the

DT

cat

NN

(from Stanislas Lauly)

WORD TAGGINGTopics: chunking• Segment phrases into syntactic phrases‣ noun phrase, verb phrase, etc.

• Segments are identified with IOBES encoding‣ single word phrase (S- prefix). Ex.: S-NP

‣ multiword phrase (B-, I-, E- prefixes). Ex.: B-VP I-VP I-VP E-VP

‣ words outside of syntactic phrases: O

4(from Stanislas Lauly)

Chunking

!  Étiqueter syntaxiquement des segments de phrase. !  Noun phrase (NP), verb phrase (VP). !  Encodage IOBES:

!  Ex.: NP-S pour un seul mot, NP-B/NP-I/NP-E pour plusieurs.

He

S-NP

reckons

S-VP

the

B-NP

current

I-NP

account

I-NP

deficit

E-NP






Chunking



He

S-NP

reckons

S-VP

the

B-NP

current

I-NP

account

I-NP

deficit

E-NP

NP






Chunking



He

S-NP

reckons

S-VP

the

B-NP

current

I-NP

account

I-NP

deficit

E-NP

NP VB






Chunking



He

S-NP

reckons

S-VP

the

B-NP

current

I-NP

account

I-NP

deficit

E-NP

NP VB NP

WORD TAGGINGTopics: named entity recognition (NER)• Identify phrases referring to a named entity‣ person

‣ location

‣ organization

• Example:


NER (Named-Entity Reconition)

!  Identifier les entités nommées d’une phrase. !  Exemple : emplacement, personne, endroit, etc.

U.N.

S-ORG

official

O

Ekeus

S-PER

heads

O

for

O

Baghdad

S-LOC

WORD TAGGINGTopics: semantic role labeling (SRL)• For each verb, identify the role of other words with respect to

that verb• Example:‣ V: verb

‣ A0: acceptor

‣ A1: thing accepted


‣ A2: accepted from

‣ A3: attribute

‣ AM-MOD: modal

‣ AM-NEG: negation

SRL (Semantic Role Labeling)

!  L’étiquetage sémantique est important pour répondre aux questions « qui », « quand », « où », etc.

!  Exemple d’étiquettes. !  V: verb !  A0: acceptor !  A1: thing accepted

!  A2: accepted-from !  A3: attribute !  AM-MOD: modal !  AM-NEG: negation

He

S-A0

would

S-AM-MOD

n’t

S-AM-NEG

accept

V

anything

B-A1

of

I-A1

value

E-A1

WORD TAGGINGTopics: labeled corpus• The raw data looks like this:

7

The DT B-NP O B-A0 B-A0$ $ I-NP O I-A0 I-A01.4 CD I-NP O I-A0 I-A0billion CD I-NP O I-A0 I-A0robot NN I-NP O I-A0 I-A0spacecraft NN E-NP O E-A0 E-A0faces VBZ S-VP O S-V Oa DT B-NP O B-A1 Osix-year JJ I-NP O I-A1 Ojourney NN E-NP O I-A1 Oto TO B-VP O I-A1 Oexplore VB E-VP O I-A1 S-VJupiter NNP S-NP S-ORG I-A1 B-A1and CC O O I-A1 I-A1its PRP$ B-NP O I-A1 I-A116 CD I-NP O I-A1 I-A1known JJ I-NP O I-A1 I-A1moons NNS E-NP O E-A1 E-A1. . O O O O

SENTENCE NEURAL NETWORKTopics: sentence convolutional network• How to model each label sequence‣ could use a CRF with neural network unary

potentials, based on a window (context) of words- not appropriate for semantic role labeling, because

relevant context might be very far away

‣ Collobert and Weston suggest a convolutional network over the whole sentence- prediction at a given position can exploit information

from any word in the sentence

8

COLLOBERT, WESTON, BOTTOU, KARLEN, KAVUKCUOGLU AND KUKSA

Input Sentence

Lookup Table

Convolution

Max Over Time

Linear

HardTanh

Linear

Text The cat sat on the mat

Feature 1 w11 w1

2 . . . w1N...

Feature K wK1 wK

2 . . . wKN

LTW 1

...LTW K

max(·)

M2! ·

M3! ·

d

Paddin

g

Paddin

g

n1hu

M1! ·

n1hu

n2hu

n3hu

= #tags

Figure 2: Sentence approach network.

In the following, we will describe each layer we use in our networks shown in Figure 1 and Figure 2.We adopt few notations. Given a matrix A we denote [A]i, j the coefficient at row i and column jin the matrix. We also denote !A"dwini the vector obtained by concatenating the dwin column vectorsaround the ith column vector of matrix A # Rd1$d2 :

!

!A"dwini

"T=#

[A]1, i%dwin/2 . . . [A]d1, i%dwin/2 , . . . , [A]1, i+dwin/2 . . . [A]d1, i+dwin/2$

.

2468

SENTENCE NEURAL NETWORK

9

Topics: sentence convolutional network• Each word can be represented by more

then one feature‣ feature of the word itself

‣ substring features - prefix: ‘‘ eating ’’ ‘‘ eat ’’

- suffix: ‘‘ eating ’’ ‘‘ ing ’’

‣ gazetteer features- whether the word belong to a list of known locations,

persons, etc.

• These features are treated like word features, with their own lookup tables


Input Sentence

Lookup Table

Convolution

Max Over Time

Linear

HardTanh

Linear


Feature 1 w11 w1

2 . . . w1N...

Feature K wK1 wK

2 . . . wKN

LTW 1

...LTW K

max(·)

M2! ·

M3! ·

d

Paddin

g

Paddin

g

n1hu

M1! ·

n1hu

n2hu

n3hu

= #tags



!

!A"dwini

"T=#


.

2468


10

Topics: sentence convolutional network• Feature must encode for which word

we are making a prediction‣ done by adding the relative

position i-posw, where poswis the position of the current word

‣ this feature also has its lookuptable

• For SRL, must know the roles for which verb we are predicting‣ also add the relative position of that verb i-posv


Input Sentence

Lookup Table

Convolution

Max Over Time

Linear

HardTanh

Linear


Feature 1 w11 w1

2 . . . w1N...

Feature K wK1 wK

2 . . . wKN

LTW 1

...LTW K

max(·)

M2! ·

M3! ·

d

Paddin

g

Paddin

g

n1hu

M1! ·

n1hu

n2hu

n3hu

= #tags



!

!A"dwini

"T=#


.

2468


11

Topics: sentence convolutional network• Lookup table: ‣ for each word concatenate

the representations of its features

• Convolution:‣ at every position, compute

linear activations from a window of representations

‣ this is a convolution in 1D

•Max pooling:‣ obtain a fixed hidden layer

with a max across positions


Input Sentence

Lookup Table

Convolution

Max Over Time

Linear

HardTanh

Linear


Feature 1 w11 w1

2 . . . w1N...

Feature K wK1 wK

2 . . . wKN

LTW 1

...LTW K

max(·)

M2! ·

M3! ·

d

Paddin

g

Paddin

g

n1hu

M1! ·

n1hu

n2hu

n3hu

= #tags



!

!A"dwini

"T=#


.

2468

SENTENCE NEURAL NETWORKTopics: sentence convolutional network• Regular neural network:‣ the pooled representation

serves as the input ofa regular neural network

‣ they proposed using a ‘‘hard’’ version of thetanh activation function

• The outputs are used as the unary potential of a chain CRF over the labels‣ no connections between the CRFs of the different task (one CRF per task)

‣ a separate neural network is used for each task 12


Input Sentence

Lookup Table

Convolution

Max Over Time

Linear

HardTanh

Linear


Feature 1 w11 w1

2 . . . w1N...

Feature K wK1 wK

2 . . . wKN

LTW 1

...LTW K

max(·)

M2! ·

M3! ·

d

Paddin

g

Paddin

g

n1hu

M1! ·

n1hu

n2hu

n3hu

= #tags



!

!A"dwini

"T=#


.

2468

SENTENCE NEURAL NETWORKTopics: multitask learning• Could share vector representations of the features across

tasks‣ simply use the same lookup

tables across tasks

‣ the other parameters of theneural networks are nottied

• This is referred to asmultitask learning‣ the idea is to transfer knowledge learned within the word representations across

the different task13


Lookup Table

Linear

Lookup Table

Linear

HardTanh HardTanh

Linear

Task 1

Linear

Task 2

M2(t1) ! · M2

(t2) ! ·

LTW 1

.

.

.

LTW K

M1! ·

n1hu

n1hu

n2hu,(t1)

= #tags n2hu,(t2)

= #tags

Figure 5: Example of multitasking with NN. Task 1 and Task 2 are two tasks trained with thewindow approach architecture presented in Figure 1. Lookup tables as well as the firsthidden layer are shared. The last layer is task specific. The principle is the same withmore than two tasks.

that examples can come from quite different data sets. The generalization performance for eachtask was measured using the traditional testing data specified in Table 1. Fortunately, none of thetraining and test sets overlap across tasks.

It is worth mentioning that MTL can produce a single unified network that performs well forall these tasks using the sentence approach. However this unified network only leads to marginalimprovements over using a separate network for each task: the most important MTL task appears tobe the unsupervised learning of the word embeddings. As explained before, simple computationalconsiderations led us to train the POS, Chunking, and NER tasks using the window approach. Thebaseline results in Table 9 also show that using the sentence approach for the POS, Chunking, andNER tasks yields no performance improvement (or degradation) over the window approach. Thenext section shows we can leverage known correlations between tasks in more direct manner.

6. The Temptation

Results so far have been obtained by staying (almost17) true to our from scratch philosophy. Wehave so far avoided specializing our architecture for any task, disregarding a lot of useful a priori

17. We did some basic preprocessing of the raw input words as described in Section 3.5, hence the “almost” in the title ofthis article. A completely from scratch approach would presumably not know anything about words at all and wouldwork from letters only (or, taken to a further extreme, from speech or optical character recognition, as humans do).

2486

SENTENCE NEURAL NETWORKTopics: language model•We can design other tasks without any labeled data‣ identify whether the middle word of a window of text is an ‘‘impostor’’

‣ can generate impostor examples from unlabeled text (Wikipedia)- pick a window of words from unlabeled corpus

- replace middle word with a different, randomly chosen word

‣ train a neural network (with word representations) to assign a higher score to the original window

‣ similar to language modeling, except we predict the word in the middle14

‘‘cat sat on the mat’‘ vs ‘‘cat sat think the mat’’


can be extremely demanding, and sophisticated approximations are required. More importantly forus, neither work leads to significant word embeddings being reported.

Shannon (1951) has estimated the entropy of the English language between 0.6 and 1.3 bits percharacter by asking human subjects to guess upcoming characters. Cover and King (1978) givea lower bound of 1.25 bits per character using a subtle gambling approach. Meanwhile, using asimple word trigram model, Brown et al. (1992b) reach 1.75 bits per character. Teahan and Cleary(1996) obtain entropies as low as 1.46 bits per character using variable length character n-grams.The human subjects rely of course on all their knowledge of the language and of the world. Can welearn the grammatical structure of the English language and the nature of the world by leveragingthe 0.2 bits per character that separate human subjects from simple n-grammodels? Since such taskscertainly require high capacity models, obtaining sufficiently small confidence intervals on the testset entropy may require prohibitively large training sets.16 The entropy criterion lacks dynamicalrange because its numerical value is largely determined by the most frequent phrases. In order tolearn syntax, rare but legal phrases are no less significant than common phrases.

It is therefore desirable to define alternative training criteria. We propose here to use a pairwiseranking approach (Cohen et al., 1998). We seek a network that computes a higher score whengiven a legal phrase than when given an incorrect phrase. Because the ranking literature often dealswith information retrieval applications, many authors define complex ranking criteria that give moreweight to the ordering of the best ranking instances (see Burges et al., 2007; Clemencon and Vayatis,2007). However, in our case, we do not want to emphasize the most common phrase over the rarebut legal phrases. Therefore we use a simple pairwise criterion.

We consider a window approach network, as described in Section 3.3.1 and Figure 1, withparameters θ which outputs a score fθ(x) given a window of text x = [w]dwin1 . We minimize theranking criterion with respect to θ:

θ !" ∑x#X

∑w#D

max!

0 , 1$ fθ(x)+ fθ(x(w))

"

, (17)

where X is the set of all possible text windows with dwin words coming from our training corpus, Dis the dictionary of words, and x(w) denotes the text window obtained by replacing the central wordof text window [w]dwin1 by the word w.

Okanohara and Tsujii (2007) use a related approach to avoiding the entropy criteria using abinary classification approach (correct/incorrect phrase). Their work focuses on using a kernelclassifier, and not on learning word embeddings as we do here. Smith and Eisner (2005) alsopropose a contrastive criterion which estimates the likelihood of the data conditioned to a “negative”neighborhood. They consider various data neighborhoods, including sentences of length dwin drawnfrom Ddwin . Their goal was however to perform well on some tagging task on fully unsuperviseddata, rather than obtaining generic word embeddings useful for other tasks.

4.3 Training Language Models

The language model network was trained by stochastic gradient minimization of the ranking crite-rion (17), sampling a sentence-word pair (s, w) at each iteration.

16. However, Klein and Manning (2002) describe a rare example of realistic unsupervised grammar induction using across-entropy approach on binary-branching parsing trees, that is, by forcing the system to generate a hierarchicalrepresentation.

2480

original window

impostor window with word w

SENTENCE NEURAL NETWORKTopics: experimental comparison• From Natural Language Processing (Almost) from Scratch

by Collobert et al.

15

NATURAL LANGUAGE PROCESSING (ALMOST) FROM SCRATCH

Approach POS CHUNK NER SRL(PWA) (F1) (F1) (F1)

Benchmark Systems 97.24 94.29 89.31 77.92Window Approach

NN+SLL+LM2 97.20 93.63 88.67 –NN+SLL+LM2+MTL 97.22 94.10 88.62 –

Sentence ApproachNN+SLL+LM2 97.12 93.37 88.78 74.15NN+SLL+LM2+MTL 97.22 93.75 88.27 74.29

Table 9: Effect of multi-tasking on our neural architectures. We trained POS, CHUNK NER in aMTL way, both for the window and sentence network approaches. SRL was only includedin the sentence approach joint training. As a baseline, we show previous results of ourwindow approach system, as well as additional results for our sentence approach system,when trained separately on each task. Benchmark system performance is also given forcomparison.

Approach POS CHUNK NER SRL(PWA) (F1) (F1)

Benchmark Systems 97.24 94.29 89.31 77.92NN+SLL+LM2 97.20 93.63 88.67 74.15NN+SLL+LM2+Suffix2 97.29 – – –NN+SLL+LM2+Gazetteer – – 89.59 –NN+SLL+LM2+POS – 94.32 88.67 –NN+SLL+LM2+CHUNK – – – 74.72

Table 10: Comparison in generalization performance of benchmark NLP systems with our neuralnetworks (NNs) using increasing task-specific engineering. We report results obtainedwith a network trained without the extra task-specific features (Section 5) and with theextra task-specific features described in Section 6. The POS network was trained withtwo character word suffixes; the NER network was trained using the small CoNLL 2003gazetteer; the CHUNK and NER networks were trained with additional POS features;and finally, the SRL network was trained with additional CHUNK features.

NLP knowledge. We have shown that, thanks to large unlabeled data sets, our generic neural net-works can still achieve close to state-of-the-art performance by discovering useful features. Thissection explores what happens when we increase the level of task-specific engineering in our sys-tems by incorporating some common techniques from the NLP literature. We often obtain furtherimprovements. These figures are useful to quantify how far we went by leveraging large data setsinstead of relying on a priori knowledge.

2487



Benchmark Systems 97.24 94.29 89.31 77.92NN+WLL 96.31 89.13 79.53 55.40NN+SLL 96.37 90.33 81.47 70.99NN+WLL+LM1 97.05 91.91 85.68 58.18NN+SLL+LM1 97.10 93.65 87.58 73.84NN+WLL+LM2 97.14 92.04 86.96 58.34NN+SLL+LM2 97.20 93.63 88.67 74.15

Table 8: Comparison in generalization performance of benchmark NLP systems with our (NN) ap-proach on POS, chunking, NER and SRL tasks. We report results with both the word-levellog-likelihood (WLL) and the sentence-level log-likelihood (SLL). We report with (LMn)performance of the networks trained from the language model embeddings (Table 7). Gen-eralization performance is reported in per-word accuracy (PWA) for POS and F1 score forother tasks.

language models from the relatively fast training of the supervised networks. Once the languagemodels are trained, we can perform multiple experiments on the supervised networks in a rela-tively short time. Note that our procedure is clearly linked to the (semi-supervised) deep learningprocedures of Hinton et al. (2006), Bengio et al. (2007) and Weston et al. (2008).

Table 8 clearly shows that this simple initialization significantly boosts the generalization per-formance of the supervised networks for each task. It is worth mentioning the larger languagemodel led to even better performance. This suggests that we could still take advantage of evenbigger unlabeled data sets.

4.6 Ranking and Language

There is a large agreement in the NLP community that syntax is a necessary prerequisite for se-mantic role labeling (Gildea and Palmer, 2002). This is why state-of-the-art semantic role labelingsystems thoroughly exploit multiple parse trees. The parsers themselves (Charniak, 2000; Collins,1999) contain considerable prior information about syntax (one can think of this as a kind of in-formed pre-processing).

Our system does not use such parse trees because we attempt to learn this information from theunlabeled data set. It is therefore legitimate to question whether our ranking criterion (17) has theconceptual capability to capture such a rich hierarchical information. At first glance, the rankingtask appears unrelated to the induction of probabilistic grammars that underly standard parsingalgorithms. The lack of hierarchical representation seems a fatal flaw (Chomsky, 1956).

However, ranking is closely related to an alternative description of the language structure: op-erator grammars (Harris, 1968). Instead of directly studying the structure of a sentence, Harrisdefines an algebraic structure on the space of all sentences. Starting from a couple of elementarysentence forms, sentences are described by the successive application of sentence transformationoperators. The sentence structure is revealed as a side effect of the successive transformations.Sentence transformations can also have a semantic interpretation.

2483


by Collobert et al.

15











2487











2487











2483


by Collobert et al.

15











2487











2487











2487











2483

SENTENCE NEURAL NETWORKTopics: experimental comparison•Nearest neighbors in word representation space:

• For a 2D visualization: http://www.cs.toronto.edu/~hinton/turian.png16


FRANCE JESUS XBOX REDDISH SCRATCHED MEGABITS454 1973 6909 11724 29869 87025

AUSTRIA GOD AMIGA GREENISH NAILED OCTETSBELGIUM SATI PLAYSTATION BLUISH SMASHED MB/SGERMANY CHRIST MSX PINKISH PUNCHED BIT/SITALY SATAN IPOD PURPLISH POPPED BAUDGREECE KALI SEGA BROWNISH CRIMPED CARATSSWEDEN INDRA PSNUMBER GREYISH SCRAPED KBIT/SNORWAY VISHNU HD GRAYISH SCREWED MEGAHERTZEUROPE ANANDA DREAMCAST WHITISH SECTIONED MEGAPIXELSHUNGARY PARVATI GEFORCE SILVERY SLASHED GBIT/S

SWITZERLAND GRACE CAPCOM YELLOWISH RIPPED AMPERES

Table 7: Word embeddings in the word lookup table of the language model neural network LM1trained with a dictionary of size 100,000. For each column the queried word is followedby its index in the dictionary (higher means more rare) and its 10 nearest neighbors (usingthe Euclidean metric, which was chosen arbitrarily).

and semantic properties of the neighbors are clearly related to those of the query word. Theseresults are far more satisfactory than those reported in Table 7 for embeddings obtained using purelysupervised training of the benchmark NLP tasks.

4.5 Semi-supervised Benchmark Results

Semi-supervised learning has been the object of much attention during the last few years (seeChapelle et al., 2006). Previous semi-supervised approaches for NLP can be roughly categorized asfollows:

• Ad-hoc approaches such as Rosenfeld and Feldman (2007) for relation extraction.

• Self-training approaches, such as Ueffing et al. (2007) for machine translation, and McCloskyet al. (2006) for parsing. These methods augment the labeled training set with examples fromthe unlabeled data set using the labels predicted by the model itself. Transductive approaches,such as Joachims (1999) for text classification can be viewed as a refined form of self-training.

• Parameter sharing approaches such as Ando and Zhang (2005); Suzuki and Isozaki (2008).Ando and Zhang propose a multi-task approach where they jointly train models sharing cer-tain parameters. They train POS and NER models together with a language model (trained on15 million words) consisting of predicting words given the surrounding tokens. Suzuki andIsozaki embed a generative model (Hidden Markov Model) inside a CRF for POS, Chunkingand NER. The generative model is trained on one billion words. These approaches shouldbe seen as a linear counterpart of our work. Using multilayer models vastly expands theparameter sharing opportunities (see Section 5).

Our approach simply consists of initializing the word lookup tables of the supervised networkswith the embeddings computed by the language models. Supervised training is then performed asin Section 3.5. In particular the supervised training stage is free to modify the lookup tables. Thissequential approach is computationally convenient because it separates the lengthy training of the

2482

http://www.cs.toronto.edu/~hinton/turian.png

http://www.cs.toronto.edu/~hinton/turian.png

CONCLUSION•We saw a particular architecture for tagging words with

syntactic and semantic information‣ it exploits the idea of learning vector representations of words‣ it uses a convolutional architecture, to use the whole sentence as context‣ it demonstrates that unsupervised learning can help a lot in learning

good representations‣ it can incorporate additional features that are known to work well in

certain NLP problems‣ even without them, it almost reaches state of the art performances

17

Documents

Natural language processing: syntactic and semantic tagginginfo.usherbrooke.ca/hlarochelle/ift725/nlp-tagging.pdf · A Uniﬁed Architecture for Natural Language Processing: Deep