ACTA UNIVERSITATIS UPSALIENSIS Studia …uu.diva-portal.org/smash/get/diva2:1357373/FULLTEXT01.pdf+RNj3Njc S BNjaR0n,jCRNYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYSe

ACTA UNIVERSITATIS UPSALIENSIS Studia Linguistica Upsaliensia

24

Linguistically Informed Neural Dependency Parsing for

Typologically Diverse Languages

Miryam de Lhoneux

si bi itht w

lc1(s1) rc1(s1)s1 ◦

/

δ

bi bw f w +lc> 0.5 +lc

bi bw f w+lc

∗= p < .05 ∗∗= p < .01

bas rc

δ

p < .05p < .01

δ

p < .05 p < .01

+c

δ

p < .05 ∗∗= p < .01

| < μ−σ < | < μ− < μ < μ+ < | < μ+σ <

|

δ

∗= p < .05 ∗∗= p < .01

δ

p < .05 p < .01

Σσ

B β

htx1:t

Σσ

B β

[[Φ]]Φ s0 s1

b ΣΣ−s0 s0 Σ

B B−b b

bi+rc +lc

bi bwf w

bi bwf w

+rc +lc

G (V,A)V A (h,r,d)

h r dn w1:n

A AV,

V {I, love,syntax} A{(love,nsub j, I),(love,ob j,syntax)}

x1:n x1, ...,xn

(wi,r,w j) ∈ A G = (V,A)wi →∗ wk i < k < j i < j j < k < i j < i

→∗A

{(love,nsub j, I),(love,nmod,syntax)}

c0(x = (w1, . . . ,wn)) = ([ ], [1, . . . ,n,0], /0)

Ct = {c ∈C |c = ([ ], [0],A)}

d (σ |i, j|β ,A)⇒ (σ , j|β ,A∪{( j,r, i)})d (σ |i| j,β ,A)⇒ (σ |i,β ,A∪{(i,r, j)})

(σ , i|β ,A)⇒ (σ |i,β ,A) i �= 0

Σσ B β

LEFT−ARC

SHIFT

RIGHT−ARCthe brown fox jumped root

Transitions: Configuration:

STACK BUFFER

BΣ

si bi itht w lc1(s1) rc1(s1)

s1 ◦

s1.w;s1.t;s1.wt;s2.w;s2.t;s2.wt;b1.w;b1.t;b1.wt;

s1.wt ◦ s2.wt;s1.wt ◦ s2.w;s1.wt ◦ s2.ts1.w◦ s2.wt;s1.t ◦ s2.wt;s1.w◦ s2.ws1.t ◦ s2.t;s1.t ◦b1.t

s2.t ◦ s1.t ◦b1.t;s2.t ◦ s1.t ◦ lc1(s1).t;s2.t ◦ s1.t ◦ rc1(s1).t;s2.t ◦ s1.t ◦ lc1(s2).t;s2.t ◦ s1.t ◦ rc1(s2).t;s2.t ◦ s1.w◦ rc1(s2).t;s2.t ◦ s1.w◦ lc1(s1).t;s2.t ◦ s1.w◦b1.t

d d

di

j r ( j,r, i)d j

i (i,r, j)i

c φ(·)

i ←s T

c ← cs(s)c

tp ← argmaxt ·φ(c, t)to ← o(c,T )

tp �= toφ(c, to),φ(c, tp)

c ← to(c)

cs(s) stp c

φ(c, t)to tp

,φ(c, to),φ(c, tp)o(c,T )

c T

x yhi

Wx+bx f

y

h = f (W1x+b)y = so f tmax(W2h)

��

�

�h

p

xt xlxw

xw

xt

xl

xw xt xl h

hp

h = (W w1 xw +Wt

1xt +W l1xl +b1)

3

p = so f tmax(W2h)

0x x x x

hh h h0 1hh1

1

2

2

t

t...

ht x1:t

x1:n ht

ht xtht−1 ht

h = RNN(x1:n)

ht = f (ht−1,xt)

ht

ft = σ(Wf xt +Uf ht−1 +b f )

it = σ(Wixt +Uiht−1 +bi)

ot = σ(Woxt +Uoht−1 +bo)

ct = ft ct−1 + it tanh(Wcxt +Ucht−1 +bc)

ht = ot tanh(ct)

tσ ft it ot

ft ct−1(it tanh(Wcxt +Ucht−1 +bc)ct

tanh

x1:n

f

b xn:1i xi

[v1;v2]

(x1:n,xi) = [ f (x1:i); b(xn:i)]

cφ(·)

φ(·)

n w1:n x1:nxi wi

ti wi xi

xi = [e(wi);e(ti)]

vic

s2 s1 s0 b0

d d

d d

vi = (x1:n, i)φ(c) = [vs2 ;vs1 ;vs0 ;vb0 ]

(φ(c)) =W 2tanh(W 1φ(c)+b1)+b2

LSTM f

LSTM b

LSTM f

LSTM b

LSTM f

LSTM b

concat concat concat concat

LSTM f

LSTM b

MLP

concat

LSTM f

LSTM b

XX

Vthe Vfox Vjumped Vroot

X

Vbrown

X the X brown fox jumped root

score(LEFT−ARC),score(RIGHT−ARC),score(SHIFT)

multiple models

multiple languages

one model

multiple languageslanguages have equal status

source and target languages

Multilingual parsing

multi−source single−source

polymonolingual

polyglot

cross−lingual

Task A Task B Task C Task specific

layers

Shared layers

concat

Cf Cf Cf

Cb Cb Cb

h et

ce(wi)wi ch j 1 ≤ j ≤ m wi

ce(wi) = (ch1:m)

i xie(wi), ce(wi)

e(ti) xi

xi = [e(wi);ce(wi);e(ti)]

pe(wi)

xi = [e(wi);ce(wi)]

c0(x = (w1, . . . ,wn)) = ([ ], [1, . . . ,n,0], /0)

Ct = {c ∈C |c = ([ ], [0],A)}

d (σ |i, j|β ,A)⇒ (σ , j|β ,A∪{( j,r, i)}) j �= 0 ∨ σ = [ ]

d (σ |i| j,β ,A)⇒ (σ |i,β ,A∪{(i,r, j)})(σ , i|β ,A)⇒ (σ |i,β ,A) i �= 0

(σ |s0, b|β , A)⇒ (σ , b|s0|β , A) β �= 0∧ s0 < b

Σσ B β

everbest ever

examplebest a ever

c c|a

SWAP

LEFT−ARC

SHIFT

RIGHT−ARCthe brown fox jumped root

Transitions: Configuration:

STACK BUFFER

s0s0 b

s0 bb

s0

[ ]Σ [ 1 2 4 3 ]B[ 1 ]Σ [ 2 4 3 ]B[ 1 2 ]Σ [ 4 3 ]B[ 1 2 4 ]Σ [ 3 ]B[ 1 2 ]Σ [ 3 4 ]B[ 1 2 3 ]Σ [ 4 ]B[ 1 2 ]Σ [ 4 ]B

[ 1 ]Σ [ 4 ]B

[ 1 4 ]Σ [ ]B[ 1 ]Σ [ ]B

O(n)O(n2) n

s0 b

o(c,T )

i ←s T

c ← cs(s)c

tp ← argmaxt ·φ(c, t)to ← o(c,T )

tp �= toφ(c, to),φ(c, tp)

c ← to(c)

c Tto

p(c)

o(t;c,T ) ct

T

[ ]Σ [ ]B[ ]Σ [ ]B[ ]Σ [ ]B[ ]Σ [ ]B

[ ]Σ [ ]B[ ]Σ [ ]B

[ ]Σ [ ]B[ ]Σ [ ]B

[ ]Σ [ ]B[ ]Σ [ ]B[ ]Σ [ ]B[ ]Σ [ ]B[ ]Σ [ ]B

[ ]Σ [ ]B

[ ]Σ [ ]B[ ]Σ [ ]B

i ←s T

c ← cs(s)c

←{t | o(t;c,T ) = }tp ← argmaxt∈ (c) ·φ(c, t)to ← argmaxt∈ (c) ·φ(c, t)

tp /∈φ(c, to),φ(c, tp)

c ← k,p c, to, tp, i

c ← tp(c)

k,p c, to, tp, ii > k ()< p

tp(c)

c, to

[ ]Σ [ ]B ⇒ [ ]Σ [ ]B

s0 s1

s0s0 s1

C( ;c,T ) (b,s0) s0s0 H = {s1}∪β

D = {b}∪βT (s0,d) (h,s0) h ∈ H d ∈ D

C( ;c,T ) (s1,s0) s0s0 B =

{b}∪β T (s0,d)(h,s0) h,d ∈ B

C( ;c,T ) b bH = {s1}∪σ

D = {s0,s1}∪σ T(b,d) (h,b) h ∈ H d ∈ D

tt

s1 s0 b

[ ]Σ [ ]B

O(n2)

i

(i) i(i) i i

(s0)> (b)

(s0)> (b)

(b,s0) s0 s0b

s1|βb|β s0

(s0)(h(s0))

(s1,s0) s0s0 s1

b|β s0

C ( ) = | (s0)| + [[h(s0) �= b s0 ∈ (h(s0))]]

(s0) = [ ] s0 (h(s0))

C ( ) = | (s0)| + [[h(s0) �= s1 s0 ∈ (h(s0))]]

(s0) = [ ] s0 (h(s0))

i ∈ B−b b < i (b)> (i)

C ( ) = 0

C ( ) = |{d ∈ (b) |d ∈ Σ}| + [[h(b) ∈ Σ−s0 b ∈ (h(b))]]

b (h(b)) h(b) ∈ Σ−s0 d ∈ Σ(b)

[[Φ]]Φ s0 s1

b ΣΣ−s0 s0 Σ B

B−b b

(s0)(h(s0))

b bΣ s0

Σb

/

/

/

/

/

/

p < 0.01p < 0.05

di d

α

pagg

xie(wi) pe(wi)

ce(wi)

xi = [e(wi); pe(wi);ce(wi)]

δ

δ δ δ

=++−−−+−=+−++−=++=−+==+=

� �

� �

� �

� �

� �

� �

�

The largest city in Minnesota

Recursive

The largest city in Minnesota

Recurrent

ptpt

xi wie(wi) pe(wi)

e(ti)

xi = max{0,W ([e(wi); pe(wi);e(ti)])+b}xi

TOP

TOP

tp

an decision was made ROOT

REDUCE−LEFT(amod)

SHIFT

...

TOP

Stack Buffer

softmax

overhasty

nmod

Actions

ch d

r

tanh hc

xi

xi

c = tanh(W [h;d;r]+b)

CfoxCfox = tanh(W[Cfox,Cthe,left−det]+b)

Cfox

Cthe Cbrown Cfox Cjumped Croot

LSTM f

LSTM b

LSTM f

LSTM b

LSTM f

LSTM b

concat concat concat concat

LSTM f

LSTM b

concat

LSTM f

LSTM b

XX


X

Vbrown


det

vi

ivi

t cti

tc0

i

vi = [ (x1:n, i);cti]

c0i = (x1:n, i)

d h ir ct

i

detc

dh

h d rh h ct

h

+rc +lc

cti = tanh(W [ct−1

i ;d;r]+b)

cti = i([ct−1

i ;d;r])

e(wi)wi e(ti) ce(wi)

xi = [e(wi);e(ti);ce(wi)])

f (x, i)xi x

vi wi bw

f wbi

vi = f (x, i)bw(x, i) = (xn:1, i)f w(x, i) = (x1:n, i)bi(x, i) = [bw(x, i); f w(x, i)]

bi+rc +lc

+lc+rc

bi bwf w

R =−0.838 p < .01

p > .05

bwbi

f w

bw bi

bi bwf w +rc +lc

+ +

− −

bi bi+ lc+ + − −

− +bw f w

− −

bw f w

f wbw

bi bw f w

bw f w bibw f w

bi bw f w +lc> 0.5 +lc

+ + + −

− + − −

bw f w

f w bw

bi bw f w+lc

+ ++ −− +− −

bi bw f w +lc

V vi mviAUXi

wauxaux←−−wmv

viV wmv mvi waux

vi AUXi v vi+1V

vi V AUXimvi

mvimvi

mvi

V

wdaux←−−wh wd wh

wh auxi V,

Vi wh auxi

∗= p < .05 ∗∗= p < .01δ δ

I did thisVerbForm=Fin

FMV

nsubj obj

Figure 5.5. Finite main verb in a UD tree.

I could easily have done thisNFMV

nsubjaux

advmod

aux

root

dobj

I could easily have done thisMAUX

nsubj advmod

aux

aux

root

dobj

Figure 5.6. Example sentence with an AVC annotated in UD (top), and in MS (bottom).AVC subtree in thick blue.

therefore represent the word in the context of the sentence) to the correspond-ing vectors for verb types (the vectors at the input of the BiLSTM) in orderto better understand what part of the representation is context-dependent. Wefinally compare the verb type vectors learned by the parser to verb type vectorslearned with a language modeling objective. We expect the vectors learned bythe parser to encode information about agreement and transitivity to a greaterextent than the vectors learned using a language modelling objective. I explainthis in more detail in Section 5.3.3.

Collecting FMVs such as in Figure 5.5 in UD treebanks is straightforward:verbs are annotated with a feature called VerbForm which has the value Finif the verb is finite. We find candidates using the feature VerbForm=Fin andonly keep those that are not dependent of a copula or an auxiliary dependencyrelation to make sure they are not part of a larger verbal construction.

We can collect AVCs in the UD and in the MS training data using the firstpart of the transformation and backtransformation algorithms respectively whichI defined in Section 5.2.1. We scan the sentence left-to-right, looking for aux-iliary dependency relations and collecting information about what is the outer-most auxiliary and what is the main verb. When we have our set of FMVs andAVCs, we can create our task data sets.

111

vi

xie(wi)

ce(wi)

vi = BiLST M(x1:n, i)xi = [e(wi);ce(wi)]

It seems interesting to find out whether or not a composition function can beuseful to model the specific case of transfer relations such as is found in AVCs.For this reason, we also investigate the composed representation of AVCs.

We are therefore interested in finding out whether or not an LSTM trainedwith a parsing objective can learn the notion of dissociated nucleus as well aswhether or not a recursive composition function can help to learn this notion.As just mentioned, the head of an AVC in UD is a non-finite main verb, whichwe will refer to as NFMV, as depicted in Figure 5.6. The head of an AVC in MSis the outermost auxiliary, which we will refer to as the main auxiliary MAUX,as also depicted in that figure. We therefore look at NFMV and MAUX tokenvectors for the respective representation schemes and consider two definitionsof these: One where we use the BiLSTM encoding of the main verb token vi,and the other where we construct a subtree vector ct

i by recursively composingthe representation of AVCs as auxiliaries get attached to their main verb, asin Equation 4.5, repeated in Equation 5.3. When training the parser, we con-catenate this composed vector to a vector of the head of the subtree to form vi.In Chapter 4, we used two different composition functions, one using a sim-ple recurrent cell and one using an LSTM cell. We saw that the one using anLSTM cell performed better. However, in this set of experiments, we only dorecursive compositions over a limited part of the subtree: only between auxil-iaries and NFMVs. This means that the LSTM would only pass through twostates in most cases, and never more than 4.8 This does not allow us to learnproper weights for the input, output and forget gates. An RNN cell seems moreappropriate here and we only use that.

cti = tanh(W [ct−1

i ;d;r]+b) (5.3)

LSTM f

LSTM b

concat concat concat

LSTM f

LSTM b

concat

LSTM f

LSTM b

LSTM f

LSTM b

XX

V I Vdone VthatVhave

X I X have done thatI h a v e d o n e t h a t

e(done) e(that)e(I) e(have)

NFMVMAUX

tok

char

type

tok

chartype

Figure 5.7. Example AVC with vectors of interest.

8The maximum number of auxiliaries in one AVC in our dataset is 3.

113

LSTM f

LSTM b

LSTM f

LSTM b

LSTM f

LSTM b

concat concat concatconcat

LSTM f

LSTM b

I

e(I)

XX

Vdid Vthat V !

X

V I

X did that !d i d t h a t !

e(did) e(that) e(!)

FMV

toktok

char

type

punct

I

bas rc

δ

p < .05 p < .01δ δ

δp < .05

p < .01δ δ

p < .01

have donedid !

FMV NFMVMAUXpunct

did !

FMV NFMVMAUX

have done

punct

δ

+cδ

p < .05∗∗= p < .01

le lel tanh

l

le = tanh(Wl +b)

�

�

�

�

te(wi) wi

xi = [pe(wi);ce(wi);e(ti); te(wi)]

| < μ−σ < | < μ− < μ < μ+ < | < μ+σ < |

μ ± , = σ/√

N

μ ±σ

concat concat concat concatconcat

concat

concat

MLP

LSTM f

LSTM b

LSTM f

LSTM b

LSTM f

LSTM b

LSTM f

LSTM b

LSTM f

LSTM b

XX


X

Vbrown


(score(LEFT−ARC),score(RIGHT−ARC),score(SHIFT), score(SWAP))

the

h et

CfCf Cf

CbCbCb

State

Word

Character

�

�

te(wi) wiwi

xi = [e(wi);ce(wi); te(wi)]

�

�

ch j = [e(ch j); te(wi)]

� �

φ(c) = [vs1 ;vs0 ;vb0 ; te(s1,s0,b0)]

� �

� � �

�

� �

� �

� �

� �

� � �

�

� �

�

� � �

� � �

� �

� �

�

� � �

� � �

� �

� �

� � �

�

� � �

� �

� �

�

δ

δ

� �

� �

��

� �

� �

�

�

�

�

p < 0.01

� �

� � �

� � �

�

� �

�

� �

� �

� �

� �

� �

� � �

� � �

� � �

� �

� � �

�

� �

� � �

�

� �

�

� �

� � �

� �

� �

�

∗ = p < .05∗∗= p < .01

δ δ

δp < .05

p < .01δ δ

xi = [pe(wi);ce(wi);e(ti)]xi = [pe(wi);ce(wi);e(ti); tbe(wi)]

ACTA UNIVERSITATIS UPSALIENSIS Studia Linguistica Upsaliensia

Editors: Michael Dunn and Joakim Nivre 1. Jörg Tiedemann, Recycling translations. Extraction of lexical data from parallel

corpora and their application in natural language processing. 2003. 2. Agnes Edling, Abstraction and authority in textbooks. The textual paths towards

specialized language. 2006. 3. Åsa af Geijerstam, Att skriva i naturorienterande ämnen i skolan. 2006. 4. Gustav Öquist, Evaluating Readability on Mobile Devices. 2006. 5. Jenny Wiksten Folkeryd, Writing with an Attitude. Appraisal and student texts

in the school subject of Swedish. 2006. 6. Ingrid Björk, Relativizing linguistic relativity. Investigating underlying assump-

tions about language in the neo-Whorfian literature. 2008. 7. Joakim Nivre, Mats Dahllöf and Beáta Megyesi, Resourceful Language Tech-

nology. Festschrift in Honor of Anna Sågvall Hein. 2008. 8. Anju Saxena & Åke Viberg, Multilingualism. Proceedings of the 23rd Scandinavi-

an Conference of Linguistics. 2009. 9. Markus Saers, Translation as Linear Transduction. Models and Algorithms for

Efficient Learning in Statistical Machine Translation. 2011. 10. Ulrika Serrander, Bilingual lexical processing in single word production. Swedish

learners of Spanish and the effects of L2 immersion. 2011. 11. Mattias Nilsson, Computational Models of Eye Movements in Reading : A Data-

Driven Approach to the Eye-Mind Link. 2012. 12. Luying Wang, Second Language Acquisition of Mandarin Aspect Markers by

Native Swedish Adults. 2012. 13. Farideh Okati, The Vowel Systems of Five Iranian Balochi Dialects. 2012. 14. Oscar Täckström, Predicting Linguistic Structure with Incomplete and Cross-

Lingual Supervision. 2013. 15. Christian Hardmeier, Discourse in Statistical Machine Translation. 2014. 16. Mojgan Seraji, Morphosyntactic Corpora and Tools for Persian. 2015. 17. Eva Pettersson, Spelling Normalisation and Linguistic Analysis of Historical Text

for Information Extraction. 2016. 18. Marie Dubremetz, Detecting Rhetorical Figures Based on Repetition of Words:

Chiasmus, Epanaphora, Epiphora. 2017. 19. Josefin Lindgren, Developing narrative competence: Swedish, Swedish-German

and Swedish-Turkish children aged 4–6. 2018. 20. Vera Wilhelmsen, A Linguistic Description of Mbugwe with Focus on Tone and

Verbal Morphology. 2018. 21. Yan Shao, Segmenting and Tagging Text with Neural Networks. 2018. 22. Ali Basirat, Principal Word Vectors. 2018. 23. Marc Tang, A typology of classifiers and gender. 2018. 24. Miryam de Lhoneux, Linguistically Informed Neural Dependency Parsing for

Typologically Diverse Languages. 2019.

Documents

ACTA UNIVERSITATIS UPSALIENSIS Studia …uu.diva-portal.org/smash/get/diva2:1357373/FULLTEXT01.pdf+RNj3Njc S BNjaR0n,jCRNYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYSe