Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
ACTA UNIVERSITATIS UPSALIENSIS Studia Linguistica Upsaliensia
24
Linguistically Informed Neural Dependency Parsing for
Typologically Diverse Languages
Miryam de Lhoneux
si bi itht w
lc1(s1) rc1(s1)s1 ◦
/
δ
bi bw f w +lc> 0.5 +lc
bi bw f w+lc
∗= p < .05 ∗∗= p < .01
bas rc
δ
p < .05p < .01
δ
p < .05 p < .01
+c
δ
p < .05 ∗∗= p < .01
| < μ−σ < | < μ− < μ < μ+ < | < μ+σ <
|
δ
∗= p < .05 ∗∗= p < .01
δ
p < .05 p < .01
Σσ
B β
htx1:t
Σσ
B β
[[Φ]]Φ s0 s1
b ΣΣ−s0 s0 Σ
B B−b b
bi+rc +lc
bi bwf w
bi bwf w
+rc +lc
G (V,A)V A (h,r,d)
h r dn w1:n
A AV,
V {I, love,syntax} A{(love,nsub j, I),(love,ob j,syntax)}
x1:n x1, ...,xn
(wi,r,w j) ∈ A G = (V,A)wi →∗ wk i < k < j i < j j < k < i j < i
→∗A
{(love,nsub j, I),(love,nmod,syntax)}
c0(x = (w1, . . . ,wn)) = ([ ], [1, . . . ,n,0], /0)
Ct = {c ∈C |c = ([ ], [0],A)}
d (σ |i, j|β ,A)⇒ (σ , j|β ,A∪{( j,r, i)})d (σ |i| j,β ,A)⇒ (σ |i,β ,A∪{(i,r, j)})
(σ , i|β ,A)⇒ (σ |i,β ,A) i �= 0
Σσ B β
LEFT−ARC
SHIFT
RIGHT−ARCthe brown fox jumped root
Transitions: Configuration:
STACK BUFFER
BΣ
si bi itht w lc1(s1) rc1(s1)
s1 ◦
s1.w;s1.t;s1.wt;s2.w;s2.t;s2.wt;b1.w;b1.t;b1.wt;
s1.wt ◦ s2.wt;s1.wt ◦ s2.w;s1.wt ◦ s2.ts1.w◦ s2.wt;s1.t ◦ s2.wt;s1.w◦ s2.ws1.t ◦ s2.t;s1.t ◦b1.t
s2.t ◦ s1.t ◦b1.t;s2.t ◦ s1.t ◦ lc1(s1).t;s2.t ◦ s1.t ◦ rc1(s1).t;s2.t ◦ s1.t ◦ lc1(s2).t;s2.t ◦ s1.t ◦ rc1(s2).t;s2.t ◦ s1.w◦ rc1(s2).t;s2.t ◦ s1.w◦ lc1(s1).t;s2.t ◦ s1.w◦b1.t
d d
di
j r ( j,r, i)d j
i (i,r, j)i
c φ(·)
i ←s T
c ← cs(s)c
tp ← argmaxt ·φ(c, t)to ← o(c,T )
tp �= toφ(c, to),φ(c, tp)
c ← to(c)
cs(s) stp c
φ(c, t)to tp
,φ(c, to),φ(c, tp)o(c,T )
c T
x yhi
Wx+bx f
y
h = f (W1x+b)y = so f tmax(W2h)
����
�
�h
p
xt xlxw
xw
xt
xl
xw xt xl h
hp
h = (W w1 xw +Wt
1xt +W l1xl +b1)
3
p = so f tmax(W2h)
0x x x x
hh h h0 1hh1
1
2
2
t
t...
ht x1:t
x1:n ht
ht xtht−1 ht
h = RNN(x1:n)
ht = f (ht−1,xt)
ht
ft = σ(Wf xt +Uf ht−1 +b f )
it = σ(Wixt +Uiht−1 +bi)
ot = σ(Woxt +Uoht−1 +bo)
ct = ft ct−1 + it tanh(Wcxt +Ucht−1 +bc)
ht = ot tanh(ct)
tσ ft it ot
ft ct−1(it tanh(Wcxt +Ucht−1 +bc)ct
tanh
x1:n
f
b xn:1i xi
[v1;v2]
(x1:n,xi) = [ f (x1:i); b(xn:i)]
cφ(·)
φ(·)
n w1:n x1:nxi wi
ti wi xi
xi = [e(wi);e(ti)]
vic
s2 s1 s0 b0
d d
d d
vi = (x1:n, i)φ(c) = [vs2 ;vs1 ;vs0 ;vb0 ]
(φ(c)) =W 2tanh(W 1φ(c)+b1)+b2
LSTM f
LSTM b
LSTM f
LSTM b
LSTM f
LSTM b
concat concat concat concat
LSTM f
LSTM b
MLP
concat
LSTM f
LSTM b
XX
Vthe Vfox Vjumped Vroot
X
Vbrown
X the X brown fox jumped root
score(LEFT−ARC),score(RIGHT−ARC),score(SHIFT)
multiple models
multiple languages
one model
multiple languageslanguages have equal status
source and target languages
Multilingual parsing
multi−source single−source
polymonolingual
polyglot
cross−lingual
Task A Task B Task C Task specific
layers
Shared layers
concat
Cf Cf Cf
Cb Cb Cb
h et
ce(wi)wi ch j 1 ≤ j ≤ m wi
ce(wi) = (ch1:m)
i xie(wi), ce(wi)
e(ti) xi
xi = [e(wi);ce(wi);e(ti)]
pe(wi)
xi = [e(wi);ce(wi)]
c0(x = (w1, . . . ,wn)) = ([ ], [1, . . . ,n,0], /0)
Ct = {c ∈C |c = ([ ], [0],A)}
d (σ |i, j|β ,A)⇒ (σ , j|β ,A∪{( j,r, i)}) j �= 0 ∨ σ = [ ]
d (σ |i| j,β ,A)⇒ (σ |i,β ,A∪{(i,r, j)})(σ , i|β ,A)⇒ (σ |i,β ,A) i �= 0
(σ |s0, b|β , A)⇒ (σ , b|s0|β , A) β �= 0∧ s0 < b
Σσ B β
everbest ever
examplebest a ever
c c|a
SWAP
LEFT−ARC
SHIFT
RIGHT−ARCthe brown fox jumped root
Transitions: Configuration:
STACK BUFFER
s0s0 b
s0 bb
s0
[ ]Σ [ 1 2 4 3 ]B[ 1 ]Σ [ 2 4 3 ]B[ 1 2 ]Σ [ 4 3 ]B[ 1 2 4 ]Σ [ 3 ]B[ 1 2 ]Σ [ 3 4 ]B[ 1 2 3 ]Σ [ 4 ]B[ 1 2 ]Σ [ 4 ]B
[ 1 ]Σ [ 4 ]B
[ 1 4 ]Σ [ ]B[ 1 ]Σ [ ]B
O(n)O(n2) n
s0 b
o(c,T )
i ←s T
c ← cs(s)c
tp ← argmaxt ·φ(c, t)to ← o(c,T )
tp �= toφ(c, to),φ(c, tp)
c ← to(c)
c Tto
p(c)
o(t;c,T ) ct
T
[ ]Σ [ ]B[ ]Σ [ ]B[ ]Σ [ ]B[ ]Σ [ ]B
[ ]Σ [ ]B[ ]Σ [ ]B
[ ]Σ [ ]B[ ]Σ [ ]B
[ ]Σ [ ]B[ ]Σ [ ]B[ ]Σ [ ]B[ ]Σ [ ]B[ ]Σ [ ]B
[ ]Σ [ ]B
[ ]Σ [ ]B[ ]Σ [ ]B
i ←s T
c ← cs(s)c
←{t | o(t;c,T ) = }tp ← argmaxt∈ (c) ·φ(c, t)to ← argmaxt∈ (c) ·φ(c, t)
tp /∈φ(c, to),φ(c, tp)
c ← k,p c, to, tp, i
c ← tp(c)
k,p c, to, tp, ii > k ()< p
tp(c)
c, to
[ ]Σ [ ]B ⇒ [ ]Σ [ ]B
s0 s1
s0s0 s1
C( ;c,T ) (b,s0) s0s0 H = {s1}∪β
D = {b}∪βT (s0,d) (h,s0) h ∈ H d ∈ D
C( ;c,T ) (s1,s0) s0s0 B =
{b}∪β T (s0,d)(h,s0) h,d ∈ B
C( ;c,T ) b bH = {s1}∪σ
D = {s0,s1}∪σ T(b,d) (h,b) h ∈ H d ∈ D
tt
s1 s0 b
[ ]Σ [ ]B
O(n2)
i
(i) i(i) i i
(s0)> (b)
(s0)> (b)
(b,s0) s0 s0b
s1|βb|β s0
(s0)(h(s0))
(s1,s0) s0s0 s1
b|β s0
C ( ) = | (s0)| + [[h(s0) �= b s0 ∈ (h(s0))]]
(s0) = [ ] s0 (h(s0))
C ( ) = | (s0)| + [[h(s0) �= s1 s0 ∈ (h(s0))]]
(s0) = [ ] s0 (h(s0))
i ∈ B−b b < i (b)> (i)
C ( ) = 0
C ( ) = |{d ∈ (b) |d ∈ Σ}| + [[h(b) ∈ Σ−s0 b ∈ (h(b))]]
b (h(b)) h(b) ∈ Σ−s0 d ∈ Σ(b)
[[Φ]]Φ s0 s1
b ΣΣ−s0 s0 Σ B
B−b b
(s0)(h(s0))
b bΣ s0
Σb
/
/
/
/
/
/
p < 0.01p < 0.05
di d
α
pagg
xie(wi) pe(wi)
ce(wi)
xi = [e(wi); pe(wi);ce(wi)]
δ
δ δ δ
=++−−−+−=+−++−=++=−+==+=
� �
� �
� �
� �
� �
� �
�
The largest city in Minnesota
Recursive
The largest city in Minnesota
Recurrent
ptpt
xi wie(wi) pe(wi)
e(ti)
xi = max{0,W ([e(wi); pe(wi);e(ti)])+b}xi
TOP
TOP
tp
an decision was made ROOT
REDUCE−LEFT(amod)
SHIFT
...
TOP
Stack Buffer
softmax
overhasty
nmod
Actions
ch d
r
tanh hc
xi
xi
c = tanh(W [h;d;r]+b)
CfoxCfox = tanh(W[Cfox,Cthe,left−det]+b)
Cfox
Cthe Cbrown Cfox Cjumped Croot
LSTM f
LSTM b
LSTM f
LSTM b
LSTM f
LSTM b
concat concat concat concat
LSTM f
LSTM b
concat
LSTM f
LSTM b
XX
Vthe Vfox Vjumped Vroot
X
Vbrown
X the X brown fox jumped root
det
vi
ivi
t cti
tc0
i
vi = [ (x1:n, i);cti]
c0i = (x1:n, i)
d h ir ct
i
detc
dh
h d rh h ct
h
+rc +lc
cti = tanh(W [ct−1
i ;d;r]+b)
cti = i([ct−1
i ;d;r])
e(wi)wi e(ti) ce(wi)
xi = [e(wi);e(ti);ce(wi)])
f (x, i)xi x
vi wi bw
f wbi
vi = f (x, i)bw(x, i) = (xn:1, i)f w(x, i) = (x1:n, i)bi(x, i) = [bw(x, i); f w(x, i)]
bi+rc +lc
+lc+rc
bi bwf w
R =−0.838 p < .01
p > .05
bwbi
f w
bw bi
bi bwf w +rc +lc
+ +
− −
bi bi+ lc+ + − −
− +bw f w
− −
bw f w
f wbw
bi bw f w
bw f w bibw f w
bi bw f w +lc> 0.5 +lc
+ + + −
− + − −
bw f w
f w bw
bi bw f w+lc
+ ++ −− +− −
bi bw f w +lc
V vi mviAUXi
wauxaux←−−wmv
viV wmv mvi waux
vi AUXi v vi+1V
vi V AUXimvi
mvimvi
mvi
V
wdaux←−−wh wd wh
wh auxi V,
Vi wh auxi
∗= p < .05 ∗∗= p < .01δ δ
I did thisVerbForm=Fin
FMV
nsubj obj
Figure 5.5. Finite main verb in a UD tree.
I could easily have done thisNFMV
nsubjaux
advmod
aux
root
dobj
I could easily have done thisMAUX
nsubj advmod
aux
aux
root
dobj
Figure 5.6. Example sentence with an AVC annotated in UD (top), and in MS (bottom).AVC subtree in thick blue.
therefore represent the word in the context of the sentence) to the correspond-ing vectors for verb types (the vectors at the input of the BiLSTM) in orderto better understand what part of the representation is context-dependent. Wefinally compare the verb type vectors learned by the parser to verb type vectorslearned with a language modeling objective. We expect the vectors learned bythe parser to encode information about agreement and transitivity to a greaterextent than the vectors learned using a language modelling objective. I explainthis in more detail in Section 5.3.3.
Collecting FMVs such as in Figure 5.5 in UD treebanks is straightforward:verbs are annotated with a feature called VerbForm which has the value Finif the verb is finite. We find candidates using the feature VerbForm=Fin andonly keep those that are not dependent of a copula or an auxiliary dependencyrelation to make sure they are not part of a larger verbal construction.
We can collect AVCs in the UD and in the MS training data using the firstpart of the transformation and backtransformation algorithms respectively whichI defined in Section 5.2.1. We scan the sentence left-to-right, looking for aux-iliary dependency relations and collecting information about what is the outer-most auxiliary and what is the main verb. When we have our set of FMVs andAVCs, we can create our task data sets.
111
vi
xie(wi)
ce(wi)
vi = BiLST M(x1:n, i)xi = [e(wi);ce(wi)]
It seems interesting to find out whether or not a composition function can beuseful to model the specific case of transfer relations such as is found in AVCs.For this reason, we also investigate the composed representation of AVCs.
We are therefore interested in finding out whether or not an LSTM trainedwith a parsing objective can learn the notion of dissociated nucleus as well aswhether or not a recursive composition function can help to learn this notion.As just mentioned, the head of an AVC in UD is a non-finite main verb, whichwe will refer to as NFMV, as depicted in Figure 5.6. The head of an AVC in MSis the outermost auxiliary, which we will refer to as the main auxiliary MAUX,as also depicted in that figure. We therefore look at NFMV and MAUX tokenvectors for the respective representation schemes and consider two definitionsof these: One where we use the BiLSTM encoding of the main verb token vi,and the other where we construct a subtree vector ct
i by recursively composingthe representation of AVCs as auxiliaries get attached to their main verb, asin Equation 4.5, repeated in Equation 5.3. When training the parser, we con-catenate this composed vector to a vector of the head of the subtree to form vi.In Chapter 4, we used two different composition functions, one using a sim-ple recurrent cell and one using an LSTM cell. We saw that the one using anLSTM cell performed better. However, in this set of experiments, we only dorecursive compositions over a limited part of the subtree: only between auxil-iaries and NFMVs. This means that the LSTM would only pass through twostates in most cases, and never more than 4.8 This does not allow us to learnproper weights for the input, output and forget gates. An RNN cell seems moreappropriate here and we only use that.
cti = tanh(W [ct−1
i ;d;r]+b) (5.3)
LSTM f
LSTM b
concat concat concat
LSTM f
LSTM b
concat
LSTM f
LSTM b
LSTM f
LSTM b
XX
V I Vdone VthatVhave
X I X have done thatI h a v e d o n e t h a t
e(done) e(that)e(I) e(have)
NFMVMAUX
tok
char
type
tok
chartype
Figure 5.7. Example AVC with vectors of interest.
8The maximum number of auxiliaries in one AVC in our dataset is 3.
113
LSTM f
LSTM b
LSTM f
LSTM b
LSTM f
LSTM b
concat concat concatconcat
LSTM f
LSTM b
I
e(I)
XX
Vdid Vthat V !
X
V I
X did that !d i d t h a t !
e(did) e(that) e(!)
FMV
toktok
char
type
punct
I
bas rc
δ
p < .05 p < .01δ δ
δp < .05
p < .01δ δ
p < .01
have donedid !
FMV NFMVMAUXpunct
did !
FMV NFMVMAUX
have done
punct
δ
+cδ
p < .05∗∗= p < .01
le lel tanh
l
le = tanh(Wl +b)
�
�
�
�
te(wi) wi
xi = [pe(wi);ce(wi);e(ti); te(wi)]
| < μ−σ < | < μ− < μ < μ+ < | < μ+σ < |
μ ± , = σ/√
N
μ ±σ
concat concat concat concatconcat
concat
concat
MLP
LSTM f
LSTM b
LSTM f
LSTM b
LSTM f
LSTM b
LSTM f
LSTM b
LSTM f
LSTM b
XX
Vthe Vfox Vjumped Vroot
X
Vbrown
X the X brown fox jumped root
(score(LEFT−ARC),score(RIGHT−ARC),score(SHIFT), score(SWAP))
the
h et
CfCf Cf
CbCbCb
State
Word
Character
�
�
te(wi) wiwi
xi = [e(wi);ce(wi); te(wi)]
�
�
ch j = [e(ch j); te(wi)]
� �
φ(c) = [vs1 ;vs0 ;vb0 ; te(s1,s0,b0)]
� �
� � �
�
� �
� �
� �
� �
� � �
�
� �
�
� � �
� � �
� �
� �
�
� � �
� � �
� �
� �
� � �
�
� � �
� �
� �
�
δ
δ
� �
� �
�� �
� �
� �
�
�
�
�
p < 0.01
� �
� � �
� � �
�
� �
�
� �
� �
� �
� �
� �
� � �
� � �
� � �
� �
� � �
�
� �
� � �
�
� �
�
� �
� � �
� �
� �
�
∗ = p < .05∗∗= p < .01
δ δ
δp < .05
p < .01δ δ
xi = [pe(wi);ce(wi);e(ti)]xi = [pe(wi);ce(wi);e(ti); tbe(wi)]
ACTA UNIVERSITATIS UPSALIENSIS Studia Linguistica Upsaliensia
Editors: Michael Dunn and Joakim Nivre 1. Jörg Tiedemann, Recycling translations. Extraction of lexical data from parallel
corpora and their application in natural language processing. 2003. 2. Agnes Edling, Abstraction and authority in textbooks. The textual paths towards
specialized language. 2006. 3. Åsa af Geijerstam, Att skriva i naturorienterande ämnen i skolan. 2006. 4. Gustav Öquist, Evaluating Readability on Mobile Devices. 2006. 5. Jenny Wiksten Folkeryd, Writing with an Attitude. Appraisal and student texts
in the school subject of Swedish. 2006. 6. Ingrid Björk, Relativizing linguistic relativity. Investigating underlying assump-
tions about language in the neo-Whorfian literature. 2008. 7. Joakim Nivre, Mats Dahllöf and Beáta Megyesi, Resourceful Language Tech-
nology. Festschrift in Honor of Anna Sågvall Hein. 2008. 8. Anju Saxena & Åke Viberg, Multilingualism. Proceedings of the 23rd Scandinavi-
an Conference of Linguistics. 2009. 9. Markus Saers, Translation as Linear Transduction. Models and Algorithms for
Efficient Learning in Statistical Machine Translation. 2011. 10. Ulrika Serrander, Bilingual lexical processing in single word production. Swedish
learners of Spanish and the effects of L2 immersion. 2011. 11. Mattias Nilsson, Computational Models of Eye Movements in Reading : A Data-
Driven Approach to the Eye-Mind Link. 2012. 12. Luying Wang, Second Language Acquisition of Mandarin Aspect Markers by
Native Swedish Adults. 2012. 13. Farideh Okati, The Vowel Systems of Five Iranian Balochi Dialects. 2012. 14. Oscar Täckström, Predicting Linguistic Structure with Incomplete and Cross-
Lingual Supervision. 2013. 15. Christian Hardmeier, Discourse in Statistical Machine Translation. 2014. 16. Mojgan Seraji, Morphosyntactic Corpora and Tools for Persian. 2015. 17. Eva Pettersson, Spelling Normalisation and Linguistic Analysis of Historical Text
for Information Extraction. 2016. 18. Marie Dubremetz, Detecting Rhetorical Figures Based on Repetition of Words:
Chiasmus, Epanaphora, Epiphora. 2017. 19. Josefin Lindgren, Developing narrative competence: Swedish, Swedish-German
and Swedish-Turkish children aged 4–6. 2018. 20. Vera Wilhelmsen, A Linguistic Description of Mbugwe with Focus on Tone and
Verbal Morphology. 2018. 21. Yan Shao, Segmenting and Tagging Text with Neural Networks. 2018. 22. Ali Basirat, Principal Word Vectors. 2018. 23. Marc Tang, A typology of classifiers and gender. 2018. 24. Miryam de Lhoneux, Linguistically Informed Neural Dependency Parsing for
Typologically Diverse Languages. 2019.