13
Parameter Space Factorization for Zero-Shot Learning across Tasks and Languages Edoardo M. Ponti 1 , Ivan Vuli´ c 1 , Ryan Cotterell 2 , Marinela Parovic 1 , Roi Reichart 3 , Anna Korhonen 1 1 Language Technology Lab, TAL, University of Cambridge 2 Computer Laboratory, University of Cambridge 3 Faculty of Industrial Engineering and Management, Technion, IIT 1,2 {ep490,iv250,mp939,rdc42,alk23}@cam.ac.uk 3 [email protected] Abstract Most combinations of NLP tasks and lan- guage varieties lack in-domain examples for supervised training because of the paucity of annotated data. How can neural models make sample-efficient generalizations from task–language combinations with available data to low-resource ones? In this work, we propose a Bayesian generative model for the space of neural parameters. We assume that this space can be factorized into latent variables for each language and each task. We infer the posteriors over such latent vari- ables based on data from seen task–language combinations through variational inference. This enables zero-shot classification on un- seen combinations at prediction time. For instance, given training data for named en- tity recognition (NER) in Vietnamese and for part-of-speech (POS) tagging in Wolof, our model can perform accurate predictions for NER in Wolof. In particular, we exper- iment with a typologically diverse sample of 33 languages from 4 continents and 11 families, and show that our model yields comparable or better results than state-of-the- art, zero-shot cross-lingual transfer methods; it increases performance by 4.49 points for POS tagging and 7.73 points for NER on average compared to the strongest baseline. 1 Introduction Transfer learning is a toolbox to extract knowl- edge from a source domain and perform sample- efficient generalizations in a target domain (Yo- gatama et al., 2019; Talmor and Berant, 2019). In practice, this approach holds promise to mitigate the data scarcity issue which is inherent to a large spectrum of NLP applications (Täckström et al., 2012; Agi´ c et al., 2016; Ammar et al., 2016; Ponti et al., 2018; Ziser and Reichart, 2018, inter alia). In the most extreme scenario, zero-shot learning, no annotated examples are available for the target domain. For instance, zero-shot transfer across lan- guages leverages information from resource-rich languages to tackle the same task in a previously unseen target language (Lin et al., 2019; Rijhwani et al., 2019; Artetxe and Schwenk, 2019; Ponti et al., 2019a, inter alia). Zero-shot transfer across tasks within the same language, on the other hand, verges on the impossible, because no information is provided for the target classes or scalars (in the case of regression). Hence, cross-task transfer has mostly focused on the few-shot or joint learning settings (Ruder et al., 2019). In this work, we show how the neural parame- ters for a particular unseen task–language combina- tion can be approximated by zero-shot transferring knowledge both from related tasks and from re- lated languages, i.e., from seen combinations. For instance, the availability of a model for part-of- speech (POS) tagging in Wolof and for named- entity recognition (NER) model in Vietnamese sup- plies plenty of information which should be effec- tively harnessed to estimate the parameters of a Wolof NER model. As our main contribution, we introduce a gener- ative model of a neural parameter space, which is factorized into latent variables 1 for each language and each task. All possible task–language combi- nations give rise to a task × language × parameter tensor. While some entries could be populated through supervised learning, completing the empty portion of such a tensor is less straightforward than standard matrix completion methods for collabora- tive filtering (Mnih and Salakhutdinov, 2008; Dzi- ugaite and Roy, 2015), as the parameters are never observed. Rather, in our approach the interaction of the latent variables determine the parameters, which in turn determine the data likelihood. 1 By latent variable we mean every variable that has to be inferred from observed (directly measurable) variables. To avoid confusion, we use the terms seen and unseen when referring to different task–language combinations. arXiv:2001.11453v1 [cs.CL] 30 Jan 2020

arXiv:2001.11453v1 [cs.CL] 30 Jan 2020ter space factorization method surpasses standard baselines based on cross-lingual transfer 1) from the (typologically) nearest source language;

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: arXiv:2001.11453v1 [cs.CL] 30 Jan 2020ter space factorization method surpasses standard baselines based on cross-lingual transfer 1) from the (typologically) nearest source language;

Parameter Space Factorization for Zero-Shot Learningacross Tasks and Languages

Edoardo M. Ponti1, Ivan Vulic1, Ryan Cotterell2,Marinela Parovic1, Roi Reichart3, Anna Korhonen1

1Language Technology Lab, TAL, University of Cambridge2Computer Laboratory, University of Cambridge

3Faculty of Industrial Engineering and Management, Technion, IIT1,2{ep490,iv250,mp939,rdc42,alk23}@cam.ac.uk

[email protected]

Abstract

Most combinations of NLP tasks and lan-guage varieties lack in-domain examples forsupervised training because of the paucityof annotated data. How can neural modelsmake sample-efficient generalizations fromtask–language combinations with availabledata to low-resource ones? In this work,we propose a Bayesian generative model forthe space of neural parameters. We assumethat this space can be factorized into latentvariables for each language and each task.We infer the posteriors over such latent vari-ables based on data from seen task–languagecombinations through variational inference.This enables zero-shot classification on un-seen combinations at prediction time. Forinstance, given training data for named en-tity recognition (NER) in Vietnamese andfor part-of-speech (POS) tagging in Wolof,our model can perform accurate predictionsfor NER in Wolof. In particular, we exper-iment with a typologically diverse sampleof 33 languages from 4 continents and 11families, and show that our model yieldscomparable or better results than state-of-the-art, zero-shot cross-lingual transfer methods;it increases performance by 4.49 points forPOS tagging and 7.73 points for NER onaverage compared to the strongest baseline.

1 Introduction

Transfer learning is a toolbox to extract knowl-edge from a source domain and perform sample-efficient generalizations in a target domain (Yo-gatama et al., 2019; Talmor and Berant, 2019). Inpractice, this approach holds promise to mitigatethe data scarcity issue which is inherent to a largespectrum of NLP applications (Täckström et al.,2012; Agic et al., 2016; Ammar et al., 2016; Pontiet al., 2018; Ziser and Reichart, 2018, inter alia).In the most extreme scenario, zero-shot learning,no annotated examples are available for the target

domain. For instance, zero-shot transfer across lan-guages leverages information from resource-richlanguages to tackle the same task in a previouslyunseen target language (Lin et al., 2019; Rijhwaniet al., 2019; Artetxe and Schwenk, 2019; Pontiet al., 2019a, inter alia). Zero-shot transfer acrosstasks within the same language, on the other hand,verges on the impossible, because no informationis provided for the target classes or scalars (in thecase of regression). Hence, cross-task transfer hasmostly focused on the few-shot or joint learningsettings (Ruder et al., 2019).

In this work, we show how the neural parame-ters for a particular unseen task–language combina-tion can be approximated by zero-shot transferringknowledge both from related tasks and from re-lated languages, i.e., from seen combinations. Forinstance, the availability of a model for part-of-speech (POS) tagging in Wolof and for named-entity recognition (NER) model in Vietnamese sup-plies plenty of information which should be effec-tively harnessed to estimate the parameters of aWolof NER model.

As our main contribution, we introduce a gener-ative model of a neural parameter space, which isfactorized into latent variables1 for each languageand each task. All possible task–language combi-nations give rise to a task × language× parametertensor. While some entries could be populatedthrough supervised learning, completing the emptyportion of such a tensor is less straightforward thanstandard matrix completion methods for collabora-tive filtering (Mnih and Salakhutdinov, 2008; Dzi-ugaite and Roy, 2015), as the parameters are neverobserved. Rather, in our approach the interactionof the latent variables determine the parameters,which in turn determine the data likelihood.

1By latent variable we mean every variable that has to beinferred from observed (directly measurable) variables. Toavoid confusion, we use the terms seen and unseen whenreferring to different task–language combinations.

arX

iv:2

001.

1145

3v1

[cs

.CL

] 3

0 Ja

n 20

20

Page 2: arXiv:2001.11453v1 [cs.CL] 30 Jan 2020ter space factorization method surpasses standard baselines based on cross-lingual transfer 1) from the (typologically) nearest source language;

yijk

xijk

θij

ti lj

K mn

Figure 1: A graph (plate notation) of the generativemodel based on parameter space factorization. Shadedcircles refer to observed variables.

We adopt a Bayesian perspective towards in-ference on the proposed generative model. Theposterior distribution over latent variables is ap-proximated through stochastic variational inference(VI), based on the mean-field assumption (Hoffmanet al., 2013). We choose multivariate Gaussians asa variational family for the latent variables. Giventhe enormous number of parameters, however, afull co-variance matrix cannot be stored in memory.Hence, we explore a factor co-variance structurethat is expressive while remaining tractable.

We evaluate the model on two related sequencetagging tasks: POS and NER, relying on a typo-logically representative sample of 33 languagesfrom 4 continents and 11 families. The resultsclearly indicate that the proposed neural parame-ter space factorization method surpasses standardbaselines based on cross-lingual transfer 1) fromthe (typologically) nearest source language; and2) from the source language with the most abun-dant in-domain data (usually English). The averagegains over the strongest baseline are 4.49 points forPOS tagging and 7.73 points for NER. The code isavailable at: github.com/cambridgeltl/parameter-factorization.

2 A Bayesian Generative Model ofParameter Space

The annotation efforts in NLP have achieved im-pressive feats, such as the Universal Dependencies(UD) project (Nivre et al., 2019) which now covers

over 70 languages. Yet, these account for only ameagre subset of the world’s 8,506 languages ac-cording to Glottolog (Hammarström et al., 2016).At the same time, the ACL Wiki2 lists 24 separatelanguage-related tasks. The lack of costly and laborintensive labelled data for many languages in suchtasks hinders the development of computationalmodels for the majority of the world’s languages(Snyder and Barzilay, 2010; Ponti et al., 2019a).

In this work, we propose a Bayesian generativemodel of multi-task, multi-lingual NLP. We trainone Bayesian neural network for several tasks andlanguages jointly. The core modeling assumptionis that the parameter space of the neural networkis structured, that is, that certain parameters cor-respond to certain tasks and others correspond tocertain languages. This structure allows us to gen-eralize to unseen task–language pairs. The model,which is reminiscent of matrix factorization for col-laborative filtering (Mnih and Salakhutdinov, 2008;Dziugaite and Roy, 2015), is presented in Figure 1.

Formally, we consider a set of n tasks T ={t1, . . . , tn} and a set of m languages L ={l1, . . . , lm}. The variational family of each taskvariable and language variable is a multivariateGaussian with mean µ ∈ Rh and diagonal co-variance σ2I ∈ Rh×h, where h is the dimen-sionality of the Gaussian. Consequently, ti ∼N (µti ,σ

2tiI) and lj ∼ N (µlj ,σ

2ljI), respectively.

The space of parameters for all tasks and lan-guages forms a tensor Θ ∈ Rn×m×d, where d isthe number of parameters of the largest model.3

We denote with θij ∈ Rd the parameters of themodel for the ith language and the jth task. Theseparameters are also considered to be sampled froma multivariate Gaussian, whose mean µ ∈ Rd anddiagonal co-variance σ2I ∈ Rd×d are functionsof the corresponding task and language latent vari-ables, i.e. fψ(ti, lj) and fφ(ti, lj), respectively.Hence θij ∼ N (fψ(·), fφ(·)).

The likelihood of the classes yk for the kth sen-tence xk is equivalent to p(· | xk,θtl). In general,we will only possess the data for a subset S ofthe Cartesian product of all tasks and languagesT × L = S ∪ U , and not for the unseen pairsU . However, as we estimate all task–language pa-rameter vectors θij jointly, our model allows us to

2aclweb.org/aclwiki/State_of_the_art3Different tasks might involve different class numbers, and

as a consequence the number of parameters oscillates. Theextra dimensions not needed for a task can be considered aspadded with zeros.

2

Page 3: arXiv:2001.11453v1 [cs.CL] 30 Jan 2020ter space factorization method surpasses standard baselines based on cross-lingual transfer 1) from the (typologically) nearest source language;

perform inference over the parameters for combi-nations in U as well. Intuitively, if data for NER inBasque and POS in Kazakh is provided, our modelassigns posterior distributions over 2 language vari-ables (Basque and Kazakh) and 2 task variables(NER and POS) based on such data. Afterwards,they can be recombined to make predictions forNER in Kazakh and POS in Basque. A summaryof generative story of how we hypothesize the data‘came into being’ is offered in Algorithm 1.

Algorithm 1 Generative model of the data.for ti ∈ Tti ∼ N (µi,Σi)

for lj ∈ Llj ∼ N (µj ,Σj)

for ti ∈ Tfor lj ∈ Lµθij = fψ(ti, lj)Σθij = fφ(ti, lj)θij ∼ N (µθij ,Σθij )for xk ∈ Xij

yk ∼ p(· | xk,θij)

3 Inference and Prediction

In order to perform inference on the generativemodel outlined in Section 2, we take a Bayesian per-spective. This enables smooth estimates of the pos-teriors for typically under-specified models suchas neural networks (Garipov et al., 2018). In par-ticular, we resort to stochastic Variational Infer-ence based on batch-level optimization (Hoffmanet al., 2013). In Section 3.1 we derive the EvidenceLower Bound (ELBO) for VI on the generativemodel of Figure 1. Then, in Section 3.2, we detailhow we implement such a model through a neuralnetwork. Finally, in Section 3.3, we explore sev-eral co-variance structures for the Gaussian distri-butions of the latent variables, including a diagonalmatrix and a low-rank factor matrix.

3.1 ELBO Derivation

In order to perform inference through the hierarchi-cal Bayesian model described in Algorithm 1, weneed to estimate the joint posterior over the latentvariable sets θ, t, and l. The posterior given theobserved data x is shown in Equation (1), whichfactorizes according to the independence assump-tion Y ⊥⊥ (T, L) | Θ ingrained in the graphical

model of Figure 1:

p(θ, t, l | x) =p(x | θ)p(θ | t, l)p(t)p(l)

p(x). (1)

Unfortunately, term p(x) in the denominator ofEquation (1) is intractable. Therefore, by integrat-ing out the latent variables, we derive the lowerbound for the log-probability of x in Equation (2).A way to interpret the ELBO is that we shouldminimize the variational gap, which is the KL-divergence between the true joint posterior and theapproximate joint posterior. To see this we definethe following:

qλ = N (µt,σ2t I)

qν = N (µl,σ2l I)

qξ = N (fψ(t, l), fφ(t, l))

Note that Equation (2) contains a log-likelihoodterm that needs to be approximated through gradi-ent descent, and 3 KL-divergence terms that haveanalytical solutions, provided that the true distribu-tion is a multivariate Gaussian p(·) ∼ N (0, I):

KL (q || p) =1

2

[ d∑i=1

(µ2i + σ2

i )− d−d∑i=1

lnσ2i

]3.2 Neural ModelGiven the recent success of approaches for zero-shot cross-lingual transfer such as multilingualBERT (i.e., M-BERT; Pires et al., 2019), MULTIFIT

(Eisenschlos et al., 2019) and XLM (Lample andConneau, 2019), we adopt a similar neural networkarchitecture with a a classifier stacked on top ofan encoder. In particular, the encoder consists ofa multi-layer Transformer (Vaswani et al., 2017)whose parameters are initialized with a model pre-trained for masked language modeling and nextsentence prediction on multiple languages. In thiswork, we treat such encoder as a black-box func-tion xBERT ∈ Re = BERT(x) to encode tokens intomulti-lingual contextualized representations.4

On the other hand, the classifier is an affine layer.In other words, the probability over classes y iscomputed as y = softmax(WxBERT + b). Cru-cially, parameter factorization takes place on thespace of parameters for task–language-specific clas-sifiers θij = {Wij ,bij} only, whereas the parame-ters of the encoder θBERT are shared across all task–language combinations and fine-tuned through

4The full details of its implementation can be found inDevlin et al. (2019).

3

Page 4: arXiv:2001.11453v1 [cs.CL] 30 Jan 2020ter space factorization method surpasses standard baselines based on cross-lingual transfer 1) from the (typologically) nearest source language;

log p(x) = log

(∫ ∫ ∫p(x,θ, t, l) dθ dt dl

)= log

(∫ ∫ ∫p(x | θ) p(θ | t, l) p(t) p(l) dθ dt dl

)= log

(∫ ∫ ∫qλ(t) qν(l) qξ(θ)

qλ(t) qν(l) qξ(θ)p(x | θ) p(θ | t, l) p(t) p(l) dθ dtdl

)= log

(Et∼qλ

El∼qν

Eθ∼qξ

1

qλ(t) qν(l) qξ(θ)p(x | θ) p(θ | t, l) p(t) p(l)

)

≥ Et∼qλ

El∼qν

Eθ∼qξ

[log

p(x | θ) p(θ | t, l) p(t) p(l)qλ(t) qν(l) qξ(θ)

]= Et∼qλ

El∼qν

[Eθ∼qξ

[log p(x | θ) + log

p(θ | t, l)qξ(θ)

]+ log

p(t)

qλ(t)+ log

p(l)

qν(l)

]= Eθ∼qξ

log p(x | θ)︸ ︷︷ ︸requires approximation

−KL (qλ(t) || p(t))−KL (qν(l) || p(l))−KL (qξ(θ) || p(θ | t, l))︸ ︷︷ ︸closed form solution

(2)

maximum-likelihood optimization during training.Throughout this paper, by parameters we refer tothose of the classifier, unless otherwise indicated.

In each training iteration, we randomly samplea task ti ∈ T and language lj ∈ L among seencombinations,5 and randomly select a batch ofexamples from the dataset of such combination.Based on the generative model of Figure 1, thevariational family of each task and language is amultivariate Gaussian. In order to allow the gra-dient to flow through, we generate samples forthe latent variables through the re-parametrizationtrick (Blundell et al., 2015): ti = µti + σti � εand lj = µlj + σlj � ε, where ε ∼ N (0, I)and � means element-wise multiplication. Theco-variance matrix must be non-negative. Hence,we obtain the diagonal standard deviation σ asln(1 + exp(ρ)). We place a prior N (0,St) oneach task and a prior N (0,Sl) on each language,where St and Sl are hyper-parameters.

The mean µθij and (diagonal) log-variance ρθijfor the parameters θij for ti and lj are generatedthrough a pair of deep feed-forward neural net-works fψ : Rh → Rd and fφ : Rh → Rdparametrized by ψ and φ, respectively, similarly toKingma and Welling (2014). The networks fψ andfφ take as input features based on the task and lan-

5As an alternative, we experimented with a setup wheresampling probabilities are proportional to the number of ex-amples of each task–language combination, but this achieveslower results on the development sets.

guage samples, namely {t⊕l⊕t−l⊕t�l}, where⊕ stands for concatenation. As mentioned before,θij is a concatenation of a weight W ij ∈ Re×ciand a bias bij ∈ Rci . Hence the number of param-eters d = e× ci + ci depends on the dimensional-ity of the contextualized token embeddings e andthe number of task-specific classes ci. We place aGaussian prior on the parameters N (0,Sθij ). Sθ,as well as the number of hidden layers and the hid-den size of fψ and fφ, are hyper-parameters.We tiethe parameters ψ and φ for all layers save the lastfor faster training.

Finally, the parameters θij are sampled fromN (µθij , ln(1 + exp(ρθij )), again through the re-parametrization trick. These are used as parametersfor the affine classifier layer to generate a a distri-bution over classes y for every token xBERT. Duringtraining, we optimize the following parameters ofthe network: {µt,ρt,µl,ρl,ψ,φ}. We performzero-shot predictions on examples from unseentask–language pairs by plugging in the mean ofthe latent variable estimates.6

3.3 Low-rank Co-variance

The co-variance matrix of latent variables is oftentaken to be diagonal in Variational approximations(Kingma and Welling, 2014; Blundell et al., 2015),

6As an alternative for prediction, we experimented withmodel averaging over 100 samples from the posteriors, butthis approach obtained lower performances on the dev sets.

4

Page 5: arXiv:2001.11453v1 [cs.CL] 30 Jan 2020ter space factorization method surpasses standard baselines based on cross-lingual transfer 1) from the (typologically) nearest source language;

in order to make computations in the model fea-sible. In fact, a full-rank co-variance matrix Σwould: i) be too massive to store in memory; andii) require O(h2) time to sample from the distribu-tion it defines, where h is the dimensionality of theGaussian. In contrast, a diagonal co-variance ma-trix makes computation feasible with a complexityof O(h); this, however, comes at the cost of notletting parameters influence each other, and thusfailing to capture their complex interactions.

To avoid these sub-optimal solutions, we turnto a factored co-variance matrix, in the spirit ofMiller et al. (2017) and Ong et al. (2018). Such afactored structure offers a balanced solution, beingparsimonious in the memory and time complexitywhile having non-zero non-diagonal entries. Inparticular, we factorize our co-variance matricesΣti and Σlj as:

Σ = σ2I + BB> (3)

which is the dot product of a matrix B ∈ Rh×kof rank k with itself plus the diagonal entries σ2.The complexity of sampling from a multivariatenormal distribution with such a co-variance matrixisO(kh), which is tractable for suitably low k as itdoes not requite to calculate the full matrix explic-itly. In particular, through the re-parametrizationtrick, a sample takes the form:

µ+ σ � ε+ Bζ> (4)

where ε ∈ Rh, ζ ∈ Rk, and both are sampled fromN (0, I). Moreover, the KL divergence computa-tion of the ELBO approximation in Equation (2)can be estimated analytically without explicitly cal-culating the low-rank co-variance matrix, providedthat p(·) ∼ N (0, I), for the prior p(·):

KL (q || p) =1

2

[ h∑i=1

(µ2i + σ2

i +k∑j=1

B2ij)

− h− ln det(Σ)] (5)

The last term can be estimated without computingthe full matrix explicitly thanks to the generaliza-tion of the matrix determinant lemma,7 which, ap-plied to the factored co-variance structure, yields:

ln det(Σ) = ln[det(Ik + B>

1

σ2IkB)

]+

h∑i=1

ln(σ2i )

(6)

7det(A+ UV >) = det(I + V >A−1U) · det(A). Notethat the lemma assumes that A is invertible.

where Ik ∈ Rk.

4 Experimental Setup

Data. We select NER and POS tagging as our ex-perimental tasks because their datasets encompassan ample and diverse sample of languages. In par-ticular, we opt for WikiANN (Pan et al., 2017) forthe NER task and Universal Dependencies 2.4 (UD,Nivre et al., 2019) for POS tagging. Our sampleof languages results from the intersection of thoseavailable in WikiANN and UD. However, this sam-ple is heavily biased towards the Indo-Europeanfamily (Gerz et al., 2018). Instead, the selectionshould be: i) typologically diverse, to ensure thatour model generalizes well; ii) focused on low-resource languages to recreate a realistic setting.Hence, we further filter the languages in order tomake the sample more balanced. In particular, weexclude all resource-rich Indo-European languages.

Our final sample comprises 33 languages from 4continents (17 from Asia, 11 from Europe, 4 fromAfrica, and 1 from South America) and from 11families (6 Uralic, 5 Afroasiatic, 5 Indo-European,3 Niger-Congo, 3 Turkic, 2 Austroasiatic, 2 Aus-tronesian, 2 Dravidian, 1 Kra-Dai, 1 Tupian, 1 Sino-Tibetan), as well as 2 isolates. The full list of lan-guage ISO 639-2 codes is reported in Figure 2.

In order to simulate a zero-shot setting, we holdout half of all possible task–language combinationsin 2 distinct runs and regard them as unseen, whiletreating the others as seen combinations. The par-tition is performed in such a way that a held-outcombination has data available for the same task ina different language, and for the same language ina different task.8

We randomly split the WikiANN datasets intotraining, development, and test portions with a pro-portion of 80-10-10. We use the provided splits forUniversal Dependencies; if the training set for alanguage is missing, we treat the test set as suchwhen the language is held out, and as a training setwhen it is among the seen combinations.9

Hyper-parameters. The multilingual M-BERT en-coder is initialized with parameters pre-trained on

8We use the controlled partitioning for the following rea-son. If a language lacks tensor entries both for NER and forPOS, the proposed factorization method cannot estimates forits posterior. We leave model extensions that can handle suchcases for future work.

9Note that, in the second case, no evaluation takes placeon such language.

5

Page 6: arXiv:2001.11453v1 [cs.CL] 30 Jan 2020ter space factorization method surpasses standard baselines based on cross-lingual transfer 1) from the (typologically) nearest source language;

masked language modeling and next sentence pre-diction on 104 languages (Devlin et al., 2019).10

We opt for the cased BERT-BASE architecture,which consists of 12 layers with 12 attention headsand a hidden size of 768. As a consequence, thisis also the dimension e of the each encoded Word-Piece unit.11 The dimension h of the multivariateGaussian for task and language latent variables isset to 100. Deep feed-forward networks fψ and fφhave 6 layers with a hidden size of 400 for the firstlayer, 768 for the internal layers.

The expectations in Equation (2) are approxi-mated through Monte Carlo estimation with 3 sam-ples per batch during training. The KL terms areweighted with 1

|B| uniformly across training, where|B| is the number of mini-batches.12 All the µparameters are initialized with a random samplefrom N (0, 0.1), whereas ρ and B with U(0, 0.5),similarly to Stolee and Patterson (2019). We placea prior of N (0, I) over t, l, and θ. The factor co-variance matrix has a rank k = 10 to fit in memory.

The maximum sequence length for inputs is lim-ited to 250. The batch size is set to 8, and the bestsetting for the Adam optimizer (Kingma and Ba,2015) was found to be an initial learning rate of5 · 10−6 and an ε = 10−8 based on grid search.In order to avoid over-fitting, we perform earlystopping with a patience of 10 and a validationfrequency of 2.5K steps.

Baselines. In this work, we propose a Bayesiangenerative model over a structured parameter spacefor neural models. In particular, we explore anapproximate inference scheme with diagonal co-variance, hereby defined Parameter Factoriza-tion, and another with Factor Co-variance. Asbaselines, we consider two widespread approachesfor cross-lingual transfer. Both of them are imple-mented sharing the BERT encoder across all lan-guages while dedicating a private affine classifierto each task–language combination.

Transfer from the Nearest Source (NS) lan-guage selects the most compatible source to a tar-get language in terms of similarity. In particular,the selection can be based on family membership(Zeman and Resnik, 2008; Cotterell and Heigold,

10Available at github.com/google-research/bert/blob/master/multilingual.md

11A WordPiece is a sub-word unit obtained through BPE(Wu et al., 2016).

12We found this weighting strategy to work better thanannealing as proposed by Blundell et al. (2015).

2017; Kann et al., 2017), typological features (Deriand Knight, 2016), KL-divergence between part-of-speech trigrams (Rosa and Zabokrtsky, 2015; Agic,2017), tree edit distance of delexicalized depen-dency parses (Ponti et al., 2018), or a combinationof the above (Lin et al., 2019). In our work, forprediction on each held-out language, we use theclassifier associated with the observed languagewith the highest cosine similarity in terms of typo-logical features. These features are sourced fromURIEL (Littell et al., 2017) and contain informa-tion about family, area, syntax, and phonology.

The second baseline is transfer from the LargestSource (LS) language, i.e. the language with mosttraining examples, which is usually English. Thisapproach has been adopted by several recent workson cross-lingual transfer (Conneau et al., 2018;Artetxe et al., 2019, inter alia). In our implemen-tation, we always select the English classifier forprediction. In order to make this baseline compara-ble to our model, we adjust the number of EnglishNER training examples to the sum of the examplesavailable for all seen languages S.13

It must be noted that, for both baselines, the num-ber of parameters of each task–language-specificclassifier is lower than of our proposed model.However, increasing the depth of such networkis detrimental if the BERT encoder parameters arekept trainable, which we also verified in our exper-iments (Peters et al., 2019).

5 Results and Discussion

5.1 Zero-shot Transfer

Firstly, we present the results for zero-shot predic-tion of the generative model of the parameter spaceand the two approximate inference schemes (withdiagonal co-variance PF and factor co-variance+FC). Table 1 summarizes the results on the twotasks of POS tagging and NER averaged across alllanguages. Our model outperforms both baselineswith a large margin on both tasks. In particular, ourmodel gains +4.49 in accuracy (+6.93%) for POStagging and +7.73 in F1 score (+9.80%) for NERin average compared to transfer from the nearestsource (NS), the strongest baseline.

More details about the individual results on eachtask–language combination are depicted in Fig-ure 2, which includes the mean and variance of

13The number of NER training examples is 1,093,184 forthe first partition and 520,616 for the second partition.

6

Page 7: arXiv:2001.11453v1 [cs.CL] 30 Jan 2020ter space factorization method surpasses standard baselines based on cross-lingual transfer 1) from the (typologically) nearest source language;

aii

am ar bm cy et eu fi fo glgu

n he hsb hu hy id kk kmr

ko kpv mt

myv sm

e ta te th tl tr ug vi wo yo yue

Language

30

40

50

60

70

80

90

100

F1 S

core

ModelNearest SourceLargest SourceParameter Factorization+ Factor Co-variance

aii-a

sam

-att

ar-p

adt

ar-p

udbm

-crb

cy-c

cget

-edt

et-e

wt

eu-b

dtfi-

ftbfi-

pud

fi-td

tfo

-oft

gl-c

tggl

-tree

gal

gun-

dool

eygu

n-th

omas

he-h

tbhs

b-uf

alhu

-sze

ged

hy-a

rmtd

pid

-gsd

id-p

udkk

-ktb

kmr-

mg

ko-g

sdko

-kai

stko

-pud

kpv-

ikdp

kpv-

latti

cem

t-mud

tm

yv-jr

sme-

giel

lata

-ttb

te-m

tgth

-pud

tl-trg

tr-gb

tr-im

sttr-

pud

ug-u

dtvi

-vtb

wo-

wtb

yo-y

tbyu

e-hk

Treebank

0

20

40

60

80

Acc

urac

y

ModelNearest SourceLargest SourceParameter Factorization+ Factor Co-variance

Figure 2: Results for NER (top) and POS tagging (bottom): Transfer from Nearest Source and from Largest Sourcecompared to Matrix Factorization with diagonal co-variance and low-rank factored co-variance.

7

Page 8: arXiv:2001.11453v1 [cs.CL] 30 Jan 2020ter space factorization method surpasses standard baselines based on cross-lingual transfer 1) from the (typologically) nearest source language;

Task NS LS PF +FCPOS 42.84 ± 1.23 60.51 ± 0.43 65.00 ± 0.12 64.71 ± 0.18NER 74.16 ± 0.56 78.97 ± 0.56 86.26 ± 0.17 86.70 ± 0.10

Table 1: Results per task averaged across all languages.

the results over 3 separate runs. Overall, we ob-tain improvements in 31/33 languages for NER andon 36/45 treebanks for POS tagging, which furthersupports the benefits of transferring both from tasksand languages. Selecting the best and worst per-formances of FC compared to NS, the strongestbaseline, we report +30.08 in F1 score (+51%)for Kurmanji and -5.72 (-10.37%) for Amharic onNER; +29.91 in accuracy (+119.71%) for Uyghurand -1.72 (-12.07%) for Guaraní on POS tagging.

Considering the baselines, the relative perfor-mance of LS versus NS is an interesting finding perse. LS largely outperforms NS on both POS tag-ging and NER. This shows that having more datais more informative than taking into account simi-larity based on linguistic properties. This findingcontradicts the hypothesis formulated by (Rosa andZabokrtsky, 2015; Cotterell and Heigold, 2017; Linet al., 2019, inter alia) that related languages tendto be the most reliable source. We conjecture thatthis is due to the pre-trained multi-lingual BERT

encoder, which effectively bridges the gap betweenunrelated languages.

Secondly, comparing the two approximate infer-ence schemes, FC obtains a small but statisticallysignificant improvement over PF in NER, whereasthey achieve the same performance on POS tag-ging. This means that the posterior is modeled wellenough by a spherical Gaussian, such that a richerco-variance structure is not needed.

Finally, we note that even for the best model(FC) there is a wide variation in the scores forthe same task across languages. POS tagging ac-curacy ranges from 12.56 ± 4.07 in Guaraní to86.71±0.67 in Galician, and NER F1 scores rangefrom 49.44±0.69 in Amharic to 96.20±0.11 in Up-per Sorbian. Part of this variation is explained bythe fact that the multilingual BERT encoder is notpre-trained in a subset of these languages (whichincludes Amharic, Guaraní, Uyghur, and AssyrianNeo-Aramaic). Another cause of variance is morestraightforward: the scores are expected to be lowerin languages for which we have less training exam-ples in the seen task–language combinations (e.g.,Yoruba, Wolof, Armenian).

ta

te

wo

yo

−5.0 −2.5 0.0 2.5 5.0

Den

sity

Dimension1234

Figure 3: Samples from the posteriors of 4 languages,PCA-reduced to 4 dimensions.

5.2 Visualization of the Learned Posteriors

The approximate posteriors of the latent variablescan be visualized in order to study the learned rep-resentations for languages. Previous works (John-son et al., 2017; Östling and Tiedemann, 2017;Malaviya et al., 2017; Bjerva and Augenstein,2018) induced point estimates of language repre-sentations from artificial tokens concatenated toevery input sentence, or from the aggregated val-ues of the hidden state of a neural encoder. Theinformation contained in such representations de-pends on the task (Bjerva and Augenstein, 2018),but mainly reflects the structural properties of eachlanguage (Bjerva et al., 2019).

In our work, due to the estimation procedure, lan-guages are represented by full distributions ratherthan point estimates. By inspecting the learned rep-resentations, language similarities appear to followhigher-level properties of languages, rather thanstructural properties. This is most likely due to thefact that parameter factorization takes place afterthe multi-lingual BERT encoding, which blends thestructural differences across languages. A fair com-parison with previous works without such encoderis left for future investigation.

8

Page 9: arXiv:2001.11453v1 [cs.CL] 30 Jan 2020ter space factorization method surpasses standard baselines based on cross-lingual transfer 1) from the (typologically) nearest source language;

ta fi hu kpv he gl yo am myv vi tr et id th bm km

r wo hsb gun kk ar eu ug mt

sme ko fo yu

e aii te tl cy hy

Language

0.0

0.5

1.0

1.5

2.0

2.5

3.0En

tropy

TaskPOSNER

Figure 4: Entropy of the predicted distributions over classes for each test example. The higher the entropy, the moreuncertain the prediction.

As an example, consider two pairs of languagesfrom two distinct families: Yoruba and Wolofare Niger-Congo from the Atlantic-Congo branch,Tamil and Telugu are Dravidian. We take 1,000samples from the approximate posterior over thelatent variables for each of these languages. In par-ticular, we focus on the variational scheme witha low-rank co-variance structure Gaussian. Sub-sequently, we reduce the dimensionality of eachsample to 4 through PCA,14 and we plot the den-sity along each resulting dimension in Figure 3. Asit is evident, density areas of each dimension donot necessarily overlap between members of thesame family. Hence, the learned representations donot depend on genealogical properties. We leaveit to future work to probe which information theycontain instead.

5.3 Entropy of the Predictions

A notable problem of point estimate methods istheir tendency to assign most of the probabilitymass to a single class even in scenarios with highuncertainty. Zero-shot transfer is one of such sce-

14Note that the dimensionality reduced samples are alsoGaussian, since PCA is a linear method.

narios, because it involves drastic distribution shiftsin the data (Rabanser et al., 2019). A key advantageof Bayesian inference, instead, is marginalizationover parameters, which yields smoother predictions(Kendall and Gal, 2017; Wilson, 2019).

We run an analysis on predictions based on (ap-proximate) Bayesian model averaging. First, werandomly sample 800 examples from each test setof a task–language combination. For each example,we predict a distribution over classes Y throughmodel averaging based on 10 samples from theposteriors. We then measure the prediction en-tropy of each example, i.e. H(p) = −

∑|Y |y p(Y =

y) ln p(Y = y), whose plot is shown in Figure 4.Entropy is a measure of uncertainty. Intuitively,

the uniform categorical distribution (maximum un-certainty) has the highest entropy, whereas if thewhole probability mass falls into a single class(maximum confidence), then the entropy H = 0.15

As it emerges from Figure 4, predictions in certainlanguages tend to have higher entropy on average,such as in Amharic, Guaraní, Uyghur, or AssyrianNeo-Aramaic. This aligns well with the perfor-

15The maximum entropy is ≈ 2.2 for 9 classes as in NERand ≈ 2.83 for 17 classes as in POS tagging.

9

Page 10: arXiv:2001.11453v1 [cs.CL] 30 Jan 2020ter space factorization method surpasses standard baselines based on cross-lingual transfer 1) from the (typologically) nearest source language;

mance metrics in Figure 2. In practice, languageswith low performances tend to display high entropyin the predictive distribution, as expected.

6 Related Work

Data Matrix Factorization. Although we are thefirst to propose a factorization of the parameterspace for unseen combinations of tasks and lan-guages, the factorization of data for collaborativefiltering and social recommendation is an estab-lished research area. In particular, the missing val-ues in sparse data structures such as user-movie re-view matrices can be filled via probabilistic matrixfactorization (PMF) through a linear combinationof user and movie matrices (Mnih and Salakhut-dinov, 2008; Ma et al., 2008; Shan and Banerjee,2010, inter alia) or through neural networks (Dzi-ugaite and Roy, 2015). Inference for PMF can becarried out through MAP inference (Dziugaite andRoy, 2015), Markov Chain Monte Carlo (MCMC;Salakhutdinov and Mnih, 2008) or stochastic VI(Stolee and Patterson, 2019). Contrary to the priorwork, we perform factorization on a latent variable(task-language parameters) rather than an observedone (data), which requires a different model.

Contextual Parameter Generation. Our modelis reminiscent of the idea that parameters can beconditioned on language representations, as pro-posed by Platanios et al. (2018). However, sincethis approach is limited to a single task and a jointlearning setting, it is not suitable for cross-tasktransfer and zero-shot predictions.

Neural Bayesian Methods are especially suitedfor cross-lingual transfer learning, but so far theyhave found only limited application in this researcharea. Firstly, they incorporate priors over parame-ters: Ponti et al. (2019b) constructed a prior imbuedwith universal linguistic knowledge for zero- andfew-shot character-level language modeling. Sec-ondly, they avoid the risk of over-fitting and takeinto account parameter uncertainty through modelaveraging. For instance, Shareghi et al. (2019) andDoitch et al. (2019) use a perturbation model tosample high-quality and diverse solutions for struc-tured prediction in cross-lingual parsing.

7 Conclusion and Future Work

The main contribution of our work is a Bayesiangenerative model of the space of neural parameterswhich can be factorized according to the combina-

tion of languages and tasks. We performed infer-ence through stochastic Variational Inference andevaluated our model on zero-shot prediction forunseen task-language combinations in two tasks:named entity recognition (NER) and part-of-speech(POS) tagging, across a typologically diverse setof 33 languages. Based on these results, we con-clude that leveraging the information from tasksand languages simultaneously is superior to modeltransfer from English (relying on more abundant in-task data in the source language) or from the mosttypologically similar language (relying on prior in-formation on language similarity). On average, wereport improvements of 4.49 in POS tagging accu-racy and 7.73 in NER F1 score over the strongestbaseline. As a consequence, our approach holdspromise to alleviating data paucity issues for a widespectrum of languages and tasks.

In the future, we will port a similar approachto multilingual tasks beyond sequence labellingtasks, such as Natural Language Inference (Con-neau et al., 2018) and Question Answering (Artetxeet al., 2019; Lewis et al., 2019). Moreover, one ex-citing research path is extending the framework ofparameter space factorization to take into accountalso multiple modalities (e.g., speech, text, vision).

Acknowledgements

This work is supported by the ERC ConsolidatorGrant LEXICAL (no 648909). RR was partiallyfunded by ISF personal grants No. 1625/18.

References

Željko Agic. 2017. Cross-lingual parser selectionfor low-resource languages. In Proceedings ofthe NoDaLiDa 2017 Workshop on Universal De-pendencies (UDW 2017), pages 1–10.

Željko Agic, Anders Johannsen, Barbara Plank,Héctor Martínez Alonso, Natalie Schluter, andAnders Søgaard. 2016. Multilingual projectionfor parsing truly low-resource languages. Trans-actions of the ACL, 4:301–312.

Waleed Ammar, George Mulcaire, Miguel Balles-teros, Chris Dyer, and Noah A. Smith. 2016.Many languages, one parser. Transactions of theACL, 4:431–444.

Mikel Artetxe, Sebastian Ruder, and Dani Yo-gatama. 2019. On the cross-lingual transfer-

10

Page 11: arXiv:2001.11453v1 [cs.CL] 30 Jan 2020ter space factorization method surpasses standard baselines based on cross-lingual transfer 1) from the (typologically) nearest source language;

ability of monolingual representations. arXivpreprint arXiv:1910.11856.

Mikel Artetxe and Holger Schwenk. 2019. Mas-sively multilingual sentence embeddings forzero-shot cross-lingual transfer and beyond.Transactions of the Association for Computa-tional Linguistics, 7:597–610.

Johannes Bjerva and Isabelle Augenstein. 2018.From phonology to syntax: Unsupervised lin-guistic typology at different levels with languageembeddings. In Proceedings of NAACL-HLT,pages 907–916.

Johannes Bjerva, Robert Östling, Maria Han Veiga,Jörg Tiedemann, and Isabelle Augenstein. 2019.What do language representations really repre-sent? Computational Linguistics, 45(2):381–389.

Charles Blundell, Julien Cornebise, KorayKavukcuoglu, and Daan Wierstra. 2015. Weightuncertainty in neural networks. In Proceedingsof ICML, pages 1613–1622.

Alexis Conneau, Guillaume Lample, Ruty Rinott,Adina Williams, Samuel R Bowman, HolgerSchwenk, and Veselin Stoyanov. 2018. XNLI:Evaluating cross-lingual sentence representa-tions. In Proceedings of EMNLP, pages 2475–2485.

Ryan Cotterell and Georg Heigold. 2017. Cross-lingual character-level neural morphological tag-ging. In Proceedings of EMNLP, pages 748–759.

Aliya Deri and Kevin Knight. 2016. Grapheme-to-phoneme models for (almost) any language. InProceedings of ACL, pages 399–408.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-trainingof deep bidirectional transformers for languageunderstanding. In Proceedings of NAACL-HLT,pages 4171–4186.

Amichay Doitch, Ram Yazdi, Tamir Hazan, andRoi Reichart. 2019. Perturbation based learningfor structured nlp tasks with application to depen-dency parsing. Transactions of the Associationfor Computational Linguistics, 7:643–659.

Gintare Karolina Dziugaite and Daniel M. Roy.2015. Neural network matrix factorization.arXiv preprint arXiv:1511.06443.

Julian Eisenschlos, Sebastian Ruder, Piotr Czapla,Marcin Kadras, Sylvain Gugger, and JeremyHoward. 2019. MultiFiT: Efficient multi-linguallanguage model fine-tuning. In Proceedings ofEMNLP, pages 5706–5711.

Timur Garipov, Pavel Izmailov, DmitriiPodoprikhin, Dmitry P. Vetrov, and An-drew G. Wilson. 2018. Loss surfaces, modeconnectivity, and fast ensembling of DNNs. InProceedings of NeurIPS, pages 8789–8798.

Daniela Gerz, Ivan Vulic, Edoardo Maria Ponti,Roi Reichart, and Anna Korhonen. 2018. On therelation between linguistic typology and (limi-tations of) multilingual language modeling. InProceedings of EMNLP, pages 316–327.

Harald Hammarström, Robert Forkel, MartinHaspelmath, and Sebastian Bank, editors. 2016.Glottolog 2.7. Max Planck Institute for the Sci-ence of Human History, Jena.

Matthew D. Hoffman, David M. Blei, Chong Wang,and John Paisley. 2013. Stochastic variationalinference. The Journal of Machine LearningResearch, 14(1):1303–1347.

Melvin Johnson, Mike Schuster, Quoc V Le,Maxim Krikun, Yonghui Wu, Zhifeng Chen,Nikhil Thorat, Fernanda Viégas, Martin Wat-tenberg, Greg Corrado, et al. 2017. Google’smultilingual neural machine translation system:Enabling zero-shot translation. Transactions ofthe Association for Computational Linguistics,5:339–351.

Katharina Kann, Ryan Cotterell, and HinrichSchütze. 2017. One-shot neural cross-lingualtransfer for paradigm completion. In Proceed-ings of ACL, pages 1993–2003.

Alex Kendall and Yarin Gal. 2017. What uncertain-ties do we need in Bayesian deep learning forcomputer vision? In Proceedings of NeurIPS,pages 5574–5584.

Diederik P. Kingma and Jimmy L. Ba. 2015. Adam:A method for stochastic optimization. In Pro-ceedings of ICLR.

11

Page 12: arXiv:2001.11453v1 [cs.CL] 30 Jan 2020ter space factorization method surpasses standard baselines based on cross-lingual transfer 1) from the (typologically) nearest source language;

Diederik P. Kingma and Max Welling. 2014. Auto-encoding variational Bayes. In Proceedings ofICLR.

Guillaume Lample and Alexis Conneau. 2019.Cross-lingual language model pretraining. Pro-ceedings of NeurIPS, pages 7057–7067.

Patrick S. H. Lewis, Barlas Oguz, Ruty Rinott,Sebastian Riedel, and Holger Schwenk. 2019.MLQA: Evaluating cross-lingual extractive ques-tion answering. CoRR, abs/1910.07475.

Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui Li,Yuyan Zhang, Mengzhou Xia, Shruti Rijhwani,Junxian He, Zhisong Zhang, Xuezhe Ma, An-tonios Anastasopoulos, Patrick Littell, and Gra-ham Neubig. 2019. Choosing transfer languagesfor cross-lingual learning. In Proceedings ofACL, pages 3125–3135.

Patrick Littell, David R Mortensen, Ke Lin, Kather-ine Kairis, Carlisle Turner, and Lori Levin. 2017.URIEL and lang2vec: Representing languagesas typological, geographical, and phylogeneticvectors. In Proceedings of EACL, pages 8–14.

Hao Ma, Haixuan Yang, Michael R. Lyu, and Ir-win King. 2008. SoRec: Social recommendationusing probabilistic matrix factorization. In Pro-ceedings of CIKM, pages 931–940.

Chaitanya Malaviya, Graham Neubig, and PatrickLittell. 2017. Learning language representa-tions for typology prediction. In Proceedingsof EMNLP, pages 2529–2535.

Andrew C. Miller, Nicholas J. Foti, and Ryan P.Adams. 2017. Variational boosting: Iterativelyrefining posterior approximations. In Proceed-ings of ICML, pages 3732–3747.

Andriy Mnih and Ruslan Salakhutdinov. 2008.Probabilistic matrix factorization. In Proceed-ings of NeurIPS, pages 1257–1264.

Joakim Nivre, Mitchell Abrams, Željko Agic,Lars Ahrenberg, Gabriele Aleksandraviciute,Lene Antonsen, Katya Aplonova, Maria JesusAranzabe, Gashaw Arutie, Masayuki Asahara,et al. 2019. Universal dependencies 2.4. LIN-DAT/CLARIN digital library at the Institute ofFormal and Applied Linguistics (ÚFAL), Facultyof Mathematics and Physics, Charles University.

Victor M-H Ong, David J Nott, and Michael SSmith. 2018. Gaussian variational approxima-tion with a factor covariance structure. Jour-nal of Computational and Graphical Statistics,27(3):465–478.

Robert Östling and Jörg Tiedemann. 2017. Contin-uous multilinguality with language vectors. InProceedings of the EACL, pages 644–649.

Xiaoman Pan, Boliang Zhang, Jonathan May, JoelNothman, Kevin Knight, and Heng Ji. 2017.Cross-lingual name tagging and linking for 282languages. In Proceedings of ACL, volume 1,pages 1946–1958.

Matthew E. Peters, Sebastian Ruder, and Noah A.Smith. 2019. To tune or not to tune? Adaptingpretrained representations to diverse tasks. InProceedings of the 4th Workshop on Representa-tion Learning for NLP (RepL4NLP-2019), pages7–14.

Telmo Pires, Eva Schlinger, and Dan Garrette.2019. How multilingual is multilingual BERT?In Proceedings of ACL, pages 4996–5001.

Emmanouil Antonios Platanios, Mrinmaya Sachan,Graham Neubig, and Tom Mitchell. 2018. Con-textual parameter generation for universal neuralmachine translation. In Proceedings of EMNLP,pages 425–435.

Edoardo Maria Ponti, Helen O’Horan, YevgeniBerzak, Ivan Vulic, Roi Reichart, ThierryPoibeau, Ekaterina Shutova, and Anna Korho-nen. 2019a. Modeling language variation anduniversals: A survey on typological linguisticsfor natural language processing. ComputationalLinguistics, 45(3):559–601.

Edoardo Maria Ponti, Roi Reichart, Anna Korho-nen, and Ivan Vulic. 2018. Isomorphic transferof syntactic structures in cross-lingual NLP. InProceedings of ACL, pages 1531–1542.

Edoardo Maria Ponti, Ivan Vulic, Ryan Cotterell,Roi Reichart, and Anna Korhonen. 2019b. To-wards zero-shot language modeling. In Proceed-ings of EMNLP, pages 2900–2910.

Stephan Rabanser, Stephan Günnemann, andZachary Lipton. 2019. Failing loudly: An empir-ical study of methods for detecting dataset shift.In Proceedings of NeurIPS, pages 1394–1406.

12

Page 13: arXiv:2001.11453v1 [cs.CL] 30 Jan 2020ter space factorization method surpasses standard baselines based on cross-lingual transfer 1) from the (typologically) nearest source language;

Shruti Rijhwani, Jiateng Xie, Graham Neubig, andJaime G. Carbonell. 2019. Zero-shot neuraltransfer for cross-lingual entity linking. In Pro-ceedings of AAAI, pages 6924–6931.

Rudolf Rosa and Zdenek Zabokrtsky. 2015. KL-cpos3 - A language similarity measure for delex-icalized parser transfer. In Proceedings of ACL,pages 243–249.

Sebastian Ruder, Matthew E. Peters, SwabhaSwayamdipta, and Thomas Wolf. 2019. Trans-fer learning in natural language processing. InProceedings of NAACL-HLT: Tutorials, pages15–18.

Ruslan Salakhutdinov and Andriy Mnih. 2008.Bayesian probabilistic matrix factorization usingMarkov chain Monte Carlo. In Proceedings ofICML, pages 880–887.

Hanhuai Shan and Arindam Banerjee. 2010. Gen-eralized probabilistic matrix factorizations forcollaborative filtering. In Proceedings of ICDM,pages 1025–1030.

Ehsan Shareghi, Yingzhen Li, Yi Zhu, Roi Reichart,and Anna Korhonen. 2019. Bayesian learningfor neural dependency parsing. In Proceedingsof NAACL-HLT, pages 3509–3519.

Benjamin Snyder and Regina Barzilay. 2010.Climbing the tower of Babel: Unsupervisedmultilingual learning. In Proceedings of ICML,pages 29–36.

Jake Stolee and Neill Patterson. 2019. Matrix fac-torization with neural networks and stochasticvariational inference. Technical report, Univer-sity of Toronto.

Oscar Täckström, Ryan McDonald, and JakobUszkoreit. 2012. Cross-lingual word clustersfor direct transfer of linguistic structure. In Pro-ceedings of NAACL-HLT, pages 477–487.

Alon Talmor and Jonathan Berant. 2019. MultiQA:An empirical investigation of generalization andtransfer in reading comprehension. In Proceed-ings of ACL, pages 4911–4921.

Ashish Vaswani, Noam Shazeer, Niki Parmar,Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Łukasz Kaiser, and Illia Polosukhin. 2017. At-tention is all you need. In Proceedings ofNeurIPS, pages 5998–6008.

Andrew Gordon Wilson. 2019. The case forBayesian deep learning. NYU Courant TechnicalReport.

Yonghui Wu, Mike Schuster, Zhifeng Chen,Quoc V Le, Mohammad Norouzi, WolfgangMacherey, Maxim Krikun, Yuan Cao, Qin Gao,Klaus Macherey, et al. 2016. Google’s neuralmachine translation system: Bridging the gapbetween human and machine translation. arXivpreprint arXiv:1609.08144.

Dani Yogatama, Cyprien de Masson d’Autume,Jerome Connor, Tomas Kocisky, MikeChrzanowski, Lingpeng Kong, Angeliki Lazari-dou, Wang Ling, Lei Yu, Chris Dyer, et al.2019. Learning and evaluating general linguisticintelligence. arXiv preprint arXiv:1901.11373.

Daniel Zeman and Philip Resnik. 2008. Cross-language parser adaptation between related lan-guages. In Proceedings of IJCNLP, pages 35–42.

Yftah Ziser and Roi Reichart. 2018. Deep pivot-based modeling for cross-language cross-domaintransfer with minimal guidance. In Proceedingsof EMNLP, pages 238–249.

13