Roi Reichart Abstract

PADA: Example-based Prompt Learningfor on-the-fly Adaptation to Unseen Domains

Eyal Ben-David ∗ Nadav Oved ∗Technion - Israel Institute of Technology

{eyalbd12@campus.|nadavo@campus.|roiri@}technion.ac.il

Roi Reichart

Abstract

Natural Language Processing algorithmshave made incredible progress, but they stillstruggle when applied to out-of-distributionexamples. We address a challenging and un-derexplored version of this domain adapta-tion problem, where an algorithm is trainedon several source domains, and then ap-plied to examples from unseen domains thatare unknown at training time. Particularly,no examples, labeled or unlabeled, or anyother knowledge about the target domain areavailable to the algorithm at training time.We present PADA: An example-based au-toregressive Prompt learning algorithm foron-the-fly Any-Domain Adaptation, basedon the T5 language model. Given a testexample, PADA first generates a uniqueprompt for it and then, conditioned on thisprompt, labels the example with respect tothe NLP prediction task. PADA is trained togenerate a prompt which is a token sequenceof unrestricted length, consisting of DomainRelated Features (DRFs) that characterizeeach of the source domains. Intuitively, thegenerated prompt is a unique signature thatmaps the test example to a semantic spacespanned by the source domains. In exper-iments with 3 tasks (text classification andsequence tagging), for a total of 14 multi-source adaptation scenarios, PADA substan-tially outperforms strong baselines.12

1 Introduction

Natural Language Processing (NLP) algorithmsare gradually achieving remarkable milestones(Devlin et al., 2019; Lewis et al., 2020; Brown

∗ Both authors equally contributed to this work.1Our code and data are available at https://github.

com/eyalbd2/PADA.2An earlier version of this paper was previously uploaded

to the arxiv under the name: "PADA: A Prompt-based Au-toregressive Approach for Adaptation to Unseen Domains"

Figure 1: Text classification with PADA. Coloredtexts signify relation to a specific source domain.PADA first generates the domain name, followedby a set of DRFs related to the input example.Then it uses the prompt to predict the task label.

et al., 2020). However, such algorithms often relyon the seminal assumption that the training set andthe test set come from the same underlying distri-bution. Unfortunately, this assumption often doesnot hold since text may emanate from many differ-ent sources, each with unique distributional prop-erties. As generalization beyond the training dis-tribution is still a fundamental challenge, NLP al-gorithms suffer a significant degradation when ap-plied to out-of-distribution examples.

Domain Adaptation (DA) explicitly addressesthe above challenge, striving to improve out-of-distribution generalization of NLP algorithms. DAalgorithms are trained on annotated data fromsource domains, to be effectively applied in a va-riety of target domains. Over the years, con-siderable efforts have been devoted to the DAchallenge, focusing on various scenarios wherethe target domain is known at training time (e.g.through labeled or unlabeled data) but is yetunder-represented (Roark and Bacchiani, 2003;Daumé III and Marcu, 2006; Reichart and Rap-poport, 2007; McClosky et al., 2010; Rush et al.,

arX

iv:2

102.

1220

6v4

[cs

.CL

] 2

7 Ja

n 20

22

https://github.com/eyalbd2/PADA

https://github.com/eyalbd2/PADA

2012; Schnabel and Schütze, 2014). Still, the chal-lenge of adaptation to any possible target domain,that is unknown at training time, is underexploredin DA literature.3

In this work, we focus on adaptation to any tar-get domain, which we consider a “Holy Grail” ofDA (§3). Apart from the pronounced intellectualchallenge, it also presents unique modeling advan-tages as target-aware algorithms typically requiretraining a separate model for each target domain,leading to an inefficient overall solution.

Intuitively, better generalization to unseen do-mains can be achieved by integrating knowledgefrom several source domains. We present PADA:An example-based autoregressive Prompt learningalgorithm for on-the-fly Any-Domain Adaptation(§4), which utilizes an autoregressive languagemodel (T5, Raffel et al. (2020)), and presents anovel mechanism which learns to generate human-readable prompts that represent multiple sourcedomains. Given a new example, from any un-known domain, the model first generates proper-ties (a sequence of tokens) that belong to famil-iar (source) domains and relate to the given ex-ample. Then, the generated sequence is used asa prompt for the example, while the model per-forms the downstream task.4 PADA implementsa specialized two-stage multi-task protocol whichfacilitates model parameter sharing between theprompt generation and the downstream tasks. Ul-timately, PADA performs its adaptation per exam-ple, by leveraging (1) an example-specific prompt-ing mechanism and (2) a two-stage multi-task ob-jective.

In order to generate effective prompts, we drawinspiration from previous work on pivot features(Blitzer et al., 2006; Ziser and Reichart, 2018;Ben-David et al., 2020) to define sets of DomainRelated Features (DRFs, §4.2). DRFs are to-kens which are strongly associated with one of thesource domains, encoding domain-specific seman-tics. We leverage the DRFs of the various sourcedomains in order to span their shared semanticspace. Together, these DRFs reflect the similari-

3The any-domain adaptation setting is addressed in themodel robustness literature. In §3, we discuss the differencesbetween these static methods and our dynamic approach.

4We use a language model, pre-trained on massive unla-beled data, and it is possible that this model was exposed totext from the source or target domains. Yet, the downstreamtask training is based only on examples from the source do-mains without any knowledge of future target domains.

ties and differences between the source domains,in addition to domain-specific knowledge.

Consider the task of review sentiment classi-fication (Figure 1). The model is familiar withfour source domains: restaurants, home-furniture,electronic-devices, and movies. When the modelencounters a review, this time from the airlines do-main, it uses DRFs to project the example into theshared semantic space, via the prompting mecha-nism. In the given example the DRFs marked inblue and green relate to the restaurants and thehome-furniture domains, respectively. The DRF-based prompt is then used in classification.

We evaluate PADA in the multi-source DA set-ting, where the target domain is unknown duringtraining (§5, 6). We consider two text classifi-cation tasks (Rumour Detection and Multi-GenreNatural Language Inference (MNLI)), and a se-quence tagging task (Aspect Prediction), for a totalof 14 DA setups. PADA outperforms strong base-lines, yielding substantial error reductions.

2 Related Work

We first describe research in the setting of unsu-pervised DA with a focus on pivot-based methods.We then continue with the study of DA methodswith multiple sources, focusing on mixture of ex-perts models. Finally, we describe autoregressivelanguage models and prompting mechanisms, andthe unique manner in which we employ T5 for DA.

Unsupervised Domain Adaptation (UDA)With the breakthrough of deep neural network(DNN) modeling, attention from the DA commu-nity has been directed to representation learningapproaches. One line of work employs DNN-based autoencoders to learn latent representations.These models are trained on unlabeled sourceand target data with an input reconstruction loss(Glorot et al., 2011; Chen et al., 2012; Yang andEisenstein, 2014; Ganin et al., 2016). Anotherbranch employs pivot features to bridge the gapbetween a source domain and a target domain(Blitzer et al., 2006, 2007; Pan et al., 2010). Pivotfeatures are prominent to the task of interest andare abundant in the source and target domains.Recently, Ziser and Reichart (2017, 2018, 2019)married the two approaches. Later on, Hanand Eisenstein (2019) presented a pre-trainingmethod, followed by Ben-David et al. (2020)and Lekhtman et al. (2021) who introduced apivot-based variant for pre-training contextual

word embeddings.Crucially, UDA models assume access to unla-

beled data from the target domain in-hand duringtraining. We see this as a slight relaxation to thegoal of generalization beyond the training distri-bution. Moreover, this definition has engineeringdisadvantages, as a new model is required for eachtarget domain. To this end, we pursue the any-domain adaptation setting, where unlabeled targetdata is unavailable at training time.

We draw inspiration from pivot-based model-ing. The pivot definition relies on labeled sourcedomain data and unlabeled source and target do-main data (which is unavailable in our setup). Par-ticularly, good pivots are ones that are correlatedwith the task label. Hence, pivot features aretypically applied to tasks which offer meaning-ful correlations between words and the task label,such as sentiment classification. For other typesof tasks, pivots may be difficult to apply. Con-sider the MNLI dataset, where the task is to un-derstand the directional relation between a pairof sentences (entailment, contradiction or neutral).In such a task it is unlikely to find meaningfulcorrelations between single words and the label.Instead, we define task-invariant DRFs, featureswhich are highly correlated with the identity of thedomain. Since domains are highly correlated withwords, our DRFs are lexical in nature.

Our proposed approach is an important step for-ward from pivots, as our model generates DRF se-quences of unrestricted lengths, instead of focus-ing on individual words. Moreover, pivots are typ-ically applied in single source setups, and whileour method can operate with a single source do-main, we utilize multiple source domains to facil-itate generalization to unknown target domains.

Multi-Source Domain Adaptation Most exist-ing multi-source DA methods follow the setup def-initions of unsupervised DA, while consideringmore than one source domain. A prominent ap-proach is to fuse models from several sources.Early work trained a classifier for each domain andassumed all source domains are equally importantfor a test example (Li and Zong, 2008; Luo et al.,2008). More recently, adversarial-based methodsused unlabeled data to align the source domains tothe target domains (Zhao et al., 2018; Chen andCardie, 2018). Meanwhile, Kim et al. (2017) andGuo et al. (2018) explicitly weighted a Mixtureof Experts (MoE) model based on the relationship

between a target example and each source domain.However, Wright and Augenstein (2020) followedthis work and tested a variety of weighting ap-proaches on a Transfomers-based MoE and founda naive weighting approach to be very effective.

We recognize two limitations in the proposedMoE solution. First, MoE requires training astandalone expert model for each source domain.Hence, the total number of parameters increases(typically linearly) with the number of source do-mains, which harms the solution’s scalability. Onepossible solution could be to train smaller-scaleexperts (Pfeiffer et al., 2020; Rücklé et al., 2020),but this approach is likely to lead to degradation inperformance. Second, domain experts are tunedtowards domain-specific knowledge, at times atthe expense of cross-domain knowledge whichhighlights the relationship between different do-mains. In practice, test examples may arrive fromunknown domains, and may reflect a complicatedcombination of the sources. To cope with this,MoE ensembles the predictions of the experts us-ing heuristic methods, such as a simple averageor a weighted average based on the predictions ofa domain-classifier. Our results indicate that thisapproach is sub-optimal.

Moreover, we view domain partitioning as of-ten somewhat arbitrary (consider for example thedifferences between the dvd and movie domains).We do not want to strictly confine our modelto a specific partitioning and rather encourage amore lenient approach towards domain bound-aries. Hence, in this work, we train only a sin-gle model which shares its parameters across alldomains. Furthermore, we are interested in adapt-ing to any target domain, such that no informationabout potential target domains is known at trainingtime. Some of the above works (Wright and Au-genstein, 2020) in fact avoid utilizing target data,thus they fit the any-domain setting and form twoof our baselines. Yet, in contrast to these works,the any-domain objective is a core principle of thisstudy.

Autoregressive LMs and Prompting Recently,a novel approach to language modeling hasbeen proposed, which casts it as a sequence-to-sequence task, by training a full Transformer(encoder-decoder) model (Vaswani et al., 2017) toautoregressively generate masked, missing or per-turbed token spans from the input sequence (Raf-fel et al., 2020; Lewis et al., 2020). Raffel et al.

(2020) present a particularly interesting approachwith the T5 model. It treats all tasks as genera-tive (text-to-text), eliminating the need for a task-specific network architecture. This is made pos-sible by prefixing each example with a promptphrase denoting the specific task being performed.

Recent works have further explored suchprompting mechanisms in several avenues: Adapt-ing a language model for different purposes(Brown et al., 2020); eliciting sentiment or topic-related information (Jiang et al., 2020; Sun andLai, 2020; Shin et al., 2020; Haviv et al., 2021);efficient fine-tuning (Li and Liang, 2021; Scao andRush, 2021); or as a method for few-shot learning(Gao et al., 2021; Schick and Schütze, 2021).5 Inthis work, we make use of T5’s prompting mech-anism as a way of priming the model to encodedomain-specific characteristics relating to each ex-ample from an unknown target domain. Borrow-ing terminology from Liu et al. (2021a), our ap-proach falls under the “Prompt+LM Tuning” train-ing strategy (Liu et al., 2021b; Han et al., 2021).In this strategy, prompt-relevant parameters arefine-tuned together with some or all of the param-eters of the pre-trained model (T5 in our case).However, in contrast to prompt tuning approacheswhich focus on representation level tuning (Liuet al., 2021b; Li and Liang, 2021; Lester et al.,2021), we train T5 to generate human readableprompts consisting of natural language tokens thatencode domain-specific information relating to thethe given example. To the best of our knowledge,this work is the first to learn to generate textualprompts alongside a downstream prediction task.It is also the first to generate a unique prompt perexample. Finally, it is the first to design a prompt-ing mechanism for the purpose of DA.

3 Any-Domain Adaptation

DA and Transfer Learning A prediction task(e.g., Rumour Detection) is defined as T = {Y},where Y is the task’s label space. We denote X tobe a feature space, P (X) to be the marginal dis-tribution over X , and P (Y ) the prior distributionover Y . The domain is then defined by DT ={X , P (X), P (Y ), P (Y |X)}. DA is a particu-lar case of transfer learning, namely transductivetransfer learning (Ramponi and Plank, 2020), inwhich TS and TT , the source and target tasks, are

5For a comprehensive discussion of the research onprompting mechanisms, we refer to Liu et al. (2021a).

the same. However, DTS and DTT , the source andtarget domains, differ in at least one of their un-derlying probability distributions, P (X), P (Y ),or P (Y |X).6 The goal in DA is to learn a functionf from a set of source domains {DSi}Ki=1 that gen-eralizes well to a set of target domains {DTi}Mi=1.

The Any-Domain Setting We focus on build-ing an algorithm for a given task, that is able toadapt to any-domain. To this end, we assume zeroknowledge about the target domain, DT , at train-ing time. Hence, we slightly modify the classicsetting of unsupervised multi-source domain adap-tation, by assuming we have no knowledge or ac-cess to labeled or unlabeled data from the targetdomains. We only assume access to labeled train-ing data from K source domains {DSi}Ki=1, whereDSi , {(xSi

t , ySit )}ni

t=1. The goal is to learn amodel using only the source domains data, thatgeneralizes well to unknown target domains.

The NLP and ML literature addresses severalsettings that are similar to any-domain adaptation.However, our on-the-fly example-based approachis novel. Below, we discuss these settings and thedifferences between their proposed solution ap-proaches and ours.

The goal of any-domain adaptation was previ-ously explored through the notion of domain ro-bustness. Algorithms from this line of work seekgeneralization to unknown distributions throughoptimization methods which favor robustness overspecification (Hu et al., 2018; Oren et al., 2019;Sagawa et al., 2020; Koh et al., 2020; Wald et al.,2021). This is typically achieved by trainingthe model to focus on domain-invariant features,which are considered fundamental to the task andgeneral across domains (Muandet et al., 2013;Ganin et al., 2016; Arjovsky et al., 2019; Mülleret al., 2020). In contrast, this work proposesto achieve this goal through on-the-fly example-based adaptation, utilizing both domain-invariantand domain-specific features, as the latter oftenproves relevant to the new domain (Blitzer et al.,2006; Ziser and Reichart, 2017). For instance,consider the example presented in Figure 1. Theexpression “food was cold” would be consideredas domain-specific, considering the restaurantsdomain. Despite it not being a domain-invariantfeature, it may serve as a valuable feature for thetarget domain (airlines).

6In inductive transfer learning TS differs from TT .

Any-domain adaptation also draws some simi-larities with the continual learning (Ring, 1995)and zero-shot learning (Palatucci et al., 2009)paradigms. Continual learning systems seek totransfer knowledge from a number of known tasksto a new one, while in our proposed setting newdomains arrive during inference, and as opposedto continual learning, we do not update the param-eters of the model when a new domain is presented(we actually do not even know the domains of thetest examples).7 The zero-shot setting also doesnot update the parameters of the model given anew task, yet its definition is less consistent acrossdifferent models: GPT-3 (Brown et al., 2020) at-tempts to transfer knowledge to an unknown targettask TT and unknown domain DT ; Blitzer et al.(2009) assume access to unlabeled data from var-ious domains including the target domain; andPeng et al. (2018) use data of a different task fromthe target domain. In contrast, our problem settingspecifically focuses on domain adaptation, whileassuming no prior knowledge of the target domain.

The any-domain adaptation setting naturallycalls for an example-level adaptation approach.Since the model does not have any knowledgeabout the target domain during training, each ex-ample it encounters during inference should bealigned with the source domains.

4 Example-based Adaptation throughPrompt Learning

In this work we propose a single model that en-codes information from multiple domains. Ourmodel is designed such that test examples fromnew unknown domains can trigger the most rel-evant parameters in the model. This way we al-low our model to share information between do-mains and use the most relevant information at testtime. Our model is inspired by recent researchon prompting mechanisms for autoregressive lan-guage models. We start (§4.1) by describing thegeneral architecture of our model, and continue(§4.2) with the domain related features that formour prompts.

4.1 The Model

We present our example-based autoregressivePrompt learning algorithm for on-the-fly Any-Domain Adaptation (PADA, Figure 2). PADA em-

7von Oswald et al. (2020) explore the notion of inferringthe new example’s task out of the training tasks.

ploys a pre-trained T5 language model and learnsto generate example-specific Domain Related Fea-tures (DRFs) in order to facilitate accurate taskpredictions. This is implemented through a two-step multi-task mechanism, where first a DRF setis generated to form a prompt, and then the tasklabel is predicted.

Formally, assume an input example (xi, yi) ∼Si, such that xi is the input text, yi is the task la-bel and Si is the domain of this example. For theinput xi, PADA is trained to first generate Ni, thedomain name, followed by Ri, the DRF signatureof xi, and given this prompt to predict the labelyi. At test time, when the model encounters anexample from an unknown domain, it generates aprompt that may consist of one or more domainnames as well as features from the DRF sets of oneor more source domains, and based on this promptit predicts the task label.

Test-time Inference Consider the example inFigure 1, which describes a sentiment classifi-cation model, trained on the restaurants, home-furniture, electronic-devices, and movies sourcedomains. The model observes a test example fromthe airlines domain, a previously unseen domainwhose name is not known to the model. The modelfirst generates the name of the domain which ismost appropriate for this example, restaurants inthis case. Then, it continues to generate the words“food” and “chair”, features related to the restau-rants and home-furniture domains, respectively.Finally, given this prompt, the model predicts theexample’s (negative) sentiment.

Training In order to separate the prompt gen-eration task from the discriminative classificationtask, we train our model within a multi-task frame-work. PADA is trained to perform two tasks, onefor generating a prompt, consisting of featuresfrom the DRF set of the example’s domain, andanother for predicting the example’s label. For thefirst, generative task, the model receives exampleswith the special prompt ‘Domain:’, which primesthe model to generateNi andRi (see examples forprompts generated by PADA in Table 1). Note thatRi is a set of features derived from the DRF set ofSi, and training examples are automatically anno-tated with their Ri, as described in §4.2. For thesecond, discriminative task, the model receives aprompt, consisting of Ni and Ri, and its task is topredict yi.

Figure 2: PADA during test time inference. An au-toregressive model with a generative head trainedfor DRF generation and a discriminative headfor sentiment classification. PADA conditionsthe classification on the generated prompt. Textmarked with blue signifies the DRFs and textmarked with yellow signifies the domain name.Black arrows (→) mark the first inference step andred arrows (→) mark the second inference step.

Following the multi-task training protocol ofT5, we mix examples from each task. To this end,we define a task proportion mixture parameter α.Each example from the training set forms an ex-ample for the generative task with probability α,and an example for the discriminative task withprobability 1 − α. The greater the value of α, themore the model will train for the generative task.

At the heart of our method is the clever selectionof the DRF set of each domain, and the promptannotation process for the training examples. Wenext discuss these features and their selection pro-cess.

4.2 Domain Related Features

For each domain we define the DRF set such thatthese features provide a semantic signature for thedomain. Importantly, if two domains have sharedsemantics, for example the restaurants and thecooking domains, we expect their DRFs to seman-tically overlap. Since the prompt of each trainingexample consists of a subset of features from theDRF set of its domain, we should also decide ona prompt generation rule that can annotate thesetraining examples with their relevant features.

In order to reflect the semantics of the domain,DRFs should occur frequently in this domain.Moreover, they should be substantially more com-mon in that specific domain relative to all otherdomains. Despite their prominence in a specific

Input. A sad day for France, for journalism, forfree speech, and those whose beliefs the attackerspretend to represent.Prompt. Ottawa Shooting - french, attack, shoot-ings

Input. Picture of explosion in Dammartin-en-Goele. all happened suddenly.Prompt. Germanwings Crash - explosion, goele,german

Input. At least 5 hostages in kosher supermarketin eastern Paris, according to reports.Prompt. Paris Siege - hostages, reports, taker

Table 1: Examples for DRF-based prompts gener-ated by PADA, from the Charlie-Hebdo (C) targetdomain, which is unknown to PADA during train-ing (source domains are FR, GW, OS, and S. SeeTable 4). PADA generates prompts which are se-mantically related to the input example by com-bining DRFs from source domains along with non-DRF yet relevant words. Moreover, it can alsogenerate new domain names (Paris Siege).

domain, DRFs can also relate to other domains.For instance, consider the top example presentedin Table 1. The word “attack” is highly associ-ated with the “Ottawa Shooting” domain and is in-deed one of its DRFs. However, this word is alsoassociated with “Sydney Siege”, which is anotherdomain in the Rumour Detection dataset (Zubiagaet al., 2016). Moreover, since both domains are re-lated to similar events, it is not surprising that theDRF set of the former contains the feature suspectand the DRF set of the latter contains the featuretaker (see Table 3). The similarity of these fea-tures facilitates parameter sharing in our model.

Automatically Extracting DRFs There can beseveral ways of implementing a DRF extractionmethod that are in line with the above DRF def-inition. We experimented with several differentextraction criteria (Correlation, class-based TF-IDF,8 and Mutual Information), and observed highsimilarity (82% overlap) between their resultingDRF sets. However, we observed a qualitativeadvantage for Mutual Information (MI), whichsuccessfully extracted DRFs that hold domain-specific semantic meaning.

We present the following MI-based method: Letexamples (texts) from the jth source domain (Sj)be labeled with 1, and examples from all other do-

8https://github.com/MaartenGr/cTFIDF

https://github.com/MaartenGr/cTFIDF

ρ = 0 ρ = 1 ρ = 10 ρ = 100

ferguson police know breakingmikebrown officer report

robbery killing just

Table 2: A sample of DRFs extracted for the Fer-guson domain (rumour detection) with different ρvalues. Each column represents DRFs which arefiltered in DRF sets of lower ρ value. DRFs oflower ρ values are more domain specific.

mains (S\Sj) be labeled with 0. We first calcu-late the mutual-information (MI) between all to-kens and this binary variable, and choose the l to-kens with the highest MI score. Note, that the MIcriterion might promote tokens which are highlyassociated with (S\Sj) rather than with Sj . Thus,we filter the l tokens according to the followingcondition:

CS\Sj (n)

CSj (n)≤ ρ, CSj (n) > 0

where CSj (n) is the count of the n-gram n in Sj ,CS\Sj (n) is the count of this n-gram in all sourcedomains except for Sj , and ρ is an n-gram fre-quency ratio hyper-parameter.

Intuitively, the smaller ρ is, the more certain weare that the n-gram is especially associated withSj , compared to other domains. Since the numberof examples in Sj is much smaller than the num-ber of examples in S\Sj , we choose ρ ≥ 1 but donot allow it to be too large. As a result, this crite-rion allows for features which are associated withSj but also related to other source domains, to bepart of the DRF set of Sj . This is demonstratedin Table 2, where we present examples of DRFsextracted for the Ferguson domain of the rumourdetection task, by using different values of ρ. Us-ing ρ = 0, domain-specific DRFs such as ”mike-brown” are extracted for the domain’s DRF set. Byincreasing the value of ρ to 1, we add DRFs whichare highly associated with the domain, but are alsoprevalent in other domains (e.g., ”killing” is alsorelated to the Ottawa-shooting domain). However,when increasing the value of ρ to 10, we extractDRFs which are less associated with the domain(”know”). This is further exacerbated when in-creasing ρ to higher values.

Annotating DRF-based Prompts for TrainingWe denote the DRF set of the jth domain withRj . Given a training example i from domain j,

we select the m features from Rj which are mostassociated with this example to form its prompt.To do that, we compute the Euclidean distance be-tween the T5 embeddings of the DRF features andthe T5 embeddings of each of the example’s to-kens. We then rank this list of pairs by their scoresand select the top m features.9 In Table 3 we pro-vide a sample of DRFs from the DRF sets associ-ated with each domain in the rumor detection task(§ 5), alongside their frequency statistics for beingannotated in a training example’s prompt.

To conclude, our methods for domain-specificDRF set extraction and for prompt annotationof training examples, demonstrate three attrac-tive properties. First, every example has its ownunique prompt. Second, our prompts map eachtraining example to the semantic space of its do-main. Lastly, the domain-specific DRF sets mayoverlap in their semantics, either by including thesame tokens or by including tokens with similarmeanings. This way they provide a more nuanceddomain signature compared to the domain namealone. This is later used during the inference phasewhen the model can generate an example-specificprompt that consists of features from the DRF setsof the various source domains.

5 Experimental Setup

5.1 Task and Datasets

We experiment with three multi-source DA tasks,where a model is trained on several domains andapplied to a new one. We consider two text classi-fication tasks, Rumour Detection and Multi-GenreNatural Language Inference (MNLI), and one se-quence tagging task – Aspect Prediction. The de-tails of the training, development and test sets ofeach domain are provided in Table 4. Our exper-iments are performed in a leave-one-out fashion:We train the model on all domains but one, andkeep the held-out domain for testing. Particularly,training is done on the training data of the sourcedomains and development on their developmentdata, while the test data is taken from the targetdomain, which is unknown at training time. Werepeat the experiments in each task such that eachdomain is used as a target domain.

9In this computation we consider the non-contextual em-beddings learned by T5 during its pre-training. In our exper-iments we consider only unigrams (words) as DRFs.

Rumour Detection The PHEME dataset of ru-mourous tweets (Zubiaga et al., 2016, 2017) con-tains 5, 802 tweets, which followed 5 differentreal-world events, and are labelled as rumourousor non-rumourous.10 We treat each event as aseparate domain: Charlie-Hebdo (C), Ferguson(FR), Germanwings-crash (GW), Ottawa-shooting(OS), and Sydney-siege (S).

We follow the data processing procedure ofWright and Augenstein (2020) and split each do-main (event) corpus by a 4:1 ratio, establishingtraining and development sets. Since the cor-pora are relatively small, we want to avoid furthershrinking the size of the test set. Hence, we in-clude all examples available from the target do-main to form the test set.11

MNLI This corpus (Williams et al., 2018) is anextension of the SNLI dataset (Bowman et al.,2015).12 Each example consists of a pair of sen-tences, a premise and a hypothesis. The relation-ship between the two may be entailment, contra-diction, or neutral. The corpus includes data from10 domains: 5 are matched, with training, devel-opment and test sets, and 5 are mismatched, with-out a training set. We experiment only with thefive matched domains: Fiction (F), Government(G), Slate (SL), Telephone (TL) and Travel (TR).

Since the test sets of the MNLI dataset are notpublicly available, we use the original develop-ment sets as our test sets for each target domain,while source domains use these sets for devel-opment. We explore a lightly supervised sce-nario, which emphasizes the need for a DA al-gorithm. Thus, we randomly downsample eachof the training sets by a factor of 30, resulting in2, 000− 3, 000 examples per set.

Aspect Prediction The Aspect Predictiondataset is based on aspect-based sentiment analy-sis (ABSA) corpora from four domains: Device(D), Laptops (L), Restaurant (R), and Service(SE). The D data consists of reviews from Topraket al. (2010), the SE data includes web service

10https://figshare.com/articles/dataset/PHEME_dataset_of_rumours_and_non-rumours/4010619

11This does not harm the integrity of our experiments,since the training and development sets are sampled from thesource domains while the test set is sampled only from thetarget domain.

12https://cims.nyu.edu/~sbowman/multinli/

C GW OS Shebdo (88%) lufthansa (86%) ottawa (83%) australians (75%)ahmed (48%) germanwings (33%) cdnpoli (36%) monis (69%)

terrorists (22%) crash (25%) shooting (30%) isis (21%)attack (19%) plane (24%) soldier (12%) cafe (18%)victims (4%) barcelona (23%) suspect (5%) taker (16%)

Table 3: A sample of DRFs from four rumour de-tection domains along with their frequency for be-ing annotated in a training example’s prompt.

Rumour DetectionDomain Training (src) Dev (src) Test (trg)Charlie-Hebdo (C) 1, 663 416 2, 079Ferguson (FR) 914 229 1, 143Germanwings-crash (GW) 375 94 469Ottawa-shooting (OS) 712 178 890Sydney-siege (S) 976 245 1, 221

MNLIDomain Training (src) Dev (src) Test (trg)Fiction (F) 2, 547 1, 972 1, 972Government (G) 2, 541 1, 944 1, 944Slate (SL) 2, 605 1, 954 1, 954Telephone(TL) 2, 754 1, 965 1, 965Travel (TR) 2, 541 1, 975 1, 975

AspectDomain Training (src) Dev (src) Test (trg)Device (D) 2, 302 255 1, 279Laptops (L) 2, 726 303 800Restaurants (R) 3, 487 388 800Service(SE) 1, 343 149 747

Table 4: The number of examples in each domainof our three tasks. We denote the examples usedwhen a domain is included as a source domain(src), and when it is the target domain (trg).

reviews (Hu and Liu, 2004), and the L and R do-mains consist of reviews from the SemEval-2014ABSA challenge (Pontiki et al., 2014).

We follow the training and test splits definedby Gong et al. (2020) for the D and SE domains,while the splits for the L and R domains are takenfrom the SemEval-2014 ABSA challenge. To es-tablish our development set, we randomly sample10% out of the training data.

5.2 Evaluated Models

Our main model is PADA: The multi-task modelthat first generates the domain name and domainrelated features to form a prompt, and then usesthis prompt to predict the task label (§4.1, Fig-ure 2). We compare it to two types of models:(a) T5-based baselines corresponding to ideas pre-sented in multi-source DA work, as well as otherrecent state-of-the-art models (§2); and (b) Ab-lation models that use specific parts of PADA, tohighlight the importance of its components.

https://figshare.com/articles/dataset/PHEME_dataset_of_rumours_and_non-rumours/4010619



https://cims.nyu.edu/~sbowman/multinli/

https://cims.nyu.edu/~sbowman/multinli/

5.2.1 Baseline ModelsTransformer-based Mixture of Experts (Tr-MoE) For each source domain, a separatetransformer-based DistilBERT expert model(Sanh et al., 2019) is trained on the domain’straining set, and an additional model is trained onthe union of training sets from all source domains.At test time, the average of the class probabilitiesof these models is calculated and the highestprobability class is selected. This model is namedMoE-avg by Wright and Augenstein (2020)and has demonstrated to achieve state-of-the-artperformance for Rumour Detection.

T5-MoE A T5-based MoE ensemble model. Foreach source domain, a separate pre-trained T5model is fine-tuned on the domain’s training set(i.e. a domain expert model). During inference,the final predictions of the model are decided us-ing the same averaging procedure as in Tr-MoE.

T5-No-Domain-Adaptation (T5-NoDA) A pre-trained T5 model, which feeds the same task clas-sifier used in PADA (see below) to predict the tasklabel. In each DA setting, the model is trained onthe training data from all source domains.

We also experiment with an in-domain versionof this model, T5-UpperBound (T5-UB), which istested on the development data of each domain.We treat T5-UB performance as an upper boundfor the average target performance across all DAsettings, for any T5-based model in our setup.

T5-Domain-Adversarial-Network (T5-DAN) Amodel that integrates T5-NoDA with an adversar-ial domain classifier to learn domain invariant rep-resentations.13

T5-Invariant-Risk-Minimization (T5-IRM) AT5-based model which penalizes feature distribu-tions that have different optimal linear classifiersfor each domain. The model is trained on the train-ing data from all source domains.

IRM (Arjovsky et al., 2019) and DAN (Ganinet al., 2016) are established algorithms in the do-main robustness literature, for generalization tounseen distributions (Koh et al., 2020).

5.2.2 Ablation ModelsPrompt-DN A simplified version of our PADAmodel, which assigns only a domain name as a

13We also experimented with BERT-NoDA and BERT-DANmodels. We do not report their results since they were con-sistently outperformed by T5-NoDA and T5-DAN.

prompt to the input text. Since the domain name isunknown at test time, we create multiple variantsof each test example, each with one of the trainingdomain names as a prompt. For the final predic-tions of the model we follow the same averagingprocedure as in Tr-MoE and T5-MoE.

Prompt-RDW and Prompt-REW Two simpli-fied versions of PADA which form prompts fromRandom-Domain-Words and Random-Example-Words, respectively. For Prompt-RDW, we samplem = 5 domain words (according to their distribu-tion in the joint vocabulary of all source domains)for each example. For Prompt-REW, we randomlyselect m = 5 words from the example’s text. Atboth training and test times, we follow the sameprompt formation procedures.

PADA-NP (No Prompt) A multi-task modelsimilar to PADA, except that it simultaneouslygenerates the example-specific domain name andDRF-based prompt, and predicts the task label(Figure 3a). Since this model does not condi-tion the task prediction on the generated prompt,it sheds light on the effect of the autoregressivenature of PADA.

PADA-NM (No Multi-task) A pipeline of twoindependent models which emulates PADA. Givenan input example, the first model generates aunique prompt for it. Then, the second model pre-dicts the task label given the input and its gener-ated prompt (Figure 3b). Since the prediction andprompt generation tasks are not performed jointly,nor are the model parameters shared between thetasks, this pipeline sheds light on the effect of themulti-task nature of PADA.

5.3 Implementation Details

For all implemented models we use the Hugging-Face Transformers library (Wolf et al., 2020).14

The T5-based text classification models do notfollow the same procedure originally described inRaffel et al. (2020). Instead, we add a simple 1D-CNN classifier on top of the T5 encoder to predictthe task label (Figure 2). The number of filters inthis classifier is 32 with a filter size of 9.15 Thegenerative component of the T5-based models isidentical to that of the original T5. Our T5-based

14https://github.com/huggingface/transformers

15We experimented with the original T5 classificationmethod as well, but PADA consistently outperformed it.

https://github.com/huggingface/transformers

https://github.com/huggingface/transformers

(a) PADA-NP. (b) PADA-NM.

Figure 3: PADA ablation models: (a) PADA-NP,which follows a multi-task training protocol, butdoes not condition its prediction on the generatedprompt; (b) PADA-NM, which separately trains aprompt generation model (→) and a prompted taskprediction model (→).

models for Aspect Prediction cast sequence tag-ging as a sequence-to-sequence task, employingthe text-to-text approach of Raffel et al. (2020) togenerate a ‘B’ (begin), ‘I’ (in) or ‘O’ (out) tokenfor each input token. Other than this change, thesemodels are identical to the T5-based models fortext classification.

We train all text classification models for 5epochs and all sequence tagging models for 60epochs, with an early stopping criterion accord-ing to performance on the development data. Weuse the cross-entropy loss function for all models,optimizing their parameters with the ADAM opti-mizer (Kingma and Ba, 2015). We employ a batchsize of 32 for text classification and 24 for se-quence tagging, warmup ratio of 0.1, and a learn-ing rate of 5·10−5. The maximum input and outputlengths of all T5-based models is set to 128 tokens.We pad shorter sequences and truncate longer onesto the maximum input length.

For PADA, we tune the α (example proportion-mixture, see §4.1) parameter considering the valuerange of {0.1, 0.25, 0.5, 0.75, 0.9}. The chosenvalues are: αrumour = 0.75, αmnli = 0.1 andαabsa = 0.1. For each training example, we se-lect the top m = 5 DRFs most associated with itfor its prompt. For the generative component ofthe T5-based models, we perform inference withthe Diverse Beam Search algorithm (Vijayaku-mar et al., 2016), considering the following hyper-parameters: We generate 5 candidates, using abeam size of 10, with 5 beam groups, and a diver-sity penalty value of 1.5. The l and ρ parametersof the DRF extraction procedure (§4.2) were tuned

to 1000 and 1.5, respectively, for all domains.

6 Results

Text Classification Table 5 presents our results.We report the binary-F1 score for Rumour Detec-tion, and the macro-F1 score for MNLI.16 PADAoutperforms all baseline models (§ 5.2.1) in 7 of10 settings and reaches the highest result in an-other setting (with T5-NoDA), exhibiting averageperformance gains of 3.5% and 1.3% in RumourDetection and MNLI, respectively, over the bestperforming baseline model. Interestingly, it is T5-NoDA, which does not perform any DA, that out-performs (on average and in most model-to-modelcomparisons) all other baseline model, includingthe MoE models.

While the performance gains differ between thetasks, they partly stem from the different perfor-mance gaps between source and target domains ineach of these tasks. Recall, that we consider theT5-UB performance on its development sets forRumour Detection (82.8%) and MNLI (80.8%),to be the upper bound for the average target per-formance across all DA settings, for any T5-basedmodel. When considering the gaps between thisupper bound and T5-NoDA (65.8% for RumourDetection and 78.3% for MNLI), PADA reducesthe error rate by 21% for Rumour Detection and52% for MNLI. The improvements gained byPADA are in fact substantial in both tasks.

The advantage of PADA over MoE goes be-yond improved predictions. Particularly, for PADAwe train a single model while for MoE we traina unique model for each source domain, hencethe number of parameters in the MoE frameworklinearly increases with the number of source do-mains. For example, in our setups, Tr-MoE trainsfive DistilBERT models (one for each source do-main and one for all source domains together), re-sulting in 5 · 66M = 330M parameters. In con-trast, the PADA models keep the 220M parametersof T5, regardless of the number of source domains.

Sequence Tagging In order to demonstrate thewide applicability of our approach, we go beyondtext classification (with 2 (Rumour Detection) or3 (MNLI) classes) and also consider Aspect Pre-diction: A sequence tagging task. We are particu-larly curious to see if the aforementioned patterns

16Binary-F1 measures the F1 score of the positive class. Itis useful in cases of unbalanced datasets where the positiveclass is of interest (34% of the Rumour Detection dataset).

Rumour Detection MNLI

All→ C All→ FR All→ GW All→ OS All→ S AVG All→ F All→ G All→ SL All→ TE All→ TR AVG

Tr-MoE 68.0 46.1 74.8 58.2 64.9 62.4 64.3 73.9 65.3 62.4 69.8 67.1T5-MoE 68.1 46.0 73.6 65.3 66.3 63.9 74.0 82.0 73.4 74.6 78.3 76.5T5-DAN 64.9 52.4 69.1 72.7 64.4 64.7 74.4 76.3 61.0 72.4 77.7 72.4T5-IRM 63.5 39.4 70.1 44.2 65.7 56.6 72.0 81.5 73.2 69.3 78.9 75.0T5-NoDA 64.1 46.9 75.1 72.0 71.0 65.8 76.4 83.5 75.5 74.9 81.3 78.3

Prompt-DN 66.4 53.7 72.4 71.4 70.1 66.8 77.0 84.4 75.6 76.3 80.5 78.8Prompt-RDW 64.1 53.1 71.8 66.0 70.0 65.0 76.0 84.2 76.6 77.0 79.9 78.7Prompt-REW 64.2 54.3 71.6 70.0 69.1 65.8 75.7 81.4 76.7 78.8 81.2 78.7

PADA-NP 65.8 54.8 71.6 72.2 74.0 67.7 76.2 83.6 75.4 77.2 81.4 78.8PADA-NM 63.6 54.1 74.3 70.1 70.3 66.5 76.0 83.7 76.5 78.0 81.0 79.0

PADA 68.6 54.4 73.0 75.2 75.1 69.3 76.4 83.4 76.9 78.9 82.5 79.6

Table 5: Binary-F1 scores for the Rumour Detection task and macro-F1 scores for the MNLI task.

Aspect Prediction

All→ D All→ L All→ R All→ SE AVG

T5-MoE 39.5 31.4 31.4 30.9 33.3T5-DAN 28.4 38.0 49.1 33.4 33.2T5-IRM 37.1 44.6 47.4 41.5 42.7T5-NoDA 31.1 45.6 40.2 37.9 38.7

Prompt-DN 41.1 42.6 29.0 30.8 35.9Prompt-RDW 34.6 46.9 52.9 41.2 43.9Prompt-REW 38.2 49.5 45.1 39.6 43.1

PADA-NP 41.7 48.2 50.1 40.1 45.0PADA-NM 40.3 48.8 50.8 40.2 45.0

PADA 43.1 50.9 50.8 45.3 47.5

Table 6: Binary-F1 scores for Aspect Prediction.

replicate in this qualitatively different task. Our re-sults are presented in Table 6, where we report thebinary-F1 score (the F1 score of the aspect class).Crucially, the patterns we observe for text classifi-cation can also be detected for sequence tagging.Particularly, PADA is the best performing model in4 of 4 settings compared to its baselines. On av-erage, PADA outperforms the second-best model,T5-IRM, by 3.5% on average. Given the averageresults of T5-UB (69.4%) and T5-NoDA (38.7%),the error reduction is 24%.

PADA Ablation Models As shown in Table 5,PADA outperforms all of its variants (§ 5.2.2)in 6 out of 10 text classification settings over-all. Furthermore, in the sequence tagging task(Table 6), PADA outperforms its simpler variants(Prompt-{DN, REW}, PADA-NP) in all 4 setups,and Prompt-RDW, PADA-NM in 3 out of 4 setups.These results highlight the importance of our de-sign choices: (a) including DRFs in the example-specific prompts, tailoring them to express the re-lation between the source domains and the testexample (PADA vs Prompt-{DN, RDW, REW});(b) utilizing an autoregressive component, where

BERTScore ROUGE-1 ROUGE-2 ROUGE-LDev F1 0.94 0.64 0.30 0.61

Table 7: Average F1 scores for our automatic eval-uation metrics, calculated for generated promptscompared to annotated prompts over all develop-ment sets in the rumour detection task.

the generated DRF prompts are used by the taskclassification component (PADA vs PADA-NP);and (3) leveraging a multi-task training objective(PADA vs PADA-NM). A noticeable difference inthe aspect prediction results from text classifica-tion results is the weakness of Prompt-DN, whichis outperformed by all baseline models (§ 5.2.1) in2 setups, and by 2 of these models in a third setup,as well as on average across all setups. This is yetanother indication of the importance of the DRFsin the prompt generated by PADA.

7 Ablation Analysis

In this section, we analyze several unique aspectsof PADA. We first evaluate the prompts generatedby PADA, to gain further insight into its genera-tive capabilities. We then analyze the impact ofthe number of source domains on PADA’s perfor-mance. Finally, we examine performance dropsdue to domain shifts, in order to evaluate PADA’sadaptation stability across domains. For the sakeof clarity and concision, analyses will henceforthfocus on the rumour detection task.

Generated Prompts Analysis We first presentan intrinsic evaluation of PADA’s prompt genera-tion task (see §4.1) by examining model-generatedprompts for examples from the development set,compared to their annotated prompts.17 We

17PADA is not restricted to specific structures or vocabu-lary when generating prompts, hence our annotated prompts

Figure 4: Average token source ratios in generatedprompts, calculated over all development and testsets in the rumour detection task.

choose automatic metrics widely used for evalu-ating NLG tasks, focusing on n-gram overlap bycalculating ROUGE (Lin, 2004) scores as well asmeasuring semantic similarity with BERTScore(Zhang et al., 2020). In Table 7 we present av-erage F1 scores for these metrics, calculated overall DA settings in the rumour detection task. Thehigh average BERTScore (0.94) indicates that thegenerated prompts share high semantic similaritywith their annotated prompts. Yet, the averageROUGE-1 (0.64) and ROUGE-2 (0.3) scores, in-dicate that the generated prompts vary on their un-igram and bigram levels (respectively), comparedwith their annotated prompts. This evidence sug-gests that PADA learns to leverage the semanticoverlaps between DRFs, over memorizing specificn-grams (f.e. an annotated DRF may be terroristwhile the generated may be gunman).

We continue our evaluation by analyzing theorigins of words in the PADA-generated prompts,specifically, whether they appear in the sourcedomains’ DRF sets, the input text, or in neither(Novel). Figure 4 presents the average ratios ofdifferent origins for generated prompt tokens, cal-culated over all DA settings in the rumour detec-tion task. As expected, the overwhelming majorityof generated tokens come from the source domainsDRF sets, for both development (92.7%) and test(75.3%) sets. However, when introduced to ex-amples from unknown domains (test sets), we ob-serve a significant increase (compared to the de-velopment sets) in novel tokens (18.9% vs 5.4%)and a slight increase in tokens from the example’sinput text (14.1% vs 11.7%).

Furthermore, Figure 5 demonstrates that PADAis able to exploit information from its multiplesource domains. For test examples PADA gener-

only serve as pseudo gold labels for training purposes.

Figure 5: Average ratios of number of domainsin generated prompts, calculated over all develop-ment and test sets in the rumour detection task.

ates prompts containing DRFs from several do-mains (95% of prompts contain DRFs from morethan 2 source domains), while for developmentexamples it mostly generates prompts with DRFsonly from the correct source domain. Togetherwith the examples presented in Table 1, these ob-servations suggest an encouraging finding - PADAis successful in generating prompts which lever-age and integrate both the source domains and thesemantics of the input example.

Number of Source Domains We next turn tostudy the impact of the number of source domainson PADA’s overall performance. Figure 6 presentsF1 scores by the number of source domains forPADA and two of its baselines, namely T5-NoDAand T5-MoE. We provide results on two target do-mains, as well as an average score across all fivetarget domains from the rumour detection dataset.

As indicated in the figure, PADA’s performanceimproves as the number of source domains in-creases. These results support our claim thatPADA is able to integrate knowledge from mul-tiple source domains by learning a meaningfuldomain-mixture, and it then leverages this knowl-edge when introduced to an example from a new,unknown, domain. Interestingly, for the baselinemodels T5-NoDA and T5-MoE, it seems that in-cluding more source domains can sometimes harmtheir ability to generalize to unknown target do-mains. One of our main hypotheses states thata DA model stands to benefit from incorporat-ing combined knowledge from multiple source do-mains (§4). PADA successfully implements thisidea, while T5-MoE and T5-NoDA fall short.

Performance Drops between Source and TargetWhen a DA method improves model performanceon the target domain, this can result in either in-creasing or decreasing the performance gap be-

Figure 6: Performance on the Rumour Detectiontask by the number of source domains the modelwas trained on. Darker hues represent a largernumber of source domains.

Figure 7: A heatmap presenting performancedrops between source domains and target domains(columns), for the rumour detection task. Darkercolors represent smaller performance drops.

tween the source and target domains. If a modelperforms similarly on its source training domainsand on unseen target domains, its source domainperformance can also provide an important indi-cation for its future performance in such unseendomains. We hence consider such stability in per-formance as a desired property in our setup wherefuture target domains are unknown (see discussionin Ziser and Reichart (2019)).

Figure 7 presents a heatmap, depicting the per-formance drop for each model between the sourcedomains and the target domains in rumour detec-tion. We measure each model’s in-domain perfor-mance by calculating an F1 score across all devel-opment examples from its source domains, as wellas out-of-domain performance on the target do-main test set, as described in §6. We then calculatethe difference between the source and the targetperformance measures, and report results for thebest performing models in our experiments (§6).The general trend is clear: PADA not only per-forms better on the target domain, but it also sub-stantially reduces the source-target performancegap. While T5-NoDA, which is not a DA model,triggers the largest average absolute performancedrop, 17%, the average of PADA’s absolute perfor-mance drop is 8.7%.

8 Discussion

We addressed the problem of multi-source domainadaptation when the target domain is not known attraining time. Effective models for this setup canbe applied to any target domain with no data re-quirements about the target domains and withoutan increase in the number of model parameters asa function of the number of source or target do-mains. PADA, our algorithm, extends the prompt-ing mechanism of the T5 autoregressive languagemodel to generate a unique textual prompt per ex-ample. Each generated prompt maps its test exam-ple into a semantic space spanned by the sourcedomains.

Our experimental results with three tasks andfourteen multi-source adaptation settings demon-strate the effectiveness of our approach com-pared to strong alternatives, as well as the im-portance of the model components and of our de-sign choices. Moreover, as opposed to the MoEparadigm, where a model is trained separately foreach source domain, PADA provides a single uni-fied model. Intuitively, this approach also seemsmore cognitively plausible – a single model at-tempts to adapt itself to examples from new in-coming domains, rather than employing an inde-pendent model per domain.

The prompt generation mechanism of PADA isnaturally limited by the set of source domains itis trained on. This might yield sub-optimal DRFsin prompts generated for examples stemming fromtarget domains which are semantically unrelated toany of the source domains. To alleviate this issue,we allow PADA to generate non-DRF words. Still,our prompt generation training process does notdirectly optimize for the downstream predictiontask’s objective, which might also contribute tosub-optimally generated prompts. In future work,we hope to improve these aspects of our approachand explore natural extensions that accommodatemultiple tasks and domains in a single model.

Acknowledgements

We would like to thank the action editor andthe reviewers, as well as the members of theIE@Technion NLP group for their valuable feed-back and advice. This research was partiallyfunded by an ISF personal grant No. 1625/18.

References

Martín Arjovsky, Léon Bottou, Ishaan Gulrajani,and David Lopez-Paz. 2019. Invariant risk min-imization. CoRR, abs/1907.02893.

Eyal Ben-David, Carmel Rabinovitz, and Roi Re-ichart. 2020. Perl: Pivot-based domain adapta-tion for pre-trained deep contextualized embed-ding models. Transactions of the Associationfor Computational Linguistics, 8:504–521.

John Blitzer, Mark Dredze, and Fernando Pereira.2007. Biographies, bollywood, boom-boxesand blenders: Domain adaptation for sentimentclassification. In ACL 2007, Proceedings ofthe 45th Annual Meeting of the Association forComputational Linguistics, June 23-30, 2007,Prague, Czech Republic. The Association forComputational Linguistics.

John Blitzer, Dean P. Foster, and Sham Machan-dranath Kakade. 2009. Zero-shot domain adap-tation: A multi-view approach.

John Blitzer, Ryan McDonald, and FernandoPereira. 2006. Domain adaptation with struc-tural correspondence learning. In EMNLP2006, Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Pro-cessing, 22-23 July 2006, Sydney, Australia,pages 120–128. ACL.

Samuel Bowman, Gabor Angeli, ChristopherPotts, and Christopher Manning. 2015. A largeannotated corpus for learning natural languageinference. In Proceedings of the 2015 Con-ference on Empirical Methods in Natural Lan-guage Processing, EMNLP 2015, Lisbon, Por-tugal, September 17-21, 2015, pages 632–642.The Association for Computational Linguistics.

Tom B. Brown, Benjamin Mann, Nick Ry-der, Melanie Subbiah, Jared Kaplan, PrafullaDhariwal, Arvind Neelakantan, Pranav Shyam,Girish Sastry, Amanda Askell, Sandhini Agar-wal, Ariel Herbert-Voss, Gretchen Krueger,Tom Henighan, Rewon Child, Aditya Ramesh,Daniel M. Ziegler, Jeffrey Wu, Clemens Win-ter, Christopher Hesse, Mark Chen, Eric Sigler,Mateusz Litwin, Scott Gray, Benjamin Chess,Jack Clark, Christopher Berner, Sam McCan-dlish, Alec Radford, Ilya Sutskever, and DarioAmodei. 2020. Language models are few-shot

learners. In Advances in Neural InformationProcessing Systems 33: Annual Conference onNeural Information Processing Systems 2020,NeurIPS 2020, December 6-12, 2020, virtual.

Minmin Chen, Zhixiang Eddie Xu, Kilian Q.Weinberger, and Fei Sha. 2012. Marginalizeddenoising autoencoders for domain adaptation.In Proceedings of the 29th International Con-ference on Machine Learning, ICML 2012, Ed-inburgh, Scotland, UK, June 26 - July 1, 2012.icml.cc / Omnipress.

Xilun Chen and Claire Cardie. 2018. Multi-nomial adversarial networks for multi-domaintext classification. In Proceedings of the 2018Conference of the North American Chapter ofthe Association for Computational Linguistics:Human Language Technologies, NAACL-HLT2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 1226–1240. Association for Computational Linguis-tics.

Hal Daumé III and Daniel Marcu. 2006. Domainadaptation for statistical classifiers. Journal ofArtificial Intelligence Research, 26:101–126.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: pre-trainingof deep bidirectional transformers for languageunderstanding. In Proceedings of the 2019Conference of the North American Chapter ofthe Association for Computational Linguistics:Human Language Technologies, NAACL-HLT2019, Minneapolis, MN, USA, June 2-7, 2019,Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguis-tics.

Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan,Pascal Germain, Hugo Larochelle, FrançoisLaviolette, Mario Marchand, and Victor S.Lempitsky. 2016. Domain-adversarial trainingof neural networks. The Journal of MachineLearning Research, 17:59:1–59:35.

Tianyu Gao, Adam Fisch, and Danqi Chen.2021. Making pre-trained language modelsbetter few-shot learners. In Proceedings ofthe 59th Annual Meeting of the Associationfor Computational Linguistics and the 11th In-ternational Joint Conference on Natural Lan-guage Processing, ACL/IJCNLP 2021, (Volume

http://arxiv.org/abs/1907.02893


https://transacl.org/ojs/index.php/tacl/article/view/2255



https://www.aclweb.org/anthology/P07-1056/



https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.147.351&rep=rep1&type=pdf

https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.147.351&rep=rep1&type=pdf

https://www.aclweb.org/anthology/W06-1615/

https://www.aclweb.org/anthology/W06-1615/

https://doi.org/10.18653/v1/d15-1075

https://doi.org/10.18653/v1/d15-1075

https://doi.org/10.18653/v1/d15-1075

https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html

https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html

http://icml.cc/2012/papers/416.pdf

http://icml.cc/2012/papers/416.pdf

https://doi.org/10.18653/v1/n18-1111

https://doi.org/10.18653/v1/n18-1111

https://doi.org/10.18653/v1/n18-1111

https://doi.org/10.1613/jair.1872

https://doi.org/10.1613/jair.1872

https://doi.org/10.18653/v1/n19-1423

https://doi.org/10.18653/v1/n19-1423

https://doi.org/10.18653/v1/n19-1423

http://jmlr.org/papers/v17/15-239.html

http://jmlr.org/papers/v17/15-239.html

https://doi.org/10.18653/v1/2021.acl-long.295


1: Long Papers), Virtual Event, August 1-6,2021, pages 3816–3830. Association for Com-putational Linguistics.

Xavier Glorot, Antoine Bordes, and Yoshua Ben-gio. 2011. Domain adaptation for large-scalesentiment classification: A deep learning ap-proach. In Proceedings of the 28th Interna-tional Conference on Machine Learning, ICML2011, Bellevue, Washington, USA, June 28 -July 2, 2011, pages 513–520. Omnipress.

Chenggong Gong, Jianfei Yu, and Rui Xia. 2020.Unified feature and instance based domainadaptation for aspect-based sentiment analysis.In Proceedings of the 2020 Conference on Em-pirical Methods in Natural Language Process-ing, EMNLP 2020, Online, November 16-20,2020, pages 7035–7045. Association for Com-putational Linguistics.

Jiang Guo, Darsh J. Shah, and Regina Barzilay.2018. Multi-source domain adaptation withmixture of experts. In Proceedings of the 2018Conference on Empirical Methods in NaturalLanguage Processing, Brussels, Belgium, Octo-ber 31 - November 4, 2018, pages 4694–4703.Association for Computational Linguistics.

Xiaochuang Han and Jacob Eisenstein. 2019. Un-supervised domain adaptation of contextualizedembeddings for sequence labeling. In Pro-ceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing andthe 9th International Joint Conference on Nat-ural Language Processing, EMNLP-IJCNLP2019, Hong Kong, China, November 3-7, 2019,pages 4237–4247. Association for Computa-tional Linguistics.

Xu Han, Weilin Zhao, Ning Ding, Zhiyuan Liu,and Maosong Sun. 2021. PTR: prompt tun-ing with rules for text classification. CoRR,abs/2105.11259.

Adi Haviv, Jonathan Berant, and Amir Globerson.2021. Bertese: Learning to speak to BERT. InProceedings of the 16th Conference of the Eu-ropean Chapter of the Association for Compu-tational Linguistics: Main Volume, EACL 2021,Online, April 19 - 23, 2021, pages 3618–3623.Association for Computational Linguistics.

Minqing Hu and Bing Liu. 2004. Mining andsummarizing customer reviews. In Proceed-ings of the 10th ACM SIGKDD InternationalConference on Knowledge Discovery and DataMining, Seattle, Washington, USA, August 22-25, 2004, pages 168–177. ACM.

Weihua Hu, Gang Niu, Issei Sato, and MasashiSugiyama. 2018. Does distributionally robustsupervised learning give robust classifiers? InProceedings of the 35th International Confer-ence on Machine Learning, ICML 2018, Stock-holmsmässan, Stockholm, Sweden, July 10-15,2018, volume 80 of Proceedings of MachineLearning Research, pages 2034–2042. PMLR.

Zhengbao Jiang, Frank F. Xu, Jun Araki, andGraham Neubig. 2020. How can we knowwhat language models know. Transactions ofthe Association for Computational Linguistics,8:423–438.

Young-Bum Kim, Karl Stratos, and DongchanKim. 2017. Domain attention with an ensem-ble of experts. In Proceedings of the 55thAnnual Meeting of the Association for Com-putational Linguistics, ACL 2017, Vancouver,Canada, July 30 - August 4, Volume 1: LongPapers, pages 643–653. Association for Com-putational Linguistics.

Diederik P. Kingma and Jimmy Ba. 2015. Adam:A method for stochastic optimization. In 3rd In-ternational Conference on Learning Represen-tations, ICLR 2015, San Diego, CA, USA, May7-9, 2015, Conference Track Proceedings.

Pang Wei Koh, Shiori Sagawa, Henrik Mark-lund, Sang Michael Xie, Marvin Zhang, Ak-shay Balsubramani, Weihua Hu, Michihiro Ya-sunaga, Richard Lanas Phillips, Sara Beery,Jure Leskovec, Anshul Kundaje, Emma Pier-son, Sergey Levine, Chelsea Finn, and PercyLiang. 2020. WILDS: A benchmark of in-the-wild distribution shifts. CoRR, abs/2012.07421.

Entony Lekhtman, Yftah Ziser, and Roi Reichart.2021. DILBERT: customized pre-training fordomain adaptation with category shift, with anapplication to aspect extraction. In Proceedingsof the 2021 Conference on Empirical Methodsin Natural Language Processing, EMNLP 2021,Virtual Event / Punta Cana, Dominican Repub-

https://icml.cc/2011/papers/342_icmlpaper.pdf



https://doi.org/10.18653/v1/2020.emnlp-main.572


https://www.aclweb.org/anthology/D18-1498/


https://doi.org/10.18653/v1/D19-1433

https://doi.org/10.18653/v1/D19-1433

https://doi.org/10.18653/v1/D19-1433



https://aclanthology.org/2021.eacl-main.316/

https://doi.org/10.1145/1014052.1014073

https://doi.org/10.1145/1014052.1014073

http://proceedings.mlr.press/v80/hu18a.html

http://proceedings.mlr.press/v80/hu18a.html



https://doi.org/10.18653/v1/P17-1060

https://doi.org/10.18653/v1/P17-1060





https://aclanthology.org/2021.emnlp-main.20



lic, 7-11 November, 2021, pages 219–230. As-sociation for Computational Linguistics.

Brian Lester, Rami Al-Rfou, and Noah Con-stant. 2021. The power of scale forparameter-efficient prompt tuning. CoRR,abs/2104.08691.

Mike Lewis, Yinhan Liu, Naman Goyal, MarjanGhazvininejad, Abdelrahman Mohamed, OmerLevy, Veselin Stoyanov, and Luke Zettlemoyer.2020. BART: denoising sequence-to-sequencepre-training for natural language generation,translation, and comprehension. In Proceed-ings of the 58th Annual Meeting of the Associa-tion for Computational Linguistics, ACL 2020,Online, July 5-10, 2020, pages 7871–7880. As-sociation for Computational Linguistics.

Shoushan Li and Chengqing Zong. 2008. Multi-domain sentiment classification. In ACL 2008,Proceedings of the 46th Annual Meeting ofthe Association for Computational Linguistics,June 15-20, 2008, Columbus, Ohio, USA, ShortPapers, pages 257–260. The Association forComputer Linguistics.

Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts forgeneration. In Proceedings of the 59th AnnualMeeting of the Association for ComputationalLinguistics and the 11th International JointConference on Natural Language Processing,ACL/IJCNLP 2021, (Volume 1: Long Papers),Virtual Event, August 1-6, 2021, pages 4582–4597. Association for Computational Linguis-tics.

Chin-Yew Lin. 2004. Rouge: A package for auto-matic evaluation of summaries. In Proceedingsof Workshop on Text Summarization BranchesOut, Post Conference Workshop of ACL 2004.

Pengfei Liu, Weizhe Yuan, Jinlan Fu, ZhengbaoJiang, Hiroaki Hayashi, and Graham Neubig.2021a. Pre-train, prompt, and predict: A sys-tematic survey of prompting methods in naturallanguage processing. CoRR, abs/2107.13586.

Xiao Liu, Yanan Zheng, Zhengxiao Du, MingDing, Yujie Qian, Zhilin Yang, and Jie Tang.2021b. GPT understands, too. CoRR,abs/2103.10385.

Ping Luo, Fuzhen Zhuang, Hui Xiong, YuhongXiong, and Qing He. 2008. Transfer learn-ing from multiple source domains via consen-sus regularization. In Proceedings of the 17thACM Conference on Information and Knowl-edge Management, CIKM 2008, Napa Valley,California, USA, October 26-30, 2008, pages103–112. ACM.

David McClosky, Eugene Charniak, and MarkJohnson. 2010. Automatic domain adaptationfor parsing. In Human Language Technologies:Conference of the North American Chapter ofthe Association of Computational Linguistics,Proceedings, June 2-4, 2010, Los Angeles, Cal-ifornia, USA, pages 28–36. The Association forComputational Linguistics.

Krikamol Muandet, David Balduzzi, and BernhardSchölkopf. 2013. Domain generalization via in-variant feature representation. In Proceedingsof the 30th International Conference on Ma-chine Learning, ICML 2013, Atlanta, GA, USA,16-21 June 2013, volume 28 of JMLR Work-shop and Conference Proceedings, pages 10–18. JMLR.org.

Mathias Müller, Annette Rios, and Rico Sennrich.2020. Domain robustness in neural machinetranslation. In Proceedings of the 14th Confer-ence of the Association for Machine Translationin the Americas, AMTA 2020, Virtual, October6-9, 2020, pages 151–164. Association for Ma-chine Translation in the Americas.

Yonatan Oren, Shiori Sagawa, Tatsunori B.Hashimoto, and Percy Liang. 2019. Distri-butionally robust language modeling. In Pro-ceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing andthe 9th International Joint Conference on Nat-ural Language Processing, EMNLP-IJCNLP2019, Hong Kong, China, November 3-7, 2019,pages 4226–4236. Association for Computa-tional Linguistics.

Johannes von Oswald, Christian Henning, JoãoSacramento, and Benjamin F. Grewe. 2020.Continual learning with hypernetworks. In 8thInternational Conference on Learning Repre-sentations, ICLR 2020, Addis Ababa, Ethiopia,April 26-30, 2020. OpenReview.net.



https://doi.org/10.18653/v1/2020.acl-main.703








https://aclanthology.org/W04-1013.pdf

https://aclanthology.org/W04-1013.pdf





https://doi.org/10.1145/1458082.1458099

https://doi.org/10.1145/1458082.1458099

https://doi.org/10.1145/1458082.1458099

https://www.aclweb.org/anthology/N10-1004/


http://proceedings.mlr.press/v28/muandet13.html

http://proceedings.mlr.press/v28/muandet13.html

https://aclanthology.org/2020.amta-research.14/

https://aclanthology.org/2020.amta-research.14/

https://doi.org/10.18653/v1/D19-1432

https://doi.org/10.18653/v1/D19-1432

https://openreview.net/forum?id=SJgwNerKvB

Mark Palatucci, Dean Pomerleau, Geoffrey E.Hinton, and Tom M. Mitchell. 2009. Zero-shotlearning with semantic output codes. In Ad-vances in Neural Information Processing Sys-tems 22: 23rd Annual Conference on NeuralInformation Processing Systems 2009. Proceed-ings of a meeting held 7-10 December 2009,Vancouver, British Columbia, Canada, pages1410–1418. Curran Associates, Inc.

Sinno Jialin Pan, Xiaochuan Ni, Jian-Tao Sun,Qiang Yang, and Zheng Chen. 2010. Cross-domain sentiment classification via spectral fea-ture alignment. In Proceedings of the 19thInternational Conference on World Wide Web,WWW 2010, Raleigh, North Carolina, USA,April 26-30, 2010, pages 751–760. ACM.

Kuan-Chuan Peng, Ziyan Wu, and Jan Ernst.2018. Zero-shot deep domain adaptation. InComputer Vision - ECCV 2018 - 15th EuropeanConference, Munich, Germany, September 8-14, 2018, Proceedings, Part XI, volume 11215of Lecture Notes in Computer Science, pages793–810. Springer.

Jonas Pfeiffer, Ivan Vulic, Iryna Gurevych, andSebastian Ruder. 2020. MAD-X: an adapter-based framework for multi-task cross-lingualtransfer. In Proceedings of the 2020 Conferenceon Empirical Methods in Natural LanguageProcessing, EMNLP 2020, Online, November16-20, 2020, pages 7654–7673. Association forComputational Linguistics.

Maria Pontiki, Dimitris Galanis, John Pavlopou-los, Harris Papageorgiou, Ion Androutsopoulos,and Suresh Manandhar. 2014. Semeval-2014task 4: Aspect based sentiment analysis. InProceedings of the 8th International Workshopon Semantic Evaluation, SemEval@COLING2014, Dublin, Ireland, August 23-24, 2014,pages 27–35. The Association for ComputerLinguistics.

Colin Raffel, Noam Shazeer, Adam Roberts,Katherine Lee, Sharan Narang, MichaelMatena, Yanqi Zhou, Wei Li, and Peter J. Liu.2020. Exploring the limits of transfer learningwith a unified text-to-text transformer. Journalof Machine Learning Research, 21:1–67.

Alan Ramponi and Barbara Plank. 2020. Neu-ral unsupervised domain adaptation in NLP -

A survey. In Proceedings of the 28th Interna-tional Conference on Computational Linguis-tics, COLING 2020, Barcelona, Spain (Online),December 8-13, 2020, pages 6838–6855. Inter-national Committee on Computational Linguis-tics.

Roi Reichart and Ari Rappoport. 2007. Self-training for enhancement and domain adap-tation of statistical parsers trained on smalldatasets. In ACL 2007, Proceedings of the 45thAnnual Meeting of the Association for Compu-tational Linguistics, June 23-30, 2007, Prague,Czech Republic. The Association for Computa-tional Linguistics.

Mark B. Ring. 1995. Continual learning in rein-forcement environments. Ph.D. thesis, Univer-sity of Texas at Austin, TX, USA.

Brian Roark and Michiel Bacchiani. 2003. Su-pervised and unsupervised PCFG adaptation tonovel domains. In Human Language Technol-ogy Conference of the North American Chapterof the Association for Computational Linguis-tics, HLT-NAACL 2003, Edmonton, Canada,May 27 - June 1, 2003. The Association forComputational Linguistics.

Andreas Rücklé, Jonas Pfeiffer, and IrynaGurevych. 2020. Multicqa: Zero-shot trans-fer of self-supervised text matching models ona massive scale. In Proceedings of the 2020Conference on Empirical Methods in NaturalLanguage Processing, EMNLP 2020, Online,November 16-20, 2020, pages 2471–2486. As-sociation for Computational Linguistics.

Alexander M. Rush, Roi Reichart, MichaelCollins, and Amir Globerson. 2012. Improvedparsing and POS tagging using inter-sentenceconsistency constraints. In Proceedings of the2012 Joint Conference on Empirical Methodsin Natural Language Processing and Compu-tational Natural Language Learning, EMNLP-CoNLL 2012, July 12-14, 2012, Jeju Island, Ko-rea, pages 1434–1444. ACL.

Shiori Sagawa, Pang Wei Koh, Tatsunori B.Hashimoto, and Percy Liang. 2020. Distribu-tionally robust neural networks. In 8th Interna-tional Conference on Learning Representations,ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.

https://proceedings.neurips.cc/paper/2009/hash/1543843a4723ed2ab08e18053ae6dc5b-Abstract.html

https://proceedings.neurips.cc/paper/2009/hash/1543843a4723ed2ab08e18053ae6dc5b-Abstract.html

https://doi.org/10.1145/1772690.1772767

https://doi.org/10.1145/1772690.1772767

https://doi.org/10.1145/1772690.1772767

https://doi.org/10.1007/978-3-030-01252-6_47




https://doi.org/10.3115/v1/s14-2004

https://doi.org/10.3115/v1/s14-2004

https://www.jmlr.org/papers/volume21/20-074/20-074.pdf

https://www.jmlr.org/papers/volume21/20-074/20-074.pdf

https://doi.org/10.18653/v1/2020.coling-main.603



https://aclanthology.org/P07-1078/




https://d-nb.info/945690320

https://d-nb.info/945690320










https://openreview.net/forum?id=ryxGuJrFvS

https://openreview.net/forum?id=ryxGuJrFvS

Victor Sanh, Lysandre Debut, Julien Chaumond,and Thomas Wolf. 2019. Distilbert, a distilledversion of BERT: smaller, faster, cheaper andlighter. CoRR, abs/1910.01108.

Teven Le Scao and Alexander M. Rush. 2021.How many data points is a prompt worth?In Proceedings of the 2021 Conference of theNorth American Chapter of the Association forComputational Linguistics: Human LanguageTechnologies, NAACL-HLT 2021, Online, June6-11, 2021, pages 2627–2636. Association forComputational Linguistics.

Timo Schick and Hinrich Schütze. 2021. Exploit-ing cloze-questions for few-shot text classifica-tion and natural language inference. In Pro-ceedings of the 16th Conference of the Euro-pean Chapter of the Association for Computa-tional Linguistics: Main Volume, EACL 2021,Online, April 19 - 23, 2021, pages 255–269. As-sociation for Computational Linguistics.

Tobias Schnabel and Hinrich Schütze. 2014.FLORS: fast and simple domain adaptation forpart-of-speech tagging. Transactions of the As-sociation for Computational Linguistics, 2:15–26.

Taylor Shin, Yasaman Razeghi, Robert L. LoganIV, Eric Wallace, and Sameer Singh. 2020. Au-toprompt: Eliciting knowledge from languagemodels with automatically generated prompts.In Proceedings of the 2020 Conference on Em-pirical Methods in Natural Language Process-ing, EMNLP 2020, Online, November 16-20,2020, pages 4222–4235. Association for Com-putational Linguistics.

Fan-Keng Sun and Cheng-I Lai. 2020. Condi-tioned natural language generation using onlyunconditioned language model: An exploration.CoRR, abs/2011.07347.

Cigdem Toprak, Niklas Jakob, and IrynaGurevych. 2010. Sentence and expressionlevel annotation of opinions in user-generateddiscourse. In ACL 2010, Proceedings ofthe 48th Annual Meeting of the Associationfor Computational Linguistics, July 11-16,2010, Uppsala, Sweden, pages 575–584. TheAssociation for Computer Linguistics.

Ashish Vaswani, Noam Shazeer, Niki Parmar,Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,

Lukasz Kaiser, and Illia Polosukhin. 2017. At-tention is all you need. In Advances in Neu-ral Information Processing Systems 30: An-nual Conference on Neural Information Pro-cessing Systems 2017, December 4-9, 2017,Long Beach, CA, USA, pages 5998–6008.

Ashwin K. Vijayakumar, Michael Cogswell, Ram-prasaath R. Selvaraju, Qing Sun, Stefan Lee,David J. Crandall, and Dhruv Batra. 2016. Di-verse beam search: Decoding diverse solu-tions from neural sequence models. CoRR,abs/1610.02424.

Yoav Wald, Amir Feder, Daniel Greenfeld, andUri Shalit. 2021. On calibration and out-of-domain generalization. In Thirty-Fifth Con-ference on Neural Information Processing Sys-tems.

Adina Williams, Nikita Nangia, and Samuel Bow-man. 2018. A broad-coverage challenge cor-pus for sentence understanding through infer-ence. In Proceedings of the 2018 Conferenceof the North American Chapter of the Associ-ation for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long Pa-pers), pages 1112–1122. Association for Com-putational Linguistics.

Thomas Wolf, Lysandre Debut, Victor Sanh,Julien Chaumond, Clement Delangue, AnthonyMoi, Pierric Cistac, Tim Rault, Rémi Louf,Morgan Funtowicz, Joe Davison, Sam Shleifer,Patrick von Platen, Clara Ma, Yacine Jernite,Julien Plu, Canwen Xu, Teven Le Scao, Syl-vain Gugger, Mariama Drame, Quentin Lhoest,and Alexander M. Rush. 2020. Transformers:State-of-the-art natural language processing. InProceedings of the 2020 Conference on Empir-ical Methods in Natural Language Processing:System Demonstrations, pages 38–45, Online.Association for Computational Linguistics.

Dustin Wright and Isabelle Augenstein. 2020.Transformer based multi-source domain adapta-tion. In Proceedings of the 2020 Conference onEmpirical Methods in Natural Language Pro-cessing (EMNLP), pages 7963–7974.

Yi Yang and Jacob Eisenstein. 2014. Fast easyunsupervised domain adaptation with marginal-ized structured dropout. In Proceedings of the52nd Annual Meeting of the Association for




https://doi.org/10.18653/v1/2021.naacl-main.208




https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/183

https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/183










https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html

https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html




https://openreview.net/forum?id=XWYJ25-yTRS

https://openreview.net/forum?id=XWYJ25-yTRS

http://aclweb.org/anthology/N18-1101



https://www.aclweb.org/anthology/2020.emnlp-demos.6

https://www.aclweb.org/anthology/2020.emnlp-demos.6

https://www.aclweb.org/anthology/2020.emnlp-main.639/

https://www.aclweb.org/anthology/2020.emnlp-main.639/

https://doi.org/10.3115/v1/p14-2088

https://doi.org/10.3115/v1/p14-2088

https://doi.org/10.3115/v1/p14-2088

Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, Volume 2: ShortPapers, pages 538–544. The Association forComputer Linguistics.

Tianyi Zhang, Varsha Kishore, Felix Wu, Kil-ian Q. Weinberger, and Yoav Artzi. 2020.Bertscore: Evaluating text generation withBERT. In 8th International Conference onLearning Representations, ICLR 2020, AddisAbaba, Ethiopia, April 26-30, 2020. OpenRe-view.net.

Han Zhao, Shanghang Zhang, Guanhang Wu,João Paulo Costeira, José M. F. Moura, and Ge-offrey J. Gordon. 2018. Multiple source domainadaptation with adversarial learning. In 6th In-ternational Conference on Learning Represen-tations, ICLR 2018, Vancouver, BC, Canada,April 30 - May 3, 2018, Workshop Track Pro-ceedings. OpenReview.net.

Yftah Ziser and Roi Reichart. 2017. Neural struc-tural correspondence learning for domain adap-tation. In Proceedings of the 21st Conferenceon Computational Natural Language Learning(CoNLL 2017), Vancouver, Canada, August 3-4,2017, pages 400–410. Association for Compu-tational Linguistics.

Yftah Ziser and Roi Reichart. 2018. Pivot basedlanguage modeling for improved neural do-main adaptation. In Proceedings of the 2018Conference of the North American Chapter ofthe Association for Computational Linguistics:Human Language Technologies, NAACL-HLT2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 1241–1251. Association for Computational Linguis-tics.

Yftah Ziser and Roi Reichart. 2019. Task refine-ment learning for improved accuracy and stabil-ity of unsupervised domain adaptation. In Pro-ceedings of the 57th Conference of the Associa-tion for Computational Linguistics, ACL 2019,Florence, Italy, July 28- August 2, 2019, Volume1: Long Papers, pages 5895–5906. Associationfor Computational Linguistics.

Arkaitz Zubiaga, Maria Liakata, and Rob Procter.2017. Exploiting context for rumour detectionin social media. In Social Informatics - 9th In-ternational Conference, SocInfo 2017, Oxford,

UK, September 13-15, 2017, Proceedings, PartI, volume 10539 of Lecture Notes in ComputerScience, pages 109–123. Springer.

Arkaitz Zubiaga, Maria Liakata, Rob Procter,Geraldine Wong Sak Hoi, and Peter Tolmie.2016. Analysing how people orient to andspread rumours in social media by look-ing at conversational threads. PloS one,11(3):e0150989.

https://openreview.net/forum?id=SkeHuCVFDr

https://openreview.net/forum?id=SkeHuCVFDr

https://openreview.net/forum?id=BkzXYSkvz

https://openreview.net/forum?id=BkzXYSkvz

https://doi.org/10.18653/v1/K17-1040

https://doi.org/10.18653/v1/K17-1040

https://doi.org/10.18653/v1/K17-1040

https://doi.org/10.18653/v1/n18-1112

https://doi.org/10.18653/v1/n18-1112

https://doi.org/10.18653/v1/n18-1112

https://doi.org/10.18653/v1/p19-1591

https://doi.org/10.18653/v1/p19-1591

https://doi.org/10.18653/v1/p19-1591

https://doi.org/10.1007/978-3-319-67217-5_8

https://doi.org/10.1007/978-3-319-67217-5_8

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0150989



Documents

Roi Reichart Abstract