Intrinsic Evaluation of Grammatical Information within

Intrinsic Evaluation of Grammatical Information within Word Embeddings

Daniel EdmistonDepartment of Linguistics

University of ChicagoChicago, IL, USA

[email protected]

Taeuk KimDepartment of Computer Science and Engineering

Seoul National UniversitySeoul, Korea

[email protected]

Abstract

This work presents a proof-of-concept studyfor a framework of intrinsic evaluation of con-tinuous embeddings as used in NLP tasks.This evaluation method compares the geome-try of such embeddings with ground-truth em-beddings in a linguistically-inspired, discretefeature space. Using model distillation (Hin-ton et al., 2015) as a means of extracting mor-phological information from models with noexplicit morphological awareness (e.g. word-atomic models), we train multiple learner net-works which do model morpheme composi-tion so as to compare the amount of gram-matical information different models capture.We use Korean affixes as a case-study, as theyencode multiple types of linguistic informa-tion (phonological, syntactic, semantic, andpragmatic), and allow us to investigate specifictypes of linguistic generalizations models mayor may not be sensitive to.

1 Introduction

While NLP systems built with neural network archi-tectures have dominated the field in recent years, itis often lamented that their improved performancehas come at the cost of understanding the models.Furthermore, recent work on natural language in-ference (McCoy et al., 2019) has cast doubt on theability of such models to generalize linguistic pat-terns effectively. Particularly, carefully selected ex-amples are shown to fool these systems, suggest-ing they learn heuristics for performing well on datasets rather than truly capturing linguistic informa-tion. As such, methods of probing the informa-

tion within these models are of growing importance.This work proposes such a method by comparingthe geometry of ground-truth, discrete-space embed-dings against continuous embeddings learnt by neu-ral networks. To do this, we extract morpholog-ical information from different embedding modelsvia transparent model distillation (Tan et al., 2017)and examine the resulting morpheme embeddingsfor grammatical information. By investigating theembeddings directy this falls under the rubric of in-trinsic evaluation. This complements extrinsic eval-uation methods, where embeddings’ ability to serveas input for classifiers on linguistic tasks is seen as aproxy for their linguistic content.

As is, intrinsic evaluation methods in the litera-ture most often test for lexical semantics. For in-stance, the word analogy task tests similarities be-tween pairs of words, e.g. king is to queen as man isto what? Here, since our ground-truth embeddingsreflect grammatical properties, the similarities be-tween them and the continuous representations canbe taken to reflect grammatical, rather than lexicalcontent in the continuous representations.

As an initial investigation, we compare the mor-phological information from multiple methods ofembedding Korean words into continuous space, us-ing three learner networks to distill morpheme repre-sentations for comparison. These are the STAFFNET

architecture of Edmiston and Stratos (2018), themorphological recursive neural network MRNNmodel of Luong et al. (2013), and a TREELSTM(Tai et al., 2015) over morphological parses. By dis-tilling into these networks, we extract explicit mor-pheme representations. We focus on the affix repre-

395 Pacific Asia Conference on Language, Information and Computation (PACLIC 33), pages 395-404, Hakodate, Japan, September 13-15, 2019

Copyright © 2019 Daniel Edmiston and Taeuk Kim

sentations, as they house the grammatical informa-tion we are testing for.

Korean affixes are the choice for this pilot studyfor multiple reasons. (i) Korean is an agglutinativelanguage, meaning there is (largely) a one-to-onecorrespondence between affixes and meanings. (ii)The morphology is highly regular, which facili-tates high-fidelity morphological parsing. (iii) Ko-rean affixes display at least four distinct types oflinguistic information. Phonolgical: Korean ex-hibits phonologically-driven allomorphy. Syntac-tic: Korean affixes contain syntactic informationas they attach to different syntactic units. Seman-tic: affixes perform different semantic functions,e.g. logical operators vs. focus markers. Prag-matic: Certain affixes are indicative of formal lan-guage, and others display honorific features. Thisallows us to run focused experiments which probewhat type of information different models are sen-sitive to.

Having distilled affix embeddings from variousground-truth models, we run our evaluation taskby comparing the distilled representations of dif-ferent models with ground-truth morpheme repre-sentations embedded in a discrete, linguistically-inspired feature space, and show that indeed neu-ral models are picking up on at least some formsof grammatical meaning. Results suggest that se-mantic relationships are more difficult to capturethan syntactic, and models appear insensitive tophonological/pragmatic information.Contributions: (i) We introduce a newlinguistically-inspired intrinsic evaluation taskwhich probes for grammatical meaning, ratherthan lexical meaning. (ii) We show results thatdifferent types of linguistic information arecaptured to differing degrees, and may be of adiffering nature from one another. (iii) To theauthors’ knowledge, this is the first application oftransparent model distillation for interpretation inan NLP setting. (iv) We focus on an under-studiedlanguage in NLP, which also happens to betypologically far removed from those usuallystudied in the literature.

2 Related Work

This work falls in the context of the emerging lit-erature on the interpretability of neural networkmodels using model distillation, and on the inter-pretability of linguistic representations learnt forNLP tasks.

In the broader context of interpreting the behav-ior of neural network models, model distillationhas emerged as a viable method. Tan et al. (2018b)use so-called transparent model distillation to au-dit black-box risk scoring models. By distilling ablack-box model into a transparent learner model,and comparing this learner model with a non-distilled transparent model trained on ground-truthdata, they are able to gain insights into black-box models. Likewise, Zhang et al. (2018) usemodel distillation (which they call knowledge dis-tillation) to extract human-interpretable featuresfrom the middle layers of convolutional networkstrained on computer vision tasks. While similar,our method differs slightly from these approachesin that we use model distillation to induce repre-sentations for linguistic units which would other-wise be unavailable (affixes).

As for the interpretability of linguistic represen-tations learnt for NLP tasks, extrinsic evaluationmethods have focused both on morphological in-formation (e.g. (Belinkov et al., 2017)) and syntac-tic information (e.g. (Adi et al., 2016)). As theseare extrinsic methods, the task consists of learn-ing representations, and then training classifiers oncarefully designed supervised learning tasks us-ing these representations as input. The tasks aremeant to probe for linguistic properties (e.g. nega-tion (Ettinger et al., 2016)), and performance onthese tasks can be interpreted as a proxy for theamount of linguistic information contained in theoriginal representations. Work in the intrinsic eval-uation domain has largely focused on testing lexi-cal semantics, as by comparing relations betweene.g. countries and capitals (Mikolov et al., 2013a).By comparison, here we intrinsically test for whatwe call grammatical information, to be made spe-cific in Section 3.3.

3 Background

3.1 Model Distillation

Model distillation (Hinton et al., 2015) is the tech-nique of training one neural network (the learnernetwork) to approximate the output of another (theground-truth network). While the technique wasoriginally designed to train relatively lightweightnetworks to approximate the outputs of larger,more cumbersome networks or ensembles, modeldistillation has recently been used to facilitate theinterpretation of so-called “black-box” neural net-work models (Tan et al., 2017, 2018a,b; Zhang

396

et al., 2018).The idea behind using model distillation for the

interpretation of word-embeddings in this studyis to train a learner network amenable to mor-phological interpretation to approximate a modelwhich is otherwise not straight-forward to ana-lyze morphologically. In this case, this includesword-atomic word-embedding models (such asWord2Vec), as well as word-embedding modelswhich compose embeddings from sub-word con-stituents (e.g. syllable-embedding models). As-suming successful distillation, the morphologi-cal representations of the learner network can beseen as a proxy for the morphological informa-tion captured by the original ground-truth net-work. Here, model distillation of a ground-truthmodel into a learner network proceeds by iterat-ing over the corpus the ground-truth model wasoriginally trained on. For each word, the ground-truth embedding y is calculated, as is the learnernetwork’s embedding y. Training proceeds by op-timizing the learner network’s parameters to min-imize the squared distance between y and y.

3.2 Learner Networks3.2.1 STAFFNET

The first learner network is the STAFFNET ar-chitecture. Introduced in Edmiston and Stratos(2018), STAFFNET (‘Stem-Affix Net’) is a dy-namic neural network architecture designed tocompose morpheme representations into fullword-embeddings in a linguistically plausibleway. The defining feature of STAFFNET is its dis-tinguishing of morphemes into stems and affixes,treating the former as vectors and the latter asfunctions over vectors. Here, we treat affixes aslinear transformations and represent them as ma-trices in Rd×d.

STAFFNET is a dynamic architecture, whosecomposition of a word-embedding varies with themorphological parse of the word. It composes aword-embedding in a three-step process. First, aword is decomposed into its constituent stems andaffixes.2 Second, the (potentially compound) stemrepresentation is calculated as the convex com-bination of the outputs of a BiLSTM into whichthe stems are fed. Third, any affixes are iterativelyapplied to the compound stem representation—the convex combination from the previous step.

2All morphological parsing done with the KomoranPOS-tagger available with the KoNLPy package. http://konlpy.org/en/latest/

Figure 1 illustrates this for the word cheese-burger.PL.NOM, or cheeseburgers marked withnominative case.

Figure 1: STAFFNET architecture showing the dynamiccomposition of cheese-burger.PL.NOM.

3.2.2 MRNN

The second learner network is the morphologicalRNN model of Luong et al. (2013), which con-stucts a word’s embedding from constituent mor-pheme embeddings by means of a recursive neuralnetwork over the binary tree of the word’s mor-phological parse. Parent nodes in the network arefunctions of their children nodes, and are calcu-lated as p = f(W [c1; c2] + b). That is, they arethe result of a non-linearity (here tanh) applied tothe output of an affine transformation over the con-catenated children embeddings. An example is asin Figure 2.

Figure 2: MORPHOLOGICAL RNN architecture show-ing the composition of un-fortunate-ly

397

http://konlpy.org/en/latest/

http://konlpy.org/en/latest/

3.2.3 TREELSTMThe final learner network is theN -ary Tree-LSTMof Tai et al. (2015). The transition equations followthe original paper, where k indexes the kth child ofparent node p.

ip = σ(W (i)xp +NΣl=1U

(i)l hpl + b(i))

fpk = σ(W (f)xp +NΣl=1U

(f)kl hpl + b(f))

op = σ(W (o)xp +NΣl=1U

(o)l hpl + b(o))

up = tanh(W (u)xp +NΣl=1U

(u)l hpl + b(u))

cp = ip � up +NΣl=1fpl � cpl

hp = op � tanh(cp)

Here, N is restricted to 2, as this model alsooperates over binary morphological parses.

3.3 Korean Affixes as a Case StudyAs mentioned above, Korean affixes exhibitat least four different types of information:phonological, syntactic, semantic, and prag-matic. Phonological information is shown throughphonologically-driven allomorphy. For example,the accusative marker -을 ‘-eul’ attaches to nom-inals ending in a coda, while its allomorph -를 ‘-leul’ attaches to nominals ending in vowels. A fa-miliar analogy from English would be the choicebetween ‘an’ and ‘a’, e.g. ‘an animal’ vs. ‘a dog.’Syntactic information is shown through place ofattachment (Cho and Sells, 1995). There are manysemantic dimensions along which affixes vary, e.g.some contribute focus semantics, others serve aslogical operators. Multiple pragmatic dimensionsof meaning are also evident in Korean affixation.Some affixes are reserved for written usage, oth-ers indicate the relationship between speaker andhearer, such as honorifics.

Having formalized the feature set of 107 Ko-rean affixes, we embed each affix into the binaryfeature space {0, 1}n, the dimensions of which areinterpreted as linguistic features (e.g. [+CODA] vs.[-CODA]) with 1 indicating the presence of thatfeature, 0 otherwise. These embeddings serve as‘ground-truth’ representations of Korean affixes,and geometrically can be interpreted as the cor-ners of an n−dimensional hypercube. We define

distance in this binary feature space with Ham-ming distance, where dH(x, y) is the number ofdimensions along which x and y differ.

4 Methodology

4.1 Ground-truth Models and Distillation

We propose a method of intrinsically evaluat-ing the extent to which different word-embeddingmodels capture the meaning of affixes in Ko-rean, and therefore how well they capture phono-logical, syntactic, semantic, and pragmatic infor-mation. The models we distill and compare areas follows. Character-level: fastText (Bojanowskiet al., 2017),3 Naver’s kor2vec model (based onKim et al. (2016))4, Syllable: The model of Choiet al. (2017), Word: Skip-gram and CBOW mod-els of Mikolov et al. (2013a), and GloVe Penning-ton et al. (2014).

All models were trained on a Korean Wikidumpwith vocabulary size limited to 10,000, and weretrained to produce embeddings in 300 dimensions,as suggested by Choi et al. (2016). Other hyper-parameters followed the suggestions of the orig-inal publications where applicable. Each modelwas then distilled into each of the learner archi-tectures, embedding affixes as vectors in R300 orin STAFFNET’s case, as matrices in R300×300.

4.2 Comparing Affix Representations forIntrinsic Evaluation

One standard method of performing intrinsic eval-uation of high-dimensional word-embeddings isanalogy tests (Mikolov et al., 2013b). In such atest, word-sets are assembled of the form a:b,c:d, where d is withheld. Given an embeddingmodel, an estimation of withheld d is given byd = ~b−~a+~c. An example is counted as correct ifthe vector ~d is the closest—via cosine similarity—in the embedding space to the model’s calculatedd, and incorrect otherwise.

In the case where affixes are represented by vec-tors, here too we make use of cosine similarity.Where affixes are represented as matrices, cosinesimilarity is not applicable, and instead we com-pare them via subspace similarity, as described inAlgorithm 1 (Mu et al., 2017).

Maff1 and Maff2 represent the matrix representa-tions of the affixes to be compared. N is an inte-

3https://fasttext.cc/docs/en/pretrained-vectors.html

4https://github.com/naver/kor2vec

398

https://fasttext.cc/docs/en/pretrained-vectors.html

https://fasttext.cc/docs/en/pretrained-vectors.html

https://github.com/naver/kor2vec

Algorithm 1 Subspace SimilarityInput: Maff1, Maff2, NOutput: score ∈ [0, 1]

X ← [pc(Maff1)1; ...; pc(Maff1)N ]Y ← [pc(Maff2)1; ...; pc(Maff2)N ]Z ← X>Y

score←√∑N

t=1 σ2t /N

return(score)

ger signifying the number of principal directionsto use. σt represents the tth singular value of Z.The algorithm proceeds by performing PCA onthe matrix inputs, and stacking the first N princi-pal directions into matrices in Rd×N . The sum ofsquares of the N singular values of the product ofthese matrices, divided by N , results in a similar-ity score in [0, 1]. The geometric intuition behindthis metric is that similar affixes should map stemsto similar subspaces.

To evaluate the distilled representations, wecompare the continuous embeddings against thediscrete embeddings in the following way. Giventhe set of all affixes A, consider the subset YAff =argminx∈A,x6=Aff

dH(x,Aff), the set of closest affixes to

any given affix in the ground truth discrete fea-ture space, where |YAff| = k. In the continuousspace, we define the k−closest affixes to any givenaffix with YAff = k-argmax

x∈A,x6=Affsim(x,Aff), where

sim(x, y) is cosine similarity or subspace similar-ity, depending on architecture. By design YAff andYAff are of the same cardinality.

Given YAff and YAff, we define two scores.The first is the percentage of overlap—the per-centage of the k-closest affixes in the continu-ous representation which are k-closest with re-gard to the ground-truth embeddings.5 The secondmeasures error; avg({dH(x,Aff) | x ∈ YAff}) −min

x∈A,x6=AffdH(x,Aff); that is, the true error with re-

gard to an affix, as calculated by the average ham-ming distance between Aff and x for x ∈ YAff,minus the minimum possible error. We label thispenalty the Hamming offset. Note that percentageof overlap and the Hamming offset provide twograded measures of success with regard to an affix,unlike all-or-nothing diagnostics like e.g. analogytests.

5Since it is always the case that |YAff| = |YAff| thisamounts to the F1 measure on cluster analysis.

4.3 Data Sets

This study makes use of two data sets. The first is aKorean WikiDump used to train the original mod-els, and also used as the corpus for model distil-lation. The second data set was hand-constructedfor this study, and constitutes the ground-truth rep-resentations used in the experiments below. It isthe embedding of 107 Korean affixes into a dis-crete, binary feature space, consisting of 62 di-mensions. These 62 dimensions correspond to lin-guistic features along which Korean affixes vary,and include features such as the aforementioned[+CODA], and [+HONORIFIC]. We divided the lin-guistic features into four distinct feature subsets,one for each of phonological, syntactic, semantic,and pragmatic features, so as to run tests on fea-ture subsets. Over the affixes, there are five distinctphonological configurations, eight distinct syntac-tic configurations, 55 distinct semantic configura-tions, and five distinct pragmatic configurations.

5 Experiments

5.1 Verifying Distillations

As we are testing models based on their distilledrepresentations rather than their original represen-tations, the first question to ask is whether the dis-tilled embeddings are a faithful recreation of themodels they are meant to emulate.

MODEL STAFFNET MRNN TREELSTM

KOR2VEC 0.971 0.840 0.952FASTTEXT 0.949 0.824 0.946SYLLABLE 0.967 0.869 0.979

W2V-SG 0.953 0.755 0.920W2V-CBOW 0.941 0.786 0.878

GLOVE 0.935 0.690 0.875

Table 1: Average cosine similarity between ground-truth embeddings and distilled embeddings over 10kvocabulary.

The results in Table 1 show that each of themodels was able to reproduce the original embed-dings to a very high degree of accuracy. This isespecially true given that the overwhelming ma-jority of volume in R300 is orthogonal to any givenpoint. We take this to mean that the models havebeen successfully distilled, and our distilled repre-sentations can serve as faithful representatives oftheir ground-truth models.

399

5.2 Closest Affix

This section tabulates the scores for each of themodels—as well as a random baseline—on the in-trinsic task described in Section 4.2. We test eachmodel distilled into each architecture, and for theSTAFFNET architecture we also test for differentvalues of subspace rank N . We chose the rank ofthe subspace to be the minimum number of prin-cipal components which accounted for 25%, 50%,and 75% of the average variance of the affixes foreach model. As mentioned above, we calculate theaverage percent of overlap between the k-closestin the continuous and discrete spaces, and also cal-culate the average Hamming offset. The results arein Table 2, where the figures represent the scoresfor each model averaged over performance on all107 affixes.

As can be seen in the results, no model was ableto achieve a high level of accuracy on the task,but all models significantly outperform the ran-dom baseline, and the MRNN and TREELSTMmodels fare better than the STAFFNET distilla-tions with regard to both average percentage ofoverlap and average hamming offset.

5.3 Feature Subsets

Given that Korean affixes contain linguistic infor-mation of different sorts, we can perform a vari-ant of our experiment from above using only sub-sets of features. For example, given the feature-makeup described above, affixes fall into one offive phonological configurations, which we candescribe as +CODA, -CODA, +LOW, -LOW, orNONE. For these experiments, similar to before de-fine YAff = argmin

x∈A,x6=AffdH(phon(x), phon(Aff)),

or the set of affixes which are closest to a certainaffix in the discrete space considering only phono-logical features. We then define YAff as before,YAff = argmax

x∈A,x6=Affsim(x,Aff). Percentage of over-

lap and Hamming offset are as before. Results canbe interpreted as: for the k-closest affixes in thecontinuous space, how many behave the same withregard to, e.g. phonological features? This shouldhelp identify what type of linguistic informationmodels are sensitive to.

The results for the phonological, syntactic, se-mantic, and pragmatic subset tests are in Tables3-6.

6 Discussion

Before discussing individual test results, it is note-worthy that the scores of the tables vary signifi-cantly from one another, and the random baselineshows particularly strong results for certain sub-sets, particularly the pragmatic subset. This is theresult of fluctuations in the average k (i.e. clustersize) and average Hamming offset for each sub-set. Table 7 lists these figuers. Dividing the aver-age k by the total number of affixes |A| gives theexpected random score for percentage of overlap.For each subset, the random baselines roughly re-flect these figures. For interpreting the results, per-formance relative to the random baseline is what isimportant, not the percentage figure or Hammingoffset figures themselves.

Examining Table 2, the MRNN and TREEL-STM models outperformed the STAFFNET mod-els by a large margin. A plausible hypothesiswould be to attribute this difference to the lackof non-linearity in the STAFFNET-derived affixes.While STAFFNET does derive stem representa-tions via back-propagation through a BiLSTM,the affix representations always apply post-non-linearity during forward propagation, and as suchare unable to learn any potentially non-linear rela-tionships.

Regarding Table 3, while nearly all models out-performed the random baseline, none did so sig-nificantly. It is furthermore surprising that charac-ter and syllable-level models did not significantlyoutperform word-atomic models, to which phono-logical information of the kind driving allomorphyin Korean should be unavailable. This may suggestthat the models examined here are not sensitive tophonological information in any significant way.

The syntactic results in Table 4 show all modelsoutperforming the random baseline, suggesting itis possible for models to capture syntactic infor-mation. Furthermore, the distilling architecturesperformed relatively similarly, with a STAFFNET

distillation achieving the highest score. This sug-gests that non-linearity may not be necessary tocapture syntactic relations.

For the semantic subset results in Table 5, whileSTAFFNET distillations performed similar to therandom baseline, the models deriving affix rep-resentations via non-linearity showed relativelystrong results. This suggests two things. (i) It ispossible for modern neural network models to cap-ture grammatical meaning of a semantic nature,

400

Architecture STAFFNET-25% variance

STAFFNET-50% variance


MRNN TREELSTM

Model %Overlap OffsetdH %Overlap OffsetdH %Overlap OffsetdH %Overlap OffsetdH %Overlap OffsetdH

KOR2VEC 12% 3.35 11% 3.49 9% 3.39 16% 2.82 19% 2.62FASTTEXT 13% 3.23 12% 3.22 9% 3.31 18% 2.59 16% 2.63SYLLABLE 13% 3.18 11% 3.42 8% 3.36 17% 2.88 17% 3.29W2V-SG 12% 3.17 8% 3.57 7% 3.40 18% 2.59 15% 3.01W2V-CBOW 13% 3.19 9% 3.46 7% 3.56 18% 2.63 16% 2.86GLOVE 10% 3.30 10% 3.34 7% 3.47 16% 2.80 11% 3.13RANDOM 2% 4.23 3% 4.38 3% 4.47 2% 4.32 1% 4.35

Table 2: Percent correct and average Hamming Distance for models mapping to subspaces of different sizes. Bestscores in bold, worst scores in red.




MRNN TREELSTM



Table 3: Percent correct and Average Hamming offset: Restricting to Phonological features.




MRNN TREELSTM



Table 4: Percent correct and Average Hamming offset: Restricting to Syntacitc features.




MRNN TREELSTM



Table 5: Percent correct and Average Hamming offset: Restricting to Semantic features.

401




MRNN TREELSTM



Table 6: Percent correct and Average Hamming offset: Restricting to Pragmatic features.

SUBSET AVG. k AVG. k / |A| AVG. dHALL 2.21 0.02 5.56

PHON 29.93 0.28 0.96SYN 30.92 0.29 1.42SEM 3.25 0.03 3.00

PRAG 73.91 0.69 0.32

Table 7: Average size of nearest k affixes for each fea-ture subset.

and (ii) non-linearity is required to capture thesemeanings.

For the pragmatic results in Table 6, all mod-els perform virtually indistinguishably from therandom baseline. This suggests that the word-embedding models examined here are not sensi-tive to pragmatic features, at least not when dis-tilled into another model.6

As this is a proof-of-concept study, the princi-pal takeaway is that this method of intrinsic eval-uation is possible, and reveals interesting charac-teristics of neural representations. Specifically, theunderlying geometry of ground-truth discrete em-beddings are at least to some extent being cap-tured in the geometry of the continuous represen-tations learnt by different neural-network models.Furthermore, the results in this study show thatnot only are neural network models sensitive togrammatical information—as distinct from lexicalinformation—they are sensitive to different typesof grammatical information to differing degrees.Syntactic information can apparently be capturedby linear relations, while semantic information re-quires non-linearities. This result is perhaps not

6Though it is of note that all models were trained ontext which is ostensibly academic in character (a WikiDumpfile), and which therefore is almost devoid of pragmatically-marked language like honorifics.

surprising, as even in the theoretical linguistics lit-erature semantic analyses often require more com-plex algebraic structures built on top of syntacticparses. Finally, neural models seem insensitive tophonological and pragmatic information, with allmodels here performing virtually the same as therandom baseline on these sub-tests.

Finally, in addition to the test results them-selves, the fact that any models significantly out-performed the random baseline shows that trans-parent model distillation can serve as a viablemeans of extracting sub-atomic information fromotherwise atomic representations.

7 Conclusion

This study has been a proof-of-concept for anintrinsic evaluation method which probes gram-matical, rather than lexical information in word-embeddings. To do this, we studied the case of Ko-rean affixes, which display multiple types of gram-matical information. In order to derive affix rep-resentations, we used Transparent Model Distil-lation to extract morpheme representations wherethey otherwise did not exist. Our study has shownthat neural network models are sensitive to gram-matical information, and that the geometry of thecontinuous representations learnt by neural net-work models reflects to some degree the geometryof ground-truth discrete embeddings.

We put forward such a process as a frameworkfor the fine-grained intrinsic analysis of the high-dimensional continuous embeddings which are of-ten used to help solve natural language processingtasks.

402

ReferencesYossi Adi, Einat Kermany, Yonatan Belinkov, Ofer

Lavi, and Yoav Goldberg. 2016. Fine-Grained Anal-ysis of Sentence Embeddings using Auxiliary Pre-diction Tasts. ArXiv preprint arXiv:1608.04207.

Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, HassanSajjad, and James Glass. 2017. What do Neural Ma-chine Translation Models Learn about Morphology.ArXiv preprint arXiv:1704.03471.

Piotr Bojanowski, Edouard Grave, Armand Joulin, andTomas Mikolov. 2017. Enriching Word Vectors withSubword Information. Transactions of the Associa-tion for Computational Linguistics, 5:135–146.

Young-Mee Yu Cho and Peter Sells. 1995. A LexicalAccount of Inflectional Suffixes in Korean. Journalof East Asian Linguistics, 4(2):119–174.

Sanghyuk Choi, Taeuk Kim, Jinseok Seol, and Sang-goo Lee. 2017. A Syllable-based Technique forWord Embeddings of Korean Words. In Proceed-ings of the First Workshop on Subword and Charac-ter Level Models in NLP, pages 36–40.

Sanghyuk Choi, Jinseok Seol, and Sang-goo Lee. 2016.On Word Embedding Models and Parameters Opti-mized for Korean. In Proceedings of the 28th An-nual Conference on Human and Cognitive LanguageTechnology (In Korean).

Daniel Edmiston and Karl Stratos. 2018. Composi-tional Morpheme Embeddings with Affixes as Func-tions and Stems as Arguments. In Proceedings ofthe Workshop on the Relevance of Linguistic Struc-ture in Neural Architectures for NLP, pages 1–5.

Allyson Ettinger, Ahmed Elgohary, and Philip Resnik.2016. Probing for Semantic Evidence of Com-position by Means of Simple Classification Tasks.In Proceedings of the 1st Workshop on EvaluatingVector-Space Representations for NLP, pages 134–139.

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015.Distilling the Knowledge in a Neural Network. InNIPS Deep Learning and Representation LearningWorkshop.

Yoon Kim, Yacine Jernite, David Sontag, and Alexan-der Rush. 2016. Character-Aware Neural LanguageModels. In Proceedings of AAAI, pages 2741–2749.

Thang Luong, Richard Socher, and Christopher Man-ning. 2013. Better Word Representations with Re-cursive Neural Networks for Morphology. In Pro-ceedings of the Seventeenth Conference on Com-putational Natural Language Learning, pages 104–113.

R. Thomas McCoy, Ellie Pavlick, and Tal Linzen.2019. Right for the Wrong Reasons: DiagnosingSyntactic Heuristics in Natural Language Inference.ArXiv preprint arXiv:1902.01007.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-rado, and Jeff Dean. 2013a. Distributed Representa-tions of Words and Phrases and their Composition-ality. In Advances in Neural Information ProcessingSystems.

Tomas Mikolov, Wen-Tau Yih, and Geoffrey Zweig.2013b. Linguistic Regularities in Continuous SpaceWord Representations. In Proceedings of the 2013Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, pages 764–751.

Jiaqi Mu, Suma Bhat, and Pramod Viswanath. 2017.Representing Sentences as Low-rank Subspaces.ArXiv preprint arXiv:1704.05358.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. GloVe: Global Vectors for WordRepresentation. In Proceedings of the 2014 Con-ference on Empirical Methods in Natural LanguageProcessing.

Kai Sheng Tai, Richard Socher, and Christopher Man-ning. 2015. Improved Semantic Representationsfrom Tree-Structured Long Short-Term MemoryNetworks. ArXiv preprint arXiv:1503.00075.

Sarah Tan, Rich Caruana, Giles Hooker, Paul Koch, andAlbert Gordo. 2018a. Learning Global Additive Ex-planations for Neural Nets Using Model Distillation.ArXiv preprint arXiv:1801.08640.

Sarah Tan, Rich Caruana, Giles Hooker, and Yin Lou.2017. Detecting Bias in Black-box Models Us-ing Transparent Model Distillation. ArXiv preprintarXiv:1710.06169.

Sarah Tan, Rich Caruana, Giles Hooker, and Yin Lou.2018b. Distill-and-Compare: Auditing Black-BoxModels Using Transparent Model Distillation. InProceedings of the 2018 AAAI/ACM Conference onAI, Ethics, and Society, pages 303–310.

Quanshi Zhang, Yu Yang, Yuchen Liu, Ying-Nian Wu,and Zhu Song-Chun. 2018. Unsupervised Learn-ing of Neural Networks to Explain Neural Networks.ArXiv preprint arXiv:1805.07468.

Appendix A: Feature subsets

Table 8 displays the feature values which make upthe dimensions of the discrete space. The combi-nation of all feature values from the subsets com-prise the full discrete space, upon which the exper-iment in Table 2 was run.

403

SUBSET FEATURE VALUES

PHONOLOGICAL FEATURES [+CODA], [-CODA], [+LOW], [-LOW]SYNTACTIC FEATURES [+POSTPOSITION], [+CONJUNCTIVE], [+X-

LIMITER],[+Z-LIMITER], [+V1], [+V2], [+V3], [+V4]

SEMANTIC FEATURES [+DECLARATIVE], [+NOMINATIVE], [+TOPIC],[+LOCATIVE], [+DIRECTIVE], [+GENATIVE],[+ACCUSATIVE], [+PAST], [+CONJUNCTION],[+ADVERBIAL], [+ALSO], [+TAG], [+NOM-INALIZER], [+FROM], [+CONDITIONAL],[+LIKE], [+ESSIVE], [+COMPARATOR],[+COMPLEMENTIZER], [+RELATIVIZER],[+PAST], [+RETROSPECTIVE], [+FUTURE],[+PLURAL], [+PRESENT], [+BECAUSE],[+COPULA], [+QUOTATIVE], [+ABLA-TIVE], [+INTENT], [+MUST], [+RESULT], [-ANIMATE], [+ANIMATE], [+INSTRUMENTAL],[+INTERROGATIVE], [+DATIVE], [+GOAL],[+EVEN], [+COHORTATIVE], [+ONLY], [+DIS-JUNCTIVE], [+EACH], [+DURATION]

PRAGMATIC FEATURES [+FORMAL], [-FORMAL], [+HONORIFIC],[+FAMILIAR]

Table 8: Description of feature subsets.

404

Documents

Intrinsic Evaluation of Grammatical Information within