An Interpretable Knowledge Transfer Model for Knowledge ...Previously, a line of research makes use of ex-ternal information such as textual relations from web-scale corpus or node

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 950–962Vancouver, Canada, July 30 - August 4, 2017. c©2017 Association for Computational Linguistics

https://doi.org/10.18653/v1/P17-1088

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 950–962Vancouver, Canada, July 30 - August 4, 2017. c©2017 Association for Computational Linguistics

https://doi.org/10.18653/v1/P17-1088

An Interpretable Knowledge Transfer Modelfor Knowledge Base Completion

Qizhe Xie, Xuezhe Ma, Zihang Dai, Eduard HovyLanguage Technologies Institute

Carnegie Mellon UniversityPittsburgh, PA 15213, USA

{qzxie, xuezhem, dzihang, hovy}@cs.cmu.edu

Abstract

Knowledge bases are important resourcesfor a variety of natural language process-ing tasks but suffer from incompleteness.We propose a novel embedding model,ITransF, to perform knowledge base com-pletion. Equipped with a sparse atten-tion mechanism, ITransF discovers hiddenconcepts of relations and transfer statisti-cal strength through the sharing of con-cepts. Moreover, the learned associationsbetween relations and concepts, whichare represented by sparse attention vec-tors, can be interpreted easily. We evalu-ate ITransF on two benchmark datasets—WN18 and FB15k for knowledge basecompletion and obtains improvements onboth the mean rank and Hits@10 metrics,over all baselines that do not use additionalinformation.

1 Introduction

Knowledge bases (KB), such as WordNet (Fell-baum, 1998), Freebase (Bollacker et al., 2008),YAGO (Suchanek et al., 2007) and DBpe-dia (Lehmann et al., 2015), are useful resourcesfor many applications such as question answer-ing (Berant et al., 2013; Yih et al., 2015; Daiet al., 2016) and information extraction (Mintzet al., 2009). However, knowledge bases suf-fer from incompleteness despite their formidablesizes (Socher et al., 2013; West et al., 2014), lead-ing to a number of studies on automatic knowl-edge base completion (KBC) (Nickel et al., 2015)or link prediction.

The fundamental motivation behind these stud-ies is that there exist some statistical regularitiesunder the intertwined facts stored in the multi-relational knowledge base. By discovering gener-

alizable regularities in known facts, missing onesmay be recovered in a faithful way. Due to its ex-cellent generalization capability, distributed repre-sentations, a.k.a. embeddings, have been popular-ized to address the KBC task (Nickel et al., 2011;Bordes et al., 2011, 2014, 2013; Socher et al.,2013; Wang et al., 2014; Guu et al., 2015; Nguyenet al., 2016b).

As a seminal work, Bordes et al. (2013) pro-poses the TransE, which models the statisticalregularities with linear translations between en-tity embeddings operated by a relation embed-ding. Implicitly, TransE assumes both entity em-beddings and relation embeddings dwell in thesame vector space, posing an unnecessarily strongprior. To relax this requirement, a variety of mod-els first project the entity embeddings to a relation-dependent space (Bordes et al., 2014; Ji et al.,2015; Lin et al., 2015b; Nguyen et al., 2016b),and then model the translation property in the pro-jected space. Typically, these relation-dependentspaces are characterized by the projection matri-ces unique to each relation. As a benefit, differ-ent aspects of the same entity can be temporarilyemphasized or depressed as an effect of the projec-tion. For instance, STransE (Nguyen et al., 2016b)utilizes two projection matrices per relation, onefor the head entity and the other for the tail entity.

Despite the superior performance of STransEcompared to TransE, it is more prone to the datasparsity problem. Concretely, since the projectionspaces are unique to each relation, projection ma-trices associated with rare relations can only be ex-posed to very few facts during training, resulting inpoor generalization. For common relations, a sim-ilar issue exists. Without any restrictions on thenumber of projection matrices, logically related orconceptually similar relations may have distinctprojection spaces, hindering the discovery, shar-ing, and generalization of statistical regularities.

950

https://doi.org/10.18653/v1/P17-1088

https://doi.org/10.18653/v1/P17-1088

Previously, a line of research makes use of ex-ternal information such as textual relations fromweb-scale corpus or node features (Toutanovaet al., 2015; Toutanova and Chen, 2015; Nguyenet al., 2016a), alleviating the sparsity problem. Inparallel, recent work has proposed to model reg-ularities beyond local facts by considering multi-relation paths (Garcıa-Duran et al., 2015; Linet al., 2015a; Shen et al., 2016). Since the numberof paths grows exponentially with its length, as aside effect, path-based models enjoy much moretraining cases, suffering less from the problem.

In this paper, we propose an interpretableknowledge transfer model (ITransF), which en-courages the sharing of statistic regularities be-tween the projection matrices of relations and al-leviates the data sparsity problem. At the core ofITransF is a sparse attention mechanism, whichlearns to compose shared concept matrices intorelation-specific projection matrices, leading to abetter generalization property. Without any ex-ternal resources, ITransF improves mean rank andHits@10 on two benchmark datasets, over all pre-vious approaches of the same kind. In addition,the parameter sharing is clearly indicated by thelearned sparse attention vectors, enabling us to in-terpret how knowledge transfer is carried out. Toinduce the desired sparsity during optimization,we further introduce a block iterative optimizationalgorithm.

In summary, the contributions of this workare: (i) proposing a novel knowledge embeddingmodel which enables knowledge transfer by learn-ing to discover shared regularities; (ii) introducinga learning algorithm to directly optimize a sparserepresentation from which the knowledge transfer-ring procedure is interpretable; (iii) showing theeffectiveness of our model by outperforming base-lines on two benchmark datasets for knowledgebase completion task.

2 Notation and Previous Models

Let E denote the set of entities and R denote theset of relations. In knowledge base completion,given a training set P of triples (h, r, t) whereh, t ∈ E are the head and tail entities having arelation r ∈ R, e.g., (Steve Jobs, FounderOf,Apple), we want to predict missing facts such as(Steve Jobs, Profession, Businessperson).

Most of the embedding models for knowledgebase completion define an energy function fr(h, t)

according to the fact’s plausibility (Bordes et al.,2011, 2014, 2013; Socher et al., 2013; Wang et al.,2014; Yang et al., 2015; Guu et al., 2015; Nguyenet al., 2016b). The models are learned to minimizeenergy fr(h, t) of a plausible triple (h, r, t) and tomaximize energy fr(h′, t′) of an implausible triple(h′, r, t′).

Motivated by the linear translation phe-nomenon observed in well trained word embed-dings (Mikolov et al., 2013), TransE (Bordes et al.,2013) represents the head entity h, the relation rand the tail entity t with vectors h, r and t ∈ Rnrespectively, which were trained so that h+r ≈ t.They define the energy function as

fr(h, t) = ‖h+ r− t‖`where ` = 1 or 2, which means either the `1 orthe `2 norm of the vector h + r − t will be useddepending on the performance on the validationset.

To better model relation-specific aspects ofthe same entity, TransR (Lin et al., 2015b) usesprojection matrices and projects the head entityand the tail entity to a relation-dependent space.STransE (Nguyen et al., 2016b) extends TransRby employing different matrices for mapping thehead and the tail entity. The energy function is

fr(h, t) = ‖Wr,1h+ r−Wr,2t‖`However, not all relations have abundant data

to estimate the relation specific matrices as mostof the training samples are associated with only afew relations, leading to the data sparsity problemfor rare relations.

3 Interpretable Knowledge Transfer

3.1 ModelAs discussed above, a fundamental weakness inTransR and STransE is that they equip each re-lation with a set of unique projection matrices,which not only introduces more parameters butalso hinders knowledge sharing. Intuitively, manyrelations share some concepts with each other, al-though they are stored as independent symbols inKB. For example, the relation “(somebody) wonaward for (some work)” and “(somebody) wasnominated for (some work)” both describe a per-son’s high-quality work which wins an award ora nomination respectively. This phenomenon sug-gests that one relation actually represents a col-lection of real-world concepts, and one concept

951

can be shared by several relations. Inspired by theexistence of such lower-level concepts, instead ofdefining a unique set of projection matrices for ev-ery relation, we can alternatively define a small setof concept projection matrices and then composethem into customized projection matrices. Effec-tively, the relation-dependent translation space isthen reduced to the smaller concept spaces.

However, in general, we do not have priorknowledge about what concepts exist out there andhow they are composed to form relations. There-fore, in ITransF, we propose to learn this informa-tion simultaneously from data, together with allknowledge embeddings. Following this idea, wefirst present the model details, then discuss the op-timization techniques for training.

Energy function Specifically, we stack all theconcept projection matrices to a 3-dimensionaltensor D ∈ Rm×n×n, wherem is the pre-specifiednumber of concept projection matrices and n is thedimensionality of entity embeddings and relationembeddings. We let each relation select the mostuseful projection matrices from the tensor, wherethe selection is represented by an attention vector.The energy function of ITransF is defined as:

fr(h, t) = ‖αααHr ·D · h+ r−αααTr ·D · t‖` (1)

where αααHr ,αααTr ∈ [0, 1]m, satisfying

∑iααα

Hr,i =∑

iαααTr,i = 1, are normalized attention vectors

used to compose all concept projection matricesin D by a convex combination. It is obvious thatSTransE can be expressed as a special case of ourmodel when we use m = 2|R| concept matricesand set attention vectors to disjoint one-hot vec-tors. Hence our model space is a generalization ofSTransE. Note that we can safely use fewer con-cept matrices in ITransF and obtain better perfor-mance (see section 4.3), though STransE alwaysrequires 2|R| projection matrices.

We follow previous work to minimize the fol-lowing hinge loss function:

L =∑

(h,r,t)∼P,(h′,r,t′)∼N

[γ + fr(h, t)− fr(h′, t′)

]+

(2)

where P is the training set consisting of correcttriples, N is the distribution of corrupted triplesdefined in section 3.3, and [·]+ = max(·, 0). Notethat we have omitted the dependence of N on(h, r, t) to avoid clutter. We normalize the en-tity vectors h, t, and the projected entity vectors

αααHr ·D · h and αααTr ·D · t to have unit length aftereach update, which is an effective regularizationmethod that benefits all models.

Sparse attention vectors In Eq. (1), we havedefined αααHr ,ααα

Tr to be some normalized vectors

used for composition. With a dense attention vec-tor, it is computationally expensive to perform theconvex combination of m matrices in each itera-tion. Moreover, a relation usually does not consistof all existing concepts in practice. Furthermore,when the attention vectors are sparse, it is ofteneasier to interpret their behaviors and understandhow concepts are shared by different relations.

Motivated by these potential benefits, we fur-ther hope to learn sparse attention vectors inITransF. However, directly posing `1 regulariza-tion (Tibshirani, 1996) on the attention vectorsfails to produce sparse representations in our pre-liminary experiment, which motivates us to en-force `0 constraints on αααTr ,ααα

Hr .

In order to satisfy both the normalization condi-tion and the `0 constraints, we reparameterize theattention vectors in the following way:

αααHr = SparseSoftmax(vHr , IHr )

αααTr = SparseSoftmax(vTr , ITr )

where vHr ,vTr ∈ Rm are the pre-softmax scores,

IHr , ITr ∈ {0, 1}m are the sparse assignment vec-

tors, indicating the non-zero entries of attentionvectors, and the SparseSoftmax is defined as

SparseSoftmax(v, I)i =exp(vi/τ)Ii∑j exp(vj/τ)Ij

with τ being the temperature of Softmax.With this reparameterization, vHr ,v

Tr and

IHr , ITr replace αααTr ,ααα

Hr to become the real param-

eters of the model. Also, note that it is equiva-lent to pose the `0 constraints on IHr , I

Tr instead of

αααTr ,αααHr . Putting these modifications together, we

can rewrite the optimization problem as

minimize Lsubject to ‖IHr ‖0 ≤ k, ‖ITr ‖0 ≤ k

(3)

where L is the loss function defined in Eq. (2).

3.2 Block Iterative OptimizationThough sparseness is favorable in practice, it isgenerally NP-hard to find the optimal solution un-der `0 constraints. Thus, we resort to an approxi-mated algorithm in this work.

952

For convenience, we refer to the parameterswith and without the sparse constraints as thesparse partition and the dense partition, respec-tively. Based on this notion, the high-level ideaof the approximated algorithm is to iteratively op-timize one of the two partitions while holding theother one fixed. Since all parameters in the densepartition, including the embeddings, the projectionmatrices, and the pre-softmax scores, are fully dif-ferentiable with the sparse partition fixed, we cansimply utilize SGD to optimize the dense partition.Then, the core difficulty lies in the step of optimiz-ing the sparse partition (i.e. the sparse assignmentvectors), during which we want the following twoproperties to hold

1. the sparsity required by the `0 constaint ismaintained, and

2. the cost define by Eq. (2) is decreased.

Satisfying the two criterion seems to highly re-semble the original problem defined in Eq. (3).However, the dramatic difference here is that withparameters in the dense partition regarded as con-stant, the cost function is decoupled w.r.t. eachrelation r. In other words, the optimal choice ofIHr , I

Tr is independent of IHr′ , I

Tr′ for any r′ 6= r.

Therefore, we only need to consider the optimiza-tion for a single relation r, which is essentially anassignment problem. Note that, however, IHr andITr are still coupled, without which we basicallyreach the situation in a backpack problem. In prin-ciple, one can explore combinatorial optimizationtechniques to optimize IHr′ , I

Tr′ jointly, which usu-

ally involve some iterative procedure. To avoidadding another inner loop to our algorithm, weturn to a simple but fast approximation methodbased on the following single-matrix cost.

Specifically, for each relation r, we consider theinduced cost LHr,i where only a single projectionmatrix i is used for the head entity:

LHr,i =∑

(h,r,t)∼Pr,(h′,r,t′)∼Nr

[γ + fHr,i(h, t)− fHr,i(h′, t′)

]+

where fHr,i(h, t) = ‖Di · h + r − αααTr · D · t‖ isthe corresponding energy function, and the sub-script in Pr and Nr denotes the subsets with rela-tion r. Intuitively, LHr,i measures, given the currenttail attention vector αααTr , if only one project matrixcould be chosen for the head entity, how implausi-ble Di would be. Hence, i∗ = argmini LHr,i gives

us the best single projection matrix on the headside given αααTr .

Now, in order to choose the best k matrices, webasically ignore the interaction among projectionmatrices, and update IHr in the following way:

IHr,i ←{1, i ∈ argpartitioni(LHr,i, k)0, otherwise

where the function argpartitioni(xi, k) producesthe index set of the lowest-k values of xi.

Analogously, we can define the single-matrixcost LTr,i and the energy function fTr,i(h, t) on thetail side in a symmetric way. Then, the updaterule for IHr follows the same derivation. Admit-tedly, the approximation described here is rela-tively crude. But as we will show in section 4,the proposed algorithm yields good performanceempirically. We leave the further improvement ofthe optimization method as future work.

3.3 Corrupted Sample Generating Method

Recall that we need to sample a negative triple(h′, r, t′) to compute hinge loss shown in Eq. 2,given a positive triple (h, r, t) ∈ P . The distri-bution of negative triple is denoted by N(h, r, t).Previous work (Bordes et al., 2013; Lin et al.,2015b; Yang et al., 2015; Nguyen et al., 2016b)generally constructs a set of corrupted triples byreplacing the head entity or tail entity with a ran-dom entity uniformly sampled from the KB.

However, uniformly sampling corrupted entitiesmay not be optimal. Often, the head and tail en-tities associated a relation can only belong to aspecific domain. When the corrupted entity comesfrom other domains, it is very easy for the modelto induce a large energy gap between true tripleand corrupted one. As the energy gap exceedsγ, there will be no training signal from this cor-rupted triple. In comparison, if the corrupted en-tity comes from the same domain, the task be-comes harder for the model, leading to more con-sistent training signal.

Motivated by this observation, we propose tosample corrupted head or tail from entities inthe same domain with a probability pr and fromthe whole entity set with probability 1 − pr.The choice of relation-dependent probability pr isspecified in Appendix A.1. In the rest of the paper,we refer to the new proposed sampling method as”domain sampling”.

953

4 Experiments

4.1 Setup

To evaluate link prediction, we conduct experi-ments on the WN18 (WordNet) and FB15k (Free-base) introduced by Bordes et al. (2013) and usethe same training/validation/test split as in (Bordeset al., 2013). The information of the two datasetsis given in Table 1.

Dataset #E #R #Train #Valid #TestWN18 40,943 18 141,442 5,000 5,000FB15k 14,951 1,345 483,142 50,000 59,071

Table 1: Statistics of FB15k and WN18 used inexperiments. #E, #R denote the number of enti-ties and relation types respectively. #Train, #Validand #Test are the numbers of triples in the training,validation and test sets respectively.

In knowledge base completion task, we evaluatemodel’s performance of predicting the head entityor the tail entity given the relation and the other en-tity. For example, to predict head given relation rand tail t in triple (h, r, t), we compute the energyfunction fr(h′, t) for each entity h′ in the knowl-edge base and rank all the entities according to theenergy. We follow Bordes et al. (2013) to reportthe filter results, i.e., removing all other correctcandidates h′ in ranking. The rank of the correctentity is then obtained and we report the mean rank(mean of the predicted ranks) and Hits@10 (top 10accuracy). Lower mean rank or higher Hits@10mean better performance.

4.2 Implementation Details

We initialize the projection matrices with iden-tity matrices added with a small noise sampledfrom normal distribution N (0, 0.0052). The en-tity and relation vectors of ITransF are initializedby TransE (Bordes et al., 2013), following Linet al. (2015b); Ji et al. (2015); Garcıa-Duran et al.(2016, 2015); Lin et al. (2015a). We ran mini-batch SGD until convergence. We employ the“Bernoulli” sampling method to generate incor-rect triples as used in Wang et al. (2014), Lin et al.(2015b), He et al. (2015), Ji et al. (2015) and Linet al. (2015a).

STransE (Nguyen et al., 2016b) is the most sim-ilar knowledge embedding model to ours exceptthat they use distinct projection matrices for eachrelation. We use the same hyperparameters as usedin STransE and no significant improvement is ob-

served when we alter hyperparameters. We set themargin γ to 5 and dimension of embedding n to50 for WN18, and γ = 1, n = 100 for FB15k.We set the batch size to 20 for WN18 and 1000 forFB15k. The learning rate is 0.01 on WN18 and0.1 on FB15k. We use 30 matrices on WN18 and300 matrices on FB15k. All the models are imple-mented with Theano (Bergstra et al., 2010). TheSoftmax temperature is set to 1/4.

4.3 Results & AnalysisThe overall link prediction results1 are reportedin Table 2. Our model consistently outperformsprevious models without external information onboth the metrics of WN18 and FB15k. On WN18,we even achieve a much better mean rank withcomparable Hits@10 than current state-of-the-artmodel IRN employing external information.

We can see that path information is very help-ful on FB15k and models taking advantage of pathinformation outperform intrinsic models by a sig-nificant margin. Indeed, a lot of facts are easierto recover with the help of multi-step inference.For example, if we know Barack Obama is born inHonolulu, a city in the United States, then we eas-ily know the nationality of Obama is the UnitedStates. An straightforward way of extending ourproposed model to k-step path P = {ri}ki=1 isto define a path energy function ‖αααHP · D · h +∑

ri∈P ri − αααTP · D · t‖`, αααHP is a concept asso-ciation related to the path. We plan to extend ourmodel to multi-step path in the future.

To provide a detailed understanding why theproposed model achieves better performance, wepresent some further analysis in the sequel.

Performance on Rare Relations In the pro-posed ITransF, we design an attention mecha-nism to encourage knowledge sharing across dif-ferent relations. Naturally, facts associated withrare relations should benefit most from such shar-ing, boosting the overall performance. To verifythis hypothesis, we investigate our model’s perfor-mance on relations with different frequency.

The overall distribution of relation frequenciesresembles that of word frequencies, subject to thezipf’s law. Since the frequencies of relations ap-proximately follow a power distribution, their log

1Note that although IRN (Shen et al., 2016) does not ex-plicitly exploit path information, it performs multi-step infer-ence through the multiple usages of external memory. WhenIRN is allowed to access memory once for each prediction, itsHits@10 is 80.7, similar to models without path information.

954

Model Additional Information WN18 FB15kMean Rank Hits@10 Mean Rank Hits@10

SE (Bordes et al., 2011) No 985 80.5 162 39.8Unstructured (Bordes et al., 2014) No 304 38.2 979 6.3TransE (Bordes et al., 2013) No 251 89.2 125 47.1TransH (Wang et al., 2014) No 303 86.7 87 64.4TransR (Lin et al., 2015b) No 225 92.0 77 68.7CTransR (Lin et al., 2015b) No 218 92.3 75 70.2KG2E (He et al., 2015) No 348 93.2 59 74.0TransD (Ji et al., 2015) No 212 92.2 91 77.3TATEC (Garcıa-Duran et al., 2016) No - - 58 76.7NTN (Socher et al., 2013) No - 66.1 - 41.4DISTMULT (Yang et al., 2015) No - 94.2 - 57.7STransE (Nguyen et al., 2016b) No 206 (244) 93.4 (94.7) 69 79.7ITransF No 205 94.2 65 81.0ITransF (domain sampling) No 223 95.2 77 81.4RTransE (Garcıa-Duran et al., 2015) Path - - 50 76.2PTransE (Lin et al., 2015a) Path - - 58 84.6NLFeat (Toutanova and Chen, 2015) Node + Link Features - 94.3 - 87.0Random Walk (Wei et al., 2016) Path - 94.8 - 74.7IRN (Shen et al., 2016) External Memory 249 95.3 38 92.7

Table 2: Link prediction results on two datasets. Higher Hits@10 or lower Mean Rank indicates betterperformance. Following Nguyen et al. (2016b) and Shen et al. (2016), we divide the models into twogroups. The first group contains intrinsic models without using extra information. The second groupmake use of additional information. Results in the brackets are another set of results STransE reported.

frequencies are linear. The statistics of relationson FB15k and WN18 are shown in Figure 1. Wecan clearly see that the distributions exhibit longtails, just like the Zipf’s law for word frequency.

In order to study the performance of relationswith different frequencies, we sort all relations bytheir frequency in the training set, and split theminto 3 buckets evenly so that each bucket has asimilar interval length of log frequency.

Within each bucket, we compare our modelwith STransE, as shown in Figure 2.2 As we cansee, on WN18, ITransF outperforms STransE bya significant margin on rare relations. In partic-ular, in the last bin (rarest relations), the aver-age Hits@10 increases from 55.2 to 93.8, showingthe great benefits of transferring statistical strengthfrom common relations to rare ones. The compar-ison on each relation is shown in Appendix A.2.On FB15k, we can also observe a similar pattern,although the degree of improvement is less signif-icant. We conjecture the difference roots in thefact that many rare relations on FB15k have dis-joint domains, knowledge transfer through com-mon concepts is harder.

Interpretability In addition to the quantitativeevidence supporting the effectiveness of knowl-edge sharing, we provide some intuitive examplesto show how knowledge is shared in our model. As

2Domain sampling is not employed.

we mentioned earlier, the sparse attention vectorsfully capture the association between relations andconcepts and hence the knowledge transfer amongrelations. Thus, we visualize the attention vectorsfor several relations on both WN18 and FB15K inFigure 3.

For WN18, the words “hyponym” and “hyper-nym” refer to words with more specific or gen-eral meaning respectively. For example, PhD isa hyponym of student and student is a hypernymof PhD. As we can see, concepts associated withthe head entities in one relation are also associatedwith the tail entities in its reverse relation. Further,“instance hypernym” is a special hypernym withthe head entity being an instance, and the tail en-tity being an abstract notion. A typical example is(New York,instance hypernym, city). Thisconnection has also been discovered by our model,indicated by the fact that “instance hypernym(T)”and “hypernym(T)” share a common concept ma-trix. Finally, for symmetric relations like “simi-lar to”, we see the head attention is identical to thetail attention, which well matches our intuition.

On FB15k, we also see the sharing be-tween reverse relations, as in “(somebody)won award for (some work)” and “(some work)award winning work (somebody)”. What’smore, although relation “won award for” and“was nominated for” share the same concepts,

955

Log(Frequency)

0

2.75

5.5

8.25

11Frequency

0

10000

20000

30000

40000

Relation

Frequency Log(Frequency)

(a) WN18

Log(Frequency)

0

2.5

5

7.5

10

Frequency

0

4000

8000

12000

16000

Relation

Frequency Log(Frequency)

(b) FB15k

Figure 1: Frequencies and log frequencies of relations on two datasets. The X-axis are relations sortedby frequency.

Hits

@10

0

25

50

75

100

Relation Bin1 2 3

ITransF STransE

(a) WN18

Hits

@10

0

25

50

75

100

Relation Bin1 2 3

ITransF STransE

(b) FB15k

Figure 2: Hits@10 on relations with different amount of data. We give each relation the equal weightand report the average Hits@10 of each relation in a bin instead of reporting the average Hits@10 ofeach sample in a bin. Bins with smaller index corresponding to high-frequency relations.

their attention distributions are different, suggest-ing distinct emphasis. Finally, symmetric relationslike spouse behave similarly as mentioned before.

Model Compression A byproduct of parame-ter sharing mechanism employed by ITransF isa much more compact model with equal perfor-mance. Figure 5 plots the average performanceof ITransF against the number of projection ma-trices m, together with two baseline models. OnFB15k, when we reduce the number of matri-ces from 2200 to 30 (∼ 90× compression), ourmodel performance decreases by only 0.09% onHits@10, still outperforming STransE. Similarly,on WN18, ITransF continues to achieve the bestperformance when we reduce the number of con-cept project matrices to 18.

5 Analysis on Sparseness

Sparseness is desirable since it contribute to in-terpretability and computational efficiency of ourmodel. We investigate whether enforcing sparse-ness would deteriorate the model performance andcompare our method with another sparse encodingmethods in this section.

Dense Attention w/o `1 regularization Al-though `0 constrained model usually enjoys manypractical advantages, it may deteriorate the modelperformance when applied improperly. Here, weshow that our model employing sparse attentioncan achieve similar results with dense attentionwith a significantly less computational burden. Wealso compare dense attention with `1 regulariza-tion. We set the `1 coefficient to 0.001 in our ex-periments and does not apply Softmax since the `1of a vector after Softmax is always 1. We comparemodels in a setting where the computation time of

956

(a) WN18 (b) FB15k

Figure 3: Heatmap visualization of attention vectors for ITransF on WN18 and FB15k. Each row is anattention vector αααHr or αααTr for a relation’s head or tail concepts.

(a) WN18 (b) FB15k

Figure 4: Heatmap visualization of `1 regularized dense attention vectors, which are not sparse. Notethat the colorscale is not from 0 to 1 since Softmax is not applied.

Hits

@10

70

73.25

76.5

79.75

83

# matrices15 30 75 300 600 1200 1345 2200 2690

ITransF STransE CTransR

(a) FB15k

Hits

@10

90

91.25

92.5

93.75

95

# matrices18 22 26 30 36 45

ITransF STransE CTransR

(b) WN18

Figure 5: Performance with different number of projection matrices. Note that the X-axis denoting thenumber of matrices is not linearly scaled.

dense attention model is acceptable3. We use 22weight matrices on WN18 and 15 weight matri-ces on FB15k and train both the models for 2000epochs.

The results are reported in Table 3. Generally,ITransF with sparse attention has slightly better orcomparable performance comparing to dense at-tention. Further, we show the attention vectors of

3With 300 projection matrices, it takes 1h1m to run oneepoch for a model with dense attention.

model with `1 regularized dense attention in Fig-ure 4. We see that `1 regularization does not pro-duce a sparse attention, especially on FB15k.

Nonnegative Sparse Encoding In the proposedmodel, we induce the sparsity by a carefully de-signed iterative optimization procedure. Apartfrom this approach, one may utilize sparse en-coding techniques to obtain sparseness based onthe pretrained projection matrices from STransE.Concretely, stacking |2R| pretrained projection

957

Method WN18 FB15kMR H10 Time MR H10 Time

Dense 199 94.0 4m34s 69 79.4 4m30sDense + `1 228 94.2 4m25s 131 78.9 5m47sSparse 207 94.1 2m32s 67 79.6 1m52s

Table 3: Performance of model with dense atten-tion vectors or sparse attention vectors. MR, H10and Time denotes mean rank, Hits@10 and train-ing time per epoch respectively

matrices into a 3-dimensional tensor X ∈R2|R|×n×n, similar sparsity can be induced bysolving an `1-regularized tensor completion prob-lem minA,D ||X − DA||22 + λ‖A‖`1 . Basically,A plays the same role as the attention vectors inour model. For more details, we refer readers to(Faruqui et al., 2015).

For completeness, we compare our model withthe aforementioned approach4. The comparisonis summarized in table 4. On both benchmarks,ITransF achieves significant improvement againstsparse encoding on pretrained model. This perfor-mance gap should be expected since the objectivefunction of sparse encoding methods is to mini-mize the reconstruction loss rather than optimizethe criterion for link prediction.

Method WN18 FB15kMR H10 MR H10

Sparse Encoding 211 86.6 66 79.1ITransF 205 94.2 65 81.0

Table 4: Different methods to obtain sparse repre-sentations

6 Related Work

In KBC, CTransR (Lin et al., 2015b) enables re-lation embedding sharing across similar relations,but they cluster relations before training ratherthan learning it in a principled way. Further, theydo not solve the data sparsity problem becausethere is no sharing of projection matrices whichhave a lot more parameters. Learning the asso-ciation between semantic relations has been usedin related problems such as relational similaritymeasurement (Turney, 2012) and relation adapta-tion (Bollegala et al., 2015).

Data sparsity is a common problem in manyfields. Transfer learning (Pan and Yang, 2010)has been shown to be promising to transfer knowl-

4We use the toolkit provided by (Faruqui et al., 2015).

edge and statistical strengths across similar mod-els or languages. For example, Bharadwaj et al.(2016) transfers models on resource-rich lan-guages to low resource languages by parametersharing through common phonological features inname entity recognition. Zoph et al. (2016) ini-tialize from models trained by resource-rich lan-guages to translate low-resource languages.

Several works on obtaining a sparse atten-tion (Martins and Astudillo, 2016; Makhzani andFrey, 2014; Shazeer et al., 2017) share a similaridea of sorting the values before softmax and onlykeeping theK largest values. However, the sortingoperation in these works is not GPU-friendly.

The block iterative optimization algorithm inour work is inspired by LightRNN (Li et al., 2016).They allocate every word in the vocabulary in atable. A word is represented by a row vector anda column vector depending on its position in thetable. They iteratively optimize embeddings andallocation of words in tables.

7 Conclusion and Future Work

In summary, we propose a knowledge embeddingmodel which can discover shared hidden concepts,and design a learning algorithm to induce the in-terpretable sparse representation. Empirically, weshow our model can improve the performanceon two benchmark datasets without external re-sources, over all previous models of the same kind.

In the future, we plan to enable ITransF to per-form multi-step inference, and extend the sharingmechanism to entity and relation embeddings, fur-ther enhancing the statistical binding across pa-rameters. In addition, our framework can also beapplied to multi-task learning, promoting a finersharing among different tasks.

Acknowledgments

We thank anonymous reviewers and Graham Neu-big for valuable comments. We thank Yulun Du,Paul Mitchell, Abhilasha Ravichander, PengchengYin and Chunting Zhou for suggestions on thedraft. We are also appreciative for the great work-ing environment provided by staff in LTI.

This research was supported in part by DARPAgrant FA8750-12-2-0342 funded under the DEFTprogram.

958

References

Jonathan Berant, Andrew Chou, Roy Frostig, and PercyLiang. 2013. Semantic parsing on Freebase fromquestion-answer pairs. In Proceedings of the 2013Conference on Empirical Methods in Natural Lan-guage Processing. Association for ComputationalLinguistics, Seattle, Washington, USA, pages 1533–1544.

James Bergstra, Olivier Breuleux, Frederic Bastien,Pascal Lamblin, Razvan Pascanu, Guillaume Des-jardins, Joseph Turian, David Warde-Farley, andYoshua Bengio. 2010. Theano: a cpu and gpu mathexpression compiler. In Proceedings of the Pythonfor scientific computing conference (SciPy). Austin,TX, volume 4, page 3.

Akash Bharadwaj, David Mortensen, Chris Dyer, andJaime Carbonell. 2016. Phonologically aware neu-ral model for named entity recognition in low re-source transfer settings. In Proceedings of the 2016Conference on Empirical Methods in Natural Lan-guage Processing. Association for ComputationalLinguistics, pages 1462–1472.

Kurt Bollacker, Colin Evans, Praveen Paritosh, TimSturge, and Jamie Taylor. 2008. Freebase: A Col-laboratively Created Graph Database for Structur-ing Human Knowledge. In Proceedings of the 2008ACM SIGMOD International Conference on Man-agement of Data. pages 1247–1250.

Danushka Bollegala, Takanori Maehara, and Ken-ichiKawarabayashi. 2015. Embedding semantic rela-tions into word representations. In Proceedings ofthe Twenty-Fourth International Joint Conferenceon Artificial Intelligence.

Antoine Bordes, Xavier Glorot, Jason Weston, andYoshua Bengio. 2014. A Semantic Matching EnergyFunction for Learning with Multi-relational Data.Machine Learning 94(2):233–259.

Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko.2013. Translating Embeddings for Modeling Multi-relational Data. In Advances in Neural InformationProcessing Systems 26, pages 2787–2795.

Antoine Bordes, Jason Weston, Ronan Collobert, andYoshua Bengio. 2011. Learning Structured Embed-dings of Knowledge Bases. In Proceedings of theTwenty-Fifth AAAI Conference on Artificial Intelli-gence. pages 301–306.

Zihang Dai, Lei Li, and Wei Xu. 2016. Cfo: Condi-tional focused neural question answering with large-scale knowledge bases. In Proceedings of the 54thAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1: Long Papers). Asso-ciation for Computational Linguistics, Berlin, Ger-many, pages 800–810.

Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, ChrisDyer, and Noah A. Smith. 2015. Sparse overcom-plete word vector representations. In Proceedingsof the 53rd Annual Meeting of the Association forComputational Linguistics and the 7th InternationalJoint Conference on Natural Language Processing(Volume 1: Long Papers). Association for Compu-tational Linguistics, Beijing, China, pages 1491–1500.

Christiane D. Fellbaum. 1998. WordNet: An ElectronicLexical Database. MIT Press.

Alberto Garcıa-Duran, Antoine Bordes, and Nico-las Usunier. 2015. Composing Relationships withTranslations. In Proceedings of the 2015 Confer-ence on Empirical Methods in Natural LanguageProcessing. pages 286–290.

Alberto Garcıa-Duran, Antoine Bordes, NicolasUsunier, and Yves Grandvalet. 2016. CombiningTwo and Three-Way Embedding Models for LinkPrediction in Knowledge Bases. Journal of Artifi-cial Intelligence Research 55:715–742.

Kelvin Guu, John Miller, and Percy Liang. 2015.Traversing Knowledge Graphs in Vector Space. InProceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing. pages318–327.

Shizhu He, Kang Liu, Guoliang Ji, and Jun Zhao.2015. Learning to Represent Knowledge Graphswith Gaussian Embedding. In Proceedings of the24th ACM International on Conference on Informa-tion and Knowledge Management. pages 623–632.

Guoliang Ji, Shizhu He, Liheng Xu, Kang Liu, andJun Zhao. 2015. Knowledge Graph Embedding viaDynamic Mapping Matrix. In Proceedings of the53rd Annual Meeting of the Association for Compu-tational Linguistics and the 7th International JointConference on Natural Language Processing (Vol-ume 1: Long Papers). pages 687–696.

Jens Lehmann, Robert Isele, Max Jakob, AnjaJentzsch, Dimitris Kontokostas, Pablo N. Mendes,Sebastian Hellmann, Mohamed Morsey, Patrick vanKleef, Soren Auer, and Christian Bizer. 2015. DB-pedia - A Large-scale, Multilingual Knowledge BaseExtracted from Wikipedia. Semantic Web 6(2):167–195.

Xiang Li, Tao Qin, Jian Yang, and Tieyan Liu. 2016.LightRNN: Memory and Computation-Efficient Re-current Neural Networks. In Advances in Neural In-formation Processing Systems 29.

Yankai Lin, Zhiyuan Liu, Huanbo Luan, Maosong Sun,Siwei Rao, and Song Liu. 2015a. Modeling Rela-tion Paths for Representation Learning of Knowl-edge Bases. In Proceedings of the 2015 Confer-ence on Empirical Methods in Natural LanguageProcessing. pages 705–714.

959

Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu,and Xuan Zhu. 2015b. Learning Entity and Re-lation Embeddings for Knowledge Graph Comple-tion. In Proceedings of the Twenty-Ninth AAAI Con-ference on Artificial Intelligence Learning, pages2181–2187.

Alireza Makhzani and Brendan Frey. 2014. K-sparseautoencoders. In Proceedings of the InternationalConference on Learning Representations.

Andre FT Martins and Ramon Fernandez Astudillo.2016. From softmax to sparsemax: A sparse modelof attention and multi-label classification. In Pro-ceedings of the 33th International Conference onMachine Learning.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in neural information processingsystems. pages 3111–3119.

Mike Mintz, Steven Bills, Rion Snow, and Daniel Ju-rafsky. 2009. Distant supervision for relation ex-traction without labeled data. In Proceedings of theJoint Conference of the 47th Annual Meeting of theACL and the 4th International Joint Conference onNatural Language Processing of the AFNLP. Asso-ciation for Computational Linguistics, Suntec, Sin-gapore, pages 1003–1011.

Dat Quoc Nguyen, Kairit Sirts, Lizhen Qu, and MarkJohnson. 2016a. Neighborhood mixture model forknowledge base completion. In Proceedings of the20th SIGNLL Conference on Computational NaturalLanguage Learning (CoNLL). Association for Com-putational Linguistics, page 4050.

Dat Quoc Nguyen, Kairit Sirts, Lizhen Qu, and MarkJohnson. 2016b. STransE: a novel embeddingmodel of entities and relationships in knowledgebases. In Proceedings of the 2016 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies. pages 460–466.

Maximilian Nickel, Kevin Murphy, Volker Tresp, andEvgeniy Gabrilovich. 2015. A Review of RelationalMachine Learning for Knowledge Graphs. Proceed-ings of the IEEE, to appear .

Maximilian Nickel, Volker Tresp, and Hans-PeterKriegel. 2011. A Three-Way Model for CollectiveLearning on Multi-Relational Data. In Proceedingsof the 28th International Conference on MachineLearning. pages 809–816.

Sinno Jialin Pan and Qiang Yang. 2010. A survey ontransfer learning. IEEE Transactions on knowledgeand data engineering 22(10):1345–1359.

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz,Andy Davis, Quoc Le, Geoffrey Hinton, and JeffDean. 2017. Outrageously large neural networks:

The sparsely-gated mixture-of-experts layer. In Pro-ceedings of the International Conference on Learn-ing Representations.

Yelong Shen, Po-Sen Huang, Ming-Wei Chang, andJianfeng Gao. 2016. Implicit reasonet: Model-ing large-scale structured relationships with sharedmemory. arXiv preprint arXiv:1611.04642 .

Richard Socher, Danqi Chen, Christopher D Manning,and Andrew Ng. 2013. Reasoning With Neural Ten-sor Networks for Knowledge Base Completion. InAdvances in Neural Information Processing Systems26, pages 926–934.

Fabian M. Suchanek, Gjergji Kasneci, and GerhardWeikum. 2007. YAGO: A Core of Semantic Knowl-edge. In Proceedings of the 16th International Con-ference on World Wide Web. pages 697–706.

Robert Tibshirani. 1996. Regression shrinkage and se-lection via the lasso. Journal of the Royal StatisticalSociety. Series B (Methodological) pages 267–288.

Kristina Toutanova and Danqi Chen. 2015. ObservedVersus Latent Features for Knowledge Base andText Inference. In Proceedings of the 3rd Workshopon Continuous Vector Space Models and their Com-positionality. pages 57–66.

Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoi-fung Poon, Pallavi Choudhury, and Michael Gamon.2015. Representing Text for Joint Embedding ofText and Knowledge Bases. In Proceedings of the2015 Conference on Empirical Methods in NaturalLanguage Processing. pages 1499–1509.

Peter D Turney. 2012. Domain and function: A dual-space model of semantic relations and compositions.Journal of Artificial Intelligence Research 44:533–585.

Zhen Wang, Jianwen Zhang, Jianlin Feng, and ZhengChen. 2014. Knowledge Graph Embedding byTranslating on Hyperplanes. In Proceedings of theTwenty-Eighth AAAI Conference on Artificial Intel-ligence, pages 1112–1119.

Zhuoyu Wei, Jun Zhao, and Kang Liu. 2016. Mininginference formulas by goal-directed random walks.In Proceedings of the 2016 Conference on EmpiricalMethods in Natural Language Processing. Associa-tion for Computational Linguistics, Austin, Texas,pages 1379–1388.

Robert West, Evgeniy Gabrilovich, Kevin Murphy,Shaohua Sun, Rahul Gupta, and Dekang Lin.2014. Knowledge Base Completion via Search-based Question Answering. In Proceedings of the23rd International Conference on World Wide Web.pages 515–526.

Bishan Yang, Wen-tau Yih, Xiaodong He, JianfengGao, and Li Deng. 2015. Embedding Entities andRelations for Learning and Inference in KnowledgeBases. In Proceedings of the International Confer-ence on Learning Representations.

960

Wen-tau Yih, Ming-Wei Chang, Xiaodong He, andJianfeng Gao. 2015. Semantic parsing via stagedquery graph generation: Question answering withknowledge base. In Proceedings of the 53rd AnnualMeeting of the Association for Computational Lin-guistics and the 7th International Joint Conferenceon Natural Language Processing (Volume 1: LongPapers). Association for Computational Linguistics,Beijing, China, pages 1321–1331.

Barret Zoph, Deniz Yuret, Jonathan May, and KevinKnight. 2016. Transfer learning for low-resourceneural machine translation. In Proceedings of the2016 Conference on Empirical Methods in Natu-ral Language Processing. Association for Computa-tional Linguistics, Austin, Texas, pages 1568–1575.

961

A Appendix

A.1 Domain Sampling ProbabilityIn this section, we define the probability pr togenerate a negative sample from the same domainmentioned in Section 3.3. The probability cannotbe too high to avoid generating negative samplesthat are actually correct, since there are generallya lot of facts missing in KBs.

Specifically, let MHr = {h | ∃t(h, r, t) ∈ P}

and MTr = {t | ∃h(h, r, t) ∈ P} denote the

head or tail domain of relation r. Suppose Nr ={(h, r, t) ∈ P} is the induced set of edges withrelation r. We define the probability pr as

pr = min(λ|MT

r ||MHr |

|Nr|, 0.5) (4)

Our motivation of such a formulation is asfollows: Suppose Or is the set that contains alltruthful fact triples on relation r, i.e., all triplesin training set and all other missing correcttriples. If we assume all fact triples within thedomain has uniform probability of being true, theprobability of a random triple being correct isPr((h, r, t) ∈ Or | h ∈ MH

r , t ∈ MTr ) =

|Or||MH

r ||MTr |

Assume that all facts are missing with a proba-bility λ, then |Nr| = λ|Or| and the above prob-ability can be approximated by |Nr|

λ|MHr ||MT

r |. We

want the probability of generating a negative sam-ple from the domain to be inversely proportionalto the probability of the sample being true, so wedefine the probability as Eq. 4. The results in sec-tion 4 are obtained with λ set to 0.001.

We compare how different value of λ would in-fluence our model’s performance in Table. 5. Withlarge λ and higher domain sampling probability,our model’s Hits@10 increases while mean rankalso increases. The rise of mean rank is due tohigher probability of generating a valid triple asa negative sample causing the energy of a validtriple to increase, which leads to a higher over-all rank of a correct entity. However, the reason-ing capability is boosted with higher Hits@10 asshown in the table.

A.2 Performance on individual relations ofWN18

We plot the performance of ITransF and STransEon each relation. We see that the improvement isgreater on rare relations.

Method WN18 FB15kMR H10 MR H10

λ = 0.0003 217 95.0 68 80.4λ = 0.001 223 95.2 73 80.6λ = 0.003 239 95.2 82 80.9

Table 5: Different λ’s effect on our model perfor-mance. The compared models are trained for 2000epochs

Hits@10

0

25

50

75

100

Relation1 3 5 7 9 11 13 15 17

ITransF STransE

Figure 6: Hits@10 on each relation in WN18. Therelations are sorted according to their frequency.

962

Documents

An Interpretable Knowledge Transfer Model for Knowledge ...Previously, a line of research makes use of ex-ternal information such as textual relations from web-scale corpus or node