GAIA at SM-KBP 2019 - A Multi-media Multi-lingual ... · GAIA at SM-KBP 2019 - A Multi-media Multi-lingual Knowledge Extraction and Hypothesis Generation System Manling Li 1, Ying

GAIA at SM-KBP 2019 - A Multi-media Multi-lingual KnowledgeExtraction and Hypothesis Generation System

Manling Li1, Ying Lin1, Ananya Subburathinam1, Spencer Whitehead1, Xiaoman Pan1,Di Lu1, Qingyun Wang1, Tongtao Zhang1, Lifu Huang1, Heng Ji1

1 University of Illinois at [email protected]

Alireza Zareian2, Hassan Akbari2, Brian Chen2, Bo Wu2, Emily Allaway2,Shih-Fu Chang2, Kathleen McKeown2

2 Columbia [email protected], [email protected]

Yixiang Yao3, Jennifer Chen3, Eric Berquist3, Kexuan Sun3, Xujun Peng3, Ryan Gabbard3

Marjorie Freedman3, Pedro Szekely3, T.K. Satish Kumar33 Information Sciences Institute, University of Southern California

[email protected]

Arka Sadhu4, Ram Nevatia4

4University of Southern [email protected]

Miguel Rodriguez5, Yifan Wang5, Yang Bai5, Ali Sadeghian4, Daisy Zhe Wang5

5 University of [email protected]

1 Introduction

In the past year the GAIA team has improved ourend-to-end knowledge extraction, grounding, in-ference, clustering and hypothesis generation sys-tem that covers all languages (English, Russianand Ukrainian), data modalities and knowledge el-ement types defined in new AIDA ontologies. Weparticipated in the evaluations of all tasks withinTA1, TA2 and TA3 and achieved highly compet-itive performance. Our TA1 system achieves topperformance at both intrinsic evaluation and ex-trinsic evaluation through TA2 and TA3. The sys-tem incorporates a number of impactful and freshresearch innovations:

• Attentive Fine-Grained Entity TypingModel with Latent Type Representation:We propose a fine-grained entity typingmodel with a novel attention mechanism anda hybrid type classifier. We advance exist-ing methods in two aspects: feature extrac-tion and type prediction. To capture richercontextual information, we adopt contextu-alized word representations instead of fixedword embeddings used in previous work. Inaddition, we propose a two-step mention-aware attention mechanism to enable themodel to focus on important words in men-tions and contexts. We also develop a hy-brid classification method beyond binary rel-evance to exploit type interdependency with

latent type representation. Instead of inde-pendently predicting each type, we predicta low-dimensional vector that encodes latenttype features and reconstruct the type vectorfrom this latent representation.

• Cross-media Structured Common Seman-tic Space for Multimedia Event Extrac-tion: We propose and develop a new multi-media Event Extraction (M2E2) task that in-volves jointly extracting events and argu-ments from text and image. We propose aweakly supervised framework which learnsto encode structures extracted from text andimages into a common semantic embeddingspace. This structured common space en-ables us to share and transfer resources acrossdata modalities for event extraction and argu-ment role labeling.

• Cross-lingual structure transfer for re-lation and event extraction: We investi-gate the suitability of cross-lingual structuretransfer techniques for these tasks. We ex-ploit relation- and event-relevant language-universal features, leveraging both symbolic(including part-of-speech and dependencypath) and distributional (including type repre-sentation and contextualized representation)information. By representing all entity men-tions, event triggers, and contexts into this

RelationExtractionMentionExtraction

EventExtractionEnglish

ELMo-LSTMCRFEntityExtractor

AttentiveFine-GrainedEntityTyping

Coreference/Linking

CollectiveEntityLinkingandNILClustering

ContextualNominalCoreference

DependencybasedFine-GrainedRelationTyping

AssembledCNNExtractor

EventCoreference

RussianandUkrainian

Bi-LSTMTriggerExtractor

CNNArgumentExtractor

GraphbasedCoreferenceResolution

Refinement

Second-StageClassifierKBPropagation

Tri-lingualTextDocuments

TA1.aKB

TA1.bKB

HumanHypothesis

GenerativeAdversarialImitationLearning(GAIL)Extractor

Bi-LSTM-CRFsandCNNExtractor

BackgroundKB

Fine-GrainedEventTypingFrameNet&DependencybasedFine-GrainedEventTyping

RulebasedFine-GrainedEventTyping

Figure 1: The Architecture of GAIA Text Knowledge Extraction

complex and structured multilingual com-mon space, using graph convolutional net-works, we can train a relation or event ex-tractor from source language annotations andapply it to the target language.

2 TA1 Text Knowledge Extraction

2.1 Approach OverviewWe build an end-to-end knowledge extraction,grounding, inference, clustering and hypothesisgeneration system that covers all languages, datamodalities. It supports the extraction of 187 en-tity types, 49 relation types, and 139 event typesdefined in AIDA ontology. Figure1 shows theoverview framework.

The details of each component will be presentedin the following sections.

2.2 Mention Extraction2.2.1 Coarse-grained Named Mention

ExtractionWe developed a neural model for name mentionextraction. The basic model consists of an em-bedding layer, a character-level network, a bidi-rectional long-short term memory (LSTM) layer, alinear layer, and a conditional random fields (CRF)layer. In this architecture, each sentence is repre-sented as a sequence of vectors X = {x1, ...,xL},where xi represents features of the i-th word. Weuse two types of features in our model: 1. Word

embedding that encodes the semantic informationof words. 2. Character-level representation thatcaptures subword information.

The LSTM layer then processes the sentence ina sequential manner and encodes both contextualand non-contextual features of each word xi intoa hidden state hi. After that, we decode the hid-den state into a score vector yi with a linear layer.The value of each component of yi represents thepredicted score of a label. However, as the labelof each token is predicted separately, the modelmay produce a path of inconsistent tags such as[BEGIN-GPE, INSIDE-GPE, SINGLETON-GPE].Therefore, we add a CRF layer on top of the modelto capture tag dependencies and predict a globaloptimal tag path for each sentence. Given a sen-tence X and scores predicted by the linear layerY = {y1, ...,yL}, the score of a sequence of tagsis computed as:

s(X, z) =

L+1∑i=1

Azi−1,z +

L∑i=1

yi,zi ,

where each entry Azi−1,zi is the score of jumpingfrom tag zi−1 to tag zi, and yi,zi is the zi dimensionof yi that corresponds to tag zi. We append twospecial tags <start> (z0) and <end> (zL+1) todenote the beginning or end of a sentence. Finally,we maximize the sentence-level log-likelihood of

the gold tag path z given the input sentence by

log p(z|X) = log( es(X,z)∑

z∈Z es(X,z)

)= s(X, z)− log

∑z∈Z

es(X,z),

where Z denotes the set of all possible paths.For English, we improve the model by incorpo-

rating ELMo contextualized word representations.We use a pre-trained ELMo encoder to generatethe contextualized word embedding ci for each to-ken and concatenate it with hi.

We train separate models for name, nominal,and pronominal mentions and merge their outputsinto the final mention extraction result.

We also explore a reliability-aware dynamicfeature composition mechanism (Lin et al., 2019)to obtain better representations for rare and unseenwords. We design a set of frequency-based relia-bility signals to indicate the quality of each wordembedding. These signals control mixing gates atdifferent levels in the model. For example, if aword is rare, the model will rely less on its pre-trained word embedding, which is usually not welltrained, but assign higher weights to its characterand contextual features.

2.2.2 Fine-grained Mention ExtractionWe develop an attentive classification model thattakes a mention with its context sentence and pre-dicts the most possible fine-grained type. Unlikeprevious neural models that generally use fixedword embeddings and task-specific networks toencode the sentence, we employ contextualizedword representations (Peters et al., 2018) that cancapture word semantics in different contexts.

After that, we use a novel two-step attentionmechanism to extract crucial information from themention and its context as follows

m =

i∑M

ami ri,

c =i∑C

aciri,

where ri ∈ Rdr is the vector of the i-th word, dris the dimension of r, and attention scores ami andaci are calculated as

ami = Softmax(vm> tanh (Wmri)),

aci = Softmax(vc> tanh (W c(ri)⊕m⊕ pi)),

pi =

(1− µ

(min(|i− a|, |i− b|)− 1

))+

,

where parameters Wm ∈ Rda×dr , vm ∈ Rda ,W c ∈ Rda×(2dr+1), and vc ∈ Rda are learnedduring training, a and b are indices of the first andlast words of the mention, da is set to dr, and µ isset to 0.1.

Next, we adopt a hybrid type classificationmodel consisting of two classifiers. We first learna matrix W b ∈ Rdt×2dr to predict type scores by

yb = W b(m⊕ c),

where ybi is the score for the i-th type.We also learn to predict the latent type represen-

tation from the feature vector using

l = V l(m⊕ c),

where V l ∈ R2dr×dl . We then recover a type vec-tor from this latent representation using

y = UΣl,

where U and Σ are obtained via Singular ValueDecomposition (SVD) as

Y ≈ Y = UΣL>,

where U ∈ Rdt×dl , Σ ∈ Rdl×dt , L ∈ RN×dt , anddl � dt. Finally, we combine scores from bothclassifier

y = σ(W b(m⊕ c) + γW ll),

where γ is set to 0.1. The training objective is tominimize the cross-entropy loss function as

J(θ) = − 1

N

N∑i

yi log yi + (1− yi) log(1− yi).

Furthermore, we obtain the YAGO fine-grained types by linking entities to the Freebase(LDC2015E42), and map them to AIDA entitytypes. Besides, for GPE and LOC entities, welink them to GeoNames 1 and determine theirfine-grained types using GeoNames attributes fea-ture class and feature code. Considering thatmost nominal mentions can not be linked to Free-base or GeoNames and the lack of training data,

1http://geonames.org/

http://geonames.org/

we develop a nominal keyword list for each type.We compute a weighted score for these typingresults and normalize the scores as typing confi-dence. Moreover, some entity types are relatedto events (e.g., PER.Protester is usually a personargument played as a protester role in a protestevent), and thus we label them based on event ex-traction results.

2.2.3 Entity Filler Mention ExtractionWe use Stanford CoreNLP (Manning et al., 2014)to extract values and titles from English raw doc-uments. We also utilize the time normalizationfunction included in the CoreNLP by adding thedate when the document was posted as the refer-ence time. For time expressions in Russian andUkrainian, we use a deterministic rule-based sys-tem based on SUTime (Finkel et al., 2005).

For color entity types, we simply assemble alist of common colors in English, Russian, andUkrainian, and then perform string matching to tagthese entities.

2.2.4 Informative Justification ExtractionFor each named entity, we rank all mentions interms of length and frequency. For each nominalentity, we select the noun phrase chunk containingthe most frequent mentions. For pronoun entities,we select the most frequent mention. The confi-dence of informative justification is the weightedsum of all mentions, with higher weights for namementions. If one entity only has nominal and pro-noun mentions, we reduce its confidence so it willbe ranked after other entities with name mentions.

2.3 Entity Linking

Given a set of name mentions M ={m1,m2, ...,mn}, we first generate an initiallist of candidate entities Em = {e1, e2, ..., en}for each name mention m, and then rank them toselect the candidate entity with the highest scoreas the appropriate entity for linking.

We adopt a dictionary-based candidate gener-ation approach (Medelyan and Legg, 2008).ForRussian and Ukrainian, we use translation dictio-naries collected from Wikipedia (Ji et al., 2009)to translate each mention into English first. If amention has multiple translations, we merge thelinking results of all translations at the end. Thenwe rank these entity candidates based on threemeasures: salience, similarity and coherence (Panet al., 2015).

We utilize Wikipedia anchor links to computesalience based on entity prior:

pprior(e) =A∗,eA∗,∗

(1)

where A∗,e is a set of anchor links that point toentity e, and A∗,∗ is a set of all anchor links inWikipedia. We define mention to entity probabil-ity as

pmention(e|m) =Am,eAm,∗

(2)

where Am,∗ is a set of anchor links with the sameanchor textm, andAm,e is a subset ofAm,∗ whichpoints to entity e.

Then we compute the similarity between men-tion and any candidate entity. We first utilize entitytypes of mentions which are extracted from nametagging. For each entity e in the KB, we assigna coarse-grained entity type t (PER, ORG, GPE,LOC, Miscellaneous (MISC)) using a MaximumEntropy based entity classifier (Pan et al., 2017).We incorporate entity types by combining it withmention to entity probability pmention(e|m) (Linget al., 2015):

ptype(e|m, t) =p(e|m)∑

e 7→tp(e|m)

(3)

where e 7→ t indicates that t is the entity type of e.Following (Huang et al., 2017), we construct a

weighted undirected graph G = (E,D) from DB-pedia, where E is a set of all entities in DBpediaand dij ∈ D indicates that two entities ei and ejshare some DBpedia properties. The weight of dij ,wij is computed as:

wij =|pi ∩ pj |

max(|pi|, |pj |)(4)

where pi, pj are the sets of DBpedia propertiesof ei and ej respectively. After constructing theknowledge graph, we apply the graph embeddingframework proposed by (Tang et al., 2015) to gen-erate knowledge representations for all entities inthe KB. We compute cosine similarity betweenthe vector representations of two entities to modelcoherence between these two entities coh(ei, ej).Given a name mention m and its candidate entitye, we defined coherence score as:

pcoh(e) =1

|Cm|∑c∈Cm

coh(e, c) (5)

where Cm is the union of entities for coherentmentions of m.

Finally, we combine these measures and com-pute final score for each candidate entity e.

2.4 Within-document Coreference Resolution

For name mentions that cannot be linked to theKB, we apply heuristic rules described in Table 1to cluster them within each document.

Rule DescriptionExact match Create initial clusters based on mention

surface form.Normalization Normalize surface forms (e.g., remove

designators and stop words) and groupmentions with the same normalized sur-face form.

NYSIIS (Taft,1970)

Obtain soundex NYSIIS representationof each mention and group mentionswith the same representation longerthan 4 letters.

Edit distance Cluster two mentions if the edit distancebetween their normalized surface formsis equal to or smaller than D, whereD = length(mention1)/8 + 1.

Translation Merge two clusters if they include men-tions with the same translation.

Table 1: Heuristic Rules for NIL Clustering.

For each cluster, we assign the most frequentname mention as the document-level canonicalmention.

2.4.1 English Coreference Resolution

We take in the output of both the entity linkingand NIL Clustering system to predict the corefer-ence clusters. Our main architecture is based onEnd-to-End Neural Coreference Resolution (Leeet al., 2017). However, since we already know theentity type of each mention, we further simplifythe coreference model by only computing scoresbetween mentions with the same entity type.

Given a document D = {y1, y2, ...yn}, we re-ceive its mention spans S = {s1, s2, ..., sl} withtheir mention types M = {m1,m2, ...,ml}, en-tity types E = {e1, e2, ..., el}, and clusters C ={c1, c2, ..., cl} from entity linking and NIL Clus-tering components, where yi is an individual word,n is the length of the document, m is the numberof mentions, ci ∈ {1, 2, ..., v}, v is the cluster size,and each span si contains two indices START(i)and END(i). To compute a word representationat position k, we use a pre-trained ELMo (Peterset al., 2018) embedding layer to obtain its repre-sentation y∗k.

Then for each span i, we compute the soft atten-tion(Bahdanau et al., 2014) over each word in thespan:

αk = ωα · FFNNα(y∗k) (6)

αi,k =exp (αk)∑END(i)

k=START(i) exp (αk)(7)

yi =

END(i)∑k=START(i)

αi,k · y∗k (8)

where yi is a weighted sum of all word hiddenstates in span i, and FFNN is the feed-forwardneural network. Then the representation gi of aspan i is computed as:

gi = [y∗START(i), y∗END(i), yi,Φg(i),Γg(i)] (9)

where a feature vector Φg(i) encodes the size ofspan i, a feature vector Γg(i) encodes the entitytype of span i, and [, ] is the concatenated vector.

Finally, to compute clusters, for each span gi,we check all precedent mentions gj where j < ihas the same entity type. For each pair of span giand gj , inspired by the dynamic feature composi-tion (Lin et al., 2019), we introduce two dynamicspan gates to update each span respectively:

gi = σ(FFNNσ1([gi, gj ]])) ∗ gi (10)

gj = σ(FFNNσ2([gi, gj ]])) ∗ gj (11)

where σ is the sigmoid function. Then we cancompute the pairwise score x(i, j) as:

x(i, j) = [wx · gi, gj , gi ◦ gj ,Φx(i, j)] (12)

where ◦ is the element-wise similarity feature vec-tor, and Φx(i, j) encodes the distance between twospan.

For each span si, following Lee et al. (2017),we learn a conditional probability distribution forthe corresponding span si over all its antecedentai ∈ A(i) = {ε, 1, ..., i − 1} where ε is a dummyantecedent:

P (s1, ..., sl|D) =

l∏i=1

exp(x(i, ai)))∑a′∈A(i) expx(i, a′)

(13)

Similarly, we then optimize the marginal log-likelihood of all correct antecedents implied bygold-standard clustering:

logl∏

i=1

∑a′∈A(i)(i)

P (a′) (14)

where GOLD(i) is the set of spans in the gold-standard cluster containing span i. In the testphrase, we only use the cluster prediction resultsfrom the system generated span si whose originalcluster ci either only contains nominal mentions ora single mention.

For English training, we used ACE2005,EDL2016, EDL2017, CoNLL-2012 to train thesystem.

2.4.2 Russian/Ukrainian Name CoreferenceResolution

For Russian and Ukrainian, we perform an addi-tional name coreference step as post-processing.We construct a graph by considering each exist-ing AIF cluster as a node, and apply the followingrules to add edges between clusters if:

1. There is an exact match in the string or stembetween any name mentions in either cluster.

2. There exists two strings from two clus-ters where the Levenshtein distance betweenthem is less than or equal to 2 and the size ofthe string or stem is at least 6 or 5, respec-tively.

3. There exists an appos2 relation between twoname mentions, with the same entity type.We apply the UDPipe(Straka and Strakova,2017) models, which are pre-trained on theUniversal Dependencies(Nivre et al., 2019)dataset, to extract the syntactic relations.

4. For person names, there exists a single-tokenstring in one cluster which appeares as thesecond token of a two-token string in anothercluster and does not appear as the second to-ken of any other two-token string (e.g. Smithto John Smith unless Jane Smith also appearsin the document).

5. For person names, there exists a two-tokenstring in one cluster which matches a two-token string in another cluster, accounting formorphological inflection of both strings.

A coreference cluster is constructed from eachconnected component of the resulting graph. ForRussian and Ukrainian stemming, stems are deter-mined in a simple, heuristic way by stripping a listof common Russian and Ukrainian inflectional af-fixes from the end of the string.

2appositional modifier

2.5 Relation Extraction

For fine-grained relation extraction, we mapthe AIDA subtypes to ACE/ERE ontology andapply a Convolutional Neural Network (CNN)based model to detect these mapped types, andthen utilize entity type constraints as well asdependency patterns to re-categorize these de-tected relations into fine-grained types (Sec-tion 2.5.1). For types that cannot be mapped di-rectly to ACE/ERE, such as Genafl.Sponsor orEvaluate.Deliberateness, we design depen-dency patterns and implement a rule-based systemto detect these fine-grained relations directly fromthe text (Section 2.5.2).

2.5.1 ACE/ERE Relation Extraction andFine-grained Typing

The relation extraction system predicts relationsbetween each pair of entity mentions within thesame sentence. We employ a CNN with piece-wise pooling as our underlying model.

Model Given a source sentence s =[w1, ..., wm] along with entity mentions e1and e2, for each word wi, we generate an em-bedding vi = [wi,pi, pi, ti, ti, ci, ηi], where wi

denotes the word embedding of wi from a set ofpre-trained embeddings. pi and pi are positionembeddings indicating the relative distance fromwi to e1 and e2, respectively. ti and ti are, respec-tively, the entity type embeddings of e1 and e2.Finally, ci is the chunking embedding, and ηi is abinary digit indicating whether the word is withinthe shortest dependency path between e1 ande2. All embeddings besides the pre-trained wordembedding are randomly initialized and optimizedduring training. The result of this embeddingprocess is a sequence of word representationsV = {v1, . . . ,vn}. We then apply a convolutionfilter, W , with an added bias, b, to each slidingn-gram phrase, gj (i.e., gj = tanh(W · V ) + b).Since different segments of a sentence do not holdequal weight towards the underlying semantics ofthe sentence, we split all the gj into three partsbased on the two entity mentions and performpiecewise max pooling. Lastly, we concatenatethe representations of these three segments, feedthem into a fully connected layer, and applya linear projection and softmax to classify therelations.

Training Details For English training, we man-ually map DEFT Rich ERE and ACE2005 to theAIDA ontology, and combine all the resources.For Ukrainian and Russian, we asked a nativespeaker to annotate some part of the seedling cor-pus. To further improve the system performance,we also adopted a majority vote strategy to assem-ble the models for all three languages. Addition-ally, we find that our model cannot well captureglobal information due to instance-level training.Accordingly, we extract some high-confidence,frequent relation patterns (Table 2) from the train-ing data and treat them as hard constraints to tacklethis problem.

Relation type Relation pattern

Employment Person of OrganizationOrganization origin Organization in Geography

Leadership Geography prime Personlocated near Person at FacilityPart whole Organization of the Organization

Manufacture Geography Weapon

Table 2: Examples of frequent relation patterns.

Fine-grained Typing We take the relations ex-tracted from the above component as input to arule-based component that assigns fine-grained re-lation types, if possible. This component onlymatches high confidence patterns and largely re-lies on entity type constraints. For example, if aLeadership relation type is detected between aPER mention and an ORG mention, and the ORG

mention is a law enforcement agency or militaryorganization, then we assign a fine-grained type ofLeadership.MilitaryPolice.

In conjunction with entity type constraints,we also employ dependency patterns to ensurethe correctness. For instance, if two entitiesfit the argument constraints for LocatedNear

and the location is an object of a “surround”keyword, then we assign the fine-grained typeLocatedNear.Surround. We base these patternson observations from training data across English,Ukrainian, and Russian.

2.5.2 Unmapped Relation TypesWe implement a rule-based re-lation extraction component forEvaluate.Deliberateness/Legitimacy,Measurement.Size, Genafl.Orgweb, andInformation.Make/Color based on the rules asfollows (for English only):

Evaluate.Deliberateness/Legitimacy Weextract this relation in a similar fashion to thefine-grained typing. We first utilize argumenttype constraints, such as the event having aStartPosition type. Then, we employ key-words and dependency patterns to determine whois the holder of the evaluation.Measurement.Size We first apply StanfordCoreNLP to extract all NUMBER N . Then forni ∈ N , we check if the word appears after ni.If it is a part of a name mention extracted by ourEDL system, we tag the relation between them asMeasurement.Count. We then attempt to assigna fine-grained type to these relations based uponwhether or not there are units mentioned in thesentence.Genafl.Orgweb Our EDL system can detect urlsas ORG and cluster them with other name mentions.We tag those urls as Genafl.Orgweb of the ORG

named entities.Information.Make/Color Any entities that aredetected that fit the argument constraints forthese types and satisfy the dependency pat-tern criteria are assigned one of these types.Only entities of type Color may be assignedInformation.Color.

2.5.3 Sponsorship and AssignBlame RelationExtraction

We implement a rule-based component forGeneralAffiliation.Sponsorhip andResponsibilityBlame.AssignBlame usingmanually collected trigger words in English andRussian. A seed set of trigger words is collectedfrom the training data and then augmented toinclude synonyms and related words. We useda native Russian speaker to augment the set ofRussian trigger words. Each trigger is associatedwith a fine-grained type and accompanied by aconfidence score that accounts for whether wehave seen the trigger, a translation of the trigger,or a synonym of the trigger in a relation.

For English, we extract relations by finding theshortest dependency path (SDP) between two en-tity mentions and checking for trigger words alongthis path. If any triggers occur in the SDP, we ex-tract the relation associated with the trigger wordbetween the two entity mentions. For Russian, welook at the plain text between the two entity men-tions, instead of the SDP, and for Ukrainian wefirst translate to Russian and then follow the sameprocedure as for Russian.

2.5.4 Sentiment Relation Extraction

We train a targeted-sentiment system (Zhonget al., 2019) for the Evaluate.Sentiment rela-tion. The system is developed for English and thenadapted to Russian and Ukrainian.

The English system features a technique fortraining attention to focus on human rationales andcan distinguish between pairs of entities that havesentiment relations and pairs that do not. We de-fine human ground-truth attention as uniform overwords in a rationale, selecting rationales from theMPQA dataset (Wilson, 2008). The model’s at-tention is then made to match the human attentionthrough a new KL-divergence loss component,giving significant improvements and outperform-ing a wide range of competitive baselines (Zhonget al., 2019).

We adapt the English sentiment system toRussian and Ukrainian by training the modelusing pre-trained multi-lingual English-Russian-Ukrainian word embeddings (Conneau et al.,2017) as input features. Since the English senti-ment system uses positional embeddings derivedfrom the POS, we also use Russian POS tags(Kopotev and Ianda, 2006). Due to the linguis-tic similarities between Ukrainian and Russian,we assume a one-to-one relationship between aUkrainian sentence and its Russian translation anduse this to project Russian POS tags to Ukrainianfor the sentiment system. In this way, we onlyneed labeled training instances for English in or-der to train a system for all three languages. Themodel achieves comparable F1-score to the origi-nal system (Zhong et al., 2019) trained using pre-trained English only word embeddings (Penning-ton et al., 2014).

2.6 Event Extraction

Event extraction consists of two subtasks: trig-ger labeling identifies the trigger words ofevents and determines the event types, e.g.,Conflict.Attack; argument role labeling as-signs the relation between entities (or argu-ments) and triggers, e.g., an Attacker in aConflict.Attack event.

In this section, we will introduce our Englishevent extractor in Section 2.6.1 and Section 2.6.2,and then we will present our approach in Russianand Ukrainian event extractor in Section 2.6.3, fol-lowed by fine-grained typing methods in 2.6.4.

2.6.1 Event Extraction with GenerativeAdversarial Imitation Learning

Trigger labeling The trigger labeling modulestill follows the sequence labeling model, whichincludes a Bi-LSTM model and a CRF de-coder (Feng et al., 2018). This is different fromthe previous work (Zhang et al., 2018), which uti-lizes the Q-learning algorithm from (Zhang et al.,2019). We notice that under the definition ofAIDA schema, some triggers are phrases (morethan one word), and we find that the performancewith the Bi-LSTM-CRF model for such phrases isbetter and more stable.

The input of the Bi-LSTM framework is wordembedding concatenated with the following cate-gories.• Token surface embeddings: We initialize a ran-

dom embedding dictionary for each token, thisembedding will be updated with the in the train-ing phase.• Character-based embeddings: Similar to the to-

ken surface embeddings, each character has anrandomly initialized embedding. We feed theembeddings into a token-level Bi-LSTM net-work and concatenate the final hidden represen-tations from both directions for the token. Theseembeddings are also updated during the trainingphase.• POS embeddings: We apply Part-of-Speech

tagging (POS) on the sentences with spaCy.Similarly, the each POS label has a trainableembeddings which is intialized with randomvariables.• Pre-trained embeddings: We also utilize the pre-

trained Word2Vec embeddings from Wikipediaarticle dump (a version on January 1st, 2017).These embeddings are fixed and not updated inthe training phase.

Argument role labeling Traditionally, argu-ment role labeling is considered as classificationproblem. In this work, we model it in a Reinforce-ment Learning scenario, and consider a label as anaction from an agent (the model/extractor).

In this scenario, we define a state (feature)

str,ar =< vtr,var, atr, aar,fss >, (15)

where vs are the context embedding from the Bi-LSTM network, tr denotes triggers, ar denotesthe argument candidate (typically, detected enti-ties from Section 2.2), as are the actions (labels)

of event types (with tr footnote) and entity type(with ar footnote), and fss denote a sub-sentenceembedding which consists of output of three Bi-LSTM networks that consume the left, middle andright subsentences seperated by the trigger and ar-gument candidate.

We feed the state representation into an MLP toobtain the probability distributionQtr,ar of actionsand we determine the argument role atr,ar with

atr,ar = arg maxatr,ar

Qtr,ar(str,ar, atr,ar). (16)

We also assign a reward R for the action (argu-ment role) and we train the models by minimizing

Lpg = −R logP (atr,ar|str,ar). (17)

The footnote pg denotes the RL algorithmcalled policy gradient, which focus on optimizingthe actions on the probability distribution.

Dynamic Reward Estimation with GAN FromEquation 17 we notice that, if we impose a fixedreward R, there is no difference between a classi-fication model with cross-entropy loss and an RLmodel. Hence, we introduce a reward estimatorbased on GAN to issue dynamic rewards with re-gard to the labels committed by the event extractor.If the extractor repeatedly make mistakes on an in-stance, then the extractor will increase harshnesson penalty, and the reward will also increase as thetarget instance is considered a difficulty one. Theexpanded difference between rewards and penal-ties deviate the extractor from wrong actions (la-bels) and push them to correct ones.

We train the reward estimator based on the com-bination of the feature vector str,ar from Equa-tion 15 and action embedding representation foratr,ar. The reward estimator is the discriminatorin GAN, and will distinguish the input is from agroudtruth instance (or in GAN, the real one, orexpert one) or from an extractor (or the generatedones).

The output of the discriminator is between(0, 1) range and we use a linear transform to ex-pand it as (−5, 5). Then we use this output as thereward R to optimize the extractor, or generator inthe GAN scenario.

Training Details Taking “out of vocabulary”(OOV) problem into consideration, we intention-ally set “UNK” (unknown) entry in the look-up dictionaries of token surface embeddings,

character-based embeddings and POS embed-dings. During training, we randomly mask knowntokens, POS tags and characters in the trainingsentences with “UNK” mask, moreover, we alsoset an all-0 vector of randomly selected tokens inthe pre-trained embeddings.

To improve performance, we expand the train-ing dataset. The set includes documents fromACE and ERE data set. To bridge the gapamong ACE, ERE and AIDA ontology, we cleanup the dataset. We map the labels in ACE(e.g., Phone-Write, Transport) to ERE ones(correspond, transport-art). We also removesentences with no trigger annotations from ACEdata to ensure that ERE labels that have no coun-terparts in ACE (e.g., broadcast) will not bemissed due to empty annotation. Finally, we mapall ERE label to AIDA ones.

2.6.2 Event Extraction with Bi-LSTMTrigger Labeling We develop a language-independent neural network architecture: Bi-LSTM-CRFs, which can significantly capturemeaningful sequential information and jointlymodel nugget type decisions for event nuggetdetection. This architecture is similar as theone in (Yu et al., 2016; Feng et al., 2016; Al-Badrashiny et al., 2017).

Given a sentence X = (X1, X2, ..., Xn) andtheir corresponding tags Y = (Y1, Y2, ..., Yn), nis the number of units contained in the sequence,we initialize each word with a vector by look-ing up word embeddings. Specifically, we usethe Skip-Gram model to pre-train the word em-beddings (Mikolov et al., 2013). Then, the se-quence of words in each sentence is taken as inputto the Bi-LSTM to obtain meaningful and contex-tual features. We feed these features into CRFsand maximize the log-probabilities of all tag pre-dictions of the sequence.

Argument Labeling For event argument extrac-tion, given a sentence, we first adopt the eventtrigger detection system to identify candidate trig-gers and utilize the EDL system to recognize allcandidate arguments, including Person, Location,Organization, Geo-Political Entity, Time expres-sion and Money Phrases. For each trigger andeach candidate argument, we select two typesof sequence information: the surface context be-tween trigger word and candidate argument or theshortest dependency path, which is obtained with

the Breath-First-Search (BFS) algorithm over thewhole dependency parsing output, as input to ourneural architecture.

Each word in the sequence is assigned with afeature vector, which is concatenated from vec-tors of word embedding, position and POS tag.In order to better capture the relatedness betweenwords, we also adopt CNNs to generate a vectorfor each word based on its character sequence. Foreach dependency relation, we randomly initialize avector, which holds the same dimensionality witheach word.

We encode the following two types of sequenceof vectors with CNNs and Bi-LSTMs respectively.For the surface context based sequence, we utilizea general CNNs architecture with Max-Pooling toobtain a vector representation. For the dependencypath based sequence, we adopt the Bi-LSTMs withMax-Pooling to get an overall vector representa-tion. Finally we concatenate these two vectors andfeed them into two softmax functions: one is topredict whether the current entity mention is a can-didate argument of the trigger, and the other is topredict final argument role.

Training Details For all the above components,we utilize all the available event annotations fromDEFT Rich ERE and ACE2005 for training.

2.6.3 Russian and Ukrainian EventExtraction

Event Extraction for Russian and Ukrainian textalso follows two steps, as in (Zhang et al., 2018)and the English counter parts in Section 2.6.1 and2.6.2. First, we apply a Bi-LSTM over the pre-trained word embeddings for each word in a sen-tence to generate the hidden representation denot-ing context. These vectors are then fed into a Soft-max classifier, where each word is classified as ei-ther a trigger of a particular AIDA event type, ornot.

In the second stage, we employ a CNN withSoftmax classification similar to (Kim, 2014), toidentify and label entities that play a role in eachevent mention. Here, all entities within the samesentence of an event mention are taken as candi-date arguments. For each candidate argument, theportion of the sentence between the argument andits trigger word is used as input to the CNN. Sim-ilar to step one, we use pre-trained embeddings torepresent each word. We use masks of length 3, 4,and 5 in the convolution layer. Then, after max-

pooling, a fully connected layer generates a singlevector which is classified using a Softmax classi-fier as a certain argument type, or not an argument.We use post processing rules to ensure that all re-sults follow the constraints set by the AIDA ontol-ogy.

We also explore a structured transfer learningapproach (Subburathinam et al.) using graph con-volutional networks. It represents all entity men-tions, event triggers, and contexts into a com-plex and structured multilingual common space,and we train a relation and event extractor fromEnglish language annotations and apply it to theRussian and Ukrainian language. In this process,we exploit event-relevant language-universal fea-tures, leveraging both symbolic (including part-of-speech and dependency path) and distributional(including type representation and contextualizedrepresentation) information.

2.6.4 Fine-grained Event TypingFor events that coarse types are not covered in theEnglish training data, we map FrameNet framesto AIDA event types (Table 3). We tag frame ele-ments as event arguments only if they overlap withthe named entities. For Russian and Ukrainian, weuse GIZA (Och, 1999; Och and Ney, 2003) to getthe word alignment.

For Government.Spy.Spy,Government.Vote, Government.Agreements,Disaster. AccidentCrash.AccidentCrash,and which cannot be mapped from Framenet, weimplement a dependency-based system. We detectevent triggers based on fuzzy string match withkeywords 3. We further decide event argumentsbased on the syntactic relation between the entityend the verb keyword in the dependency tree.

We develop a rule-based fine-grained event typ-ing system, including verb-based rules, context-based rules, and argument-based rules. Foreach coarse-grained types, we collect a verbkeywords list to distinguish its fine-grainedtypes. For example, Conflict.Attack isConflict.Attack.FirearmAttack when thetrigger is shooting. Contextual keywords are col-lected similarly. For example, funeral in thecontext of Contact.Meet can be used to de-termine Contact.FuneralVigil.Meet. Mul-

3Government.Spy.Spy: spy;Government.Vote: vote, ballot, poll, referendum, election;Disaster.AccidentCrash.AccidentCrash: crash;Government.Agreements: agreement

Framenet event type AIDA event type

Conquering Transaction.Transaction.TransferControlCriminal investigation Justice.Investigate.InvestigateCrime

Sign agreement Government.Agreements.AcceptAgreementContractCeasefireInspecting InspectionDestroying ArtifactExistence.DamageDestroy.DestroyDamaging ArtifactExistence.DamageDestroy.Damage

Quitting a place Conflict.Yield.RetreatSurrendering Conflict.Yield.Surrender

Explosion Disaster.FireExplosion.FireExplosionFire burning Disaster.FireExplosion.FireExplosionCatching fire Disaster.FireExplosion.FireExplosion

Amalgamation Government.Formation.MergeGPEIntentionally create Government.Formation.StartGPE

Come together Contact.DiscussionDiscussion Contact.DiscussionRescuing Movement.TransportPerson.EvacuationRescue

Prevarication Contact.PrevaricationRequest Contact.CommandOrder

Speak on topic Contact.MediaStatementGiving in Conflict.Yield

Table 3: Mapping from Framenet to AIDA event ontology.

tiple rules might be applied in order to de-termine one fine-grained type. Also, the ar-guments can be used to determine the fine-grained types. For example, if there is an in-strument for Life.Die Event, then it shouldbe Life.Die.DeathCausedByViolentEvents,not Life.Die.NonviolentDeath. Also, thefine-grained types of entities can indicate the fine-grained types, e.g., Conflict.Attack shouldbe Conflict.Attack.Bombing when the instru-ment is WEA.Bomb.

For Russian and Ukrainian events, the mainproblem is the difficulty of keyword matchingdue to the morphology. As a result, we generatemultiple-level translations using Google Transla-tion 4, i.e., the translation of the trigger translation,the translation of the sub-sentence containing thetrigger, the translation of sentence. Then we applythe English system on these translations.

2.7 Event Coreference Resolution

We apply a graph-based algorithm (Al-Badrashinyet al., 2017; Zhang et al., 2018) for our language-independent Event coreference resolution. Foreach event type, we view event mentions as nodesin the graph, and the undirected weighted edgesbetween the nodes represent the coreference con-fidence between corresponding events. Then weapply hierarchical clustering to obtain event clus-ters. We train a Maximum Entropy binary clas-sifier with the features listed in Table 4. We use

4https://translate.google.com/

the English annotations from ERE (Getman et al.,2018) as training data.

2.8 Refinement with Human Hypotheses

A human hypothesis is a small manual knowledgegraph, and we use it to refine the knowledge baseconstructed in TA1.a. There are three kinds of re-finements, i.e., entity refinement, relation refine-ment, and event refinement.

The entities in the human hypothesis are linkedto the background KB (LDC2019E43), and wepropagate TA1.a KB by adding new entities if oneentity appears in the hypothesis but not in TA1.aKB. To extract the mentions of these entities, weuse the name of the entity in background KB andalso its translations in other languages, and con-ducted string matching after stemming. For thenominal entities, such as “snipers shooting duringMaidan”, we extract the head word “snipers” forstring matching and use “Maidan” as a constraintthat “Maidan” should appear in the same docu-ment as the mention “snipers”. Only when theconstraint is satisfied, the mention is extracted. Ifthe entity in the hypothesis is already extracted inTA1.a KB, we compare its fine-grained types, andtrust the human hypothesis when there are con-flicts. Furthermore, we use the human hypothesisto correct the entity linking results. We use the en-tity name and its translations to match the entitiesin TA1.a KB, and trust the linking results in thehuman hypothesis when encountering conflicts.

Relation refinement has two stages. The first

https://translate.google.com/

Features Remarks(EM1: the first event mention, EM2: the second event mention)

type subtype match 1 if the types and subtypes of the event nuggets matchtrigger pair exact match 1 if the spellings of triggers in EM1 and EM2 exactly match

Distance between the wordembedding quantized semantic similarity score (0-1) using pre-trained word embeddingtoken dist how many tokens between triggers of EM1 and EM2 (quantized)

Argument match argument roles that are associated with the same entitiesArgument conflict Number of argument roles that are associated with different entities

Table 4: Event Coreference: Features for Maximum Entropy classifier (Zhang et al., 2018)

stage is to rerun the system using the updated en-tities, which will enrich the relations, especiallythe ones related to augmented entities. The sec-ond stage is based on a second-stage binary clas-sifier. For each relation in the human hypothesis,we extract the co-occurrent sentences of two enti-ties, and the binary classifier determines whetherthe sentence indicates a relation. If so, we trustthe relation type in the human hypothesis, and thesentence serves as the justification.

Similar to relation refinement, event refinementalso has two stages, with the rerunning as thefirst stage. In the second stage, we extract theco-occurrent sentences of event type and argu-ments and develop rules to decide whether the co-occurrent sentence indicates the arguments of theevent. The localization of event types in each sen-tence is based on the triggers extracted in TA1.aKB. In the co-occurrent sentences of entity typeand two arguments, if one argument is extractedby the system and the other one does not, we addthe missing one to the system output.

3 TA1 Visual Knowledge Extraction

We first briefly review our previous Visual Knowl-edge Extraction (VKE) system (Zhang et al.,2018), and then describe the new extensions andimprovements that form our new system. We haddesigned the VKE system to complement the textKE system, by extracting entities solely by theirvisual appearance, and then integrating them withthe knowledge extracted from text, to form a uni-fied knowledge base. The VKE system consistsof entity detection, entity recognition and entitycoreference resolution, that are discussed in therest of this section.

3.1 Entity Detection

The entity detection system consists of an ensem-ble of 5 object detectors, namely 3 Faster R-CNN(Ren et al., 2015) models trained on 3 differentdatasets, a weakly supervised CAM model (Zhou

et al., 2016) for classes for which bounding boxannotation is not available, and an MTCNN model(Zhang et al., 2016) specialized for face detec-tion. Since these models are trained on variousdatasets with distinct but somewhat overlappingontologies, we develop a post-processing algo-rithm (Zhang et al., 2018) to combine their resultsand map to the unified AIDA ontology. Since TAC2018, the program ontology has changed, and sowe update our class mapping, as well as the CAMmodel to cover more classes.

The entity detection system is trained using atypical cross-entropy loss to classify each objectinto a variety of types that are selected from mul-tiple ontologies. Among these are classes whichfollow a hierarchy. This causes a serious issuein producing a score for a given bounding box;fine-grained entity types are in conflict with theircoarse-grained counterparts resulting in artificiallylower scores in several cases. For e.g. score for abounding box containing a “man” gets distributedacross “man”, “person”, “boy”, “people”. To alle-viate this issue, we remove the cross-entropy re-sponsible for generating probability distributionacross classes with a binary cross-entropy whichallows for independent distributions and hence thescores for each of the entity types are independentand would get a higher score as required by thequery type similar to (Akiba et al., 2018).

3.2 Entity Recognition

The entity recognition (a.k.a. entity linking) mod-ule was previously based on a face recognitionpipeline that can recognize a predefined set of PER(person) entities and link their face bounding box(visual equivalent of entity mention) to their KBentry. This year we expanded that system to in-clude other entity types such as GPE (geopoliticalentity) and FAC (facility). Our face recognitionsystem (Zhang et al., 2018) is based on FaceNet(Schroff et al., 2015) which trains a metric space,which enables comparing the visual similarity of

Figure 2: Example landmark recognition results. Red circles denote local features that were matched to a referenceimage.

Figure 3: Example flag recognition results. Green represents correct prediction, blue represents wrong prediction,and grey represents missed detection by Faster R-CNN.

faces and determining whether a pair of faces be-long to the same identity. We use a similar strategyto train other specialized models for other types ofentity, besides PER (person).

For instance, for GPE (geopolitical entity), weuse Google image search to collect flag imagesfrom different countries and organizations (200classes, 100 images each). We apply our entitydetection system to localize flags in those images,

and tightly crop the flags to train a classifier. Wetrain an Inception-v4 model (Szegedy et al., 2017)with 200 classes. In the test time, after entity de-tection we apply the classification model on thecropped region of each detected flag to classifythe country and link to the KB. To evaluate, wemanually labeled 69 cropped flags from a randomportion of last year’s evaluation data. Our modelachieves 73% F1 score and 72% accuracy. Exam-

ple results are shown in Figure 3.To recognize FAC (facility), LOC (location),

and ORG (organization), we adopt the recentlyproposed DEep Local Feature (DELF (Noh et al.,2017)) method, which is used for large-scaleinstance-level image matching. The model is pre-trained on the Google Landmarks (kag) datasetwhich includes 15,000 landmarks across theworld. We used the AIDA training data to select500 scenario-relevant landmarks and buildings.More specifically, we collected entities that are an-notated with one of the aforementioned types inthe training data, or detected by the text entity ex-traction (Section 2.2). Then, we use each retrievedentity mention to find related images using Googleimage search. We use Faster-RCNN on the re-trieved images to detect landmark instances andremove images without any landmark. Then weselect entities that have sufficient images (100 perentity).

We use those images as prototypes for eachentity, and given a test image, we use RAN-dom SAmple Concensus (RANSAC) (Fischlerand Bolles, 1981) to align its DELF local featureswith each prototype. We assign the test image tothe best matching entity, i.e. the prototype with themost aligned local features. This system achievesan F1 score of 0.875 on a random portion of lastyear’s evaluation data. Example results are illus-trated in Figure 2.

3.3 Cross-Modal Entity Coreference

The entity coreference system is used to create acoherent knowledge graph from individually de-tected entities. As described in (Zhang et al.,2018), the entity coreference has two submodules:intra-modal and cross-modal. The intra-modalcoreference module connects visually similar orrelated entities such as a person’s face that appearsin multiple images. The cross-modal coreferencemodule links entities mentioned in text to entitiesappeared in images and video keyframes. Thisyear, we significantly improved our cross-modalcoreference module by incorporating a multi-layerattention mechanism, as described in the follow-ing.

We advanced our previous visual groundingsystem (Zhang et al., 2018) by introducing a novelmulti-level attention mechanism. As illustrated inFigure 4, our system extracts a multi-level visualfeature map for each image in a document, that is

a set of grids of feature vectors, where each gridhas a certain granularity, and each cell of a grid isa feature vector representing an image region. Onthe other hand, our network takes each sentenceof the document and represents each word using alanguage model. Then for each word (or phrase,entity mention, etc.) we compute an attention mapto every level and every location of the featuremap. This way, we can choose the level that moststrongly matches the semantic query, and withinthat level, the attention map can be used to local-ize the query. The multi-level aspect of attention isimportant, because some queries may convey low-level information such as object types (e.g. per-son) while other queries may refer to high-levelsemantics such as protest. Higher-level semanticstend to be reflected in the latter layers of convo-lution nets, which means the selection of an ap-propriate layer for query localization is important.Our cross-modal coreference system outperformsthe state of the art in various visual groundingbenchmarks. More details are discussed in (Ak-bari et al., 2019).

Furthermore, this year we added a complemen-tary visual grounding method to our pipeline, toboost recall. Given bounding boxes and their pre-dicted classes using the above approach we areable to perform grounding by matching the textwith those of the predicted classes. We use aGloVe (Pennington et al., 2014) with thresholdsset to 0.75 to obtain the relevant bounding boxes.

4 TA2

The goal of the TA2 clustering work is to identifyentities and events across documents. The primarychallenge is that the entities and events extractedby the TA1 modules contain few attributes. Forentities, name and type are always present. About20% of the entities contain a link to an externalknowledge base (Freebase), and some entities alsocontain subtypes (e.g., politician). A secondarychallenge is that the names are present in threelanguages (English, Russian and Ukranian), andextractions are noisy.

To address the sparsity of information, our ap-proach is to enrich the information present in eachentity using information from an external knowl-edge graph. For this purpose we use Wikidata as itcontains a large number of entities, including en-tities that do not satisfy the notability criteria inWikipedia. Wikipedia contains over 60 million en-

Figure 4: An overview of our cross-modal coreference (visual grounding) system.

tities, compared to the roughly 5 million entitiespresent in the English Wikipedia. A secondarybenefit of Wikidata is that it is a multi-lingualknowledge graph, containing labels and aliases forall entities in multiple languages, including Rus-sian and Ukranian.

Our entity enrichment approach leverages thelinks to the external knowledge graph. Wikidataentities contain Freebase identifiers, so our algo-rithm first maps the Freebase identifiers to Wiki-data indentifiers to link TA1 entities to Wikidataentities. Then the algorithm forms entity clustersby collecting into a cluster all entities that link tothe same Wikidata entity. This initial set of clus-ters contain only those TA1 entities for which anexternal link is present.

The second step of the algorithm the enrichedset of entity labels to add to clusters entities whoselabels match a label for an entity already presentin the cluster. To compare labels we use Jaccardsimilarity with a 0.4 threshold. Our experimentsrevealed that a high threshold produces good re-sults as it reduces false positives. The expectationis that after label enrichment the likelihood of aclose match is increased. When a TA1 entity can-not be added to an existing cluster, it forms a newcluster.

5 HypoGator: Alternative HypothesesGeneration and Ranking

HypoGator is the University of Florida hypoth-esis Generation system. It uses a search-score-rank approach to find alternative answers to com-plex queries over the automatically extracted TA2

knowledge graph. In a nutshell, HypoGator de-composes a complex graph query into subqueriesof simple subgraph patterns. For each subquery itsentry points are matched into the TA2 knowledgegraph and their local context generates candidateanswers. Candidates are scored and ranked usingmultiple features that are indicative of coherenceand relevance. A join algorithm combines the an-swers from each atomic sub-query and re-scoresthe final set of answers using features that encour-age answer cohesion. Figure 5 shows the archi-tecture of the proposed system. In this section wedescribe the core components of HypoGator.

5.1 Query Processing

A statement of information need is a subgraph pat-tern with event/relation types and entities as nodes,event/relation argument roles as edges and a setof grounded entities known as entry points. Weclassify an information need as simple if each en-try points is used as the argument of only oneevent/relation. In contrast, a complex informationneed has entry points that are shared by multipleevents/relations developing into a star-like struc-ture. HypoGator’s query processing module firstscan an information need and decomposes a com-plex information needs into multiple simple in-formation needs that we refer to as atomic. Thedecomposition algorithm first finds all connectedcomponents in the information need and for eachcomponent, the algorithm visits the neighbors ofits entry points and traverse each of them until adifferent entry point or a terminal node is found.The resulting subgraphs are added to the atomicquery list. Figure 6 shows the statement of infor-

Figure 5: HypoGator System Architecture

mation need E102 with entry points Odessa (a.k.a.Odesa) and Trade Unions House (a.k.a. TradeUnions Building) in the center and atomic queriesderived from it using the decomposition algorithmround it.

After query decomposition, HypoGatormatches entry points into the TA2 knowledgegraph. At first, HypoGator matches entry pointswith background KB ids directly to the corre-sponding entity mention id in the TA2 KG. Ifthe background KB id can not be matched or thequery doesn’t have one, HypoGator uses the entrypoint name string to match TA2 entity mentionnodes using common string similarity metrics.Finally, entry points are attempted to be matchedusing the provenance offset.

5.2 Candidate Hypothesis Generation

To generate relevant hypothesis for an atomicquery HypoGator explores the neighborhood ofthe entry point mention in the knowledge graphusing the A* algorithm. Te results is a set of pathsup to a fixed length that serve as backbone struc-ture for candidate hypothesis. While the result-ing paths are rooted at entity nodes, the gener-ated paths visit event and relation nodes, to gener-ate full candidate hypothesis HypoGator expandsa path by adding all arguments of the visited eventand relations in the path.

5.3 Coherence Feature Extraction

Our candidate generation module ensures that thegenerated candidate hypothesis include the entrypoints. However, this does not guarantee them tobe fully relevant to the query at hand. Moreover,

the candidates need to be pruned if they are notlogically or semantically coherent. Another im-portant factor determining the quality of a candi-date hypothesis is the validity confidence of eachof its knowledge elements, whether they are fromthe document sources (extraction confidence) orinferred (inference confidence) or TA2 clustering.

We use a variety of features to measure each hy-pothesis’s semantic coherence, logical coherence,and degree of relevance to the query. We usean aggregation method to obtain an overall con-fidence score from each knowledge elements con-fidence. For example we use an ensemble of graphdistance functions to measure the query relevanceor use a set of predefined logical rules to detectlogical inconsistency. The overall score for eachhypothesis is computed as a linear combinationof the individual scores from each of the features.We use the LDC labeled data to learn appropriateweights for each feature or use reasonable hand-crafted weights for each feature.

5.4 Ranking and Selection

While we have multiple features and each of themscores the hypothesis for some important consis-tency or coherence property, we need a condensescore that can be used to give full quantitative sig-nificance to a hypothesis and therefore use it forranking candidates. We use a simple approach toaggregate different scores, a weighted sum of thefeature values. We manually select the weightswith what we believe are more salient features of ahypothesis. We look forward to include a learnedversion of the weights.

Figure 6: Query Decomposition E102

5.5 Hypotheses Clustering

Due to the nature of AIDAs data e.g. multiple doc-uments about the same hypothesis, it is possibleto have multiple subgraphs representing the samehypothesis. Our system uses subgraph clusteringto prune duplicate hypothesis. Our focus on thehypothesis and having a smaller data size com-pared to what TA2 has to work with, enables usto perform more expensive subgraph level cluster-ing. We compute a similarity score for each pairof generated subgraph hypothesis. Entity features,TA2s alignments and the graph structure, are usedto compute this score. We finally use spectral clus-tering to cluster the set of hypothesis into K clus-ters.

5.6 Hypotheses Composition

After all candidate hypothesis are scored, dedupli-cated and ranked for each atomic part, we use ajoin algorithm to combine atomic hypothesis. Thealgorithm generates all possible combinations ofcandidate atomic answers. We set the value of Kin the clustering step such that the cross productof atomic hypothesis is feasible. Every resultingcombinations is deemed as a full hypothesis. Fullhypothesis are re scored using the same same setof features, in addition a new score is given basedon how far each connected component is with re-spect to others in the same hypothesis. The top re-

sulting hypotheses are selected as final candidatesand further deduplicated by clustering. The repre-sentatives of each cluster are returned as final hy-potheses.

5.7 Hypothesis Enrichment

Based on development observation, we found thatthe hypothesis generated by HypoGator matchedtotally or partially the core elements found in thestatement of information need, but lack a mecha-nism to add context to the answer. E.g. an infor-mation need may ask about the dead of a politi-cian, and hypo gator may be able to find the cor-rect life.die event but it would not contain contex-tual information such as who did the killer workfor, or the details of the weapon used in the as-sault. Therefore, we increase the size of the hy-pothesis by including entity neighbors from thebackground knowledge base. Since the randomaddition of neighbors may induce noise into thehypothesis, we only add context to certain entitiesof interest e.g., the entities of type PER. the en-richment method also allows to select what kindof contextual relations we are relevant e.g., the re-lations of type AFFILIATION, RESPONSIBILITY,DELIBERATENESS, etc. Context expansion, in-creases the reach of our hypotheses while main-taining the core set of matched elements intact.

6 Hypothesis Generation at ISI

TA2 generates a full Knowledge Graph (KG).Given such a KG and a file containing the State-ments of Information Need (SIN) that identifies atopic in the KG, the steps required to generate thetop K hypotheses in TA3 are as follows.

Retrieve the knowledge elements relevant tothe Entry Points. We convert an XML file con-taining the SIN to SPARQL and perform a queryon the full KG generated by TA2.

Convert the full KG to data in an internalformat. The data in this internal format only con-tains information about the knowledge elementsthat are within a user-specified number of hopsfrom the Entry Points and that will be used for rea-soning, such as the arguments of Events and Rela-tions. Each of these arguments has a confidencevalue.

Generate ontological constraints. Based onthe LDCOntology, we generate two types of on-tological constraints: (a) domain constraints, and(b) cardinality constraints. For each type of Eventor Relation, each argument can accept only spe-cific types of objects. Such constraints qualifyas domain constraints. At a practical level, suf-ficient reasoning done by TA1 and TA2 alreadyminimizes the number of domain constraints to beadded by TA3. The cardinality constraints, oftenimplemented as mutual exclusion constraints, canbe added for two different reasons. First, for sometypes of Events or Relations, some of their argu-ments can only have a limited number of distinctobjects. For example, given an Event Life.Born,the argument Life.Born Time can have only oneobject. Second, each Entity can serve the samerole only in a limited number of Events or Rela-tions. For example, each Person Entity can servethe argument Person in only one Life.Born Event.

Represent the new data as weighted con-straints. In our approach, assertions in the KGare not automatically deemed to be true in a hy-pothesis. This is because they have certain confi-dence values associated with them that have to bereasoned about jointly with the ontological con-straints. Instead, a hypothesis fixes the truth valueof each assertion, i.e., it is a subset of the fullKG. We therefore treat each assertion as a Booleanvariable with domain {0, 1}. Here, ‘0’ and ‘1’ rep-resent ‘False’ and ‘True’, respectively. For eachBoolean variable v corresponding to an assertionwith confidence value p, we add a unary weighted

constraint with weights− log(1−p) and− log(p)against the assignments v = 0 and v = 1, re-spectively. For each mutual exclusion constraintc between two variables v1 and v2, we add a bi-nary weighted constraint with weights− log(p/2),− log(1 − p/2), − log(1 − p/2) and − log(p/2)against the assignments (v1 = 0, v2 = 0), (v1 =0, v2 = 1), (v1 = 1, v2 = 0), and (v1 = 1, v2 =1), respectively.

Solve an instance of the Weighted ConstraintSatisfaction Problem (WCSP) defined on theweighted constraints. We use the Toulbar2 solverto solve the WCSP instance. This solver finds anoptimal assignment of values to all Boolean vari-ables that minimizes the total weight. This opti-mal assignment is the top hypothesis base of theoriginal problem. To generate the kth hypothesisbase, for 1 ≤ k ≤ K, we construct a new WCSPinstance that includes all the weighted constraintsdescribed above as well as additional constraintsthat disallow the previously generated k − 1 hy-pothesis bases. Each hypothesis base is consistentwith the ontological constraints.

Filter edges relevant to the SIN and constructhypotheses. We generate the kth final hypoth-esis from the kth hypothesis base by retainingonly those edges that are relevant to the SIN. Thisrelevance filtering mimics the popular Word2Vecmodel in Natural Language Processing. For eachEvent or Relation, we calculate similarity scoresbetween its type and the Event types specified inthe SIN. We then retain only the M Events andRelations with the highest similarity scores.

Score the hypotheses. The total score of a hy-pothesis is the average of its relevance score andits consistency score. In turn, the relevance scoreof a hypothesis is the average similarity score overall Events and Relations in it, as derived from theWord2Vec model in the previous step. The con-sistency score of a hypothesis is e−w, where w isthe total weight of the optimal assignment to theWCSP instance that defines its base. This scoreuses the transformation rule that defines weightedconstraints from confidence values.

Acknowledgments

This work was supported by the U.S. DARPAAIDA Program No. FA8750-18-2-0014. Theviews and conclusions contained in this documentare those of the authors and should not be inter-preted as representing the official policies, either

expressed or implied, of the U.S. Government.The U.S. Government is authorized to reproduceand distribute reprints for Government purposesnotwithstanding any copyright notation here on.

ReferencesGoogle landmark recognition challenge.

Hassan Akbari, Svebor Karaman, Surabhi Bhargava,Brian Chen, Carl Vondrick, and Shih-Fu Chang.2019. Multi-level multimodal common semanticspace for image-phrase grounding. In Proceedingsof the IEEE Conference on Computer Vision andPattern Recognition, pages 12476–12486.

Takuya Akiba, Tommi Kerola, Yusuke Niitani, ToruOgawa, Shotaro Sano, and Shuji Suzuki. 2018.Pfdet: 2nd place solution to open images chal-lenge 2018 object detection track. arXiv preprintarXiv:1809.00778.

Mohamed Al-Badrashiny, Jason Bolton, Arun TejasviChaganty, Kevin Clark, Craig Harman, Lifu Huang,Matthew Lamm, Jinhao Lei, Di Lu, Xiaoman Pan,et al. 2017. Tinkerbell: Cross-lingual cold-startknowledge base construction. In TAC.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473.

Alexis Conneau, Guillaume Lample, Marc’AurelioRanzato, Ludovic Denoyer, and Herve Jegou. 2017.Word translation without parallel data. arXivpreprint arXiv:1710.04087.

Xiaocheng Feng, Lifu Huang, Duyu Tang, BingQin, Heng Ji, and Ting Liu. 2016. A language-independent neural network for event detection. InThe 54th Annual Meeting of the Association forComputational Linguistics, page 66.

Xiaocheng Feng, Bing Qin, and Ting Liu. 2018.A language-independent neural network for eventdetection. Science China Information Sciences,61(9):092106.

Jenny Rose Finkel, Trond Grenager, and ChristopherManning. 2005. Incorporating non-local informa-tion into information extraction systems by gibbssampling. In Proceedings of the 43rd annual meet-ing on association for computational linguistics,pages 363–370. Association for Computational Lin-guistics.

Martin A Fischler and Robert C Bolles. 1981. Randomsample consensus: a paradigm for model fitting withapplications to image analysis and automated car-tography. Communications of the ACM, 24(6):381–395.

Jeremy Getman, Joe Ellis, Stephanie Strassel, ZhiyiSong, and Jennifer Tracey. 2018. Laying thegroundwork for knowledge base population: Nineyears of linguistic resources for tac kbp. In Proceed-ings of the Eleventh International Conference onLanguage Resources and Evaluation (LREC-2018).

Lifu Huang, Jonathan May, Xiaoman Pan, Heng Ji,Xiang Ren, Jiawei Han, Lin Zhao, and James A.Hendler. 2017. Liberal entity extraction: Rapid con-struction of fine-grained entity typing systems. BigData, 5.

Heng Ji, Ralph Grishman, Dayne Freitag, MatthiasBlume, John Wang, Shahram Khadivi, RichardZens, and Hermann Ney. 2009. Name extractionand translation for distillation. Handbook of Natu-ral Language Processing and Machine Translation:DARPA Global Autonomous Language Exploitation.

Yoon Kim. 2014. Convolutional neural networks forsentence classification. In Proceedings of the 2014Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP), pages 1746–1751.

M. V. Kopotev and L. A. Ianda. 2006. Natsionalny ko-rpos russkogo iazyka.

Kenton Lee, Luheng He, Mike Lewis, and Luke Zettle-moyer. 2017. End-to-end neural coreference reso-lution. In Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Process-ing, pages 188–197. Association for ComputationalLinguistics.

Ying Lin, Liyuan Liu, Heng Ji, Dong Yu, and JiaweiHan. 2019. Reliability-aware dynamic feature com-position for name tagging. In Proceedings of the57th Annual Meeting of the Association for Compu-tational Linguistics, pages 165–174, Florence, Italy.Association for Computational Linguistics.

Xiao Ling, Sameer Singh, and Daniel Weld. 2015. De-sign challenges for entity linking. Transactions ofthe Association for Computational Linguistics, 3.

Christopher Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven Bethard, and David McClosky.2014. The stanford corenlp natural language pro-cessing toolkit. In Proceedings of 52nd annualmeeting of the association for computational lin-guistics: system demonstrations, pages 55–60.

O. Medelyan and C. Legg. 2008. Integrating cyc andwikipedia: Folksonomy meets rigorously definedcommon-sense. In Proc. AAAI 2008 Workshop onWikipedia and Artificial Intelligence.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-rado, and Jeffrey Dean. 2013. Distributed represen-tations of words and phrases and their composition-ality. In NIPS.

Joakim Nivre et al. 2019. Universal dependencies 2.4.LINDAT/CLARIN digital library at the Institute ofFormal and Applied Linguistics (UFAL), Faculty ofMathematics and Physics, Charles University.

https://www.kaggle.com/c/landmark-recognition-challenge

https://doi.org/10.1089/big.2017.0012

https://doi.org/10.1089/big.2017.0012

http://hdl.handle.net/11234/1-2988

Hyeonwoo Noh, Andre Araujo, Jack Sim, TobiasWeyand, and Bohyung Han. 2017. Large-scale im-age retrieval with attentive deep local features. InProceedings of the IEEE International Conferenceon Computer Vision, pages 3456–3465.

Franz Josef Och. 1999. An efficient method for deter-mining bilingual word classes. In Proceedings of theninth conference on European chapter of the Asso-ciation for Computational Linguistics, pages 71–76.Association for Computational Linguistics.

Franz Josef Och and Hermann Ney. 2003. A systematiccomparison of various statistical alignment models.Computational Linguistics, 29(1):19–51.

Xiaoman Pan, Taylor Cassidy, Ulf Hermjakob, Heng Ji,and Kevin Knight. 2015. Unsupervised entity link-ing with Abstract Meaning Representation. In Proc.the 2015 Conference of the North American Chap-ter of the Association for Computational Linguis-tics, Human Language Technologies (NAACL-HLT2015).

Xiaoman Pan, Boliang Zhang, Jonathan May, JoelNothman, Kevin Knight, and Heng Ji. 2017. Cross-lingual name tagging and linking for 282 languages.In Proc. the 55th Annual Meeting of the Associationfor Computational Linguistics.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In EMNLP.

Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word repre-sentations. In Proceedings of the 2018 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long Papers). Associationfor Computational Linguistics.

Shaoqing Ren, Kaiming He, Ross Girshick, and JianSun. 2015. Faster r-cnn: Towards real-time ob-ject detection with region proposal networks. InAdvances in neural information processing systems,pages 91–99.

Florian Schroff, Dmitry Kalenichenko, and JamesPhilbin. 2015. Facenet: A unified embedding forface recognition and clustering. In Proceedings ofthe IEEE conference on computer vision and patternrecognition, pages 815–823.

Milan Straka and Jana Strakova. 2017. Tokenizing,pos tagging, lemmatizing and parsing ud 2.0 withudpipe. In Proceedings of the CoNLL 2017 SharedTask: Multilingual Parsing from Raw Text to Univer-sal Dependencies, pages 88–99, Vancouver, Canada.Association for Computational Linguistics.

Ananya Subburathinam, Di Lu, Heng Ji, Jonathan May,Shih-Fu Chang, Avirup Sil, and Clare Voss. Cross-lingual structure transfer for relation and event ex-traction.

Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke,and Alexander A Alemi. 2017. Inception-v4,inception-resnet and the impact of residual connec-tions on learning. In Thirty-First AAAI Conferenceon Artificial Intelligence.

Robert L Taft. 1970. Name Search Techniques. NewYork State Identification and Intelligence System,Albany, New York, US.

Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, JunYan, and Qiaozhu Mei. 2015. Line: Large-scale in-formation network embedding. In WWW. ACM.

Theresa Wilson. 2008. Fine-grained subjectivity andsentiment analysis: recognizing the intensity, polar-ity, and attitudes of private states.

Dian Yu, Xiaoman Pan, Boliang Zhang, Lifu Huang,Di Lu, Spencer Whitehead, and Heng Ji. 2016. Rpiblender tac-kbp2016 system description.

Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, andYu Qiao. 2016. Joint face detection and alignmentusing multitask cascaded convolutional networks.IEEE Signal Processing Letters, 23(10):1499–1503.

Tongtao Zhang, Heng Ji, and Avirup Sil. 2019. Jointentity and event extraction with generative adversar-ial imitation learning. Data Intelligence, 1(2):99–120.

Tongtao Zhang, Ananya Subburathinam, Ge Shi, LifuHuang, Di Lu, Xiaoman Pan, Manling Li, BoliangZhang, Qingyun Wang, Spencer Whitehead, et al.2018. Gaia-a multi-media multi-lingual knowl-edge extraction and hypothesis generation system.In Proceedings of TAC KBP 2018, the 25th Inter-national Conference on Computational Linguistics:Technical Papers.

Ruiqi Zhong, Steven Shao, and Kathleen McKeown.2019. Fine-grained sentiment analysis with faithfulattention. ArXiv, abs/1908.06870.

Bolei Zhou, Aditya Khosla, Agata Lapedriza, AudeOliva, and Antonio Torralba. 2016. Learning deepfeatures for discriminative localization. In Proceed-ings of the IEEE conference on computer vision andpattern recognition, pages 2921–2929.

http://www.aclweb.org/anthology/K/K17/K17-3009.pdf