Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
HITS’ Cross-lingual Entity Linking
System at TAC 2011
One Model for All Languages
Angela Fahrni, Michael Strube, Vivi Nastase
Heidelberg Institute for Theoretical Studies gGmbHHeidelberg, Germany
1 / 48
2 / 48
3 / 48
4 / 48
5 / 48
6 / 48
7 / 48
8 / 48
Strategies for Cross-lingualEntity Disambiguation
9 / 48
Strategies for Cross-lingualEntity Disambiguation
10 / 48
Strategies for Cross-lingualEntity Disambiguation
11 / 48
Strategies for Cross-lingualEntity Disambiguation
12 / 48
Strategies for Cross-lingualEntity Disambiguation
13 / 48
Knowledge base
14 / 48
Knowledge base
15 / 48
Mapping between the Englishand Chinese Wikipedia Versions
16 / 48
Mapping between the Englishand Chinese Wikipedia Versions
17 / 48
Mapping between the Englishand Chinese Wikipedia Versions
18 / 48
Mapping between the Englishand Chinese Wikipedia Versions
19 / 48
Coverage after Lexicon Lookup
Training SetCoverage Average Ambiguity
EN 0.928 16.92ZH 0.867 5.56ZH_Lex 0.930 22.86
20 / 48
21 / 48
Different Techniques for DifferentLanguages
22 / 48
Different Models for DifferentLanguages
23 / 48
One Model for All Languages
24 / 48
Context DisambiguationApproach
25 / 48
Context DisambiguationApproach
26 / 48
Context DisambiguationApproach
27 / 48
Context DisambiguationApproach
28 / 48
Context DisambiguationApproach
29 / 48
Context DisambiguationApproach
30 / 48
Context DisambiguationApproach
31 / 48
Context DisambiguationApproach
• Training of a supervised model (SVM) using English traininginstances
• Training instances are derived from the internal hyperlinks inWikipedia
• Positive instances: linked Wikipedia articles• Negative instances: other Wikipedia articles the terms can refer to
• Model is used for English and Chinese
32 / 48
Entity Disambiguation
33 / 48
Entity Disambiguation Approach
• Training instances are derived from the training data provided byTAC
• Positive instances: correct entity for the query terms• Negative instances: other candidate entities for the query terms
• Training of a supervised model (SVM) based on• English and Chinese training instances• Chinese training instances• English instances
• Application of the models to both languages
34 / 48
One-Model-for-All-Languages:Features
• Abstraction from particular languages, i.e. no lexical features
• Relatedness measures
• Concept-based features
35 / 48
One-Model-for-All-Languages:Features
• Prior probability
• Distance between term and name of the respective Wikipedia article
• Context fit on conceptual level• Vector-based features based on concepts, categories, lists, portals• Average, maximum and minimum derived from pairwise calculated
relatedness measures• Relatedness measures based on outgoing links, incoming links,
categories
• Context fit on token level: Cosine similarity based on nouns, verbs,adjectives between:
• Whole text and respective Wikipedia article• Surrounding context and local context of hyperlinks pointing to the
respective page in Wikipedia
36 / 48
Results Micro-AverageDifferent Training Approaches
Micro-AverageHITS_Trained_EN_ZH 0.785
HITS_Trained_SEP 0.789HITS_Trained_ZH 0.786
37 / 48
38 / 48
Language-independentConcept-based Representation
39 / 48
Language-independentConcept-based Representation
40 / 48
Graph
41 / 48
Clustering
42 / 48
Learning the Edge Weights
• Binary classifier (SMV) with the classes:• In same cluster• Not in the same cluster
• Confidence values as edge weights
• Features: Cosine similarities between local contexts and whole textsbased on:
• Identified concepts• Identified concepts extended by incoming and outgoing links• Categories associated with the concepts• Lists associated with the concepts• Portals associated with the concepts
43 / 48
Shortcomings of Graph-basedClustering Approach
• Construction of the graph is inefficient• Pairwise comparisons using different similarity measures
• Few data to optimize the parameters
• Temporary solution: String match heuristic
44 / 48
Results
Micro-Average Precision (B3) Recall (B3) F1 (B3)Best System 0.788Median 0.675HITS 0.785 0.700 0.763 0.730
45 / 48
46 / 48
Future Work
• Experiments with other languages• We participated in the cross-lingual link discovery task organized by
NTCIR: good results for all subtasks (EN-to-Chinese, EN-to-Korean,EN-to-Japanese)
• Efficient way to cluster terms: Circumventing pairwise comparisons
47 / 48