HITS’ Cross-lingual Entity Linking System at TAC 2011 · HITS’ Cross-lingual Entity Linking...

Preview:

Citation preview

HITS’ Cross-lingual Entity Linking

System at TAC 2011

One Model for All Languages

Angela Fahrni, Michael Strube, Vivi Nastase

Heidelberg Institute for Theoretical Studies gGmbHHeidelberg, Germany

1 / 48

2 / 48

3 / 48

4 / 48

5 / 48

6 / 48

7 / 48

8 / 48

Strategies for Cross-lingualEntity Disambiguation

9 / 48

Strategies for Cross-lingualEntity Disambiguation

10 / 48

Strategies for Cross-lingualEntity Disambiguation

11 / 48

Strategies for Cross-lingualEntity Disambiguation

12 / 48

Strategies for Cross-lingualEntity Disambiguation

13 / 48

Knowledge base

14 / 48

Knowledge base

15 / 48

Mapping between the Englishand Chinese Wikipedia Versions

16 / 48

Mapping between the Englishand Chinese Wikipedia Versions

17 / 48

Mapping between the Englishand Chinese Wikipedia Versions

18 / 48

Mapping between the Englishand Chinese Wikipedia Versions

19 / 48

Coverage after Lexicon Lookup

Training SetCoverage Average Ambiguity

EN 0.928 16.92ZH 0.867 5.56ZH_Lex 0.930 22.86

20 / 48

21 / 48

Different Techniques for DifferentLanguages

22 / 48

Different Models for DifferentLanguages

23 / 48

One Model for All Languages

24 / 48

Context DisambiguationApproach

25 / 48

Context DisambiguationApproach

26 / 48

Context DisambiguationApproach

27 / 48

Context DisambiguationApproach

28 / 48

Context DisambiguationApproach

29 / 48

Context DisambiguationApproach

30 / 48

Context DisambiguationApproach

31 / 48

Context DisambiguationApproach

• Training of a supervised model (SVM) using English traininginstances

• Training instances are derived from the internal hyperlinks inWikipedia

• Positive instances: linked Wikipedia articles• Negative instances: other Wikipedia articles the terms can refer to

• Model is used for English and Chinese

32 / 48

Entity Disambiguation

33 / 48

Entity Disambiguation Approach

• Training instances are derived from the training data provided byTAC

• Positive instances: correct entity for the query terms• Negative instances: other candidate entities for the query terms

• Training of a supervised model (SVM) based on• English and Chinese training instances• Chinese training instances• English instances

• Application of the models to both languages

34 / 48

One-Model-for-All-Languages:Features

• Abstraction from particular languages, i.e. no lexical features

• Relatedness measures

• Concept-based features

35 / 48

One-Model-for-All-Languages:Features

• Prior probability

• Distance between term and name of the respective Wikipedia article

• Context fit on conceptual level• Vector-based features based on concepts, categories, lists, portals• Average, maximum and minimum derived from pairwise calculated

relatedness measures• Relatedness measures based on outgoing links, incoming links,

categories

• Context fit on token level: Cosine similarity based on nouns, verbs,adjectives between:

• Whole text and respective Wikipedia article• Surrounding context and local context of hyperlinks pointing to the

respective page in Wikipedia

36 / 48

Results Micro-AverageDifferent Training Approaches

Micro-AverageHITS_Trained_EN_ZH 0.785

HITS_Trained_SEP 0.789HITS_Trained_ZH 0.786

37 / 48

38 / 48

Language-independentConcept-based Representation

39 / 48

Language-independentConcept-based Representation

40 / 48

Graph

41 / 48

Clustering

42 / 48

Learning the Edge Weights

• Binary classifier (SMV) with the classes:• In same cluster• Not in the same cluster

• Confidence values as edge weights

• Features: Cosine similarities between local contexts and whole textsbased on:

• Identified concepts• Identified concepts extended by incoming and outgoing links• Categories associated with the concepts• Lists associated with the concepts• Portals associated with the concepts

43 / 48

Shortcomings of Graph-basedClustering Approach

• Construction of the graph is inefficient• Pairwise comparisons using different similarity measures

• Few data to optimize the parameters

• Temporary solution: String match heuristic

44 / 48

Results

Micro-Average Precision (B3) Recall (B3) F1 (B3)Best System 0.788Median 0.675HITS 0.785 0.700 0.763 0.730

45 / 48

46 / 48

Future Work

• Experiments with other languages• We participated in the cross-lingual link discovery task organized by

NTCIR: good results for all subtasks (EN-to-Chinese, EN-to-Korean,EN-to-Japanese)

• Efficient way to cluster terms: Circumventing pairwise comparisons

47 / 48

Thank you!

Angela Fahrniangela.fahrni@h-its.orghttp://www.h-its.org

48 / 48

Recommended