Seungwon Hwang: Entity Graph Mining and Matching

  • View
    1.162

  • Download
    2

Embed Size (px)

DESCRIPTION

This talk introduces the problem of matching web-scale entity graphs, such as multilingual name graphs and social network graphs, to solve difficult problems such as name translation or social id finding. While existing approaches focus on using textual (or phonetic) similarity or Web co-occurrences, this approach combines the strength of the two and significantly outperforms the state-of-the-arts. We present our evaluation results using real-life entity graphs.

Transcript

Mining Entity Translation from Comparable Corpora: A Holistic Graph Mapping Approach

Entity Graph Mining and Matching

Seung-won HwangAssociate ProfessorDepartment of Computer Science and EngineeringPOSTECH, Korea

Information & Database Systems Lab

Mining Human Intelligence from the Web: Click Graph

Language-agnostic/data-intensive: e.g., arabic Corpus?

Are q1 and q2 similar?

Are u3 and u4 similar?

Information & Database Systems Lab

Mining at Finer Granularity: Named Entity (NE) Graph

Person name, Place name, Organization name, Product nameNewspapers, Web sites, TV programs,

MS

jobs

gates

Apple

Mac

complicated

Co-founder

tenure

Information & Database Systems Lab

Case I: Matching names with twitter accounts [EDBT11]

Information & Database Systems Lab

Case II: Entity Translation [EMNLP10,CIKM11]

What are the features?How are the features combined?(using translation as an application scenario)

English Corpus

Chinese Corpus

NE

NE

NE

NE

NE

NE

NE

NE

Ge=(Ve, Ee)

Gc=(Vc, Ec)

NE

NE

NE

NE

NE

NE

NE

NE

NE

NE

NE

NE

NE

NE

NE

NE

NE

NE

NE

Information & Database Systems Lab

NE Translation

Goal Finding a NE in source language into its NE in target languageEx) Obama (English) (Chinese)Resources: comparable corpora

Features

NEE

Features

NEE

Features

NEE

Features

NEE

Features

NEC

Features

NEC

Features

NEC

Features

NEC

NEE

NEC

Find!!

NEE

NEC

NEE

NEC

NEE

NEC

Xinhua News Agency (English)

Xinhua News Agency (Chinese)

Information & Database Systems Lab

NE Translation Similarity Features

Entity Name Similarity (E): S.Wan [1], L. Haizhou [2], K. Knight [3]Pronunciation similarity between named entities Ex) Obama and (pronounced Aobama)

Entity Context Similarity (EC): M. Diab [4], H. Ji [5], K. Yu [6]Contextual word similarity between named entities Ex) The president () Obama ()

Relationship Similarity (R): G.-w.You [7]Co-occurrence similarity between pairs of named entitiesEx) (Jackie Chan, Bill Gates ) vs. (, )

As president, Obama signed economic stimulus legislation

Information & Database Systems Lab

Entity

Relationship

UsingEntity Names

E[1,2,3]

R

Using TextualContext

EC[4,5,6]

?

Motivation

Taxonomy Table

Research questions:Why RC is not used?Can all four categories combined?

Shao [8]

You [7]

Information & Database Systems Lab

In this paper

We propose a new NE translation similarity feature Relationship Context similarity (RC)Contextual word similarity between named entities Ex) pair (Barack, Michelle) Spouse

We propose new holistic approachesCombining all E, EC, R, and RC

We validate our proposed approach using extensive experiments

Information & Database Systems Lab

Our Framework

We abstract this problem as Graph Matching of two NE relationship graphs extracted from comparable corpora

English Corpus

Chinese Corpus

NE

NE

NE

NE

NE

NE

NE

NE

Ge=(Ve, Ee)

Gc=(Vc, Ec)

NE

NE

NE

NE

NE

NE

NE

NE

NE

NE

NE

NE

NE

NE

NE

NE

NE

NE

NE

Populate a decision matrix R, |Ve|-by-|Vc| matrix

Information & Database Systems Lab

Our Framework

Overview 3 StepsInitializationConstruct NE relationship graphs Build an initial pairwise similarity matrix R0 Use Entity (E) and Entity Context (EC) similarities

Iterative reinforcement Build a final pairwise similarity matrix RUse Relationship (R) and Relationship Context (RC) similarities

Matching Find 1:1 matching from R Build a binary hard decision matrix R*

Obama

.99

.1

.2

Jackiechan

.1

Obama

.99

.1

.2

Jackiechan

.99

Information & Database Systems Lab

Initialization

Constructing NE relationship graphs G = (N, E)Extract NEs using entity tagger for each document in each corpusRegard NEs that appears more than times as NodesConnect two Nodes when they co-occur more than times

Initializing R0Computing entity similarity matrix SEUse Edit-Distance (ED) between ei and Pinyin representation of cjEx) ED(Obama, ) = ED(Obama, Aobama)

Information & Database Systems Lab

Initialization

Initializing R0Computing entity context similarity matrix SECContext word

ex) As president, Obama signed economic stimulus legislation

Context window

Correlation between a NE and a context word : Log-odd ratios

Information & Database Systems Lab

Initializing R0Computing entity context similarity matrix SECProjected Context Association Vector

Initialization

Obama

Score

President

0.9

Score

0.85

Dictionary

(President,)

president

USA

Information & Database Systems Lab

Initialization

Initializing R0Computing entity context similarity matrix SECContext Similarity between ei and cj Compute cosine similarity between two vectors

Merging SE and SECMin-Max normalization in range [0:1]Merge

Information & Database Systems Lab

Reinforcement

IntuitionTwo NEs with a strong relationshipCo-occur frequently have edge Share similar context have similar relationship context

Align neighbors using relationship (R) and relationship context (RC) similarityUpdate the similarity score

X

NE

NE

English NE Graph

Y

NE

NE

Chinese NE Graph

context

context

context

context

Information & Database Systems Lab

Reinforcement

Iterative Approach

Entity-based Similarity (E & EC)

Relationship-based Similarity (R & RC)

Ordered set of aligned neighbor pairs of (i, j) at iteration t

Relationship Context (RC) Similarity between relation pair (i, u) and (j, v)

Relationship (R) Similarity ofis neighbor u and js neighbor v

Information & Database Systems Lab

Matching

Finding 1:1 matching using greedy algorithm

StepsFind a translation pair with the highest final similarity score Select the pair and remove the corresponding row and column from RRepeat 1. and 2. until the similarity score < threshold

R

Information & Database Systems Lab

Experiments

DatasetEnglish Gigaword CorpusXinhua News Agency 2008.01~2008.12100,746 news documentsChinese Gigaword CorpusXinhua News Agency 2008.01~2008.1288,029 news documents

Approaches EC: consider Entity context similarity feature onlyE: consider Entity name similarity feature onlyShao (E+EC): combine Entity name & Entity Context similaritiesYou (E+R): combine Entity name & Relationship similaritiesOursE+EC+R (when = 0)E+EC+R+RC

Measure Precision, Recall, and F1-score

Information & Database Systems Lab

Experiments

Effectiveness of overall framework500 person named entities Set = 0.15 5-fold cross-validation for threshold parameter learning

Other type of NE (100 Location named entities)

Information & Database Systems Lab

Directions

Graph matchingGraph cleansing [VLDB11]Scalable entity search

US Presidents

Bill Clinton

William J Clinton

George W. Bush

George H.W. Bush

Dubya

Information & Database Systems Lab

Thanks

Question?

Visit: www.postech.ac.kr/~swhwang for these papers

Information & Database Systems Lab