24
Comparative Study of Comparative Study of Name Disambiguation Name Disambiguation Problem using a Scalable Problem using a Scalable Blocking-based Framework Blocking-based Framework Byung-Won On, Dongwon Lee, Jae Byung-Won On, Dongwon Lee, Jae woo Kang, Prasenjit Mitra woo Kang, Prasenjit Mitra JCDL’05 JCDL’05

Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL

Embed Size (px)

Citation preview

Page 1: Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL

Comparative Study of Name Comparative Study of Name Disambiguation Problem Disambiguation Problem

using a Scalable Blocking-using a Scalable Blocking-based Frameworkbased Framework

Comparative Study of Name Comparative Study of Name Disambiguation Problem Disambiguation Problem

using a Scalable Blocking-using a Scalable Blocking-based Frameworkbased Framework

Byung-Won On, Dongwon Lee, Jaewoo Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit MitraKang, Prasenjit Mitra

JCDL’05JCDL’05

Page 2: Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL

Abstract• They consider the problem of ambiguous

author names in bibliographic citationsbibliographic citations.

• Scalable two-step framework– Reduce the number of candidates via blocking (four

methods)– Measure the distance of two names via coauthor

information (seven measures)

Page 3: Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL

Introduction• Citation records are important resources for academic

communities.

• Keeping citations correct and up-to-date proved to be a challenging task in a large-scale.

• We focus on the problem of ambiguous author names.

• It is difficult to get the complete list of the publications of some authors.– “John Doe” published 100 articles, but DL keeps two separate

purported author names, “John Doe” and “J. D. Doe”, each contains 50 citations.

Page 4: Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL
Page 5: Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL

Problem• Problem definition:

• The baseline approach:

Page 6: Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL

Solution• Rather than comparing each pair of author

names to find similar names, they advocate a scalable two-step name disambiguation framework.– Partition all author-name strings into blocks– Visit each block and compare all possible pairs of

names within the block

Page 7: Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL

Solution Overview

Page 8: Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL

Blocking (1/3)• The goal of step 1 is to put similar records into

the same group by some criteria.

• They examine four representative blocking methods– heuristics, token-based, n-gram, sampling

Page 9: Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL

Blocking (2/3)• Spelling-based heuristics

– Group author names based on name spellings– Heuristics: iFfL, iFiL, fL, combination– iFfL: e.g. “Jeffrey Ullman”, “J. Ullman”

• Token-based– Author names sharing at least one common token are gr

ouped into the same block – e.g., “Jeffrey D. Ullman” and “Ullman, Jason”

Page 10: Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL

Blocking (3/3)• N-gram

– N=4– The number of author names put into the same

block is the largest one.– e.g. “David R. Johnson”, “F. Barr-David”

• Sampling– Sampling-based join approximation– Each token from all author names has an TFIDF

weight.– Each author name has its token weight vector.– All pairs of names with similarity of at least θ can be

put into the same block.

Page 11: Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL

Measuring Distances• The goal of step 2 is, for each block, to identify to

p-k author names that are the closest.

• Supervised method– Naïve Bayes Model, Support Vector Machine

• Unsupervised method– String-based Distance, Vector-based Cosine Distance

Page 12: Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL

Supervised Methods (1)

• Naïve Bayes ModelTraining:– A collection of coauthors of x are randomly split, and on

ly the half is used for training.– They estimate each coauthor’s conditional probability

P(Aj|x)

Testing:

Page 13: Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL

Supervised Methods (2)

• Support Vector Machine– All coauthor information of an author in a block is tran

sformed into vector-space representation.– Author names in a block are randomly split, 50% is use

d for training, and the other 50% is used for testing.– SVM creates a maximum-margin hyperplane that splits

the YES and NO training examples.– In testing, the SVM classifies vectors by mapping them

via kernel trick to a high dimensional.• Radial Basis Function kernel

Page 14: Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL

Unsupervised Methods(1)

• String-based Distance– The distance between two author names are

measured by the “distance” between their coauthor lists.

– Two token-based string distances

– Two edit-distance-based string distances

Page 15: Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL

Unsupervised Methods(2)

• Vector-based Cosine Distance– They model the coauthor lists as vectors in the vector

space and compute the distances between the vectors.

– They use the simple cosine distance.

Page 16: Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL

Experiment

Page 17: Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL

Data Sets• They gathered real citation data from four differen

t domains.– DBLP, e-Print, BioMed, EconPapers

• Different disciplines appear to have slightly different citation policies and the conventions of citations also vary.– Number of coauthors per article– Use the initial of first name instead of full name

Page 18: Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL

Artificial name variants

• Given the large number of citations, it is not possible nor practical to find a “real” solution set.

• They pick top-100 author names from Y according to their number of citations, and generate 100 corresponding new name variants artificially.

• “Grzegorz Rozenberg” with 344 citations and 114 coauthors in DBLP, we create a new name like “G. Rozenberg” or “Grzegorz Rozenbergg”.

• Splitting the original 344 citations into halves, each name carries half of citations 172

• They test if the algorithm is able to find the corresponding artificial name variant in Y

Page 19: Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL

• Error type: e.g. “Ji-Woo K. Li” – Abbreviation: “J. K. Ki”– Name alternation: “Li, Ji-Woo K.”– Typo: “Ji-Woo K. Lee” or “Jee-Woo K. Li”– Contraction: “Jiwoo K. Li”– Omission: “Ji-Woo Li”– Combinations

• The quantify the effect of error types on the accuracy of name disambiguation is measured.

Artificial name variants

Page 20: Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL

Artificial name variants

• (1) mixed error types of abbreviation (30%), alternation (30%), typo (12% each in first/last name), contraction (2%), omission (4%), and combination (10%)

• (2) abbreviation of the first name (85%) and typo (15%)

Page 21: Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL

Evaluation metrics• Scalability

– Size of blocks generated in step 1 – Time it took to process both step 1 and 2

• Accuracy– They measured the accuracy of top-k.

Page 22: Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL

Scalability• The average # of

authors in each block

• Processing time for step 1 and 2

Page 23: Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL

Accuracy• Four blocking methods combined with seven dista

nce metrics for all four data set with k = 5.• EconPapers data set is omitted.

Page 24: Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL

Conclusion• They compared various configurations (four blocking in step

1, seven distance metrics via “coauthor” information in step 2), against four data sets.

• A combination of token-based or N-gram blocking (step 1) and SVM as a supervised method or cosine metric as a unsupervised method (step 2) gave the best scalability/accuracy trade-off.

• The accuracy of simple name spelling based heuristics were shown to be quite sensitive to the error types.

• Edit distance based distance metrics such as Jaro or Jaro-Winkler proved to be inadequate for large-scale name disambiguation problem for its slow processing time.