Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL

Comparative Study of Name Comparative Study of Name Disambiguation Problem Disambiguation Problem

using a Scalable Blocking-using a Scalable Blocking-based Frameworkbased Framework

Comparative Study of Name Comparative Study of Name Disambiguation Problem Disambiguation Problem

using a Scalable Blocking-using a Scalable Blocking-based Frameworkbased Framework

Byung-Won On, Dongwon Lee, Jaewoo Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit MitraKang, Prasenjit Mitra

JCDL’05JCDL’05

Abstract• They consider the problem of ambiguous

author names in bibliographic citationsbibliographic citations.

• Scalable two-step framework– Reduce the number of candidates via blocking (four

methods)– Measure the distance of two names via coauthor

information (seven measures)

Introduction• Citation records are important resources for academic

communities.

• Keeping citations correct and up-to-date proved to be a challenging task in a large-scale.

• We focus on the problem of ambiguous author names.

• It is difficult to get the complete list of the publications of some authors.– “John Doe” published 100 articles, but DL keeps two separate

purported author names, “John Doe” and “J. D. Doe”, each contains 50 citations.

Problem• Problem definition:

• The baseline approach:

Solution• Rather than comparing each pair of author

names to find similar names, they advocate a scalable two-step name disambiguation framework.– Partition all author-name strings into blocks– Visit each block and compare all possible pairs of

names within the block

Solution Overview

Blocking (1/3)• The goal of step 1 is to put similar records into

the same group by some criteria.

• They examine four representative blocking methods– heuristics, token-based, n-gram, sampling

Blocking (2/3)• Spelling-based heuristics

– Group author names based on name spellings– Heuristics: iFfL, iFiL, fL, combination– iFfL: e.g. “Jeffrey Ullman”, “J. Ullman”

• Token-based– Author names sharing at least one common token are gr

ouped into the same block – e.g., “Jeffrey D. Ullman” and “Ullman, Jason”

Blocking (3/3)• N-gram

– N=4– The number of author names put into the same

block is the largest one.– e.g. “David R. Johnson”, “F. Barr-David”

• Sampling– Sampling-based join approximation– Each token from all author names has an TFIDF

weight.– Each author name has its token weight vector.– All pairs of names with similarity of at least θ can be

put into the same block.

Measuring Distances• The goal of step 2 is, for each block, to identify to

p-k author names that are the closest.

• Supervised method– Naïve Bayes Model, Support Vector Machine

• Unsupervised method– String-based Distance, Vector-based Cosine Distance

Supervised Methods (1)

• Naïve Bayes ModelTraining:– A collection of coauthors of x are randomly split, and on

ly the half is used for training.– They estimate each coauthor’s conditional probability

P(Aj|x)

Testing:

Supervised Methods (2)

• Support Vector Machine– All coauthor information of an author in a block is tran

sformed into vector-space representation.– Author names in a block are randomly split, 50% is use

d for training, and the other 50% is used for testing.– SVM creates a maximum-margin hyperplane that splits

the YES and NO training examples.– In testing, the SVM classifies vectors by mapping them

via kernel trick to a high dimensional.• Radial Basis Function kernel

Unsupervised Methods(1)

• String-based Distance– The distance between two author names are

measured by the “distance” between their coauthor lists.

– Two token-based string distances

– Two edit-distance-based string distances

Unsupervised Methods(2)

• Vector-based Cosine Distance– They model the coauthor lists as vectors in the vector

space and compute the distances between the vectors.

– They use the simple cosine distance.

Experiment

Data Sets• They gathered real citation data from four differen

t domains.– DBLP, e-Print, BioMed, EconPapers

• Different disciplines appear to have slightly different citation policies and the conventions of citations also vary.– Number of coauthors per article– Use the initial of first name instead of full name

Artificial name variants

• Given the large number of citations, it is not possible nor practical to find a “real” solution set.

• They pick top-100 author names from Y according to their number of citations, and generate 100 corresponding new name variants artificially.

• “Grzegorz Rozenberg” with 344 citations and 114 coauthors in DBLP, we create a new name like “G. Rozenberg” or “Grzegorz Rozenbergg”.

• Splitting the original 344 citations into halves, each name carries half of citations 172

• They test if the algorithm is able to find the corresponding artificial name variant in Y

• Error type: e.g. “Ji-Woo K. Li” – Abbreviation: “J. K. Ki”– Name alternation: “Li, Ji-Woo K.”– Typo: “Ji-Woo K. Lee” or “Jee-Woo K. Li”– Contraction: “Jiwoo K. Li”– Omission: “Ji-Woo Li”– Combinations

• The quantify the effect of error types on the accuracy of name disambiguation is measured.



• (1) mixed error types of abbreviation (30%), alternation (30%), typo (12% each in first/last name), contraction (2%), omission (4%), and combination (10%)

• (2) abbreviation of the first name (85%) and typo (15%)

Evaluation metrics• Scalability

– Size of blocks generated in step 1 – Time it took to process both step 1 and 2

• Accuracy– They measured the accuracy of top-k.

Scalability• The average # of

authors in each block

• Processing time for step 1 and 2

Accuracy• Four blocking methods combined with seven dista

nce metrics for all four data set with k = 5.• EconPapers data set is omitted.

Conclusion• They compared various configurations (four blocking in step

1, seven distance metrics via “coauthor” information in step 2), against four data sets.

• A combination of token-based or N-gram blocking (step 1) and SVM as a supervised method or cosine metric as a unsupervised method (step 2) gave the best scalability/accuracy trade-off.

• The accuracy of simple name spelling based heuristics were shown to be quite sensitive to the error types.

• Edit distance based distance metrics such as Jaro or Jaro-Winkler proved to be inadequate for large-scale name disambiguation problem for its slow processing time.

Documents

Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL