Upload
lars-marius-garshol
View
1.691
Download
2
Tags:
Embed Size (px)
DESCRIPTION
A quick overview of some common approximate string comparators used in record linkage.
Citation preview
1
Approximate string comparators
TvungenOne, 2012-06-15Lars Marius Garshol, <[email protected]>http://twitter.com/larsga
2
Approximate string comparators?
• Basically, measures of the similarity between two strings
• Useful in situations where exact match is insufficient– record linkage– search– ...
• Many of these are slow: O(n2)
3
Levenshtein
• Also known as edit distance• Measures the number of edit operations
necessary to turn s1 into s2• Edit operations are– insert a character– remove a character– substitute a character
4
Levenshtein example
• Levenshtein -> Löwenstein– Levenstein (remove ‘h’)– Lövenstein (substitute ‘ö’)– Löwenstein (substitute ‘w’)
• Edit distance = 3
5
Weighted Levenshtein
• Not all edit operations are equal• Substituting “i” for “e” is a smaller edit
than substituting “o” for “k”• Weighted Levenshtein evaluates each
edit operation as a number 0.0-1.0• Difficult to implement– weights are also language-dependent
6
Jaro-Winkler
• Developed at the US Bureau of the Census
• For name comparisons– not well suited to long strings– best if given name/surname are separated
• Exists in a few variants– originally proposed by Winkler– then modified by Jaro– a few different versions of modifications etc
7
Jaro-Winkler definition
• Formula:– m = number of matching characters– t = number of transposed characters
• A character from string s1 matches s2 if the same character is found in s2 less then half the length of the string away
• Levenshtein ~ Löwenstein = 0.8• Axel ~ Aksel = 0.783
8
Jaro-Winkler variant
9
Soundex
• A coarse schema for matching names by sound– produces a key from the name– names match if key is the same
• In common use in many places– Nav’s person register uses it for search– built-in in many databases– ...
10
Soundex definition
11
Examples
• soundex(“Axel”) = ‘A240’• soundex(“Aksel”) = ‘A240’• soundex(“Levenshtein”) = ‘L523’• soundex(“Löwenstein”) = ‘L152’
12
Metaphone
• Developed by Lawrence Philips• Similar to Soundex, but much more
complex– both more accurate and more sensitive
• Developed further into Double Metaphone
• Metaphone 3.0 also exists, but only available commercially
13
Metaphone examples
• metaphone(“Axel”) = ‘AKSL’• metaphone(“Aksel”) = ‘AKSL’• metaphone(“Levenshtein”) = ‘LFNX’• metaphone(“Löwenstein”) = ‘LWNS’
14
Dice coefficient
• A similarity measure for sets– set can be tokens in a string– or characters in a string
• Formula:
15
TFIDF
• Compares strings as sets of tokens– a la Dice coefficient
• However, takes frequency of tokens in corpus into account– this matches how we evaluate matches
mentally
• Has done well in evaluations– however, can be difficult to evaluate– results will change as corpus changes
16
More comparators
• Smith-Waterman– originated in DNA sequencing
• Q-grams distance– breaks string into sets of pieces of q characters– then does set similarity comparison
• Monge-Elkan– similar to Smith-Waterman, but with affine gap
distances– has done very well in evaluations– costly to evaluate
• Many, many more– ...