A quick overview of some common approximate string comparators used in record linkage.
Text of Approximate string comparators
1. Approximate string comparatorsTvungenOne, 2012-06-15Lars Marius Garshol, http://twitter.com/larsga1
2. Approximate string comparators? Basically, measures of the similarity betweentwo strings Useful in situations where exact match isinsufficient record linkage search ... Many of these are slow: O(n2)2 3. Levenshtein Also known as edit distance Measures the number of edit operationsnecessary to turn s1 into s2 Edit operations are insert a character remove a character substitute a character3 4. Levenshtein example Levenshtein -> Lwenstein Levenstein (remove h) Lvenstein (substitute ) Lwenstein (substitute w) Edit distance = 34 5. Weighted Levenshtein Not all edit operations are equal Substituting i for e is a smaller edit thansubstituting o for k Weighted Levenshtein evaluates each editoperation as a number 0.0-1.0 Difficult to implement weights are also language-dependent5 6. Jaro-Winkler Developed at the US Bureau of the Census For name comparisons not well suited to long strings best if given name/surname are separated Exists in a few variants originally proposed by Winkler then modified by Jaro a few different versions of modifications etc6 7. Jaro-Winkler definition Formula: m = number of matching characters t = number of transposed characters A character from string s1 matches s2 if thesame character is found in s2 less then half thelength of the string away Levenshtein ~ Lwenstein = 0.8 Axel ~ Aksel = 0.7837 8. Jaro-Winkler variant8 9. Soundex A coarse schema for matching names by sound produces a key from the name names match if key is the same In common use in many places Navs person register uses it for search built-in in many databases ...9 10. Soundex definition10 11. Examplessoundex(Axel) = A240soundex(Aksel) = A240soundex(Levenshtein) = L523soundex(Lwenstein) = L15211 12. Metaphone Developed by Lawrence Philips Similar to Soundex, but much more complex both more accurate and more sensitive Developed further into Double Metaphone Metaphone 3.0 also exists, but only availablecommercially12 13. Metaphone examplesmetaphone(Axel) = AKSLmetaphone(Aksel) = AKSLmetaphone(Levenshtein) = LFNXmetaphone(Lwenstein) = LWNS13 14. Dice coefficient A similarity measure for sets set can be tokens in a string or characters in a string Formula:14 15. TFIDF Compares strings as sets of tokens a la Dice coefficient However, takes frequency of tokens in corpusinto account this matches how we evaluate matches mentally Has done well in evaluations however, can be difficult to evaluate results will change as corpus changes15 16. More comparators Smith-Waterman originated in DNA sequencing Q-grams distance breaks string into sets of pieces of q characters then does set similarity comparison Monge-Elkan similar to Smith-Waterman, but with affine gap distances has done very well in evaluations costly to evaluate Many, many more ...16