Upload
mercedes-coyle
View
117
Download
0
Embed Size (px)
Citation preview
W H AT ’ S I N A N A M E ?P H O N E T I C A L G O R I T H M S F O R S E A R C H A N D S I M I L A R I T Y
Mercedes Coyle @benzobot
Data Infrastructure Engineer
W H AT I ’ M G O I N G T O C O V E R T O D AY
• Search - how does it work?
• Phonetic Algorithms
• Use cases for Phonetic Algorithms
W H E N W E T H I N K O F S E A R C H …
H O W D O E S G O O G L E S E A R C H W O R K ?
• Web crawling on a very large scale!
• Document rank (importance) and similarity
• Text analysis
image credit: flickr.com/photos/rserrano/
• Obligatory Hand-wavey “Big Data” comment here
H O W D O E S G O O G L E S E A R C H W O R K ?
image credit: twitter.com/wtrsld/status/424364245648564226
D ATA B A S E S E A R C Himage credit: Mercedes Coyle
* S Q L
• Comparison search: LIKE operator
• SELECT * FROM table WHERE word LIKE %and%
* S Q L
• Comparison search: LIKE operator
• basically a wildcard character search
• only returns data that contains the search string; does not account for misspelling
• can be expensive on large datasets
* S Q L
E L A S T I C S E A R C H - T O K E N I Z AT I O N
• Used in full-text search against a corpus of text
• “The quick brown fox jumped over the lazy dog”
• the, quick, brown, fox, jump, over, lazy, dog
• Wildcard searches return too many results
• Typos or misspelled names don’t return correct results
• exp: “Shawn” vs “Sean”
P R O B L E M : T E X T- B A S E D S E A R C H E S D O N ’ T A LW AY S W O R K W E L L W I T H N A M E S
W H AT I S A P H O N E M E ?
• In language, the smallest unit that conveys distinct meaning
• Includes single letters, letter combinations, vowels and consonants
E N G L I S H P H O N E M E S
H O W D O W E T R A N S L AT E P H O N E M E S C O D E ?
image credit: demoons.com/2010/09/first-animation-test.html
P H O N E T I C A L G O R I T H M S
• A method of hashing words and names based on sounds (phonemes).
P H O N E T I C A L G O R I T H M T Y P E S
• Soundex
• NYSIIS
• Metaphone and Double Metaphone
• Match Rating, Daitch-Mokotoff Soundex, Kölner Phonetik, Caverphone…
S O U N D E X
• Designed in the 1900’s to encode names for the US Census
• Built in to PostgreSQL and MySQL
S O U N D E X A L G O R I T H M
Mercedes = MERCEDES
MERCEDES = M0620302
{ 0 : [’A’, E', 'I', 'O', 'U', 'H', 'W', ‘Y’], 1 : [ 'B', 'F', 'P', ‘V’], 2 : ['C', 'G', 'J', 'K', 'Q', 'S', 'X', ‘Z’], 3 : [‘D’,’T’], 4 : [‘L’], 5 : [‘M’,’N’], 6 : [‘R’] }
M0620302 = M6232
M6232 = M623
S O U N D E X L I M I TAT I O N S
• Most implementations work for English Language only
• First letter retention causes no match on some similar names
S O U N D E X L I M I TAT I O N S
• Postgres Soundex implementation has limited character encoding support
http://www.postgresql.org/docs/9.4/static/fuzzystrmatch.html
N Y S I I S
• Developed in 1970, part of New York State Identification and Intelligence System
• Slightly improved functionality over Soundex
N Y S I I S A L G O R I T H M
N Y S I I S A L G O R I T H M
• MERCEDES
• MARCADAS
• MARCADA
• MARCAD
N Y S I I S
M E TA P H O N E
• Developed in 1990 by Lawrence Philips
• Improved accuracy over Soundex and NYSIIS
• Double Metaphone implements two hashes for each name or word
M E TA P H O N E
M E TA P H O N E
• Metaphone and Double Metaphone were improved upon in Metaphone 3, which is unfortunately closed source.
P H O N E T I C A L G O R I T H M S I N P R A C T I C E
• Use cases for Phonetic Algorithms
• Example uses in Databases
P H O N E T I C A L G O R I T H M S I N P R A C I T C E
• Phonetic algorithms are useful for searching by name or word, and tolerate some misspelling.
P H O N E T I C A L G O R I T H M S I N P R A C I T C E
• Store the phonetic hash of a name in fields/columns in your db for indexing and querying
{ "_id" : ObjectId("53e13a73cbcc7a0a6e3078e5"), "first_name" : "Arya", "last_name" : “Stark", "n_first_name" : “AR", "n_last_name" : “STARC”, “report” : “lost_item”, “item” : “ID Card”, "timestamp" : 1407269491, "report_id" : 50642 }
P H O N E T I C S E A R C H W I T H E L A S T I C S E A R C H
• Elasticsearch has support for Phonetic Matches, in many different languages!
• Store words/names as documents, and hashing is done at query time
GET /my_index/_analyze?analyzer=dbl_metaphone
returns: Smith Smythe
P H O N E T I C S E A R C H U S I N G E L A S T I C S E A R C H
• As a Developer, I really like using Elasticsearch!
• But as a System Administrator, I have battle scars.
P H O N E T I C A L G O R I T H M S F O R N O N E N G L I S H L A N G U A G E S
• Grab a linguist and write one?
image credit: flickr.com/photos/opacity
R E S O U R C E S
• Libraries
• clj-fuzzy: yomguithereal.github.io/clj-fuzzy/
• python soundex: pypi.python.org/pypi/soundex/1.1.3
• python fuzzy: pypi.python.org/pypi/Fuzzy
• elasticsearch phonetic matching https://www.elastic.co/guide/en/elasticsearch/guide/current/phonetic-matching.html
• http://aspell.net/metaphone/dmetaph.cpp
• Reading:
• http://doughellmann.com/2012/03/03/using-fuzzy-matching-to-search-by-sound-with-python.html
• Fluency, Jen Feohner Wells - http://www.jenniferfoehnerwells.com/fluency.html
T H A N K S F O R L I S T E N I N G !
QUEST IONS?
Mercedes Coyle @benzobot
image credit: Mercedes Coyle