Efficient Fuzzy Search Enabled Hash Map
4th International Workshop On Soft Computing Applications SOFA2010 – Arad, ROMANIA
Vasile TopacPhD Student
Department of Information Technology and Computer Science“Politehnica” University Of Timisoara
Email: [email protected]
How it all started
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
&
Java HashMap
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
- widely used Java data structure
- stores (key, value) pairs
- search by key
- very fast
-a hash function generates a hash code for indexation
- Uses equals method to compare trough the keys
- only values for existing keys can be retrieved
Java HashMap
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
phone book example
Java HashMap
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
Collision
Java HashMap
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
Search for “Lisa Smith”
hashMap.get(“Lisa Smith”);Result: “521-8976”
Problem
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
- only values for existing keys can be retrieved
Search for “Lissa Smith”
hashMap.get(“Lissa Smith”);Result: null
Problem
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
Brute force solution: - iterate trough the set of entries and search approximate matches Works, but is time expensive Fuzzy data structures – currently available for database
- search for “Lissa Smith”
Fuzzy Hash Map
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
“ Soft computing (SC) is a collection of methodologies that are trying to cope with the main disadvantage of the conventional (hard) computing: the poor performances when working in uncertain conditions. ”
Fuzzy Hash Map
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
UML Class Diagram
Fuzzy Hash Map
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
FuzzyKey overridden methods
- hashCode()- prehashing - create collisions to cluster data
- substring substring(“Fuzzy Search”, 0, 4) = “Fuzz”- soundex soundex(“Fuzzy Search”) = F226
- equals(Object o)- string metrics
- Levenshtain Distance LD(computing, computation)=4- Hamming Distance HD(computing, computers)=3
How it works
Fuzzy Hash Map
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
Example(law terminology dictionary)
- hashCode()- prehashing
- substring 4
- equals(Object o)- Levenshtain Distance
SUBSTRING (0, 4)
action
adjudication
evidence
violence
violation
...
...
hashfunction
pre-hashingfunction buckets
acti
adju
evid
viol
action
adjudication
evidence
violence
violation
12
13
14
215
A civil judicial proceeding ...
A decision or sentence imposed by a judge...
The expression of physical or verbal ...
An offense for which the only sentence ...
Testimony, documents or objects ...
...
......
...
......
......
Fuzzy Hash Map
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
“the judge has the option of either adjudicating you as guilty or..”
fuzzyHashMap.get(“adjudicating”) = nullfuzzyHashMap.getFuzzy(“adjudicating”, 2) = “a decision or sentence
imposed by a
judge…”
- hashCode()substring 4 = “adju”
- equals(Object o)LD(adjudicating, adjudication) = 2
Fuzzy Hash Map
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
fuzzyHashMap.getFuzzy(“violent”)= “violence”
SUBSTRING (0, 4)
action
adjudication
evidence
violence
violation
...
...
hashfunction
pre-hashingfunction buckets
acti
adju
evid
viol
action
adjudication
evidence
violence
violation
12
13
14
215
A civil judicial proceeding ...
A decision or sentence imposed by a judge...
The expression of physical or verbal ...
An offense for which the only sentence ...
Testimony, documents or objects ...
...
......
...
......
......
LD(violent, violence) = 2LD(violent, violation) = 5
“violence” is returned
Fuzzy Hash Map
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
SOUNDEX
Mary
Paul
Scott
Jhon
John
...
...
hashfunction
pre-hashingfunction buckets
M600
P400
S300
J500
Mary
Paul
Scott
Jhon
John
12
13
14
215
312050505
732124789
025465892
361475236
712696969
...
......
...
......
......
Example(phone book)
- hashCode()- prehashing
- soundex
- equals(Object o)- Levenshtain Distance
Results
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
Accuracy Test
Test conditions- Substring(0,4) hashing function- Levenshtein Distance fuzzy matching algorithm- Distance threshold value 2- medical terminology dictionary populated with 1030 English medical terms
Test results
-Parse text from American Family Physicians Journal - text of 568 words- 43 words identified as medical terms- 9 were incorrect matches- 80% accuracy
- Parse text from eMedicine web site - text of 2730 words- 260 were recognized- 7 were incorrect matches- 97% accuracy
Results
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
Speed Test
-Exact matches only
0 100 200 300 400 500 600 700 800 90010000
1000
2000
3000
4000
5000
6000
4 5 5 6
5419
4013
2300
1
HashMap
FuzzyHashMap
Results
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
Speed Test
-Fuzzy matches only
010
020
030
040
050
060
070
080
090
010
000
1000
2000
3000
4000
5000
6000
7000
4 5 6 7
54195739 5711
5401
HashMap
FuzzyHashMap
Results
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
Speed Test
-Exact & fuzzy matches
010
020
030
040
050
060
070
080
090
010
000
1000
2000
3000
4000
5000
6000
4 5 5 6
5419
4744
4135
3143HashMap
FuzzyHashMap
Conclusion
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
- FuzzyHashMap data structures proved to have very good performance on working with uncertain data
- Flexible (can choose different pre-hashing functions and string metrics)
- available as open source http://fuzzyhashmap.sourceforge.net/
- community can extend the functionality
- Future work: - adding more string metrics- improve performance- implement Fuzzy TreeMap
SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac
Thank you!
sources at:http://fuzzyhashmap.sourceforge.net