Download pptx - Fuzzy Hash Map

Transcript
Page 1: Fuzzy Hash Map

Efficient Fuzzy Search Enabled Hash Map

4th International Workshop On Soft Computing Applications SOFA2010 – Arad, ROMANIA

Vasile TopacPhD Student

Department of Information Technology and Computer Science“Politehnica” University Of Timisoara

Email: [email protected]

Page 2: Fuzzy Hash Map

How it all started

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

&

Page 3: Fuzzy Hash Map

Java HashMap

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

- widely used Java data structure

- stores (key, value) pairs

- search by key

- very fast

-a hash function generates a hash code for indexation

- Uses equals method to compare trough the keys

- only values for existing keys can be retrieved

Page 4: Fuzzy Hash Map

Java HashMap

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

phone book example

Page 5: Fuzzy Hash Map

Java HashMap

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

Collision

Page 6: Fuzzy Hash Map

Java HashMap

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

Search for “Lisa Smith”

hashMap.get(“Lisa Smith”);Result: “521-8976”

Page 7: Fuzzy Hash Map

Problem

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

- only values for existing keys can be retrieved

Search for “Lissa Smith”

hashMap.get(“Lissa Smith”);Result: null

Page 8: Fuzzy Hash Map

Problem

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

Brute force solution: - iterate trough the set of entries and search approximate matches Works, but is time expensive Fuzzy data structures – currently available for database

- search for “Lissa Smith”

Page 9: Fuzzy Hash Map

Fuzzy Hash Map

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

“ Soft computing (SC) is a collection of methodologies that are trying to cope with the main disadvantage of the conventional (hard) computing: the poor performances when working in uncertain conditions. ”

Page 10: Fuzzy Hash Map

Fuzzy Hash Map

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

UML Class Diagram

Page 11: Fuzzy Hash Map

Fuzzy Hash Map

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

FuzzyKey overridden methods

- hashCode()- prehashing - create collisions to cluster data

- substring substring(“Fuzzy Search”, 0, 4) = “Fuzz”- soundex soundex(“Fuzzy Search”) = F226

- equals(Object o)- string metrics

- Levenshtain Distance LD(computing, computation)=4- Hamming Distance HD(computing, computers)=3

How it works

Page 12: Fuzzy Hash Map

Fuzzy Hash Map

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

Example(law terminology dictionary)

- hashCode()- prehashing

- substring 4

- equals(Object o)- Levenshtain Distance

SUBSTRING (0, 4)

action

adjudication

evidence

violence

violation

...

...

hashfunction

pre-hashingfunction buckets

acti

adju

evid

viol

action

adjudication

evidence

violence

violation

12

13

14

215

A civil judicial proceeding ...

A decision or sentence imposed by a judge...

The expression of physical or verbal ...

An offense for which the only sentence ...

Testimony, documents or objects ...

...

......

...

......

......

Page 13: Fuzzy Hash Map

Fuzzy Hash Map

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

“the judge has the option of either adjudicating you as guilty or..”

fuzzyHashMap.get(“adjudicating”) = nullfuzzyHashMap.getFuzzy(“adjudicating”, 2) = “a decision or sentence

imposed by a

judge…”

- hashCode()substring 4 = “adju”

- equals(Object o)LD(adjudicating, adjudication) = 2

Page 14: Fuzzy Hash Map

Fuzzy Hash Map

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

fuzzyHashMap.getFuzzy(“violent”)= “violence”

SUBSTRING (0, 4)

action

adjudication

evidence

violence

violation

...

...

hashfunction

pre-hashingfunction buckets

acti

adju

evid

viol

action

adjudication

evidence

violence

violation

12

13

14

215

A civil judicial proceeding ...

A decision or sentence imposed by a judge...

The expression of physical or verbal ...

An offense for which the only sentence ...

Testimony, documents or objects ...

...

......

...

......

......

LD(violent, violence) = 2LD(violent, violation) = 5

“violence” is returned

Page 15: Fuzzy Hash Map

Fuzzy Hash Map

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

SOUNDEX

Mary

Paul

Scott

Jhon

John

...

...

hashfunction

pre-hashingfunction buckets

M600

P400

S300

J500

Mary

Paul

Scott

Jhon

John

12

13

14

215

312050505

732124789

025465892

361475236

712696969

...

......

...

......

......

Example(phone book)

- hashCode()- prehashing

- soundex

- equals(Object o)- Levenshtain Distance

Page 16: Fuzzy Hash Map

Results

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

Accuracy Test

Test conditions- Substring(0,4) hashing function- Levenshtein Distance fuzzy matching algorithm- Distance threshold value 2- medical terminology dictionary populated with 1030 English medical terms

Test results

-Parse text from American Family Physicians Journal - text of 568 words- 43 words identified as medical terms- 9 were incorrect matches- 80% accuracy

- Parse text from eMedicine web site - text of 2730 words- 260 were recognized- 7 were incorrect matches- 97% accuracy

Page 17: Fuzzy Hash Map

Results

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

Speed Test

-Exact matches only

0 100 200 300 400 500 600 700 800 90010000

1000

2000

3000

4000

5000

6000

4 5 5 6

5419

4013

2300

1

HashMap

FuzzyHashMap

Page 18: Fuzzy Hash Map

Results

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

Speed Test

-Fuzzy matches only

010

020

030

040

050

060

070

080

090

010

000

1000

2000

3000

4000

5000

6000

7000

4 5 6 7

54195739 5711

5401

HashMap

FuzzyHashMap

Page 19: Fuzzy Hash Map

Results

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

Speed Test

-Exact & fuzzy matches

010

020

030

040

050

060

070

080

090

010

000

1000

2000

3000

4000

5000

6000

4 5 5 6

5419

4744

4135

3143HashMap

FuzzyHashMap

Page 20: Fuzzy Hash Map

Conclusion

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

- FuzzyHashMap data structures proved to have very good performance on working with uncertain data

- Flexible (can choose different pre-hashing functions and string metrics)

- available as open source http://fuzzyhashmap.sourceforge.net/

- community can extend the functionality

- Future work: - adding more string metrics- improve performance- implement Fuzzy TreeMap

Page 21: Fuzzy Hash Map

SOFA2010 – Arad, ROMANIA - Efficient Fuzzy Search Enabled Hash Map - Vasile Topac

Thank you!

sources at:http://fuzzyhashmap.sourceforge.net

[email protected]


Recommended