35
Text Similarity & Clustering Qinpei Zhao 15.Feb.2011

Text Similarity & Clustering

  • Upload
    meira

  • View
    102

  • Download
    0

Embed Size (px)

DESCRIPTION

Text Similarity & Clustering. Qinpei Zhao 15.Feb.2011. Outline. String matching metrics Implementation and applications Online Resources Location-based clustering. String Matching Metrics. Exact String Matching. - PowerPoint PPT Presentation

Citation preview

Page 1: Text Similarity & Clustering

TextSimilarity & Clustering

Qinpei Zhao15.Feb.2011

Page 2: Text Similarity & Clustering

Outline

String matching metrics Implementation and applications Online Resources Location-based clustering

Page 3: Text Similarity & Clustering

String Matching Metrics

Page 4: Text Similarity & Clustering

Exact String Matching

Given a text string T of length n and a pattern string P of length m, the exact string matching problem is to find all occurrences of P in T.

Example: T=“AGCTTGA” P=“GCT” Applications:

Searching keywords in a fileSearching engines (like Google)Database searching

Page 5: Text Similarity & Clustering

Approximate String Matching

Determine if a text string T of length n and a pattern string P of length m “partially” matches. Consider the string “approximate”. Which of these are partial matches? aproximate approximately appropriate proximate approx approximat

apropos approxximate A partial match can be thought of as one that has k differences from the st

ring where k is some small integer (for instance 1 or 2) A difference occurs if the string1.charAt(j) != string2.charAt(j) or if strin

g1.charAt(j) does not appear in string2 (or vice versa) The former case is known as a revise difference, the latter is a delete

or insert difference. What about two characters that appear out of position? For instance,

approximate vs. apporximate?

Page 6: Text Similarity & Clustering

Keanu ReevesSamuel JacksonSchwarzeneggerSamuel Jackson…

Schwarrzenger

Query errors: Limited knowledge about data Typos Limited input device (cell phone) input

Data errors Typos Web data OCR

Applications Spellchecking Query relaxation …

Similarity functions: Edit distance Q-gram Cosine …

Approximate String Matching

Page 7: Text Similarity & Clustering

Edit distance (Levenshtein distance)

Given two strings T and P, the edit distance is the minimum number of substitutions, insertion and deletions, which will transform some characters of T into P.

Time complexity by dynamic programming: O(mn)

Page 8: Text Similarity & Clustering

Edit distance (1974)t m p

0 1 2 3

t 1 0 1 2

e 2 1 2 2

m 3 2 1 2

p 4 3 2 1

Dynamic programming:m[i][j] = min{m[i-1][j]+1, m[i][j-1]+1, m[i-1][j-1]+d(i,j)} d(i,j) =0 if i=j, d(i,j)=1 else

Page 9: Text Similarity & Clustering

Q-grams

b i n g o n 2-grams

Fixed length (q)ed(T, P) <= k, then

# of common grams >= # of T grams – k * q

Page 10: Text Similarity & Clustering

Q-grams

T = “bingo”, P = “going”gram1 = {#b, bi, in, ng, go, o#}gram2 = {#g, go, oi, in, ng, g#}

Unique(gram1, gram2) = {#b, bi, in, ng, go, o#, #g, oi, g#}gram1.length = (T.length + (q - 1) * 2 + 1) – qgram2.length = (P.length + (q - 1) * 2 + 1) - qL = gram1.length + gram2.lengthSimilarity = (L- |common terms difference| )/ L

Page 11: Text Similarity & Clustering

Cosine similarity

Two vectors A and B,θ is represented using a dot product and magnitude as

Implementation: Cosine similarity = (Common Terms) / (sqrt(Num

ber of terms in String1) + sqrt(Number of terms in String2))

Page 12: Text Similarity & Clustering

Cosine similarity

T = “bingo right”, P = “going right”T1 = {bingo right}, P1 = {going right}

L1 = unique(T1).length;L2 = unique(T2).length;

Unique(T1&P1) = {bingo right going}L3 = Unique(T1&P1) .length;Common terms = (L1+L2)-L3;

Similarity = common terms / (sqrt(L1)*sqrt(L2))

Page 13: Text Similarity & Clustering

Dice coefficient

Similar with cosine similarity Dices coefficient = (2*Common Terms) /

(Number of terms in String1 + Number of terms in String2)

Page 14: Text Similarity & Clustering
Page 15: Text Similarity & Clustering

Implementation & Applications

Page 16: Text Similarity & Clustering

Similarity metrics

Edit distance Q-gram Cosine distance Dice coefficient …… similarity between two strings: Demo

Page 17: Text Similarity & Clustering

Compared Strings Edit Distance (%)

Q_Grams Q =2 (%)

Q_Grams Q =3 (%)

Q_Grams Q =4 (%)

Cosin Distance (%)

Pizza Express Café Pizza Express

72% 78.79% 74.29% 70.27% 81.65%

Lounasravintola Pinja Ky – Ravintoloita Lounasravintola Pinja

54% 67.74% 67.19 % 65.15% 63.25%

Kioski Piirakkapaja Kioski Marttakahvio

47% 45.00% 33.33% 31.82% 50.00%

Kauppa Kulta Keidas Kauppa Kulta Nalle

68% 66.67% 63.41 % 60.47% 66.67%

Ravintola Beer Stop Pub Baari, Beer Stop R-kylä

39% 41.67% 36% 30.77% 50.00%

Ravintola Beer Stop Pub Baari, Wanha Mestari R-kylä

19% 7.69% 0% 0.00% 0.00%

Ravintola Foxie s Bar Siirry hakukenttään Baari, Foxie Karsikko

31% 25.00% 15.15% 11.76% 23.57%

Play baari Ravintola Bar Play – Ravintoloita

21% 31.11% 17.02% 8.16% 31.62%

Page 18: Text Similarity & Clustering

Applications in MOPSI

Duplicated records clean Spelling check

Communication & comunication

query relevance/expansion Text-level Annotation recommendation * Keyword clustering * MOPSI search engine**

Page 19: Text Similarity & Clustering

Annotation recommendation

500ms

Page 20: Text Similarity & Clustering

String clustering The similarity between every string pair is

calculated as a basis for determining the clusters Using the vector model for clustering

A similarity measure is required to calculate the similarity between two strings.

Page 21: Text Similarity & Clustering

String clustering (Cont.)

The final step in creating clusters is to determine when two objects (words) are in the same cluster Hierarchical agglomerative clustering (HAC) – start

with un-clustered items and perform pair-wise similarity measures to determine the clusters

Hierarchical divisive clustering – start with a cluster and breaking it down into smaller clusters

Page 22: Text Similarity & Clustering

Objectives of Hierarchy of Clusters

Reduce the overhead of search Perform top-down searches of the centroids of the clusters in the

hierarchy and trim those branches that are not relevant Provide for visual representation of the information

space Visual cues on the size of clusters (size of ellipse) and strengths of the

linkage between clusters (dashed line, sold line…) Expand the retrieval of relevant items

A user, once having identified an item of interest, can request to see other items in the cluster

The user can increase the specificity of items by going to children clusters or by increasing the generality of items being reviewed by going to a parent clusters

Page 23: Text Similarity & Clustering

Keyword clustering (semantic)

Thesaurus-based:WordNetAn advanced web-interface t

o browse the WordNet database

Thesaurus are not available for every language, e.g. Finnish.

example

Page 24: Text Similarity & Clustering

Resources

Page 25: Text Similarity & Clustering

Useful resources

Similarity metrics (http://staffwww.dcs.shef.ac.uk/people/S.Chapman/stringmetrics.html )

Similarity metrics (javascript) (http://cs.joensuu.fi/~zhao/Link/ )

Flamingo package (http://flamingo.ics.uci.edu/releases/4.0/ )

WordNet (http://wordnet.princeton.edu/wordnet/related-projects/ )

Page 26: Text Similarity & Clustering

Location-based clustering

Page 27: Text Similarity & Clustering

DBSCAN- density based clustering (KDD’96)

Parameters: MinPts eps

Time complexityO(logn) – getNeighbursO(nlogn) – total

AdvantagesData shape unlimitedNoise considered

Page 28: Text Similarity & Clustering

DBSCAN result

Joensuu: 29,76, 62.60Helsinki: 24, 60

Page 29: Text Similarity & Clustering
Page 30: Text Similarity & Clustering

Gaussian Mixture Model

Maximization likelihood estimation (Expectation Maximization algorithm)

Parameters requiredNumber of components Iteration number

Advantages:Probabilistic (fuzzy) theory

Page 31: Text Similarity & Clustering

GMMsJoensuu: 29,76, 62.60Helsinki: 24, 60

Page 32: Text Similarity & Clustering

GMMsJoensuu: 29,76, 62.60Helsinki: 24, 60

Page 33: Text Similarity & Clustering
Page 35: Text Similarity & Clustering

thanks!