22
Presented by: Aneeta Kolhe

Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text

Embed Size (px)

Citation preview

Page 1: Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text

Presented by: Aneeta Kolhe

Page 2: Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text

• Named Entity Recognition finds approximate matches in text.

• Important task for information extraction and integration, text mining and also for web search.

Page 3: Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text

Approximate dictionary matching. Previous solution – Token based similarity

constraints Proposed solution – Neighborhood

generation method

Page 4: Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text

It uses Jaccard co-efficient similarity

It may miss some match.

It may result in too many matches.

Page 5: Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text

For Example: Given al-qaida *“al-qaeda” or “al-qa’ida” won’t be matched

unless use low jaccard similarity of 0.33.

“alqaeda” will match “al gore” as well as “al pacino”

Hence we use edit distance

Page 6: Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text

Problem Definition:

For example: Given :document D, a dictionary E of entities To find: all substrings in D such that they are within edit

distance from one of the entities in E

Solution: Iterate through all the valid substrings of the document D

Issue a similarity selection query to the dictionary to retrieve the set of entities that satisfy the constraint.

Consider each substring as a query segment.

Page 7: Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text

at least one partition with at most one edit error

select k т = (т +1)/2Example: s = [ abcdefghijkl ] s’= [ axxbcdefghxijkl ]т = 3 , k т = 2 s = [ abcdef ], [ ghijkl ] s’ = [ axxbcde ], [ fghxijkl ]

Page 8: Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text

Shifting the first partition s by 2 => s = [cdef]

scaling it by -1 => s = [ cdefg ] Transformation rules First partition, we only need to consider

scaling within the range of [−2, 2]. Last partition, we only need to consider the

combination of the same amount of shifting and scaling within the range of [− т, т ] (so that the last character is always included in the resulting substring).

For the rest of the partitions, we need to consider shifting within the range [− т, т ] and scaling within the range [−2, 2].

Page 9: Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text

1st partition: 5 variations intermediate partitions: 5*(2 т +1)

variations last partition: (2 т +1) variations Total amount of the 1-variants generated = O(m + 2).

Page 10: Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text
Page 11: Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text

s = [ abcdef ], [ ghijkl ] s’ = [ axxbcde ], [ fghxijkl ]

< [ abcd ], 1>< [ abcdefgh ], 1>< [ ghijkl ], 2> <[ abcde ], 1> <[ jkl ], 2> < [ fghijkl ], 2 > <[ abcdef ], 1> < [ ijkl ], 2 > < [ efghijkl ], 2> <[ abcdefg ],1>< [ hijkl ],2><[ defghijkl ], 2> segment s’ comes in second partition [ fghxijkl ], will have 1-variant match with s’s

partition variation [fghijkl ] generated from s’s second partition.

Page 12: Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text

The partition (variation) is longer than a prefix length l p, we only use its l p-prefix to generate its 1-variants.

Assume l p is set to 3. Then 1-variantsare generated from only the following

prefixes. <[ abc ], 1> <[ ghi ], 2 > <[ hij ], 2> <[ fgh ], 2 > By setting l p ≤ m/kт – 2 Total # of 1-variants generated is further

reduced to O(l p т²).

Page 13: Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text

to index short and long entities in the dictionary, and store them in two

inverted indexes, Ishort and Ilong For each entity whose length is smaller than kт lp + т lp-prefix of each partition variation is used

to generate its 1-variant family, which will be indexed.

Page 14: Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text

Algorithm : BuildIndex (E, , lp) for each e Є E do if |e| < k lp + then V GenVariants(e[1 .. min(lp, |e|)], ); /* The GenVariants (s, k) function generates the k-variant family of string s */ for each v Є V do Ishort <- Ishort U { e }; if |e| ≥ k lp then P the set of k partitions of e; for each i-th partition p Є P do PT TransformPartition(p); /* according to the three transformation rules in Section 3.1 */ for each partition variations pT Є PT do V GenVariants(p[1 .. lp], 1); for each v 2 V do Ilong <- Ilong U <e, i >; return (Ishort, Ilong)

Page 15: Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text

Algorithm : MatchDocument (D, E, т ) for each starting position p Є[1, |D| − Lmin + т + 1] do SearchLong (D[p .. p + lp − 1], E, т ); /* matching entities no shorter than kт lp */ SearchShort (D[p .. p + lp − 1], E, т );/* matching entities of length in [lmin, kт lp)

*/

Page 16: Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text

R <- ф; /* holds results */

C <- ф ; /* holds candidates */

V <- GenVariants(s, 1) ; /* gen 1-variant family */

for each v Є V do for each <e, pid > Є Ilongv do C <- C U <e, pid > ; /*

duplicates removed */ 7 for each <e, pid > Є C do 8 S <- QuerySegmentInstantiation(e, pid); /* returns the set of query segment candidates for e */ for each seg Є S do if Verify(seg, e) = true then R <-R <seg, e > Return R

Page 17: Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text

Search short(s) We need to generate the т-variant families for

each possible length l between Lmin − т and lp If the current query segment is shorter than lp,

every candidate pair formed by probing the index needs to be verified

Otherwise, we need to perform verification for 2 т + 1 possible query segments.

Page 18: Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text

For example, enumerate 1-variants of the string [ abcdef ] from left to right.

no variant starts with abc in the index. Algorithm still enumerate other three 1-

variants containing abc. To avoid this set parameter lpp set to lp/2.

Page 19: Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text

Consider 4 possible cases:

Prefix Match

Suffix Match

Action

True true enumerate all 1-variants of q[1 .. lp]

False False discard q as there is no match

False True enumerate all 1-variants of q[1 .. lpp]

False False enumerate all 1-variants of q[(lpp + 1) .. lp]

Page 20: Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text

Successfully reduced the size of neighborhood

Proposed an efficient query processing algorithm

Optimized the algorithm to share computation

Avoid unnecessary variant enumeration

Page 21: Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text
Page 22: Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text