30
INFORMATION EXTRACTION SNITA SARAWAGI

INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information

Embed Size (px)

Citation preview

Page 1: INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information

INFORMATION EXTRACTIONSNITA SARAWAGI

Page 2: INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information

Management of Information Extraction System

• Performance Optimization

• Handling Change

• Integration of Extracted Information

• Imprecision of Extraction

Page 3: INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information

Performance Optimization• Two modes of extraction system

1. The unstructured source is naturally available

2. The unstructured source is open-ended and large

Page 4: INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information

Document Selection Strategies• When the source is really large…

• manually restricting the set

• Focused crawling

• Searching via keyword

Page 5: INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information

Document Selection Strategies• When the source is really large…

• manually restricting the set

• Focused crawling

• Searching via keyword

Page 6: INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information

Document Selection Strategies• When the source is really large…

• manually restricting the set

• Focused crawling

• Searching via keyword

Page 7: INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information

Document Selection Strategies• When the source is really large…

• manually restricting the set

• Focused crawling

• Searching via keyword

Page 8: INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information

Document Selection Strategies• Index Search Techniques

1. Standard IR-style keyword queries“vaccine” and “cure”

2. Pattern queries “Thomas w+ Edison” > “Thomas NEAR Edison”

• Index Design for Efficient Extraction

• Index should support: proximity queries, regular expression

patterns, and storate of tags

• Cafarella and Etzioni [38]

Page 9: INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information

Efficient Querying of Entity Database for Extraction• Similarity

• Can be an expensive operation• E.g. extracting book titles from Bolgs.

• Batch-Top-K search• Goal: to find each possible segment in x whose similarity to an entry in

D(database of entities) is greater than a threshold ε • Concentrated on the TF-IDF similarity score

• Chandel et al. [49]

Page 10: INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information

Handling Change• Incremental Extractions on Changing Sources

• “An easy optimization, with clear scope for performance boost, is to

run the extractor only on the changed portions of a page instead of

the entire page.”

• Detecting the unchanged regions of the page is the key.

Page 11: INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information

Handling Change• Detecting When Extractors Fail on Evolving Data

• Defining Characteristic Patterns

• DataProg1. When the pattern’s frequency is statistically significant

2. Avoiding choosing very specific patterns1. E.g. 4676 Admiralty Way

10924 Pico Boulevard 512 Oak Street2431 Main Street5257 Adams Boulevard

P1: Number UpperCaseWord BoulevardP2: Number UpperCaseWord StreetP3: Number UpperCaseWord Way

Page 12: INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information

Handling Change• Detecting When Extractors Fail on Evolving Data

• Defining Characteristic Patterns

• DataProg1. When the pattern’s frequency is statistically significant

2. Avoiding choosing very specific patterns1. E.g. 4676 Admiralty Way

10924 Pico Boulevard 512 Oak Street2431 Main Street5257 Adams Boulevard

P1: Number UpperCaseWord BoulevardP2: Number UpperCaseWord StreetP3: Number UpperCaseWord Way

Page 13: INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information

Handling Change• Detecting When Extractors Fail on Evolving Data

• Defining Significant Change• The distribution represented by Fi’ is said to be statistically different from

Fi, if the expected values ei’ of counts in (D’,S’) obtained by extrapolated from Fi, differs a lot from Fi’ (using X2 statistics)

=

• The expected value:

Page 14: INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information

Integration of Extracted Information• Main Challenge in Integration of Extracted Information:

• Deduplication, coreference resolution, record linkage

• Solution

• “Ideally, extraction of all repeated mentions should be done

simultaneously and jointly with integration with existing sources.”

1. Decoupled Extractions and Integration

2. Decoupled Extraction and Collective Integration

3. Coupled Extraction and Integration

Page 15: INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information

Integration of Extracted Information• Decoupled Extractions and Integration

• Extraction and integration are happened independently

• Decision of redundancy is made during integration

• Binary classification

• Input: a pair of records output: binary decision(duplicate or not)

• uses similarity function(cosine similarity, edit distance, Jaccard similarity, and

Soundex)

Page 16: INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information

Integration of Extracted Information• Decoupled Extractions and Integration

• Example of a decision tree created on similarity function

Page 17: INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information

Integration of Extracted Information• Decoupled Extractions and Integration

• Sequential Process

• An exracted record r and each entry e in the existing database D are

applied by the classifier on the pair (r, e) and get “yes/no”

• If the answer is no for all entries, r is a new entry, if not it is integrated

with the best matching entry e.

• Sequential process can be sped up considerably through index lookups

for efficiently finding likely matches.

Page 18: INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information

Integration of Extracted Information• Decoupled Extraction and Collective Integration

R1. Alistair MacLean

R2. A McLean

R3. Alistair Mclean

• Transitivity• If A = B and B = C than A = C

• Cast the collective integration of multiple records as a graph partitioning problem• An edge between ei and ej is drawn with weighted score wij

• Nodes: records The sign: duplicate or nonduplicate• Magnitude: confidence in this outcome

Page 19: INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information

Integration of Extracted Information• Decoupled Extraction and Collective Integration

• Correlation Clustering (CC)

Page 20: INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information

Integration of Extracted Information• Decoupled Extraction and Collective Integration

• Collective Multi-attribute Integration• When the information extracted spans multiple columns, it can have a

greater impact

Page 21: INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information

Integration of Extracted Information• Coupled Extraction and Integration

• Joint extraction and integration

• Little to be gained • when the database is not guaranteed to be complete• When we are extracting single entities at a time

• Boost accuracy• When extracting records or multi-way relationships consisting of multiple

entity subtypes

Page 22: INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information

Integration of Extracted Information• Coupled Extraction and Integration

E.g. “In his foreword to Transaction Processing Concepts and Tech- niques, Bruce Lindsay”

Existing Books database• Book names where one of the entries is “Transaction Pro- cessing:

Concepts and Techniques. “• People names consisting of entries like “A. Reuters”, “J. Gray”, “B.

Lindsay”, “D Knuth”, and so on. • Authors table linking the book titles with the people who wrote them.

Page 23: INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information

Imprecision of Extraction• Confidence Values for Single Extractions

• Two ways of representing the imprecision of extraction1. To associate each extracted information with probability value

2. To output multiple possible extractions

Page 24: INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information

Imprecision of Extraction• Confidence Values for Single Extractions

• Reliability Plot• A useful visual tool to measure the soundness of the probabilities• X-axis: binned probabilities output by classifier• Y-axis: fraction of test instances in that probability bin whose predictions

are correct

Page 25: INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information

Imprecision of Extraction• Multi-attribute Extractions

• Extracting multiple attributes of an entity from a single source stringE.g. “52-A Goregaon West Mumbai 400 076”

• Representing uncertainty through a probability distribution attached to each cloumn.

Page 26: INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information

Imprecision of Extraction• Multi-attribute Extractions

• Extracting multiple attributes of an entity from a single source stringE.g. “52-A Goregaon West Mumbai 400 076”• Hybrid method: row and column level distributions

Page 27: INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information

Imprecision of Extraction• Multiple Redundant Extractions

Page 28: INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information

Imprecision of Extraction• Multiple Redundant Extractions

1. Assume only extraction uncertainty and ignore co-reference

uncertainty by assuming that an exact method exists for

resolving if two strings are the same.

2. Assume there is only co-reference uncertainty and each string

has no uncertainty attached to it referring to an entity.

Page 29: INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information

Imprecision of Extraction• Multiple Redundant Extractions

• The Noisy-OR Model: to convert this p1,…,pn into a single

probability value p of x

• It assumes that different extractions are independent: not practical

• 0.1 confidence * 100 => close to 1

• The soft-OR function

Page 30: INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information

Imprecision of Extraction• Multiple Redundant Extractions

• Conditional Probability Models from Labeled Data• Based on labeled data• No assumption about independency

• Generative Models for Unlabeled Data• a single pattern

• Multiple patterns: Pr(y|n1j…nkj) =