INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information

INFORMATION EXTRACTIONSNITA SARAWAGI

Management of Information Extraction System

• Performance Optimization

• Handling Change

• Integration of Extracted Information

• Imprecision of Extraction

Performance Optimization• Two modes of extraction system

1. The unstructured source is naturally available

2. The unstructured source is open-ended and large

Document Selection Strategies• When the source is really large…

• manually restricting the set

• Focused crawling

• Searching via keyword













Document Selection Strategies• Index Search Techniques

1. Standard IR-style keyword queries“vaccine” and “cure”

2. Pattern queries “Thomas w+ Edison” > “Thomas NEAR Edison”

• Index Design for Efficient Extraction

• Index should support: proximity queries, regular expression

patterns, and storate of tags

• Cafarella and Etzioni [38]

Efficient Querying of Entity Database for Extraction• Similarity

• Can be an expensive operation• E.g. extracting book titles from Bolgs.

• Batch-Top-K search• Goal: to find each possible segment in x whose similarity to an entry in

D(database of entities) is greater than a threshold ε • Concentrated on the TF-IDF similarity score

• Chandel et al. [49]

Handling Change• Incremental Extractions on Changing Sources

• “An easy optimization, with clear scope for performance boost, is to

run the extractor only on the changed portions of a page instead of

the entire page.”

• Detecting the unchanged regions of the page is the key.

Handling Change• Detecting When Extractors Fail on Evolving Data

• Defining Characteristic Patterns

• DataProg1. When the pattern’s frequency is statistically significant

2. Avoiding choosing very specific patterns1. E.g. 4676 Admiralty Way

10924 Pico Boulevard 512 Oak Street2431 Main Street5257 Adams Boulevard

P1: Number UpperCaseWord BoulevardP2: Number UpperCaseWord StreetP3: Number UpperCaseWord Way


• Defining Characteristic Patterns

• DataProg1. When the pattern’s frequency is statistically significant

2. Avoiding choosing very specific patterns1. E.g. 4676 Admiralty Way

10924 Pico Boulevard 512 Oak Street2431 Main Street5257 Adams Boulevard

P1: Number UpperCaseWord BoulevardP2: Number UpperCaseWord StreetP3: Number UpperCaseWord Way


• Defining Significant Change• The distribution represented by Fi’ is said to be statistically different from

Fi, if the expected values ei’ of counts in (D’,S’) obtained by extrapolated from Fi, differs a lot from Fi’ (using X2 statistics)

=

• The expected value:

Integration of Extracted Information• Main Challenge in Integration of Extracted Information:

• Deduplication, coreference resolution, record linkage

• Solution

• “Ideally, extraction of all repeated mentions should be done

simultaneously and jointly with integration with existing sources.”

1. Decoupled Extractions and Integration

2. Decoupled Extraction and Collective Integration

3. Coupled Extraction and Integration

Integration of Extracted Information• Decoupled Extractions and Integration

• Extraction and integration are happened independently

• Decision of redundancy is made during integration

• Binary classification

• Input: a pair of records output: binary decision(duplicate or not)

• uses similarity function(cosine similarity, edit distance, Jaccard similarity, and

Soundex)


• Example of a decision tree created on similarity function


• Sequential Process

• An exracted record r and each entry e in the existing database D are

applied by the classifier on the pair (r, e) and get “yes/no”

• If the answer is no for all entries, r is a new entry, if not it is integrated

with the best matching entry e.

• Sequential process can be sped up considerably through index lookups

for efficiently finding likely matches.

Integration of Extracted Information• Decoupled Extraction and Collective Integration

R1. Alistair MacLean

R2. A McLean

R3. Alistair Mclean

• Transitivity• If A = B and B = C than A = C

• Cast the collective integration of multiple records as a graph partitioning problem• An edge between ei and ej is drawn with weighted score wij

• Nodes: records The sign: duplicate or nonduplicate• Magnitude: confidence in this outcome


• Correlation Clustering (CC)


• Collective Multi-attribute Integration• When the information extracted spans multiple columns, it can have a

greater impact

Integration of Extracted Information• Coupled Extraction and Integration

• Joint extraction and integration

• Little to be gained • when the database is not guaranteed to be complete• When we are extracting single entities at a time

• Boost accuracy• When extracting records or multi-way relationships consisting of multiple

entity subtypes

Integration of Extracted Information• Coupled Extraction and Integration

E.g. “In his foreword to Transaction Processing Concepts and Tech- niques, Bruce Lindsay”

Existing Books database• Book names where one of the entries is “Transaction Pro- cessing:

Concepts and Techniques. “• People names consisting of entries like “A. Reuters”, “J. Gray”, “B.

Lindsay”, “D Knuth”, and so on. • Authors table linking the book titles with the people who wrote them.

Imprecision of Extraction• Confidence Values for Single Extractions

• Two ways of representing the imprecision of extraction1. To associate each extracted information with probability value

2. To output multiple possible extractions

Imprecision of Extraction• Confidence Values for Single Extractions

• Reliability Plot• A useful visual tool to measure the soundness of the probabilities• X-axis: binned probabilities output by classifier• Y-axis: fraction of test instances in that probability bin whose predictions

are correct

Imprecision of Extraction• Multi-attribute Extractions

• Extracting multiple attributes of an entity from a single source stringE.g. “52-A Goregaon West Mumbai 400 076”

• Representing uncertainty through a probability distribution attached to each cloumn.

Imprecision of Extraction• Multi-attribute Extractions

• Extracting multiple attributes of an entity from a single source stringE.g. “52-A Goregaon West Mumbai 400 076”• Hybrid method: row and column level distributions

Imprecision of Extraction• Multiple Redundant Extractions


1. Assume only extraction uncertainty and ignore co-reference

uncertainty by assuming that an exact method exists for

resolving if two strings are the same.

2. Assume there is only co-reference uncertainty and each string

has no uncertainty attached to it referring to an entity.


• The Noisy-OR Model: to convert this p1,…,pn into a single

probability value p of x

• It assumes that different extractions are independent: not practical

• 0.1 confidence * 100 => close to 1

• The soft-OR function


• Conditional Probability Models from Labeled Data• Based on labeled data• No assumption about independency

• Generative Models for Unlabeled Data• a single pattern

• Multiple patterns: Pr(y|n1j…nkj) =

Documents

INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information