Contextual Shortcuts (CIKM 2007)

Leveraging Context in User-Centric

Entity Detection Systems

Vadim von BrzeskiUtku IrmakReiner Kraft

Outline

• Definitions– “User-Centric Entity Detection System"?

• Motivation• Contextual Shortcuts System

– Pre-processing– Entity Detection Chain– Post-processing

• Evaluation and Results• Conclusions

User-Centric Entity Detection Systems

• A system which detects entities or concepts in a web page and transforms them into actionable “intelligent hyperlinks”– View a map– Perform a web search– Compose email– …

• The primary consumer is a person



Definitions

• Entity:Something which is considered to have its own physical existence in the real or virtual world– Named Entities (persons, organizations, places)– Phone numbers, email addresses, URLs, etc

• Concept: An abstract thought or idea which is not an entity– “Car insurance”– “Chinese restaurants in San Francisco”

User-Centric vs. Machine-Centric Detection Systems

• Machine-Centric: Algorithm consumed– Applications:

• Question answering• Automatic correction of missing case information or misspellings

– Metrics (Quantity Evaluation):• Precision, Recall

• User-Centric: Human consumed– Applications:

• Creating Intelligent Hyperlinks in emails, articles, blogs, etc.

– Metrics (Quality Evaluation): • Accuracy, Relevance, Interestingness

• Entity detection can be based on:– Regular expressions

• urls, phone numbers, date and time, emails, etc

– Editorially maintained dictionaries• Persons, organizations, events, products, places, etc

– Anonymous web search engine query logs


Challenges

Definitions

• Under-selection and over-selection problems: These problems occur when the boundaries are not selected correctly. – “Betty Crocker” vs. “Betty Crocker’s famous cheesecake”– “in California” vs. “California”

• Concept Extension: A more specific concept derived from a broader entity or concept.– “Betty Crocker’s famous cheesecake” from “Betty Crocker”

Motivation

• Improve the overall entity detection quality within user-centric entity detection systems:– Accuracy:

• Are the boundaries selected correctly?

– Relevance:• Are the entities/concepts relevant to the topic of the text?

– Interestingness:• Are the entities/concepts interesting enough for further action?

Outline

• Definitions• Motivation• Contextual Shortcuts System

– Pre-processing– Entity Detection Algorithms– Post-processing


Contextual Shortcuts Platform

Document Html Parsing

Tokenization

Boundary Detection

Pre-processing

Query LogsDictionaries

Filtering

Collision Detection

Post-processing

Entity Detection Chain

Annotated Output

Concept Extension

Annotation

Entity Detection Chain

• Pluggable and configurable architecture• Regular expression based detectors:

– Phone numbers, urls, etc

• Dictionaries:– Editorially reviewed lists– Geo-spatial data

Query LogsDictionaries

Concept Extension

Using Query Logs in Entity Detection

• As a resource for new entities and concepts• When used with context, they improve the

overall performance through: • Algorithm: Generating a Concept Vector• Algorithm: Concept Extension

Algorithm: Generating a Concept Vector

• Given a document, it first creates two vectors:• Term Vector (tf*idf based)

• Unit Vector (generated from all units in the document)

Definition - Unit

• Multi-term entity which refers to a single concept– Ex: New york city cheap hotels

• Units are constructed from query logs in an iterative approach (off-line):– First iteration: consider all (single) terms as units– Following iterations: the units that frequently co-occur in

queries are combined into larger candidate units

– The validation of the units is based on mutual information:

Generating a Concept Vector

• Term Vector:– tf*idf scores are computed and normalized (0 - 1.0)– Scores that fall under certain thresholds are

punished or removed

• Unit Vector:– Generated from all units in the document– Unit scores are normalized (0 - 1.0)– Scores that fall under certain thresholds are

punished or removed


• A term appears in term vector, but not in unit vector: – add it to the concept vector– punish its (term vector) weight

• A term appears in unit vector, but not in term vector:– add it to the concept vector

• A term appears in both term and unit vector– add it to the concept vector– sum vector and unit vector weights

• Inspect the merged concept vector and reward the more specific (multi term) concepts– details in the paper


• Concept vector captures:– Interestingness: through unit scores derived

from query logs– Relevance: through term vector which is

based on all the terms in the document

Example: Concept Vector

• By DAVID ESPO , AP Special Correspondent WASHINGTON Anti-war Democrats in the Senate failed in an attempt to cut off

funds for the Iraq war on Wednesday, a lopsided bipartisan vote that masked growing impatience within both political parties over President Bush’ s handling of the four- year conflict.

<termvector id="concept"> <item term="david espo" weight="1.4403"> <item term="special correspondent" weight="1.2075"> <item term="iraq war" weight="1.1833"> <item term="president bush" weight="1.1549"> <item term="political parties" weight="0.6147"> ...

</termvector>

Finding Concept Extensions

• Addresses the under-selection problem (accurate boundary detection)– “Betty Crocker’s famous cheesecake”

• For each candidate entity detected:– Consider the concepts in the surrounding context– If the entity is contained in a more specific concept then we

favor the more specific concept

• Ex: “… for the Iraq war on Wednesday, a …”

Outline

• Definitions• Motivation• Contextual Shortcuts System

– Pre-processing– Entity Detection Algorithms– Post-processing


Evaluation and Results

• By a team of expert judges on three criteria– Accuracy:

• Yes: Entity or concept boundaries are correct• No: Under-selection, Over-selection or Bad Term

– Interestingness:• Interesting or Useful in General• Interesting or Useful Only in This Context• Not Interesting or Useful

– Relevance:• Relevant• Somewhat Relevant• Not Relevant

Concept Extension Results (Accuracy)

• Corpus: – 1304 documents from Y! Answers, Enron mail corpus, Y! News– 2305 entites, 376 of which were extensions

OK: correct US: Under-selection OS: Over-selection BT: Bad Term

Without concept extensions accuracy drops from 95.8% to 81.3%

Concept Vector Results

• Corpus:– 352 documents from Y! Finance– 2099 concepts detected (with a minimum of 1.0

concept weight), and judged by 2 editors– 1039 judgments (on which the editors agreed)

Concept Vector Results

Overall Results

• Corpus:– 1519 random documents from

• Y! Answers• Enron mail corpus• Y! News• Newsgroup postings• Product reviews

– 1586 entities generated

Conclusions and Future Work

• Focused on the quality issues in user-centric entity detection systems

• Argued that quality can be measured via the metrics of– accuracy, interestingness and relevance

• Designed and evaluated algorithms that improve the overall quality, by: – Using query logs– Leveraging both local and global context

• Increasing coverage of interesting concepts while maintaining high relevance

• Disambiguation of the entity types

Questions?

Thank you!

[email protected]

Technology

Contextual Shortcuts (CIKM 2007)