Upload
reinerkraft
View
218
Download
2
Embed Size (px)
DESCRIPTION
Leveraging Context in User-Centric Entity Detection Systems
Citation preview
Leveraging Context in User-Centric
Entity Detection Systems
Vadim von BrzeskiUtku IrmakReiner Kraft
Outline
• Definitions– “User-Centric Entity Detection System"?
• Motivation• Contextual Shortcuts System
– Pre-processing– Entity Detection Chain– Post-processing
• Evaluation and Results• Conclusions
User-Centric Entity Detection Systems
• A system which detects entities or concepts in a web page and transforms them into actionable “intelligent hyperlinks”– View a map– Perform a web search– Compose email– …
• The primary consumer is a person
User-Centric Entity Detection Systems
User-Centric Entity Detection Systems
Definitions
• Entity:Something which is considered to have its own physical existence in the real or virtual world– Named Entities (persons, organizations, places)– Phone numbers, email addresses, URLs, etc
• Concept: An abstract thought or idea which is not an entity– “Car insurance”– “Chinese restaurants in San Francisco”
User-Centric vs. Machine-Centric Detection Systems
• Machine-Centric: Algorithm consumed– Applications:
• Question answering• Automatic correction of missing case information or misspellings
– Metrics (Quantity Evaluation):• Precision, Recall
• User-Centric: Human consumed– Applications:
• Creating Intelligent Hyperlinks in emails, articles, blogs, etc.
– Metrics (Quality Evaluation): • Accuracy, Relevance, Interestingness
• Entity detection can be based on:– Regular expressions
• urls, phone numbers, date and time, emails, etc
– Editorially maintained dictionaries• Persons, organizations, events, products, places, etc
– Anonymous web search engine query logs
User-Centric Entity Detection Systems
Challenges
Definitions
• Under-selection and over-selection problems: These problems occur when the boundaries are not selected correctly. – “Betty Crocker” vs. “Betty Crocker’s famous cheesecake”– “in California” vs. “California”
• Concept Extension: A more specific concept derived from a broader entity or concept.– “Betty Crocker’s famous cheesecake” from “Betty Crocker”
Motivation
• Improve the overall entity detection quality within user-centric entity detection systems:– Accuracy:
• Are the boundaries selected correctly?
– Relevance:• Are the entities/concepts relevant to the topic of the text?
– Interestingness:• Are the entities/concepts interesting enough for further action?
Outline
• Definitions• Motivation• Contextual Shortcuts System
– Pre-processing– Entity Detection Algorithms– Post-processing
• Evaluation and Results• Conclusions
Contextual Shortcuts Platform
Document Html Parsing
Tokenization
Boundary Detection
Pre-processing
Query LogsDictionaries
Filtering
Collision Detection
Post-processing
Entity Detection Chain
Annotated Output
Concept Extension
Annotation
Entity Detection Chain
• Pluggable and configurable architecture• Regular expression based detectors:
– Phone numbers, urls, etc
• Dictionaries:– Editorially reviewed lists– Geo-spatial data
Query LogsDictionaries
Concept Extension
Using Query Logs in Entity Detection
• As a resource for new entities and concepts• When used with context, they improve the
overall performance through: • Algorithm: Generating a Concept Vector• Algorithm: Concept Extension
Algorithm: Generating a Concept Vector
• Given a document, it first creates two vectors:• Term Vector (tf*idf based)
• Unit Vector (generated from all units in the document)
Definition - Unit
• Multi-term entity which refers to a single concept– Ex: New york city cheap hotels
• Units are constructed from query logs in an iterative approach (off-line):– First iteration: consider all (single) terms as units– Following iterations: the units that frequently co-occur in
queries are combined into larger candidate units
– The validation of the units is based on mutual information:
Generating a Concept Vector
• Term Vector:– tf*idf scores are computed and normalized (0 - 1.0)– Scores that fall under certain thresholds are
punished or removed
• Unit Vector:– Generated from all units in the document– Unit scores are normalized (0 - 1.0)– Scores that fall under certain thresholds are
punished or removed
Generating a Concept Vector
• A term appears in term vector, but not in unit vector: – add it to the concept vector– punish its (term vector) weight
• A term appears in unit vector, but not in term vector:– add it to the concept vector
• A term appears in both term and unit vector– add it to the concept vector– sum vector and unit vector weights
• Inspect the merged concept vector and reward the more specific (multi term) concepts– details in the paper
Generating a Concept Vector
• Concept vector captures:– Interestingness: through unit scores derived
from query logs– Relevance: through term vector which is
based on all the terms in the document
Example: Concept Vector
• By DAVID ESPO , AP Special Correspondent WASHINGTON Anti-war Democrats in the Senate failed in an attempt to cut off
funds for the Iraq war on Wednesday, a lopsided bipartisan vote that masked growing impatience within both political parties over President Bush’ s handling of the four- year conflict.
<termvector id="concept"> <item term="david espo" weight="1.4403"> <item term="special correspondent" weight="1.2075"> <item term="iraq war" weight="1.1833"> <item term="president bush" weight="1.1549"> <item term="political parties" weight="0.6147"> ...
</termvector>
Finding Concept Extensions
• Addresses the under-selection problem (accurate boundary detection)– “Betty Crocker’s famous cheesecake”
• For each candidate entity detected:– Consider the concepts in the surrounding context– If the entity is contained in a more specific concept then we
favor the more specific concept
• Ex: “… for the Iraq war on Wednesday, a …”
Outline
• Definitions• Motivation• Contextual Shortcuts System
– Pre-processing– Entity Detection Algorithms– Post-processing
• Evaluation and Results• Conclusions
Evaluation and Results
• By a team of expert judges on three criteria– Accuracy:
• Yes: Entity or concept boundaries are correct• No: Under-selection, Over-selection or Bad Term
– Interestingness:• Interesting or Useful in General• Interesting or Useful Only in This Context• Not Interesting or Useful
– Relevance:• Relevant• Somewhat Relevant• Not Relevant
Concept Extension Results (Accuracy)
• Corpus: – 1304 documents from Y! Answers, Enron mail corpus, Y! News– 2305 entites, 376 of which were extensions
OK: correct US: Under-selection OS: Over-selection BT: Bad Term
Without concept extensions accuracy drops from 95.8% to 81.3%
Concept Vector Results
• Corpus:– 352 documents from Y! Finance– 2099 concepts detected (with a minimum of 1.0
concept weight), and judged by 2 editors– 1039 judgments (on which the editors agreed)
Concept Vector Results
Overall Results
• Corpus:– 1519 random documents from
• Y! Answers• Enron mail corpus• Y! News• Newsgroup postings• Product reviews
– 1586 entities generated
Conclusions and Future Work
• Focused on the quality issues in user-centric entity detection systems
• Argued that quality can be measured via the metrics of– accuracy, interestingness and relevance
• Designed and evaluated algorithms that improve the overall quality, by: – Using query logs– Leveraging both local and global context
• Increasing coverage of interesting concepts while maintaining high relevance
• Disambiguation of the entity types