17
Text Classification and Named Entities for New Event Detection Giridhar Kumaran and James Allan University of Massachusetts Amherst SIGIR 2004

Text Classification and Named Entities for New Event Detection

Embed Size (px)

DESCRIPTION

Text Classification and Named Entities for New Event Detection. Giridhar Kumaran and James Allan University of Massachusetts Amherst SIGIR 2004. Introduction. New Event Detection (NED) is one of the task in TDT program. (http://www.nist.gov/speech/tests/tdt/index.htm) - PowerPoint PPT Presentation

Citation preview

Page 1: Text Classification and Named Entities for New Event Detection

Text Classification and Named Entities for New

Event Detection

Giridhar Kumaran and James Allan

University of Massachusetts Amherst

SIGIR 2004

Page 2: Text Classification and Named Entities for New Event Detection

IntroductionNew Event Detection (NED) is one of the task in TDT program. (http://www.nist.gov/speech/tests/tdt/index.htm)Vector space model has achieved the best results to date.Better similarity metrics and document representations.

Page 3: Text Classification and Named Entities for New Event Detection

Previous ResearchIncreasing the number of features.Weight event-level features more heavily than more general topic-level features.Lexical chains (using WordNet)NED and tracking system.Named entities re-weighted and stop list created for each topic.Incremental TF-IDF

Page 4: Text Classification and Named Entities for New Event Detection

NED EvaluationAssign a confidence score between 0-1 by the NED algorithm, immediately or look-ahead.0 new, 1 oldDefine threshold results in the least cost.Detection Error Tradeoff (DET) curve is used to represent miss and false alarm.

Page 5: Text Classification and Named Entities for New Event Detection

Basic ModelCosine similarity

Page 6: Text Classification and Named Entities for New Event Detection

Modified ModelCosine is good, but make mistakes.The level of a hierarchy of events is of interest.Looking into other parameters like the category, the overlap of named entities, and the overlap of non-named entities.Develop a simple rules reflect the questions that a human being would ask before deciding if a story is new or old.

Page 7: Text Classification and Named Entities for New Event Detection

Modification to document model

Terms: health care – drugs, cost, coverage, plan, prescription..vs. locations and individuals.Solve: First placing stories into broad categories, and then computing term weights.Using topic types specified by the LDC.Classification according to LDC topics.Train in TDT2, test in TDT3.

Page 8: Text Classification and Named Entities for New Event Detection

Modification to Similarity Metric

Isolate the named entities and treat them preferentially (nothing new).Named entities are a double-edged sword, deciding when to use them can be tricky.

Page 9: Text Classification and Named Entities for New Event Detection

Multiple document representations

Alpha : all termsBeta : only named entitiesGamma : non-named entity termsEvent, GPE(Geographical and Political Entities ), Language, Location, Nationality, Organization, Person, Cardinal, Ordinal, Date, and Time.

Page 10: Text Classification and Named Entities for New Event Detection

Election NewsGamma is less than 0.2, while beta spreads out. (2 Graphs) : using alpha + gamma

Page 11: Text Classification and Named Entities for New Event Detection

Legal/Criminal CasesGamma below 0.4, beta above 0.4 : use beta + alpha

Page 12: Text Classification and Named Entities for New Event Detection

Financial NewsCannot decide using beta or gamma: use alpha only.

Page 13: Text Classification and Named Entities for New Event Detection

Term scores and categories

(Table 4)

Page 14: Text Classification and Named Entities for New Event Detection

Experimental ResultsThe result seems to be worse in TDT4.TDT4 may contain topics not conductive to named entity-based modification.

Page 15: Text Classification and Named Entities for New Event Detection

DET Curve of TDT3Focus on the high accuracy area.

Page 16: Text Classification and Named Entities for New Event Detection

DET Curve of TDT4

Page 17: Text Classification and Named Entities for New Event Detection

Conclusion and Future Work

Present a new multi-stage system for NED.Show a way to harness the named entities in documents, and illustrate their utility in different situations.Improve named entity rulesDifferent ways to develop stop lists for different categoriesTemporal information