Download ppt - April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

April 22, 2004 1

Text Mining: Finding Nuggets in Mountains of Textual Data

Jochen Doerre, Peter Gerstl, Roland Seiffert

IBM Germany, August 1999

Presenter: Tyler Carr

April 22, 2004 Motivation 2

Outline Motivation Methodology Feature Extraction Clustering and Categorizing Applications Exam Questions


Motivation

Customer Letters E-Mail

Correspondence Phone Call

Recordings Contracts

Technical Documentation

Patents News Articles Web Pages

90% of company’s data cannot be looked at with standard Datamining:


Value of Text Mining Rapid Digestion of large document

collections Faster than human knowledge brokers Objective and Customizable Analysis Automation of tasks


Typical Applications Summarizing Documents Monitoring relations among people,

places, and organizations Organizing documents by content Organizing indices for search and

retrieval (keyword finding) Retrieving documents by content

April 22, 2004 Methodology 6



Challenges in Text Mining Information is in unstructured textual

form Natural Language (NL) interpretation is

years away for computers Text Mining deals with huge collections

of documents


Two Text Mining Approaches Knowledge Discovery

Extraction of codified information (features) Information Distillation

Analysis of the feature distribution


Comparison with Data Mining Data Mining

Identify data sets Select features

manually Prepare data Analyze distribution

Text Mining Identify documents Extract features Select features by

algorithm Prepare data Analyze distribution

April 22, 2004 Feature Extraction 10



Feature Extraction “To recognize and classify significant

vocabulary items in unrestricted natural language texts.”

Classes of Vocabulary Proper names Technical phrases Abbreviations and acronyms …


Canonical Forms Numbers convert to normal form

Four ==> 4 Date convert to normal form Inflected forms convert to common form

Sings, Sang, Sung ==> Sing Alternative names convert to explicit

form Mr. Carr, Tyler, Presenter==>Tyler Carr


Feature Extraction Tools Linguistically motivated heuristics Pattern matching Limited amounts of lexical information

Part-of-speech information (subject,verb) Avoid analyzing too deep (for speed)

Does not use huge amounts of lexical info. No in-depth syntactic and semantic

analysis


Feature Extraction Example Disambiguating Proper Names

(Nominator Program) Apply heuristics to strings, instead of

interpreting semantics. The unit of context for extraction is a

document. The heuristics represent English naming

conventions.


Feature Extraction Goals Very fast processing to deal with huge

amounts of data Domain independence for general

applicability

April 22, 2004 Clustering and Categorization 16



Clustering Also called Knowledge Discovery Fully automatic process Partitions a given collection into groups

of documents similar in contents Clusters identifiable by feature vectors

Provides a set of keywords for each cluster


Two Clustering Engines Hierarchical Clustering tool

Orders the clusters into a tree reflecting various levels of similarity.

Binary Relational Clustering tool Produces a flat clustering together with

relationships of different strength between the clusters

Relationships reflect inter-cluster similarities


Clustering Model


Categorization Also called Information Distillation Topic Categorization Tool Assigns documents to pre-existing

categories (“topics” or “themes”) Categories are chosen to match the

intended use of the collection


Categorization Categories defined by providing a set of

sample documents for each category Training phase produces a special

index, called the categorization schema Categorization tool returns set of

category names and confidence levels for each document


Categorization If confidence is below some threshold,

document is set aside for human categorizer

Tests have shown the Topic Categorization Tool agrees with human categorizers to the same degree as human categorizers agree with one another.


Categorization Model

April 22, 2004 Applications 24



IBM Intelligent Miner for Text Software Development Kit (not full

application) Contains necessary components for “real text

mining” Also contains more traditional components:

IBM Text Search Engine IBM Web Crawler Drop-in Intranet search solutions


Applications Customer Relationship Management

application provided by IBM Intelligent Miner for text called Customer Relationship Intelligence (CRI) “Help companies better understand what

their customers want and what they think about the company itself.”


Customer Intelligence Process Take body of communications with customer

as input. Cluster the documents to identify issues. Characterize the clusters to identify the

conditions for problems. Assign new messages appropriate to

clusters.


Customer Intelligence Usage Knowledge Discovery

Clustering used to create a structure that can be interpreted

Information Distillation Refinement and extension of clustering results

Interpreting the results Tuning of the clustering process Selecting meaningful clusters

April 22, 2004 Exam Questions 29



Exam Question #1 Name an example of each of the two

main classes of applications of text-mining. Knowledge Discovery: Discovering a

common customer complaint among much feedback

Information Distillation: Filtering future comments into pre-defined categories.


Exam Question #2 How does the procedure for text mining

differ from the procedure for data mining? Adds feature extraction function Not feasible to have humans select

features Highly dimensional, sparsely populated

feature vectors


Exam Question #3 In the Nominator program of IBM’s

Intelligent Miner for Text, an objective of the design is to enable rapid extraction of names from large amounts of text. How does this decision affect the ability of the program to interpret the semantics of text? Does not perform in-depth syntactic or

semantic analysis of texts

April 22, 2004 33

Thank You

Any Questions?

April 22, 2004 34

Thank You

Any Questions?

April 22, 2004 35

Thank You

Any Questions?