April 22, 2004 1
Text Mining: Finding Nuggets in Mountains of Textual Data
Jochen Doerre, Peter Gerstl, Roland Seiffert
IBM Germany, August 1999
Presenter: Tyler Carr
April 22, 2004 Motivation 2
Outline Motivation Methodology Feature Extraction Clustering and Categorizing Applications Exam Questions
April 22, 2004 Motivation 3
Motivation
Customer Letters E-Mail
Correspondence Phone Call
Recordings Contracts
Technical Documentation
Patents News Articles Web Pages
90% of company’s data cannot be looked at with standard Datamining:
April 22, 2004 Motivation 4
Value of Text Mining Rapid Digestion of large document
collections Faster than human knowledge brokers Objective and Customizable Analysis Automation of tasks
April 22, 2004 Motivation 5
Typical Applications Summarizing Documents Monitoring relations among people,
places, and organizations Organizing documents by content Organizing indices for search and
retrieval (keyword finding) Retrieving documents by content
April 22, 2004 Methodology 6
Outline Motivation Methodology Feature Extraction Clustering and Categorizing Applications Exam Questions
April 22, 2004 Methodology 7
Challenges in Text Mining Information is in unstructured textual
form Natural Language (NL) interpretation is
years away for computers Text Mining deals with huge collections
of documents
April 22, 2004 Methodology 8
Two Text Mining Approaches Knowledge Discovery
Extraction of codified information (features) Information Distillation
Analysis of the feature distribution
April 22, 2004 Methodology 9
Comparison with Data Mining Data Mining
Identify data sets Select features
manually Prepare data Analyze distribution
Text Mining Identify documents Extract features Select features by
algorithm Prepare data Analyze distribution
April 22, 2004 Feature Extraction 10
Outline Motivation Methodology Feature Extraction Clustering and Categorizing Applications Exam Questions
April 22, 2004 Feature Extraction 11
Feature Extraction “To recognize and classify significant
vocabulary items in unrestricted natural language texts.”
Classes of Vocabulary Proper names Technical phrases Abbreviations and acronyms …
April 22, 2004 Feature Extraction 12
Canonical Forms Numbers convert to normal form
Four ==> 4 Date convert to normal form Inflected forms convert to common form
Sings, Sang, Sung ==> Sing Alternative names convert to explicit
form Mr. Carr, Tyler, Presenter==>Tyler Carr
April 22, 2004 Feature Extraction 13
Feature Extraction Tools Linguistically motivated heuristics Pattern matching Limited amounts of lexical information
Part-of-speech information (subject,verb) Avoid analyzing too deep (for speed)
Does not use huge amounts of lexical info. No in-depth syntactic and semantic
analysis
April 22, 2004 Feature Extraction 14
Feature Extraction Example Disambiguating Proper Names
(Nominator Program) Apply heuristics to strings, instead of
interpreting semantics. The unit of context for extraction is a
document. The heuristics represent English naming
conventions.
April 22, 2004 Feature Extraction 15
Feature Extraction Goals Very fast processing to deal with huge
amounts of data Domain independence for general
applicability
April 22, 2004 Clustering and Categorization 16
Outline Motivation Methodology Feature Extraction Clustering and Categorizing Applications Exam Questions
April 22, 2004 Clustering and Categorization 17
Clustering Also called Knowledge Discovery Fully automatic process Partitions a given collection into groups
of documents similar in contents Clusters identifiable by feature vectors
Provides a set of keywords for each cluster
April 22, 2004 Clustering and Categorization 18
Two Clustering Engines Hierarchical Clustering tool
Orders the clusters into a tree reflecting various levels of similarity.
Binary Relational Clustering tool Produces a flat clustering together with
relationships of different strength between the clusters
Relationships reflect inter-cluster similarities
April 22, 2004 Clustering and Categorization 19
Clustering Model
April 22, 2004 Clustering and Categorization 20
Categorization Also called Information Distillation Topic Categorization Tool Assigns documents to pre-existing
categories (“topics” or “themes”) Categories are chosen to match the
intended use of the collection
April 22, 2004 Clustering and Categorization 21
Categorization Categories defined by providing a set of
sample documents for each category Training phase produces a special
index, called the categorization schema Categorization tool returns set of
category names and confidence levels for each document
April 22, 2004 Clustering and Categorization 22
Categorization If confidence is below some threshold,
document is set aside for human categorizer
Tests have shown the Topic Categorization Tool agrees with human categorizers to the same degree as human categorizers agree with one another.
April 22, 2004 Clustering and Categorization 23
Categorization Model
April 22, 2004 Applications 24
Outline Motivation Methodology Feature Extraction Clustering and Categorizing Applications Exam Questions
April 22, 2004 Applications 25
IBM Intelligent Miner for Text Software Development Kit (not full
application) Contains necessary components for “real text
mining” Also contains more traditional components:
IBM Text Search Engine IBM Web Crawler Drop-in Intranet search solutions
April 22, 2004 Applications 26
Applications Customer Relationship Management
application provided by IBM Intelligent Miner for text called Customer Relationship Intelligence (CRI) “Help companies better understand what
their customers want and what they think about the company itself.”
April 22, 2004 Applications 27
Customer Intelligence Process Take body of communications with customer
as input. Cluster the documents to identify issues. Characterize the clusters to identify the
conditions for problems. Assign new messages appropriate to
clusters.
April 22, 2004 Applications 28
Customer Intelligence Usage Knowledge Discovery
Clustering used to create a structure that can be interpreted
Information Distillation Refinement and extension of clustering results
Interpreting the results Tuning of the clustering process Selecting meaningful clusters
April 22, 2004 Exam Questions 29
Outline Motivation Methodology Feature Extraction Clustering and Categorizing Applications Exam Questions
April 22, 2004 Exam Questions 30
Exam Question #1 Name an example of each of the two
main classes of applications of text-mining. Knowledge Discovery: Discovering a
common customer complaint among much feedback
Information Distillation: Filtering future comments into pre-defined categories.
April 22, 2004 Exam Questions 31
Exam Question #2 How does the procedure for text mining
differ from the procedure for data mining? Adds feature extraction function Not feasible to have humans select
features Highly dimensional, sparsely populated
feature vectors
April 22, 2004 Exam Questions 32
Exam Question #3 In the Nominator program of IBM’s
Intelligent Miner for Text, an objective of the design is to enable rapid extraction of names from large amounts of text. How does this decision affect the ability of the program to interpret the semantics of text? Does not perform in-depth syntactic or
semantic analysis of texts
April 22, 2004 33
Thank You
Any Questions?
April 22, 2004 34
Thank You
Any Questions?
April 22, 2004 35
Thank You
Any Questions?