IBM Intelligent Miner for Text

page 1

Copyrighted materialJohn Tullis

IBM Intelligent Miner for Text

John TullisDePaul [email protected]

page 2


IBM Intelligent Miner for Text A Knowledge-discovery software development toolkit

to build advanced Text-Mining and Text-Search applications

A NetQuestion Solution to construct Internet/intranet text-search solutions

NetQuestion Solution

Text Analysis

Tools

Text Search Engine

Web Crawler Package

page 3


Intelligent Miner for Text For companies of any size and for different

industries Media Petroleum

BankingIntelligent

Miner for Text Education

Government Insurance

page 4


Potential Applications

Customer complaints

analysis

Newswire analysis

Intelligent Website

Intelligent Miner for Text

Opinion survey

classification

Competitive intelligence

Corporate Image analysis

page 5


Intelligent Miner for Text: Platforms supported Text

AnalysisTools

TextSearchEngineServer

Text Search Engine

Client

Text SearchEngineJava GUIJavaBeans

Web CrawlerPackage

NetQSolution

AIX 4.3

Y Y Y Y Y Y

Solaris 2.5.1

Y Y Y Y Y Y

Win NT 4.0SP3 Y Y Y Y Y Y

OS/390 V2R4, V2R5,

V2R6 Y Y Y Y Y Y

page 6


Reference Customers FinanceWise (Search engine for financial content on the

Internet) www.financewise.com

IBM web sites (incl. 2000 IBM intranet sites) www.ibm.com

Sueddeutsche Zeitung (classified ads on Sueddeutsche Zeitung Web site)

www.sueddeutsche.de SearchCafe (Business Partner)

www.search-cafe.com Success stories available at

www.software.ibm.com/iminer/fortext

Reference customers & Success stories

page 7


Component: Text Analysis Tools

page 8


Functionality Language Identification Clustering of document collection

hierarchical clustering relational clustering

Categorization/Classification of document collection Feature Extraction Summarization

page 9


Text Analysis Tools To automate tasks previously done manually

automatically identifies the language of a document automatically groups related documents based on their content,

without requiring predefined classes automatically assigns documents to one or more user-defined

categories automatically recognizes significant items in text, such as names,

technical terms, and abbreviations automatically extracts sentences from a document to create a

document summary

page 10


•Text analysis tools are available in a command line format structured to function like common UNIX or DOS command line formats.

•Text analysis tools can be used individually or in a combined mode depending on the required task.

•Configuration files allow document format flexibility and performance tuning for text searches. Command line switches provide additional flexibility by permitting the user to set runtime parameters.

•The documents need to be provided in plain text format. For other formats, conversion tools can be obtained from third parties such as KEYpak (http://www.keypak.com/

Text Analysis

page 11


Clustering (2)

Summarization

Lan

gu

ag

e

Iden

tifica

tion

Cla

ssifica

tion

Feature Extractio

n

page 12


Text Analysis Tools: Feature Extraction

To recognize significant vocabulary items To recognize all names referring to a single entity To provide the location of all person names, places and

organization in a text To find multi-word terms that have a meaning of their own To find abbreviations introduced in a text and links them

with their full forms To recognize named relationships

page 13


Text Analysis Tools: Feature Extraction• Produces statistics for each vocabulary item.• Associates terms to canonical forms (i.e. "related" associated to the term

"relate")• Feature extraction can be used as a preprocessor for the Clustering utility to

bias (or control) clustering activities.• Feature extraction can be run in two modes:

1) Lookup mode which refers to a schema generated by a training set and produces statistics for vocabulary items as they relate to the rest of the schema as well as within the document

2) Exploration mode which requires no training and yields textual data statistics for vocabulary items as they relate within the scope of the document(s) specified

page 14


Several classes of significant vocabulary can

be recognized

Names are categorized

Significant concepts are detected automatically

Automatic keywording: the most significant terminology in the

document

page 15


Feature Extraction - statistics & analysis• Application here shows how one can use the statistics and analysis

produced by the feature extraction.• Highlighting of selected items within a document by using the location

information in the feature extract (all vocabulary terms have location information to accomplish this).

• Selected categories can be filtered upon.• A significance measure for each vocabulary item is produced by

feature extraction which allows prioritization of keywords within the scope of individual documents or entire collections.

• This is a sample application which is not included in the software installation.

page 16


"Terms" include multi-word phrases whose

meaning is much more than that of the individual

words

Multi-word phrases are the vocabulary in which concepts are expressed

page 17


Feature Extraction - statistics & analysis• Recognizes multi-word phrases by pattern recognition meaning if a two

word pattern appears with an acceptable frequency then it is included as an extracted vocabulary item in the output.

• More heuristics are applied than mentioned but generally this is the textual processing which occurs.

• Concepts can be FORMULATED from the multi-word terms. The feature extraction utility assists in emphasizing prevalent multi-word terms.

page 18


Clustering (2)

Summarization

Lan

gu

ag

e

Iden

tifica

tion

Cla

ssifica

tion

Feature Extractio

n

page 19


Language Identification

given a document, discover automatically the language(s) in which the document is written

It can be used to restrict search results by languages organize the crawls by languages route documents to language translators

page 20


Language Identification• A 16 language dictionary is shipped with the Intelligent Miner for Text to

be used by the Language Identification utility.

• The Language Identification utility also comes with a utility which can be used to add to the shipped dictionary file to extend language identification. (You can even invent your own language and add it to the dictionary!)

• Documents can be analyzed for language content meaning the output of Language Identification can produce multiple degrees of language content in one pass (i.e. Document ABC has 75% English, 20% German, etc.). This is possible using a command line option.

• Allows further document organization by language and a degree of internationalization to applications.

page 21


Clustering (2)

Summarization

Lan

gu

ag

e

Iden

tifica

tion

Cla

ssifica

tion

Feature Extractio

n

page 22


Categorization/Classification

given a defined taxonomy, it can assign documents to preexisting categories

utilizes feature extraction capacities to do document comparisons efficiently

two stages training using sample documents category assignment

page 23


Categorization/Classification

• Users determine the taxonomy for organizing the documents into topics.

• Users create training sets to define categories and use the supplied training utility.

• Each document is analyzed and a rank value assigned as it relates to each category.

• A command line switch allows the user to display varying numbers of categories with the document's associated rank value.

• REMEMBER: The categories are predefined by the user.

page 24


Categorization: Solution Example

page 25


Clustering (2)

Summarization

Lan

gu

ag

e

Iden

tifica

tion

Cla

ssifica

tion

Feature Extractio

n

page 26


Clustering

Functions to automatically group related documents

based on their content, without requiring predefined classes

objects within a group are more similar to each other than to members of any other group

two approaches - Hierarchical clustering and binary relational clustering

page 27


Clustering - Details Preprocessing steps

Analyze data input stream and divide it into individual textual components to be used for clustering

Extract portions of individual textual components to be used for clustering (uses Feature Extraction as a preprocessor)

Customize stop word list Hierarchical clustering

Structure document collection using lexical affinity based on similarity function

Build clustering tree showing relationships between clusters of documents of varying granularity

page 28


Clustering - Details

Slicing Customize tree by applying adjustable thresholds to reduce

complexity and zoom-in on concepts of interest Use default threshold values for specific document collection

Note - slicing allows merging similar clusters into a single cluster. Clustering Output Formats

HTML file viewable by browser Textual description to be parsed (in the format of a tree)

page 29


Hierarchical Clustering - Visualization Example

page 30


Clustering - Details

• This is a sample application which shows the use of the clustering results in an HTML format. This application is not shipped with the software.

• The HTML output can be configured to place actual document paths in the display on the browser so users may easily view the documents which were clustered RIGHT FROM THE BROWSER.

• Clusters each have labels which are generated from three 2 word pairings which are the most common lexical affinities

• Similarity values in the application are represented by percentages. This is normalized as the similarity values actually range from 0 to 1000.

page 31


Categorization: Comparison to ClusteringIn clustering document collections are processed and grouped into dynamically generated clusters ....

In categorization, document collections are processed and grouped into predetermined groupings based on a taxonomy generated with training sets....Document

Collection DocumentCollection

ClusteringUtility

Cluster1 Cluster2 Cluster3 Cluster4

Trainer

Categorizer

Cat1 Cat2 Cat3 Cat4

Category1 Training Collection



page 32


Clustering (2)

Summarization

Lan

gu

ag

e

Iden

tifica

tion

Cla

ssifica

tion

Feature Extractio

n

page 33


Summarization

Extracts sentences from a document to create a document summary

Sentence selection is based on document structure and ranking of extracted features

page 34


Component: Text Search Engine

page 35


Text Search Engine

Fuzzy search

Hybrid queries

Free-text queries

Boolean queries

Synonyms search

page 36


Text Search Engine Search Engine

offers multiple search paradigms - boolean, free text, fuzzy, hybrid, etc.

supports linguistic analysis for documents in 21 languages including Arabic and Hebrew

features Boolean queries, precise term search and fuzzy search for 4 DBCS languages

Mining Functions to extract key features in text to cluster result list to refine queries

Integrated in IBM DB2 Digital Library and IBM DB2 UDB Text Extender

page 37


Text Search Engine

• A user can refine searches meaning that they can reuse previous search result sets to perform additional searches.

• Multilingual linguistic analysis performed:• - basic text analysis (recognizing terms, normalizing terms,

recognizing sentence boundaries)• - reducing terms to their base form• - stop word filtering• - decomposition (splitting compound terms)

page 38


Basic Text Seach Engine functions Included as part of the basic functional set in the Text

Search Engine Precise index ngram index linguistic index 21 SBCS languages 4 DBCS languages relevance ranking boolean queries free text queries fuzzy and phonetical searches thesaurus support

page 39


Text Search Engine: Details

Document support for single byte character set language Document support for double byte character set languages Linguistic search:

Dictionaries and synonyms lists for SBCS languages Terms are reduced to their base form, terms are decomposed,

terms are normalized to stand form Boolean query: Operators: AND, NOT, OR Natural language query/free text query: To formulate a query in

natural language Hybrid query:

To combine a natural language query with a Boolean search term

page 40


Text Search Engine: Details Fuzzy query:

To find misspell words: TOYOTA/TOYOTTA, DATABASE/DATABSAE

Phonetical query: Technique: remove vowel (s) from search term and replace

it/them with masking characters, eliminate duplicate consonants To search for similar-sounding words: COLOR/COLOUR,

SMITH/SMYTH, JANET/JEANNETTE ... Wildcard support for Boolean queries : Front, middle and end

masking for word and character masking

page 41


Text Search Engine: Even more details!

Section support Able to define a section of a document Restrict the search to given sections Example : define a section called Summary

Limit search scope within the Summary section Thesaurus support

for all index types and many languages ngram index thesaurus (workstation only)

Synonyms and broader/narrower terms DBCS language synonym support

Not supported for BiDi languages or Russian

page 42


Text Search Engine: Text Mining Functions

Provides text mining functions for English documents

Feature extractions Organize result list

Supports query refinement method for English documents

User assigns value to single documents

page 43


Text Search Engine: Query refinement example

page 44


Query Refinement Example• This is a snap shot of the Java GUI which is shipped with Intelligent

Miner for Text. The source code and instructions are shipped and must be compiled by the end user to be operational.

• Interacts with the TextMiner Java server.• Comprised of Java Beans which are shipped with Intelligent Miner for

Text. The Beans can also be built and integrated into other applications to interact with Intelligent Miner for text.

• The Java GUI provides a "ready-to-go" search GUI to interact with the Advanced Search engine. User can perform various levels of queries and even browse the documents themselves by double clicking in the window.

• Users must use a full Java enabled browser to run this pure Java applet.

page 45


Where to find the Text Search Engine functions

Basic functions S/390 Text Search Download for OS/390 V2.4 - V2.6 IM4T V2.3 workstations

Extended functions (result list clustering, relevance

feedback/query refinement, feature index) IM4T V2.3 for OS/390 IM4T V2.3 for Workstations

page 46


Component: Java & JavaBeans

page 47


Java Components

Java Search GUI - fully operational, NLS enabled JavaBeans for Rapid Application Development

Search Administration

Source is available and intended to be used as a 'starter kit'

Works with the Text Search Engine

page 48


GUI Enhancements - Enhanced error recovery, help

Use with NetScape and MS Internet Explorer Internet Explorer 3.02 and 4.0 for NT Internet Explorer 4.0 for Win95/98 NetScape Navigator 3.0/4.0 for Win95/98/NT NetScape Navigator 3.0/4.0 Solaris/SPARC NetScape Navigator 3.0 for Solaris/x86

Supported via plugin found at http://java.sun.com/products/plugin/1.1.1/index.html

Sun's HotJava Browser

Java Components - Details

page 49


Component: WebCrawler

page 50


Web Crawler

Is a Robot used to collect HTML pages for indexing Customizable as to which HTML links are to be crawled

(include and exclude patterns ...) Results are stored

Data objects on AIX/NT file systems Metadata in DB2

Parallel crawling, results combined HTML page change frequency used as revisiting factor External subsystems can be notified of web changes detected

by the crawler Create individual crawler using crawler toolkit

page 51


Web Crawler details

• Uses regular expression configuration files to filter or retain crawled URL.

• The data object are actual URL or documents. The size and type of URL to be stored are also configurable using provided configuration file structure.

• Storage is scaleable by mounting disk storage to file system storage locations

• Multiple crawlers can be run at once. The only known limitation is physical machine processing and storage capacities.

page 52


Web Crawler details

• Crawlers will dynamically adjust to increase monitoring for pages which change more frequently and vice-versa. This feature is also user configurable.

• Flexible API toolkit provide for the web crawler to assist in tasks such as forwarding of workflow messages

• API toolkit can also be used to allow the user to build their own crawler using provided components. Sample code is included to assist in the development.

page 53


Web Crawler Package

consists of 2 components A ready-to-run Web Crawler A Web Crawler toolkit to build customized Web

crawlers

page 54


The NetQuestion Solution

page 55



A Pre-built ready to use Internet/intranet text-search solution for searching a local Web server

A multiserver domain solution based on the Text Search Engine and Web Crawler

page 56


NetQuestion - Single WebServer Support Workstations

SBCS Search Forms and CGI script S/390

SBCS Search Forms and CGI script English Admin Forms and Script

NetQuestion - Multiple WebServer Support Drop in solution with some assumed defaults Fully configurable solution

Spellchecker support

NetQuestion Solution - details

page 57


Natural Language Support

page 58


NLS Support IBM Text Search Engine

18 SBCS Languages US English, UK English, Catalan, Danish, Dutch, German,

Swiss German, Spanish, Finnish, French, Canadian French, Icelandic, Italian, Norwegian Bokma., Norwegian Nynmal, Portuguese, Brazilian Portuguese, and Swedish plus Russian, Hebrew (BiDi), Arabic (BiDi)

4 DBCS Languages (Japanese, S Chinese, T Chinese, Korean) Text Analysis Tools

Language ID can identify 14 languages all other tools are English only

EURO support (new code page 8859-15) TATools to recognize Euro Abbr

page 59


NLS Support - Messages and GUI Fully enabled messages across all platforms Ship translations in all Group I languages (English , French,

German, Italian, Spanish, Brazilian Portugese, Simplified Chinese, Traditional Chinese, Japanese, Korean)

Java Search GUI sample is enabled, not to be translated JavaBeans not enabled NetQ Solution on S/390

NLS for Search forms and scripts (English, French, German, Italian, Spanish, Brazilian Portugese, Danish, Swedish, Norwegian, Finnish, Simplified Chinese, Traditional Chinese, Japanese, Korean)

No NLS of Admin Search forms and scripts

page 60


Documentation

page 61


Documentation

On-line Documentation in HTML for workstation product S/390 Relies upon documentation on workstation CD-ROMs PDFs are shipped on workstation CD-ROMs Online Documentation Search available for all workstation

platforms

page 62


Documentation - DetailsTitle BookMaster HTML PDF Hardcopy Cmts

Getting Started Y Y Y Y

Translated into Group 1

Text Analysis Tools Y Y Y N

IBM Text Search Engine Y Y Y N

IBM Text Search EngineCustomization and Admin

Y Y Y N

WebCrawler Y Y Y N

Java GUI, Java Beans SearchJava Beans Admin N Y N N

NetQuestion Solution Y Y Y N

Welcome HTML page with search N Y N N

Fact Sheet N N N Y

IBM Web Crawler and Toolkit Y Y Y N

WWW External Pages N Y N N

page 63


Presentation Summary

page 64


IBM Intelligent Miner for Text A Knowledge-discovery software development toolkit

to build advanced Text-Mining and Text-Search applications

A NetQuestion Solution to construct Internet/intranet text-search solutions


Text Analysis

Tools

Text Search Engine

Web Crawler Package

page 65


Platforms AIX , Sun Solaris, Windows NT, OS/390

Announcement December 8, 1998

General Availability Workstation product: December 29, 1998 Mainframe product: January 29, 1999

Evaluation License 60-day trial version for AIX, Windows NT, Sun Solaris Order Number: GK2T-0167

Price for workstation product 30K$ per server

Platforms Available

page 66


Web presence Product Features, Downloads, News, Library,

Business partners, Case studies, Service, Support, Feedback

www.software.ibm.com/iminer/fortext

Intelligent Miner for Text