66
page 1 Copyrighted material John Tullis IBM Intelligent Miner for Text John Tullis DePaul Instructor [email protected]

IBM Intelligent Miner for Text

  • Upload
    lexi

  • View
    54

  • Download
    0

Embed Size (px)

DESCRIPTION

IBM Intelligent Miner for Text. John Tullis DePaul Instructor [email protected]. Text Analysis Tools. Text Search Engine. Web Crawler Package. NetQuestion Solution. A Knowledge-discovery software development toolkit - PowerPoint PPT Presentation

Citation preview

Page 1: IBM Intelligent Miner for Text

page 1

Copyrighted materialJohn Tullis

IBM Intelligent Miner for Text

John TullisDePaul [email protected]

Page 2: IBM Intelligent Miner for Text

page 2

Copyrighted materialJohn Tullis

IBM Intelligent Miner for Text A Knowledge-discovery software development toolkit

to build advanced Text-Mining and Text-Search applications

A NetQuestion Solution to construct Internet/intranet text-search solutions

NetQuestion Solution

Text Analysis

Tools

Text Search Engine

Web Crawler Package

Page 3: IBM Intelligent Miner for Text

page 3

Copyrighted materialJohn Tullis

Intelligent Miner for Text For companies of any size and for different

industries Media Petroleum

BankingIntelligent

Miner for Text Education

Government Insurance

Page 4: IBM Intelligent Miner for Text

page 4

Copyrighted materialJohn Tullis

Potential Applications

Customer complaints

analysis

Newswire analysis

Intelligent Website

Intelligent Miner for Text

Opinion survey

classification

Competitive intelligence

Corporate Image analysis

Page 5: IBM Intelligent Miner for Text

page 5

Copyrighted materialJohn Tullis

Intelligent Miner for Text: Platforms supported Text

AnalysisTools

TextSearchEngineServer

Text Search Engine

Client

Text SearchEngineJava GUIJavaBeans

Web CrawlerPackage

NetQSolution

AIX 4.3

Y Y Y Y Y Y

Solaris 2.5.1

Y Y Y Y Y Y

Win NT 4.0SP3 Y Y Y Y Y Y

OS/390 V2R4, V2R5,

V2R6 Y Y Y Y Y Y

Page 6: IBM Intelligent Miner for Text

page 6

Copyrighted materialJohn Tullis

Reference Customers FinanceWise (Search engine for financial content on the

Internet) www.financewise.com

IBM web sites (incl. 2000 IBM intranet sites) www.ibm.com

Sueddeutsche Zeitung (classified ads on Sueddeutsche Zeitung Web site)

www.sueddeutsche.de SearchCafe (Business Partner)

www.search-cafe.com Success stories available at

www.software.ibm.com/iminer/fortext

Reference customers & Success stories

Page 7: IBM Intelligent Miner for Text

page 7

Copyrighted materialJohn Tullis

Component: Text Analysis Tools

Page 8: IBM Intelligent Miner for Text

page 8

Copyrighted materialJohn Tullis

Functionality Language Identification Clustering of document collection

hierarchical clustering relational clustering

Categorization/Classification of document collection Feature Extraction Summarization

Page 9: IBM Intelligent Miner for Text

page 9

Copyrighted materialJohn Tullis

Text Analysis Tools To automate tasks previously done manually

automatically identifies the language of a document automatically groups related documents based on their content,

without requiring predefined classes automatically assigns documents to one or more user-defined

categories automatically recognizes significant items in text, such as names,

technical terms, and abbreviations automatically extracts sentences from a document to create a

document summary

Page 10: IBM Intelligent Miner for Text

page 10

Copyrighted materialJohn Tullis

•Text analysis tools are available in a command line format structured to function like common UNIX or DOS command line formats.

•Text analysis tools can be used individually or in a combined mode depending on the required task.

•Configuration files allow document format flexibility and performance tuning for text searches. Command line switches provide additional flexibility by permitting the user to set runtime parameters.

•The documents need to be provided in plain text format. For other formats, conversion tools can be obtained from third parties such as KEYpak (http://www.keypak.com/

Text Analysis

Page 11: IBM Intelligent Miner for Text

page 11

Copyrighted materialJohn Tullis

Clustering (2)

Summarization

Lan

gu

ag

e

Iden

tifica

tion

Cla

ssifica

tion

Feature Extractio

n

Page 12: IBM Intelligent Miner for Text

page 12

Copyrighted materialJohn Tullis

Text Analysis Tools: Feature Extraction

To recognize significant vocabulary items To recognize all names referring to a single entity To provide the location of all person names, places and

organization in a text To find multi-word terms that have a meaning of their own To find abbreviations introduced in a text and links them

with their full forms To recognize named relationships

Page 13: IBM Intelligent Miner for Text

page 13

Copyrighted materialJohn Tullis

Text Analysis Tools: Feature Extraction• Produces statistics for each vocabulary item.• Associates terms to canonical forms (i.e. "related" associated to the term

"relate")• Feature extraction can be used as a preprocessor for the Clustering utility to

bias (or control) clustering activities.• Feature extraction can be run in two modes:

1) Lookup mode which refers to a schema generated by a training set and produces statistics for vocabulary items as they relate to the rest of the schema as well as within the document

2) Exploration mode which requires no training and yields textual data statistics for vocabulary items as they relate within the scope of the document(s) specified

Page 14: IBM Intelligent Miner for Text

page 14

Copyrighted materialJohn Tullis

Several classes of significant vocabulary can

be recognized

Names are categorized

Significant concepts are detected automatically

Automatic keywording: the most significant terminology in the

document

Page 15: IBM Intelligent Miner for Text

page 15

Copyrighted materialJohn Tullis

Feature Extraction - statistics & analysis• Application here shows how one can use the statistics and analysis

produced by the feature extraction.• Highlighting of selected items within a document by using the location

information in the feature extract (all vocabulary terms have location information to accomplish this).

• Selected categories can be filtered upon.• A significance measure for each vocabulary item is produced by

feature extraction which allows prioritization of keywords within the scope of individual documents or entire collections.

• This is a sample application which is not included in the software installation.

Page 16: IBM Intelligent Miner for Text

page 16

Copyrighted materialJohn Tullis

"Terms" include multi-word phrases whose

meaning is much more than that of the individual

words

Multi-word phrases are the vocabulary in which concepts are expressed

Page 17: IBM Intelligent Miner for Text

page 17

Copyrighted materialJohn Tullis

Feature Extraction - statistics & analysis• Recognizes multi-word phrases by pattern recognition meaning if a two

word pattern appears with an acceptable frequency then it is included as an extracted vocabulary item in the output.

• More heuristics are applied than mentioned but generally this is the textual processing which occurs.

• Concepts can be FORMULATED from the multi-word terms. The feature extraction utility assists in emphasizing prevalent multi-word terms.

Page 18: IBM Intelligent Miner for Text

page 18

Copyrighted materialJohn Tullis

Clustering (2)

Summarization

Lan

gu

ag

e

Iden

tifica

tion

Cla

ssifica

tion

Feature Extractio

n

Page 19: IBM Intelligent Miner for Text

page 19

Copyrighted materialJohn Tullis

Language Identification

given a document, discover automatically the language(s) in which the document is written

It can be used to restrict search results by languages organize the crawls by languages route documents to language translators

Page 20: IBM Intelligent Miner for Text

page 20

Copyrighted materialJohn Tullis

Language Identification• A 16 language dictionary is shipped with the Intelligent Miner for Text to

be used by the Language Identification utility.

• The Language Identification utility also comes with a utility which can be used to add to the shipped dictionary file to extend language identification. (You can even invent your own language and add it to the dictionary!)

• Documents can be analyzed for language content meaning the output of Language Identification can produce multiple degrees of language content in one pass (i.e. Document ABC has 75% English, 20% German, etc.). This is possible using a command line option.

• Allows further document organization by language and a degree of internationalization to applications.

Page 21: IBM Intelligent Miner for Text

page 21

Copyrighted materialJohn Tullis

Clustering (2)

Summarization

Lan

gu

ag

e

Iden

tifica

tion

Cla

ssifica

tion

Feature Extractio

n

Page 22: IBM Intelligent Miner for Text

page 22

Copyrighted materialJohn Tullis

Categorization/Classification

given a defined taxonomy, it can assign documents to preexisting categories

utilizes feature extraction capacities to do document comparisons efficiently

two stages training using sample documents category assignment

Page 23: IBM Intelligent Miner for Text

page 23

Copyrighted materialJohn Tullis

Categorization/Classification

• Users determine the taxonomy for organizing the documents into topics.

• Users create training sets to define categories and use the supplied training utility.

• Each document is analyzed and a rank value assigned as it relates to each category.

• A command line switch allows the user to display varying numbers of categories with the document's associated rank value.

• REMEMBER: The categories are predefined by the user.

Page 24: IBM Intelligent Miner for Text

page 24

Copyrighted materialJohn Tullis

Categorization: Solution Example

Page 25: IBM Intelligent Miner for Text

page 25

Copyrighted materialJohn Tullis

Clustering (2)

Summarization

Lan

gu

ag

e

Iden

tifica

tion

Cla

ssifica

tion

Feature Extractio

n

Page 26: IBM Intelligent Miner for Text

page 26

Copyrighted materialJohn Tullis

Clustering

Functions to automatically group related documents

based on their content, without requiring predefined classes

objects within a group are more similar to each other than to members of any other group

two approaches - Hierarchical clustering and binary relational clustering

Page 27: IBM Intelligent Miner for Text

page 27

Copyrighted materialJohn Tullis

Clustering - Details Preprocessing steps

Analyze data input stream and divide it into individual textual components to be used for clustering

Extract portions of individual textual components to be used for clustering (uses Feature Extraction as a preprocessor)

Customize stop word list Hierarchical clustering

Structure document collection using lexical affinity based on similarity function

Build clustering tree showing relationships between clusters of documents of varying granularity

Page 28: IBM Intelligent Miner for Text

page 28

Copyrighted materialJohn Tullis

Clustering - Details

Slicing Customize tree by applying adjustable thresholds to reduce

complexity and zoom-in on concepts of interest Use default threshold values for specific document collection

Note - slicing allows merging similar clusters into a single cluster. Clustering Output Formats

HTML file viewable by browser Textual description to be parsed (in the format of a tree)

Page 29: IBM Intelligent Miner for Text

page 29

Copyrighted materialJohn Tullis

Hierarchical Clustering - Visualization Example

Page 30: IBM Intelligent Miner for Text

page 30

Copyrighted materialJohn Tullis

Clustering - Details

• This is a sample application which shows the use of the clustering results in an HTML format. This application is not shipped with the software.

• The HTML output can be configured to place actual document paths in the display on the browser so users may easily view the documents which were clustered RIGHT FROM THE BROWSER.

• Clusters each have labels which are generated from three 2 word pairings which are the most common lexical affinities

• Similarity values in the application are represented by percentages. This is normalized as the similarity values actually range from 0 to 1000.

Page 31: IBM Intelligent Miner for Text

page 31

Copyrighted materialJohn Tullis

Categorization: Comparison to ClusteringIn clustering document collections are processed and grouped into dynamically generated clusters ....

In categorization, document collections are processed and grouped into predetermined groupings based on a taxonomy generated with training sets....Document

Collection DocumentCollection

ClusteringUtility

Cluster1 Cluster2 Cluster3 Cluster4

Trainer

Categorizer

Cat1 Cat2 Cat3 Cat4

Category1 Training Collection

Category2 Training Collection

Category3 Training Collection

Page 32: IBM Intelligent Miner for Text

page 32

Copyrighted materialJohn Tullis

Clustering (2)

Summarization

Lan

gu

ag

e

Iden

tifica

tion

Cla

ssifica

tion

Feature Extractio

n

Page 33: IBM Intelligent Miner for Text

page 33

Copyrighted materialJohn Tullis

Summarization

Extracts sentences from a document to create a document summary

Sentence selection is based on document structure and ranking of extracted features

Page 34: IBM Intelligent Miner for Text

page 34

Copyrighted materialJohn Tullis

Component: Text Search Engine

Page 35: IBM Intelligent Miner for Text

page 35

Copyrighted materialJohn Tullis

Text Search Engine

Fuzzy search

Hybrid queries

Free-text queries

Boolean queries

Synonyms search

Page 36: IBM Intelligent Miner for Text

page 36

Copyrighted materialJohn Tullis

Text Search Engine Search Engine

offers multiple search paradigms - boolean, free text, fuzzy, hybrid, etc.

supports linguistic analysis for documents in 21 languages including Arabic and Hebrew

features Boolean queries, precise term search and fuzzy search for 4 DBCS languages

Mining Functions to extract key features in text to cluster result list to refine queries

Integrated in IBM DB2 Digital Library and IBM DB2 UDB Text Extender

Page 37: IBM Intelligent Miner for Text

page 37

Copyrighted materialJohn Tullis

Text Search Engine

• A user can refine searches meaning that they can reuse previous search result sets to perform additional searches.

• Multilingual linguistic analysis performed:• - basic text analysis (recognizing terms, normalizing terms,

recognizing sentence boundaries)• - reducing terms to their base form• - stop word filtering• - decomposition (splitting compound terms)

Page 38: IBM Intelligent Miner for Text

page 38

Copyrighted materialJohn Tullis

Basic Text Seach Engine functions Included as part of the basic functional set in the Text

Search Engine Precise index ngram index linguistic index 21 SBCS languages 4 DBCS languages relevance ranking boolean queries free text queries fuzzy and phonetical searches thesaurus support

Page 39: IBM Intelligent Miner for Text

page 39

Copyrighted materialJohn Tullis

Text Search Engine: Details

Document support for single byte character set language Document support for double byte character set languages Linguistic search:

Dictionaries and synonyms lists for SBCS languages Terms are reduced to their base form, terms are decomposed,

terms are normalized to stand form Boolean query: Operators: AND, NOT, OR Natural language query/free text query: To formulate a query in

natural language Hybrid query:

To combine a natural language query with a Boolean search term

Page 40: IBM Intelligent Miner for Text

page 40

Copyrighted materialJohn Tullis

Text Search Engine: Details Fuzzy query:

To find misspell words: TOYOTA/TOYOTTA, DATABASE/DATABSAE

Phonetical query: Technique: remove vowel (s) from search term and replace

it/them with masking characters, eliminate duplicate consonants To search for similar-sounding words: COLOR/COLOUR,

SMITH/SMYTH, JANET/JEANNETTE ... Wildcard support for Boolean queries : Front, middle and end

masking for word and character masking

Page 41: IBM Intelligent Miner for Text

page 41

Copyrighted materialJohn Tullis

Text Search Engine: Even more details!

Section support Able to define a section of a document Restrict the search to given sections Example : define a section called Summary

Limit search scope within the Summary section Thesaurus support

for all index types and many languages ngram index thesaurus (workstation only)

Synonyms and broader/narrower terms DBCS language synonym support

Not supported for BiDi languages or Russian

Page 42: IBM Intelligent Miner for Text

page 42

Copyrighted materialJohn Tullis

Text Search Engine: Text Mining Functions

Provides text mining functions for English documents

Feature extractions Organize result list

Supports query refinement method for English documents

User assigns value to single documents

Page 43: IBM Intelligent Miner for Text

page 43

Copyrighted materialJohn Tullis

Text Search Engine: Query refinement example

Page 44: IBM Intelligent Miner for Text

page 44

Copyrighted materialJohn Tullis

Query Refinement Example• This is a snap shot of the Java GUI which is shipped with Intelligent

Miner for Text. The source code and instructions are shipped and must be compiled by the end user to be operational.

• Interacts with the TextMiner Java server.• Comprised of Java Beans which are shipped with Intelligent Miner for

Text. The Beans can also be built and integrated into other applications to interact with Intelligent Miner for text.

• The Java GUI provides a "ready-to-go" search GUI to interact with the Advanced Search engine. User can perform various levels of queries and even browse the documents themselves by double clicking in the window.

• Users must use a full Java enabled browser to run this pure Java applet.

Page 45: IBM Intelligent Miner for Text

page 45

Copyrighted materialJohn Tullis

Where to find the Text Search Engine functions

Basic functions S/390 Text Search Download for OS/390 V2.4 - V2.6 IM4T V2.3 workstations

Extended functions (result list clustering, relevance

feedback/query refinement, feature index) IM4T V2.3 for OS/390 IM4T V2.3 for Workstations

Page 46: IBM Intelligent Miner for Text

page 46

Copyrighted materialJohn Tullis

Component: Java & JavaBeans

Page 47: IBM Intelligent Miner for Text

page 47

Copyrighted materialJohn Tullis

Java Components

Java Search GUI - fully operational, NLS enabled JavaBeans for Rapid Application Development

Search Administration

Source is available and intended to be used as a 'starter kit'

Works with the Text Search Engine

Page 48: IBM Intelligent Miner for Text

page 48

Copyrighted materialJohn Tullis

GUI Enhancements - Enhanced error recovery, help

Use with NetScape and MS Internet Explorer Internet Explorer 3.02 and 4.0 for NT Internet Explorer 4.0 for Win95/98 NetScape Navigator 3.0/4.0 for Win95/98/NT NetScape Navigator 3.0/4.0 Solaris/SPARC NetScape Navigator 3.0 for Solaris/x86

Supported via plugin found at http://java.sun.com/products/plugin/1.1.1/index.html

Sun's HotJava Browser

Java Components - Details

Page 49: IBM Intelligent Miner for Text

page 49

Copyrighted materialJohn Tullis

Component: WebCrawler

Page 50: IBM Intelligent Miner for Text

page 50

Copyrighted materialJohn Tullis

Web Crawler

Is a Robot used to collect HTML pages for indexing Customizable as to which HTML links are to be crawled

(include and exclude patterns ...) Results are stored

Data objects on AIX/NT file systems Metadata in DB2

Parallel crawling, results combined HTML page change frequency used as revisiting factor External subsystems can be notified of web changes detected

by the crawler Create individual crawler using crawler toolkit

Page 51: IBM Intelligent Miner for Text

page 51

Copyrighted materialJohn Tullis

Web Crawler details

• Uses regular expression configuration files to filter or retain crawled URL.

• The data object are actual URL or documents. The size and type of URL to be stored are also configurable using provided configuration file structure.

• Storage is scaleable by mounting disk storage to file system storage locations

• Multiple crawlers can be run at once. The only known limitation is physical machine processing and storage capacities.

Page 52: IBM Intelligent Miner for Text

page 52

Copyrighted materialJohn Tullis

Web Crawler details

• Crawlers will dynamically adjust to increase monitoring for pages which change more frequently and vice-versa. This feature is also user configurable.

• Flexible API toolkit provide for the web crawler to assist in tasks such as forwarding of workflow messages

• API toolkit can also be used to allow the user to build their own crawler using provided components. Sample code is included to assist in the development.

Page 53: IBM Intelligent Miner for Text

page 53

Copyrighted materialJohn Tullis

Web Crawler Package

consists of 2 components A ready-to-run Web Crawler A Web Crawler toolkit to build customized Web

crawlers

Page 54: IBM Intelligent Miner for Text

page 54

Copyrighted materialJohn Tullis

The NetQuestion Solution

Page 55: IBM Intelligent Miner for Text

page 55

Copyrighted materialJohn Tullis

NetQuestion Solution

A Pre-built ready to use Internet/intranet text-search solution for searching a local Web server

A multiserver domain solution based on the Text Search Engine and Web Crawler

Page 56: IBM Intelligent Miner for Text

page 56

Copyrighted materialJohn Tullis

NetQuestion - Single WebServer Support Workstations

SBCS Search Forms and CGI script S/390

SBCS Search Forms and CGI script English Admin Forms and Script

NetQuestion - Multiple WebServer Support Drop in solution with some assumed defaults Fully configurable solution

Spellchecker support

NetQuestion Solution - details

Page 57: IBM Intelligent Miner for Text

page 57

Copyrighted materialJohn Tullis

Natural Language Support

Page 58: IBM Intelligent Miner for Text

page 58

Copyrighted materialJohn Tullis

NLS Support IBM Text Search Engine

18 SBCS Languages US English, UK English, Catalan, Danish, Dutch, German,

Swiss German, Spanish, Finnish, French, Canadian French, Icelandic, Italian, Norwegian Bokma., Norwegian Nynmal, Portuguese, Brazilian Portuguese, and Swedish plus Russian, Hebrew (BiDi), Arabic (BiDi)

4 DBCS Languages (Japanese, S Chinese, T Chinese, Korean) Text Analysis Tools

Language ID can identify 14 languages all other tools are English only

EURO support (new code page 8859-15) TATools to recognize Euro Abbr

Page 59: IBM Intelligent Miner for Text

page 59

Copyrighted materialJohn Tullis

NLS Support - Messages and GUI Fully enabled messages across all platforms Ship translations in all Group I languages (English , French,

German, Italian, Spanish, Brazilian Portugese, Simplified Chinese, Traditional Chinese, Japanese, Korean)

Java Search GUI sample is enabled, not to be translated JavaBeans not enabled NetQ Solution on S/390

NLS for Search forms and scripts (English, French, German, Italian, Spanish, Brazilian Portugese, Danish, Swedish, Norwegian, Finnish, Simplified Chinese, Traditional Chinese, Japanese, Korean)

No NLS of Admin Search forms and scripts

Page 60: IBM Intelligent Miner for Text

page 60

Copyrighted materialJohn Tullis

Documentation

Page 61: IBM Intelligent Miner for Text

page 61

Copyrighted materialJohn Tullis

Documentation

On-line Documentation in HTML for workstation product S/390 Relies upon documentation on workstation CD-ROMs PDFs are shipped on workstation CD-ROMs Online Documentation Search available for all workstation

platforms

Page 62: IBM Intelligent Miner for Text

page 62

Copyrighted materialJohn Tullis

Documentation - DetailsTitle BookMaster HTML PDF Hardcopy Cmts

Getting Started Y Y Y Y

Translated into Group 1

Text Analysis Tools Y Y Y N

IBM Text Search Engine Y Y Y N

IBM Text Search EngineCustomization and Admin

Y Y Y N

WebCrawler Y Y Y N

Java GUI, Java Beans SearchJava Beans Admin N Y N N

NetQuestion Solution Y Y Y N

Welcome HTML page with search N Y N N

Fact Sheet N N N Y

IBM Web Crawler and Toolkit Y Y Y N

WWW External Pages N Y N N

Page 63: IBM Intelligent Miner for Text

page 63

Copyrighted materialJohn Tullis

Presentation Summary

Page 64: IBM Intelligent Miner for Text

page 64

Copyrighted materialJohn Tullis

IBM Intelligent Miner for Text A Knowledge-discovery software development toolkit

to build advanced Text-Mining and Text-Search applications

A NetQuestion Solution to construct Internet/intranet text-search solutions

NetQuestion Solution

Text Analysis

Tools

Text Search Engine

Web Crawler Package

Page 65: IBM Intelligent Miner for Text

page 65

Copyrighted materialJohn Tullis

Platforms AIX , Sun Solaris, Windows NT, OS/390

Announcement December 8, 1998

General Availability Workstation product: December 29, 1998 Mainframe product: January 29, 1999

Evaluation License 60-day trial version for AIX, Windows NT, Sun Solaris Order Number: GK2T-0167

Price for workstation product 30K$ per server

Platforms Available

Page 66: IBM Intelligent Miner for Text

page 66

Copyrighted materialJohn Tullis

Web presence Product Features, Downloads, News, Library,

Business partners, Case studies, Service, Support, Feedback

www.software.ibm.com/iminer/fortext

Intelligent Miner for Text