Text Analytics World Future Directions of Text Analytics:
Smarter, Bigger, and Better
Tom Reamy
Chief Knowledge Architect
KAPS Group
Program Chair – Text Analytics World
Knowledge Architecture Professional Services
http://www.kapsgroup.com
2
Text Analytics World Highlights
Keynote – Peter Morville, Information Architecture+
Keynote – Future of Text Analytics – Bigger, Better, Smarter
Social Media and Enterprise Text Analytics – new techniques,
new applications, new directions - Integration
Two Panels– leading TA experts: Interactive: What you always
wanted to know about TA, but were afraid to ask.
Great Companies: Visit Sponsors & hear great case studies
Text Analytics Workshop – Thursday
Logistics
3
Agenda
Introduction:
– Current State of Text Analytics
– Survey / Report
Enterprise Text Analytics - Search – still fundamental
– Shift from information to business
Social Media – Next Generation
– Different World: Content, Structures, Applications
Future of Text Analytics
– Roadblocks, Deep Vision
Questions
4
Introduction: KAPS Group
Knowledge Architecture Professional Services – Network of Consultants
Applied Theory – Faceted taxonomies, complexity theory, natural categories, emotion taxonomies
Services:
– Strategy – IM & KM - Text Analytics, Social Media, Integration
– Taxonomy/Text Analytics development, consulting, customization
– Text Analytics Quick Start – Audit, Evaluation, Pilot
– Social Media: Text based applications – design & development
Partners – SAS, Smart Logic, Expert Systems, SAP, IBM, FAST, Concept Searching, Attensity, Clarabridge, Lexalytics
Projects – Portals, taxonomy, Text analytics – news, expertise location, information strategy, text analytics evaluation, Quick Start in Text A.
Clients: Genentech, Novartis, Northwestern Mutual Life, Financial Times, Hyatt, Home Depot, Harvard Business Library, British Parliament, Battelle, Amdocs, FDA, GAO, World Bank, etc.
Presentations, Articles, White Papers – www.kapsgroup.com
5
Introduction: Coming Soon
New Book: Text Analytics: How to Conquer Information Overload and Get Real Value from Social Media
Due end of May
Free Copy to Workshop Attendees
One randomly selected person at the conference will receive a free copy – stay tuned!
6
Text Analytics World
Current State of Text Analytics
History – academic research, focus on NLP
Inxight –out of Zerox Parc
– Moved TA from academic and NLP to auto-categorization, entity extraction, and Search-Meta Data
Explosion of companies – many based on Inxight extraction with some analytical-visualization front ends
– Half from 2008 are gone - Lucky ones got bought
Early applications – News aggregation and Enterprise Search –
Second Wave = shift to sentiment analysis
Third Wave = Multiple Enterprise & Social Applications
– Watson = New Levels of Excitement
– Need practical version
7
Text Analytics World
Current State of Text Analytics: Vendor Space
Taxonomy Management – SchemaLogic, Pool Party
Taxonomy & Semantic Networks - Text Analytics Solutions
– Access Innovation, Luminoso
Extraction and Analytics
– Linguamatics (Pharma), Temis, whole range of companies
Business Intelligence – Clear Forest, Inxight
Sentiment Analysis – Attensity, Lexalytics, Clarabridge
Open Source – GATE
Stand alone text analytics platforms – IBM, SAS, SAP, Smart Logic, Expert System, Basis, Open Text, Megaputer, Temis, Concept Searching
Embedded in Content Management, Search
– Autonomy, FAST, Endeca, Exalead, etc.
Market Mindshare – IBM, SAS, Clarabridge, Lexalytics
8
Current Market: Text Analytics
Surveys, Seth Grimes Report
Market – 2014 - $2Bil
Enterprise search – 30-50% of market ($1Bil)
Text Analytics is growing 20% a year, 10% of analytics
Fragmented market – no clear leader
Social and Voice of Customer is huge
Money (investor) is still mostly social
Cloud-based Software as Service continues to grow
Growth as a market – slowed, as a technique – expanding
– (Me – time for new direction, characterization of field, etc.)
US market different than Europe/Asia – project oriented
9
Seth Grimes Report + Interviews Leading Analysts:
Current Trends
From Mundane to Advanced – reducing manual labor to
“Cognitive Computing”
Enterprise – Shift from Information to Business – cost cutting
rather than productivity gains
Embedded solutions – not called TA (but should be because they
suffer from weak TA)
Graph databases (saying since 2010 – he’ll be right one of these
years: Open Knowledge Graphs
Human-Machine – still need human hybrid
Rules – hard to maintain and new text (wrong kind of rules)
10
Seth Grimes Report
Current and Future Trends
Top four in Grimes survey:
– Ability to generate taxonomies (64%)
– Ability to use specialized, taxonomies, ontologies, etc. (54%)
– Broad information extraction (53%)
– Document Classification (53%)
Top business applications
– Brand/product/reputation management (38%)
– Voice of the Customer (39%)
– Competitive Intelligence (33%)
– Search, Info Access, etc. (29%)
– (Research 38% - not listed as a choice)
11
Seth Grimes Report
Current and Future Trends
Current extract more, more diverse types of info, applying
insights in new ways and for new purposes – yet user
satisfaction still lagging- accuracy and ease of use
74% satisfied with TA – only 4% disappointed
Most dissatisfaction – ease of use (29%) and availability of
professional services/support (50%)
48% likely to recommend their provider – 36% would
recommend against
12
Enterprise Text Analytics
Search is still #1 = 30-50% of applications
New Standard Search – facets (more and more metadata), auto-categorization built on taxonomies, clustering
Trend = Text Analytics/Search as Semantic Infrastructure
– Platform for Info Apps (Search-based applications)
SharePoint – Major focus of TA companies – fix problems with taxonomy/folksonomy
– Hybrid workflow – Publish document -> TA analysis -> suggestions for categorization, entities, metadata -> present to author
External information = more automation, extraction – precision more important
13
Enterprise Text Analytics
Adding Structure to Unstructured Content
Beyond Documents – categorization by corpus, by page, sections
or even sentence or phrase
Documents are not unstructured – variety of structures
– Sections – Specific - “Abstract” to Function “Evidence”
– Corpus – document types/purpose
– Textual complexity, level of generality
Need to develop flexible categorization and taxonomy – tweets to
200 page PDF
Applications require sophisticated rules, not just categorization by
similarity
14
15
Enterprise Text Analytics
Document Type Rules
(START_2000, (AND, (OR, _/article:"[Abstract]",
_/article:"[Methods]“), (OR,_/article:"clinical trial*",
_/article:"humans",
(NOT, (DIST_5, (OR,_/article:"approved", _/article:"safe",
_/article:"use", _/article:"animals"),
If the article has sections like Abstract or Methods
AND has phrases around “clinical trials / Humans” and not words
like “animals” within 5 words of “clinical trial” words – count it and
add up a relevancy score
Primary issue – major mentions, not every mention
– Combination of noun phrase extraction and categorization
– Results – virtually 100%
16
Enterprise Text Analytics
Building on the Foundation: Applications
Focus on business value, cost cutting
Enhancing information access is means, not an end
– Governance, Records Management, Doc duplication,
Compliance
– Applications – Business Intelligence, CI, Behavior Prediction
– eDiscovery, litigation support
– Risk Management
– Productivity / Portals – spider and categorize, extract – KM
communities & knowledge bases
• New sources – field notes into expertise, knowledge base –
capture real time, own language-concepts
17
Enterprise Text Analytics: Applications
Pronoun Analysis: Fraud Detection; Enron Emails Function words = pronouns, articles, prepositions, conjunctions, etc.
– Used at a high rate, short and hard to detect, very social, processed
in the brain differently than content words
Patterns of “Function” words reveal wide range of insights
Areas: sex, age, power-status, personality – individuals and groups
Lying / Fraud detection: Documents with lies have:
– Fewer, shorter words, fewer conjunctions, more positive emotion
words
– More use of “if, any, those, he, she, they, you”, less “I”
Current research – 76% accuracy in some contexts
Text Analytics can improve accuracy and utilize new sources
Combine with Data analytics can improve accuracy
18
Social Media: Next Generation
Beyond Simple Sentiment
Beyond Good and Evil (positive and negative)
– Degrees of intensity, complexity of emotions and documents
Importance of Context – around positive and negative words
– Rhetorical reversals – “I was expecting to love it”
– Issues of sarcasm, (“Really Great Product”), slanguage
Essential – need full categorization and concept extraction
Voice of the Customer: Must Have
– Need full Text Analytics to do well
New conceptual models, models of users, communities
19
New Content Characteristics
It’s a Very Different World
Scale – orders of magnitude – 100’s of millions, Billions
Speed – 20-100 million a day
Size – Twitter, Blogs, forums, email
– 140 characters to a few sentences
Quality – misspellings, lack of structure, incoherence
Conversations – not stand alone docs
– Can’t tell what a “document” is about without reference to previous threads
Purpose – communicate - social grooming, rant
– Not exchange of ideas, policies, etc.
Simple Content Complexity – single thoughts, simplicity of emotion
20
New Content Characteristics
It’s a Very Different World – Search and Taxonomy
i tried very slow, NO GOOGLE search, some apps not working.. This is not a "with GOOGLE" My friend has incredible, that is much batter.. Anyways i returned samsung, replace incredible. What's great about it: 4" LCD What's not so great: NOT A GOOGLE PHONE
(nt 2.0)willie John ci to/for: wanted to know about charges for pic mail for ;bill date 4/5/2010 | repeat: no | auth: pin | ptns affected: 7777777777 | information/instructions given: sup gave pic mail for free and gave adj for $ 2.40 new bal is $ 147.53 | any mobile, anytime: n | ir: yes | ir-email: n |
21
New Content Characteristics
It’s a Very Different World – Topical Current Content
Content not archived (for users)
No real need for search (or just very simple search)
Very Poor (if any) metadata – not faceted search
Focus on phrases, sentences – not documents
Little need of a complex subject taxonomy
About emotions, things, products, people
Emotion – simple structures, infinite kinds of expression
22
It’s a Very Different World
Companies are mining this resource and they need to add structure to get deeper understanding
Varieties of structure:
– Simple topical taxonomies 2-3 levels
– Emotion taxonomies, Ontologies and Semantic Networks
– Dynamic taxonomies – built on public taxonomies, enterprise taxonomy – exposed in hierarchical triples .
Need more automatic / semi-automatic solutions
– Advanced text analytics
New Kinds of Social Taxonomies
New Taxonomies – Appraisal
– Appraisal Groups – Adjective and modifiers – “not very good”
– Four types – Attitude, Orientation, Graduation, Polarity
– Supports more subtle distinctions than positive or negative
Emotion taxonomies
– Joy, Sadness, Fear, Anger, Surprise, Disgust
– New Complex – pride, shame, embarrassment, love, awe
– New situational/transient – confusion, concentration, skepticism
Beyond Keywords – Need Text Analytics
– Analysis of phrases, multiple contexts – conditionals, oblique
– Analysis of conversations – dynamic of exchange, private language
– Enterprise taxonomy rolled into a categorization taxonomy
23
24
Social Media: Next Generation
Variety of New Applications
Crowd Sourcing Technical Support
– User Forums – find problem area, nearby text for solution
– Automatic or Human mediated
Legal Review
– Significant trend – computer-assisted review (manual =too many)
– TA- categorize and filter to smaller, more relevant set
– Payoff is big – One firm with 1.6 M docs – saved $2M
Financial Services
– Trend – using text analytics with predictive analytics – risk and fraud
– Combine unstructured text (why) and transaction data (what)
– Customer Relationship Management, Fraud Detection
– Stock Market Prediction – Twitter, impact articles
25
Social Media: Next Generation
Variety of New Applications
Voice of the Customer (Employee, Voter)
– Early discovery of issues with product, service, customer issues
– Identify opportunities for new products and service, sales or new
feature improvements
– Enable companies to find and understand correlations between
promotional campaigns and customer reactions
– It can lead to business or competitor intelligence
Current – better at gathering information than analyzing
Possibilities are (almost) endless
And a little bit scary – deep psychology, conservative-liberal
brains
26
Social Media: Next Generation
Behavior Prediction – Telecom Customer Service
Problem – distinguish customers likely to cancel from mere threats
Basic Rule
– (START_20, (AND, (DIST_7,"[cancel]", "[cancel-what-cust]"),
– (NOT,(DIST_10, "[cancel]", (OR, "[one-line]", "[restore]", “[if]”)))))
Examples:
– customer called to say he will cancell his account if the does not stop receiving
a call from the ad agency.
– cci and is upset that he has the asl charge and wants it off or her is going to
cancel his act
More sophisticated analysis of text and context in text
Combine text analytics with Predictive Analytics and traditional behavior
monitoring for new applications
27
Future of Text Analytics
Obstacles - Survey Results
What factors are holding back adoption of TA?
– Lack of clarity about TA and business value - 47%
– Lack of senior management buy-in - 8.5%
Need articulated strategic vision and immediate practical win
Issue – TA is strategic, US wants short term projects
– Sneak Project in, then build infrastructure – difficulty of speaking enterprise
Integration Issue – who owns infrastructure? IT, Library, ?
– IT understands infrastructure, but not text
– Need interdisciplinary collaboration – Stanford is offering English-
Computer Science Degree – close, but really need a library-
computer science degree
28
Future of Text Analytics
Primary Obstacle: Complexity
Usability of software is one element
More important is difficulty of conceptual-document models
– Language is easy to learn , hard to understand and model
Need to add more intelligence (semantic networks) and ways for
the system to learn – social feedback
Customization – Text Analytics– heavily context dependent
– Content, Questions, Taxonomy-Ontology
– Level of specificity – Telecommunications
– Specialized vocabularies, acronyms
29
New Directions in Text Analytics
Conclusions
Text Analytics still growing: more mature applications and
technique
Find the right balance of infrastructure and application focus
Essential theme – integration – text and data, enterprise and
social
Big obstacles remain
– Strategic Vision of text analytics in the enterprise
– Concrete and quick application to drive acceptance
Future – Women, Fire, and Dangerous Things
– Text Analytics and Cognitive Science = Metaphor Analysis, deep
language understanding, common sense?
30
New Directions in Text Analytics
Conclusions
Bigger:
– Big Data gets the press, but Big Text is bigger – and potentially more
valuable – Needs more systemic solutions
– Number and variety of TA Applications still growing
Better:
– Libraries of Modules – Ensemble Methods
– Cognitive Computing – TA Foundation
Smarter:
– Not AI, but smarts without waiting for 50 years
Great Time to get into Text Analytics
Questions?
Tom Reamy
Program Chair – Text Analytics World
KAPS Group
http://www.kapsgroup.com