Extracting Value from Unstructured Data

text data

analyticscusto

mer unstructured

Biguse

characteristicsScorecard

extraction

words

Data

predic

tivebehavior

techno

logy

unders

tand

decisio

ns

structured

interactions

result

s

predictors

extrac

tedtag

ged

LDA

change

analy

zing

statis

tical

based

docum

entnew

operat

ional

Textrange

strong

addition historical

insights

right sales

ability

predict

distributed

need

process

meaning

automated

NEE

multiple

businessbetter

Semantic

availab

lesource

s

changi

ng

source

value

perfor

mance

improve

example

mode

ls time

entities

processing

techn

iques

segme

ntation

topics

discover

meaningful

algori

thms analysis

intel

ligen

ce

numerical

major

» insights

Extracting Value from Unstructured DataText analytics discover new predictive customer insights in a major source of untapped data

Even the most customer-centric businesses today are making decisions about their customers based

on only about 20% of available data. Much of the untapped 80%—data that is unstructured—could

be yielding not only more insights but different and complementary insights. This new intelligence

improves the ability to predict customer behavior, as well as understand and respond to the needs

and motivations behind it.

Text analytics helps to deliver this fresh intelligence by mining a major category of unstructured

data—one many organizations currently have on hand. Customer-related text may be abundant

and available, but most organizations don’t know how to efficiently extract predictive elements from

it. They don’t know how to reap the value of these insights by using them to boost the performance

of predictive analytics and make better operational customer decisions.

This white paper discusses how text analytics can help bring Big Data benefits to your organization’s

bottom line. It covers:

• Understanding what text analytics is and how it can help you.

• Boosting the performance of existing models with text-derived insights.

• Using text-derived insights to improve segmentation,

decisions strategies and customer interactions.

• Exploring how text analytics reveal behavioral context, and

can provide clues to what customers are thinking and feeling.

• Choosing the right text analytic techniques for your objectives.

Number 71—October 2013

Done right, extracting valuable predictive insights from huge quantities of text takes just seconds.

www.fico.com Make every decision countTM

www.fico.com page 2

Extracting Value from Unstructured Data

» insights

As organizations strive to become more customer-centric and deliver ever-better service, they

need additional sources of data-driven insights. Text is a major untapped potential source of

these insights—if structured in ways that make it usable by traditional analytic models. There are

also opportunities to use insights from text independently and with insights from other types of

unstructured data.

The time to unlock this value is now. For many organizations, text currently accounts for

the largest percentage of unstructured data that is on hand or easily accessible. New sources

of text—for instance, from blogs, Twitter feeds and other social media—are proliferating.

And text analytics, which provides the means to do the unlocking, has reached a level of

developmental maturity suitable for widespread use. In its Hype Cycle for Big Data, 2013

(July 2013), Gartner positions text analytics as having high benefit and projects mainstream

adoption within two to five years.

What has brought text analytics to this point of readiness? Recent technological advances are

solving many of the problems that posed obstacles in the past:

• Transforming for machine insights. Text in email messages, call center logs, new business

applications and collections agent notes is readily understood by us, but it’s meaningless

to traditional predictive models. A variety of machine learning methods, however, can

determine what texts are about and classify them for further analysis. They can discover

customer characteristics and transform them into structured numerical inputs that are

comprehensible to predictive models and other types of traditional analytic algorithms.

• Handling complexity. Unstructured and semi-structured text (e.g., XML files, Excel

spreadsheets, weblogs) is inherently complex since it may contain a wide range of content

on a wide range of topics. Moreover, the potential value of text analysis is often increased by

combining text of different types from multiple sources. Such a comprehensive approach

may reveal complex, subtle customer behavior patterns not evident in smaller, more

homogeneous document sets. But the task of collating, regularizing and organizing diverse

data would be impossible without today’s advanced technologies. We now have the analytic

techniques and data infrastructures to essentially merge disparate varieties of text into one

“document” for analysis.

• Facilitating engineering, deployment, management and regulatory compliance.

While text and the process of analyzing it can be quite complex, the results need to be simple

to understand and use. Today we can bring new insights from text analysis into predictive

scorecards, for example, maintaining all the advantages they provide. The resulting scorecard

has higher predictive power, but can still be engineered to meet specific business needs

and regulatory requirements. It’s easily deployed into rules-driven decision processes and

operational workflows. It’s easy to manage, including tracking performance, automating

updates and measuring the impact of change. The contribution of text analysis is transparent,

explainable to regulators, and documentable through automated audit trails and regular

validation and compliance reporting.

» Why Text Analytics Now

http://www.networkworld.com/community/blog/big-data-reaching-peak-its-hype-gartner-says

www.fico.com page 3


» insights » insights

• Scalable processing. While in many cases text analysis can be performed on a stand-

alone PC, there are situations where the quantity of text and number of potential customer

characteristics it contains are immense. Today’s Big Data infrastructures can perform analysis

at this scale with extreme efficiency and speed. Open source implementations of the

MapReduce programming model and Hadoop distributed file system, for instance, process

very large data sets in parallel across processor grids of enormous size. Storm, another open

source distributed computation system, provides near-real-time distributed processing.

All of these developments have come together to make now the time for getting started with

text analytics. In addition, the full range of capabilities—including advanced algorithms and

Big Data infrastructures—are becoming available through cloud-based services. Without deep-

pocket investment, organizations of all types and sizes can tap into these powerful technologies

to better serve their customers or constituents.

Text analytics is not a single technology. It’s a convergence of many disciplines and technologies

around the problem of extracting meaningful signals from text, as shown in Figure 1. Some of

these contributors have been around for a long time, and some are newer and rapidly advancing.

At the end of this paper, you’ll find guidance for determining the right one for your application.

» How to Use Text Analytics

FIGURE 1 HERE

Text Mining

Text Analytics

WebMining

InformationRetrieval

NaturalLanguageProcessing

ConceptExtraction

Clustering

InformationExtraction

Statistics

Computer Science Other Disciplines

Machine Learning

Management Science

Inverted IndexDocument MatchingSearch Optimization

Co-ReferenceRelationship Extraction

Entity Extraction

ColocationsWord Association

Sentiment Analysis

TokenizationPart-of-Speech

TaggingLemmatization

Document SimilarityDocument Clustering

Document RankingDocument

Categorization

Alert Detection

Web AnalyticsWeb Content Mining

Web Structure Analysis

Classification

Artificial Intelligence

Figure 1: Text analytics comprises many rapidly advancing technologies

Source: Miner, Gary, et al. Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications: 2012. Print.

www.fico.com page 4


» insights

The best place to start, however, is not with technology but with how you want to use

text-extracted insights:

• Is your objective to improve an analytic model’s ability to predict a particular

customer behavior? Are you looking for additional insights into the likelihood that a

customer will, say, purchase a certain product, become 60-days delinquent or attrite?

• Is your objective to more generally increase analytic insights for better

segmentation, decision strategies and interactions? Are you trying, for example, to

better understand why you’re losing some of your best customers to attrition or how to

improve your collections results with certain types of customers?

As depicted in Figure 2, text-derived customer characteristics can be used for both

types of objectives.

On the left branch, the graphic shows that once transformed into structured numerical

inputs, text-derived characteristics can be used with traditional supervised analytic

techniques. These use historical data tagged with observed outcomes (e.g., fraud/no

fraud, purchase within six weeks, delinquent account paid/rolled over) to train models

to predict a targeted customer behavior.

A text-derived characteristic could potentially

be used in multiple models, where its

information value depends on the degree

to which it correlates with the targeted

customer behavior.

The right branch incorporates text-derived

characteristics into unsupervised analytic

techniques that do not require tagged

historical data. These include analytic

methods for improving segmentation,

deepening understanding of individual

customers, adapting decision strategies to

customer needs and behavior context, and

even triggering timely actions in response to

changing customer behavior and attitudes.

Let’s look in more detail at how text-derived

insights can improve results in both supervised

and unsupervised analytics.

Using text-derived characteristics in supervised analytic techniques

Organizations that already have predictive models built with structured data can boost the

performance of these models by analyzing text they already have or routinely generate. Hidden

in this text may be numerous customer characteristics that could have predictive value when

used with characteristics from structured data.

Text Mining Objective

Use text-derived customer characteristics for:

Increasing insights forsegmentation & decisions

Unsupervised analyticse.g., classification, segmentation,

self-calibrating outlier models, Latent Dirichlet Allocation

Improving performance of a predictive model

Supervised analyticse.g., scorecards, attrition models, response models

Figure 2: Text-derived customer characteristics can be used in both supervised and unsupervised analytics

» Boost the Performance of Existing Models

www.fico.com page 5


» insights

Two ways of mining this potential predictive power from raw

text are feature extraction and entity extraction. FICO’s Semantic

Scorecard technology makes use of feature extraction. As depicted

in Figure 3, there are four major steps in this analytic technique:

Transform text into a numerical representation in

the form of a term frequency matrix. The algorithm breaks

text strings into tokens—meaningful pieces for analysis—

which in this case are terms made up of words or

combinations of words. It then determines the frequency

of occurrence for each term; it takes into account the

descendants of word roots, synonyms, etc., and filters out

frequently occurring terms that convey little meaning, such as

conjunctions and relative pronouns. Filters are configurable so

users can make adjustments and rapidly see how they change

results. The result is a numerical representation of the text in a

form that is comprehensible to algorithms and can be used as

input into a variety of text analytics techniques.

Apply statistical techniques to discover predictors with

high information value. Frequently occurring terms are

candidates for selection as predictive variables. Statistical

analytic algorithms based on information theory identify those

with high information value—variables that correlate strongly

with the target outcome variable. This process discovers often

subtle and complex predictors hidden in the text, invisible

to the human eye and mind. And it’s extremely fast. Tens

of thousands of records made up of hundreds of words of

freeform text can be analyzed in seconds.

Create an enhanced data set containing predictive

characteristics extracted from both the text and the

structured data. Strong predictors extracted from the text

and strong predictors extracted from structured data are

combined into a single tagged enhanced data set.

Perform traditional model training and scoring.

The enhanced data set can be analyzed by the existing

structured data scorecard technology.

Figure 4 shows the improved accuracy seen when the

Semantic Scorecard technology was applied to a bank’s

existing risk scorecard for new account decisions.

In other FICO research, a Semantic Scorecard incorporating

predictors from notes about sales inquiries did a better job of

identifying leads that result in sales than a scorecard using

structured data alone. In the top 5% of scored leads, it found

3% more leads resulting in sales. Increased accuracy can lead

to better prioritization and sales outcomes.

Figure 3: Transforming text into numerical representations that can be analyzed for behavior predictors

Term Frequency Matrix

Traditional Model Training & Scoring

Statistical Analysis

1

2

3

4

FreeformText

Variables Used in Traditional Scorecards

01100100011010111001101

Targ

et

0

1

1

1

0

0

1

Targ

et

0111001101

Targ

et

Enhanced Data Set

Figure 4: Analyzing hidden signals in text lifts predictive performances

0%

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%

PERC

ENTA

GE

“BA

DS”

PERCENTAGE “GOODS”

TraditionalScorecard

SemanticScorecard

The chart shows a Semantic Scorecard outperforming a bank’s traditional risk scorecard. At any percentage of “goods” (current and fully paid accounts), the Semantic Scorecard predicts a higher percentage of “bads” (charged-off and defaulted accounts).

1

2

3

4

www.fico.com page 6


» insights

For the same project, we also employed named entity extraction (NEE) as a complementary

technique. NEE is based on natural language processing, which draws on the disciplines of

computer science, artificial intelligence and linguistics. By analyzing the structure of the text, NEE

determines which parts of it are likely to represent entities such as people, locations, organizations,

job titles, money, percentages, dates and times.

One reason NEE is compatible with scorecards is that both techniques enable strategy engineering.

For every entity identified, the NEE algorithm generates a score indicating the probability that the

identification is correct. Engineers can set thresholds—accepting only those entities with a score

above 80%, for example.

In this project, we used extracted entities with a similarity-based matching algorithm to join

records from different kinds of files having no direct links (e.g., a structured file containing customer

information and unstructured texts about interactions with the customer). In addition, by combining

extracted entities, we were able to impute an individual’s authority to make a purchase decision—a

strong predictor for improving intelligent automated lead generation that is not directly available in

the dataset.

This customer characteristic and others that prove to have high information value (i.e., to correlate

strongly with the targeted customer behavior) can be incorporated into the Semantic Scorecard or

another type of supervised model. They can also be used in unsupervised analytics, which we’ll look

at next.

Using text-derived characteristics in unsupervised analytic techniques

Text-derived characteristics provide insights for improving population segmentation, as well as

individualized customer decisions and interactions. They can also be used for models that do not

require training with historical data.

Consider the named entity extraction we described in the previous section. Such entities could be

used to refine peer group definitions for self-calibrating outlier models that detect changes in

customer behavior. These models can incorporate new customer characteristics without training

on tagged historical data. In production, they determine on the fly, from ongoing customer

activity, what the current normal ranges of values are for these characteristics. Out-of-range values

can trigger alerts. This patented FICO technology is critical to streaming analytic problems where

the algorithms must continuously update, in real time, estimates of feature distributions so that

detection of outliers is always based on the current distributions.

Another analytic technique effective for segmentation, as well as for detecting changing customer

behavior, is Latent Dirichlet Allocation (LDA) and related methods for finding similarities in data that

enable classification and grouping. LDA is an unsupervised statistical method of extracting topics,

concepts and other types of meaning from unstructured data. It doesn’t understand syntax or any

other aspect of human language. It’s looking for patterns, and it does that equally well no matter

what language text is in, or even if it consists of just symbols rather than characters.

LDA could be used to examine a blog with a 100,000 posts, for example, to determine what the

blog is about overall. The algorithm could identify four or five predominant topics or “archetypes” of

content. It could also be applied to a heterogeneous mix of text. In fact, any type of unstructured,

semi-structured and structured data from any number of sources can be combined into a single

“document” for LDA to find patterns across them.

» Understanding Customers and Their Changing Behavior

http://bankinganalyticsblog.fico.com/2012/08/fraud-analytics-that-adapt-on-the-fly.html

www.fico.com page 7


» insights

This very flexible technique is commonly used in marketing to generate archetypes for customers

with similar behaviors. FICO is also using this approach to analyze a bank’s call center data. The

goal is to identify meaningful reasons why customers are calling, and use these insights to better

understand and predict attrition risk.

Among the useful discoveries so far: Customers mapping strongly to a frequent traveler archetype

are among the least likely to attrite. Moreover, if these customers interact with a call center

representative regarding a payment missed due to traveling—resulting in waiver of the late fee—

attrition becomes even less likely. This insight could help focus the budget for late fee waivers

where it will have a strong impact on improving attrition rates.

In another area of application, fraud detection, FICO has developed Collaborative Profiles, a patent-

pending streaming version of LDA. As shown in Figure 6, by analyzing card transactions, Collaborative

Profiles represent a real customer as a distribution across multiple archetypes—and updates the

distribution continuously, with each transaction.

Figure 5: Using LDA to generate archetypes from text“Bag of words”—collection of words from call transcripts (size indicates frequency of occurrence)

20.1% 11.3% 22% 38.5% 8.1%

Generation of call center content archetypes—determining customer topics

Call center text

Mapping of customer interactions to archetypes by % of match

Karen’s calls

notice. I missed only one paymente I was out of town and now I’veh hit by a late fee that I don’t feelI’d have to pay. I want to make my account is not reported toto bureaus as delinquint. I’ve beenformer more than five years andally pay my bills on time. Why cabank know me well enough to understand that I have to travelequently for my work thereforealways meet their imposed deadlines. If you want to keep my business you need to show me more



notice. I missed only one paymente I was out of town and now I’veh hit by a late fee that I don’t feelI’d have to pay. I want to make my account is not reported toto bureaus as delinquent. I’ve beenformer more than five years andally pay my bills on time. Why cabank know me well enough to understand that I have to travelequently for my work thereforealways meet their imposed deadlines. If you want to keep my business you need to show me more

Real-time updates fromtransaction streams

Actual customer Casey Dynamic mapping of Casey’s behavior to multiplearchetypes by % of match

32.8% 2.6% 49.3% 0.5% 14.8%

35.2% 2.4% 50.5% 0.1% 11.8%

39.2% 1.3% 52.1% 0.3% 7.1%

Figure 6: Using Collaborative Profiles to update customer archetype distributions in real time

http://bankinganalyticsblog.fico.com/2013/07/new-fraud-analytics-learn-from-global-behavior.html

www.fico.com page 8


» insights

An advantage of this streaming technique is that, based on the delta between the current

distribution and the updated one, Collaborative Profiles can trigger an alert that a significant

change in customer behavior is underway. For example, by analyzing collectors’ notes over the

course of several interactions regarding a delinquent bill, Collaborative Profiles could detect

that the customer is becoming frustrated, angry or losing confidence in his eventual ability to

repay the debt. Perhaps a new factor has entered the equation, such as a family member falling

ill, which may require an adjustment in collections strategy. This type of analysis could even

signal a change in intent—identifying the moment when a customer who originally intended

to pay consciously or unconsciously gives himself permission not to.

Digging deeper for insights into not only how customers are likely to behave, but also what

they’re thinking and feeling is an area of text analytics generally referred to as sentiment

analysis. The analytic techniques used are often based on natural language processing (NLP),

but they may also be statistical or a hybrid of these.

Traditionally, supervised methods have been

used to analyze the “polarity” of a document

or phrase: Is it positive, negative or neutral? At

the most basic level, this classification is done

by simple statistical methods like key word

indexing. More sophisticated NLP approaches

try to understand the semantic context in

which key words appear to determine degree

of positive or negative sentiment. Some

are able to track and compare changes in

sentiment over time.

Unsupervised techniques that, like LDA, can

extract topics from text are also being used to

identify a wider range of customer issues and

attitudes. Among them are latent semantic

analysis (LSA) and latent semantic indexing (LSI).

These are attractive not only because they can

be applied to unclassified text, but also because

they can discover unknown and emerging

customer sentiments. They can also be used at

a macro level to identify attitudinal shifts and

developing trends within a market.

Some of the most promising and challenging

areas of development seek to use natural

language processing to understand what

customers really mean when they use a set of words. For example, is the phrase “That’s great”

always positive? What does it mean when the client says “Yeah, sure, sure”? Often we humans

express ourselves in ambiguous ways or employ sarcasm. As more and more customer

interactions take place through email, chat and text messaging rather than phone calls, we lose

the crucial clues to meaning that come from voice tonality and emphasis. The cutting edge of

sentiment analysis seeks to pick up these subtleties through other automated means.

» How Deep Can We Go?

to learn more about text analytics and FiCO’s semantic scorecard technology,

read the FICO ebook (registration required).

http://subscribe.fico.com/big-data-analytics-for-better-predictions-and-decisions?CID=70180000000dnJzAAI

www.fico.com page 9


» insights

» Choosing the Right Text Analytic Technique

This overview of several text analytics applications demonstrates the wide range of methods that

can be employed to extract value. Once you have the answer to the fundamental question of how

you want to use text-extracted insights, there are other questions that can help you select the right

techniques for your purpose. The simplified decision tree in Figure 7 is provided as one possible

example of how to think through some of the possibilities.

Figure 7: Example of sorting through text analytics possibilities to select the right approach for your purpose

WordsDocuments

Text Mining Application

Structure Meaning

Concept ExtractionNatural Language Processing

Specific facts Understanding

Is your focus on the meaning of the text or on the structure?

Information Extraction

Do you want to automatically identify specific facts or gain overall understanding?

Independent Connected

Web MiningDocument Classification

No categories Have categories

Are your documents independent or connected via hyperlinks?

Clustering

Search Sort

Do you have categories already? Information Retrieval

Do you want to sort documents into groups or search for specific documents?

OR Individual words and phrases?Sentences, paragraphs, document analysis?

the latest release of FICO® Model Builder, our complete analytic

development environment, includes extensive capabilities for

Big Data text analytics. these include FiCO’s proprietary semantic

scorecard technology for combining predictors discovered in

both text and structured data into a single, easy to deploy and

manage scorecard. Model Builder also provides access to scalable

data processing via hadoop and to the huge library of algorithms

(including for entity extraction, topic modeling and sentiment

analysis) from R, the open source statistical programming

language. in addition, Model Builder ships with the tika and

Lucene libraries for text extraction and indexing.

FICO® Decision Management Platform, accessible soon

via cloud-based services, provides everything you need to

quickly and efficiently bring text-extracted insights to bear on

operational decisions. it includes Big Data infrastructure, rapid

application development (RAD) tools and decision applications.

FICO® Model Central™ Solution provides the means to deploy,

track, validate, monitor and manage a wide range of analytics

to improve bottom-line impact and regulatory compliance. this

kind of centralized control is increasingly crucial today as analytics

become part of almost every customer operational decision, and

the number and variety of models in use expands dramatically.

Among the specific benefits for text analytics is the ability

to assess change by tracking volatility in the prevalence and

importance of topics, entities and other text-extracted insights.

FICO Text Analytics Solutions

http://www.fico.com/en/Products/DMTools/Pages/FICO-Model-Builder.aspx

http://www.fico.com/en/Products/DMTools/Pages/FICO-Decision-Management-Platform.aspx

http://www.fico.com/en/Products/DMTools/Pages/FICO-ModelCentral-Solution.aspx


» insights

The Insights white paper series

provides briefings on research

findings and product development

directions from FICO. To subscribe,

go to www.fico.com/insights.

FICO, Model Central and “Make every decision count” are trademarks or registered trademarks of Fair Isaac Corporation in the United States and in other countries. Other product and company names herein may be trademarks of their respective owners. © 2013 Fair Isaac Corporation. All rights reserved.3017WP 10/13 PDF

For more information North America toll-free International email web +1 888 342 6336 +44 (0) 207 940 8718 [email protected] www.fico.com

» Conclusion—Make Decisions Based on All the Data

The obstacles that once forced organizations to make operational decisions based on a partial

representation of customer data are now falling away. Leaders in all industries and sectors are

beginning to make decisions on an increasingly wide range of Big Data—including vast stores

of text they already have. Text analytics are delivering abundant, fresh customer insights to fuel

higher levels of customer service and business performance.

To learn more about advances in Big Data analytics and related technologies, visit the

FICO Labs Blog or read these Insights papers:

• When Is Big Data the Way to Customer Centricity?

• Is It Fraud? Or New Behavior?

• From Big Data to Big Marketing: Seven Essentials

http://ficolabsblog.fico.com/

http://www.fico.com/account/resourcelookup.aspx?theID=870



Software

Extracting Value from Unstructured Data