Inmon - Data Mining - An Architecture

1© 1997 William H. Inmon, all rights reserved

T E C H T O P I C 5

DATA MINING: AN ARCHITECTUREPART 1

by W.H. Inmon

[This Tech Topic is divided into two parts because of the length of the topic. It is recommended that if youread the first part that you also read the second part, as the two parts logically form a single Tech Topic.]

One of the most important uses of the data warehouse is that of data mining. Data mining is theprocess of using raw data to infer important business relationships. Once the business relationshipshave been discovered, they can then be used for business advantage.

Certainly a data warehouse has other uses than that of data mining. However, the fullest use of adata warehouse certainly must include data mining.

There are many approaches to data mining just as there are many approaches to actual mining ofminerals. Minerals historically have been mined in many different ways — by panning for gold,digging mine shafts, strip mining, analyzing satellite photos taken from space, and so forth.In much the same fashion, data mining occurs in many different forms and flavors, each with theirown overhead, rewards and probability of success.

A General Approach

The general approach to data mining that will be discussed in the series of Tech Topics is described inFigure APP 1.


The diagram in Figure APP 1 is not a methodology, per se. Instead, the diagram represents abroader approach than a single methodology. The steps in the approach shown in Figure APP 1 areas follows:

■ infrastructure preparation,■ exploration,■ analysis,■ interpretation, and■ exploitation.

Figure App 1

a general approach to data exploration/data mining

infrastructurelocate dataintegrate datascrub dataestablish history

locate metadatasecure hardware platformsecure dbms platformdetermine frequency of refreshment

identify source datadefine granularityset objectives

exploration summary level comparisonssampling analysisDSS analyst inutition

random access of dataheuristic search for patterns

analysis business relevance of discovered patternsstatistical strength of discovered patternspopulation pattern is applicable to

conditions predicating the pattern

interpretationbusiness cyclesseasonalityapplicable population

strength of predictionsize of population

factor corelation - - time - geography - demographics - other

exploitation salespackagingnew product introductionpricing

advertisingstrategic alliancedeliverypresentation

the simple methodology describes a simple approach to data exploration and data mining. The methodology is nonlinear in that there are constant movements up and down thescale into and out of different activities. The methodology is aheuristic one in that the next step of development depends onthe success and results of the current development.


Infrastructure Preparation

The first step in data mining is the identification and preparation of the infrastructure. It is inthe infrastructure that the actual activity of data mining will occur. The infrastructure contains(at a minimum):

■ a hardware platform,■ a dbms platform, and■ one or more tools for data mining.

In almost every case, the hardware platform is a separate platform than that which originallycontains the data. Said differently, it is very unusual to try to do data mining on the same platformas the operational environment. To be done properly, data needs to be removed from its hostenvironment before it can be properly mined.

Removing data from the host environment is merely the first step in preparing the data mininginfrastructure. In order for data mining to be done efficiently, the data itself needs to haveundergone a thorough analysis, and in most cases the data needs to have been scrubbed. Thescrubbing of the data entails integrating the data, because the operational data will often havecome from many sources. In addition, a metadata infrastructure — a “card catalog” — that sitsabove the data to be mined is very useful. Unless there is a small amount of data to be mined(which is almost never the case), the metadata sits above the data to be mined and serves as aroadmap as to what is and is not contained in the data. The metadata contains such usefulinformation as:

■ what data is contained in the data to be mined,■ what the source of the data is,■ how the data has been integrated,■ the frequency of refreshment, and so forth.

Granularity

One of the biggest issues of creating the data mining infrastructure is that of the granularity of thedata. The finer the level of granularity, the greater the chance that unusual and never before noticedrelationships of data will be discovered. But the finer the level of granularity, the more units of datathere will be. There often are so many units of data that important relationships hide behind sheervolume of data. Therefore, raising the level of granularity can greatly help in discovering importantrelationships of data.

There is a very important trade-off to be made here between the very fine level of detail that ispossible and the need to manage volumes of data. Making this trade-off properly is one of thereasons why data mining is an art, not a science.

Exploration

Once the infrastructure has been established, the first steps in exploration can commence. Thereare as many ways to approach data exploration as there are to approach the actual discovery ofminerals. Some of the approaches to the discovery of important relationships include:

■ analyzing summary data and “sniffing out” unusual occurrences and patterns,■ sampling data and analyzing the samples to discover patterns that have not been

detected before,


■ taking advantage of the intuition and experience of the experienced DSS analyst,■ doing random access of data,■ heuristically searching for patterns, etc.

Analysis

Once the patterns have been discovered, there needs to be a thorough analysis of the pattern. Somepatterns are not statistically very strong, while other patterns are very strong. The stronger thatpattern, the better the chance that pattern will form a basis for exploitation. On the other hand, if apattern is not strong today, but its strength is increasing over time, then this kind of pattern may beof great interest because it may be a clue as to how to anticipate the marketplace.

Strength of pattern is not the only thing that needs to be considered. A second consideration iswhether the pattern is a “false positive”. A false positive is a correlation between two or morevariables that is valid, but random. Given the large number of occurrences of data and thelarge number of variables, it is inevitable that there will be some number of false positivecorrelations of data.

A third consideration is whether a valid correlation of variables has any business significance. It isentirely possible there will be a valid correlation between two variables that is not a false positive,but for which there is no business significance.

These then are some of the analysis activities that follow exploration.

Interpretation

Once the patterns have been discovered and analyzed, the next step is to interpret them. Withoutinterpretation, the patterns that have been discovered are fairly useless. In order to interpret thepatterns, it is necessary to combine technical and business expertise. An interpretation without bothelements is a fairly pointless exercise.

Some of the considerations in the interpretation of the patterns include:■ the larger business cycles of the business,■ the seasonality of the business,■ the population to which the pattern is applicable,■ the strength of the pattern and the ability to use the pattern as a basis for future behavior,■ the size of the population the pattern applies to, and■ other important external correlations to the pattern, such as:

♦ time — of week, of day, of month, of year, etc.♦ geography — where the pattern is applicable to,♦ demographics — the group of people the pattern is applicable to, and so forth.

Once the pattern has been discovered and the interpretation has been made, the process of datamining is prepared to enter the last phase.


Exploitation

Exploitation of a discovered pattern is a business activity and a technical activity. The easiest waythat a discovered pattern can be exploited is to use the pattern as a predictor of behavior. Once thebehavior pattern is determined for a segment of the population served by a company, the patterncan be used as a basis for prediction. Once the population is identified and the conditions underwhich the behavior will predictably occur are defined, the business is now in a position to exploitthe information. There are many ways the business can exploit the information:

■ by making sales offers,■ by packaging products to appeal to the predicted audience,■ by introducing new products,■ by pricing products in an unusual way,■ by advertising to appeal to the predicted audience,■ by delivering services and/or products creatively,■ by presenting products and services to cater to the predicted audience, and so forth.

In addition to using patterns to position sales and products in a competitive and novel fashion, themeasurement of patterns over time is another useful way that pattern processing can be exploited.Even if a pattern has been detected that does not have a strong correlation or if there is only a smallpopulation showing the characteristics of the pattern, if the pattern is growing stronger over time orif the population exhibiting the pattern is growing, the company can start to plan for the anticipatedgrowth. Measuring the strength and weakness of a pattern over time or the growth or shrinkage ofthe population exhibiting the characteristics of the pattern over time is an excellent way to gaugechanges in product line, pricing, etc.

Yet another way that patterns can lead to commercial advantage is in the distinguishing of thepopulations that correlate to the pattern. In other words, if it can be determined that somepopulations do correlate to a pattern and other populations do not, then the business person canposition advertising and promotions with bull’s eye accuracy, thereby improving the rate of successand reducing the cost of sales.

The Approach to Data Exploration and Data Mining

The approach that has been outlined in Figure APP 1 is one in which the activities appear to belinear — i.e., the activities appear to flow from one activity to another in a prescribed order. Whilethere may actually be some project that flows as described, it is much more normal for the activitiesto be executed in a heuristic, non linear manner. First one activity is accomplished, then anotheractivity is done. The first activity is repeated and another activity commences, and so forth. There isno implied order to the activities shown in Figure APP 1. Instead, upon the completion of an activity,any other activity may commence, and even activities that have been previously done may have tobe redone. Such is the nature of heuristic processing.


Data Mining/Data Exploration

What is data mining and data exploration? Figure 1.1 gives a simple definition of data mining anddata exploration.

Data mining is the use of historical data to discover patterns of behavior and use the discovery ofthose patterns for exploitation in the commercial environment. Typically the audience beingconsidered by data miners is that of the consumer. In addition, the sales that have been made are thefocus of the mining activity. However,there is no reason data mining cannot be used in themanufacturing, telecommunications, insurance, banking and other environments.

The notion behind data mining is that there are important relationships of transactions and othervestiges of customer activity that are hidden away in the every day transactions executed by thecustomer. Older systems have captured a wealth of data as the systems executed the basics of atransaction. It is data exploration and data mining that discover those relationships and enable therelationships to be exploited commercially in novel and unforeseen ways.

In order to understand the techniques for data mining and data exploration, it is necessary torecognize the people who are doing data mining and data exploration. Figure 1.2 shows that thereare two primary groups of people who engage in data mining and data exploration.

what is data mining/data exploration?

data mining/data exploration -

- the usage of historical data to discover and exploit important business relationships

Figure 1.1

Figure 1.2

data mining is done by explorers and farmers


Figure 1.2 also shows that there are two distinct groups of people that do data mining and dataexploration — farmers and explorers. At any one moment in time an individual is one or the other.Over time, an individual may act in both capacities.

Explorers

Explorers are depicted in Figure 1.3.

Explorers are DSS analysts who don’t know what they want, but have an idea that there issomething in the data worth pursuing. Explorers typically look at large amounts of data.Frequently, explorers find nothing. Explorers look for interesting patterns of data in a non repetitivefashion. Even though they typically find nothing, explorers will find huge nuggets of informationon occasion. Explorers look at data in very unpredictable and very unusual fashions. Manyimportant findings are made by explorers, although the results they achieve are non predictable.

Farmers

Farmers are also DSS analysts, just as explorers are. But farmers have a very different approach todata mining and data exploration. Figure 1.4 depicts farmers.

Figure 1.3

Figure 1.4

farmers -- know what they are looking for- frequently look for things- have a repetitive pattern of access- look for small amounts of data- frequently find small flakes of gold

explorers -- access data infrequently- don't know what they want- look at things randomly- look at lots of data- often find nothing- occasionally find huge nuggets


Farmers look for small amounts of data, and they frequently find what they are looking for. Farmershave a predictable pattern of access of data and seldom come up with any major insight. Insteadfarmers find small flakes of gold when they do mining. Farmers access data frequently and in a verypredictable pattern.

Farmers follow the lead of explorers, as seen by Figure 1.5.

Explorers look over vast seas of data. When explorers find something of value, they turn over theirfindings to farmers. Farmers, in turn, attempt to repeat the success of the explorer, but in apredictable, repetitive manner, not in the random manner of the explorer. Said differently, theexplorer discovers what to look for, and the farmer executes the search once it is established thatsomething valuable exists in the data on which mining and exploration is done.

Macro/Micro Exploration

Data mining and exploration is effectively done at both the micro and the macro level, as seen inFigure 1.6.

where the explorer leads, the farmers follow

Figure 1.5

macro exploration - atthe summary level

micro exploration - atthe detailed level

there are two types of exploration - at the macro leveland at the micro level

Figure 1.6


Different types of analysis and different types of conclusions can be drawn from the summary anddetailed data on which mining and exploration can be done. Figure 1.6 shows that macroexploration is done at the summary level, and micro analysis is done at the detailed level. Both typesof data are needed for data mining and exploration. Any mining and exploration effort that excludesone or the other type of data greatly reduces the chances of success.

Summary data is good for looking at long-term trends and getting the larger picture. Once the largerpicture is understood, the DSS analyst knows where to go to productively mine and explore thedetailed data. The summary data serves as a roadmap to where productive data mining andexploration can be done. Without summary data, productive detailed mining and exploration issimply guesswork. The limitation of mining and exploration at the summary level is that no detailedanalysis or correlation of data can be done. With summary data, the process of mining andexploration can only go so far before the lack of detail hampers the analysis.

Detailed data is required for the in-depth analysis and correlation of data. There is no question as tothe worth of detailed data. However, there usually is so much detailed data that the DSS analystdrowns in detail unless a preliminary analysis has been done at the summary level. The single, mostdifficult issue of analyzing detailed data is that there are massive volumes that must be considered.The cost, overhead and time consumed by dealing with massive volumes of detailed data is suchthat the most effective way to deal with detailed data is to start with summary data.

There is then a symbiotic relationship between the summary data and detailed data that comes tothe attention of the explorer and the farmer.

Exploration of data is an iterative process that involves at least three elements, as described inFigure 1.7.

Figure 1.7

summarization detail

corporation/business

exploration is a constant loop of iterative hypothesisand verification between the business itself, summarydata, and detailed data


The process of exploration is a constant movement of focus from summary data to detailed data tothe business environment. The activities shift from one environment to the next in a random order,based upon the heuristic processing that is occurring. Trying to do exploration with any one ofthe elements missing will greatly inhibit the explorer and reduce the chances of success.

Correlating Data

The intellectual basis for data mining is the correlation of data. Figure 1.8 graphically depicts thecorrelation of some data values.

When data is correlated, there is a message. In some cases — the very strongest — the cause andeffect relationship is inferred. In the cause and effect relationship, one data occurrence is said to havecaused another occurrence of data. When cause and effect can be inferred it is easy to predictbusiness phenomena; however, the cause and effect relationship is not very common.

The more normal case is not cause and effect. The more normal case is that of coincidentalrelationship. Perhaps two occurrences occur because of an unknown common heritage. In otherwords, the variables are not related in a direct cause and effect relationship. Instead, some event hascaused both variables to occur. In this case there is a very strong relationship between two variableseven though it is not a cause and effect relationship.

A third important possibility is that of an indirect relationship. In an indirect relationship, twovariables are related although not in a direct causal relationship, but in an indirect relationship.Variable A was generated by event ABC. Event ABC caused event BCD to occur, which in turncaused variable B to come into existence. In this case, there is a relationship between A and B but therelationship is an indirect, not a direct, one and certainly not a cause and effect relationship.

Figure 1.8

vs

correlating data is at thebasis of exploration- when there is a correlationthere may be the opportunityfor exploitation- when there is no correlationof data, the opportunity forbusiness exploitation isdiminished or non existant


Another relationship between the existence of two variables is that of a very indirect relationship. Ina very indirect relationship there is indeed some relationship between the existence of variable A andvariable B but the relationship is unknown and is not obvious to any reasonable form of analysis.

The next form of a relationship is between two variables (A and B), which is purely random. For thegiven case of data there happens to be a relationship between A and B, but there is no valid reasonfor the relationship, and for another set of data the relationship may very well not exist. Theexistence of the relationship is merely a random artifact of the set of data the analysis is being doneunder. When there is a lot of data and a lot of variables, it is inevitable that there will be a fairnumber of “false positive” relationships that are discovered and are mathematically valid; however,they are only a function of the data on which the analysis is being done.

Finally, there is the possibility of a valid relationship between two variables that has amathematical basis, but for which there is no business basis. The variables may in fact have amathematically sound relationship, but there is no valid business reason why there should be acorrelation. This last case is very interesting. Should there in fact be a valid business basis forthe relationship and the DSS analyst were to discover what that basis is, then there may well bea nugget of wisdom waiting. But, if in fact there is no valid business basis for the correlation,the correlation may well be a red herring.

The stronger the relationship, the better the chance that there will be the opportunity to exploit thecorrelation. Conversely, the weaker the relationship, the less chance there will be for exploitation.

However, weak relationships should not be discarded out of hand. When a weak relationship isdiscovered and that relationship is becoming stronger over time, there will be the opportunity toanticipate a marketplace and this opportunity is the epitome of the success that can follow datamining and exploration. Therefore, looking at weak relationships is not a waste of time if in factthose relationships are increasing in strength over time.

Another factor to be considered in examining relationships is the population over which therelationship is valid. In some cases, there may be a weak correlation of data over the generalpopulation, but a very strong correlation over a selected population. In this case, segmentedmarketing may present a real and undiscovered opportunity. For these reasons, weak relationshipsmay be as interesting as very strong and very obvious relationships.

Ways to Correlate Data

The simplest way to correlate data is to ask, for a given selection of data, how often variable B existswhen variable A exists. This simple way is used quite often and forms the basis of discoveringrelationships. There are, however, many refinements to the way that data can be correlated. Figure1.9 shows some for the refinements.


Figure 1.9 shows that data can be correlated from one element to another, from one element toanother over different time periods, from one element to a group of elements, from one elementover a geographic area, from one element to external data, from one element to a demographicallysegmented population, and so forth. Furthermore, the same one element can be correlated overmultiple variables at the same time — time, geography, demographics, etc.

As an example of correlating an element of data to another element of data, consider the analysis ofsales where the amount of sale is correlated to whether the sales is paid for in cash or with creditcard. When a sale is below a certain amount it is found that the payment is made with cash. Whenthe sale is over a certain amount the payment is made with a credit card. And when the sale isbetween certain parameters, the sale is paid for with either credit card or cash.

As an example of a variable being correlated to time, consider the analysis of airline flightsthroughout the year. The length of the flight and the cost of the flight can be correlated against themonth of the year that a passenger flies. Do people make more expensive trips in January? InFebruary? As the holidays approach, do people make shorter and less expensive trips? What exactlyis the correlation?

As an example of correlating data against groups of other data, consider an analyst who wants toknow if the purchase of automobiles correlates to the sale of large ticket items in general, such aswashers and dryers, television sets and refrigerators.

The correlation of units of data to geography is a very common one in which data is matchedagainst local patterns of activity and consumption. The comparison of the beer drinking habits ofSoutherners versus Southwesterners is an example of correlating units of data to geography.

One of the most useful types of correlations is that of correlating internal data to external data. Thevalue of this kind of correlation is demonstrated by the comparison of internal sales figures toindustry-wide sales figures.

Figure 1.9

there are many ways to correlate data

vs unit of data versus unit of data

vs MonTuesWdeThurs........

JanFebMarApr.....

8:00 am 9:00 am10:00 am11:00 am............

unit of data versus time

vs unit of data versus groups of data

vs unit of data versus geography

vs Dow-Jones Avg

unit of data versus external data

vs unit of data versus demographics


And finally, as an example of demographic correlation, the saving rates for college-educated peoplecan be correlated against the savings rate for non college educated people.

There are an infinite number of combinations of correlations that can be done. Some correlations arevery revealing and some are merely just interesting, but have no potential for exploitation.

The reason why correlations are so intriguing is that they hold the key to commercial exploitation.Figure 1.10 illustrates the sequence of activities that starts with correlation and ends withcompetitive advantage.

Different Kinds of Trends

There are a multitude of trends. Some of them are useful; some are not. In no small part theusefulness of a trend is dependent on how permanent the trend is. Figure 1.11 suggests thedifference in trends.

what detailed pattern analysis and recognition leads to - the determination of trends

what the determination of trends leads to - predictable patterns of consumption or business activity over large audiences

what predictable patterns of business activity leads to - the ability toanticipate the marketplace

what the ability to anticipate the marketplace leads to - competitiveadvantage

why pattern recognition is so important

Figure 1.10

Figure 1.11

different kinds of trends -

long term trends

short term trends

sub trendsthere are different kinds of trends - each is importantand each has its own particular place in gainingcompetitive advantage


The most interesting trends are the long-term trends. Long-term trends are interesting in that oncedetected they can be used for market or behavioral anticipation. Once anticipation can be achieved,it is easy to position for market advantage. But long-term trends tend to be obvious, and thecompetition most likely will have noticed the trend as well.

Short-term trends hold promise in two ways. If they can be detected early enough they will becomeuseful tools of competition only to the company who has detected them. But there are problemswith short-term trends:

■ by definition, they have a short life, and■ they often compete with many other trends for attention and are hard to detect.

Exploiting short-term trends requires great agility on the part of the corporation. But because no oneis likely to have noticed the trend, the advantage afforded can be large.

Another way of looking at trends is that a large trend is merely a part of a series of much smallertrends. Each of the smaller trends can be exploited on its own.

Data Warehouse and Data Mining/Data Exploration

Data mining and exploration can be done on any body of data. It is not necessary to have a datawarehouse for the purpose of data mining. But there are some very good reasons why a datawarehouse is — easily — the best foundation for data mining and data exploration. Figure 1.12shows that data warehouse sits at the base of data mining.

Data warehouse provides the foundation for heuristic analysis. The results of the heuristic analysisare created and analyzed as the basis for the next iteration of analysis. Figure 1.13 shows that manyiterations of heuristic analysis occur.

data warehouse

heuristic analysis

trends and patterns

data warehouse sets the stage for data exploration and mining

Figure 1.12


As the DSS analyst goes from one hypothesis to another, the DSS analyst needs to hold the dataconstant. The DSS analyst changes the hypothesis and reruns the analysis. As long as the data is heldconstant, the DSS analyst knows that the changes in results from one iteration to another are due tothe changes in the hypothesis. In other words, the data needs to be held constant if iterativedevelopment is to proceed on a sound basis.

Consider what happens when data is being updated, as seen in Figure 1.14.

In Figure 1.14, update is regularly occurring to data at the same time that heuristic analysis is beingdone. The changes in the results the DSS analysts sees are always questionable. The DSS analystnever knows whether the changes in results are a function of the changing of the hypothesis, thechanging of the underlying data, or some combination of both. The first reason why the datawarehouse provides such a sound basis for data mining and data exploration is that the foundationof data is not constantly changing. But that reason — as important as it is — is not the only reasonwhy the data warehouse provides such a good basis for data mining.

data warehouse

heuristic analysis

hypothesis #1 hypothesis #2 hypothesis #3 hypothesis #4

because data in the data warehouse is static and is not updated,the DSS analyst can change the hypothesis and determine theresults without worrying about the effect of changing data

Figure 1.13

Figure 1.14

heuristic analysis

hypothesis #1 hypothesis #2 hypothesis #3 hypothesis #4

operational dataupdate

in the operational environment where update is occurring, the DSSanalyst never knows whether the different results that are achievedare because of the changing hypothesis or because of changes thathave occurred to the underlying data


Perhaps the most compelling case that can be made for a data warehouse as a basis for data miningand data exploration is that the data in the warehouse is integrated. The very essence of the datawarehouse is that of integration. To properly build a data warehouse requires an arduous amount ofwork because data is not merely “thrown into” a data warehouse. Instead, data is carefully andwillfully integrated as it is placed into the warehouse.

Figure 1.15 shows the integration that occurs as data is placed into the warehouse.

When the DSS analyst wants to operate from a non data warehouse foundation, the first task the DSSanalyst faces is that of having to “scrub” and integrate the data. This is a daunting task and holds upprogress for a lengthy amount of time. But when a DSS analyst operates on a data warehouse, the datahas already been scrubbed and integrated. The DSS analyst can get right to work with no major delays.

Stability of data and integration are not the only reasons why the data warehouse forms a soundfoundation for data mining and exploration. The rich amount of history is another reason why thedata warehouse invites data mining. Figure 1.16 shows that one of the characteristics of the datawarehouse is that of a robust amount of data.

data warehouse

hypothesis hypothesis

integrate

integrate - - scrub - reconcile - restructure - convert - change dbms - change oper sys - summarize - merge - default values ............................

without a data warehouse, there is a needfor a massive integration effort before theprocess of data mining can commence

Figure 1.15

Figure 1.16

data warehouse

hypothesis hypothesis

integrate

1985 1986 1987 1988 ...... 1996

.......

without a data warehouse, there is a needto go back and find and reconstructmassive amounts of historical data


If the DSS analyst attempts to go outside of the data warehouse to do data mining, the DSS analystfaces the task of locating historical data. For a variety of reasons, historical data is difficult to gather.Some of those reasons are:

■ historical data gets lost,■ historical data is placed on a medium that does not age well and the data becomes

impossible to physically read,■ the metadata that describes the content of historical data is lost and the structure of the

historical data becomes impossible to read,■ the context of the historical data is lost,■ the programs that created and manipulated the historical data becomes misplaced,■ the versions of dbms that the historical data is stored under becomes out of date and is

discarded, and so forth.

There are many challenges facing the DSS analyst in trying to go backward in time and reclaimhistorical data. The data warehouse, on the other hand, has the historical data neatly andreadily available.

Another reason why the data warehouse sets the stage for effective data mining and data explorationis that the data warehouse contains both summary and detail data, as illustrated in Figure 1.17.

The DSS analyst is able to immediately start doing effective macro analysis of data using thesummary data found in the data warehouse. If the DSS analyst must start from the foundation ofoperational or legacy data, there often is very little summary data. Only detailed data resides (to anygreat extent) in the operational or legacy environment. Therefore, the DSS analyst can do only microanalysis of data when there is no data warehouse, and starting with micro analysis of data in thedata mining adventure is risky at best.

data warehouse

hypothesis

detail

summary

hypothesis

detail

with a data warehouse there is a richsupply of both detailed and summarydata

Figure 1.17


Figure 1.18 shows the ability of the data warehouse to support both micro and macro data miningand data exploration activities.

Analyzing Historical Data

There is a notion that history is not a valid source for analysis of trends. The idea is that today’sbusiness is different from yesterday’s business, so looking at yesterday’s data only tells me about thepast, not the future. Indeed, if a company goes through a radical change in business then yesterday’sdata does not point the way to tomorrow’s opportunity. However, radical business change is not theway that most businesses conduct affairs. While external aspects of a business do change frequently,the fundamentals of the business remain constant.

History is required to be able to measure and fathom the economic cycles to which all businesses aresubject. You cannot understand business cycles by looking at this month’s or this quarter’s data.Historical data is the underpinning of understanding long-term trends that the business is engulfed in.

Figure 1.19 shows the basis of history for the understanding of long-term trends.

data warehouse

hypothesis

detail

summary

macro analysis

micro analysis

the different levels of summarization anddetail are supported by the existence ofboth summarized and detailed data

Figure 1.18


In addition to history being the key to comprehending long-term business cycles, history is also thekey to understanding the seasonality of sales and other business activity. These long-term trends arebest understood in terms of summary data, not detailed data.

Historical data can be used at the detailed, micro level as well. Historical data can be used to trackthe pattern of activity of individual consumers. Figure 1.20 shows the usage of detailed historicaldata.

Individuals are creatures of habit. The likelihood that buying habits, consumption habits, recreationalhabits, living style habits, and so forth will be the same tomorrow as they were today is a validassumption for most people, once the individual's habits are formed in the first place. It is very unusualfor an individual to have a great departure from a lifelong habit once that habit has been established.Therefore, on an individual basis, looking at the history of consumption and other activities for anindividual is an excellent predictor as to what the future habits of the individual will be.

Volumes of Data — The Biggest Challenge

The single largest challenge the DSS analyst has in doing effective data mining and data explorationis that of coming to grips with the volumes of data that accompany the data warehouse or thepopulation of data that faces the DSS analyst. There inevitably are masses and masses of data. Thecost of manipulating, storing, understanding, scanning, and loading, etc. is enormous. Even thesimplest activity becomes painful in the face of the large volumes that await the DSS analyst.

summary data

at the macro level of analysis,history is needed to understandthe larger business cycles - - seasonality - industrial cyclicity

Jan Apr Jul Oct

Figure 1.19

at a micro level, detailed history is important becauseconsumers are creatures of habit. It is very unusualfor a consumer to undergo a massive change ofhabits. Therefore, the consumption habits of the pastare an excellent indicator of future habits of consumption

Figure 1.20


Figure 1.21 shows that piles of data await the DSS analyst as the first challenge in doing data mining.

There are many problems with the volumes of data the DSS analyst must maneuver through. But themost insidious problem is that the volumes of data mask the important relationships and correlationsthat the DSS analyst is looking for. Figure 1.22 shows that hiding among the occurrences of data areimportant patterns that the DSS analyst is seeking.

large volumes of data hide important relationships

Figure 1.22

Figure 1.21

HELP!

the biggest challenge of data exploration/data mining -drowning in a sea of data


There are other problems with trying to ferret out patterns of activity and behavior from volumes ofdata. There is the problem of “false positives” occurring when there is a massive amount of data. A“false positive” is a relationships that is valid, but is randomly valid. Figure 1.23 shows the problemsof “false positives” arising with a large amount of data.

False positives occur simply because there is so much data. A false positive is a mathematicallyaccurate pattern which has no basis in the business itself. When there is a lot of data it is impossiblefor there not to be false positives. Given enough data, correlations of data will begin to appear simplybecause there is so much data, and for no other reason. Therefore, the DSS analyst needs to alwaysinterject the sensibility of business into the equation. Just because the numbers tell a story is noindication that the story has a basis in the business itself.

The task of the DSS analyst is to find the needle in the haystack — where there is a mathematicalbasis for a correlation and where there is also a business basis as well. Figure 1.24 shows an analystlooking for the needle.

Figure 1.23

finding useful business patterns is like finding the needle in the haystack

Figure 1.24

with enough data, there are bound to be some "false positives"that are detected. Even though the correlation is real and ismathematically valid, the business basis is non existent


Sampling Techniques

One of the most effective way of dealing with massive volumes of data found in the datawarehouse is to use sampling techniques. By using a sample the DSS analyst can significantlyreduce the overhead and cost of doing analysis against data. In addition, the DSS analyst canreduce the turnaround time required for any pass against the data. Figure 1.25 suggests theimportance of sampling.

While there are many advantages to using a sample for analysis of data, there are somedisadvantages. Some of the disadvantages are:

■ Whenever a sample is chosen, there will be bias introduced. In the best of cases the bias willbe a known factor and will not unduly influence the analytical work done against the data.

■ The sample needs to be periodically refreshed.■ The sample cannot be used for general purpose processing.■ The sample cannot produce accurate results, only approximations, and so forth.

For all of the factors that limit the usefulness of sampling, the advantages of sampling far outweighthe disadvantages insofar as the work of the DSS analyst is concerned.

The first and most basic question the DSS analyst faces upon beginning to use samples for datamining and data exploration is — what kind of sample should be chosen. A simple technique is tochose a random sample, as shown in Figure 1.26.

one way to weed through the volume of data is toselect a sample of data and do analysis on thesample

Figure 1.25


A random sample can be chosen by simply marching through the data warehouse in a sequentialmanner and selecting every nth record. Inevitably there will be some bias in using this technique, butas long as the bias is not too severe, the technique works just fine.

One of the problems of choosing a random sample of data on which to do data mining and dataexploration is that it is still possible to get false positives. It is true that fewer false-positivecorrelations will appear using a random sample than if the full population was used for analysis. Butfalse positives are not eliminated by the selection of random samples.

Judgment Samples

An appealing alternative to the selection of a random sample is that of choosing a non randomsample (or a “judgment sample”). Figure 1.27 shows the selection of a judgment sample of data forthe purpose of doing data mining and data exploration.

judgmentsample

a non random sample is a good way to start a refined,directed analysis. More powerful conclusions can bereached more quickly starting with a nonrandomsample.

Figure 1.27

a randomly chosen sample often yieldsrandom positive patterns, which may ormay not be relevant to the businessequation. Choosing a random sampleis a good way to start general exploration

random sample

Figure 1.26


The choice of the data entering the judgment sample can be a powerful tool for the DSS analyst. TheDSS analyst can use the selection of the data to go into the random sample in order to qualify thecorrelations that will be made. In other words, the DSS analyst can look at a selected subset of dataand find the correlations that apply only to that subset. In doing so, the DSS analyst can start to createprofiles for the different subpopulations of the corporation.

The judgment sample can be made along any lines the DSS analyst deems useful. Figure 1.28 showssome of the many different ways the DSS analyst can choose a judgment sample.

On the one hand, the DSS analyst can use the judgment sample in order to qualify the data and indoing so, start to understand the different subpopulations. On the other hand, the DSS analyst can belimited by the judgment samples as to the conclusions that can be drawn and the patterns that arediscovered. Any pattern that is discovered must always be qualified by stating that the patternapplies only to the sub population represented by the judgment sample.

One technique to overcome the limitations of selecting a judgment sample is to analyze the judgmentsample, then to reanalyze the base population of data once the pattern has been observed. Figure 1.29shows this technique.

One of the interesting uses of this technique is to measure the difference between the strength ofthe correlation against the general population versus the strength of the correlation against thejudgment sample. In some cases the correlation will be very strong using the judgment sampledata, and weak or non existent using the general population. Knowing that a correlationapplies only to a subset of a population is a fact that can be used quite effectively in achieving aposition of marketplace exploitation.

select

there are many criteria on which to select the data - - by time - by geography - by customer demographics - by data qualification - by multiple criteria

Figure 1.28

the conclusions that are drawn from the analysisof a sample are applicable to ONLY the sample.In order to generalize the conclusions the hypothesisneeds to be run against the general population

Figure 1.29


In the same vein, one judgment sample may exhibit one degree of correlation against the sample andanother degree of correlation against another sample, as seen in Figure 1.30.

This difference in degrees of correlation against different sample bases is the basis for doing “bull’seye” marketing, in which one demographic group is singled out for exploitation.

Refreshing Sample Data

While it is necessary to keep the base of data stable in doing heuristic analysis, it is also necessary toperiodically refresh the data in order to capture recently emerging trends. Figure 1.31 shows thatperiodically refreshment must be done.

Fortunately, when looking at long-term trends, it is not necessary to refresh the data on whichanalysis is being done very frequently. Depending on the data and the analysis being done on thedata, it may be necessary to refresh the data no more frequently than every quarter or so.

Figure 1.30

periodicrefreshment

periodic refreshment of the sample is in required in orderto keep the sample as up to date as possible

Figure 1.31

the sample of data that has been chosen may well exhibitother patterns of correlation than other samples. Stateddifferently, some patterns of data can be detected whenlooking at a sample that cannot be detected when lookingat the population in general. Asking the question - what isdifferent about this sample than another sample can be avery enlightening thing to do

sample A sample B sample Cpattern abc


The general approach to the usage of a sample approach to data mining and data exploration can bestated by Figure 1.32.

Using Summary Data

Undoubtedly, using sampling techniques is the most popular and the most effective approach tohandling the volumes of data that arise in the world of data warehousing and data mining. The useof sampling should be in the bag of tricks of every data warehouse administrator. There are othereffective ways to discover patterns of interest to the DSS analyst other than the usage of samplingtechniques. Indeed, the techniques that will be discussed can be used in conjunction with samplingtechniques quite handily.

Using summary data is a very useful technique to find patterns of interest to the DSS analyst. At firstthe usage of summary data may seem contradictory because, by definition, the summary data doesnot contain detailed data, and where there is no detailed data, it is questionable as to how to locateinteresting patterns. But summary data can act as a “water witch.” In the old West, when it wasdesired to find water, a pioneer cut a forked stick from a tree and set about “dowsing.” The pioneerwould walk about holding the forked twig in front of him and would stop when the twig “dipped.”It was at this point that the water witch would declare that the place where the twitching of theforked stick occurred was the place to dig for water. In much the same vein the DSS analyst can usesummary data as a forked twig as a determinant to position where to look for interesting businesspatterns. The difference is that there is a rational basis for using summary data as an indicator as towhere to look for nuggets of information.

1

2

3

4

hypothesizetest

create sample

discover pattern

determine strength ofpattern against basepopulation

a typical sequence of events1 - the base population has a sample created2 - the sample is iteratively hypothesized and tested3 - a pattern is discovered4 - the strength and vaildity of the pattern is tested against the base population

Figure 1.32


As an example of how summary data can be used, consider the diagram shown in Figure 1.33

Figure 1.33 shows a simple summary chart over a period of time.

There are several obvious places where the DSS analyst might start to look for interestingcorrelations of data:

■ at the bottom of a trough, where the trend bottomed out,■ at the top of a peak, where the trend reached a new high,■ at the tops of several peaks to see if a consistent trend is forming, and so forth.

In short, the simple summary graph suggests many likely places to start to look for trends. Summarydata can then be used to find the needle in the haystack, or at least to indicate where in the haystackis a productive place to look for needles.

But a simple graph is only one way that summary data can be used. Figure 1.34 shows that a veryuseful way to analyze summary data is to look at two summarizations and compare thesummarizations together.

looking at summary data canlead to interesting observationswhich in turn can lead to the discoveryof patterns

why was there a peak here?

why was there a valley here?

is this a believable long term trend?

why was there such a falloff?

Figure 1.33

total industry sales for the year

our company's annual sales

comparisons of different kinds of summary data can be veryelucidating. Patterns may emerge that would otherwise goundetected

Figure 1.34


In the case of Figure 1.34, total industry numbers are being compared to numbers generated insidethe corporation. By comparing and contrasting these two sets of numbers, many interestingobservations can be made that might otherwise escape the attention of the DSS analyst. Figure 1.35shows some of the things the alert DSS analyst might see.

The alert DSS analyst looks for places where the current trend goes for or against the industryaverage. At these points, the DSS industry is tempted to ask — what are we doing differentlythen the rest of the industry? When we are succeeding at a faster rate than everyone else, whatare we doing right? And when we are failing at a faster rate than the rest of the industry, whatare we doing wrong?

The contrast with other summary numbers can provide an excellent place to begin closely examiningdetailed data in search of interesting correlations.

Intuition

But summary data is not the only place to start. A very basic place to start, which is often overlooked,is to trust the intuition of the DSS analyst. In many cases, when a DSS analyst sees data, the analystwill have a feeling that something interesting is going on that the analyst cannot put an immediatefinger on. Something speaks out to the DSS analyst to look further. Figure 1.36 shows that in somecases the DSS analyst will know where to look more deeply for correlations even though the DSSanalyst cannot tell you exactly why a deeper search is warranted.

total industry sales for the year

our company's annual sales

ba c

b

a

c

- at this point we are selling significantly higher than the industry average. Why?

- at this point we are selling significantly lower than the industry average. Why?

- at this point we are selling significantly higher than the industry average. Why?

the kinds of observations that come from comparing industryaverages to internal numbers and the questions that can beraised. The questions lead to the discovery of patterns, or atleast to the ability to ask the questions that will lead to thediscovery of patterns.

Figure 1.35


The DSS analyst will not be correct in every case. But in some cases the DSS analyst can cut throughmasses of data simply by trusting his/her intuition.

Further Subdividing Data

Another technique (which is really a variation of sampling) is to further subdivide data so that onlyrelevant fields appear in the data that is being studied. Figure 1.37 illustrates this technique.

In Figure 1.37 the DSS analyst has started the analysis by stripping out only two variables ofdata. In doing so, there is much less data to be managed. The stripped out data can beefficiently stored and analyzed.

As long as the DSS analyst wants to look ONLY at the data that has been stripped out, then there isno problem. The analysis can proceed smoothly. But the minute the DSS analyst wishes to correlatethe stripped-out data with the other data that it was originally stored with, then the limitations ofthis approach become manifest. Once the stripped data is removed from its source, it cannot easilybe correlated with any other data that it is related to.

this just seems like a good placeto start digging. I can just feelthe gold under here....

the intuition of an experienced DSS analyst is alwaysa good place to start looking for nuggets

Figure 1.36

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx







..........................................

..........................................

..........................................

one way to achieve a different perspective of data is to lookat subsets of attributes rather than at the entire row of data.This technique cuts down on the volume of data that mustbe handled while creating a basis for analysis

Figure 1.37


In the same vein, another way to achieve a unique perspective of data is to look at the data by type ofdata, rather than by actual occurrence. Figure 1.38 shows this approach.

Figure 1.38 shows that data is grouped into classes. Once grouped into classes, the DSS analyst doesstudies on the classes of data rather on the raw data itself. In such a fashion, data can be studied andmanaged efficiently and succinctly.

[This Tech Topic is Part 1 in the series on data mining and data exploration. In this Tech Topic we haveexplored the origins of data mining, the considerations of the infrastructure of data mining and theactivities of exploration. Part 2 of this Tech Topic will explore how to exploit the patterns discovered inthe exploration.]

............

type A type B type C

another good way to achieve an interesting perspective of data isto categorize data into classes and count the occurrences of datain each class. Depending on how the categorization is done, thecomparison of the number of data that resides in each class canproduce very interesting results

Figure 1.38

Documents

Inmon - Data Mining - An Architecture