12
Trends & Controversies Domain-Driven Data Mining: A Framework Longbing Cao and Chengqi Zhang, University of Technology, Sydney Data mining increasingly faces complex challenges in the real-life world of business problems and needs. 1 The gap be- tween business expectations and R&D results in this area in- volves key aspects of the field, such as methodologies, tar- geted problems, pattern interestingness, and infrastructure support. Both researchers and practitioners are realizing the importance of domain knowledge to close this gap and de- velop actionable knowledge for real user needs. What is domain-driven data mining? Domain-driven data mining generally targets actionable knowledge discovery in complex domain problems. 2 It aims first to utilize and mine many aspects of intelligence— for example, in-depth data, domain expertise, and real-time human involvement as well as process, environment, and social intelligence. It metasynthesizes its intelligence sources for actionable knowledge discovery. To achieve this metasynthesis, domain-driven data mining must develop knowledge actionability, enhance knowledge reliability, and interact with methodologies and systems that support exist- ing business uses. Domain-driven data mining works to expose next- generation methodologies for actionable knowledge discovery, identifying how KDD can better contribute to critical domain problems in theory and practice. It un- covers domain-driven techniques to help KDD strengthen business intelligence in complex enterprise applications. It discloses applications that effectively deploy domain- driven data mining to solve complex practical problems. It also identifies challenges and directions for future R&D in the dialogue between academia and business to achieve seamless migration into business world. Determining what knowledge to pursue requires both technical and business interests to instantiate both objec- tive and subjective factors. 3 Actionable knowledge discov- ery should fit the following framework: x X, P: x.tech_obj(P) x.tech_subj(P) x.biz_obj(P) biz_subj(P) act(P) where P indicates pattern interestingness from not only technological and business viewpoints but also objective and subjective perspectives. 4 Why do we need it? Traditional KDD is a data-driven trial-and-error process that targets automated hidden knowledge discovery. Re- searchers commonly use it to let data create and verify their innovations or to develop and demonstrate the use of novel algorithms and methods in discovering knowledge of re- search interest. In business, however, KDD must support commercial actions. In addition to business rules, policies, and so on, it Domain-Driven, Actionable Knowledge Discovery Longbing Cao, University of Technology, Sydney 78 1541-1672/07/$25.00 © 2007 IEEE IEEE INTELLIGENT SYSTEMS Published by the IEEE Computer Society The complexities of real-world domain problems pose great chal- lenges in the knowledge discovery and data mining (KDD) field. For example, existing technologies seldom deliver results that businesses can act on directly. In this issue of Trends & Controversies, seven short articles report on different aspects of domain-driven KDD, an R&D area that targets the development of effective methodologies and techniques for delivering actionable knowledge in a given domain, especially business. To begin, my colleague Chengqi Zhang and I propose a frame- work that boosts data mining capabilities and dependability by synthesizing domain-related intelligence into the development KDD process. In the second article, Qiang Yang looks at ways of getting data mining output models to correspond to actions. The framework he proposes takes traditional KDD output as input to a process that, in turn, generates actionable output. The third and fourth articles describe different aspects of visuali- zation for clarifying data patterns and meaning. David Bell describes two video and vision system applications that strengthen data mining practice in general. Michail Vlachos, Bahar Taneri, Eamonn Keogh, and Philip S. Yu present a simple, fast method for visualiz- ing DNA transcripts to enhance data mining utility. Human intelligence plays important roles in actionable data min- ing. Ning Zhong proposes a methodology for processing multiple data sources in a systems approach to brain studies. Privacy imposes a strong constraint on data mining. Mafruz Zaman Ashrafi and David Taniar review the issues and techniques for preserving privacy in aggregated data mining. Finally, Eugene Dubossarsky and Warwick Graco describe four aspects of moving data mining from a method-driven approach to a process that focuses on domain knowledge. —Longbing Cao

Domain-Driven, Actionable Knowledge Discoveryqyang/Docs/2007/action.pdf · Trends & Controversies Domain-Driven Data Mining: A Framework Longbing Cao and Chengqi Zhang, University

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Domain-Driven, Actionable Knowledge Discoveryqyang/Docs/2007/action.pdf · Trends & Controversies Domain-Driven Data Mining: A Framework Longbing Cao and Chengqi Zhang, University

T r e n d s & C o n t r o v e r s i e s

Domain-Driven Data Mining: A FrameworkLongbing Cao and Chengqi Zhang, University of Technology, Sydney

Data mining increasingly faces complex challenges in thereal-life world of business problems and needs.1 The gap be-tween business expectations and R&D results in this area in-volves key aspects of the field, such as methodologies, tar-geted problems, pattern interestingness, and infrastructuresupport. Both researchers and practitioners are realizing theimportance of domain knowledge to close this gap and de-velop actionable knowledge for real user needs.

What is domain-driven data mining?Domain-driven data mining generally targets actionable

knowledge discovery in complex domain problems.2 Itaims first to utilize and mine many aspects of intelligence—for example, in-depth data, domain expertise, and real-timehuman involvement as well as process, environment, andsocial intelligence. It metasynthesizes its intelligencesources for actionable knowledge discovery. To achieve thismetasynthesis, domain-driven data mining must developknowledge actionability, enhance knowledge reliability, andinteract with methodologies and systems that support exist-ing business uses.

Domain-driven data mining works to expose next-generation methodologies for actionable knowledgediscovery, identifying how KDD can better contribute tocritical domain problems in theory and practice. It un-covers domain-driven techniques to help KDD strengthenbusiness intelligence in complex enterprise applications.It discloses applications that effectively deploy domain-driven data mining to solve complex practical problems.It also identifies challenges and directions for futureR&D in the dialogue between academia and business toachieve seamless migration into business world.

Determining what knowledge to pursue requires bothtechnical and business interests to instantiate both objec-tive and subjective factors.3 Actionable knowledge discov-ery should fit the following framework:

∀ x � X, � P: x.tech_obj(P) � x.tech_subj(P) �x.biz_obj(P) � biz_subj(P) → act(P)

where P indicates pattern interestingness from not onlytechnological and business viewpoints but also objectiveand subjective perspectives.4

Why do we need it?Traditional KDD is a data-driven trial-and-error process

that targets automated hidden knowledge discovery. Re-searchers commonly use it to let data create and verify theirinnovations or to develop and demonstrate the use of novelalgorithms and methods in discovering knowledge of re-search interest.

In business, however, KDD must support commercialactions. In addition to business rules, policies, and so on, it

Domain-Driven, ActionableKnowledge Discovery

Longbing Cao, University of Technology, Sydney

78 1541-1672/07/$25.00 © 2007 IEEE IEEE INTELLIGENT SYSTEMSPublished by the IEEE Computer Society

The complexities of real-world domain problems pose great chal-lenges in the knowledge discovery and data mining (KDD) field. Forexample, existing technologies seldom deliver results that businessescan act on directly. In this issue of Trends & Controversies, seven shortarticles report on different aspects of domain-driven KDD, an R&Darea that targets the development of effective methodologies andtechniques for delivering actionable knowledge in a given domain,especially business.

To begin, my colleague Chengqi Zhang and I propose a frame-work that boosts data mining capabilities and dependability bysynthesizing domain-related intelligence into the developmentKDD process.

In the second article, Qiang Yang looks at ways of getting datamining output models to correspond to actions. The framework heproposes takes traditional KDD output as input to a process that,in turn, generates actionable output.

The third and fourth articles describe different aspects of visuali-zation for clarifying data patterns and meaning. David Bell describestwo video and vision system applications that strengthen datamining practice in general. Michail Vlachos, Bahar Taneri, EamonnKeogh, and Philip S. Yu present a simple, fast method for visualiz-ing DNA transcripts to enhance data mining utility.

Human intelligence plays important roles in actionable data min-ing. Ning Zhong proposes a methodology for processing multipledata sources in a systems approach to brain studies.

Privacy imposes a strong constraint on data mining. Mafruz Zaman Ashrafi and David Taniar review the issues and techniquesfor preserving privacy in aggregated data mining.

Finally, Eugene Dubossarsky and Warwick Graco describe fouraspects of moving data mining from a method-driven approach toa process that focuses on domain knowledge. —Longbing Cao

Page 2: Domain-Driven, Actionable Knowledge Discoveryqyang/Docs/2007/action.pdf · Trends & Controversies Domain-Driven Data Mining: A Framework Longbing Cao and Chengqi Zhang, University

must take into account such real-world phe-nomena as evolving scenarios, constrainedenvironments, runtime mining, and distrib-uted and heterogeneous data sources. It mustsupport business requirements for trustwor-thy, reliable, and cost-effective performance.It must also find ways to integrate humanintelligence seamlessly in its processes.

Key issuesOperating on top of a data-driven frame-

work, domain-driven data mining aims todevelop specific methodologies and tech-niques for dealing with these business com-plexities. This goal can mean developinggeneric frameworks, domain-specific ap-proaches, or both. Other key issues include

• domain-driven project management;• actionable knowledge discovery

frameworks; • capturing, representing, and using net-

work intelligence; • mining in-depth patterns and deep data

intelligence; and• balancing the conflicts between techni-

cal performance and business interest.

Table 1 summarizes the differences be-tween data-driven and domain-driven datamining.

ApplicationsWe’ve used domain-driven data mining in

real-world trade-support assignments andfor analyzing exceptional behavior in gov-ernment social security overpayments.4,5 Inone case, we integrated domain intelligenceinto the automated construction of activitysequences in government-customer con-tacts. We also discovered high-impact

activity-sequence patterns associated withgovernment-customer debt and identifiedcustomers and events that would likely pre-vent or recover debt. To this end, we devel-oped both technical and business measuresfor patterns relevant to these issues in real,unbalanced social security data. For in-stance, through metasynthesizing intel-ligence sources, we get the following inter-estingness of an activity sequence associatedwith the occurrence of government debt:

• Technical interestingness: support =0.01251, confidence = 0.60935, and lift = 1.2187;

• Business interestingness:, the averaged debt

amount in cents of those debt-relatedactivity sequences supporting the rule;and , the averaged debtduration in days of those debt-relatedactivity sequences supporting the rule.

These measures tell us that the selectedpattern has not only technical interest butalso business impact on debt amount andduration.

ConclusionDomain-driven KDD represents a para-

digm shift from a research-centered disci-pline to a practical tool for actionableknowledge. Despite many open issues, de-ployed systems are already showing waysto transmit reliable research in forms thatsatisfy business needs with direct supportfor decisions.

References

1. U. Fayyad, G. Shapiro, and R. Uthurusamy,

“Summary from the KDD-03 Panel—DataMining: The Next 10 Years,” ACM SIGKDD

Explorations Newsletter, vol. 5, no. 2, 2003,pp. 191–196.

2. L. Cao and C. Zhang, “Domain-Driven DataMining: A Practical Methodology,” Int’l J.Data Warehousing and Mining, vol. 2, no. 4,2006, pp. 49–65.

3. A. Silberschatz and A. Tuzhilin, “What MakesPatterns Interesting in Knowledge DiscoverySystems?” IEEE Trans. Knowledge and DataEng., vol. 8, no. 6, 1996, pp. 970–974.

4. L. Cao and C Zhang, “The Evolution of KDD:Towards Domain-Driven Data Mining,” Int’lJ. Pattern Recognition and Artificial Intelli-gence, vol. 21, no. 4, 2007, pp. 677–692.

5. L. Cao, Y. Zhao, and C. Zhang, “MiningImpact-Targeted Activity Patterns in Imbal-anced Data,” to be published in IEEE Trans.Knowledge and Data Eng.

Learning Actions from Data Mining ModelsQiang Yang, Hong Kong University of Science and Technology

Data mining and machine learning algo-rithms aim mostly at generating statisticalmodels for decision making. There are manytechniques for computing statistical modelsfrom data: Bayesian probabilities, decisiontrees, logistic and linear regression, kerneland support-vector machines, and cluster andassociation rules, among others.1,2 Mosttechniques represent algorithms that summa-rize training-data distributions in one way oranother. Their output models are typicallymathematical formulas or classificationresults describing test data. In other words,they’re data centric.

d dur_ .( ) = 15 5

d amt_ ,( ) = 29 526

JULY/AUGUST 2007 www.computer.org/intelligent 79

Table 1. Data-driven versus domain-driven data mining.

Aspects Traditional data-driven Domain-driven

Object mined Data tells the story Data and domain tell the story

Aim Develop innovative approaches Generate business impacts

Objective Algorithms are the focus Solving business problems is the target

Data set Mining abstract and refined data sets Mining constrained real-life data

Extendability Predefined models and methods Ad hoc, runtime, and personalized model customization

Process Data mining is an automated process Humans are integral to the data mining process

Evaluation Based on technical metrics Based on actionable options

Accuracy Results reflect solid theoretical computation Results reflect complex context in a kind of artwork

Goal Let data create and verify research innovation; Let data and metasynthetic knowledge tell the hidden demonstrate and push novel algorithms business story; discover actionable knowledge to satisfy to discover knowledge of research interest real user needs

Page 3: Domain-Driven, Actionable Knowledge Discoveryqyang/Docs/2007/action.pdf · Trends & Controversies Domain-Driven Data Mining: A Framework Longbing Cao and Chengqi Zhang, University

Despite much industrial success, thesemodels don’t correspond to actions that willbring about desired world states. They max-imize their utility on test data. But data min-ing methods should do more than produce amodel. They should generate actions thatcan be executed either automatically orsemiautomatically. Only in this way can adata mining system be truly consideredactionable.

I’ve developed two techniques that high-light a novel computational framework foractionable data mining.

Extracting actions from decision trees

The first technique uses an algorithm forextracting actions from decision trees suchthat each test instance falls in a desirablestate.3 A customer-relationship-managementproblem illustrates our solution. CRM indus-try competition has heated up in recent years.Customers increasingly switch service pro-viders. To stay profitable, CRM companieswant to convert valuable customers from alikely attrition state to a loyal state.

Our approach to this problem exploits de-cision tree algorithms. These learning algo-rithms, such as ID3 or C4.5,1 are among themost popular predictive data-classificationmethods. In CRM applications, we can builda decision tree from an example customerset described by a feature set. The featurescan include any kind of information: per-sonal, financial, and so on.

Let’s assume an algorithm has alreadygenerated a decision tree. To generate ac-tions from this tree, we must first considerhow to extract actions when no restrictionsexist on their number. In the training data,some values under the class attribute aremore desirable than others. For example, ina banking application, a customer loyaltystatus of “stay” is more desirable than “notstay.” For each test data instance—that is,for each customer under consideration—wewant to decide a sequence of actions thatwould transform a customer from “not stay”to “stay” status. We can extract this set ofactions from the decision trees, using thefollowing algorithm:

1. Import customer data using appropri-ate data collection, cleaning, and pre-processing techniques, and so on.

2. Build customer profiles from the train-ing data, using a decision-tree learningalgorithm, such as C4.5,1 to predict

whether a customer is in the desired sta-tus. Refine the algorithm to use the areaunder the receiver-operating-character-istic curve.4 (The ROC curve lets usevaluate candidate-predication rankinginstead of accuracy.) The Laplace cor-rection is a further refinement to avoidextreme probability values.

3. Search for optimal actions for each cus-tomer. This critical step generates actions,and I describe it in more detail later.

4. Produce reports for domain experts toreview the actions and selectively de-ploy them.

In this algorithm, when a customer fallsinto a particular leaf node with a certainprobability of having the desired status, thealgorithm tries to “move” the customer intoother leaves with higher probabilities of hav-

ing the desired status. A data analyst can thenconvert the probability gain into an expectedgross profit. However, moving a customerfrom one leaf to another means changingsome of the customer’s attribute values. Thedata analyst generates an action to denoteeach such change in which an attribute A’svalue is transformed from v1 to v2. Further-more, these actions can incur costs, which adomain expert defines in a cost matrix.

On the basis of a domain-specific costmatrix for actions, we can define an ac-tion’s net profit:

Pnet = PE � Pgain � �i Costi

where Pnet denotes the net profit, PE de-notes the total profit of having the customerin the desired status, Pgain denotes the prob-ability gain, and Costi denotes each action’scost. The algorithm has taken into account

both continuous and discrete attributeversions.

This solution corresponds to a simplesituation, which limits the number of ac-tions. However, when we limit actions bynumber or total costs, the problem of se-lecting the action subset to execute be-comes NP-hard. In an extension of thiswork, I’ve developed a greedy algorithmthat can give a high-quality approximatesolution.3 My colleagues and I have runmany tests to show that these methods areuseful in action generation and perfor-mance. Furthermore, we’ve also considereda decision-tree ensemble to make the solu-tions more robust.

Learning from frequent-actionsequences

The second technique uses an algorithmthat can learn relational action models fromfrequent item sets. This technique appliesto automatic planning systems, which oftenrequire formally defined action modelswith an initial state and a goal. However,building such models from scratch is diffi-cult, so many researchers have exploredvarious approaches to learning them fromexamples instead.

The Action-Relation Modeling Systemautomatically acquires action models fromrecorded user plans.5 ARMS takes a collec-tion of observed traces as its input. It deter-mines a collection of frequent-action setsby applying a frequent-item-set-miningalgorithm to the traces. It then takes thesesets as the input to another modeling sys-tem, called weighted Max-Sat, which cangenerate relational actions.

To better understand relational-action rep-resentations, consider an example input andoutput in the Depot problem domain from anAI planning competition.6 The training dataconsists of training plan samples that aresimilar to this (abbreviated) sequence ofactions: drive(?x:truck ?y:place ?z:place),lift(?x:hoist ?y:crate ?z:surface ?p:place),drop(?x:hoist ?y:crate ?z:surface ?p:place),load(?x:hoist ?y:crate ?z:truck ?p:place),unload(?x:hoist ?y:crate ?z:truck ?p:place),where “?” indicates variable parameters.

From such input sequences, we want tolearn the preconditions, add lists, anddelete lists for all actions. When ARMScompletes the three lists for an action, itsaction model is complete. We want to learnan action model for every action in a prob-lem domain. In this way, we “explain” all

80 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS

Despite much industrial

success, data-centric

output models don’t

correspond to actions

that will bring about

desired world states.

Page 4: Domain-Driven, Actionable Knowledge Discoveryqyang/Docs/2007/action.pdf · Trends & Controversies Domain-Driven Data Mining: A Framework Longbing Cao and Chengqi Zhang, University

training examples successfully. For exam-ple, a learned action model for the actionload(?x:hoist ?y:crate ?z:truck ?p:place)might have the preconditions (at ?x ?p), (at?z ?p), (lifting ?x ?y); the delete list (lifting?x ?y); and the add list (at ?y ?p), (in ?y?z), (available ?x), (clear ?y).

ARMS proceeds in two phases. In phaseone, it finds frequent-action sets from planssharing a common parameter set. In addi-tion, with the help of the initial and goalstates, ARMS finds some frequent relation-action pairs and uses them to make an ini-tial guess on the preconditions, add lists,and delete lists of actions in this subset. Ituses these action subsets and pairs to ob-tain a constraint set that must hold for theplans to be correct.

In phase two, ARMS takes the frequentitem sets as input and transforms them intoconstraints in the form of a weighted Max-Sat representation.7 It solves the constraintsusing a weighted Max-Sat solver and pro-duces action models as a result. The pro-cess iterates until it models all actions.

Future workWhile ARMS action models are determin-

istic, in the future we plan to extend theframework to learning probabilistic actionmodels that can handle uncertainty. We canadd other constraints to allow partial obser-vations between actions and prove the sys-tem’s formal properties. We’ve testedARMS successfully on all the StanfordResearch Institute Problem Solver’s plan-ning domains from a recent AI PlanningCompetition based on training action se-quences.5 In related work,8 we consider waysto generate actions to cost-effectively acquiremissing data.

Acknowledgments

I thank the Hong Kong Research GrantsCouncil (grant 621606) for its support.

References

1. J.R. Quinlan, C4.5 Programs for MachineLearning, Morgan Kaufmann, 1993.

2. R. Agrawal and R. Srikant, “Fast Algorithmsfor Mining Association Rules,” Proc. 20thInt’l Conf. Very Large Data Bases (VLDB94), Morgan Kaufmann, 1994, pp. 487–499.

3. Q. Yang et al., “Extracting Actionable Knowl-edge from Decision Trees,” IEEE Trans.Knowledge and Data Eng., vol. 19, no. 1,2007, pp. 43–56.

4. J. Huang and C. X. Ling, “Using AUC andAccuracy in Evaluating Learning Algo-rithms,” IEEE Trans. Knowledge and DataEng., vol. 17, no. 3, 2005, pp. 299–310.

5. Q. Yang, K. Wu, and Y. Jiang, “Learning Ac-tion Models from Plan Examples UsingWeighted Max-Sat,” Artificial Intelligence,vol. 171, nos. 2–3, 2007, pp. 107–143.

6. M. Ghallab et al., “PDDL—The PlanningDomain Definition Language,” tech. manual,AIPS Planning Committee, 1998.

7. H. Kautz and B. Selman, “Pushing the Enve-lope: Planning, Propositional Logic, and Sto-chastic Search,” Proc 13th Nat’l Conf. Artifi-cial Intelligence,AAAI, 1996, pp. 1194–1201.

8. Q. Yang et al., “Test-Cost Sensitive Classifi-cation on Data with Missing Values,” IEEETrans. Knowledge and Data Eng., vol. 18, no.5, 2006, pp. 626–638.

Actionable Data Mining inVideo and Vision SystemsDavid Bell, Queen’s University Belfast

Techniques for actionable data mining areoften application specific: “Horses forcourses” is a commonly used rule. However,even application-specific techniques can gen-eralize to ideas that encourage both formaland pragmatic advances. Here, I describetwo practical scenarios based on behavioraldata mining in video and vision systems. Thescenarios involve traffic surveillance and apredator’s observation of its prey. I will showhow both scenarios exert a strong pull on datamining technology and, at the same time, geta push from current data mining technology.

The applications’ pullBoth applications are oriented toward

profiling particular individuals and eventsfrom a population. In addition to the usualdata mining problems of representation andperformance, the applications share severalcharacteristics: they both involve inductiveand transductive reasoning, adaptive learn-ing over time, and unstructured data.

Inductive reasoning is concerned withbuilding a global model to capture generaldata patterns across a complete data spaceand using this model to subsequently predictoutput values for a particular input. However,such models are difficult to create and up-date, and they’re often not necessary. As the19th century philosopher John Stuart Millnoted: “The child who having burnt his fin-gers, avoids to thrust them again into the fire,has reasoned or inferred, though he neverthought of the general maxim, Fire burns.”

Transductive reasoning, which arguesfrom particulars to particulars, works betterin many situations because it produces a tai-lored local model as a result for each new in-put—for example, an individual’s behaviortraces from sampled database records. Trans-duction is often appropriate in situations thatfocus on behaviors that unfold over time. Apredator animal hunting its prey at Queen’sUniversity Belfast is one example we’veused in our studies. However, inductive rea-soning is also useful—for example, in study-ing species behavior.

For adaptive learning applications, videoand vision systems need techniques for min-ing time-series data. There are several tradi-tional approaches. Some look at similaritiesbetween two sequences, some seek optimalalgorithms for classifying sequences intosimilar subsequences, some search for re-peating cycles, and others try to extract ex-plicit rules over the time series. QUB’sKnowledge and Data Engineering group hasdeveloped some ideas and techniques foraddressing this specific problem environ-ment. They involve capturing coarsenedbehavior components from pixel values andclassifying sequences as such components.

Video and vision systems also pose ex-treme requirements on data mining tech-niques relative to unstructured data. Someestimates say that 85 percent of the data weuse is unstructured—for example, in emails,workbooks, consumer comments, and othertextual documents. Video data is often muchmore difficult to assemble for pattern recog-nition than table data. For example, considera video of a predator animal trying to gain anadvantage on potential prey by watching it

JULY/AUGUST 2007 www.computer.org/intelligent 81

Transductive reasoning, which

argues from particulars

to particulars, works better

in many situations because it

produces a tailored local model

as a result for each new input.

Page 5: Domain-Driven, Actionable Knowledge Discoveryqyang/Docs/2007/action.pdf · Trends & Controversies Domain-Driven Data Mining: A Framework Longbing Cao and Chengqi Zhang, University

adapt to its environment. What will its nextmove be? What is its idiosyncratic behaviorpattern? The irrelevant detail and noise in thevisual data is huge. Furthermore, it varies be-tween individuals and even episodes. Para-doxically, though, video can have other sortsof structure, as I will show in the next section.

The technology’s pushTraffic and security surveillance systems

are potentially important instances of ac-tionable, transductive mining. They usevideo clips and other image data that con-tain multiple, potentially interesting behav-ioral and other sequences that vary withtime. Our work in this domain includes col-laboration with QUB’s Speech, Image, andVision Systems group, which has devel-oped—among other things—a sensor-basedsystem that automatically detects and tracksmoving objects.1 Our work is concernedwith capturing such systems’ output forhigher-level analysis using data mining and,where appropriate, other AI techniques fordynamic scene analysis. The applicationsenvisaged include predicting traffic incidentsby learning activity patterns from stored tra-jectories and spotting abnormal behavior.

Suppose a vehicle performs a U-turn neara roadblock. To initiate tracking this vehicle,the system must learn normal behavior andthen identify unusual patterns. The idea is tocapture coarsened patterns of behavior frag-ments—for example, speed behavior. We canobtain tuples tracing the patterns in variousscenarios. We can have an expert describeabnormal behavior classes or, alternatively,the system can identify them automatically.The system then looks for matches with thepatterns being sought. Such traffic surveil-lance systems have close similarities topredator-prey scenarios.

This outline is greatly simplified. Indi-vidual actors have idiosyncratic behaviorpatterns, and existing representation schemesaren’t always rich enough to capture move-ments and other behaviors in enough detailto help in inductive or transductive rea-soning. Nevertheless, patterns often carryacross several episodes of a behavior oractivity, so video and vision systems offeran analysis medium for classifying thepatterns to gain more general insights intothe behavior or activity.

To demonstrate our techniques and pro-vide insights into animal behavior, we’ve

implemented a system for capturing time-varying behavioral pattern sequences ofrobots, which serve as animal subjects andhave the advantage of being fully under theexperimenter’s control. The system can useeither video clip observations or a visionsystem’s output. The system inputs can beeither real episodes of a robot predator anda robot prey or synthetic episodes—for ex-ample, simulations of a prey robot emulat-ing the vision system of Sony’s Aibo dogrobot. The detailed time slices of action arecoarsened to provide gross, molecular unitsof behavior. For both individuals and popula-tions, or for a particular episode, we can rep-resent combinations of these behavior unitsin table form, to be mined using varioustechniques. For example, the output in table2 shows molecular behavior units such asFwd S or Fwd Q (forward slowly/quickly)and more exotic coarsened units such as Sur-prised (Surp). Figure 1 illustrates the behav-ior of tuple 2. The table shows the outcomeof this behavior—in this case, not to attack.

The table structure is more complex ifthe sequence includes a second actor, ofcourse. The similarities between this sce-nario and other monitoring and adaptationapplications are significant. From the timeseries of behavior chunks, a predator robotmight see, for example, how a prey robotlearns. If it observes more examples of theprey’s behavior, as in table 2’s other tuples,it can learn about behavior patterns. If thepredator observes other members of theprey’s “species,” it might generalize itsknowledge to the species.

Other applicationsWe study a variety of other data mining

techniques and applications that organi-cally interact with these techniques. For ex-ample, we work on more structured data setsfor studies with belief networks, rough sets,associative-mining methods, and a variety oftext-mining algorithms for different applica-tions. The results help in debugging multi-plexer networks in telecommunicationsapplications, facilitating nuclear safety andapplications such as content managementand technology watch, or simply turninginert, unstructured text into actionableknowledge.

References

1. F. Campbell-West and P. Miller, “Evaluationof a Robust Least Squares Motion Detection

82 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS

Table 2. Illustrative tuple traces for five time slots relative to an attack decision.*

Tuple T0 T1 T2 T3 T4 Attack

1 Fwd S Pause Fwd S Fwd Q Pause No

2 Fwd S Pause Fwd S Pause Bwd Q No

3 Bwd Q Pause Fwd S Fwd Q Fwd S Yes

4 Pause Fwd Q Fwd S Fwd Q Fwd S Yes

5 Surp Bwd Q Pause Fwd S Fwd Q Yes

6 Surp Bwd Q Pause Surp Bwd Q No

*Fwd S/Q: forward movement, slowly or quickly; Bwd: backward movement; Surp: Surprised

Time

Distance

Forward slowly

Forward slowly

Backward quickly

Pause

Pause

Figure 1. An example of five time slots/molecules from an episode (tuple 2 in table 2).

Page 6: Domain-Driven, Actionable Knowledge Discoveryqyang/Docs/2007/action.pdf · Trends & Controversies Domain-Driven Data Mining: A Framework Longbing Cao and Chengqi Zhang, University

Algorithm for Projective Sensor Motions Par-allel to a Plane,” Proc. SPIE Opto-IrelandImage and Vision Conf., Soc. Photo-OpticalInstrumentation Engineers, 2005, pp. 225–236.

Visual Mining of DNA SequencesMichail Vlachos and Philip S. Yu, IBM T.J.Watson Research CenterBahar Taneri, Scripps Genome CenterEamonn Keogh, University of California,Riverside

Now that complete genome mappingsare available for several species, compara-tive DNA analysis is a routine application,providing many insights into the conserva-tion, regulation, and function of genes andproteins. Even though many consider thehuman eye to be the ultimate data miningtool, visual comparison between DNAnucleotides can be difficult for humans,given that typical DNA data sets containthousands of nucleotides (typical genelength is 2,000 bases, and the whole humangenome consists of 3 billion base pairs).

Humans are better at conceptualizingand comparing shapes than text. Here, wepresent a method for visualizing DNAnucleotide sequences into numerical trajec-tories in space. The trajectories captureeach sequence’s nucleotide content, allow-ing for quick, easy comparison of differentDNA sequences. Additionally, the resultingtrajectories are organized in 2D space in away that preserves their relative distancesas accurately as possible. We illustrate thisapproach in one of many biological appli-cations, visually determining the molecularphylogenetic relationship of species.

Comparative mitochondrial DNA(mtDNA) analyses have proved useful inestablishing phylogeny among a widerange of species.1 mtDNA is passed ononly from the mother during sexual repro-duction, making the mitochondria clones.This means that mtDNA changes are minorfrom generation to generation. Unlike nu-clear DNA, which changes by 50 percenteach generation, mtDNA mutations arerare—that is, mtDNA has a long memory.In this study, we use mtDNA to identify theevolutionary distances among species.

From DNA sequences to trajectories

DNA sequences are combinatorial se-quences of four nucleotides; adenine, cyto-

sine, thymine, and guanine, which are de-noted by A, C, T, and G. By scanning eachsymbol in a DNA sequence, we can con-struct a trajectory that starts from a fixedpoint on the Cartesian coordinate systemand moves in four directions according tothe given nucleotide.

To quantify the similarity between theresulting trajectories, we utilize a warpingdistance,2 which allows for elastic matchingbetween the DNA trajectories, supportinglocal compressions and decompressions.The warping distance has an additionaldesirable property: it supports matchingsbetween trajectories of different lengths.Figure 2 illustrates the flexible matchingbetween trajectories that warping distanceachieves. Figure 2a shows the mapping be-tween the points in the resulting human andchimpanzee mtDNA trajectories, and figure2b shows the mapping between the humanand cat mtDNA trajectories.

Spanning-tree visualizationNow we need a fast visualization tech-

nique that also accurately highlights the pair-wise relationships between objects in twodimensions.3 We can easily retain the dis-tances between any three objects A, B, and Cin two dimensions by placing the objects onthe vertices of a triangle constructed as fol-lows: if D(A, B) is the distance between Aand B, we can map the third point at the in-tersection of circles transcribed with centersA and B and radii of, respectively, D(A, C)—that is, the distance between points A andC—and D(B, C). Given the triangle inequal-ity, the circles either intersect at twopositions or are tangent. Any position on thecircles’ intersection will retain the originaldistance. So, when mapping n objects, wecan retain three distances for the first three

objects and two distances (with respect totwo reference points) for the remaining n – 3objects, preserving a total of 3 + 2(n – 3) dis-tances. Using the minimum-spanning-tree(MST) with this triangulation method, wecan preserve two distances per object on the2D space: the distance to each object’s near-est neighbor and the distance with respect toa pivot point.

However, this mapping is valid only formetric distances because only then are thetranscribed circles guaranteed to intersectwith respect to the two reference points.The warping distance we use in assessingDNA trajectory similarities is a nonmetricdistance, which means the circles might notintersect. Hence, we must extend the tech-nique to properly identify the third point’sposition so that it’s as close as possible tothe circumference of the (possibly) nonin-tersecting reference circles.

Once we’ve incorporated these additionson the mapping method, we have a power-ful visualization technique for nonmetricdistances. It preserves as well as possiblenot only nearest-neighbor distances (localstructure) but also global distances withrespect to a single reference (pivot) point.The latter supports global data viewingusing the point as a pivot. Finally, themethod is computationally spartan becausewe can construct the MST in O(VlogE)time for a graph of V vertices and E edges.

Applications to evolutionary biology

Several examples demonstrate the useful-ness of the proposed trajectory transforma-tion and 2D mapping technique. First, weuse mtDNA from Homo sapiens and sevenrelated species to construct the spanning-treevisualization. Figure 3 depicts the results,

JULY/AUGUST 2007 www.computer.org/intelligent 83

(b)(a)

Human

Chimpanzee

Human

Cat

Figure 2. Matching DNA trajectories using warping distance: (a) human versuschimpanzee and (b) human versus cat.

Page 7: Domain-Driven, Actionable Knowledge Discoveryqyang/Docs/2007/action.pdf · Trends & Controversies Domain-Driven Data Mining: A Framework Longbing Cao and Chengqi Zhang, University

which agree with current evolutionary views.The mapping not only is accurate with re-gard to the evolutionary distance of the spe-cies but also preserves the clustering betweenthe original groups that the various speciesbelong to. Specifically, human, pygmy chim-panzee, chimpanzee, and orangutan belongto the hominidae group, gibbon to the hylo-batae group, and baboon and macaque to thecercopithicae group. Our results corroborateearlier findings on evolutionary distances ofHomo sapiens to other mammalian species.2

Human and orangutan divergence took placeapproximately 11 million years ago, whereasgibbon and human divergence occurred ap-proximately 15 million years ago.5 Accord-ing to the same source, gorilla divergenceoccurred about 6.5 million years ago andchimpanzee divergence took place about 5.5million years ago.

Figure 4 illustrates our second example,which involves a larger mammalian data setand again takes the human as the referentialpoint. On this plot, we use the formal speciesnames and overlay the DNA trajectory of therespective mtDNA sequence. At first glance,the closeness of the hippopotamus with thewhales might seem like a misplacement. In-tuitively, a hippopotamus has greater affinitywith an elephant. In fact, however, the hippo-potami are more closely related to whalesthan to any other mammals. Whales and hip-popotami diverged 54 million years ago,whereas the whale-hippopotamus groupparted from the elephants about 105 mil-lion years ago. The group that includes hip-popotami and whales-dolphins, but ex-cludes all other mammals in figure 4, isCetartiodactyla.6 In general, the figureillustrates the strong visualization capacityof the spanning-tree technique, particularlyin unveiling the similarities and connec-tions between the different species.

Future workHere, we’ve used only small data in-

stances to present this novel DNA repre-sentation and visualization technique.However, we expect to find many applica-tions in mining large sequence collections,especially in conjunction with advancedcompression and indexing techniques. Weintend to apply our techniques to additionalbiomedical applications, including screen-ing and diagnostic techniques for cancerdata, where they could distinguish cancertranscripts from unaffected ones and iden-tify different cancer stages.

84 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS

HumanPygmy chimpanzee

Orangutan

Chimpanzee

Gibbon

Gorilla

Baboon

Macaque

HominidaeHylobatae

Cercopithicae

Figure 3. A visualization of humans and related species using the spanning-tree visualization method.

Hippopotamusamphibius

Elephasmaximusindicus

Loxodontaafricana

Ursusamericanus

Ursusmaritimus

Homo sapiens

Panpaniscus

Pantroglodytes

Balaenopteramusculus

Balaenopteraphysalus

Canisfamiliaris

Figure 4. An evolutionary visualization of various species with respect to humans.

Page 8: Domain-Driven, Actionable Knowledge Discoveryqyang/Docs/2007/action.pdf · Trends & Controversies Domain-Driven Data Mining: A Framework Longbing Cao and Chengqi Zhang, University

References

1. A.K. Royyuru et al., “Inferring Common Ori-gins from mtDNA,” Research in Computa-tional Molecular Biology, LNCS 3909,Springer, 2006, pp. 246–247.

2. M. Vlachos et al., “Indexing Multi-DimensionalTime-Series with Support for Multiple DistanceMeasures,” Proc. 9th ACM SIGKDD Int’l Conf.Knowledge Discovery and Data Mining (KDD03), ACM Press, 2003, pp. 216–225.

3. R. Lee, J. Slagle, and H. Blum, “A Triangu-lation Method for the Sequential Mapping ofPoints from N-Space to Two-Space,” IEEETrans. Computers, vol. C-26, no. 3, 1977, pp.288–292.

4. Z. Cheng et al., “A Genome-Wide Compari-son of Recent Chimpanzee and Human Seg-mental Duplications,” Nature, vol. 437, no.7055, 2005, pp. 88–93.

5. R. Stauffer et al., “Human and Ape Molecu-lar Clocks and Constraints on Paleontologi-cal Hypotheses,” J. Heredity, vol. 92, no. 6,2001, pp. 469–474.

6. B.M. Ursing and U. Arnason, “Analyses ofMitochondrial Genomes Strongly Support aHippopotamus-Whale Clade,” Proc. RoyalSoc. London, series B, vol. 265, 1998, pp.2251–2255.

Actionable KnowledgeDiscovery: A Brain InformaticsPerspectiveNing Zhong, Maebashi Institute of Technology

Brain informatics pursues a holistic un-derstanding of human intelligence througha systematic approach to brain research.BI regards the brain as an information-processing system and emphasizes cogni-tive experiments to understand its mecha-nisms for analyzing and managing data.Multiaspect data analysis is an importantBI methodology because the brain is toocomplex for a single data mining algorithmto analyze all the available cognitive ex-perimental data. MDA supports an agent-based approach that has two main benefitsfor addressing the complexity and diversityof human brain data and applications:

• its agents can cooperate in a multiphaseprocess and support multilevel concep-tual abstraction and learning, and

• its agent-based approach supports taskdecomposition for distributing data min-ing subtasks over the Grid.

MDA requires a Web-based BI portal thatcan support a multiphase mining processbased on a conceptual data model of the hu-man brain. Generally speaking, MDA canmine several kinds of rules and hypothesesfrom different data sources, but brain re-searchers can’t use MDA results directly. In-stead, an explanation-based reasoning pro-cess must combine and refine them into moregeneral results to form actionable knowledge.From an application’s viewpoint, the BI pro-vides the knowledge-flow management fordistributed Web inference engines that em-ploy actionable knowledge and related datasources to implement knowledge services.1

A BI portal for MDABuilding a BI portal requires the develop-

ment of a multilayer, data mining grid sys-tem to support MDA. At the Maebashi Insti-

tute of Technology’s Department of LifeScience and Informatics, we’ve been devel-oping a systematic approach to support thisgoal. We use powerful instruments, such asfunctional magnetic resonance imaging(fMRI) and electroencephalography (EEG),to measure, collect, model, transform, man-age, and mine human brain data obtainedfrom various cognitive experiments.1

fMRI provides images of functionalbrain activity that show dynamic patternsthroughout the brain for a given task; fMRIimage resolution is excellent, but the pro-cess is relatively slow. EEG provides infor-mation about the electrical fluctuations be-tween neurons that also characterize brainactivity, and it measures brain activity in nearreal time. Discovering new knowledge andmodels of human information-processingactivities requires not only individual dataobtained from a single measuring method

but also multiple data sources measuringmethods.

Our work focuses on human informationprocessing activities on two levels:

• spatiotemporal features and flow based onfunctional relationships between activatedbrain areas for each given task, and

• neural structures and neurobiological pro-cesses related to the activated areas.

More specifically, at the current stage,we want to understand how neurologicalprocesses support a cognitive process.We’re investigating how a specific part ofthe brain operates in a specific time, howthe operations change over time, and howthe activated areas work cooperatively toimplement a whole information-processingsystem. We’re also looking at individualdifferences in performance.

BI’s future will be affected by the abilityto mine fMRI and EEG brain activations ona large scale. Key issues are how to designthe psychological and physiological experi-ments for obtaining data from the brain’sinformation-processing mechanisms andhow to analyze and manage such data frommultiple aspects for discovering new humaninformation-processing models. Researchersare also studying how to use data mining andstatistical learning techniques to automatefMRI image analysis and understanding.2,3

Web intelligence4 and Grid computing5

provide ideal infrastructures, platforms,and technologies for building a BI portal todeal with MDA’s huge, distributed datasources. They can support a data mininggrid composed of many agent components.Each agent can do some simple task, butwhen the Grid integrates all these agents,they can carry out more complex BI tasks.

Using data mining agents entails both pre-processing and postprocessing steps. Knowl-edge discovery generally involves back-ground knowledge from experts, such asbrain scientists, about a domain, such as cog-nitive neuroscience, to guide a spiral, multi-phase process to find interesting and novelrules and features hidden in data. On thebasis of such data, BI generates hypothesesthat it deploys on the Grid for use by variousknowledge-based inference and reasoningtechnologies.1 From a top-down perspective,the knowledge level is also the applicationlevel. Both the mining and data levels sup-port brain scientists in their work and theportal in updating its resources. From a

JULY/AUGUST 2007 www.computer.org/intelligent 85

Multiaspect data analysis is

important because the brain

is too complex for a single

data mining algorithm

to analyze available cognitive

experimental data.

Page 9: Domain-Driven, Actionable Knowledge Discoveryqyang/Docs/2007/action.pdf · Trends & Controversies Domain-Driven Data Mining: A Framework Longbing Cao and Chengqi Zhang, University

bottom-up perspective, the data level sup-plies the data services for the mining level,and the mining level produces new rules andhypotheses for the knowledge level to gener-ate actionable knowledge.

Peculiarity-oriented miningfMRI and EEG data are peculiar with

respect to a specific state or the related partof a stimulus. Peculiarity-oriented mining isa proposed knowledge discovery methodol-ogy that automates fMRI and EEG dataanalysis and understanding. POM doesn’tuse conventional fMRI image processing orEEG frequency analysis, and it doesn’trequire human-expert-centric visualization.

Figure 5 illustrates the methodology ap-plied to interpreting the spatiotemporal fea-tures and flow of a human information pro-cessing system. In the cognitive processfrom perception (in this case, a cognitivetask stimulated by vision) to thinking (rea-soning), the system collects data from sev-eral event-related points in time and trans-forms them into various forms for POM-centric MDA. Finally, the system explainsthe results of the separate analyses and syn-thesizes them into a whole flow.

The proposed POM/MDA methodologyshifts the focus of cognitive science from asingle type of experimental data analysistoward a deep, holistic understanding ofhuman information-processing principles,models, and mechanisms.

References

1. N. Zhong et al., “Building a Data Mining Grid

for Multiple Human Brain Data Analysis,”Computational Intelligence, vol. 21, no. 2,2005, pp. 177–196.

2. F.T. Sommer and A. Wichert, eds., Ex-ploratory Analysis and Data Modeling inFunctional Neuroimaging, MIT Press, 2003.

3. N. Zhong et al., “Peculiarity Oriented fMRIBrain Data Analysis for Studying HumanMulti-Perception Mechanism,” CognitiveSystems Research, vol. 5, no. 3, 2004, pp.241–256.

4. J. Liu et al., “The Wisdom Web: New Chal-lenges for Web Intelligence (WI),” J. Intel-ligent Information Systems, vol. 20, no. 1,2003, pp. 5–9.

5. I. Foster and C. Kesselman, eds., The Grid:Blueprint for a New Computing Infrastruc-ture, Morgan Kaufmann, 1999.

Privacy-Preserving DataMining: Past, Present, and FutureMafruz Zaman Ashrafi, Institute for Infocomm ResearchDavid Taniar, Monash University

Privacy issues grow in importance asdata mining increases its power to dis-cover knowledge and trends from ever-larger data stores. The reason is simple andstraightforward: once information is re-leased, it’s difficult to control and so im-possible to protect from misuse.

Aggregating data from various sourcesincreases the risk of privacy breaches, butmany applications require data sharing toensure their resulting model accuracy. Un-

less they aggregate data from differentsources, the mining models they generate cancontain many false positives and useless pat-terns. More importantly, the models mightexclude knowledge that’s critical in decisionmaking. For example, homeland-defenseapplications profile individuals by combin-ing privacy-sensitive data sources from vari-ous domains. Without combining thesesources, the data mining models might tag aninnocent person as a criminal or vice versa.

Because the consequences of inaccuracyare so serious, privacy has emerged as a topdata mining research priority. Mitigatingprivacy risk in distributed data mining mod-els involves two broad issues: privacy-pre-serving data aggregation and data miningmodel accuracy.

Privacy in frequent-item-setmining

The distributed data mining process dis-closes each participating source site’sitem-set frequency. Privacy-preservingmethods in this distributed process mainlyrely on the way each site shares its localsupport of its item sets without exposingthe exact frequency. Three well-knownapproaches to achieve this goal are ran-domization, secure multiparty compu-tation (SMC), and cryptography.

Randomization approaches are based onrandomized data sets from various sites.1

Participating sites preserve privacy by dis-carding some data set items and insertingnew ones. They send the results to a cen-tralized third-party site and ensure accuracyby including statistical estimates of the origi-nal item frequency and randomization vari-ance. Applications can use these estimates,together with the Apriori algorithm,2 tomine the nonrandomized transaction fre-quencies while looking only at the random-ized frequencies.

Randomization techniques in frequency-item-set mining are either transaction in-variant or item invariant. The transaction-invariant technique will breach privacy if thegiven data set’s individual transaction size islarge. For example, a transaction size |t| �10 is doomed to fail at protecting privacy.On the other hand, the item-invariant tech-nique includes all items in a perturbed trans-action |t�| = R(t). This technique assumes theprobability of these items to be the same, andit completely ignores the correlation betweenthem. The resulting frequent-item sets mightbe unable to accurately reflect many of the

86 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS

Reasoning – Language – Memory – Attention – Vision

Thinking

tkTransformation

POM/MDA

tjTransformation

POM/MDA

Explanation/synthesis

tiTransformation

POM/MDA

t1

POM/MDA

Perception

Figure 5. Investigating the spatiotemporal features and flow of a human informationprocessing system.

Page 10: Domain-Driven, Actionable Knowledge Discoveryqyang/Docs/2007/action.pdf · Trends & Controversies Domain-Driven Data Mining: A Framework Longbing Cao and Chengqi Zhang, University

original items. The item-invariant techniquealso incurs additional computational cost.Furthermore, if each participating site has alarge data set, distorting the data set can in-volve huge computation.

SMC-based techniques discover an itemset’s global frequency without involving atrusted third party.3,4 Unlike randomizationapproaches, SMC approaches neither per-turb the original data set nor send it to acentralized site. Instead, they perform asecure computation in two cycles.

To discover a global frequency, in thefirst cycle, known as obfuscation, everyparticipating site generates its local fre-quency x and obfuscates it by performing afunction fi(x ri), where r is a randomnoise. The site then sends the obfuscatedfrequency to an adjacent site. The adjacentsite j obfuscates its local frequency in thesame manner fj(x rj), combines the ob-fuscated frequencies, and sends the resultto the next adjacent site. The process con-tinues until no participating site remains.At this point, the obfuscated frequency in-cludes each item set’s local frequency andnoise. In the second cycle, de-obfuscation,each site repeats this process with one ex-ception: instead of adding noise, it removesnoise from the obfuscated frequency. At theend of this round, each site can discover theexact global frequency of a given item set.

Although generating global frequency ofany item set is quite straightforward withSMC, privacy is still vulnerable if the par-ticipating sites collude with each other. Forexample, sites i and i + 2 in the chain cancollude to find the exact support of site i +1. SMC also incurs high communicationcosts because each site must communicatewith others many times to find each itemset’s exact global frequency.

SMC’s communication costs increaseexponentially as the number of partici-pating sites increases. In fact, these ap-proaches aren’t feasible when the numberof participants is large—for example, in anonline survey. Crypto-based systems canovercome this limitation by using public-key cryptography to generate global fre-quency.5 Similar to randomization, cryptosystems use a centralized site to aggregateall participating sites’ frequencies withoutlosing accuracy as the cost of privacy.However, if each data set has many localfrequent-item sets, this approach not onlyincurs high computation costs but also in-creases communication costs.

Future trendsA privacy-preserving min-

ing model isn’t easy to achieve, and it’s perhaps im-possible to achieve privacywithout trade-offs. A datamining application deter-mines the cost it must pay toreach the required privacylevel. Figure 6 summarizesthese trade-offs for the ap-proaches we’ve described.

Apart from these trade-offs, other privacy questionsalso need serious and imme-diate attention. For example,does the data mining patternitself breach privacy? Canthe data mining model’s pri-vacy be exposed by associ-ating the model with publicdata sources? Such problems might not bevisible immediately, but the threats theypose are real.

References

1. A.V. Evfimievski et al., “Privacy PreservingMining of Association Rules,” InformationSystems, vol. 29, no. 4, 2004, pp. 343–364.

2. R. Agrawal and R. Srikant, “Fast Algorithmsfor Mining Association Rules in Large Data-base,” Proc. 20th Int’l Conf. Very Large Data-bases, IEEE CS Press, 1994, pp. 407–419.

3. M. Kantarcioglu and C. Clifton, “Privacy-Pre-serving Distributed Mining of AssociationRules on Horizontally Partitioned Data,”IEEE Trans. Knowledge and Data Eng., vol.16, no. 9, 2004, pp. 1026–1037.

4. M.Z. Ashrafi, D. Taniar and K. Smith,“Reducing Communication Cost in a PrivacyPreserving Distributed Association Rule Min-ing,” Proc. Database Systems for AdvancedApplications, LNCS 2973, Springer, 2004,pp. 381–392.

5. Z. Yang, S. Zhong and R.N. Wright, “Privacy-Preserving Classification of Customer Datawithout Loss of Accuracy,” Proc. 2005 SIAMInt’l Conf. Data Mining (SDM 05), Soc.Industrial and Applied Math., 2005; www.siam.org/meetings/sdm05/downloads.htm.

Toward Knowledge-DrivenData MiningEugene Dubossarsky, Ernst & YoungWarwick Graco, Australian Taxation Office

Data mining faces many challenges, but

the primary one is to move from a method-driven approach to a process driven by do-main problems and knowledge. Here, wedescribe four key aspects of successfullymeeting this challenge: moving to intelli-gent algorithms that utilize expert knowl-edge, converting dumb data points to smartones, combining business knowledge withtechnical knowledge to optimize miningand modeling results, and using discoveryand detection results to improve intelli-gence analysis.

Intelligent algorithmsMany data mining algorithms lack

smarts and are biased in their capabilities.For example, K-means clustering algo-rithms are biased toward recovering clus-ters that have hyperspherical shapes.1 Theydon’t perform well in recovering other nat-urally occurring forms, such as banana-shaped clusters. We need intelligent algo-rithms that can identify clusters in dataregardless of shapes, sizes, and spatial ori-entations.

Such algorithms require two key compo-nents. First is a range of tools and techni-ques for working out the optimum fit be-tween data and classification and predictionmodels and for discovering data configura-tions and correlations such as associations,clusters, and classes. Second, intelligentalgorithms require expert knowledge. Thisknowledge can take the form of heuristicsand routines that manage data nuances andtailor the data mining to specific tasks. Onesolution is supervised clustering—for

JULY/AUGUST 2007 www.computer.org/intelligent 87

Randomization

SMC

Crypto

Privacy

Com

mun

icat

ion

cost

s

Privac

y = 0

Compu

tation

costs

Privacy

Privacy

Inaccuracies

Figure 6. Privacy-preserving frequent-item-set miningapproaches and their trade-offs.

Page 11: Domain-Driven, Actionable Knowledge Discoveryqyang/Docs/2007/action.pdf · Trends & Controversies Domain-Driven Data Mining: A Framework Longbing Cao and Chengqi Zhang, University

88 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS

example, user-driven selection of clusterseeds and distance metrics.2 Clusteringtechniques can also employ active learningtechniques to obtain expert feedback foroptimal results.3 Other data mining tech-niques should also incorporate expertknowledge.

Intelligent dataCurrent data mining techniques are ap-

plied to dumb data points. Intelligent datapoints could aid both data mining and datamodeling. Each data point should havemetadata to explain when it was created,who created it, and what it represents. Inaddition, metaknowledge in the form ofexpert explanation and interpretation couldclarify each data point’s significance.

Dumb data points don’t help an algo-rithm discriminate whether the data is rele-vant to solving specific mining and model-ing problems. Efficient algorithms need todistinguish noise from signal—a distinc-tion that depends on the problem context.Data that’s noise in one context might besignal in another. For example, prevailingweather conditions might not be relevantto explaining certain crime patterns, butthey’re likely to be important in explainingpatterns of disease.

We need algorithms that can interrogatethe metadata and metaknowledge attachedto data points in much the same way theSemantic Web proposes to do for Web con-tent.4 The Semantic Web expresses Webcontent in a form that software agents canread. This allows them to find, share, andintegrate information more easily. In a simi-lar way, data mining algorithms could usemetadata and metaknowledge to eliminatedata points that have little bearing on theproblem being solved. These same sourceswould also help users understand data min-ing results.

Business knowledgeData mining must marry business knowl-

edge and technical knowledge. Data minersoften lack business domain knowledgewhen they undertake mining and model-ing tasks. This essential knowledge comesfrom business experts, who can helpthroughout the data mining life cycle.These experts must guide the explorationprocess, acting as navigators while dataminers do the driving.

Method-driven data mining can pro-duce many pages of results with little orno significance. Knowledge-driven datamining needs business experts to identifythe important results and interpret them inthe form of metaknowledge. Metaknowl-edge puts results in perspective and helpsusers understand their significance forpurposes such as increasing productivity,lowering costs, and improving outcomes.For example, Tatiana Semenova high-lighted how to use metaknowledge de-rived from interpreting data mining re-sults to develop best-practice medicalguidelines to improve patient outcomes.5

Another perspective on this issue is min-ing the tacit knowledge of experiencedpeople to predict outcomes. Research hasshown that such knowledge can producesuperior results in prediction markets rang-ing from political forecasting to commer-cial sales forecasts.6

IntelligenceConfusion abounds around what “intel-

ligence” means. To clarify it here, we usethe WIKID (pronounced “Why-kid”) hi-erarchy that’s emerged from the knowl-edge management community (see table3). In this hierarchy, intelligence refers toinformation that’s been analyzed andevaluated—part of a process of makinginformed assessments of what will hap-pen, when, and how. It relies on the ca-pacity to acquire and apply knowledge.Analysts who perform this function areresponsible for marshalling the facts,drawing deductions from what is known,

and providing options to decision makers.Intelligence is the lifeblood of organiza-tions and helps to determine where re-sources should be invested to exploit op-portunities and to counter threats.

Intelligence and data mining have a sym-biotic relationship. Intelligence helps to re-solve where to focus data mining—for ex-ample, to detect fraudulent insurance claims.Equally, data mining research and practicediscover new knowledge to evaluate.

ConclusionKnowledge is the fuel that drives data

mining. The work we’ve described here canmove the field toward integrating knowledgefully into data mining practice and so en-hance the intelligence of its results.

References

1. M. Mu-Chun Su and C. Chou, “A ModifiedVersion of the K-Means Algorithm with aDistance Based on Cluster Symmetry,” IEEETrans. Pattern Analysis and Machine Intelli-gence, vol. 23, no. 6, 2001, pp. 674–680.

2. S. Basu, A. Banerjee, and R.J. Mooney,“Semi-supervised Clustering by Seeding,”Proc. 19th Int’l Conf. Machine Learning(ICML 02), Morgan Kaufmann, 2002, pp.19–26.

3. S. Tong, “Active Learning: Theory and Appli-cations,” doctoral dissertation, Dept. of Com-puter Science, Stanford Univ., 2001.

4. T. Berners-Lee, J. Hendler, and O. Lassila,“The Semantic Web,” Scientific American,May 2001, pp. 34–43.

5. T. Semenova, “Identification of InterestingPatterns in Large Data Health Data Bases,”doctoral dissertation, Research School ofInformation Sciences and Eng., AustralianNat’l Univ., 2006.

6. A. Leigh and J. Wolfers, “Wisdom of theMasses,” Australian Financial Rev., 8 June2007, pp. 3–4.

Table 3. WIKID hierarchy.

Concept Description

Wisdom Knowledge rightly applied to solve difficult problems and issues

Intelligence Evaluated knowledge from which relevant insights and understanding have been extracted

Knowledge Deep and detailed information on a topic

Information Data on issues

Data Basic facts and figures

Page 12: Domain-Driven, Actionable Knowledge Discoveryqyang/Docs/2007/action.pdf · Trends & Controversies Domain-Driven Data Mining: A Framework Longbing Cao and Chengqi Zhang, University

Longbing Cao is asenior lecturer ofthe Faculty of In-formation Technol-ogy at the Univer-sity of Technology,Sydney. Contacthim at [email protected].

Chengqi Zhang isa professor of theFaculty of Informa-tion Technology atthe University ofTechnology, Syd-ney. Contact him [email protected].

Michail Vlachosis a research staffmember at the IBMT.J. Watson Re-search Center, NewYork. Contact himat [email protected].

Eamonn Keogh isan associate profes-sor of computerscience at the Uni-versity of Califor-nia, Riverside.Contact him [email protected].

David Bell is a pro-fessor in the Schoolof Electronics, Elec-trical Engineeringand Computer Sci-ence at Queen’sUniversity, Belfast.Contact him at [email protected].

Bahar Taneri is anassistant professorof psychology atEastern Mediter-ranean University,Cypress, and amember of theScripps GenomeCenter at the Uni-

versity of California, San Diego. Contacther at [email protected].

Philip S. Yu is themanager of theSoftware Tools andTechniques groupat the IBM T.J.Watson ResearchCenter, New York.Contact him [email protected].

Ning Zhong is aprofessor in theDepartment of LifeScience and Infor-matics at the Mae-bashi Institute ofTechnology, Japan.He’s also an ad-junct professor in

the International Web Intelligence Consor-tium Institute at the Beijing University ofTechnology. Contact him at [email protected].

David Taniar is asenior lecturer incomputing atMonash University.Contact him [email protected].

Mafruz ZamanAshrafi is a re-search fellow at theInstitute for Info-comm Research.Contact him [email protected].

Eugene Dubossarsky is the director ofPredictive Business Intelligence at Ernst &Young in Sydney and a Senior Visiting Fel-low at the University of New South Wales.Contact him at [email protected].

Warwick Graco isa senior data minerin the Office of theChief KnowledgeOfficer of the Aus-tralian TaxationOffice. Contacthim at [email protected].

Qiang Yang is aprofessor at theHong Kong Univer-sity of Science andTechnology. Con-tact him at [email protected].

EXECUTIVE COMMITTEE

President: Michael R. Williams*President-Elect: Rangachar Kasturi;* Past President:

Deborah M. Cooper;* VP, Conferences and Tutorials:Susan K. (Kathy) Land (1ST VP);* VP, Electronic Prod-ucts and Services: Sorel Reisman (2ND VP);* VP,Chapters Activities: Antonio Doria;* VP, EducationalActivities: Stephen B. Seidman;† VP, Publications: JonG. Rokne;† VP, Standards Activities: John Walz;† VP,Technical Activities: Stephanie M. White;* Secretary:Christina M. Schober;* Treasurer: Michel Israel;†2006–2007 IEEE Division V Director: Oscar N.Garcia;† 2007–2008 IEEE Division VIII Director:Thomas W. Williams;† 2007 IEEE Division V Director-Elect: Deborah M. Cooper;* Computer Editor in Chief:Carl K. Chang;† Executive Director: Angela R. Burgess†

* voting member of the Board of Governors† nonvoting member of the Board of Governors

BOARD OF GOVERNORS

Term Expiring 2007: Jean M. Bacon, George V. Cybenko,Antonio Doria, Richard A. Kemmerer, Itaru Mimura,Brian M. O’Connell, Christina M. Schober

Term Expiring 2008: Richard H. Eckhouse, James D. Isaak,James W. Moore, Gary McGraw, Robert H. Sloan,Makoto Takizawa, Stephanie M. White

Term Expiring 2009: Van L. Eden, Robert Dupuis, Frank E.Ferrante, Roger U. Fujii, Ann Q. Gates, Juan E. Gilbert,Don F. Shafer

Next Board Meeting: 9 Nov. 2007, Cancún, Mexico

EXECUTIVE STAFF

Executive Director: Angela R. Burgess; Associate Execu-tive Director: Anne Marie Kelly; Associate Publisher:Dick Price; Director, Administration: Violet S. Doan;Director, Finance and Accounting: John Miller

COMPUTER SOCIETY OFFICESWashington Office. 1730 Massachusetts Ave. NW,

Washington, DC 20036-1992Phone: +1 202 371 0101 • Fax: +1 202 728 9614Email: [email protected]

Los Alamitos Office. 10662 Los Vaqueros Circle, LosAlamitos, CA 90720-1314Phone: +1 714 821 8380 • Email: [email protected] and Publication Orders: Phone: +1 800 272 6657 • Fax: +1 714 821 4641Email: [email protected]

Asia/Pacific Office. Watanabe Building, 1-4-2 Minami-Aoyama, Minato-ku, Tokyo 107-0062, JapanPhone: +81 3 3408 3118 • Fax: +81 3 3408 3553Email: [email protected]

IEEE OFFICERS

President: Leah H. Jamieson; President-Elect: LewisTerman; Past President: Michael R. Lightner; ExecutiveDirector & COO: Jeffry W. Raynes; Secretary: CeliaDesmond; Treasurer: David Green; VP, EducationalActivities: Moshe Kam; VP, Publication Services andProducts: John Baillieul; VP, Regional Activities: PedroRay; President, Standards Association: George W.Arnold; VP, Technical Activities: Peter Staecker; IEEEDivision V Director: Oscar N. Garcia; IEEE Division VIIIDirector: Thomas W. Williams; President, IEEE-USA:John W. Meredith, P.E.

PURPOSE: The IEEE Computer Society is theworld’s largest association of computingprofessionals and is the leading provider oftechnical information in the field. Visit ourWeb site at www.computer.org.

revised 25 June 2007