Dwdm Intro

8/2/2019 Dwdm Intro

1/103

Outline Background

Content of human mind, Sample data miningproblems, Why data mining ?

Definition, KDD process, System architecture Data Visualization Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms Issues in data mining Data mining applications Summary

8/2/2019 Dwdm Intro

2/103

Data, Information, Knowledge, and Wisdomby Gene Bellinger , Durval Castro , Anthony Mills

According to Russell Ackoff, content of human mind can be

classified into five categories: Data, Information, Knowledge,Understanding and wisdom

Data: Symbols

Data represents a fact or statement of event without relationto other things.

Data is raw. It simply exists and has no significance beyondits existence (in and of itself). It can exist in any form,

usable or not. It does not have meaning of itself. In computerparlance, a spreadsheet generally starts out by holding data.

Ex: It is raining.
http://www.systems-thinking.org/feedback.htmmailto:[email protected]:[email protected]:[email protected]:[email protected]://www.systems-thinking.org/feedback.htm

8/2/2019 Dwdm Intro

3/103

Content of Human Mind

Information: Data that are processed to beuseful; provides answer to who, what,where, and when questions. Information is data that has been given meaning by way of

relational connection. This "meaning" can be useful, butdoes not have to be. In computer parlance, a relational database makes

information from the data stored within it.

Information embodies the understanding of a relationship of some sort, possibly cause and effect. Example The temperature dropped 15 degrees and then it

started raining.

8/2/2019 Dwdm Intro

4/103

Knowledge: application of data and information;answers how questions. Knowledge is the appropriate collection of information,

such that it's intent is to be useful. Knowledge is a

deterministic process. When someone "memorizes"information (as less-aspiring test-bound students oftendo), then they have amassed knowledge.

Ex: If the humidity is very high and the temperature

drops suddenly the atmosphere is often unlikely to beable to hold the moisture so it rains.


8/2/2019 Dwdm Intro

5/103

Understanding: appreciation of why

It is the process by which one can take knowledge and synthesizenew knowledge from the previously held knowledge. The difference between understanding and knowledge is the

difference between "learning" and "memorizing". People who have understanding can undertake useful actions because

they can synthesize new knowledge, or in some cases, at least newinformation, from what is previously known (and understood).

That is, understanding can build upon currently held information,knowledge and understanding itself.

In computer parlance, AI systems possess understanding in the sensethat they are able to synthesize new knowledge from previouslystored information and knowledge.


8/2/2019 Dwdm Intro

6/103

Content of human mind Wisdom: evaluated understanding

It is the process by which we also discern, or judge, between right and wrong, good

and bad. I personally believe that computers do not have, and will never have theability to posses wisdom.

Ex: It rains because it rains. And this encompasses an understanding of all the

interactions that happen between raining, evaporation, air currents, temperature

gradients, changes, and raining.

8/2/2019 Dwdm Intro

7/103

Sample data mining problem # 1

I manage a supermarket (restaurant, video store, book store) and my cash register (or web site) pumpstransactions into my DB. Can you help me visualize my sales ? Can you profile my customers ? Tell me something interesting I do not know statistics, and I do not want to hire

statisticians.

8/2/2019 Dwdm Intro

8/103

Sample data mining problem #2

I am an astronomer and I have sky survey 3 terabytes of data, 2 billion objects. Can you help to recognize the objects ? Most of my data is beyond my reach.

Can you find new/unusual items in my data ? Can you help me with basic manipulation, so

I can focus on basic science ?

I know my data and statistics, but that is notenough

8/2/2019 Dwdm Intro

9/103

About Data mining

Look-up a few records SQL Populate standard report SQL Create a new report OLAP/mining

Data mining Optimize business process Locate a new problem Understand something new Answer a tough question

8/2/2019 Dwdm Intro

10/103

Evolution of Database Technology

Before 1960s: Primitive file processing

1960s: Data collection, database creation, IMS and network DBMS

1970s: Relational data model, relational DBMS implementation

1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.)

Application-oriented DBMS (spatial, scientific, engineering, etc.)

1990s: Data mining, data warehousing, multimedia databases,and Web databases

2000s Stream data management and mining

Data mining and its applications

Web technology (XML, data integration) and global information systems

8/2/2019 Dwdm Intro

11/103

Why Data Mining ?

The Explosive Growth of Data: from terabytes to petabytes

Data collection and data availability

Automated data collection tools, database systems, Web, computerized

society

Major sources of abundant data

Business: Web, e- commerce, transactions, stocks,

Science: Remote sensing, bioinformatics, scientific simulation,

Society and everyone: news, digital cameras, YouTube

We are drowning in data, but starving for knowledge! Necessity is the mother of invention Data mining Automated analysis

of massive data sets

8/2/2019 Dwdm Intro

12/103

Lots of data is being collectedand warehoused Web data, e-commerce purchases at department/

grocery stores Bank/Credit Card

transactions

Computers have become cheaper and more powerful

Competitive Pressure is Strong Provide better, customized services for an edge (e.g. in

Customer Relationship Management)

Why Mine Data? Commercial Viewpoint

8/2/2019 Dwdm Intro

13/103

Why Mine Data? Scientific Viewpoint

Data collected and stored atenormous speeds (GB/hour) remote sensors on a satellite

telescopes scanning the skies

microarrays generating geneexpression data

scientific simulationsgenerating terabytes of data

Traditional techniques infeasible for raw data Data mining may help scientists

in classifying and segmenting data in Hypothesis Formation

8/2/2019 Dwdm Intro

14/103

Mining Large Data Sets - Motivation

There is often information hidden in the data that is

not readily evident Human analysts may take weeks to discover useful

information Much of the data is never analyzed at all

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

3,000,000

3,500,000

4,000,000

1995 1996 1997 1998 1999

The Data Gap

Total new disk (TB) since 1995

Number of analysts

From: R. Grossman, C. Kamath, V. Kumar, Data Mining for Scientific and Engineering Applications

8/2/2019 Dwdm Intro

15/103

Evolution of Sciences

Before 1600, empirical science

1600-1950s, theoretical science

Each discipline has grown a theoretical component. Theoretical models often motivateexperiments and generalize our understanding.

1950s-1990s, computational science

Over the last 50 years, most disciplines have grown a third, computational branch (e.g.empirical, theoretical, and computational ecology, or physics, or linguistics.)

Computational Science traditionally meant simulation. It grew out of our inability to findclosed-form solutions for complex mathematical models.

1990-now, data science

The flood of data from new scientific instruments and simulations

The ability to economically store and manage petabytes of data online

The Internet and computing Grid that makes all these archives universally accessible Scientific info. management, acquisition, organization, query, and visualization tasks

scale almost linearly with data volumes. Data mining is a major new challenge!

Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science ,Comm. ACM, 45(11): 50-54, Nov. 2002

8/2/2019 Dwdm Intro

16/103

Outline Background


Definition, KDD process, System architecture Data Visualization Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms Issues in data mining Data mining applications Summary

8/2/2019 Dwdm Intro

17/103

What Is Data Mining? Data mining (knowledge discovery in databases):

Extraction of interesting (non-trivial, implicit, previouslyunknown and potentially useful) information or patternsfrom data in large databases

Alternative names and their inside stories: Data mining: a misnomer? Knowledge discovery(mining) in databases (KDD),

knowledge extraction, data/pattern analysis, dataarcheology, data dredging, information harvesting,business intelligence, etc.

What is not data mining? (Deductive) query processing. Expert systems or small ML/statistical programs

8/2/2019 Dwdm Intro

18/103

What is (not) Data Mining?

What is Data Mining?

Certain names are moreprevalent in certain USlocations (OBrien, ORurke,OReilly in Boston area)

Group together similar

documents returned bysearch engine according totheir context (e.g. Amazonrainforest, Amazon.com,)

What is not DataMining?

Look up phonenumber in phonedirectory

Query a Web

search engine forinformation aboutAmazon

8/2/2019 Dwdm Intro

19/103

Data Mining: A KDD Process

Data mining: the core of knowledge discoveryprocess.

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

8/2/2019 Dwdm Intro

20/103

Steps of a KDD Process

Learning the application domain:

relevant prior knowledge and goals of application Data cleaning: to remove noise and inconsistent data Data integration: Multiple data sources can be combined Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation:

Find useful features, dimensionality/variable reduction, invariantrepresentation.

Choosing functions of data mining summarization, association, classification, clustering.

Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation

visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge

8/2/2019 Dwdm Intro

21/103

Architecture: Typical Data Mining System

data cleaning, integration, and selection

Database or DataWarehouse Server

Data Mining Engine

Pattern Evaluation

Graphical User Interface

Knowledge-Base

Database DataWarehouse

World-WideWeb

Other InfoRepositories

8/2/2019 Dwdm Intro

22/103

Components of data mining system

Database, Data warehouse, World Wide Web or other information

Repository Data cleaning and data integration techniques are performed on this data

Database and data warehouse server: Responsible for fetching therelevant data, based on the user s data mining request.

Knowledge-base: Domain knowledge which is used to guide the data

mining process. Attribute levels, semantics, user beliefs, pattern interestingness, thrsholds,meta data

Data mining engine: Set of functional modules for tasks such ascharacterization, summarization, association, classification, clustering,outlier extraction

Pattern evaluation: Employees interestingness measures Put the evaluation pattern as much deep as you can so that one can

optimize. User interface: communication between users and the data mining

system.

8/2/2019 Dwdm Intro

23/103

Outline Background


Definition, KDD process, System architecture Data Visualization

Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms Issues in data mining Data mining applications Summary

8/2/2019 Dwdm Intro

24/103

Data VisualizationOne Picture May Worth 1000 Words!

Visual Data Mining Visualization of data

Visualization of data mining results

Visualization of data mining processes Interactive data mining: visual classification

One melody may worth 1000 words too!

Audio data mining: turn data into music and melody! Uses audio signals to indicate the patterns of data or the

features of data mining results

Vi li i f d i i l i SAS

8/2/2019 Dwdm Intro

25/103

Visualization of data mining results in SASEnterprise Miner: scatter plots

8/2/2019 Dwdm Intro

26/103

Visualization of association rules inMineSet 3.0

8/2/2019 Dwdm Intro

27/103

Visualization of a decision tree in MineSet 3.0

8/2/2019 Dwdm Intro

28/103

Visualization of Data MiningProcesses by Clementine

8/2/2019 Dwdm Intro

29/103

Interactive Visual Mining byPerception-Based Classification (PBC)

8/2/2019 Dwdm Intro

30/103

Visualization on NTT i-Townpage

8/2/2019 Dwdm Intro

31/103

Traversal Diagram

8/2/2019 Dwdm Intro

32/103

Visitor Success Path

8/2/2019 Dwdm Intro

33/103

Day/Night Success Path

8/2/2019 Dwdm Intro

34/103

Data Mining and Business Intelligence

Increasing potentialto supportbusiness decisions End User

BusinessAnalyst

DataAnalyst

DBA

MakingDecisions

Data PresentationVisualization Techniques

Data Mining Information Discovery

Data Exploration

OLAP, MDA

Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts

Data Sources Paper, Files, Information Providers, Database Systems, OLTP

8/2/2019 Dwdm Intro

35/103

Data Mining: Confluence of MultipleDisciplines

Data Mining

DatabaseTechnology Statistics

OtherDisciplines

InformationScience

MachineLearning

Visualization

Other disciplines: pattern recognition, image processing, signal processingSpatial or temporal data analysis.

8/2/2019 Dwdm Intro

36/103

Regarding this course Emphasis is on efficient and scalable data mining techniques.

Algorithms must be highly scalable to handle such as tera-bytes of data

Scalability: Running time should grow approximately linearlyin proportion to the size of data given the available resourcessuch as main memory and disk space.

Using the proposed techniques, interesting knowledge,regularities or high-level information can be extracted

from the databases and viewed or browsed fromdifferent angles.

Efficiency: Without compromising quality

8/2/2019 Dwdm Intro

37/103

Why Not Traditional Data Analysis?(statistics, .)

Tremendous amount of data

Algorithms must be highly scalable to handle such as tera-bytes of data

Scalability: Running time should grow approximately linearly in proportion to thesize of data.

High-dimensionality of data

Micro-array may have tens of thousands of dimensions

High complexity of data

Data streams and sensor data

Time-series data, temporal data, sequence data

Structure data, graphs, social networks and multi-linked data

Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data

Software programs, scientific simulations

New and sophisticated applications

8/2/2019 Dwdm Intro

38/103

Multi-Dimensional View of Data Mining

Data to be mined

Relational, data warehouse, transactional, stream, object-oriented/relational,active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW

Knowledge to be mined

Characterization, discrimination, association, classification, clustering,

trend/deviation, outlier analysis, etc. Multiple/integrated functions and mining at multiple levels

Techniques utilized

Database-oriented, data warehouse (OLAP), machine learning, statistics,visualization, etc.

Applications adapted

Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.

8/2/2019 Dwdm Intro

39/103

Data Mining: On What Kinds of Data? Database-oriented data sets and applications

Relational database, data warehouse, transactional database Advanced data sets and advanced applications

Data streams and sensor data

Time-series data, temporal data, sequence data (incl. bio-sequences)

Structure data, graphs, social networks and multi-linked data

Object-relational databases

Heterogeneous databases and legacy databases

Spatial data and spatiotemporal data Multimedia database

Text databases

The World-Wide Web

8/2/2019 Dwdm Intro

40/103

Data Mining Functionalities Multidimensional concept description: Characterization and discrimination

Generalize, summarize, and contrast data characteristics, e.g., dry vs.wet regions

Frequent patterns, association, correlation vs. causality

Diaper Beer [0.5%, 75%] (Correlation or causality?)

Classification and prediction

Construct models (functions) that describe and distinguish classes orconcepts for future prediction

E.g., classify countries based on (climate), or classify cars based on (gas

mileage) Predict some unknown or missing numerical values

8/2/2019 Dwdm Intro

41/103

Data Mining Functionalities (2)

Cluster analysis Class label is unknown: Group data to form new classes, e.g., cluster

houses to find distribution patterns Maximizing intra-class similarity & minimizing interclass similarity

Outlier analysis

Outlier: Data object that does not comply with the general behavior of thedata

Noise or exception? Useful in fraud detection, rare events analysis Trend and evolution analysis

Trend and deviation: e.g., regression analysis Sequential pattern mining: e.g., digital camera large SD memory Periodicity analysis Similarity-based analysis

Other pattern-directed or statistical analyses

8/2/2019 Dwdm Intro

42/103

Outline Background




8/2/2019 Dwdm Intro

43/103

What is Data Warehouse?

Defined in many different ways, but not rigorously. A decision support database that is maintained separately from the

organizations operational database

Support information processing by providing a solid platform of

consolidated, historical data for analysis.

A data warehouse is a subject-oriented, integrated, time-

variant, and nonvolatile collection of data in support of

managements decision -making process. W. H. Inmon

Data warehousing:

The process of constructing and using data warehouses

8/2/2019 Dwdm Intro

44/103

Data Warehouse Subject-Oriented

Organized around major subjects, such as customer,product, sales

Focusing on the modeling and analysis of data for

decision makers, not on daily operations ortransaction processing

Provide a simple and concise view around particular

subject issues by excluding data that are not useful in

the decision support process

8/2/2019 Dwdm Intro

45/103

8/2/2019 Dwdm Intro

46/103

Data Warehouse Time Variant

The time horizon for the data warehouse issignificantly longer than that of operational systems Operational database: current value data

Data warehouse data: provide information from ahistorical perspective (e.g., past 5-10 years)

Every key structure in the data warehouse Contains an element of time, explicitly or implicitly

But the key of operational data may or may not containtime element

8/2/2019 Dwdm Intro

47/103

Data Warehouse Nonvolatile

A physically separate store of data transformed fromthe operational environment

Operational update of data does not occur in the data

warehouse environment

Does not require transaction processing, recovery, and

concurrency control mechanisms

Requires only two operations in data accessing:

initial loading of data and access of data

8/2/2019 Dwdm Intro

48/103

Data Warehouse vs. Heterogeneous DBMS

Traditional heterogeneous DB integration: A query driven

approach

Build wrappers/mediators on top of heterogeneous databases

When a query is posed to a client site, a meta-dictionary is used to

translate the query into queries appropriate for individual heterogeneous

sites involved, and the results are integrated into a global answer set

Complex information filtering, compete for resources

Data warehouse: update-driven, high performance

Information from heterogeneous sources is integrated in advance and

stored in warehouses for direct query and analysis

8/2/2019 Dwdm Intro

49/103

Data Warehouse vs. Operational DBMS

OLTP (on-line transaction processing) Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking, manufacturing,

payroll, registration, accounting, etc.

OLAP (on-line analytical processing) Major task of data warehouse system Data analysis and decision making

Distinct features (OLTP vs. OLAP):

User and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidated

Database design: ER + application vs. star + subject

View: current, local vs. evolutionary, integrated

Access patterns: update vs. read-only but complex queries

8/2/2019 Dwdm Intro

50/103

OLTP vs. OLAPOLTP OLAP

users clerk, IT professional knowledge worker

function day to day operations decision support

DB design application-oriented subject-oriented

data current, up-to-datedetailed, flat relationalisolated

historical,summarized, multidimensionalintegrated, consolidated

usage repetitive ad-hoc

access read/writeindex/hash on prim. key

lots of scans

unit of work short, simple transaction complex query

# records accessed tens millions

#users thousands hundreds

DB size 100MB-GB 100GB-TB

metric transaction throughput query throughput, response

8/2/2019 Dwdm Intro

51/103

Why Separate Data Warehouse? High performance for both systems

DBMS tuned for OLTP: access methods, indexing, concurrencycontrol, recovery

Warehouse tuned for OLAP: complex OLAP queries,multidimensional view, consolidation

Different functions and different data: missing data : Decision support requires historical data which operational

DBs do not typically maintain

data consolidation : DS requires consolidation (aggregation,summarization) of data from heterogeneous sources

data quality : different sources typically use inconsistent datarepresentations, codes and formats which have to be reconciled

Note: There are more and more systems which perform OLAPanalysis directly on relational databases

O tli

8/2/2019 Dwdm Intro

52/103

Outline Background




8/2/2019 Dwdm Intro

53/103

Objectives

Data mining is the process of extractinginteresting and useful information/knowledgefrom large databases or data warehouses.

The course covers the concepts and techniques of data mining such as

association rules, clustering, and classification. the basic concepts, architecture and general

implementations of data warehousing technology

8/2/2019 Dwdm Intro

54/103

Course topics Introduction (3 hrs): Definition, KDD framework, Issues in data mining. Association Rules (9hrs): Problem definition, Frequent item-set generation,

A priori and FP-growth algorithm, Evaluation of Association patterns. Clustering (9hrs): Overview, Types of Data, K-means, Aglomerative

clustering, Clustering algorithms (DBSCAN, BIRCH, CURE, ROCK,CHAMELEON).

Classification (9hrs): Overview, Decision tree induction, Over-fitting andunder-fitting, Scalable decision tree algorithms, Bayesian Classification,Regression-based Prediction methods

Data preprocessing (6 hrs): Data summarization, Data cleaning, Dataintegration and transformation, Data reduction, Data discretization andConcept hierarchy.

Data warehousing (9 hrs): Multidimensional data model, Data warehousingarchitecture, Data cube computation and OLAP technology.

8/2/2019 Dwdm Intro

55/103

Text Books Research Papers:

In this course, about 25 research papers will be covered.Students can refer the following books for the details of some research papers and other background information.

Text books Book: Jiawei Han and Micheline Kamber, Data

Mining: Concepts and Techniques, Second edition,2006, Elseiver Inc.

Pang-Nong Tan, Michael Steinbach and Vipin Kumar,Introduction to Data Mining, 2006, Pearson Education.

Reference Books: Papers from the proceeding of the conferences and

journals related to data mining and data warehousing.

8/2/2019 Dwdm Intro

56/103

LAB WORK

Several data mining tasks related to datapreprocessing, association rules, clusteringand classification will be given.

8/2/2019 Dwdm Intro

57/103

Outcome

After completing the course, the students will be able to appreciate the importance of

extracting useful knowledge from large amountsof data to improve the performance of a

business/organization. get enough exposure to investigate new/improveddata mining methods.

will understand the basics of data warehousingtechnology and its links to data mining.

Will be able play a role of a Data Miner in anorganization.

8/2/2019 Dwdm Intro

58/103

GRADING

MidSem1: 15 %; MidSemII: 15 %; EndSem: 30%;

Research Paper Quiz: 10 % Project/Lab: 30 %

8/2/2019 Dwdm Intro

59/103

A Brief History of Data Mining Society

1989 IJCAI Workshop on Knowledge Discovery in Databases

Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)

1991-1994 Workshops on Knowledge Discovery in Databases

Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)

1995-1998 International Conferences on Knowledge Discovery in Databases and DataMining (KDD95 -98)

Journal of Data Mining and Knowledge Discovery (1997)

ACM SIGKDD conferences since 1998 and SIGKDD Explorations

More conferences on data mining PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM

(2001), etc.

ACM Transactions on KDD starting in 2007

8/2/2019 Dwdm Intro

60/103

Conferences and Journals on Data Mining

KDD Conferences ACM SIGKDD Int. Conf. on

Knowledge Discovery inDatabases and Data Mining(KDD )

SIAM Data Mining Conf. ( SDM ) (IEEE) Int. Conf. on Data Mining

(ICDM ) Conf. on Principles and practices

of Knowledge Discovery and

Data Mining ( PKDD ) Pacific-Asia Conf. on KnowledgeDiscovery and Data Mining(PAKDD )

Other related conferences ACM SIGMOD

VLDB

(IEEE) ICDE

WWW, SIGIR

ICML, CVPR, NIPS

Journals Data Mining and Knowledge

Discovery (DAMI or DMKD)

IEEE Trans. On Knowledge andData Eng. (TKDE)

KDD Explorations

ACM Trans. on KDD

h d f l

8/2/2019 Dwdm Intro

61/103

Where to Find References? DBLP, CiteSeer, Google

Data mining and KDD (SIGKDD: CDROM)

Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc. Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD

Database systems (SIGMOD: ACM SIGMOD Anthology CD ROM) Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.

AI & Machine Learning Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc. Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-

PAMI, etc.

Web and IR Conferences: SIGIR, WWW, CIKM, etc. Journals: WWW: Internet and Web Information Systems,

Statistics Conferences: Joint Stat. Meeting, etc. Journals: Annals of statistics, etc.

Visualization Conference proceedings: CHI, ACM-SIGGraph, etc. Journals: IEEE Trans. visualization and computer graphics, etc.

8/2/2019 Dwdm Intro

62/103

Reference Books S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002

R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000

T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003

U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining.

AAAI/MIT Press, 1996

U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan

Kaufmann, 2001

J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2 nd ed., 2006

D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001

T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction,

Springer-Verlag, 2001

B. Liu, Web Data Mining, Springer 2006.

T. M. Mitchell, Machine Learning, McGraw Hill, 1997

G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991

P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005

S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998

I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations,

Morgan Kaufmann, 2 nd ed. 2005

Outline

8/2/2019 Dwdm Intro

63/103

Outline Background

Content of human mind, Sample data miningproblems, Why data mining ? Definition, KDD process, System architecture Data Visualization


8/2/2019 Dwdm Intro

64/103

Data Mining Tasks

Prediction Methods Use some variables to predict unknown or future

values of other variables.

Description Methods Find human-interpretable patterns that describe

the data.

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

8/2/2019 Dwdm Intro

65/103

Data Mining Tasks...

Association Rule Discovery [Descriptive]

Clustering [Descriptive] Classification [Predictive] Sequential Pattern Discovery [Descriptive] Regression [Predictive]

Deviation Detection [Predictive]

8/2/2019 Dwdm Intro

66/103

Association Rule Discovery: Definition

Given a set of records each of which contain somenumber of items from a given collection; Produce dependency rules which will predict occurrence of

an item based on occurrences of other items.

TID Items

1 Bread, Coke, Milk

2 Beer, Bread

3 Beer, Coke, Diaper, Milk4 Beer, Bread, Diaper, Milk

5 Coke, Diaper, Milk

Rules Discovered:{Milk} --> {Coke}{Diaper, Milk} --> {Beer}

8/2/2019 Dwdm Intro

67/103

Association Rule Discovery: Application 1

Marketing and Sales Promotion: Let the rule discovered be {Bagels, } --> {Potato Chips}

Potato Chips as consequent => Can be used to determinewhat should be done to boost its sales.

Bagels in the antecedent => Can be used to see whichproducts would be affected if the store discontinues sellingbagels.

Bagels in antecedent and Potato chips in consequent =>Can be used to see what products should be sold withBagels to promote sale of Potato chips!

8/2/2019 Dwdm Intro

68/103


Supermarket shelf management. Goal: To identify items that are bought together by

sufficiently many customers. Approach: Process the point-of-sale data collected

with barcode scanners to find dependencies amongitems.

A classic rule --

If a customer buys diaper and milk, then he is verylikely to buy beer.

So, dont be surprised if you find six -packs stacked nextto diapers!


8/2/2019 Dwdm Intro

69/103


Inventory Management: Goal: A consumer appliance repair company wants to

anticipate the nature of repairs on its consumer productsand keep the service vehicles equipped with right parts toreduce on number of visits to consumer households.

Approach: Process the data on tools and parts required inprevious repairs at different consumer locations anddiscover the co-occurrence patterns.

Sequential Pattern Discovery: Definition

8/2/2019 Dwdm Intro

70/103

Sequential Pattern Discovery: Definition

Given is a set of objects , with each object associated with its own timeline of events , find rules that predict strong sequential dependencies among differentevents.

Rules are formed by first discovering patterns. Event occurrences in thepatterns are governed by timing constraints.

(A B) (C) (D E)

8/2/2019 Dwdm Intro

71/103

Sequential Pattern Discovery: Examples

In telecommunications alarm logs, (Inverter_Problem Excessive_Line_Current)

(Rectifier_Alarm) --> (Fire_Alarm)

In point-of-sale transaction sequences, Computer Bookstore:

(Intro_To_Visual_C) (C++_Primer) -->(Perl_for_dummies,Tcl_Tk)

Athletic Apparel Store:(Shoes) (Racket, Racketball) --> (Sports_Jacket)

8/2/2019 Dwdm Intro

72/103

Clustering Definition

Given a set of data points, each having a set of attributes, and a similarity measure amongthem, find clusters such that

Data points in one cluster are more similar to oneanother. Data points in separate clusters are less similar to

one another.

Similarity Measures: Euclidean Distance if attributes are continuous. Other Problem-specific Measures.

8/2/2019 Dwdm Intro

73/103

Illustrating ClusteringEuclidean Distance Based Clustering in 3-D space.

Intracluster distancesare minimized

Intercluster distancesare maximized

8/2/2019 Dwdm Intro

74/103

Clustering: Application 1

Market Segmentation: Goal: subdivide a market into distinct subsets of

customers where any subset may conceivably be selectedas a market target to be reached with a distinct marketing

mix. Approach:

Collect different attributes of customers based on theirgeographical and lifestyle related information.

Find clusters of similar customers. Measure the clustering quality by observing buying patterns of

customers in same cluster vs. those from different clusters.

8/2/2019 Dwdm Intro

75/103

Clustering: Application 2

Document Clustering: Goal: To find groups of documents that are similar to each

other based on the important terms appearing in them. Approach: To identify frequently occurring terms in each

document. Form a similarity measure based on thefrequencies of different terms. Use it to cluster. Gain: Information Retrieval can utilize the clusters to relate

a new document or search term to clustered documents.

8/2/2019 Dwdm Intro

76/103

Illustrating Document Clustering Clustering Points: 3204 Articles of Los Angeles

Times. Similarity Measure: How many words are common

in these documents (after some word filtering).

Category Total Articles

Correctly Placed

Financial 555 364

Foreign 341 260

National 273 36

Metro 943 746

Sports 738 573

Entertainment 354 278

Clustering of S&P 500 Stock Data

8/2/2019 Dwdm Intro

77/103

Clustering of S&P 500 Stock Data

Discovered Clusters Industry Group

1Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,

Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,

Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,

Sun-DOWN

Technology1-DOWN

2Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,

ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,Computer-Assoc-DOWN,Circuit-City-DOWN,

Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN

Technology2-DOWN

3Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

4Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,

Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,Schlumberger-UP

Oil-UP

Observe Stock Movements every day.Clustering points: Stock-{UP/DOWN}Similarity Measure: Two points are more similar if the eventsdescribed by them frequently happen together on the same day.

We used association rules to quantify a similarity measure.

8/2/2019 Dwdm Intro

78/103

Classification: Definition

Given a collection of records ( training set ) Each record contains a set of attributes , one of theattributes is the class .

Find a model for class attribute as a function

of the values of other attributes. Goal: previously unseen records should be

assigned a class as accurately as possible. A test set is used to determine the accuracy of the model.

Usually, the given data set is divided into training andtest sets, with training set used to build the model andtest set used to validate it.

8/2/2019 Dwdm Intro

79/103

Classification Example

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

Refund MaritalStatus

TaxableIncome Cheat

No Single 75K ?

Yes Married 50K ?

No Married 150K ?Yes Divorced 90K ?

No Single 40K ?

No Married 80K ?10

TestSet

TrainingSet

Model Learn

Classifier

8/2/2019 Dwdm Intro

80/103

Classification: Application 1

Direct Marketing Goal: Reduce cost of mailing by targeting a set of

consumers likely to buy a new cell-phone product. Approach:

Use the data for a similar product introduced before. We know which customers decided to buy and which decided

otherwise. This {buy, dont buy} decision forms the classattribute .

Collect various demographic, lifestyle, and company-interaction

related information about all such customers. Type of business, where they stay, how much they earn, etc.

Use this information as input attributes to learn a classifier model.

From [Berry & Linoff] Data Mining Techniques, 1997

8/2/2019 Dwdm Intro

81/103


Fraud Detection Goal: Predict fraudulent cases in credit card transactions. Approach:

Use credit card transactions and the information on its account-holder as attributes.

When does a customer buy, what does he buy, how often he pays ontime, etc

Label past transactions as fraud or fair transactions. This forms theclass attribute.

Learn a model for the class of the transactions.

Use this model to detect fraud by observing credit cardtransactions on an account.

8/2/2019 Dwdm Intro

82/103


Customer Attrition/Churn: Goal: To predict whether a customer is likely to be

lost to a competitor.

Approach: Use detailed record of transactions with each of the past

and present customers, to find attributes. How often the customer calls, where he calls, what time-of-the

day he calls most, his financial status, marital status, etc. Label the customers as loyal or disloyal. Find a model for loyalty.

From [Berry & Linoff] Data Mining Techniques, 1997

8/2/2019 Dwdm Intro

83/103


Sky Survey Cataloging Goal: To predict class (star or galaxy) of sky objects,

especially visually faint ones, based on the telescopicsurvey images (from Palomar Observatory).

3000 images with 23,040 x 23,040 pixels per image.

Approach: Segment the image. Measure image attributes (features) - 40 of them per object.

Model the class based on these features. Success Story: Could find 16 new high red-shift quasars, some of

the farthest objects that are difficult to find!

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

8/2/2019 Dwdm Intro

84/103

Classifying Galaxies

Early

Intermediate

Late

Data Size: 72 million stars, 20 million galaxies Object Catalog: 9 GB Image Database: 150 GB

Class: Stages of Formation

Attributes: Image features, Characteristics of light

waves received, etc.

Courtesy: http://aps.umn.edu

8/2/2019 Dwdm Intro

85/103

Regression Predict a value of a given continuous valued variable

based on the values of other variables, assuming alinear or nonlinear model of dependency.

Greatly studied in statistics, neural network fields.

Examples: Predicting sales amounts of new product based on

advertising expenditure. Predicting wind velocities as a function of temperature,

humidity, air pressure, etc. Time series prediction of stock market indices.

Deviation/Anomaly Detection

8/2/2019 Dwdm Intro

86/103

y

Detect significant deviations from normal behavior Applications:

Credit Card Fraud Detection

Network IntrusionDetection

Typical network traffic at University level may reach over 100 million connections per day

First Assignment

8/2/2019 Dwdm Intro

87/103

Assignment 1: Identify a problem from your own experience that you think would beamenable to data mining. Describe:

(i) What the data is.(ii) What type of benefit you might hope to get from data mining.(iii) What type of data mining (classification, clustering, etc.) you think would berelevant.

For each, illustrate with an example, e.g., if you think clustering is relevant, describe

what you think a likely cluster might contain and what the real-world meaning would be.

Submit twwo pages of 11 point single-spaced typeset text (leave 0.5 inch margins). Wrieyour roll number and name.

Last Date: 14-08-08 (5PM)

References: Introductory chapters of any data mining book or any data mining paper andthe PPTs of first two classes.

Outline

8/2/2019 Dwdm Intro

88/103

Background Content of human mind, Sample data miningproblems, Why data mining ?


Data warehousing Course outline Overview of Data mining tasks Top 10 data mining algorithms

Issues in data mining Data mining applications Summary

Top-10 Most Popular DM Algorithms:f

8/2/2019 Dwdm Intro

89/103

18 Identified Candidates (I)

Classification

#1. C4.5: Quinlan, J. R. C4.5: Programs for Machine Learning. MorganKaufmann., 1993. #2. CART: L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification

and Regression Trees. Wadsworth, 1984. #3. K Nearest Neighbours (kNN): Hastie, T. and Tibshirani, R. 1996.

Discriminant Adaptive Nearest Neighbor Classification. TPAMI. 18(6)

#4. Naive Bayes Hand, D.J., Yu, K., 2001. Idiot's Bayes: Not So Stupid AfterAll? Internat. Statist. Rev. 69, 385-398.

Statistical Learning #5. SVM: Vapnik, V. N. 1995. The Nature of Statistical Learning Theory.

Springer-Verlag. #6. EM: McLachlan, G. and Peel, D. (2000). Finite Mixture Models. J. Wiley,

New York. Association Analysis #7. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for

Mining Association Rules. In VLDB '94. #8. FP-Tree: Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns

without candidate generation. In SIGMOD '00.

The 18 Identified Candidates (II)

8/2/2019 Dwdm Intro

90/103

The 18 Identified Candidates (II) Link Mining

#9. PageRank: Brin, S. and Page, L. 1998. The anatomy of a large-scalehypertextual Web search engine. In WWW-7, 1998.

#10. HITS: Kleinberg, J. M. 1998. Authoritative sources in ahyperlinked environment. SODA, 1998.

Clustering #11. K-Means: MacQueen, J. B., Some methods for classification and

analysis of multivariate observations, in Proc. 5th Berkeley Symp.Mathematical Statistics and Probability, 1967.

#12. BIRCH: Zhang, T., Ramakrishnan, R., and Livny, M. 1996.BIRCH: an efficient data clustering method for very large databases. InSIGMOD '96.

Bagging and Boosting #13. AdaBoost: Freund, Y. and Schapire, R. E. 1997. A decision-theoretic generalization of on-line learning and an application toboosting. J. Comput. Syst. Sci. 55, 1 (Aug. 1997), 119-139.

The 18 Identified Candidates (III)

8/2/2019 Dwdm Intro

91/103

The 18 Identified Candidates (III)

Sequential Patterns

#14. GSP: Srikant, R. and Agrawal, R. 1996. Mining Sequential Patterns:Generalizations and Performance Improvements. In Proceedings of the 5thInternational Conference on Extending Database Technology, 1996.

#15. PrefixSpan: J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayaland M-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. In ICDE '01.

Integrated Mining #16. CBA: Liu, B., Hsu, W. and Ma, Y. M. Integrating classification and

association rule mining. KDD-98. Rough Sets

#17. Finding reduct: Zdzislaw Pawlak, Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, Norwell, MA, 1992

Graph Mining #18. gSpan: Yan, X. and Han, J. 2002. gSpan: Graph-Based Substructure

Pattern Mining. In ICDM '02.

Top- 10 Algorithm Finally Selected at ICDM06

8/2/2019 Dwdm Intro

92/103

p g y

#1: C4.5 (61 votes)

#2: K-Means (60 votes)

#3: SVM (58 votes)

#4: Apriori (52 votes)

#5: EM (48 votes) #6: PageRank (46 votes)

#7: AdaBoost (45 votes)

#7: kNN (45 votes) #7: Naive Bayes (45 votes)

#10: CART (34 votes)

Outline

8/2/2019 Dwdm Intro

93/103

Background




h ll f

8/2/2019 Dwdm Intro

94/103

Challenges of Data Mining Scalability Dimensionality Complex and Heterogeneous Data

Data Quality Data Ownership and Distribution Privacy Preservation

Streaming Data

Major Issues in Data Mining

8/2/2019 Dwdm Intro

95/103

Mining methodology Mining different kinds of knowledge from diverse data types, e.g., bio, stream,

Web

Performance: efficiency, effectiveness, and scalability

Pattern evaluation: the interestingness problem

Incorporation of background knowledge

Handling noise and incomplete data

Parallel, distributed and incremental mining methods Integration of the discovered knowledge with existing one: knowledge fusion

User interaction Data mining query languages and ad-hoc mining

Expression and visualization of data mining results

Interactive mining of knowledge at multiple levels of abstraction

Applications and social impacts Domain-specific data mining & invisible data mining Protection of data security, integrity, and privacy

Outline

8/2/2019 Dwdm Intro

96/103

Background




DM applications: Market Analysis and Management

8/2/2019 Dwdm Intro

97/103

DM applications: Market Analysis and Management

Where are the data sources for analysis?

Credit card transactions, loyalty cards, discount coupons, customercomplaint calls, plus (public) lifestyle studies

Target marketing

Find clusters of model customers who share the same characteristics:

interest, income level, spending habits, etc. Determine customer purchasing patterns over time

Conversion of single to a joint bank account: marriage, etc.

Cross-market analysis

Associations/co-relations between product sales Prediction based on the association information

DM applications: Market Analysis and Management.

8/2/2019 Dwdm Intro

98/103

pp y g

Customer profiling

data mining can tell you what types of customers buy what products

(clustering or classification)

Identifying customer requirements

identifying the best products for different customers use prediction to find what factors will attract new customers

Provides summary information

various multidimensional summary reports

statistical summary information (data central tendency and variation)

DM applications: Corporate Analysis and Risk

8/2/2019 Dwdm Intro

99/103

Management

Finance planning and asset evaluation cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio,

trend analysis, etc.)

Resource planning: summarize and compare the resources and spending

Competition: monitor competitors and market directions group customers into classes and a class-based pricing

procedure set pricing strategy in a highly competitive market

DM applications: Fraud Detection and Management

8/2/2019 Dwdm Intro

100/103


Applications

widely used in health care, retail, credit card services,telecommunications (phone card fraud), etc.

Approach use historical data to build models of fraudulent behavior and use data

mining to help identify similar instances Examples

auto insurance: detect a group of people who stage accidents to collecton insurance

money laundering: detect suspicious money transactions (USTreasury's Financial Crimes Enforcement Network)

medical insurance: detect professional patients and ring of doctors andring of references


8/2/2019 Dwdm Intro

101/103

pp g

Detecting inappropriate medical treatment Australian Health Insurance Commission identifies that in many cases

blanket screening tests were requested (save Australian $1m/yr). Detecting telephone fraud

Telephone call model: destination of the call, duration, time of day orweek. Analyze patterns that deviate from an expected norm.

British Telecom identified discrete groups of callers with frequentintra-group calls, especially mobile phones, and broke a multimilliondollar fraud.

Retail Analysts estimate that 38% of retail shrink is due to dishonest

employees.

Other Applications of data mining

8/2/2019 Dwdm Intro

102/103

Other Applications of data mining Sports

IBM Advanced Scout analyzed NBA game statistics (shots blocked,assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat

Astronomy

JPL and the Palomar Observatory discovered 22 quasars with the helpof data mining

Internet Web Surf-Aid

IBM Surf-Aid applies data mining algorithms to Web access logs formarket-related pages to discover customer preference and behavior

pages, analyzing effectiveness of Web marketing, improving Web siteorganization, etc.

Summary

8/2/2019 Dwdm Intro

103/103

y

Data mining: Discovering interesting patterns from large amounts of data

A natural evolution of database technology, in great demand, with wideapplications

A KDD process includes data cleaning, data integration, data selection,transformation, data mining, pattern evaluation, and knowledge presentation

Mining can be performed in a variety of information repositories Data mining systems and architectures

Data warehousing

Data mining functionalities: characterization, discrimination, association,

classification, clustering, outlier and trend analysis, etc. Major issues in data mining

Documents

Dwdm Intro