Data Mining Survey of applications and methodologies - Akshat Singhal, Oberlin College, 2007

Data MiningSurvey of applications and

methodologies

- Akshat Singhal, Oberlin College, 2007

Presentation Summary• What is Data mining?• Evolution of Data mining• Applications• Process• Models : Predictive vs

Descriptive• Decision Tree (Classification

Rules) Example• Association Rules Example• Text Mining Example• Software used

What Is Data Mining?• Also called Knowledge-Discovery in

Databases (KDD)

• “the extraction of hidden predictive information from large databases”

ORthe process of automatically searching

large volumes of data for patterns

• Answering questions such as

“What products are candy buyers most likely to buy this month?”

“What kind of credit card transaction is a likely fraud?”

“What colour of automobile is the most associated with accidents?”

Evolution of Data MiningEvolutionary Step

Business Question

Enabling Technology

Data Collection (1960s)

“How many widgets were sold this Year?”

computers, tapes, disks

Data Access (1980s)

“How many widgets were sold and for what cost this year?"

Relational Databases (RDBMS)

Data Warehousing and Decision Support

“How many widgets were sold without discount in the recently acquired Puerto Rico store of Giant Corp, Inc.?"

On-line analytical processing (OLAP), multidimensional databases, data warehouses

Data Mining

“How many widgets will be sold in Cleveland next year?”

Machine Learning, Technologies for handling mass storage and computation like RAID and SMP.

Files

RDBMS

OLAP

Data Mining

What Data Mining is NOT?• Data Entry/Storage/Access or

connectivity among diverse Data Sources (Data Warehousing)

• Presenting Data in a better format (Data Presentation / Interfacing)

• Brute-Force algorithm application for generating data about data (Statistics).

• Finding relations that don’t manifest themselves in the given data (Business Strategy).

Types of Data Mining:

1. Forecasting what may happen in the future

2. Classifying and Clustering data items into groups by recognizing patterns

3. Associating events (attribute values) that are likely to occur together

4. Sequencing events that are likely to lead to later events

Example Applications•Fraud/Non-Compliance Anomaly detection (government)

•Credit/Risk Scoring•Intrusion detection

•Parts failure prediction

•Market Basket Analysis

•“Fun” statistics

•Product Recommendations

•Customer Profiling

•Maximizing profitability (cross selling, identifying profitable customers)

•Web Mining

•Weather Prediction

•Using patterns in Medical test results for diagnosis

Success Stories

• HSBC - used data mining to target mailings better at customers. (i.e. not sending Car Loan brochures to millionaires)

• DEA – Analyzed suspect calls to catch drug peddlers. (i.e. don’t say LSD on the phone)

• IRS – better scheduling, catching Tax Fraud.• DaimlerBenz – used data mining for analysis

of testing data for F-Cell fuelled vehicles.• Walmart – analyzing 7.5 TB of customer and

supplier data.

Privacy ConcernsPrivacy Concerns

•Data mining Data mining extracts new insights extracts new insights from old data.from old data.•This data may have been collected This data may have been collected with a with a stated purpose of record-stated purpose of record-keeping keeping only.only.•Results of data mining can classify Results of data mining can classify people people as high risk/potentially as high risk/potentially criminal and hence criminal and hence hurt hurt themthem•Many believe data mining is the same Many believe data mining is the same as as The Man The Man simply stealing simply stealing information (the mining metaphor is information (the mining metaphor is ambiguous)ambiguous)

Issues of Scale• Common data sets are non-trivial

in size, usually in the order of Terabytes.

• Data is almost never consistent in quality.

• A top-down approach is needed to solving data mining problems

• The Answer: Standard process for data mining: CRISP-DM (CRoss Industry Standard Process for Data Mining)

CRISP-DM• Proposed by SPSS, Daimler-Benz,

and OHRA in 1996• Follows uniform and well-

documented guidelines.• Flexible on type of :

– Business/agency problems– Data– Application software (i.e. software tools

used for analysis)

• Very similar to the standard Software Development Process (top-down model)

Phases of CRISP-DM

BusinessUnderstanding

DataUnderstanding

EvaluationDataPreparation

Modeling

Determine Business ObjectivesBackgroundBusiness ObjectivesBusiness Success Criteria

Situation AssessmentInventory of ResourcesRequirements, Assumptions, and ConstraintsRisks and ContingenciesTerminologyCosts and Benefits

Determine Data Mining GoalData Mining GoalsData Mining Success Criteria

Produce Project PlanProject PlanInitial Asessment of Tools and Techniques

Collect Initial DataInitial Data Collection Report

Describe DataData Description Report

Explore DataData Exploration Report

Verify Data Quality Data Quality Report

Data SetData Set Description

Select Data Rationale for Inclusion / Exclusion

Clean Data Data Cleaning Report

Construct DataDerived AttributesGenerated Records

Integrate DataMerged Data

Format DataReformatted Data

Select Modeling TechniqueModeling TechniqueModeling Assumptions

Generate Test DesignTest Design

Build ModelParameter SettingsModelsModel Description

Assess ModelModel AssessmentRevised Parameter Settings

Evaluate ResultsAssessment of Data Mining Results w.r.t. Business Success CriteriaApproved Models

Review ProcessReview of Process

Determine Next StepsList of Possible ActionsDecision

Plan DeploymentDeployment Plan

Plan Monitoring and MaintenanceMonitoring and Maintenance Plan

Produce Final ReportFinal ReportFinal Presentation

Review ProjectExperience Documentation

Deployment

CRISP-DM: Stage 1• Define business objective.• Define data mining objective.• Define set of data to be used, and

identify outliers in the data.• Gauge reliability of analysis• Reasons:

– Business Objectives are often unclear. (e.g. cutting mailing costs vs. finding new areas to campaign in)

– Data quality varies widely, even in large well-structured organizations.

Stage 2-3: Data Preparation• Evaluating quality of data• Statistical outliers, incomplete data, and

sparse data must be accounted for.• Data may need to be transformed (for

instance, by logarithm function) for useful statistics.

• Bad quality data:– Sparse data: e.g. in Market Basket analysis, one

customer never buys the whole store, so the resulting matrix is very sparse.

– Incomplete data: e.g. • people do not answer every question in surveys. • Data from a 10-year-old IBM mainframe takes

conversion and standardized.• Non-entries can manifest themselves as 0 or some

default value.

Stage 4: Modelling• Predictive models:

– output is function or distribution that predicts values for individual objects.

– e.g. to play or not play, given that its sunny outside) and humidity is high.

– Use Classification Rules– Classification looks for associations to one

target clustering attribute (say, Class = Ham or Spam)

• Descriptive models: – output are interesting (local, marginal)

properties of distribution– e. g. If its sunny and we decide to play, the

temperature must be cool.– Use Association Rules– Associations are more numerous because

they can be between any number of attributes.

AlgorithmsPredictive:•Regression algorithms: neural networks, Rule Induction•Classification algorithms: CHAID, C5.0 , Naïve Bayesian Classifier.

Descriptive:•Clustering/Grouping algorithms: K-means, Kohonen maps•Association algorithms: GRI

Decision Tree Induction Example (C4.5)

•The C4.5 algorithm infers from this data, Classification Rules like:

•If Outlook = sunny and Humidity <=75, Play =yes

•If Outlook = rainy and Windy = true, Play =yes

•Rules can be represented as a decision tree. In this example, the rules can help predict if a game will be played, based on weather data.

Association Rules Example• Given data about Contact Lenses

use and eye characteristics for a number of people,

• Find such associations in the data:– If tear production rate = reduced (low), then contact-lenses=none (i.e. finding the association that people with dry eyes are not prescribed contact lenses)

– If contact-lenses=hard, then astigmatism=true (i.e. finding the association that people with astigmatism are prescribed hard lenses)

Text Mining Example• Oberlinconfessional.com is a

restricted (to Oberlin) website for anonymous confessions.

• “Automatically Categorizing Written Texts by Author Gender” by Moshe Koppel describes an algorithm for predicting the gender of a text’s writer based on word occurrences.

Results:

•Posts are more male than female at 6:00 AM , 7:00 AM, and at 5:00 PM. (possible reason: women don’t stay up that late)•Posts are more female than male throughout the rest of the day. (possible reason: there are more women than men in the community)

Conditional Distribution Host vs. Gender Grade Sums (2)

0.00%

2.00%

4.00%

6.00%

8.00%

10.00%

12.00%

14.00%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Hour

Per

cen

tag

e o

f T

ota

l G

end

er S

core

fo

r G

end

er Male

Female

Software• Weka toolkit: Java-based open source data

mining workbench (with reusable code) –http://www.cs.waikato.ac.nz/ml/weka/

• Pentaho – Open Source Business Intelligence suite. http://www.pentaho.com/

• IBM DB2 Data Warehouse Edition – complete data warehouse suite with mining and visualizing capabilities. (easily googleable)

• SPSS – Back-end software as well as a range of industry-specific data mining solutions.http://www.spss.com/

• SAS – Commercial Text mining tools and Business Intelligence server. http://www.sas.com/

http://www.cs.waikato.ac.nz/ml/weka/

http://www.pentaho.com/



http://www.spss.com/



http://www.sas.com/

http://www.sas.com/

http://www.sas.com/

Presentation Summary• What is Data mining?• Evolution of Data mining• Applications• Process• Models : Predictive vs

Descriptive• Decision Tree (Classification

Rules) Example• Association Rules Example• Text Mining Example• Software used

Slide was repeated

because YOU are a hetero-associative

learner.

Questions

Documents

Data Mining Survey of applications and methodologies - Akshat Singhal, Oberlin College, 2007