An Evaluation of A Commercial Data Mining Suite Oracle Data Mining Presented by Emily Davis Supervisor: John Ebden

An Evaluation of A Commercial Data Mining Suite

Oracle Data MiningPresented by Emily DavisSupervisor: John Ebden

Oracle Data MiningAn Investigation

Emily Davis

Investigating the data mining tools

and software available with

Oracle9i.

Use Oracle Data Mining and

JDeveloper (Java API) to run

algorithms in data mining suite on

sample data.

An evaluation of results using confusion

matrices, lift charts & error rates. A

comparison of the effectiveness of different

algorithms.

Supervisor: John EbdenContact: [email protected]: http://www.cs.ru.ac.za/research/students/g01D1801/

Model A

Model Accept

Model Reject

Actual Accept

600 25

Actual Reject

75 300

Oracle Data Mining, DM4J and

JDeveloper

Adaptive BayesNaive Bayes

Problem Statement

To determine how Oracle provides data mining functionalityEase of useData preparationModel buildingModel testingApplying models to new data

Problem Statement

To determine whether the algorithms used would find a pattern in a data setWhat happened when the models were

applied to a new data set To determine which algorithm built the

most effective model and under what circumstances

Problem Statement

To determine how models are tested and if this indicates how they will perform when applied to new data

To determine how the data affected the model building and how the test data affected the model testing

Methodology

Two Classification algorithms selected:Naïve BayesAdaptive Bayes Network

Both produce predictions which could then be compared

Methodology

Data from http://www.ru.ac.za/weather/ Weather data Data recorded includes:

Temperature (degrees F) Humidity (percent) Barometer (inches of mercury) Wind Direction (degrees, 360 = North, 90 = East) Wind Speed (MPH) High Wind Speed (MPH) Solar Radiation (Watts/m^2) Rainfall (inches) Wind Chill (computed from high wind speed and temperature)

Data

Rainfall reading removed and replaced with a yes or no depending on whether rainfall was recorded

This variable, RAIN, was chosen as the target variable

2 Data sets put into tables in the databaseWEATHER_BUILDWEATHER_APPLY

WEATHER_BUILD2601 recordsUsed to create build and test data with

Transformation Split wizard WEATHER_APPLY

290 recordsUsed to validate models

Building and Testing the Models

The Priors technique Training and tuning the models The models built Testing Results

Data Preparation Techniques - Priors

Histogram for:RAIN

0

200

400

600

800

1000

1200

1400

yes no

Bin Range

Bin

Co

un

t

Priors

Histogram for:RAIN

0

200

400

600

800

1000

1200

1400

yes no

Bin Range

Bin

Co

un

t

Stratified Sampling

Priors

Histogram for:RAIN

0

200

400

600

800

1000

1200

1400

yes no

Bin Range

Bin

Co

un

t

Histogram for:RAIN

0

200

400

600

800

1000

1200

1400

yes no

Bin Range

Bin

Co

un

t

Stratified Sampling

Training and Tuning the Models

Predicted No Predicted Yes

Actual No 384 34

Actual Yes 141 74

Training and Tuning the Models

Viable to introduce a weighting of 3 against false negatives

Makes a false negative prediction 3 times as costly as a false positive

Algorithm attempts to minimise costs

The Models

8 models in total 4 using each algorithm

One using default settingsOne using the Priors techniqueOne using weightingOne using Priors and weighting

Testing the Models

Tested on test data set created from WEATHER_BUILD data set

Confusion matrices indicating accuracy of models

Testing Results

Testing Results

0.00%10.00%20.00%30.00%40.00%50.00%60.00%70.00%80.00%90.00%

now

eigh

ting,

no p

riors

now

eigh

ting,

prio

rs

wei

ghtin

g,no

prio

rs

wei

ghtin

g,pr

iors

Model Settings

Acc

ura

cy

Naïve Bayes

Adaptive BayesNetwork

Applying the Models to New Data

Models were applied to the new data in WEATHER_APPLY

Prediction Probability THE_TIME

no 0.9999 1

yes 0.6711 138

Prediction Cost of incorrect prediction

THE_TIME

no 0 1

yes 0.3288 138

Extracts showing 2 predictions in actual results

Attribute Influence on Predictions

Adaptive Bayes Network provides rules along with predictions

Rules in if…….then format Rules showed attributes with most

influence were:Wind ChillWind Direction

Results of Applying Models to New Data

Model Results

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

noweighting,no priors

noweighting,

priors

weighting,no priors

weighting,priors

Model Settings

Acc

ura

cy

Naïve Bayes

Adaptive Bayes Network

Comparing Accuracy

Model Results

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

noweighting,no priors

noweighting,

priors

weighting,no priors

weighting,priors

Model SettingsA

ccu

racy Naïve Bayes

Adaptive Bayes Network

Testing Results

0.00%10.00%20.00%30.00%40.00%50.00%60.00%70.00%80.00%90.00%

now

eigh

ting,

no p

riors

now

eigh

ting,

prio

rs

wei

ghtin

g,no

prio

rs

wei

ghtin

g,pr

iors

Model Settings

Acc

ura

cy Naïve Bayes

Adaptive BayesNetwork

Observations

Algorithms found a pattern in the weather data Most effective model: Adaptive Bayes Network

algorithm using weighting Accuracy of Naïve Bayes models improves

dramatically if weighting and Priors are used Significant difference between accuracy during

testing of models and accuracy when applied to new data

Conclusions

Oracle Data Mining provides easy to use wizards that support all aspects of the data mining process

Algorithms found a pattern in the weather dataBest case: the Adaptive Bayes Network model

predicted 73.1% of RAIN outcomes correctly

Conclusions

Adaptive Bayes Network algorithm produced most effective model: accuracy 73.1% when applied to new data Tuned using a weighting of 3 against false negatives

Most effective model using Naïve Bayes: accuracy of 63.79% Uses a weighting of 3 against false negatives and

uses Priors technique

Conclusions

Accuracy during testing does not always indicate performance of model on new data

Test accuracy inflated if target attribute distribution in build and test data sets is similar

Shows the need for testing of a model on a variety of data sets

Questions

Documents

An Evaluation of A Commercial Data Mining Suite Oracle Data Mining Presented by Emily Davis Supervisor: John Ebden