CSc288 Term Project Data mining on predict Voice-over-IP Phones market ----- Huaqin Xu

CSc288 Term Project Data mining on predict Voice-over-IP Phones market

----- Huaqin Xu

Agenda

Abstract Introduction Methodology Result Conclusion Learning Experience References

Abstract

This project based on the VoIP survey data sets. Weka explorer’s classifiers are chosen as data mining tool to build models to predict potential customers of VoIP phone and the most important features and services of two VoIP models.

Introduction

BackgroundVoIP phone has a potential opportunity with

the wide use of internet service.Two VoIP phone models: Basic & Deluxe

Data mining ScopeCustomerProduct features and services

Methodology

Data Mining Tools C4.5/C5.0, Cubist Weka Microsoft SQL Server SPSS

Chose: Weka Explorer

Why? Free, Easy, Good Interface, More choices……

Methodology

Explorer Vs KnowledgeFlow

Methodology

Datasets: Totally: 94 instances

Methodology Preprocessing

Split table Customer: 17 attributes Basic-model: 14 attributes Deluxe-model: 10 attributes

Processing Missing data Delete Replaced by “?”

Transfer data typeSPSS Excel Weka

Methodology

Algorithm selectionClassification ClusteringAssociation

Chose: NNgeWhy?

High accuracy rate Simple, clear Rules

Algorithms Correct Instances (%)

Naivebayes 63.82

DecisionStump 65.95

Id3 84.04

J48 75.53

NBTree 79.78

ConjunctiveRule 69.14

DecisionTable 80.85

NNge 87.23

OneR 71.27

PART 72.34

Prism 88.29

Ridor 71.27

JRip 74.46

ZeroR 63.83

AdaBoostM1 65.95

BayesNet 60.63

NNge classifier Nearest-neighbor like algorithm using

non-nested generalized exemplars. a rule based classifier builds a sort of “hypergeometric” model. shows promise as an ML method that

performs well on a wide range of datasets

Methodology

Result

Result

Result Rules:

One of customer rules :class Would_Buy IF :

cost in {10-20} ^ phone in {yes} ^ email in {yes} ^ fax in {no} ^ chat in {yes,no} ^ other in {no} ^ service type in {Phone_cards_only} ^ price in {Somewhat_Dissatisfied, Somewhat_Satisfied} ^ voice_quality in {Somewhat_Dissatisfied, Somewhat_Satisfied} ^ service in {Somewhat_Dissatisfied} ^ convenience in {Somewhat_Satisfied} ^ promotion in {Somewhat_Dissatisfied} ^ Know VoIP in {yes,no} ^ marital status in {Single} ^ gender in {Male} (11)

Result Stat:

Classes allocation Feature weights

Result Basic-model & Deluxe-model

Schema: meta.AttributeSelectedClassifier

Subschema: rules.NNge

Selected attributes: 3,6,8,10,11,12 : 6

Why?avoid overfitting

Result

Evaluation

Ten-fold cross-validation Summary

Correctly classified instances > 85% Detailed Accuracy By Class

TP, FP, Precision, Recall, F measure Confusion Matrix

Misclassified instances:12 instances/94 instances

Result

Conclusion

LimitationSmall Datasets Incomplete Data source

ModelsHigh accuracy rateHelp further Market AnalysisHelp product design

Learning Experience

Process a real data mining problem Know Classification algorithms better

Numeric, Nominal Missing data Overfitting

Know Evaluation methods better How to compare algorithms Evaluation factors

Learning Experience

Learn how to use WekaFuture work: learn how to modify source to

perform better data mining Learn from classmates

References

”Data Mining - Concepts and Techniques" by Jiawei Han and Micheline Kamber, Morgan Kaufmann 2001.

“Data Mining – Practical Machine Learning Tools and Techniques with Java Implementations” by Ian H. Witten and Eibe Frank, Morgan Kaufmann 2000.

http://www.cs.waikato.ac.nz/~ml/index.html. Machine Learning---Weka Home Page

Marketing Research by David A. Aaker, V. Kumer and George S. Day, eighth edition, Willey 2004.

Thank you

Documents

CSc288 Term Project Data mining on predict Voice-over-IP Phones market ----- Huaqin Xu