Classification and Prediction Final

Embed Size (px)

Citation preview

  • 7/31/2019 Classification and Prediction Final

    1/43

    May 23, 2012 Data Mining: Concepts and Techniques 1

    Classification andPrediction

  • 7/31/2019 Classification and Prediction Final

    2/43

    Agenda

    Introduction Decision Tree Induction

    Statistical based Algo

    Distance based algo Rule based algo

    May 23, 2012 Data Mining: Concepts and Techniques 2

  • 7/31/2019 Classification and Prediction Final

    3/43

    May 23, 2012 Data Mining: Concepts and Techniques 3

    Classification predicts categorical class labels

    classifies data (constructs a model) based on thetraining set and the values (class labels) in a

    classifying attribute and uses it in classifying new data Prediction

    models continuous-valued functions, i.e., predictsunknown or missing values

    Typical applications Loan Disbursement risky or safe

    Medical diagnosis (prediction of treatment from below) Treatment A, Treatment B, Treatment C

    Classification vs. Prediction

  • 7/31/2019 Classification and Prediction Final

    4/43

    May 23, 2012 Data Mining: Concepts and Techniques 4

    ClassificationA Two-Step Process

    Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class,

    as determined by the class label attribute

    The set of tuples used for model construction is training set

    The model is represented as classification rules, decision trees,

    or mathematical formulae Model usage: for classifying future or unknown objects

    Estimate accuracy of the model

    The known label of test sample is compared with theclassified result from the model

    Accuracy rate is the percentage of test set samples that arecorrectly classified by the model

    Test set is independent of training set

    If the accuracy is acceptable, use the model to classify datatuples whose class labels are not known

  • 7/31/2019 Classification and Prediction Final

    5/43

    May 23, 2012 Data Mining: Concepts and Techniques 5

    Process (1): Model Construction

    Training

    Data

    NAME age income oan_decision

    Mike young low risky

    Mary young low risky

    Bill middle_aged high safe

    Jim middle_aged low risky

    Dave senior low safe

    Anne senior medium safe

    Classification

    Algorithms

    IF age = youth

    THEN loan_decision =

    risky..

    Classifier

    (Model)

  • 7/31/2019 Classification and Prediction Final

    6/43

    May 23, 2012 Data Mining: Concepts and Techniques 6

    Process (2): Using the Model in Prediction

    Classifier

    Testing

    Data

    NAME age income loan_decision

    Tom senior low safe

    Crest middle_aged low risky

    yee middle_aged high safe

    Unseen Data

    (Henry, middle_aged,l

    Loan_decision?

    risky

  • 7/31/2019 Classification and Prediction Final

    7/43May 23, 2012 Data Mining: Concepts and Techniques 7

    Supervised vs. Unsupervised Learning

    Supervised learning (classification)

    Supervision: The training data are accompanied by

    labels indicating the class of the observations e.g.

    risky, safe etc.

    New data is classified based on the training set

    Unsupervised learning(clustering)

    The class labels of training data is unknown

    Given a set of measurements, observations, etc. with

    the aim of establishing the existence of classes or

    clusters in the data

  • 7/31/2019 Classification and Prediction Final

    8/43May 23, 2012 Data Mining: Concepts and Techniques 8

    Issues: Data Preparation

    Data cleaning

    Preprocess data in order to reduce noise and handle

    missing values

    Relevance analysis (feature selection)

    Remove the irrelevant or redundant attributes

    Data transformation

    Generalize and/or normalize data

  • 7/31/2019 Classification and Prediction Final

    9/43May 23, 2012 Data Mining: Concepts and Techniques 9

    Issues: Evaluating Classification Methods

    Accuracy classifier accuracy: predicting class label

    predictor accuracy: guessing value of predictedattributes

    Speed time to construct the model (training time)

    time to use the model (classification/prediction time)

    Robustness: handling noise and missing values

    Scalability: efficiency in disk-resident databases Interpretability

    understanding and insight provided by the model

  • 7/31/2019 Classification and Prediction Final

    10/43

    Classification by Decision Tree Induction

    Decision tree is a flowchart like tree structure Non-leaf node denotes a test on an attribute

    Each branch represents an outcome of the test

    Each leaf node holds a class label Topmost node is root

    May 23, 2012 Data Mining: Concepts and Techniques 10

  • 7/31/2019 Classification and Prediction Final

    11/43May 23, 2012 Data Mining: Concepts and Techniques 11

    Decision Tree Induction: Training Dataset

    age income student credit_rating buys_computer

    40 low yes fair no

    >40 low yes excellent yes

    3140 low yes excellent yes

  • 7/31/2019 Classification and Prediction Final

    12/43May 23, 2012 Data Mining: Concepts and Techniques 12

    Output: A Decision Tree for buys_computer

    age?

    overcast

    student? credit rating?

    40

    no yes yes

    yes

    31..40youth

    Middle_aged senior

    noyes

    excellentfair

  • 7/31/2019 Classification and Prediction Final

    13/43May 23, 2012 Data Mining: Concepts and Techniques 13

    Algorithm for Decision Tree Induction

    Basic algorithm Tree is constructed in a top-down recursive divide-and-conquer

    manner

    At start, all the training examples are at the root

    Examples are partitioned based on selected attributes

    Test attributes are selected on the basis of a heuristic or

    statistical measure (e.g., information gain)

  • 7/31/2019 Classification and Prediction Final

    14/43

    DT Algorithm

    May 23, 2012 Data Mining: Concepts and Techniques 14

  • 7/31/2019 Classification and Prediction Final

    15/43

    Issues faced by DT algo

    Choosing splitting attributes-which attributes to use as splitting impacts the

    performance e.g. age or credit_rating, student-name is not useful

    Ordering of splitting attributes order in which attributes are chosen isimportant. In example age is first, then student and credit_rating is chosen.

    Splits number of splits to take place Tree structure balanced tree with fewest levels is desirable.

    Training data the structure of DT depends on training data. If training datais very small generated tree might not be enough to work properly and if it is

    too large created tree may overfit.(might not work for future states)

    Pruning once a tree is constructed, some modifications to the tree might beneeded to improve the performance of the tree.

    -- pruning phase redundant comparisons or remove subtrees to achieve better

    performance.May 23, 2012 Data Mining: Concepts and Techniques 15

  • 7/31/2019 Classification and Prediction Final

    16/43May 23, 2012 Data Mining: Concepts and Techniques 16

    Presentation of Classification Results

  • 7/31/2019 Classification and Prediction Final

    17/43May 23, 2012 Data Mining: Concepts and Techniques 17

    Visualization of a Decision Tree in SGI/MineSet 3.0

  • 7/31/2019 Classification and Prediction Final

    18/43May 23, 2012 Data Mining: Concepts and Techniques 18

    Interactive Visual Mining by Perception-Based

    Classification (PBC)

  • 7/31/2019 Classification and Prediction Final

    19/43

    Statistical based algorithms

    May 23, 2012 Data Mining: Concepts and Techniques 19

    Regression

    Bayesian classification

  • 7/31/2019 Classification and Prediction Final

    20/43

    Regression

    May 23, 2012 Data Mining: Concepts and Techniques 20

    Regression deals with estimation of output value based on input values. In classification the input values are values from database D and output

    values represent the classes. If we know input parameters x1,x2,x3..xn then the relation between the

    output parameter y and input parameters can be

    y=c0+c1x1+c2x2+cnxn

    c0,c1,..cn are regression coefficients

  • 7/31/2019 Classification and Prediction Final

    21/43 Prentice Hall 21

    Linear Regression Poor Fit

  • 7/31/2019 Classification and Prediction Final

    22/43 Prentice Hall 22

    Classification Using Regression

    Division: Use regression function to divide area into regions. In this data is plotted in an n-dimensional space without any explicit

    class.

    Through regression the space is divided into regionsone per class.

    Prediction: Formulas are generated to predict the output class

    value.

  • 7/31/2019 Classification and Prediction Final

    23/43

    Prentice Hall 23

    Division

  • 7/31/2019 Classification and Prediction Final

    24/43

    Prentice Hall 24

    Prediction

  • 7/31/2019 Classification and Prediction Final

    25/43

    May 23, 2012 Data Mining: Concepts and Techniques 25

    Bayesian Classification: Why?

    A statistical classifier: performs probabilistic prediction,i.e., predicts class membership probabilities

    Foundation: Based on Bayes Theorem.

    Performance: A simple Bayesian classifier, nave Bayesianclassifier, has comparable performance with decision treeand selected neural network classifiers

    Incremental: Each training example can incrementallyincrease/decrease the probability that a hypothesis iscorrect prior knowledge can be combined with observed

    data

  • 7/31/2019 Classification and Prediction Final

    26/43

  • 7/31/2019 Classification and Prediction Final

    27/43

    May 23, 2012 Data Mining: Concepts and Techniques 27

    Bayesian Theorem

    Given training dataX, posteriori probability of a

    hypothesisH, P(H|X), follows the Bayes theorem

    Informally, this can be written as

    posteriori = likelihood x prior/evidence

    Practical difficulty: require initial knowledge of manyprobabilities, significant computational cost

    )()()|(

    )|(X

    XX

    P

    HPHPHP

  • 7/31/2019 Classification and Prediction Final

    28/43

  • 7/31/2019 Classification and Prediction Final

    29/43

    May 23, 2012 Data Mining: Concepts and Techniques 29

    Nave Bayesian Classifier: Training Dataset

    Class:

    C1:buys_computer = yes

    C2:buys_computer = no

    Data sample

    X = (age 40 low yes excellent no

    3140 low yes excellent yes

  • 7/31/2019 Classification and Prediction Final

    30/43

    May 23, 2012 Data Mining: Concepts and Techniques 30

    Nave Bayesian Classifier: An Example

    P(Ci): P(buys_computer = yes) = 9/14 = 0.643P(buys_computer = no) = 5/14= 0.357

    Compute P(X|Ci) for each classP(age =

  • 7/31/2019 Classification and Prediction Final

    31/43

    May 23, 2012 Data Mining: Concepts and Techniques 31

    Nave Bayesian Classifier: Comments

    Advantages Easy to implement

    Good results obtained in most of the cases

    Disadvantages

    Assumption: class conditional independence, thereforeloss of accuracy

    Practically, dependencies exist among variables

  • 7/31/2019 Classification and Prediction Final

    32/43

    Rule based Classification

    May 23, 2012 Data Mining: Concepts and Techniques 32

  • 7/31/2019 Classification and Prediction Final

    33/43

    May 23, 2012 Data Mining: Concepts and Techniques 33

    Using IF-THEN Rules for Classification

    Represent the knowledge in the form ofIF-THEN rulesr= where a is antecedent and c is consequent

    R: IF age= youth AND student= yes THEN

    buys_computer= yes

    Rules are generated by many techniques like decision trees

    and Neural networks.

  • 7/31/2019 Classification and Prediction Final

    34/43

    Prentice Hall 34

    Generating Rules from DTs

    The algorithm will generate a rule foreach leaf node in DT

    All the rules with same consequentcould be combined together by Oringthe antecedents of the simpler rules

  • 7/31/2019 Classification and Prediction Final

    35/43

    Algorithm for Generating Rules from DTs

    May 23, 2012 Data Mining: Concepts and Techniques 35

  • 7/31/2019 Classification and Prediction Final

    36/43

    Prentice Hall 36

    Generating Rules Example

  • 7/31/2019 Classification and Prediction Final

    37/43

    Distance based algorithms

    May 23, 2012 Data Mining: Concepts and Techniques 37

  • 7/31/2019 Classification and Prediction Final

    38/43

    Prentice Hall 38

    Classification Using Distance

    Each item that is mapped to the same class are moresimilar to the other items in that class than the itemsin other classes

    So distance (or similarity) measures may be used toidentify alikeness of different items in database.

    Place items in class to which they are closest.

    Must determine distance between an item and a class.

    Classification using simple distance based

  • 7/31/2019 Classification and Prediction Final

    39/43

    Prentice Hall 39

    Classification using simple distance basedalgo

    Distance Based

    Classes represented byCentroid:Central value

    /representative vectore.g. Class A is represented by

    Class B by Class C by

  • 7/31/2019 Classification and Prediction Final

    40/43

    Simple distance based Algo

    Input :c1,..,cm // centers for each class

    t //input tuple to classify

    Output :

    c //class to which t is assigneddist = ;

    for i=1 to m do

    if dis ( ci, t ) < dist, then

    c = i;

    dist = dist (ci, t);

    May 23, 2012 Data Mining: Concepts and Techniques 40

  • 7/31/2019 Classification and Prediction Final

    41/43

    Prentice Hall 41

    K Nearest Neighbor (KNN):

    Common classification scheme

    Lazy learners due to learning from neighbors

    Training set includes data alongwith classes.

    Examine K closest items near the item to beclassified.

    New item is placed in the class where the

    most of the k closest items are.

  • 7/31/2019 Classification and Prediction Final

    42/43

    Prentice Hall 42

    KNN

    K=3Three closest items in the training setare shownT will be placed in the class where mostof these are.

  • 7/31/2019 Classification and Prediction Final

    43/43

    KNN Algorithm