Data Mining. Classification

Classification

Summer School

“Achievements and Applications of Contemporary Informatics,

Mathematics and Physics” (AACIMP 2011)

August 8-20, 2011, Kiev, Ukraine

Erik Kropat

University of the Bundeswehr Munich Institute for Theoretical Computer Science,

Mathematics and Operations Research

Neubiberg, Germany

Examples

Clinical trials

In a clinical trial 20 laboratory values of 10.000 patients are collected together with the diagnosis ( ill / not ill ).

We measure the values of a new patient.

Is he / she ill or not?

Credit ratings

An online shop collects data from its customers together with some information about the credit rating ( good customer / bad customer ).

We get the data of a new customer.

Is he / she a good customer or not?

Machine-Learning / Classification

Labeled training

examples

Machine learning

algorithm

Classification rule

New Example

Predicted classification

k Nearest Neighbor Classification

Compare the new object with the k “nearest” objects (“nearest neighbors”)

Idea: Classify a new object with regard to a set of training examples.

+ Objects in class 1 Objects in class 2

New object

4-nearest neighbor

• Required

− Training set, i.e. objects and their class labels

− Distance measure

− The number k of nearest neighbors

5-nearest neighbor

• Classification of a new object

− Calculate the distance between the objects of the training set.

− Identify the k nearest neighbors.

− Use the class label of the k nearest neighbors to determine the class of the new object (e.g. by majority vote).

New object

1-nearest neighbor

2-nearest neighbor 3-nearest neighbor

Classification ? + Class label:

Decision by distance

1-nearest neighbor ⇒ Voronoi diagram

kNN k Nearest Neighbor Classification

• The distance between the new object and the objects in the set of training samples is usually measured by the Euclidean metric or the squared Euclidean metric.

• In text mining the Hamming-distance is often used.

Distance

kNN k Nearest Neighbor Classification

• The class label of the new object is determined by the list of the k nearest neighbors. This could be achieved by

− Majority vote with regard to the class labels of the k nearest neighbors.

− Distance of the k nearest neighbors.

Class label of the new object

kNN k Nearest Neighbor Classification • The value of k has a strong influence on the classification result.

− k too small: Noise can have a strong influence.

− k too large: Neighborhood can contain objects from different classes (ambiguity / false classification)

+ + + +

Support Vector Machines

A set of training samples with objects in Rn is divided in two categories:

positive objects and negative objects

Goal: “Learn” a decision rule from the training samples.

Assign a new example into the “positive” or the “negative” category. Idea: Determine a separating hyperplane.

New objects are classified as

positive, if they are in the half space of positive examples

negative, if they are in the half space of negative examples.

INPUT: Sample of training data T = { (x1, y1),...,(xk, yk) | xi ∈ Rn , yi ∈ { -1, +1 } }, with xi ∈ Rn data

and yi ∈ {-1, +1} class label

Data from patients with confirmed diagnosis

Laboratory values

Disease: Yes / No

Decision rule: f : Rn → {-1, +1}

INPUT: Laboratory values of a new patient Decision: Disease: Yes / No

Separating Hyperplane

A separating hyperplane is determined by

− a normal vector w and

− a parameter b

Idea: Choose w and b, such that the hyperplane separates the set of training samples in an optimal way.

H = { x ∈ Rn | ⟨ w, x ⟩ b = 0 }

w scalar product

‖ ‖ w b ____

Offset of the hyperplane from the origin along w:

What is a good separating hyperplane?

There exist many separating hyperplanes

Will this new object be in the “red” class?

Question: What is the best separating hyperplane?

Answer: Choose the separating hyperplane so that the distance from it

to the nearest data point on each side is maximized.

support vector

margin

maximum-margin hyperplane

Scaling of Hyperplanes

• A hyperplane can be defined in many ways:

For c ≠ 0: { x ∈ Rn | ⟨ w, x ⟩ + b = 0 } = { x ∈ Rn | ⟨ cw, x ⟩ + cb = 0 } • Use trainings samples to choose (w, b), such that Min | ⟨ w, xi ⟩ + b | = 1

xi canonical hyperplane

Definition

A training sample T = {(x1, y1),...,(xk, yk) | xi ∈ Rn , yi ∈ {-1, +1} } is separable by the hyperplane H = { x ∈ Rn | ⟨ w, x ⟩ + b = 0 }, if there exists a vector w ∈ Rn and a parameter b ∈ R, such that ⟨ w, xi ⟩ + b ≥ +1 , falls yi = +1

⟨ w, xi ⟩ + b ≤ 1 , falls yi = 1 for all i ∈ {1,...,k}.

⟨ w, x ⟩ + b = -1

⟨ w, x ⟩ + b = 1

Maximal Margin

• The above conditions can be rewritten:

yi · ( ⟨ w, xi ⟩ + b ) ≥ 1 for all i ∈ {1,...,k}

• Distance between the two margin hyperplanes:

⇒ In order to maximize the margin we must minimize

‖ ‖ w 2 ____

‖ ‖ w

⟨ w, x ⟩ + b = -1

⟨ w, x ⟩ + b = 1

s.t. yi · ( ⟨ w, xi ⟩ + b ) ≥ 1 for all i ∈ {1,...,k}

Optimization problem

Find a normal vector w and a parameter b, such that the distance between

the training samples and the hyperplane defined by w and b is maximized.

Minimize 2 ‖ ‖ w 1 2 __

⇒ quadratic programming problem

Dual Form

Find parameters α1,...,αk, such that

with αi ≥ 0 for all i = 1,...,k

i = 1 Σ k

αi 1 2 i, j = 1

αi αj yi yj ⟨ xi, xj ⟩ Max

αi yi = 0 i = 1 Σ k

The maximal margin hyperplane (= the classification problem)

is only a function of the support vectors. ⇒

Kernel function

k ( xi, xj ) := ⟨ xi, xj ⟩

Dual Form

• When the optimal Parameters α1,...,αk are known, the normal vector w*

of the separating hyperplane is given by

αi yi * xi w* = • The parameter b* is given by

b* = 1 2 _ _ max { ⟨ w*, xi ⟩ | yi = 1 } min { ⟨ w*, xi ⟩ | yi = +1 } +

i = 1 Σ k

training data

Classifier

• A decision function f maps a new object x ∈ Rn to a category f(x) ∈ {-1, +1} :

, if ⟨ w*, x ⟩ + b* ≥ +1

, if ⟨ w*, x ⟩ + b* ≤ 1 f (x) =

Soft Margins

Soft Margin Support Vector Machines

• Until now: Hard margin SVMs

The set of training samples can be separated by a hyperplane. • Problem: Some elements of the trainings samples can have a false label

The set of training samples can not be separated by a hyperplane

and SVM is not applicable.

• Idea: Soft margin SVMs

Modified maximum margin method for mislabeled examples.

• Choose a hyperplane that splits the training set as cleanly as possible, while still maximizing the distance to the nearest cleanly split examples.

• Introduce slack variables ξ1,…, ξ n which measure the degree of misclassification.

• Interpretation

The slack variables measure the degree of misclassification of the training examples with regard to a given hyperplane H.

• Replace the constraints

yi · ( ⟨ w, xi ⟩ + b ) ≥ 1 for all i ∈ {1,...,n}

yi · ( ⟨ w, xi ⟩ + b ) ≥ 1 ξ i for all i ∈ {1,...,n}

• Idea

If the slack variables ξ i are small, then:

ξ i = 0 ⇔ xi is correctly classified

0 < ξ i < 1 ⇔ xi is between the margins.

ξ i ≥ 1 ⇔ xi is misclassified [ yi · ( ⟨ w, xi ⟩ + b ) < 0 ]

Constraint: yi · ( ⟨ w, xi ⟩ + b ) ≥ 1 ξ i for all i ∈ {1,...,n}

• The sum of all slack variables is an upper bound for the total training error:

i = 1 Σ n

Find a hyperplane with maximal margin and minimal training error.

regularisation

Minimize 2 ‖ ‖ w 1 2 __

i = 1 Σ n

s.t. yi · ( ⟨ w, xi ⟩ + b ) ≥ 1 ξ i for all i ∈ {1,...,n }

≥ 0 for all i ∈ {1,...,kn ξ i

Nonlinear Classifiers

Support Vector Machines Nonlinear Separation

Question: Is it possible to create nonlinear classifiers?

Idea: Map data points into a higher dimensional feature space where a linear separation is possible.

Support Vector Machines Nonlinear Separation

Nonlinear Transformation

Rn Rm original feature space high dimensional feature space

Kernel Functions

Assume: For a given set X of training examples we know a function Ф,

such that a linear separation in the high-dimensional space is possible.

Decision: When we have solved the corresponding optimization problem,

we only need to evaluate a scalar product

to decide about the class label of a new data object.

i = 1 Σ n

αi yi * f(xnew) ⟨ Ф (xi), Ф(xneu) ⟩ + b* ) ( = sign ∈ {-1, +1}

Kernel functions

Introduce a kernel function The kernel function defines a similarity measure between the objects xi and xj. It is not necessary to know the function Ф or the dimension of H !!!

K(xi, xj) = ⟨ Ф (xi), Ф(xj) ⟩

Kernel Trick

Example: Transformation into a higher dimensional feature space

Ф (x1,x2) = ( x1 , 2 x1 x2, x2 ) 2 ___

2 Ф : R → R , 2 3

Input: An element of the training sample x,

a new object x

⟨ Ф ( x ), Ф( x ) ⟩ ^

The scalar product in the higher dimensional space (here: R )

can be evaluated in the low dimensional original space (here: R ).

= x1 x1 + 2 x1 x1 x2 x2 + x2 x2 ^ ^ 2 2 ^ 2 2 ^

= ⟨ ( x1 , 2 x1 x2, x2 ), ( x1 , 2 x1 x2, x2 ) ⟩ ^ ^ ^ ^ 2 2 2 2

= ( x1 x1 + x2 x2) ^ ^ 2

= ⟨ x , x ⟩ ^ = K ( x , x ) ^ 2

√ ___

It is not necessary to apply the nonlinear function Ф to transform

the set of training examples into a higher dimensional feature space. Use a kernel function instead of the scalar product in the original optimization problem and the decision problem.

K(xi, xj) = ⟨ Ф (xi), Ф(xj) ⟩

Kernel Trick

Kernel Functions

Linear kernel K(xi, xj) = ⟨ xi, xj ⟩ Radial basis function kernel K(xi, xj) = exp

Polynomial kernel K(xi, xj) = (s ⟨ xi, xj ⟩ + c) d

Sigmoid kernel K(xi, xj) = tanh (s ⟨ xi, xj ⟩ + c) Convex combinations of kernels K(xi, xj) = c1K1(xi, xj) + c2K2(xi, xj) Normalization kernel K(xi, xj) =

xi xj ‖ ‖ 2

___________ 2 σ 2

___________________ K (xi, xj) ,

√ K (xi, xi) K (xj, xj)

σ 2 0 = ‖ ‖ xi xj 2 mean ;

Summary

• Support vector machines can be used for binary classification.

• We can handle misclassified data if we introduce slack variables.

• If the sets to discriminate are not linearly separable we can use kernel functions.

• Applications → binary decisions − Spam filter (spam / no spam)

− Face recognition ( access / no access)

− Credit rating ( good customer / bad costumer)

Literature

• N. Christianini, J.Shawe-Taylor

An Introduction to Support Vector Machines and Other Kernel-based Learning Methods.

Cambridge University Press, Cambridge, 2004. • T. Hastie, R. Tibshirani, J. Friedman

The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York, 2011.

Thank you very much!

Data Mining. Classification

Education

Other classification methods in data mining

Data Mining By: Thai Hoa Nguyen Pham. Data Mining Define Data Mining Classification Association Clustering

Classification Algorithm in Data Mining SuhasYashAshokcis.csuohio.edu/~sschung/CIS660/Classification... · Data pre-processing and transform the dataset to perform data mining. To

CS 570 Data Mining Classification and Prediction 3cengiz/cs570-data-mining-fa... · February 12, 2008 Data Mining: Concepts and Techniques 4 Prediction Prediction vs. classification

Data Mining Classification: Alternative Techniques

Comparing Data Mining Classification Algorithms in

Page 1 Data Mining. Page 2 Outline What is data mining? Data Mining Tasks –Association –Classification –Clustering Data mining Algorithms Are all the

DATA MINING CLASSIFICATION TECHNIQUES TO PREDICT THE

6 Data Mining Classification

Data Mining Classification:

Data Mining Classification: Basic Concepts, Decision Trees ...cis.csuohio.edu/~sschung/CIS660/chap4_basic_classification_Up.pdf · Data Mining Classification: Basic Concepts, Decision

Data Mining - Classification

LEARNING CLASSIFICATION ALGORITHMS IN DATA MINING … · LEARNING CLASSIFICATION ALGORITHMS IN DATA MINING A Project ... LEARNING CLASSIFICATION ALGORITHMS IN DATA MINING by Swetha

A Taxonomy and Classification of Data Mining

Network Intrusion Classification Using Data Mining Techniqueszu.edu.jo/UploadFile/PaperFiles/PaperFile_56_54.pdfNetwork Intrusion Classification Using Data Mining Techniques By Amneh

Data Mining - Functionalities,Classification and Task Primitives

Data Mining Classification: Alternative Techniqueseecs.csuohio.edu/~sschung/CIS660/chap5_alternative_classification.pdfData Mining Classification: Alternative Techniques ... Kumar

Data Mining - [2] Classification - 01 - Introduction

DATA MINING FOR HADITH CLASSIFICATION BY KAWTHER …

Data mining classification-2009-v0