K-Mean Algo. on Iris Data set_15129145.pdf

Embed Size (px)

Citation preview

  • 8/17/2019 K-Mean Algo. on Iris Data set_15129145.pdf

    1/7

    K-Means Algorithm ImplementationMuhammad Waqas Moin Sheikh (15129145, BJTU)

    Practice Course: Machine Learning and Cognitive Computing

    I.  Abstract

    In this work the K-mean clustering algorithm is applied to Fisher ’s Iris Plant Dataset. The data

    set known to include 3 classes of Iris plant data. Setosa, Verginica and Versicolor. One of

    which is linearly separable from other two, to assesses the capabilities of clustering algorithm,

    it is applied to the data set with varied number of initials centers and stopping thresholds, it

    will be shown that the K-means Clustering algorithm is capable of perfectly separating the

    Setosa data set from others two, as expected, and able to achieve the acceptable recognition of

    the other two plants species.

    II. 

    Introduction

    K-Means Clustering is an unsupervised learning algorithm that tries to cluster data based on

    their similarity. Unsupervised learning means that there is no outcome to be predicted, and the

    algorithm just tries to find patterns in the data. In k means clustering, we have to specify the

    number of clusters that we want the data to be grouped into. The algorithm randomly assigns

    each observation to a cluster, and finds the centroid of each cluster. Then, the algorithm repeats

    through two steps. The first one is to reassign the data points to the cluster whose centroid is

    closest and the second one is to calculate new centroids of each clusters.

    Figure 1: k-means clustering result for the Iris flower data set and actual species visualized.

  • 8/17/2019 K-Mean Algo. on Iris Data set_15129145.pdf

    2/7

    III.  Methodology

    The K-means algorithm is an unsupervised algorithm that attempts to cluster data into groups

     based on a chosen similarity measure. In this work, the similarity measure of choice is

    Euclidean distance. To create the clusters, the K-Means algorithm iteratively implements thefollowing steps:

    I.  Initialize - Initialize the centre of each cluster .

    II.  Distribute data points - Assign each data point to the cluster whose centre is the smallest

    distance from the data point.

    III.  Compute new cluster centres  –  Set the position of the cluster centre to the mean of all

    data points belonging to that cluster .

    IV. 

    Compare new centres to old center, If the new centres are the same as the old centres,

    then the algorithm converges. The clusters and centres computed in step 3 are the final

    clusters and centres. If they are not the same, return to step 2.

    In reality, it may not always be possible to find centers that do not change from iteration to

    iteration. In other words, this algorithm may not always lead to a perfect solution. Some

    datasets may lead to centers that oscillate between two values, for example. So, to avoid an

    infinite loop when iterating through the algorithm, the threshold is used as another stopping

    condition. When the new cluster centers are identified at the end of each iteration, the amount

    of change in the clusters is also taken into consideration. This is done by measuring the distance

     between the new centers and old centers. If this distance is less than the threshold distance, the

    algorithm converges.

    IV. 

    Experimental Setup

    This project was broken up into two tasks. First, the K-means algorithm was coded into a

    general function so that the number of centers and threshold value could be easily varied. Next,

    a shell to call the function iteratively for each of the three k values and two threshold values

    was created. The results of each run was saved to a cell array. The function can be reviewed in

    Appendix A. In addition to making the function more general, the convergence test step was

    modified to include the threshold as a stopping condition for the algorithm. So, the function

    takes in the following inputs: data, number of centers, and threshold value. Given these

     parameters, the K-means’m function will return the following two cells: centers per iteration,

    assigned classes per iteration. More in-depth consideration of pertinent steps are presented below.

    Initializing the centers

    An important consideration in the K-means algorithm is the choice in initial centers.

    Since this is an unsupervised algorithm, I chose to use sample values as the initial

    centers. However, instead of using the first k sample values, I decided to choose the

    centers at random for each iteration. This was achieved using the code snippet below.

    Z = x(randi([1,numSamples],centers,1),:); %k initial centers

  • 8/17/2019 K-Mean Algo. on Iris Data set_15129145.pdf

    3/7

    Implementing the threshold as a stopping condition

    To avoid an infinite loop when the K-means function is called, step 4 of the algorithm

    was modified to include the threshold as a stopping condition. First, the new centers are

    compared to the old centers. If the new centers do not match the old centers, then thedistance between them is computed using the norm function as shown in the code

    snippet below. This difference is compared to the threshold value. If the distance is less

    than the threshold, the algorithm converges. This is summarized in the code snippet

     below.

    case 4 %Step 4: compare new centers to old centers

    if Z_iter{m+1} == Z_iter{m} %if new center = current center

    NotEq = 0; %algorithm converges

    break;

    else if m == 1 %if 1st iter, no prev distance, just proceed

    m = m+1;

    step = 2;

    else %if not 1st iter and Z \= Znew

    %check stopping conditions

    if abs(Dist_iter{m} - Dist_iter{m-1}) < threshold

    NotEq = 0;

    break;

    else

    m = m+1; %new iterationstep = 2; %go back to step 2

    end

    end 

    Confusion Matrix

    After calling the Kmeans function, the confusion matrix was generated for each

    simulation using the Matlab confusionmat function. The results can be seen in the

    subsequent section.

    Result

    As previously stated, K-means clustering was applied six times to the dataset. The

    results are broken up into two groups and presented based on the chosen threshold

    value.

  • 8/17/2019 K-Mean Algo. on Iris Data set_15129145.pdf

    4/7

    Threshold = 0.01

    Below, the confusion matrix for the three different choices of number of initial centers

    is shown. In each simulation, the stopping threshold was set to 0.01.

    Table 1. K = 2 Confusion Matrix Table 2. K = 3 Confusion Matrix Table 3.

    It can be seen that in each case, the Setosa plant species was easily separated from the

    others. However, the Versicolor and Virginica datasets were not as easily distinguished

    from each other as they were from the Setosa. However, it is interesting to see that when

    there were just 2 centers, the Virginica dataset was able to be perfectly separated from

    the others. The Versicolor. However, is still straddling between the two clusters. It is

    mostly clustered with the Virginica dataset, but there are several pieces that were

    clustered with the Setosa plants.

    Threshold = 0.1

    Below, the confusion matrix for the three different choices of number of initial centers

    is shown. In each simulation, the stopping threshold was set to 0.1.

    Table 4. K = 2 Confusion Matrix Table 5. K = 3 Confusion Matrix Table 6 

    Increasing the threshold did not have much of an impact on the final confusion matrices.

    Although there is some shifting of the data points, as is evidenced by the values shown

    in the tables, the overall clustering results are quite similar. In all three runs, the Setosa

    species was perfectly separated from the other two species. The other two species, on

    average could not be perfectly separated. However, when there are two centers, the

    1 2 3 4

    Setosa  50 0 0 0

    Versicolor   0 0 21 29

    Virginica  0 21 26 1

    1 2

    Setosa  50 0

    Versicolor   3 47

    Virginica  0 50

    1 2 3

    Setosa  0 50 0

    Versicolor   2 0 48

    Virginica  36 0 14

    1 2

    Setosa  0 50

    Versicolor   47 3

    Virginica  50 0

    1 2 3

    Setosa  0 50 0

    Versicolor   47 0 3

    Virginica  14 0 36

    1 2 3 4

    Setosa  50 0 0 0

    Versicolor   0 25 25 0

    Virginica  0 17 1 32

  • 8/17/2019 K-Mean Algo. on Iris Data set_15129145.pdf

    5/7

    Setosa and Virginica sets are again easily separated from one another while the

    Versicolor is split (unevenly) between the two clusters.

    V.  Conclusions 

    Using Matlab and a personal computer, the K-means algorithm was applied to the Iris plant

    dataset. It was shown that the Setosa dataset was able to be perfectly classified in each case.

    The other two species Versicolor and Virginica were not as easily separated from each other

    as they were from the Setosa plant. After randomly selecting the initial centers, varying the

    number of centers, and manipulating the stopping threshold, these results remained true. Since

    these results are typical of the Iris plant dataset and the recognition using the K-means

    clustering algorithm was able to reach these results, the K-means algorithm was shown to be a

    reliable method of clustering.

    VI. 

    Appendix A - Kmeans.m

    %This function takes in a dataset and number of clusters and returns

    %the clustered data after applying the K-means algorithm

    %inputs: data, number of clusters, threshold

    %output: clustered data

    function [clusCenters,clusData] = Kmeans (x,cen,t)

    centers=cen; %number of centers

    threshold = t;

    numSamples = size(x,1); %number of samples

    sampleLength = size(x,2); %dimension of samples

    Dist = zeros(numSamples,centers); %array to hold distances

    Class = zeros(numSamples,1); %array to hold classes

    Znew = zeros(centers,sampleLength); %array to hold new centers

    %step 1

    Z = x(randi([1,numSamples],centers,1),:); %k initial centers

    m=1;

    Z_iter{m} = Z; %save Z values

    step = 2;NotEq = true;

    while NotEq

    switch step

    case 2 %distribute samples to clusters

    Z = Z_iter{m}; %grab current centers

    for k = 1:centers %for each center

    for N = 1:numSamples %for each sample

    %compute distance between sample and center

    Dist(N,k) = norm(x(N,:)-Z(k,:));

    end

  • 8/17/2019 K-Mean Algo. on Iris Data set_15129145.pdf

    6/7

      end

    Dist_iter{m} = norm(Dist);

    for N = 1:numSamples %for each sample

    minDist = min(Dist(N,:)); %get min dist for sample

    [i,j] = find(Dist(N,:) == minDist); %index of min

    Class(N) = j(1); %index=class, save index/class

    end

    Class_Iter{m} = Class; %save class assignments for m

    step = 3;

    case 3 %compute new centers

    for k = 1:centers%for each center

    C = find(Class == k); %find all samples in class

    zt = [0,0];

    for i = 1:size(C,1) %for every sample in class

    zt = zt + x(C(i),:); %add sample to sumend

    Znew(k,:) = zt/size(C,1); %center = sample mean

    end

    Z_iter{m+1} = Znew; %save next centers

    step = 4;

    case 4 %compare new centers to old centers

    if Z_iter{m+1} == Z_iter{m} %if new = current

    NotEq = 0; %algorithm converges

    break;

    elseif m == 1 %if 1st iter, no prev distance, proceed

    m = m+1;

    step = 2;

    else %if not 1st iter and Z \= Znew

    %check stopping conditions

    if abs(Dist_iter{m} - Dist_iter{m-1}) < threshold

    NotEq = 0;

    break;

    else

    m = m+1; %new iterationstep = 2; %go back to step 2

    end

    end

    end

    end

    clusCenters = Z_iter; %return cluster centers per iteration

    clusData = Class_Iter; %return clusters per iteration

    end

  • 8/17/2019 K-Mean Algo. on Iris Data set_15129145.pdf

    7/7

    VII.  Appendix B –  project1.m

    %This program loads the iris data set, then iteratively calls the k-means

    %function to implement the k-means clustering algorithm for k values of 2,

    %3, and 4 cluster center using different threshold values.

    clear;

    iris = csvread('iris.csv'); %load dataset

    x = iris(:,1:end-1); %get all but the class value

    y = iris(:,end);

    %set simulation parameters

    k = [2 3 4]; %number of centers

    t = [0.01 0.1]; %stopping threshold

    for i = 1:size(k,2)

    for j = 1:size(t,2)[centers, clusters] = Kmeans(x,k(i),t(j));

    CONF{i,j} = confusionmat(y,clusters{end});

    End

    end

    VIII.  References

    [1]http://www.r-bloggers.com/k-means-clustering-in-r/ [2]Monique Kirkman-Bey, K-MEANS CLUSTERING & THE IRIS PLANT DATASET

    [3]http://www.cs.colostate.edu/~anderson/cs545/index.html/lib/exe/fetch.php?media=assign

    ments:solutions1:two.pdf  

    [4]https://www.youtube.com/watch?v=Qy2vEecfucY 

    [5]https://www.google.com/search?q=K-

    Means+Clustering&biw=1366&bih=667&source=lnms&tbm=isch&sa=X&ved=0ahUKEwjj

    k72igbnMAhWMcT4KHWSlAn0Q_AUICCgD#imgrc=o7bZUFEHo72JXM%3A 

    [6]http://www.mathworks.com/help/stats/kmeans.html?s_tid=gn_loc_drop 

    http://www.r-bloggers.com/k-means-clustering-in-r/http://www.r-bloggers.com/k-means-clustering-in-r/http://www.r-bloggers.com/k-means-clustering-in-r/http://www.cs.colostate.edu/~anderson/cs545/index.html/lib/exe/fetch.php?media=assignments:solutions1:two.pdfhttp://www.cs.colostate.edu/~anderson/cs545/index.html/lib/exe/fetch.php?media=assignments:solutions1:two.pdfhttp://www.cs.colostate.edu/~anderson/cs545/index.html/lib/exe/fetch.php?media=assignments:solutions1:two.pdfhttp://www.cs.colostate.edu/~anderson/cs545/index.html/lib/exe/fetch.php?media=assignments:solutions1:two.pdfhttps://www.youtube.com/watch?v=Qy2vEecfucYhttps://www.youtube.com/watch?v=Qy2vEecfucYhttps://www.youtube.com/watch?v=Qy2vEecfucYhttps://www.google.com/search?q=K-Means+Clustering&biw=1366&bih=667&source=lnms&tbm=isch&sa=X&ved=0ahUKEwjjk72igbnMAhWMcT4KHWSlAn0Q_AUICCgD#imgrc=o7bZUFEHo72JXM%3Ahttps://www.google.com/search?q=K-Means+Clustering&biw=1366&bih=667&source=lnms&tbm=isch&sa=X&ved=0ahUKEwjjk72igbnMAhWMcT4KHWSlAn0Q_AUICCgD#imgrc=o7bZUFEHo72JXM%3Ahttps://www.google.com/search?q=K-Means+Clustering&biw=1366&bih=667&source=lnms&tbm=isch&sa=X&ved=0ahUKEwjjk72igbnMAhWMcT4KHWSlAn0Q_AUICCgD#imgrc=o7bZUFEHo72JXM%3Ahttps://www.google.com/search?q=K-Means+Clustering&biw=1366&bih=667&source=lnms&tbm=isch&sa=X&ved=0ahUKEwjjk72igbnMAhWMcT4KHWSlAn0Q_AUICCgD#imgrc=o7bZUFEHo72JXM%3Ahttps://www.google.com/search?q=K-Means+Clustering&biw=1366&bih=667&source=lnms&tbm=isch&sa=X&ved=0ahUKEwjjk72igbnMAhWMcT4KHWSlAn0Q_AUICCgD#imgrc=o7bZUFEHo72JXM%3Ahttp://www.mathworks.com/help/stats/kmeans.html?s_tid=gn_loc_drophttp://www.mathworks.com/help/stats/kmeans.html?s_tid=gn_loc_drophttp://www.mathworks.com/help/stats/kmeans.html?s_tid=gn_loc_drophttp://www.mathworks.com/help/stats/kmeans.html?s_tid=gn_loc_drophttps://www.google.com/search?q=K-Means+Clustering&biw=1366&bih=667&source=lnms&tbm=isch&sa=X&ved=0ahUKEwjjk72igbnMAhWMcT4KHWSlAn0Q_AUICCgD#imgrc=o7bZUFEHo72JXM%3Ahttps://www.google.com/search?q=K-Means+Clustering&biw=1366&bih=667&source=lnms&tbm=isch&sa=X&ved=0ahUKEwjjk72igbnMAhWMcT4KHWSlAn0Q_AUICCgD#imgrc=o7bZUFEHo72JXM%3Ahttps://www.google.com/search?q=K-Means+Clustering&biw=1366&bih=667&source=lnms&tbm=isch&sa=X&ved=0ahUKEwjjk72igbnMAhWMcT4KHWSlAn0Q_AUICCgD#imgrc=o7bZUFEHo72JXM%3Ahttps://www.youtube.com/watch?v=Qy2vEecfucYhttp://www.cs.colostate.edu/~anderson/cs545/index.html/lib/exe/fetch.php?media=assignments:solutions1:two.pdfhttp://www.cs.colostate.edu/~anderson/cs545/index.html/lib/exe/fetch.php?media=assignments:solutions1:two.pdfhttp://www.r-bloggers.com/k-means-clustering-in-r/