77
RadViz Extensions with Applications Dissertation Defense John Sharko October 26, 2009

RadViz Extensions with Applications

  • Upload
    etta

  • View
    48

  • Download
    0

Embed Size (px)

DESCRIPTION

RadViz Extensions with Applications. Dissertation Defense John Sharko October 26, 2009. Committee. Prof. Georges Grinstein (Advisor) Prof. Kenneth Marx Prof. Haim Levkowitz Dr. Patrick Hoffman Dr. Alex Gee. Outline. Introduction RadViz Cluster Ensembles Fuzzy Clusters Methodology - PowerPoint PPT Presentation

Citation preview

Page 1: RadViz Extensions with Applications

RadViz Extensions with Applications

Dissertation DefenseJohn Sharko

October 26, 2009

Page 2: RadViz Extensions with Applications

Committee

• Prof. Georges Grinstein (Advisor)• Prof. Kenneth Marx• Prof. Haim Levkowitz• Dr. Patrick Hoffman• Dr. Alex Gee

Page 3: RadViz Extensions with Applications

Outline

• Introduction– RadViz– Cluster Ensembles– Fuzzy Clusters

• Methodology• Contributions• Recommendations

Page 4: RadViz Extensions with Applications

RadViz Example

Page 5: RadViz Extensions with Applications

Description of Traditional RadViz

Each dimension in a dataset is represented by a point, called an anchor point, on the circumference of a circle.

Each record in the dataset is positioned as if it were being pulled by a spring attached to each anchor point where the strength of the spring is proportional to that record’s coordinate or value for the dimension related to that anchor point.

Page 6: RadViz Extensions with Applications

RadViz ExampleAll Coordinate Values Equal

Page 7: RadViz Extensions with Applications

RadViz ExampleTwo Coordinate Values Equal

Page 8: RadViz Extensions with Applications

RadViz ExampleRange of Coordinates Values

Page 9: RadViz Extensions with Applications

RadViz ExampleRange of Coordinates Values

Page 10: RadViz Extensions with Applications

Terminology

• Dimensional Anchor (Anchor Point)– point on the circle representing a dimension

• Point – representation of record(s) within the circle

Page 11: RadViz Extensions with Applications

RadViz Mathematical Formulation

nia

ax

djji

djjji

i ,...,1,cos

,...,1,

,...,1,

where: xi and yi are the resulting transformed coordinates for record iθj is the angular position on the circle corresponding to dimension jai,j is the value for dimension j for record id is the number of dimensions and n the number of records.

nia

ay

djji

djjji

i ,...,1,sin

,...,1,

,...,1,

Page 12: RadViz Extensions with Applications

Impact of Exchanging Dimensional Anchors

A

D C

B A

D

(1, 0, 1, 0)

B

C

Page 13: RadViz Extensions with Applications

Example of Repositioning Anchor Points Using Layout Algorithm

Before repositioning After repositioning

Page 14: RadViz Extensions with Applications
Page 15: RadViz Extensions with Applications

Multiple Clustered Datasets

• Clustering algorithms are heuristic, not optimal

• Different clustering algorithms tend to generate different clusters

Page 16: RadViz Extensions with Applications

Sample Multiple Clustered Dataset

Record Algorithm A Algorithm B Algorithm C

1 1 2 3

2 1 2 3

3 1 2 1

4 2 1 1

5 2 3 2

6 2 3 4

Page 17: RadViz Extensions with Applications

Stable Group of RecordsRecord Algorithm A Algorithm B Algorithm C

1 1 2 3

2 1 2 3

3 1 2 1

4 2 1 1

5 2 3 2

6 2 3 4

Page 18: RadViz Extensions with Applications

Uniquely Clustered RecordRecord Algorithm A Algorithm B Algorithm C

1 1 2 3

2 1 2 3

3 1 2 1

4 2 1 1

5 2 3 2

6 2 3 4

Page 19: RadViz Extensions with Applications

Fuzzy Clusters

• A record belongs to multiple clusters• Varying strengths of association

Record Cluster 1 Cluster 2 Cluster 3 Cluster 4

1 .8 .1 .1 0

2 .5 .4 0 .1

3 .3 .2 .3 .2

Page 20: RadViz Extensions with Applications

Cluster Ensemble vs. Fuzzy Clustering

Cluster ensemble Fuzzy cluster

Multiple instances of hard clustering

A record belongs to multiple clusters

Each record is assigned to one cluster in each instance Varying strengths of association

Page 21: RadViz Extensions with Applications

Using RadViz to Analyze Multiple Clustered Datasets

• RadViz typically deals with real numbers

• Cluster number just does not work

• How do you produce a meaningful RadViz visualization?

Page 22: RadViz Extensions with Applications

Flattening of Categorical Data

• Break up each original dimension into multiple dimensions

• Each new dimension represents a value of the original dimension

Page 23: RadViz Extensions with Applications

Flattening a Dimension

Original

Manufacturer

Model

Type

Price

Flattened

Manufacturer

Model

Small

Large

Sporty

Van

Price

Original Record: (Cadillac, Deville, Large, 33)

Flattened Record: (Cadillac, Deville, 0, 1, 0, 0, 33)

Page 24: RadViz Extensions with Applications

Flattening Multi Cluster DatasetOriginal Dimensions

Flattened Dimensions

12

123

1234

Algorithm A

Algorithm B

Algorithm C

(2, 1, 4) (0, 1, 1, 0, 0, 0, 0, 0, 1 )SampleRecord:

A B C { { {A B C

Page 25: RadViz Extensions with Applications

Simple Example

• Iris dataset• Three cluster sets

– KM1: K-means clustering with 1000 iterations– KM2: K-means clustering with 100,000 iterations– HC: hierarchical clustering

• Ten clusters per cluster set

Page 26: RadViz Extensions with Applications

10

9

8

7

6

5

4

3

2

1

KM1 Color Scale

Flattened Multi-cluster Iris Dataset

HC-6

Page 27: RadViz Extensions with Applications

10

9

8

7

6

5

4

3

2

1

KM1 Color Scale

Flattened Multi-cluster Iris Dataset - Jittered

HC-6

Page 28: RadViz Extensions with Applications

10

9

8

7

6

5

4

3

2

1

KM1 Color Scale

Flattened Multi-cluster Iris Dataset

HC-6

Page 29: RadViz Extensions with Applications

Repositioning Dimensional Anchors

• Move points away from the center

• Separate points

• Increase displayed information content

Page 30: RadViz Extensions with Applications

Class Discrimination Layout Algorithm

•Select a dimension that classifies the records•Assign each dimension to the class with the highest values with respect to the other classes•Move the dimensional anchors assigned to the same class next to each other to form a classification sector

Page 31: RadViz Extensions with Applications

Example of Class Discrimination Layout Algorithm

Before After

Classification Sector 2

Classification Sector 1Class

12

Page 32: RadViz Extensions with Applications

KM1 Cluster Size

30 records

20 records

10 records

5 records

After Repositioning Dimensional Anchors

Page 33: RadViz Extensions with Applications

KM1 Cluster Size

30 records

20 records

10 records

5 records

After Repositioning Dimensional AnchorsKM1-2

Page 34: RadViz Extensions with Applications

KM1 Cluster Size

30 records

20 records

10 records

5 records

After Repositioning Dimensional Anchors

Page 35: RadViz Extensions with Applications

KM1 Cluster Size

30 records

20 records

10 records

5 records

After Repositioning Dimensional Anchors

Page 36: RadViz Extensions with Applications

KM1 Cluster Size

30 records

20 records

10 records

5 records

After Repositioning Dimensional Anchors

Page 37: RadViz Extensions with Applications

KM1 Cluster Size

30 records

20 records

10 records

5 records

After Repositioning Dimensional Anchors

Page 38: RadViz Extensions with Applications

KM1 Cluster Size

30 records

20 records

10 records

5 records

After Repositioning Dimensional Anchors

Page 39: RadViz Extensions with Applications

Moving Similar Classification Sectors Close to Each Other

-Dimensions have been grouped together into classification sectors

-Determine which record classes are most similar to each other using Euclidean distances

-Move those dimension sectors closer to each other using greedy algorithm

-Records will tend to be moved away from the center

Page 40: RadViz Extensions with Applications

10

9

8

7

6

5

4

3

2

1

KM1 Color Scale

9

5

8

47

3

2

1

6

10 HC-8, HC-7KM2-3, KM2-8 KM2-1, HC-10KM2-9, KM1-6

HC-2KM2-4

KM1-5

KM2-6HC-1

KM1-9KM2-7 HC-4 KM1-10

KM2-2

KM1-1

HC-6

KM1-2

HC-5KM2-10

KM1-3KM2-5KM1-7

HC-9

KM1-4

HC-3

KM1-8

Repositioning Classification Sectors

Page 41: RadViz Extensions with Applications

Interpreting Vectorized RadViz

Sepal length

Petal length

SetosaVirsicolorVirginica

Page 42: RadViz Extensions with Applications

Interpreting VRV

Sepal length

Petal length

SetosaVirsicolorVirginica

Page 43: RadViz Extensions with Applications

Interpreting VRV

Sepal length

Petal length

SetosaVirsicolorVirginica

Page 44: RadViz Extensions with Applications

Interpreting VRV

Sepal length

Petal length

SetosaVirsicolorVirginica

Page 45: RadViz Extensions with Applications

Interpreting VRV

Sepal length

Petal length

SetosaVirsicolorVirginica

Page 46: RadViz Extensions with Applications

Salamander Gene Expression Levels

Time

Expr

essio

n Le

vels

Page 47: RadViz Extensions with Applications

Salamander Class 9 GenesNvg00226

Nvg00155

Nvg00111

Nvg00091

Page 48: RadViz Extensions with Applications

Salamander Class 9 Genes

• Nvg00111– “Key” gene– CXC chemokine, ligand 10

• Nvg00226– No homology

• Nvg00155– Keratin type II cytoskeletal

• Nvg00091– Annexin

Page 49: RadViz Extensions with Applications

Fuzzy Clusters

Page 50: RadViz Extensions with Applications

Description of Fuzzy Clusters

• K-means clustering algorithm used• Four clusters• Applied to Iris dataset

Page 51: RadViz Extensions with Applications

Cluster 1

Cluster 4

Cluster 3

Cluster 2

Setosa

Versicolor

Virginica

Outlier

Area of Versicolor and Virginica overlap

RadViz Visualization of Fuzzy Clusters

Page 52: RadViz Extensions with Applications

Sepal Length

Peta

l Len

gth

Setosa

Versicolor

Virginica

Outlier

Scatterplot Visualization of Iris Dataset

Page 53: RadViz Extensions with Applications

Sepal Length

Peta

l Len

gthCluster 1

Cluster 4

Cluster 3

Cluster 2

Comparing Visualizations of Fuzzy Clusters

Page 54: RadViz Extensions with Applications

Cluster 1

Cluster 4

Cluster 3

Cluster 2

Setosa

Versicolor

Virginica

Virginica outlier

Overlap

Central

RadViz Visualization of Fuzzy Clusters

Page 55: RadViz Extensions with Applications

Setosa

Versicolor

VirginicaKey to dimension labeling: Cluster Set-Cluster Numbere.g. KM1-3 is Kmeans set 1 cluster number 3

Virginica outlier

Overlap Central

Vectorized RadViz Visualization of Iris Cluster Ensemble

Page 56: RadViz Extensions with Applications

Cluster 1

Cluster 4

Cluster 3

Cluster 2

Comparison of RadViz Visualizations

Fuzzy Clusters Cluster Ensemble - VRV

Virginica outlier

Page 57: RadViz Extensions with Applications

Cluster 1

Cluster 4

Cluster 3

Cluster 2

Comparison of RadViz Visualizations

Fuzzy Clusters Cluster Ensemble - VRV

Central

Page 58: RadViz Extensions with Applications

Group AGroup B

Group C

RV Visualization of Fuzzy ClustersNewt Microarray Dataset

Page 59: RadViz Extensions with Applications

Group A

Group B

Group C

VRV Visualization of Cluster EnsembleNewt Microarray Dataset

Page 60: RadViz Extensions with Applications

Decision Trees

Page 61: RadViz Extensions with Applications

Decision to Play Tennis

Day Outlook Temperature Humidity Wind Play Tennis1 Sunny Hot High Weak No2 Sunny Hot High Strong No3 Overcast Hot High Weak Yes4 Rain Mild High Weak Yes5 Rain Cool Normal Weak Yes6 Rain Cool Normal Strong No7 Overcast Cool Normal Strong Yes8 Sunny Mild High Weak No9 Sunny Cool Normal Weak Yes10 Rain Mild Normal Weak Yes11 Sunny Mild Normal Strong Yes12 Overcast Mild High Strong Yes13 Overcast Hot Normal Weak Yes14 Rain Mild High Strong No

Page 62: RadViz Extensions with Applications
Page 63: RadViz Extensions with Applications
Page 64: RadViz Extensions with Applications
Page 65: RadViz Extensions with Applications

VRV Applied to an Ordered Numerical Dataset

Page 66: RadViz Extensions with Applications

Adult Income DatasetIncome category (<$50,000, >$50,000)

as a function of:

AgeWork classEducationMarital statusOccupationRelationshipRaceGenderCapital gainCapital lossHours per weekNative country

Page 67: RadViz Extensions with Applications

VRV Applied to the Adult Dataset

< $50,000

> $50,000

Page 68: RadViz Extensions with Applications

VRV as a Classifier

•< $ 50,000 48% correct•> $ 50,000 89% correct

Page 69: RadViz Extensions with Applications

Records Predicted as High IncomeModerate Case

Page 70: RadViz Extensions with Applications

Records Predicted as High IncomeExtreme Case

Page 71: RadViz Extensions with Applications

Summary of Results of VRV Classification of Adult Dataset

<=50K >50K Total0

102030405060708090

100

Split in halfIncreased low incomeExtreme low income

Income Category

Percent Correct

Page 72: RadViz Extensions with Applications

Summary of Results of VRV Classification of Adult DatasetCompared to J48 Algorithm

<=50K >50K Total0

102030405060708090

100

Split in halfIncreased low incomeExtreme low income

Income Category

Percent Correct

J48 Classification Algorithm

Page 73: RadViz Extensions with Applications

Problems Binning Quantitative Data

Source: Iris dataset

Page 74: RadViz Extensions with Applications

Contributions1. Vectorized Radviz

1. Application to cluster ensembles provides capability to visually simulate the identification of stable and unstable clusters.

2. Identified several methods to evaluate stability of clusters using characteristics of a VRV visualization.

3. Improved dimensional layout anchor algorithm by moving classification sectors.

4. Used RV to visualize decision trees5. Identified problems when applied to ordered numerical

data6. Successfully applied to microarray data

Page 75: RadViz Extensions with Applications

Contributions (cont’d)

1. Fuzzy Clusters1. Developed method to visualize fuzzy clusters

using RV.2. Developed method to visually compare results of

fuzzy clusters and cluster ensembles applied to the same dataset.

Page 76: RadViz Extensions with Applications

Recommendations

• Adding information to plotted points• Ordering of dimensions within classification

sectors• Selection of base classifier• Investigate visualization of complex decision

trees• Investigate the optimum position of the

classification sectors

Page 77: RadViz Extensions with Applications

Thank you