74
University of Konstanz Department of Computer and Information Science Chair of Information Processing Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico Dolfing January 22, 2007 First examinator: Second examinator: Prof. Dr. D.A. Keim Prof. Dr. M.R. Berthold University of Konstanz University of Konstanz Chair of Information Processing Chair for Applied Computer Science

Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

  • Upload
    others

  • View
    17

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

University of Konstanz

Department of Computer and Information ScienceChair of Information Processing

Master Thesis

A Visual Analytics Framework for Feature and

Classifier Engineering

to obtain the academic degreeMaster of Science (M.Sc.)

Henrico Dolfing

January 22, 2007

First examinator: Second examinator:

Prof. Dr. D.A. Keim Prof. Dr. M.R. Berthold

University of Konstanz University of Konstanz

Chair of Information Processing Chair for Applied Computer Science

Page 2: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

Dolfing, Henrico:A Visual Analytics Framework for Feature and Classifier EngineeringMaster Thesis, University of Konstanz, 2007.

Page 3: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

i

Abstract

This thesis describes results of research on visualization techniques, visual clustering,feature engineering and classifier analysis. Most of the research and development is doneat the Product and Application Development department of Oce Document TechnologiesGmbH (ODT) in Konstanz, Germany.

Goal of the research project was to combine scientific research with a practical solutionto support feature and classifier engineering for large high dimensional data sets, suchas those of ODT. This resulted in the development of a visual analytics framework thatsupports these tasks. The framework combines existing methods with newly developedvisualization and clustering techniques.

This thesis contains three contributions to computer science.

• GridView, a collection of different views that makes it possible to view and explorea very large data set.

• A new visual clustering algorithm based on the O-Cluster algorithm [MC02], calledVisual Supported O-Clustering

• A visual analytics framework that combines existing techniques and algorithmswith the newly developed techniques.

Keywords: Visual Analytics, Classifier Analysis, Feature Engineering

Page 4: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

ii

Preface

As final part of my master study of Information Engineering at the University of Kon-stanz I have written this thesis with the title ’A Visual Analytics Framework for Featureand Classifier Engineering’.

My thesis supervisors at the University of Konstanz were Prof. Dr. Daniel A. Keim,holder of the chair of Information Processing, and Prof. Dr. Michael R. Berthold,holder of the chair for Applied Computer Science. Both I would like to thank for theirwillingness to supervise this thesis and for making it possible to write my thesis in anindustrial environment at Oce Document Technologies GmbH.

Dr. Tobias Schreck, who was my thesis advisor at the University of Konstanz I would liketo thank at this place for his great support and our numerous discussions that provideda lot of new ideas and insight.

Most of the research for this thesis is done within the Product and Application Develop-ment (PAD) department of Oce Document Technologies GmbH in Konstanz, Germany.For that possibility and his feedback during the research I would like to thank Dr. Wolf-gang Lellmann, head of one of the Product and Application Development departments.

My thesis advisor at Oce Document Technologies was Dr. Knut R. Meier, and I wouldlike to thank him for helping me understand the whole classification process in detailand our very interesting discussions on this theme.

Further I would like to thank Fabian Dill and Christoph Sieb for proofreading my thesisand providing helpful comments.

Henrico Dolfing, January 22, 2007

Page 5: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

CONTENTS iii

Contents

List of Figures viii

1 Introduction 1

1.1 Visual Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Feature and Classifier Engineering . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Contributions and Overview . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Theoretical Framework 6

2.1 Binning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . 9

2.2.2 Normalized Principal Component Analysis . . . . . . . . . . . . . 10

2.2.3 Supervised Principal Component Analysis . . . . . . . . . . . . . 11

2.2.4 Multidimensional Scaling . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.5 Self Organizing Maps . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.1 Partitioning algorithms . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.2 Hierarchical algorithms . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.3 Locality-based algorithms . . . . . . . . . . . . . . . . . . . . . . 14

2.3.4 Grid-based algorithms . . . . . . . . . . . . . . . . . . . . . . . . 15

Page 6: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

iv CONTENTS

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 GridView Concept 16

3.1 Basic Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2.1 Scatterplot View . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2.2 Class View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2.3 Mixed View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.4 Density View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.5 Purity View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4 Visual Supported O-Clustering 26

4.1 Projections and separators . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2 Multidimensional Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.3 O-Cluster Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.3.1 Sensitivity parameter . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.3.2 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.4 Visual Supported O-Cluster Algorithm . . . . . . . . . . . . . . . . . . . 33

4.4.1 Visualization of cluster tree . . . . . . . . . . . . . . . . . . . . . 35

4.4.2 Visual finding of separators . . . . . . . . . . . . . . . . . . . . . 36

4.4.3 Manual selection of projections . . . . . . . . . . . . . . . . . . . 37

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5 Visual Analytics Framework 38

5.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.2 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.3 Input / Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Page 7: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

CONTENTS v

5.4 Supported Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6 Application Examples 43

6.1 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.1.1 Visual Feature Space Analysis . . . . . . . . . . . . . . . . . . . . 44

6.1.2 Regions of Interest . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.1.3 Visual Projection Analysis . . . . . . . . . . . . . . . . . . . . . . 50

6.1.4 Feature Selection for Multimedia Objects . . . . . . . . . . . . . . 52

6.2 Classifier Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

7 Conclusion and Discussion 57

7.1 GridView . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

7.2 Visual Supported O-Clustering . . . . . . . . . . . . . . . . . . . . . . . 57

7.3 Visual Analytics Framework . . . . . . . . . . . . . . . . . . . . . . . . . 58

References 59

Statement of Authorship 64

Page 8: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

vi

List of Figures

1.1 Visual Analytics Process, taken from [KMSZ06] . . . . . . . . . . . . . . 3

2.1 Histogram of the British incomes data based on (a) the bin width h2, (b)

the bin width h0 and (c) the S-PLUS default bin width. Taken from [Wan97]. 8

2.2 Two 1-D projections of an originally 2-D data set that contains two out-

liers. The PCA projection is fooled by the outliers, unlike the normalized

PCA projection that maintains much of the structure of the data. Taken

from [KC03]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Two 1-D projections of 2-D data that contain two clusters. The PCA

projection merges the clusters, while the weighted PCA projection keeps

them much apart. Taken from [KC03]. . . . . . . . . . . . . . . . . . . . 11

3.1 Example of a regular equally spaced two-dimensional grid. . . . . . . . . 18

3.2 Examples of a scatterplot view. . . . . . . . . . . . . . . . . . . . . . . . 19

3.3 Example of a class view. . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.4 Example of a class histogram view. . . . . . . . . . . . . . . . . . . . . . 21

3.5 Example of a class view with α-blending to indicate the point density. . . 22

3.6 Example of a mixed view. . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.7 Example of a density view. . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.8 Example of a purity view. . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1 General (left) and contracting projections (right) . . . . . . . . . . . . . 28

Page 9: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

LIST OF FIGURES vii

4.2 Example of a regular multidimensional grid. . . . . . . . . . . . . . . . . 29

4.3 An overview of the O-Cluster algorithm, taken from [MC02] . . . . . . . 30

4.4 An overview of the Visual Supported O-Cluster algorithm . . . . . . . . 34

4.5 A visual representation of the cluster tree. . . . . . . . . . . . . . . . . . 36

4.6 Visual selection of separators by moving the separator to the left or right,

with unlabeled and labeled data. . . . . . . . . . . . . . . . . . . . . . . 36

5.1 Example of the Data View . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.2 Example of the Classifier view . . . . . . . . . . . . . . . . . . . . . . . . 42

6.1 Feature extraction of a single character. . . . . . . . . . . . . . . . . . . . 44

6.2 Scatter View of the numerical data set (see Table 6.1) after projection on

two dimensions by a Principal Component Analysis. . . . . . . . . . . . . 45

6.3 Density View of the numerical data set (see Table 6.1) after projection on

two dimensions by a Principal Component Analysis. . . . . . . . . . . . . 46

6.4 Purity View of the numerical data set (see Table 6.1) after projection on

two dimensions by a Principal Component Analysis. . . . . . . . . . . . . 47

6.5 Scatter View of the classes ”1” and ”9” of the numerical data set (see

Table 6.1) after projection on two dimensions by a Principal Component

Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.6 Class View of the classes ”1” and ”9” of the numerical data set (see Table

6.1) after projection on two dimensions by a Principal Component Analysis. 48

6.7 Class View with alpha-blending of the classes ”1” and ”9” of the numerical

data set (see Table 6.1) after projection on two dimensions by a Principal

Component Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.8 Mixed View of the classes ”1” and ”9” of the numerical data set (see Table

6.1) after projection on two dimensions by a Principal Component Analysis. 49

Page 10: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

viii LIST OF FIGURES

6.9 Mixed View with alpha-blending of the classes ”1” and ”9” of the nu-

merical data set (see Table 6.1) after projection on two dimensions by a

Principal Component Analysis. . . . . . . . . . . . . . . . . . . . . . . . 50

6.10 Density View of the numerical data set (see Table 6.1) after projection

on two dimensions by a Principal Component Analysis (left) and a Self

Organizing Map (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.11 Purity View of the numerical data set (see Table 6.1) after projection

on two dimensions by a Principal Component Analysis (left) and a Self

Organizing Map (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.12 PurityView of the feature vectors of Princeton Shape Benchmark obtained

with Voxel extractor after projection on two dimensions by a 12 x 9 Self

Organizing Map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.13 Regression analysis between purity score of SOM (unsupervised informa-

tion) and a supervised discrimination precision metric for eleven extractors. 54

6.14 Example of the Classifier view . . . . . . . . . . . . . . . . . . . . . . . . 55

Page 11: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

Chapter 1. Introduction 1

Chapter 1

Introduction

This thesis describes results of research on visualization techniques, visual clustering,feature engineering and classifier analysis. Most of the research and development is doneat the Product and Application Development department of Oce Document TechnologiesGmbH (ODT) in Konstanz, Germany.

ODT is a worldwide leader in optical character recognition (OCR), information process-ing and document management. Founded in 1974 as Computer Gesellschaft KonstanzGmbH (CGK), the company has been part of the Oce Group since April 2000. Withinthis group, ODT is the international center of expertise for document management sys-tems. ODT has currently a staff of approximately 180 employees.

The Oce Group is one of the world’s leading suppliers of products and services forprofessional printing and document management, from document creation and design todistribution and archiving. The Oce Group is active in approximately 80 countries, withdirect sales and service organizations in more then 30 countries and has approximately26,000 employees worldwide.

One of the applications that ODT sells is the OCR-software Recostar ProfessionalPlus.OCR is the whole process of transforming an image of a document that contains text (ma-chine printed or handwritten) into a corresponding ASCII text. Recostar ProfessionalPlus

combines the strengths of two commercial OCR engines, the RecoStar and the AEGRecognition software. These engines have been optimized and fine tuned to read inparallel. Next, an internal voting system combines the results of both engines in orderto improve the results. Since OCR is mainly a task of pattern classification RecostarProfessionalPlus can be described as a multiple classifier system (MCS).

Both engines consist of a large collection of classifiers that work on high dimensionalfeature vectors extracted from the images to read. Goal of the research project wasto combine scientific research with a practical solution to support feature and classifier

Page 12: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

2 1.1. Visual Analytics

engineering for large high dimensional data sets, such as those of ODT. This resultedin the development of a visual analytics framework that supports these tasks. Theframework combines existing methods with newly developed visualization and clusteringtechniques.

In the following two sections a definition of the terms ”visual analytics” and ”featureand classifier engineering” will be given. Section 1.3 presents the contributions made tothe computer science community, and provides an overview of the thesis.

1.1 Visual Analytics

Visual analytics is the science of analytical reasoning facilitated by interactive visualinterfaces [TC05]. It is a multidisciplinary field that includes the following focus areas:

• Analytical reasoning techniques that enable users to obtain deep insights whichdirectly support assessment, planning and decision making.

• Visual representation and interaction techniques that take advantage of the humaneye’s broad bandwidth pathway into the mind to allow users to see, explore andunderstand large amounts of information at once.

• Data representation and transformation techniques that convert all types of con-flicting and dynamic data in ways that support visualization and analysis.

• Techniques to support production, presentation, and dissemination of the resultsof an analysis to communicate information in the appropriate context to a varietyof audiences.

Following Keim et al. [KMSZ06] a formal description of the visual analytics process isprovided in the following. Input for the data sets used in the visual analytics process areheterogeneous data sources (i.e. the internet, newspapers, scientific experiments, expertsystems). From these sources, the data sets S = {S1, ..., Sn} are chosen, whereas eachSi, i ∈ {1, ..., n} consists of a set of attributes A = {Ai1, ..., Aik}. The goal or outputof the process is insight I. Insight is either directly obtained from the set of createdvisualizations V or through confirmation of hypotheses H as the result of automatedanalysis methods. Figure 1.1 illustrates this formalization.

Arrows represent the transitions from one set to another. More formal the visual ana-lytics process is a transformation F : S ⇒ I, whereas F is a concatenation of functionsf ∈ {DW , VX , HY , UZ} defined as follows: DW describes the basic data pre-processingfunctionality with DW : S ⇒ S and W ∈ {T,C, SL, I} including data transformation

Page 13: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

Chapter 1. Introduction 3

Figure 1.1: Visual Analytics Process, taken from [KMSZ06]

(i.e., projections and clustering) functions DT , data cleaning functions DC , data selec-tion functions DSL and data integration functions DI that are needed to make analysisfunctions applicable to the data set.

VW , W ∈ {S,H} symbolizes the visualization functions, which are either functionsvisualizing data VS : S ⇒ V or functions visualizing hypotheses VH : H ⇒ V .

User interactions UZ , Z ∈ {V,H,CV,CH} are an integral part of the visual analyticsprocess. User interactions can either effect only visualizations UV : V ⇒ V (i.e., select-ing, zooming or filtering), or can effect only hypotheses UH : H ⇒ H by generating newhypotheses from given ones. Furthermore insight can be concluded from visualizationsUCV : V ⇒ I or from hypothesis UCH : H ⇒ I.

F (S) is rather an iterative process than a single application of each provided function,as indicated by the feedback loop in Figure 1.1.

1.2 Feature and Classifier Engineering

One way to represent objects is the so called feature vector approach. This approachrepresents objects o ∈ O given in an object space O by points −→po ∈ Rd in a d-dimensionalvector space. Feature vector extractors fvx are functions fvx : O → Rd mapping objectsto vectors numerically describing object properties. Suitable extractors are efficientlycalculated and allow to effectively capture object similarities by appropriate distancefunctions d : (−→pi ,−→pj ) → R+

0 defined in feature space.

The effectiveness of a given extractor used to represent objects is critical for any featurevector based application. We understand the effectiveness of an extractor as the degreeof how accurately distances d in feature space resemble object similarities in object space.The identification of the most suitable extractor for a given set of objects is a difficult

Page 14: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

4 1.3. Contributions and Overview

task and is that what we call feature engineering.

Classification is the problem of assigning class labels to unlabeled data items givena collection of labels. Creating efficient and effective classifiers is that what we callclassifier engineering. Classifiers that will work based on feature vectors are of coursealso depending on the quality of those feature vectors. Therefore feature and classifierengineering are strongly connected.

Our goal in visualizing feature space and classifiers is to quickly absorb inter-class andintra-class relationships, i.e. understand the spatial relationships between various classesin order to answer questions such as:

1. how well-separated are different classes?

2. what classes are similar or dissimilar to each other?

3. what kind of surface separates various classes? (classification boundaries)

4. how coherent or well-formed is a given class?

Answers to these questions can enable the data analyst to infer inter-class relationshipsthat may not be part of the given classification. Additionally, visualization can help incatching the quality of the classification and the feature space. Discovery of interestingclass relationships in such a visual examination can help in designing better classifiersand in more effective feature extraction.

In order to achieve the above objectives for large high dimensional data sets we believethe user must have access to a real-time, interactive visual analytics framework. Theframework should allow the user to compute different local and global views of the dataon the fly, and allow seamless transitions between the various views.

1.3 Contributions and Overview

Chapter 2 contains the theoretical background of this thesis. It consists of three partsin which binning, projections and clustering are explained and different algorithms con-cerning these topics are presented.

The work presented in this thesis contains three contributions to the field of computerscience.

1. A new visualization technique, called GridView, is presented in Chapter 3. Theconcept comprehends a collection of different views that are all based on a grid,hence the name GridView.

Page 15: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

Chapter 1. Introduction 5

2. A new visual clustering algorithm based on the O-Cluster algorithm [MC02], calledVisual Supported O-Clustering, is described in Chapter 4.

3. A visual analytics framework that combines existing techniques and algorithmswith the newly developed techniques. The framework that is developed and im-plemented during this research is presented in Chapter 5.

Application examples from different fields are presented in Chapter 6. They show theusefulness of the visual analytics, including the newly developed techniques, frameworkand its value for feature and classifier engineering. Chapter 7 concludes the thesis withsome final remarks on the topic and suggestions for future work.

Page 16: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

6

Chapter 2

Theoretical Framework

For better understanding of the work presented in this thesis some theory about binning,projections and clustering is necessary. This chapter provides a quick overview of thistheory and provides references to more comprehensive literature. Assumed throughoutthis chapter are n data elements of data setD in anm dimensional feature space arrangedrow-wise in the n×m feature matrix D, with Dij being feature j of element i.

2.1 Binning

Discretization of data, often called binning, is a widely used technique in statistics anddata mining. It reduces the number of values n for a given continuous attribute bydividing the range of the attribute into k intervals (bins), with typically k << n. Binlabels can then be used to replace actual data values. Among many other discretizationtechniques, equal width binning and equal depth binning are very popular.

Let hi be the width of bin Bi, and xi the number of data elements in Bi, then equal widthbinning means determining the width h of a single bin, i.e. hi = h with i = {1, ..., k}.In most cases not the bin width is determined, but the number of bins k, with the bin

width being computed by max value - min valuek

. This results in a range for each bin.Now, each data elements is placed in the bin where its value is in the bins range.

Equal depth binning works the other way around. The number of data elements x ina single bin is determined, i.e. xi = x with i = {1, ..., k}. All data elements are thensorted ascending according to their values. The first x values are then placed in the firstbin. The next x values in the second bin , and so on.

Histograms use equal width binning, with the bin width h being the histogram’s mostimportant parameter, since it controls the trade-off between presenting a picture with

Page 17: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

Chapter 2. Theoretical Framework 7

too much detail (undersmoothing) or too little detail (oversmoothing) with respect to thetrue distribution. The same goes for the cell-width and cell-height of a two-dimensionalgrid. Because we use both histograms and two-dimensional grids (see Chapters 3 and 4)in our framework we sought for an algorithm that automatically determines an estimationof the ’optimal’ bin width for a certain data distribution. Despite the importance of theproblem, there has been surprisingly little research into estimation of the ’optimal’ binwidth.

In this section we present the bin width determination rule of Scott [Sco97], and itstheoretical verification of Wand [Wan97] that we found suitable to implement in ourframework. Wand [Wan97] suggests besides the zero-stage rule (the verification of Scott’srule) two other rules. The one-stage and the two-stage rule. Both rules have goodsmoothing results on histograms, but the computation of the estimators that they useis very expensive. Therefore we decided to use the zero-stage rule, that has a little lesssmoothing quality, but is very fast to compute. All three rules are proved to have goodtheoretical results, but these proves are out of the scope of this thesis. Interested readersshould look at the original publications.

The zero-stage rule h0 returns the bin width h for a given data distribution. Basis of thezero-stage rule is a normal scale estimator ψNS

r . This estimator is defined as:

ψNSr =

(−1)r/2r!

(2σ)r+1(r/2)!π1/2(2.1)

where r stands for the L metric used. In our case r = 2 An often used value for σ is

σ = min{s, IQR/1.349} (2.2)

where s is the sample standard deviation and IQR is the inter-quartile range. The factor1.349 ensures that σ is consistent for σ when the data are normal distributed. Using thisestimators the zero-stage rule is defined as:

h0 =

(6

−ψNS2 n

)1/3

= (24π1/2/n)1/3σ ' 3.49σn1/3 (2.3)

As stated before, the zero-stage rule is actually another definition of the normal scalebin width selection rule presented in [Sco97].

Figure 2.1 shows the results of applying (a) h2, (b) Scott’s normal reference rule h0 and(c) the S-PLUS default bin width on a data set that contains 7.201 British incomes forthe year 1975, that have been divided by their sample average.

The bin width choice h2 leads to a histogram which clearly shows the bimodal structurein the data. The selector h0 also shows the bimodal structure, but not quite as sharply.

Page 18: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

8 2.2. Projections

Figure 2.1: Histogram of the British incomes data based on (a) the bin width h2, (b)the bin width h0 and (c) the S-PLUS default bin width. Taken from [Wan97].

The S-PLUS default bin width leads to a oversmoothed histogram, that does not showthe bimodal structure of the data at all. This example clearly shows that a false choiceof bin width can lead to missing important features in the data.

2.2 Projections

One of the most important aspects of exploratory data analysis is data visualization,which aims at revealing structure and unexpected relationships in large data sets, bypresenting them as human accessible visualizations. An important family of visualizationtools comprises methods for achieving a low dimensional embedding of multivariate data.In other words, the data is mapped on points in low dimensional space in a way thatcaptures the essence of the underlying data set. There are numerous techniques to dothis, including well-studied methods as Principal Component Analysis (PCA), MultiDimensional Scaling (MDS) and Self Organizing Maps (SOM). We implemented thesemethods in our framework (see Chapter 5), and additional to them two PCA variantsthat are recently proposed by Koren and Carmel [KC03].

Projection methods can be divided in two groups; linear and non-linear methods. MDSand SOM are both non-linear methods. Non-linear methods take as input pairwiserelationships between data elements (i.e. similarities or distances) and then optimize acost function that preserves those similarities. In other words, the resulting mappingtries to place similar data elements next to each other. This results in good topologicalprojections, but the mappings are hard to interpreted because the axes of the mappingare meaningless and the orientation of the map is arbitrary.

Page 19: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

Chapter 2. Theoretical Framework 9

The PCA and its variants are so called linear transformations, i.e. each low dimensionalaxis is some linear combination of the original axis. Linear mappings are certainly morelimited than their non-linear counterparts, but on the other hand, they posses severalsignificant advantages [KC03]:

• The low dimensional embedding is reliable in the sense that is guaranteed to showgenuine properties of the data. In contrast, the relation to the original data is lessclear from non-linear embedding.

• The embedding axes are meaningful as they are linear combinations of the originalaxes. These combinations can even sometimes induce domain specific interpreta-tions.

• Using the already computed linear transformation, new data elements can easilybe added to the embedding without having to recalculate it.

• In general, the computational complexity of linear transformation methods is low,both in time and space, when compared to non-linear transformations.

The following sections will describe the in our framework implemented methods in moredetail.

2.2.1 Principal Component Analysis

Principal Component Analysis (PCA), also known as Karhunen-Loeve transform orHotelling transform, is a linear transformation that transforms the data to a new co-ordinate system such that the greatest variance by any projection of the data comes tolie on the first coordinate, i.e. the first principal component. The first principal compo-nent accounts for as much of the variability in the data as possible, and each succeedingcomponent accounts for as much of the remaining variability as possible. By using onlythe first few principal components, PCA makes it possible to reduce the number ofsignificant dimensions of the data, while maintaining the maximum possible variancethereof. Depending on the application, the first principal components often contain themost important aspects of the data, but this is not always the case. See [ED91] for acomprehensive discussion of PCA.

Koren and Carmel [KC03] proved that PCA maximizes the sum of the projected pairwisesquared distances:

∑i<j(dist

pij)

2, where distpij is the Euclidean distance between theelement i and j. This derivation makes it possible to define and extend the PCA in awhole different way. The two following PCA extensions are presented in [KC03]. Theclaims taken from the paper are not proved. Interested readers can find full proofs inthe original paper.

Page 20: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

10 2.2. Projections

2.2.2 Normalized Principal Component Analysis

PCA maximizes the sum of squared distances which emphasizes the contribution of thepoints with large pair-wise distances. When there are outliers, i.e. noise, present, thisbehavior will disturb the results of PCA. This can be solved by introducing nonnegativepair-wise weights. The goal is then to maximize the sum of the weighted distances.∑

i<j

wij(distpij)

2 (2.4)

In the case of normalized PCA the weights are defined as following:

wij =1

distij(2.5)

This results in large distances contributing less to the summation. An example of thedifference between PCA and normalized PCA can be seen in Figure 2.2.

Figure 2.2: Two 1-D projections of an originally 2-D data set that contains two outliers.The PCA projection is fooled by the outliers, unlike the normalized PCA projection thatmaintains much of the structure of the data. Taken from [KC03].

The figure shows a synthetic two-dimensional data set containing 50 normally distributeddata points and two outliers. As can be seen, the one-dimensional PCA projectionprojects the data in a direction that emphasizes the outliers while hiding almost allinformation contained in the normally distributed area. The normalized PCA on theother hand maintains much of the structure present in the whole data set.

Page 21: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

Chapter 2. Theoretical Framework 11

2.2.3 Supervised Principal Component Analysis

PCA does not take cluster or class labels in consideration. Artificially underweightingthe dissimilarities between intra-cluster pairs of data elements can be helpful to see anembedding that separates clusters or classes. This can be done by multiplying the intra-cluster dissimilarities by some decay factor t, 0 ≤ t ≤ 1. The weights are then definedas

wij =

{t · wij i and j have the same labelwij otherwise

(2.6)

Figure 2.3 shows an example of a PCA projection compared to the projection thatsupervised PCA produces on the same data.

Figure 2.3: Two 1-D projections of 2-D data that contain two clusters. The PCAprojection merges the clusters, while the weighted PCA projection keeps them muchapart. Taken from [KC03].

The figure shows a synthetic two-dimensional data set containing 2 normally distributedclusters with 200 points each. As can be seen the one-dimensional PCA projection com-pletely merges the both clusters, whereas by setting all the intra-cluster dissimilaritiesto 0, the supervised PCA projection obtains a one-dimensional projection that capturesthe clustering decomposition of the data set.

2.2.4 Multidimensional Scaling

Multidimensional scaling (MDS) is a collection of methods which allow to gain insight inthe underlying structure of relations between data elements by providing a geometricalrepresentation of these relations. An MDS algorithm starts with an input matrix X

Page 22: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

12 2.2. Projections

that contains pairwise similarities, then assigns a location to each element in a low-dimensional space.

The degree of correspondence between the similarity among data elements implied bythe MDS projection and the input matrix is measured by a stress function. The generalform of these function is as follows:√∑n

i=0

∑nj=0(f(Xij))− dij)2

scale(2.7)

where dij is the distance between two data elements on the MDS projection and f(Xij) issome function of the input data. scale refers to a constant scaling factor, used to keep thestress values between 0 and 1. When the MDS projection perfectly reproduces the inputdata, then the stress is 0. Thus, the smaller the stress, the better the representation.

The transformation of the input values f(Xij) used depends on whether metric or non-metric scaling. In metric scaling, f(Xij) = Xij. In other words, the raw input data iscompared directly to the mapped distances. In non-metric scaling, f(Xij) is a weaklymonotonic transformation of the input data that minimizes the stress function. Themonotonic transformation is computed via ’monotonic regression’, also known as ’isotonicregression’.

As stated before MDS is not just one, but a collection of methods. Different algorithmscan be used to obtain the geometrical representation of the similarities and this goestogether with the existence of a number of multidimensional scaling models. Also nu-merous different stress functions exist. For a comprehensive overview of these modelsand techniques see [ED91, SRY81] and [Web02].

2.2.5 Self Organizing Maps

The Self Organizing Map (SOM), or Kohonen Map [Koh95], is a neural network algo-rithm based on unsupervised learning. The SOM consists of neurons located on a regularlow-dimensional grid. Usually a 1 or 2 dimensional grid is used, because this is easierto visualize, and therefore easier to interpreted. The lattice of the grid can be eitherhexagonal or rectangular. The latter is used throughout this thesis.

Each neuron k is represented by an m-dimensional prototype vector nk = [nk1, ..., nkm],where m is the dimension of the input feature space. On each training step a dataelement Dij is selected and the nearest unit nc (the best matching unit, BMU) is foundon the map. The prototype vectors of the BMU and its neighbors on the grid are movedtowards the sample vector.

nk := nk + α(t)hck(t)(Dij − nk) (2.8)

Page 23: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

Chapter 2. Theoretical Framework 13

where α(t) is the learning rate at training step t and hck(t) is a neighborhood kernelcentered on the winner unit c. Both learning rate and neighborhood kernel decreasesmonotonically with time.

The SOM is similar to the k-means clustering algorithm, extending it by providing atopological structure and placing similar objects in neighboring clusters.

2.3 Clustering

Determining clusters essentially involves arranging data points into groups and separat-ing from other groups of data points. Clustering of large high dimensional data sets is animportant problem in current research. Applications range from clustering multimediaand text data to segmentation in CAD, product and customer databases. There are anumber of different clustering algorithms that are applicable to very large data sets, anda few that address high dimensional data.

In high dimensional data sets, it is very unlikely that data points are nearer to eachother than the average distance between data points because of sparsely filled space. Asa result, as the dimensionality of the space increases, the difference between the distanceto the nearest and the farthest neighbors of a data object goes to zero. [BGRS99,HAK00, AHK01]. This phenomenon, also known as the ’curse of dimensionality’, is agreat problem for most clustering algorithms, because they rely on distance measures todetermine similarity between data points. Basically, clustering algorithms can be dividedinto four different types:

1. Partitioning algorithms

2. Hierarchical algorithms

3. Locality-based algorithms

4. Grid-based algorithms

Each type and a few of its representatives will be discussed in the following sections.

2.3.1 Partitioning algorithms

Given a data set D with n elements, and k being the number of desired clusters, wherek ≤ n, then a partitioning algorithm divides the elements into k clusters. The sin-gle elements are assigned to the different clusters by optimizing an objective criterion.For example a distance function, where each element is assigned to the closest cluster.

Page 24: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

14 2.3. Clustering

Clusters are typically represented by either the mean of the elements assigned to thecluster, or by one representative element of the cluster. Examples of such algorithmsare k -means [Mac67], and k -medoid [KR90]. CLARANS (Clustering Large Applicationsbased upon RANdomized Search) [NH94] is a partitioning algorithm developed for largedata sets. It uses a randomized and bounded search strategy to improve the scalabilityof the k -medoid algorithm.

2.3.2 Hierarchical algorithms

Hierarchical clustering algorithms work by grouping data objects into a hierarchy of clus-ters. The hierarchy can be formed top-down (divisive hierarchical methods) or bottom-up(agglomerative hierarchical methods). Hierarchical algorithms rely on a distance func-tion to measure the similarity between clusters. One well known example is the LinkageAlgorithm in all its variations (single, average, complete, e.a.). The scalability of thesealgorithms with respect to the number of elements in a data set is poor because of theircomputational complexity (O(n2)).

CURE (Clustering Using REpresentatives) [GRS98] shares this disadvantage but avoidsthis partly by a combination of random sampling and partitioning. It employs an ap-proach that uses a fixed number of representative points to define a cluster instead of onesingle data element or centroid. CURE produces high-quality clusters in the presence ofoutliers and can identify clusters of different sizes and complex shapes.

BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) [ZRL96] on theother hand has a computational complexity of O(n) because of the use of a specialdata structure, called Cluster Feature Tree (CF-tree). The CF-tree stores summaryinformation about subclusters of data elements. BIRCH only performs well on datasets containing spherical clusters because of the similarity measures it uses to determinewhich data elements to summarize.

2.3.3 Locality-based algorithms

Locality-based clustering algorithms group neighboring data elements into cluster basedon local conditions. One representative of this type of algorithms is DBSCAN (Density-Based Spatial Clustering of Applications with Noise) [EKSX96]. Clusters are seen asdense regions of data elements that are separated by regions of low density in the inputspace. The basic idea is that the density of data elements in a radius around eachdata element in a cluster has to be above a certain threshold. The cluster will enlargeas long as for each data element within the cluster, a neighborhood of a given radiuscontains at least a minimum number of data elements. In its original form DBSCAN has

Page 25: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

Chapter 2. Theoretical Framework 15

a computational complexity of O(n2), when a spatial index is applied this can be reducedto O(n log n). The algorithms is very sensitive to its parameter choice, and cannot handledata sets that contain clusters with different densities. Another representative of thistype is OPTICS (Ordering Points To Identify Clustering Structures) [ABKS99].

2.3.4 Grid-based algorithms

Grid-based clustering algorithms do not suffer from the nearest neighbor problem inhigh dimensional spaces. Examples include STING (STatistical INformation Grid)[WYM97], CLIQUE [AGGR98], DENCLUE [HK98], WaveCluster [SCZ98], and MAFIA(Merging Adaptive Finite Intervals And is more than a clique) [HNC99]. These methodsdivide the input space into hyper-rectangular cells, discard the low-density cells, andthen combine adjacent high-density cells to form clusters. Grid-based methods arecapable of discovering cluster of any shape and are also reasonably fast. However, noneof these methods address how to efficiently cluster very large data sets that do not fitin memory. Furthermore, these methods also only work well with input spaces withlow to moderate numbers of dimensions. As the dimensionality of the space increases,grid-based methods face some serious problems. The number of cells grows exponentiallyand finding adjacent high-density cells to form clusters becomes prohibitively expensive[HK99].

In order to address the ”curse of dimensionality” a couple of algorithms have focusedon data projections in subspaces. Examples include PROCLUS [APW+99], OptiGrid[HK99] and ORCLUS [AY00]. For the visual analytics framework (see Chapter 5), O-Cluster (Orthogonal partitioning CLUSTERing) [MC02] with additional visual support(see further Chapter 4) is used.

2.4 Summary

In this chapter some an overview of binning, projections and clustering is presented.Besides the theoretical work some algorithms that are implemented in the visual analyticsframework (see Chapter 5) are described.

Page 26: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

16

Chapter 3

GridView Concept

Visual analytics and visualization techniques have been proven to be of great value inanalyzing and exploring large data sets, since presenting data in an interactive, graph-ical form often fosters new insights, encouraging the formation and validation of newhypotheses to the end of better problem solving and gaining deeper domain knowledge[TC05, vW05]. But the increasing size and dimensionality of todays data sets poses achallenge for developers of visualization models and methods.

Regarding the high dimensionality of the data sets numerous visualization techniques areavailable [GTC01]. Although many sophisticated techniques exists, many applicationsuse simple charts, such as bar-charts, scatterplots and pie-charts. These techniqueshowever, can only deal with 2- or 3-dimensional data. One way to obtain such inputfrom a high dimensional data set is to compose a low dimensional embedding of thedata. This is, mapping the data points in a low-dimensional space (i.e., 2-d or 3-d) ina way that captures certain structured components without losing the essence of thedata. There are numerous techniques to do this, including principal component analysis[ED91, Web02], multidimensional scaling [SRY81] and self organizing maps [Koh95] (c.f.Section 2.2).

Regarding the size of the data set Eick and Karr [EK02] proposed an scalability analysisand came to the conclusion that many visualization metaphors do not scale effectively,even for moderately sized data sets. Scatterplots for example, one of the most use-ful graphical techniques for understanding relationships between two variables, can beoverwhelmed by a few thousand points [EK02].

Additional, there are two limiting factors for all visualization techniques [KS05]: humanperception and display area. Human perception means the precision of the eye and theability of the human mind to process visual patterns, limits the number of perceptiblepixels and therefore affects visual scalability directly. Monitor resolution affects on vi-

Page 27: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

Chapter 3. GridView Concept 17

sual scalability through both physical size of displays and pixel resolution. In typicalapplication scenarios monitor resolution rather than human vision is the limiting factor.

Based on these observations, and the assumption that we have a good 2-dimensionalembedding of the data, the analysis of large data sets reveals two major tasks. The firstone is the question, how visualizations for massive data sets can be constructed withoutloosing important information even if the number of data points is to large to visualizeeach single data point at full detail. The second important task is to find techniques toefficiently navigate and query such large data sets.

Our solution for these tasks is GridView. A technique to visualize and explore 2-dimensional embeddings of very large high dimensional data sets. GridView is actuallynot one view, but a collection of different views that operate on the same data set. Userscan switch between the different views to get different information and more insightabout the data set at hand.

We define some basic notions about grids in Section 3.1. Then the different views thatform GridView are presented in Section 3.2.1 till 3.2.5. A summary of the chapter isgiven in Section 3.3.

3.1 Basic Notions

As the name GridView implies the technique is based upon the grid metaphor. The(projected) 2-dimensional data space is divided into a collection of grid cells.

Definition 1 Data Set: A data set D in this context is a two-dimensional data set, byorigin or projection, with n data points. Each data point p is defined by an x-coordinateand a y-coordinate, denoted as px and py. Each point p has a label, this can be for examplea classlabel or a clusterlabel, that is denoted by plabel. The data set will be normalizedlinearly, so each x- and y-coordinate of a point will have a value between 0 and 1.

Based upon a data set D a grid will be created.

Definition 2 Grid: A grid G for the two-dimensional data set D is defined by two setsof separators Hx = {Hx1, ..., Hxk} and Hy = {Hy1, ..., Hyl}. This results in a grid withk + 1 · l + 1 grid cells.

An example of a regular equally spaced 2-dimensional grid can be seen in Figure 3.1.The separators that define the grid will be determined automatically by applying theso called zero-stage rule [Sco92, Wan97]. This is a rule that determines the optimal binwidth for histograms given a certain data distribution (see Section 2.1). We first applythis rule on the x-values of the data set, resulting in the separator set Hx. The same we

Page 28: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

18 3.1. Basic Notions

Figure 3.1: Example of a regular equally spaced two-dimensional grid.

do for the y-values of the data set. This results in a cell collection with each cell havingan ’optimal’ width and height. A cell collection is defined as follows:

Definition 3 Cell Collection: A cell collection C is defined by a number of single cells.Here C(x) denotes the number of cells on the x-axis, where C(y) denotes the number ofcells on the y-axis. C(w) denotes the width of a single cell, and C(h) denotes the heightof a single cell. C(n) denotes the number of data points in the collection.

A cell in such a collection is defined as follows.

Definition 4 Cell: A cell c is defined by an x-coordinate and a y-coordinate. cij denotesthe cell that is the i-th cell counted from the left and the j-th cell from above. cij(n)denotes the number of points in cell cij.

With the help from a simple mapping function, each single data point will be assignedto one single cell. This procedure is not always necessary. For example the input of aSOM does not have to contain single point information.

Definition 5 Mapping Function: A data point p is mapped on a grid cell cij by thefollowing function:

i = bpx · C(x)c+ 1 (3.1)

j = bpy · C(y)c+ 1 (3.2)

After the mapping is done each single cell has a point collection.

Definition 6 Point Collection: A point collection P is defined by a set of points. HereP (n) denotes the number of points in the collection. A point collection can belong to thegrid, i.e. the whole data set, or to a single cell, i.e. a subset of the data.

Page 29: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

Chapter 3. GridView Concept 19

3.2 Views

As stated before GridView is not one single view, but a collection of views. To switchbetween the different views a view manager is contained in GridView. All the views aresupported by a class manager. This control makes it possible to select which classes ofthe data set are visualized and which not. Besides the possibility to (de)select singleclasses a ’select all’ and an ’deselect all’ functionality is offered. Additional to the classmanager a color manager is available. This control offers the possibility to determinethe color of each single class, the color of the grid and the grid background color. Thesethree controls, together with the views presented in the following sections form togetherGridView.

3.2.1 Scatterplot View

A scatterplot or scatter graph is a graph used to visually display data by displayingonly finitely many points, each having a coordinate on a horizontal (x) and a vertical(y) axis. Figure 3.2 shows examples of a scatterplot view. The left figure shows one

Figure 3.2: Examples of a scatterplot view.

class, the center figure shows a different class, and the right figure shows the view thatresults when both classes are rendered. By selecting one ore more cells a scatter plot ofonly that region will be shown at full screen. Because of the extra monitor space withperspective to the selected region, more details of the data can be seen this way.

The scatterplot is used in many applications with great success, but as stated before,a scatterplot is overwhelmed with a few thousand of points. Lets look at the examplein Figure 3.2. It is easy to see in the right figure that the green class is rendered aslast. This causes major overlap with the red points, which cannot be seen anymore.This intra-class overlap can give a false idea about the data. The same happens withinter-class overlap. Looking at the green class we see in the middle a large complete

Page 30: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

20 3.2. Views

green area. We cannot see anymore how many green points are mapped to one singlepixel. This can be one, but also a 1000. So information about the density of the datapoints is lost in a scatterplot with many points. That is why we decided to develop aview that overcomes these scalability problems. This so called class view is described inthe following section.

3.2.2 Class View

The class view summarizes the information contained in each single cell and representsthis information by coloring the cell where the percentage of a color in the cell, is equalto the percentage of points of the corresponding class in that cell.

Definition 7 Coloring Function: For each class in a grid cell cij the number of pointsbelonging to that class is divided by the total number of points cij(n) in that cell. Theresulting values, that are between 0 and 1, and add up to 1, determine the area of thecell that is colored with the corresponding class colors. The order of the colors dependson the alphabetical order of the class labels.

Figure 3.3 shows an example of the class view on the same data as the scatterplot viewin Figure 3.2.

Figure 3.3: Example of a class view.

Page 31: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

Chapter 3. GridView Concept 21

By selecting one ore more cells a class histogram of the selected cell(s) will be shown. Thepoints collections of the selected cells are merged and the representatives of the differentclasses are counted. This information is then presented as a histogram as shown in Figure3.4. By mousing over a histogram bar more detailed information, such as exact count,

Figure 3.4: Example of a class histogram view.

relative frequency and class label will be shown.

A problem with the class view is the lack of information regarding the point density in acell. For example a cell with one green, and one red point will look almost the same as acell with 900 green, and 1000 red points. That is why we decided to use alpha blending(α-blending) to visualize the point density. Each color in our framework is defined over aRGB value and an α-value that defines the transparency of the color. The α-parametercan have a value between 0 and 255. We decided to use a minimum value of 100 forthe most sparse cell(s) of the view, and the maximum value of 255 for the most densecell(s) of the view. All other cells have an α-value depending on the number of points inthe cell. The value is computed with help from a linear scale between the minimum andthe maximum value. Figure 3.5 shows an example of the class view with α-blending.Although in many situations the class view with α-blending gives a good overview of thedata set with respect to the density and distribution we were not satisfied. Especially invery sparse areas the scatterplot view gives more information then the class view. Thatis why we decided to combine the strengths from a class view with the strengths froma scatterplot view in the form of a mixed view. This view is described in detail in thefollowing section.

Page 32: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

22 3.2. Views

Figure 3.5: Example of a class view with α-blending to indicate the point density.

3.2.3 Mixed View

This view is a mix of the scatterplot view and the class view. By setting an overplottingfactor, the view determines for each single cell if it should be visualized as a scatterplotview, or as a class view. The overplotting factor is defined as the percentage of the pointsin the cell that overlap each other. Both inter- and intra-class overlap are taken intoaccount. Figure 3.6 shows an example of a mixed view. The applied overplotting factorcan be adjusted by moving a slider, the mixed view will be updated instantly.

3.2.4 Density View

The density view gives an overview of the point distribution of a data set. For each cellc of the view the density is computed.

Definition 8 Density: The density d of a cell c is defined by the following function.

d(c) =c(n)

max (c(n)|c(n) ∈ C)(3.3)

In other words, the cell that contains the most number of points will be determined. Letscall this number of point pmax, then for each cell the number of points contained in that

Page 33: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

Chapter 3. GridView Concept 23

Figure 3.6: Example of a mixed view.

cell will be divided by pmax, resulting in the cell’s density. All the computed densitieshave a value between 0 and 1. These values are mapped on a colormap, and the cell iscolored accordingly. Figure 6.3 shows an example of a density view. By a mousing overa cell the exact density value will appear in the cell. We see from the density view inthe upper left of the projected data space a very dense area, that is surrounded with alarger dense area. As we go further to the borders of the projected space the densitydecreases. Black cells indicate that no data points are mapped on that cell.

3.2.5 Purity View

The purity view gives indicates how pure the cells are. Pure in this context means howmany points of how many different classes are mapped into one cell. Therefore we defineda simple, but effective purity measure.

Definition 9 Purity: The purity of a cell c is defined by the following function.

p(c) =Number of points of dominating class in c

c(n)(3.4)

The dominating class in a cell is defined as the class with the most representative points

Page 34: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

24 3.3. Summary

Figure 3.7: Example of a density view.

in that cell. The purity of the whole projection, i.e. cell collection, is defined by:

p(C) =C(x)∑i=1

C(y)∑j=1

(p(cij) ·

cij(n)

C(n)

)(3.5)

All the computed single cell purities have a value between 0 and 1, and are mapped on acolormap. The cells are then colored accordingly. The computed overall purity measureis shown in a text box. Figure 6.4 shows an example of a purity view. By a mousingover a cell the exact purity value will appear in the cell. We see from the purity viewthat at the borders of the projected data space the purity is very high, as we move tothe center the purity decreases.

3.3 Summary

In this chapter we presented GridView, a collection of different views that makes itpossible to view and explore a very large data set. The strengths of the concept arethe intuitive user interface, multiple views of the same data with a single click, andeasy selection of classes, regions and subsets. Disadvantage of the concept is that it can

Page 35: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

Chapter 3. GridView Concept 25

Figure 3.8: Example of a purity view.

only handle 2-dimensional data sets, so the quality of the view depends on the qualityof the embedding. The use of GridView makes only sense when the data has classlabels. Future work will be automatic determination of a good order for the colors inthe class view, other cell forms, and other representations of the information in a cell.One very interesting research direction is the use of pie charts in the cells so that theyare positioned with the colors directed to their class centers (c.f. [BST00]).

GridView will be available as a visualization node within KNIME (www.knime.org).This is a modular data exploration platform that enables the user to visually create dataflows (often referred to as pipelines), selectively execute some or all analysis steps, andlater investigate the results through interactive views on data and models. The platformis developed at the Chair for Applied Computer Science leaded by professor Michael R.Berthold at the University of Konstanz, Germany.

Page 36: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

26

Chapter 4

Visual Supported O-Clustering

Clustering of very large high dimensional data sets is an important problem. Applicationsrange from clustering multimedia and text data to segmentation in CAD, product andcustomer databases. There are a number of different clustering algorithms that areapplicable to very large data sets, and a few that address high dimensional data (seeSection 2.3). Many researchers have focused on the the automatic clustering algorithms,but very few have addressed the human factor in the clustering process. Clusteringrequires the specification of particular problem formulations, objective functions, andparameters. We think, human can not be ignored in the cluster analysis, and thereforedeveloped an visual supported clustering algorithm that gives the user visual feedbackand the possibility to interact with the algorithm. Our motivations for developing suchan algorithm are the following:

• Through visual feedback, users gain a deeper understanding of the data set athand.

• Through visual feedback, users gain a deeper understanding of the used clusteringalgorithm and can therefore better interpreted the results of the algorithm.

• When the data set to cluster is already labeled, visual supported clustering can bevery helpful to select regions of interest that deserve further attention and analysis.

• Through user interaction, users can use their domain knowledge to improve theresults of the clustering algorithm.

The idea to combine an advanced clustering algorithm with visualization and interactiontechniques for an effective interactive clustering of the data is not new. Aggarwal devel-oped the IPCLUS algorithm [Agg01], Hinneburg et. al [HKW99] builded the HD-Eye

Page 37: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

Chapter 4. Visual Supported O-Clustering 27

system based on the OptiGrid algorithm [HK99], Tejada and Mingjim [TM05] improvedthe ICPLUS algorithm by developing the HC-Enhanced algorithm, and Keke and Ling[CL03] developed a visual framework to steer/monitor/refine the clustering process withdomain knowledge. But as far as we know the idea is never applied on the clusteringalgorithm of our choice.

The clustering algorithm we decided to build upon, is the so called O-Cluster algorithm[MC02]. O-Cluster stands for Orthogonal partitioning clustering and is developed atOracle Data Mining Technologies. The reasons why we chose this algorithm are thefollowing:

• High quality of the obtained clustering,

• Robustness to noise, and

• Linear scalability.

Before presenting the algorithms we explain some basic notions, following [HKW99],on projections, separators and multidimensional grids in Section 4.1 and 4.2. Section4.3 presents the original O-Cluster algorithm, and Section 4.4 introduces our visualsupported O-Cluster algorithm. A summary of the Chapter can be found in Section 4.5

4.1 Projections and separators

Determining clusters essentially involves arranging data points into groups and separat-ing from other groups of data points. Using the point density for clustering, clusters be-come separated by the valleys between two maxima. O-Cluster and its visual supportedversion use lower dimensional projections of the high dimensional data to effectively de-termine the separations. Useful projections must be contracting, since only contractingprojections provide the upper bound property (see Lemma 1) necessary for detecting theseparating valleys correctly.

Definition 10 Contracting projection: A contracting projection for a given m-dimensional data space S and an appropriate metric ‖ · ‖ is a linear transformationP defined for all points in x ∈ S

P (x) = A · x, with ‖ A ‖ max y∈S

(‖ Ay ‖y

)≤ 1 (4.1)

Figure 4.1 shows an example of the difference between general and contracting projec-tions.

Page 38: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

28 4.1. Projections and separators

Figure 4.1: General (left) and contracting projections (right)

Lemma 1 states that the density at a point x′ in a projected space of the data is anupper bound for the density on the plane orthogonal to the projection plane in theoriginal feature space. The lemma shows a way of determining separators that partitionthe data without dividing the clusters.

Lemma 1 Upper bound property of contracting density projections: Let P (x) = A · x bea contracting projection as defined in Definition 10, P (D) the projection of the data setD, and fP (D)(x′) the density for a point x′ ∈ P (S). Then, using the same kernel for fD

and fP (D),

∀x ∈ S with P (x) = x′ : fP (D)(x′) ≥ fD(x′) (4.2)

In other words, Lemma 1 states that a set of points that can be separated in a contractingprojection P with a small error, are also separated in the original data space S, withouta large error.

Separation is done by a separator, that is defined as follows:

Definition 11 Separator: A separator is a geometric object that partitions Rd into twohalf spaces h0, h1. The decision function H(x) determines the half space, where a pointx ∈ Rd is located:

H(x) =

{1 x ∈ h1

0 x else(4.3)

In the remaining of this chapter separators are 1-dimensional hyperplanes defined by asplit value xs. The formula for the decision functions H(x) is then defined as follows:

Rd → R1, H(x) =

{1 P (x) ≥ xs

0 x else(4.4)

Page 39: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

Chapter 4. Visual Supported O-Clustering 29

Figure 4.2: Example of a regular multidimensional grid.

4.2 Multidimensional Grid

The combination of several separators results in a multidimensional grid. Since we cannot store the grid explicitly in high-dimensional space, we need a coding function c thatassigns a label to all points belonging to the same grid cell. We define a general notionof our regular grid as follows:

Definition 12 Multidimensional grid: A multidimensional grid G for the data space Sis defined by a set of separators H = {H1, ..., Hk}. The coding function cG;S → N isdefined as follows:

x ∈ S, c(x) =k∑

i=1

2iHi(x) (4.5)

where N stands for the space of natural numbers.

The grid notation let us determine the relevant subsets efficiently. An example of aregular 2-dimensional grid can be seen in Figure 4.2.

4.3 O-Cluster Algorithm

This section describes the O-Cluster algorithm as proposed by [MC02] in detail. Thealgorithm is based on the OptiGrid clustering algorithm [HK99]. OptiGrid constructs agrid-partitioning of the data by calculating the partitioning hyperplanes using contract-ing projections of the data. Although OptiGrid has an excellent ability to find clusters inhigh dimensional spaces in the presence of noise, it is sensitive to parameter choice, and

Page 40: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

30 4.3. O-Cluster Algorithm

Figure 4.3: An overview of the O-Cluster algorithm, taken from [MC02]

it does not prescribe a strategy to efficiently handle data sets that do not fit in memory.The O-Cluster algorithm combines an active sampling with an axis parallel partitioningstrategy in order to overcome these problems.

The algorithm operates recursively. It evaluates possible separators H = {H1, ..., Hn}for all projections in a partition, selects the ’best’ one, and splits the data into twonew partitions. The algorithm proceeds by searching for possible separators inside thenewly created partitions. Thus the algorithm creates a multidimensional grid. Figure4.3 presents an overview of the algorithm. The main processing steps of the algorithmare as follows:

1. Load data buffer: If the entire data set does not fit in the buffer, a randomsample is used. The algorithm assigns all points from the initial buffer to a rootpartition that is set active.

Page 41: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

Chapter 4. Visual Supported O-Clustering 31

2. Compute histograms for active partitions: The goal is to determine a set ofprojections for the active partitions and compute histograms along these projec-tions. Any partition that represents a leaf in the clustering hierarchy and is notexplicitly marked ambiguous or frozen is considered active. The process wherebyan active partition becomes ambiguous or ”frozen” is explained in Step 4. It is es-sential to compute histograms that provide good resolution but also that have dataartifacts smoothed out. So an effective binning strategy (see Section 2.1) shouldbe implemented. The algorithm is robust with respect to different binning strate-gies as long as the histograms do not significantly undersmooth or oversmooth thedistribution density.

3. Find best splitting points for active partitions: For each histogram, thealgorithm attempts to find the best valid cutting plane, if such exists. A validcutting plane passes through a point of low density (a valley) in the histogram.Additionally, the point of low density should be surrounded on both sides by pointsof high density (peaks). The algorithm attempts to find a pair of peaks with a valleybetween them where the difference between the peak and the valley histogramcounts is statistically significant. Statistical significance is tested using a standardχ2 test:

χ2 =2(o− e)

e≥ χ2

α,i, (4.6)

where o, the observed value, is equal to the histogram count of the valley and e, theexpected value is the average of the histogram counts of the valley and the lowerpeak. Our implementation, as suggested by the authors, uses a 95% confidencelevel (χ2

0.05,1 = 3.843).

Since multiple splitting points can be found to be valid separators per partitionaccording to this test, the algorithm chooses the one where the valley has the lowesthistogram count o, as the best splitting point. Thus the cutting plane would gothrough the area with lowest density.

4. Flag ambiguous and frozen partitions: If no valid splitting points are found,the algorithm checks whether the χ2 test would have found a valid splitting pointat a lower confidence level (for example 90%). If that is the case, the currentpartition can be considered ambiguous. More data points are needed to establishthe quality of the splitting point. If no splitting points were found and there is noambiguity, the partition can be marked as frozen and the records associated withit marked for deletion from the active buffer.

5. Split active partitions: If a valid separator H exists, the data points are splitalong the cutting plane and two new active partitions ho and h1 are created from

Page 42: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

32 4.3. O-Cluster Algorithm

the original partition. For each new partition the algorithm proceeds recursivelyfrom Step 2.

6. Reload buffer: This step takes place after all recursive partitioning on the cur-rent buffer has completed. If all existing partitions are marked as frozen and/orthere are no more data points available, the algorithm exits. Otherwise, if somepartitions are marked as ambiguous and additional unseen data records exist, thealgorithm proceeds with reloading the data buffer. The new data replace recordsbelonging to frozen partitions. When new records are read in, only data pointsthat fall inside ambiguous partitions are placed in the active buffer. New recordsfalling within a frozen partition are not loaded into the buffer. If it is desirable tomaintain statistics of the data points falling inside partitions (including the frozenpartitions), such statistics can be continuously updated with the reading of eachnew record. Loading of new records continues until either: 1) the active bufferis filled again; 2) the end of the data set is reached; or 3) a reasonable numberof records have been read, even if the active buffer is not full and there are moredata. The reason for the last condition is that if the buffer is relatively large andthere are many points marked for deletion, it may take a long time to fill the entirebuffer with data from the ambiguous regions. To avoid excessive reloading dur-ing these circumstances, the buffer reloading process is terminated after readingthrough a number of records equal to the data buffer size. Once the buffer reload iscompleted, the algorithm proceeds from Step 2. The algorithm requires, at most,a single pass through the entire data set.

4.3.1 Sensitivity parameter

The effect of creating spurious clusters due to splitting artifacts can be alleviated byusing the algorithms sensitivity (ρ) parameter. ρ is a parameter in the [0, 1] range thatis inversely proportional to the minimum count required to find a histogram peak. Avalue of 0 requires the histogram peaks to surpass the count corresponding to a globaluniform level per dimension. The global uniform level is defined as the average histogramcount that would have been observed if the data points in the buffer were drawn froma uniform distribution. A value of 0.5 sets the minimum histogram count for a peak to50% of the global uniform level. A value of 1 removes the restrictions on peak histogramcounts and the splitting point identification relies solely on the χ2 test.

Page 43: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

Chapter 4. Visual Supported O-Clustering 33

4.3.2 Complexity

The algorithm can use an arbitrary set of projections. The current implementation isrestricted to projections that are axis-parallel. The histogram computation step is ofcomplexity O(n · m) where n is the number of data points in the buffer and m is thenumber of dimensions. The selection of the best splitting point for a single dimensionis O(b) where b is the average number of histogram bins in a partition. Choosing thebest splitting point over all dimensions is O(m · b). The assignment of data points tonewly created partitions requires a comparison of an attribute value to the splitting pointand the complexity has an upper bound of O(n). Loading new records into the databuffer requires their insertion into the relevant partitions. The complexity associatedwith scoring a record is depends on the depth of the binary clustering tree T . The upperlimit for filling the whole active buffer is O(n ·T ). The depth of the tree depends on thedata set. In general, the total complexity can be approximated as O(n ·m). It is shownin [MC02] that the algorithm scales linearly with the number of records and number ofdimensions.

4.4 Visual Supported O-Cluster Algorithm

This section presents our proposed visual supported O-Cluster algorithm in detail. Themain difference between the O-Cluster algorithm and its visual supported version lies inthe visual determination of the separators, and the possibility to use the users domainknowledge by selecting manually certain projections. Through visual feedback, usersgain a deeper understanding of the data set. Therefore, they can identify additionalprojections and separators that the original O-Cluster algorithm can not find. Figure4.4 presents an overview of the algorithm.

When the algorithm is started, a setup screen will appear. Here the user has to supplythe following parameters:

Splitting confidence: This parameter is used in Step 4 of the algorithm. Per defaultthis parameter is set to 95% as suggested by the authors.

Sensitivity: This is the parameter as discussed in Section 4.3.1. Standard value of thisparameter is 0.

Minimal cluster size: We found this a useful parameter to add to the algorithm. Theoriginal O-Cluster algorithm can lead to a large number of clusters, that are verysmall compared to the data set size. Therefore we decided to give the user thepossibility to supply a minimum cluster size, i.e. the number of data points in acluster.

Page 44: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

34 4.4. Visual Supported O-Cluster Algorithm

Figure 4.4: An overview of the Visual Supported O-Cluster algorithm

Page 45: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

Chapter 4. Visual Supported O-Clustering 35

Visual support: When visual support is not selected the algorithm runs as describedin Section 4.3, otherwise the algorithm runs as described in the following.

The main processing steps of the algorithm are as follows:

1. Load data buffer: as in the O-Cluster algorithm. Although we only implementedrandom sampling, other sampling strategies can be used here.

2. Compute histograms for active partitions: as in the O-Cluster algorithm.As binning strategy for the projections we used the zero-stage rule as suggested by[Sco92] and [Wan97].

3. Find best splitting points for active partitions: as in the O-Cluster algo-rithm.

4. Flag ambiguous and frozen partitions: as in the O-Cluster algorithm.

5. (a) Visualize active splitting point: The ’best’ splitting point of the foundsplitting points will be set to active and visualized. After the visualization isrendered the algorithm proceeds with the next step.

(b) User interaction: the user has different interaction possibilities such asaccepting or rejecting the proposed splitting point, changing the splittingpoint (c.f. Section 4.4.2), accepting the partition (i.e. the partition state is setto ’frozen’) or manually selecting projections (c.f. Section 4.4.3). Dependingon the actions done by the user the next step of the algorithm follows.

(c) Split active partition: If the user has chosen to accept a certain splittingpoint as a separator the data points are split along the cutting plane and twonew active partitions are created from the original partition. For each newpartition the processing proceeds recursively from Step 2.

6. Reload buffer: as in the O-Cluster algorithm.

7. Exit: When the algorithm cannot find splitting points automatically anymore,with the current parameter settings, the user has the option to select projections(see Section 4.4.3), and adding splitting points manually.

4.4.1 Visualization of cluster tree

The visual supported O-Cluster algorithm splits the data set recursively. The recursiontree forms a cluster hierarchy that represents the cluster structure intuitively. The

Page 46: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

36 4.4. Visual Supported O-Cluster Algorithm

Figure 4.5: A visual representation of the cluster tree.

Figure 4.6: Visual selection of separators by moving the separator to the left or right,with unlabeled and labeled data.

separator tree (or simply cluster tree) appears in the main overview window of thesystem (see Figure 4.5).

By clicking on a node in the cluster tree, detailed information about that cluster willappear in a property grid.

4.4.2 Visual finding of separators

When the algorithm or the user finds a good projection that separates data points intoclusters, they can use the visualization to directly specify one or multiple separators.In our system users visually specify the separator by adding/moving a split line in theprojected space (see Figure 4.6).

Page 47: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

Chapter 4. Visual Supported O-Clustering 37

By left-clicking on a separator it can be dragged elsewhere in in the projection. Byright-clicking within the visualization a separator can be added, and by right-clinkingon a separator, the separator can be removed. A property grid containing detailedinformation about the separator, and the from this separator resulting clusters will bedisplayed.

4.4.3 Manual selection of projections

When the users have specific domain knowledge they can define their own projections, bysimply selecting a dimension. The projected space will appear in the main window of thesystem (see Figure 4.6). Within this dimension a separator can be placed as describedin the previous section.

4.5 Summary

In this chapter we proposed a new visual clustering algorithm based on the O-Clusteralgorithm. Implemented in the classifier and feature engineering system (see Chapter 5),our approach combines the strengths of an advanced automatic clustering algorithm withvisualization techniques and user interaction possibilities, that effectively support theclustering process by presenting the important information visually. The visualizationtechnique uses a combination of histograms and tree maps of the data and allows usersto directly specify cluster separators in the visualizations. Future work will be in theimprovement of the visual presentations.

Page 48: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

38

Chapter 5

Visual Analytics Framework

There exist other visualization frameworks that all have their own advantages and dis-advantages. Although we found none of the visualization frameworks we analyzed ahundred percent suitable for the tasks at hand, we found a lot of good ideas in them.Therefore we decided to develop our own visual analytics framework that combines themany good things we found in other visualization frameworks, with some new ideas wedeveloped. In the following we will give a quick overview of the visualization frameworkswe analyzed.

ORCA (ORCA: A Visualization Toolkit for High Dimensional Data [SRL+00]) is a flex-ible and extensible toolkit for constructing interactive and dynamic linked data viewers.The sophisticated linking and dynamic interaction, the so called brushing, across allOrca view types is a very helpful functionality. It allows many different view combina-tions on the same (focused) data, which can lead to interesting new insights. We appliedsomething similar on our visualization component GridView (see Chapter 3).

CViz (CViz: Class Visualization of High Dimensional Data [DMS99]) is a visualiza-tion tool for clustering and analysis of high dimensional data sets. CViz utilizes ak-means clustering algorithm to find interesting concepts in the data. It then drawstwo-dimensional scatter plots by selecting pairs or triples of concepts and relating theseconcepts to the data examples and the other concepts. By using an animation techniquecalled, touring, CViz allows the analyst to quickly cycle through all the different pairsor triples of concepts and see how the data changes from one perspective to another.This can provide new insight into underlying structure of the data. We found this ideavery good, unfortunately we did not implemented it in our framework yet, but this iscertainly something to think about for future work.

HDDVis (HDDVis: An Interactive Tool for High Dimensional Data Visualization[Min04]) is a very accessible high dimensional data visualization tool that allows users

Page 49: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

Chapter 5. Visual Analytics Framework 39

to interactively explore the data sets from both low-dimensional projections and parallelcoordinates. It uses visualization techniques that are very basic, but the idea of im-plementing different dimensionality reduction techniques we found very suitable for ourown framework.

HD-Eye (HD-Eye: Visual Mining of High Dimensional Data [HKW99]) is a frameworkthat supports clustering visually. The basic idea of the HD-Eye system is to improvethe clustering process by allowing the user to directly interact in the crucial steps of theclustering process. We found this idea very promising and applied something similar onthe O-Clustering algorithm (see Chapter 4).

Visumap (Visumap: The High Dimensional Data Visualizer [Li04]) offers a comprehen-sive implementation of the Relational Perspective Map technology for dimensionalityreduction, along with traditional methods such as principle component analysis, multi-dimensional scaling, self-organizing map, k-mean clustering. The RPM technology hadsome promising results on our data sets, but because Visumap is not open source, andthe scripting engine of the framework is rather limiting, we were not able to add newcomplex functionality to the framework.

The architecture of the developed visual analytics framework, the input/output it usesand its functionality will be discussed in the following sections.

5.1 Architecture

The framework implementation is based on the Microsoft .NET Framework. It makes ex-tensive use of the Microsoft Graphical Design Interface (GDI+) and the two free externallibraries Mapack (www.aisto.com/roeder/dotnet/) and Magic (www.crownwood.net).

Mapack is a .NET class library, developed by Lutz Roeder, for basic linear algebra com-putations. It supports the following matrix operations and properties: Multiplication,Addition, Subtraction, Determinant, Norm1, Norm2, Frobenius Norm, Infinity Norm,Rank, Condition, Trace, Cholesky, LU, QR, Single Value decomposition, Least Squaressolver, Eigenproblem solver, Equation System solver. The algorithms were adapted fromMapack for the Component Object Model (COM), Lapack and the Java Matrix Package.

Magic is a library developed by Crownwood Consulting Limited that supports a Microsoftlook and feel for the Graphical User Interface (GUI) of the framework.

Most of the framework code is written in C#, but parts of it are in C and C++, whichare wrapped in managed classes, in order to use them within the .NET framework.Currently the framework is only available for the Windows platform.

The framework exist out one single executable and a collection of accompanying DLL’s(Dynamic Link Libraries). The executable contains only the GUI and some basic func-

Page 50: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

40 5.2. User Interface

tionality. All the other functionality of the framework is packed into the DLL’s. Thisway it is easy to add and update functionality.

5.2 User Interface

The user interface of the framework has the Microsoft Visual Studio .NET look and feel.This is because it uses the Magic library that provides this look and feel, and a lot offunctionality that Visual Studio also has. This includes the possibility of docking eachpanel of the user interface on any place the user wants. Also the hiding of panels issupported. Besides this look and feel a tracer is added, that provides information abouteverything the user has done. Information, warnings and errors accompanied with a timestamp will appear in this view.

5.3 Input / Output

Currently the framework takes as input two kind of files. The first are delimiter separatedfiles, where the delimiter parameter of the files can be be free defined by the user. Thefiles may contain column headers. Optionally, each example may be provided with aclass label, indicating which class it belongs to. The second are images in the commonformats (tif, jpg, png and bmp). From the images that are loaded the feature vectors, asdefined by ODT methods, will be extracted and saved into a data set. The data set thatis loaded will appear in the so called Data View. Figure 5.1 shows this view. Within thisData View each single value can be altered and formatted in different ways. As outputthe framework writes delimiter separated files. All the visualizations that the frameworkproduces can be saved in one of the common image formats . Future work will be tomake database connections, so the framework can load data from a database, and alsowrite data into a database.

5.4 Supported Tasks

The framework supports several task to do feature and classifier engineering. The fol-lowing sections will provide an overview of these tasks.

Projections: Currently the framework supports the following projections: PrincipalComponent Analysis (PCA), normalized PCA, supervised PCA, a multi dimen-sional scaling algorithm, and a self organizing map algorithm. For a detaileddescription of those algorithms see Section 2.2.

Page 51: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

Chapter 5. Visual Analytics Framework 41

Figure 5.1: Example of the Data View

Clustering: Clustering is the classification of similar objects into different groups, ormore precisely, the partitioning of a data set into subsets (clusters), so that thedata in each subset share some common features. The framework supports this taskby implementing the O-Clustering algorithm ([MC02]), and its visual supportedversion as described in Chapter 4. The results of the clustering can be saved ina delimiter separated file, where for each data point the cluster it belongs to isdenoted in the last column.

Classification: Classification is the problem of assigning class labels to unlabeled dataitems given a collection of labeled data items. Currently only the classifier formatthat is used by ODT can be used. The classifiers of ODT consist of a collectionof so called base classifiers. Each base classifier will decide between two differentclasses. To do this the original feature space is mapped on one single value between-1 and 1. The value 0 indicates the separation plane between the two classes. Allnegative values are corresponding with the first class, all the positive values withthe second class. The more the classification value is away from 0, the morecertain the classifier is. A classifier can be loaded into the framework, and singlebase classifiers can be selected. Using such a base classifier on a data set will resultin a ClassiView as shown in Figure 6.14.

By clicking on one of the histogram bars, all the images that are contained in this

Page 52: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

42 5.5. Summary

Figure 5.2: Example of the Classifier view

bar based on their classification value will be loaded in the image viewer. Multipleselection of bars is possible. This way the user can easily identify which shapes ofcharacters are mapped on which value within the classifier space [−1,+1]. Thisway classifiers can be analyzed, and problematic shapes can be identified.

Visualization: As visualization component we implemented GridView as described inChapter 3. Besides GridView we implemented additional views to support classi-fication and clustering. Also an image viewer is added to the framework.

5.5 Summary

This chapter shortly describes the visual analytics framework that is developed to sup-port feature and classifier analysis for different problems. An evaluation version (ODTspecific methods as classification and feature extraction will not be included) of theframework can be obtained by sending an email to the author ([email protected]). Some application examples can be found in the following chapter.

Future work will include supporting different classifiers, adding additional projectionmethods and clustering algorithms and adding database functionality.

Page 53: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

Chapter 6. Application Examples 43

Chapter 6

Application Examples

This chapter will give some application examples of the Visual Analytics Framework andthe different techniques it incorporates.

6.1 Feature Engineering

One of the applications that Oce Document Technologies GmbH (ODT) developes andsells is the Optical Character Recognition software Recostar ProfessionalPlus. Opticalcharacter recognition (OCR) is the whole process of transforming an image of a documentthat contains text (machine printed or handwritten) into a corresponding ASCII text.Recostar ProfessionalPlus combines the strengths of two commercial OCR engines, theRecoStar and the AEG Recognition software. These engines have been optimized andfine tuned to read in parallel. Next, an internal voting system combines the results ofboth engines in order to improve the results.

The OCR process, as done with the engines from ODT, can be broken down in five subprocesses; image preprocessing, layout analysis, character segmentation, classification,and contextual postprocessing. The focus of this use case lays on optimization of theclassification process. The result of the processes before classification is an image thatcontains one single character. Each of those image snippets will be classified, and theresults combined together to obtain the read result for the whole document.

The first step in classifying a character is feature extraction. There are different methodsto compute the features of a character. The reason why features instead of the raw pixelimage is used as classification input is the fact that by using suitable transformationsmore characteristic features can be worked out. Also the input data may be reducedmaking the classification more efficient. One of the methods that is used by RecostarProfessionalPlus is described briefly below.

Page 54: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

44 6.1. Feature Engineering

The RecoStar engine slices each segmented character every 15 degrees, resulting in 24slices (360 degrees/15 degrees). Those slices are analyzed and statistics about the numberof intersecting lines are composed (see Figure 6.1).

Figure 6.1: Feature extraction of a single character.

The statistics about the number of pixels and intersecting lines are then the features ofthe character. This method is known as Winkelschnittanalyse (WSA), and is similar tomethods used in computer tomography. The feature vector obtained this way containsabout 1100 features. Because is too big for the actual classification due to the highperformance that should be guaranteed, the original feature vector will be reduced witha two dimensional discrete cosine transform to 115 features. Based on this feature vectorthe actual classification takes place. For ODT it is of course very interesting to see howthe feature space resulting from this kind of feature vectors looks like.

6.1.1 Visual Feature Space Analysis

We wanted to analyze the feature space as mentioned above with help from our VisualAnalytics Framework. One of the data sets we used for analysis is a numerical data set.It contains 173.622 images from different numerical characters as listed in Table 6.1.

Our first task was displaying the feature space with help from the GridView technique.To do this we had to reduce the dimensions from 115 to 2. In the following examplewe used a Principal Component Analysis resulting in the Scatter View as displayed inFigure 6.2.

Page 55: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

Chapter 6. Application Examples 45

Class Count Class Count0 37567 8 104341 18049 9 96692 17518 + 17633 14983 - 16394 11654 / 31195 17815 = 4076 11270 EU 70767 10659Total 173622

Table 6.1: Numeric Dataset

Figure 6.2: Scatter View of the numerical data set (see Table 6.1) after projection ontwo dimensions by a Principal Component Analysis.

Page 56: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

46 6.1. Feature Engineering

To get an idea about how the number of data points is divided in the feature space wefirst switched the Scatter View to the Density View, resulting in Figure Figure 6.3.

Figure 6.3: Density View of the numerical data set (see Table 6.1) after projection ontwo dimensions by a Principal Component Analysis.

We see from the density view in the bottom of the projected data space a very densearea, that is surrounded with a larger dense area. As we go further to the borders ofthe projected space the density decreases. Also the in the middle of the feature space adense area can be seen. Switching to the Purity View leads to Figure 6.4.

We see from the purity view that at the borders of the projected data space the purityis very high, as we move to the center the purity decreases.

To show how additional information about the data can be found with GridView wetake two classes, namely ”1” and ”9” apart from the data set and display only those ina Scatter View, resulting in Figure 6.5.

We can see in Figure 6.5 that the pink points, with the corresponding class ”9” is renderedlast. This causes major overlap with the green points (class ”1”). This intra-class overlapgives a false idea about the data. The same happens with inter-class overlap. Lookingat the green class we see in the middle of the class a large complete green area. Wecannot see anymore how many green points are mapped on one single pixel. This canbe one, but also a 1000. So information about the density of the data points is lost inthis Scatter View. To get more information about the data we switch to a Class Viewof the same data resulting in Figure 6.6.

We now see that in the area were the green and the pink class overlap almost a smoothtransition from pink to green. This information is almost completely lost when we only

Page 57: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

Chapter 6. Application Examples 47

Figure 6.4: Purity View of the numerical data set (see Table 6.1) after projection on twodimensions by a Principal Component Analysis.

Figure 6.5: Scatter View of the classes ”1” and ”9” of the numerical data set (see Table6.1) after projection on two dimensions by a Principal Component Analysis.

Page 58: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

48 6.1. Feature Engineering

Figure 6.6: Class View of the classes ”1” and ”9” of the numerical data set (see Table6.1) after projection on two dimensions by a Principal Component Analysis.

look at the data with a Scatter View. But as we can see when we compare Figure 6.5and 6.7 the lather gives a wrong idea about the density of the data. All grid cells,even those that contain only one point are colored the same way as a cell that contains10,000 points. So allthough this view solves most problems with inter-class overlap theintra-class overlap still is a problem. To display the information about the density in thedifferent cell we use alpha-blending, resulting in Figure 6.7.

We can see that that there now are brighter and darker regions within the Class View.Brighter regions correspond with very dense cells, and darker regions are very sparecells. This gives some more information about the density in the different cells, but stillexcludes information from us. Because the whole cell is always colored, even when thecell only contains one single point this still gives a wrong idea about the number of pointsin that cell. That is why we switch to the Mixed View. This results in Figure 6.8.

Now we can see depending on the overplotting factor for each cell a Scatter View or aClass View. This way sparse regions will be displayed as a Scatter View so the detailedinformation is direct visible. Dense regions are summarized with the Class View so theoverplotting problem is solved. However, within the Class View we still have the problemthat density information is excluded from us. That is why we use the alpha-blendingagain to make this information accessible in the same view, resulting in Figure 6.9.

Page 59: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

Chapter 6. Application Examples 49

Figure 6.7: Class View with alpha-blending of the classes ”1” and ”9” of the numericaldata set (see Table 6.1) after projection on two dimensions by a Principal ComponentAnalysis.

Figure 6.8: Mixed View of the classes ”1” and ”9” of the numerical data set (see Table6.1) after projection on two dimensions by a Principal Component Analysis.

Page 60: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

50 6.1. Feature Engineering

Figure 6.9: Mixed View with alpha-blending of the classes ”1” and ”9” of the numericaldata set (see Table 6.1) after projection on two dimensions by a Principal ComponentAnalysis.

6.1.2 Regions of Interest

Instead of using Visual Supported O-Clustering as a regular clustering method we foundit also very suitable to find regions of interest in very large high dimensional data sets.To explain this we first have to define our regions of interest. We were searching for singlecharacters that are classified false by our classifiers and want to know why. Unfortunately,when we filter out all the characters that are classified correct we lose important contextinformation, and the incorrect classified characters are spread all over the data space.That is why we used Visual Supported O-Clustering for finding regions of interest. Wemade the assumption that when we make a split on one dimension during the clusteringprocess, the same or similar split will be reproduced by our classifiers. So what wehave done now, is loading a data set with a number of different classes and startedthe clustering algorithm. Based on what we see we were partitioning the data set in anumber of clusters. Based on the number of different classes and their deviation within acluster we decided if this cluster was interesting for further analysis or not. This way wegot a very fast data reduction, a dimension reduction (the dimensions that were splittedaren’t interesting for further analysis) and interesting sub spaces within the data set.

6.1.3 Visual Projection Analysis

We compared the Principal Component Analysis (PCA) with the Self Organizing Map(SOM) on the same numeric data set used in the previous example in order to compare

Page 61: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

Chapter 6. Application Examples 51

their results. For the difference in approach between those two projection techniques werefer to Section 2.2.1 and 2.2.5. Although SOM’s do not work with single data points butwith a number of neurons and the PCA work with single data points we can visualizeboth with GridView because of the summarization of information into cells. Figure 6.10shows a Density View of the numerical data set (see Table 6.1) after projection on twodimensions by a PCA (left) and a SOM (right).

Figure 6.10: Density View of the numerical data set (see Table 6.1) after projection ontwo dimensions by a Principal Component Analysis (left) and a Self Organizing Map(right).

We can see that the SOM places the data more or less equally over the whole featurespace. The PCA on the other hand concentrates a lot of data points at the bottom andthe center of the feature space (dark regions). The very sparse regions (colored yellow)in the SOM are a sort of borders that are separating the dense regions (colored blue)from each other.

Figure 6.11 shows a Purity View of the same data set (see Table 6.1) after projection ontwo dimensions by a PCA (left) and a SOM (right).

We can see that the dense regions in the SOM are also very pure (colored brown). Thesparse regions in the PCA on the contrary are very unpure (colored blue). This indicatesthat the SOM has found some classes, and placed all classes in a separated region on themap. Because all these separated regions are very pure, this SOM indicates that witha classification strategy similar to the SOM method, very good results can be achieved.The PCA shows only pure regions at the borders of the projected feature space. Onthe other hand the density view showed us that at the borders of the projected featurespace not so many points are mapped. The dense region in the center of the map is very

Page 62: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

52 6.1. Feature Engineering

Figure 6.11: Purity View of the numerical data set (see Table 6.1) after projection ontwo dimensions by a Principal Component Analysis (left) and a Self Organizing Map(right).

unpure. The very dense region at the bottom of the projected feature space is also verypure.

6.1.4 Feature Selection for Multimedia Objects

One way to represent multimedia objects is the so called Feature Vector (FV) approach[Fal96]. This approach represents multimedia objects o ∈ O given in an object spaceO by points −→po ∈ Rd in a d-dimensional vector space. Feature Vector Extractors fvxare functions fvx : O → Rd mapping objects to vectors numerically describing objectproperties. Suitable extractors are efficiently calculated and allow to effectively captureobject similarities by appropriate distance functions d : (−→pi ,−→pj ) → R+

0 defined in featurespace.

The effectiveness of a given extractor used to represent multimedia objects is critical forany FV-based application. We understand the effectiveness of an extractor as the degreeof how accurately distances d in feature space resemble object similarities in object space.For many multimedia data types a large collection of possible extractors are available.But the identification of the most suitable extractor for a given data set is difficult. Inthis section we address this problem by using GridView for the comparative evaluationof feature spaces, and we demonstrate how the framework can support the selection andengineering of promising extractors from a set of available extractors.

Following [SKP06] we used the Princeton Shape Benchmark [SMKF04]. The Princeton

Page 63: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

Chapter 6. Application Examples 53

Shape Benchmark provides a repository of 3D models and software tools for evaluatingshape-based retrieval and analysis algorithms. Figure 6.12 shows an example of thePurity View of GridView on a 12 x 9 SOM based on the data that was obtained by usingthe Voxel extractor on the Princeton Shape Benchmark data set.

Figure 6.12: PurityView of the feature vectors of Princeton Shape Benchmark obtainedwith Voxel extractor after projection on two dimensions by a 12 x 9 Self Organizing Map.

We verified these visually obtained effectiveness estimations by comparing the puritymeasures generated by GridView (see Section 3.2.5) with the benchmarked effectivenessscores from [BKS+06] and [SKP06].

Table 6.2 shows an overview of the tested extractors, with their original dimensionality,and their measured R-Precision value, which is defined as the precision when retrievingexactly the number of objects relevant to the query. The R-precision gives a singlenumber to rate the effectiveness of a retrieval system. Behind the extractor names, aliterature reference indicates where detailed information about the descriptor can befound.

Figure 6.13 shows the regression analysis between the purity measure of the differentSOM’s (unsupervised information) and a supervised discrimination precision metric foreleven extractors. As we can see there a very strong linear regression between the twodifferent measures. This indicates that producing a SOM on a certain feature vectorand looking at its purity measure will give a good indication of the effectiveness of anextractor on this data.

Page 64: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

54 6.1. Feature Engineering

Extractor Dimensions R-PrecisionRotation Invariant [KSO00] 155 0,225228Parallel Method of Moments [MKA97] 52 0,148161Silhouette [HKSV02] 375 0,281479Harmonics 3D [FMK+03] 128 0,202036Cords based [PMN+00] 120 0,22539DSR 472 0,426087Depth Buffer [HKSV02] 510 0,311618Complex Shading [VS02] 169 0,270753COR 30 0,1574983D Discrete Fourier Transformation [VS01] 173 0,250815Voxel [HKSV02] 343 0,31133SD2 130 0,182645

Table 6.2: Overview of the descriptors tested

Figure 6.13: Regression analysis between purity score of SOM (unsupervised informa-tion) and a supervised discrimination precision metric for eleven extractors.

Page 65: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

Chapter 6. Application Examples 55

6.2 Classifier Engineering

Besides analysis of the feature space, also classifiers that are used on this feature spacecan be analyzed with help from the Visual Analytics framework. The classifiers of ODTexist out a collection of so called base classifiers. Each base classifier will decide betweentwo different classes. To do this the original feature space is mapped on one single valuebetween -1 and 1. The value 0 indicates the separation plane between the two classes.All negative values are corresponding with the first class, all the positive values withthe second class. The more the classification value is away from 0, the more certain theclassifier is. A classifier can be loaded into the framework, and single base classifierscan be selected. Using such a base classifier on a data set will result in a ClassiView asshown in Figure 6.14.

Figure 6.14: Example of the Classifier view

In this example a data set that contains only characters that are labeled as ”1” or ”2”are used. After that the base classifier that decided between those two classes is selectedand the classification process is started. In the figure the characters that are labeledwith as ”1” are red, the characters labeled as ”2” are blue. By clicking on a bar all thecharacters that are placed by the classifier in this bin will appear in the image viewer.Multiple selection of bars is possible. this way the user can easily identify which shapesof characters are placed on which place in classifier space. Problematic shapes can easilybe identified. This way classifiers can be judged on their suitability for certain problems.

Page 66: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

56 6.3. Summary

6.3 Summary

This chapter presented some application examples of the Visual Analytics Framework.Emphasis of the examples is not on the results obtained with the techniques that areincorporated in the framework, but on the way these techniques can support such kindof problems in a more general way.

Page 67: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

Chapter 7. Conclusion and Discussion 57

Chapter 7

Conclusion and Discussion

This thesis contains three contributions to computer science. Namely GridView, VisualSupported O-Clustering and a Visual Analytics Framework. For each of them conclusionswill be drawn, discussed, and future work will suggested in the following sections.

7.1 GridView

We presented GridView (see Chapter 3), a collection of different views that makes itpossible to view and explore a very large data set. The strengths of the concept are theintuitive user interface, multiple views of the same data with a single click, and easyselection of classes, regions and subsets. Disadvantage of the concept is that it can onlyhandle 2-dimensional data sets, so the quality of the view depends on the quality of theembedding. The use of GridView makes only sense when the data has class labels.

We have shown different examples were GridView can be used to visualize and ana-lyze very large data sets, and overcomes the overplotting problems that for examplescatterplots have.

Future work will be automatic determination of a good order for the colors in the classview, other cell forms, and other representations of the information in a cell. One veryinteresting research direction is the use of pie charts in the cells so that they are positionedwith the colors directed to their class centers (c.f. [BST00]).

7.2 Visual Supported O-Clustering

We proposed a new visual clustering algorithm based on the O-Cluster algorithm (seeChapter 4). Implemented in the visual analytics framework (see Chapter 5), our ap-

Page 68: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

58 7.3. Visual Analytics Framework

proach combines the strengths of an advanced automatic clustering algorithm with visu-alization techniques and user interaction possibilities, that effectively support the clus-tering process by representing the important information visually. The visualizationtechnique uses a combination of histograms and tree maps of the data and allows usersto directly specify cluster separators in the visualizations.

Using the visual supported version of O-Clustering cannot only improve the results of thealgorithm significantly because you can add knowledge about the data into the algorithm,it also improves the understanding of the algorithm and the data set at hand. Besidesusing the algorithm solely as a clustering algorithm, it can be very useful to discoverregions of interest and interesting dimensions within large high dimensional data sets.Future work will be in improving the different visualizations that are used.

7.3 Visual Analytics Framework

A visual analytics framework for classifier and feature engineering was developed andimplemented (see Chapter 5). The framework supports different tasks such as clustering,classifying, projections and visualization.

Chapter 6 shows a number of applications were the Visual Analytics Framework is usedfor different problems within the feature and classifier engineering process. Future workwill include supporting different classifiers, adding additional projection methods andclustering algorithms, and adding database functionality.

An evaluation version (ODT specific methods as classification and feature extraction willnot be included) of the framework can be obtained by sending an email to the author([email protected]).

Page 69: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

BIBLIOGRAPHY 59

Bibliography

[ABKS99] Ankerst, M.; Breunig, M. M.; Kriegel, H.-P.; Sander, J.: Optics: Orderingpoints to identify the clustering structure. In Proc. 1999 ACM SIGMOD Int.Conf. on Management of Data (SIGMOD’99), 1999.

[Agg01] Aggarwal, C. C.: A human-computer cooperative system for effective highdimensional clustering. In Proc. of the ACM Int. Conf. on Knowledge Dis-covery and Data Mining (KDD), CA, USA, S. 221–226, 2001.

[AGGR98] Agrawal, R.; Gehrke, J.; Gunopulos, D.; Raghavan, P.: Automatic subspaceclustering of high dimensional data for data mining applications. In Proc.1998 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD98), S. 94–105, 1998.

[AHK01] Aggarwal, C. C.; Hinneburg, A.; Keim, D. A.: On the surprising behaviorof distance metric in high-dimensional space. In Proc. 8th Int. Conf. onDatabase Theory, London, S. 420–434, 2001.

[APW+99] Aggarwal, C. C.; Procopiuc, C.; Wolf, J. L.; Yu, P. S.; Park, J. S.: Fastalgorithms for projected clustering. In Proc. 1999 ACM-SIGMOD Int. Conf.Management of Data (SIGMOD99), S. 61–72, 1999.

[AY00] Aggarwal, C. C.; Yu, P. S.: Finding generalized projected clusters in highdimensional spaces. In Proc. 2000 ACM-SIGMOD Int. Conf. Managementof Data (SIGMOD00), S. 7081, 2000.

[BGRS99] Beyer, K.; Goldstein, J.; Ramakhrisnan, R.; Shaft, U.: When is ”near-est neighbor” meaningful? In In Proc. 7th Int. Conf. on Database Theory(ICDT99), S. 217235, 1999.

[BKS+06] Bustos, B.; Keim, D.; Saupe, D.; Schreck, T.; Vranic, D.: An experimentaleffectiveness comparison of methods for 3d similarity search. InternationalJournal on Digital Libraries, Special, Band 6, Nr. 1, S. 39–54, 2006.

Page 70: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

60 BIBLIOGRAPHY

[BST00] Brandes, U.; Shubina, G.; Tamassia, R.: Improving angular resolution invisualizations of geographic networks. In Data Visualization: Proc. 2nd JointEUROGRAPHICS and IEEE TCVG Symp. Visualization, VisSym, S. 23–33,2000.

[CL03] Chen, K.; Liu, L.: A visual framework invites human into the clusteringprocess. In Proc. of the 15th International Conference on Scientific andStatistical Database Management, S. 97, 2003.

[DMS99] Dhillon, I.; Modha, D.; Spangler, W.: Class visualization of high-dimensionaldata with applications, 1999.

[ED91] Everitt, B.; Dunn, G.: Applied Multivariate Data Analysis. Arnold, 1991.

[EK02] Eick, S.; Karr, A.: Visual scalability. Journal of Computational and Graph-ical Statistics, Band 1, Nr. 11, S. 22–43, 2002.

[EKSX96] Ester, M.; Kriegel, H. P.; Sander, J.; Xu, X.: A density-based algorithm fordiscovering clusters in large spatial databases with noise. In Proc. of the 2ndInt. Conf. on Knowledge Discovery and Data Mining (KDD), Portland, OR,USA, S. 226–231. AAAI Press, August 1996.

[Fal96] Faloutsos, C.: Searching Multimedia Databases by Content. Kluwer Aca-demic Publishers, Norwell, MA, USA, 1996.

[FMK+03] Funkhouser, T.; Min, P.; Kazhdan, M.; Chen, J.; Halderman, A.; Dobkin, D.;Jacobs, D.: A search engine for 3d models. ACM Trans. Graph., Band 22,Nr. 1, S. 83–105, 2003.

[GRS98] Guha, S.; Rastogi, R.; Shim, K.: Cure: An efficient clustering algorithm forlarge databases. In Proc. 1998 ACM-SIGMOD Int. Conf. Management ofData (SIGMOD98), S. 73–84, 1998.

[GTC01] Grinstein, G.; Trutschl, M.; Cvek, U.: High-dimensional visualizations. InProc. of the 7th Data Mining Conference (KDD 2001), San Francisco, Cal-ifornia, 2001.

[HAK00] Hinneburg, A.; Aggarwal, C. C.; Keim, D. A.: What is the nearest neigh-bor in high dimensional spaces? In Proc. 26th Int. Conf. on Very LargeDatabases, Cairo, Egypt, September 2000.

Page 71: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

BIBLIOGRAPHY 61

[HK98] Hinneburg, A.; Keim, D. A.: An efficient approach to clustering in large mul-timedia databases with noise. In Proc. 1998 Int. Conf. Knowledge Discoveryand Data Mining (KDD98), S. 58–65, 1998.

[HK99] Hinneburg, A.; Keim, D. A.: Optimal grid-clustering: Towards breaking thecurse of dimensionality in high-dimensional clustering. In Proc. 25th Int.Conf. on Very Large Data Bases (VLDB), Edinburgh, Scotland, S. 506–517,1999.

[HKSV02] Heczko, M.; Keim, D. A.; Saupe, D.; Vranic, D.: Verfahren zurahnlichkeitssuche auf 3D-objekten. Proc. Datenbanksysteme in Buro, Technikund Wissenschaft (BTW’01), Oldenburg, 2001; extended version in Daten-bank Spektrum, Band 2, Nr. 1, S. 54–63, January 2002.

[HKW99] Hinneburg, A.; Keim, D. A.; Wawryniuk, M.: Hd-eye: Visual mining of highdimensional data. IEEE Comp. Graphics and Applications, Band 19, Nr. 5,S. 22–31, 1999.

[HNC99] H. Nagesh, S. G.; Choudhary, A.: Mafia: Efficient and scalable subspaceclustering for very large data sets. Technischer Bericht Nr. 9906-010, North-western University, June 1999.

[KC03] Koren, Y.; Carmel, L.: Visualization of labeled data using linear transfor-mations. In Proc. of IEEE Information Visualization 2003 (InfoVis’03), S.121–128. IEEE, 2003.

[KMSZ06] Keim, D. A.; Mansmann, F.; Schneidewind, J.; Ziegler, H.: Challenges invisual data analysis. In Information Visualization (IV 2006) ,Invited Paper,July 5-7, London, United Kingdom. IEEE, IEEE Press, 2006.

[Koh95] Kohonen, T.: Self-Organizing Maps. Springer Academic Press, Berlin, 1995.

[KR90] Kaufman, L.; Rousseeuw, P. J.: Finding groups in data: An introduction tocluster analysis. John Wiley and Sons, New York, 1990.

[KS05] Keim, D. A.; Schneidewind, J.: Scalable visual data exploration of largedata sets via multiresolution. JUCS Special Issue on Visual Data Mining,Band 11, Nr. 11, S. 1766–1779, 11 2005.

[KSO00] Kato, T.; Suzuki, M.; Otsu, N.: A similarity retrieval of 3d polygonal modelsusing rotation invariant shape descriptors. In Proc. of IEEE Int. Conf. onSystems, Man, and Cybernetics, S. 29462952, 2000.

Page 72: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

62 BIBLIOGRAPHY

[Li04] Li, J. X.: Visualization of high dimensional data with relational perspectivemap. In Proc. of the IEEE Information Visualization 2004 (INFOVIS’04),S. 45–59, 2004.

[Mac67] MacQueen, J.: Some methods for classification and analysis of multivariateobservations. In Proc. 5th Berkeley Symp. Math. Statist, Nr. 1, S. 281–297,1967.

[MC02] Milenova, B. L.; Campos, M. M.: O-cluster: Scalable clustering of largehigh dimensional data sets. In Proc. 2002 IEEE Int. Conf. on Data Mining(ICDM’02), S. 290–305, 2002.

[Min04] Mingyue, T.: Hddvis: An interactive tool for high dimensional data visu-alization. Technischer Bericht, University of British Columbia, ComputerScience Department, 2004.

[MKA97] Marsh, A.; Kaklamani, D. I.; Adam, K.: Using parallel method of moments(pmom) to solve multi-plate scattering problems. In HPCN Europe, S. 1038–1039, 1997.

[NH94] Ng, R.; Han, J.: fficient and effective clustering method for spatial datamining. In Proc. 1994 Int. Conf. on Very Large Data Bases (VLDB94), S.144–155, 1994.

[PMN+00] Paquet, E.; Murching, M.; Naveen, T.; Tabatabai, A.; M.Rioux: Descriptionof shape information for 2-d and 3-d objects. Signal Process Image Commun.,Band 16, S. 103–122, 2000.

[Sco92] Scott, D.: Multivariate density estimation. New York: John Wiley and Sons,1992.

[Sco97] Scott, D.: On optimal and data-based histograms. Biometrika, Band 66, S.605–610, 1997.

[SCZ98] Sheikholeslami, G.; Chatterjee, S.; Zhang, A.: Wavecluster: A multiresolu-tion clustering approach for very large spatial databases. In Proc. 1998 Int.Conf. on Very Large Data Bases (VLDB98), S. 428439, 1998.

[SKP06] Schreck, T.; Keim, D.; Panse, C.: Visual feature space analysis for unsuper-vised effectiveness estimation and feature engineering. In IEEE InternationalConference on Multimedia and Expo (ICME’2006). Toronto, Canada, July9-12, 2006.

Page 73: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

BIBLIOGRAPHY 63

[SMKF04] Shilane, P.; Min, P.; Kazhdan, M.; Funkhouser, T.: The princeton shapebenchmark. In Shape Modeling International, Genova, Italy, 2004.

[SRL+00] Sutherland, P.; Rossini, A.; Lumley, T.; Lewin-Koh, N.; Cook, D.; Cox, Z.:Orca: A visualization toolkit for high-dimensional data. Technischer BerichtNr. 046, NRCSE, May 2000.

[SRY81] Schiffman, S.; Reynolds, M.; Young, F.: Introduction to MultidimensionalScaling: Theory, Methods and Application. Academic Press, 1981.

[TC05] Thomas, J.; Cook, K.: Illuminating The Path: Research and DevelopmentAgenda for Visual Analytics. IEEE Press, 2005.

[TM05] Tejada, E.; Minghim, R.: Improved visual clustering of large multi-dimensional datasets. In Proc. of the IEEE Information Visualization 2005(INFOVIS’05), S. 818–825, 2005.

[VS01] Vranic, D.; Saupe, D.: 3d shape descriptor based on 3d fourier transform.In Proc. of EURASIP Conf. on Digital Signal Processing for MultimediaCommunications and Services (ECMCS01), S. 271–274, 2001.

[VS02] Vranic, D.; Saupe, D.: Description of 3d-shape using a complex function onthe sphere. In Proc. of IEEE Int. Conf. on Multimedia and Expo (ICME02),S. 177–180, 2002.

[vW05] Wijk, J. v.: The value of visualization. In C. Silva, H. R., E. G. (Hrsg.):Proc. IEEE Visualization 2005, S. 79–86, 2005.

[Wan97] Wand, M. P.: Data-based choice of histogram bin width. The AmericanStatistician, Band 51, Nr. 1, 1997.

[Web02] Webb, A.: Statistical Pattern Recognition. John Wiley and Sons, 2002.

[WYM97] Wang, W.; Yang, J.; Muntz, M.: Sting: A statistical information grid ap-proach to spatial data mining. In Proc. 1997 Int. Conf. on Very Large DataBases (VLDB97), S. 186–195, 1997.

[ZRL96] Zhang, T.; Ramakhrisnan, R.; Livny, M.: Birch: An efficient data clusteringmethod for very large databases. In Proc. 1996 ACM-SIGMOD Int. Conf.Management of Data (SIGMOD96), S. 103–114, 1996.

Page 74: Master Thesis - Uni Konstanz€¦ · Master Thesis A Visual Analytics Framework for Feature and Classifier Engineering to obtain the academic degree Master of Science (M.Sc.) Henrico

64 BIBLIOGRAPHY

Statement of Authorship

I declare that this thesis and the accompanying code has been composed by myself,and describes my own work, unless otherwise acknowledged in the text. It has not beenaccepted in any previous application for a degree. All verbatim extracts have beendistinguished by quotation marks, and all sources of information have been specificallyacknowledged.

Henrico Dolfing, January 22, 2007