Vivarana fyp report

Vivarana : Interactive Data Visualization Tool for Complex Event Processor

Rule Generation

Sajith Edirisinghe (100112V)

Vimuth Fernando (100132G)

Tharindu Ranasinghe (100440A)

Mihil Ranathunge (100444N)

Department of Computer Science and Engineering

Faculty of Engineering

University of Moratuwa

Supervised by:

Prof. Gihan Dias

Eng. Charith Chitranjan

2015

2

Abstract

In Complex Event Processor (CEP) systems, processing takes place according touser defined rules, which consists of defining an action for a particular set of data.Writing such rules is generally a challenging, time consuming task even for domainexperts. This is a two part process where the user has to first identify what events ofthe event stream to act on and then write CEP queries to filter out the types of eventsidentified earlier. We are proposing a solution that would unify this whole process byproviding the users of CEP systems with a single tool that can be used to easily identifypatterns of interest in large data sets through a data visualization technique and thenautomatically generate CEP queries to filter out the events of interest identified by theuser.

Vivarana is an interactive data visualization tool that can be used to generate CEPqueries. This tool provides the users with the ability to interactively analyze a largedata set and to generate CEP queries to filter out events of interest. In this reportwe describe the current research in the area of visualization and CEP rule generation,implementation details of our tool, the issues and challenges encountered during theproject, and some paths that can be explored in the future to improve the effectivenessof our tool visualization method, the interactions user can perform on the visualizationand the rule generation technique implemented in Vivarana.

i

Contents

Contents i

List of Figures ii

1 Introduction 1

2 Literature Review 22.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Multidimensional data visualization . . . . . . . . . . . . . . . . . . . . . . . 22.3 Visualization techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4 CEP Rule generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3 Solution 513.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.2 Visualization - Parallel Coordinates . . . . . . . . . . . . . . . . . . . . . . . 533.3 Other functionalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.4 Rule Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643.5 Other Approaches Attempted . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4 Discussion 84

5 Conclusion and Future Work 87

Bibliography 89

ii

List of Figures

2.1 A scatterplot of the distribution of drivers visibility range against their age . . 62.2 A scatterplot matrix displays of data with three variates X, Y , and Z. . . . . . 72.3 Rank-by-feature framework interface for scatterplots (2D). . . . . . . . . . . . . 72.4 Rank by feature visualization for a data set of a demographic and health related

statistics for 3138 U.S. counties . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.5 Scatterplot matrix navigation for a digital camera dataset. . . . . . . . . . . . 102.6 Stage-by-stage overview of the scatterplot animated transition . . . . . . . . . . 112.7 Scatterplot matrix for the Nuts-and-bolts dataset . . . . . . . . . . . . . . . . . 122.8 Generalized Plot Matrix for the Nuts-and-bolts dataset . . . . . . . . . . . . . 132.9 Parallel coordinate plot with 8 variables for 250 cars . . . . . . . . . . . . . . . 142.10 Parallel Coordinate plot for a point . . . . . . . . . . . . . . . . . . . . . . . . 152.11 Parallel Coordinate plot for points in a line with m <0 . . . . . . . . . . . . . . 152.12 Parallel Coordinate plot for points in a line with 0<m <1 . . . . . . . . . . . . 162.13 Negative correlation between Car Weight and the Year . . . . . . . . . . . . . . 172.14 Using brushing to filter Cars with 6 cylinders . . . . . . . . . . . . . . . . . . . 172.15 Using composite brushing to Filter Cars with 6 cylinders made in 76 . . . . . . 182.16 An example of Smooth brushing . . . . . . . . . . . . . . . . . . . . . . . . . . 192.17 Angular Brushing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.18 Multiple ways of ordering N axes in parallel coordinates . . . . . . . . . . . . . 212.19 Two clusters represented in parallel coordinates . . . . . . . . . . . . . . . . . . 222.20 Multiple clusters visualized in parallel coordinates in different colors . . . . . . 222.21 Variable length Opacity Bands representing a cluster in parallel coordinate . . . 222.22 Parallel-coordinates plot using polylines and using bundled curves . . . . . . . 232.23 Statistically colored Parallel Coordinates plot on weight of cars . . . . . . . . . 242.24 Three scaling options for visualizing the stage times in the Tour de France . . . 252.25 Parallel Coordinates plot for a data set with 8000 rows . . . . . . . . . . . . . . 262.26 Parallel Coordinates for the Olive Oils data. Shows how alpha blending can

improve dense visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.27 Parallel Coordinates visualization with Z Score coloring . . . . . . . . . . . . . 292.28 Parallel Coordinates drawn on same data set using data selection . . . . . . . . 302.29 Radviz Visualization for multi dimensional data . . . . . . . . . . . . . . . . . 31

iii

2.30 Mosaic plot for the Titanic data showing the distribution of passengers survivalbased on their class and sex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.31 Double Decker plot for the Titanic data . . . . . . . . . . . . . . . . . . . . . . 332.32 Training a self organizing map. . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.33 A self organizing map trained on the poverty levels of countries . . . . . . . . . 352.34 A sunburst visualization summarizing user paths through a fictional e-commerce

site. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.35 Trellis Chart for a dates set on sales . . . . . . . . . . . . . . . . . . . . . . . . 382.36 Trellis Display of Scatter Plots (Relationship of Gifts Given/Received on Rev-

enue) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392.37 A snapshot of the grand tour, a projection of the data to single plane is illustrated

in (B) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402.38 grand tour path in 3D space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412.39 Structure of the iCEP framework . . . . . . . . . . . . . . . . . . . . . . . . . . 452.40 Prediction Correction Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . 472.41 An overview of rules tuning method . . . . . . . . . . . . . . . . . . . . . . . . 50

3.1 Architecture of the implementation of Vivarana . . . . . . . . . . . . . . . . . . 523.2 Basic Implementation of Parallel Coordinates . . . . . . . . . . . . . . . . . . . 543.3 Example 1D Brushing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.4 Example Composite Brushing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.5 Example Composite Brushing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.6 Slick Grid along with the Parallel Coordinates . . . . . . . . . . . . . . . . . . . 583.7 Cluster Coloring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.8 Cluster Bundling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.9 Parallel Coordinates without Alpha Blending . . . . . . . . . . . . . . . . . . . 603.10 Parallel Coordinates with Alpha Blending . . . . . . . . . . . . . . . . . . . . . 603.11 Parallel Coordinates with Statistical Coloring . . . . . . . . . . . . . . . . . . . 613.12 Specifying the size of either Time or Event window . . . . . . . . . . . . . . . . 623.13 State of visualization after performing an aggregation operation . . . . . . . . . 633.14 Performing clustering within clusters in a web server log data set . . . . . . . . 643.15 A decision tree to classify the Iris data set. . . . . . . . . . . . . . . . . . . . . 653.16 A decision tree to classify the Iris data set. Paths to follow to get to a Virginica

flower are highlighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663.17 Rule generation process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683.18 Sunburst Visualization which we used as foundation to our project . . . . . . . 723.19 Select grouping and grouped columns . . . . . . . . . . . . . . . . . . . . . . . . 733.20 Data file representation in Python DataFrame . . . . . . . . . . . . . . . . . . . 743.21 Python DataFrame after Group By operation . . . . . . . . . . . . . . . . . . . 753.22 Sequence Database Final Representation . . . . . . . . . . . . . . . . . . . . . . 763.23 Sequence Database after being stripped off of unnecessary event attributes . . . 773.24 Counting the number of unique sequences using valuecounts() . . . . . . . . . 77

iv

3.25 Improved sunburst visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . 823.26 Sunburst with drill down capability . . . . . . . . . . . . . . . . . . . . . . . . . 833.27 Sunburst zoom and pan capability . . . . . . . . . . . . . . . . . . . . . . . . 83

4.1 Challenges face in the visualization . . . . . . . . . . . . . . . . . . . . . . . . . 86

v

Acknowledgments

We would like to acknowledge with so much gratitude and thank every person whoprovided assistance and supervision throughout this project in-order to make it successful.We would like to express our sincere gratitude especially towards Eng. Prof. Gihan Diasand Eng. Charith Chitraranjan who supervised and mentored us from the beginning to theend of the project, providing us valuable insides, feedback, immense support, and guidanceto make this project a success.

Further, we would like to thank Dr. Malaka Walpola, our final year project coordinatorfor his continuous support and guidance which helped us to boost our performance andmotivated us to do our best.

We are also grateful to all the members of the academic staff and non-academic staffof the Department of Computer Science and Engineering who helped us in various ways tofinish our project.

Last but not least we are highly grateful to all the colleagues of CSE 10 batch who helpedus in various ways by providing valuable feedback and helping us in technical difficulties.We consider it as a privilege to work with all these amazing people throughout this project.

1

Chapter 1

Introduction

Nowadays every action/event occurring in the real world, whether it be a change of temper-

ature detected by a sensor, changes in stock market prices or even the movement of objects

tracked through GPS coordinates is digitally collected and stored for further exploration

and analysis and sometimes pre-specified action is triggered in real-time when a particular

action/event occurs. Complex Event Processing (CEP) engines are used to analyze these

events on the fly and to execute appropriate pre-specified actions.

But one of the downside of this real-time event monitoring and processing using a CEP is

that a domain expert must write necessary CEP rules in-order to detect interesting event and

to trigger an appropriate response. Sometimes the domain expert might lack the knowledge

to write efficient CEP rules for a particular CEP engine using its query language or he might

need to explore, understand and analyze the incoming event stream prior to writing any

rules.

By providing an interactive visualization of data to the domain experts, we can help

them in their process of generating CEP rules. Section 2 contains the literature review we

conducted in-order to familiarize ourselves with the existing research regarding interactive

multi-dimensional data visualization and automatic Complex event processor rule generation.

Section 3 contains the implementation details of our solution to address the aforementioned

problem. In section 4 we have discussed about the challenges we have faced during this

project and how we have overcome those. Finally, section 5 contains the conclusion and

future work regarding Vivarana

2

Chapter 2

Literature Review

2.1 Introduction

This literature survey mainly contains two sections. Section 3 presents our findings on

interactive visualization techniques. In this sections we have described about Scatter plots

(section 3.1) and parallel coordinated (section 3.2) in details and we have introduced other

promising visualization techniques briefly. Section 4 contains our findings on two methods

of generating CEP rule generation namely iCEP and rule parameter tuning. Further section

2 contains an overview of multidimensional visualization (principles, techniques, problems)

for the sake of completeness.

2.2 Multidimensional data visualization

Recent advances in technology has enabled the generation of vast amounts of data in a wide

range of fields. These data also keep getting more complex. Data analysts want to look

for patterns, anomalies and structures in data. Analyzing the data can lead to important

knowledge discoveries which is valuable to users. The benefits of such understanding reflect

in business decision making, more accurate medical diagnosis, finer engineering and more

refined conclusions in a general sense.

Visualizing these complex data can provide an overview of the data, summary of the

data and also can provide and help in identifying areas of interest in the data. Good data

visualization techniques that allows users to explore and manipulate the data can empower

them in analyzing the data and identifying important patterns and trends in the data that

CHAPTER 2. LITERATURE REVIEW 3

may have been hidden otherwise.

Multi-dimensional data visualization is a very active research area that goes back many

years [68]. In this survey we have focused on 2D multi-dimensional data visualization tech-

niques, because 2D visualizations will make it easy for the users to analyze and interact with

the data as 2D surfaces present a surface that more familiar to users and is easy to navigate.

There are multiple challenges that needs overcoming in multidimensional data visualiza-

tion. Finding a good visualization includes finding a good compromise that can overcome

some of these challenges are

• Mapping - Finding a good mapping from a multi-dimensional space to a two di-

mensional space is not a simple task. The final representation of the data should be

intuitive and interpretable. Users should be able to identify patterns and trends in the

multi-dimensional data using the two dimensional representation.

• Large amounts of data - Modern dataset contain very large amounts of data that

can lead to very dense data visualizations. This causes the loss of information in the

visualization because the users lose the ability to distinguish between small differences

in the data.

• Dimensionality - Displaying the information of multiple dimensions in two dimen-

sional space can also lead to very dense and cluttered visualizations. Techniques need

to be developed to allow users to reduce the clutter and identify important informa-

tion in the data. Techniques such as principle component analysis [29] can help in

identifying important dimensions in the data.

• Assessing effectiveness - Information needs from data varies widely with each data

set. So there is no silver bullet in visualization technique that can solve all the problem.

Different datasets and requirements can yield to varying visualization methods. There

is no method to access the effectiveness of a visualization method over another so there

is process that can be followed to come up with a visualization method that works for

any dataset.

Further according to E.R. Tufte [62] a good visualization comprises of below qualities

• Show data variations instead of design variations. This quality encourages the viewer

to think about the substance rather than about methodology, graphic design, the tech


of graphic production, etc. One way to achieve this quality in a visualization is to have

a high data-to-ink ratio[10] and a high data density.

• Clear, detailed and thorough labeling and appropriate scales. A visualization can use

layering and separation techniques to show the labels of the data items

• Size of the graphic effect should be directly proportional to the numeric quantities. This

can be achieved by avoiding chart junks such as unnecessary 3D, shadowing effects and

by reducing the lie factor[37]

In-order to make the visualization more user friendly, a number of interaction techniques

have been proposed [33]. It should be noted that that the behavior of these interaction tech-

niques differ from one visualization technique to another. However, interaction techniques

allows the user to directly interact with the visualization and to change the visualization

according to the exploration objective. Below list contains the major interactive techniques

we have identified.

• Dynamic Projections

Dynamic projection means dynamically changing the projection in-order to explore

a multidimensional data set. A classic example would be the Grand Tour [3] which

tries to show all interesting pairs of dimensions of a multidimensional dataset as a

series of scatter plots. However, the sequence of projection can be random, manual,

pre-computed, or even data driven depending on the visualization technique.

• Interactive Filtering

When exploring large dataset interactively partitioning and focusing on interesting

subsets is a must. This can be achieved through direct selection of the desired subset

(browsing) or through specifying the properties of the desired subset (querying). How-

ever, browsing and querying becomes difficult and inaccurate respectively when the

dataset becomes larger. As a solution to this problem a number of techniques such as

Magic Lens [5], InfoCrystal [54] have been developed in order to improve interactive

filtering in data exploration.

• Interactive Zooming

Zooming is used in almost all the interactive visualizations. When dealing with large

amount of data, sometimes the data is highly compressed in-order to provide an


overview of it. In such cases zooming does not only mean to display the data objects

larger, but it also means that the data representation should automatically change to

present more details on higher zoom levels (decompressing). The initial view (com-

pressed view) will allow the user to identify patterns, correlations and outliers and by

zooming in to the interested area user can study the data objects within that region

in more detail.

• Interactive Distortion

Interactive distortion techniques will help in data exploration process by providing a

way for focusing on details while preserving an overview of data. The basic idea of

distortion is to show a portion of the data with high level of details while other portion

is shown in lower level of detail.

• Interactive Linking and Brushing

The idea of linking and brushing is to combine different visualization methods to

overcome the shortcomings of single techniques. As an example one could visualize a

scatterplot matrix (section 3.1) for a data set and when some points in a particular

scatterplot is brushed those points will get highlighted in all other scatterplots. Hence

interactive changes made in one visualization are automatically reflected in the other

visualizations.

2.3 Visualization techniques

Scatter Plots

Scatterplots are a commonly used visualization technique to deal with multivariate data

sets. Mainly there are 2D and 3D scatter plot visualizations. In a 2D scatterplot, data

points from two dimensions of a dataset are plotted in a Cartesian coordinate system where

the two axes represent the selected dimensions resulting in a scattering of points. An example

of a scatterplot showing the distribution of drivers visibility with their age is shown in Figure

2.1.

The positions of the data points represent the corresponding dimension values. Scat-

terplots are useful for visually identifying correlations between two selected variables of a

multidimensional data set, or finding clusters of individuals (outliers) in the dataset. One


Figure 2.1: A scatterplot of the distribution of drivers visibility range against their age

single scatterplot can only depict the correlation between two dimensions. Additional limited

dimensions can be mapped to color, size or shape of the plotting points.

Advocates of 3D scatterplots argue that since the natural world is three dimensional, users

can readily grasp 3D representations. However, there is substantial empirical evidence that

for multidimensional ordinal data (rather than 3D real objects such as chairs or skeletons),

users struggle with occlusion and the cognitive burden of navigation as they try to find desired

viewpoints [51]. Advocates of higher dimensional displays have demonstrated attractive

possibilities, but their strategies are still difficult to grasp for most users.

Since two-dimensional scatterplot presentation offer ample power while maintaining com-

prehensibility, many variations have been proposed. One of the method used to visualize

multivariate data using 2D scatterplots is scatterplot matrix (SPLOM) [68].

Each individual plot in the SPLOM is identified by its row and column number in the

matrix [68]. For example, the identity of the upper left plot of the matrix in Figure 2.2 is

(1 , 3) and the lower right plot is (3, 1). The empty diagonals displays the variable names.

Plot (2, 1) is the scatter plot of parameter X against Y while plot (1 , 2) is the reverse, i.e.

Y versus X.

One of the major disadvantage of SPLOM is that as the number of dimensions of the

data set grow the n-by-n SPLOM grows and each individual scatterplot in the SPLOM will

have less space. Following frameworks provide a solution to that problem by incorporating

interactive techniques with the traditional SPLOM.


Figure 2.2: A scatterplot matrix displays of data with three variates X, Y , and Z.

Figure 2.3: Rank-by-feature framework interface for scatterplots (2D).

Rank-by-feature framework

Many variations have been proposed to the initial SPLOM to enhance its interactivity and

interpretability. One such enhancement is presented with the rank-by-feature framework [51].

Instead of directly visualizing the data point against all pairs of dimensions, this framework

allows the user to select an interesting ranking criterion which will be described later in this

section.

Figure 2.3 shows a dataset of demographic and health related statistics for 3138 U.S.

counties with 17 attributes, visualized through the rank-by-feature framework and its in-

terface consists of four coordinated components: control panel (Figure 3A), score overview

(Figure 2.3B), ordered list (Figure 2.3C), and scatterplot browser (Figure 2.3D).


User can select an ordering criterion in the control panel (Figure 2.3A), and the ordered

list (Figure 2.3C) shows the pairs of dimensions (scatterplots) sorted according to the score

of the criteria with the scores color-coded on the background. But users cannot see an

overview of entire relationships between variables at a glance in the ordered list. Hence

the score overview (Figure 2.3B), an m-by-m grid view where all dimensions are aligned

in the rows and columns has been implemented. Each cell of score overview represents a

scatterplot whose horizontal and vertical axes are dimensions at the corresponding column

and row respectively.

Since this matrix is symmetric, only the lower-triangular part is shown. Each cell is col-

orcoded by its score value using the same mapping scheme as in ordered list. The scatterplot

corresponding to the cell is shown in the scatterplot browser (Figure 2.3D) simultaneously,

and the corresponding item is highlighted in the ordered list (Figure 2.3C). In the scatterplot

browser, users can quickly take a look at scatterplots by using item sliders attached to the

scatterplot view.

Simply by dragging the vertical or horizontal item slider bar, users can change the di-

mension for either the horizontal or vertical axis respectively while preserving the other

axis.

Below list contains the ranking criterions suggested by this framework.

• Correlation coefficient (-1 to 1): The Pearsons correlation coefficient (r) for a scatterplot

(S) with n points [46] is defined in Equation 1

Pearsons r is a number between -1 and 1. The sign and magnitude tells the direction and

the strength of the relationship respectively. Although correlation doesnt necessarily

imply causality, it can provide a good clue to the true cause, which could be another

variable. Linear relationships are more common and simple to understand. As a visual

representation of the linear relationship between two variables, the line of best fit or

the regression line is drawn over scatterplots.

• Least square error for curvilinear regression (0 to 1)

This criterion sort scatterplots in terms of least-square errors from the optimal quadratic

curve fit so that the user can isolate the scatterplots where all points are closely/loosely

arranged along a quadratic curve. In some scenarios it might be interesting to find non-

linear relationships in the data set in addition to linear relationship.

• Quadracity (0 to infinity)


Figure 2.4: Rank by feature visualization for a data set of a demographic and health relatedstatistics for 3138 U.S. counties

The ”Quadricity” criterion is added to emphasize the real quadratic relationships. It

ranks scatterplots according to the coefficient of the highest degree term, so that users

can easily identify ones that are more quadratic than others.

• The number of potential outliers (0 to n)

Distance-based outlier detection methods such as DB-out [36] or Density based outlier

detection methods such as Local Outlier Factor (LOF)-based method [6] can be used to

detect outliers in a scatterplot and rank by-feature framework uses LOF-based method

(Figure 2.4), since it is more flexible and dynamic in terms of outlier definition and

detection. The outliers are highlighted with yellow triangles in the scatterplot browser

view.

• The number of items in the region of interest (0 to n)

This criterion allows the user to draw a free-formed polygon region of interest on the

scatterplot. Then the framework will use the number of data points in the region to

order all scatterplots so that user can easily find the ones with most/least number of

items in the specified region.

• Uniformity of scatterplots (0 to infinity)

To calculate this criterion the two-dimensional space is divided into regular grid cells

and then each cell is used as a bin. For example, if k-by-k grid has been generated,

the entropy of a scatterplot S would be

Where Pij is the probability that an item belongs to the cell at (i, j) of the grid.


Figure 2.5: Scatterplot matrix navigation for a digital camera dataset.

Rolling Dice Framework

Rolling dice is another framework which utilizes SPLOM to visualize multidimensional data

[13]. In this framework, transitions from one scatterplot to another is performed as animated

rotations in 3D space, similar to a rolling dice. Rolling dice framework suggest a visual

querying technique so that a user can refine his requirement by exploring how the same

query would result in any scatterplot.

The interface proposed by the framework mainly consist of three components: Scatter-

plot component (Figure 2.5B), scatterplot matrix component (Figure 2.5A) and query layer

component (Figure 2.5C). The scatterplot component shows the currently viewed cell of the

scatterplot matrix with the name and labels of the two displayed axes. The scatterplot ma-

trix component can be used both as an overview and a navigational tool. Navigation in the

scatterplot matrix is restricted to orthogonal movement along the same row or column in the

matrix so that one dimension in the focused scatterplot is always preserved while the other

changes. The change is visualized using a 3D rotation animation which gives a semantic

meaning to the movement of the points, allowing human mind to interpret the motion as

shape [64].

The transition of scatterplots is performed as a three-stage animation: extrusion into 3D,


Figure 2.6: Stage-by-stage overview of the scatterplot animated transition

rotation and projection into 2D. More specifically, given two current visualized dimensions

x and y and a vertical transition to a new dimension y’, will follow below mentioned steps

(also depicted in Figure 2.6).

• Extrusion: The scatterplot visualizing x and y axes is extruded to 3D where y be-

comes the new depth coordinate for each data point. At the end of this step the 2D

scatterplots has become 3D (Figure 2.6A and 2.6B)

• Rotation : The scatterplot is rotated 90 degrees up or down, causing the axis previously

along the depth dimension to become the new vertical axis (Figure 2.6C)

• Projection: The 3D plot is projected back into 2D with x and y as the new horizontal

and vertical axes (Figure 2.6D and 2.6E)

Further, rolling dice framework suggest a method called query sculpting which allows

selecting data items in the main scatterplot visualization using 2D bounding shapes (con-

vex hulls) and iteratively refining that selection from other viewpoints while navigating the

scatterplot matrix. As shown in Figure 2.5C the query layer component is used for select-

ing, naming and clearing color-coded queries during the visual exploration. Clicking and

dragging one query onto another will perform union or intersection operation (by dragging

using the left or right mouse button respectively). Each query layer also provides a visual

indication of the percentage of items currently selected by it.


Figure 2.7: Scatterplot matrix for the Nuts-and-bolts dataset

Shortcomings of Scatterplot Matrix (SPLOM)

In-order to discuss the shortcomings of SPLOM let’s consider a fictitious ”nuts-and-bolts”

dataset. This dataset shown in Table 1 involves 3 (independent) categorical variables: Region

(North, Central, and South), Month (January, February...), and Product (Nuts or Bolts).

It also consists of 3 (dependent) continuous variables: Sales, Equipment costs, and Labor

costs.

Figure 2.7 shows the SPLOM for the ”nuts-and-bolts” dataset and the top three scat-

terplots (e.g. Month vs Region) each show a crossing of two categorical variables, resulting

in an uninformative grid of points. Further, scatterplots showing continuous vs categorical

variables suffers from over plotting (e.g.: Sales vs. product)

In-order to overcome this issue Generalized Plot Matrix (GPLOM) [27] has been pro-

posed. In the GPLOM it is suggested to use heatmaps to visualize pairs of categorical

variable, bar-charts to visualize continuous vs categorical variables and scatterplots to visu-

alize pairs of continuous variables. It is important to note that in this scenario scatterplots


Figure 2.8: Generalized Plot Matrix for the Nuts-and-bolts dataset

show individual tuples, whereas the barchars and heatmaps show aggregated data. Figure

2.8 shows the GPLOM for the nuts-and-bolts dataset. Even though GPLOM is a better

choice than SPLOM to visualize a combination of continuous and categorical variables, since

it uses 3 types of charts it loses the consistency of the matrix.

Parallel Coordinates

Parallel coordinates introduced by Inselberg and Dimsdale [28][30] is a popular technique

for transforming multidimensional data into a 2D image. The m-dimensional data items

are represented as lines crossing m parallel axes, each axis corresponding to one dimension


Figure 2.9: Parallel coordinate plot with 8 variables for 250 cars

of the original data. Fundamentally parallel coordinates differ from all other visualization

methodologies since it yields graphical representation of multidimensional data rather than

just visualizing a finite set of points .

Figure 2.9 displays Parallel Coordinate plot with 8 variables using a dataset which con-

tains information about cars such as economy (mpg), cylinders, displacement (cc)and etc.

for a selected sample of cars manufactured within 1970 to 1982.

Definition and Representation

On the plane with xy-Cartesian coordinates starting on the y-axis, N copies of the real line,

labeled x1,x2,x3..... xn are places equi-distant and perpendicular to the x axis, They are the

axes of the parallel coordinate system for Euclidean N-Dimensional Space RN all having the

same positive orientation as the y axis. [28]

In the figure 2.10 it is shown how a point C with coordinates (c1, c2, c3.......cn) can be

represented by a polygonal line. As in the aforementioned way m number of data points can

be represented by m polygonal lines.

For lines with negative slope (m < 0) the interesting point lies between the axes as in

Figure 2.11.

For m > 1 the intersecting point lies left of the X1 axis while intersecting point for the

lines with m (0 < m < 1), lies right of the X2 axis as in the Figure 2.12

The above property can be considered as one of the main advantages in parallel coor-

dinates. Parallel Coordinates representations can provide statistical data interpretations.

In the statistical setting, the following interpretations can be made: For highly negatively

correlated pairs, the dual line segments in Parallel Coordinates tend to cross near a single


Figure 2.10: Parallel Coordinate plot for a point

Figure 2.11: Parallel Coordinate plot for points in a line with m <0


Figure 2.12: Parallel Coordinate plot for points in a line with 0<m <1

point between the two Parallel Coordinates axes. Parallel or almost parallel lines between

axes indicate positive correlation between variables [49] [60]. For an example we can see that

there is a highly negative correlation between weight and year in the Figure 2.13.

Over the years parallel coordinates have been enhanced by multiple people. Data Sci-

entists have been working on improving this technique for better data investigation and for

easier, user-friendly interaction by adding brushing, data clustering, real-time re-ordering of

coordinate axes, etc.

Brushing

Brushing is considered to be a very effective technique for specifying an explicit focus during

information visualization [20]. The user actively marks subsets of the data-set as being

especially interesting and the points that are contained by the brush are colored differently

from other points to make them standout [42]. For example if the user is interested in cars

having 6 cylinders he can use brushing as depicted in the Figure 2.14.

The introduction of composite brushes [42] allows users to more specifically define their

focus. Composite brushes are a combination of single brushes which result the conjunction

of those single brushes. For an example if the user is interested in cars having 6 cylinders

that were produced on 76 he can use composite brushing as depicted in Figure 2.15.

Brushing technique we have seen up to now uses a discrete distinction between focus and


Figure 2.13: Negative correlation between Car Weight and the Year

Figure 2.14: Using brushing to filter Cars with 6 cylinders


Figure 2.15: Using composite brushing to Filter Cars with 6 cylinders made in 76

context. With that we dont understand the similarity of other data points to the focused

data points. The solution that had brought forward for this is called smooth brushing [20]

where a multi-valued or even continuous transition is allowed, which inherently supports the

similarity between data-points in focus and their context. This corresponds to a degree-

of-interest (DOI) function which non-binarily maps into the [0, 1] range. Often, such a

non-binary DOI function is defined by means of spatial distances, i.e., the DOI-value reflects

the distance of a data-point from a so-called center-of interest.

The standard brushing primarily acts along the axes, but the technique called angular

brushing enables the space between axes for brushing [20]. The user can interactively specify

a sub-set of slopes which then yields all those data-points to be marked as part of the current

focus, which exhibit the matching correlation in between the brushed axes. For an example

if the user is interested on data that only has a negative correlation between Horsepower

and Acceleration he can use angular brushing as shown in Figure 2.17.

Axis Reordering

One strength of parallel coordinates as described in section 3.2.1, is its effectiveness of visual-

izing relations between coordinate axes. By bringing axes next to each other in an interactive

way, the user can investigate how values are related to each other with special respect to two

of the data dimensions. Order of the axes clearly affects the patterns revealed by parallel

coordinate plots. Figure 2.18 shows 3 ways out of N! (N = 8 in this case) ways of reordering

axes. But only the plot C in Figure 2.18 is capable of showing that there is a highly negative

correlation between weight and economy.

Many Researchers address this problem using some measure to score an order of axes while

others discuss how to visualize multiple orderings in a single display [21]. Many approaches

for this which are based on the combination of Nonlinear Correlation Coefficient and Singular


Figure 2.16: An example of Smooth brushing

Value Decomposition algorithm are suggested. By using these approaches, the first Many

Researchers address this problem using some measure to score an order of axes while others

discuss how to visualize multiple orderings in a single display [24]. Many approaches for this

which are based on the combination of Nonlinear Correlation Coefficient and Singular Value

Decomposition algorithm [25] are suggested. By using these approaches, the first remarkable

axe can be selected based on mathematics theory and all axis are re-ordered in line with the

degree of similarities among them [39].

Data Clustering

Parallel Coordinates are a good technique to show clusters in the data set. There are many

techniques that researchers have used to show clusters in parallel coordinates.

Coloring is one method that has been used to show clusters in parallel coordinates [17].

Different colors will be assigned to different clusters. As in the figure 2.19 it shows two


Figure 2.17: Angular Brushing

clusters that had been given explicitly is represented with 2 different colors.

Figure 2.20 shows the same cluster visualization technique for more many clusters for the

data set taken from USDA National Nutrient Database.

Variable length Opacity bands [17] is another technique of showing clusters in Parallel

Coordinates. Figure 2.21 shows a graduated band faded from a dense middle to transparent

edges that visually encodes information for a cluster. The mean stretches across the middle

of the band and is encoded with the deepest opacity. This allows the user to differentiate

sparse, broad clusters and narrow, dense clusters. The top and bottom edges of the band

have full transparency. The opacity across the rest of the band is linearly interpolated. The

thickness of the band across each axis section represents the extents of the cluster in that

dimension.

Curved bundling [40] is also used to visualize clusters in parallel coordinates. Bundled


Figure 2.18: Multiple ways of ordering N axes in parallel coordinates


Figure 2.19: Two clusters represented in parallel coordinates

Figure 2.20: Multiple clusters visualized in parallel coordinates in different colors

Figure 2.21: Variable length Opacity Bands representing a cluster in parallel coordinate


Figure 2.22: Parallel-coordinates plot using polylines and using bundled curves

curve plots extend the traditional polyline plots and are designed to reveal the structure

of clusters previously identified in the input data. Given a data point (P1, P2,...,PN),its

corresponding polyline is replaced by a piecewise cubic Bezier curve preserving following

properties. (Denote the main axes by X1, X2, X3 XNto avoid the confusion between them

and the added axes.)

• The curve interpolates P1, P2,..., PN at the main axes

• Curves corresponding to data points that belong to the same cluster are bundled be-

tween adjacent main axes. This is accomplished by inserting a virtual axis midway

between the main axes and by appropriately positioning the Bzier control points along

the virtual axis. To support curve bundling, control points that define curves within

the same cluster are attracted toward a cluster centroid along the virtual axis.

Figure 2.22 compares a polyline plot with its counterpart using bundled curves. Polylines

require color coding to distinguish clusters, whereas curve bundles rely on geometrical prox-

imity to naturally represent cluster information. The cluttered visualization in color-coded

polylines, which is the standard approach to cluster-membership visualization, motivates the

new geometry based method.

Bundling violates the point-line duality discussed in section 3.2.1, but can be used to

visualize clusters using geometry only, leaving the color channel free for other uses such as

statistical coloring which is described in section 3.2.6. To adjust the shape of Bzier curves

there are many algorithms proposed by many researchers [40], [22], [69].


Figure 2.23: Statistically colored Parallel Coordinates plot on weight of cars

Statistical Coloring

Coloring polygonal lines can be used to display statistical coloring of axes. A popular

color scheme is to color by z-score for that dimension, so that we can understand the data

distribution of that dimension. Figure 2.23 shows how z-score coloring has been used on

weight dimension in that data set.

Scaling

Scaling of the axes are also an interesting property in the parallel coordinates. Default scaling

is to plot all values over the full range of each axis between the minimum and the maximum

of the variable. Several other scaling methods have been suggested by researchers [60]. A

common one would be to use a common scale over all axes. Figure 2.24 shows the difference

between two scaling methods. The data taken is individual stage times of the 155 cyclists

who finished the 2005 Tour De France bicycle race. Figure 2.24A is plotted with default

scaling and Figure 2.24B is plotted using a common scale over all axes. But it is obvious

that the both Figure 2.24A and Figure 2.24B are not capable enough to reveal correlations

between axes even though Figure 2.24B shows the outliers clearly. But the spread between

the first and the last cyclist is almost invisible for most of the stages. In the Figure 2.24C,

a common scale for all stages is used, but each stage is aligned at the median value of that

stage. It is the user experience, his domain knowledge and the use case that defines the scale

and alignment on the parallel coordinates [60].


Figure 2.24: Three scaling options for visualizing the stage times in the Tour de France


Figure 2.25: Parallel Coordinates plot for a data set with 8000 rows

Limitations

Even though Parallel coordinates are a great tool to visualize high dimensional data, it soon

reached its limits. When using a very large dataset there are some identified weaknesses in

parallel coordinates such as:

1. Cross-over Problem - The zigzagging polygonal lines used for data representation are

not continuous. They generally lose visual continuation across the parallel-coordinates

axes, making it difficult to follow lines that share a common point along an axis.

2. When two or more data points have the same or similar values for a subset of the

attributes, the corresponding polylines may overlap and clutter the visualization.

Figure 2.25 depicts the aforementioned two problems - A parallel coordinate plot drawn

for 8000 data points.

Given a very large data set, with this two problems it is not easy to come to a conclusion

about the correlation in axes and brushing also will not give a clear idea about the data.

One solution to above problems is to use -blending [60]. When -blending is used, each

polygon is plotted with only percent opacity. With smaller values, areas of high line density

are more visible and hence are better contrasted to areas with a small density.

The data in Figure 2.26 are real data from Forina et al.[15] on the fatty acid content of

Italian olive oil samples from nine regions. Figure 2.26 A, B, C shows the same plot of all

eight fatty acids with -values of 0.5, 0.1, and 0.01 respectively. Depending on the amount

of - blending applied, the group structure of some of the nine regions is more or less visible

[60].


It is hard to come to a conclusion about a value for . The user must adjust the value

until the graph gain enough insight.

Clustering and statistical coloring were mentioned in the sections 3.2.5 and 3.2.6 will also

reduce the weaknesses in Parallel Coordinates.

As in the Figure 2.27, point line duality is preserved more when statistical coloring is

used. Data preprocessing techniques can also be used to overcome the limitations in parallel

coordinates: data selection and data aggregation. Data selection means that a display does

not represent a dataset as a whole but only a portion of it, which is selected in a certain

way [30].The display is supplied with interactive controls for changing the current selection,

which results in showing another portion of the data [1].

The Figure 2.28 shows how to display portion of the data and to overcome the weaknesses

in Parallel Coordinates. The Figure 2.28A only displays food group of sausages and luncheon

meats. Respectively, Figure 2.28B and Figure 2.28C displays food groups of beef products

and spices and herbs, which is a better visualization than visualizing whole data set.

Data aggregation reduces the amount of data under visualization by grouping individual

items into subsets, often called aggregates, and some collective characteristics of the aggre-

gates can be computed. The aggregates and their characteristics (jointly called aggregated

data) are then explored instead of the original data. For an example in parallel coordinates

there is just one polygonal line for the whole cluster so that mentioned limitations at the

beginning of this section will be reduced.

Parallel Coordinates might be the least affected plot from curse of dimensionality since

it can represent many dimensions as long as the screen width permits. But that also comes

to a limitation when it comes to high dimensional data because the distance d between

two coordinates gets decreased with the increase in number of dimensions. As a result the

correlation between axes might not be clear in the plot. Most of the applications assume it is

up to the user to decide which attributes should be kept in, or removed from a visualization.

This approach will not be a good approach for a user who does not have domain knowledge,

parallel coordinates itself can be used to reduce dimensions of the data set [2].

When we were discussing about axis reordering in section 3.2.4 we talked about getting

a measure to the axis similarity. Once the most similar axes are identified through that

algorithm the application can suggest user to remove them and keep one significant axe to

all those identified similar axes [2]. In that way redundant attributes can be removed from

the visualization and the space can be used efficiently to represent the remaining attributes.


Figure 2.26: Parallel Coordinates for the Olive Oils data. Shows how alpha blending canimprove dense visualizations


Figure 2.27: Parallel Coordinates visualization with Z Score coloring

Parallel Coordinates are a good technique to visualize data. It support many user in-

teractions and data analytic techniques. Even though it has limits researchers have found

many ways to overcome those limitations. Parallel Coordinates are still a hot topic for data

visualization research work.

Radviz

The Radviz (Radial Visualization) visualization method [23] maps a set of n dimensional

data points onto a two dimensional space. All dimensions are represented by a set of equally

spaced anchor points on the circumference of a circle.

For each data instance, imagine a set of springs that connects the data point to the

anchor point for each dimension. The spring constant for the spring that connects to the ith

anchor corresponds to the value of the ith dimension of the data instance. Each data point

is then displayed where the sum of all the spring forces equals 0. All the data point values

are usually normalized to have values between 0 and 1.

Consider the example in Figure 2.29.A, this data has 8 dimensions d1,d2. dn. Each

data point is connected as shown in the diagram using springs. Following this procedure

for all the records in the dataset leads to the Radviz display. Figure 2.29.B shows a Radviz

representation for a dataset on transitional cell carcinoma (TCC) of the bladder generated

by Clifford Lab at LSUHSC-S [58].

One major disadvantage of this method is the overlap of points. Consider the following

two points on a 4 dimensional data space, (1, 1 , 1, 1) and (10, 10, 10, 10). These two data

records will overlap in a Radviz display even though they are clearly different because the


Figure 2.28: Parallel Coordinates drawn on same data set using data selection


Figure 2.29: Radviz Visualization for multi dimensional data

dimensions pull them both equally.

Categorical dimensions cannot be visualized with Radviz and require additional prepro-

cessing. First each categorical dimension needs to be flattened to create a new dimension

for each possible category. This becomes problematic as the number of possible categories

increase and may lead to poor visualizations.

Another challenge in generating good visualizations with this method is identifying a

good ordering for the anchor points that correspond to the dimensions. A good ordering

needs to be found that makes it easy to identify patterns in the data. An interactive approach

that allows for changing the position of anchor points can be used to help users overcome

this issue.

Mosaic Plots

Mosaic plots [19], [16] are a popular method of visualizing categorical data. They provide a

way of visualizing the counts in a multivariate n-way contingency table. The frequencies in

the contingency table are represented by a group of rectangles whose areas are proportional

to the frequency of each cell in the contingency table.


Figure 2.30: Mosaic plot for the Titanic data showing the distribution of passengers survivalbased on their class and sex

A mosaic plot starts as a rectangle. Then at each stage of plot creation, the rectangles

are split parallel to one of the two axes based on the proportions of data belonging to a

category. An example of a mosaic plot is shown in Figure 2.30. It shows a mosaic plot for

the Titanic dataset, which describes the attributes of passengers on Titanic details of their

survival.

The process of creating a mosaic display can be described as below [24].

Let us assume that we want to construct a mosaic plot for p categorical variables X1,...,

Xp. Let ci be the number of categories of variable Xi, i = 1, . . . , p.

1. Start with one single rectangle r (of width w and height h), and let i = 1.

2. Cut rectangle ri−1 into ci pieces: find all observations corresponding to rectangle ri1,


Figure 2.31: Double Decker plot for the Titanic data

and find the breakdown for each variable Xi (i.e., count the number of observations

that fall into each of the categories). Split the width (height) of rectangle ri1 into ci

pieces, where the widths (heights) are proportional to the breakdown, and keep the

height (width) of each the same as ri1. Call these new rectangles rji, with j = 1, . . .

,ci.

3. Increase i by 1.

4. While i <= p, repeat steps 2 and 3 for all rji1 with j =1 , . . . ,ci1

In standard mosaic plots the rectangle is divided both horizontally and vertically. A

variation of mosaic plots that only divide the rectangle horizontally has been proposed called

Double Decker plots [19]. These can be used to visualize association rules. An example of a

double decker plot is show in Figure 2.31 for the same data as in Figure 2.30. There are other

variations of mosaic plots such as fluctuation diagrams that try to increase the usability of

them.

Mosaic plots are an interesting visualization technique for categorical data but they can’t

handle continuous data. To display continuous data using a mosaic plot the data needs to

be first converted to categorical through a process such as binning. Mosaic plots require

the visual comparison of rectangle and their sizes to understand the data. But this becomes


complicated as the number of rectangles increase and the distance between two increases. So

they are harder to interpret and understand. Vastly different aspect ratios of the rectangles

also compound the difficulty in comparing their sizes.

Another issue with Mosaic plots is that they become more complex as the number of

dimensions in the data increase. Each additional dimension requires the rectangles to be split

again which at least doubles the possible number of rectangles leading to a final visualization

that is not very user friendly.

Self Organizing Maps

Self-organizing maps (SOM) [58] is a type of neural network that has been used widely in

data exploration and visualization among its many other uses. SOMs use an unsupervised

learning algorithm to perform a topology preserving mapping from a high dimensional data

space to a lower dimensional map (usually a two dimensional lattice). The mapping preserves

the topology of the high dimensional data space such that data points lying near each other

in the original multidimensional space maps to nearby units in the output space.

Generating self-organizing maps consists of training a set of neurons with the dataset. At

each step of the training an input data item is matched against the neurons from which the

closest one is chosen as the winner. Then the weights of the winner and the neighborhood

of the winner is updated to reinforce this behavior. the final result is a topology preserving

ordering where similar new data entry will match to neurons nearer to each other.

An example of a self-organizing map is shown in Figure 2.33. This shows a self-organizing

map trained on the poverty levels of countries [31]. As can be seen clearly countries with

similar poverty levels got matched to neurons close to each other. USA, Canada and other

countries with lower poverty are together in the yellow and green areas while countries such

as Afghanistan and Mali which have high poverty levels are grouped together in the purple

areas. This shows the topology preserving aspect of SOMs.

There are some challenges with using self-organizing maps for multidimensional data

visualization.

1. SOMs are not unique. The same data can lead to widely different outcomes based on

the initialization of the SOM. So the same data may yield different visualizations and

lead to confusion.


Figure 2.32: Training a self organizing map.

Figure 2.33: A self organizing map trained on the poverty levels of countries


2. While similar data points are grouped together in SOMs, similar groups are not guar-

anteed to be close to each other. Some SOMs may be created that have similar groups

in multiple places in the map.

3. SOMs are not very user friendly when compared with other visualization techniques.

Its not easy to look at a SOM and interpret the data.

4. The process of creating a SOM is computationally expensive. The computational

requirements grow as the dimensionality of data increases. In modern data sources

that are highly complex and detailed this becomes a major drawback.

Sunburst Visualization

The Sunburst technique, like Tree Map [65] is a space-filling visualization that uses a radial

rather than a rectangular layout to visualize hierarchical information [55]. It is comparable

to a nested pie charts. It can be used to show hierarchical information such as elements of

a decision tree. This compact visualization avoids the problem of decision trees getting too

wide to fit the display area. Its akin to visualizing the tree in a top down manner. The

center represents the root of the decision tree and the ring around it as its children.

In SunBurst, the top of the hierarchy is at the center and deeper levels farther away from

the center. The angle swept out by an item and its color correspond to some attribute of

the data. For instance, in a visualization of a file system, the angle may correspond to the

file/directory size and the color may correspond to the file type. An example Sunburst display

is shown in Figure 2.34. This visualization has been used to summarize user navigation paths

through a website [48]. Further this visualization has been used to visualize frequent item

sets [34].

Trellis Visualization

Trellis chart Also known as: Small Multiples [61], Panel Chart, Lattice Chart, Grid Chart,

is a layout of smaller charts in a grid with consistent scales. Each smaller chart represents

an item in a category, named conditions [67]. The data displayed on each smaller chart is

conditional on items in the category. Trellis Charts are useful for finding the structure and

patterns in complex data. The grid layout looks similar to a garden trellis, hence the name

Trellis Chart.


Figure 2.34: A sunburst visualization summarizing user paths through a fictional e-commercesite.

Main aspects of trellis displays are columns, rows, panels and pages [46]. The figure 2.35

consists of 4 columns, 1 row, 4 panels and 1 page. Trellised visualizations enable the user

to quickly recognize similarities or differences between different categories in the data. Each

individual panel in a trellis visualization displays a subset of the original data table, where

the subsets are defined by the categories available in a column or hierarchy. To make plots

comparable across rows and columns, the same scales are used in all the panel plots [59].

Benefits of trellis chart are;

• They are easy to understand. A Trellis Chart is a basic chart type repeated many

times. If you understand the basic chart type, you can understand the whole Trellis


Figure 2.35: Trellis Chart for a dates set on sales

Chart.

• Having many small charts enables you to view complex multi-dimensional data in a

flat 2D layout avoiding the need for confusing 3D charts.

• The grid layout combined with consistent scales makes data comparison simple. Just

look up/down or across the charts.

Figure 2.36 contains a trellis chart for Minnesota Barley Data from The Design of Exper-

iments [14] by R.A. Fisher. The trial involved planting: 10 varieties of barley, in 6 different

sites over two different years. The researchers measured yield in bushels per acre for each of

the 120 possibilities.

Grand Tour

Grand tour is one of the tour methods which is used to find structure of multidimensional

data. This method can be applied to show multidimensional data in a 2D computer display.

Tour is a subset of all the possible projections of multidimensional data. The different tour

methods combine several static projections using different interpolation techniques into a

movie, which is called a tour [9].


Figure 2.36: Trellis Display of Scatter Plots (Relationship of Gifts Given/Received on Rev-enue)

Tour

In a static projection some of the information of the dataset is lost to the user. But if several

projections in different planes can be shown to the user step by step, user can get the idea

of overview of structure of the multivariate data.

Tours provide a general approach to choose and view data projections, allowing the viewer

to mentally connect disparate views, and thus supporting the exploration of a highdimen-

sional space.


Figure 2.37: A snapshot of the grand tour, a projection of the data to single plane isillustrated in (B)

Tour methods

• Grand Tour - Shows all projections of the multivariate data by a random walk through

the landscape.

• Projection Pursuit (PP) guided tour - Tour gives more concentration to more interest-

ing views based on a PP index.

• Manual Control - User can decide the tour direction to take.

The grand tour method for choosing the target plane is to use random selection. A

frame is randomly selected from the space of all possible projections. A target frame is

chosen randomly by standardizing a random vector from a standard multivariate normal

distribution: sample p values from a standard univariate normal distribution, resulting in

a sample from a standard multivariate normal. Standardizing this vector to have length

equal to one gives a random value from a (p1) dimensional sphere, that is, a randomly


Figure 2.38: grand tour path in 3D space

generated projection vector. Do this twice to get a 2D projection, where the second vector

is orthonormalized on the first. Figure 2.38 illustrates the tour path.

The solid circle in Figure 2.38 indicates the first point on the tour path corresponding

to the starting frame. The solid square indicates the last point in the tour path, or the

last projection computed. Each point corresponds to a projection from 3 dimensions to one

dimension. The projection will look as if the data space is viewed from that direction. In

grand tour this point is chosen randomly.

2.4 CEP Rule generation

Recent advances in technology has enabled the generation of vast amounts of data in a

wide range of fields. This data is created continuously in large quantities overtime as data

streams. Complex Event Processing (CEP) can be used to analyze and process these large

data streams to identify interesting situations and respond to them as quickly as possible.

Complex event processors are used in almost every domain : vehicular traffic analysis,

network monitoring, sensor details analyzing [7], analyzing trends in stock market [11], fraud


detection [50]. Any system that requires real time monitoring can use a complex event

processor.

In CEP, the processing takes place according to user-defined rules, which specify the

relations between the observed events and the actions required by the user. For an example

in a network monitoring system a complex event processor can be used to notify the system

admin about an excessive internet usage of an user in that particular network. An example

rule will look like this,

Where if a user’s bandwidth exceeds the limit, the admin will receive a notification. The

value of the ”limit” in this example should be low enough to catch high usage as well as it

should be high enough to ignore normal users.

Any complex event processing rule will have a condition to check, and an action associated

with that condition. So regardless of the domain, any system using a CEP heavily depends

on the rules defined by the user.

In current complex event processing applications, users need to manually specify the

rules that are used to identify and act on important patterns in the event streams. This is

a complex and arduous task that is time consuming, includes a lot of trial and error and

typically requires domain specific information that is hard to identify accurately.

So the rule writing is typically done by domain experts who study the parameters available

in the event streams manually or using external data analysis tools to identify the events that

need to be specially handled. Needless to say that incorrect estimation of relevant parameters

in the rules negatively impacts the utility of the systems that depend on accurate processing

of these events. Even for domain experts manually specifying textual rules in CEP specific

rule language is not a very user friendly experience. Maintaining the system after a rule is

specified to provide the same functionality through changing data and behavior may require

periodical updates to the specified rule that may require the same effort as initially spent.

Several approaches [41], [63], [44] have been proposed to overcome these difficulties using

data mining and knowledge discovery techniques to generate rules based on available data.

This provide users the ability to automatically generate rules based on their requirements.

Two approaches have been proposed that can help in generating CEP rules. One is Using

a framework that learns, from historical traces, the hidden causality between the received

events and the situations to detect, and uses them to automatically generate CEP rules

[41]. Another approach is to use a skeleton of the rule and use historical traces to tune the

parameters of the final rule [63].


iCEP

iCEP [41] analyzes historical traces and learns from them. It adopts a highly modular design,

with different components considering different aspects of the rule.

Following terminology and definitions are used in the framework.

Each event notification is assumed to be characterized by a type and a set of attributes.

The event type defines the number, order, names, and types of the attributes that compose

the event itself. It is also assumed that events occur instantaneously at some points in time.

Accordingly, each notification includes a timestamp, which represents the time of occurrence

of the event it encodes. Author of the paper uses the following example event of type Temp.

Temp@10(room=123, value=24.5)

This event contains the fact that the air temperature measured inside room 123 at time

10 was 24.5 0C.

Another aspect of the terminology used by the authors is the difference between primi-

tive and composite events. Simple events similar to the one given above are considered as

primitive events. A composite event is defined using a pattern of primitive events. When

such a pattern is identified the CEP engine derives that a composite event has occurred

and notifies the interested components. An event trace that end with the occurrence of the

composite event is called a positive event trace.

iCEP framework uses the following basic building blocks used in most CEP systems to

generate filters for events.

• Selection: filters relevant event notifications according to the values of their attributes.

• Conjunction: combines event notifications together

• Parameterization: introduces constraints involving the values carried by different events.

• Sequence: introduces ordering relations among events.

• Window: defines the maximum timeframe of a pattern.

• Aggregation: constraints involving some aggregated value.

iCEP uses a set of modules that generates a combination of above building blocks to

generate CEP rules. The framework uses a training data set created using historical traces

to generate rules using a supervised learning technique.


The learning method uses the following consideration. Consider the following positive

event trace

1 : A@0, B@2, C@3

This implies the following set of constraints S1

- A: an event of type A must occur

- B: an event of type B must occur

- C: an event of type C must occur

- AB: the event of type A must occur before that of type B

- AC: the event of type A must occur before that of type C

- BC: the event of type B must occur before that of type C

We can assert that for each rule r and event trace , r fires if and only if Sr S where Sr

is the complete set of constraints that needs to be satisfied for the rule to fire.

Using these considerations the problem of rule generation can be expressed as the problem

of identifying Sr. Given a positive trace , S can be considered as an over constraining

approximation of Sr. To produce an approximation of Sr we can consider the set of all positive

traces collectively and consider the conjunction of all the sets of constraints generated.

Using these intuitions the iCEP framework follows the following steps in generating rules.

1. Determine the relevant timeframe to consider (window size)

2. Identify the relevant event types and attributes

3. Determine the selection and parameter constraints

4. Discover ordering constraints (sequences)

5. Identify aggregate and negation constraints.

The final structure of the framework is shown in figure 2.39. The problem is broken down

to sub problems and solved using different modules (described below) that work together.

• Event Learner: The event learner tries to determine which primitive event types are

required for the composite event to occur. It considers the window size as an optional

input parameter. It cuts each positive trace such that it ends with the occurrence.

For each positive trace, the event learner extracts the set of event types it contains.

Then, according to the general intuition described above, it computes and outputs the

intersection of all these sets.


Figure 2.39: Structure of the iCEP framework

• Window Learner: The window learner is responsible for learning the size of the window

that includes all primitive events required for a composite event. If the required event

types are knows the window learner tries to identify a window size that would ensure all

required primitive events are present is all positive traces. If the required event types

are not known, window learner and event learner uses an iterative approach where

increasing window sizes are fed to the event learner until a required accuracy in the

rule is reached.

• Constraint Learner: This module receives the filtered event traces from the above two

modules and tries to identify possible constraints in the parameters. For all parameters

it tries to look for equality constraints where all possible traces contain a single value

and failing that generates an inequality constraint the looks for values between the

minimum and maximum value available all positive traces.

• Aggregate Learner: As shown in Figure 2.39, the aggregate learner runs in parallel with

the constraint learner. Instead of looking for single value constraints the aggregate

learner uses aggregation functions such as sum and average over the time window over

all the events of a certain type to generate constraints.

Other modules in the framework uses the same methods to identify different aspects of

the rule. The effectiveness of the framework has been assessed using the following steps.

1. Use an existing rule created by a domain expert that identifies a set of composite events

in a data stream and collect the positive traces.

2. Use iCEP with the data collected in the above step to generate a rule


3. Run the data again through the CEP with the generated rule and capture the composite

events triggered.

4. Compare the two versions and calculate precision and recall

The results have been promising with a precision of around 94% based on some of the

tests that were run by the authors. But the system is far from perfect and the following are

some of the challenges that needs to be overcome.

1. A large training dataset with many positive traces are required to generate good rules

with high precision. The training methodology considers only the conjunction of all

the positive traces to generate rules. So without a large number of positive traces that

cover the variations in the data generating accurate rules is difficult.

2. High computational requirements. The iterative approach used with the windows

learner and event learner translates to a lot of computations that needs to be done.

So without hints from a domain expert on the window size or the required events and

parameters the runtime and computational cost increases rapidly.

3. The generated rules require tuning and cleanup from the user. As the rules created

are generated automatically the constraints may be over constraining or may contain

mistakes when used with previously unseen conditions. So they require a final cleanup

by the users.

Tuning rule parameters using the Prediction-Correction Paradigm

A mechanism has been proposed by Yulia Turchin in order to automate the definition of

the rules at the beginning and automate the update of rules with the time [63]. It consists

of 2 main repetitive stages - namely rule parameter prediction and rule parameter correc-

tion. Parameter prediction is performed by updating the parameters using available expert

knowledge regarding the future changes of parameters. The rule parameter correction utilizes

expert feedback regarding the actual past occurrence of events and the events materialized

by the CEP framework to tune rule parameters.

For an example in an Intrusion detection system [4] a domain expert can specify the rule

as follow. If the size of the received packet from user has a high level of deviation from

normal packet size with estimated size of m1 and standard deviation of σ1, infer an event E1


Figure 2.40: Prediction Correction Paradigm

representing the anomaly level of the packet size. It is a hard task to determine the values

for m1 and σ1 and moreover the specified values can change with the time due to dynamic

nature of network traffic.

Rule parameter determination and tuning can be done as following: Given a set of rules,

provide an initial value for rule parameters and then modify it as required. For example for

a given rule, rule tuning algorithm might suggest to replace values m1 with values m2 such

that m2 < m1. Initial prediction of m1 value can be done as special case of tuning where

arbitrary value is corrected to m1 by the rule tuning algorithm. This rule tuning algorithm

should be tied with ability of the system to correctly predict events. So that rule tuning

algorithm can see that parameter m1 is too high and because of that many intrusions were

not detected, therefore it needs to be reduced to m2.

The proposed framework is based on the Kalman Estimator which is a simple type of su-

pervised, Bayesian, and predict-correct estimator [18]. As shown in figure 2.40 the framework

learns and updates the system state in two stages, namely rule parameter prediction and

rule parameter update. Unsupervised learning is carried out in rule parameter prediction -

rule parameters are updated without any user feedback and it depends on preexisting knowl-

edge about how the parameters might change over time and events created by the inference

algorithm to predict rule parameters. In rule parameter update stage, the parameters are

tuned in a supervised manner using domain experts feedback and recently generated events


to update rule parameters to next stage. User feedback can be given through two forms

- direct and indirect feedback. Direct feedback involves changes to the system state while

indirect feedback provides an assessment on the correctness of the estimated event history.

Model

The model of this methods consists of events, rules and system. In here event means a

significant (of interest to the system) actual occurrence of the system. Examples of events

include notifications of login attempt, failures of IT components. Therefore, we can define

an event history h to be a set of all events (of interest to the system), as well as their

associated data. And event notification to be an estimation of an occurrence of an event.

Some event may not be notified and some non-occurring events may be notified because

of faulty equipment. Therefore we can define estimated event history h of notified events

(of interest to the system). Events can be of two types: explicit events and inferred events.

Explicit events are signaled by event sources. For example a new network connection request

is an explicit event. Inferred events are the events materialized by the system based on other

events, for example an illegal connection attempt event is an inferred event materialized

by the network security system, based on the explicit event of a new network connection,

and an inferred event of unsuccessful user authorization. Inferred events, just like explicit

events, belong to event histories. Inferred events that actually occurred in the real world

belong to event history h and those who are only estimated to occur in h estimated event

history.It is expected that expert can provide the form of sr , pr , ar , and mr but providing

accurate values will be difficult. These are called rule parameters and set of all parameters

will be called system state. This system state will be updated by this system as shown in

figure 2.40. In predict stage parameters are updated using the knowledge how the rule might

change over time and updated event history h. In update stage parameters are updated by

direct feedback where exact rule parameter is mentioned, or in an indirect manner where

events in estimated event history h are marked whether they actually occurred or not.

Events can be inferred by rules. Rule can be represented by quadruple r = < sr , pr , ar

, mr >. sr is a selection function that filters events according to rule r. Events selected by

selection function are said to be relevant events. Input to this function is an event history

h. pr is a predicate, defined over a filtered event history, determining when events become

candidates for materialization. The ar is an association function, which defines how many

events should be materialized, as well as which subsets of selectable events are associated


with each materialized event. mr is a mapping function that determines attribute values for

the materialized events in ar.

System State

It is expected that expert can provide the form of sr , pr , ar , and mr but providing accurate

values will be difficult. These are called rule parameters and set of all parameters will be

called system state. This system state will be updated by this system as shown in figure

2.40. In predict stage parameters are updated using the knowledge how the rule might

change over time and updated event history h. In update stage parameters are updated by

direct feedback where exact rule parameter is mentioned, or in an indirect manner where

events in estimated event history h are marked whether they actually occurred or not.

Rule Tuning Mechanism

In-order to tune rule parameters this framework uses discrete Kalman filter technique. The

filter estimates the process state at some time and then obtains feedback in the form of

(noisy) measurements.

Rule tuning model consists of two recursive equations: time equation which shows how

parameters change over time and history equation which shows outcome of a set of rules

and their parameters. Time equation is a function of previous system state (set of rule

parameters) and actual event history of that time period and output of this equation is

current system state. History equation is a function of current set of rule parameters, set of

explicit event during that time period and actual event history of previous time period and

output this equation is actual event history. But since current system state is not known,

another equation which is known as estimated event history equation which differs from

original history equation by using estimated current system state (estimated current set of

rule parameters) and its output is estimated current event history. This can be used to

evaluate performance of our inference mechanism. Performance evaluation will be based on

the comparison of the estimated event history received from the inference mechanism and

the actual event history, provided by expert feedback at the end of time interval k. By

that we can measure the performance measures of precision and recall. The precision is the

percentage of correctly inferred events relative to the total number of events inferred in this

time interval. Recall measures the percentage of correctly inferred events (i.e., true positive)

relative to the actual total number of events occurred in this time interval.


Figure 2.41: An overview of rules tuning method

The Rule Tuning Method consists of a repetitive sequence of actions that should be

performed for correct evaluation and dynamic update of rule parameters. The sequence is

illustrated in Figure 2.41.

Above model is a generic model for automating rule parameter tuning in CEPs. Further,

it serves as a proof of concept of automatic rule parameter tuning when doing that manually

becomes a cognitive challenge.

However the model introduced here is more generic and actual implementation will require

lot of work and tailoring for that specific requirement (such as example mentioned here

intrusion detection in IDS). But this model can work as a theoretical basis for any such work

because of the promising results of the empirical study.

51

Chapter 3

Solution

3.1 Overview

For the implementation of our tool we decided to use a web application architecture, so

as shown in Figure 3.1 our implementation mainly consists of two components and those

are client side and server side components. The server side component performs a lot of

computationally intensive actions and that is the major reason behind us choosing a web

framework so that we could utilize client-server architecture instead of going for a standalone

application. This way we could deploy server-side component in a high performance server

and the user could use it as a web application through web browser without requiring a

high-end machine.

Our solution consist of a Django web application[12]. We have mainly considered Django

and Shiny web frameworks[53] because we intended to use either python or R based develop-

ment environment because there are a lot of libraries for data mining and machine learning

in those languages. One of the main reasons we dropped Shiny was lack of documentation.

Shiny is a relatively new web framework and not as mature as Django. Further we found a

technique to execute R code within python development environment. Hence, we decided to

use Django because we could utilize both python and R libraries within Django

As for the Complex event processing engine we have considered both Siddhi and Esper.

Initially we were planing to implement support for generating queries for both of these

engines. But due to time constrain we were able to implement query generation only for

Siddhi CEP. We plan to implement support for Esper query language in future.

Since we are using a web browser as our front-end we had to narrow down our data

CHAPTER 3. SOLUTION 52

Figure 3.1: Architecture of the implementation of Vivarana

visualization library to one which supports JavaScript, CSS and HTML. Hence, from the

visualization libraries we have considered such as ggobi, flot.js, plotly, d3 and tableau we

chose d3 library because it was written entirely in JavaScript, had inbuilt support for all the

functionality we were planning to implement on top of a basic implementation of parallel

coordinates visualization we used as a basis.

We have selected parallel coordinates as our main multi-dimensional data visualization

technique. The reasons behind selection parallel coordinates and how we have modifies it to

enhance interactivity is mentioned in section 3.2

In our implementation we mainly focused on interactively generating CEP rules for web

server log data. In-order to do that we implemented an Apache web log parser module as

an extension. This log parser handles all the preprocessing steps for a particular web log

data set specified by an user. The preprocessed data returned by this module is used by

other components. If someone wants to generate rules for another different type of data,

then that person could write an extension to pre-process that data implementing the API

we have defined so that preprocessed data returned by that extension would be usable by

other components we have implemented. Apart from the Apache web log parser, we have

also implemented a comma separated file type parser too.

Further both clustering and anomaly detection components are implemented to identify


interesting pattern by the user. The algorithms used in clustering and and the implementa-

tion in details is described in section 3.3

The aggregation component is to perform aggregation operations. Complex even proces-

sors supports specifying sum, average, count, maximum and allowing group by and having

conditions in queries. Through this aggregation component we allow users to perform these

conditions on the data through a moving window. The implementation details of this com-

ponent is elaborated in section 3.3

3.2 Visualization - Parallel Coordinates

Data that is commonly associated with CEP engines are very large consisting of multi-

ple dimensions. Displaying this information in a clear, intuitive and interactive manner is

a challenge that have been the focus of a large amount of research. As we discussed in

the Literature Review there are many visualization techniques such as Scatter Plot Matri-

ces[Scatterplot], Parallel Coordinates[Parcords] Mosaic Plots[MozPlots], and Self Organizing

Maps[SOM] have been proposed to tackle this challenge along the years. After researching

on these methodologies we decided to focus on the Parallel coordinates method for our im-

plementation since we found that it fits most of our requirements as a good visualization

method for the kind of data we hope to use our method with. Parallel coordinates introduced

by Inselberg and Dimsdale is a popular technique for transforming multidimensional data

into a 2D visualization. The m-dimensional data items are represented as lines crossing m

parallel vertical axes, each axis correspondence to one dimension of the original data. Each

element of the data set corresponds to a zigzag line joining the specific values of each one of

its variables.As in the aforementioned way m number of data points can be represented by

m polygonal lines. There are many advantages that can be gained with Parallel Coordinates

over other visualization techniques.

1. With Parallel Coordinates there is no need of transforming all the dimensions to a 2D

image as with most of other visualizations. All the dimensions can be represented in a

2D image easily.

2. With the property of Point line duality in Parallel Coordinates it is easy to observe

the relationships between the dimensions. For an example two dimensions having a

highly negative correlation can be identified by the data lines of that two dimensions

intersecting at a point between those two axes.


3. Parallel Coordinates can handle many dimensions than most of the other visualizations.

The number of dimensions are only bounded by the width of the screen where the

visualization techniques such as Scatter plot Matrix will be really large and will not

be clear with lot of dimensions.

4. parallel coordinates have several techniques which make it easy for the user to inter-

act with the visualization and identify patterns in the data set. Those interaction

techniques will be discussed in the implementation details.

With the above advantages we decided to have Parallel Coordinates in our implementa-

tion. Using D3 Java Script Library there is a basic implementation of the parallel coordinates

by Jason Davis which is only having the interactive techniques 1D Brushing and Axis reorder-

ing.1 Using that implementation as our basis we implemented some interactive techniques

on top of that to give the user a better experience with interaction. In the following section

we will describe those interactive technique starting with the techniques that the implemen-

tation already had - Axis reordering and 1D Brushing. Rest of the sections will describe the

techniques we added to the implementation.

Figure 3.2: Basic Implementation of Parallel Coordinates

Axis Reordering

Axis Reordering was an important property which was already there in the Parallel Coor-

dinates implementation we used because one strength in parallel coordinates as described

before, is its effectiveness of visualizing relations between coordinate axes. Order of the axes

1Available at http://bl.ocks.org/jasondavies/1341281


clearly affects the patterns revealed by parallel coordinate plots. There are many approaches

that have been suggested to arrange the order of the axes by using some measure to score

an order of axes while others discuss how to visualize multiple orderings in a single display

[21]. Many approaches for this which are based on the combination of Nonlinear Correlation

Coefficient and Singular Value Decomposition algorithm [39] are suggested. By using these

approaches, the first remarkable axe can be selected based on mathematics theory and all axis

are re-ordered in line with the degree of similarities among them [39]. In our implementation

we noticed few disadvantages in using a mathematical model to determine the order of axes.

Calculating similarities of axes should be done before visualizing each data set in Parallel

Coordinates. But as we are dealing with big data sets, calculating axis similarities will take

much time and definitely it will affect the performance of the tool. Most importantly the

aim of our tool is letting user to interact with the visualization to identify patterns in the

data set. So rather than coming up with a fixed order of axes determined by a mathematical

model we have allowed user to bring axes next to each other in an interactive way, the user

can investigate how values are related to each other with special respect to two of the data

dimensions. The user can use his/her domain knowledge to assume the axes that will be

having a correlation and confirm it with the help of visualization later.

Brushing

Brushing can be used to distinguish an user interested area from the rest of the data points.

The user actively marks subsets of the data-set as being especially interesting and the points

that are contained by the brush are colored differently from other points to make them

standout. For an example if the user is interested in a certain area of a dimension he can

use brushing to highlight the interested area. As in the figure 3.3 user is interested only

in POST method so that he has brushed in on the axes to distinguish POST method data

points from the rest of the data.

1D brushing in our implementation is not only limited for a single axe. If the user is

interested with an area related to two dimensions the user can use composite brushing which

is a combination of single brushes which result the conjunction of those single brushes. As

in the figure 3.5 if the user is interested in data having POST method and more than 10,000

MB Bandwidth he can use composite brushing to specify it.

As described earlier the 1D Brushing technique that was already in the implementation

primarily acts along the axes. In the 2D brushing technique that we added to the visualiza-


Figure 3.3: Example 1D Brushing

Figure 3.4: Example Composite Brushing

tion an user can mark subsets of the data between the axes also. Using that the user can

mark an interested area for two axes.

Using brushing the user can mark subset of the data-set and then generate Complex

Event Processing rules to identify it separately from the rest of the data. It will be discussed

in section 3.4 Rule generation section.

Slickgrid

One major change that we did to the user interface of the basis Parallel Coordinates that

we used is the introduction of a table showing the details of the data contains in the Parallel

Coordinates. When user performs brushing the table will be updated to the brushed data.

When user hovers over the records in the table the related polyline of the plot to the hovered

row will be highlighted. The table helps to over come from one of the weakness in the Parallel


Figure 3.5: Example Composite Brushing

Coordinates which is, once displayed in the parallel coordinates it is difficult to get the exact

value of a data point, specially in numeric columns. With introduction of the table user can

easily refer to the table to get the exact value of a data point. In the implementation we used

slickgrid.js library to include the table. Whenever an user performs an update functions on

the Parallel Coordinates such as brushing, keeping, removing(Which will be described later)

the slick grid will update automatically with the relevant data. Using Slickgrid.js had it’s

own advantages. It was capable of adaptive virtual scrolling (handling thousands of rows

with extreme responsiveness) and also it had an extremely fast rendering speed. SlickGrid

utilizes virtual rendering to enable you to easily work with hundreds of thousands of items

without any drop in performance. In fact, there is no difference in performance between

working with a grid with 10 rows versus a 100,000 rows. This is achieved through virtual

rendering where only whats visible on the screen plus a small buffer is rendered. As the

user scrolls, DOM nodes are continuously being created and removed. These operations are

highly tuned to provide optimal performance under all browsers. The grid also adapts to the

direction and speed of scroll to minimize the number of rows that need to be swapped out

and to dynamically switch between synchronous and asynchronous rendering. Also SlickGrid

takes a different approach in updating. In the simplest scenario, it accesses data through

an array interface (i.e. using dataitem to get to an item at a given position and data.length

to determine the number of items), but the API is structured in such a way that it is very

easy to make the grid react to any possible changes to the underlying data which is really

important in our use-case since user will be continuously performing brushing, keeping and

removing actions very frequently. Also with a big data set the slick grid allows users to

specify a page size - number of records per a page and then navigate through the pages


Figure 3.6: Slick Grid along with the Parallel Coordinates

easily.

One disadvantage over using the mentioned java script library is that it doesn’t support

IE6 browsers. But as it is not a common browser we ignored that fact.

Cluster Visualization

Clustering is a technique that is commonly used in analyzing large data set to identify the

underlying structure of the data. Clustering can be used to group together data items that

are similar to each other. In our implementation we have provide the user the ability to

perform clustering on data sets which will be discussed more in the section 3.3. Since the

clustering is an important operation we needed to visualize them properly with Parallel

Coordinates. There are two methods to visualize clusters in Parallel Coordinates, namely

cluster coloring and cluster bundling. After the user has performed clustering on the data set

he can color the clusters to give an unique color to each cluster and visualize them clearly.

In cluster bundling the lines belonging to one cluster will be bundled together between

axes. There are two properties in cluster bundling - smoothness and bundling strength. Since

the values for those two is varying from use case to use case, in our implementation the user

can select values for those two variables through the sliders observing the visualization. The


Figure 3.7: Cluster Coloring

Figure 3.8: Cluster Bundling

advantage that can be gained with cluster bundling is that you can free the color channel to

use it for another purpose (eg. Statistical coloring)

Alpha Blending

As we discussed in the literature survey with a big data set the point line duality in parallel

coordinates will not be visible clearly so that the relationships between dimensions can’t

be observed. The solution that has been proposed is called alpha blending. When Alpha-

blending is used, each polygon is plotted with only Alpha percent opacity. With smaller

Alpha values, areas of high line density are more visible and hence are better contrasted to

areas with a small density. It is hard to come to a conclusion about a value for alpha. So in

our implementation the user can adjust the alpha value through the slider until the graph

gain enough insight.


Figure 3.9: Parallel Coordinates without Alpha Blending

Figure 3.10: Parallel Coordinates with Alpha Blending

For an example as in the Figure 3.9it is not easy to figure out whether there is a

relationship between cluster ID and the ID. But with the alpha blending technique used in

Figure 3.10 it is easy to figure out that there’s no actual relationship between cluster ID

and and the ID.

Other techniques

There are some other techniques too used on our implementation to give a better user

experience. We have added axis removing, axis flipping, keep/exclude data and statistical

coloring of data on to the basic implementation. If the user feels an axis is not needed to

the visualization the user can remove the axis. If the user feels that he can observe better

results if he flip the axis, he can flip the axis. Further more if the user feels he wants to

examine only a subset of the data set he can select that particular subset with brushing and


Figure 3.11: Parallel Coordinates with Statistical Coloring

use keep to examine more that brushed sub set. Same with exclude also.

Using statistical coloring it is easy to figure out the out liers within one dimension. The

data points that belongs to top 2.5percentage of the zscore curve will be colored from a

different color as well as data points belonging to the bottom 2.5 percentage will be colored

from another different color. As in the Figure 3.11 it has been statistically colored on size

Having above technique user can interact with the visualization TO identify hidden pat-

terns in the data set which will make CEP rule generation easy.

3.3 Other functionalities

Parsing collected datasets

In addition to supporting simple csv files we have provided support for visualizing and

analyzing Apache web server log files. Apache log files consist of a list of srings for each

entry to the log. These strings need to be broken down in to the data set it contains. For

this we have used an open source implementation of an log parser2 which outputs a simple

dictionary for each entry in the log file. This output is collected and converted to a Pandas

DataFrame object which we can easily manipulate. After breaking the the log entry in to

the separate data elements using the parser we perform conversions such as converting the

time stamp to DateTime format to ease future manipulation. In addition the request line in

the log file is broken down in to its constituent elements; method, URL and protocol.

2Available at : https://code.google.com/p/apachelog/


We have also provided the ability to easily extend the support for additional file formats.

This can be done by defining a file extension for the new file type and providing a module

that can convert the input file in to a Pandas Dataframe.

Aggregation

CEP query model allows defining a Window; which is a limited subset of events from an event

stream. WSO2 CEP supports seven types of windows and in Vivarana we have implemented

support for two types of windows and those are Time window and Length Window. It should

be noted that these windows are moving windows.

Figure 3.12: Specifying the size of either Time or Event window

User can use one type of window at a time and perform sum, average, maximum, min-

imum and count aggregation operations based on the specified window. Further use can

group by a certain attribute and then perform aggregation operations.

As an example consider a scenario where a user is analyzing a server web log. He wants

to know the total bandwidth usage of users for last 15 minutes. He can first specify the 15

minutes in aggregation menu and then select group by from remote host attribute’s context

menu and select sum from the size attribute’s context menu.

We have implemented aggregation operations using Python Pandas library. Once the

aggregation operation is performed, the result is cached because as the size of data grows,

performing aggregation becomes an expensive operation. After calculating the aggregated

values, parallel coordinate visualization is updated so that user can interact with it in-order

to identify useful patterns


(a) Before performing any aggregation operation

(b) After performing minimum aggregation operation for 15 minutes time window on Size attribute.

Figure 3.13: State of visualization after performing an aggregation operation

Clustering

Clustering is one of the major data-mining technique we have integrated in Vivarana. Ini-

tially we have implemented support for three clustering algorithm. K-modes clustering[26] is

introduced mainly to cluster data consisting of non-numeric content. It uses simple-matching

distance to find the dissimilarity between two objects and partition the given objects to k

groups. Another algorithm we implemented was the fuzzy analysis clustering[32]. It uses

euclidean distance to calculate the distance between two observations. In contrast to k-

modes clustering, fuzzy clustering produce clusters where cluster boundaries are not crisp.

The other clustering method we have implemented was hierarchical clustering[43]. This uses

Gower’s coefficient[18] as distance measure between observation.

The major disadvantage of aforementioned algorithms are that all those requires the

number of clusters as a user input initially. But since clustering is an unsupervised learning

technique, user might not know about the optimal number of clusters initially. Hence, we

decided to present a deprogram when hierarchical clustering is used. Further, we imple-

mented clustering in a way so that user can identify clusters within clusters as depicted in

figure 3.14. It should be noted that we used R libraries to implement clustering support.


(a) Initial clustering using hierarchical clustering

(b) Selecting cluster number 3 and perform clustering withing that cluster

(c) After performing hierarchical clustering within initial cluster number 3

Figure 3.14: Performing clustering within clusters in a web server log data set

Both hierarchical and fuzzy clustering is implemented through cluster library[8] and k-modes

clustering is implemented through klaR[35] library

3.4 Rule Generation

As mentioned earlier there are already existing solutions to automatically generate CEP rules

but most of these systems do not depend an interactive process where the user can adjust

parameters and aid in the process of rule generation and tries to automatically generate all

aspect of the rule. So considering the interactive nature of our solution we have used a simple


recursive partitioning and regression tree(CART tree)[38] based method in-order to generate

CEP rules while considering the inputs user has provided through interactive visualization.

Classification and Regression Trees(CART)

Classification and Regression Trees also known as CART trees are popular a classification

method used widely. This method consists of training a classification model to partition

a data set in to categories based on conditional decision rules. For example consider the

CART tree in Figure 3.2. It shows a CART classification tree built on the Iris flower data

set, which feature various attributes of flowers and their species. The decision tree tries to

classify the flowers into their species based on the flower’s attributes By starting at the root

and traversing the tree based on the parameters on a data entry we can get a prediction for

the class it belongs to.

Figure 3.15: A decision tree to classify the Iris data set.

We used the ’rpart’ package available with R to generate decision trees to classify the

set of events selected by the user against the other events in the data set. So we create two

categories for the events based on them being selected by the user as important or not.

To create a CART tree the training data is recursively partitioned into subsets based on

a single parameter condition at each partition. This parameter can be selected based on a

impurity index such as the Gini impurity index which is what we are using with the ’rpart’

package. Gini index reaches a value of zero when only one class is present in a partition.

At each partition the Gini index is calculated for all possible splits to identify the best

possible split that can yield partitions with the least impurity. This recursive partitioning

continues until a new partition with the minimum size specified cannot be generated or all

the partitions contain events belonging to a single category.

But this leads to trees that over fit to the training dataset and reduce the value of the

classification. To prevent over fitting and produce better decision trees, the tree generated


Figure 3.16: A decision tree to classify the Iris data set. Paths to follow to get to a Virginicaflower are highlighted.

in the previous step is pruned to limit the number of splits. In the CART implementation

available with rpart library this is handled by considering a complexity parameter that is

used along with with cross validation to produce the best possible tree. Every split that

doesnt improve the results of the classification by at least the amount specified by the

complexity parameter is removed from the tree. So the complexity of the final tree can be

changed with the complexity parameter.

Generating rules based on the decision tree

After the rpart library generates the tree we need to traverse the tree to generate rules to

filter out the positive events. So the path to each of the leaf nodes needs to be identified

and then merged together to generate the final rule. The final rule is the disjunctions of all

the rules generated by each path.

For example consider the classification in Figure 3.3. If we want to select a Flower be-

longing to ’Virginica’ the paths we can follow are highlighted in the Figure. So by traversing

the tree we can come to the conclusion that if a flower is to belong to ’Viginica’ it either need

to have a ’Petal Length’ between 2.6 and 4.9 or have ’Petal Length’ greater than 4.9 with

’Patal Width’ less than 1.6. Then the rule we can generate to filter out ’Virginica’ flowers

from the rest become,

IF(’Petal Length’ >= 2.5 AND ’Petal Length’ < 4.9)

OR

IF(’Petal Length’ >= 2.5 AND ’Petal Length’ > 4.9 AND ’Petal Width’ < 1.6)

=> OUTPUT ’Virginica

So a rule can be generated for each leaf node by traversing the tree from the root to

the leaf node and combining the constraints at each fit. But one issue with the use of rpart


library is that it only allows binary splits. So the same variable may be used in the splitting

criterion of multiple levels in the classification tree. So to generate better and user-friendlier

CEP queries it is desirable to merge all the partial constraints to generate a single constraint.

Consider the same classification we were studying earlier in Figure 3.3. If we consider

the second path to a ’Virginica’ leaf node we get the Following list of conditions,

’Petal Length’ >= 2.5

’Petal Length’ >= 4.9

’Petal Width’ < 1.6

We can see that there are two conditions for Petal Length that can be merged together to

get a single condition of ’Petal Length’ ¿= 4.9 . We perform this type of merges for all the

sets of conditions we get from following the paths in the tree to the leaf nodes we want. So to

produce better rules we have implemented a method in our system where we go through the

constraint list and merge together all the constraints on a single variable. After the merging

we can get a better and user friendly rule.

IF(’Petal Length’ >= 2.5 AND ’Petal Length’ < 4.9)

OR

IF(’Petal Length’ > 4.9 AND ’Petal Width’ < 1.6)

=> OUTPUT ’Virginica

In addition to the results derived from the classification tree we need to consider the

interactions performed by the user. As mentioned in a earlier section of this report, our tool

allows the user to generate aggregate functions that introduce new parameters to the data

set that may depend on windows defined by the user. These values are not available with

the data stream that the final rule will run on. So they need to be calculated by the CEP

engine. This requires the necessary syntax to be included with the final rule. To support

this functionality we had to keep track of and store the interactions performed by a user

such as the time windows applied and the new data types created.

Then after the condition is derived from the classification tree we check each of the

parameters to see whether the user created them or whether they were available in the

original data stream. If the user created the parameter we need to augment the CEP query

with the necessary syntax for the aggregate function that created the parameters.

The final step in the rule generation process is the translation of these constraints to

the CEP query format. We combine the constraints from the classification tree with the


(a) User select the type of events he wants to detect

(b) CEP query is generated for his selection and false positives are highlighted.

Figure 3.17: Rule generation process

information to generate the user created functions along with the time windows to generate

a CEP query. In our current implementation we support the generation of queries in the

Siddhi query language To extend the application to support another query language we need

to provide two syntax elements.

1. Syntax used to perform aggregate functions. We currently support sum, average, count

as the aggregate functions. So there should be a mapping for the rule syntax for these

functions so that the CEP can run these functions in real time to generate the necessary

values for the rule.

2. Translation from generic constraints to the query syntax required by the user. The

constraints on the parameters are generated in a generic format. So we get constraints


of Equality, LessThan, MoreThan, etc. So if we can get a mapping from these generic

constraints to a specific CEP query format we can get a legitimate CEP query.

So providing the application with these can easily extend the language support.

After the rule are generated we need to assess the quality of the rule. We provide the

user with two methods of looking at the quality of the rule.

1. Applying the rule to the data set and comparing the classification from the rule with

the users selection to calculate the accuracy and precision of the rule. We apply the

rule to the Pandas Data frame and generate a confusion matrix between the events

selected by the user against the event filtered out from the generated rule. The user

can use these parameters to see whether the rule has the qualities he/she needs. For

example, if the user is looking for the selected events to be exactly similar to the ones

he selected in the too then the rule needs to have a higher precision.

2. The rule is also applied in the visualization to highlight the false-positives and false-

negatives to show how the filtering through the rule differs from the users intention. So

along with the rule we highlight the events that the rule selected in the visualization

with the false-positive colored in Red for better clarity

3.5 Other Approaches Attempted

Sunburst Visualization

What the primary visualization method which we have used, does is generating CEP rules

to classify and detect certain events that satisfy a specific criteria. But it does not detect or

does not concern itself with event patterns that occur in a sequence.

For Example,

from every (a1 = infoStock[action == "buy"]

-> a2 = confirmOrder[command == "OK"] )

-> b1 = StockExchangeStream [price > infoStock.price]

within 3000

select a1.action as action, b1.price as price

insert into StockQuote


This is an example pattern matching query for ”Siddhi” language specification which is

used in ”Siddhi” complex event processing engine (WSO2) [45].To create these kind of CEP

rules first we had to look into methods of visualizing sequences.

Patterns and Sequences

Patterns In the scenario described above, the query specified is a pattern matching query.

It matches events that occur in a pattern within a constrained amount of time(3000). That

in hindsight means there could be other unrelated events between those events which are

specified in the pattern.

Sequences In sequences all events need to be in the specified sequence with no unrelated

events occurring in between.

For example,

A⇒ A⇒ B ⇒ B ⇒ B ⇒ C

this sequences of events can be said to be in the A, B, C sequence.

Sunburst Partition

After looking into already existing visualizations we picked sunburst plot [56, 55] for visual-

izing patterns and sequences. Sunburst is a visualization resembling a multilevel pie chart

that is able to visualize hierarchical information. It includes concentric circles of varying

radii. The circle in the center represents root node. And lower level of the hierarchies are

represented from the circles further away from the center. Each circle could be segmented

by radial lines. Each segment represents a node in the hierarchy and child nodes are drawn

within the angle occupied by their parent node [66].

Visualizing sequences in Sunburst

For example, take the sequences,

1. A→ B → C → D → E

2. B → C → D

3. A→ C → B → D


At the first level sequences 1 and 3 share same element A, and sequence 2 bears element

B.

This can be shown in a tree structure as shown in 3.1 ,

Root

A B

Diagram 3.1: sequence prefixes example

In this manner above sequences can be shown in a tree structure such that longest

common prefixes are shared by several sequences.

So the above 3 sequences can be represented as a tree as shown in 3.2 ,

Root

A

B

C

D

E

C

B

D

B

C

Diagram 3.2: sequence tree

This tree structure can be directly represented in the sunburst visualization, the angle

of the segments will represent the value that how many sequences share that common prefix

and color of the segment can represent type of the element (A, B, C)

Why use Sunburst

We found many research studies done on evaluating of space filling visualizations such as

sunburst that display hierarchical information. These research studies concluded that sun-

burst is the best visualization method that uses the space effectively to display information

while making it intuitive to the user [66, 57, 33]. Also further research exists that show

sunburst being used to visualize frequency patterns using the same approach we described

earlier [34]. Further motivating factor for us to use this visualization was existing work that

used sunburst diagram to visualize summarized user navigation paths through a web site [48,


52], which is displayed below in figure 3.18. We used modified version of this implementation

as groundwork for our project. The limitations of that visualization and improvements done

by our project are described below.

Figure 3.18: Sunburst Visualization which we used as foundation to our project

We used visualization depicted in 3.18, which was used to display user navigation paths

to web site as basic foundation for our project, this stock visualization had several limitations,

1. Needed to load data from a CSV in a specific format, since our project involves loading


various kinds of data files we needed to find a method to load our data of arbitrary

formats to this visualization.

2. This visualization could only draw sequences of limited length and the text values of

each element needed to be of limited length to show clearly without error.

Ex: we could not show long name elements such as ”\ta \online \query \query \string”

in the breadcrumbs trail, only short names like ”product” were able to be shown

without error.

3. This visualization could only contain limited number of unique elements (name value

types). If it had many unique elements (each shown in different color) many of the

elements would looks the same and would have looked confusing to the user.

4. Due to displaying data in this ”stock” manner user couldnt do many operations on the

visualization. (No drill down operations, zooming). Simply said this stock visualization

could only be used to visualize limited number of data on a static web page.

To improve upon this visualization and to develop our project first we had to develop a

method to process data which were taken from the data files to a format that can be used

in this visualization, which will be described in the following section.

Data Processing

From the preprocessing page of our application first user should move to sunburst tab and

should select a grouping column and a grouped column to be used in the data processing

stage as shown in 3.19.

Figure 3.19: Select grouping and grouped columns

This attributes are loaded from the column names of the uploaded file. And specifying one

of them in this manner will cause the data file to be grouped by the grouping attribute(like

SQL groupby operation).


First The data elements in the data file are first converted in to a Python Pandas

dataframe object for processing automatically in the backend of the application. Then

group by operations are done on the dataframe to create database of sequences.

Why use python pandas

Python pandas is a high performance data analysis and data preparation toolkit

developed for the python platform. It provides many easy to use high performance data

structures, data analytic functions and tools that helps to make data analysis tasks easier.

It prevents user moving to a different domain specific language such as R to execute these

kind of tasks.

Python is generally preferred for data analytic tasks because of wide range of data

analytic tools available in the python eco system. Therefore Python is the preferred general

purpose programming language of many scientific and research projects that involve data

analytics. These were the main motivations for us to develop core application in Python

programming language.

Creating Sequence Database A dataframe built from a multicolumn CSV looks as

shown in figure 3.20.

Figure 3.20: Data file representation in Python DataFrame

The python dataframe group by operator splits the data frame to multiple groups based

on certain criteria, since we have specified group by attribute it splits(groups) the dataframe

based on unique values of the group by column of the dataframe. Dataframe.groupby com-

mand and it’s inputs are shown below.

DataFrame.groupby(by=None, axis=0, level=None, as \_ index=True, sort=True,

group\_keys=True, squeeze=False)


After grouping operation Pandas creates a dataframe group by object which consists of

multiple dataframes, each of which has the rows of original dataframe which has the same

value for the grouped attribute. In this example we have grouped using ”Remote host”

attribute. So the each split data frames has all the requests made by a particular remote

host.

Figure 3.21: Python DataFrame after Group By operation

Since representation shown in figure 3.21 is not suitable to be used for further data

analytic operations this data is further processed to a format by a process described below,

• Each row on these grouped dataframes is converted into a python ”dict” and all of the

”dicts” are stored in an python ”list”

[{col_name1 = value1, col_name2 = value2, col_name3 = value3},

{col_name1 = value1, col_name2 = value2, col_name3 = value3}]]

• Since all these log events occurred in a sequence a ”sequence number” attribute is

added to each of the ”dict”s.


Figure 3.22: Sequence Database Final Representation

Final ”Sequence database” Structure looks like as shown in figure 3.22.

In order to use this data as input to the sunburst diagram Original sunburst diagram

sequences are required to be in the format,

event1 ’separator’ event2 ’separator’ event3 ’separator’, number_of_times_

this_sequence_occured

The ’separator’ is used to separate each of the events. After modifying original sunburst

implementation we added the capability which enables it to accept a 2 element json array

consisting of sequence and the occurrences of that sequence in the database, to visualizes

sequences.

To convert the data to this format further operations needed to be done on the ”sequence

database”.

Since the sunburst diagram can show only one attribute of the data, these sequences in

the sequence database needs to be stripped off of unnecessary attributes.

The attribute to keep is decided by the grouped column value user specified in figure

. Since in this example the user has specified ”URL”. Pandas operations are run on the

sequence database dataframe to keep only the ”URL’ attribute from the sequence elements.

Then the sequence elements are merged with above specified ’separator’ string to separate

the sequence elements in a string format.

Now the stripped sequence database looks as shown in the figure 3.23.

Note that in this representation we have used |- |as a separator element to prevent

confusion with the elements in the sequences.


Figure 3.23: Sequence Database after being stripped off of unnecessary event attributes

Counting the occurrences of unique sequences Each of the rows in the stripped se-

quence database shown in 3.23 represents each ”Remote host” and the sequence of ”URL”’s

that were requested by that host. Theres a high probability that same requests were done

by some other user in the same sequence of steps. The number of users that requested the

same sequence of ”URL”’s is the number of times this sequence occured value that

needed to specified in the input of the Sunburst visualization.

That is done through valuecounts() operator in pandas

Now the representation of sequences looks as shown in figure 3.24,

Figure 3.24: Counting the number of unique sequences using valuecounts()

Now this data is given to the sunburst as an array of two elements consists of url sequence

and frequency of that sequence

Improvements to the Sunburst Diagram

As specified above, the sunburst display we used as base had number of limitations. This

section describes how those limitations were overcome.

1. Load data from CSV -


This limitation was overcome by loading data formatted through above specified data

formatting steps by an Ajax call to the back end. Original sunburst implementation

was modified to accept JSON array object as input to visualize data.

2. Each name value of an event(sequence element) needed to be of limited

length and each of these sequences needed to be of limited length to display

correctly -

To overcome this limitation we added functionality to clip value of the sequence element

in the bread crumbs trail. Then added functionality to show tips displaying the value

of the element when mouse cursor is pointed to the element.

And we added capability to show sequences of arbitrary length in the visualization

without affecting coherence.

3. Shows only limited number of unique elements in limited number of colors

-

We improved the visualization to show many unique elements in many different colors

by developing a color function that assigns almost unique color to a specific unique

element(colors are guaranteed to be unique if the amount of unique values are below

121). Original hard-coded color values numbering 6 was improved to 121. These colors

were selected such that most of the browsers would be able to display those colors and

using that color will match with contrast of the overall visualization. For this purpose

certain light colors were removed from admissible color range of HTML5 and CSS3

color names which numbers to 140 [25].

(Note: that in the diagram displayed in figure 3.25 different name value types have

been assigned same color because of high number of unique name value types, in this

case it is 700.(CSS3 and HTML5 specification contains only 140 web safe colors)

4. Limitation of drill down, zoom functionality -

When the sequences are of arbitrary length and has many unique elements, visual-

ization tends to look cluttered and small elements become incoherent. So we added

functionality to drill down sequences in a separate visualization by the side of the main

visualization.

When user clicks on an element, sequence from that element onward is visualized in

the next display. Also user operations on the second display (moving mouse over the


elements) is referenced in the main visualization as well, so that user gets the idea of

the whole visualization in coherent and intuitive manner.

Also we added pan and zoom functionality to main display so that user can inspect

harder to see elements in the visualization by zooming in and panning.

Note - in the visualization shown in figure 3.26 the children of ”online\images\f” ele-

ments are depicted in the second diagram which shows drill down functionality. And moving

mouse pointer over elements in the second diagram shows original referenced elements in the

main diagram.

Note Figure 3.27 shows how hard to see areas can be zoomed and panned in the main

diagram

469.0pt

Sequential Pattern Mining (Pattern Search)

In order to use the above visualization for our CEP rule generation use case patterns needed

to be searched in the sequence database.

Expected user case User specifies event name values or a regex pattern that matches

the name values as a pattern.

A, B, \C[A-Z]

Pattern search engine returns sequences that contain these events occurring in the se-

quence specified above by the user and visualizes them in the display.

Also it returns length of time window in the average case that this sequence occurs in

the sequence database.

Ex:

A, B, ˆ\[C[A-Z]

Average sequence length 19, min =3, max = 32

Average time window 1300 milliseconds, min = 50 milliseconds, max 2500 milliseconds

To generate above parameters pattern search engine utilizes attributes of each event,

timestamp, seq no in the sequence database.

These parameters can be used to develop a CEP rule that given the query matches and

detects this sequence in a data stream.


Pattern Search Methodology

There exists many work done on sequential pattern mining. But these cases involve searching

all the sequential patterns in a sequence database given the support value. But in our use

case we needed to search a specific pattern occurring in the database. For that purpose

we decided to use a modified approach based on prefix projected pattern search utilized in

prefixspan algorithm [47].

In this approach each element of the pattern will project the input sequences in the

database discarding unmatched prefixes creating a new database of sequences. This in the

long run is very efficient because of the reduction of the size of the sequence data base in

successive iterations.

Example sequence database of 4 sequences is shown below,

1. A,B,C,D,A,E

2. B,C,A,E

3. C,A,C,B,D

4. A,B,C,D,C,E

If we were to search pattern A,B,C in the given sequence data base, in the first iteration

sequence database is projected using the first element of the pattern which is A. This results

in the following database of sequences.

1. B,C,D,A,E

2. E

3. C, B, D

4. B, C, D, C, E

Notice that sequence length of all sequences has been reduced in all sequences. If this

sequences are projected again using second element of the search sequence (B),

1. C,D,A,E

2. D,


3. C, D, C, E

Note that projection has resulted in one sequence disappearing altogether from the

database.

By using this approach we can get the sequences which contains a specified pattern and

then by scanning those sequences we can calculate parameters for time window and length

window of the pattern as it occurs in the sequence database quickly.

Failure to use this Component in the project

Although this components adds further functionality to our project, it was perceived that

the impact of the above use case is low compared to the overall usability of our project.

Also to develop this functionality rule generation and precision calculating methods needs

to be developed from the scratch because the components developed for the main Parallel

coordinates visualization are unusable in this aspect.

And developing these methods will indeed take the project out of scope and time. There-

fore given the limited amount of time available we decided to drop further development of

this module in favor of improving the existing main functionality of the project.


Figure 3.25: Improved sunburst visualization


Figure 3.26: Sunburst with drill down capability

Figure 3.27: Sunburst zoom and pan capability

84

Chapter 4

Discussion

Using our visualization and the rule generation methodology we were able to produce good

results in simple use cases. Using this methodology can help users identify trends and

information in data streams that would otherwise have gone unnoticed and generate CEP

queries to identify similar events with the click of a button which makes the whole process

simpler and easier. We tested our tool with generating rules for acting on events occuring on

web logs and on other generic data types and we were able to produce usable CEP querries

for most of our needs.

The biggest challenge we faced in the implementation of our project is in visualizing and

generating rules to identify sequential patterns. While Parallel Coordinates is an excellent

visualization method for multidimensional data it doesnt translate to displaying sequences.

So the user doesnt get a chance to identify sequential patterns in the data using our visu-

alization. We tried to solve this issue by introducing the sunburst visualization to display

sequences based on their frequency of occurrence and allowing the user to select a sequence

base on it. But we felt that this was only addressing a single use case of sequence detection

and was not useful in a generic environment.

In addition implementing this would have required us to completely overhaul the rule

generation mechanism. The current implementation that uses decision trees does not support

sequence matching and cannot therefore be used in pattern detection. The methodology

presented in iCEP is better suited for generating these types of rules. But we decided that

based on its complexity and disregard for user interactions we were better off focusing on

the simpler non-sequential rule generation.

Other issues faced by us during our project include,

CHAPTER 4. DISCUSSION 85

• Displaying labels for String type data

When displaying data with categorical variables that contain a large number of unique

labels, the display tend to get denser and the information becomes hard to distinguish.

This issue needs to be handled in a better manner to display data items such as names,

which are unique among data items. Currently as we display the data table along with

the visualization the users can look this up in the table. But we need to come up with

a better method of displaying this type of data.

• Displaying data with long labels

Another problem with the visualization is long labels that cannot be displayed in full

length. Displaying the entire data label causes other aspects of the visualization to

be obstructed and causes the visualization to be loose its value. While we can easily

shorten the labels and display only partial results this leads to loss of information in

data values such as URLs, which are commonly very long.

• Loss of interactivity when displaying large data sets

As the visualization runs on a web browser the performance is very much dependent

on the resources available on the machine the user is viewing the visualization on. This

results in the visualization loosing interactivity and getting slow when displaying larger

data sets. When the data set visualized reach hundred thousands the whole visual-

ization becomes unusable. One solution to this problem would be to avoid visualizing

such large datasets and reduce the size of the dataset to a much smaller size through

methods such as sampling.

• Different data formats

As the tool currently stands we have support for .csv format and apache web log files.

To support other formats and data types we need to add parsers for those data types

and convert them to Pandas DataFrames.

CHAPTER 4. DISCUSSION 86

(a) Large numbers of unique labels makes them indistinguishable

(b) Long labels block details below them

Figure 4.1: Challenges face in the visualization

87

Chapter 5

Conclusion and Future Work

We started this project with the task of creating a better method for analyzing large data

streams and acting on them through CEP engines. We feel that we have taken a successful

first step in this direction. With our tool users can take a look at a data set and look for

patterns and trends in the data without being an expert in data mining and write CEP

queries to identify events he/she deems important with the click of a button.

To get to a completely automated process we need to go further and support other

functionality,

We have identified several ways that would improve the usability of the project and

provide more functionality to the user,

• Handling large amounts of data

We need to optimize the tool so that it can handle more data in a single visualization

without compromising the interactivity. This can be done by optimizing the visual-

ization we created. But even with the ability to display all the data without issues of

interactivity we will come to a point where the large number of data points will make

the display too dense and useless for visual analysis. The method we have provided

in our tool to avoid this is reducing the size of the data through sampling. We can

specify a size of the sample and use random sampling to fill that sample from randomly

selected data items from the data set. While this reduces the datasets to a better size

random sampling may lead to loss of valuable information. A better way would be to

scan the data and look for anomalies and other events that would be important and

displaying them clearly. But identifying ’important’ data subsets is a complex issue

and needs more research.

CHAPTER 5. CONCLUSION AND FUTURE WORK 88

• Handling Sequential patterns

This is one of the most important paths that need to be further explored to make

this tool of value. CEPs are very commonly used to look for distinct patterns in the

data. So we need to be able to handle that aspect of rule generation. This is a very

complex task as the sequences required vary widely from use case to use case. For a

store application we may need to track purchase sequences, for a banking application

the sequences may be sequences of purchases based on their relative size, So writing

a visualization that can create and show these types of patterns become increasingly

complex. This aspect of our tool needs to be improved further.

• Allow the user to change parameters in the generated rule.

Allowing the user to change parameters in the generated rule can help the user to get

to exactly what he wants we can allow the user to change the parameters in the rule

and show how they affect the application of the rule and the difference it made from

the original generated rule. This can help with the generation of a better final result.

• Improve the data mining aspect of the application.

We need to provide more functionality and support more use cases for users. We

currently provide the ability to do cluster analysis and anomaly detection among few

other data mining tools. User has to depend on visual analysis of the data to look for

important factors in the data. We can improve this and support more manipulation

operations such as search functionality and other mining operations.

We believe that by providing these improvements and more work we can create a tool that

provides a simpler and enjoyable process of using CEP engines to process data streams in

real life and allow people who are not experts in data mining and Complex Event Processing

to use these valuable tools with ease.

89

Bibliography

[1] Gennady Andrienko and Natalia Andrienko. “Blending aggregation and selection:

Adapting parallel coordinates for the visualization of large datasets”. In: The Car-

tographic Journal 42.1 (2005), pp. 49–60.

[2] Almir Olivette Artero, Maria Cristina Ferreira de Oliveira, and Haim Levkowitz. “En-

hanced high dimensional data visualization through dimension reduction and attribute

arrangement”. In: Information Visualization, 2006. IV 2006. Tenth International Con-

ference on. IEEE. 2006, pp. 707–712.

[3] Daniel Asimov. “The grand tour: a tool for viewing multidimensional data”. In: SIAM

journal on scientific and statistical computing 6.1 (1985), pp. 128–143.

[4] Stefan Axelsson. Intrusion detection systems: A survey and taxonomy. Tech. rep. Tech-

nical report, 2000.

[5] Eric A Bier et al. “Toolglass and magic lenses: the see-through interface”. In: Proceed-

ings of the 20th annual conference on Computer graphics and interactive techniques.

ACM. 1993, pp. 73–80.

[6] Markus M Breunig et al. “LOF: identifying density-based local outliers”. In: ACM

sigmod record. Vol. 29. 2. ACM. 2000, pp. 93–104.

[7] Krysia Broda et al. SAGE: a logical agent-based environment monitoring and control

system. Springer, 2009, pp. 112–117.

[8] cluster: Cluster Analysis Extended Rousseeuw et al. http://cran.r-project.org/

web/packages/cluster/index.html/. [Online; accessed 03-February-2015]. 2015.

[9] Dianne Cook et al. “Grand tours, projection pursuit guided tours, and manual con-

trols”. In: Handbook of data visualization. Springer, 2008, pp. 295–314.

[10] Data-Ink Ratio. http://www.infovis- wiki.net/index.php/Data- Ink_Ratio.

[Online; accessed 03-February-2015]. 2015.

BIBLIOGRAPHY 90

[11] Alan Demers et al. “Towards expressive publish/subscribe systems”. In: Advances in

Database Technology-EDBT 2006. Springer, 2006, pp. 627–644.

[12] Django overview, Django. https://www.djangoproject.com/start/overview/.


[13] Niklas Elmqvist, Pierre Dragicevic, and Jean-Daniel Fekete. “Rolling the dice: Multi-

dimensional visual exploration using scatterplot matrix navigation”. In: Visualization

and Computer Graphics, IEEE Transactions on 14.6 (2008), pp. 1539–1148.

[14] Ronald Aylmer Fisher. “The design of experiments.” In: (1935).

[15] M Forina et al. “Classification of olive oils from their fatty acid composition”. In: Food

research and data analysis: proceedings from the IUFoST Symposium, September 20-

23, 1982, Oslo, Norway/edited by H. Martens and H. Russwurm, Jr. London: Applied

Science Publishers, 1983. 1983, pp. 189–214.

[16] Michael Friendly. “A brief history of the mosaic display”. In: Journal of Computational

and Graphical Statistics 11.1 (2002).

[17] Ying-Huey Fua, Matthew O Ward, and Elke A Rundensteiner. “Hierarchical parallel

coordinates for exploration of large datasets”. In: Proceedings of the conference on

Visualization’99: celebrating ten years. IEEE Computer Society Press. 1999, pp. 43–

50.

[18] John C Gower. “A general coefficient of similarity and some of its properties”. In:

Biometrics (1971), pp. 857–871.

[19] John A Hartigan and Beat Kleiner. “Mosaics for contingency tables”. In: Computer

science and statistics: Proceedings of the 13th symposium on the interface. Springer.

1981, pp. 268–273.

[20] Helwig Hauser, Florian Ledermann, and Helmut Doleisch. “Angular brushing of ex-

tended parallel coordinates”. In: Information Visualization, 2002. INFOVIS 2002.

IEEE Symposium on. IEEE. 2002, pp. 127–130.

[21] Julian Heinrich and Daniel Weiskopf. “State of the art of parallel coordinates”. In:

STAR Proceedings of Eurographics 2013 (2013), pp. 95–116.

[22] Julian Heinrich et al. “Evaluation of a bundling technique for parallel coordinates”.

In: arXiv preprint arXiv:1109.6073 (2011).

BIBLIOGRAPHY 91

[23] Patrick Hoffman et al. “DNA visual and analytic data mining”. In: Visualization’97.,

Proceedings. IEEE. 1997, pp. 437–441.

[24] Heike Hofmann. “Mosaic plots and their variants”. In: Handbook of data visualization.

Springer, 2008, pp. 617–642.

[25] HTML Color Names. http://www.w3schools.com/html/html_colornames.asp.


[26] Zhexue Huang. “A Fast Clustering Algorithm to Cluster Very Large Categorical Data

Sets in Data Mining.” In: DMKD. Citeseer. 1997.

[27] J-F Im, Michael J McGuffin, and Rock Leung. “GPLOM: the generalized plot matrix

for visualizing multidimensional multivariate data”. In: Visualization and Computer

Graphics, IEEE Transactions on 19.12 (2013), pp. 2606–2614.

[28] Alfred Inselberg and Bernard Dimsdale. Parallel coordinates for visualizing multi-

dimensional geometry. Springer, 1987.

[29] Ian Jolliffe. Principal component analysis. Wiley Online Library, 2002.

[30] Rudolph Emil Kalman. “A new approach to linear filtering and prediction problems”.

In: Journal of Fluids Engineering 82.1 (1960), pp. 35–45.

[31] Samuel Kaski and Teuvo Kohonen. “Exploratory data analysis by the self-organizing

map: Structures of welfare and poverty in the world”. In: Neural networks in financial

engineering. Proceedings of the third international conference on neural networks in

the capital markets. Citeseer. 1996.

[32] Leonard Kaufman and Peter J Rousseeuw. Finding groups in data: an introduction to

cluster analysis. Vol. 344. John Wiley & Sons, 2009.

[33] Daniel A Keim. “Information visualization and visual data mining”. In: Visualization

and Computer Graphics, IEEE Transactions on 8.1 (2002), pp. 1–8.

[34] Daniel A Keim, Jorn Schneidewind, and Mike Sips. “Fp-viz: Visual frequent pattern

mining”. In: (2005).

[35] klaR: Classification and visualization. http://cran.r-project.org/web/packages/

klaR/index.html. [Online; accessed 03-February-2015]. 2015.

[36] Edwin M Knorr, Raymond T Ng, and Vladimir Tucakov. “Distance-based outliers:

algorithms and applications”. In: The VLDB JournalThe International Journal on

Very Large Data Bases 8.3-4 (2000), pp. 237–253.

BIBLIOGRAPHY 92

[37] Lie Factor. http://www.infovis-wiki.net/index.php?title=Lie_Factor. [Online;

accessed 03-February-2015]. 2015.

[38] Wei-Yin Loh. “Classification and regression trees”. In: Wiley Interdisciplinary Reviews:

Data Mining and Knowledge Discovery 1.1 (2011), pp. 14–23. issn: 1942-4795. doi:

10.1002/widm.8. url: http://dx.doi.org/10.1002/widm.8.

[39] Liang Fu Lu, Mao Lin Huang, and Tze-Haw Huang. “A new axes re-ordering method in

parallel coordinates visualization”. In: Machine Learning and Applications (ICMLA),

2012 11th International Conference On. Vol. 2. IEEE. 2012, pp. 252–257.

[40] Yuan Luo et al. “Cluster Visualization in Parallel Coordinates Using Curve Bundles”.

In: Visualization and Computer Graphics, IEEE Transactions on 20 (2008).

[41] Alessandro Margara, Gianpaolo Cugola, and Giordano Tamburrelli. “Learning from

the past: automated rule generation for complex event processing”. In: Proceedings

of the 8th ACM International Conference on Distributed Event-Based Systems. ACM.

2014, pp. 47–58.

[42] Allen R Martin and Matthew O Ward. “High dimensional brushing for interactive

exploration of multivariate data”. In: Proceedings of the 6th Conference on Visualiza-

tion’95. IEEE Computer Society. 1995, p. 271.

[43] Fionn Murtagh and A Heck. “Multivariate data analysis with Fortran, C and Java

code”. In: Northern Ireland: Queen University Belfast, Astronomical Observatory Stras-

bourg (2000), p. 272.

[44] Christopher Mutschler and Michael Philippsen. “Learning event detection rules with

noise hidden markov models”. In: Adaptive Hardware and Systems (AHS), 2012 NASA/ESA

Conference on. IEEE. 2012, pp. 159–166.

[45] Patterns. https://docs.wso2.com/display/CEP310/Patterns. [Online; accessed

03-February-2015]. 2015.

[46] Karl Pearson. “Note on regression and inheritance in the case of two parents”. In:

Proceedings of the Royal Society of London 58.347-352 (1895), pp. 240–242.

[47] Jian Pei et al. “Prefixspan: Mining sequential patterns efficiently by prefix-projected

pattern growth”. In: 2013 IEEE 29th International Conference on Data Engineering

(ICDE). IEEE Computer Society. 2001, pp. 0215–0215.

BIBLIOGRAPHY 93

[48] Kerry Rodden. “Applying a Sunburst Visualization to Summarize User Navigation

Sequences”. In: (2014).

[49] S Savoska and S Loskovska. “Parallel Coordinates as Tool of Exploratory Data Anal-

ysis”. In: 17th Telecommunications Forum TELFOR, Belgrade, Serbia. 2009, pp. 24–

26.

[50] Nicholas Poul Schultz-Møller, Matteo Migliavacca, and Peter Pietzuch. “Distributed

complex event processing with query rewriting”. In: Proceedings of the Third ACM

International Conference on Distributed Event-Based Systems. ACM. 2009, p. 4.

[51] Jinwook Seo and Ben Shneiderman. “A rank-by-feature framework for interactive ex-

ploration of multidimensional data”. In: Information Visualization 4.2 (2005), pp. 96–

113.

[52] Sequences Sunburst. http://bl.ocks.org/kerryrodden/7090426. [Online; accessed

03-February-2015]. 2015.

[53] Shiny. http://shiny.rstudio.com/. [Online; accessed 03-February-2015]. 2015.

[54] Anselm Spoerri. “InfoCrystal, a visual tool for information retrieval”. PhD thesis.

Massachusetts Institute of Technology, 1995.

[55] John T. Stasko. SunBurst. http://www.cc.gatech.edu/gvu/ii/sunburst/. [Online;

accessed 03-February-2015]. 2015.

[56] John Stasko and Eugene Zhang. “Focus+ context display and navigation techniques for

enhancing radial, space-filling hierarchy visualizations”. In: Information Visualization,

2000. InfoVis 2000. IEEE Symposium on. IEEE. 2000, pp. 57–65.

[57] John Stasko et al. “An evaluation of space-filling information visualizations for depict-

ing hierarchical structures”. In: International Journal of Human-Computer Studies

53.5 (2000), pp. 663–694.

[58] Randolph Stone et al. “Identification of genes correlated with early-stage bladder can-

cer progression”. In: Cancer Prevention Research 3.6 (2010), pp. 776–786.

[59] Martin Theus. “High Dimensional Data Visualizations”. In: Handbook of data visual-

ization. Springer, 2008, pp. 156–163.

[60] Martin Theus. “Parallel Coordinate Plots”. In: Handbook of data visualization. Springer,

2008, pp. 164–174.

BIBLIOGRAPHY 94

[61] Edward R Tufte. “Small Multiples”. In: Envisioning Information. Graphics press Cheshire,

CT, 1990, pp. 67–80.

[62] Edward R Tufte and PR Graves-Morris. The visual display of quantitative information.

Vol. 2. Graphics press Cheshire, CT, 1983.

[63] Yulia Turchin, Avigdor Gal, and Segev Wasserkrug. “Tuning complex event processing

rules using the prediction-correction paradigm”. In: Proceedings of the Third ACM

International Conference on Distributed Event-Based Systems. ACM. 2009, p. 10.

[64] Shimon Ullman. The interpretation of visual motion. Massachusetts Inst of Technology

Pr, 1979.

[65] Roel Vliegen, Jarke J van Wijk, and E-J Van der Linden. “Visualizing business data

with generalized treemaps”. In: Visualization and Computer Graphics, IEEE Transac-

tions on 12.5 (2006), pp. 789–796.

[66] Richard Webbera, Ric D Herbertb, and Wei Jiangbc. “Space-filling Techniques in Vi-

sualizing Output from Computer Based Economic Models”. In: ().

[67] What is a Trellis Chart? http://trellischarts.com/what-is-a-trellis-chart.


[68] Pak Chung Wong and R Daniel Bergeron. “30 Years of Multidimensional Multivariate

Visualization.” In: Scientific Visualization. 1994, pp. 3–33.

[69] Hong Zhou et al. “Visual clustering in parallel coordinates”. In: Computer Graphics

Forum. Vol. 27. 3. Wiley Online Library. 2008, pp. 1047–1054.