Upload
tharindu-ranasinghe
View
135
Download
2
Embed Size (px)
Citation preview
Vivarana : Interactive Data Visualization Tool for Complex Event Processor
Rule Generation
Sajith Edirisinghe (100112V)
Vimuth Fernando (100132G)
Tharindu Ranasinghe (100440A)
Mihil Ranathunge (100444N)
Department of Computer Science and Engineering
Faculty of Engineering
University of Moratuwa
Supervised by:
Prof. Gihan Dias
Eng. Charith Chitranjan
2015
2
Abstract
In Complex Event Processor (CEP) systems, processing takes place according touser defined rules, which consists of defining an action for a particular set of data.Writing such rules is generally a challenging, time consuming task even for domainexperts. This is a two part process where the user has to first identify what events ofthe event stream to act on and then write CEP queries to filter out the types of eventsidentified earlier. We are proposing a solution that would unify this whole process byproviding the users of CEP systems with a single tool that can be used to easily identifypatterns of interest in large data sets through a data visualization technique and thenautomatically generate CEP queries to filter out the events of interest identified by theuser.
Vivarana is an interactive data visualization tool that can be used to generate CEPqueries. This tool provides the users with the ability to interactively analyze a largedata set and to generate CEP queries to filter out events of interest. In this reportwe describe the current research in the area of visualization and CEP rule generation,implementation details of our tool, the issues and challenges encountered during theproject, and some paths that can be explored in the future to improve the effectivenessof our tool visualization method, the interactions user can perform on the visualizationand the rule generation technique implemented in Vivarana.
i
Contents
Contents i
List of Figures ii
1 Introduction 1
2 Literature Review 22.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Multidimensional data visualization . . . . . . . . . . . . . . . . . . . . . . . 22.3 Visualization techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4 CEP Rule generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3 Solution 513.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.2 Visualization - Parallel Coordinates . . . . . . . . . . . . . . . . . . . . . . . 533.3 Other functionalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.4 Rule Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643.5 Other Approaches Attempted . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4 Discussion 84
5 Conclusion and Future Work 87
Bibliography 89
ii
List of Figures
2.1 A scatterplot of the distribution of drivers visibility range against their age . . 62.2 A scatterplot matrix displays of data with three variates X, Y , and Z. . . . . . 72.3 Rank-by-feature framework interface for scatterplots (2D). . . . . . . . . . . . . 72.4 Rank by feature visualization for a data set of a demographic and health related
statistics for 3138 U.S. counties . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.5 Scatterplot matrix navigation for a digital camera dataset. . . . . . . . . . . . 102.6 Stage-by-stage overview of the scatterplot animated transition . . . . . . . . . . 112.7 Scatterplot matrix for the Nuts-and-bolts dataset . . . . . . . . . . . . . . . . . 122.8 Generalized Plot Matrix for the Nuts-and-bolts dataset . . . . . . . . . . . . . 132.9 Parallel coordinate plot with 8 variables for 250 cars . . . . . . . . . . . . . . . 142.10 Parallel Coordinate plot for a point . . . . . . . . . . . . . . . . . . . . . . . . 152.11 Parallel Coordinate plot for points in a line with m <0 . . . . . . . . . . . . . . 152.12 Parallel Coordinate plot for points in a line with 0<m <1 . . . . . . . . . . . . 162.13 Negative correlation between Car Weight and the Year . . . . . . . . . . . . . . 172.14 Using brushing to filter Cars with 6 cylinders . . . . . . . . . . . . . . . . . . . 172.15 Using composite brushing to Filter Cars with 6 cylinders made in 76 . . . . . . 182.16 An example of Smooth brushing . . . . . . . . . . . . . . . . . . . . . . . . . . 192.17 Angular Brushing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.18 Multiple ways of ordering N axes in parallel coordinates . . . . . . . . . . . . . 212.19 Two clusters represented in parallel coordinates . . . . . . . . . . . . . . . . . . 222.20 Multiple clusters visualized in parallel coordinates in different colors . . . . . . 222.21 Variable length Opacity Bands representing a cluster in parallel coordinate . . . 222.22 Parallel-coordinates plot using polylines and using bundled curves . . . . . . . 232.23 Statistically colored Parallel Coordinates plot on weight of cars . . . . . . . . . 242.24 Three scaling options for visualizing the stage times in the Tour de France . . . 252.25 Parallel Coordinates plot for a data set with 8000 rows . . . . . . . . . . . . . . 262.26 Parallel Coordinates for the Olive Oils data. Shows how alpha blending can
improve dense visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.27 Parallel Coordinates visualization with Z Score coloring . . . . . . . . . . . . . 292.28 Parallel Coordinates drawn on same data set using data selection . . . . . . . . 302.29 Radviz Visualization for multi dimensional data . . . . . . . . . . . . . . . . . 31
iii
2.30 Mosaic plot for the Titanic data showing the distribution of passengers survivalbased on their class and sex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.31 Double Decker plot for the Titanic data . . . . . . . . . . . . . . . . . . . . . . 332.32 Training a self organizing map. . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.33 A self organizing map trained on the poverty levels of countries . . . . . . . . . 352.34 A sunburst visualization summarizing user paths through a fictional e-commerce
site. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.35 Trellis Chart for a dates set on sales . . . . . . . . . . . . . . . . . . . . . . . . 382.36 Trellis Display of Scatter Plots (Relationship of Gifts Given/Received on Rev-
enue) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392.37 A snapshot of the grand tour, a projection of the data to single plane is illustrated
in (B) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402.38 grand tour path in 3D space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412.39 Structure of the iCEP framework . . . . . . . . . . . . . . . . . . . . . . . . . . 452.40 Prediction Correction Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . 472.41 An overview of rules tuning method . . . . . . . . . . . . . . . . . . . . . . . . 50
3.1 Architecture of the implementation of Vivarana . . . . . . . . . . . . . . . . . . 523.2 Basic Implementation of Parallel Coordinates . . . . . . . . . . . . . . . . . . . 543.3 Example 1D Brushing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.4 Example Composite Brushing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.5 Example Composite Brushing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.6 Slick Grid along with the Parallel Coordinates . . . . . . . . . . . . . . . . . . . 583.7 Cluster Coloring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.8 Cluster Bundling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.9 Parallel Coordinates without Alpha Blending . . . . . . . . . . . . . . . . . . . 603.10 Parallel Coordinates with Alpha Blending . . . . . . . . . . . . . . . . . . . . . 603.11 Parallel Coordinates with Statistical Coloring . . . . . . . . . . . . . . . . . . . 613.12 Specifying the size of either Time or Event window . . . . . . . . . . . . . . . . 623.13 State of visualization after performing an aggregation operation . . . . . . . . . 633.14 Performing clustering within clusters in a web server log data set . . . . . . . . 643.15 A decision tree to classify the Iris data set. . . . . . . . . . . . . . . . . . . . . 653.16 A decision tree to classify the Iris data set. Paths to follow to get to a Virginica
flower are highlighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663.17 Rule generation process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683.18 Sunburst Visualization which we used as foundation to our project . . . . . . . 723.19 Select grouping and grouped columns . . . . . . . . . . . . . . . . . . . . . . . . 733.20 Data file representation in Python DataFrame . . . . . . . . . . . . . . . . . . . 743.21 Python DataFrame after Group By operation . . . . . . . . . . . . . . . . . . . 753.22 Sequence Database Final Representation . . . . . . . . . . . . . . . . . . . . . . 763.23 Sequence Database after being stripped off of unnecessary event attributes . . . 773.24 Counting the number of unique sequences using valuecounts() . . . . . . . . . 77
iv
3.25 Improved sunburst visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . 823.26 Sunburst with drill down capability . . . . . . . . . . . . . . . . . . . . . . . . . 833.27 Sunburst zoom and pan capability . . . . . . . . . . . . . . . . . . . . . . . . 83
4.1 Challenges face in the visualization . . . . . . . . . . . . . . . . . . . . . . . . . 86
v
Acknowledgments
We would like to acknowledge with so much gratitude and thank every person whoprovided assistance and supervision throughout this project in-order to make it successful.We would like to express our sincere gratitude especially towards Eng. Prof. Gihan Diasand Eng. Charith Chitraranjan who supervised and mentored us from the beginning to theend of the project, providing us valuable insides, feedback, immense support, and guidanceto make this project a success.
Further, we would like to thank Dr. Malaka Walpola, our final year project coordinatorfor his continuous support and guidance which helped us to boost our performance andmotivated us to do our best.
We are also grateful to all the members of the academic staff and non-academic staffof the Department of Computer Science and Engineering who helped us in various ways tofinish our project.
Last but not least we are highly grateful to all the colleagues of CSE 10 batch who helpedus in various ways by providing valuable feedback and helping us in technical difficulties.We consider it as a privilege to work with all these amazing people throughout this project.
1
Chapter 1
Introduction
Nowadays every action/event occurring in the real world, whether it be a change of temper-
ature detected by a sensor, changes in stock market prices or even the movement of objects
tracked through GPS coordinates is digitally collected and stored for further exploration
and analysis and sometimes pre-specified action is triggered in real-time when a particular
action/event occurs. Complex Event Processing (CEP) engines are used to analyze these
events on the fly and to execute appropriate pre-specified actions.
But one of the downside of this real-time event monitoring and processing using a CEP is
that a domain expert must write necessary CEP rules in-order to detect interesting event and
to trigger an appropriate response. Sometimes the domain expert might lack the knowledge
to write efficient CEP rules for a particular CEP engine using its query language or he might
need to explore, understand and analyze the incoming event stream prior to writing any
rules.
By providing an interactive visualization of data to the domain experts, we can help
them in their process of generating CEP rules. Section 2 contains the literature review we
conducted in-order to familiarize ourselves with the existing research regarding interactive
multi-dimensional data visualization and automatic Complex event processor rule generation.
Section 3 contains the implementation details of our solution to address the aforementioned
problem. In section 4 we have discussed about the challenges we have faced during this
project and how we have overcome those. Finally, section 5 contains the conclusion and
future work regarding Vivarana
2
Chapter 2
Literature Review
2.1 Introduction
This literature survey mainly contains two sections. Section 3 presents our findings on
interactive visualization techniques. In this sections we have described about Scatter plots
(section 3.1) and parallel coordinated (section 3.2) in details and we have introduced other
promising visualization techniques briefly. Section 4 contains our findings on two methods
of generating CEP rule generation namely iCEP and rule parameter tuning. Further section
2 contains an overview of multidimensional visualization (principles, techniques, problems)
for the sake of completeness.
2.2 Multidimensional data visualization
Recent advances in technology has enabled the generation of vast amounts of data in a wide
range of fields. These data also keep getting more complex. Data analysts want to look
for patterns, anomalies and structures in data. Analyzing the data can lead to important
knowledge discoveries which is valuable to users. The benefits of such understanding reflect
in business decision making, more accurate medical diagnosis, finer engineering and more
refined conclusions in a general sense.
Visualizing these complex data can provide an overview of the data, summary of the
data and also can provide and help in identifying areas of interest in the data. Good data
visualization techniques that allows users to explore and manipulate the data can empower
them in analyzing the data and identifying important patterns and trends in the data that
CHAPTER 2. LITERATURE REVIEW 3
may have been hidden otherwise.
Multi-dimensional data visualization is a very active research area that goes back many
years [68]. In this survey we have focused on 2D multi-dimensional data visualization tech-
niques, because 2D visualizations will make it easy for the users to analyze and interact with
the data as 2D surfaces present a surface that more familiar to users and is easy to navigate.
There are multiple challenges that needs overcoming in multidimensional data visualiza-
tion. Finding a good visualization includes finding a good compromise that can overcome
some of these challenges are
• Mapping - Finding a good mapping from a multi-dimensional space to a two di-
mensional space is not a simple task. The final representation of the data should be
intuitive and interpretable. Users should be able to identify patterns and trends in the
multi-dimensional data using the two dimensional representation.
• Large amounts of data - Modern dataset contain very large amounts of data that
can lead to very dense data visualizations. This causes the loss of information in the
visualization because the users lose the ability to distinguish between small differences
in the data.
• Dimensionality - Displaying the information of multiple dimensions in two dimen-
sional space can also lead to very dense and cluttered visualizations. Techniques need
to be developed to allow users to reduce the clutter and identify important informa-
tion in the data. Techniques such as principle component analysis [29] can help in
identifying important dimensions in the data.
• Assessing effectiveness - Information needs from data varies widely with each data
set. So there is no silver bullet in visualization technique that can solve all the problem.
Different datasets and requirements can yield to varying visualization methods. There
is no method to access the effectiveness of a visualization method over another so there
is process that can be followed to come up with a visualization method that works for
any dataset.
Further according to E.R. Tufte [62] a good visualization comprises of below qualities
• Show data variations instead of design variations. This quality encourages the viewer
to think about the substance rather than about methodology, graphic design, the tech
CHAPTER 2. LITERATURE REVIEW 4
of graphic production, etc. One way to achieve this quality in a visualization is to have
a high data-to-ink ratio[10] and a high data density.
• Clear, detailed and thorough labeling and appropriate scales. A visualization can use
layering and separation techniques to show the labels of the data items
• Size of the graphic effect should be directly proportional to the numeric quantities. This
can be achieved by avoiding chart junks such as unnecessary 3D, shadowing effects and
by reducing the lie factor[37]
In-order to make the visualization more user friendly, a number of interaction techniques
have been proposed [33]. It should be noted that that the behavior of these interaction tech-
niques differ from one visualization technique to another. However, interaction techniques
allows the user to directly interact with the visualization and to change the visualization
according to the exploration objective. Below list contains the major interactive techniques
we have identified.
• Dynamic Projections
Dynamic projection means dynamically changing the projection in-order to explore
a multidimensional data set. A classic example would be the Grand Tour [3] which
tries to show all interesting pairs of dimensions of a multidimensional dataset as a
series of scatter plots. However, the sequence of projection can be random, manual,
pre-computed, or even data driven depending on the visualization technique.
• Interactive Filtering
When exploring large dataset interactively partitioning and focusing on interesting
subsets is a must. This can be achieved through direct selection of the desired subset
(browsing) or through specifying the properties of the desired subset (querying). How-
ever, browsing and querying becomes difficult and inaccurate respectively when the
dataset becomes larger. As a solution to this problem a number of techniques such as
Magic Lens [5], InfoCrystal [54] have been developed in order to improve interactive
filtering in data exploration.
• Interactive Zooming
Zooming is used in almost all the interactive visualizations. When dealing with large
amount of data, sometimes the data is highly compressed in-order to provide an
CHAPTER 2. LITERATURE REVIEW 5
overview of it. In such cases zooming does not only mean to display the data objects
larger, but it also means that the data representation should automatically change to
present more details on higher zoom levels (decompressing). The initial view (com-
pressed view) will allow the user to identify patterns, correlations and outliers and by
zooming in to the interested area user can study the data objects within that region
in more detail.
• Interactive Distortion
Interactive distortion techniques will help in data exploration process by providing a
way for focusing on details while preserving an overview of data. The basic idea of
distortion is to show a portion of the data with high level of details while other portion
is shown in lower level of detail.
• Interactive Linking and Brushing
The idea of linking and brushing is to combine different visualization methods to
overcome the shortcomings of single techniques. As an example one could visualize a
scatterplot matrix (section 3.1) for a data set and when some points in a particular
scatterplot is brushed those points will get highlighted in all other scatterplots. Hence
interactive changes made in one visualization are automatically reflected in the other
visualizations.
2.3 Visualization techniques
Scatter Plots
Scatterplots are a commonly used visualization technique to deal with multivariate data
sets. Mainly there are 2D and 3D scatter plot visualizations. In a 2D scatterplot, data
points from two dimensions of a dataset are plotted in a Cartesian coordinate system where
the two axes represent the selected dimensions resulting in a scattering of points. An example
of a scatterplot showing the distribution of drivers visibility with their age is shown in Figure
2.1.
The positions of the data points represent the corresponding dimension values. Scat-
terplots are useful for visually identifying correlations between two selected variables of a
multidimensional data set, or finding clusters of individuals (outliers) in the dataset. One
CHAPTER 2. LITERATURE REVIEW 6
Figure 2.1: A scatterplot of the distribution of drivers visibility range against their age
single scatterplot can only depict the correlation between two dimensions. Additional limited
dimensions can be mapped to color, size or shape of the plotting points.
Advocates of 3D scatterplots argue that since the natural world is three dimensional, users
can readily grasp 3D representations. However, there is substantial empirical evidence that
for multidimensional ordinal data (rather than 3D real objects such as chairs or skeletons),
users struggle with occlusion and the cognitive burden of navigation as they try to find desired
viewpoints [51]. Advocates of higher dimensional displays have demonstrated attractive
possibilities, but their strategies are still difficult to grasp for most users.
Since two-dimensional scatterplot presentation offer ample power while maintaining com-
prehensibility, many variations have been proposed. One of the method used to visualize
multivariate data using 2D scatterplots is scatterplot matrix (SPLOM) [68].
Each individual plot in the SPLOM is identified by its row and column number in the
matrix [68]. For example, the identity of the upper left plot of the matrix in Figure 2.2 is
(1 , 3) and the lower right plot is (3, 1). The empty diagonals displays the variable names.
Plot (2, 1) is the scatter plot of parameter X against Y while plot (1 , 2) is the reverse, i.e.
Y versus X.
One of the major disadvantage of SPLOM is that as the number of dimensions of the
data set grow the n-by-n SPLOM grows and each individual scatterplot in the SPLOM will
have less space. Following frameworks provide a solution to that problem by incorporating
interactive techniques with the traditional SPLOM.
CHAPTER 2. LITERATURE REVIEW 7
Figure 2.2: A scatterplot matrix displays of data with three variates X, Y , and Z.
Figure 2.3: Rank-by-feature framework interface for scatterplots (2D).
Rank-by-feature framework
Many variations have been proposed to the initial SPLOM to enhance its interactivity and
interpretability. One such enhancement is presented with the rank-by-feature framework [51].
Instead of directly visualizing the data point against all pairs of dimensions, this framework
allows the user to select an interesting ranking criterion which will be described later in this
section.
Figure 2.3 shows a dataset of demographic and health related statistics for 3138 U.S.
counties with 17 attributes, visualized through the rank-by-feature framework and its in-
terface consists of four coordinated components: control panel (Figure 3A), score overview
(Figure 2.3B), ordered list (Figure 2.3C), and scatterplot browser (Figure 2.3D).
CHAPTER 2. LITERATURE REVIEW 8
User can select an ordering criterion in the control panel (Figure 2.3A), and the ordered
list (Figure 2.3C) shows the pairs of dimensions (scatterplots) sorted according to the score
of the criteria with the scores color-coded on the background. But users cannot see an
overview of entire relationships between variables at a glance in the ordered list. Hence
the score overview (Figure 2.3B), an m-by-m grid view where all dimensions are aligned
in the rows and columns has been implemented. Each cell of score overview represents a
scatterplot whose horizontal and vertical axes are dimensions at the corresponding column
and row respectively.
Since this matrix is symmetric, only the lower-triangular part is shown. Each cell is col-
orcoded by its score value using the same mapping scheme as in ordered list. The scatterplot
corresponding to the cell is shown in the scatterplot browser (Figure 2.3D) simultaneously,
and the corresponding item is highlighted in the ordered list (Figure 2.3C). In the scatterplot
browser, users can quickly take a look at scatterplots by using item sliders attached to the
scatterplot view.
Simply by dragging the vertical or horizontal item slider bar, users can change the di-
mension for either the horizontal or vertical axis respectively while preserving the other
axis.
Below list contains the ranking criterions suggested by this framework.
• Correlation coefficient (-1 to 1): The Pearsons correlation coefficient (r) for a scatterplot
(S) with n points [46] is defined in Equation 1
Pearsons r is a number between -1 and 1. The sign and magnitude tells the direction and
the strength of the relationship respectively. Although correlation doesnt necessarily
imply causality, it can provide a good clue to the true cause, which could be another
variable. Linear relationships are more common and simple to understand. As a visual
representation of the linear relationship between two variables, the line of best fit or
the regression line is drawn over scatterplots.
• Least square error for curvilinear regression (0 to 1)
This criterion sort scatterplots in terms of least-square errors from the optimal quadratic
curve fit so that the user can isolate the scatterplots where all points are closely/loosely
arranged along a quadratic curve. In some scenarios it might be interesting to find non-
linear relationships in the data set in addition to linear relationship.
• Quadracity (0 to infinity)
CHAPTER 2. LITERATURE REVIEW 9
Figure 2.4: Rank by feature visualization for a data set of a demographic and health relatedstatistics for 3138 U.S. counties
The ”Quadricity” criterion is added to emphasize the real quadratic relationships. It
ranks scatterplots according to the coefficient of the highest degree term, so that users
can easily identify ones that are more quadratic than others.
• The number of potential outliers (0 to n)
Distance-based outlier detection methods such as DB-out [36] or Density based outlier
detection methods such as Local Outlier Factor (LOF)-based method [6] can be used to
detect outliers in a scatterplot and rank by-feature framework uses LOF-based method
(Figure 2.4), since it is more flexible and dynamic in terms of outlier definition and
detection. The outliers are highlighted with yellow triangles in the scatterplot browser
view.
• The number of items in the region of interest (0 to n)
This criterion allows the user to draw a free-formed polygon region of interest on the
scatterplot. Then the framework will use the number of data points in the region to
order all scatterplots so that user can easily find the ones with most/least number of
items in the specified region.
• Uniformity of scatterplots (0 to infinity)
To calculate this criterion the two-dimensional space is divided into regular grid cells
and then each cell is used as a bin. For example, if k-by-k grid has been generated,
the entropy of a scatterplot S would be
Where Pij is the probability that an item belongs to the cell at (i, j) of the grid.
CHAPTER 2. LITERATURE REVIEW 10
Figure 2.5: Scatterplot matrix navigation for a digital camera dataset.
Rolling Dice Framework
Rolling dice is another framework which utilizes SPLOM to visualize multidimensional data
[13]. In this framework, transitions from one scatterplot to another is performed as animated
rotations in 3D space, similar to a rolling dice. Rolling dice framework suggest a visual
querying technique so that a user can refine his requirement by exploring how the same
query would result in any scatterplot.
The interface proposed by the framework mainly consist of three components: Scatter-
plot component (Figure 2.5B), scatterplot matrix component (Figure 2.5A) and query layer
component (Figure 2.5C). The scatterplot component shows the currently viewed cell of the
scatterplot matrix with the name and labels of the two displayed axes. The scatterplot ma-
trix component can be used both as an overview and a navigational tool. Navigation in the
scatterplot matrix is restricted to orthogonal movement along the same row or column in the
matrix so that one dimension in the focused scatterplot is always preserved while the other
changes. The change is visualized using a 3D rotation animation which gives a semantic
meaning to the movement of the points, allowing human mind to interpret the motion as
shape [64].
The transition of scatterplots is performed as a three-stage animation: extrusion into 3D,
CHAPTER 2. LITERATURE REVIEW 11
Figure 2.6: Stage-by-stage overview of the scatterplot animated transition
rotation and projection into 2D. More specifically, given two current visualized dimensions
x and y and a vertical transition to a new dimension y’, will follow below mentioned steps
(also depicted in Figure 2.6).
• Extrusion: The scatterplot visualizing x and y axes is extruded to 3D where y be-
comes the new depth coordinate for each data point. At the end of this step the 2D
scatterplots has become 3D (Figure 2.6A and 2.6B)
• Rotation : The scatterplot is rotated 90 degrees up or down, causing the axis previously
along the depth dimension to become the new vertical axis (Figure 2.6C)
• Projection: The 3D plot is projected back into 2D with x and y as the new horizontal
and vertical axes (Figure 2.6D and 2.6E)
Further, rolling dice framework suggest a method called query sculpting which allows
selecting data items in the main scatterplot visualization using 2D bounding shapes (con-
vex hulls) and iteratively refining that selection from other viewpoints while navigating the
scatterplot matrix. As shown in Figure 2.5C the query layer component is used for select-
ing, naming and clearing color-coded queries during the visual exploration. Clicking and
dragging one query onto another will perform union or intersection operation (by dragging
using the left or right mouse button respectively). Each query layer also provides a visual
indication of the percentage of items currently selected by it.
CHAPTER 2. LITERATURE REVIEW 12
Figure 2.7: Scatterplot matrix for the Nuts-and-bolts dataset
Shortcomings of Scatterplot Matrix (SPLOM)
In-order to discuss the shortcomings of SPLOM let’s consider a fictitious ”nuts-and-bolts”
dataset. This dataset shown in Table 1 involves 3 (independent) categorical variables: Region
(North, Central, and South), Month (January, February...), and Product (Nuts or Bolts).
It also consists of 3 (dependent) continuous variables: Sales, Equipment costs, and Labor
costs.
Figure 2.7 shows the SPLOM for the ”nuts-and-bolts” dataset and the top three scat-
terplots (e.g. Month vs Region) each show a crossing of two categorical variables, resulting
in an uninformative grid of points. Further, scatterplots showing continuous vs categorical
variables suffers from over plotting (e.g.: Sales vs. product)
In-order to overcome this issue Generalized Plot Matrix (GPLOM) [27] has been pro-
posed. In the GPLOM it is suggested to use heatmaps to visualize pairs of categorical
variable, bar-charts to visualize continuous vs categorical variables and scatterplots to visu-
alize pairs of continuous variables. It is important to note that in this scenario scatterplots
CHAPTER 2. LITERATURE REVIEW 13
Figure 2.8: Generalized Plot Matrix for the Nuts-and-bolts dataset
show individual tuples, whereas the barchars and heatmaps show aggregated data. Figure
2.8 shows the GPLOM for the nuts-and-bolts dataset. Even though GPLOM is a better
choice than SPLOM to visualize a combination of continuous and categorical variables, since
it uses 3 types of charts it loses the consistency of the matrix.
Parallel Coordinates
Parallel coordinates introduced by Inselberg and Dimsdale [28][30] is a popular technique
for transforming multidimensional data into a 2D image. The m-dimensional data items
are represented as lines crossing m parallel axes, each axis corresponding to one dimension
CHAPTER 2. LITERATURE REVIEW 14
Figure 2.9: Parallel coordinate plot with 8 variables for 250 cars
of the original data. Fundamentally parallel coordinates differ from all other visualization
methodologies since it yields graphical representation of multidimensional data rather than
just visualizing a finite set of points .
Figure 2.9 displays Parallel Coordinate plot with 8 variables using a dataset which con-
tains information about cars such as economy (mpg), cylinders, displacement (cc)and etc.
for a selected sample of cars manufactured within 1970 to 1982.
Definition and Representation
On the plane with xy-Cartesian coordinates starting on the y-axis, N copies of the real line,
labeled x1,x2,x3..... xn are places equi-distant and perpendicular to the x axis, They are the
axes of the parallel coordinate system for Euclidean N-Dimensional Space RN all having the
same positive orientation as the y axis. [28]
In the figure 2.10 it is shown how a point C with coordinates (c1, c2, c3.......cn) can be
represented by a polygonal line. As in the aforementioned way m number of data points can
be represented by m polygonal lines.
For lines with negative slope (m < 0) the interesting point lies between the axes as in
Figure 2.11.
For m > 1 the intersecting point lies left of the X1 axis while intersecting point for the
lines with m (0 < m < 1), lies right of the X2 axis as in the Figure 2.12
The above property can be considered as one of the main advantages in parallel coor-
dinates. Parallel Coordinates representations can provide statistical data interpretations.
In the statistical setting, the following interpretations can be made: For highly negatively
correlated pairs, the dual line segments in Parallel Coordinates tend to cross near a single
CHAPTER 2. LITERATURE REVIEW 15
Figure 2.10: Parallel Coordinate plot for a point
Figure 2.11: Parallel Coordinate plot for points in a line with m <0
CHAPTER 2. LITERATURE REVIEW 16
Figure 2.12: Parallel Coordinate plot for points in a line with 0<m <1
point between the two Parallel Coordinates axes. Parallel or almost parallel lines between
axes indicate positive correlation between variables [49] [60]. For an example we can see that
there is a highly negative correlation between weight and year in the Figure 2.13.
Over the years parallel coordinates have been enhanced by multiple people. Data Sci-
entists have been working on improving this technique for better data investigation and for
easier, user-friendly interaction by adding brushing, data clustering, real-time re-ordering of
coordinate axes, etc.
Brushing
Brushing is considered to be a very effective technique for specifying an explicit focus during
information visualization [20]. The user actively marks subsets of the data-set as being
especially interesting and the points that are contained by the brush are colored differently
from other points to make them standout [42]. For example if the user is interested in cars
having 6 cylinders he can use brushing as depicted in the Figure 2.14.
The introduction of composite brushes [42] allows users to more specifically define their
focus. Composite brushes are a combination of single brushes which result the conjunction
of those single brushes. For an example if the user is interested in cars having 6 cylinders
that were produced on 76 he can use composite brushing as depicted in Figure 2.15.
Brushing technique we have seen up to now uses a discrete distinction between focus and
CHAPTER 2. LITERATURE REVIEW 17
Figure 2.13: Negative correlation between Car Weight and the Year
Figure 2.14: Using brushing to filter Cars with 6 cylinders
CHAPTER 2. LITERATURE REVIEW 18
Figure 2.15: Using composite brushing to Filter Cars with 6 cylinders made in 76
context. With that we dont understand the similarity of other data points to the focused
data points. The solution that had brought forward for this is called smooth brushing [20]
where a multi-valued or even continuous transition is allowed, which inherently supports the
similarity between data-points in focus and their context. This corresponds to a degree-
of-interest (DOI) function which non-binarily maps into the [0, 1] range. Often, such a
non-binary DOI function is defined by means of spatial distances, i.e., the DOI-value reflects
the distance of a data-point from a so-called center-of interest.
The standard brushing primarily acts along the axes, but the technique called angular
brushing enables the space between axes for brushing [20]. The user can interactively specify
a sub-set of slopes which then yields all those data-points to be marked as part of the current
focus, which exhibit the matching correlation in between the brushed axes. For an example
if the user is interested on data that only has a negative correlation between Horsepower
and Acceleration he can use angular brushing as shown in Figure 2.17.
Axis Reordering
One strength of parallel coordinates as described in section 3.2.1, is its effectiveness of visual-
izing relations between coordinate axes. By bringing axes next to each other in an interactive
way, the user can investigate how values are related to each other with special respect to two
of the data dimensions. Order of the axes clearly affects the patterns revealed by parallel
coordinate plots. Figure 2.18 shows 3 ways out of N! (N = 8 in this case) ways of reordering
axes. But only the plot C in Figure 2.18 is capable of showing that there is a highly negative
correlation between weight and economy.
Many Researchers address this problem using some measure to score an order of axes while
others discuss how to visualize multiple orderings in a single display [21]. Many approaches
for this which are based on the combination of Nonlinear Correlation Coefficient and Singular
CHAPTER 2. LITERATURE REVIEW 19
Figure 2.16: An example of Smooth brushing
Value Decomposition algorithm are suggested. By using these approaches, the first Many
Researchers address this problem using some measure to score an order of axes while others
discuss how to visualize multiple orderings in a single display [24]. Many approaches for this
which are based on the combination of Nonlinear Correlation Coefficient and Singular Value
Decomposition algorithm [25] are suggested. By using these approaches, the first remarkable
axe can be selected based on mathematics theory and all axis are re-ordered in line with the
degree of similarities among them [39].
Data Clustering
Parallel Coordinates are a good technique to show clusters in the data set. There are many
techniques that researchers have used to show clusters in parallel coordinates.
Coloring is one method that has been used to show clusters in parallel coordinates [17].
Different colors will be assigned to different clusters. As in the figure 2.19 it shows two
CHAPTER 2. LITERATURE REVIEW 20
Figure 2.17: Angular Brushing
clusters that had been given explicitly is represented with 2 different colors.
Figure 2.20 shows the same cluster visualization technique for more many clusters for the
data set taken from USDA National Nutrient Database.
Variable length Opacity bands [17] is another technique of showing clusters in Parallel
Coordinates. Figure 2.21 shows a graduated band faded from a dense middle to transparent
edges that visually encodes information for a cluster. The mean stretches across the middle
of the band and is encoded with the deepest opacity. This allows the user to differentiate
sparse, broad clusters and narrow, dense clusters. The top and bottom edges of the band
have full transparency. The opacity across the rest of the band is linearly interpolated. The
thickness of the band across each axis section represents the extents of the cluster in that
dimension.
Curved bundling [40] is also used to visualize clusters in parallel coordinates. Bundled
CHAPTER 2. LITERATURE REVIEW 21
Figure 2.18: Multiple ways of ordering N axes in parallel coordinates
CHAPTER 2. LITERATURE REVIEW 22
Figure 2.19: Two clusters represented in parallel coordinates
Figure 2.20: Multiple clusters visualized in parallel coordinates in different colors
Figure 2.21: Variable length Opacity Bands representing a cluster in parallel coordinate
CHAPTER 2. LITERATURE REVIEW 23
Figure 2.22: Parallel-coordinates plot using polylines and using bundled curves
curve plots extend the traditional polyline plots and are designed to reveal the structure
of clusters previously identified in the input data. Given a data point (P1, P2,...,PN),its
corresponding polyline is replaced by a piecewise cubic Bezier curve preserving following
properties. (Denote the main axes by X1, X2, X3 XNto avoid the confusion between them
and the added axes.)
• The curve interpolates P1, P2,..., PN at the main axes
• Curves corresponding to data points that belong to the same cluster are bundled be-
tween adjacent main axes. This is accomplished by inserting a virtual axis midway
between the main axes and by appropriately positioning the Bzier control points along
the virtual axis. To support curve bundling, control points that define curves within
the same cluster are attracted toward a cluster centroid along the virtual axis.
Figure 2.22 compares a polyline plot with its counterpart using bundled curves. Polylines
require color coding to distinguish clusters, whereas curve bundles rely on geometrical prox-
imity to naturally represent cluster information. The cluttered visualization in color-coded
polylines, which is the standard approach to cluster-membership visualization, motivates the
new geometry based method.
Bundling violates the point-line duality discussed in section 3.2.1, but can be used to
visualize clusters using geometry only, leaving the color channel free for other uses such as
statistical coloring which is described in section 3.2.6. To adjust the shape of Bzier curves
there are many algorithms proposed by many researchers [40], [22], [69].
CHAPTER 2. LITERATURE REVIEW 24
Figure 2.23: Statistically colored Parallel Coordinates plot on weight of cars
Statistical Coloring
Coloring polygonal lines can be used to display statistical coloring of axes. A popular
color scheme is to color by z-score for that dimension, so that we can understand the data
distribution of that dimension. Figure 2.23 shows how z-score coloring has been used on
weight dimension in that data set.
Scaling
Scaling of the axes are also an interesting property in the parallel coordinates. Default scaling
is to plot all values over the full range of each axis between the minimum and the maximum
of the variable. Several other scaling methods have been suggested by researchers [60]. A
common one would be to use a common scale over all axes. Figure 2.24 shows the difference
between two scaling methods. The data taken is individual stage times of the 155 cyclists
who finished the 2005 Tour De France bicycle race. Figure 2.24A is plotted with default
scaling and Figure 2.24B is plotted using a common scale over all axes. But it is obvious
that the both Figure 2.24A and Figure 2.24B are not capable enough to reveal correlations
between axes even though Figure 2.24B shows the outliers clearly. But the spread between
the first and the last cyclist is almost invisible for most of the stages. In the Figure 2.24C,
a common scale for all stages is used, but each stage is aligned at the median value of that
stage. It is the user experience, his domain knowledge and the use case that defines the scale
and alignment on the parallel coordinates [60].
CHAPTER 2. LITERATURE REVIEW 25
Figure 2.24: Three scaling options for visualizing the stage times in the Tour de France
CHAPTER 2. LITERATURE REVIEW 26
Figure 2.25: Parallel Coordinates plot for a data set with 8000 rows
Limitations
Even though Parallel coordinates are a great tool to visualize high dimensional data, it soon
reached its limits. When using a very large dataset there are some identified weaknesses in
parallel coordinates such as:
1. Cross-over Problem - The zigzagging polygonal lines used for data representation are
not continuous. They generally lose visual continuation across the parallel-coordinates
axes, making it difficult to follow lines that share a common point along an axis.
2. When two or more data points have the same or similar values for a subset of the
attributes, the corresponding polylines may overlap and clutter the visualization.
Figure 2.25 depicts the aforementioned two problems - A parallel coordinate plot drawn
for 8000 data points.
Given a very large data set, with this two problems it is not easy to come to a conclusion
about the correlation in axes and brushing also will not give a clear idea about the data.
One solution to above problems is to use -blending [60]. When -blending is used, each
polygon is plotted with only percent opacity. With smaller values, areas of high line density
are more visible and hence are better contrasted to areas with a small density.
The data in Figure 2.26 are real data from Forina et al.[15] on the fatty acid content of
Italian olive oil samples from nine regions. Figure 2.26 A, B, C shows the same plot of all
eight fatty acids with -values of 0.5, 0.1, and 0.01 respectively. Depending on the amount
of - blending applied, the group structure of some of the nine regions is more or less visible
[60].
CHAPTER 2. LITERATURE REVIEW 27
It is hard to come to a conclusion about a value for . The user must adjust the value
until the graph gain enough insight.
Clustering and statistical coloring were mentioned in the sections 3.2.5 and 3.2.6 will also
reduce the weaknesses in Parallel Coordinates.
As in the Figure 2.27, point line duality is preserved more when statistical coloring is
used. Data preprocessing techniques can also be used to overcome the limitations in parallel
coordinates: data selection and data aggregation. Data selection means that a display does
not represent a dataset as a whole but only a portion of it, which is selected in a certain
way [30].The display is supplied with interactive controls for changing the current selection,
which results in showing another portion of the data [1].
The Figure 2.28 shows how to display portion of the data and to overcome the weaknesses
in Parallel Coordinates. The Figure 2.28A only displays food group of sausages and luncheon
meats. Respectively, Figure 2.28B and Figure 2.28C displays food groups of beef products
and spices and herbs, which is a better visualization than visualizing whole data set.
Data aggregation reduces the amount of data under visualization by grouping individual
items into subsets, often called aggregates, and some collective characteristics of the aggre-
gates can be computed. The aggregates and their characteristics (jointly called aggregated
data) are then explored instead of the original data. For an example in parallel coordinates
there is just one polygonal line for the whole cluster so that mentioned limitations at the
beginning of this section will be reduced.
Parallel Coordinates might be the least affected plot from curse of dimensionality since
it can represent many dimensions as long as the screen width permits. But that also comes
to a limitation when it comes to high dimensional data because the distance d between
two coordinates gets decreased with the increase in number of dimensions. As a result the
correlation between axes might not be clear in the plot. Most of the applications assume it is
up to the user to decide which attributes should be kept in, or removed from a visualization.
This approach will not be a good approach for a user who does not have domain knowledge,
parallel coordinates itself can be used to reduce dimensions of the data set [2].
When we were discussing about axis reordering in section 3.2.4 we talked about getting
a measure to the axis similarity. Once the most similar axes are identified through that
algorithm the application can suggest user to remove them and keep one significant axe to
all those identified similar axes [2]. In that way redundant attributes can be removed from
the visualization and the space can be used efficiently to represent the remaining attributes.
CHAPTER 2. LITERATURE REVIEW 28
Figure 2.26: Parallel Coordinates for the Olive Oils data. Shows how alpha blending canimprove dense visualizations
CHAPTER 2. LITERATURE REVIEW 29
Figure 2.27: Parallel Coordinates visualization with Z Score coloring
Parallel Coordinates are a good technique to visualize data. It support many user in-
teractions and data analytic techniques. Even though it has limits researchers have found
many ways to overcome those limitations. Parallel Coordinates are still a hot topic for data
visualization research work.
Radviz
The Radviz (Radial Visualization) visualization method [23] maps a set of n dimensional
data points onto a two dimensional space. All dimensions are represented by a set of equally
spaced anchor points on the circumference of a circle.
For each data instance, imagine a set of springs that connects the data point to the
anchor point for each dimension. The spring constant for the spring that connects to the ith
anchor corresponds to the value of the ith dimension of the data instance. Each data point
is then displayed where the sum of all the spring forces equals 0. All the data point values
are usually normalized to have values between 0 and 1.
Consider the example in Figure 2.29.A, this data has 8 dimensions d1,d2. dn. Each
data point is connected as shown in the diagram using springs. Following this procedure
for all the records in the dataset leads to the Radviz display. Figure 2.29.B shows a Radviz
representation for a dataset on transitional cell carcinoma (TCC) of the bladder generated
by Clifford Lab at LSUHSC-S [58].
One major disadvantage of this method is the overlap of points. Consider the following
two points on a 4 dimensional data space, (1, 1 , 1, 1) and (10, 10, 10, 10). These two data
records will overlap in a Radviz display even though they are clearly different because the
CHAPTER 2. LITERATURE REVIEW 30
Figure 2.28: Parallel Coordinates drawn on same data set using data selection
CHAPTER 2. LITERATURE REVIEW 31
Figure 2.29: Radviz Visualization for multi dimensional data
dimensions pull them both equally.
Categorical dimensions cannot be visualized with Radviz and require additional prepro-
cessing. First each categorical dimension needs to be flattened to create a new dimension
for each possible category. This becomes problematic as the number of possible categories
increase and may lead to poor visualizations.
Another challenge in generating good visualizations with this method is identifying a
good ordering for the anchor points that correspond to the dimensions. A good ordering
needs to be found that makes it easy to identify patterns in the data. An interactive approach
that allows for changing the position of anchor points can be used to help users overcome
this issue.
Mosaic Plots
Mosaic plots [19], [16] are a popular method of visualizing categorical data. They provide a
way of visualizing the counts in a multivariate n-way contingency table. The frequencies in
the contingency table are represented by a group of rectangles whose areas are proportional
to the frequency of each cell in the contingency table.
CHAPTER 2. LITERATURE REVIEW 32
Figure 2.30: Mosaic plot for the Titanic data showing the distribution of passengers survivalbased on their class and sex
A mosaic plot starts as a rectangle. Then at each stage of plot creation, the rectangles
are split parallel to one of the two axes based on the proportions of data belonging to a
category. An example of a mosaic plot is shown in Figure 2.30. It shows a mosaic plot for
the Titanic dataset, which describes the attributes of passengers on Titanic details of their
survival.
The process of creating a mosaic display can be described as below [24].
Let us assume that we want to construct a mosaic plot for p categorical variables X1,...,
Xp. Let ci be the number of categories of variable Xi, i = 1, . . . , p.
1. Start with one single rectangle r (of width w and height h), and let i = 1.
2. Cut rectangle ri−1 into ci pieces: find all observations corresponding to rectangle ri1,
CHAPTER 2. LITERATURE REVIEW 33
Figure 2.31: Double Decker plot for the Titanic data
and find the breakdown for each variable Xi (i.e., count the number of observations
that fall into each of the categories). Split the width (height) of rectangle ri1 into ci
pieces, where the widths (heights) are proportional to the breakdown, and keep the
height (width) of each the same as ri1. Call these new rectangles rji, with j = 1, . . .
,ci.
3. Increase i by 1.
4. While i <= p, repeat steps 2 and 3 for all rji1 with j =1 , . . . ,ci1
In standard mosaic plots the rectangle is divided both horizontally and vertically. A
variation of mosaic plots that only divide the rectangle horizontally has been proposed called
Double Decker plots [19]. These can be used to visualize association rules. An example of a
double decker plot is show in Figure 2.31 for the same data as in Figure 2.30. There are other
variations of mosaic plots such as fluctuation diagrams that try to increase the usability of
them.
Mosaic plots are an interesting visualization technique for categorical data but they can’t
handle continuous data. To display continuous data using a mosaic plot the data needs to
be first converted to categorical through a process such as binning. Mosaic plots require
the visual comparison of rectangle and their sizes to understand the data. But this becomes
CHAPTER 2. LITERATURE REVIEW 34
complicated as the number of rectangles increase and the distance between two increases. So
they are harder to interpret and understand. Vastly different aspect ratios of the rectangles
also compound the difficulty in comparing their sizes.
Another issue with Mosaic plots is that they become more complex as the number of
dimensions in the data increase. Each additional dimension requires the rectangles to be split
again which at least doubles the possible number of rectangles leading to a final visualization
that is not very user friendly.
Self Organizing Maps
Self-organizing maps (SOM) [58] is a type of neural network that has been used widely in
data exploration and visualization among its many other uses. SOMs use an unsupervised
learning algorithm to perform a topology preserving mapping from a high dimensional data
space to a lower dimensional map (usually a two dimensional lattice). The mapping preserves
the topology of the high dimensional data space such that data points lying near each other
in the original multidimensional space maps to nearby units in the output space.
Generating self-organizing maps consists of training a set of neurons with the dataset. At
each step of the training an input data item is matched against the neurons from which the
closest one is chosen as the winner. Then the weights of the winner and the neighborhood
of the winner is updated to reinforce this behavior. the final result is a topology preserving
ordering where similar new data entry will match to neurons nearer to each other.
An example of a self-organizing map is shown in Figure 2.33. This shows a self-organizing
map trained on the poverty levels of countries [31]. As can be seen clearly countries with
similar poverty levels got matched to neurons close to each other. USA, Canada and other
countries with lower poverty are together in the yellow and green areas while countries such
as Afghanistan and Mali which have high poverty levels are grouped together in the purple
areas. This shows the topology preserving aspect of SOMs.
There are some challenges with using self-organizing maps for multidimensional data
visualization.
1. SOMs are not unique. The same data can lead to widely different outcomes based on
the initialization of the SOM. So the same data may yield different visualizations and
lead to confusion.
CHAPTER 2. LITERATURE REVIEW 35
Figure 2.32: Training a self organizing map.
Figure 2.33: A self organizing map trained on the poverty levels of countries
CHAPTER 2. LITERATURE REVIEW 36
2. While similar data points are grouped together in SOMs, similar groups are not guar-
anteed to be close to each other. Some SOMs may be created that have similar groups
in multiple places in the map.
3. SOMs are not very user friendly when compared with other visualization techniques.
Its not easy to look at a SOM and interpret the data.
4. The process of creating a SOM is computationally expensive. The computational
requirements grow as the dimensionality of data increases. In modern data sources
that are highly complex and detailed this becomes a major drawback.
Sunburst Visualization
The Sunburst technique, like Tree Map [65] is a space-filling visualization that uses a radial
rather than a rectangular layout to visualize hierarchical information [55]. It is comparable
to a nested pie charts. It can be used to show hierarchical information such as elements of
a decision tree. This compact visualization avoids the problem of decision trees getting too
wide to fit the display area. Its akin to visualizing the tree in a top down manner. The
center represents the root of the decision tree and the ring around it as its children.
In SunBurst, the top of the hierarchy is at the center and deeper levels farther away from
the center. The angle swept out by an item and its color correspond to some attribute of
the data. For instance, in a visualization of a file system, the angle may correspond to the
file/directory size and the color may correspond to the file type. An example Sunburst display
is shown in Figure 2.34. This visualization has been used to summarize user navigation paths
through a website [48]. Further this visualization has been used to visualize frequent item
sets [34].
Trellis Visualization
Trellis chart Also known as: Small Multiples [61], Panel Chart, Lattice Chart, Grid Chart,
is a layout of smaller charts in a grid with consistent scales. Each smaller chart represents
an item in a category, named conditions [67]. The data displayed on each smaller chart is
conditional on items in the category. Trellis Charts are useful for finding the structure and
patterns in complex data. The grid layout looks similar to a garden trellis, hence the name
Trellis Chart.
CHAPTER 2. LITERATURE REVIEW 37
Figure 2.34: A sunburst visualization summarizing user paths through a fictional e-commercesite.
Main aspects of trellis displays are columns, rows, panels and pages [46]. The figure 2.35
consists of 4 columns, 1 row, 4 panels and 1 page. Trellised visualizations enable the user
to quickly recognize similarities or differences between different categories in the data. Each
individual panel in a trellis visualization displays a subset of the original data table, where
the subsets are defined by the categories available in a column or hierarchy. To make plots
comparable across rows and columns, the same scales are used in all the panel plots [59].
Benefits of trellis chart are;
• They are easy to understand. A Trellis Chart is a basic chart type repeated many
times. If you understand the basic chart type, you can understand the whole Trellis
CHAPTER 2. LITERATURE REVIEW 38
Figure 2.35: Trellis Chart for a dates set on sales
Chart.
• Having many small charts enables you to view complex multi-dimensional data in a
flat 2D layout avoiding the need for confusing 3D charts.
• The grid layout combined with consistent scales makes data comparison simple. Just
look up/down or across the charts.
Figure 2.36 contains a trellis chart for Minnesota Barley Data from The Design of Exper-
iments [14] by R.A. Fisher. The trial involved planting: 10 varieties of barley, in 6 different
sites over two different years. The researchers measured yield in bushels per acre for each of
the 120 possibilities.
Grand Tour
Grand tour is one of the tour methods which is used to find structure of multidimensional
data. This method can be applied to show multidimensional data in a 2D computer display.
Tour is a subset of all the possible projections of multidimensional data. The different tour
methods combine several static projections using different interpolation techniques into a
movie, which is called a tour [9].
CHAPTER 2. LITERATURE REVIEW 39
Figure 2.36: Trellis Display of Scatter Plots (Relationship of Gifts Given/Received on Rev-enue)
Tour
In a static projection some of the information of the dataset is lost to the user. But if several
projections in different planes can be shown to the user step by step, user can get the idea
of overview of structure of the multivariate data.
Tours provide a general approach to choose and view data projections, allowing the viewer
to mentally connect disparate views, and thus supporting the exploration of a highdimen-
sional space.
CHAPTER 2. LITERATURE REVIEW 40
Figure 2.37: A snapshot of the grand tour, a projection of the data to single plane isillustrated in (B)
Tour methods
• Grand Tour - Shows all projections of the multivariate data by a random walk through
the landscape.
• Projection Pursuit (PP) guided tour - Tour gives more concentration to more interest-
ing views based on a PP index.
• Manual Control - User can decide the tour direction to take.
The grand tour method for choosing the target plane is to use random selection. A
frame is randomly selected from the space of all possible projections. A target frame is
chosen randomly by standardizing a random vector from a standard multivariate normal
distribution: sample p values from a standard univariate normal distribution, resulting in
a sample from a standard multivariate normal. Standardizing this vector to have length
equal to one gives a random value from a (p1) dimensional sphere, that is, a randomly
CHAPTER 2. LITERATURE REVIEW 41
Figure 2.38: grand tour path in 3D space
generated projection vector. Do this twice to get a 2D projection, where the second vector
is orthonormalized on the first. Figure 2.38 illustrates the tour path.
The solid circle in Figure 2.38 indicates the first point on the tour path corresponding
to the starting frame. The solid square indicates the last point in the tour path, or the
last projection computed. Each point corresponds to a projection from 3 dimensions to one
dimension. The projection will look as if the data space is viewed from that direction. In
grand tour this point is chosen randomly.
2.4 CEP Rule generation
Recent advances in technology has enabled the generation of vast amounts of data in a
wide range of fields. This data is created continuously in large quantities overtime as data
streams. Complex Event Processing (CEP) can be used to analyze and process these large
data streams to identify interesting situations and respond to them as quickly as possible.
Complex event processors are used in almost every domain : vehicular traffic analysis,
network monitoring, sensor details analyzing [7], analyzing trends in stock market [11], fraud
CHAPTER 2. LITERATURE REVIEW 42
detection [50]. Any system that requires real time monitoring can use a complex event
processor.
In CEP, the processing takes place according to user-defined rules, which specify the
relations between the observed events and the actions required by the user. For an example
in a network monitoring system a complex event processor can be used to notify the system
admin about an excessive internet usage of an user in that particular network. An example
rule will look like this,
Where if a user’s bandwidth exceeds the limit, the admin will receive a notification. The
value of the ”limit” in this example should be low enough to catch high usage as well as it
should be high enough to ignore normal users.
Any complex event processing rule will have a condition to check, and an action associated
with that condition. So regardless of the domain, any system using a CEP heavily depends
on the rules defined by the user.
In current complex event processing applications, users need to manually specify the
rules that are used to identify and act on important patterns in the event streams. This is
a complex and arduous task that is time consuming, includes a lot of trial and error and
typically requires domain specific information that is hard to identify accurately.
So the rule writing is typically done by domain experts who study the parameters available
in the event streams manually or using external data analysis tools to identify the events that
need to be specially handled. Needless to say that incorrect estimation of relevant parameters
in the rules negatively impacts the utility of the systems that depend on accurate processing
of these events. Even for domain experts manually specifying textual rules in CEP specific
rule language is not a very user friendly experience. Maintaining the system after a rule is
specified to provide the same functionality through changing data and behavior may require
periodical updates to the specified rule that may require the same effort as initially spent.
Several approaches [41], [63], [44] have been proposed to overcome these difficulties using
data mining and knowledge discovery techniques to generate rules based on available data.
This provide users the ability to automatically generate rules based on their requirements.
Two approaches have been proposed that can help in generating CEP rules. One is Using
a framework that learns, from historical traces, the hidden causality between the received
events and the situations to detect, and uses them to automatically generate CEP rules
[41]. Another approach is to use a skeleton of the rule and use historical traces to tune the
parameters of the final rule [63].
CHAPTER 2. LITERATURE REVIEW 43
iCEP
iCEP [41] analyzes historical traces and learns from them. It adopts a highly modular design,
with different components considering different aspects of the rule.
Following terminology and definitions are used in the framework.
Each event notification is assumed to be characterized by a type and a set of attributes.
The event type defines the number, order, names, and types of the attributes that compose
the event itself. It is also assumed that events occur instantaneously at some points in time.
Accordingly, each notification includes a timestamp, which represents the time of occurrence
of the event it encodes. Author of the paper uses the following example event of type Temp.
Temp@10(room=123, value=24.5)
This event contains the fact that the air temperature measured inside room 123 at time
10 was 24.5 0C.
Another aspect of the terminology used by the authors is the difference between primi-
tive and composite events. Simple events similar to the one given above are considered as
primitive events. A composite event is defined using a pattern of primitive events. When
such a pattern is identified the CEP engine derives that a composite event has occurred
and notifies the interested components. An event trace that end with the occurrence of the
composite event is called a positive event trace.
iCEP framework uses the following basic building blocks used in most CEP systems to
generate filters for events.
• Selection: filters relevant event notifications according to the values of their attributes.
• Conjunction: combines event notifications together
• Parameterization: introduces constraints involving the values carried by different events.
• Sequence: introduces ordering relations among events.
• Window: defines the maximum timeframe of a pattern.
• Aggregation: constraints involving some aggregated value.
iCEP uses a set of modules that generates a combination of above building blocks to
generate CEP rules. The framework uses a training data set created using historical traces
to generate rules using a supervised learning technique.
CHAPTER 2. LITERATURE REVIEW 44
The learning method uses the following consideration. Consider the following positive
event trace
1 : A@0, B@2, C@3
This implies the following set of constraints S1
- A: an event of type A must occur
- B: an event of type B must occur
- C: an event of type C must occur
- AB: the event of type A must occur before that of type B
- AC: the event of type A must occur before that of type C
- BC: the event of type B must occur before that of type C
We can assert that for each rule r and event trace , r fires if and only if Sr S where Sr
is the complete set of constraints that needs to be satisfied for the rule to fire.
Using these considerations the problem of rule generation can be expressed as the problem
of identifying Sr. Given a positive trace , S can be considered as an over constraining
approximation of Sr. To produce an approximation of Sr we can consider the set of all positive
traces collectively and consider the conjunction of all the sets of constraints generated.
Using these intuitions the iCEP framework follows the following steps in generating rules.
1. Determine the relevant timeframe to consider (window size)
2. Identify the relevant event types and attributes
3. Determine the selection and parameter constraints
4. Discover ordering constraints (sequences)
5. Identify aggregate and negation constraints.
The final structure of the framework is shown in figure 2.39. The problem is broken down
to sub problems and solved using different modules (described below) that work together.
• Event Learner: The event learner tries to determine which primitive event types are
required for the composite event to occur. It considers the window size as an optional
input parameter. It cuts each positive trace such that it ends with the occurrence.
For each positive trace, the event learner extracts the set of event types it contains.
Then, according to the general intuition described above, it computes and outputs the
intersection of all these sets.
CHAPTER 2. LITERATURE REVIEW 45
Figure 2.39: Structure of the iCEP framework
• Window Learner: The window learner is responsible for learning the size of the window
that includes all primitive events required for a composite event. If the required event
types are knows the window learner tries to identify a window size that would ensure all
required primitive events are present is all positive traces. If the required event types
are not known, window learner and event learner uses an iterative approach where
increasing window sizes are fed to the event learner until a required accuracy in the
rule is reached.
• Constraint Learner: This module receives the filtered event traces from the above two
modules and tries to identify possible constraints in the parameters. For all parameters
it tries to look for equality constraints where all possible traces contain a single value
and failing that generates an inequality constraint the looks for values between the
minimum and maximum value available all positive traces.
• Aggregate Learner: As shown in Figure 2.39, the aggregate learner runs in parallel with
the constraint learner. Instead of looking for single value constraints the aggregate
learner uses aggregation functions such as sum and average over the time window over
all the events of a certain type to generate constraints.
Other modules in the framework uses the same methods to identify different aspects of
the rule. The effectiveness of the framework has been assessed using the following steps.
1. Use an existing rule created by a domain expert that identifies a set of composite events
in a data stream and collect the positive traces.
2. Use iCEP with the data collected in the above step to generate a rule
CHAPTER 2. LITERATURE REVIEW 46
3. Run the data again through the CEP with the generated rule and capture the composite
events triggered.
4. Compare the two versions and calculate precision and recall
The results have been promising with a precision of around 94% based on some of the
tests that were run by the authors. But the system is far from perfect and the following are
some of the challenges that needs to be overcome.
1. A large training dataset with many positive traces are required to generate good rules
with high precision. The training methodology considers only the conjunction of all
the positive traces to generate rules. So without a large number of positive traces that
cover the variations in the data generating accurate rules is difficult.
2. High computational requirements. The iterative approach used with the windows
learner and event learner translates to a lot of computations that needs to be done.
So without hints from a domain expert on the window size or the required events and
parameters the runtime and computational cost increases rapidly.
3. The generated rules require tuning and cleanup from the user. As the rules created
are generated automatically the constraints may be over constraining or may contain
mistakes when used with previously unseen conditions. So they require a final cleanup
by the users.
Tuning rule parameters using the Prediction-Correction Paradigm
A mechanism has been proposed by Yulia Turchin in order to automate the definition of
the rules at the beginning and automate the update of rules with the time [63]. It consists
of 2 main repetitive stages - namely rule parameter prediction and rule parameter correc-
tion. Parameter prediction is performed by updating the parameters using available expert
knowledge regarding the future changes of parameters. The rule parameter correction utilizes
expert feedback regarding the actual past occurrence of events and the events materialized
by the CEP framework to tune rule parameters.
For an example in an Intrusion detection system [4] a domain expert can specify the rule
as follow. If the size of the received packet from user has a high level of deviation from
normal packet size with estimated size of m1 and standard deviation of σ1, infer an event E1
CHAPTER 2. LITERATURE REVIEW 47
Figure 2.40: Prediction Correction Paradigm
representing the anomaly level of the packet size. It is a hard task to determine the values
for m1 and σ1 and moreover the specified values can change with the time due to dynamic
nature of network traffic.
Rule parameter determination and tuning can be done as following: Given a set of rules,
provide an initial value for rule parameters and then modify it as required. For example for
a given rule, rule tuning algorithm might suggest to replace values m1 with values m2 such
that m2 < m1. Initial prediction of m1 value can be done as special case of tuning where
arbitrary value is corrected to m1 by the rule tuning algorithm. This rule tuning algorithm
should be tied with ability of the system to correctly predict events. So that rule tuning
algorithm can see that parameter m1 is too high and because of that many intrusions were
not detected, therefore it needs to be reduced to m2.
The proposed framework is based on the Kalman Estimator which is a simple type of su-
pervised, Bayesian, and predict-correct estimator [18]. As shown in figure 2.40 the framework
learns and updates the system state in two stages, namely rule parameter prediction and
rule parameter update. Unsupervised learning is carried out in rule parameter prediction -
rule parameters are updated without any user feedback and it depends on preexisting knowl-
edge about how the parameters might change over time and events created by the inference
algorithm to predict rule parameters. In rule parameter update stage, the parameters are
tuned in a supervised manner using domain experts feedback and recently generated events
CHAPTER 2. LITERATURE REVIEW 48
to update rule parameters to next stage. User feedback can be given through two forms
- direct and indirect feedback. Direct feedback involves changes to the system state while
indirect feedback provides an assessment on the correctness of the estimated event history.
Model
The model of this methods consists of events, rules and system. In here event means a
significant (of interest to the system) actual occurrence of the system. Examples of events
include notifications of login attempt, failures of IT components. Therefore, we can define
an event history h to be a set of all events (of interest to the system), as well as their
associated data. And event notification to be an estimation of an occurrence of an event.
Some event may not be notified and some non-occurring events may be notified because
of faulty equipment. Therefore we can define estimated event history h of notified events
(of interest to the system). Events can be of two types: explicit events and inferred events.
Explicit events are signaled by event sources. For example a new network connection request
is an explicit event. Inferred events are the events materialized by the system based on other
events, for example an illegal connection attempt event is an inferred event materialized
by the network security system, based on the explicit event of a new network connection,
and an inferred event of unsuccessful user authorization. Inferred events, just like explicit
events, belong to event histories. Inferred events that actually occurred in the real world
belong to event history h and those who are only estimated to occur in h estimated event
history.It is expected that expert can provide the form of sr , pr , ar , and mr but providing
accurate values will be difficult. These are called rule parameters and set of all parameters
will be called system state. This system state will be updated by this system as shown in
figure 2.40. In predict stage parameters are updated using the knowledge how the rule might
change over time and updated event history h. In update stage parameters are updated by
direct feedback where exact rule parameter is mentioned, or in an indirect manner where
events in estimated event history h are marked whether they actually occurred or not.
Events can be inferred by rules. Rule can be represented by quadruple r = < sr , pr , ar
, mr >. sr is a selection function that filters events according to rule r. Events selected by
selection function are said to be relevant events. Input to this function is an event history
h. pr is a predicate, defined over a filtered event history, determining when events become
candidates for materialization. The ar is an association function, which defines how many
events should be materialized, as well as which subsets of selectable events are associated
CHAPTER 2. LITERATURE REVIEW 49
with each materialized event. mr is a mapping function that determines attribute values for
the materialized events in ar.
System State
It is expected that expert can provide the form of sr , pr , ar , and mr but providing accurate
values will be difficult. These are called rule parameters and set of all parameters will be
called system state. This system state will be updated by this system as shown in figure
2.40. In predict stage parameters are updated using the knowledge how the rule might
change over time and updated event history h. In update stage parameters are updated by
direct feedback where exact rule parameter is mentioned, or in an indirect manner where
events in estimated event history h are marked whether they actually occurred or not.
Rule Tuning Mechanism
In-order to tune rule parameters this framework uses discrete Kalman filter technique. The
filter estimates the process state at some time and then obtains feedback in the form of
(noisy) measurements.
Rule tuning model consists of two recursive equations: time equation which shows how
parameters change over time and history equation which shows outcome of a set of rules
and their parameters. Time equation is a function of previous system state (set of rule
parameters) and actual event history of that time period and output of this equation is
current system state. History equation is a function of current set of rule parameters, set of
explicit event during that time period and actual event history of previous time period and
output this equation is actual event history. But since current system state is not known,
another equation which is known as estimated event history equation which differs from
original history equation by using estimated current system state (estimated current set of
rule parameters) and its output is estimated current event history. This can be used to
evaluate performance of our inference mechanism. Performance evaluation will be based on
the comparison of the estimated event history received from the inference mechanism and
the actual event history, provided by expert feedback at the end of time interval k. By
that we can measure the performance measures of precision and recall. The precision is the
percentage of correctly inferred events relative to the total number of events inferred in this
time interval. Recall measures the percentage of correctly inferred events (i.e., true positive)
relative to the actual total number of events occurred in this time interval.
CHAPTER 2. LITERATURE REVIEW 50
Figure 2.41: An overview of rules tuning method
The Rule Tuning Method consists of a repetitive sequence of actions that should be
performed for correct evaluation and dynamic update of rule parameters. The sequence is
illustrated in Figure 2.41.
Above model is a generic model for automating rule parameter tuning in CEPs. Further,
it serves as a proof of concept of automatic rule parameter tuning when doing that manually
becomes a cognitive challenge.
However the model introduced here is more generic and actual implementation will require
lot of work and tailoring for that specific requirement (such as example mentioned here
intrusion detection in IDS). But this model can work as a theoretical basis for any such work
because of the promising results of the empirical study.
51
Chapter 3
Solution
3.1 Overview
For the implementation of our tool we decided to use a web application architecture, so
as shown in Figure 3.1 our implementation mainly consists of two components and those
are client side and server side components. The server side component performs a lot of
computationally intensive actions and that is the major reason behind us choosing a web
framework so that we could utilize client-server architecture instead of going for a standalone
application. This way we could deploy server-side component in a high performance server
and the user could use it as a web application through web browser without requiring a
high-end machine.
Our solution consist of a Django web application[12]. We have mainly considered Django
and Shiny web frameworks[53] because we intended to use either python or R based develop-
ment environment because there are a lot of libraries for data mining and machine learning
in those languages. One of the main reasons we dropped Shiny was lack of documentation.
Shiny is a relatively new web framework and not as mature as Django. Further we found a
technique to execute R code within python development environment. Hence, we decided to
use Django because we could utilize both python and R libraries within Django
As for the Complex event processing engine we have considered both Siddhi and Esper.
Initially we were planing to implement support for generating queries for both of these
engines. But due to time constrain we were able to implement query generation only for
Siddhi CEP. We plan to implement support for Esper query language in future.
Since we are using a web browser as our front-end we had to narrow down our data
CHAPTER 3. SOLUTION 52
Figure 3.1: Architecture of the implementation of Vivarana
visualization library to one which supports JavaScript, CSS and HTML. Hence, from the
visualization libraries we have considered such as ggobi, flot.js, plotly, d3 and tableau we
chose d3 library because it was written entirely in JavaScript, had inbuilt support for all the
functionality we were planning to implement on top of a basic implementation of parallel
coordinates visualization we used as a basis.
We have selected parallel coordinates as our main multi-dimensional data visualization
technique. The reasons behind selection parallel coordinates and how we have modifies it to
enhance interactivity is mentioned in section 3.2
In our implementation we mainly focused on interactively generating CEP rules for web
server log data. In-order to do that we implemented an Apache web log parser module as
an extension. This log parser handles all the preprocessing steps for a particular web log
data set specified by an user. The preprocessed data returned by this module is used by
other components. If someone wants to generate rules for another different type of data,
then that person could write an extension to pre-process that data implementing the API
we have defined so that preprocessed data returned by that extension would be usable by
other components we have implemented. Apart from the Apache web log parser, we have
also implemented a comma separated file type parser too.
Further both clustering and anomaly detection components are implemented to identify
CHAPTER 3. SOLUTION 53
interesting pattern by the user. The algorithms used in clustering and and the implementa-
tion in details is described in section 3.3
The aggregation component is to perform aggregation operations. Complex even proces-
sors supports specifying sum, average, count, maximum and allowing group by and having
conditions in queries. Through this aggregation component we allow users to perform these
conditions on the data through a moving window. The implementation details of this com-
ponent is elaborated in section 3.3
3.2 Visualization - Parallel Coordinates
Data that is commonly associated with CEP engines are very large consisting of multi-
ple dimensions. Displaying this information in a clear, intuitive and interactive manner is
a challenge that have been the focus of a large amount of research. As we discussed in
the Literature Review there are many visualization techniques such as Scatter Plot Matri-
ces[Scatterplot], Parallel Coordinates[Parcords] Mosaic Plots[MozPlots], and Self Organizing
Maps[SOM] have been proposed to tackle this challenge along the years. After researching
on these methodologies we decided to focus on the Parallel coordinates method for our im-
plementation since we found that it fits most of our requirements as a good visualization
method for the kind of data we hope to use our method with. Parallel coordinates introduced
by Inselberg and Dimsdale is a popular technique for transforming multidimensional data
into a 2D visualization. The m-dimensional data items are represented as lines crossing m
parallel vertical axes, each axis correspondence to one dimension of the original data. Each
element of the data set corresponds to a zigzag line joining the specific values of each one of
its variables.As in the aforementioned way m number of data points can be represented by
m polygonal lines. There are many advantages that can be gained with Parallel Coordinates
over other visualization techniques.
1. With Parallel Coordinates there is no need of transforming all the dimensions to a 2D
image as with most of other visualizations. All the dimensions can be represented in a
2D image easily.
2. With the property of Point line duality in Parallel Coordinates it is easy to observe
the relationships between the dimensions. For an example two dimensions having a
highly negative correlation can be identified by the data lines of that two dimensions
intersecting at a point between those two axes.
CHAPTER 3. SOLUTION 54
3. Parallel Coordinates can handle many dimensions than most of the other visualizations.
The number of dimensions are only bounded by the width of the screen where the
visualization techniques such as Scatter plot Matrix will be really large and will not
be clear with lot of dimensions.
4. parallel coordinates have several techniques which make it easy for the user to inter-
act with the visualization and identify patterns in the data set. Those interaction
techniques will be discussed in the implementation details.
With the above advantages we decided to have Parallel Coordinates in our implementa-
tion. Using D3 Java Script Library there is a basic implementation of the parallel coordinates
by Jason Davis which is only having the interactive techniques 1D Brushing and Axis reorder-
ing.1 Using that implementation as our basis we implemented some interactive techniques
on top of that to give the user a better experience with interaction. In the following section
we will describe those interactive technique starting with the techniques that the implemen-
tation already had - Axis reordering and 1D Brushing. Rest of the sections will describe the
techniques we added to the implementation.
Figure 3.2: Basic Implementation of Parallel Coordinates
Axis Reordering
Axis Reordering was an important property which was already there in the Parallel Coor-
dinates implementation we used because one strength in parallel coordinates as described
before, is its effectiveness of visualizing relations between coordinate axes. Order of the axes
1Available at http://bl.ocks.org/jasondavies/1341281
CHAPTER 3. SOLUTION 55
clearly affects the patterns revealed by parallel coordinate plots. There are many approaches
that have been suggested to arrange the order of the axes by using some measure to score
an order of axes while others discuss how to visualize multiple orderings in a single display
[21]. Many approaches for this which are based on the combination of Nonlinear Correlation
Coefficient and Singular Value Decomposition algorithm [39] are suggested. By using these
approaches, the first remarkable axe can be selected based on mathematics theory and all axis
are re-ordered in line with the degree of similarities among them [39]. In our implementation
we noticed few disadvantages in using a mathematical model to determine the order of axes.
Calculating similarities of axes should be done before visualizing each data set in Parallel
Coordinates. But as we are dealing with big data sets, calculating axis similarities will take
much time and definitely it will affect the performance of the tool. Most importantly the
aim of our tool is letting user to interact with the visualization to identify patterns in the
data set. So rather than coming up with a fixed order of axes determined by a mathematical
model we have allowed user to bring axes next to each other in an interactive way, the user
can investigate how values are related to each other with special respect to two of the data
dimensions. The user can use his/her domain knowledge to assume the axes that will be
having a correlation and confirm it with the help of visualization later.
Brushing
Brushing can be used to distinguish an user interested area from the rest of the data points.
The user actively marks subsets of the data-set as being especially interesting and the points
that are contained by the brush are colored differently from other points to make them
standout. For an example if the user is interested in a certain area of a dimension he can
use brushing to highlight the interested area. As in the figure 3.3 user is interested only
in POST method so that he has brushed in on the axes to distinguish POST method data
points from the rest of the data.
1D brushing in our implementation is not only limited for a single axe. If the user is
interested with an area related to two dimensions the user can use composite brushing which
is a combination of single brushes which result the conjunction of those single brushes. As
in the figure 3.5 if the user is interested in data having POST method and more than 10,000
MB Bandwidth he can use composite brushing to specify it.
As described earlier the 1D Brushing technique that was already in the implementation
primarily acts along the axes. In the 2D brushing technique that we added to the visualiza-
CHAPTER 3. SOLUTION 56
Figure 3.3: Example 1D Brushing
Figure 3.4: Example Composite Brushing
tion an user can mark subsets of the data between the axes also. Using that the user can
mark an interested area for two axes.
Using brushing the user can mark subset of the data-set and then generate Complex
Event Processing rules to identify it separately from the rest of the data. It will be discussed
in section 3.4 Rule generation section.
Slickgrid
One major change that we did to the user interface of the basis Parallel Coordinates that
we used is the introduction of a table showing the details of the data contains in the Parallel
Coordinates. When user performs brushing the table will be updated to the brushed data.
When user hovers over the records in the table the related polyline of the plot to the hovered
row will be highlighted. The table helps to over come from one of the weakness in the Parallel
CHAPTER 3. SOLUTION 57
Figure 3.5: Example Composite Brushing
Coordinates which is, once displayed in the parallel coordinates it is difficult to get the exact
value of a data point, specially in numeric columns. With introduction of the table user can
easily refer to the table to get the exact value of a data point. In the implementation we used
slickgrid.js library to include the table. Whenever an user performs an update functions on
the Parallel Coordinates such as brushing, keeping, removing(Which will be described later)
the slick grid will update automatically with the relevant data. Using Slickgrid.js had it’s
own advantages. It was capable of adaptive virtual scrolling (handling thousands of rows
with extreme responsiveness) and also it had an extremely fast rendering speed. SlickGrid
utilizes virtual rendering to enable you to easily work with hundreds of thousands of items
without any drop in performance. In fact, there is no difference in performance between
working with a grid with 10 rows versus a 100,000 rows. This is achieved through virtual
rendering where only whats visible on the screen plus a small buffer is rendered. As the
user scrolls, DOM nodes are continuously being created and removed. These operations are
highly tuned to provide optimal performance under all browsers. The grid also adapts to the
direction and speed of scroll to minimize the number of rows that need to be swapped out
and to dynamically switch between synchronous and asynchronous rendering. Also SlickGrid
takes a different approach in updating. In the simplest scenario, it accesses data through
an array interface (i.e. using dataitem to get to an item at a given position and data.length
to determine the number of items), but the API is structured in such a way that it is very
easy to make the grid react to any possible changes to the underlying data which is really
important in our use-case since user will be continuously performing brushing, keeping and
removing actions very frequently. Also with a big data set the slick grid allows users to
specify a page size - number of records per a page and then navigate through the pages
CHAPTER 3. SOLUTION 58
Figure 3.6: Slick Grid along with the Parallel Coordinates
easily.
One disadvantage over using the mentioned java script library is that it doesn’t support
IE6 browsers. But as it is not a common browser we ignored that fact.
Cluster Visualization
Clustering is a technique that is commonly used in analyzing large data set to identify the
underlying structure of the data. Clustering can be used to group together data items that
are similar to each other. In our implementation we have provide the user the ability to
perform clustering on data sets which will be discussed more in the section 3.3. Since the
clustering is an important operation we needed to visualize them properly with Parallel
Coordinates. There are two methods to visualize clusters in Parallel Coordinates, namely
cluster coloring and cluster bundling. After the user has performed clustering on the data set
he can color the clusters to give an unique color to each cluster and visualize them clearly.
In cluster bundling the lines belonging to one cluster will be bundled together between
axes. There are two properties in cluster bundling - smoothness and bundling strength. Since
the values for those two is varying from use case to use case, in our implementation the user
can select values for those two variables through the sliders observing the visualization. The
CHAPTER 3. SOLUTION 59
Figure 3.7: Cluster Coloring
Figure 3.8: Cluster Bundling
advantage that can be gained with cluster bundling is that you can free the color channel to
use it for another purpose (eg. Statistical coloring)
Alpha Blending
As we discussed in the literature survey with a big data set the point line duality in parallel
coordinates will not be visible clearly so that the relationships between dimensions can’t
be observed. The solution that has been proposed is called alpha blending. When Alpha-
blending is used, each polygon is plotted with only Alpha percent opacity. With smaller
Alpha values, areas of high line density are more visible and hence are better contrasted to
areas with a small density. It is hard to come to a conclusion about a value for alpha. So in
our implementation the user can adjust the alpha value through the slider until the graph
gain enough insight.
CHAPTER 3. SOLUTION 60
Figure 3.9: Parallel Coordinates without Alpha Blending
Figure 3.10: Parallel Coordinates with Alpha Blending
For an example as in the Figure 3.9it is not easy to figure out whether there is a
relationship between cluster ID and the ID. But with the alpha blending technique used in
Figure 3.10 it is easy to figure out that there’s no actual relationship between cluster ID
and and the ID.
Other techniques
There are some other techniques too used on our implementation to give a better user
experience. We have added axis removing, axis flipping, keep/exclude data and statistical
coloring of data on to the basic implementation. If the user feels an axis is not needed to
the visualization the user can remove the axis. If the user feels that he can observe better
results if he flip the axis, he can flip the axis. Further more if the user feels he wants to
examine only a subset of the data set he can select that particular subset with brushing and
CHAPTER 3. SOLUTION 61
Figure 3.11: Parallel Coordinates with Statistical Coloring
use keep to examine more that brushed sub set. Same with exclude also.
Using statistical coloring it is easy to figure out the out liers within one dimension. The
data points that belongs to top 2.5percentage of the zscore curve will be colored from a
different color as well as data points belonging to the bottom 2.5 percentage will be colored
from another different color. As in the Figure 3.11 it has been statistically colored on size
Having above technique user can interact with the visualization TO identify hidden pat-
terns in the data set which will make CEP rule generation easy.
3.3 Other functionalities
Parsing collected datasets
In addition to supporting simple csv files we have provided support for visualizing and
analyzing Apache web server log files. Apache log files consist of a list of srings for each
entry to the log. These strings need to be broken down in to the data set it contains. For
this we have used an open source implementation of an log parser2 which outputs a simple
dictionary for each entry in the log file. This output is collected and converted to a Pandas
DataFrame object which we can easily manipulate. After breaking the the log entry in to
the separate data elements using the parser we perform conversions such as converting the
time stamp to DateTime format to ease future manipulation. In addition the request line in
the log file is broken down in to its constituent elements; method, URL and protocol.
2Available at : https://code.google.com/p/apachelog/
CHAPTER 3. SOLUTION 62
We have also provided the ability to easily extend the support for additional file formats.
This can be done by defining a file extension for the new file type and providing a module
that can convert the input file in to a Pandas Dataframe.
Aggregation
CEP query model allows defining a Window; which is a limited subset of events from an event
stream. WSO2 CEP supports seven types of windows and in Vivarana we have implemented
support for two types of windows and those are Time window and Length Window. It should
be noted that these windows are moving windows.
Figure 3.12: Specifying the size of either Time or Event window
User can use one type of window at a time and perform sum, average, maximum, min-
imum and count aggregation operations based on the specified window. Further use can
group by a certain attribute and then perform aggregation operations.
As an example consider a scenario where a user is analyzing a server web log. He wants
to know the total bandwidth usage of users for last 15 minutes. He can first specify the 15
minutes in aggregation menu and then select group by from remote host attribute’s context
menu and select sum from the size attribute’s context menu.
We have implemented aggregation operations using Python Pandas library. Once the
aggregation operation is performed, the result is cached because as the size of data grows,
performing aggregation becomes an expensive operation. After calculating the aggregated
values, parallel coordinate visualization is updated so that user can interact with it in-order
to identify useful patterns
CHAPTER 3. SOLUTION 63
(a) Before performing any aggregation operation
(b) After performing minimum aggregation operation for 15 minutes time window on Size attribute.
Figure 3.13: State of visualization after performing an aggregation operation
Clustering
Clustering is one of the major data-mining technique we have integrated in Vivarana. Ini-
tially we have implemented support for three clustering algorithm. K-modes clustering[26] is
introduced mainly to cluster data consisting of non-numeric content. It uses simple-matching
distance to find the dissimilarity between two objects and partition the given objects to k
groups. Another algorithm we implemented was the fuzzy analysis clustering[32]. It uses
euclidean distance to calculate the distance between two observations. In contrast to k-
modes clustering, fuzzy clustering produce clusters where cluster boundaries are not crisp.
The other clustering method we have implemented was hierarchical clustering[43]. This uses
Gower’s coefficient[18] as distance measure between observation.
The major disadvantage of aforementioned algorithms are that all those requires the
number of clusters as a user input initially. But since clustering is an unsupervised learning
technique, user might not know about the optimal number of clusters initially. Hence, we
decided to present a deprogram when hierarchical clustering is used. Further, we imple-
mented clustering in a way so that user can identify clusters within clusters as depicted in
figure 3.14. It should be noted that we used R libraries to implement clustering support.
CHAPTER 3. SOLUTION 64
(a) Initial clustering using hierarchical clustering
(b) Selecting cluster number 3 and perform clustering withing that cluster
(c) After performing hierarchical clustering within initial cluster number 3
Figure 3.14: Performing clustering within clusters in a web server log data set
Both hierarchical and fuzzy clustering is implemented through cluster library[8] and k-modes
clustering is implemented through klaR[35] library
3.4 Rule Generation
As mentioned earlier there are already existing solutions to automatically generate CEP rules
but most of these systems do not depend an interactive process where the user can adjust
parameters and aid in the process of rule generation and tries to automatically generate all
aspect of the rule. So considering the interactive nature of our solution we have used a simple
CHAPTER 3. SOLUTION 65
recursive partitioning and regression tree(CART tree)[38] based method in-order to generate
CEP rules while considering the inputs user has provided through interactive visualization.
Classification and Regression Trees(CART)
Classification and Regression Trees also known as CART trees are popular a classification
method used widely. This method consists of training a classification model to partition
a data set in to categories based on conditional decision rules. For example consider the
CART tree in Figure 3.2. It shows a CART classification tree built on the Iris flower data
set, which feature various attributes of flowers and their species. The decision tree tries to
classify the flowers into their species based on the flower’s attributes By starting at the root
and traversing the tree based on the parameters on a data entry we can get a prediction for
the class it belongs to.
Figure 3.15: A decision tree to classify the Iris data set.
We used the ’rpart’ package available with R to generate decision trees to classify the
set of events selected by the user against the other events in the data set. So we create two
categories for the events based on them being selected by the user as important or not.
To create a CART tree the training data is recursively partitioned into subsets based on
a single parameter condition at each partition. This parameter can be selected based on a
impurity index such as the Gini impurity index which is what we are using with the ’rpart’
package. Gini index reaches a value of zero when only one class is present in a partition.
At each partition the Gini index is calculated for all possible splits to identify the best
possible split that can yield partitions with the least impurity. This recursive partitioning
continues until a new partition with the minimum size specified cannot be generated or all
the partitions contain events belonging to a single category.
But this leads to trees that over fit to the training dataset and reduce the value of the
classification. To prevent over fitting and produce better decision trees, the tree generated
CHAPTER 3. SOLUTION 66
Figure 3.16: A decision tree to classify the Iris data set. Paths to follow to get to a Virginicaflower are highlighted.
in the previous step is pruned to limit the number of splits. In the CART implementation
available with rpart library this is handled by considering a complexity parameter that is
used along with with cross validation to produce the best possible tree. Every split that
doesnt improve the results of the classification by at least the amount specified by the
complexity parameter is removed from the tree. So the complexity of the final tree can be
changed with the complexity parameter.
Generating rules based on the decision tree
After the rpart library generates the tree we need to traverse the tree to generate rules to
filter out the positive events. So the path to each of the leaf nodes needs to be identified
and then merged together to generate the final rule. The final rule is the disjunctions of all
the rules generated by each path.
For example consider the classification in Figure 3.3. If we want to select a Flower be-
longing to ’Virginica’ the paths we can follow are highlighted in the Figure. So by traversing
the tree we can come to the conclusion that if a flower is to belong to ’Viginica’ it either need
to have a ’Petal Length’ between 2.6 and 4.9 or have ’Petal Length’ greater than 4.9 with
’Patal Width’ less than 1.6. Then the rule we can generate to filter out ’Virginica’ flowers
from the rest become,
IF(’Petal Length’ >= 2.5 AND ’Petal Length’ < 4.9)
OR
IF(’Petal Length’ >= 2.5 AND ’Petal Length’ > 4.9 AND ’Petal Width’ < 1.6)
=> OUTPUT ’Virginica
So a rule can be generated for each leaf node by traversing the tree from the root to
the leaf node and combining the constraints at each fit. But one issue with the use of rpart
CHAPTER 3. SOLUTION 67
library is that it only allows binary splits. So the same variable may be used in the splitting
criterion of multiple levels in the classification tree. So to generate better and user-friendlier
CEP queries it is desirable to merge all the partial constraints to generate a single constraint.
Consider the same classification we were studying earlier in Figure 3.3. If we consider
the second path to a ’Virginica’ leaf node we get the Following list of conditions,
’Petal Length’ >= 2.5
’Petal Length’ >= 4.9
’Petal Width’ < 1.6
We can see that there are two conditions for Petal Length that can be merged together to
get a single condition of ’Petal Length’ ¿= 4.9 . We perform this type of merges for all the
sets of conditions we get from following the paths in the tree to the leaf nodes we want. So to
produce better rules we have implemented a method in our system where we go through the
constraint list and merge together all the constraints on a single variable. After the merging
we can get a better and user friendly rule.
IF(’Petal Length’ >= 2.5 AND ’Petal Length’ < 4.9)
OR
IF(’Petal Length’ > 4.9 AND ’Petal Width’ < 1.6)
=> OUTPUT ’Virginica
In addition to the results derived from the classification tree we need to consider the
interactions performed by the user. As mentioned in a earlier section of this report, our tool
allows the user to generate aggregate functions that introduce new parameters to the data
set that may depend on windows defined by the user. These values are not available with
the data stream that the final rule will run on. So they need to be calculated by the CEP
engine. This requires the necessary syntax to be included with the final rule. To support
this functionality we had to keep track of and store the interactions performed by a user
such as the time windows applied and the new data types created.
Then after the condition is derived from the classification tree we check each of the
parameters to see whether the user created them or whether they were available in the
original data stream. If the user created the parameter we need to augment the CEP query
with the necessary syntax for the aggregate function that created the parameters.
The final step in the rule generation process is the translation of these constraints to
the CEP query format. We combine the constraints from the classification tree with the
CHAPTER 3. SOLUTION 68
(a) User select the type of events he wants to detect
(b) CEP query is generated for his selection and false positives are highlighted.
Figure 3.17: Rule generation process
information to generate the user created functions along with the time windows to generate
a CEP query. In our current implementation we support the generation of queries in the
Siddhi query language To extend the application to support another query language we need
to provide two syntax elements.
1. Syntax used to perform aggregate functions. We currently support sum, average, count
as the aggregate functions. So there should be a mapping for the rule syntax for these
functions so that the CEP can run these functions in real time to generate the necessary
values for the rule.
2. Translation from generic constraints to the query syntax required by the user. The
constraints on the parameters are generated in a generic format. So we get constraints
CHAPTER 3. SOLUTION 69
of Equality, LessThan, MoreThan, etc. So if we can get a mapping from these generic
constraints to a specific CEP query format we can get a legitimate CEP query.
So providing the application with these can easily extend the language support.
After the rule are generated we need to assess the quality of the rule. We provide the
user with two methods of looking at the quality of the rule.
1. Applying the rule to the data set and comparing the classification from the rule with
the users selection to calculate the accuracy and precision of the rule. We apply the
rule to the Pandas Data frame and generate a confusion matrix between the events
selected by the user against the event filtered out from the generated rule. The user
can use these parameters to see whether the rule has the qualities he/she needs. For
example, if the user is looking for the selected events to be exactly similar to the ones
he selected in the too then the rule needs to have a higher precision.
2. The rule is also applied in the visualization to highlight the false-positives and false-
negatives to show how the filtering through the rule differs from the users intention. So
along with the rule we highlight the events that the rule selected in the visualization
with the false-positive colored in Red for better clarity
3.5 Other Approaches Attempted
Sunburst Visualization
What the primary visualization method which we have used, does is generating CEP rules
to classify and detect certain events that satisfy a specific criteria. But it does not detect or
does not concern itself with event patterns that occur in a sequence.
For Example,
from every (a1 = infoStock[action == "buy"]
-> a2 = confirmOrder[command == "OK"] )
-> b1 = StockExchangeStream [price > infoStock.price]
within 3000
select a1.action as action, b1.price as price
insert into StockQuote
CHAPTER 3. SOLUTION 70
This is an example pattern matching query for ”Siddhi” language specification which is
used in ”Siddhi” complex event processing engine (WSO2) [45].To create these kind of CEP
rules first we had to look into methods of visualizing sequences.
Patterns and Sequences
Patterns In the scenario described above, the query specified is a pattern matching query.
It matches events that occur in a pattern within a constrained amount of time(3000). That
in hindsight means there could be other unrelated events between those events which are
specified in the pattern.
Sequences In sequences all events need to be in the specified sequence with no unrelated
events occurring in between.
For example,
A⇒ A⇒ B ⇒ B ⇒ B ⇒ C
this sequences of events can be said to be in the A, B, C sequence.
Sunburst Partition
After looking into already existing visualizations we picked sunburst plot [56, 55] for visual-
izing patterns and sequences. Sunburst is a visualization resembling a multilevel pie chart
that is able to visualize hierarchical information. It includes concentric circles of varying
radii. The circle in the center represents root node. And lower level of the hierarchies are
represented from the circles further away from the center. Each circle could be segmented
by radial lines. Each segment represents a node in the hierarchy and child nodes are drawn
within the angle occupied by their parent node [66].
Visualizing sequences in Sunburst
For example, take the sequences,
1. A→ B → C → D → E
2. B → C → D
3. A→ C → B → D
CHAPTER 3. SOLUTION 71
At the first level sequences 1 and 3 share same element A, and sequence 2 bears element
B.
This can be shown in a tree structure as shown in 3.1 ,
Root
A B
Diagram 3.1: sequence prefixes example
In this manner above sequences can be shown in a tree structure such that longest
common prefixes are shared by several sequences.
So the above 3 sequences can be represented as a tree as shown in 3.2 ,
Root
A
B
C
D
E
C
B
D
B
C
Diagram 3.2: sequence tree
This tree structure can be directly represented in the sunburst visualization, the angle
of the segments will represent the value that how many sequences share that common prefix
and color of the segment can represent type of the element (A, B, C)
Why use Sunburst
We found many research studies done on evaluating of space filling visualizations such as
sunburst that display hierarchical information. These research studies concluded that sun-
burst is the best visualization method that uses the space effectively to display information
while making it intuitive to the user [66, 57, 33]. Also further research exists that show
sunburst being used to visualize frequency patterns using the same approach we described
earlier [34]. Further motivating factor for us to use this visualization was existing work that
used sunburst diagram to visualize summarized user navigation paths through a web site [48,
CHAPTER 3. SOLUTION 72
52], which is displayed below in figure 3.18. We used modified version of this implementation
as groundwork for our project. The limitations of that visualization and improvements done
by our project are described below.
Figure 3.18: Sunburst Visualization which we used as foundation to our project
We used visualization depicted in 3.18, which was used to display user navigation paths
to web site as basic foundation for our project, this stock visualization had several limitations,
1. Needed to load data from a CSV in a specific format, since our project involves loading
CHAPTER 3. SOLUTION 73
various kinds of data files we needed to find a method to load our data of arbitrary
formats to this visualization.
2. This visualization could only draw sequences of limited length and the text values of
each element needed to be of limited length to show clearly without error.
Ex: we could not show long name elements such as ”\ta \online \query \query \string”
in the breadcrumbs trail, only short names like ”product” were able to be shown
without error.
3. This visualization could only contain limited number of unique elements (name value
types). If it had many unique elements (each shown in different color) many of the
elements would looks the same and would have looked confusing to the user.
4. Due to displaying data in this ”stock” manner user couldnt do many operations on the
visualization. (No drill down operations, zooming). Simply said this stock visualization
could only be used to visualize limited number of data on a static web page.
To improve upon this visualization and to develop our project first we had to develop a
method to process data which were taken from the data files to a format that can be used
in this visualization, which will be described in the following section.
Data Processing
From the preprocessing page of our application first user should move to sunburst tab and
should select a grouping column and a grouped column to be used in the data processing
stage as shown in 3.19.
Figure 3.19: Select grouping and grouped columns
This attributes are loaded from the column names of the uploaded file. And specifying one
of them in this manner will cause the data file to be grouped by the grouping attribute(like
SQL groupby operation).
CHAPTER 3. SOLUTION 74
First The data elements in the data file are first converted in to a Python Pandas
dataframe object for processing automatically in the backend of the application. Then
group by operations are done on the dataframe to create database of sequences.
Why use python pandas
Python pandas is a high performance data analysis and data preparation toolkit
developed for the python platform. It provides many easy to use high performance data
structures, data analytic functions and tools that helps to make data analysis tasks easier.
It prevents user moving to a different domain specific language such as R to execute these
kind of tasks.
Python is generally preferred for data analytic tasks because of wide range of data
analytic tools available in the python eco system. Therefore Python is the preferred general
purpose programming language of many scientific and research projects that involve data
analytics. These were the main motivations for us to develop core application in Python
programming language.
Creating Sequence Database A dataframe built from a multicolumn CSV looks as
shown in figure 3.20.
Figure 3.20: Data file representation in Python DataFrame
The python dataframe group by operator splits the data frame to multiple groups based
on certain criteria, since we have specified group by attribute it splits(groups) the dataframe
based on unique values of the group by column of the dataframe. Dataframe.groupby com-
mand and it’s inputs are shown below.
DataFrame.groupby(by=None, axis=0, level=None, as \_ index=True, sort=True,
group\_keys=True, squeeze=False)
CHAPTER 3. SOLUTION 75
After grouping operation Pandas creates a dataframe group by object which consists of
multiple dataframes, each of which has the rows of original dataframe which has the same
value for the grouped attribute. In this example we have grouped using ”Remote host”
attribute. So the each split data frames has all the requests made by a particular remote
host.
Figure 3.21: Python DataFrame after Group By operation
Since representation shown in figure 3.21 is not suitable to be used for further data
analytic operations this data is further processed to a format by a process described below,
• Each row on these grouped dataframes is converted into a python ”dict” and all of the
”dicts” are stored in an python ”list”
[{col_name1 = value1, col_name2 = value2, col_name3 = value3},
{col_name1 = value1, col_name2 = value2, col_name3 = value3}]]
• Since all these log events occurred in a sequence a ”sequence number” attribute is
added to each of the ”dict”s.
CHAPTER 3. SOLUTION 76
Figure 3.22: Sequence Database Final Representation
Final ”Sequence database” Structure looks like as shown in figure 3.22.
In order to use this data as input to the sunburst diagram Original sunburst diagram
sequences are required to be in the format,
event1 ’separator’ event2 ’separator’ event3 ’separator’, number_of_times_
this_sequence_occured
The ’separator’ is used to separate each of the events. After modifying original sunburst
implementation we added the capability which enables it to accept a 2 element json array
consisting of sequence and the occurrences of that sequence in the database, to visualizes
sequences.
To convert the data to this format further operations needed to be done on the ”sequence
database”.
Since the sunburst diagram can show only one attribute of the data, these sequences in
the sequence database needs to be stripped off of unnecessary attributes.
The attribute to keep is decided by the grouped column value user specified in figure
. Since in this example the user has specified ”URL”. Pandas operations are run on the
sequence database dataframe to keep only the ”URL’ attribute from the sequence elements.
Then the sequence elements are merged with above specified ’separator’ string to separate
the sequence elements in a string format.
Now the stripped sequence database looks as shown in the figure 3.23.
Note that in this representation we have used |- |as a separator element to prevent
confusion with the elements in the sequences.
CHAPTER 3. SOLUTION 77
Figure 3.23: Sequence Database after being stripped off of unnecessary event attributes
Counting the occurrences of unique sequences Each of the rows in the stripped se-
quence database shown in 3.23 represents each ”Remote host” and the sequence of ”URL”’s
that were requested by that host. Theres a high probability that same requests were done
by some other user in the same sequence of steps. The number of users that requested the
same sequence of ”URL”’s is the number of times this sequence occured value that
needed to specified in the input of the Sunburst visualization.
That is done through valuecounts() operator in pandas
Now the representation of sequences looks as shown in figure 3.24,
Figure 3.24: Counting the number of unique sequences using valuecounts()
Now this data is given to the sunburst as an array of two elements consists of url sequence
and frequency of that sequence
Improvements to the Sunburst Diagram
As specified above, the sunburst display we used as base had number of limitations. This
section describes how those limitations were overcome.
1. Load data from CSV -
CHAPTER 3. SOLUTION 78
This limitation was overcome by loading data formatted through above specified data
formatting steps by an Ajax call to the back end. Original sunburst implementation
was modified to accept JSON array object as input to visualize data.
2. Each name value of an event(sequence element) needed to be of limited
length and each of these sequences needed to be of limited length to display
correctly -
To overcome this limitation we added functionality to clip value of the sequence element
in the bread crumbs trail. Then added functionality to show tips displaying the value
of the element when mouse cursor is pointed to the element.
And we added capability to show sequences of arbitrary length in the visualization
without affecting coherence.
3. Shows only limited number of unique elements in limited number of colors
-
We improved the visualization to show many unique elements in many different colors
by developing a color function that assigns almost unique color to a specific unique
element(colors are guaranteed to be unique if the amount of unique values are below
121). Original hard-coded color values numbering 6 was improved to 121. These colors
were selected such that most of the browsers would be able to display those colors and
using that color will match with contrast of the overall visualization. For this purpose
certain light colors were removed from admissible color range of HTML5 and CSS3
color names which numbers to 140 [25].
(Note: that in the diagram displayed in figure 3.25 different name value types have
been assigned same color because of high number of unique name value types, in this
case it is 700.(CSS3 and HTML5 specification contains only 140 web safe colors)
4. Limitation of drill down, zoom functionality -
When the sequences are of arbitrary length and has many unique elements, visual-
ization tends to look cluttered and small elements become incoherent. So we added
functionality to drill down sequences in a separate visualization by the side of the main
visualization.
When user clicks on an element, sequence from that element onward is visualized in
the next display. Also user operations on the second display (moving mouse over the
CHAPTER 3. SOLUTION 79
elements) is referenced in the main visualization as well, so that user gets the idea of
the whole visualization in coherent and intuitive manner.
Also we added pan and zoom functionality to main display so that user can inspect
harder to see elements in the visualization by zooming in and panning.
Note - in the visualization shown in figure 3.26 the children of ”online\images\f” ele-
ments are depicted in the second diagram which shows drill down functionality. And moving
mouse pointer over elements in the second diagram shows original referenced elements in the
main diagram.
Note Figure 3.27 shows how hard to see areas can be zoomed and panned in the main
diagram
469.0pt
Sequential Pattern Mining (Pattern Search)
In order to use the above visualization for our CEP rule generation use case patterns needed
to be searched in the sequence database.
Expected user case User specifies event name values or a regex pattern that matches
the name values as a pattern.
A, B, \C[A-Z]
Pattern search engine returns sequences that contain these events occurring in the se-
quence specified above by the user and visualizes them in the display.
Also it returns length of time window in the average case that this sequence occurs in
the sequence database.
Ex:
A, B, ˆ\[C[A-Z]
Average sequence length 19, min =3, max = 32
Average time window 1300 milliseconds, min = 50 milliseconds, max 2500 milliseconds
To generate above parameters pattern search engine utilizes attributes of each event,
timestamp, seq no in the sequence database.
These parameters can be used to develop a CEP rule that given the query matches and
detects this sequence in a data stream.
CHAPTER 3. SOLUTION 80
Pattern Search Methodology
There exists many work done on sequential pattern mining. But these cases involve searching
all the sequential patterns in a sequence database given the support value. But in our use
case we needed to search a specific pattern occurring in the database. For that purpose
we decided to use a modified approach based on prefix projected pattern search utilized in
prefixspan algorithm [47].
In this approach each element of the pattern will project the input sequences in the
database discarding unmatched prefixes creating a new database of sequences. This in the
long run is very efficient because of the reduction of the size of the sequence data base in
successive iterations.
Example sequence database of 4 sequences is shown below,
1. A,B,C,D,A,E
2. B,C,A,E
3. C,A,C,B,D
4. A,B,C,D,C,E
If we were to search pattern A,B,C in the given sequence data base, in the first iteration
sequence database is projected using the first element of the pattern which is A. This results
in the following database of sequences.
1. B,C,D,A,E
2. E
3. C, B, D
4. B, C, D, C, E
Notice that sequence length of all sequences has been reduced in all sequences. If this
sequences are projected again using second element of the search sequence (B),
1. C,D,A,E
2. D,
CHAPTER 3. SOLUTION 81
3. C, D, C, E
Note that projection has resulted in one sequence disappearing altogether from the
database.
By using this approach we can get the sequences which contains a specified pattern and
then by scanning those sequences we can calculate parameters for time window and length
window of the pattern as it occurs in the sequence database quickly.
Failure to use this Component in the project
Although this components adds further functionality to our project, it was perceived that
the impact of the above use case is low compared to the overall usability of our project.
Also to develop this functionality rule generation and precision calculating methods needs
to be developed from the scratch because the components developed for the main Parallel
coordinates visualization are unusable in this aspect.
And developing these methods will indeed take the project out of scope and time. There-
fore given the limited amount of time available we decided to drop further development of
this module in favor of improving the existing main functionality of the project.
CHAPTER 3. SOLUTION 82
Figure 3.25: Improved sunburst visualization
CHAPTER 3. SOLUTION 83
Figure 3.26: Sunburst with drill down capability
Figure 3.27: Sunburst zoom and pan capability
84
Chapter 4
Discussion
Using our visualization and the rule generation methodology we were able to produce good
results in simple use cases. Using this methodology can help users identify trends and
information in data streams that would otherwise have gone unnoticed and generate CEP
queries to identify similar events with the click of a button which makes the whole process
simpler and easier. We tested our tool with generating rules for acting on events occuring on
web logs and on other generic data types and we were able to produce usable CEP querries
for most of our needs.
The biggest challenge we faced in the implementation of our project is in visualizing and
generating rules to identify sequential patterns. While Parallel Coordinates is an excellent
visualization method for multidimensional data it doesnt translate to displaying sequences.
So the user doesnt get a chance to identify sequential patterns in the data using our visu-
alization. We tried to solve this issue by introducing the sunburst visualization to display
sequences based on their frequency of occurrence and allowing the user to select a sequence
base on it. But we felt that this was only addressing a single use case of sequence detection
and was not useful in a generic environment.
In addition implementing this would have required us to completely overhaul the rule
generation mechanism. The current implementation that uses decision trees does not support
sequence matching and cannot therefore be used in pattern detection. The methodology
presented in iCEP is better suited for generating these types of rules. But we decided that
based on its complexity and disregard for user interactions we were better off focusing on
the simpler non-sequential rule generation.
Other issues faced by us during our project include,
CHAPTER 4. DISCUSSION 85
• Displaying labels for String type data
When displaying data with categorical variables that contain a large number of unique
labels, the display tend to get denser and the information becomes hard to distinguish.
This issue needs to be handled in a better manner to display data items such as names,
which are unique among data items. Currently as we display the data table along with
the visualization the users can look this up in the table. But we need to come up with
a better method of displaying this type of data.
• Displaying data with long labels
Another problem with the visualization is long labels that cannot be displayed in full
length. Displaying the entire data label causes other aspects of the visualization to
be obstructed and causes the visualization to be loose its value. While we can easily
shorten the labels and display only partial results this leads to loss of information in
data values such as URLs, which are commonly very long.
• Loss of interactivity when displaying large data sets
As the visualization runs on a web browser the performance is very much dependent
on the resources available on the machine the user is viewing the visualization on. This
results in the visualization loosing interactivity and getting slow when displaying larger
data sets. When the data set visualized reach hundred thousands the whole visual-
ization becomes unusable. One solution to this problem would be to avoid visualizing
such large datasets and reduce the size of the dataset to a much smaller size through
methods such as sampling.
• Different data formats
As the tool currently stands we have support for .csv format and apache web log files.
To support other formats and data types we need to add parsers for those data types
and convert them to Pandas DataFrames.
CHAPTER 4. DISCUSSION 86
(a) Large numbers of unique labels makes them indistinguishable
(b) Long labels block details below them
Figure 4.1: Challenges face in the visualization
87
Chapter 5
Conclusion and Future Work
We started this project with the task of creating a better method for analyzing large data
streams and acting on them through CEP engines. We feel that we have taken a successful
first step in this direction. With our tool users can take a look at a data set and look for
patterns and trends in the data without being an expert in data mining and write CEP
queries to identify events he/she deems important with the click of a button.
To get to a completely automated process we need to go further and support other
functionality,
We have identified several ways that would improve the usability of the project and
provide more functionality to the user,
• Handling large amounts of data
We need to optimize the tool so that it can handle more data in a single visualization
without compromising the interactivity. This can be done by optimizing the visual-
ization we created. But even with the ability to display all the data without issues of
interactivity we will come to a point where the large number of data points will make
the display too dense and useless for visual analysis. The method we have provided
in our tool to avoid this is reducing the size of the data through sampling. We can
specify a size of the sample and use random sampling to fill that sample from randomly
selected data items from the data set. While this reduces the datasets to a better size
random sampling may lead to loss of valuable information. A better way would be to
scan the data and look for anomalies and other events that would be important and
displaying them clearly. But identifying ’important’ data subsets is a complex issue
and needs more research.
CHAPTER 5. CONCLUSION AND FUTURE WORK 88
• Handling Sequential patterns
This is one of the most important paths that need to be further explored to make
this tool of value. CEPs are very commonly used to look for distinct patterns in the
data. So we need to be able to handle that aspect of rule generation. This is a very
complex task as the sequences required vary widely from use case to use case. For a
store application we may need to track purchase sequences, for a banking application
the sequences may be sequences of purchases based on their relative size, So writing
a visualization that can create and show these types of patterns become increasingly
complex. This aspect of our tool needs to be improved further.
• Allow the user to change parameters in the generated rule.
Allowing the user to change parameters in the generated rule can help the user to get
to exactly what he wants we can allow the user to change the parameters in the rule
and show how they affect the application of the rule and the difference it made from
the original generated rule. This can help with the generation of a better final result.
• Improve the data mining aspect of the application.
We need to provide more functionality and support more use cases for users. We
currently provide the ability to do cluster analysis and anomaly detection among few
other data mining tools. User has to depend on visual analysis of the data to look for
important factors in the data. We can improve this and support more manipulation
operations such as search functionality and other mining operations.
We believe that by providing these improvements and more work we can create a tool that
provides a simpler and enjoyable process of using CEP engines to process data streams in
real life and allow people who are not experts in data mining and Complex Event Processing
to use these valuable tools with ease.
89
Bibliography
[1] Gennady Andrienko and Natalia Andrienko. “Blending aggregation and selection:
Adapting parallel coordinates for the visualization of large datasets”. In: The Car-
tographic Journal 42.1 (2005), pp. 49–60.
[2] Almir Olivette Artero, Maria Cristina Ferreira de Oliveira, and Haim Levkowitz. “En-
hanced high dimensional data visualization through dimension reduction and attribute
arrangement”. In: Information Visualization, 2006. IV 2006. Tenth International Con-
ference on. IEEE. 2006, pp. 707–712.
[3] Daniel Asimov. “The grand tour: a tool for viewing multidimensional data”. In: SIAM
journal on scientific and statistical computing 6.1 (1985), pp. 128–143.
[4] Stefan Axelsson. Intrusion detection systems: A survey and taxonomy. Tech. rep. Tech-
nical report, 2000.
[5] Eric A Bier et al. “Toolglass and magic lenses: the see-through interface”. In: Proceed-
ings of the 20th annual conference on Computer graphics and interactive techniques.
ACM. 1993, pp. 73–80.
[6] Markus M Breunig et al. “LOF: identifying density-based local outliers”. In: ACM
sigmod record. Vol. 29. 2. ACM. 2000, pp. 93–104.
[7] Krysia Broda et al. SAGE: a logical agent-based environment monitoring and control
system. Springer, 2009, pp. 112–117.
[8] cluster: Cluster Analysis Extended Rousseeuw et al. http://cran.r-project.org/
web/packages/cluster/index.html/. [Online; accessed 03-February-2015]. 2015.
[9] Dianne Cook et al. “Grand tours, projection pursuit guided tours, and manual con-
trols”. In: Handbook of data visualization. Springer, 2008, pp. 295–314.
[10] Data-Ink Ratio. http://www.infovis- wiki.net/index.php/Data- Ink_Ratio.
[Online; accessed 03-February-2015]. 2015.
BIBLIOGRAPHY 90
[11] Alan Demers et al. “Towards expressive publish/subscribe systems”. In: Advances in
Database Technology-EDBT 2006. Springer, 2006, pp. 627–644.
[12] Django overview, Django. https://www.djangoproject.com/start/overview/.
[Online; accessed 03-February-2015]. 2015.
[13] Niklas Elmqvist, Pierre Dragicevic, and Jean-Daniel Fekete. “Rolling the dice: Multi-
dimensional visual exploration using scatterplot matrix navigation”. In: Visualization
and Computer Graphics, IEEE Transactions on 14.6 (2008), pp. 1539–1148.
[14] Ronald Aylmer Fisher. “The design of experiments.” In: (1935).
[15] M Forina et al. “Classification of olive oils from their fatty acid composition”. In: Food
research and data analysis: proceedings from the IUFoST Symposium, September 20-
23, 1982, Oslo, Norway/edited by H. Martens and H. Russwurm, Jr. London: Applied
Science Publishers, 1983. 1983, pp. 189–214.
[16] Michael Friendly. “A brief history of the mosaic display”. In: Journal of Computational
and Graphical Statistics 11.1 (2002).
[17] Ying-Huey Fua, Matthew O Ward, and Elke A Rundensteiner. “Hierarchical parallel
coordinates for exploration of large datasets”. In: Proceedings of the conference on
Visualization’99: celebrating ten years. IEEE Computer Society Press. 1999, pp. 43–
50.
[18] John C Gower. “A general coefficient of similarity and some of its properties”. In:
Biometrics (1971), pp. 857–871.
[19] John A Hartigan and Beat Kleiner. “Mosaics for contingency tables”. In: Computer
science and statistics: Proceedings of the 13th symposium on the interface. Springer.
1981, pp. 268–273.
[20] Helwig Hauser, Florian Ledermann, and Helmut Doleisch. “Angular brushing of ex-
tended parallel coordinates”. In: Information Visualization, 2002. INFOVIS 2002.
IEEE Symposium on. IEEE. 2002, pp. 127–130.
[21] Julian Heinrich and Daniel Weiskopf. “State of the art of parallel coordinates”. In:
STAR Proceedings of Eurographics 2013 (2013), pp. 95–116.
[22] Julian Heinrich et al. “Evaluation of a bundling technique for parallel coordinates”.
In: arXiv preprint arXiv:1109.6073 (2011).
BIBLIOGRAPHY 91
[23] Patrick Hoffman et al. “DNA visual and analytic data mining”. In: Visualization’97.,
Proceedings. IEEE. 1997, pp. 437–441.
[24] Heike Hofmann. “Mosaic plots and their variants”. In: Handbook of data visualization.
Springer, 2008, pp. 617–642.
[25] HTML Color Names. http://www.w3schools.com/html/html_colornames.asp.
[Online; accessed 03-February-2015]. 2015.
[26] Zhexue Huang. “A Fast Clustering Algorithm to Cluster Very Large Categorical Data
Sets in Data Mining.” In: DMKD. Citeseer. 1997.
[27] J-F Im, Michael J McGuffin, and Rock Leung. “GPLOM: the generalized plot matrix
for visualizing multidimensional multivariate data”. In: Visualization and Computer
Graphics, IEEE Transactions on 19.12 (2013), pp. 2606–2614.
[28] Alfred Inselberg and Bernard Dimsdale. Parallel coordinates for visualizing multi-
dimensional geometry. Springer, 1987.
[29] Ian Jolliffe. Principal component analysis. Wiley Online Library, 2002.
[30] Rudolph Emil Kalman. “A new approach to linear filtering and prediction problems”.
In: Journal of Fluids Engineering 82.1 (1960), pp. 35–45.
[31] Samuel Kaski and Teuvo Kohonen. “Exploratory data analysis by the self-organizing
map: Structures of welfare and poverty in the world”. In: Neural networks in financial
engineering. Proceedings of the third international conference on neural networks in
the capital markets. Citeseer. 1996.
[32] Leonard Kaufman and Peter J Rousseeuw. Finding groups in data: an introduction to
cluster analysis. Vol. 344. John Wiley & Sons, 2009.
[33] Daniel A Keim. “Information visualization and visual data mining”. In: Visualization
and Computer Graphics, IEEE Transactions on 8.1 (2002), pp. 1–8.
[34] Daniel A Keim, Jorn Schneidewind, and Mike Sips. “Fp-viz: Visual frequent pattern
mining”. In: (2005).
[35] klaR: Classification and visualization. http://cran.r-project.org/web/packages/
klaR/index.html. [Online; accessed 03-February-2015]. 2015.
[36] Edwin M Knorr, Raymond T Ng, and Vladimir Tucakov. “Distance-based outliers:
algorithms and applications”. In: The VLDB JournalThe International Journal on
Very Large Data Bases 8.3-4 (2000), pp. 237–253.
BIBLIOGRAPHY 92
[37] Lie Factor. http://www.infovis-wiki.net/index.php?title=Lie_Factor. [Online;
accessed 03-February-2015]. 2015.
[38] Wei-Yin Loh. “Classification and regression trees”. In: Wiley Interdisciplinary Reviews:
Data Mining and Knowledge Discovery 1.1 (2011), pp. 14–23. issn: 1942-4795. doi:
10.1002/widm.8. url: http://dx.doi.org/10.1002/widm.8.
[39] Liang Fu Lu, Mao Lin Huang, and Tze-Haw Huang. “A new axes re-ordering method in
parallel coordinates visualization”. In: Machine Learning and Applications (ICMLA),
2012 11th International Conference On. Vol. 2. IEEE. 2012, pp. 252–257.
[40] Yuan Luo et al. “Cluster Visualization in Parallel Coordinates Using Curve Bundles”.
In: Visualization and Computer Graphics, IEEE Transactions on 20 (2008).
[41] Alessandro Margara, Gianpaolo Cugola, and Giordano Tamburrelli. “Learning from
the past: automated rule generation for complex event processing”. In: Proceedings
of the 8th ACM International Conference on Distributed Event-Based Systems. ACM.
2014, pp. 47–58.
[42] Allen R Martin and Matthew O Ward. “High dimensional brushing for interactive
exploration of multivariate data”. In: Proceedings of the 6th Conference on Visualiza-
tion’95. IEEE Computer Society. 1995, p. 271.
[43] Fionn Murtagh and A Heck. “Multivariate data analysis with Fortran, C and Java
code”. In: Northern Ireland: Queen University Belfast, Astronomical Observatory Stras-
bourg (2000), p. 272.
[44] Christopher Mutschler and Michael Philippsen. “Learning event detection rules with
noise hidden markov models”. In: Adaptive Hardware and Systems (AHS), 2012 NASA/ESA
Conference on. IEEE. 2012, pp. 159–166.
[45] Patterns. https://docs.wso2.com/display/CEP310/Patterns. [Online; accessed
03-February-2015]. 2015.
[46] Karl Pearson. “Note on regression and inheritance in the case of two parents”. In:
Proceedings of the Royal Society of London 58.347-352 (1895), pp. 240–242.
[47] Jian Pei et al. “Prefixspan: Mining sequential patterns efficiently by prefix-projected
pattern growth”. In: 2013 IEEE 29th International Conference on Data Engineering
(ICDE). IEEE Computer Society. 2001, pp. 0215–0215.
BIBLIOGRAPHY 93
[48] Kerry Rodden. “Applying a Sunburst Visualization to Summarize User Navigation
Sequences”. In: (2014).
[49] S Savoska and S Loskovska. “Parallel Coordinates as Tool of Exploratory Data Anal-
ysis”. In: 17th Telecommunications Forum TELFOR, Belgrade, Serbia. 2009, pp. 24–
26.
[50] Nicholas Poul Schultz-Møller, Matteo Migliavacca, and Peter Pietzuch. “Distributed
complex event processing with query rewriting”. In: Proceedings of the Third ACM
International Conference on Distributed Event-Based Systems. ACM. 2009, p. 4.
[51] Jinwook Seo and Ben Shneiderman. “A rank-by-feature framework for interactive ex-
ploration of multidimensional data”. In: Information Visualization 4.2 (2005), pp. 96–
113.
[52] Sequences Sunburst. http://bl.ocks.org/kerryrodden/7090426. [Online; accessed
03-February-2015]. 2015.
[53] Shiny. http://shiny.rstudio.com/. [Online; accessed 03-February-2015]. 2015.
[54] Anselm Spoerri. “InfoCrystal, a visual tool for information retrieval”. PhD thesis.
Massachusetts Institute of Technology, 1995.
[55] John T. Stasko. SunBurst. http://www.cc.gatech.edu/gvu/ii/sunburst/. [Online;
accessed 03-February-2015]. 2015.
[56] John Stasko and Eugene Zhang. “Focus+ context display and navigation techniques for
enhancing radial, space-filling hierarchy visualizations”. In: Information Visualization,
2000. InfoVis 2000. IEEE Symposium on. IEEE. 2000, pp. 57–65.
[57] John Stasko et al. “An evaluation of space-filling information visualizations for depict-
ing hierarchical structures”. In: International Journal of Human-Computer Studies
53.5 (2000), pp. 663–694.
[58] Randolph Stone et al. “Identification of genes correlated with early-stage bladder can-
cer progression”. In: Cancer Prevention Research 3.6 (2010), pp. 776–786.
[59] Martin Theus. “High Dimensional Data Visualizations”. In: Handbook of data visual-
ization. Springer, 2008, pp. 156–163.
[60] Martin Theus. “Parallel Coordinate Plots”. In: Handbook of data visualization. Springer,
2008, pp. 164–174.
BIBLIOGRAPHY 94
[61] Edward R Tufte. “Small Multiples”. In: Envisioning Information. Graphics press Cheshire,
CT, 1990, pp. 67–80.
[62] Edward R Tufte and PR Graves-Morris. The visual display of quantitative information.
Vol. 2. Graphics press Cheshire, CT, 1983.
[63] Yulia Turchin, Avigdor Gal, and Segev Wasserkrug. “Tuning complex event processing
rules using the prediction-correction paradigm”. In: Proceedings of the Third ACM
International Conference on Distributed Event-Based Systems. ACM. 2009, p. 10.
[64] Shimon Ullman. The interpretation of visual motion. Massachusetts Inst of Technology
Pr, 1979.
[65] Roel Vliegen, Jarke J van Wijk, and E-J Van der Linden. “Visualizing business data
with generalized treemaps”. In: Visualization and Computer Graphics, IEEE Transac-
tions on 12.5 (2006), pp. 789–796.
[66] Richard Webbera, Ric D Herbertb, and Wei Jiangbc. “Space-filling Techniques in Vi-
sualizing Output from Computer Based Economic Models”. In: ().
[67] What is a Trellis Chart? http://trellischarts.com/what-is-a-trellis-chart.
[Online; accessed 03-February-2015]. 2015.
[68] Pak Chung Wong and R Daniel Bergeron. “30 Years of Multidimensional Multivariate
Visualization.” In: Scientific Visualization. 1994, pp. 3–33.
[69] Hong Zhou et al. “Visual clustering in parallel coordinates”. In: Computer Graphics
Forum. Vol. 27. 3. Wiley Online Library. 2008, pp. 1047–1054.