Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Intelligent Adaptation of Ensemble Sizein Data Streams Using Online Bagging
Author:Muhammed Kehinde Olorunnimbe
Supervisor:Dr. Herna L. Viktor
Thesis submitted to theFaculty of Graduate and Postdoctoral Studies
in partial fulfilment of the requirements for the degree of
Master of Science in Systems Science
School of Electrical Engineering and Computer ScienceFaculty of EngineeringUniversity of Ottawa
University of Ottawa© Muhammed Kehinde Olorunnimbe, Ottawa, Canada, 2015
http://www.uottawa.ca
Abstract
In this era of the Internet of Things and Big Data, a proliferation of connected devices
continuously produce massive amounts of fast evolving streaming data. There is a need to
study the relationships in such streams for analytic applications, such as network intrusion
detection, fraud detection and financial forecasting, amongst other. In this setting, it is
crucial to create data mining algorithms that are able to seamlessly adapt to temporal
changes in data characteristics that occur in data streams. These changes are called
concept drifts. The resultant models produced by such algorithms should not only be highly
accurate and be able to swiftly adapt to changes. Rather, the data mining techniques
should also be fast, scalable, and efficient in terms of resource allocation. It then becomes
important to consider issues such as storage space needs and memory utilization. This is
especially relevant when we aim to build personalized, near-instant models in a Big Data
setting.
This research work focuses on mining in a data stream with concept drift, using an
online bagging method, with consideration to the memory utilization. Our aim is to
take an adaptive approach to resource allocation during the mining process. Specifically,
we consider metalearning, where the models of multiple classifiers are combined into an
ensemble, has been very successful when building accurate models against data streams.
However, little work has been done to explore the interplay between accuracy, efficiency
and utility. This research focuses on this issue. We introduce an adaptive metalearning
algorithm that takes advantage of the memory utilization cost of concept drift, in order to
vary the ensemble size during the data mining process. We aim to minimize the memory
usage, while maintaining highly accurate models with a high utility.
We evaluated our method against a number of benchmarking datasets and compare
our results against the state-of-the art. Return on Investment (ROI) was used to evaluate
the gain in performance in terms of accuracy, in contrast to the time and memory invested.
We aimed to achieve high ROI without compromising on the accuracy of the result. Our
experimental results indicate that we achieved this goal.
ii
Acknowledgements
All praises to God Almighty.
A special gratitude to my supervisor, Dr. Herna L. Viktor, for guiding me through
this thesis, from the beginning when I was confused and overwhelmed, to the end when I
was quite less confused but still overwhelmed. Thank you for your patience, support and
understanding. I have learnt so much from you, and it has been an honour working with
you.
Thank you to my mum, my twin sister and the rest of my siblings for the unwavering
love and encouragement, through this, and other aspects of my life. Thank you to my dear
wife, Jemilat Animashaun, for your limitless love, selfless companionship and enduring
support through this journey. I love you.
I would also like to thank my friends and colleagues who have been encouraging and
supportive through this endeavour.
iii
Contents
Abstract ii
Acknowledgements iii
List of Figures viii
List of Tables ix
List of Algorithms x
List of Abbreviations xi
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Thesis Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
I PRELIMINARIES 5
2 Data Stream Mining: Fundamentals 6
2.1 Taxonomy of Data Mining Tasks and Techniques . . . . . . . . . . . . . . . 62.1.1 Data Mining Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.2 Data Mining Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.1 Online vs Offline Learning . . . . . . . . . . . . . . . . . . . . . . . . 122.2.2 Data Stream Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Methodologies of Data Stream Mining Systems . . . . . . . . . . . . . . . . 142.3.1 Data-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3.2 Task-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Concept Drift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
iv
Contents
2.4.1 Learner Adaptivity and Model Selection . . . . . . . . . . . . . . . . 182.4.2 Detecting and Handling Concept Drift . . . . . . . . . . . . . . . . . 192.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5 Utility and Efficiency Consideration In Data Streams . . . . . . . . . . . . . 202.6 Challenges in Mining Data Streams . . . . . . . . . . . . . . . . . . . . . . . 23
2.6.1 Streaming Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.6.2 Security/Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.6.3 Streaming Data Management . . . . . . . . . . . . . . . . . . . . . . 242.6.4 Algorithms Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 242.6.5 Legacy Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.6.6 Model Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.6.7 Entity Stream Mining . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.7 Applications of Mining Data Streams with Concept Drift . . . . . . . . . . 272.7.1 Monitoring and Control . . . . . . . . . . . . . . . . . . . . . . . . . 272.7.2 Personal Assistance and Information . . . . . . . . . . . . . . . . . . 282.7.3 Decision Making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.7.4 Artificial Intelligence and Robotics . . . . . . . . . . . . . . . . . . . 29
2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Metalearning and Model Selection 31
3.1 Metalearning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2 Model Selection in Ensemble Methods . . . . . . . . . . . . . . . . . . . . . 323.3 ADWIN Drift Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.4 Meta-level Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.1 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.4.2 OzaBag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.4.3 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.4.4 OzaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
II ADAPTIVE ENSEMBLE SIZE 42
4 Adaptive Ensemble Size (AES) Online Bagging 43
4.1 Hoeffding tree Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.2 Adaptive Ensemble Size Methodology . . . . . . . . . . . . . . . . . . . . . 474.3 Adaptive Ensemble Size Online Bagging Algorithm . . . . . . . . . . . . . . 48
v
Contents
4.4 AES Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.4.1 KDD’99 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.4.2 Size of Ensemble Object in Memory . . . . . . . . . . . . . . . . . . 514.4.3 Ensemble Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.4.4 Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5 Experimentation Setup 55
5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.1.1 LED Dataset Generator . . . . . . . . . . . . . . . . . . . . . . . . . 555.1.2 Poker Hand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.1.3 IMDb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.1.4 Forest Cover Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.1.5 Electricity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.1.6 Airline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Pre-processing the Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.3 WEKA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.4 MOA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.5 ADAMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625.6 Measuring Learner Performance . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.6.1 Cost-Sensitive Learning and Accuracy . . . . . . . . . . . . . . . . . 645.6.2 Accuracy Measurement in Data Streams with Concept Drift . . . . . 655.6.3 Utility Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.6.4 Testing for Statistical Significance . . . . . . . . . . . . . . . . . . . 67
5.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6 Experimental Results and Discussion 70
6.1 Experimentation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706.1.1 Memory Utilization with Concept Drift . . . . . . . . . . . . . . . . 716.1.2 Accuracy Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 726.1.3 Statistical Significance of Accuracy Results . . . . . . . . . . . . . . 756.1.4 Memory Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 756.1.5 Time Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 786.1.6 ROI and Results Discussion . . . . . . . . . . . . . . . . . . . . . . . 79
6.2 Synthesis and Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . 796.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
vi
Contents
7 Conclusions 84
7.1 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 847.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Bibliography 87
A Implementation in the MOA Source Code 95
A.1 Modification to Measurement Class . . . . . . . . . . . . . . . . . . . . . . . 95A.2 Implementaton of Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 96
vii
List of Figures
2.1 Data Mining Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Data Mining Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Taxonomy of Data Stream Mining Systems’ Methods . . . . . . . . . . . . . . . 142.4 Illustration of Types of Concept Drift (Žliobaitė, 2010) . . . . . . . . . . . . . . 18
3.1 Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.2 Exponential Histogram Illustration (Datar, Gionis, Indyk, and Motwani, 2002) 34
4.1 A Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.2 Memory Increase in the Data Stream . . . . . . . . . . . . . . . . . . . . . . . . 514.3 Error and RAM-Hour With Increase in Ensemble Size . . . . . . . . . . . . . . 52
5.1 WEKA Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.2 MOA Framework Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.3 MOA Classification and Task Configuration Interface . . . . . . . . . . . . . . . 625.4 ADAMS Flow Editor and Result Summary . . . . . . . . . . . . . . . . . . . . 63
6.1 Memory Change with Concept Drift . . . . . . . . . . . . . . . . . . . . . . . . 716.2 Accuracy Graph and Box Plot for KDD, Poker and IMDB Datasets . . . . . . 736.3 Accuracy Graph and Box Plot for Forest, Electricity and Airline Datasets . . . 746.4 Memory and Time Plot for KDD and Poker Datasets . . . . . . . . . . . . . . 766.5 Memory and Time Plot for IMDB, Forest and Electricity Datasets . . . . . . . 776.6 Memory and Time Plot for Airline Dataset . . . . . . . . . . . . . . . . . . . . 786.7 ROI Plot for the Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
viii
List of Tables
2.1 Difference Between Data Mining and Data Stream Mining . . . . . . . . . . . . 132.2 Summary of Data Stream Methods (Han, Kamber, and Pei, 2011) . . . . . . . 17
4.1 Categories of attacks in the KDD ’99 dataset . . . . . . . . . . . . . . . . . . . 504.2 Accuracy, Time and RAM-Hour for Different Ensemble Sizes . . . . . . . . . . 53
5.1 Poker Hand Dataset Class Labels . . . . . . . . . . . . . . . . . . . . . . . . . . 565.2 Forest Cover Type Dataset Class Labels . . . . . . . . . . . . . . . . . . . . . . 575.3 Summary of Datasets Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.4 Weighted Average . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635.5 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.1 Kappa Plus Statistics Median for the datasets . . . . . . . . . . . . . . . . . . . 726.2 Result of Friedman’s Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756.3 Memory Utilized in Building the Models (M’bytes) . . . . . . . . . . . . . . . . 756.4 Time Taken to Build the Models . . . . . . . . . . . . . . . . . . . . . . . . . . 786.5 ROI Comparison between the Algorithms . . . . . . . . . . . . . . . . . . . . . 796.6 Summary of ROI and Accuracy of Algorithms . . . . . . . . . . . . . . . . . . . 81
ix
List of Algorithms
1 ADWIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 OzaBag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364 OzaBagADWIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386 OzaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397 Hoeffding tree algorithm with VFDT Enhancements . . . . . . . . . . . . . 468 AES with OzaBag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489 AES with OzaBagADWIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
x
List of Abbreviations
ADAMS Advanced Data mining And Machine learning System
ADWIN Adaptive Windowing
AES Adaptive Ensemble Size
AI Artificial Intelligence
Amazon EC2 Amazon Elastic Compute Cloud
ANOVA Analysis of Variance
API Application Program Interface
AUC Area Under the Curve
Bagging Bootstrap Aggregating
CART Classification And Regression Trees
DARPA Defense Advanced Research Projects Agency
DOS Denial-Of-Service
EHA Event History Analysis
FN False Negative
FP False Positive
GUI Graphical User Interface
HT Hoeffding tree
ICMP Internet Control Message Protocol
ID3 Iterative Dichotomiser 3
IMDb Internet Movie Database
xi
List of Abbreviations
IoT the Internet of Things
KDD Knowledge Discovery in Databases
LED Light Emitting Diode
MEKA Multi-label Extension to WEKA
MOA Massive Online Analysis
OzaBag Oza and Russell’s Online Bootstrap Aggregating
OzaBagADWIN OzaBag with ADWIN
OzaBoost Oza and Russell’s Online Boosting
OzaBoostADWIN OzaBoost with ADWIN
PAC Probably Approximately Correct
R2L Remote to Local
RAM Random Access Memory
RIS Resource Information System
ROI Return on Investment
TCP Transmission Control Protocol
TF-IDF Term Frequency - Inverse Document Frequency
TN True Negative
TP True Positive
U2R User to Root
UBDM Utility-Based Data Mining
UDP User Datagram Protocol
UKD Ubiquitous Knowledge Discovery
USFS United States Forest Service
USGS United States Geological Survey
VFDT Very Fast Decision Trees
WEKA Waikato Environment for Knowledge Analysis
xii
Ch
ap
te
r 1Introduction
Making sense of data is a very important task. Like never before, we are faced with very
fast-paced data that are produced at a large scale. “Big Data” is now a common phrase
in data analytics, with data being generated every second at an exponential rate. This
trend is projected to continue. According to a 2012 report by the data analytics company,
IDC (Gantz and Reinsel, 2012), the future data growth rate is estimated at 300%, with
the overall data generated to be at 40 zettabyte by 2020. That is 40 trillion gigabyte, or
to put in perspective, 5,200 gigabyte for every person on earth.
In this scenario, there is a constant flow of data that needs to be analysed at some point,
and it is very difficult to analyse all data at any given point. Aside from the complexity
introduced by the constant flow of data, there is also the issue of unforeseen changes in data
distribution, with time. This change phenomenon is referred to as concept drift (Gama,
Žliobaitė, Bifet, Pechenizkiy, and Bouchachia, 2014). Techniques are therefore needed to
analyse these types of fast evolving data stream.
In this thesis, we focus our attention on algorithms that aim to discover knowledge
from data streams that are susceptible to concept drift. Specifically, we study the use
of Ensemble methods, since these types of supervised classification technique have shown
promising results for data mining tasks in streams with concept drift (Kuncheva, 2004; Bifet,
Holmes, Pfahringer, Kirkby, and Gavaldà, 2009b; Wang and Pineau, 2013). An ensemble
method involves building a better predictive model by integrating multiple models from
one or multiple classification algorithms (or so-called base learners), effectively reducing the
1
Chapter 1. Introduction
chances of misclassification. According to Bifet, Holmes, Pfahringer, Kirkby, and Gavaldà
(2009b), ensemble methods ”can adapt to change quickly by pruning under-performing
parts of the ensemble, and they therefore usually also generate more accurate concept
descriptions”, in comparison to single classifier methods.
The number (or size) of the ensemble of models to build the composite model from
is usually predetermined. However, this may not be the ideal setting, especially when a
stream is suspect to concept drift. This research introduces a methodology that adapts
the size of the ensemble based on concept drift in the stream, in order to balance accuracy
and cost, where cost is defined in terms of resource allocation.
The rest of this chapter contains the motivation for our study, and the outline of this
thesis.
1.1 Motivation
In data analytics, there is a constant requirement to improve on the results obtained. In
applications such as intrusion detection, medical prediction and fraud detection, amongst
other, misclassification could be counter-productive and dangerous. It, therefore, becomes
imperative to continuously thrive for better classification accuracy. In the current era of
virtualization, parallel and distributed computing, we may choose to increase the resources
allocated to a mining task in a bid to improving the results to be obtained. However, this
comes at a cost, in terms of memory utilization, time and processor usage. Although the
cost of computing has become relatively cheaper over time, the volume of data has also
become massive. This research work takes this cost into consideration, by way of adapting
the mining technique to the resource usage. Our approach optimizes for accuracy, with
consideration to the computing cost.
Many data mining and machine learning algorithms have been proposed and have
been successful in numerous domains (Brazdil, Giraud-Carrier, Soares, and Vilalta, 2009;
Rossi, De Carvalho, Soares, and De Souza, 2014). Specifically, ensemble methods, such
as Boosting and Bagging, have been found to improve on classification accuracy when
compared to standard base learners. A number of these previous research focuses on factors
such as the right combination of ensemble of models (simply called ensemble), the right
size to use for the ensemble, the kind of ensemble technique to use, and other such subjects
relating to the characteristics of an ensemble. The cost factors referred to as the economic
cost in stream mining has, however, not been given as much attention.
Furthermore, determining the optimal size of the ensemble is not a trivial task. In the
2
Chapter 1. Introduction
context of resource allocation, the direct resource associated with metalearning is the size
of the ensemble used. This indirectly interprets to the computing cost, and we further
focus on this issue in our research.
We considered the Online version of two start-of-the-art ensemble learners. Online
learners are the types of classification algorithms that are ideal for data streams because
they are able to process chunks of data sequentially, to produce models, without having the
complete dataset. The ensemble learning algorithms we considered are Online Bagging and
Online Boosting. Online Bagging draws a random subset of examples from a data stream,
and generates independent ensemble of models. The most predicted label is assigned
to the new classification instance. Online Boosting, on the other hand, focuses on the
hard to learn examples, and uses a weighted vote of the ensemble models to assign a new
classification instance (Oza and Russell, 2001).
A number of research have shown that Boosting perform better, in terms of accuracy,
when compared to Bagging, especially for non-evolving datasets. However, literature
also suggests that Online Bagging algorithm performs better, in terms of accuracy and
consistency, in a data stream with concept drift (Oza and Russell; Bifet, Holmes, and
Pfahringer, 2010c). For this reason we focus on improving the Online Bagging algorithm
with consideration to the cost of the mining process.
1.2 Thesis Objective
Current research in data streams is working towards improving on the classification accuracy,
with no specific concern for the monetary cost of the mining process. It follows that there
is, however, a monetary cost associated with Big Data. The computing means and cost
are to be accounted for, and the larger the resources utilized, the larger the cost. This
issue is more prominent in a stream with concept drift, where a fixed estimate cannot be
preassigned up front. There is, therefore, a need to introduce cost efficient algorithms, that
facilitates improved accuracy, as well as facilitate adaptable resource usage.
To this end, we introduced the Adaptive Ensemble Size (AES) Online Bagging algorithm,
as an extension of the Online Bagging algorithm. We created an adaptation technique to
adapt the size of the ensemble between optimal maximum and minimum values, instead
of having a fixed size, as is currently implemented in Online Bagging. This approach
was implemented to consider the changes in the data stream with concept drift, and only
increase the size of the ensemble when required, in order to guarantee higher accuracy.
Although there have been some research in utility-based data mining in the past
3
Chapter 1. Introduction
(UBDM, 2005, 2006; Zadrozny, Weiss, and Saar-Tsechansky, 2006), it is still an emerging
field in data stream mining. Factors such as cost-sensitive learning and time have received
more attention than the cost of building and applying models, in terms of physical resources.
In a Big Data setting, we need to take resource utilization into consideration, in order to
design energy-efficient learning algorithms. Recently, the term cost sensitive adaptation
was proposed in (Žliobaite, Budka, and Stahl, 2015), and a framework was introduced
using Return on Investment (ROI) as a measurement criteria for measuring costs and
benefits in data stream. In the work by Žliobaite, Budka, and Stahl (2015), however, the
focus is on measuring the cost of adapting a model, rather than focusing on a holistic cost
measure.
In this thesis, we tackle this issue of holistic cost adaptation of data streams that contains
concept drift. Our goal is to obtain better ROI, without compromising on the accuracy
obtained during the mining process. We compare the efficiency of our implementation with
standard Online Bagging, using ROI as the measurement metric. Our results are promising.
The strength of our method is that it may be combined with many state-of-the-art ensemble
learning algorithms, in order to facilitate cost sensitive learning.
1.3 Thesis Organization
The remainder of this thesis is divided into two parts, consisting of six chapters. The
first part consists of the background study and the literature review, in Chapters 2 and 3,
respectively. Chapter 2 starts with the introduction of ideas, concepts and methodologies
in data mining, and focuses on specific concepts in data stream mining such as concept
drift and adaptivity. Utility, Efficiency and Cost sensitive adaptation in data stream mining
is also discussed in this chapter. In Chapter 3, we introduce metalearning and discuss the
different algorithms that were utilized in this thesis.
We start the second part of this thesis by introducing our cost and energy efficient
approach to improving classification accuracy in a stream with concept drift in Chapter 4. A
methodology is detailed in this chapter with the preliminary experimentation and reasoning
behind our implementation. The dataset used, our experimentation setup and performance
measurement criteria, are detailed in Chapter 5. We present our experimentation results
and analysis in Chapter 6. Comparison between our implementation and the state-of-the-
art algorithms are provided in this chapter. We provide a conclusion to our research in
Chapter 7, providing possible directions for future work.
4
Part I
PRELIMINARIES
5
Ch
ap
te
r 2Data Stream Mining:
Fundamentals
It is important to understand data streams and the basic concepts and terminologies in
data stream mining. Considering that data stream mining as a process of data analytics
is a subset of data mining, we complete this literature review following a top down
perspective. Data mining, and data stream mining, environments have several techniques
and terminologies that are frequently used in their mining processes and the discussion
about the taxonomy discussed in section 2.1 is applicable to both. In section 2.2, we
proceed with the discussion on data stream mining specifically. We also highlight concept
drift and various terminologies associated with it. In the last sections of this chapter, we
discuss the issues and applications of data stream mining.
2.1 Taxonomy of Data Mining Tasks and Techniques
There are many definitions of data mining, depending on who is been asked. One general
interpretation from these definitions, however, is that data mining involves knowledge
discovery from a relatively large amount of data, referred to as KDD (Knowledge Discovery
in Databases). This involves using some form of automated computational and statistical
processes to examine large datasets in other to generate new information. These processes
are referred to as machine learning, an important application of artificial intelligence
6
Chapter 2. Data Stream Mining: Fundamentals
(Russell, Norvig, Candy, Malik, and Edwards, 1996).
Taxonomy of data mining can be based on the task to be performed or the technique
in use. We will first take a look at the broader classification by the technique employed
before looking at the more specific classification by major tasks.
2.1.1 Data Mining Techniques
Data mining problems can generally be categorized into Supervised (or Predictive) and
Unsupervised (or Explanatory) based on the learning technique. Supervised learning
methods are methods that attempt to discover the relationship between input attributes
(called training or independent variables) and the target attributes (referred to as test or
dependent variables). The relationship discovered is represented in a structure referred to
as a model. Usually, models describe and explain phenomena, which are hidden in the
dataset and can be used for predicting the values of new unlabelled data based on the
labelled training data.
Prediction methods are used to automatically build this behavioural model with the
training dataset, and are able to predict values of one or more variables related to new and
unseen test dataset. It also develops patterns, which form the discovered knowledge in a
way which is understandable and easy to operate upon. Some prediction oriented methods
can also help to provide understanding of the dataset. Most of these mining techniques are
based on inductive learning, where a model is constructed by generalizing from a sufficient
number of training examples. In the inductive approach, the implicit assumption is that
the trained model is applicable to future unseen data (Maimon and Rokach, 2010). The
supervised methods are implemented in variety of domains, such as marketing, finance
and manufacturing.
Unsupervised learning refers mostly to techniques that group instances without a
pre-specified, dependent attribute. They are oriented to data interpretation, with focus on
understanding the relationship between the underlying data and its parts. They are used
to uncover hidden patterns within unlabelled data.
Many applications employs both supervised and unsupervised learning techniques.
Such an approach is referred to as Semi-Supervised learning method. This method uses
some unlabelled data with a relatively smaller amount of labelled data for training. This is
represented with the dotted lines in figure 2.1. The four primary categorizations of these
techniques are also shown in the figure (Kotu and Deshpande, 2015). They are further
broken down based on the specific method used, but we would not go into details of these
7
Chapter 2. Data Stream Mining: Fundamentals
in this review. We will discuss the algorithms that applies to this research work later in
Chapter 3.
Figure 2.1: Data Mining Techniques
Classification
The classification method is arguably the most common technique used in data mining.
Classification algorithms, also referred to as classifiers, map the input data into predefined
classes provided by the training data. As an example, a classification task might require
classifying students by those that has paid their tuition and those with outstanding
payment. In that case, a classifier labels students that has paid their tuition as ”good”
and those with outstanding payment as ”bad”. The Classification technique is an instance
of supervised learning, because of the provided training data with correct labels to base
the learning on. In the student example, the classifier is able to identify the good and the
bad students based on the training set provided.
Regression
Regression technique estimates the relationship between different variables in a given
dataset. Using their respective models, regression algorithms, also referred to as regressors,
can predict the demand for a certain products given their characteristics. Early work
on regression was published by Sir Francis Galton in his 1988 paper (Galton, 1888). He
discovered that the height of children of tall parents tend to be slightly shorter than the
8
Chapter 2. Data Stream Mining: Fundamentals
parents’ while children of short parents are slight taller; the sample regressed towards a
population mean, hence the term regression. Regression involves fitting data with functions
or function fitting. The value of a dependent variable y is obtained by combining the
predictor variables X into a function y = f(X). The value of the known is used to
formulate the function that is used to determine the unknown.
Clustering
Besides the two supervised learning methods briefly explained above, for clustering and
association analysis, no training data is provided in the learning process. In Clustering,
The algorithm identifies points within the dataset that are similar to each other and
group them in clusters. By so doing, it means the constituent data within each cluster is
similar to one another, and dissimilar to data elements in another cluster. Similarity is
commonly measured using the Euclidean distance or other measurement parameters like
the Chebychev distance and the percentage disagreement (StatSoft, 2014).
Association Analysis
A key concept in unsupervised learning is frequent pattern as this is an important property
of datasets. Frequent patterns are simply patterns that appear frequently in data. Frequent
pattern mining is mining items in a data set to find relationships between them. It is
the core concept in Association Analysis (Aggarwal and Han, 2014). Unlike predictive
methods like classification and regression, this method of data mining is used to find useful
patterns in the co-occurrence of multiple items. It measures the strength of co-occurrence
amongst these items, discovering hidden patterns in the form of easily recognizable rules
in the dataset. It was made better known by Agrawal, Imieliński, and Swami (1993) in
the application of the method to determine the set of retail items frequently purchased
together in the market basket analysis.
2.1.2 Data Mining Tasks
We can categorize data mining problems based on the mining task to be performed. These
tasks are shown in figure 2.2 (Kotu and Deshpande, 2015).
Classification: This involves identifying the category that an incoming data belongs to,
based on already known labels of the training set. This task uses supervised classification
techniques to sort data into two or more distinct classes or buckets. Models are developed
from the training datasets and are applied on new and unseen data. An example of a
9
Chapter 2. Data Stream Mining: Fundamentals
Figure 2.2: Data Mining Tasks
classification task is how spam filters detect spam emails based on known message content
and header (Kotu and Deshpande, 2015).
Regression: This is very similar to classification task in the sense that the prediction is
based on previously known dataset. The main difference is that while the output variables
are categorical or polynomial, in regression, the output variable is numeric (Kotu and
Deshpande, 2015).
Clustering: Unlike classification where we have training dataset with known labels,
clustering identifies the natural grouping in the data set using unsupervised clustering
techniques. An example of a clustering task is the grouping of students in a classroom
based on seating arrangement. This differs from a classification task, using the same
student group, that might involve classifying by gender. The key difference is that
while classification involves determining whether an attribute belongs to a known group,
clustering involves dividing data into meaningful groups that are not previously specified.
This could also be employed as a preprocessing step for a classification task (Kotu and
Deshpande, 2015).
Association Analysis: The objective of association analysis is to find patterns in the
co-occurrence of item sets. The task is to determine the relationship that exists between
different attributes in a dataset. A well known application of this is the market basket
analysis in which co-occurrences are determined between retail items within the same
customer transactions. This kind of knowledge will enable the retailer to take advantage
of the association, by placing these items together in the store front, or bundling their
prices together (Kotu and Deshpande, 2015).
10
Chapter 2. Data Stream Mining: Fundamentals
Anomaly/Outlier Detection: The task here is to identify the data attributes that
are significantly different from the others in a dataset. This involves using a supervised
or semi-supervised learning technique, and the learning accuracy is improved with time.
This task is used in cases of bank fraud, intrusion detection, amongst others (Kotu and
Deshpande, 2015).
Time Series Analysis: This involves making forecast into future values or events based
on past values of the same attribute. The technique frequently used for this is regression
analysis. We may recall that in normal regression analysis, we use the value of the
unknown variable to formulate the fitting function used to obtain the unknown, usually
of a different class. In time series analysis, we use historic data of a specific class to make
forecast about the same class. Weather forecasting and pattern recognition are typical
applications of time series analysis (Kotu and Deshpande, 2015).
Text Mining: With the relatively recent exponential increase in social media usage,
text mining has become very important in data mining and predictive analytics. As is
apparent by the name, this is mining data in which the input data is text. This can be in
the form of documents, emails, tweets, facebook posts amongst others. The texts are first
converted to semi-structured data, where each unique word is considered as an attribute.
We can then apply any of the already mentioned data mining techniques, depending
on the task at hand. Sentiment analysis is a text mining task, where the polarity of
a sentence or phrase is determined based on comparison of the constituent words with
training examples (Kotu and Deshpande, 2015).
Feature Selection: This literally means selecting only the attributes or features that
matters at the point in time. This is usually a task for data preprocessing before a
data mining technique is used. Feature selection is particularly useful in a regression
analysis task with many predictor variables. As the number of predictors X increases, it
adds computational overhead and reduces the ability to obtain good models. It becomes
essential to reduce the number of predictors using feature selection to the required
minimum while guaranteeing a good result (Kotu and Deshpande, 2015).
2.2 Data Streams
Advances in technologies in recent years have enabled us to automatically transact infor-
mation about many activities at a fast rate. Such information transactions generates huge
amount of online data growing at an unlimited rate. This kind of continuous flow of data
11
Chapter 2. Data Stream Mining: Fundamentals
are referred to as data streams. From computer network traffic, to social network streams,
it is practically impossible to do without streams of data in modern live. It becomes
imperative to be able to extract meaningful information from this vast, fast-paced data,
hence this yields the need for data stream mining.
Data streams differ from static data in a number of aspects, and the difference is
fundamental to data stream processing. Some of the key characteristics are itemized below:
• Unboundedness: Data streams are potentially infinite in nature and they are typically
not stored in their entirety (Gaber, Zaslavsky, and Krishnaswamy, 2004; Gama,
2010).
• Temporally ordered: In most application of data streams, the characteristics of the
stream elements evolve over time. This property is referred to as temporal locality and
adds an inherent temporal component to the data stream mining process (Aggarwal,
2007). This characteristic will be an important factor in measuring the accuracy of
the algorithms to be discussed later.
• High rate of data generation: Data stream elements are known to be generated at a
rapid rate in some applications. Thus the rate of processing such elements has to be
timely in order to keep up (Franke, 2008).
The Temporally ordered characteristics causes the data distribution to change over
time, yielding the phenomenon called concept drift (Gama, Žliobaitė, Bifet, Pechenizkiy,
and Bouchachia, 2014). This area of data stream mining is frequently called evolving data
stream mining or non stationary learning, and it is actively being researched upon. This
thesis is in this field of study. We will discuss more about concept drift in section 2.4.
2.2.1 Online vs Offline Learning
In sections 2.1.1 and 2.1.2, we discussed data mining and its various techniques and tasks.
In those discussions, we made a common assumption that the dataset to be processed
is available as a whole at the time of processing. This form of learning is referred to as
offline learning. The model produced from the offline learning process can then be used
for prediction when the training is completed. This differs considerably in the case of data
stream mining. In data stream mining, the data to be evaluated is never fully available at
one time, and can only be evaluated in sequence. Hence the method to be used should
be able to process the dataset sequentially and produce models for predictions without
having the complete dataset. Such learning process is referred to as online learning. In
12
Chapter 2. Data Stream Mining: Fundamentals
online learning process, the model is continuously updated as more data arrives (Gama,
Žliobaitė, Bifet, Pechenizkiy, and Bouchachia, 2014).
Online learning algorithms updates their model by incremental learning. In incremental
learning, the learning process takes place batch-by-batch, and the model is updated when
new data becomes available. It should be noted that an incremental learner does not
necessarily have to process streaming data, as is typically required of online learner (Khreich,
Granger, Miri, and Sabourin, 2012). In online learning, data is typically processed once,
and have limited memory and time for each processed item.
Online learning methods has the advantage of processing real-time, fast-paced and
adaptive datasets, referred to as ’fast data’. Offline learning methods on the other hand,
has the advantage of processing large datasets, called ’big data’, requiring longer processing
time and larger abstraction.
2.2.2 Data Stream Mining
Recall that we discussed the different data mining tasks and techniques in section 2.1. While
some of these concepts applies to fast evolving data streams, the general understanding of
conventional data mining approaches is that we are processing static, offline dataset. In
data mining, the learning algorithm can run through the same set of dataset examples
multiple times. The processing time and memory is also relatively unlimited as we have
control over these parameters and can choose to alter as we see fit. We also get the most
precise representation of the result by the learning algorithm, and we have the luxury of
adjusting different parameters multiple times to obtain even better results (Gama, 2010).
All of these points, however, are the direct opposite when we consider the process
of mining streams of data. In most instance of data stream mining, the algorithm will
only have a maximum of one pass on the data, under limited time and processing power.
Because we usually cannot evaluate the data more than once, and only with a portion of
the dataset, only an approximate accuracy is guaranteed. The data stream might also be
distributed across multiple source and it become imperative for our stream mining system
to consolidate the datasets in the mining process (Gama, 2010).
Table 2.1: Difference Between Data Mining and Data Stream Mining
Data Mining Data Stream MiningNumber of passes Multiple SingleProcessing Time Unlimited RestrictedMemory Usage Unlimited RestrictedType of Result Accurate Approximate
13
Chapter 2. Data Stream Mining: Fundamentals
These key differences are summarized in table 2.1.
2.3 Methodologies of Data Stream Mining Systems
Recall that we previously itemized the characteristics of data streams in section 2.2. The
Unboundedness and High rate of data generation nature of a data stream limits the amount
of data that can be processed and the time it can be processed. This has led to various
data stream summarization and reduction methodologies employed in the mining systems.
While the classifications of techniques and tasks discussed in section 2.1 applies to data
mining and data stream mining, these methodologies as shown in figure 2.3 applies only
to stream mining because of their obvious characteristics. The techniques are used for
producing approximate answers from the data stream usually by transforming the data
to a suitable form for analysis, considering that we cannot capture the entirely of a data
stream for analysis.
Figure 2.3: Taxonomy of Data Stream Mining Systems’ Methods
2.3.1 Data-Based Methods
Sampling: Sampling is the process of representing the data stream with a small sample
of the stream by statistically selecting the elements of the incoming stream to be analysed
at periodic intervals (Toivonen, 1996). It is the easiest form of stream summarization
and other techniques can be built from the sample. Very Fast Machine Learning tech-
nique (Domingos and Hulten, 2000) uses the Hoeffding Bound (Hoeffding, 1963) in its
measurement of the sample size, according to a loss function derived from the running
algorithm. The problem with sampling for data stream analysis is the unknown data
14
Chapter 2. Data Stream Mining: Fundamentals
size and fluctuating data rates. A modification to sampling to address this problem is
reservoir sampling.
Sketching: This technique trades off accuracy for storage, unlike the other ones. Sketch-
ing builds a statistical summary of the data using a small amount of memory by vertically
sampling the incoming stream (Babcock, Babu, Datar, Motwani, and Widom, 2002;
Cormode, Datar, Indyk, and Muthukrishnan, 2002). The entirety of the dataset is repre-
sented with only important information, using a very small amount of space. Sketching
techniques are very convenient for distributed computation over multiple streams. The
primary drawback, though, is its accuracy, making it relatively not very effective in stream
mining.
Synopsis Data Structures: Creating a synopsis of data refers to the process of applying
summarization techniques that are capable of summarizing the incoming stream for further
analysis. Synopsis data structures uses small space approximation solution to massive
data set problem. Examples are Wavelet analysis (Gilbert, Kotidis, Muthukrishnan,
and Strauss, 2003), histograms, quantile and frequency moments (Babcock, Babu, Datar,
Motwani, and Widom, 2002). Wavelets coefficients are projections of the given set of
data onto an orthogonal set of basic vector, and they only analyse the precise data by
detecting and determining the positions of abrupt signals. In Histogram technique, the
data is partitioned into a set of contiguous buckets, with varying width and depth based
on the partitioning rule. Histograms can be used to approximate query answers rather
than using sampling techniques.
Aggregation: Aggregation is the representation of a number of elements in one element
using statistical measures such as means, variance or average. The primary problem with
aggregation is that it does not perform well in a stream with fluctuating data distribution.
This approach has been successfully used in a distributed stream data environment
and with continuous queries over data stream (Babu and Widom, 2001). Merging of
online stream aggregation and offline mining was extensively studied for clustering and
classification of data streams (Aggarwal, Han, Wang, and Yu, 2003, 2004a,b).
Load Shedding: Load shedding refers to the process of eliminating a batch of subsequent
elements from being analysed (Tatbul, Çetintemel, Zdonik, Cherniack, and Stonebraker,
2003; Mayur, Babcock, Datar, and Motwani, 2003). This method is not frequently used
because it drops chunks of data that might represent a pattern of interest. Its also shares
similar problems with sampling. Loadstar algorithm represents the first attempt at using
15
Chapter 2. Data Stream Mining: Fundamentals
load shedding for solving high speed data stream classification problem (Chi, Wang, and
Yu, 2005).
2.3.2 Task-Based Methods
Sliding Window: The idea of the sliding window technique is that rather than running
statistical computations on all or some of the data seen so far, we make decisions based on
“recent” data only (Babcock, Babu, Datar, Motwani, and Widom, 2002). More formally,
at every time t, a new data element arrives. This element expires at time t+ w, where
w is the window size or length. This technique has an advantage of reduced memory
requirements because not all data is attempted to be analysed or stored. The size of the
window over time could be of a fixed or variable size (Gama, Žliobaitė, Bifet, Pechenizkiy,
and Bouchachia, 2014). For the fixed size sliding window, a fixed number of most recent
data is stored, and the oldest is discarded when new data arrives. In a data stream with
concept drift a variable sized window is preferred because the window size shrinks or
grows based on the data distribution. Section 2.4 discusses more about concept drift.
Approximation Algorithm: Approximation algorithms are specifically designed for
computationally hard problems (Muthukrishnan, 2003). These make it desirable for data
stream mining, given its features of continuity, speed and resource constraint. These
features, however, makes it hard for approximation algorithms to provide absolute solution
hence they provide approximate solutions with error bounds. Other tools are used with
these algorithms to adapt the available resources. Approximation algorithms have been
used in association analysis of streaming data (Deng, 2007).
Algorithm Output Granularity: This is the first resource-aware approach to data
analysis, making it particularly applicable for data stream mining, especially one with
concept drift (Gaber, Zaslavsky, and Krishnaswamy, 2004). The data analysis is performed
on a resource constraint device that generates or receives streams of information. This
approach has been successfully used in clustering, classification and association analysis
(Gaber, Krishnaswamy, and Zaslavsky, 2005). The first part of the algorithm output
granularity approach is the mining stage. Then the algorithm is adapted to the resources
and streaming rate. The third and last stage involves merging the generated knowledge
structures when running out of memory.
16
Chapter 2. Data Stream Mining: Fundamentals
2.3.3 Discussion
In this section, we explained the methodologies employed by a data stream mining
system. As we explained, these generally falls into two categories, based on the data to be
processed, or the specific task. These methodologies are explained with some of their usage
scenario. Table 2.2 provide us with a summary of these methods, with their advantages
and disadvantages.
Table 2.2: Summary of Data Stream Methods (Han, Kamber, and Pei, 2011)Technique Description Advantages Disadvantages
Data-based MethodsSampling Choosing a data subset for
analysisError BoundsGuaranteed
Poor for anomaly detection
Load Shedding Ignoring a chunk of data Efficient forqueries
Very poor for anomaly detec-tion
Sketching Random projection on featureset
Extremely Effi-cient
May ignore Relevant features
Synopsis Structure Quick Transformation Analysis Task In-dependent
Not sufficient for very faststream
Aggregation Compiling summary statistics Analysis Task In-dependent
May ignore Relevant features
Task-based MethodsSliding Window Algorithms with Error Bounds Efficient Resource adaptivity with data
rates not always possibleApproximation Algo-rithm
Analyzing most recent streams General Ignores part of stream
Algorithm OutputGranularity
Highly Resource aware tech-nique with memory and fluc-tuating data rates
General Cost overhead of resourceaware component
The next section introduces the concept to represent changes in class labels in streaming
data.
2.4 Concept Drift
Concept drift refers to a change in data distribution over time. This causes mismatch
in the training and test dataset and is particularly a non stationary learning problem.
Concept drift are generally classified into four types, as shown in figure 2.4. An overview
of concept drift is provided here. For simplification, we will restrict the number of data
chunks over time to C1 and C2 (Žliobaitė, 2010).
• Sudden drift: This represents the simplest pattern of change. It means that at time
t, a chunk of data, C1, is suddenly replaced by another chunk, C2.
• Gradual drift: This kind of drift come in two types:
– The first type of gradual drift refers to a period when both chunks are active
and as time goes, the probability of sampling from C1 decreases, while the
17
Chapter 2. Data Stream Mining: Fundamentals
Figure 2.4: Illustration of Types of Concept Drift (Žliobaitė, 2010)
probability of sampling from C2 increases. Note that at the beginning of this
type of gradual drift, more instances from C1 are visible, as instances from C2
might be easily mixed up with random noise. This type is generally the one
referred to when talking about gradual drift.
– The second type of gradual drift includes more than two chunks, however, the
difference between them is very small, and the drift is noticed only when looking
at a longer time period. This type of gradual drift is referred to as incremental
or stepwise drift.
• Reoccurring context drift: This refers to a drift in which previously active concept
reappears after some time. In this type of drift, it is not certainly periodic, as it is
not clear when the source might reappear.
2.4.1 Learner Adaptivity and Model Selection
Adaptivity refers to the ability of a system to adapt its behaviour according to changes in
its environment. This is particularly necessary in a stream with concept drift. The main
areas of adaptivity in a data stream framework are in the following: The base learner;
learner parameterization; training set formation (set selection, set manipulation and feature
set manipulation); and Rules for learners or models (Žliobaitė, 2010).
In this study, our focus will be on the adaptivity strategy that is based on training
set selection. This is due to the fact that our objective is to be able to select the
most informative examples in the data stream to learn from. This is further divided
18
Chapter 2. Data Stream Mining: Fundamentals
into windowing (selecting training instances in consecutive time) and instance selection
(sequential in time instances are selected as a training set) (Žliobaitė, 2010). Windowing is
preferred for sudden drift, while instance selection is preferred for gradual and reoccurring
context drift. The model selection and evaluation strongly depends on the assumption
about the adaptivity strategy on which the learner will be applied.
2.4.2 Detecting and Handling Concept Drift
In this section, we will briefly review related work studying concept drift. This will give a
general overview in relation to the approach proposed in this thesis. In essence, there are
two types of approaches, namely learners that evolve and techniques that are triggered
when concept drift occurs.
Evolving Learners
Evolving learners employ change detection mechanism as a tool to reduce computational
complexity. There are four main groups of these kinds of techniques and they are as follows
(Žliobaitė, 2010):
• Adaptive Ensembles – This is a classifier ensemble method that combines or selects
classification outputs of several models to get a final decision. The combination or
selection rules are called fusion rules and they are used to achieve adaptivity by
assigning weights to individual models at each instance.
• Instance weighting – The learner can consist of one algorithm or an ensemble, but the
adaptivity is achieved by a systematic training set formation. Ideas from boosting is
usually used to give more attention to misclassified instances.
• Feature Space – This manipulates feature space to achieve adaptivity i.e. new features
are added to the training instance, either by transfer learning using information from
past model performance, or using time.
• Base model specific – In this group of evolving learning algorithm, adaptivity is
achieved by managing specific model design or parameter.
Learners with triggers
These groups of methods use triggers to determine how the models or sampling should be
changed at a given time. Here, a trigger refers to the detection of a potential drift. The
main groups are as follows (Žliobaitė, 2010):
19
Chapter 2. Data Stream Mining: Fundamentals
• Change detectors – This is a trigger technique, and it is related to sudden drift. The
method may be based on raw data, a learner parameter, or the output of the learners.
The detection methods usually cut the training window at change point, although
the window might be different in some cases.
• Training windows – Some learners in this group uses heuristics related to error
monitoring, to determine the training window size, by using a look-up table principles.
In this setting, there is an action for each possible trigger value. Others uses base
learning specific methods or historical accuracy to determining the window size.
• Adaptive sampling – The previously listed trigger methods learn during the training
windows, or by using instance selection for testing unlabelled incoming instances.
On the other hand, in adaptive sampling, a new training set is selected based on
the relationship between the testing instance, and a predefined or historic training
instance.
2.4.3 Discussion
In this section, we explained the meaning of concept drift, and its different types. In
real-world scenarios of data streams, it is almost impossible to have a stream without
concept drift because of the constantly changing nature of data. It is, therefore, important
to be able to deal with it by adapting our learning process to its occurrence. The two
groups of learning methods that deals with concept drift were presented. In this thesis, we
employ a combination of adaptive ensemble and the feature space method, as an evolving
learner technique to handle concept drift. Instead of using fusion rules, we use the memory
information from the past model as a new feature. And also, instead of assigning weights
to each model in the ensemble, we adapt the size of the ensemble.
In the next section, the utility factor of data stream mining is explained, before looking
at data stream mining challenges and applications in subsequent sections.
2.5 Utility and Efficiency Consideration In Data Streams
There has been extensive research about improving learning accuracy in data stream
mining, and indeed, the general field of data mining. There is, however, relatively fewer
work on the economic utility considerations. In this context, utility has to do with ”factors
related to acquiring data, building models, and applying models” (UBDM, 2006). While
Utility-Based Data Mining (UBDM) also covers other topics relating to utility in data
20
Chapter 2. Data Stream Mining: Fundamentals
mining such as cost-sensitive learning, our focus in this research is on maximizing the
economic factors in data stream mining algorithms.
Coincidentally, the term cost-sensitive adaptation was recently introduced by Žliobaite,
Budka, and Stahl (2015). This should not be mistaken with cost-sensitive learning. While
cost-sensitive learning considers different kinds of errors at the model level, cost-sensitive
adaptation considers the system level cost of updating the model due to the evolving
nature of data streams. The above paper highlighted four adaptation requirements in data
stream mining, namely one-time process of examples; memory limitation; limited time;
and having the model ready to predict at any time. A conclusion was also made that a
data stream mining algorithm should be able to process incoming data at a linear scale in
terms of time; use limited memory; and execute adaptation if expected utility is sufficient
(Žliobaite, Budka, and Stahl, 2015).
Classification cost-sensitive learning and time are intrinsic factors in UBDM that have
received more attention than extrinsic factors such as the cost of building and applying
models (Zadrozny, Weiss, and Saar-Tsechansky, 2006; UBDM, 2006, 2005). In the advent
of Big Data with finitely available resources, these factors requires more priority than they
currently have. They enable us to take profitability into consideration in designing energy-
efficient learning algorithms, while still considering conventional assessment measures like
accuracy and f-measure. We will discuss more on cost-sensitive learning, accuracy and
f-measure in Chapter 5 when we discuss performance evaluation.
Utility considerations such as the physical memory or computing power utilized by
the algorithm in the learning process are examples of these extrinsic factors. The overall
memory utilization by the algorithm vis-a-vis the time taken during the learning process
is considered in this approach. The computing cost becomes more important when we
consider that most cloud services bill using Random Access Memory (RAM) utilized
per hour of using the service. Google Cloud Platform and Amazon Elastic Compute
Cloud (Amazon EC2) are examples of cloud services that takes this into consideration in
their billing processes (Google, 2014; Amazon, 2014). RAM-hour was introduced as an
evaluation measure in (Bifet, Holmes, Pfahringer, and Frank, 2010d). Žliobaite, Budka,
and Stahl (2015) proposed a method for quantifying the gain in performance for each
resource invested using Return on Investment (ROI). The ROI on RAM-hour over time
between two period of adaptation T and T + 1 is given as:
ROIT =γTψT
(2.1)
21
Chapter 2. Data Stream Mining: Fundamentals
Where γT is the change in prediction accuracy (or error) between adaptation, called
the gain of the adaptation; and ψT is the overall adaptation cost. The adaptation cost ψTis approximated as the computational cost (ψcomT ), and it is measured in RAM-hour (Rh).
The ROI can be used either as a performance statistics, algorithm effectiveness measure
over time; or a comparison measure between an adaptive and a non adaptive model.
To compute the average ROI over all adaptation, a weighted average is given as:
ROI =∑T ′i=1(Ni ×ROIi)∑T ′
i=1Ni= 1∑T ′
i=1Ni
T ′∑i=1
Niγiψi
(2.2)
Where T ′ is the total number of adaptations and Ni is the number of samples after the
last adaptation (Žliobaite, Budka, and Stahl, 2015).
The above equations enable us to consider the value gained by the adaptive approach
employed, and allows us to make an informed decision considering the monetary implications.
The Utility factor also affects the relative usefulness of the model to our system. In deciding
how much resource to allocate for the adaptivity, we consider the available resources in
our system, and a weighted importance to the expected results. Different adaptation
strategies have been developed for different kinds of adaptation in data mining algorithms.
With consideration to efficiency and resources, there are five possible strategies (Žliobaite,
Budka, and Stahl, 2015):
• Fully Incremental: The model is updated using the last window and the current
window;
• Summary Incremental: The model is updated using the last window, the current
window and a summary of all the previous windows;
• Batch Incremental: A batch of data windows is kept, and updated in sequence;
• Summary Batch Incremental: Similar to the batch increment, but with a summary
of all the previous data;
• Non-incremental: A new model is built from past observations, whenever required.
This means that a significant size of data is held in memory at any given time, for
the model to be built from when required.
With these cost considerations in mind, and also considering the relative utility of
the model to be constructed, there is need to derive a reasonable compromise. In this
research, we have employed the fully incremental approach. Our objective is to obtain a
better algorithm with consideration to the available resources, for a stream with concept
22
Chapter 2. Data Stream Mining: Fundamentals
drift. We focus on the ROI, while making efficiency considerations based on memory usage
induced by concept drift. The aim is to develop a method that uses memory utilization in
its learner adaptivity technique, when mining in a data stream with concept drift.
2.6 Challenges in Mining Data Streams
In the previous sections, we introduced data stream mining with various considerations
and methods. In this section, we take a look at the various challenges associated with data
stream mining. We intend to provide an encompassing overview of the research challenges
associated with data stream mining. This varies from the characteristics of the data and
its preprocessing to the evaluation of streaming algorithms (Krempl, Žliobaite, Brzeziński,
Hüllermeier, Last, Lemaire, Noack, Shaker, Sievi, Spiliopoulou, and Stefanowski, 2014).
These challenges are presented below.
2.6.1 Streaming Data
The inherent characteristics of volume, velocity and volatility of a data streams constitute
the primary research challenges in data stream mining. Most of the various discussion
in previous sections are towards overcoming these primary challenges, so much that it is
seldom not categorized as a data stream mining challenge.
2.6.2 Security/Privacy
As with every other field of technology, privacy and confidentiality are very important in
data stream mining, and they constitute the primary challenges associate with this field
of study. Privacy preserving data mining technique has been in active research for many
years and is still actively researched (Lindell and Pinkas, 2000; Malik, Ghazi, and Ali,
2012; Lu, Zhu, Liu, Liu, and Shao, 2014). Unlike in offline mining, where we can check the
privacy concerns before releasing the model, in data stream mining, privacy considerations
and measures has to be taken before hand, as the mining is done online. There has been
research in this area (Wang and Liu, 2008), but not quite as much as desired, and no data
mining or data stream mining security framework exist. The issue of data integrity also
comes into consideration, because altering the data attribute in a bid for higher security is
bound to affect the resulting model.
Two main concerns are identified in relation to privacy in data stream mining:
23
Chapter 2. Data Stream Mining: Fundamentals
• Incompleteness of Information: Because data arrive at intervals in portions, the
model is never finalized, hence it becomes difficult to judge the need for privacy
before seeing all the data.
• Concept Drift: Because a data stream with concept drift evolves over time, fixed
privacy rules may no longer hold after some time in this kind of data stream. This
could also go the other way, i.e. a stream evolving to require privacy when it
originally does not require one. Adaptive privacy preservation mechanism is therefore
an interesting area for further research in such cases.
2.6.3 Streaming Data Management
Unlike conventional data mining, where there is the luxury of tuning the data as required
before the mining processing, this is not always possible in data stream mining. This
section takes a brief look at challenges associated with preprocessing and availability of
streaming data.
• Preprocessing: While there are many procedures available for preprocessing offline
data (Garćıa, Luengo, and Herrera, 2015), data stream preprocessing has not quite
received so much attention. This can be attributed to the fact that manual prepro-
cessing is not feasible in streaming data mining. Fully automated and autonomous
methods are required in this case. It would also be necessary for this kind of prepro-
cessing methods to be able to update themselves as required with evolving and new
arriving data.
• Information Availability: There is the general assumption by most data mining
algorithms that the information related to the data is complete and ready to be
processed. This is not always the case in data stream. Challenges like how to address
the unpredictability of missing value frequency, how best to feed the stream and how
to trade off speed and statistical accuracy are some of the key points to be dealt
with (Nelwamondo and Marwala, 2008). Skewness in data distribution (He and Ma,
2013), latency (Krempl, 2011) and cost sensitivity issues (Spiliopoulou and Krempl,
2013) are also areas of challenges being actively researched.
2.6.4 Algorithms Evaluation
Because the characteristics of a data stream is significantly different than static data, most
of the current evaluation criteria does not apply for these environments. factors such as
24
Chapter 2. Data Stream Mining: Fundamentals
concept drift, latency, cost e.t.c. makes most existing evaluation criteria insufficient. New
methods has been proposed for evaluating stream accuracy in both normal stream and
stream with concept drift with one type of change (Gama, Sebastião, and Rodrigues, 2013;
Bifet, Read, Žliobaitė, Pfahringer, and Holmes, 2013), but not much has been done in
a stream involving many types of changes (Brzezinski and Stefanowski, 2014b). More
advanced tools for visualizing changes in algorithm prediction with time are also desired.
2.6.5 Legacy Systems
Unlike in controlled research environment where we have the luxury of having a distinct
and controllable data source, in most real life applications this is not the case. Data is
generated from multiple sources, of varying time of origin and technology. Considering
these data sources still provide huge and useful data, there might be need to analyse using
real time streaming approaches. Given that it is not always possible to change existing
infrastructure because of this, it becomes imperative to efficiently adapt stream mining
techniques to these systems.
A recent attempt to this end is the use of data stream mining to monitor the Inter-
national Space Station (ISS) Columbus Failure Management System (Noack, Luedtke,
Schmitt, Noack, Schaumlöffel, Hauke, Stamminger, and Frisk, 2012). Reliability of the
system is a key consideration because the life of astronauts and success of the mission
depends on it. Another crucial factor considered in that mission is the complexity, consid-
ering that the system consists of various legacy modules. As was discussed in section 2.6.3,
challenges associated with streaming data availability also was an issue. Issues of latency
and transmission interruption comes to play considering the system needs to communicate
with the ground station in near real time.
2.6.6 Model Construction
Simplicity of the model is also another area of active research in data stream mining.
Because of the many varying factors in a data stream, it is very tempting, to parameterize
the system in the name of adding features for various tunable settings. Simplicity should
be the feature. Models that are constructed from a wild range range of parameters becomes
unreliable and impractical for data stream mining because it is difficult for too many
parameters to keep up with the evolving nature of data streams. Research work on tuning
these model parameters has been done in offline stream (Guyon, Saffari, Dror, and Cawley,
2010) but not so much in stream mining.
25
Chapter 2. Data Stream Mining: Fundamentals
Constructing model that could work for both big and fast data is still an area of
challenge. This area of research is now actively talked about and some research work has
started showing up in that direction (Zhang, Sow, Turaga, and van der Schaar, 2014),
but not quite as much as desired. Because of resource constraints, consideration should
also been given to optimizing memory performance, self tuning and auto adaptivity in
designing algorithms for streaming data. Methodologies is lacking for optimizing such
factors in data stream mining.
2.6.7 Entity Stream Mining
In conventional data stream mining, mining algorithms learn over a single stream of arriving
entities or records. Each entity is all encompassing in the information it contains, and
there is no assumption of any adjoining information for individual records arriving at a
later time. This is quite different in entity stream mining in which stream entities are
linked to instances from another stream. This discrete stream can be in the past, current
or further point in time. This is related to relational databases, where records are linked
together with foreign keys, except that they are not static.
An example of this is will be the arrival of patients in an hospital. Patients come
and go at different times, and there are multiple pieces of information associated with
each patient at different times. In this scenario, the entity mining is therefore that of
creating a model that incorporate the adjoining information irrespective of when they
appear in the mining process. The unsupervised clustering task is that of adapting the
cluster over time as new information are available with consideration to the speed of
availability. Similarly, the supervised classification task involves learning and adapting the
classifier as new information become available.
The challenge becomes that of aggregating this discrete information to its corresponding
record, as well as prediction. The fact that distinct entities appear and reappear at different
time also pose a learning challenge in this kind of data stream. This is because, at the
point of reappearance, there might be new information associated with that entity, linking
the entity to a conceptually different instance. This phenomena is referred to as entity
drift, quite different, but similar to concept drift. There has been active research in this
area (Spiliopoulou and Krempl, 2013), but it still remain a major challenge because of the
complexity.
26
Chapter 2. Data Stream Mining: Fundamentals
Event Data Analysis
Learning from the occurrence of discrete events, can also be seen as a form of entity stream
mining, considering that an event can be seen as a degenerate instance consisting of a single
value. Events can be produced by a single or multiple sources with varying features. Events
are not often talked about in data stream mining, even though they are studied through
Event History Analysis (EHA) in static mining environment. Statistical methods such as
survival analysis and hazard rates are used to approximate the likelihood of occurrence of
an event in this method (Cox and Oakes, 1984).
These statistical model can also be used in similar ways for streaming data mining,
however, the fact that the model may change over time introduces another level of
complexity. Dealing with this kind of model change is obviously a challenge for entity
analysis in data stream. There has been work in combining EHA with other methods for
event detection (Sakaki, Okazaki, and Matsuo, 2013), there is still a long way to go in
this area of research considering the need for real-time action in some events like intrusion
detection.
2.7 Applications of Mining Data Streams with Concept
Drift
In this section, we discuss the real-world applications of data stream mining, in streams
with natural occurrence of concept drift. These problems are relevant in both supervised
and unsupervised learning process. According to Žliobaitė (2010), these applications are
generally categorized into four namely Monitoring and Control, Personal Assistance and
Information, Decision Making, and Artificial Intelligence and Robotics.
2.7.1 Monitoring and Control
Monitoring and control is usually an unsupervised learning problem that detects irregular
behaviour. This is typically a problem of analysing big and fast data, with time sensitivity.
The monitoring task could be prevention against adversary actions. Examples of these kind
of problems are intrusion detection in network security (Day, 2013) and fraud detection
in telecommunication (Hilas, 2009). A combination of supervised and unsupervised
learning techniques can also be combined, as is used for fraud detection (Bolton and Hand,
2002) in finance. In all of these scenarios, the time to response is very crucial.
27
Chapter 2. Data Stream Mining: Fundamentals
The monitoring could also be for management. This kind of application can be
seen in transportation for traffic management (Crespo and Weber, 2005), positioning
systems (Liao, Patterson, Fox, and Kautz, 2007) or for industrial monitoring (Bakker,
Pechenizkiy, Žliobaitė, Ivannikov, and Kärkkäinen, 2009). In these scenarios, change in
pattern is required to be picked up with as much accuracy as possible, so as not to disrupt
the flow of other activities.
2.7.2 Personal Assistance and Information
These applications are mainly concerned with the flow and organization of information.
These could be either user specific information or information related to a business.
Concern in these sort of applications is usually not very critical, and the class label is
categorized as ’soft’. Personal assistance applications deals with personalizing information
flow (Gauch, Speretta, Chandramouli, and Micarelli, 2007), referred to as information
filtering. Applications related to web personalization and dynamics (Scanlan et al.,
2008), textual data (Lebanon and Zhao, 2008) and spam filtering are also relevant
areas in stream mining.
Profiling customer aggregated data is also another application of stream mining that
enables us segment customers according to interest. This is particularly useful for something
like direct marketing where there is a requirement to classify different customers based
on product or service preference. Social network analysis is also been employed for this
task and it was observed that interest evolves apart for multiple users (Lathia, Hailes, and
Capra, 2008).
Handling data distribution over a long period of time is another application of stream
mining to information. The task of Document organization is to extract meaningful
structures from emails, news, or document stream, into topics. A significant project in
this area was the organization and analysis of 120 years (1881 - 1999) of scientific topics
articles of science magazine, showing emergence, peak and decline of various topics (Blei
and Lafferty, 2006).
In Economics, mining streams with concept drift is applicable in making macro-
economic forecasts (Giacomini and Rossi, 2009). It is also applicable in software project
management where precise engineering and logic becomes inaccurate if concept drift is
not considered in the planning phase (Ekanayake, Tappolet, Gall, and Bernstein, 2009).
28
Chapter 2. Data Stream Mining: Fundamentals
2.7.3 Decision Making
Decision making in Finance for services such as credit card score or bankruptcy decisions
also employs data stream mining, because of the variety of factors that come into play.
While it is easy to see these problems as stationary problems, concept drift is closely related
to other factors not considered in the original model. Biomedical applications are also
subject of concept drift due to the mutating and evolutionary resistance of micro-organism
to antibiotics. Concept drift applies to adaptivity to changes caused by human demographic
(Kukar, 2003), discovery of emerging resistance and nosocomial infections in hospitals
(Jermaine, Ranka, and archibald, 2008), as well as biometric authentication (Poh, Wong,
Kittler, and Roli, 2009).
2.7.4 Artificial Intelligence and Robotics
In Artificial Intelligence (AI), the AI object learns to interact with a dynamic environment
using adaptive techniques of data stream mining. Ubiquitous Knowledge Discovery (UKD)
has to do with a distributed system, operating in an unstable and complex environment.
Stanley, the robot that won the Defense Advanced Research Projects Agency (DARPA)
grand challenge in 2006 is an example of a robot having to deal with the complexities
of UKD using data stream mining with concept drift techniques (Thrun, Montemerlo,
Dahlkamp, Stavens, Aron, Diebel, Fong, Gale, Halpenny, Hoffmann, Lau, Oakley, Palatucci,
Pratt, Stang, Strohband, Dupont, Jendrossek, Koelen, Markey, Rummel, van Niekerk,
Jensen, Alessandrini, Bradski, Davies, Ettinger, Kaehler, Nefian, and Mahoney, 2006).
In the current era of the Internet of Things (IoT), adaptability is also a requirement
hence the employment of adaptive data stream mining (Gubbi, Buyya, Marusic, and
Palaniswami, 2013). Virtual reality is another fast growing area of technological advance-
ment, requiring adaptive stream mining since it is mostly used for fast changing applications
like computer games and flight simulations.
2.8 Summary
In this chapter, we reviewed the concept of data stream mining, starting from its origin
in data mining. We analyzed the various taxonomies of data mining and data stream
mining based on techniques, task and streaming method. An important terminology called
concept drift was also visited in section 2.4, where we explained concept drift and the
methods for handling it.
29
Chapter 2. Data Stream Mining: Fundamentals
This chapter also showed the various challenges associated with mining data streams.
We discussed the issues of security, data management, evaluation method, legacy systems,
model construction, and entity stream mining. We ended this chapter by mentioning some
applications of mining data stream with concept drift in real-world. In the next chapter,
we consider various data mining algorithms, specifically those that apply to our current
work. We will also discuss metalearning and the different ensemble mining methods, as
these will form the bedrock of our thesis methodology.
30
Ch
ap
te
r 3Metalearning and Model Selection
In this chapter, we shall be looking at state-of-the-art algorithms in data stream mining,
explaining their foundation and processes. We start by explaining what metalearning is
and how it matters to our research. Some metalearning algorithms will be discussed in
section 3.2. We will explain the model selection procedure in the Bagging and Boosting
algorithms as these will form the basis of the online metalearning methods discussed in
section 3.4. A technique that works well for detecting changes in data streams is ADWIN
and shall be explained in 3.3. This is particularly important in a stream with concept drift
because it allows learning algorithms to react to the changes detected.
3.1 Metalearning
Metalearning is the study of methods that exploit knowledge about knowledge, called
metaknowledge, to obtain efficient models and solutions by adapting machine learning
and data mining processes (Brazdil, Giraud-Carrier, Soares, and Vil