Upload
mohammad-usman
View
11
Download
3
Embed Size (px)
Citation preview
Advance Database System May, 2015
Page 1
A comparative study on data storage
techniques according nature of dat
Muhammad Usman
Kinza Sardar
University of Management and Technology (UMT)
Lahore, Pakistan
E-mail: [email protected]
Abstract
Data mining has gauged much attention for over a past few decades. No doubt, it is one of the
most accurate domains as far as security is concerned. People had worked a lot in this area. It
deals with all, starting from classifying or clustering the samples to matching the records using
some mathematical, statistical, probabilistic or intelligent models. This article evaluates and
compares among few of the well-known state of art data mining models.
INTRODUCTION
The procedure to extract useful and
important informative data from huge pool
of data is known as process of data mining.
Extraction can be done by using many
techniques such as by classification of data,
clustering of data, trend and evolution
analysis, concept description & outlier
analysis. Two main techniques we will
discuss in this paper are clustering of data
and classification of data. Clustering and
classification of data can further be done
through many ways according to the nature
of the system. Now-a-days, techniques of
data mining are used in many applications
such as financial management system,
medical data analysis, telecommunications
systems and others.
Classification is used to classify each item in
a set of data into one of predefined set of
classes or groups. Classification process is
done in two steps.First step is definition of
Advance Database System May, 2015
Page 2
predetermined set of data classes or
concepts. This step is also known as
supervised learning. Second step is to define
a model for classification and this step is
known as unsupervised learning. There are
many classification techniques such as
Neural Networks [1], Fuzzy Logic, Support
Vector Machine (SVM) [4,6], Decision
trees, Nearest Neighbors [2] and Artificial
Immune System. The most common
technique for classification of data which is
used now-a-days is Support Vector Machine
(SVM) [4,6]. This technique is widely used
because of its simple structure and good
performance. This technique is used for two
types of data such as for liner and non-linear
data. SVM shows the best accuracy rates as
well as compare to other techniques. The
only flaw this technique is suffered from is
the numeric value of image should be within
a specific range. But some of authors [4,6]
used their algorithms on this classification
technique and proved that it can also be used
with wide range as well. Similarly other
techniques are also discussed and used by
many authors for their research as required
by the system they used.
Another technique, we will discuss in this
paper is of clustering [3,5]. It is also one of
the most used techniques in data mining
process. This technique is defined as
unsupervised learning technique as it
discovers knowledge about data by itself. In
clustering technique, similar type of data
gets clustered and different types of
algorithms and searching criteria are applied
on them for further processing. In each
cluster, there is a defined cluster centroid.
By measuring distance from a specific data
variable to a cluster centroid, a boundary of
cluster is defined. Different algorithms are
also used for the comparison of distance
values to define the exact boundary for the
cluster. In this process, same type of data
with approximately same distance measures
will together form a cluster.
In this paper, we will discuss different
systems using different types of techniques
with their proposed algorithms. We will
conclude this paper with best algorithm for
both of clustering and classification
techniques of data mining by doing
comparative analysis.
LITERATURE SURVEY
ARTIFICAL NEURAL NETWORK
First paper was “Vegetable Price Prediction
Using Data Mining Classification
Technique” [1], which was published in
2012 and authors were G. M. Nasira & N.
Hemageetha from India. In this paper, they
[G. M. Nasira & N. Hemageetha] proposed a
Advance Database System May, 2015
Page 3
BPNN prediction model whichis used for
establishment of vegetable price in market.
They described that ANN (Artificial Neural
Network) is a neural network which consists
of many neurons which are interconnected
with each other. There are many types of
ANNs but the one which is frequently used
is multilayer perception. In this type,
neurons are divided among different layers.
Signal is received by the input layer of the
network and then information is sent to the
network. Neurons in input layer actually do
not perform any tasks and information is
sent to outer layer and neurons in outer layer
will generate an output which will be then
provided as output by the neurons network.
It might be possibility of occurrence of some
hidden layers between layers of input and
output as well. One of the other networks
discussed in this paper is BPNN (Back
Propagation Neural Network). It is also a
multi layered neural network. Authors took
this network and applied an algorithm for
the estimation of vegetable. Algorithm can
be described in steps as mentioned below:
Step 1: Send normalized data as input to the
network and results will be then calculated
accordingly.
Step 2: Comparison of calculated results and
actual result errors.
Step 3: On the findings of error, make
adjustment in the functions of membership
and connection weight.
Step 4: If error comes out to be greater than
the expectations or calculations then go to
first step or else stop the process.
Flow chart of ANN can be represented
graphically as:
Collect Data
Process Data
Normalize Data
Determine the netwrok
Select Optimization Method
Train Netwrok
Test Network
Forecast the Sample
Figure 1 ANN Flow Chart
In first step, authors collected data as
vegetable prices are directly affected by
many attributes such as demand, season,
supply, festival and climate etc. These
factors make the prediction of price more
difficult. So author took only a tomato as a
vegetable for vegetable price prediction.
Authors design a size and frequency of data.
Weekly and monthly data was used for
processing of data because of less
Advance Database System May, 2015
Page 4
occurrence of noise in data. Time period
from 2009 to 2011 was taken for the model
creation. Next step was to normalize the
data. This step is very important for
speeding up the training process. There are
many types by which data can be
normalized. In this paper, minimax
normalization is used which can be
mathematically represented as: �′ = � �� − � � ∗ ( �� − � �� �� − � � )+ � �
�′ stands for normalized input data, �� is
Actual Input, � �� and � � are maximum
and minimum values of the old data range
and values are 1 and 0 respectively. The
obtained data will used for further
processing of neural network. The data is
firstly spliced into two categories from
which one set is used for training and other
set is used for validation of network.
The last step is of designing a model. There
are no specific rules for designing as each
problem need different quantity of neurons
for different hidden layers. Increase in
quantity of neurons will increase the
computational time. Authors decided the
number of neurons by considering many
factors and proposed a model of ANN for
price prediction which can be represented
as:
Input Layer Hidden Layer Output Layer
Figure 2 Proposed Model
Authors used MATLAB for producing good
results. They took experiments on weekly
and monthly basis. They concluded that
using neural network for clustering the
prices of vegetable is one of the best
technique. Accuracy rates were high as well
as performance was good. Results they
concluded can be shown in tabular form
such as:
Week Actual
price
Predicted
Price
Error %
1 26 20 23%
2 31 26 16%
3 34 28 17%
4 25 24 4%
5 18 20 11%
6 18 18 0%
7 14 17 21%
8 11 13 18%
9 9 10 11%
Advance Database System May, 2015
Page 5
10 10 10 0%
Table 1 Results
KNN& C4.5 CLASSIFIERS
Second paper was “Intraurban land cover
classification using IKONOS II images and
data mining techniques: a comparative
analysis” [2], which was published in 2013
and authors were Vanessa da Silva Brum
Bastos, Leila Maria Garcia Fonseca, Thales
Sehn Korting, Carolina Moutinho Duque
Pinho & Rafael Duarte Coelho dos Santos.
This was a survey paper and they discussed
many classification techniques and their
proposed algorithms and at the end, they
concluded with best technique after
performing many experiments on each of the
classification technique. First discussed
technique was KNN classifier. They stated
that it is a learning algorithm based on
specific instances which are then used to
train the network as well as to forecast the
data model without even training. This
classifier uses Euclidean distance for finding
the distance between neighbor and their
classes. K is number of training objects and
algorithm applied determines a group with K
training objects that are closest to a test
object, and labels it according to the
predominant class in this group or region.If
value of K is higher then it is not more
sensitive for the noise. This can proceed the
experiment to incorrect results. Second
technique was of C4.5 classifier. This
classifier actually generates a decision tree
for the process of classification. Leaf of a
tree shows a class or in other words it is also
known as decision node. Decision took
place on decision node and sub-classes are
then assigned to the leave nodes of the tree.
This technique is strictly followed by the
rules. The third classifier discussed in this
paper was multilayer perception artificial
neural network. This technique is already
discussed above in this paper. Authors used
an IKONOS image II which was 1.0-m
panchromatic band and 4.0-m multispectral
bands with 11-bit radiometric resolution and
incidence angle of 4.85º. These images
were then merged and then segmentation is
done and these are segmented into five
levels. Software named Definiens was used
for process of segmentation and extraction
of features. Then in step of training a model,
desired segment is taken and then divided
that segment further into two groups of
training sets. One training set consists of
1630 different instances and second set
consists of 819 instances. Each segment was
then categorized by 524 attributes. For
comparison of classifiers, authors performed
Monte Carlo simulation. They had done 30
iterations on each classifier for each GVP
Advance Database System May, 2015
Page 6
(Growing Variable Parameter), represented
by the Minimum number of objects per leaf
(MOL), K and Number of Hidden Neurons
(NHN), respectively, for C4.5, KNN, and
MLP algorithms. At the end they concluded
by using large data set. They concluded that
selection of appropriate parameters for the
process of classification is a major concern
to improve performance. Calculated
parameters can be represented in tabular
form as:
Technique MOL Accuracy Portability
KNN 7 80% Average
MLP 20 90% Low
C4.5 9 85% High
Table 2 Comparison of Techniques
Hence they showed that MLP algorithm
shows the best classification accuracy as
compared to other techniques. But the only
disadvantage of this technique was its
translation to the semantic networks. They
used software of GEPBIA, InterImage and
GeoDMA for the implementation of
semantic networking.
NEURAL GAS
Third paper was “A Novel Approach for
Data Mining Clustering Technique using
Neural Gas Algorithm” [3], which was
published in 2014 and authors were
Mohnish Patel, Prashant Richhariya &
Anurag Shrivastava from India. They
discussed the clustering techniques with
reference to database security. They
compared algorithms which hides the
sensitive data by applying some mining
rules. They discussed already existing
algorithms of ISL, DSR, K-Means and
Neural Gas.In ISL algorithm, there were no
efficient rules which can hide the desired
transaction. On other hand DSR algorithm
decreases the support of the right hand side
of the rule. Itmodifies one item at a single
interval of time in a selected transaction and
changes transaction’s value from 1 to 0. The
third algorithm was K-Means algorithm. Its
accuracy depends on the selection of the
value of centroid. If centroid is representing
group of similar objects then clustering
obtained is better as compared to other
scenarios. Main objective of this algorithm
is to divide objects into many classes and
hence objects in same class will have less
distance between them as compare to objects
of different classes. Last clustering
technique which authors had done their
experimentation was Neural Gas clustering
technique. This technique is based on ANN
but it is inspired by SOM (Self-Organizing
Map). This technique is good for data
representation based on features. This
Advance Database System May, 2015
Page 7
algorithm was named as "neural gas"
because of the dynamics of the feature
vectors during the adaptation process which
distribute themselves like a gas within the
data spaces. They compared all techniques
on basis of database size as well as number
of clusters. Authors concluded that Neural
Gas algorithm is best for clustering non
structured data sets in an efficient way. This
algorithm also devised to imbalanced issues
in clustering. They also concluded that K-
mean algorithm is less sensitive for the
process of initializations. Authors also
obtained that for best performance, values of
parameters should also be selected wisely.
They used large data set for experimentation
with best accuracy rate with better
performance.In their future work, they will
work on the selection of values of
parameters.
No.
of
Clu
ster
s
Time
for
executi
on of
K-
Mean+I
SL+DS
R in
millisec
eonds
Time for
execution
of
NeuralGa
s+IS+DS
R in
millisece
onds
Time
for
execut
ion of
ISL in
millis
eceon
ds
Time
for
execut
ion of
DSR
in
millis
eceon
ds
2 2365 5354 ms 12365 9323
ms ms ms
3 4301
ms
8313 ms 19314
ms
17375
ms
4 15813
ms
15621 ms 37813
ms
28217
ms
5 17342
ms
17324 ms 37435
ms
31123
ms
Table 3: Comparison according to No. of clusters
SVM (SUPPORT VECTOR MACHINE)
Fourth paper we considered was “Customer
Relationship Management Classification
Using Data Mining Techniques” [4] which
was published in 2014 and author were
S.Ummugulthum Natchiar & Dr.S.Baulkani
from India. They took highly un-balanced
data set which was very noisy as well for
experimentation. They used many
classifiers for classification. Firstly they
extracted the features and then classification
techniques are applied by using different
data sets on different classifiers of J48,
Naïve Bayes, SVM and KNN. At the end 10
fold cross validation is applied. First
classifier is J48. This classifier uses
algorithm of C4.5 which is already
discussed in this paper. It generates a
decision tree by using labeled trained data
set and by using each leaf node, decision
took place and then nodes splits into two or
more nodes according to the decision. The
Advance Database System May, 2015
Page 8
node with highest information gain is then
used for the process of decision taking.
Process of splitting ends where all instances
will belong to the same class. Next classifier
is Naïve Bayes. It uses Bayes theorem which
deals with the probability calculations. It
stated that the value of the predictor of a
selected class will always remain
independent of the values of other
predictors. The third and most efficient
classifier discussed is SVM. This classifier
uses its own algorithm. It transfers data from
normal dimension to higher dimensions. By
using newly assigned dimensions, decision
is made for the division of objects among
classes. Boundary of decision is made on the
basis of support vectors and margins. The
last classifier discussed is KNN. It compare
the selected object with trained objects and
class is made according to the Euclidean
distance as discussed above in K-means
algorithm.
The procedure which is followed by the
authors in this paper consists of following
steps such as:
1. Enter noisy and un-balanced data set.
2. Pre-process the data and remove features
which have 90% missing values or empty.
4. Selection of attributes is done by the
following considerations:
i. Use Information Gain.
ii. Selectmaximum gain binary
features by using ranker.
iii. Select attribute by using full
training set.
5. Spliced the data set into two groups such
as of testing data set and training data set.
6. Apply algorithm using data set with test
of 10 fold cross validation.
By applying above procedure, following
values of parameters comes as a result
which can be represented as following
tabular form:
Parameters J48 Naïve
Bayes
SVM KNN
Accuracy 99 94 99 98.2
ROC 0.9 1 0.7 0.9
Sensitivity 1 0.9 1 1
Specificity 0.5 0.1 0.5 0.7
Precision 1 0.9 1 0.9
Recall 1 0.9 1 0.9
CC 98.8 93.8 98.9 98.3
Error Rate 1.3 6.3 1.2 1.8
Table 4: Comparison of parameters
Hence SVM technique shows the highest
accuracy rate while Naïve Bayes resulted in
highest ROC values. As future work for
authors, theywill use multiple algorithms
together for classification so that
performance can be highly achieved.
Advance Database System May, 2015
Page 9
COMPARATIVE ANALYSIS
First paper we chose for our survey paper is
“Vegetable Price Prediction Using Data
Mining Classification Technique” [1], which
was published in 2012. In this paper, they
proposed a BPNN prediction model for the
establishment of vegetable price in market.
Authors used MATLAB for producing good
results. They took experiments on weekly
and monthly basis. They concluded that
using neural network for clustering the
prices of vegetable is one of the best
technique. Accuracy rates were high as well
as performance was good.
Second paper was “Intraurban land cover
classification using IKONOS II images and
data mining techniques: a comparative
analysis” [2], which was published in 2013.
They used large amount of data. They
concluded that selection of appropriate
parameters for the process of classification
is a major concern to improve performance.
They used three algorithms of C4.5, MLP
and KNN for classification of data. By
comparing parameters values of all
mentioned techniques, they showed that
MLP algorithm shows the best classification
accuracy as compared to other techniques.
But the only disadvantage of this technique
was its translation to the semantic networks.
They used software of GEPBIA, InterImage
and GeoDMA for the implementation of
semantic networking.
Third paper was “A Novel Approach for
Data Mining Clustering Technique using
Neural Gas Algorithm” [3], which was
published in 2014.They discussed the
techniques with reference to database
security. They proposed an algorithm which
hides the sensitive data by applying some
mining rules. They actually used already
published algorithms of K-Mean clustering
algorithm and ISL. They concluded that the
performance of hiding data gets better in
much less scanning time of database. They
used large data set for experimentation with
best accuracy rate with better performance.
Fourth paper we considered was “Customer
Relationship Management Classification
Using Data Mining Techniques” [4]which
was published in 2014. They took highly un-
balanced data set which was very noisy as
well for experimentation. They used many
classifiers for classification. They proved
that SVM technique shows the highest
accuracy rate while Naïve Bayes resulted in
highest ROC values. They used multiple
algorithms for classification so that
performance can be highly achieved.
All of the above discussed papers can be
compared easily and comparative analysis
can be represented in tabular form such as:
Advance Database System May, 2015
Page
10
Graph of above mentioned parameters of accuracy and performance can be shown as:
Figure 3: Parametric comparison chart
Techniques
Data Types
Noisy
Data
Multi-
Dimensional Data
Irregular
Data
Un Balanced
Data
Redundant
Data
Linear
Complexity
Neural Networks Yes No Yes No No Yes
C4.5, MLP and KNN No No Yes No Yes No
K – Means Clustering &
ISL Yes Yes Yes No No Yes
SVM Yes No Yes Yes No No
Table 5: Overall Balance
As a conclusion, it can conclude that each
technique have advantages as well as
disadvantages. All authors used irregular
type of data. Some of them used noisy data
as well. The best technique we found is K-
Mean and ISL combination. This technique
shows best performance with high accuracy
0
50
100
Performance Accuracy
Parameteric Comparison
G. M. Nasira & N. Hemageetha
Vanessa da Silva Brum Bastos, Leila Maria Garcia Fonseca, Thales Sehn Korting, Carolina Moutinho Duque
Pinho & Rafael Duarte Coelho dos Santos
Mohnish Patel, Prashant Richhariya & Anurag Shrivastava
S.Ummugulthum Natchiar & Dr.S.Baulkani
Advance Database System May, 2015
Page
11
rates. Authors used large databases for
experimentation and provided with best
security of database as well. The second
highly accurate technique is SVM. This
technique is widely used in classification
processes because it provides best results on
every type of data such as irregular and
noisy. It is applicable on high dimensional
data as well. By less calculation of data, it
gives best accuracy rate.
Advance Database System May, 2015
Page
12
REFERENCES
1. G. M. Nasira & N. Hemageetha. 2012, Proceedings of the International Conference on
Pattern Recognition, Informatics and Medical Engineering , March 21-23, 2012
Vegetable Price Prediction Using DataMining Classification Technique
2. Prashant Vats HMRITM, Delhi, India. J.R.N. Raj. Vidyapeeth, Udaipur, India
([email protected]) Anjana Gosain USICT, GGSIPU, Delhi,
India.([email protected])2014 International Conference on Electronic
Systems, Signal Processing and Computing Technologies, A Comparative Analysis of
Various Cluster Detection Techniques for Data Mining
3. Mohnish Patel, RGPV CSE, NIRT,Bhopal, India,[email protected] Prashant
Richhariya, RGPV CSE, NIRT,Bhopal, India,[email protected] Anurag
Shrivastava, RGPV CSE, NIRT,Bhopal, India,[email protected] Novel
Approach for Data Mining Clustering Technique using NeuralGas Algorithm
4. IEEE-32331 Customer Relationship Management Classification Using Data Mining
Techniques S.Ummugulthum Natchiar Sethu Institute of Technology
InformationTechnology Virudhu Nagar , India Electronics and Communication
Engineering Government College of Engineering Dr.S.Baulkani Tirunelveli, India
5. Intraurban land cover classification using IKONOS II images and data mining
techniques: a comparative analysis Leila Maria Garcia Fonseca Image Processing
Division (DPI) National Institute for Space Research (INPE) São José dos Campos,
Brazil [email protected] Thales Sehn Korting Image Processing Division (DPI) National
Institute for Space Research (INPE) São José dos Campos, Brazil [email protected]
Advance Database System May, 2015
Page
13
Carolina Moutinho Duque Pinho Centro de política e economia do setor público
(CEPESP) Fundação Getúlio Vargas (FGV-SP) São Paulo, Brazil [email protected]