ADB_ Research Paper-Final-M Usman

Advance Database System May, 2015

Page 1

A comparative study on data storage

techniques according nature of dat

Muhammad Usman

Kinza Sardar

University of Management and Technology (UMT)

Lahore, Pakistan

E-mail: [email protected]

Abstract

Data mining has gauged much attention for over a past few decades. No doubt, it is one of the

most accurate domains as far as security is concerned. People had worked a lot in this area. It

deals with all, starting from classifying or clustering the samples to matching the records using

some mathematical, statistical, probabilistic or intelligent models. This article evaluates and

compares among few of the well-known state of art data mining models.

INTRODUCTION

The procedure to extract useful and

important informative data from huge pool

of data is known as process of data mining.

Extraction can be done by using many

techniques such as by classification of data,

clustering of data, trend and evolution

analysis, concept description & outlier

analysis. Two main techniques we will

discuss in this paper are clustering of data

and classification of data. Clustering and

classification of data can further be done

through many ways according to the nature

of the system. Now-a-days, techniques of

data mining are used in many applications

such as financial management system,

medical data analysis, telecommunications

systems and others.

Classification is used to classify each item in

a set of data into one of predefined set of

classes or groups. Classification process is

done in two steps.First step is definition of

mailto:[email protected]


Page 2

predetermined set of data classes or

concepts. This step is also known as

supervised learning. Second step is to define

a model for classification and this step is

known as unsupervised learning. There are

many classification techniques such as

Neural Networks [1], Fuzzy Logic, Support

Vector Machine (SVM) [4,6], Decision

trees, Nearest Neighbors [2] and Artificial

Immune System. The most common

technique for classification of data which is

used now-a-days is Support Vector Machine

(SVM) [4,6]. This technique is widely used

because of its simple structure and good

performance. This technique is used for two

types of data such as for liner and non-linear

data. SVM shows the best accuracy rates as

well as compare to other techniques. The

only flaw this technique is suffered from is

the numeric value of image should be within

a specific range. But some of authors [4,6]

used their algorithms on this classification

technique and proved that it can also be used

with wide range as well. Similarly other

techniques are also discussed and used by

many authors for their research as required

by the system they used.

Another technique, we will discuss in this

paper is of clustering [3,5]. It is also one of

the most used techniques in data mining

process. This technique is defined as

unsupervised learning technique as it

discovers knowledge about data by itself. In

clustering technique, similar type of data

gets clustered and different types of

algorithms and searching criteria are applied

on them for further processing. In each

cluster, there is a defined cluster centroid.

By measuring distance from a specific data

variable to a cluster centroid, a boundary of

cluster is defined. Different algorithms are

also used for the comparison of distance

values to define the exact boundary for the

cluster. In this process, same type of data

with approximately same distance measures

will together form a cluster.

In this paper, we will discuss different

systems using different types of techniques

with their proposed algorithms. We will

conclude this paper with best algorithm for

both of clustering and classification

techniques of data mining by doing

comparative analysis.

LITERATURE SURVEY

ARTIFICAL NEURAL NETWORK

First paper was “Vegetable Price Prediction

Using Data Mining Classification

Technique” [1], which was published in

2012 and authors were G. M. Nasira & N.

Hemageetha from India. In this paper, they

[G. M. Nasira & N. Hemageetha] proposed a


Page 3

BPNN prediction model whichis used for

establishment of vegetable price in market.

They described that ANN (Artificial Neural

Network) is a neural network which consists

of many neurons which are interconnected

with each other. There are many types of

ANNs but the one which is frequently used

is multilayer perception. In this type,

neurons are divided among different layers.

Signal is received by the input layer of the

network and then information is sent to the

network. Neurons in input layer actually do

not perform any tasks and information is

sent to outer layer and neurons in outer layer

will generate an output which will be then

provided as output by the neurons network.

It might be possibility of occurrence of some

hidden layers between layers of input and

output as well. One of the other networks

discussed in this paper is BPNN (Back

Propagation Neural Network). It is also a

multi layered neural network. Authors took

this network and applied an algorithm for

the estimation of vegetable. Algorithm can

be described in steps as mentioned below:

Step 1: Send normalized data as input to the

network and results will be then calculated

accordingly.

Step 2: Comparison of calculated results and

actual result errors.

Step 3: On the findings of error, make

adjustment in the functions of membership

and connection weight.

Step 4: If error comes out to be greater than

the expectations or calculations then go to

first step or else stop the process.

Flow chart of ANN can be represented

graphically as:

Collect Data

Process Data

Normalize Data

Determine the netwrok

Select Optimization Method

Train Netwrok

Test Network

Forecast the Sample

Figure 1 ANN Flow Chart

In first step, authors collected data as

vegetable prices are directly affected by

many attributes such as demand, season,

supply, festival and climate etc. These

factors make the prediction of price more

difficult. So author took only a tomato as a

vegetable for vegetable price prediction.

Authors design a size and frequency of data.

Weekly and monthly data was used for

processing of data because of less


Page 4

occurrence of noise in data. Time period

from 2009 to 2011 was taken for the model

creation. Next step was to normalize the

data. This step is very important for

speeding up the training process. There are

many types by which data can be

normalized. In this paper, minimax

normalization is used which can be

mathematically represented as: �′ = � �� − � � ∗ ( �� − � �� − � � )+ � �

�′ stands for normalized input data, �� is

Actual Input, � �� and � � are maximum

and minimum values of the old data range

and values are 1 and 0 respectively. The

obtained data will used for further

processing of neural network. The data is

firstly spliced into two categories from

which one set is used for training and other

set is used for validation of network.

The last step is of designing a model. There

are no specific rules for designing as each

problem need different quantity of neurons

for different hidden layers. Increase in

quantity of neurons will increase the

computational time. Authors decided the

number of neurons by considering many

factors and proposed a model of ANN for

price prediction which can be represented

as:

Input Layer Hidden Layer Output Layer

Figure 2 Proposed Model

Authors used MATLAB for producing good

results. They took experiments on weekly

and monthly basis. They concluded that

using neural network for clustering the

prices of vegetable is one of the best

technique. Accuracy rates were high as well

as performance was good. Results they

concluded can be shown in tabular form

such as:

Week Actual

price

Predicted

Price

Error %

1 26 20 23%

2 31 26 16%

3 34 28 17%

4 25 24 4%

5 18 20 11%

6 18 18 0%

7 14 17 21%

8 11 13 18%

9 9 10 11%


Page 5

10 10 10 0%

Table 1 Results

KNN& C4.5 CLASSIFIERS

Second paper was “Intraurban land cover

classification using IKONOS II images and

data mining techniques: a comparative

analysis” [2], which was published in 2013

and authors were Vanessa da Silva Brum

Bastos, Leila Maria Garcia Fonseca, Thales

Sehn Korting, Carolina Moutinho Duque

Pinho & Rafael Duarte Coelho dos Santos.

This was a survey paper and they discussed

many classification techniques and their

proposed algorithms and at the end, they

concluded with best technique after

performing many experiments on each of the

classification technique. First discussed

technique was KNN classifier. They stated

that it is a learning algorithm based on

specific instances which are then used to

train the network as well as to forecast the

data model without even training. This

classifier uses Euclidean distance for finding

the distance between neighbor and their

classes. K is number of training objects and

algorithm applied determines a group with K

training objects that are closest to a test

object, and labels it according to the

predominant class in this group or region.If

value of K is higher then it is not more

sensitive for the noise. This can proceed the

experiment to incorrect results. Second

technique was of C4.5 classifier. This

classifier actually generates a decision tree

for the process of classification. Leaf of a

tree shows a class or in other words it is also

known as decision node. Decision took

place on decision node and sub-classes are

then assigned to the leave nodes of the tree.

This technique is strictly followed by the

rules. The third classifier discussed in this

paper was multilayer perception artificial

neural network. This technique is already

discussed above in this paper. Authors used

an IKONOS image II which was 1.0-m

panchromatic band and 4.0-m multispectral

bands with 11-bit radiometric resolution and

incidence angle of 4.85º. These images

were then merged and then segmentation is

done and these are segmented into five

levels. Software named Definiens was used

for process of segmentation and extraction

of features. Then in step of training a model,

desired segment is taken and then divided

that segment further into two groups of

training sets. One training set consists of

1630 different instances and second set

consists of 819 instances. Each segment was

then categorized by 524 attributes. For

comparison of classifiers, authors performed

Monte Carlo simulation. They had done 30

iterations on each classifier for each GVP


Page 6

(Growing Variable Parameter), represented

by the Minimum number of objects per leaf

(MOL), K and Number of Hidden Neurons

(NHN), respectively, for C4.5, KNN, and

MLP algorithms. At the end they concluded

by using large data set. They concluded that

selection of appropriate parameters for the

process of classification is a major concern

to improve performance. Calculated

parameters can be represented in tabular

form as:

Technique MOL Accuracy Portability

KNN 7 80% Average

MLP 20 90% Low

C4.5 9 85% High

Table 2 Comparison of Techniques

Hence they showed that MLP algorithm

shows the best classification accuracy as

compared to other techniques. But the only

disadvantage of this technique was its

translation to the semantic networks. They

used software of GEPBIA, InterImage and

GeoDMA for the implementation of

semantic networking.

NEURAL GAS

Third paper was “A Novel Approach for

Data Mining Clustering Technique using

Neural Gas Algorithm” [3], which was

published in 2014 and authors were

Mohnish Patel, Prashant Richhariya &

Anurag Shrivastava from India. They

discussed the clustering techniques with

reference to database security. They

compared algorithms which hides the

sensitive data by applying some mining

rules. They discussed already existing

algorithms of ISL, DSR, K-Means and

Neural Gas.In ISL algorithm, there were no

efficient rules which can hide the desired

transaction. On other hand DSR algorithm

decreases the support of the right hand side

of the rule. Itmodifies one item at a single

interval of time in a selected transaction and

changes transaction’s value from 1 to 0. The

third algorithm was K-Means algorithm. Its

accuracy depends on the selection of the

value of centroid. If centroid is representing

group of similar objects then clustering

obtained is better as compared to other

scenarios. Main objective of this algorithm

is to divide objects into many classes and

hence objects in same class will have less

distance between them as compare to objects

of different classes. Last clustering

technique which authors had done their

experimentation was Neural Gas clustering

technique. This technique is based on ANN

but it is inspired by SOM (Self-Organizing

Map). This technique is good for data

representation based on features. This


Page 7

algorithm was named as "neural gas"

because of the dynamics of the feature

vectors during the adaptation process which

distribute themselves like a gas within the

data spaces. They compared all techniques

on basis of database size as well as number

of clusters. Authors concluded that Neural

Gas algorithm is best for clustering non

structured data sets in an efficient way. This

algorithm also devised to imbalanced issues

in clustering. They also concluded that K-

mean algorithm is less sensitive for the

process of initializations. Authors also

obtained that for best performance, values of

parameters should also be selected wisely.

They used large data set for experimentation

with best accuracy rate with better

performance.In their future work, they will

work on the selection of values of

parameters.

No.

of

Clu

ster

s

Time

for

executi

on of

K-

Mean+I

SL+DS

R in

millisec

eonds

Time for

execution

of

NeuralGa

s+IS+DS

R in

millisece

onds

Time

for

execut

ion of

ISL in

millis

eceon

ds

Time

for

execut

ion of

DSR

in

millis

eceon

ds

2 2365 5354 ms 12365 9323

ms ms ms

3 4301

ms

8313 ms 19314

ms

17375

ms

4 15813

ms

15621 ms 37813

ms

28217

ms

5 17342

ms

17324 ms 37435

ms

31123

ms

Table 3: Comparison according to No. of clusters

SVM (SUPPORT VECTOR MACHINE)

Fourth paper we considered was “Customer

Relationship Management Classification

Using Data Mining Techniques” [4] which

was published in 2014 and author were

S.Ummugulthum Natchiar & Dr.S.Baulkani

from India. They took highly un-balanced

data set which was very noisy as well for

experimentation. They used many

classifiers for classification. Firstly they

extracted the features and then classification

techniques are applied by using different

data sets on different classifiers of J48,

Naïve Bayes, SVM and KNN. At the end 10

fold cross validation is applied. First

classifier is J48. This classifier uses

algorithm of C4.5 which is already

discussed in this paper. It generates a

decision tree by using labeled trained data

set and by using each leaf node, decision

took place and then nodes splits into two or

more nodes according to the decision. The


Page 8

node with highest information gain is then

used for the process of decision taking.

Process of splitting ends where all instances

will belong to the same class. Next classifier

is Naïve Bayes. It uses Bayes theorem which

deals with the probability calculations. It

stated that the value of the predictor of a

selected class will always remain

independent of the values of other

predictors. The third and most efficient

classifier discussed is SVM. This classifier

uses its own algorithm. It transfers data from

normal dimension to higher dimensions. By

using newly assigned dimensions, decision

is made for the division of objects among

classes. Boundary of decision is made on the

basis of support vectors and margins. The

last classifier discussed is KNN. It compare

the selected object with trained objects and

class is made according to the Euclidean

distance as discussed above in K-means

algorithm.

The procedure which is followed by the

authors in this paper consists of following

steps such as:

1. Enter noisy and un-balanced data set.

2. Pre-process the data and remove features

which have 90% missing values or empty.

4. Selection of attributes is done by the

following considerations:

i. Use Information Gain.

ii. Selectmaximum gain binary

features by using ranker.

iii. Select attribute by using full

training set.

5. Spliced the data set into two groups such

as of testing data set and training data set.

6. Apply algorithm using data set with test

of 10 fold cross validation.

By applying above procedure, following

values of parameters comes as a result

which can be represented as following

tabular form:

Parameters J48 Naïve

Bayes

SVM KNN

Accuracy 99 94 99 98.2

ROC 0.9 1 0.7 0.9

Sensitivity 1 0.9 1 1

Specificity 0.5 0.1 0.5 0.7

Precision 1 0.9 1 0.9

Recall 1 0.9 1 0.9

CC 98.8 93.8 98.9 98.3

Error Rate 1.3 6.3 1.2 1.8

Table 4: Comparison of parameters

Hence SVM technique shows the highest

accuracy rate while Naïve Bayes resulted in

highest ROC values. As future work for

authors, theywill use multiple algorithms

together for classification so that

performance can be highly achieved.


Page 9

COMPARATIVE ANALYSIS

First paper we chose for our survey paper is

“Vegetable Price Prediction Using Data

Mining Classification Technique” [1], which

was published in 2012. In this paper, they

proposed a BPNN prediction model for the

establishment of vegetable price in market.

Authors used MATLAB for producing good

results. They took experiments on weekly

and monthly basis. They concluded that

using neural network for clustering the

prices of vegetable is one of the best

technique. Accuracy rates were high as well

as performance was good.

Second paper was “Intraurban land cover

classification using IKONOS II images and

data mining techniques: a comparative

analysis” [2], which was published in 2013.

They used large amount of data. They

concluded that selection of appropriate

parameters for the process of classification

is a major concern to improve performance.

They used three algorithms of C4.5, MLP

and KNN for classification of data. By

comparing parameters values of all

mentioned techniques, they showed that

MLP algorithm shows the best classification

accuracy as compared to other techniques.

But the only disadvantage of this technique

was its translation to the semantic networks.

They used software of GEPBIA, InterImage

and GeoDMA for the implementation of

semantic networking.

Third paper was “A Novel Approach for

Data Mining Clustering Technique using

Neural Gas Algorithm” [3], which was

published in 2014.They discussed the

techniques with reference to database

security. They proposed an algorithm which

hides the sensitive data by applying some

mining rules. They actually used already

published algorithms of K-Mean clustering

algorithm and ISL. They concluded that the

performance of hiding data gets better in

much less scanning time of database. They

used large data set for experimentation with

best accuracy rate with better performance.

Fourth paper we considered was “Customer

Relationship Management Classification

Using Data Mining Techniques” [4]which

was published in 2014. They took highly un-

balanced data set which was very noisy as

well for experimentation. They used many

classifiers for classification. They proved

that SVM technique shows the highest

accuracy rate while Naïve Bayes resulted in

highest ROC values. They used multiple

algorithms for classification so that

performance can be highly achieved.

All of the above discussed papers can be

compared easily and comparative analysis

can be represented in tabular form such as:


Page

10

Graph of above mentioned parameters of accuracy and performance can be shown as:

Figure 3: Parametric comparison chart

Techniques

Data Types

Noisy

Data

Multi-

Dimensional Data

Irregular

Data

Un Balanced

Data

Redundant

Data

Linear

Complexity

Neural Networks Yes No Yes No No Yes

C4.5, MLP and KNN No No Yes No Yes No

K – Means Clustering &

ISL Yes Yes Yes No No Yes

SVM Yes No Yes Yes No No

Table 5: Overall Balance

As a conclusion, it can conclude that each

technique have advantages as well as

disadvantages. All authors used irregular

type of data. Some of them used noisy data

as well. The best technique we found is K-

Mean and ISL combination. This technique

shows best performance with high accuracy

0

50

100

Performance Accuracy

Parameteric Comparison

G. M. Nasira & N. Hemageetha

Vanessa da Silva Brum Bastos, Leila Maria Garcia Fonseca, Thales Sehn Korting, Carolina Moutinho Duque

Pinho & Rafael Duarte Coelho dos Santos

Mohnish Patel, Prashant Richhariya & Anurag Shrivastava

S.Ummugulthum Natchiar & Dr.S.Baulkani


Page

11

rates. Authors used large databases for

experimentation and provided with best

security of database as well. The second

highly accurate technique is SVM. This

technique is widely used in classification

processes because it provides best results on

every type of data such as irregular and

noisy. It is applicable on high dimensional

data as well. By less calculation of data, it

gives best accuracy rate.


Page

12

REFERENCES

1. G. M. Nasira & N. Hemageetha. 2012, Proceedings of the International Conference on

Pattern Recognition, Informatics and Medical Engineering , March 21-23, 2012

Vegetable Price Prediction Using DataMining Classification Technique

2. Prashant Vats HMRITM, Delhi, India. J.R.N. Raj. Vidyapeeth, Udaipur, India

([email protected]) Anjana Gosain USICT, GGSIPU, Delhi,

India.([email protected])2014 International Conference on Electronic

Systems, Signal Processing and Computing Technologies, A Comparative Analysis of

Various Cluster Detection Techniques for Data Mining

3. Mohnish Patel, RGPV CSE, NIRT,Bhopal, India,[email protected] Prashant

Richhariya, RGPV CSE, NIRT,Bhopal, India,[email protected] Anurag

Shrivastava, RGPV CSE, NIRT,Bhopal, India,[email protected] Novel

Approach for Data Mining Clustering Technique using NeuralGas Algorithm

4. IEEE-32331 Customer Relationship Management Classification Using Data Mining

Techniques S.Ummugulthum Natchiar Sethu Institute of Technology

InformationTechnology Virudhu Nagar , India Electronics and Communication

Engineering Government College of Engineering Dr.S.Baulkani Tirunelveli, India

5. Intraurban land cover classification using IKONOS II images and data mining

techniques: a comparative analysis Leila Maria Garcia Fonseca Image Processing

Division (DPI) National Institute for Space Research (INPE) São José dos Campos,

Brazil [email protected] Thales Sehn Korting Image Processing Division (DPI) National

Institute for Space Research (INPE) São José dos Campos, Brazil [email protected]

mailto:[email protected]


Page

13

Carolina Moutinho Duque Pinho Centro de política e economia do setor público

(CEPESP) Fundação Getúlio Vargas (FGV-SP) São Paulo, Brazil [email protected]

Documents

ADB_ Research Paper-Final-M Usman