22
Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using Machine Learning Techniques Chapter 2 LITERATURE SURVEY In literature most of the authors have used different machine learning algorithms to analyse time changing big data. In [12], author have put forth algorithm based on neural network to analyse behaviours of customers using social media data set. In [13], author has explored a way to find link between to users social media like twitter or facebook using machine learning algorithm. In [14, 15, 16] authors have thrown light on big data architecture, challenges etc. In work done by Isvani Frıas-Blanco, Jose del Campo- Avila[40], moving average method is suggested which is used for Online and Non- Parametric Drift Detection Methods Based on Hoeffding’s Bounds. Short-Term Load Forecasting Based on Big Data Technologies by Pei Zhang [5], explain decision tree framework to forecast short term load like electricity. Petra Perner has claimed that decision tree induction is suitable than traditional methods for big data mining in his paper “Decision Tree Induction Methods and their Applications to Big Data”. Also there is lot of work available on social media data analysis. Characteristics of social activities and patterns of communication in Twitter are studied by Naaman et al. [31]. Davidov et al. [32] have used hash tags and other sentiment labels for sentiment analysis. An effective and efficient followee recommender system built by Hannon et al [33]. Methods to recommend influential users proposed by Kwak et al [34]. Twitter use within and across organizations and geographic markets comparison is given by Burton et al. [35]. Kim et al. [36], have explored how to maximize the outcomes of SMM through Word-of-Mouth (WOM) marketing by identifying the core group of users. On distributed implementation of decision trees some work is available. In [27], author has defined way to extract knowledge using decision Tree and Naïve Bayes Algorithm for Classification and Generation of Actionable Knowledge for Direct Marketing. On distributed implementation also there is a literature available. Distributed implementation of support vector machines is proposed by [21].In [39], author has proposed map reduce implementation of C4.5 decision tree algorithm. Motivation behind choosing unstructured data is shown below in a graph. We can see day by day usage of unstructured data is increasing over structured data. Ph. D Thesis Computer Engineering 11

LITERATURE SURVEYlib.unipune.ac.in:8080/.../123456789/8434/11/11_chapter2.pdf · 2018. 10. 19. · Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: LITERATURE SURVEYlib.unipune.ac.in:8080/.../123456789/8434/11/11_chapter2.pdf · 2018. 10. 19. · Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using

Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using Machine Learning Techniques

Chapter 2

LITERATURE SURVEY

In literature most of the authors have used different machine learning algorithms to

analyse time changing big data. In [12], author have put forth algorithm based on neural

network to analyse behaviours of customers using social media data set. In [13], author

has explored a way to find link between to users social media like twitter or facebook

using machine learning algorithm. In [14, 15, 16] authors have thrown light on big data

architecture, challenges etc. In work done by Isvani Frıas-Blanco, Jose del Campo-

Avila[40], moving average method is suggested which is used for Online and Non-

Parametric Drift Detection Methods Based on Hoeffding’s Bounds. Short-Term Load

Forecasting Based on Big Data Technologies by Pei Zhang [5], explain decision tree

framework to forecast short term load like electricity. Petra Perner has claimed that

decision tree induction is suitable than traditional methods for big data mining in his

paper “Decision Tree Induction Methods and their Applications to Big Data”. Also there

is lot of work available on social media data analysis. Characteristics of social activities

and patterns of communication in Twitter are studied by Naaman et al. [31]. Davidov et

al. [32] have used hash tags and other sentiment labels for sentiment analysis. An

effective and efficient followee recommender system built by Hannon et al [33].

Methods to recommend influential users proposed by Kwak et al [34]. Twitter use within

and across organizations and geographic markets comparison is given by Burton et al.

[35]. Kim et al. [36], have explored how to maximize the outcomes of SMM through

Word-of-Mouth (WOM) marketing by identifying the core group of users. On distributed

implementation of decision trees some work is available. In [27], author has defined way

to extract knowledge using decision Tree and Naïve Bayes Algorithm for Classification

and Generation of Actionable Knowledge for Direct Marketing. On distributed

implementation also there is a literature available. Distributed implementation of support

vector machines is proposed by [21].In [39], author has proposed map reduce

implementation of C4.5 decision tree algorithm.

Motivation behind choosing unstructured data is shown below in a graph. We can see day

by day usage of unstructured data is increasing over structured data.

Ph. D Thesis Computer Engineering 11

Page 2: LITERATURE SURVEYlib.unipune.ac.in:8080/.../123456789/8434/11/11_chapter2.pdf · 2018. 10. 19. · Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using

Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using Machine Learning Techniques

Figure 2.1: Motivational Graph for Unstructured Data Usage [Source: IDC’s Digital Universe Study]

In the beginning different Decision Tree Learning was used to analyse the big data. In

work done by Hall. et al. [10], there is defined an approach for forming learning rules of

the large set of training data. The approach is to have a single decision system generated

from a large and independent n subset of data. Whereas Patil et al, uses a hybrid approach

combining both genetic algorithm and decision tree to create an optimized decision tree

thus improving efficiency and performance of computation. In Literature many authors

have tried to exploit the machine learning algorithms for different structured and

unstructured data. It is given below in the table 2.1. In this table we can see maximum

literature is available for image data which is structured.

Ph. D Thesis Computer Engineering 12

Page 3: LITERATURE SURVEYlib.unipune.ac.in:8080/.../123456789/8434/11/11_chapter2.pdf · 2018. 10. 19. · Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using

Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using Machine Learning Techniques

2.1 Literature Survey on Methodologies Used for Social Data Mining Table 2.1: Comparison of Social Data Mining Techniques

Sr. No

Title Technique Used Outcome

1. On Distributed Fuzzy

Decision Trees for Big

Data(IEEE 2017)

Distributed Fuzzy

Decision Tree Algorithm

based on map-reduce

architecture is proposed

Implementation is

based MLib library

2 Sentiment Analysis of Top

Colleges in India

Using Twitter

Data(IEEE2016)

Naïve Bayes and Support

Vector Machine and an

Artificial Neural Network

model:

highlights a

comparison between

the results obtained

by exploiting the

following machine

learning algorithms:

Naïve Bayes and

Support Vector

Machine and an

Artificial Neural

Network model:

3 Mining Social Media Data

for Understanding Students’

Learning Experience(IEEE

2015)

SVM multi label classifier Focused on

engineering students’

Twitter posts to

understand issues

and problems in their

educational

experiences

4 Smart text-classification of

user-generated data in

educational social

networks(IEEE 2014)

partial-supervised learning

for Hierarchical Dirichlet

Process (HDP) for text

classification with

inherent hierarchical

More flexible way

and better guide for

the model learning

from the unlabelled

documents

Ph. D Thesis Computer Engineering 13

Page 4: LITERATURE SURVEYlib.unipune.ac.in:8080/.../123456789/8434/11/11_chapter2.pdf · 2018. 10. 19. · Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using

Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using Machine Learning Techniques

structure in education

5 Network-Based Modelling

and Intelligent Data Mining

of Social Media for

Improving Care(IEEE

2015)

Network-based approach

for modelling users' forum

interactions and employed

a network partitioning

method based on

optimizing a stability

quality measure.

Used to determine

consumer opinion

and identify

influential users

within the retrieved

modules using

information derived

from both word-

frequency data and

network-based

properties

6 A novel data-mining

approach leveraging social

media to monitor and

respond to outcomes of

diabetes drugs and

treatment(IEEE 2013)

A novel data-mining

method was developed to

gauge the experiences of

medical devices and drugs

by patients with diabetes

mellitus

Rapid data

collection, feedback,

and analysis that

would enable

improved outcomes

and solutions for

public health.

2.2 Literature Survey on Methodologies Used for Big Data Mining

• In [4], Drift Detection Methods are proposed for online and non-parametric data using on

Hoeffding’s Bounds. As well as Moving average method is suggested to identify drift of

data. author proposed a methods to analysis the performance for learning algorithm

during data stream classification. Concept drift occur in order to handle this two methods

used first moving averages -this used for detecting sudden changes and second is

weighted moving averages - this used for detecting slow changes. The main advantage of

proposed method is that it is independent from the learning algorithm, and used with any

Ph. D Thesis Computer Engineering 14

Page 5: LITERATURE SURVEYlib.unipune.ac.in:8080/.../123456789/8434/11/11_chapter2.pdf · 2018. 10. 19. · Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using

Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using Machine Learning Techniques

classifier in order to track concept drift. It used a Naive Bayes classifier and a Perceptron

method.

• In [5], Pei Zhang has implemented Short-Term Load Forecasting model based on Big

Data Technologies. Decision tree framework is proposed to forecast short term load like

electricity.

• In [6], Petra Perner has used Decision Tree Induction Methods and their Applications to

Big Data. Author has explained how decision tree induction is suitable than traditional

methods.

• A paper on Learning with Drift Detection by J Gama, P Medas, G Castillo and Pedro

Rodrigues (2004) [55] proposed technique that controls the streaming data and error that

occurs during classification algorithm. Concept drift handled by time windows. Two

approaches used during classification: 1st learn model at time interval without

considering whether change is happen or not. 2nd first detect change in data stream then

adapt model. If error rate increases, then concept drift occurs. In this two register is used

to track information of error rate first Smin and secondly Pmin .This two are used to find

warning level and alarm level.

• A paper on Early Drift Detection Method by M Baena-Garcia, J Campo-Avila, R Fidalgo,

A Bifet, R Gavalda and R Morales-Bueno (2006), [57] proposed an EDDM method

which used to find concept drift and suitable to detect slow and gradual changes even

when that changes is very slow. Distance between two classification errors is used to find

drift. The drift detection method handle noisy dataset and classification algorithm is not

designed with that technique. This method implements with any classification algorithm

first using it as wrapper of batch learning algorithm and second is implementing with

inside online algorithm. The distance between classification errors are used for detecting

concept drift.

• A paper on Learning from time changing data with Adaptive Windowing by Albert Bifet,

Ricard Gavalda (2007), [60] presents a method to handle concept drift when learning

from time evolving data. In this sliding window technique is used and window size is not

fixed. The window size increases if data is stationary to achieve greater accuracy and

when drift occur window size shrink to remove old data from window. They also propose

ADWIN2 algorithm which is time and memory efficient. Then it combines with nave

Ph. D Thesis Computer Engineering 15

Page 6: LITERATURE SURVEYlib.unipune.ac.in:8080/.../123456789/8434/11/11_chapter2.pdf · 2018. 10. 19. · Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using

Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using Machine Learning Techniques

Bayes predictor to maintain up to date result from classification. An advantage is that

window size automatically grows and shrink asper rate of change observed.

Disadvantages ADWIN is costly because it analysis all the sub window of the current

window for suitable cut. Another is that it deletes one element each time when it detects

drift in data.

• A paper on Exponentially weighted moving average charts for detecting concept drift

by G. J. Ross, N. M. Adams, D. Tasoulis, D. Hand (2012), [59] presents EWMA method

to monitor error rate of an classifier. It used single pass and computationally efficient

algorithm. It also controls misclassified instance rate.

• A paper on Active Learning with Drifting Streaming Data by Indre Zliobaite, Albert

Bifet, Bernhard Pfahringer, and Georey Holmes (2014), [61] presents a background for

data stream classification and presents active learning methods for processing dynamic

data. Technique to handle and allocate the labelling cost above time, to controls the

labelling for correct classifiers and to find drift. Author stated that analysis of results

shows that the methods effective when the classification cost is too small. The advantages

of that this strategy provides base for incremental active learning and also works on

uncertainty.

Table 2.2: Comparison of Big Data Mining Techniques

Sr.

No.

Publication

/Author

Paper Title Algorithm/Tech.

Pros Cons

01 F. Blanco, J.

C.A, G R.

Jimenez, R. M.

Bueno, and

Y C Mota. 2015

Online and Non-

Parametric Drift

Detection Methods

Based

on Hoeffdings Bounds

HDDM Accurately

find out

drifted data

and update

model.

Issues

related with

speed and

accuracy.

02 Yanhuang

Jiang, Qiangli

Zhao, Yutong

Lu1 2014

Ensemble based Data

Stream Mining with

Recalling and

Forgetting

Mechanisms

MAE Ensemble

pruning is

used

as a

recalling

mechanism

Need more

experiment

s to

optimize

the values

of the

Ph. D Thesis Computer Engineering 16

Page 7: LITERATURE SURVEYlib.unipune.ac.in:8080/.../123456789/8434/11/11_chapter2.pdf · 2018. 10. 19. · Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using

Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using Machine Learning Techniques

to select

useful

component

Classifiers

for each

incoming

data chunk.

parameters

in MAE

algorithm,

such as

memory ca-

pacity,

forgetting

factor.

03 G. Ross, N.

Adams, D.

Tasoulis, and D.

Hand

2012

Exponentially

weighted

moving average charts

for

detecting concept drift

EWMA Points from

the data

stream

should be

pro-

cessed only

once and

discarded

rather than

stored in

memory.

The time

required to

pro-

cess each

point

should be

small and

constant

over

time.

04 F. cao, J. Liang,

L. Bai, X.

Zhao, C. Dang.

2010

A framework for

clustering

categorical Time-

evolving

data.

clusterin

g

algorith

m

It is

effective

for large

dataset. It

not only

accurately

detects the

drifting

concepts

but also

attains

clustering

Compared

with other

algorithm

this

algorithm

needs

fewer

parameters,

which is

favourable

for specific

application.

Ph. D Thesis Computer Engineering 17

Page 8: LITERATURE SURVEYlib.unipune.ac.in:8080/.../123456789/8434/11/11_chapter2.pdf · 2018. 10. 19. · Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using

Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using Machine Learning Techniques

results of

better

quality

05 L.L.Minku,

A.P.White, X.

yao. 2010

The impact of

diversity on

online Ensemble

Learning

in the presence of

concept

Drift.

Diversity

on

Ensembl

e

learning

It used to

reduce the

initial

increase in

the error

caused by a

drift.

To recover

from the

drift

and

converge to

the new

concept

additional

mechanism

required.

06 A. Bifet and R.

Gavald

SIAM Int. Conf.

Data

Min., 2007

Learning from time-

changing data with

adaptive windowing

ADWIN In this use

sliding

win-

dows and

window

will

grow

automatical

ly when

the data is

stationary,

for

greater

accuracy,

and will

shrink

automatical

ly when

It is

inefficient

in time and

memory.

Expensive

be-

cause it

checks all

large

enough sub

windows of

the

current

window for

possible

cut.

Ph. D Thesis Computer Engineering 18

Page 9: LITERATURE SURVEYlib.unipune.ac.in:8080/.../123456789/8434/11/11_chapter2.pdf · 2018. 10. 19. · Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using

Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using Machine Learning Techniques

change is

taking

place, to

discard

stale data.

07 M. Baena, J. del

Campo,

R. Fidalgo, A.

Bifet, R.

Gavalda, and

R.Morales

2006

Early drift detection

method

EDDM It works

with slow

gradual

changes. It

uses

the distance

between

classificati

on errors to

detect

changes.

Do not

provide

rigorous

guarantees

of

performanc

e.

08 J. Gama, P.

Medas, G.

Castillo, and P.

Rodrigues

2004

Learning with drift

detection

DDM It control

online error

rate of

learning

algorithm.

Sudden

changes are

detected

easily.

Not dealing

with slow

gradual

changes.

Do not

provide

rigorous

guarantees

of

performanc

e.

Ph. D Thesis Computer Engineering 19

Page 10: LITERATURE SURVEYlib.unipune.ac.in:8080/.../123456789/8434/11/11_chapter2.pdf · 2018. 10. 19. · Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using

Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using Machine Learning Techniques

2.3 Literature Survey on Different Machine Learning (ML) Techniques

Used for Sentiment Analysis In paper [68] authors developed a workflow to integrate both qualitative analysis and

large-scale data mining techniques. By focusing on engineering students’ twitter posts to

understand issues and problems in their educational experiences. Authors first conducted

a qualitative analysis on samples taken from about 25,000 tweets related to engineering

students’ college life. They found engineering students encounter problems such as heavy

study load, lack of social engagement, and sleep deprivation. Based on these results,

authors implemented a multi-label classification algorithm to classify tweets reflecting

students’ problems.

In paper [69] authors proposed partial supervised learning for HDP which enables HDP

to make use of partial known knowledge to guide the model learning process. This partial

learning enables HDP which is aimed at solving clustering problems to tackle

classification problems and meanwhile partial supervised learning helps improve the

classification accuracy. They applied the proposed partial supervised learning for HDP to

classify posts (micro-blogs) in an educational environment.

In paper [70] authors proposes a novel application of text categorization to identify

relevant and irrelevant micro-blogging questions asked in a classroom. Several modelling

approaches and several weighting or pre-processing configurations are studied for this

application through extensive experiments.

In paper [71] authors propose a two-step analysis framework that focuses on positive and

negative sentiment, as well as the side effects of treatment, in user’s forum posts, and

identifies user communities (modules) and influential users for the purpose of

ascertaining user opinion of cancer treatment. They used a Self-Organizing Map to

analyse word frequency data derived from user’s forum posts. They introduced a novel

network-based approach for modelling users forum interactions and employed a network

partitioning method based on optimizing a stability quality measure.

In paper [72] authors explored privacy concerns related to mining social media networks.

Specifically, author looked at the issue through a crime incident mining context, looking

at matters related to social media data ownership, legal protection of personal

information, methods that may be used to anonyms users as well as some ethical

Ph. D Thesis Computer Engineering 20

Page 11: LITERATURE SURVEYlib.unipune.ac.in:8080/.../123456789/8434/11/11_chapter2.pdf · 2018. 10. 19. · Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using

Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using Machine Learning Techniques

dilemmas when processing identifying information especially for an application such as a

crime incident reporting tool.

In paper [73] authors implemented novel data-mining method was developed to gauge

the experiences of medical devices and drugs by patients with diabetes mellitus. Self-

organizing maps were used to analyse forum posts numerically to better understand user

opinion of medical devices and drugs. The end-result is a word list compilation that

correlates certain positive and negative word cluster groups with medical drugs and

devices. The implication of this novel data-mining method could open new avenues of

research into rapid data collection, feedback, and analysis that would enable improved

outcomes and solutions for public health.

In paper [74] authors presented a scalable user-profiling solution that extracts terms and

concepts-based user profiles from social media conversation data, implemented using the

Apache Hadoop framework. Authors also discussed the challenges and presented some

evaluation. In addition, they wish to extend the profile to include other data sources, both

structured data (e.g., transaction logs) and unstructured data (e.g., mobile browsing logs)

and thus be able to verify and generate more robust profiles.

Table 2.3: Comparison of Sentiment Analysis Techniques Sr.

No

Title Technique Used Outcome

01 Mining Social Media Data

for Understanding

Students’ Learning

Experience

Mining Social Media Data

for Understanding Students’

Learning Experience

Mining Social Media

Data for Understanding

Students’ Learning

Experience

02 Smart text-classification of

user-generated data in

educational social

networks

partial-supervised learning

for Hierarchical Dirichlet

Process

(HDP) for text classification

with inherent hierarchical

structure in education

More flexible way and

better guide for the

model learning from the

unlabelled documents

Ph. D Thesis Computer Engineering 21

Page 12: LITERATURE SURVEYlib.unipune.ac.in:8080/.../123456789/8434/11/11_chapter2.pdf · 2018. 10. 19. · Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using

Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using Machine Learning Techniques

03 Network-Based Modelling

and Intelligent Data

Mining of Social Media

Improving Care

Network-based approach for

modelling users’ interactions

and employed a network

partitioning based on

optimizing a stability quality

measure.

Used to determine

consumer opinion and

identify influential users

within the retrieved

information derived

from both word-

frequency data and

network-based

Properties

04 A novel data-mining

approach leveraging social

media to monitor and

respond to outcomes of

diabetes drugs and

treatment

A novel data-mining method

was developed to gauge the

experiences of medical

devices and drugs by

patients with diabetes

mellitus

Rapid data collection,

feedback, and analysis

that would enable

improved outcomes and

solutions for public

health.

2.4 Literature Survey on Different Machine Learning (ML) Techniques

Used for Distributed Data Mining In [50] paper “A novel algorithm for distributed data mining in HDFS”, author has

explained named Association rule mining based on Hadoop (ARMH) has been proposed

to utilize the clusters effectively and mining frequent pattern from large databases.

Hadoop distributed framework helps in managing the workload among the clusters. The

ARMH was implemented in Hadoop using Map Reduce programming paradigm.

This paper [51] has analysed the drawback of existing DDM systems and put forward a

service-oriented architecture of DDM on the grid. The mining algorithm and distributed

data sets in the proposed framework are abstracted as Web service resource (WS-

resource), which can cooperate to perform DDM as required dynamically. Finally, a grid

based on local area network was built with Globus Toolkit 4.0Beta and the algorithm of

WS-resource, dataset WS-resource for data mining on the grid are developed.

Ph. D Thesis Computer Engineering 22

Page 13: LITERATURE SURVEYlib.unipune.ac.in:8080/.../123456789/8434/11/11_chapter2.pdf · 2018. 10. 19. · Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using

Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using Machine Learning Techniques

In paper [52], “Distributed data mining: a survey”, Author has surveyed the-state-of-the-

art algorithms and applications in distributed data mining and discuss the future research

opportunities.

In paper [53],”Study of Distributed Data Mining”, Distributed Data Mining algorithms,

methods and trends to discover knowledge from distributed data in an effective and

efficient way. Author has explained DDM (Distributed Data Mining) based Multi Agent

System and parallel data mining techniques.

In paper [54], “Privacy-Preserving Distributed Data Mining Techniques: A Survey”,

author has provided extensive survey on different privacy preserving data mining

methods and analyses the representative techniques for privacy preserving data mining.

We majorly discuss the distributed privacy preservation techniques which provide secure

solutions using primitive operations of cryptographic protocols such as secure multi-party

computation (SMPC), secret sharing schemes (SSS) and homomorphic encryption (HC)

2.5 Literature Survey on Different Machine Learning (ML) Techniques

Used for Data Mining Table 2.4: Comparison of different ML algorithms available in literature

Sr

no

Title Year methodol

ogy

Applic

ation

Classific

ation

Or

Predicti

on

Advantages Disadvant

ages

1. Robust and Effective Component-based Banknote Recognition by SURF Features

2011 Speeded Up Robust Features (SURF).

Banknote recognition for blind or visually impaired people

Classification

100% recognition rate on challenging dataset & faster

-

2. Automatic detection and classification Of objects with

2014 Support vector machine(SVM), Neural

Automatic selling of goods,

Classification

SVMs deliver a unique solution, gains

Limitation of the SV approach lies in choice of

Ph. D Thesis Computer Engineering 23

Page 14: LITERATURE SURVEYlib.unipune.ac.in:8080/.../123456789/8434/11/11_chapter2.pdf · 2018. 10. 19. · Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using

Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using Machine Learning Techniques

optimized search space

network. vending Machines

Stability & flexibility. It produce accurate and robust classification results.

the kernel, Speed & size.

3. A MapReduce based distributed SVM algorithm for binary classification

2013 SVM in cloud computing environment.

- prediction

Use for training big datasets.

-

4. Forgery Detection and Value Identification of Euro Banknotes

2013 Use of both hardware & software modules. Proposed approach: 1.Calibration 2.training 3.use

To detect counterfeits of Euro banknotes.

Classification

Robust to changes in environmental Lighting & non-uniformity of the infrared light.

-

5.

Banknote recognition using inductive learning

2013 RULES-3 inductive learning

Petrol station automats , Parking automats , Currency exchange machines

classification

Saves memory space. Decision can be made in a short time. Easy & cheap to develop the system

Sometimes frustrating, May reach false conclusions.

6 Employing multiple-kernel support vector

2011 Multiple kernel support vector machine

Automatic good selling machin

Classification

Suppose more counterfeitpreventive features are

The performance of SVMs largely

Ph. D Thesis Computer Engineering 24

Page 15: LITERATURE SURVEYlib.unipune.ac.in:8080/.../123456789/8434/11/11_chapter2.pdf · 2018. 10. 19. · Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using

Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using Machine Learning Techniques

machines for counterfeit banknote recognition

e Vending machines Automatic monetary transaction machine

added to the banknotes. Our system can still be capable of distinguishing between genuine and forged banknotes without any modification

depends on the choice of kernels.

7 ANN based currency recognition system using compressed gray scale and application for sriodell currency notes

2008 SLCRec ATM Machine

classification

Capability of separating classes properly in varying image conditions,better robustness for noise

-

8 Using hidden marcov model for paper currency recognition

2013 Hidden marcov model(HMM).

ATMs and vending machine

Accuracy and robustness

Uses size and color properties which are same for many countries

9 Recognition on Indian currency based on LBP

2012 Local binary partition(LBP)

ATMs and vending machine

classification

Simplicity and high speed,high recognition rate,good performance for low noise,low computational complexity

Cannot detect counterfeit banknotes

10 Recognition of Mexican banknote s

2012 Local binary partition(

In countries

classification

High recognition performanc

Cost is

high,canno

Ph. D Thesis Computer Engineering 25

Page 16: LITERATURE SURVEYlib.unipune.ac.in:8080/.../123456789/8434/11/11_chapter2.pdf · 2018. 10. 19. · Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using

Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using Machine Learning Techniques

via their color and texture features

LBP) and RGB space,LVQ network as classifier

where colors are employed to identify different denominations

e,less processing time,invariant to image rotation,

t detect

counterfeit

banknotes

11 Support Vector Machine-Based Classification Scheme for Myoelectric Control Applied to Upper Limb

2008 myoelectric control, support vector machine (SVM),MES

hands-free human–machine interfaces for disabled people

classification

It demonstrates exceptional accuracy, robust performance, and low computational load

does not

invalidate

the

achieved

conclusion

s in the

design of

pattern-

recognitio

nbased

myoelectri

c control

12 Constructing L2-SVM-Based Fuzzy Classifiers in High-Dimensional Space With Automatic Model Selection and Fuzzy Rule Ranking

2007 L2-SVM Image and video classification

classification

automatically choose the number of fuzzy rules and identify the important input features at the same time. More reasonable rule ranking scheme

High

dimention

al

problem,

does not

select

variable

automatica

lly

13 Cutting Plane Training for

2011 SVM Text and hyperte

classification

asymptotic time complexity

Ph. D Thesis Computer Engineering 26

Page 17: LITERATURE SURVEYlib.unipune.ac.in:8080/.../123456789/8434/11/11_chapter2.pdf · 2018. 10. 19. · Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using

Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using Machine Learning Techniques

Linear Support Vector Machines

xt categoritioin, Handwriting recognition

scales more reasonable, reduces training time

14 Euro banknote recognition system using a tree layer modelled using RBF networks

2003 Three layered modelled using neural network Radial basis function network

ATM classification

Good performance of both accepting valid banknotes and rejecting invalid data Performance of validation part without using IR images

The size

of three

layered

modelled

using

becomes

smaller

than RBF

network

by

reducing

redundant

input

neuron

15 Recognition System for Pakistani Paper Currency

2013 Euclidean distance classifier, Weighted Euclidean distance classifier , knn classify

ATM Machines Auto-seller machines Bank money-counters

classification

low cost machine working efficiently

16 Automatic recognition of serial numbers in bank notes

2014 Feature extraction methods gradient direction feature,

Counterfeit recognition of RMB (renmin

classification

applies the cascade schemes to the context of rejection, which could

Only used

to

recognize

serial

Ph. D Thesis Computer Engineering 27

Page 18: LITERATURE SURVEYlib.unipune.ac.in:8080/.../123456789/8434/11/11_chapter2.pdf · 2018. 10. 19. · Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using

Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using Machine Learning Techniques

Gabor feature, and CNN trainable feature),classifiers (SVM, LDF, MQDF, and CNN

bi bank note, the paper currency used in China)

dramatically reduce the number of rejected samples while achieving 100% reliability. Highest test accuracy

number of

RMB

17 Location based Recordation System

2013 Method was proposed to predict preferred restaurants based on weather and demographics of customers like age, mood etc. Bayesian network was used.

Social media

Prediction

Along with user’s biographic data location data is also used.

The data is

collected

manually

by

tracking

seven

volunteers

in real

world.

18 Analysis of location based social media data

2013 How radius of gyration varies according to various city demographics like population ,household income

social media

Classification

19 Predicting customers purchase behaviour

2013 A multiclass classifier was used

social media

Prediction

Only reflected relationship between

Other

affecting

factors

Ph. D Thesis Computer Engineering 28

Page 19: LITERATURE SURVEYlib.unipune.ac.in:8080/.../123456789/8434/11/11_chapter2.pdf · 2018. 10. 19. · Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using

Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using Machine Learning Techniques

user’s profile and user’s purchase behaviour.

like

comments

about the

product

were not

considered

.

20 Employing multiple kernel Support vector machines for counterfeit banknote recognition

Chi-Yuan Yeh, Wen-Pin Su, Shie-Jue Lee

Each banknote is divided into partitions and the luminance histograms of the partitions are taken as the input of the system. Each partition is associated with its own kernels. Linearly weighted combination is adopted to combine multiple kernels into a combined matrix. Optimal

Banking

Classification

Two strategies are adopted to reduce the amount of time and space required by the SDP method. One strategy assumes the non-negativity of the kernel weights, and the other one is to set the sum of the weights to be unity

proposed

approach

outperfor

ms single-

kernel

SVMs,

standard

SVMs

with SDP,

and

multiple-

SVM

classifiers.

Ph. D Thesis Computer Engineering 29

Page 20: LITERATURE SURVEYlib.unipune.ac.in:8080/.../123456789/8434/11/11_chapter2.pdf · 2018. 10. 19. · Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using

Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using Machine Learning Techniques

weights with kernel matrices in the combination are obtained through semi-definite programming (SDP) learning

21 Using Hidden Markov Models for paper currency recognition

Hamid Hassanpour , Payam M. Farahabadi

By employing HMM, the texture characteristics of paper currencies are modelled as a random process. A similarity measure is used for the classification in the proposed algorithm

Banking

Classification

the proposed algorithm can be used for distinguishing paper currency from different countries.

Only texture characteristics are considered

Ph. D Thesis Computer Engineering 30

Page 21: LITERATURE SURVEYlib.unipune.ac.in:8080/.../123456789/8434/11/11_chapter2.pdf · 2018. 10. 19. · Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using

Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using Machine Learning Techniques

The motivational graph given below suggests need of research for social network big

data.

Figure 2.2: Motivational Graph for Approach Selection

In this graph we can find that there is less work available on parallel implementation and

social network data. Fewer papers are suggesting ways to decrease the time required.

Applying popular machine learning algorithms to large amounts of data raised new

challenges for the ML practitioners. A traditional ML library does not support well

processing of huge datasets, so that new approaches were needed. Parallelization using

modern parallel computing frameworks, such as MapReduce, CUDA, or Dryad gained in

popularity and acceptance, resulting in new ML libraries developed on top of these

frameworks.

Social Motivation

The predictive analysis of Big Data will help business analytics to understand market

trends, understand customer behaviour, and take feedback on different products and

services also for friend recommendation or for link prediction.

0

5

10

15

20

25

30

No of Papers

No of Papers

1,15

,16,

17,1

8,19

21,2

6,27

,28,

29,3

0,34

10,1

1,13

,14,

15,1

6,17

10,1

1,26

7,8,

9,23

,24,

25

12,1

3,14

,15,

16,1

7,29

,31,

34

12,1

3,14

,15,

16,1

7,29

,30,

31,3

4

1,11

,12,

13,1

8,19

19,2

0,21

,22,

28,2

4,25

,31

3,4,

5,32

,33

12,1

3,14

,15,

16,1

7,29

,30,

31,3

4

7,8,

9,23

,24,

25

Ph. D Thesis Computer Engineering 31

Page 22: LITERATURE SURVEYlib.unipune.ac.in:8080/.../123456789/8434/11/11_chapter2.pdf · 2018. 10. 19. · Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using

Distributed Algorithm for Pattern Classification and Prediction of Big Data by Using Machine Learning Techniques

Technical Motivation

Very less work is available on distributed big data analysis which will train and handle

large amount of data stream. Distributed implementation will help to reduce the time

required for classification and prediction [1]. Previous literature has not considered data

clean up and pre-processing techniques. Most social network work considered

bibliographic information [1, 13]. Location specific information is not considered.

Educational Motivation

Prediction and classification of big data on social network is new area of research. It will

help to gather business intelligence information from social media data. This study will

help to enhance the knowledge of distributed machine learning domain. It will add values

to existing machine learning algorithms which will work efficiently for big data.

2.6 Summary

In this chapter, literature survey on different machine learning techniques is done. Many

authors have used different machine learning algorithms to analyses time changing big

data. Some authors have used different machine learning algorithms for doing sematic

analysis. For concept drift detection also different methods like moving average,

hoeffding bound methods, windowing methods are suggested. Some authors have used

neural network. Its accuracy is good but computation speed is slow. Its complexity is

high and self-explanatory level is low. For data mining, some authors have used support

vector machines (SVM).In SVM, visualization of results is less. Use of kernels in SVM

adds more complexity into it. Some authors have swarm intelligence for optimization and

clustering purpose.

Ph. D Thesis Computer Engineering 32