32
A Database-Hadoop Hybrid Approach to Scalable Machine Learning Makoto YUI , Isao Kojima AIST, Japan <[email protected]> June 30, 2013 IEEE BigData Congress 2013, Santa Clara

A Database-Hadoop Hybrid Approach to Scalable Machine Learning

Embed Size (px)

DESCRIPTION

My presentation slide at IEEE 2nd International Congress on Big Data on June 30, 2013. http://www.ieeebigdata.org/2013/

Citation preview

Page 1: A Database-Hadoop Hybrid Approach to Scalable Machine Learning

A Database-Hadoop Hybrid Approachto Scalable Machine Learning

Makoto YUI, Isao Kojima AIST, Japan

<[email protected]>

June 30, 2013IEEE BigData Congress 2013, Santa Clara

Page 2: A Database-Hadoop Hybrid Approach to Scalable Machine Learning

Outline

1. Motivation & Problem Description

2. Our Hybrid Approach to Scalable Machine Learning

Architecture

Our batch learning scheme on Hive

3. Experimental Evaluation

4. Conclusions and Future Directions

2

Page 3: A Database-Hadoop Hybrid Approach to Scalable Machine Learning

3

As we seen in the Keynote and the Panel discussion (2nd day) of this conference, data analytics and machine learning are obviously getting more attentions along with Big Data

Suppose then that you were a developer and your manager is willing to ..

ManagerDeveloper

Page 4: A Database-Hadoop Hybrid Approach to Scalable Machine Learning

What’s possible choices around there?

In-database Analytics MADlib (open-source project lead by Greenplum)

Bismarck (Project at wisc.edu, SIGMOD’12)

SAS In-database Analytics

Fuzzy Logix (Sybase)and more..

Machine Learning on Hadoop Apache Mahout

Vowpal Wabbit (open-source project at MS research)

In-house analytical toolse.g., Twitter, SIGMOD’12

4

Two popular schools of thought for performing large-scalemachine learning that does not fit in memory space

Page 5: A Database-Hadoop Hybrid Approach to Scalable Machine Learning

4 Issues needed to be considered

1. Scalability

5

Scalability would always be a problem when handing Big Data

Page 6: A Database-Hadoop Hybrid Approach to Scalable Machine Learning

4 Issues needed to be considered

1. Scalability

2. Data movement

6

Data movement is important because Moving data is a critical issue when the size of dataset shift from terabytes to petabytes and beyond

Page 7: A Database-Hadoop Hybrid Approach to Scalable Machine Learning

4 Issues needed to be considered

1. Scalability

2. Data movement

3. Transactions

7

Considering transactions is important for real-time/online predictions because most of transaction records, which are valuable for predictions, are stored in relational databases

Page 8: A Database-Hadoop Hybrid Approach to Scalable Machine Learning

4 Issues needed to be considered

1. Scalability

2. Data movement

3. Transactions

4. Latency and throughput

8

Latency and throughput are the key issues for achieving online prediction and/or real-time analytics

Page 9: A Database-Hadoop Hybrid Approach to Scalable Machine Learning

Which is better?

9

ScalabilityData movement

Transactions Latency Throughput

In-databaseanalytics

Machine learning on Hadoop

+ Fault Tolerance+ Straggler node handling + Scale-out

Page 10: A Database-Hadoop Hybrid Approach to Scalable Machine Learning

Which is better?

10

ScalabilityData movement

Transactions Latency Throughput

In-databaseanalytics

Machine learning on Hadoop

It depends on where the data initially stored and purposes of using the data

+ HDFS is useful for append-only and archiving purposes+ ETL Processing(feature engineering)

+ RDBMS is reliable as a transactional data store

Page 11: A Database-Hadoop Hybrid Approach to Scalable Machine Learning

Which is better?

11

ScalabilityData movement

Transactions Latency Throughput

In-databaseanalytics

Machine learning on Hadoop

+ Small fraction updates+ Index-lookup foronline prediction

Page 12: A Database-Hadoop Hybrid Approach to Scalable Machine Learning

Which is better?

12

ScalabilityData movement

Transactions

In-databaseanalytics

Machine learning on Hadoop

+ Incremental learning for each training instance

- High latency bottleneck in job submitting process

+ Batch processing

Latency Throughput

Page 13: A Database-Hadoop Hybrid Approach to Scalable Machine Learning

Idea behind DB-Hadoop Hybrid approach

13

scalabilityData movement

Transactions

Batch learningon Hadoop

Incremental learning and prediction in a relational database

Just an illustration, you knowNext, we will see what happens inside the box

Latency Throughput

Page 14: A Database-Hadoop Hybrid Approach to Scalable Machine Learning

Inside the box (an overview)

– How to combine them

14

Postgres

Hadoop cluster

node

node

node

・・・

OLTPtransactions

Training data

Prediction model

Incrementallearning

implemented as a database stored procedure

Trickle training data to Hadoop HDFS little by little and bringing back prediction models periodicity

Batch learning

Page 15: A Database-Hadoop Hybrid Approach to Scalable Machine Learning

15

Trickle updates

Source database

Trickle updates in the queue periodically

Hadoop clusterRelational Database

Stagingtable

Pull updates in the queue

Trainingdata sink

The Detailed Architecture― Data to Prediction Cycle

IncrementalLearner

Page 16: A Database-Hadoop Hybrid Approach to Scalable Machine Learning

16

Trickle updates

Source database

Hadoop clusterRelational Database

Stagingtable

Pull updates in the queue

Trainingdata sink

Prediction model

Batch learning processbuild a model

The Detailed Architecture― Data to Prediction Cycle

IncrementalLearner

up-to-datemodel Export a prediction

model

Trickle updates in the queue periodically

Page 17: A Database-Hadoop Hybrid Approach to Scalable Machine Learning

17

Trickle updates

Source database

Hadoop clusterRelational Database

Stagingtable

Pull updates in the queue

Trainingdata sink

Prediction model

Batch learning processbuild a model

The Detailed Architecture― Data to Prediction Cycle

IncrementalLearner

up-to-datemodel

Select the latest one

Insert a new one Export a predictionmodel

Trickle updates in the queue periodically

Page 18: A Database-Hadoop Hybrid Approach to Scalable Machine Learning

18

Trickle updates

Source database

Hadoop clusterRelational Database

Stagingtable

Pull updates in the queue

Trainingdata sink

Prediction model

Batch learning processbuild a model

The Detailed Architecture― Data to Prediction Cycle

IncrementalLearner

up-to-datemodel

Select the latest one

Insert a new one Export a predictionmodel

Transactionalupdates

Online prediction

User can control the flow considering requirements and performance Real-time prediction is possible using database triggers on the staging table

Trickle updates in the queue periodically

Page 19: A Database-Hadoop Hybrid Approach to Scalable Machine Learning

19

Trickle updates

Source database

Hadoop clusterRelational Database

Stagingtable

Pull updates in the queue

Trainingdata sink

Prediction model

Batch learning processbuild a model

The Detailed Architecture― Data to Prediction Cycle

IncrementalLearner

up-to-datemodel

Select the latest one

Insert a new one Export a predictionmodel

The workflow consists of continuous and independent processes

Trickle updates in the queue periodically

Page 20: A Database-Hadoop Hybrid Approach to Scalable Machine Learning

Existing Approach for Parallel Batch Learning― Machine Learning as User Defined Aggregates (UDAF)

20

train train

+1, <1,2>..+1, <1,7,9>

-1, <1,3, 9>..+1, <3,8>

merge

tuple<label, array<features >

array<weight>

array<sum of weight>, array<count>

Training table

Prediction model

UDAF

-1, <2,7, 9>..+1, <3,8>

final merge

merge

-1, <2,7, 9>..+1, <3,8>

train train

array<weight>

Bottleneck in the final merge

Scalability is limited by the maximum fan-out of the final merge

Scalar aggregates computing a large single result are not suitable for S/N settings

Parallel aggregation (as one in Google Dremel) is not supported in Hadoop/MapReduce

Aggregate tree (parallel aggregation)

to merge prediction models Problems Observed

Even though MPP databases and Hive parallelize user-defined aggregates, the above problems prevent using it

Page 21: A Database-Hadoop Hybrid Approach to Scalable Machine Learning

Purely Relational Approach for Parallel Learning

Implemented a trainer as a set-returning function, instead of UDAFThe purely relational way that scales on MPP and Hive/Hadoop

Shuffle by feature to Reducers

Run trainers independently on mappers and aggregate the results on reducersEmbarrassingly parallel as # of mappers and reducers is controllable

21

+1, <1,2>..+1, <1,7,9>

-1, <1,3, 9>..+1, <3,8>

train train

tuple<label, array<features>>

tuple<feature, weights>

Prediction model

UDTF

Relation<feature, weights>

param-mix param-mix

Training table

Shuffle by feature

Our solution for parallel machine learning on Hadoop/Hive

SELECTfeature, -- reducers perform model averaging in parallelavg(weight) as weight

FROM (SELECT trainLogistic(features,label,..) as (feature,weight)FROM train

) t -- map-only taskGROUP BY feature; -- shuffled to reducers

Parameter MixingK. B. Hall et al. in Proc. NIPS workshop on Leaning on Cores, Clusters, and Clouds, 2010.

Key points

Page 22: A Database-Hadoop Hybrid Approach to Scalable Machine Learning

Experimental Evaluation1. Compared the performance of our batch learning scheme

to state-of-the-art machine learning techniques, namely Bismarck and Vowpal Wabbit

2. Conducted a online prediction scenario to see the latency and throughput of our incremental learning scheme

Dataset KDD Cup 2012, Track 2 dataset, which is one of the one of the largest publically available datasets for machine learning, provided by a commercial search engine provider

Experimental Environment In-house 33 commodity servers (32 slaves nodes for Hadoop)each equipped with 8 processors and 24 GB memory

22

Given a prediction model is created with 80% of training data by batch learning, the rest of data (20%) is supplied for incremental learning

The task is predicting Click-Through-Rates of search engine ads The training data is about 235 million records in 23 GB

Page 23: A Database-Hadoop Hybrid Approach to Scalable Machine Learning

Performance Evaluation of Batch LearningOur batch learning scheme on Hive is 5 and 7.65 times faster than Vowpal Wabbit and Bismarck, respectively

23

AUC value (Green Bar) represents prediction accuracy

5x

7.65x

Throughput: 2.3 million tuples/sec on 32 nodesLatency: 96 sec for training 235 million records of 23 GB

CAUTION: you can find the detailed number and setting in our paper

Page 24: A Database-Hadoop Hybrid Approach to Scalable Machine Learning

Performance Analysis in the Evaluation

24

Source database

Hadoop clusterRelational Database

Stagingtable

Trainingdata sink

Prediction model

IncrementalLearner

up-to-datemodel

Low latency (5 sec) under moderate (70,000 tuples/sec)

updates

96 sec for training Excellent throughput (2.3 million

tuples/sec) on 32 nodes

5 s96 s

Page 25: A Database-Hadoop Hybrid Approach to Scalable Machine Learning

Performance Analysis in the Evaluation

25

Sqoop required 3 min 32 s to migrate a prediction model 80% model containing about 1.56 million records (323MB)

Model conversion to a dense format, which is suited for online

learning/prediction on Postgres, required 58 seconds

Source database

Hadoop clusterRelational Database

Stagingtable

Trainingdata sink

Prediction model

IncrementalLearner

up-to-datemodel

5 s96 s

212 s58 s

Non-trivial costs in model migration

Page 26: A Database-Hadoop Hybrid Approach to Scalable Machine Learning

Performance Analysis in the Evaluation

26

Source database

Hadoop clusterRelational Database

Stagingtable

Trainingdata sink

Prediction model

IncrementalLearner

up-to-datemodel

“Data migration time > Training time” justifies the rationale behind in-database analytics

The cost of moving data is critical for online prediction as well as in Big Data analysisModel migration costs could be amortized with our approach

Key observations

Page 27: A Database-Hadoop Hybrid Approach to Scalable Machine Learning

Conclusions

DB-Hadoop hybrid architecture for online prediction in which the prediction model needs to be updated in a low latency process

Design principal for achieving scalable machine learning on Hadoop/Hive

Excellent throughput and Scalability

Our Batch learning scheme on Hive is 5 and 7.65 times faster than Vowpal Wabbit and Bismarck, respectively

Acceptably Small Latency

Possibly less than 5 s, under moderate transactional updates

27

Going hybrid brings low latency to Big Data analytics

Page 28: A Database-Hadoop Hybrid Approach to Scalable Machine Learning

Directions for Future Work

28

Source database

Hadoop clusterRelational Database

Stagingtable

Trainingdata sink

Prediction model

IncrementalLearner

up-to-datemodel

Online testing

Integrating online testing schemes (e.g., Multi-armed Bandits and A/B testing)

to the prediction pipeline Develop a scheme to select the best

prediction model among past models for each user in each session

Online prediction

Page 29: A Database-Hadoop Hybrid Approach to Scalable Machine Learning

Backup slides

29

Page 30: A Database-Hadoop Hybrid Approach to Scalable Machine Learning

Directions for Future Work

30

Source database

Hadoop clusterRelational Database

Stagingtable

Trainingdata sink

Prediction model

IncrementalLearner

up-to-datemodel

Take a common setting for OLTP that database is partitioned across servers (a.k.a. Database Sharding) into consideration

Page 31: A Database-Hadoop Hybrid Approach to Scalable Machine Learning

Evaluation of Incremental LearningGiven a prediction model is created with 80% of training data by batch learning, the rest of data (20% ) is supplied for incremental learning

31

Built model with ..elapsed time

(in sec)Throughputtuples/sec AUC

Batch only (80%) 96.33 2067418.3 0.7177

+0.1% updates (80.1%) 4.99 36155.4 0.7197

+1% updates (81%) 25.96 69812.8 0.7242

+10% updates (90%) 256.03 71278.1 0.7291

+20% updates (100%) 499.61 72901.4 0.7349

Batch only (100%) 102.52 2298010.8 0.7356

Page 32: A Database-Hadoop Hybrid Approach to Scalable Machine Learning

Special ThanksFont

Lato by Łukasz Dziedzic

Symbols by the Noun Project

Data Analysis designed by Brennan Novak

Elephant designed by Ted Mitchner

Scale designed by Laurent Patain

Heavy Load designed by Olivier Guin

Receipt designed by Benjamin Orlovski

Gauge designed by Márcio Duarte

Stopwatch designed by Ilsur Aptukov

Box designed by Travis J. Lee

Sprint Cycle designed by Jeremy J Bristol

Dilbert characters by Scott Adams Inc.

10-12-10 and 7-29-12

32