55
1 © 2015 The MathWorks, Inc. Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks [email protected]

Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

Embed Size (px)

Citation preview

Page 1: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

1© 2015 The MathWorks, Inc.

Bringing Big Data Modelling into the

Hands of Domain Experts

David Willingham

Senior Application Engineer

MathWorks

[email protected]

Page 2: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

2

“Data is the sword of the 21st century, those who

wield it the samurai.”

Big data — how to create

it, manipulate it, and put it

to good use.

“If you want to work at

Google, make sure you

can use MATLAB.”

Google’s Former SVP - Jonathan Rosenberg

Page 3: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

3

Who is looking to do Big Data Analytics now?

Software Developers

– Programmers using languages such as Java / .NET

– They may not be domain experts

End Users (non technical)

– Non programmers

– They may or may not be domain experts

Domain Users

– Scientists, Engineers, Analysts, etc..

– Have some programming skills

– Might use MATLAB to protype ideas, algorithms models

– They want their ideas to scale with the size of data easily

Page 4: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

4

Key Industries

Aerospace and defense

Automotive

Biotech and pharmaceutical

Communications

Education

Electronics and semiconductors

Energy production

Financial services

Industrial automation and

machinery

Medical devices

Page 5: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

5

Tesco uses supply chain analytics to

save £100m a year.

4 years of sales data held in

a Teradata data warehouse.

MATLAB is used to forecast

the optimum stock levels.

“We can run 100 stores for 100 days in about half an hour.

We can figure out quickly whether what we are doing is

right and we can optimise that.”

Page 6: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

6

Preventing lethal ship collisions with whales

Cornell University collected

terabytes of ocean acoustic

data.

Crowdsourced algorithms

to detect and classify

animal signals in the

presence of noise.

“A data set that would have taken months to process can

now be processed multiple times in just a few days using

different detection algorithms.”

Page 7: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

7

Earth’s topography

on a Miller cylindrical

projection, created

with MATLAB and

Mapping Toolbox.

MathWorks Today

More than 1 million users worldwide.

3500 universities

21 AU universities have MATLAB campus-wide license.

Page 8: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

8

Challenges of Big Data

“Any collection of data sets so large and complex that it becomes

difficult to process using … traditional data processing applications.”(Wikipedia)

Getting started

Rapid data exploration

Development of scalable algorithms

Ease of deployment

Page 9: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

9

Big Data Capabilities in MATLAB

Memory and Data Access

64-bit processors

Memory Mapped Variables

Disk Variables

Databases

Datastores

Platforms

Desktop (Multicore, GPU)

Clusters

Cloud Computing (MDCS on EC2)

Hadoop

Programming Constructs

Streaming

Block Processing

Parallel-for loops

GPU Arrays

SPMD and Distributed Arrays

MapReduce

Page 10: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

10

MATLAB and MemoryBest Practices for Memory Usage

Use the appropriate data storage

– Use only the precision your need

– Sparse Matrices

– Categorical Arrays

– Be aware of overhead of cells and structures

Minimize Data Copies

– In place operations, if possible

– Use nested functions

– If using objects, consider

handle classes

Page 11: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

11

Considerations for Choosing an Approach

Data characteristics

– Size, type and location of your data

Compute platform

– Single desktop machine or cluster

Analysis Characteristics

– Embarrassingly Parallel

– Analyze sub-segments of data and aggregate results

– Operate on entire dataset

Page 12: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

12

Techniques for Big Data in MATLAB

Complexity

Embarrassingly

Parallel

Non-

Partitionable

MapReduce

Distributed Memory

SPMD and distributed arrays

Load,

Analyze,

Discard

parfor, datastore,

out-of-memory

in-memory

Page 13: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

13

Techniques for Big Data in MATLAB

Embarrassingly

Parallel

Non-

Partitionable

MapReduce

Distributed Memory

SPMD and distributed arrays

Load,

Analyze,

Discard

parfor, datastore,

out-of-memory

in-memory

Complexity

Page 14: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

14

Parallel Computing with MATLAB

MATLAB

Desktop (Client)

Worker

Worker

Worker

Worker

Worker

Worker

Page 15: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

15

Demo: Determining Land UseUsing Parallel for-loops (parfor)

Data

– Arial images of agriculture land

– 24 TIF files

Analysis

– Find and measure

irrigation fields

– Determine which irrigation

circles are in use (by color)

– Calculate area under irrigation

Page 16: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

16

When to Use parfor

Data Characteristics

– Can be of any format (i.e. text, images)

as long as it can be broken into pieces

– The data for each iteration must

fit in memory

Compute Platform

– Desktop (Parallel Computing Toolbox)

– Cluster (MATLAB Distributed Computing Server)

Analysis Characteristics

– Each iteration of your loop must be independent

Page 17: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

17

Techniques for Big Data in MATLAB

Embarrassingly

Parallel

Non-

Partitionable

MapReduce

Distributed Memory

SPMD and distributed arrays

Load,

Analyze,

Discard

parfor, datastore,

out-of-memory

in-memory

Complexity

Page 18: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

18

Access Big Datadatastore

Easily specify data set

– Single text file (or collection of text files)

Preview data structure and format

Select data to import

using column names

Incrementally read

subsets of the dataairdata = datastore('*.csv');

airdata.SelectedVariables = {'Distance', 'ArrDelay‘};

data = read(airdata);

Page 19: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

19

Demo: Vehicle Registry AnalysisUsing a DataStore

Data

– Massachusetts Vehicle Registration

Data from 2008-2011

– 16M records, 45 fields

Analysis

– Examine hybrid adoptions

– Calculate % of hybrids registered

by quarter

– Fit growth to predict further

adoption

Page 20: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

20

When to Use datastore

Data Characteristics

– Text data in files, databases or stored in the

Hadoop Distributed File System (HDFS)

Compute Platform

– Desktop

Analysis Characteristics

– Supports Load, Analyze, Discard workflows

– Incrementally read chunks of data,

process within a while loop

Page 21: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

21

Big Data Capabilities in MATLAB

Memory and Data Access

64-bit processors

Memory Mapped Variables

Disk Variables

Databases

Datastores

Platforms

Desktop (Multicore, GPU)

Clusters

Cloud Computing (MDCS on EC2)

Hadoop

Programming Constructs

Streaming

Block Processing

Parallel-for loops

GPU Arrays

SPMD and Distributed Arrays

MapReduce

Page 22: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

22

Reading in Part of a Dataset from Files

Text file, ASCII file

– datastore

MAT file

– Load and save part of a variable using the matfile

Binary file

– Read and write directly to/from file using memmapfile

– Maps address space to file

Databases

– ODBC and JDBC-compliant (e.g. Oracle, MySQL, Microsoft, SQL Server)

Page 23: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

23

Techniques for Big Data in MATLAB

ComplexityEmbarrassingly

Parallel

Non-

Partitionable

MapReduce

Distributed Memory

SPMD and distributed arrays

Load,

Analyze,

Discard

parfor, datastore,

out-of-memory

in-memory

Page 24: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

24

Analyze Big Datamapreduce

Use the powerful MapReduce programming

technique to analyze big data

– mapreduce uses a datastore to process data

in small chunks that individually fit into memory

– Useful for processing multiple keys, or when

Intermediate results do not fit in memory

mapreduce on the desktop

– Increase compute capacity (Parallel Computing Toolbox)

– Access data on HDFS to develop algorithms for use on Hadoop

mapreduce with Hadoop

– Run on Hadoop using MATLAB Distributed Computing Server

– Deploy applications and libraries for Hadoop using MATLAB Compiler

********************************

* MAPREDUCE PROGRESS *

********************************

Map 0% Reduce 0%

Map 20% Reduce 0%

Map 40% Reduce 0%

Map 60% Reduce 0%

Map 80% Reduce 0%

Map 100% Reduce 25%

Map 100% Reduce 50%

Map 100% Reduce 75%

Map 100% Reduce 100%

Page 25: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

25

mapreduce

Data Store Map Reduce

Veh_typ Q3_08 Q4_08 Q1_09 Hybrid

Car 1 1 1 0

SUV 0 1 1 1

Car 1 1 1 1

Car 0 0 1 1

Car 0 1 1 1

Car 1 1 1 1

Car 0 0 1 1

SUV 0 1 1 0

Car 1 1 1 0

SUV 1 1 1 1

Car 0 1 1 1

Car 1 0 0 0

Key: Q3_080

1

1

0

1

1

1

Key: Q4_08

0

1

1

1

1

Key: Q1_09

Hybrid

0

0

0

1

0

1

Key % Hybrid (Value)

Q3_08 0.4

Q4_08 0.67

Q1_09 0.75

Key: Q3_08

Key: Q4_08

Key: Q1_09

Key: Q3_080

1

1

0

1

1

1

Key: Q4_08

0

1

1

1

1

Key: Q1_09

Hybrid

0

0

0

1

0

1

Shuffle

and Sort

Page 26: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

26

Demo: Vehicle Registry AnalysisUsing MapReduce

Data

– Massachusetts Vehicle Registration

Data from 2008-2011

– 16M records, 45 fields

Analysis

– Examine hybrid adoptions

– Calculate % of hybrids registered

By Quarter

By Regional Area

– Create map of results

Page 27: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

27

Datastore

HDFS

The Big Data Platform

Reduce

Node

Node

Node Data

Data

Data

Map

ReduceMap

ReduceMap

Map Reduce

Map

Map

Reduce

Reduce

Page 28: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

28

Datastore

Explore and Analyze Data on Hadoop

MATLAB

MapReduce

Code

HDFS

Node Data

MATLAB

Distributed

Computing

Server

Node Data

Node Data

Map Reduce

Map Reduce

Map Reduce

Page 29: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

29

Deployed Applications with Hadoop

MATLAB

MapReduce

Code

Datastore

HDFS

Node Data

Node Data

Node Data

Map Reduce

Map Reduce

Map Reduce

MATLAB

runtime

Page 30: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

30

When to Use mapreduce

Data Characteristics

– Text data in files, databases or stored in the

Hadoop Distributed File System (HDFS)

– Dataset will not fit into memory

Compute Platform

– Desktop

– Scales to run within Hadoop MapReduce

on data in HDFS

Analysis Characteristics

– Must be able to be Partitioned into two phases

1. Map: filter or process sub-segments of data

2. Reduce: aggregate interim results and calculate final answer

Page 31: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

31

Techniques for Big Data in MATLAB

ComplexityEmbarrassingly

Parallel

Non-

Partitionable

MapReduce

Distributed Memory

SPMD and distributed arrays

Load,

Analyze,

Discard

parfor, datastore,

out-of-memory

in-memory

Page 32: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

32

Parallel Computing – Distributed Memory

Core 1

Core 3 Core 4

Core 2

RAM

Using More Computers (RAM)

Core 1

Core 3 Core 4

Core 2

RAM

Page 33: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

33

TOOLBOXES

BLOCKSETS

Distributed Array

Lives on the Cluster

Remotely Manipulate Array

from Desktop

11 26 41

12 27 42

13 28 43

14 29 44

15 30 45

16 31 46

17 32 47

17 33 48

19 34 49

20 35 50

21 36 51

22 37 52

Distributed ArraysAvailable from Parallel Computing Toolbox

MATLAB Distributed Computing

Server

Page 34: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

34

spmd blocks

spmd

% single program across workers

end

Mix parallel and serial code in the same function

Run on a pool of MATLAB resources

Single Program runs simultaneously across

workers

Multiple Data spread across multiple workers

Page 35: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

35

Example: Airline Delay Analysis

Data

– BTS/RITA Airline

On-Time Statistics

– 123.5M records, 29 fields

Analysis

– Calculate delay patterns

– Visualize summaries

– Estimate & evaluate

predictive models

Page 36: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

36

When to Use Distributed Memory

Data Characteristics

– Data must be fit in collective

memory across machines

Compute Platform

– Prototype (subset of data) on desktop

– Run on a cluster or cloud

Analysis Characteristics

– Consists of:

Parts that can be run on data in memory (spmd)

Supported functions for distributed arrays

Page 37: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

37

Big Data on the Desktop

Expand workspace 64 bit processor support – increased in-memory data set handling

Access portions of data too big to fit into memory Memory mapped variables – huge binary file

Datastore – huge text file or collections of text files

Database – query portion of a big database table

Variety of programming constructs System Objects – analyze streaming data

MapReduce – process text files that won’t fit into memory

Increase analysis speed Parallel for-loops with multicore/multi-process machines

GPU Arrays

Page 38: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

38

Further Scaling Big Data Capacity

MATLAB supports a number of programming

constructs for use with clusters

General compute clusters Parallel for-loops – embarrassingly parallel algorithms

SPMD and distributed arrays – distributed memory

Hadoop clusters MapReduce – analyze data stored in the Hadoop Distributed File System

Use these constructs

on the desktop to

develop your algorithms

Migrate to a

cluster without

algorithm changes

Page 39: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

39

Learn More

MATLAB Documentation

– Strategies for Efficient Use of Memory

– Resolving "Out of Memory" Errors

Big Data with MATLAB– www.mathworks.com/discovery/big-data-matlab.html

MATLAB MapReduce and Hadoop– www.mathworks.com/discovery/matlab-mapreduce-hadoop.html

Page 40: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

40

EXTRA SLIDES

Page 41: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

41

Running into Big Data Issues?

Performance

– Takes too long to process all

of your data

“Out of memory”

– Running out of address space

Slow processing (swapping)

– Data too large to be efficiently

managed between RAM and

virtual memory

Page 42: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

42

MATLAB and MemoryMemory and Your Operating System

32-bit operating systems

– 4GB of addressable memory per process

– Part of it is reserved by the OS,

leaving the application < 4GB

64-bit operating systems

– 8TB of addressable memory

per process

– Limited by the amount of memory

available on the computer

Use 64-bit OS, if possible

Memory per Process

Available for Process

Reserved by

Operating System

Page 43: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

43

MATLAB and MemoryBest Practices for Memory Usage

Use the appropriate data storage

– Use only the precision your need

– Sparse Matrices

– Categorical Arrays

– Be aware of overhead of cells and structures

Minimize Data Copies

– In place operations, if possible

– Use nested functions

– If using objects, consider

handle classes

Page 44: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

44

No dependencies or communications between tasks

Ideal problem for parallel computing (parfor)

Independent Tasks or Iterations

TimeTime

Page 45: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

45

Block Processing Images

blockproc automatically divides an

image into blocks for processing

Reduces memory usage

– Read and write block

directly from image file

Processes arbitrarily large images

Available from Image Processing Toolbox

Page 46: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

46

System Objects

A class of MATLAB objects that

support streaming workflows

Simplifies data access for streaming applications

– Manages flow of data from files or network

– Handles data indexing and buffering

Contain algorithms to work with streaming data

– Manages algorithm state

– Available for Signal Processing, Communications,

Video Processing, and Phased Array Applications

Available from DSP System Toolbox

Communications System Toolbox

Computer Vision System Toolbox

Phased Array System Toolbox

Image Acquisition Toolbox

Page 47: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

47

Platform Desktop Only Desktop + Cluster Desktop + Hadoop

Data Size 100’s MB -10’s GB 100’s MB -100’s GB 100’s GB – PBs

Techniques • parfor

• datastore

• mapreduce

• parfor

• distributed data• spmd

• mapreduce

Summary: Options for Handling Large Data

MATLAB

Desktop (Client)

Hadoop Cluster

Hadoop Scheduler

… … …

..…

..…

..…

MATLAB

Desktop (Client)

Cluster

Scheduler

… … …..…

..…

..…

MATLAB

Desktop (Client)

Page 48: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

48

Performance Gain with More Hardware

Using GPUs

Device Memory

GPU cores

Device Memory

Core 1

Core 3 Core 4

Core 2

Cache

Using More Cores (CPUs)

Page 49: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

49

Scaling Up Your ComputationsMATLAB Distributed Computing Server

Transition from desktop to cluster with

minimal code changes

Supports both interactive and

batch workflows

Integrates with commonly

used schedulers

Cluster

Computer Cluster

Scheduler

Page 50: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

50

MapReduce Programming Model

mapreducer

datastore

mapreduce

MAP REDUCE

Execution

Environment

Output FilesInput Files

Page 51: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

51

mapreducer

mapreducer: Control the Execution Environment

of the MapReduce operation

MAP REDUCE

Output FilesInput Files

Execution

Environment

Page 52: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

52

mapreducer

Serial Hadoop ClusterLocal Pool

Multi-core CPU

Page 53: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

53

datastore

datastore: Stores information regarding

input files and output files

MAP REDUCE

Execution

Environment

Output FilesInput Files

Page 54: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

54

mapreduce

MAP REDUCE

Output FilesInput Files

Execution

Environment

mapreduce: Executes the MapReduce algorithm

Page 55: Bringing Big Data Modelling into the Hands of Domain …datamining.it.uts.edu.au/bigdata/bigdatasummit15/wp-content/uploads... · Bringing Big Data Modelling into the Hands of Domain

55

mapreduce

mapreduce: - Input datastore object

- Map function

- Reduce function

MAP REDUCE

Output FilesInput Files

Execution

Environment