68
1 © 2015 The MathWorks, Inc. Predictive Analytics and Big Data with MATLAB Ian McKenna, Ph.D.

Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

1© 2015 The MathWorks, Inc.

Predictive Analytics and Big Data with

MATLAB

Ian McKenna, Ph.D.

Page 2: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

2

Agenda

Introduction

Predictive Modeling– Supervised Machine Learning

– Time Series Modeling

Big Data Analysis– Load, Analyze, Discard workflows

– Scale computations with parallel computing

– Distributed processing of large data sets

Moving to Production with MATLAB

Page 3: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

3

Financial Modeling Workflow

Explore and Prototype

Data Analysis

& Visualization

Financial

Modeling

Application

Development

Reporting

Applications

Production

Share

Scale

Files

Databases

Datafeeds

Access

Small/Big Data Predictive Modeling Deploy

Page 4: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

4

Financial Modeling Workflow

Explore and Prototype

Data Analysis

& Visualization

Financial

Modeling

Application

Development

Predictive Modeling

Page 5: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

7

Agenda

Introduction

Predictive Modeling– Supervised Machine Learning

– Time Series Modeling

Big Data Analysis– Load, Analyze, Discard workflows

– Scale computations with parallel computing

– Distributed processing of large data sets

Moving to Production with MATLAB

Page 6: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

8

What is Predictive Modeling?

Use of mathematical language to make predictions

about the future

Predictive

model

Input/

Predictors

Output/

Response

Electricity Demand

,...),,( DPtTfEL

Examples

Trading strategies

Page 7: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

9

Why develop predictive models?

Forecast prices/returns

Price complex instruments

Analyze impact of predictors (sensitivity analysis)

Stress testing

Gain economic/market insight

And many more reasons

Page 8: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

10

Challenges

Significant technical expertise required

No “one size fits all” solution

Locked into Black Box solutions

Time required to conduct the analysis

Page 9: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

11

MODEL

PREDICTION

Predictive Modeling Workflow

Train: Iterate till you find the best model

Predict: Integrate trained models into applications

MODELSUPERVISED

LEARNING

CLASSIFICATION

REGRESSION

PREPROCESS

DATA

SUMMARY

STATISTICS

PCAFILTERS

CLUSTER

ANALYSIS

LOAD

DATAPREPROCESS

DATA

SUMMARY

STATISTICS

PCAFILTERS

CLUSTER

ANALYSIS

NEW

DATA

Page 10: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

13

Classes of Response Variables

TypeStructure

Non-Sequential Categorical

ContinuousSequential

Page 11: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

14

Examples

Classification Learner App

Predicting Customer Response

– Classification techniques

– Measure accuracy and compare models

Predicting S&P 500

– ARIMA modeling

– GARCH modeling

May-01 Feb-04 Nov-06 Aug-09 May-12

800

900

1000

1100

1200

1300

1400

1500

1600

1700

1800

S&

P 5

00

Realized vs Median Forecasted Path

Original Data

Simulated Data

0

10

20

30

40

50

60

70

80

90

100

Perc

enta

ge

Bank Marketing Campaign

Misclassification Rate

Neur

al N

et

Logi

stic

Reg

ress

ion

Dis

crim

inant

Ana

lysi

s k-

neare

st N

eig

hbor

s

Naiv

e B

ayes

Sup

port V

M

Deci

sion

Tree

s

Tre

eBagg

er

Redu

ced

TB

No

Misclassified

Yes

Misclassified

Page 12: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

16

Getting Started with Predictive Modeling

Perform common tasks interactively

– Classification Learner App

– Neural Net App

Page 13: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

21

Example – Bank Marketing Campaign

Goal:

– Predict if customer would subscribe to

bank term deposit based on different

attributes

Approach:

– Train a classifier using different models

– Measure accuracy and compare models

– Reduce model complexity

– Use classifier for prediction

Data set downloaded from UCI Machine Learning repository

http://archive.ics.uci.edu/ml/datasets/Bank+Marketing

0

10

20

30

40

50

60

70

80

90

100

Perc

enta

ge

Bank Marketing Campaign

Misclassification Rate

Neur

al N

et

Logi

stic

Reg

ress

ion

Dis

crim

inant

Ana

lysi

s k-

neare

st N

eig

hbor

s

Naiv

e B

ayes

Sup

port V

M

Deci

sion

Tree

s

Tre

eBagg

er

Redu

ced

TB

No

Misclassified

Yes

Misclassified

Page 14: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

22

Classification Techniques

Regression

Classification

Non-linear Reg.

(GLM, Logistic)

Linear

RegressionDecision Trees

Ensemble

Methods

Neural

Networks

Nearest

Neighbor

Discriminant

AnalysisNaive Bayes

Support Vector

Machines

Page 15: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

26

Example – Bank Marketing Campaign

Numerous predictive models with rich

documentation

Interactive visualizations and apps to

aid discovery

Built-in parallel computing support

Quick prototyping; Focus on

modeling not programming

0

10

20

30

40

50

60

70

80

90

100

Perc

enta

ge

Bank Marketing Campaign

Misclassification Rate

Neur

al N

et

Logi

stic

Reg

ress

ion

Dis

crim

inant

Ana

lysi

s k-

neare

st N

eig

hbor

s

Naiv

e B

ayes

Sup

port V

M

Deci

sion

Tree

s

Tre

eBagg

er

Redu

ced

TB

No

Misclassified

Yes

Misclassified

Page 16: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

27

Example – Time Series Modeling and

Forecasting for the S&P 500 Index

Goal:

– Model S&P 500 time series as a

combined ARIMA/GARCH

process and forecast on test data

Approach:

– Fit ARIMA model with S&P 500

returns and estimate parameters

– Fit GARCH model for S&P 500

volatility

– Perform statistical tests for time

series attributes e.g. stationarity

May-01 Feb-04 Nov-06 Aug-09 May-12

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

11000

S&

P 5

00

Realized vs All Forecasted Paths

Original Data

Simulated Data

May-01 Feb-04 Nov-06 Aug-09 May-12

800

900

1000

1100

1200

1300

1400

1500

1600

1700

1800

S&

P 5

00

Realized vs Median Forecasted Path

Original Data

Simulated Data

Page 17: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

28

Models for Time Series Data

Conditional Mean Models

AR – Autoregressive

MA – Moving Average

ARIMA – Integrated

ARIMAX – eXogenous inputs

VARMA – Vector ARMA

VARMAX – eXogenous inputs

VEC – Vector Error Correcting

State Space Models

Time Varying

Time Invariant

Conditional Variance Models

ARCH

GARCH

EGARCH

GJR

Non-Linear Models

NAR Neural Network

NARX Neural Network

Regression

Regression with ARIMA errors

Page 18: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

29

Example – Time Series Modeling and

Forecasting for the S&P 500 Index

Numerous ARIMAX and

GARCH modeling techniques

with rich documentation

Interactive visualizations

Code parallelization to

maximize computing resources

Rapid exploration &

development

May-01 Feb-04 Nov-06 Aug-09 May-12

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

11000

S&

P 5

00

Realized vs All Forecasted Paths

Original Data

Simulated Data

May-01 Feb-04 Nov-06 Aug-09 May-12

800

900

1000

1100

1200

1300

1400

1500

1600

1700

1800

S&

P 5

00

Realized vs Median Forecasted Path

Original Data

Simulated Data

Page 19: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

34

Agenda

Introduction

Predictive Modeling– Supervised Machine Learning

– Time Series Modeling

Big Data Analysis– Load, Analyze, Discard workflows

– Scale computations with parallel computing

– Distributed processing of large data sets

Moving to Production with MATLAB

Page 20: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

35

Financial Modeling Workflow

Explore and Prototype

Data Analysis

& Visualization

Financial

Modeling

Application

Development

Reporting

Applications

Production

Share

Scale

Files

Databases

Datafeeds

Access

Small/Big Data Predictive Modeling Deploy

Page 21: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

36

Financial Modeling Workflow

Scale

Page 22: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

37

Challenges of Big Data

“Any collection of data sets so large and complex that it becomes

difficult to process using … traditional data processing applications.”(Wikipedia)

Volume

– The amount of data

Velocity

– The speed data is generated/analyzed

Variety

– Range of data types and sources

Value

– What business intelligence can be obtained from the data?

Page 23: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

38

Big Data Capabilities in MATLAB

Memory and Data Access

64-bit processors

Memory Mapped Variables

Disk Variables

Databases

Datastores

Platforms

Desktop (Multicore, GPU)

Clusters

Cloud Computing (MDCS on EC2)

Hadoop

Programming Constructs

Streaming

Block Processing

Parallel-for loops

GPU Arrays

SPMD and Distributed Arrays

MapReduceNative ODBC interface

Database datastore

Fetch in batches

Scrollable cursors

Page 24: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

39

Techniques for Big Data in MATLAB

Complexity

Embarrassingly

Parallel

Non-

Partitionable

datastore

parfor

64bit Workstation

SPMD, Distributed Memory

MapReduce

Scale

RA

MH

ard

drive

Co

ns

ult

ing

Page 25: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

40

Techniques for Big Data in MATLAB

Complexity

Embarrassingly

Parallel

Non-

Partitionable

64bit Workstation

Scale

RA

MH

ard

drive

Page 26: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

41

Memory Usage Best Practices

Expand Workspace: 64bit MATLAB

Use the appropriate data storage

– Categorical Arrays

– Be aware of overhead of cells and structures

– Use only the precision your need

– Sparse Matrices

Minimize Data Copies

– In place operations, if possible

– Use nested functions

– Inherit data using object handles

Page 27: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

43

Techniques for Big Data in MATLAB

Complexity

Embarrassingly

Parallel

Non-

Partitionable

parfor

Scale

RA

MH

ard

drive

Page 28: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

44

Parallel Computing with MATLAB

MATLAB

Desktop (Client)

Worker

Worker

Worker

Worker

Worker

Worker

Page 29: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

45

Example: Analyzing an Investment Strategy

Optimize portfolios against target

benchmark

Analyze and report performance

over time

Backtest over 20-year period,

parallelize 3-month rebalance

Page 30: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

48

When to Use parfor

Data Characteristics

– The data for each iteration must

fit in memory

– Loop iterations must be independent

Transition from desktop to cluster with

minimal code changes

Speed up analysis on big data

Page 31: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

49

Techniques for Big Data in MATLAB

Complexity

Embarrassingly

Parallel

Non-

Partitionable

SPMD, Distributed Memory

Scale

RA

MH

ard

drive

Page 32: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

50

Parallel Computing – Distributed Memory

Core 1

Core 3 Core 4

Core 2

RAM

Using More Computers (RAM)

Core 1

Core 3 Core 4

Core 2

RAM

Page 33: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

52

spmd blocks

spmd

% single program across workers

end

Mix parallel and serial code in the same function

Single Program runs simultaneously across

workers

Multiple Data spread across multiple workers

Page 34: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

53

Example: Airline Delay Analysis

Data

– Airline On-Time Statistics

– 123.5M records, 29 fields

Analysis

– Calculate delay patterns

– Visualize summaries

– Estimate & evaluate

predictive models

Page 35: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

54

When to Use Distributed Memory

Data Characteristics

– Data must be fit in collective

memory across machines

Compute Platform

– Prototype (subset of data) on desktop

– Run on a cluster or cloud

Analysis Characteristics

– Distributed arrays support a subset of functions

Page 36: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

55

Techniques for Big Data in MATLAB

Complexity

Embarrassingly

Parallel

Non-

Partitionable

datastore

Scale

RA

MH

ard

drive

Page 37: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

56

Access Big Datadatastore

Easily specify data set

– Single text file (or collection of text files)

Preview data structure and format

Select data to import

using column names

Incrementally read

subsets of the dataairdata = datastore('*.csv');

airdata.SelectedVariables = {'Distance', 'ArrDelay‘};

data = read(airdata);

Page 38: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

57

Example: Determine unique tickers

15 years of daily S&P 500 data

Data in multiple files of different

sizes

Many irrelevant columns in

dataset

Page 39: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

58

When to Use datastore

Data Characteristics

– Text files, databases, or stored in the

Hadoop Distributed File System (HDFS)

Analysis Characteristics

– Load, Analyze, Discard workflows

– Incrementally read chunks of data,

process within a while loop

Page 40: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

59

Reading in Part of a Dataset from Files

Text file, ASCII file

– Read part of a collection of files using datastore

MAT file

– Load and save part of a variable using the matfile

Binary file

– Read and write directly to/from file using memmapfile

Databases

– ODBC and JDBC-compliant (e.g. Oracle, MySQL, Microsoft SQL Server)

Page 41: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

60

Techniques for Big Data in MATLAB

Complexity

Embarrassingly

Parallel

Non-

Partitionable

MapReduce

Scale

RA

MH

ard

drive

Page 42: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

61

Analyze Big Datamapreduce

MapReduce programming technique to analyze big data

– mapreduce uses a datastore to process data

in small chunks that individually fit into memory

mapreduce on the desktop

– Access data on HDFS

– Integrates with Parallel Computing Toolbox

mapreduce with Hadoop

– Run on Hadoop using MATLAB Distributed Computing Server

– Deploy to Hadoop using MATLAB Compiler

********************************

* MAPREDUCE PROGRESS *

********************************

Map 0% Reduce 0%

Map 20% Reduce 0%

Map 40% Reduce 0%

Map 60% Reduce 0%

Map 80% Reduce 0%

Map 100% Reduce 25%

Map 100% Reduce 50%

Map 100% Reduce 75%

Map 100% Reduce 100%

Page 43: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

62

Date Ticker Return

3-Jan AIG -0.051

3-Jan AMZN NaN

3-Jan GE -0.040

3-Jan INTC NaN

Date Ticker Return

3-Jan AIG -0.051

3-Jan AMZN NaN

3-Jan GE -0.040

3-Jan INTC NaN

3-Jan AIG -0.051

4-Jan YHOO -0.067

4-Jan INTC -0.046

5-Jan GE 0.025

MapReduce

Data Store Map Reduce

Shuffle & Sort

Date Ticker Return

3-Jan AIG -0.051

3-Jan AMZN NaN

3-Jan GE -0.040

3-Jan INTC NaN

3-Jan AIG -0.051

4-Jan YHOO -0.067

4-Jan INTC -0.046

5-Jan GE 0.025

5-Jan AIG NaN

5-Jan AMZN 0.078

5-Jan GE 0.025

5-Jan YHOO -0.039

AIG

AIG

GE

YHOO

INTC

GE

AMZN

GE

YHOO

AIG

GE

AIG

YHOO

INTC

GE

AMZN

GE

YHOO

Key: 3-Jan

Key: 3-Jan

Key: 4-Jan

Key: 5-Jan

Key: 5-Jan

Key: 3-Jan

Key: 4-Jan

Key: 5-Jan

Key Unique Tickers

3-Jan AIG, GE

5-Jan AMZN, GE, YHOO

4-Jan YHOO, INTC

Date Ticker Return

3-Jan AIG -0.051

3-Jan AMZN NaN

3-Jan GE -0.040

3-Jan INTC NaN

3-Jan AIG -0.051

4-Jan YHOO -0.067

4-Jan INTC -0.046

5-Jan GE 0.025

5-Jan AIG NaN

5-Jan AMZN 0.078

5-Jan GE 0.025

5-Jan YHOO -0.039

Page 44: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

63

Example: Calculate covariance of S&P500Using MapReduce

15 years of daily S&P500 returns

stored in multiple files

Use all the data to calculate the

mean and covariance

Computation must scale to 1-minute

bars for 30 years of data

Page 45: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

64

Challenges

Multiple files of differing sizes

Page 46: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

65

Challenges

How do we read/partition this dataset if it doesn’t fit in

memory?

Missing data (explicit/implicit)

Date Ticker Open High Low Close Volume Return

3-Jan-2000 AIG 107.13 107.44 103 103.94 166500 NaN

3-Jan-2000 AMZN 87.25 89.56 79.05 89.56 16117600 NaN

3-Jan-2000 GE 147.25 148 144 144 22121400 -0.040

8-Jan-2000 AMZN 81.5 89.56 79.05 89.38 16117600 NaN

4-Jan-2000 AIG 101.5 102.13 98.31 98.63 364000 -0.051

Jan 4,2000 YHOO 464.5 500.12 442 443 69868800 -0.067

4-Jan-2000 INTC 85.44 87.88 82.25 92.94 51019600 -0.046

4-Jan-2000 GE 147.25 148 144 144 22121400 -0.040

8-Jan-2000 GE 143.12 146.94 142.63 145.67 19873200 0.013

Date Ticker Return

3-Jan-2000 AIG NaN

3-Jan-2000 AMZN NaN

3-Jan-2000 GE -0.040

8-Jan-2000 AMZN NaN

4-Jan-2000 AIG -0.051

Jan 4,2000 YHOO -0.067

4-Jan-2000 INTC -0.046

4-Jan-2000 GE -0.040

8-Jan-2000 GE 0.013

Page 47: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

66

Challenges

Mean

– Coupling between rows

Covariance

– Coupling between rows

– Coupling between columns

Page 48: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

67

Date AIG AMZN GE YHOODate AIG AMZN GE YHOO

3-Jan-2000 -0.012 NaN

Date AIG AMZN GE YHOO

3-Jan-2000 -0.012 NaN 0.051

4-Jan-2000 NaN

Date AIG AMZN GE YHOO

3-Jan-2000 -0.012 NaN 0.051

4-Jan-2000 0.097 NaN NaN -0.035

Date AIG AMZN GE YHOO

3-Jan-2000 -0.012 NaN 0.051 NaN

4-Jan-2000 0.097 NaN NaN -0.035

Approach

Reading in chunks – do we have a full column of data?

Solution: convert to tabular form with all columns

Further memory savings (ticker/date not repeated)

Date Ticker Return

3-Jan-2000 AIG -0.012

3-Jan-2000 AMZN NaN

3-Jan-2000 GE 0.051

4-Jan-2000 AMZN NaN

4-Jan-2000 AIG 0.097

4-Jan-2000 YHOO -0.035

4-Jan-2000 GE NaN

Date Ticker Return

3-Jan-2000 AIG -0.012

3-Jan-2000 AMZN NaN

3-Jan-2000 GE 0.051

4-Jan-2000 AMZN NaN

4-Jan-2000 AIG 0.097

4-Jan-2000 YHOO -0.035

4-Jan-2000 GE NaN

Date Ticker Return

3-Jan-2000 AIG -0.012

3-Jan-2000 AMZN NaN

3-Jan-2000 GE 0.051

4-Jan-2000 AMZN NaN

4-Jan-2000 AIG 0.097

4-Jan-2000 YHOO -0.035

4-Jan-2000 GE NaN

Date Ticker Return

3-Jan-2000 AIG -0.012

3-Jan-2000 AMZN NaN

3-Jan-2000 GE 0.051

4-Jan-2000 AMZN NaN

4-Jan-2000 AIG 0.097

4-Jan-2000 YHOO -0.035

4-Jan-2000 GE NaN

Page 49: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

68

Approach

Goal: Calculate mean/covariance for big data sets

Tabular conversion

Calculate mean/cov

Scale

Data StoreS&P500 Data File 1

S&P500 Data File 2

S&P500 Data File N

Unique tickers

MapReduce

MapReduce

Hadoop

Combine mean/cov

Valid

ate

Page 50: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

70

Datastore

HDFS

The Big Data Platform

Reduce

Node

Node

Node Data

Data

Data

Map

ReduceMap

ReduceMap

Map Reduce

Map

Map

Reduce

Reduce

programming model for

Fault-tolerant distributed data storage

Take the computation to the data

HDFS

MapReduce

Page 51: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

72

Deployed Applications with Hadoop

MATLAB

MapReduce

Code

Datastore

HDFS

Node Data

Node Data

Node Data

Map Reduce

Map Reduce

Map Reduce

MATLAB

runtime

Page 52: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

75

Solution

Datastore

– Treat multiple files as a pool of data

– Parse data in chunks to determine unique values

Mapreduce

– Group, filter, and calculate summary statistics

Hadoop

– Algorithm is the same as the one developed on desktop

– Easily deploy to Hadoop using interactive tools

MATLAB Interactive Environment

– Debugger and profiler

– Validate algorithms using built-in functions for rapid prototyping

Page 53: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

77

Big Data Summary

Access portions of data with datastore

Cluster-ready programming constructs

– parfor

– SPMD

– MapReduce

– Distributed arrays

Prototype code for your cluster

– Transition from desktop to cluster with

no algorithm changes

MATLAB

Desktop (Client)

Cluster

Scheduler

… … …

..…

..…

..…

Page 54: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

78

Agenda

Introduction

Predictive Modeling– Supervised Machine Learning

– Time Series Modeling

Big Data Analysis– Load, Analyze, Discard workflows

– Scale computations with parallel computing

– Distributed processing of large data sets

Moving to Production with MATLAB

Page 55: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

79

Financial Modeling Workflow

Explore and Prototype

Data Analysis

& Visualization

Financial

Modeling

Application

Development

Reporting

Applications

Production

Share

Scale

Files

Databases

Datafeeds

Access

Small/Big Data Predictive Modeling Deploy

Page 56: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

80

Financial Modeling Workflow

Reporting

Applications

Production

Share

Deploy

Enterprise WebDesktop

Hadoop

Page 57: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

81

Deployed Applications

Example: Portfolio optimization and simulation

Example: Day-ahead system load forecasting

Page 58: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

85

MATLAB Production Server

MATLAB Production Server

Request

Broker

&

Program

Manager

Web

Server...

App

Server

Enterprise framework for running packaged MATLAB programs

Scalable & reliable

– Service large numbers of concurrent requests

Use with web, database & application servers

– Easily integrates with IT systems (Java, .NET, C++, Python)

Page 59: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

87

Integrating with IT systems

Web

Server

Application

Server

Database Server

Pricing

Risk

Analytics

Portfolio

Optimization

MATLAB Production Server

MATLAB

Compiler SDK™

Web

Applications

Desktop

Applications

Excel®

Page 60: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

89

Benefits of the MATLAB Production Server

Reduce cost of building and deploying in-house analytics

– Quants/Analysts/Financial Modelers do not have to rewrite code

in another language

– Update deployed models easily without restarting the server

– Single environment for model development and testing

IT can efficiently integrate models/analytics in to

production systems

– Centrally manage packaged MATLAB programs

– Handoff from Quant to IT only requires function signatures

– Easily support analytics built with multiple releases of MATLAB

– Simultaneous multiple instances of MATLAB Production Server

Page 61: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

96

Summary

Challenges MATLAB Solution

Time (loss of productivity) Rapid analysis and application developmentEasily access big data sets, interactive exploratory analysis

and visualization, apps to get started, debugger

No “one-size-fits-all” Multiple algorithms and programming constructsRegression, machine learning, time series modeling, parfor,

MapReduce, datastore

Big data and scaling Work on the desktop and scale to clustersHadoop support, no algorithm changes required

Time to deploy & integrate Ease of deployment and leveraging enterprisePush-button deployment into production

Page 62: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

98

Financial Modeling Workflow

Financial

Statistics & Machine

LearningOptimization

Financial Instruments Econometrics

MATLAB

Parallel Computing

MATLAB Distributed Computing Server

Files

Databases

Datafeeds

Access

Reporting

Applications

Production

Share

Data Analysis and Visualization

Financial Modeling

Application Development

Research and Quantify

MATLAB Compiler

SDK

MATLAB Compiler

Rep

ort G

en

era

tor

Production Server

Datafeed

Database

Spreadsheet Link EX

Trading

Neural Networks

Curve Fitting

Page 64: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

102

Learn More: Big Data

MATLAB Documentation

– Strategies for Efficient Use of Memory

– Resolving "Out of Memory" Errors

Big Data with MATLAB– www.mathworks.com/discovery/big-data-matlab.html

MATLAB MapReduce and Hadoop– www.mathworks.com/discovery/matlab-mapreduce-hadoop.html

Page 65: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

103

Classroom Training

– Customized curriculum

– Usually 2-5 day consecutive format

Live Online

– Flexible scheduling

– Full or Half Day Sessions

Self-Paced

– Learn whenever you want and at your own pace

– Online discussion boards and live trainer chats

Training Services

CPE APPROVED PROVIDER: Earn one CPE

credit per hour of content.

mathworks.com/training

Page 66: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

104

Training Roadmap

MATLAB for Financial Applications

Programming Techniques

Interactive User Interfaces

Parallel Computing Time-Series Modeling (Econometrics)

Statistical Methods

Optimization Techniques

Data Analysis and Modeling Application Development

Risk Management

Machine Learning

Asset Allocation

Interfacing with Databases

Interfacing with Excel

Content for On-site Customization

Page 67: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

105

Migration Planning

Component Deployment

Full Application Deployment

Co

nti

nu

ou

s Im

pro

ve

me

nt

Consulting ServicesAccelerating return on investment

A global team of experts supporting every stage of tool and process integration

Supplier InvolvementProduct Engineering TeamsAdvanced EngineeringResearch

Advisory Services

Process Assessment

Jumpstart

Process and Technology

Standardization

Process and Technology

Automation

Page 68: Predictive Analytics and Big Data with MATLAB · 2 Agenda Introduction Predictive Modeling –Supervised Machine Learning –Time Series Modeling Big Data Analysis –Load, Analyze,

106

Q&A