USING SAS HIGH PERFORMANCE STATISTICS FOR PREDICTIVE … · USING SAS HIGH-PERFORMANCE PACKAGE TO...

Preview:

Citation preview

USING SAS HIGH PERFORMANCE

STATISTICS FOR PREDICTIVE

MODELLINGRegan LU, CFA, FRM

SAS certified Statistical Business Analyst & SAS certified Advanced Programmer

Future of Work Taskforce

Department of Jobs and Small Business

Part 1: Using Predictive Modelling in the Public Sector

Part 2: Building Predictive Models using SAS

Part 3: Advantages of SAS High Performance Statistics

Part 4: Examples and Comparison

PART 1: PREDICTIVE MODELLING IN

THE PUBLIC SECTOR

Use longitudinal administrative data to build

predictive models

Predictive models can be applied to resolve

policy problems

For instance, we could use a predictive model to

estimate the labour force participation of target

group in the 2016-17 financial year.

PREDICTIVE MODELS COULD BE USED FOR EVIDENCE-BASED POLICY DEVELOPMENT

1. Generalised Linear Model (GLM)

Binomial

Multinomial

Gamma Distribution

2. Decision Trees

Classification Tree

Regression Tree

BUILD PREDICTIVE MODELS USING DIFFERENT MACHINE LEARNING ALGORITMS FOR EVIDENCE-BASED POLICY DEVELOPMENT

Supervised Learning

• Generalised Linear Model (Forward, Backward, Stepwise)

• Decision Trees

• Random Forest

• Neural Network

• Supported Vector Machine, etc

Unsupervised Learning

• K-means Clustering

• Cosine Similarity

MODEL VALIDATION 1: RECEIVER OPERATING

CHARACTERISTIC (ROC) CURVEUsing longitudinal data, a model was built to predict labour force participation rate of the target group next year.

The closer the curve follows the left-hand border and the top border of the ROC space, the more accurate the predictive model is.

MODEL VALIDATION 2: PARTIAL DEPENDENCY PLOTUsing longitudinal data, a model was built to predict labour force participation rate of the target group next year.

The following chart indicates the accuracy of the prediction between different target groups.

A B C D

PART 2 BUILDING PREDICTIVE MODELS USING SAS

Regression

Y = β1X1 + β2X2 +…..+ βLXL + ε

SAS Statistical Package:

• PROC LOGISTIC

• PROC GENMOD

High Performance SAS Statistics

• PROC HPLOGISTIC

• PROC HPGENSELECT

SYNTAX: LOGISTIC PROCEDURE

PROC LOGISTIC DATA=DATASET <OPTIONS>;

MODEL RESPONSE=PREDICTOR /<OPTIONS>;

OUTPUT OUT=SAS-DATASET;

RUN;

SAS High-performance Statistics has similar syntax

PART 3: ADVANTAGES OF SAS HIGH PERFORMANCE

STATISTICS

SAS high-performance statistics take advantage of parallel

processing. This is crucial for big data analytics.

PROC HPLOGISTIC DATA=DATASET <OPTIONS>;

PERFORMANCE CPUCOUNT=24 NTHREADS=24;

MODEL RESPONSE=PREDICTOR /<OPTIONS>;

OUTPUT OUT=SAS-DATASET;

RUN;

ADVANTAGES OF SAS HIGH PERFORMANCE STATISTICS

For building predictive models using different machine

learning algorithms such as GLM, we found that high-

performance statistics significantly improved the

efficiency of building models.

However, high-performance statistics does not always

outperform traditional SAS package. (e.g. HPSUMMARY

for descriptive statistics)

PART 4: EXAMPLES AND COMPARISONS

Build a model using GLM to predict the labour force attachment of a target

group.

SAS data: 638k observations with 121 variables

COMPARISON OF TIME SPENT ON RUNNING ONE

REGRESSIONDuring the lunch break (Non-HP statistics)

PROC LOGISTIC(HP statistics)

PROC HPLOGISTIC (HP statistics)PROC HPLOGISTIC

Number of Threads (default: 1) 4 24

Real Time (minutes) 28:50.45 1:33.74 1:32.78

CPU Time (minutes) 28:50.68 16:33.03 16:20.17

Between 2p.m. and 3p.m. (Non-HP statistics) PROC LOGISTIC

(HP statistics)PROC HPLOGISTIC

(HP statistics) PROC HPLOGISTIC

Number of Threads (default: 1) 4 24

Real Time (minutes) 28:15.13 4:24.48 1:42.51

CPU Time (minutes) 28:14.54 16:22.67 16:44.18

USING SAS HIGH-PERFORMANCE PACKAGE TO BUILD A PREDICTIVE MODEL FROM SCRATCH USING DIFFERENT

MACHINE LEARNING ALGORITHMS

Machine Learning Algorithms 1 Core (High Performance Statistical Package)

24 Cores (High Performance Statistical Package)

Generalized Linear Model(Stepwise)

PROC HPGENSELECT

Real time: 31 mins 42 secsUser cpu time: 30 mins 44 secs

Real time: 3 mins 42 secsUser cpu time: 31 mins 03 secs

Decision Tree(Classification Tree)

PROC HPSPLIT

Real time: 21 secsUser cpu time: 19 secs

Real time: 11 secsUser cpu time: 26 secs

Random ForrestPROC HPFORREST

Real time: 10 mins 4 secsUser cpu time: 9 mins 55 secs

Real time: 4 mins 31 secsUser cpu time: 12 mins 15 secs

Neural Network(MLP one inner layer, 30 neurons)PROC HPNEURAL

Real time: 64 mins 23 secsUser cpu time: 64 mins 3 secs

Real time: 10 minsUser cpu time: 65 mins 41 secs

Generalized Linear Model Decision Tree Neural Network

Pseudo-Rsquare=11.7% Pseudo-Rsquare=11.8% Pseudo-Rsquare=24.7%

COMPARE PREDICTIVE MODELS BUILT BY COMPUTER USING

DIFFERENT MACHINE LEARNING ALGORITHMS

ADDITIONAL HINT

Using SAS high performance statistics, a model could be quickly built and risk factors could be automatically selected within a very short period of time (e.g. building a GLM from scratch only takes a few minutes.)

This makes big data analytics and machine learning feasible using our SAS server.

THANKS!

Recommended