Kaggle "Give me some credit" challenge overview

Predicting delinquency on debt

What is the problem?

• X Store has a retail credit card available to customers

• There can be a number of sources of loss from this product, but one is customer’s defaulting on their debt

• This prevents the store from collecting payment for products and services rendered

Is this problem big enough to matter?

• Examining a slice of the customer database (150,000 customers) we find that 6.6% of customers were seriously delinquent in payment the last two years

• If only 5% of their carried debt was the store credit card this is potentially an:

• Average loss of $8.12 per customer

• Potential overall loss of $1.2 million

What can be done?

• There are numerous models that can be used to predict which customers will default

What can be done?

• This could be used to decrease credit limits or cancel credit lines for current risky customers to minimize potential loss

What can be done?

• This could be used to decrease credit limits or cancel credit lines for current risky customers to minimize potential loss

• Or better screen which customers are approved for the card

How will I do this?

• This is a basic classification problem with important business implications

How will I do this?

• We’ll examine a few simplistic models to get an idea of performance

How will I do this?

• We’ll examine a few simplistic models to get an idea of performance

• Explore decision tree methods to achieve better performance

What will the models predict delinquency?

Each customer has a number of attributes

John SmithDelinquent: YesAge: 23Income: $1600Number of Lines: 4

Mary RasmussenDelinquent: NoAge: 73Income: $2200Number of Lines: 2

We will use the customer attributes to predict whether they were delinquent

How do we make sure that our solution actually has predictive power?

We have two slices of the customer dataset

Train150,000

customers

Delinquencyin dataset

Train Test150,000

customers

101,000customers

Delinquencynot indataset

Train Test150,000

customers

101,000customers

Delinquencynot indataset

None of the customers in the test dataset are used to train the model

Internally we validate our model performance with cross-fold validation

Using only the train dataset we can get a sense of how well our model performs without externally validating it

TrainTrain 1

Train 2

Train 3

TrainTrain 1

Train 2

Train 3

Train 1

Train 2

AlgorithmTraining

TrainTrain 1

Train 2

Train 3

Train 1

Train 2

AlgorithmTraining

AlgorithmTesting

Train 3

What matters is how well we can predict the test dataset

We judge this using the accuracy, which is the number of our predictions correct out of the total number of predictions made

So with 100,000 customers and an 80% accuracy we will have correctly predicted whether 80,000 customers will default or not in the next two years

Putting accuracy in context

We could save $600,000 over two years if we correctly predicted 50% of the customers that would default and changed their account to prevent it

Putting accuracy in context

We could save $600,000 over two years if we correctly predicted 50% of the customers that would default and changed their account to prevent it

The potential loss is minimized by ~$8,000 for every 100,000 customers with each percentage point increase in accuracy

Looking at the actual data

Assume$2,500

Looking at the actual data

Assume$2,500

Assume0

There is a continuum of algorithmic choices to tackle the problem

Simpler,Quicker

Complex,Slower

Simpler,Quicker

Complex,Slower

RandomChance

Simpler,Quicker

Complex,Slower

RandomChance

Simpler,Quicker

Complex,Slower

RandomChance

Simpler,Quicker

Complex,Slower

RandomChance

SimpleClassification

For simple classification we pick a single attribute and find the best split in the customers

Times Past Due

True PositiveTrue NegativeFalse PositiveFalse Negative

Times Past Due

1 2 ...

We evaluate possible splits using accuracy, precision, and sensitivity

Acc = Number correctTotal Number

Prec = True PositivesNumber of People

Predicted Delinquent

Sens = True PositivesNumber of PeopleActually Delinquent

0 20 40 60 80 100Number of Times 30-59 Days Past Due

AccuracyPrecisionSensitivity

0.61 KGI on Test Set

However, not all fields are as informative

Using the number of times past due 60-89 dayswe achieve a KGI of 0.5

However, not all fields are as informative

Using the number of times past due 60-89 dayswe achieve a KGI of 0.5

The approach is naive and could be improved but our time is better spent on different algorithms

Exploring algorithmic choices further

Simpler,Quicker

Complex,Slower

RandomChance

0.50-0.61

Simpler,Quicker

Complex,Slower

RandomChance

0.50-0.61

RandomForests

A random forest starts from a decision tree

Customer Data

Find the best split in a set ofrandomly chosen attributes

Customer Data

Is age <30?

Customer Data

Is age <30?

75,000 Customers>30

Customer Data

Is age <30?

75,000 Customers>30

25,000 Customers <30

Customer Data

Is age <30?

75,000 Customers>30

25,000 Customers <30

A random forest is composed of many decision trees

Customer Data

Best Split

Customers Data Set 2

Customer Data

Best Split

Customers Data Set 1 ...

Customer Data

Best Split

Customer Data

Best Split

Customer Data

Best Split

Customer Data

Best Split

Customer Data

Best Split

Customer Data

Best Split

Customer Data

Best Split

Customer Data

Best Split

Customer Data

Best Split

Customer Data

Best Split

Customer Data

Best Split

Class assignment of a customer is based on how manyof the decision trees “vote” on how to split an attribute

Customer Data

Best Split

Customer Data

Best Split

Customer Data

Best Split

Customer Data

Best Split

Customer Data

Best Split

Customer Data

Best Split

We use a large number of trees to not over-fit to the training data

Class assignment of a customer is based on how manyof the decision trees “vote” on how to split an attribute

The Random Forest algorithm are easily implemented

In Python or R for initial testing and validation

Also parallelized with Mahout and Hadoop since there is no dependence from one tree to the next

A random forest performs well on the test set

Random Forest 10 trees: 0.779 KGI

Random Forest 10 trees: 0.779 KGI150 trees: 0.843 KGI

Random Forest 10 trees: 0.779 KGI150 trees: 0.843 KGI1000 trees: 0.850 KGI

0.4 0.5 0.6 0.7 0.8 0.9

Random

Accuracy

ClassificationRandom Forests

Simpler,Quicker

Complex,Slower

RandomChance

0.50-0.61

RandomForests

0.78-0.85

Simpler,Quicker

Complex,Slower

RandomChance

0.50-0.61

RandomForests

0.78-0.85

Gradient TreeBoosting

Boosting Trees is similar to a Random Forest

Customer Data

Is age <30?

Customers >30 Data

Customers <30 Data

Boosting Trees is similar to a Random Forest

Customer Data

Is age <30?

Customers >30 Data

Customers <30 Data

Do an exhaustive searchfor best split

How Gradient Boosting Trees differs from Random Forest

Customer Data

Best Split

The first tree is optimized to minimize a loss function describing the data

Customer Data

Best Split

The next tree is then optimized to fit whatever variability the first

tree didn’t fit

Customer Data

Best Split

tree didn’t fit

This is a sequential process in comparison to the random forest

Customer Data

Best Split

tree didn’t fit

This is a sequential process in comparison to the random forest

We also run the risk of over-fitting to the data, thus the learning rate

Implementing Gradient Boosted Trees

In Python or R it is easy for initial testing and validation

Implementing Gradient Boosted Trees

In Python or R it is easy for initial testing and validation

There are implementations that use Hadoop but it’s more complicated to achieve the best performance

Gradient Boosting Trees performs well on the dataset

100 trees, 0.1 Learning: 0.865022 KGI

100 trees, 0.1 Learning: 0.865022 KGI1000 trees, 0.1 Learning: 0.865248 KGI

0 0.6 0.8Learning Rate

0.2 0.4

0 0.6 0.8Learning Rate

0.2 0.4

0.4 0.5 0.6 0.7 0.8 0.9

Random

Accuracy

Boosting Trees

Moving one step further in complexity

Simpler,Quicker

Complex,Slower

RandomChance

0.50-0.61

RandomForests

0.78-0.85

0.71-0.8659

BlendedMethod

Or more accurately an ensemble ofensemble methods

Algorithm Progression

Random Forest

Extremely Random Forest

Random Forest

Gradient Tree Boosting

Algorithm ProgressionTrain Data Probabilities

Random Forest

0.10.50.010.80.7...

Random Forest

0.10.50.010.80.7...

0.150.60.00.750.68

Random Forest

0.10.50.010.80.7...

0.150.60.00.750.68

Combine all of the model information

Train Data Probabilities

0.10.50.010.80.7...

0.150.60.00.750.68

0.10.50.010.80.7...

0.150.60.00.750.68

Optimize the set of train probabilities to the known delinquencies

0.10.50.010.80.7...

0.150.60.00.750.68

Optimize the set of train probabilities to the known delinquencies

Apply the same weighting scheme to the set of test data probabilities

Implementation can be done in a number of ways

Testing in Python or R is slower, due to the sequential nature of applying the algorithms

Could be faster parallelized, running each algorithm separately and combining the results

Assessing model performance

Blending Performance, 100 trees: 0.864394 KGI

0.4 0.5 0.6 0.7 0.8 0.9

Random

Accuracy

Boosting TreesBlended

But this performance and the possibility of additional gains comes at a distinct time cost.

0.4 0.5 0.6 0.7 0.8 0.9

Random

Accuracy

Boosting TreesBlended

Examining the continuum of choices

Simpler,Quicker

Complex,Slower

RandomChance

0.50-0.61

RandomForests

0.78-0.85

0.71-0.8659

BlendedMethod

What would be best to implement?

There is a large amount of optimization in the blended method that could be done

However, this algorithm takes the longest to run.This constraint will apply in testing and validation also

Random Forests returns a reasonably good result.It is quick and easily parallelized

Gradient Tree Boosting returns the best result and runs reasonably fast.It is not as easily parallelized though

Increases in predictive performance have real business value

Using any of the more complex algorithms we achieve an increase of 35% in comparison to random

Increases in predictive performance have real business value

Using any of the more complex algorithms we achieve an increase of 35% in comparison to random

Potential decrease of ~$420k in losses by identifyingcustomers likely to default in the training set alone

Thank you for your time

Kaggle "Give me some credit" challenge overview

Business

ОСОБЕННОСТИ СОРЕВНОВАНИЙ KAGGLE

Kaggle Machine Learning Projects Ashok Kumar Harnal203.122.28.235/pdf/Kaggle_Projects_Executed_in_the_course.pdf · Kaggle and About Projects Kaggle is a platform for predictive modelling

Kaggle bosch presentation material for Kaggle Tokyo Meetup #2

Kaggle Ensembling Guide

Big Give Christmas Challenge 2011

My First Attempt on Kaggle - Higgs Machine Learning Challenge: 755st and Proud!

Kaggle - Higgs Boson ML Challenge project report

Challenge the world Give Trout a Chance!. challenge the world WHY …

Kaggle Competitions: Author Identification & Statoil/C-CORE ... · Kaggle Competitions: Author Identification & Statoil/C-CORE Iceberg Classifier Challenge — 3/21 2.0.3General

Kaggle digits analysis_final_fc

Kaggle Competitons

Andrej Karpathy - 텐서 플로우 블로그 (Tensor · ConvNets are everywhere… Whale recognition, Kaggle Challenge Satellite image analysis Mnih and Hinton, 2010 Galaxy Challenge

Станислав Семенов, Data Scientist, Kaggle top-3, «О соревновании Telstra Kaggle Competition»

Kaggle Competition: Product Classification

Higgs Boson Machine Learning Challenge - Kaggle

ABSTRACT Instructor: Natalia Sizova WORLD DATA: EXPLORING KAGGLE DATA SETSns10/Kaggle/pdfs/World_Data... · · 2017-05-09WORLD DATA: EXPLORING KAGGLE DATA SETS ABSTRACT Introduction

CM UTaipei Kaggle Share

stories behind kaggle competitions

Million Song Dataset Challenge€¦ · · 2017-03-12The Million Song Dataset Challenge (MSDC) is a large scale, music recommendation challenge posted in Kaggle, ... Microsoft Word

HEALTH INSURANCE MARKET: SHARING YOUR WORK WITH THE KAGGLE …ns10/Kaggle/pdfs/Health_Insuran… · · 2017-05-09HEALTH INSURANCE MARKET: SHARING YOUR WORK WITH THE KAGGLE COMMUNITY