Kaggle "Give me some credit" challenge overview

  • View
    1.598

  • Download
    4

  • Category

    Business

Preview:

DESCRIPTION

Full description of the work associated with this project can be found at: http://www.npcompleteheart.com/project/kaggle-give-me-some-credit/

Citation preview

Predicting delinquency on debt

What is the problem?

What is the problem?

• X Store has a retail credit card available to customers

What is the problem?

• X Store has a retail credit card available to customers

• There can be a number of sources of loss from this product, but one is customer’s defaulting on their debt

What is the problem?

• X Store has a retail credit card available to customers

• There can be a number of sources of loss from this product, but one is customer’s defaulting on their debt

• This prevents the store from collecting payment for products and services rendered

Is this problem big enough to matter?

Is this problem big enough to matter?

• Examining a slice of the customer database (150,000 customers) we find that 6.6% of customers were seriously delinquent in payment the last two years

Is this problem big enough to matter?

• Examining a slice of the customer database (150,000 customers) we find that 6.6% of customers were seriously delinquent in payment the last two years

• If only 5% of their carried debt was the store credit card this is potentially an:

Is this problem big enough to matter?

• Examining a slice of the customer database (150,000 customers) we find that 6.6% of customers were seriously delinquent in payment the last two years

• If only 5% of their carried debt was the store credit card this is potentially an:

• Average loss of $8.12 per customer

Is this problem big enough to matter?

• Examining a slice of the customer database (150,000 customers) we find that 6.6% of customers were seriously delinquent in payment the last two years

• If only 5% of their carried debt was the store credit card this is potentially an:

• Average loss of $8.12 per customer

• Potential overall loss of $1.2 million

What can be done?

What can be done?

• There are numerous models that can be used to predict which customers will default

What can be done?

• There are numerous models that can be used to predict which customers will default

• This could be used to decrease credit limits or cancel credit lines for current risky customers to minimize potential loss

What can be done?

• There are numerous models that can be used to predict which customers will default

• This could be used to decrease credit limits or cancel credit lines for current risky customers to minimize potential loss

• Or better screen which customers are approved for the card

How will I do this?

How will I do this?

• This is a basic classification problem with important business implications

How will I do this?

• This is a basic classification problem with important business implications

• We’ll examine a few simplistic models to get an idea of performance

How will I do this?

• This is a basic classification problem with important business implications

• We’ll examine a few simplistic models to get an idea of performance

• Explore decision tree methods to achieve better performance

What will the models predict delinquency?

Each customer has a number of attributes

What will the models predict delinquency?

Each customer has a number of attributes

John SmithDelinquent: YesAge: 23Income: $1600Number of Lines: 4

What will the models predict delinquency?

Each customer has a number of attributes

John SmithDelinquent: YesAge: 23Income: $1600Number of Lines: 4

Mary RasmussenDelinquent: NoAge: 73Income: $2200Number of Lines: 2

What will the models predict delinquency?

Each customer has a number of attributes

John SmithDelinquent: YesAge: 23Income: $1600Number of Lines: 4

Mary RasmussenDelinquent: NoAge: 73Income: $2200Number of Lines: 2

...

What will the models predict delinquency?

Each customer has a number of attributes

John SmithDelinquent: YesAge: 23Income: $1600Number of Lines: 4

Mary RasmussenDelinquent: NoAge: 73Income: $2200Number of Lines: 2

...

We will use the customer attributes to predict whether they were delinquent

How do we make sure that our solution actually has predictive power?

How do we make sure that our solution actually has predictive power?

We have two slices of the customer dataset

How do we make sure that our solution actually has predictive power?

We have two slices of the customer dataset

Train150,000

customers

Delinquencyin dataset

How do we make sure that our solution actually has predictive power?

We have two slices of the customer dataset

Train Test150,000

customers

Delinquencyin dataset

101,000customers

Delinquencynot indataset

How do we make sure that our solution actually has predictive power?

We have two slices of the customer dataset

Train Test150,000

customers

Delinquencyin dataset

101,000customers

Delinquencynot indataset

None of the customers in the test dataset are used to train the model

Internally we validate our model performance with cross-fold validation

Using only the train dataset we can get a sense of how well our model performs without externally validating it

Train

Internally we validate our model performance with cross-fold validation

Using only the train dataset we can get a sense of how well our model performs without externally validating it

TrainTrain 1

Train 2

Train 3

Internally we validate our model performance with cross-fold validation

Using only the train dataset we can get a sense of how well our model performs without externally validating it

TrainTrain 1

Train 2

Train 3

Train 1

Train 2

AlgorithmTraining

Internally we validate our model performance with cross-fold validation

Using only the train dataset we can get a sense of how well our model performs without externally validating it

TrainTrain 1

Train 2

Train 3

Train 1

Train 2

AlgorithmTraining

AlgorithmTesting

Train 3

What matters is how well we can predict the test dataset

We judge this using the accuracy, which is the number of our predictions correct out of the total number of predictions made

So with 100,000 customers and an 80% accuracy we will have correctly predicted whether 80,000 customers will default or not in the next two years

Putting accuracy in context

Putting accuracy in context

We could save $600,000 over two years if we correctly predicted 50% of the customers that would default and changed their account to prevent it

Putting accuracy in context

We could save $600,000 over two years if we correctly predicted 50% of the customers that would default and changed their account to prevent it

The potential loss is minimized by ~$8,000 for every 100,000 customers with each percentage point increase in accuracy

Looking at the actual data

Looking at the actual data

Looking at the actual data

Looking at the actual data

Assume$2,500

Looking at the actual data

Assume$2,500

Assume0

There is a continuum of algorithmic choices to tackle the problem

Simpler,Quicker

Complex,Slower

There is a continuum of algorithmic choices to tackle the problem

Simpler,Quicker

Complex,Slower

RandomChance

There is a continuum of algorithmic choices to tackle the problem

Simpler,Quicker

Complex,Slower

RandomChance

50%

There is a continuum of algorithmic choices to tackle the problem

Simpler,Quicker

Complex,Slower

RandomChance

50%

There is a continuum of algorithmic choices to tackle the problem

Simpler,Quicker

Complex,Slower

RandomChance

50%

SimpleClassification

For simple classification we pick a single attribute and find the best split in the customers

For simple classification we pick a single attribute and find the best split in the customers

For simple classification we pick a single attribute and find the best split in the customers

Num

ber

of C

usto

mer

s

Times Past Due

For simple classification we pick a single attribute and find the best split in the customers

Num

ber

of C

usto

mer

s

Times Past Due

True PositiveTrue NegativeFalse PositiveFalse Negative

1

For simple classification we pick a single attribute and find the best split in the customers

Num

ber

of C

usto

mer

s

Times Past Due

True PositiveTrue NegativeFalse PositiveFalse Negative

1 2

For simple classification we pick a single attribute and find the best split in the customers

Num

ber

of C

usto

mer

s

Times Past Due

True PositiveTrue NegativeFalse PositiveFalse Negative

1 2

For simple classification we pick a single attribute and find the best split in the customers

Num

ber

of C

usto

mer

s

Times Past Due

True PositiveTrue NegativeFalse PositiveFalse Negative

1 2

For simple classification we pick a single attribute and find the best split in the customers

Num

ber

of C

usto

mer

s

Times Past Due

True PositiveTrue NegativeFalse PositiveFalse Negative

1 2 ...

We evaluate possible splits using accuracy, precision, and sensitivity

Acc = Number correctTotal Number

We evaluate possible splits using accuracy, precision, and sensitivity

Acc = Number correctTotal Number

Prec = True PositivesNumber of People

Predicted Delinquent

We evaluate possible splits using accuracy, precision, and sensitivity

Acc = Number correctTotal Number

Prec = True PositivesNumber of People

Predicted Delinquent

Sens = True PositivesNumber of PeopleActually Delinquent

0 20 40 60 80 100Number of Times 30-59 Days Past Due

0

0.2

0.4

0.6

0.8

AccuracyPrecisionSensitivity

We evaluate possible splits using accuracy, precision, and sensitivity

Acc = Number correctTotal Number

Prec = True PositivesNumber of People

Predicted Delinquent

Sens = True PositivesNumber of PeopleActually Delinquent

0 20 40 60 80 100Number of Times 30-59 Days Past Due

0

0.2

0.4

0.6

0.8

AccuracyPrecisionSensitivity

We evaluate possible splits using accuracy, precision, and sensitivity

Acc = Number correctTotal Number

Prec = True PositivesNumber of People

Predicted Delinquent

Sens = True PositivesNumber of PeopleActually Delinquent

0 20 40 60 80 100Number of Times 30-59 Days Past Due

0

0.2

0.4

0.6

0.8

AccuracyPrecisionSensitivity

We evaluate possible splits using accuracy, precision, and sensitivity

Acc = Number correctTotal Number

Prec = True PositivesNumber of People

Predicted Delinquent

Sens = True PositivesNumber of PeopleActually Delinquent

0.61 KGI on Test Set

However, not all fields are as informative

Using the number of times past due 60-89 dayswe achieve a KGI of 0.5

However, not all fields are as informative

Using the number of times past due 60-89 dayswe achieve a KGI of 0.5

The approach is naive and could be improved but our time is better spent on different algorithms

Exploring algorithmic choices further

Simpler,Quicker

Complex,Slower

RandomChance

0.50

SimpleClassification

0.50-0.61

Exploring algorithmic choices further

Simpler,Quicker

Complex,Slower

RandomChance

0.50

SimpleClassification

0.50-0.61

RandomForests

A random forest starts from a decision tree

Customer Data

A random forest starts from a decision tree

Customer Data

Find the best split in a set ofrandomly chosen attributes

A random forest starts from a decision tree

Customer Data

Find the best split in a set ofrandomly chosen attributes

Is age <30?

A random forest starts from a decision tree

Customer Data

Find the best split in a set ofrandomly chosen attributes

Is age <30?

No

75,000 Customers>30

A random forest starts from a decision tree

Customer Data

Find the best split in a set ofrandomly chosen attributes

Is age <30?

No

75,000 Customers>30

Yes

25,000 Customers <30

A random forest starts from a decision tree

Customer Data

Find the best split in a set ofrandomly chosen attributes

Is age <30?

No

75,000 Customers>30

Yes

25,000 Customers <30

...

A random forest is composed of many decision trees

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

A random forest is composed of many decision trees

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1 ...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

A random forest is composed of many decision trees

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1 ...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

Class assignment of a customer is based on how manyof the decision trees “vote” on how to split an attribute

A random forest is composed of many decision trees

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1 ...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

We use a large number of trees to not over-fit to the training data

Class assignment of a customer is based on how manyof the decision trees “vote” on how to split an attribute

The Random Forest algorithm are easily implemented

In Python or R for initial testing and validation

The Random Forest algorithm are easily implemented

In Python or R for initial testing and validation

The Random Forest algorithm are easily implemented

In Python or R for initial testing and validation

Also parallelized with Mahout and Hadoop since there is no dependence from one tree to the next

A random forest performs well on the test set

Random Forest 10 trees: 0.779 KGI

A random forest performs well on the test set

Random Forest 10 trees: 0.779 KGI150 trees: 0.843 KGI

A random forest performs well on the test set

Random Forest 10 trees: 0.779 KGI150 trees: 0.843 KGI1000 trees: 0.850 KGI

A random forest performs well on the test set

Random Forest 10 trees: 0.779 KGI150 trees: 0.843 KGI1000 trees: 0.850 KGI

A random forest performs well on the test set

Random Forest 10 trees: 0.779 KGI150 trees: 0.843 KGI1000 trees: 0.850 KGI

0.4 0.5 0.6 0.7 0.8 0.9

Random

Accuracy

ClassificationRandom Forests

Exploring algorithmic choices further

Simpler,Quicker

Complex,Slower

RandomChance

0.50

SimpleClassification

0.50-0.61

RandomForests

0.78-0.85

Exploring algorithmic choices further

Simpler,Quicker

Complex,Slower

RandomChance

0.50

SimpleClassification

0.50-0.61

RandomForests

0.78-0.85

Gradient TreeBoosting

Boosting Trees is similar to a Random Forest

Customer Data

Find the best split in a set ofrandomly chosen attributes

Is age <30?

No

Customers >30 Data

Yes

Customers <30 Data

...

Boosting Trees is similar to a Random Forest

Customer Data

Is age <30?

No

Customers >30 Data

Yes

Customers <30 Data

...

Do an exhaustive searchfor best split

How Gradient Boosting Trees differs from Random Forest

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

The first tree is optimized to minimize a loss function describing the data

How Gradient Boosting Trees differs from Random Forest

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

The first tree is optimized to minimize a loss function describing the data

The next tree is then optimized to fit whatever variability the first

tree didn’t fit

How Gradient Boosting Trees differs from Random Forest

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

The first tree is optimized to minimize a loss function describing the data

The next tree is then optimized to fit whatever variability the first

tree didn’t fit

This is a sequential process in comparison to the random forest

How Gradient Boosting Trees differs from Random Forest

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

The first tree is optimized to minimize a loss function describing the data

The next tree is then optimized to fit whatever variability the first

tree didn’t fit

This is a sequential process in comparison to the random forest

We also run the risk of over-fitting to the data, thus the learning rate

Implementing Gradient Boosted Trees

In Python or R it is easy for initial testing and validation

Implementing Gradient Boosted Trees

In Python or R it is easy for initial testing and validation

There are implementations that use Hadoop but it’s more complicated to achieve the best performance

Gradient Boosting Trees performs well on the dataset

100 trees, 0.1 Learning: 0.865022 KGI

Gradient Boosting Trees performs well on the dataset

100 trees, 0.1 Learning: 0.865022 KGI1000 trees, 0.1 Learning: 0.865248 KGI

Gradient Boosting Trees performs well on the dataset

100 trees, 0.1 Learning: 0.865022 KGI1000 trees, 0.1 Learning: 0.865248 KGI

0 0.6 0.8Learning Rate

0.75

0.8

0.85

KG

I

0.2 0.4

Gradient Boosting Trees performs well on the dataset

100 trees, 0.1 Learning: 0.865022 KGI1000 trees, 0.1 Learning: 0.865248 KGI

0 0.6 0.8Learning Rate

0.75

0.8

0.85

KG

I

0.2 0.4

0.4 0.5 0.6 0.7 0.8 0.9

Random

Accuracy

ClassificationRandom Forests

Boosting Trees

Moving one step further in complexity

Simpler,Quicker

Complex,Slower

RandomChance

0.50

SimpleClassification

0.50-0.61

RandomForests

0.78-0.85

Gradient TreeBoosting

0.71-0.8659

BlendedMethod

Or more accurately an ensemble ofensemble methods

Algorithm Progression

Or more accurately an ensemble ofensemble methods

Algorithm Progression

Random Forest

Or more accurately an ensemble ofensemble methods

Algorithm Progression

Random Forest

Extremely Random Forest

Or more accurately an ensemble ofensemble methods

Algorithm Progression

Random Forest

Extremely Random Forest

Gradient Tree Boosting

Or more accurately an ensemble ofensemble methods

Algorithm ProgressionTrain Data Probabilities

Random Forest

Extremely Random Forest

Gradient Tree Boosting

0.10.50.010.80.7...

Or more accurately an ensemble ofensemble methods

Algorithm ProgressionTrain Data Probabilities

Random Forest

Extremely Random Forest

Gradient Tree Boosting

0.10.50.010.80.7...

0.150.60.00.750.68

.

.

.

Or more accurately an ensemble ofensemble methods

Algorithm ProgressionTrain Data Probabilities

Random Forest

Extremely Random Forest

Gradient Tree Boosting

0.10.50.010.80.7...

0.150.60.00.750.68

.

.

.

Combine all of the model information

Train Data Probabilities

0.10.50.010.80.7...

0.150.60.00.750.68

.

.

.

Combine all of the model information

Train Data Probabilities

0.10.50.010.80.7...

0.150.60.00.750.68

.

.

.

Optimize the set of train probabilities to the known delinquencies

Combine all of the model information

Train Data Probabilities

0.10.50.010.80.7...

0.150.60.00.750.68

.

.

.

Optimize the set of train probabilities to the known delinquencies

Apply the same weighting scheme to the set of test data probabilities

Implementation can be done in a number of ways

Testing in Python or R is slower, due to the sequential nature of applying the algorithms

Could be faster parallelized, running each algorithm separately and combining the results

Assessing model performance

Blending Performance, 100 trees: 0.864394 KGI

Assessing model performance

Blending Performance, 100 trees: 0.864394 KGI

0.4 0.5 0.6 0.7 0.8 0.9

Random

Accuracy

ClassificationRandom Forests

Boosting TreesBlended

Assessing model performance

Blending Performance, 100 trees: 0.864394 KGI

But this performance and the possibility of additional gains comes at a distinct time cost.

0.4 0.5 0.6 0.7 0.8 0.9

Random

Accuracy

ClassificationRandom Forests

Boosting TreesBlended

Examining the continuum of choices

Simpler,Quicker

Complex,Slower

RandomChance

0.50

SimpleClassification

0.50-0.61

RandomForests

0.78-0.85

Gradient TreeBoosting

0.71-0.8659

BlendedMethod

0.864

What would be best to implement?

What would be best to implement?

There is a large amount of optimization in the blended method that could be done

What would be best to implement?

There is a large amount of optimization in the blended method that could be done

However, this algorithm takes the longest to run.This constraint will apply in testing and validation also

What would be best to implement?

There is a large amount of optimization in the blended method that could be done

However, this algorithm takes the longest to run.This constraint will apply in testing and validation also

Random Forests returns a reasonably good result.It is quick and easily parallelized

What would be best to implement?

There is a large amount of optimization in the blended method that could be done

However, this algorithm takes the longest to run.This constraint will apply in testing and validation also

Random Forests returns a reasonably good result.It is quick and easily parallelized

Gradient Tree Boosting returns the best result and runs reasonably fast.It is not as easily parallelized though

What would be best to implement?

Random Forests returns a reasonably good result.It is quick and easily parallelized

Gradient Tree Boosting returns the best result and runs reasonably fast.It is not as easily parallelized though

Increases in predictive performance have real business value

Using any of the more complex algorithms we achieve an increase of 35% in comparison to random

Increases in predictive performance have real business value

Using any of the more complex algorithms we achieve an increase of 35% in comparison to random

Potential decrease of ~$420k in losses by identifyingcustomers likely to default in the training set alone

Thank you for your time

Recommended