L6. Unbalanced Datasets

Unbalanced Datasets

Poul Petersen

BigML

2

Unbalanced Dataset?

DATASET

3

Unbalanced Dataset?

4

How Does it Happen?

Campus Population

Students

FacultyVisitors

Consider:Campus Survey

by this guypretty

BIASED SAMPLE

FIX: Re-sample

5

Sometimes it’s Reality

0

750

1500

2250

3000

Fraud Not Fraud

0

125

250

375

500

Earthquake No Earthquake

6

Not Always a Problem

Switch Room?on brighton brightoff darkon brighton brighton brighton brightoff darkon bright

0

2

4

6

8

bright dark

Tightly CorrelatedSwitch on <=> bright

7

When is it a problem?

Imagine:A Fraud datasetwith 100 rows…and only ONE fraud instance

Forget building a modeljust always return False:

This is 99% Accurate!…but the Precision of the fraud class is 0%

8

What’s the Problem?

Front Door

… Robbed?unlocked … nounlocked … nounlocked … nounlocked … yesunlocked … nounlocked … nounlocked … nounlocked … nounlocked … nounlocked … no

Imagine: Dataset with 10 identicalinputs and 9/10 identical outcomes

What does the model learn?

Front Door unlocked?

No Robbery, with slightlyless than perfect confidence

8


Front Door

… Robbed?unlocked … nounlocked … nounlocked … nounlocked … yesunlocked … nounlocked … nounlocked … nounlocked … nounlocked … nounlocked … no

Imagine: Dataset with 10 identicalinputs and 9/10 identical outcomes

What does the model learn?

Front Door unlocked?

No Robbery, with slightlyless than perfect confidence

!!! IMPORTANT !!!

9


• The ML algorithm treats all instances equally

• It does not know the relative cost of different outcomes, unless you tell it!

• This is important even if the class is balanced. One class can still be more important to get right.

• No Free Lunch - there are ways to fix, but there is always a tradeoff

10

Sub-sampling

Front Door … Robbed?unlocked … nounlocked … nounlocked … nounlocked … yesunlocked … nounlocked … nounlocked … nounlocked … nounlocked … nounlocked … no

0

2.25

4.5

6.75

9

Robbed Not Robbed

Throw out instances from “over-represented” classeither randomly or using clustering

10

Sub-sampling


0

0.25

0.5

0.75

1

Robbed Not Robbed

Throw out instances from “over-represented” classeither randomly or using clustering

11

Over-sampling


0

2.25

4.5

6.75

9

Robbed Not Robbed

Count instances from “under-represented” classmore than once

11

Over-sampling


0

0.25

0.5

0.75

1

Robbed Not Robbed

Count instances from “under-represented” classmore than once

9 X

Front Door … Robbed? weightunlocked … no 1unlocked … no 1unlocked … no 1unlocked … yes 1000unlocked … no 1unlocked … no 1unlocked … no 1unlocked … no 1unlocked … no 1unlocked … no 1

12

WeightingTell the model engine which instances

are more “important” to learn from

Front Door … Robbed? weightunlocked … no 1unlocked … no 1unlocked … no 1unlocked … yes 9unlocked … no 1unlocked … no 1unlocked … no 1unlocked … no 1unlocked … no 1unlocked … no 1

13

Auto BalancingTell the model engine to add weights

so all instances have equal representation

Classified Not Fraud

14

The Trade-off

Accuracy = 70% Precision = 50% Recall = 66%

Classified Not Fraud

Classified Fraud

= Fraud= Not Fraud

Positive ClassFraud

Negative ClassNot Fraud

Evaluation with no weighting Evaluation with weighting

Accuracy = 60% Precision = 43% Recall = 100%

Classified Fraud

15

The Trade-off

• Weighting is typically a tradeoff between precision and recall.

• What to do depends on what is important in the “business” sense.

• There are some ways to optimize

feature_1 … feature_n label weight3.4 … 4 TRUE 16.7 … 5 FALSE 11.0 … 1 FALSE 15.5 … 23 TRUE 1

16

Sometimes UsefulForce an unbalanced dataset to improve a model


16


feature_1 … feature_n label weight predict3.4 … 4 TRUE 1 TRUE6.7 … 5 FALSE 1 TRUE1.0 … 1 FALSE 1 FALSE5.5 … 23 TRUE 1 FALSE


16



correctwrongcorrectwrong


16



correctwrongcorrectwrong

0.52

0.52


16


feature_1 … feature_n label weight3.4 … 4 TRUE 0.56.7 … 5 FALSE 21.0 … 1 FALSE 0.55.5 … 23 TRUE 2

Repeat … this is a type of Boosting

Data & Analytics

L6. Unbalanced Datasets