View
481
Download
5
Category
Preview:
Citation preview
4
How Does it Happen?
Campus Population
Students
FacultyVisitors
Consider:Campus Survey
by this guypretty
BIASED SAMPLE
FIX: Re-sample
5
Sometimes it’s Reality
0
750
1500
2250
3000
Fraud Not Fraud
0
125
250
375
500
Earthquake No Earthquake
6
Not Always a Problem
Switch Room?on brighton brightoff darkon brighton brighton brighton brightoff darkon bright
0
2
4
6
8
bright dark
Tightly CorrelatedSwitch on <=> bright
7
When is it a problem?
Imagine:A Fraud datasetwith 100 rows…and only ONE fraud instance
Forget building a modeljust always return False:
This is 99% Accurate!…but the Precision of the fraud class is 0%
8
What’s the Problem?
Front Door
… Robbed?unlocked … nounlocked … nounlocked … nounlocked … yesunlocked … nounlocked … nounlocked … nounlocked … nounlocked … nounlocked … no
Imagine: Dataset with 10 identicalinputs and 9/10 identical outcomes
What does the model learn?
Front Door unlocked?
No Robbery, with slightlyless than perfect confidence
8
What’s the Problem?
Front Door
… Robbed?unlocked … nounlocked … nounlocked … nounlocked … yesunlocked … nounlocked … nounlocked … nounlocked … nounlocked … nounlocked … no
Imagine: Dataset with 10 identicalinputs and 9/10 identical outcomes
What does the model learn?
Front Door unlocked?
No Robbery, with slightlyless than perfect confidence
!!! IMPORTANT !!!
9
What’s the Problem?
• The ML algorithm treats all instances equally
• It does not know the relative cost of different outcomes, unless you tell it!
• This is important even if the class is balanced. One class can still be more important to get right.
• No Free Lunch - there are ways to fix, but there is always a tradeoff
10
Sub-sampling
Front Door … Robbed?unlocked … nounlocked … nounlocked … nounlocked … yesunlocked … nounlocked … nounlocked … nounlocked … nounlocked … nounlocked … no
0
2.25
4.5
6.75
9
Robbed Not Robbed
Throw out instances from “over-represented” classeither randomly or using clustering
10
Sub-sampling
Front Door … Robbed?unlocked … nounlocked … nounlocked … nounlocked … yesunlocked … nounlocked … nounlocked … nounlocked … nounlocked … nounlocked … no
0
0.25
0.5
0.75
1
Robbed Not Robbed
Throw out instances from “over-represented” classeither randomly or using clustering
11
Over-sampling
Front Door … Robbed?unlocked … nounlocked … nounlocked … nounlocked … yesunlocked … nounlocked … nounlocked … nounlocked … nounlocked … nounlocked … no
0
2.25
4.5
6.75
9
Robbed Not Robbed
Count instances from “under-represented” classmore than once
11
Over-sampling
Front Door … Robbed?unlocked … nounlocked … nounlocked … nounlocked … yesunlocked … nounlocked … nounlocked … nounlocked … nounlocked … nounlocked … no
0
0.25
0.5
0.75
1
Robbed Not Robbed
Count instances from “under-represented” classmore than once
9 X
Front Door … Robbed? weightunlocked … no 1unlocked … no 1unlocked … no 1unlocked … yes 1000unlocked … no 1unlocked … no 1unlocked … no 1unlocked … no 1unlocked … no 1unlocked … no 1
12
WeightingTell the model engine which instances
are more “important” to learn from
Front Door … Robbed? weightunlocked … no 1unlocked … no 1unlocked … no 1unlocked … yes 9unlocked … no 1unlocked … no 1unlocked … no 1unlocked … no 1unlocked … no 1unlocked … no 1
13
Auto BalancingTell the model engine to add weights
so all instances have equal representation
Classified Not Fraud
14
The Trade-off
Accuracy = 70% Precision = 50% Recall = 66%
Classified Not Fraud
Classified Fraud
= Fraud= Not Fraud
Positive ClassFraud
Negative ClassNot Fraud
Evaluation with no weighting Evaluation with weighting
Accuracy = 60% Precision = 43% Recall = 100%
Classified Fraud
15
The Trade-off
• Weighting is typically a tradeoff between precision and recall.
• What to do depends on what is important in the “business” sense.
• There are some ways to optimize
feature_1 … feature_n label weight3.4 … 4 TRUE 16.7 … 5 FALSE 11.0 … 1 FALSE 15.5 … 23 TRUE 1
16
Sometimes UsefulForce an unbalanced dataset to improve a model
feature_1 … feature_n label weight3.4 … 4 TRUE 16.7 … 5 FALSE 11.0 … 1 FALSE 15.5 … 23 TRUE 1
16
Sometimes UsefulForce an unbalanced dataset to improve a model
feature_1 … feature_n label weight predict3.4 … 4 TRUE 1 TRUE6.7 … 5 FALSE 1 TRUE1.0 … 1 FALSE 1 FALSE5.5 … 23 TRUE 1 FALSE
feature_1 … feature_n label weight3.4 … 4 TRUE 16.7 … 5 FALSE 11.0 … 1 FALSE 15.5 … 23 TRUE 1
16
Sometimes UsefulForce an unbalanced dataset to improve a model
feature_1 … feature_n label weight predict3.4 … 4 TRUE 1 TRUE6.7 … 5 FALSE 1 TRUE1.0 … 1 FALSE 1 FALSE5.5 … 23 TRUE 1 FALSE
correctwrongcorrectwrong
feature_1 … feature_n label weight3.4 … 4 TRUE 16.7 … 5 FALSE 11.0 … 1 FALSE 15.5 … 23 TRUE 1
16
Sometimes UsefulForce an unbalanced dataset to improve a model
feature_1 … feature_n label weight predict3.4 … 4 TRUE 1 TRUE6.7 … 5 FALSE 1 TRUE1.0 … 1 FALSE 1 FALSE5.5 … 23 TRUE 1 FALSE
correctwrongcorrectwrong
0.52
0.52
Recommended