L6. Unbalanced Datasets

Unbalanced Datasets

Poul Petersen

Unbalanced Dataset?

DATASET

Unbalanced Dataset?

How Does it Happen?

Campus Population

Students

FacultyVisitors

Consider:Campus Survey

by this guypretty

BIASED SAMPLE

FIX: Re-sample

Sometimes it’s Reality

Fraud Not Fraud

Earthquake No Earthquake

Not Always a Problem

Switch Room?on brighton brightoff darkon brighton brighton brighton brightoff darkon bright

bright dark

Tightly CorrelatedSwitch on <=> bright

When is it a problem?

Imagine:A Fraud datasetwith 100 rows…and only ONE fraud instance

Forget building a modeljust always return False:

This is 99% Accurate!…but the Precision of the fraud class is 0%

What’s the Problem?

Front Door

… Robbed?unlocked … nounlocked … nounlocked … nounlocked … yesunlocked … nounlocked … nounlocked … nounlocked … nounlocked … nounlocked … no

Imagine: Dataset with 10 identicalinputs and 9/10 identical outcomes

What does the model learn?

Front Door unlocked?

No Robbery, with slightlyless than perfect confidence

Front Door

… Robbed?unlocked … nounlocked … nounlocked … nounlocked … yesunlocked … nounlocked … nounlocked … nounlocked … nounlocked … nounlocked … no

Imagine: Dataset with 10 identicalinputs and 9/10 identical outcomes

What does the model learn?

Front Door unlocked?

No Robbery, with slightlyless than perfect confidence

!!! IMPORTANT !!!

• The ML algorithm treats all instances equally

• It does not know the relative cost of different outcomes, unless you tell it!

• This is important even if the class is balanced. One class can still be more important to get right.

• No Free Lunch - there are ways to fix, but there is always a tradeoff

Sub-sampling

Front Door … Robbed?unlocked … nounlocked … nounlocked … nounlocked … yesunlocked … nounlocked … nounlocked … nounlocked … nounlocked … nounlocked … no

Robbed Not Robbed

Throw out instances from “over-represented” classeither randomly or using clustering

Sub-sampling

Robbed Not Robbed

Throw out instances from “over-represented” classeither randomly or using clustering

Over-sampling

Robbed Not Robbed

Count instances from “under-represented” classmore than once

Over-sampling

Robbed Not Robbed

Count instances from “under-represented” classmore than once

Front Door … Robbed? weightunlocked … no 1unlocked … no 1unlocked … no 1unlocked … yes 1000unlocked … no 1unlocked … no 1unlocked … no 1unlocked … no 1unlocked … no 1unlocked … no 1

WeightingTell the model engine which instances

are more “important” to learn from

Front Door … Robbed? weightunlocked … no 1unlocked … no 1unlocked … no 1unlocked … yes 9unlocked … no 1unlocked … no 1unlocked … no 1unlocked … no 1unlocked … no 1unlocked … no 1

Auto BalancingTell the model engine to add weights

so all instances have equal representation

Classified Not Fraud

The Trade-off

Accuracy = 70% Precision = 50% Recall = 66%

Classified Not Fraud

Classified Fraud

= Fraud= Not Fraud

Positive ClassFraud

Negative ClassNot Fraud

Evaluation with no weighting Evaluation with weighting

Accuracy = 60% Precision = 43% Recall = 100%

Classified Fraud

The Trade-off

• Weighting is typically a tradeoff between precision and recall.

• What to do depends on what is important in the “business” sense.

• There are some ways to optimize

feature_1 … feature_n label weight3.4 … 4 TRUE 16.7 … 5 FALSE 11.0 … 1 FALSE 15.5 … 23 TRUE 1

Sometimes UsefulForce an unbalanced dataset to improve a model

feature_1 … feature_n label weight predict3.4 … 4 TRUE 1 TRUE6.7 … 5 FALSE 1 TRUE1.0 … 1 FALSE 1 FALSE5.5 … 23 TRUE 1 FALSE

correctwrongcorrectwrong

feature_1 … feature_n label weight3.4 … 4 TRUE 0.56.7 … 5 FALSE 21.0 … 1 FALSE 0.55.5 … 23 TRUE 2

Repeat … this is a type of Boosting

L6. Unbalanced Datasets

Data & Analytics

L6 lamboray

Patents L6

Spelling l6

Implementing UnBalanced Scorecard

Unbalanced Faults

An Unbalanced Recovery

Title: Family Verification based on Similarity of Individual Family ...oar.a-star.edu.sg/jspui/bitstream/123456789/150/3/MVA_journal artic… · unbalanced datasets, various age groups

Unbalanced Load

L6 banderitas

A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets 1 Turgay Tugay Bilgin and A.Yilmaz Camurcu 2 1 Department

Survey of resampling techniques for improving …Survey of resampling techniques for improving classiﬁcation performance in unbalanced datasets Ajinkya More (ajinkya@umich.edu) ABSTRACT

Balancing the Unbalanced

Basic Linear Unbalanced

L6 Routing

Learned lessons in credit card fraud detection from …...3.2. Unbalanced problem Learning from unbalanced datasets is a di cult task since most learning systems are not designed to

Unbalanced Translocations - rarechromo

L6 interviews

HandlingImbalanceClassificationVirtualScreeningBigData ......rent advances in machine learning (ML), developing successful algorithms that learn from unbalanced datasets remainsadauntingtask.InML,manyapproacheshavebeen

10 Challenging Problems in Data Mining Research · 10. Dealing with Non-static, Unbalanced and Cost-sensitive Data The UCI datasets are small and not highly unbalanced Real world

Catalogo l6