Robust Feature Selection by Mutual Information Distributions

Robust Feature Selection Robust Feature Selection by Mutual Information Distributionsby Mutual Information Distributions

Marco ZaffalonMarco Zaffalon & Marcus Hutter & Marcus Hutter

IDSIAIDSIAGalleria 2, 6928 Manno (Lugano), SwitzerlandGalleria 2, 6928 Manno (Lugano), Switzerland

www.idsia.ch/~{zaffalon,marcus}www.idsia.ch/~{zaffalon,marcus}{zaffalon,marcus}@idsia.ch{zaffalon,marcus}@idsia.ch

Mutual Information (MI)Mutual Information (MI)

Consider two discrete random variables (Consider two discrete random variables (,,))

(In)Dependence often measured by MI(In)Dependence often measured by MI

– Also known as Also known as cross-entropycross-entropy or or information gaininformation gain– ExamplesExamples

Inference of Bayesian nets, classification treesInference of Bayesian nets, classification trees Selection of relevant variables for the task at handSelection of relevant variables for the task at hand

,, of chancejoint jiij sjri ,,1 and ,,1

iiji of chance marginal j

jijj of chance marginal

log0 π

MI-Based Feature-Selection Filter (F)MI-Based Feature-Selection Filter (F)Lewis, 1992Lewis, 1992

ClassificationClassification– Predicting the Predicting the classclass value given values of value given values of featuresfeatures– Features (or attributes) and class = random variablesFeatures (or attributes) and class = random variables– Learning the rule ‘features Learning the rule ‘features class’ from data class’ from data

Filters goal: removing irrelevant featuresFilters goal: removing irrelevant features– More accurate predictions, easier modelsMore accurate predictions, easier models

MI-based approachMI-based approach– Remove feature Remove feature if class if class does not depend on it: does not depend on it:– Or: remove Or: remove if if

is an arbitrary threshold of relevanceis an arbitrary threshold of relevance

Empirical Mutual InformationEmpirical Mutual Informationa common way to use MI in practicea common way to use MI in practice

Data ( ) Data ( ) contingency table contingency table

– Empirical (sample) probability:Empirical (sample) probability:– Empirical mutual information: Empirical mutual information:

Problems of the empirical approachProblems of the empirical approach– due to random fluctuations? (finite sample)due to random fluctuations? (finite sample)– How to know if it is reliable, e.g. by How to know if it is reliable, e.g. by

jj\\ii 11 22 …… rr

11 nn1111 nn1212 …… nn1r1r

22 nn2121 nn2222 …… nn2r2r

ss nns1s1 nns2s2 …… nnsrsr

occurred times of# i,jnij

occurred times of# i nnj iji

occurred times of# j nni ijj

sizedataset ij ijnn

nnijij ̂ π̂I

0ˆ πI

We Need the Distribution of MIWe Need the Distribution of MI

Bayesian approachBayesian approach– Prior distribution for the unknown chances Prior distribution for the unknown chances

(e.g., Dirichlet) (e.g., Dirichlet) – Posterior: Posterior:

Posterior probability density of MI:Posterior probability density of MI:

How to compute it?How to compute it?– Fitting a curve by the exact mean, approximate varianceFitting a curve by the exact mean, approximate variance

ijpp ππn

πππ dpIIIp nn

Mean and Variance of MIMean and Variance of MIHutter, 2001; Wolpert & Wolf, 1995Hutter, 2001; Wolpert & Wolf, 1995

Exact meanExact mean

Leading and next to leading order term (NLO) for Leading and next to leading order term (NLO) for the variancethe variance

Computational complexity Computational complexity OO((rsrs))– As fast as empirical MIAs fast as empirical MI

1 ,1111

kijjiijij k

nnnnnnn

ij ij ji

ijij nOnn

nIVAR 3

NLOlog1

MI Density Example GraphsMI Density Example Graphs

Distribution of Mutual Information for Dirichlet Priors

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Exact n=[(40,10),(20,80)]

Gauss n=[(40,10),(20,80)]

Gamma n=[(40,10),(20,80)]

Beta n=[(40,10),(20,80)]

Exact n=[(8,2),(4,16)]

Gauss n=[(8,2),(4,16)]

Gamma n=[(8,2),(4,16)]

Beta n=[(8,2),(4,16)]

Exact n=[(20,5),(10,40)]

Gauss n=[(20,5),(10,40)]

Gamma n=[(20,5),(10,40)]

Beta n=[(20,5),(10,40)]

Robust Feature SelectionRobust Feature Selection

Filters: two new proposalsFilters: two new proposals– FF: include feature FF: include feature iff iff

(include iff “proven” relevant)(include iff “proven” relevant)

– BF: exclude feature BF: exclude feature iff iff (exclude iff “proven” irrelevant) (exclude iff “proven” irrelevant)

ExamplesExamples

95.0 nIP

FF includesBF includes

FF excludesBF includes

FF excludesBF excludes

Comparing the FiltersComparing the Filters

Experimental set-upExperimental set-up– Filter (F,FF,BF) + Naive Bayes classifierFilter (F,FF,BF) + Naive Bayes classifier– Sequential learning and testingSequential learning and testing

Collected measures for each filterCollected measures for each filter– Average # of correct predictions (prediction accuracy)Average # of correct predictions (prediction accuracy)– Average # of features usedAverage # of features used

Naive Bayes

Classification

stance

Filter

Learningdata

Store after

classi

ficatio

Results on 10 Complete DatasetsResults on 10 Complete Datasets

# of used features# of used features

Accuracies NOT significantly differentAccuracies NOT significantly different– Except Chess & Spam with FFExcept Chess & Spam with FF

# Instances # Features Dataset FF F BF690 36 Australian 32.6 34.3 35.9

3196 36 Chess 12.6 18.1 26.1653 15 Crx 11.9 13.2 15.0

1000 17 German-org 5.1 8.8 15.22238 23 Hypothyroid 4.8 8.4 17.13200 24 Led24 13.6 14.0 24.0148 18 Lymphography 18.0 18.0 18.0

5800 8 Shuttle-small 7.1 7.7 8.01101 21611 Spam 123.1 822.0 13127.4435 16 Vote 14.0 15.2 16.0

Results on 10 Complete Datasets - Results on 10 Complete Datasets - ctdctd

phogra

Percentages of used features

FF: Significantly Better AccuraciesFF: Significantly Better Accuracies

ChessChess

SpamSpam

Instance number

Extension to Incomplete SamplesExtension to Incomplete Samples

MAR assumptionMAR assumption– General case: missing features and classGeneral case: missing features and class

EM + closed-form expressionsEM + closed-form expressions

– Missing features onlyMissing features only Closed-form approximate expressions for Mean and VarianceClosed-form approximate expressions for Mean and Variance Complexity still Complexity still OO((rsrs))

New experimentsNew experiments– 5 data sets5 data sets– Similar behaviorSimilar behavior

Instance number

ConclusionsConclusions

Expressions for several moments of MI distribution Expressions for several moments of MI distribution are availableare available– The distribution can be approximated wellThe distribution can be approximated well– Safer inferences, same computational complexity of Safer inferences, same computational complexity of

empirical MIempirical MI– Why not to use it?Why not to use it?

Robust feature selection shows power of MI Robust feature selection shows power of MI distributiondistribution– FF outperforms traditional filter FFF outperforms traditional filter F

Many useful applications possibleMany useful applications possible– Inference of Bayesian netsInference of Bayesian nets– Inference of classification treesInference of classification trees– ……

Robust Feature Selection by Mutual Information Distributions

Documents

Escaping Cannibalization? Correlation-Robust …arXiv:2003.05913v1 [cs.GT] 12 Mar 2020 nonmonotonicity in the valuation distributions [28], with the general revenue-maximizing solution

Chapter Four Discrete Probability Distributions 4.1 Probability Distributions

Discrete Probability Distributions. Probability Distributions

BramL.Gorissen,IhsanYanıkoğlu,DickdenHertog · Robust optimization (RO), on the other hand, does not assume that probability distributions are known, but instead it assumes that

1 Best-Buddies Similarity - Robust Template Matching using ... · 1 Best-Buddies Similarity - Robust Template Matching using Mutual Nearest Neighbors Shaul Oron, Tali Dekel, Tianfan

Lecture 2: Discrete Distributions, Normal Distributions › ~xuanyaoh › stat350 › xyJan13Lec... · 2012-01-13 · Lecture 2: Discrete Distributions, Normal Distributions Chapter

Hybrid Statistical Estimation of Mutual Information and ...kawamoty/FAOC2018.pdf · Formal Aspects of Computing 3 exact sub-probability distributions of the components, the hybrid

Lecture 2 Review Probabilities Probability Distributions Normal probability distributions Sampling distributions and estimation

Taxation of Distributions: Distributions from Partnerships

Severity Distributions for GLMs: Gamma or Lognormal? Presented by Luyang Fu, Grange Mutual Richard Moncher, Bristol West 2004 CAS Spring Meeting Colorado

Robust Learning under Uncertain Test Distributions ...jwen4/files/wen2014robust.pdf · the target test population, but in most cases we have noth-Proceedings of the 31st International

Chapter Six Discrete Probability Distributions 6.1 Probability Distributions

Chapter 5: Probability Distributions: Discrete Probability Distributions

CIVL 3103 - Memphis Distributions... · 2011. 8. 29. · CIVL 3103 Continuous Distributions . Learning Objectives - Continuous Distributions Define continuous distributions, and identify

Mutual Benefits...the chance to win great prizes worth £1,000! Mutual Rewards Mutual Rewards Mutual Wellbeing Mutual Wellbeing Mutual Wellbeing ese o members.britishfriendly.com Recovery

Probability distributions, sampling distributions and central limit theorem

Robust Feature Selection by Mutual Information Distributions Marco Zaffalon & Marcus Hutter IDSIA IDSIA Galleria 2, 6928 Manno (Lugano), Switzerland {zaffalon,marcus}{zaffalon,marcus}@idsia.ch

Robust Permutation Test for Equality of Distributions ... · conduct permutation-based inference for Neyman’s weak null hypothesis for a large class of test statistics—like the

A Robust Optimization Perspective to Stochastic Models · Modeling Uncertainty: Probability Distributions! Data represented as random variables with known distributions " Stochastic/Dynamic

Probability Basic Probability Concepts Probability Distributions Sampling Distributions