Scalable Bayesian Network Classifiers · Scalable Bayesian Network Classi ers Geo Webb Ana Martinez Nayyar Zaidi Introduction Bayesian Network Classi ers Selective KDB Experiments

ScalableBayesianNetwork

Classifiers

Geoff WebbAna MartinezNayyar Zaidi

Introduction

BayesianNetworkClassifiers

Selective KDB

Experiments

Conclusions &FutureResearch

Scalable Bayesian Network Classifiers



Classifiers


Introduction


Selective KDB

Experiments


Introduction

• Learning from large data

is not just about scaling-upexisting algorithms.

• Large data is best tackled by fundamentally new learningalgorithms.


Classifiers


Introduction


Selective KDB

Experiments


Introduction

• Learning from large data is not just about scaling-upexisting algorithms.



Classifiers


Introduction


Selective KDB

Experiments


Introduction

• Learning from large data is not just about scaling-upexisting algorithms.



Classifiers


Introduction


Selective KDB

Experiments


Learning curves

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

RMSE

Training set size

• Error typically reduces as data quantity increases.


Classifiers


Introduction


Selective KDB

Experiments


Learning curves

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

RMSE

Training set size

• Error typically reduces as data quantity increases.

• Different algorithms have different curves.


Classifiers


Introduction


Selective KDB

Experiments


Learning curves

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

RMSE

Training set size

Overfitting on small data

• Algorithms that closely fit complex multivariatedistributions will tend to overfit small data


Classifiers


Introduction


Selective KDB

Experiments


Learning curves

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

RMSE

Training set size


Best on large data

• Algorithms that closely fit complex multivariatedistributions will tend to overfit small data, but can betterfit large data


Classifiers


Introduction


Selective KDB

Experiments


Learning curves

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

RMSE

Training set size


Best on large data

• Algorithms that closely fit complex multivariatedistributions will tend to overfit small data, but can betterfit large data: Bias-Variance Tradeoff


Classifiers


Introduction


Selective KDB

Experiments


Capacity to fit = model space + optimization

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1RM

SE

Training set size

Naïve Bayes

Logistic Regression


Classifiers


Introduction


Selective KDB

Experiments


Much research of questionable relevance

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

RMSE

Training set size

• Most prior research has used relatively small data.


Classifiers


Introduction


Selective KDB

Experiments


Requirements

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

RMSE

Training set size

• Need algorithms that can closely fit complex multivariatedistributions

, while being very computationally efficient.• low bias• few passes through data• out of core


Classifiers


Introduction


Selective KDB

Experiments


Requirements

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

RMSE

Training set size

• Need algorithms that can closely fit complex multivariatedistributions, while being very computationally efficient.

• low bias• few passes through data• out of core


Classifiers


Introduction


Selective KDB

Experiments


Requirements

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

RMSE

Training set size


• low bias

• few passes through data• out of core


Classifiers


Introduction


Selective KDB

Experiments


Requirements

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

RMSE

Training set size


• low bias• few passes through data

• out of core


Classifiers


Introduction


Selective KDB

Experiments


Requirements

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

RMSE

Training set size


• low bias• few passes through data• out of core


Classifiers


Introduction


Selective KDB

Experiments


State-of-the-art

• Most low bias algorithms do not scale• Random Forest• SVM• Deep Learning

and inherently in-core

• Selective KDB is a scalable low-bias Bayesian NetworkClassifier


Classifiers


Introduction


Selective KDB

Experiments


State-of-the-art





Classifiers


Introduction


Selective KDB

Experiments


State-of-the-art





Classifiers


Introduction


Selective KDB

Experiments


Outline

1 Introduction

2 Bayesian Network Classifiers

3 Selective KDB

4 Experiments

5 Conclusions & Future Research


Classifiers


Introduction


Selective KDB

Experiments


Bayesian Network Classifiers

• Defined by parent relation π and Conditional ProbabilityTables (CPTs)

• π encodes conditional independence / structure• CPTs encode conditional probabilities

• Classifies using P(y | x) ∝ P(y | πY )∏

P(xi | πi )• Usually makes Y a parent of all Xi

Y

X1 X2 X3 X4 X5

• Given π, CPTs can be learned by counting jointfrequencies

• single pass• incremental


Classifiers


Introduction


Selective KDB

Experiments


k-Dependence Bayes (KDB)

• Two pass learning

• 1st pass, learn structure:• Collect counts for pairs of attributes with the class.• Order attributes based on MI with the class.• Select parents based on CMI.

• no more than k parents• parents must be earlier in the order

Y

X1 X2 X3 X4 X5

• 2nd pass, learn CPTs:• Collect statistics according to the structure learned.


Classifiers


Introduction


Selective KDB

Experiments



• Two pass learning

• 1st pass, learn structure:• Collect counts for pairs of attributes with the class.• Order attributes based on MI with the class.• Select parents based on CMI.

• no more than k parents• parents must be earlier in the order

Y

X1 X2 X3 X4 X5

• 2nd pass, learn CPTs:• Collect statistics according to the structure learned.


Classifiers


Introduction


Selective KDB

Experiments



• Training Space: O(ya2v2 + yavk+1

)• Classification Space: O

(yavk+1

)• Training Time: O

(ta2 + ya2v2 + tak

)• Classification Time: O(yak)

t = no. of training examplesa = number of attributes;v = average number of values;y = number of classes


Classifiers


Introduction


Selective KDB

Experiments



• Parameter k controls bias-variance tradeoff

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

RMSE

Training set size

Naïve Bayes

KDB k=1

KDB k=2

KDB k=3

KDB k=4

KDB k=5

• No a priori means to anticipate best k

• Spurious attributes may increase error


Classifiers


Introduction


Selective KDB

Experiments




0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

RMSE

Training set size

Naïve Bayes

KDB k=1

KDB k=2

KDB k=3

KDB k=4

KDB k=5




Classifiers


Introduction


Selective KDB

Experiments




0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

RMSE

Training set size

Naïve Bayes

KDB k=1

KDB k=2

KDB k=3

KDB k=4

KDB k=5




Classifiers


Introduction


Selective KDB

Experiments


KDB models are nested

• Mi ,j = KDB with k = i using attributes 1 to j .

Attributes (ordered by MI)k 1 2 3 41 M1,1 M1,2 M1,3 M1,4

2 M2,1 M2,2 M2,3 M2,4

3 M3,1 M3,2 M3,3 M3,4

4 M4,1 M4,2 M4,3 M4,4

• Mi , j is a minor extension of Mi , j−1 and Mi−1, j


Classifiers


Introduction


Selective KDB

Experiments





2 M2,1 M2,2 M2,3 M2,4

3 M3,1 M3,2 M3,3 M3,4

4 M4,1 M4,2 M4,3 M4,4



Classifiers


Introduction


Selective KDB

Experiments





2 M2,1 M2,2 M2,3 M2,4

3 M3,1 M3,2 M3,3 M3,4

4 M4,1 M4,2 M4,3 M4,4


Y

X1 X2 X3 X4 X5


Classifiers


Introduction


Selective KDB

Experiments





2 M2,1 M2,2 M2,3 M2,4

3 M3,1 M3,2 M3,3 M3,4

4 M4,1 M4,2 M4,3 M4,4


Y

X1 X2 X3 X4 X5


Classifiers


Introduction


Selective KDB

Experiments





2 M2,1 M2,2 M2,3 M2,4

3 M3,1 M3,2 M3,3 M3,4

4 M4,1 M4,2 M4,3 M4,4


Y

X1 X2 X3 X4 X5


Classifiers


Introduction


Selective KDB

Experiments


Selective KDB

• In one extra pass select an attribute subset and best kusing leave-one-out CV.

• The full model subsumes all k × a submodels.

• Very efficient selection between a large class of strongmodels.


Classifiers


Introduction


Selective KDB

Experiments


Selective KDB





Classifiers


Introduction


Selective KDB

Experiments


Selective KDB





Classifiers


Introduction


Selective KDB

Experiments


Leave-one-out CV

• Very low bias estimator of out-of-sample performance

• Incremental cross-validation makes it VERY efficient forBNC

• Collect counts from all data once• When classifying a hold-out object subtract it from the

counts

• All k × a nested models can be evaluated with little morecomputation than the full KDB model


Classifiers


Introduction


Selective KDB

Experiments


Selective KDB

• Training Space: O(ya2v2 + yavk+1

)• Classification Space: O

(ya∗vk

∗+1)

• Training Time: O(ta2 + ya2v2 + tayk

)• Classification Time: O(ya∗k∗)

a = number of attributes;v = average number of values;y = number of classesa∗ = number of attributes selected (a∗ ≤ a).k∗ = best value of k found (k∗ ≤ kmax).


Classifiers


Introduction


Selective KDB

Experiments


Selective KDB

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

RMSE

Training set size

Naïve Bayes

KDB k=1

KDB k=2

KDB k=3

KDB k=4

KDB k=5


Classifiers


Introduction


Selective KDB

Experiments


Selective KDB

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

RMSE

Training set size

Naïve Bayes

KDB k=1

KDB k=2

KDB k=3

KDB k=4

KDB k=5

SKDB Only K


Classifiers


Introduction


Selective KDB

Experiments


Selective KDB

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

RMSE

Training set size

Naïve BayesKDB k=1KDB k=2KDB k=3KDB k=4KDB k=5SKDB Only KSKDB k=5 Only Atts


Classifiers


Introduction


Selective KDB

Experiments


Selective KDB

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

RMSE

Training set size

Naïve BayesKDB k=1KDB k=2KDB k=3KDB k=4KDB k=5SKDB Only KSKDB k=5 Only AttsSKDB


Classifiers


Introduction


Selective KDB

Experiments


16 datasets (165K-54M examples)No. of cases No. of No. of

Name (million) Atts. Classes Sizelocalization 0.165 5 11 11MB

census-income 0.299 41 2 136MB

USPSExtended 0.342 676 2 603MBs

MITFaceSetA 0.474 361 2 584MB

MITFaceSetB 0.489 361 2 603MB

MSDYear-Prediction 0.515 90 90 601MBs

covtype 0.581 54 7 72MB

MITFaceSetC 0.839 361 2 1.1GB

poker-hand 1.025 10 10 24MB

uscensus1990 2.458 67 4 325MB

PAMAP2 3.851 54 19 1.7GB

kddcup 5.210 41 40 754MB

linkage 5.749 11 2 251MB

mnist8ms 8.100 784 10 19GBs

satellite 8.705 138 24 3.6GB

splice 54.628 141 2 7.3GBs sparse format.


Classifiers


Introduction


Selective KDB

Experiments


Numeric attributes

• 5 bin equal frequency discretisation


Classifiers


Introduction


Selective KDB

Experiments


Out-of-core BNCs

RMSE

KDB5 vs SKDB best KDB vs SKDB


Classifiers


Introduction


Selective KDB

Experiments


Out-of-core BNCs

• Training Time Comparisons

100

1000

10

100

(sec

onds

)

NB

1

Tim

e TAN

AODE

KDB k=5

SKDB

0,1

zatio

n

ncom

e

r-ha

nd

natio

n

ovty

pe

cens

us

eSet

A

eSet

B

ende

d

eSet

C

ddcu

p

SKDB

loca

liz

Cen

sus-

in

Poke

r

don

co usc

MIT

Face

MIT

Face

UPS

Sext

e

MIT

Face kd

Datasets


Classifiers


Introduction


Selective KDB

Experiments


Out-of-core BNCs

• Classification Time Comparisons

100

1000

k=4k=3

k=4k=3

k=5 k=4.5

10

me

(sec

onds

)

NBTAN

k=4 k=4 k=2

k=4.8 k=51

Tim TAN

AODEKDB k=5SKDB

k 4 k 4 k 20,1

ocal

izat

ion

oker

-han

d

us-in

com

e

dona

tion

covt

ype

usce

nsus

TFac

eSet

A

Sext

ende

d

TFac

eSet

B

TFac

eSet

C

kddc

up

lo P

Cen

su

MIT

UPS

S

MIT

MIT

Datasets


Classifiers


Introduction


Selective KDB

Experiments


Out-of-core SGD

• SGD in Vowpal Wabbit (VW):• Squared and logistic function.• Quadratic features (best results).• Different number of passes (3 or 10).• Discrete attributes into binary features.• One-against-all for multiclass classification.


Classifiers


Introduction


Selective KDB

Experiments


Out-of-core SGD

0.0001 0.001 0.01 0.1 10.0001

0.001

0.01

0.1

1

VWLF - RMSE (log. scale)

SK

DB

(m

ax k

=5)

- R

MS

E (

log.

sca

le)

0.0001 0.001 0.01 0.1 10.0001

0.001

0.01

0.1

1

VWSF - RMSE (log. scale)

SK

DB

(m

ax k

=5)

- R

MS

E (

log.

sca

le)

VW (logistic) - RMSE VW (squared) - 0-1 Loss

VWRMSE (logistic) 0-1 Loss (squared)

Selective KDB (7+2)-0-7 8-1-7


Classifiers


Introduction


Selective KDB

Experiments


Out-of-core SGD


100

1000

10000

(sec

onds

)

1

10

tion

ome

and

tion yp

e

nsus

etA

etB

ded

etC

cup

Tím

e

VWSF

VWLF

SKDB

loca

lizat

Cen

sus-

inco

Pok

er-h

a

dona

t

covt

y

usce

n

MIT

Fac

eSe

MIT

Fac

eS

UP

SS

exte

nd

MIT

Fac

eS

kddc

Datasets


Classifiers


Introduction


Selective KDB

Experiments


Out-of-core SGD


10

100

(sec

onds

)

0.1

1

tion

and

ome

tion

ype

nsus etA

ded

etB

etC

cup

Tim

e

VWSF

VWLF

SKDB

loca

lizat

Pok

er-h

a

Cen

sus-

inco

dona

t

covt

y

usce

n

MIT

Fac

eSe

UP

SSe

xten

d

MIT

Fac

eS

MIT

Fac

eS

kddc

Datasets


Classifiers


Introduction


Selective KDB

Experiments


In-core BayesNet and RF

BayesNet RFx = sampled dataset

BayesNet RF (Num)k-selective KDB 6-3-3 5-0-6


Classifiers


Introduction


Selective KDB

Experiments




100

1000

10000

100000

seco

nds)

BayesNet

RF

0.1

1

10

tion

ome

tion

type etA

Set

B

nded

Tim

e ( s RF

SKDB

loca

lizat

Cen

sus-

inco

dona

t

covt

MIT

Fac

eS

MIT

Fac

eS

US

PS

Ext

en

Datasets


Classifiers


Introduction


Selective KDB

Experiments




10

100

seco

nds)

BayesNet

RF

0.1

1

tion

ome

tion

type etA

Set

B

nded

Tim

e (s

SKDB

loca

liza

t

Cen

sus-

inco

dona

t

covt

MIT

Fac

eS

MIT

Fac

e S

US

PS

Ext

en

Datasets


Classifiers


Introduction


Selective KDB

Experiments


Global comparisons

• Cumulative Ranking

• Selective KDB copes well with high-dimensional datasetsand datasets with more than 1 million points.

• RF performs better on datasets with small # of attributes.

• VW has advantage for sparse numeric data.


Classifiers


Introduction


Selective KDB

Experiments


Global comparisons

• Ranking

7.63

6.817

8

5.88

5

6

king

3.47 3.44 3.34 2.912.533

4Ran

k

1

2

NB AODE TAN BayesNet VWLF KDB (best k) RF (Num) SKDBy ( ) ( )

Learners


Classifiers


Introduction


Selective KDB

Experiments


Global comparisons (Time)

Learner

Tim

e (s

econ

ds)

NB TAN AODE KDB ksKDB VWsq VWlog

050

0010

000

1500

0

Learner

Tim

e (s

econ

ds)

− lo

g. s

cale

NB TAN AODE KDB ksKDB VWsq VWlog

5010

020

050

010

00

Training Test


Classifiers


Introduction


Selective KDB

Experiments


Stepping back

• The key trick is nested evaluation of a large class ofcount-based models

• Works also with 2-pass selective evaluation of AnDEmodels

• Combines the efficiency of simple generative techniqueswith the power of discriminative learning

• Can use any loss function that is a function of eachP̂(y | x)


Classifiers


Introduction


Selective KDB

Experiments


Future Research

• Numeric attributes• it seems that there should be something better than

discretisation.• this remains elusive ...

• Overfitting avoidance

• Increase number of alternative models considered

• Explore other forms of nested BNCs

• Two pass Selective KDB• sample small test set in second pass

• Single pass generative/discriminative learning• initially learn a simple generative model• collect discriminative statistics and refine the model when

there is sufficient evidence to make a choice• refinements might be attribute selection, structural

refinement or weighting

• repeat


Classifiers


Introduction


Selective KDB

Experiments


Satellite learning curves

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

RMSE

Training set size

NBKDB K=1KDB K=2KDB K=3KDB K=4KDB K=5KDB K=6KDB K=7KDB K=8


Classifiers


Introduction


Selective KDB

Experiments



0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

RMSE

Training set size

NBKDB K=1KDB K=2KDB K=3KDB K=4KDB K=5KDB K=6KDB K=7KDB K=8SKDB K=10


Classifiers


Introduction


Selective KDB

Experiments



0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

RMSE

Training set size

NBKDB K=1KDB K=2KDB K=3KDB K=4KDB K=5KDB K=6KDB K=7KDB K=8SKDB K=10Random Forest


Classifiers


Introduction


Selective KDB

Experiments


Conclusions

• Large data calls for fundamentally new learning algorithms

• We are pioneering a new generation of theoreticallywell-founded algorithms that are both scalable to verylarge data and capable of exploiting the fine detailinherent therein

Documents

Scalable Bayesian Network Classifiers · Scalable Bayesian Network Classi ers Geo Webb Ana Martinez Nayyar Zaidi Introduction Bayesian Network Classi ers Selective KDB Experiments