Bayesian Optimization and Automated Machine...

Preview:

Citation preview

1/28

Bayesian Optimizationand Automated Machine Learning

Jungtaek Kim (jtkim@postech.ac.kr)

Machine Learning Group,Department of Computer Science and Engineering, POSTECH,

77 Cheongam-ro, Nam-gu, Pohang 37673,Gyeongsangbuk-do, Republic of Korea

June 12, 2018

2/28

Table of Contents

Bayesian OptimizationGlobal OptimizationBayesian OptimizationBackground: Gaussian Process RegressionAcquisition FunctionSynthetic Examplesbayeso

Automated Machine LearningAutomated Machine LearningPrevious WorksAutoML Challenge 2018Automated Machine Learning for Soft Voting in an Ensemble of Tree-basedClassifiersAutoML Challenge 2018 Result

References

3/28

Bayesian Optimization

4/28

Global Optimization

From Wikipedia (https://en.wikipedia.org/wiki/Local_optimum)

I A method to find global minimum or maximum of giventarget function:

x∗ = arg minL(x),

orx∗ = arg maxL(x).

5/28

Target Functions in Bayesian Optimization

I Usually an expensive black-box function.

I Unknown functional forms or local geometric features such assaddle points, global optima, and local optima.

I Uncertain function continuity.

I High-dimensional and mixed-variable domain space.

6/28

Bayesian Approach

I In Bayesian inference, given a prior knowledge for parameters,p(θ|λ) and a likelihood over dataset, conditional toparameters, p(D|θ,λ), the posterior distribution:

p(θ|D,λ) =p(D|θ,λ)p(θ|λ)

p(D|λ)=

p(D|θ,λ)p(θ|λ)∫p(D|θ,λ)p(θ|λ)dθ

where θ is a vector of parameters, D is an observed dataset,and λ is a vector of hyperparameters.

I Produce an uncertainty as well as a prediction.

7/28

Bayesian Optimization

I A powerful strategy for finding the extrema of objectivefunctions that are expensive to evaluate,

I where one does not have a closed-form expression for theobjective function,

I but where one can obtain observations at sampled values.

I Since we do not know a target function, optimize acquisitionfunction, instead of the target function.

I Compute acquisition function using outputs of Bayesianregression model.

8/28

Bayesian Optimization

Algorithm 1 Bayesian Optimization

Input: Initial data D1:I = {(xi , yi )1:I}.1: for t = 1, 2, . . . , do2: Predict a function f ∗(x|D1:I+t−1) considered as an objective

function.3: Find xI+t that maximizes an acquisition function,

xI+t = arg maxx a(x|D1:I+t−1).4: Sample the true objective function, yI+t = f (xI+t) + εI+t .5: Update on D1:I+t = {D1:I+t−1, (xt , yt)}.6: end for

9/28

Background: Gaussian Process

I A collection of random variables, any finite number of whichhave a joint Gaussian distribution. [Rasmussen and Williams,2006]

I Generally, Gaussian process (GP):

f ∼ GP(m(x), k(x, x′))

where

m(x) = E[f (x)]

k(x, x′) = E[(f (x)−m(x))(f (x′)−m(x′))].

10/28

Background: Gaussian Process Regression

−3 −2 −1 0 1 2 3x

−1.0

−0.5

0.0

0.5

1.0y

11/28

Background: Gaussian Process Regression

I One of basic covariance functions, the squared-exponentialcovariance function in one dimension:

k(x, x′

)= σ2

f exp

(− 1

2l2(x− x′

)2)

+ σ2nδxx′ ,

where σf is the signal standard deviation, l is the length scaleand σn is the noise standard deviation. [Rasmussen andWilliams, 2006]

I Posterior mean function and covariance function:

µ∗ = K (X∗,X)(K (X,X) + σ2nI )−1y,

Σ∗ = K (X∗,X∗)

− K (X∗,X)(K (X,X) + σ2nI )−1K (X,X∗).

12/28

Background: Gaussian Process Regression

I If non-zero mean prior is given, posterior mean and covariancefunctions:

µ∗ = K (X∗,X)(K (X,X) + σ2I )(y − µ(X)) + µ(X)

Σ∗ = K (X∗,X∗)

+ K (X∗,X)(K (X,X) + σ2I )−1K (X,X∗).

13/28

Acquisition Functions

I A function that acquires a next point to evaluate for anexpensive black-box function.

I Traditionally, the probability of improvement (PI) [Kushner,1964], the expected improvement (EI) [Mockus et al., 1978],and GP upper confidence bound (GP-UCB) [Srinivas et al.,2010] are used.

I Several functions such as entropy search [Hennig and Schuler,2012] and a combination of existing functions [Kim and Choi,2018b] have been recently proposed.

14/28

Traditional Acquisition Functions (Minimization Case)

I PI [Kushner, 1964]

aPI(x|D,λ) = Φ(Z ),

I EI [Mockus et al., 1978]

aEI(x|D,λ) =

{(f (x+)−µ(x))Φ(Z)+σ(x)φ(Z) if σ(x)>0

0 if σ(x)=0,

I GP-UCB [Srinivas et al., 2010]

aUCB(x|D,λ) = −µ(x) + βσ(x),

where

Z =

{f (x+)−µ(x)

σ(x)if σ(x)>0

0 if σ(x)=0

µ(x) := µ(x|D,λ), σ(x) := σ(x|D,λ).

15/28

Synthetic Examples

−5.0 −2.5 0.0 2.5 5.0

0

20y

−5.0 −2.5 0.0 2.5 5.0x

0

5

acq.

(a) Iteration 1

−5.0 −2.5 0.0 2.5 5.0

0

10

20

y

−5.0 −2.5 0.0 2.5 5.0x

0

1

acq.

(b) Iteration 2

−5.0 −2.5 0.0 2.5 5.0

0

10

20

y

−5.0 −2.5 0.0 2.5 5.0x

0

2

acq.

(c) Iteration 3

−5.0 −2.5 0.0 2.5 5.00

10

20

y

−5.0 −2.5 0.0 2.5 5.0x

0.00

0.25

acq.

(d) Iteration 4

−5.0 −2.5 0.0 2.5 5.00

10

20y

−5.0 −2.5 0.0 2.5 5.0x

0.00

0.02

acq.

(e) Iteration 5

−5.0 −2.5 0.0 2.5 5.00

10

20

y

−5.0 −2.5 0.0 2.5 5.0x

0.000

0.001

acq.

(f) Iteration 6

Figure 1: y = 4.0 cos(x) + 0.1x + 2.0 sin(x) + 0.4(x− 0.5)2. EI is used tooptimize.

16/28

bayeso

I Simple, but essential Bayesian optimization package.

I Written in Python.

I Licensed under the MIT license.

I https://github.com/jungtaekkim/bayeso

17/28

Automated Machine Learning

18/28

Automated Machine Learning

I Attempt to find automatically the optimal machine learningmodel without human intervention.

I Usually include feature transformation, algorithm selection,and hyperparameter optimization.

I Given a training dataset Dtrain and a validation dataset Dval,the optimal hyperparameter vector λ∗ for an automatedmachine learning system:

λ∗ = AutoML(Dtrain,Dval,Λ)

where AutoML is an automated machine learning system andλ ∈ Λ.

19/28

Previous Works

I Bayesian optimization and hyperparameter optimizationI GPyOpt [The GPyOpt authors, 2016]I SMAC [Hutter et al., 2011]I BayesOpt [Martinez-Cantin, 2014]I bayesoI SigOpt API [Martinez-Cantin et al., 2018]

I Automated machine learning frameworkI auto-sklearn [Feurer et al., 2015]I Auto-WEKA [Thornton et al., 2013]I Our previous work [Kim et al., 2016]

20/28

AutoML Challenge 2018

I Two phases: feedback phase and AutoML challenge phase.

I In the feedback phase, provide five datasets for binaryclassification.

I Given training/validation/test datasets, after submitting acode or prediction file, validation measure is posted in theleaderboard.

I In the AutoML challenge phase, determine challenge winners,comparing a normalized area under the ROC curve (AUC)metric for blind datasets:

Normalized AUC = 2 ·AUC− 1.

21/28

AutoML Challenge 2018

Figure 2: Datasets of feedback phase in AutoML Challenge 2018. Train.#, Valid. #, Test #, Feature #, Chrono., and Budget stand for trainingdataset size, validation dataset size, test dataset size, the number offeatures, chronological order, and time budget, respectively. Time budgetshows in seconds.

22/28

Background: Soft Majority Voting

I An ensemble method to construct a classifier using a majorityvote of k base classifiers.

I Class assignment of soft majority voting classifier:

ci = arg maxk∑

j=1

wjp(j)i

for 1 ≤ i ≤ n where n is the number of instances, arg maxreturns an index of maximum value in given vector,

wj ∈ R ≥ 0 is a weight of base classifier j , and p(j)i is a class

probability vector of base classifier j .

23/28

Our AutoML System [Kim and Choi, 2018a]

Dataset

Automated Machine Learning System

Voting Classifier

Gradient BoostingClassifier

Extra-treesClassifier

Random ForestsClassifier

Bayesian Optimization

Prediction

Figure 3: Our automated machine learning system. Voting classifierconstructed by three tree-based classifiers: gradient boosting, extra-trees,and random forests classifiers produces predictions, where voting classifierand tree-based classifiers are iteratively optimized by Bayesianoptimization for the given time budget.

24/28

Our AutoML System [Kim and Choi, 2018a]

I Written in Python.

I Use scikit-learn and our own Bayesian optimizationpackage.

I Split training dataset to training (0.6) and validation (0.4)sets for Bayesian optimization.

I Optimize six hyperparameters:

1. extra-trees classifier weight/gradient boosting classifier weightfor voting classifier,

2. random forests classifier weight/gradient boosting classifierweight for voting classifier,

3. the number of estimators for gradient boosting classifier,4. the number of estimators for extra-trees classifier,5. the number of estimators for random forests classifier,6. maximum depth of gradient boosting classifier.

I Use GP-UCB.

25/28

AutoML Challenge 2018 Result

Figure 4: AutoML Challenge 2018 result. A normalized area under theROC curve (AUC) score (upper cell in each row) is computed for eachdataset, and a dataset rank (lower cell in each row) is determined bynumerical order of the normalized AUC score. Finally, an overall rank isdetermined by the average rank of five datasets.

26/28

References

27/28

References I

M. Feurer, A. Klein, K. Eggensperger, J. T. Springenberg, M. Blum, and F. Hutter.Efficient and robust automated machine learning. In Advances in NeuralInformation Processing Systems (NIPS), pages 2962–2970, Montreal, Quebec,Canada, 2015.

P. Hennig and C. J. Schuler. Entropy search for information-efficient globaloptimization. Journal of Machine Learning Research, 13:1809–1837, 2012.

F. Hutter, H. H. Hoos, and K. Leyton-Brown. Sequential model-based optimizationfor general algorithm configuration. In Proceedings of the International Conferenceon Learning and Intelligent Optimization, pages 507–523, Rome, Italy, 2011.

J. Kim and S. Choi. Automated machine learning for soft voting in an ensemble oftree-based classifiers, 2018a.https://github.com/jungtaekkim/automl-challenge-2018.

J. Kim and S. Choi. Clustering-guided GP-UCB for Bayesian optimization. InProceedings of the IEEE International Conference on Acoustics, Speech, and SignalProcessing (ICASSP), Calgary, Alberta, Canada, 2018b.

J. Kim, J. Jeong, and S. Choi. AutoML Challenge: AutoML framework using randomspace partitioning optimizer. In International Conference on Machine LearningWorkshop on Automatic Machine Learning, New York, New York, USA, 2016.

H. J. Kushner. A new method of locating the maximum point of an arbitrarymultipeak curve in the presence of noise. Journal of Basic Engineering, 86(1):97–106, 1964.

28/28

References II

R. Martinez-Cantin. BayesOpt: A Bayesian optimization library for nonlinearoptimization, experimental design and bandits. Journal of Machine LearningResearch, 15:3735–3739, 2014.

R. Martinez-Cantin, K. Tee, and M. McCourt. Practical Bayesian optimization in thepresence of outliers. In Proceedings of the International Conference on ArtificialIntelligence and Statistics (AISTATS), Playa Blanca, Lanzarote, Canary Islands,2018.

J. Mockus, V. Tiesis, and A. Zilinskas. The application of Bayesian methods forseeking the extremum. Towards Global Optimization, 2:117–129, 1978.

C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning.MIT Press, 2006.

N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization inthe bandit setting: No regret and experimental design. In Proceedings of theInternational Conference on Machine Learning (ICML), pages 1015–1022, Haifa,Israel, 2010.

The GPyOpt authors. GPyOpt: A Bayesian optimization framework in Python, 2016.https://github.com/SheffieldML/GPyOpt.

C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown. Auto-WEKA: Combinedselection and hyperparameter optimization of classification algorithms. InProceedings of the ACM SIGKDD Conference on Knowledge Discovery and DataMining (KDD), pages 847–855, Chicago, Illinois, USA, 2013.

Recommended