Bayesian Optimization and Automated Machine...

Bayesian Optimizationand Automated Machine Learning

Jungtaek Kim (jtkim@postech.ac.kr)

Machine Learning Group,Department of Computer Science and Engineering, POSTECH,

77 Cheongam-ro, Nam-gu, Pohang 37673,Gyeongsangbuk-do, Republic of Korea

June 12, 2018

Table of Contents

Bayesian OptimizationGlobal OptimizationBayesian OptimizationBackground: Gaussian Process RegressionAcquisition FunctionSynthetic Examplesbayeso

Automated Machine LearningAutomated Machine LearningPrevious WorksAutoML Challenge 2018Automated Machine Learning for Soft Voting in an Ensemble of Tree-basedClassifiersAutoML Challenge 2018 Result

References

Bayesian Optimization

Global Optimization

From Wikipedia (https://en.wikipedia.org/wiki/Local_optimum)

I A method to find global minimum or maximum of giventarget function:

x∗ = arg minL(x),

orx∗ = arg maxL(x).

Target Functions in Bayesian Optimization

I Usually an expensive black-box function.

I Unknown functional forms or local geometric features such assaddle points, global optima, and local optima.

I Uncertain function continuity.

I High-dimensional and mixed-variable domain space.

Bayesian Approach

I In Bayesian inference, given a prior knowledge for parameters,p(θ|λ) and a likelihood over dataset, conditional toparameters, p(D|θ,λ), the posterior distribution:

p(θ|D,λ) =p(D|θ,λ)p(θ|λ)

p(D|λ)=

p(D|θ,λ)p(θ|λ)∫p(D|θ,λ)p(θ|λ)dθ

where θ is a vector of parameters, D is an observed dataset,and λ is a vector of hyperparameters.

I Produce an uncertainty as well as a prediction.

I A powerful strategy for finding the extrema of objectivefunctions that are expensive to evaluate,

I where one does not have a closed-form expression for theobjective function,

I but where one can obtain observations at sampled values.

I Since we do not know a target function, optimize acquisitionfunction, instead of the target function.

I Compute acquisition function using outputs of Bayesianregression model.

Algorithm 1 Bayesian Optimization

Input: Initial data D1:I = {(xi , yi )1:I}.1: for t = 1, 2, . . . , do2: Predict a function f ∗(x|D1:I+t−1) considered as an objective

function.3: Find xI+t that maximizes an acquisition function,

xI+t = arg maxx a(x|D1:I+t−1).4: Sample the true objective function, yI+t = f (xI+t) + εI+t .5: Update on D1:I+t = {D1:I+t−1, (xt , yt)}.6: end for

Background: Gaussian Process

I A collection of random variables, any finite number of whichhave a joint Gaussian distribution. [Rasmussen and Williams,2006]

I Generally, Gaussian process (GP):

f ∼ GP(m(x), k(x, x′))

m(x) = E[f (x)]

k(x, x′) = E[(f (x)−m(x))(f (x′)−m(x′))].

Background: Gaussian Process Regression

−3 −2 −1 0 1 2 3x

−1.0

−0.5

I One of basic covariance functions, the squared-exponentialcovariance function in one dimension:

k(x, x′

)= σ2

(− 1

2l2(x− x′

+ σ2nδxx′ ,

where σf is the signal standard deviation, l is the length scaleand σn is the noise standard deviation. [Rasmussen andWilliams, 2006]

I Posterior mean function and covariance function:

µ∗ = K (X∗,X)(K (X,X) + σ2nI )−1y,

Σ∗ = K (X∗,X∗)

− K (X∗,X)(K (X,X) + σ2nI )−1K (X,X∗).

I If non-zero mean prior is given, posterior mean and covariancefunctions:

µ∗ = K (X∗,X)(K (X,X) + σ2I )(y − µ(X)) + µ(X)

Σ∗ = K (X∗,X∗)

+ K (X∗,X)(K (X,X) + σ2I )−1K (X,X∗).

Acquisition Functions

I A function that acquires a next point to evaluate for anexpensive black-box function.

I Traditionally, the probability of improvement (PI) [Kushner,1964], the expected improvement (EI) [Mockus et al., 1978],and GP upper confidence bound (GP-UCB) [Srinivas et al.,2010] are used.

I Several functions such as entropy search [Hennig and Schuler,2012] and a combination of existing functions [Kim and Choi,2018b] have been recently proposed.

Traditional Acquisition Functions (Minimization Case)

I PI [Kushner, 1964]

aPI(x|D,λ) = Φ(Z ),

I EI [Mockus et al., 1978]

aEI(x|D,λ) =

{(f (x+)−µ(x))Φ(Z)+σ(x)φ(Z) if σ(x)>0

0 if σ(x)=0,

I GP-UCB [Srinivas et al., 2010]

aUCB(x|D,λ) = −µ(x) + βσ(x),

{f (x+)−µ(x)

σ(x)if σ(x)>0

0 if σ(x)=0

µ(x) := µ(x|D,λ), σ(x) := σ(x|D,λ).

Synthetic Examples

−5.0 −2.5 0.0 2.5 5.0

−5.0 −2.5 0.0 2.5 5.0x

(a) Iteration 1

−5.0 −2.5 0.0 2.5 5.0

−5.0 −2.5 0.0 2.5 5.0x

(b) Iteration 2

−5.0 −2.5 0.0 2.5 5.0

−5.0 −2.5 0.0 2.5 5.0x

(c) Iteration 3

−5.0 −2.5 0.0 2.5 5.00

−5.0 −2.5 0.0 2.5 5.0x

(d) Iteration 4

−5.0 −2.5 0.0 2.5 5.00

−5.0 −2.5 0.0 2.5 5.0x

(e) Iteration 5

−5.0 −2.5 0.0 2.5 5.00

−5.0 −2.5 0.0 2.5 5.0x

(f) Iteration 6

Figure 1: y = 4.0 cos(x) + 0.1x + 2.0 sin(x) + 0.4(x− 0.5)2. EI is used tooptimize.

bayeso

I Simple, but essential Bayesian optimization package.

I Written in Python.

I Licensed under the MIT license.

I https://github.com/jungtaekkim/bayeso

Automated Machine Learning

I Attempt to find automatically the optimal machine learningmodel without human intervention.

I Usually include feature transformation, algorithm selection,and hyperparameter optimization.

I Given a training dataset Dtrain and a validation dataset Dval,the optimal hyperparameter vector λ∗ for an automatedmachine learning system:

λ∗ = AutoML(Dtrain,Dval,Λ)

where AutoML is an automated machine learning system andλ ∈ Λ.

Previous Works

I Bayesian optimization and hyperparameter optimizationI GPyOpt [The GPyOpt authors, 2016]I SMAC [Hutter et al., 2011]I BayesOpt [Martinez-Cantin, 2014]I bayesoI SigOpt API [Martinez-Cantin et al., 2018]

I Automated machine learning frameworkI auto-sklearn [Feurer et al., 2015]I Auto-WEKA [Thornton et al., 2013]I Our previous work [Kim et al., 2016]

AutoML Challenge 2018

I Two phases: feedback phase and AutoML challenge phase.

I In the feedback phase, provide five datasets for binaryclassification.

I Given training/validation/test datasets, after submitting acode or prediction file, validation measure is posted in theleaderboard.

I In the AutoML challenge phase, determine challenge winners,comparing a normalized area under the ROC curve (AUC)metric for blind datasets:

Normalized AUC = 2 ·AUC− 1.

AutoML Challenge 2018

Figure 2: Datasets of feedback phase in AutoML Challenge 2018. Train.#, Valid. #, Test #, Feature #, Chrono., and Budget stand for trainingdataset size, validation dataset size, test dataset size, the number offeatures, chronological order, and time budget, respectively. Time budgetshows in seconds.

Background: Soft Majority Voting

I An ensemble method to construct a classifier using a majorityvote of k base classifiers.

I Class assignment of soft majority voting classifier:

ci = arg maxk∑

wjp(j)i

for 1 ≤ i ≤ n where n is the number of instances, arg maxreturns an index of maximum value in given vector,

wj ∈ R ≥ 0 is a weight of base classifier j , and p(j)i is a class

probability vector of base classifier j .

Our AutoML System [Kim and Choi, 2018a]

Dataset

Automated Machine Learning System

Voting Classifier

Gradient BoostingClassifier

Extra-treesClassifier

Random ForestsClassifier

Prediction

Figure 3: Our automated machine learning system. Voting classifierconstructed by three tree-based classifiers: gradient boosting, extra-trees,and random forests classifiers produces predictions, where voting classifierand tree-based classifiers are iteratively optimized by Bayesianoptimization for the given time budget.

Our AutoML System [Kim and Choi, 2018a]

I Written in Python.

I Use scikit-learn and our own Bayesian optimizationpackage.

I Split training dataset to training (0.6) and validation (0.4)sets for Bayesian optimization.

I Optimize six hyperparameters:

1. extra-trees classifier weight/gradient boosting classifier weightfor voting classifier,

2. random forests classifier weight/gradient boosting classifierweight for voting classifier,

3. the number of estimators for gradient boosting classifier,4. the number of estimators for extra-trees classifier,5. the number of estimators for random forests classifier,6. maximum depth of gradient boosting classifier.

I Use GP-UCB.

AutoML Challenge 2018 Result

Figure 4: AutoML Challenge 2018 result. A normalized area under theROC curve (AUC) score (upper cell in each row) is computed for eachdataset, and a dataset rank (lower cell in each row) is determined bynumerical order of the normalized AUC score. Finally, an overall rank isdetermined by the average rank of five datasets.

References

References I

M. Feurer, A. Klein, K. Eggensperger, J. T. Springenberg, M. Blum, and F. Hutter.Efficient and robust automated machine learning. In Advances in NeuralInformation Processing Systems (NIPS), pages 2962–2970, Montreal, Quebec,Canada, 2015.

P. Hennig and C. J. Schuler. Entropy search for information-efficient globaloptimization. Journal of Machine Learning Research, 13:1809–1837, 2012.

F. Hutter, H. H. Hoos, and K. Leyton-Brown. Sequential model-based optimizationfor general algorithm configuration. In Proceedings of the International Conferenceon Learning and Intelligent Optimization, pages 507–523, Rome, Italy, 2011.

J. Kim and S. Choi. Automated machine learning for soft voting in an ensemble oftree-based classifiers, 2018a.https://github.com/jungtaekkim/automl-challenge-2018.

J. Kim and S. Choi. Clustering-guided GP-UCB for Bayesian optimization. InProceedings of the IEEE International Conference on Acoustics, Speech, and SignalProcessing (ICASSP), Calgary, Alberta, Canada, 2018b.

J. Kim, J. Jeong, and S. Choi. AutoML Challenge: AutoML framework using randomspace partitioning optimizer. In International Conference on Machine LearningWorkshop on Automatic Machine Learning, New York, New York, USA, 2016.

H. J. Kushner. A new method of locating the maximum point of an arbitrarymultipeak curve in the presence of noise. Journal of Basic Engineering, 86(1):97–106, 1964.

References II

R. Martinez-Cantin. BayesOpt: A Bayesian optimization library for nonlinearoptimization, experimental design and bandits. Journal of Machine LearningResearch, 15:3735–3739, 2014.

R. Martinez-Cantin, K. Tee, and M. McCourt. Practical Bayesian optimization in thepresence of outliers. In Proceedings of the International Conference on ArtificialIntelligence and Statistics (AISTATS), Playa Blanca, Lanzarote, Canary Islands,2018.

J. Mockus, V. Tiesis, and A. Zilinskas. The application of Bayesian methods forseeking the extremum. Towards Global Optimization, 2:117–129, 1978.

C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning.MIT Press, 2006.

N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization inthe bandit setting: No regret and experimental design. In Proceedings of theInternational Conference on Machine Learning (ICML), pages 1015–1022, Haifa,Israel, 2010.

The GPyOpt authors. GPyOpt: A Bayesian optimization framework in Python, 2016.https://github.com/SheffieldML/GPyOpt.

C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown. Auto-WEKA: Combinedselection and hyperparameter optimization of classification algorithms. InProceedings of the ACM SIGKDD Conference on Knowledge Discovery and DataMining (KDD), pages 847–855, Chicago, Illinois, USA, 2013.

Bayesian Optimization and Automated Machine...

Documents

Daniel J. Mankowitz Rae Jeong Cosmin Paduraru Nicolas

Mixed-mode J …...Mixed-modeJ-integralformulationandimplementation usinggradedelementsforfractureanalysisof nonhomogeneousorthotropicmaterials Jeong-HoKim,GlaucioH.Paulino

OBOE: Collaborative Filtering for AutoML InitializationOBOE: Collaborative Filtering for AutoML Initialization Chengrun Yang, Yuji Akimoto, Dae Won Kim, Madeleine Udell Cornell University

Automatic Machine Learning (AutoML): A Tutorial · A Tutorial Slides available at ... AutoML 14 Both completely uninformed Random search handles unimportant dimensions better Random

Ý 7 í; · í; ·›광사2월.pdf · 2010-11-10 · 원광사 국제선원 wonkwangsa@gmail.com ... Jeong Duk Suk Jeong Jin BSN Jeong Seok Hwa Jeong Seong Woel Jeong Suk Hwa Jeong

Methods for Improving Bayesian Optimization for AutoMLaad.informatik.uni-freiburg.de/papers/15-AUTOML-AutoML-poster.pdf• Test performance of the best configuration found with 10-fold

FLAML: A Fast and Lightweight AutoML Library

Brooks Jeong

Scalable AutoML for Time Series Forecasting using Ray

Extrapolating Learning Curves of Deep Neural Networksaad.informatik.uni-freiburg.de/papers/14-AUTOML-Extrapolating... · ICML 2014 AutoML Workshop Extrapolating Learning Curves of

Chapter 5 Hyperopt-Sklearn - AutoML

Automatic Machine Learning, AutoML

Autotext: AutoML for Text Classi cation

Hardware-Aware AutoML for Efﬁcient Deep Learning Applications

OBOE: Collaborative Filtering for AutoML Initializationmetalearning.ml/2018/papers/metalearn2018_paper39.pdfOBOE: Collaborative Filtering for AutoML Initialization Chengrun Yang, Yuji

On Evaluation of AutoML Systems

Analysis of the AutoML Challenge series 2015-2018

Byungil Jeong

AutoML: A Survey of the State-of-the-Art · paper, we provide a comprehensive and up-to-date study on the state-of-the-art (SOTA) in AutoML. First, we introduce the AutoML techniques

Results of the AutoML challenge - INAOEemorales/Cursos/Aprendizaje... · Results of the AutoML challenge Isabelle Guyon, Imad Chaabane, Hugo Jair Escalante, Sergio Escalera, Damir