1
A Mathematically Derived Number of Resamplings for Noisy Optimization Jialin Liu, David L. Saint-Pierre, Olivier Teytaud [email protected],[email protected],[email protected] Team TAO, INRIA Saclay-CNRS-LRI, Univ. Paris-Sud, France Noisy Black-box Optimization Objective function f : R d R Optimum x * = argmin xR d f (x) Recommendation at iteration n x n Recommendation after #eval x #eval Resampling rules For a sphere function, n O (||x - x * ||) and σ N oise O (1/ r n ). To keep the standard deviation of same order as the differences be- tween fitness values, which is n 2 , we should have r n σ n -4 d -2 . Using a log -linear conver- gence log(σ n ) ’-Cn, we suggest C decreasing as Θ(1/d) in r n exp(Cn)/d 2 . (1) We use Notation r n linear n quadratic n 2 cubic n 3 2exp 2 n exp/10 dexp( n 10 )e math1 dexp( 4n 5d )/d 2 e math2 dexp( n 10 )/d 2 e math3 dexp( n d )/d 2 e Conclusion 2exp performs badly in large dimension; Polynomial rules perform badly in small dimension; math2 has the best of both worlds and performs nearly optimally among our non- adaptive rules in the considered setting, i.e. σ N oise O (f ). Notations f : f itness function d: search space dimension x: search point R d ω : random process n: generation index r n : resampling number at generation n #eval : total number of evaluations σ n : step-size at generation n N : standard Gaussian random variable Some References [1] Sandra Astete-Morales, Jialin Liu, and Olivier Tey- taud. log-log convergence for noisy optimization. In Proceedings of EA 2013, LLNCS. Springer, 2013. [2] Anne Auger. Convergence results for the (1, λ)-sa- es using the theory of φ-irreducible markov chains. Theoretical Computer Science, 334(1):35–69, 2005. [3] Olivier Teytaud, Jérémie Decock, et al. Noisy opti- mization complexity. In FOGA-Foundations of Ge- netic Algorithms XII-2013, 2013. [4] Hervé Fournier and Olivier Teytaud. Lower bounds for comparison based evolution strategies using vc-dimension and sign patterns. Algorithmica, 59(3):387–408, 2011. Abstract Resampling = classical tool for adapting evolution strategies (ES) to noisy optimization. Goal of this work: a conclusive answer for parameter-free resampling rules in simple settings. We work with Self-Adaptive (μ/μ,λ)-ES. Here, local noisy optimization (no local minima involved). Convergence Rate Noise-free case: log -linear convergence[2] log ||x n || n A 0 < 0 (2) Noisy case: [1] proves that using exponential rules with ad hoc parameters can lead to this log -log convergence.[3] log ||x #eval || log(#eval ) A 00 < 0 (3) Simple Regret (SR) and slope(SR): SR = Ef (x #eval ) - Ef (x * ) (4) slope(SR) = lim sup #eval log (SR) log(#eval ) (5) Experiments on f (x, ω )= ||x - x * || 2 + N , x R d , x * =0 For f (x, ω ) = ||x - x * || 2 + N (x * = 0), the Simple Regret after m evaluations is SR = Ef itness(x #eval ). A non-adaptive rule with polynomial resampling number can lead to slope(SR) A 000 < 0. 0 5 10 15 20 25 30 -7 -6 -5 -4 -3 -2 -1 0 1 2 λ=32, μ=1 λ=32, μ=8 λ=44, μ=1 λ=44, μ=10 λ=44, μ=11 0 5 10 15 20 25 30 -7 -6 -5 -4 -3 -2 -1 0 1 2 λ=32, μ=1 λ=32, μ=8 λ=44, μ=1 λ=44, μ=10 λ=44, μ=11 0 5 10 15 20 25 30 -7 -6 -5 -4 -3 -2 -1 0 1 2 λ=32, μ=1 λ=32, μ=8 λ=44, μ=1 λ=44, μ=10 λ=44, μ=11 Figure 1: quadratic (left); 2exp (middle); math2 (right) for d = 10. x-axis: log(#eval ); y-axis: log(SR). 0 5 10 15 20 25 30 -5 -4 -3 -2 -1 0 1 2 3 λ=1000, μ=1 λ=1000, μ=100 λ=1000, μ=200 λ=404, μ=1 λ=404, μ=100 λ=404, μ=101 0 5 10 15 20 25 30 -5 -4 -3 -2 -1 0 1 2 3 λ=1000, μ=1 λ=1000, μ=100 λ=1000, μ=200 λ=404, μ=1 λ=404, μ=100 λ=404, μ=101 0 5 10 15 20 25 30 -5 -4 -3 -2 -1 0 1 2 3 λ=1000, μ=1 λ=1000, μ=100 λ=1000, μ=200 λ=404, μ=1 λ=404, μ=100 λ=404, μ=101 Figure 2: quadratic (left); 2exp (middle); math2 (right) for d = 100. x-axis: log(#eval ); y-axis: log(SR). For small dimension: (i) math3 converges fastest when λ is large; (ii) r n = n (linear ) is the slowest; (iii) r n = n 3 (cubic) and math2 are fast for all the values of λ and μ; (iv) r n = d2 n e (2exp), math1 and math3 converge faster than r n = n 3 (cubic) when λ is large. For moderate dimension: (i) exponential numbers of resamplings have similar performance for all the values of λ and μ, but math3 has slope(SR) slightly faster than others’; (ii) the choices of λ and μ do not change a lot the results for dimension 10. For large dimension: (i) r n = dexp( n 10 )e (exp/10) converges the most slowly when μ =1 and converges slowly at the beginning when μ> 1; (ii) r n = n 3 (cubic) does not converge at the beginning; (iii) math3 has faster slope(SR) for all values of λ and μ; math2 is almost as efficient.

A Mathematically Derived Number of Resamplings for Noisy Optimization (GECCO2014)

Embed Size (px)

Citation preview

AMathematicallyDerivedNumberofResamplings forNoisyOptimization

Jialin Liu, David L. Saint-Pierre, Olivier [email protected],[email protected],[email protected]

Team TAO, INRIA Saclay-CNRS-LRI, Univ. Paris-Sud, France

Noisy Black-box OptimizationObjective function f : Rd → ROptimum x∗ = argmin

x∈Rd

f(x)

Recommendation at iteration n xnRecommendation after #eval x#eval

Resampling rulesFor a sphere function,

√dσn ∼ O(||x − x∗||)

and σNoise ∼ O(1/√rn). To keep the standard

deviation of same order as the differences be-tween fitness values, which is dσn2, we shouldhave rn ∼ σn

−4d−2. Using a log-linear conver-gence log(σn) ' −Cn, we suggest C decreasingas Θ(1/d) in

rn ∼ exp(Cn)/d2. (1)

We use

Notation rnlinear n

quadratic n2

cubic n3

2exp 2n

exp/10 dexp( n10 )e

math1 dexp( 4n5d )/d2e

math2 dexp( n10 )/d2e

math3 dexp( n√d)/d2e

Conclusion• 2exp performs badly in large dimension;

• Polynomial rules perform badly in smalldimension;

• math2 has the best of both worlds andperforms nearly optimally among our non-adaptive rules in the considered setting,i.e. σNoise ∼ O(f).

Notationsf : fitness functiond: search space dimensionx: search point ∈ Rd

ω: random processn: generation indexrn: resampling number at generation n#eval: total number of evaluationsσn: step-size at generation nN : standard Gaussian random variable

Some References[1] Sandra Astete-Morales, Jialin Liu, and Olivier Tey-

taud. log-log convergence for noisy optimization. InProceedings of EA 2013, LLNCS. Springer, 2013.

[2] Anne Auger. Convergence results for the (1, λ)-sa-es using the theory of φ-irreducible markov chains.Theoretical Computer Science, 334(1):35–69, 2005.

[3] Olivier Teytaud, Jérémie Decock, et al. Noisy opti-mization complexity. In FOGA-Foundations of Ge-netic Algorithms XII-2013, 2013.

[4] Hervé Fournier and Olivier Teytaud. Lower boundsfor comparison based evolution strategies usingvc-dimension and sign patterns. Algorithmica,59(3):387–408, 2011.

Abstract• Resampling = classical tool for adapting evolution strategies (ES) to noisy optimization.

• Goal of this work: a conclusive answer for parameter-free resampling rules in simple settings.

• We work with Self-Adaptive (µ/µ,λ)-ES.

• Here, local noisy optimization (no local minima involved).

Convergence RateNoise-free case: log-linear convergence[2]

log ||xn||n

∼ A′ < 0 (2)

Noisy case: [1] proves that using exponential rules with ad hoc parameters can lead to this log-logconvergence.[3]

log ||x#eval||log(#eval)

∼ A′′ < 0 (3)

Simple Regret (SR) and slope(SR):

SR = Ef(x#eval)− Ef(x∗) (4)

slope(SR) = lim sup#eval

log (SR)

log(#eval)(5)

Experiments on f(x, ω) = ||x− x∗||2 +N , x ∈ Rd, x∗ = 0For f(x, ω) = ||x − x∗||2 + N (x∗ = 0), the Simple Regret after m evaluations is SR =Efitness(x#eval). A non-adaptive rule with polynomial resampling number can lead to slope(SR) ∼A

′′′< 0.

0 5 10 15 20 25 30−7

−6

−5

−4

−3

−2

−1

0

1

2

λ=32, µ=1

λ=32, µ=8

λ=44, µ=1

λ=44, µ=10

λ=44, µ=11

0 5 10 15 20 25 30−7

−6

−5

−4

−3

−2

−1

0

1

2

λ=32, µ=1

λ=32, µ=8

λ=44, µ=1

λ=44, µ=10

λ=44, µ=11

0 5 10 15 20 25 30−7

−6

−5

−4

−3

−2

−1

0

1

2

λ=32, µ=1

λ=32, µ=8

λ=44, µ=1

λ=44, µ=10

λ=44, µ=11

Figure 1: quadratic (left); 2exp (middle); math2 (right) for d = 10. x-axis: log(#eval); y-axis: log(SR).

0 5 10 15 20 25 30−5

−4

−3

−2

−1

0

1

2

3

λ=1000, µ=1

λ=1000, µ=100

λ=1000, µ=200

λ=404, µ=1

λ=404, µ=100

λ=404, µ=101

0 5 10 15 20 25 30−5

−4

−3

−2

−1

0

1

2

3

λ=1000, µ=1

λ=1000, µ=100

λ=1000, µ=200

λ=404, µ=1

λ=404, µ=100

λ=404, µ=101

0 5 10 15 20 25 30−5

−4

−3

−2

−1

0

1

2

3

λ=1000, µ=1

λ=1000, µ=100

λ=1000, µ=200

λ=404, µ=1

λ=404, µ=100

λ=404, µ=101

Figure 2: quadratic (left); 2exp (middle); math2 (right) for d = 100. x-axis: log(#eval); y-axis: log(SR).

For small dimension:(i) math3 converges fastest when λ is large;(ii) rn = n (linear) is the slowest;(iii) rn = n3 (cubic) and math2 are fast for all the values of λ and µ;(iv) rn = d2ne (2exp), math1 and math3 converge faster than rn = n3 (cubic) when λ is large.For moderate dimension:(i) exponential numbers of resamplings have similar performance for all the values of λ and µ, butmath3 has slope(SR) slightly faster than others’;(ii) the choices of λ and µ do not change a lot the results for dimension 10.For large dimension:(i) rn = dexp( n

10 )e (exp/10) converges the most slowly when µ = 1 and converges slowly at thebeginning when µ > 1;(ii) rn = n3 (cubic) does not converge at the beginning;(iii) math3 has faster slope(SR) for all values of λ and µ; math2 is almost as efficient.