Probabilistic Modeling for Combinatorial Optimization

Probabilistic Modeling for Combinatorial Optimization

Scott Davies

School of Computer Science

Carnegie Mellon University

Joint work with Shumeet Baluja

Combinatorial Optimization

Maximize “evaluation function” f(x) input: fixed-length bitstring x output: real value

x might represent: job shop schedules TSP tours discretized numeric values etc.

Our focus: “Black Box” optimization No domain-dependent heuristics

Most Commonly Used Approaches

Hill-climbing, simulated annealing Generate candidate solutions “neighboring” single current

working solution (e.g. differing by one bit) Typically make no attempt to model how particular bits

affect solution quality

Genetic algorithms Attempt to implicitly capture dependency of solution

quality on bit values by maintaining a population of candidate solutions

Use crossover and mutation operators on population members to generate new candidate solutions

Using Explicit Probabilistic Models

Maintain an explicit probability distribution P from which we generate new candidate solutions Initialize P to uniform distribution Until termination criteria met:

Stochastically generate K candidate solutions from PEvaluate themUpdate P to make it more likely to generate solutions “similar” to

the “good” solutions

Several different choices for what sorts of P to use and how to update it after candidate solution evaluation

Probability Distributions Over Bitstrings

Let x = (x1, x2, …, xn), where xi can take one of the values {0, 1} and n is the length of the bitstring.

Can factorize any distribution P(x1…xn) bit by bit:

P(x1,…,xn) = P(x1) P(x2 | x1) P(x3 | x1, x2)…P(xn|x1, …, xn-1)

In general, the above formula is just another way of representing a big lookup table with one entry for each of the 2n possible bitstrings.

Obviously too many parameters to estimate from limited data!

Representing Independencies withBayesian Networks

Graphical representation of probability distributions Each variable is a vertex Each variable’s probability distribution is conditioned only

on its parents in the directed acyclic graph (“dag”)

D

F

I A

H

Wean on Fire

Fire AlarmActivated

Hear Bells

Ice CreamTruck Nearby

Office Door Open

P(F,D,I,A,H) =P(F) *P(D) *P(I|F) *P(A|F) *

P(H|D,I,A)

“Bayesian Networks” for Bitstring Optimization?

P(x1,…,xn) = P(x1) P(x2 | x1) P(x3 | x1, x2)…P(xn|x1, …, xn-1)

x1 x2 x3 xn

Yuck. Let’s just assume all the bits are independent instead.

(For now.)

x1 x2 x3 xn

Ahhh. Much better.

P(x1,…,xn) = P(x1) P(x2) P(x3)…P(xn)

Population-Based Incremental Learning

Population-Based Incremental Learning (PBIL) [Baluja, 1995] Maintains a vector of probabilities: one independent

probability P(xi) for each bit xi.

Until termination criteria met:Generate a population of K bitstrings from PEvaluate themUse the best M of the K to update P as follows:

))1(1()1()1(' iii xPxPxP if xi is set to 1)1()1( ii xPxP if xi is set to 0or

Optionally, also update P similarly with the bitwise complement of the worst of the K bitstrings

Return best bitstring ever evaluated

PBIL vs. Discrete Learning Automata

Equivalent to a team of Discrete Learning Automata, one automata per bit. [Thathachar & Sastry, 1987] Learning automata choose actions independently, but

receive common reinforcement signal dependent on all their actions

PBIL update rule equivalent to linear reward-inaction algorithm [Hilgard & Bower, 1975] with “success” defined as “best in the bunch”

However, Discrete Learning Automata typically used previously in problems with few variables but noisy evaluation functions

PBIL vs. Genetic Algorithms

PBIL originated as tool for understanding GA behavior

Similar to Bit-Based Simulated Crossover (BSC) [Syswerda, 1993] Regenerates P from scratch after every generation All K used to update P, weighted according to probabilities

that a GA would have selected them for reproduction

Why might normal GAs be better? Implicitly capture inter-bit dependencies with population However, because model is only implicit, crossover must

be randomized. Also, limited population size often leads to premature convergence based on noise in samples.

Four Peaks Problem

Problem used in [Baluja & Caruana, 1995] to test how well GAs maintain multiple solutions before converging.

Given input vector X with N bits, and difficulty parameter T: FourPeaks(T,X)=MAX(head(1,X), tail(0,X))+Bonus(T,X)

head(b,X) = # of contiguous leading bits in X set to btail(b,X) = # of contiguous trailing bits in X set to bBonus(T,X) = 100 if (head(1,X)>T) AND (tail(0,X) > T), or 0

otherwise

Should theoretically be easy for GA to handle with single-point crossover

Four Peaks Problem Results

Insert nasty photocopy here

Large-Scale PBIL Empirical Comparison

See bug-ugly photocopied table

Modeling Inter-Bit Dependencies

How about automatically learning probability distributions in which at least some dependencies between variables are modeled?

Problem statement: given a dataset D and a set of allowable Bayesian networks {Bi}, find the Bi with the maximum posterior probability:

)|()(maxarg)|(maxarg iiB

iB

BDPBPDBPii

Equations. (Mwuh hah hah hah!)

Where: d j is the jth datapoint di

j is the value assigned to xi by d j.

i is the set of xi’s parents in B

Dd

j

j

BdPBDP )|(log)|(log

),|(log1

BddxP ji

Dd

n

i

jii i

j

n

ibax

iiii

ii

BbaxPbaxPD1

, possible all

),|(log),('

is the set of values assigned to i by d j

P’ is the empirical probability distribution exhibited by D

j

id

Mmmm...Entropy.

Fact: given that B has a network structure S, the optimal probabilities to use in B are just the probabilities in D, i.e. P’. So:

n

ibax

iiii

ii

BbaxPbaxP1

, possible all

),|(log),('

Pick B to Maximize:

n

ibax

iiii

ii

SbaxPbaxP1

, possible all

),|('log),('

Pick S to Maximize:

n

iii SxH

1

),|('

Entropy Calculation Example

x17

x20

x32

x20’s contribution to score:

# timesx20=0

# timesx20=1

total #times

x17=0,x32=0

5 5 10

x17=0,x32=1

5 15 20

x17=1,x32=0

20 10 30

x17=1,x32=1

10 10 20

80total

datapoints

),(80

20),(

80

102015

205

105

105 HH

),(80

20),(

80

302010

2010

3010

3020 HH

where H(p,q) = p log p + q log q.

),|( 321720 xxxH

Single-Parent Tree-Shaped Networks

Now let’s allow each bit to be conditioned on at most one other bit. x17

x5x22 x2

x6x19x13 x14 x8

Adding an arc from xj to xi increases network score by

H(xi) - H(xi|xj) (xj’s “information gain” with xi)

= H(xj) - H(xj|xi) (not necessarily obvious, but true)

= I(xi, xj) Mutual information between xi and xj

Optimal Single-Parent Tree-Shaped Networks

To find optimal single-parent tree-shaped network, just find maximum spanning tree using I(xi, xj) as the weight for the edge between xi and xj. [Chow and Liu, 1968] Start with an arbitrary root node xr.

Until all n nodes have been added to the tree:Of all pairs of nodes xin and xout, where xin has already been added to

the tree but xout has not, find the pair with the largest I(xin, xout).

Add xout to the tree with xin as its parent.

Can be done in O(n2) time (assuming D has already been reduced to sufficient statistics)

Optimal Dependency Trees for Combinatorial Optimization

[Baluja & Davies, 1997]Start with a dataset D initialized from the uniform

distributionUntil termination criteria met:

Build optimal dependency tree T with which to model D. Generate K bitstrings from probability distribution represented

by T. Evaluate them. Add best M bitstrings to D after decaying the weight of all

datapoints already in D by a factor between 0 and 1.

Return best bitstring ever evaluated.Running time: O(K*n + n2) per iteration

Tree-based Optimization vs. MIMIC

Tree-based optimization algorithm inspired by Mutual Information Maximization for Input Clustering (MIMIC) [De Bonet, et al., 1997] Learned chain-shaped networks rather than tree-shaped

networks

Dataset: best N% of all bitstrings ever evaluated+: dataset has simple, well-defined interpretation-: have to remember bitstrings-: seems to converge too quickly on some larger problems

x4 x8x17 x5

MIMIC dataset vs. Exp. Decay dataset

Insert skanky photocopy here

Graph-Coloring Example

Photocopy or PostScript inclusion

Peaks problems results

Postscript/photcopy

Tree-Max problem results

Yet another dummy slide

Checkerboard problem results

Dummy slide

Linear Equations Results

Yucky photocopy maybe

Modeling Higher-Order Dependencies

The maximum spanning tree algorithm gives us the optimal Bayesian Network in which each node has at most one parent.

What about finding the best network in which each node has at most K parents for K>1? NP-complete problem! [Chickering, et al., 1995] However, can use search heuristics to look for “good”

network structures (e.g., [Heckerman, et al., 1995]), e.g. hillclimbing.

Scoring Function for Arbitrary Networks

Rather than restricting K directly, add penalty term to scoring function to limit total network size Equivalent to priors favoring simpler network structures Alternatively, lends itself nicely to MDL interpretation Size of penalty controls exploration/exploitation tradeoff

n

iii BxHDBP

1

),|(')|(log |B|

penalty factor

|B|: number of parameters in B

Bayesian Network-Based Combinatorial Optimization

Initialize D with C bitstrings from uniform distribution, and Bayesian network B to empty network containing no edges

Until termination criteria met: Perform steepest-ascent hillclimbing from B to find locally optimal network

B’. Repeat until no changes increase score:Evaluate how each possible edge addition, removal or deletion would

affect penalized log-likelihood scorePerform change that maximizes increase in score.

Set B to B’. Generate and evaluate K bitstrings from B. Decay weight of datapoints in D by . Add best M of the K recently generated datapoints to D.

Return best bit string ever evaluated.

Cutting Computational Costs

Can cache contingency tables for all possible one-arc changes to network structure Only have to recompute scores associated with at most two

nodes after arc added, removed, or reversed. Prevents having to slog through the entire dataset

recomputing score changes for every possible arc change when dataset changes.

However, kiss memory goodbye

Only a few network structure changes required after each iteration since dataset hasn’t changed much (one or two structural changes on average)

Evolution of Network Complexity

Placeholder for parity-based network complexity graph

Summation Cancellation

Minimize magnitudes of cumulative sum of discretized numeric parameters (s1, …, sn) represented with standard binary encoding:

Parameters Bits Bayesian Network Tree PBIL GA

10 50 2797 53.7 21 23.612 60 32.2 29.3 16.1 18.815 75 20.3 16.8 11.2 14.620 100 12.4 11 7.5 8.9

15.016.0 is

Ni ...111 sy

Niysy iii ...2for 1

N

iiy

f

1000001

0.1

Average value over 50 runs of best solution found in 2000 generations

Bayesian Networks: Empirical Results Summary

Does better than Tree-based optimization algorithm on some toy problems Significantly better on “Summation Cancellation” problem 10% reduction in error on System of Linear Equation problems

Roughly the same as Tree-based algorithm on others, e.g. small Knapsack problems

Significantly more computation despite efficiency hacks, however.

Why not much better performance? Too much emphasis on exploitation rather than exploration? Steepest-ascent hillclimbing over network structures not good

enough?

Using Probabilistic Models for Intelligent Restarts

Tree-based algorithm’s O(n2) execution time per generation very expensive for large problems

Even more so for more complicated Bayesian networks

One possible approach: use probabilistic models to select good starting points for faster optimization algorithms, e.g. hillclimbing

COMIT

Combining Optimizers with Mutual Information Trees [Baluja & Davies, 1997b]: Initialize dataset D with bitstrings drawn from uniform

distribution Until termination criteria met:

Build optimal dependency tree T with which to model D.Generate K bitstrings from the distribution represented by T.

Evaluate them.Execute a hillclimbing run starting from single best bitstring of

these K.Replace up to M bitstrings in D with the best bitstrings found

during the hillclimbing run.

Return best bitstring ever evaluated

COMIT, cont’d

Empirical tests performed with stochastic hillclimbing algorithm that allows at most PATIENCE moves to points of equal value before restarting

Compare COMIT vs.: Hillclimbing with restarts from bitstrings chosen randomly

from uniform distribution Hillclimbing with restarts from best bitstring out of K

chosen randomly from uniform distribution Genetic algorithms?

COMIT: Example of Behavior

Dummy slide for graphs showing all evaluations in TSP domain

COMIT: Empirical Comparisons

Big ugly table goes here

Summary

PBIL uses very simple probability distribution from which new solutions are generated, yet works surprisingly well

Algorithm using tree-based distributions seems to work even better, though at significantly more computational expense

More sophisticated networks: past the point of diminishing marginal returns? “Future research”

COMIT makes tree-based algorithm applicable to much larger problems

Future Work

Making algorithm based on complex Bayesian Networks more practical Combine w/simpler search algorithms, ala COMIT?

Applying COMIT to more interesting problems WALKSAT?

Optimization in real-valued state spaces. What sorts of PDF representations might be useful?

Gaussians?Kernel-based representations?Hierarchical representations?

Acknowledgements

Shumeet BalujaDoug Baker, Justin Boyan, Lonnie Chrisman, Greg

Cooper, Geoff Gordon, Andrew Moore, …?

Documents

Probabilistic Modeling for Combinatorial Optimization