Fast Probabilistic Modeling for Combinatorial Optimization

Fast Probabilistic Modeling for Combinatorial Optimization

Scott Davies

School of Computer Science

Carnegie Mellon University(and Justsystem Pittsburgh Research Center)

Joint work with Shumeet Baluja

Combinatorial Optimization

Maximize “evaluation function” f(x) input: fixed-length bitstring x = (x1, x2, …,xn)

output: real value

x might represent: job shop schedules TSP tours discretized numeric values etc.

Our focus: “Black Box” optimization No domain-dependent heuristics

Commonly Used Approaches

Hill-climbing, simulated annealing Generate candidate solutions “neighboring” single current

working solution (e.g. differing by one bit) Typically make no attempt to model how particular bits

affect solution quality

Genetic algorithms Attempt to implicitly capture dependency of solution

quality on bit values by maintaining a population of candidate solutions

Use crossover and mutation operators on population members to generate new candidate solutions

Using Explicit Probabilistic Models

Maintain an explicit probability distribution P from which we generate new candidate solutions Initialize P to uniform distribution Until termination criteria met:

Stochastically generate K candidate solutions from PEvaluate themUpdate P to make it more likely to generate solutions “similar” to

the “good” solutions

Return best bitstring ever evaluated

Several different choices for what sorts of P to use and how to update it after candidate solution evaluation

Population-Based Incremental Learning

Population-Based Incremental Learning (PBIL) [Baluja, 1995] Maintains a vector of probabilities: one independent

probability P(xi=1) for each bit xi. (Each initialized to 0.5)

Until termination criteria met:Generate a population of K bitstrings from P. Evaluate them.For each of the best M bitstrings of these K, adjust P to make that

bitstring more likely:

Also sometimes uses “mutation” and/or pushes P away from bad solutions

Often works very well (compared to hillclimbing and GAs) despite its independence assumption

))1(1()1()1(' iii xPxPxP if xi is set to 1

)1()1( ii xPxP if xi is set to 0or

Modeling Inter-Bit Dependencies

MIMIC [De Bonet, et al., 1997]: sort bits into Markov Chain in which each bit’s distribution is conditionalized on value of previous bit in chain e.g.: P(x1, x2, …, xn) = P(x3)*P(x6|x3)* P(x2|x6)*… Generate chain from best N% of bitstrings evaluated so

far; generate new bitstrings from chain; repeat

More general framework: Bayesian networks

Bayesian Networks

P(x1,…,xn) = P(x3)*P(x12)*P(x2|x3,x12)*P(x6|x3)*...

x3 x12

x6 x2

x9x17

Specified by: Network structure (must be a DAG) Probability distribution for each variable (bit) given

each possible combination of its parents’ values

Optimization with Bayesian Networks

Two operations we need to perform during optimization: Given a network, generate bitstrings from the

probability distribution represented by that networkTrivial with Bayesian networks

Given a set of “good” bitstrings, find the network most likely to have generated itGiven a fixed network structure, trivial to fill in the probability

tables optimallyBut what structure should we use?

Learning Network Structures from Data

Problem statement: given a dataset D and a set of allowable Bayesian networks, find the network B with the maximum posterior probability:

)|()(maxarg)|(maxarg BDPBPDBPBB

)|(log)(logmaxarg BDPBPB

Learning Network Structures from Data

Maximizing log (D|B) reduces to maximizing

where i is the set of xi’s parents in B

H’ (…) is the average entropy of an empirical distribution exhibited in D

n

iiixH

1

)|('

Single-Parent Tree-Shaped Networks

Let’s allow each bit to be conditioned on at most one other bit. x17

x5x22 x2

x6x19x13 x14 x8

Adding an arc from xj to xi increases network score by

H(xi) - H(xi|xj) (xj’s “information gain” with xi)

= H(xj) - H(xj|xi) (not necessarily obvious, but true)

= I(xi, xj) Mutual information between xi and xj

Optimal Single-Parent Tree-Shaped Networks

To find optimal single-parent tree-shaped network, just find maximum spanning tree using I(xi, xj) as the weight for the edge between xi and xj. [Chow and Liu, 1968].

Can be done in O(n2) time (assuming D has already been reduced to sufficient statistics)

Optimal Dependency Trees for Combinatorial Optimization

[Baluja & Davies, 1997]Start with a dataset D initialized from the uniform

distributionUntil termination criteria met:

Build optimal dependency tree T with which to model D. Generate K bitstrings from probability distribution represented

by T. Evaluate them. Add best M bitstrings to D after decaying the weight of all

datapoints already in D by a factor between 0 and 1.

Return best bitstring ever evaluated.Running time: O(K*n + M*n2) per iteration

Graph-Coloring Example

Noisy Graph-Coloring example For each edge connected to vertices of different colors, add

+1 to evaluation function with probability 0.5Title:pogp.figCreator:fig2dev Version 3.1 Patchlevel 2Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Optimal Dependency Tree Results

Tested following optimization algoirthms on variety of small problems, up to 256 bits long: Optimization with optimal dependency trees Optimization with chain-shaped networks (ala MIMIC) PBIL Hillclimbing Genetic algorithms

General trend: Hillclimbing & GAs did relatively poorly Among probabilistic methods, PBIL < Chains < Trees:

more accurate models are better

Modeling Higher-Order Dependencies

The maximum spanning tree algorithm gives us the optimal Bayesian Network in which each node has at most one parent.

What about finding the best network in which each node has at most K parents for K>1? NP-complete problem! [Chickering, et al., 1995] However, can use search heuristics to look for “good”

network structures (e.g., [Heckerman, et al., 1995]), e.g. hillclimbing.

Bayesian Network-Based Combinatorial Optimization

Initialize D with C bitstrings from uniform distribution, and Bayesian network B to empty network containing no edges

Until termination criteria met: Perform steepest-ascent hillclimbing from B to find locally optimal network

B’. Repeat until no changes increase score:Evaluate how each possible edge addition, removal or deletion would

affect penalized log-likelihood scorePerform change that maximizes increase in score.

Set B to B’. Generate and evaluate K bitstrings from B. Decay weight of datapoints in D by . Add best M of the K recently generated datapoints to D.

Return best bit string ever evaluated.

Bayesian Networks Results

Does better than Tree-based optimization algorithm on some toy problems

Roughly the same as Tree-based algorithm on others, Significantly more computation than Tree-based

algorithm, however, despite efficiency hacksWhy not much better results?

Too much emphasis on exploitation rather than exploration? Steepest-ascent hillclimbing over network structures not good

enough, particularly when starting from old networks?

Using Probabilistic Models for Intelligent Restarts

Tree-based algorithm’s O(n2) execution time per iteration very expensive for large problems

Even more so for algorithm based on more complicated Bayesian networks

One possible approach: use probabilistic models to select good starting points for faster optimization algorithms, e.g. hillclimbing or PBIL

COMIT

Combining Optimizers with Mutual Information Trees [Baluja and Davies, 1998] Initialize dataset D with bitstrings drawn from uniform

distribution Until termination criteria met:

Build optimal dependency tree T with which to model D.Use T to stochastically generate K bitstrings. Evaluate them.Execute a fast-search procedure, initialized with the best solutions

generated from T.Replace up to M of the worst bitstrings in D with the best

bitstrings found during the fast-search run just executed.

Return best bitstring ever evaluated

COMIT with Hill-climbing

Use single best bitstring generated by tree as starting point of a stochastic hillclimbing algorithm that allows at most PATIENCE moves to points of equal value before giving up

|D| kept at 1000; M set to 100.We compare COMIT vs.:

Hillclimbing with restarts from bitstrings chosen randomly from uniform distribution

Hillclimbing with restarts from best bitstring out of K chosen randomly from uniform distribution

COMIT w/Hillclimbing: Example of Behavior

TSP domain: 100 cities, 700 bitsTitle:

Creator:idrawPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Title:

Creator:idrawPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

Evaluation Number * 103

Tour L

ength * 10 3

Hillclimber

COMIT

COMIT w/Hillclimbing: Results

Each number is average over at least 25 runs Each algorithm given 200,000 evaluation function calls Highlighted: better than each non-COMIT hillclimber with confidence > 95% AHCxxx: pick best of xxx randomly generated starting points before hillclimbing COMITxxx: pick best of xxx starting points generated by tree before hillclimbing

Problem Min/Max # bits HC AHC100 AHC1000 COMIT100 COMIT1000

Knapsack 1 Max 512 3238 3377 3335 6684 6259Knapsack 2 Max 900 3403 3418 3488 7733 7182Knapsack 3 Max 1200 5226 5270 5280 13052 12829Jobshop 1 Min 500 998 988 982 978 970Jobshop 2 Min 700 965 961 957 954 953Jobshop 3 Min 700 1207 1200 1199 1196 1195Bin-packing 1 Min 504 170 158 162 156 145Bin-packing 2 Min 800 154 150 138 111 124TSP 1 (100 city) Min 700 1629 1599 1573 1335 1336TSP 2 (200 city) Min 1600 15119 15286 15100 15189 15117TSP 3 (150 city) Min 1200 11451 11247 11290 9812 9077Sum. Canc. Min 675 64 61 59 54 52

COMIT with PBIL

Generate K samples from T; evaluate them Initialize PBIL’s P vector according to the

unconditional distributions contained in the best C% of these examples

PBIL run terminated after 5000 evaluations without improvement

|D| kept at 1000; M set to 100 (as before) PBIL parameters: =.15; update P with single best example

out of 50 each iteration; some “mutation” of P used as well

COMIT w/PBIL: Results

Problem Min/Max # bits PBIL PBIL+R COMIT+PBIL

Knapsack 1 Max 512 7823 7765 7872Knapsack 2 Max 900 9601 9402 10134Knapsack 3 Max 1200 14668 13768 18014Jobshop 1 Min 500 982 963 961Jobshop 2 Min 700 959 943 940Jobshop 3 Min 700 1188 1176 1170Bin-packing 1 Min 504 140 280 280Bin-packing 2 Min 800 830 960 1000TSP 1 (100 city) Min 700 1331 1518 1021TSP 2 (200 city) Min 1600 17148 21858 14870TSP 3 (150 city) Min 1200 11035 13698 9078Sum. Canc. Min 675 25 33 33

Each number is average over at least 50 runs Each algorithm given 600,000 evaluation function calls Highlighted: better than opposing algorithm(s) with confidence > 95% PBIL+R: PBIL with restart after 5000 evaluations without improvement

(P reinitialized to 0.5)

Conclusions & Future Work

COMIT makes probabilistic modeling applicable to much larger problems

COMIT led to significant improvements over baseline search algorithm in most problems tested

Future work Applying COMIT to more interesting problems Using COMIT to combine results of multiple search algorithms Incorporating domain knowledge

Future Work

Making algorithm based on complex Bayesian Networks more practical Combine w/simpler search algorithms, ala COMIT?

Applying COMIT to more interesting problems and other baseline search algorithms WALKSAT?

Using more sophisticated probabilistic models Using COMIT to combine results of multiple search

algorithms

Documents

Fast Probabilistic Modeling for Combinatorial Optimization