Sampling and Soundness: Can We Have Both? Carla Gomes, Bart Selman, Ashish Sabharwal Cornell University Jörg Hoffmann DERI Innsbruck …and I am: Frank van

Sampling and Soundness: Can We Have Both?

Carla Gomes, Bart Selman, Ashish Sabharwal

Cornell University

Jörg Hoffmann

DERI Innsbruck

…and I am: Frank van Harmelen

Nov 11, 2007 ISWC’07 2

Talk Roadmap

A Sampling Method with a Correctness Guarantee Can we apply this to the Semantic Web? Discussion

Nov 11, 2007 ISWC’07 3

How Might One Count?

Problem characteristics:

Space naturally divided into rows, columns, sections, …

Many seats empty

Uneven distribution of people (e.g. more near door, aisles, front, etc.)

How many people are present in the hall?

Nov 11, 2007 ISWC’07 4

#1: Brute-Force Counting

Idea: Go through every seat If occupied, increment counter

Advantage: Simplicity, accuracy

Drawback: Scalability

Nov 11, 2007 ISWC’07 5

#2: Branch-and-Bound (DPLL-style)

Idea: Split space into sections

e.g. front/back, left/right/ctr, … Use smart detection of

full/empty sections Add up all partial counts

Advantage: Relatively faster, exact

Drawback: Still “accounts for” every single

person present: need extremely fine granularity

Scalability

Framework used in DPLL-basedsystematic exact counterse.g. Relsat [Bayardo-et-al ’00], Cachet [Sang et al. ’04]

Nov 11, 2007 ISWC’07 6

#3: Naïve Sampling Estimate

Idea: Randomly select a region Count within this region Scale up appropriately

Advantage: Quite fast

Drawback: Robustness: can easily under-

or over-estimate Scalability in sparse spaces:

e.g. 1060 solutions out of 10300 means need region much larger than 10240 to “hit” any solutions

Nov 11, 2007 ISWC’07 7

Idea: Identify a “balanced” row split or

column split (roughly equal number of people on each side)

Use local search for estimate

Pick one side at random Count on that side recursively Multiply result by 2

This provably yields the true count on average! Even when an unbalanced row/column is picked accidentally

for the split, e.g. even when samples are biased or insufficiently many Surprisingly good in practice, using a local search as the sampler

Sampling with a Guarantee

Nov 11, 2007 ISWC’07 8

Algorithm SampleCount

Input: Boolean formula F

1. Set numFixed = 0, slack = some constant (e.g. 2, 4, 7, …)

2. Repeat until F becomes feasible for exact counting

a. Obtain s solution samples for F

b. Identify the most balanced variable and variable-pair [“x is balanced” : s/2 samples have x = 0, s/2 have x = 1 “(x,y) is balanced” : s/2 samples have x = y, s/2 have x = –y]

c. If x is more balanced than (x,y), randomly set x to 0 or 1Else randomly replace x with y or –y ; simplify F

d. Increment numFixed

Output: model count 2numFixed – slack exactCount(simplified F) with confidence (1 – 2– slack )

Note: showing one trial

[Gomes-Hoffmann-Sabharwal-Selman IJCAI’07]

Nov 11, 2007 ISWC’07 9

Correctness Guarantee

Key properties: Holds irrespective of the quality of the local search estimates

No free lunch! Bad estimates high variance of trial outcome min(trials) is high-confidence but not tight

Confidence grows exponentially with slack and t

Ideas used in the proof: Expected model count = true count (for each trial) Use Markov’s inequality Pr[X>kE[X]] < 1/k to bound error

probability (X is outcome of one trial)

Theorem: SampleCount with t trials gives a correct lower bound with probability (1 – 2– slack t )

e.g. slack =2, t =4 99% correctness confidence

Nov 11, 2007 ISWC’07 10

Circuit Synthesis, Random CNFs

1.4 x 1014

7 min

1.4 x 1014

2 hrs

1.6 x 1013

4 min1.4 x 1014wff-3-3.5

1.8 x 1021

3 hrs

4.0 x 1017

12 hrs

1.6 x 1020

4 min1.8 x 1021wff-3-1.5

---

---

2.1 x 1029

True Count

1.0 x 1014

12 hrs

1.8 x 1012

12 hrs

8.0 x 1015

2 minwff-4-5.0

---

12 hrs

---

12 hrs

5.9 x 101339

32 min3bitadd_32

2.1 x 1029

2 sec

2.1 x 1029

66 sec

2.4 x 1028

29 sec2bitmax_6

Cachet

(exact)

Relsat

(exact)SampleCount (99% conf.)

Instance

Nov 11, 2007 ISWC’07 11

Talk Roadmap

A Sampling Method with a Correctness Guarantee Can we apply this to the Semantic Web? Discussion

Nov 11, 2007 ISWC’07 12

Talk Roadmap

A Sampling Method with a Correctness Guarantee Can we apply this to the Semantic Web?

[Highly speculative] Discussion

Nov 11, 2007 ISWC’07 13

Counting in the Semantic Web…

… should certainly be possible with this method Example: given RDF database D, count how many

triples comply with query q Throw a constraint cutting the set of all triples in half If feasible, count n triples exactly; return n*2#constraints-slack

Else, iterate “Merely” technical challenges:

What are “constraints” cutting the set of all triples in half? How to “throw” a constraint? When to stop throwing constraints? How to efficiently count the remaining triples?

Nov 11, 2007 ISWC’07 14

What about Deduction?

Does follow from ? Exploit connection “implication UNSAT upper bounds”?

A similar theorem does NOT hold for upper bounds Nutshell: Markov’s inequality Pr[X>kE[X]] < 1/k does not have

a symmetric “Pr[X<kE[X]]” counterpart An adaptation is possible but has many problems → does not

look too promising Heuristic alternative:

Add constraints into to obtain ’; check whether ’ implies If “No”, stop; if “yes”, goto next trial After t successful trials, output “it’s enough, I believe it” No provable confidence but may work well in practice

Nov 11, 2007 ISWC’07 15


Does follow from ? Much more distant adaptation:

“Constraint” = something that removes half of !! Throw some and check whether ’

Confidence problematic: Can we draw any conclusions if ’ NOT ? May be that 1, 2 in with 1 2 , but a constraint

separated 1 from 2

May be that all relevant are thrown out Are there interesting cases where we can bound the

probability of these events??

Nov 11, 2007 ISWC’07 16

Talk Roadmap

A Sampling Method with a Correctness Guarantee Can we apply this to the Semantic Web?

[Highly speculative] Discussion

Nov 11, 2007 ISWC’07 17

Discussion

In prop CNF, one can efficiently obtain high-confidence lower bounds on nr of models, by sampling

Application to Semantic Web: Adaptation to counting tasks should be possible Adaptation for , via upper bounds, is problematic

Promising: heuristic method sacrificing confidence guarantee Alternative adaptation weakens instead of strengthening it

“Sampling the knowledge base” Confidence guarantees??

Your feedback and thoughts are highly appreciated!!

Nov 11, 2007 ISWC’07 18


Does follow from ? Straightforward adaptation:

There is a variant of this algorithm that computes high-confidence upper bounds instead

Throw “large” constraints, check if ’ is SAT If SAT, no implication; if UNSAT in each of t iterations,

confidence on upper bound on #models Many problems:

Is the ’ actually easier to check?? “Large” constraints are tough even in propositional CNF context! (“Large” = involves half of the prop vars; needed for confidence) Upper bound on #models is not confidence in UNSAT!

Documents

Sampling and Soundness: Can We Have Both? Carla Gomes, Bart Selman, Ashish Sabharwal Cornell University Jörg Hoffmann DERI Innsbruck …and I am: Frank van