Click here to load reader

Christopher Re and Dan Suciu University of Washington Efficient Evaluation of HAVING Queries on a Probabilistic Database

Embed Size (px)

Citation preview

Efficient Evaluation of HAVING Queries on a Probabilistic Database

Christopher Re and Dan SuciuUniversity of WashingtonEfficient Evaluation of HAVING Queries on a Probabilistic Database1High level OverviewEvaluation of conjunctive Boolean queries with aggregate tests on probabilistic DBs:HAVING in SQL, e.g. is the SUM(profit) > 100k?Looking for optimal algorithms (dichotomies): For all queries q with aggregate A wantP time algorithm, call this A-Safe [DS04,DS07]Some instance s.t. q is hard (#P).Technique: In safe plans, use multiplicationIn A-safe plans, use convolution (on monoids)22MotivationItem ForecasterAmountPWidgetAlice$-99k0.99Bob$100M0.01WhatsitAlice$1M1SELECT SUM(Amount)FROM ProfitWHERE item=WidgetSELECT item FROM ProfitWHERE item =WidgetGROUP BY itemHAVING SUM(Amount) > 0Expectation Style [Prior Art]HAVING styleAns: -99k *.99 +100M*0.01 ~900KAns: 0.01Profit33OverviewPreliminariesFormal Problem DescriptionQuery plans and DatalogMonoid Random Variables and ConvolutionsMax,Min,Count and hints for othersConclusions

44

SELECT ITEM FROM PROFITWHERE ITEM=WidgetGROUP BY ITEMHAVING SUM(PROFIT) > 0

HAVING Query semantics

NB: Assume SQL-like semanticsConjunctive rule: No repeated symbolsAggregates

Comparision:

k, is a constant

5Paste in def.s here. Explain restrictions for talk v. paper5Probabilistic Semantics

NB: In paper, allow disjoint tuples

Possible worlds, model

Query Semantics

In talk, restrict to tuple independence6GIVE EXAMPLE HERE6Complexity and formal problem

Data complexity: Fix Query. Instance grows.In practice, query is small.

Consider k, i.e. 1000, as part of the inputSkeleton,

77OverviewPreliminariesFormal Problem DescriptionQuery plans and DatalogMonoid Random Variables and ConvolutionsMax,Min,Count and hints for othersConclusions

88Monoids and Semirings

NB: n=1 is logical OR

A monoid is a triple where M is a set and + is associative with identity 0.e.g.

Commutative Semiring isBoth are commutative monoids* distributes over +e.g. a Boolean algebra

99Fix a Semiring S.Annotation is a function to S with finite support

Plans defined inductively:

[GKT07] : Datalog + Semirings

1010Goal: define value of tuple t in a plan P, support, i.e. tuples contributing to a value

Value of a plan, i.e, the annotation computes

[GKT07] Inductive definition

1111

Annotations and HAVINGXYA10B100C1t(Y)112Monoid sum is 1 iff all values are bigger than 3 0.20.40.1probabilities0 is tuple not present1 is tuple present, y > 3

2 is tuple present,

Monoids and Aggregates

How can we deal with probabilities?1212OverviewPreliminariesFormal Problem DescriptionQuery plans and DatalogMonoid Random Variables and ConvolutionsMax,Min,Count and hints for othersConclusions

1313An M-random variable (rv) is

Correlationsr,s are independent if for any m,m in M

Extended to sets via total independence

Monoid Random Variables

14Switch text to have running example in the corner! Use min/max.14

Monoid Convolutions

Let r be an rv. A marginal vector is

The monoid convolution * (depending on +) is

151. Label vectors 0,1 for monoid values (SAY MONOID)2. Below convolution, label how derived, \ie 0 + 1 , 1 + 0, 1 + 115

ConvolutionsConvolutions are efficient, if M is not too big

If r,s monoid rvs then r+s is an rv defined as

PROP: If r,s are independent then the distribution of r + s is given by convolution:

PROP: The convolution of n r.v.s can be computed in Single convolution in timeConvolution is associative.1616OverviewPreliminariesFormal Problem DescriptionQuery plans and DatalogMonoid Random Variables and ConvolutionsMax,Min,Count and hints for othersConclusions

1717

Annotations and HAVINGXYA10B100C1t(Y)112Monoid sum is 1 iff all values are bigger than 3 0.20.40.1probabilities(0.8,0.2,0)(0.6,0.4,0)(0.9,0,0.1)Marginal of 1 after convolution = value of query0 is tuple not present1 is tuple present, y > 3marginal vectors

2 is tuple present,

Monoids and Aggregates

1818Compute value of Safe Plans:

Plan is safe [DS04], if all projects and joins are independent tuples, else #PTHM: value is correct if the plan is safe.

Safe plans for semirings

Only efficient if the semiring is smallGives dicohotomy for MIN,MAX,COUNT not the others1919Additional ResultsDichotomy for SUM,AVG,COUNT DISTINCTNot all safe plans allowed!e.g. cannot have independent projections on top

Disjoint tuples in the paperNeed a disjoint projection operation More work for dichotomies

Algorithms for finding safe plans (P time)

2020ConclusionSemantic for aggregation queries on prob DBsSimilar to HAVING in SQLProposed a complexity measure for such queries

Central technique was marginal vectors and convolutions

Dichotomy for HAVING queries w.o. self-joins

21212222Conjunctive rule: No repeated subgoalsAggregates

Comparision:

k, is a constant

SELECT ITEM FROM PROFITWHERE ITEM=WidgetGROUP BY ITEMHAVING SUM(PROFIT) > 0

HAVING Query semantics

NB: Assume SQL-like semantics23Paste in def.s here. Explain restrictions for talk v. paper23

Annotations and HAVINGXYA10B100C1t(Y)112Monoid sum is 1 iff all values are bigger than 3 0.20.40.1probabilities(0.8,0.2,0)(0.6,0.4,0)(0.9,0,0.1)Marginal of 1 after convolution = value of query0 is tuple not present1 is tuple present, y > 3marginal vectors

2 is tuple present,

Monoids and Aggregates

2424