Click here to load reader
Upload
alice-hall
View
216
Download
0
Embed Size (px)
Citation preview
Efficient Evaluation of HAVING Queries on a Probabilistic Database
Christopher Re and Dan SuciuUniversity of WashingtonEfficient Evaluation of HAVING Queries on a Probabilistic Database1High level OverviewEvaluation of conjunctive Boolean queries with aggregate tests on probabilistic DBs:HAVING in SQL, e.g. is the SUM(profit) > 100k?Looking for optimal algorithms (dichotomies): For all queries q with aggregate A wantP time algorithm, call this A-Safe [DS04,DS07]Some instance s.t. q is hard (#P).Technique: In safe plans, use multiplicationIn A-safe plans, use convolution (on monoids)22MotivationItem ForecasterAmountPWidgetAlice$-99k0.99Bob$100M0.01WhatsitAlice$1M1SELECT SUM(Amount)FROM ProfitWHERE item=WidgetSELECT item FROM ProfitWHERE item =WidgetGROUP BY itemHAVING SUM(Amount) > 0Expectation Style [Prior Art]HAVING styleAns: -99k *.99 +100M*0.01 ~900KAns: 0.01Profit33OverviewPreliminariesFormal Problem DescriptionQuery plans and DatalogMonoid Random Variables and ConvolutionsMax,Min,Count and hints for othersConclusions
44
SELECT ITEM FROM PROFITWHERE ITEM=WidgetGROUP BY ITEMHAVING SUM(PROFIT) > 0
HAVING Query semantics
NB: Assume SQL-like semanticsConjunctive rule: No repeated symbolsAggregates
Comparision:
k, is a constant
5Paste in def.s here. Explain restrictions for talk v. paper5Probabilistic Semantics
NB: In paper, allow disjoint tuples
Possible worlds, model
Query Semantics
In talk, restrict to tuple independence6GIVE EXAMPLE HERE6Complexity and formal problem
Data complexity: Fix Query. Instance grows.In practice, query is small.
Consider k, i.e. 1000, as part of the inputSkeleton,
77OverviewPreliminariesFormal Problem DescriptionQuery plans and DatalogMonoid Random Variables and ConvolutionsMax,Min,Count and hints for othersConclusions
88Monoids and Semirings
NB: n=1 is logical OR
A monoid is a triple where M is a set and + is associative with identity 0.e.g.
Commutative Semiring isBoth are commutative monoids* distributes over +e.g. a Boolean algebra
99Fix a Semiring S.Annotation is a function to S with finite support
Plans defined inductively:
[GKT07] : Datalog + Semirings
1010Goal: define value of tuple t in a plan P, support, i.e. tuples contributing to a value
Value of a plan, i.e, the annotation computes
[GKT07] Inductive definition
1111
Annotations and HAVINGXYA10B100C1t(Y)112Monoid sum is 1 iff all values are bigger than 3 0.20.40.1probabilities0 is tuple not present1 is tuple present, y > 3
2 is tuple present,
Monoids and Aggregates
How can we deal with probabilities?1212OverviewPreliminariesFormal Problem DescriptionQuery plans and DatalogMonoid Random Variables and ConvolutionsMax,Min,Count and hints for othersConclusions
1313An M-random variable (rv) is
Correlationsr,s are independent if for any m,m in M
Extended to sets via total independence
Monoid Random Variables
14Switch text to have running example in the corner! Use min/max.14
Monoid Convolutions
Let r be an rv. A marginal vector is
The monoid convolution * (depending on +) is
151. Label vectors 0,1 for monoid values (SAY MONOID)2. Below convolution, label how derived, \ie 0 + 1 , 1 + 0, 1 + 115
ConvolutionsConvolutions are efficient, if M is not too big
If r,s monoid rvs then r+s is an rv defined as
PROP: If r,s are independent then the distribution of r + s is given by convolution:
PROP: The convolution of n r.v.s can be computed in Single convolution in timeConvolution is associative.1616OverviewPreliminariesFormal Problem DescriptionQuery plans and DatalogMonoid Random Variables and ConvolutionsMax,Min,Count and hints for othersConclusions
1717
Annotations and HAVINGXYA10B100C1t(Y)112Monoid sum is 1 iff all values are bigger than 3 0.20.40.1probabilities(0.8,0.2,0)(0.6,0.4,0)(0.9,0,0.1)Marginal of 1 after convolution = value of query0 is tuple not present1 is tuple present, y > 3marginal vectors
2 is tuple present,
Monoids and Aggregates
1818Compute value of Safe Plans:
Plan is safe [DS04], if all projects and joins are independent tuples, else #PTHM: value is correct if the plan is safe.
Safe plans for semirings
Only efficient if the semiring is smallGives dicohotomy for MIN,MAX,COUNT not the others1919Additional ResultsDichotomy for SUM,AVG,COUNT DISTINCTNot all safe plans allowed!e.g. cannot have independent projections on top
Disjoint tuples in the paperNeed a disjoint projection operation More work for dichotomies
Algorithms for finding safe plans (P time)
2020ConclusionSemantic for aggregation queries on prob DBsSimilar to HAVING in SQLProposed a complexity measure for such queries
Central technique was marginal vectors and convolutions
Dichotomy for HAVING queries w.o. self-joins
21212222Conjunctive rule: No repeated subgoalsAggregates
Comparision:
k, is a constant
SELECT ITEM FROM PROFITWHERE ITEM=WidgetGROUP BY ITEMHAVING SUM(PROFIT) > 0
HAVING Query semantics
NB: Assume SQL-like semantics23Paste in def.s here. Explain restrictions for talk v. paper23
Annotations and HAVINGXYA10B100C1t(Y)112Monoid sum is 1 iff all values are bigger than 3 0.20.40.1probabilities(0.8,0.2,0)(0.6,0.4,0)(0.9,0,0.1)Marginal of 1 after convolution = value of query0 is tuple not present1 is tuple present, y > 3marginal vectors
2 is tuple present,
Monoids and Aggregates
2424