OLAP Over Uncertain and Imprecise Data
Adapted from a talk byT.S. Jayram (IBM Almaden)
with Doug Burdick (Wisconsin), Prasad Deshpande (IBM), Raghu Ramakrishnan (Wisconsin), Shivakumar
Vaithyanathan (IBM)
Adapted by S. Sudarshan
CA MANYTX
EastWest
AllLocation
Civic SierraF150Camry
TruckSedan
AllAutomobile
Dimensions in OLAP
Auto = TruckLoc = EastSUM(Repair) = ?
Measures, Facts, and Queries
MA
NY
TX
CA
West
East
ALL
Civic SierraF150Camry
TruckSedan
ALLAutomobile
p1
p2
p3
p4p5
p6p7
p8
Auto = F150Loc = NYRepair = $200
Cell
Location
Restriction on Imprecision
We restrict the sets of values in an imprecise fact to either:
1. A singleton set consisting of a leaf level member of the hierarchy, or,2. The set of all the leaf level members under some non-leaf level member of the hierarchy.
Cells and Regions
A region is a vector of attribute values from an imprecise domains of each dimension of the cube.A cell is a region in which all values are leaf level members.Let reg(R) represent the set of cells in a region R.
Queries on precise data
A query Q = (R, M, A) refers to a region R, a measure M, and an aggregate function A. Eg : (<Ambassador, Location>, Repairs, Sum)The result of the query in a precise database is obtained by applying A on the measure M of all cells in R.For the example above, the result is (P1 + P2)
Extend the OLAP model to handle data
ambiguity
Imprecision
Uncertainty
Extend the OLAP model to handle data
ambiguity
Imprecision
Uncertainty
MA
NY
TX
CA
West
East
ALL
Location
Civic SierraF150Camry
TruckSedan
ALLAutomobile
p1
p2
p3
p4p5
p6p7
p8
Auto = F150Loc = EastRepair = $200
p9
p10
Imprecision
p11
Representing Imprecision using Dimension Hierarchies
Dimension hierarchies lead to a natural space
of “partially specified” objects
Sources of imprecision: incomplete data,
multiple sources of data
SierraF150
Truck
MA
NY
East
p1
p3
p5
p4
p2
Motivating Example
We propose desiderata that enable appropriate definition of query semantics for imprecise data
We propose desiderata that enable appropriate definition of query semantics for imprecise data
Query: COUNT
Queries on imprecise data
Consider the query region <Pune, Model> in the figure. It overlaps two imprecise facts P4 and P5.Three (naive) options for including fact in query: Contains: consider only if contained in query Overlaps: consider if overlapping query None: ignore all imprecise facts
Desideratum I: Consistency
Consistency specifies the relationship between answers to related queries on a fixed data set
SierraF150
Truck
MA
NY
East
p1
p3
p5
p4
p2
Notions of Consistency
Generic idea: if query region is partitioned, and aggregate applied on each partition, then aggregate q on whole region must be consistent in some ways with aggregates qi on partitions
General idea: alpha consistency for property alpha
Specific forms of consistency discussed in detail in paper
Sum consistency (for count/sum)
Boundedness consistency (for average)
Contains option : Consistency
Intuitively, consistency means that the answer to a query should be consistent with the aggregates from individual partitions of the query.Using the Contains option could give rise to inconsistent results.For example, consider the sum aggregate of the query above and that of its individual cells. With the Contains option, will the individual results add up to be the same as the collective?
Desideratum II: Faithfulness
Faithfulness specifies the relationship between answers to a fixed query on related data sets
Notion of result quality relative to the quality of the data input to the query.
– For example, the answer computed for Q=F150,MA should be of higher quality if p3 were precisely known.
SierraF150
MA
NY
p3
p1
p4
p2
p5
SierraF150
MA
NY
p3
p1
p4
p2
p5
SierraF150
MA
NY
p3
p1
p4
p2
p5
Data Set 1 Data Set 2 Data Set 3
Formal definitions of both Consistency and Faithfulness depend on the underlying aggregation operator
Can we define query semantics that satisfy these desiderata?
Formal definitions of both Consistency and Faithfulness depend on the underlying aggregation operator
Can we define query semantics that satisfy these desiderata?
p3
p1
p4
p2
p5MA
NY
SierraF150Query Semantics
Possible Worlds [Kripke63,…]
SierraF150M
AN
Y
p4
p1
p3 p5
p2
p1p3
p4p5
p2
p4
p1p3
p5
p2
MA
NY
MA
NY
SierraF150SierraF150
p3 p4
p1
p5
p2
MA
NY
SierraF150
w1
w2 w3
w4
Possible Worlds Query Semantics
Given all possible worlds together with their probabilities, queries are easily answered (using expected values)
But number of possible worlds is exponential!
Allocation
Allocation gives facts weighted assignments to possible completions, leading to an extended version of the data
Size increase is linear in number of (completions of) imprecise facts
Queries operate over this extended version
Key contributions:Appropriate characterization of the large space of allocation policies
Designing efficient allocation policies that take into account the correlations in the data
Storing Allocations using Extended Data Model
p3
p1
p4
p2
p5MA
NY
SierraF150
ID FactID Auto Loc Repair Weight
1 1 F150 NY 100 1.0
2 2 Sierra NY 500 1.0
3 3 F150 MA 150 0.6
4 3 F150 NY 150 0.4
5 4 Sierra MA 200 1.0
6 5 F150 MA 100 0.5
7 5 Sierra MA 100 0.5
Truck
East
Advantages of EDM
No extra infrastructure required for representing imprecisionEfficient algorithms for aggregate queries :SUM and COUNT : linear time algo.AVERAGE : slightly complicated algorithm running in O(m + n3) for m precise facts and n imprecise facts.
Aggregating Uncertain Measures
Opinion pooling: provide a consensus opinion from a set of opinions Θ.
The opinions in Θ as well as the consensus opinion are represented as pdfs over a discrete domain O
linear operator LinOp(Θ) produces a consensus pdf P that is a weighted linear combination of the pdfs in Θ,
Allocation Policies
For every region r in the database, we want to assign an allocation p
c, r to each cell c in Reg(r), such that
∑c Reg(r)
pc, r
= 1
Three ways of doing so:
1. Uniform : Assign each cell c in a region r an equal probability.
pc, r
= 1 / |Reg(r)|
Allocation Policies
For every region r in the database, we want to assign an allocation p
c, r to each cell c in Reg(r), such that
∑c Reg(r)
pc, r
= 1
However, we can do better. Some cells may be naturally inclined to have more probability than others. Eg : Mumbai will clearly have more repairs than Bhopal. We can do this automatically by giving more probability to cells with higher number of precise facts.
2. Count based :
where Nc is the number of precise facts in cell c
Allocation Policies
For every region r in the database, we want to assign an allocation p
c, r to each cell c in Reg(r), such that
∑c Reg(r)
pc, r
= 1
Again, we can arguably get a better result by looking at not just the count, but rather than the actual value of the measure in question.
3. Measure based : next slide.
Measure Based Allocation
Assumes the following model : The given database D with imprecise facts has been generated by randomly injecting imprecision in a precise database D'.D' assigns value o to a cell c according to some unknown pdf P(o, c).
If we could determine this pdf, the allocation is simplyp
c, r = P(c) / ∑
c' in Reg(r) P(c')
Classifying Allocation Policies
Ignored Used
IgnoredU
sed
Uniform
EMCount
Measure Correlation
Dim
ension C
orrelation
Results on Query Semantics
Evaluating queries over extended version of data yields expected value of the aggregation operator over all possible worlds
intuitively, the correct value to compute
Efficient query evaluation algorithms for SUM, COUNT
consistency and faithfulness for SUM, COUNT are satisfied under appropriate conditions
Dynamic programming algorithm for AVERAGEUnfortunately, consistency does not hold for AVERAGE
Alternative Semantics for AVERAGE
APPROXIMATE AVERAGEE[SUM] / E[COUNT] instead of E[SUM/COUNT]
simpler and more efficient
satisfies consistency
extends to aggregation operators for uncertain measures
Maximum Likelihood Principle
A reasonable estimate for this function P can be that which maximises the probability of generating the given imprecise data set D.
Example :Suppose the pdf depends only on the cells and is independent of the measure values. Thus, the pdf is a mapping : C ℝ where C is the set of cells.This pdf can be found by maximising the likelihood function :
ℒ() = r D
∑c Reg(r)
(c)
EM Algorithm
The Expectation Maximization algorithm provides a standard way of maximizing the likelihood, when we have some unknown variables in the observation set.
Expectation step (compute data): Calculate the expected value of the unknown variables, given the current estimate of variables.Maximization step (compute generator): Calculate the distribution that maximizes the probability of the current estimated data set.
Initialization Step: Data: [4, 10, ?, ?] Initial mean value: 0New Data: [4, 10, 0, 0]
Step 1: New Mean: 3.5New Data:[4, 10, 3.5, 3.5]
Step 2: New Mean: 5.25New Data: [4, 10, 5.25, 5.25]
Step 3: New Mean: 6.125New Data: [4, 10, 6.125, 6.125]
Result: New Mean: 6.890625
EM Algorithm : Example
Step 4: New Mean: 6.5625New Data: [4, 10, 6.5625, 6.5625]
Step 5: New Mean: 6.7825New Data: [4, 10, 6.7825, 6.7825]
EM Algorithm : Application
Experiments : Allocation run time
Experiments : Query run time
Experiments : Query run time
Experiments : Accuracy
Uncertainty
Measure value is modeled as a probability
distribution function over some base domain
e.g., measure Brake is a pdf over values {Yes,No}
sources of uncertainty: measures extracted from text
using classifiers
Adapt well-known concepts from statistics to
derive appropriate aggregation operators
Our framework and solutions for dealing with
imprecision also extend to uncertain measures
Summary
Consistency and faithfulness desiderata for designing query semantics for imprecise data
Allocation is the key to our framework
Efficient algorithms for aggregation operators with appropriate guarantees of consistency and faithfulness
Iterative algorithms for allocation policies
Correlation-based Allocation
Involves defining an objective function to capture some underlying correlation structure
a more stringent requirement on the allocations
solving the resulting optimization problem yields the allocations
EM-based iterative allocation policyinteresting highlight: allocations are re-scaled iteratively by computing appropriate aggregations