Upload
fotis-savva
View
102
Download
0
Embed Size (px)
Citation preview
Explanations in Data Systems
Fotis Savva,
Ph.D. student IDEAS Group,
University of Glasgow
Explanations are sought when anomalies appear in our dataset or we want to further
investigate query answers.
Examples#1 Outlier Explanation
Outlier Detection● Automatically [Bailis et al., 2016]
● User selects outlier points [Wu-Madden, 2013]
Explaining Outliers● Find (attribute, value)
combinations [Bailis et al., 2016]
● Find Predicates [Wu-Madden,
2013]
Explanation (attribute, value ) :
Explanation predicate :
(SensorId, 3)
SensorId == 3
Fine-grained
Coarse-grained
Examples#2 Answers to Database Queries [Roy et al., 2015]
Award(aid, amount, title, year,
startdate, enddate, dir, div)
Investigator(aid, PIName, emailID)
Institution(aid, instName, address)
Query : Top 5 Institutions ranked by total awards
SELECT TOP 5 B.instName, SUM(A.amount) AS totalAwardFROM Award A, Institution BWHERE A.aid = B.aid AND dir = ’CS’ AND year >= 1990GROUP BY B.instNameORDER BY totalAward DESC
Answers to Database Queries [Roy et al., 2015]instName totalAward
University of Illinois at Urbana-Champaign 1,169,673,252
University of California-San Diego 723,335,212
Carnegie-Mellon University 472,915,775
University of Texas at Austin 319,437,217
Massachusetts Institute of Technology 292,662,491
Why such difference ?
Potential Explanations :
[PIName = ‘Robert Pennington’] [div = ‘ACI’]Associated Awards : $893M for UIUC and $26M for CMU.
Associated Awards : $580M
Evaluating Explanations● Find the tuples matching the
specified predicate
● Removing the associated
tuples helped balance out the
awards for each Institution.
● This means the predicates are
an effective explanation.
Change in output
Change in output
Main issues to consider1. Which domain to consider ? Explanations to what ? (eg aggregates,
database queries, outliers etc)2. How to represent explanations ? ( Attribute, Value combinations,
predicates, visualisations )3. Efficiently find explanations. (SQL data systems, Big Data Engines)4. Ranking explanations.
Domain● Streaming IoT data from sensors. Automatically detecting outliers and providing
explanations for those outliers. [Bailis et al., 2016]● Allowing the user to interactively select outliers in aggregate results and provide explanations.
[Wu-Madden, 2013] ● Answers to database queries [Roy et al., 2015]● Given relation instances augmented with a binary feature, explain the outcome of the binary
feature. [El Gebaly et al., 2015]● Explaining incorrect results in knowledge bases. [Wang et al., 2015]
Representation of Explanations1. (Attribute, Value)2. Predicates3. List of (attribute, value) or (attribute, *). “*” matching all values.4. Simple pre-defined descriptions.
(SensorID, 3)
1.
(SensorID==3 && avg(volt) >= 2.5)
2.
3. [El Gebaly et al., 2015]
4. [Roy et al., 2015]
Efficiently finding explanations● Frequent Itemset and Heavy-Hitters Sketch [Bailis et al., 2016]● Feature selection approach [Wang et al., 2015]● Decision Tree approach [Wu-Madden, 2013]● Storing explanations as relations and then executing JOINs. [Roy et al., 2015]● Use CUBE operator to generate patterns [El Gebaly et al., 2015]
Frequent Itemset and Heavy-Hitters Sketch[Bailis et al., 2016]
Heavy-Hitters Sketch (AMC) for Single Attributes Combination of attributes using M-CPS tree and
FPGrowth
(value1, 1)(value1, 1)
(value2, 1) Heavy-Hitters Sketch
Maintain Heavy-Hitters Sketch
Image Source : https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Frequent_Pattern_Mining/The_FP-Growth_Algorithm
● Maintain a similar frequency-based tree
that is updated in a streaming fashion.
● Tree is mined using the FPGrowth
algorithm
Decision Tree approach [Wu-Madden, 2013]
Partitions the set of tuples given as input into a number of predicates(=partitions)
where in each partition the ‘influence’ is maximised.
Partitions
Then Merge adjacent predicates to find a better
one
Also make use of sampling so no need to go through all
(attribute, value) tuples.
Feature selection approach [Wang et al., 2015]
Each feature (attribute, value) is associated with an error rate (
i
) which is derived
directly from the number of incorrect elements.
Incorrect elements (known beforehand)
Feature cost is then computed using the equations below. Cost is used to select the
features that best explain the incorrect elements. (selection is based on a top-down
iterative approach)
Ranking Explanations
Explanations(=predicates over aggregates) are ranked
based on their influence.
1. Scorpion [Wu-Madden, 2013]
Defined as the ratio between the change in the output and
the number of tuples that satisfy the predicate.
2. Macrobase [Bailis et al., 2016]
Explanations (=attribute, value combinations) are
ranked based on their frequency.
3. Data X-Ray [Wang et al., 2015]
Explanations (=features) are ranked based on their
diagnosis cost.
Research IdeasDomain
Assisting data exploration by allowing data scientists to pose the explanation question
along with the aggregate query.
Common exploratory technique to request
COUNT over range predicates.
Rdd.filter( (x,y) ->
range_predicate(x,y)).count()
Spark Example : dataset containing tuples of (x,y) coordinates
Result : 678
-0.5 <= x <= 0.5 &&
-0.5 <= y <= 0.5
BUT, what are the underlying sub-ranges that contribute the most to this result ?
Representation
range_dim1 range_dim2 COUNT
[-0.3, -0.1] [0.2, 0.3] 260
[-0.1, 0.1] [-0.2, 0.1] 100
1. List of sub-range predicates showing COUNT result for each one
User can select ‘k’ to show only top-k ranges
Can choose if sub-ranges can overlap
2. Visualisations.
Finding Explanations
1. Naive solution
Split filtered dataset into ‘k’ partitions
Get min((x,y)) and max((x,y)) for each
partition
Construct bounding box and get COUNT result for
each one
2. Partition filtered dataset with clustering
algorithms
Probabilistic (mixture models)
Hard-assignment (k-means)
k-independent
3. Recursive partitioning (regression/decision
trees)
Partitioning based on COUNT and minimum
distance between points.
4. Construct histogram
Major Differences with other parts of research1. Not dealing with SQL systems as [El Gebaly et al., 2015] [Roy et al., 2015]2. Does not have to compute backwards provenance as in [Wu-Madden, 2013] because explanation is
provided along with aggregate result.3. User does not have to specify whether the given result is high or low [Wu-Madden, 2013]. In some
cases user has no idea whether the given value is high or low given a dataset.
Thank you for your time.
References1. Bailis, Peter, Deepak Narayanan, and Samuel Madden. "MacroBase: Analytic Monitoring for the Internet of Things." arXiv
preprint arXiv:1603.00567 (2016).
2. Wu, Eugene, and Samuel Madden. "Scorpion: Explaining away outliers in aggregate queries." Proceedings of the VLDB Endowment 6.8 (2013): 553-564.
3. El Gebaly, Kareem, et al. "Interpretable and informative explanations of outcomes." Proceedings of the VLDB Endowment 8.1 (2014): 61-72.
4. Wang, Xiaolan, Xin Luna Dong, and Alexandra Meliou. "Data x-ray: A diagnostic tool for data errors." Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 2015.
5. Roy, Sudeepa, Laurel Orr, and Dan Suciu. "Explaining query answers with explanation-ready databases." Proceedings of the VLDB Endowment 9.4 (2015): 348-359.