Explanations in Data Systems

Explanations in Data Systems

Fotis Savva,

Ph.D. student IDEAS Group,

University of Glasgow

Explanations are sought when anomalies appear in our dataset or we want to further

investigate query answers.

Examples#1 Outlier Explanation

Outlier Detection● Automatically [Bailis et al., 2016]

● User selects outlier points [Wu-Madden, 2013]

Explaining Outliers● Find (attribute, value)

combinations [Bailis et al., 2016]

● Find Predicates [Wu-Madden,

2013]

Explanation (attribute, value ) :

Explanation predicate :

(SensorId, 3)

SensorId == 3

Fine-grained

Coarse-grained

Examples#2 Answers to Database Queries [Roy et al., 2015]

Award(aid, amount, title, year,

startdate, enddate, dir, div)

Investigator(aid, PIName, emailID)

Institution(aid, instName, address)

Query : Top 5 Institutions ranked by total awards

SELECT TOP 5 B.instName, SUM(A.amount) AS totalAwardFROM Award A, Institution BWHERE A.aid = B.aid AND dir = ’CS’ AND year >= 1990GROUP BY B.instNameORDER BY totalAward DESC

Answers to Database Queries [Roy et al., 2015]instName totalAward

University of Illinois at Urbana-Champaign 1,169,673,252

University of California-San Diego 723,335,212

Carnegie-Mellon University 472,915,775

University of Texas at Austin 319,437,217

Massachusetts Institute of Technology 292,662,491

Why such difference ?

Potential Explanations :

[PIName = ‘Robert Pennington’] [div = ‘ACI’]Associated Awards : $893M for UIUC and $26M for CMU.

Associated Awards : $580M

Evaluating Explanations● Find the tuples matching the

specified predicate

● Removing the associated

tuples helped balance out the

awards for each Institution.

● This means the predicates are

an effective explanation.

Change in output

Change in output

Main issues to consider1. Which domain to consider ? Explanations to what ? (eg aggregates,

database queries, outliers etc)2. How to represent explanations ? ( Attribute, Value combinations,

predicates, visualisations )3. Efficiently find explanations. (SQL data systems, Big Data Engines)4. Ranking explanations.

Domain● Streaming IoT data from sensors. Automatically detecting outliers and providing

explanations for those outliers. [Bailis et al., 2016]● Allowing the user to interactively select outliers in aggregate results and provide explanations.

[Wu-Madden, 2013] ● Answers to database queries [Roy et al., 2015]● Given relation instances augmented with a binary feature, explain the outcome of the binary

feature. [El Gebaly et al., 2015]● Explaining incorrect results in knowledge bases. [Wang et al., 2015]

Representation of Explanations1. (Attribute, Value)2. Predicates3. List of (attribute, value) or (attribute, *). “*” matching all values.4. Simple pre-defined descriptions.

(SensorID, 3)

1.

(SensorID==3 && avg(volt) >= 2.5)

2.

3. [El Gebaly et al., 2015]

4. [Roy et al., 2015]

Efficiently finding explanations● Frequent Itemset and Heavy-Hitters Sketch [Bailis et al., 2016]● Feature selection approach [Wang et al., 2015]● Decision Tree approach [Wu-Madden, 2013]● Storing explanations as relations and then executing JOINs. [Roy et al., 2015]● Use CUBE operator to generate patterns [El Gebaly et al., 2015]

Frequent Itemset and Heavy-Hitters Sketch[Bailis et al., 2016]

Heavy-Hitters Sketch (AMC) for Single Attributes Combination of attributes using M-CPS tree and

FPGrowth

(value1, 1)(value1, 1)

(value2, 1) Heavy-Hitters Sketch

Maintain Heavy-Hitters Sketch

Image Source : https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Frequent_Pattern_Mining/The_FP-Growth_Algorithm

● Maintain a similar frequency-based tree

that is updated in a streaming fashion.

● Tree is mined using the FPGrowth

algorithm

Decision Tree approach [Wu-Madden, 2013]

Partitions the set of tuples given as input into a number of predicates(=partitions)

where in each partition the ‘influence’ is maximised.

Partitions

Then Merge adjacent predicates to find a better

one

Also make use of sampling so no need to go through all

(attribute, value) tuples.

Feature selection approach [Wang et al., 2015]

Each feature (attribute, value) is associated with an error rate (

i

) which is derived

directly from the number of incorrect elements.

Incorrect elements (known beforehand)

Feature cost is then computed using the equations below. Cost is used to select the

features that best explain the incorrect elements. (selection is based on a top-down

iterative approach)

Ranking Explanations

Explanations(=predicates over aggregates) are ranked

based on their influence.

1. Scorpion [Wu-Madden, 2013]

Defined as the ratio between the change in the output and

the number of tuples that satisfy the predicate.

2. Macrobase [Bailis et al., 2016]

Explanations (=attribute, value combinations) are

ranked based on their frequency.

3. Data X-Ray [Wang et al., 2015]

Explanations (=features) are ranked based on their

diagnosis cost.

Research IdeasDomain

Assisting data exploration by allowing data scientists to pose the explanation question

along with the aggregate query.

Common exploratory technique to request

COUNT over range predicates.

Rdd.filter( (x,y) ->

range_predicate(x,y)).count()

Spark Example : dataset containing tuples of (x,y) coordinates

Result : 678

-0.5 <= x <= 0.5 &&

-0.5 <= y <= 0.5

BUT, what are the underlying sub-ranges that contribute the most to this result ?

Representation

range_dim1 range_dim2 COUNT

[-0.3, -0.1] [0.2, 0.3] 260

[-0.1, 0.1] [-0.2, 0.1] 100

1. List of sub-range predicates showing COUNT result for each one

User can select ‘k’ to show only top-k ranges

Can choose if sub-ranges can overlap

2. Visualisations.

Finding Explanations

1. Naive solution

Split filtered dataset into ‘k’ partitions

Get min((x,y)) and max((x,y)) for each

partition

Construct bounding box and get COUNT result for

each one

2. Partition filtered dataset with clustering

algorithms

Probabilistic (mixture models)

Hard-assignment (k-means)

k-independent

3. Recursive partitioning (regression/decision

trees)

Partitioning based on COUNT and minimum

distance between points.

4. Construct histogram

Major Differences with other parts of research1. Not dealing with SQL systems as [El Gebaly et al., 2015] [Roy et al., 2015]2. Does not have to compute backwards provenance as in [Wu-Madden, 2013] because explanation is

provided along with aggregate result.3. User does not have to specify whether the given result is high or low [Wu-Madden, 2013]. In some

cases user has no idea whether the given value is high or low given a dataset.

Thank you for your time.

References1. Bailis, Peter, Deepak Narayanan, and Samuel Madden. "MacroBase: Analytic Monitoring for the Internet of Things." arXiv

preprint arXiv:1603.00567 (2016).

2. Wu, Eugene, and Samuel Madden. "Scorpion: Explaining away outliers in aggregate queries." Proceedings of the VLDB Endowment 6.8 (2013): 553-564.

3. El Gebaly, Kareem, et al. "Interpretable and informative explanations of outcomes." Proceedings of the VLDB Endowment 8.1 (2014): 61-72.

4. Wang, Xiaolan, Xin Luna Dong, and Alexandra Meliou. "Data x-ray: A diagnostic tool for data errors." Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 2015.

5. Roy, Sudeepa, Laurel Orr, and Dan Suciu. "Explaining query answers with explanation-ready databases." Proceedings of the VLDB Endowment 9.4 (2015): 348-359.

Data & Analytics

Explanations in Data Systems