Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
10/30/17
1
Closing the Loop on Data Analysis
eugenewu.netassistant professorcolumbia university
data science institute
Agenda
Smoke Fast Lineage + Interactions
Precision Interfaces Interface for All Analyses
Scorpion Explaining Outliers
2
DB
SQL
DB
Vis
BIML …
“The World”
[Unflattening - Nick Sousanis]
DB
Vis
BIML
“The World”
…
Data Visualization Management SystemDVMS
Co-designend-to-endhuman-in-the-loop
data analysis
10/30/17
2
DB BIML
“The World”
…to create and scale?
How interfaces to create?
WhatVis
the data?
Prep
Why?
What?
Why?Scorpion: Explaining Outliers
NeuroFlash: Explaining Neural NetworksExplaining Social Media Popularities
How?DeVIL: Use human limits
SMOKE: Lineage for Interactive VisVISTREAM: Prefetching architecture
Prep? ActiveClean: Interactive Cleaning for MLQFix: Cleaning past queriesPreCog: Quality Push-down
[CIDR ’17][revision]
[in progress]
[VLDB ’13][in progress][in progress]
[VLDB ’16][SIGMOD ’17]
[in review]
PI: Scalable Interface GenerationS4: Spreadsheet-style search
PVD: Physical Visualization Design
[HILDA ‘17][SIGMOD ‘15][in progress]
DVMS DVMS Projects
Agenda
Smoke Fast Lineage + Interactions
Precision Interfaces Interface for All Analyses
Scorpion Explaining Outliers
17
Smoke: Fast Lineage + Interactions
18
Result 1
Result 2
backward_trace()
forward_trace()view_refresh()
Rev
enue
Profit
Pric
e
Product
Rev
enue
Profit
Pric
e
Product
Reve
nue
Profit
Pric
e
Product
Smoke: Fast Lineage + Interactions
refresh(backward_trace( ,input))⨝
Rev
enue
Profit
Pric
e
Product
backward_trace()
view_refresh()
Rev
enue
Profit
Pric
e
Product
Rev
enue
Profit
Pric
e
Product
Reve
nue
Profit
Pric
e
Product
Smoke: Fast Lineage + Interactions
backward_trace()
view_refresh()
refresh(backward_trace( ,input))⨝
Rev
enue
Profit
Pric
e
Product
10/30/17
3
Rev
enue
Profit
Pric
e
Product
Rev
enue
Profit
Pric
e
Product
Reve
nue
Profit
Pric
e
Product
Smoke: Fast Lineage + Interactions
backward_trace()
view_refresh()
refresh(backward_trace( ,input))⨝
Rev
enue
Profit
Pric
e
Product
SPLOT = SELECT 8 AS radius, 'gray' AS stroke, 'gray' AS fill, lscale(revenue, sx) AS center_x, lscale(profit, sy) AS center_y,
FROM A, B, sx, sy WHERE …;
HIST = SELECT 4 AS width,'blue' AS fill,hscale(price, hx) AS height
FROM B, C, hx WHERE …;
render(SELECT * FROM SPLOT);render(SELECT * FROM HIST);
BT = BACKWARD TRACE FROMHIST@vnow-1 AS HS, clickedWHERE clicked.id = HS.idTO A;
SPLOT = SELECT ..., 'red' AS fill FROM BT, B WHERE …UNION SELECT ..., 'gray' AS fillFROM (A EXCEPT BT), B WHERE …
HIST = SELECT ..., 'red' AS fillFROM BT, C WHERE …UNION SELECT ..., 'blue’ AS fillFROM (A EXCEPT BT), C WHERE …
interaction(vis(database))
SQL(Lineage( ))SQL
Fine-grained Lineage Capture
22
id qty $
j1 1 6 40
j2 1 1 40
j3 2 9 5id qty
b1 1 6
b2 1 1
b3 2 9
id $
a1 1 40
a2 2 5
𝛾"#,%&'()*+∗$)(A⨝B)
⨝id sum
o1 1 280
o2 2 45γ
Fine-grained Lineage Capture
23
id sum
o1 1 280
o2 2 45
id qty $
j1 1 6 40
j2 1 1 40
j3 2 9 5id qty
b1 1 6
b2 1 1
b3 2 9
id $
a1 1 40
a2 2 5
Capture lineage graph w/ low-overhead to answer lineage queries efficiently
𝛾"#,%&'()*+∗$)(A⨝B)
How do people capture lineage today?Lazy aka don’t capture
Eager via Query rewritesEager via Instrumentation
24
Lazy Approach
25
id sum
o1 1 280
o2 2 45
id qty $
j1 1 6 40
j2 1 1 40
j3 2 9 5id qty
b1 1 6
b2 1 1
b3 2 9
id $
a1 1 40
a2 2 5
Rewrite lineage qs into SQLBackward_trace(o1,B) = σid=1(B)
⨝
[Cui et al. and Ikeda et al.]
No capture overheadGood for high-selectivity
Bad for low-selectivityNo support for non-invertible ops
Complex rewrite predicates
CONSPROS
γ
Eager Logical Denormalized
27
id $ pid $ pid qty
o1 1 280 1 40 1 6
o2 1 280 1 40 1 1
o3 2 45 2 5 2 9
id qty $
j1 1 6 40
j2 1 1 40
j3 2 9 5id qty
b1 1 6
b2 1 1
b3 2 9
id $
a1 1 40
a2 2 5
A B
Rewrite original query into single big query
⨝’ γ’
Leverage DB query optsFlexibility
Use existing database
Introduces redundancyResult must be further processed
Index result to use itAddtl projection to get real result
CONSPROS
[Perm, Gprom, and DBNotes]
10/30/17
4
28
id sum
o1 1 280
o2 2 45
id qty $
j1 1 6 40
j2 1 1 40
j3 2 9 5id qty
b1 1 6
b2 1 1
b3 2 9
id $
a1 1 40
a2 2 5
A O1 12 2
⨝ γ
Reduces redundancyEasily add annotationsUse existing database
Extra Qs to make lineage tablesNeed to index lineage tablesLineage tracing requires join
CONSPROS
[Trio, and DBNotes]
Eager Logical Normalized Eager Physical
29
id sum
o1 1 280
o2 2 45
id qty $
j1 1 6 40
j2 1 1 40
j3 2 9 5id qty
b1 1 6
b2 1 1
b3 2 9
id $
a1 1 40
a2 2 5
⨝’ γ’
Avoid relational overheadsControl over physical rep
Easier to integrate
RPC/virtual function calls expensiveWrite-inefficient lineage storageNo physical plan optimizations
CONSPROS
[Subzero, NewT, Ramp, Clothia et al., Titian]
Lineage Subsystem
(a1, j1) (j1, o1)
Can Lineage Express Interactions?Performance IssuesHigh capture overheadSlow lineage tracing
Issues come from:Redundant workInefficient representationsPer-pointer overheads
Are these issues necessary?No. See Smoke
30
4 Design Principles
Tight IntegrationOperator instrumentationWrite efficient lineage idxs
Reuse workLineage indexes ≈ Hash tables Intra-plan hash table reuse
Apriori KnowledgeDon’t capture if not used
Lineage ConsumptionPush computation into lineage capture
Can Lineage Express Interactions?
31
4 Design Principles
Tight IntegrationOperator instrumentationWrite efficient lineage idxs
Reuse workLineage indexes ≈ Hash tables Intra-plan hash table reuse
Apriori KnowledgeDon’t capture if not used
Lineage ConsumptionPush computation into lineage capture
Workload-based optimizations
(lineage workload)
Make capture fast
Smoke Overview
Eager physical approachWrite and read-efficient lineage indexesTwo instrumentation approaches
32
Result 1
Result 2
Lineage Index Representation
33
r1
r2
r3
j1j2j3
…
input
r1
r2
r3
r4
o1
o2
o3
o4
outputN-to-1
j1j4 j7j5 j8 j9j3
rid index1-to-1
j2j4j2j1
rid arrayOp
10/30/17
5
Two Capture Approaches
34
id sum
o1 1 2
o2 2 1
id qty $
j1 1 6 40
j2 1 1 40
j3 2 9 5
γ
id qty $
j1 1 6 40
j2 1 1 40
j3 2 9 5
⋈
112
id sum
o1 1 2
o2 2 1
id qty $
j1 1 6 40
j2 1 1 40
j3 2 9 5
γ’
112
Defer Inject
Inject Capture for GROUPBY
35
(1, 1)
id sum
o1 1 2
o2 2 1
id qty $
j1 1 6 40
j2 1 1 40
j3 2 9 5
γbuild
γagg
Inject Capture for GROUPBY
36
(1, 2)
(2, 1)
j1
j3
j2
id sum
o1 1 2
o2 2 1
id qty $
j1 1 6 40
j2 1 1 40
j3 2 9 5
γbuild
γagg
Inject Capture for GROUPBY
37
(1, 2)
(2, 1)
j1
j3
j2
id sum
o1 1 2
o2 2 1
j1j3
j2
id sum
o1 1 2
o2 2 1
id qty $
j1 1 6 40
j2 1 1 40
j3 2 9 5
γbuild
γagg
Experiments
SetupCustom in-memory query compiled engineExecution comparable with MonetDBSmoke vs Logical vs Subsystem vs LazyTPC-H, synthetic, and cross-filter
Smoke is fastLowest capture overheadFastest tracing & lineage query perfInteractive capture and tracing speeds
38
Capture Overhead GROUPBY
39
SELECT z, COUNT(*), SUM(v), SUM(v*v), SUM(sqrt(v)), MIN(v), MAX(v)FROM zipfGROUP BY z
Smoke-I best overall à 0.7x overheadArray resizing à ~1/2 of smoke overheadWrite-inefficient idxs à ~4x overheadVirtual function calls à~1.6xLogical penalized by denormalized representation
10/30/17
6
Cross Filtering Experiments
40
Cross Filtering Experiments
41
Stepping Back
Lineage capture for high-throughput workflows need not cripple normal execution
Lineage capture can directly create idxs and pre-compute results for future QsSmells like cracking. Working on partial data cubes
Lineage tracing Qs fast enough for interactive vis
Extending to other applications e.g., ML
42
Agenda
Smoke Fast Lineage + Interactions
Precision Interfaces Interface for All Analyses
Scorpion Explaining Outliers
43
SQL
Interfaces Make Life Easy Building Interfaces
45
Specs Engineering
It takes work!
10/30/17
7
task i
# p
eopl
e w
ho d
o th
e ta
skInterfaces for Everybody
SQLWorth building
Not worth building
task i
# p
eopl
e w
ho d
o th
e ta
sk
Our Vision
An Interface for Every Task
Existing Approach 1Help Developers Build Interfaces
49
Specs Engineering
Existing Approach 2Non-programmers Program
50
Specs Engineering
51
Read Minds Gen. Interface
Precision InterfacesPI
52
Mine Logs Gen. Interface
SQL
sparQL
Precision InterfacesPI
10/30/17
8
53
What is an Interface?PI
Vis renders program output
Interactionschange program
Program1
Program2
Program3…
54
Program1
Program2
Program3…
PI
Precision InterfacesPI
Precision Interfaces
55
Interaction Mining
Interface Generation
PI Precision Interfaces
Detect interactions in logInteraction(P) = P’Proxy: program differences expressible by interactions
56
Interaction Mining
Interface Generation
PI
Precision Interfaces
57
Interaction Mining
Interface Generation
PI
Subtree transform Ti
Precision Interfaces
58
Interaction Mining
Interface Generation
PI
InteractionGraph
Subtree transform Ti
expressible by interactions
10/30/17
9
Precision Interfaces
59
Interaction Mining
Interface Generation
PI
InteractionGraph
Q1Q2
Q3
Q4
Q6
Q5
Q7
Q8Visually simpler
Less efficientVisually complex
More efficient
~Set Cover60
Exp. 1 - Synthetic Data
Simple designMore typing
Complex designNo typing
Lots of clicking
Happy mediumDrag & drop
Random walk through OLAP spaceOnTime Flight Dataset
Stepping Back
In the paper:More logs e.g., Sloan Digital Sky Survey, company X
2 Languages: SQL and SPARQLRunning user studies
In the future:Multi-languageLeverage query plans + ASTsGeneral visual log summarizationInterfaces generate queries... DVMS generates interfaces
61
Agenda
Smoke Fast Lineage + Interactions
Precision Interfaces Interface for All Analyses
Scorpion Explaining Outliers
62
Scorpion: Data Explanationsensor, light, voltage, humidity, temperature
54 sensors3.2k readings/hour
Example: Data Cleaning
64
10/30/17
10
Example: Data Cleaning
65
Why are highlighted std(temp) pts so wacky?
sensor, light, voltage, humidity, temperature
54 sensors3.2k readings/hour
Example: Data Cleaning
66
“sensors with low voltage”
sensor, light, voltage, humidity, temperature
54 sensors3.2k readings/hour
Other Questions
Q: “Why did Obama’s Oct. campaign spend $millions?”
A: company = GMMC
$5M
$2M
$0
Apr 12 Oct 12Oct 1167
Other Questions
Q: “Why did Obama’s Oct. campaign spend $millions?”
A: company = GMMC
Q: “Why does one district test better than the rest?”
A: income > 100k ^ nchildren = 1
68
11 112
50
100Time Sensor Volt Humid Temp11 1 2.64 0.4 3411 2 2.65 0.3 4011 3 2.63 0.3 3512 1 2.7 0.5 3512 2 2.7 0.4 3812 3 2.2 0.3 1001 1 2.7 0.5 351 2 2.65 0.5 381 3 2.3 0.5 80
Temp
Time
69
11 112
50
100Time Sensor Volt Humid Temp11 1 2.64 0.4 3411 2 2.65 0.3 4011 3 2.63 0.3 3512 1 2.7 0.5 3512 2 2.7 0.4 3812 3 2.2 0.3 1001 1 2.7 0.5 351 2 2.65 0.5 381 3 2.3 0.5 80
Temp
Time
1. Label inputs of selected outliers
70
10/30/17
11
11 112
50
100Time Sensor Volt Humid Temp11 1 2.64 0.4 3411 2 2.65 0.3 4011 3 2.63 0.3 3512 1 2.7 0.5 3512 2 2.7 0.4 3812 3 2.2 0.3 1001 1 2.7 0.5 351 2 2.65 0.5 381 3 2.3 0.5 80
Temp
Time
1. Label inputs of selected outliers2. Label inputs of normal results
71
11 112
50
100Time Sensor Volt Humid Temp11 1 2.64 0.4 3411 2 2.65 0.3 4011 3 2.63 0.3 3512 1 2.7 0.5 3512 2 2.7 0.4 3812 3 2.2 0.3 1001 1 2.7 0.5 351 2 2.65 0.5 381 3 2.3 0.5 80
Temp
Time
1. Label inputs of selected outliers2. Label inputs of normal results3. Apply rule-learning algorithm
Volt < 2.5 & Sensor = 372
Time Sensor Volt Humid Temp11 1 2.64 0.4 3411 2 2.65 0.3 4011 3 2.63 0.3 3512 1 2.7 0.5 3512 2 2.7 0.4 3812 3 2.2 0.3 1001 1 2.7 0.5 351 2 2.65 0.5 381 3 2.3 0.5 80 11 112
50
100
Temp
Time
73
Big data means can’t plot everything
11 112
50
100Time Sensor Volt Humid Temp11 1 2.64 0.4 3411 2 2.65 0.3 4011 3 2.63 0.3 3512 1 2.7 0.5 3512 2 2.7 0.4 3812 3 2.2 0.3 1001 1 2.7 0.5 351 2 2.65 0.5 381 3 2.3 0.5 80
AVG(Temp)
Time
SELECT time, AVG(Temp)FROM readingsGROUP BY time
74
11 112
50
100Time Sensor Volt Humid Temp11 1 2.64 0.4 3411 2 2.65 0.3 4011 3 2.63 0.3 3512 1 2.7 0.5 3512 2 2.7 0.4 3812 3 2.2 0.3 1001 1 2.7 0.5 351 2 2.65 0.5 381 3 2.3 0.5 80
AVG(Temp)
Time
Confounds the anomalous & normal readings!
Volt > 2.6775
DB
Process Data
11 112
50
100
AVG(Temp)
Time
10/30/17
12
DB
11 112
50
100
AVG(Temp)
Time
Process Data
DB
IGNORE:Volt < 2.5 & sensor = 3
Process Data
11 112
50
100
AVG(Temp)
Time
SystemData cleaning for systematic errors
UserQuality and error
specification
Stepping Back
79
DB
Vis
BIML
Vis + Cleaning is natural step
Scorpion: specific type of data cleaningUser finds outliers, specifies through visGenerate program to delete culprits
Arachnida: generalizes ideasUser finds any error, specify through vis Generate broader classes of cleaning programs
From Presentation to Manipulation
DB Vis
DB Vis
Why?Scorpion: Explaining Outliers
NeuroFlash: Explaining Neural NetworksExplaining Social Media Popularities
[VLDB ’13][in progress][in progress]
Prep?ActiveClean: Interactive Cleaning for ML
QFix: Cleaning past queriesPreCog:Quality Push-down
[VLDB ’16][SIGMOD ’17]
[in review]
What?PI: Scalable Interface Generation
S4: Spreadsheet-style searchPVD: Physical Visualization Design
[HILDA ‘17][SIGMOD ‘15]
[in progress]
How?DeVIL: Use human limits
SMOKE: Lineage for Interactive VisVISTREAM: Prefetching architecture
[CIDR ’17][revision]
[in progress]
eugenewu.netWe are hiring!
DVMS