12
10/30/17 1 Closing the Loop on Data Analysis eugenewu.net assistant professor columbia university data science institute Agenda Smoke Fast Lineage + Interactions Precision Interfaces Interface for All Analyses Scorpion Explaining Outliers 2 DB SQL DB Vis BI ML “The World” [Unflattening - Nick Sousanis] DB Vis BI ML “The World” Data Visualization Management System DVMS Co-design end-to-end human-in-the-loop data analysis

10/30/17 - Statistics Researchstats.research.att.com/nycseminars/slides/wu.pdf10/30/17 1 Closing the Loop on Data Analysis eugenewu.net assistant professor columbia university data

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 10/30/17 - Statistics Researchstats.research.att.com/nycseminars/slides/wu.pdf10/30/17 1 Closing the Loop on Data Analysis eugenewu.net assistant professor columbia university data

10/30/17

1

Closing the Loop on Data Analysis

eugenewu.netassistant professorcolumbia university

data science institute

Agenda

Smoke Fast Lineage + Interactions

Precision Interfaces Interface for All Analyses

Scorpion Explaining Outliers

2

DB

SQL

DB

Vis

BIML …

“The World”

[Unflattening - Nick Sousanis]

DB

Vis

BIML

“The World”

Data Visualization Management SystemDVMS

Co-designend-to-endhuman-in-the-loop

data analysis

Page 2: 10/30/17 - Statistics Researchstats.research.att.com/nycseminars/slides/wu.pdf10/30/17 1 Closing the Loop on Data Analysis eugenewu.net assistant professor columbia university data

10/30/17

2

DB BIML

“The World”

…to create and scale?

How interfaces to create?

WhatVis

the data?

Prep

Why?

What?

Why?Scorpion: Explaining Outliers

NeuroFlash: Explaining Neural NetworksExplaining Social Media Popularities

How?DeVIL: Use human limits

SMOKE: Lineage for Interactive VisVISTREAM: Prefetching architecture

Prep? ActiveClean: Interactive Cleaning for MLQFix: Cleaning past queriesPreCog: Quality Push-down

[CIDR ’17][revision]

[in progress]

[VLDB ’13][in progress][in progress]

[VLDB ’16][SIGMOD ’17]

[in review]

PI: Scalable Interface GenerationS4: Spreadsheet-style search

PVD: Physical Visualization Design

[HILDA ‘17][SIGMOD ‘15][in progress]

DVMS DVMS Projects

Agenda

Smoke Fast Lineage + Interactions

Precision Interfaces Interface for All Analyses

Scorpion Explaining Outliers

17

Smoke: Fast Lineage + Interactions

18

Result 1

Result 2

backward_trace()

forward_trace()view_refresh()

Rev

enue

Profit

Pric

e

Product

Rev

enue

Profit

Pric

e

Product

Reve

nue

Profit

Pric

e

Product

Smoke: Fast Lineage + Interactions

refresh(backward_trace( ,input))⨝

Rev

enue

Profit

Pric

e

Product

backward_trace()

view_refresh()

Rev

enue

Profit

Pric

e

Product

Rev

enue

Profit

Pric

e

Product

Reve

nue

Profit

Pric

e

Product

Smoke: Fast Lineage + Interactions

backward_trace()

view_refresh()

refresh(backward_trace( ,input))⨝

Rev

enue

Profit

Pric

e

Product

Page 3: 10/30/17 - Statistics Researchstats.research.att.com/nycseminars/slides/wu.pdf10/30/17 1 Closing the Loop on Data Analysis eugenewu.net assistant professor columbia university data

10/30/17

3

Rev

enue

Profit

Pric

e

Product

Rev

enue

Profit

Pric

e

Product

Reve

nue

Profit

Pric

e

Product

Smoke: Fast Lineage + Interactions

backward_trace()

view_refresh()

refresh(backward_trace( ,input))⨝

Rev

enue

Profit

Pric

e

Product

SPLOT = SELECT 8 AS radius, 'gray' AS stroke, 'gray' AS fill, lscale(revenue, sx) AS center_x, lscale(profit, sy) AS center_y,

FROM A, B, sx, sy WHERE …;

HIST = SELECT 4 AS width,'blue' AS fill,hscale(price, hx) AS height

FROM B, C, hx WHERE …;

render(SELECT * FROM SPLOT);render(SELECT * FROM HIST);

BT = BACKWARD TRACE FROMHIST@vnow-1 AS HS, clickedWHERE clicked.id = HS.idTO A;

SPLOT = SELECT ..., 'red' AS fill FROM BT, B WHERE …UNION SELECT ..., 'gray' AS fillFROM (A EXCEPT BT), B WHERE …

HIST = SELECT ..., 'red' AS fillFROM BT, C WHERE …UNION SELECT ..., 'blue’ AS fillFROM (A EXCEPT BT), C WHERE …

interaction(vis(database))

SQL(Lineage( ))SQL

Fine-grained Lineage Capture

22

id qty $

j1 1 6 40

j2 1 1 40

j3 2 9 5id qty

b1 1 6

b2 1 1

b3 2 9

id $

a1 1 40

a2 2 5

𝛾"#,%&'()*+∗$)(A⨝B)

⨝id sum

o1 1 280

o2 2 45γ

Fine-grained Lineage Capture

23

id sum

o1 1 280

o2 2 45

id qty $

j1 1 6 40

j2 1 1 40

j3 2 9 5id qty

b1 1 6

b2 1 1

b3 2 9

id $

a1 1 40

a2 2 5

Capture lineage graph w/ low-overhead to answer lineage queries efficiently

𝛾"#,%&'()*+∗$)(A⨝B)

How do people capture lineage today?Lazy aka don’t capture

Eager via Query rewritesEager via Instrumentation

24

Lazy Approach

25

id sum

o1 1 280

o2 2 45

id qty $

j1 1 6 40

j2 1 1 40

j3 2 9 5id qty

b1 1 6

b2 1 1

b3 2 9

id $

a1 1 40

a2 2 5

Rewrite lineage qs into SQLBackward_trace(o1,B) = σid=1(B)

[Cui et al. and Ikeda et al.]

No capture overheadGood for high-selectivity

Bad for low-selectivityNo support for non-invertible ops

Complex rewrite predicates

CONSPROS

γ

Eager Logical Denormalized

27

id $ pid $ pid qty

o1 1 280 1 40 1 6

o2 1 280 1 40 1 1

o3 2 45 2 5 2 9

id qty $

j1 1 6 40

j2 1 1 40

j3 2 9 5id qty

b1 1 6

b2 1 1

b3 2 9

id $

a1 1 40

a2 2 5

A B

Rewrite original query into single big query

⨝’ γ’

Leverage DB query optsFlexibility

Use existing database

Introduces redundancyResult must be further processed

Index result to use itAddtl projection to get real result

CONSPROS

[Perm, Gprom, and DBNotes]

Page 4: 10/30/17 - Statistics Researchstats.research.att.com/nycseminars/slides/wu.pdf10/30/17 1 Closing the Loop on Data Analysis eugenewu.net assistant professor columbia university data

10/30/17

4

28

id sum

o1 1 280

o2 2 45

id qty $

j1 1 6 40

j2 1 1 40

j3 2 9 5id qty

b1 1 6

b2 1 1

b3 2 9

id $

a1 1 40

a2 2 5

A O1 12 2

⨝ γ

Reduces redundancyEasily add annotationsUse existing database

Extra Qs to make lineage tablesNeed to index lineage tablesLineage tracing requires join

CONSPROS

[Trio, and DBNotes]

Eager Logical Normalized Eager Physical

29

id sum

o1 1 280

o2 2 45

id qty $

j1 1 6 40

j2 1 1 40

j3 2 9 5id qty

b1 1 6

b2 1 1

b3 2 9

id $

a1 1 40

a2 2 5

⨝’ γ’

Avoid relational overheadsControl over physical rep

Easier to integrate

RPC/virtual function calls expensiveWrite-inefficient lineage storageNo physical plan optimizations

CONSPROS

[Subzero, NewT, Ramp, Clothia et al., Titian]

Lineage Subsystem

(a1, j1) (j1, o1)

Can Lineage Express Interactions?Performance IssuesHigh capture overheadSlow lineage tracing

Issues come from:Redundant workInefficient representationsPer-pointer overheads

Are these issues necessary?No. See Smoke

30

4 Design Principles

Tight IntegrationOperator instrumentationWrite efficient lineage idxs

Reuse workLineage indexes ≈ Hash tables Intra-plan hash table reuse

Apriori KnowledgeDon’t capture if not used

Lineage ConsumptionPush computation into lineage capture

Can Lineage Express Interactions?

31

4 Design Principles

Tight IntegrationOperator instrumentationWrite efficient lineage idxs

Reuse workLineage indexes ≈ Hash tables Intra-plan hash table reuse

Apriori KnowledgeDon’t capture if not used

Lineage ConsumptionPush computation into lineage capture

Workload-based optimizations

(lineage workload)

Make capture fast

Smoke Overview

Eager physical approachWrite and read-efficient lineage indexesTwo instrumentation approaches

32

Result 1

Result 2

Lineage Index Representation

33

r1

r2

r3

j1j2j3

input

r1

r2

r3

r4

o1

o2

o3

o4

outputN-to-1

j1j4 j7j5 j8 j9j3

rid index1-to-1

j2j4j2j1

rid arrayOp

Page 5: 10/30/17 - Statistics Researchstats.research.att.com/nycseminars/slides/wu.pdf10/30/17 1 Closing the Loop on Data Analysis eugenewu.net assistant professor columbia university data

10/30/17

5

Two Capture Approaches

34

id sum

o1 1 2

o2 2 1

id qty $

j1 1 6 40

j2 1 1 40

j3 2 9 5

γ

id qty $

j1 1 6 40

j2 1 1 40

j3 2 9 5

112

id sum

o1 1 2

o2 2 1

id qty $

j1 1 6 40

j2 1 1 40

j3 2 9 5

γ’

112

Defer Inject

Inject Capture for GROUPBY

35

(1, 1)

id sum

o1 1 2

o2 2 1

id qty $

j1 1 6 40

j2 1 1 40

j3 2 9 5

γbuild

γagg

Inject Capture for GROUPBY

36

(1, 2)

(2, 1)

j1

j3

j2

id sum

o1 1 2

o2 2 1

id qty $

j1 1 6 40

j2 1 1 40

j3 2 9 5

γbuild

γagg

Inject Capture for GROUPBY

37

(1, 2)

(2, 1)

j1

j3

j2

id sum

o1 1 2

o2 2 1

j1j3

j2

id sum

o1 1 2

o2 2 1

id qty $

j1 1 6 40

j2 1 1 40

j3 2 9 5

γbuild

γagg

Experiments

SetupCustom in-memory query compiled engineExecution comparable with MonetDBSmoke vs Logical vs Subsystem vs LazyTPC-H, synthetic, and cross-filter

Smoke is fastLowest capture overheadFastest tracing & lineage query perfInteractive capture and tracing speeds

38

Capture Overhead GROUPBY

39

SELECT z, COUNT(*), SUM(v), SUM(v*v), SUM(sqrt(v)), MIN(v), MAX(v)FROM zipfGROUP BY z

Smoke-I best overall à 0.7x overheadArray resizing à ~1/2 of smoke overheadWrite-inefficient idxs à ~4x overheadVirtual function calls à~1.6xLogical penalized by denormalized representation

Page 6: 10/30/17 - Statistics Researchstats.research.att.com/nycseminars/slides/wu.pdf10/30/17 1 Closing the Loop on Data Analysis eugenewu.net assistant professor columbia university data

10/30/17

6

Cross Filtering Experiments

40

Cross Filtering Experiments

41

Stepping Back

Lineage capture for high-throughput workflows need not cripple normal execution

Lineage capture can directly create idxs and pre-compute results for future QsSmells like cracking. Working on partial data cubes

Lineage tracing Qs fast enough for interactive vis

Extending to other applications e.g., ML

42

Agenda

Smoke Fast Lineage + Interactions

Precision Interfaces Interface for All Analyses

Scorpion Explaining Outliers

43

SQL

Interfaces Make Life Easy Building Interfaces

45

Specs Engineering

It takes work!

Page 7: 10/30/17 - Statistics Researchstats.research.att.com/nycseminars/slides/wu.pdf10/30/17 1 Closing the Loop on Data Analysis eugenewu.net assistant professor columbia university data

10/30/17

7

task i

# p

eopl

e w

ho d

o th

e ta

skInterfaces for Everybody

SQLWorth building

Not worth building

task i

# p

eopl

e w

ho d

o th

e ta

sk

Our Vision

An Interface for Every Task

Existing Approach 1Help Developers Build Interfaces

49

Specs Engineering

Existing Approach 2Non-programmers Program

50

Specs Engineering

51

Read Minds Gen. Interface

Precision InterfacesPI

52

Mine Logs Gen. Interface

SQL

sparQL

Precision InterfacesPI

Page 8: 10/30/17 - Statistics Researchstats.research.att.com/nycseminars/slides/wu.pdf10/30/17 1 Closing the Loop on Data Analysis eugenewu.net assistant professor columbia university data

10/30/17

8

53

What is an Interface?PI

Vis renders program output

Interactionschange program

Program1

Program2

Program3…

54

Program1

Program2

Program3…

PI

Precision InterfacesPI

Precision Interfaces

55

Interaction Mining

Interface Generation

PI Precision Interfaces

Detect interactions in logInteraction(P) = P’Proxy: program differences expressible by interactions

56

Interaction Mining

Interface Generation

PI

Precision Interfaces

57

Interaction Mining

Interface Generation

PI

Subtree transform Ti

Precision Interfaces

58

Interaction Mining

Interface Generation

PI

InteractionGraph

Subtree transform Ti

expressible by interactions

Page 9: 10/30/17 - Statistics Researchstats.research.att.com/nycseminars/slides/wu.pdf10/30/17 1 Closing the Loop on Data Analysis eugenewu.net assistant professor columbia university data

10/30/17

9

Precision Interfaces

59

Interaction Mining

Interface Generation

PI

InteractionGraph

Q1Q2

Q3

Q4

Q6

Q5

Q7

Q8Visually simpler

Less efficientVisually complex

More efficient

~Set Cover60

Exp. 1 - Synthetic Data

Simple designMore typing

Complex designNo typing

Lots of clicking

Happy mediumDrag & drop

Random walk through OLAP spaceOnTime Flight Dataset

Stepping Back

In the paper:More logs e.g., Sloan Digital Sky Survey, company X

2 Languages: SQL and SPARQLRunning user studies

In the future:Multi-languageLeverage query plans + ASTsGeneral visual log summarizationInterfaces generate queries... DVMS generates interfaces

61

Agenda

Smoke Fast Lineage + Interactions

Precision Interfaces Interface for All Analyses

Scorpion Explaining Outliers

62

Scorpion: Data Explanationsensor, light, voltage, humidity, temperature

54 sensors3.2k readings/hour

Example: Data Cleaning

64

Page 10: 10/30/17 - Statistics Researchstats.research.att.com/nycseminars/slides/wu.pdf10/30/17 1 Closing the Loop on Data Analysis eugenewu.net assistant professor columbia university data

10/30/17

10

Example: Data Cleaning

65

Why are highlighted std(temp) pts so wacky?

sensor, light, voltage, humidity, temperature

54 sensors3.2k readings/hour

Example: Data Cleaning

66

“sensors with low voltage”

sensor, light, voltage, humidity, temperature

54 sensors3.2k readings/hour

Other Questions

Q: “Why did Obama’s Oct. campaign spend $millions?”

A: company = GMMC

$5M

$2M

$0

Apr 12 Oct 12Oct 1167

Other Questions

Q: “Why did Obama’s Oct. campaign spend $millions?”

A: company = GMMC

Q: “Why does one district test better than the rest?”

A: income > 100k ^ nchildren = 1

68

11 112

50

100Time Sensor Volt Humid Temp11 1 2.64 0.4 3411 2 2.65 0.3 4011 3 2.63 0.3 3512 1 2.7 0.5 3512 2 2.7 0.4 3812 3 2.2 0.3 1001 1 2.7 0.5 351 2 2.65 0.5 381 3 2.3 0.5 80

Temp

Time

69

11 112

50

100Time Sensor Volt Humid Temp11 1 2.64 0.4 3411 2 2.65 0.3 4011 3 2.63 0.3 3512 1 2.7 0.5 3512 2 2.7 0.4 3812 3 2.2 0.3 1001 1 2.7 0.5 351 2 2.65 0.5 381 3 2.3 0.5 80

Temp

Time

1. Label inputs of selected outliers

70

Page 11: 10/30/17 - Statistics Researchstats.research.att.com/nycseminars/slides/wu.pdf10/30/17 1 Closing the Loop on Data Analysis eugenewu.net assistant professor columbia university data

10/30/17

11

11 112

50

100Time Sensor Volt Humid Temp11 1 2.64 0.4 3411 2 2.65 0.3 4011 3 2.63 0.3 3512 1 2.7 0.5 3512 2 2.7 0.4 3812 3 2.2 0.3 1001 1 2.7 0.5 351 2 2.65 0.5 381 3 2.3 0.5 80

Temp

Time

1. Label inputs of selected outliers2. Label inputs of normal results

71

11 112

50

100Time Sensor Volt Humid Temp11 1 2.64 0.4 3411 2 2.65 0.3 4011 3 2.63 0.3 3512 1 2.7 0.5 3512 2 2.7 0.4 3812 3 2.2 0.3 1001 1 2.7 0.5 351 2 2.65 0.5 381 3 2.3 0.5 80

Temp

Time

1. Label inputs of selected outliers2. Label inputs of normal results3. Apply rule-learning algorithm

Volt < 2.5 & Sensor = 372

Time Sensor Volt Humid Temp11 1 2.64 0.4 3411 2 2.65 0.3 4011 3 2.63 0.3 3512 1 2.7 0.5 3512 2 2.7 0.4 3812 3 2.2 0.3 1001 1 2.7 0.5 351 2 2.65 0.5 381 3 2.3 0.5 80 11 112

50

100

Temp

Time

73

Big data means can’t plot everything

11 112

50

100Time Sensor Volt Humid Temp11 1 2.64 0.4 3411 2 2.65 0.3 4011 3 2.63 0.3 3512 1 2.7 0.5 3512 2 2.7 0.4 3812 3 2.2 0.3 1001 1 2.7 0.5 351 2 2.65 0.5 381 3 2.3 0.5 80

AVG(Temp)

Time

SELECT time, AVG(Temp)FROM readingsGROUP BY time

74

11 112

50

100Time Sensor Volt Humid Temp11 1 2.64 0.4 3411 2 2.65 0.3 4011 3 2.63 0.3 3512 1 2.7 0.5 3512 2 2.7 0.4 3812 3 2.2 0.3 1001 1 2.7 0.5 351 2 2.65 0.5 381 3 2.3 0.5 80

AVG(Temp)

Time

Confounds the anomalous & normal readings!

Volt > 2.6775

DB

Process Data

11 112

50

100

AVG(Temp)

Time

Page 12: 10/30/17 - Statistics Researchstats.research.att.com/nycseminars/slides/wu.pdf10/30/17 1 Closing the Loop on Data Analysis eugenewu.net assistant professor columbia university data

10/30/17

12

DB

11 112

50

100

AVG(Temp)

Time

Process Data

DB

IGNORE:Volt < 2.5 & sensor = 3

Process Data

11 112

50

100

AVG(Temp)

Time

SystemData cleaning for systematic errors

UserQuality and error

specification

Stepping Back

79

DB

Vis

BIML

Vis + Cleaning is natural step

Scorpion: specific type of data cleaningUser finds outliers, specifies through visGenerate program to delete culprits

Arachnida: generalizes ideasUser finds any error, specify through vis Generate broader classes of cleaning programs

From Presentation to Manipulation

DB Vis

DB Vis

Why?Scorpion: Explaining Outliers

NeuroFlash: Explaining Neural NetworksExplaining Social Media Popularities

[VLDB ’13][in progress][in progress]

Prep?ActiveClean: Interactive Cleaning for ML

QFix: Cleaning past queriesPreCog:Quality Push-down

[VLDB ’16][SIGMOD ’17]

[in review]

What?PI: Scalable Interface Generation

S4: Spreadsheet-style searchPVD: Physical Visualization Design

[HILDA ‘17][SIGMOD ‘15]

[in progress]

How?DeVIL: Use human limits

SMOKE: Lineage for Interactive VisVISTREAM: Prefetching architecture

[CIDR ’17][revision]

[in progress]

eugenewu.netWe are hiring!

DVMS