48
Aditya Parameswaran Assistant Professor University of Illinois http://data-people.cs.illinois.edu Three Tools for “Human-in-the-loop” Data Science

Three Tools for "Human-in-the-loop" Data Science

Embed Size (px)

Citation preview

Page 1: Three Tools for "Human-in-the-loop" Data Science

Aditya ParameswaranAssistant ProfessorUniversity of Illinois

http://data-people.cs.illinois.edu

Three Tools for “Human-in-the-loop” Data Science

Page 2: Three Tools for "Human-in-the-loop" Data Science

2

Many many contributors!

• PIs: Kevin Chang, Karrie Karahalios, Aaron Elmore, Sam Madden, Amol Deshpande (Spanning Illinois, UMD, MIT, Chicago)

• PhD Students: Mangesh Bendre, Himel Dev, John Lee, Albert Kim, Manasi Vartak, Liqi Xu, Silu Huang, Sajjadur Rahman, Stephen Macke

• MS Students: Vipul Venkataraman, Tarique Siddiqui, Chao Wang, Sili Hui

• Undergrads: Paul Zhou, Ding Zhang, Kejia Jiang, Bofan Sun, Ed Xue, Sean Zou, Jialin Liu, Changfeng Liu, Xiaofo Yu

Page 3: Three Tools for "Human-in-the-loop" Data Science

3

Scale is a Solved ProblemMost work in the database community is myopically focused on scale: the ability to pose SQL queries on larger and larger datasets.

My claim:Scale is a solved problem.

Findings:– Median job size at Microsoft and Yahoo is 16GB;– >90% of the jobs within Facebook are <100GB

The bottleneck is no longer our ability to pose SQL queries on large datasets!

Of course, exceptions exist: the “1%” of data analysis needs

Page 4: Three Tools for "Human-in-the-loop" Data Science

4

What about the Needs of the 99%?The bottleneck is actually the “humans-in-the-loop”

As our data size has grown, what has stayed constant is • the time for analysis, • the human cognitive load,• the skills to extract value from data

There is a severe need for tools that can help analysts extract value from even moderately sized datasets

From “Big data and and its Technical Challenges”, CACM 2014

For big data to fully reach its potential, we need to consider scale not just for the system but also from the perspective of humans. We have to make sure that the end points—humans—can properly “absorb” the results of the analysis and not get lost in a sea of data.

Page 5: Three Tools for "Human-in-the-loop" Data Science

5

Need of the hour: Human-In-the-Loop Data Analytics Tools

HILDA tools:• treat both humans and

data as first-class citizens

• reduce human labor• minimize complexity

Interaction Data Mining

Databases

Taking the human

perspective into account

Go beyond SQL

Scalability/Interactivity is still important

Magic happens here

Page 6: Three Tools for "Human-in-the-loop" Data Science

6

A Maslow’s Hierarchy for HILDABackground: Maslow developed a theory for what motivates individuals in 1943; highly influential

Complex Needs

Basic Needs

Page 7: Three Tools for "Human-in-the-loop" Data Science

7

A Maslow’s Hierarchy for HILDA

Share & Collaborate

Play & View

Touch & Feel

Incr

easin

g so

phist

icatio

n of

ana

lysis

Page 8: Three Tools for "Human-in-the-loop" Data Science

8

Touch and Feel:DataSpread is a spreadsheet-database hybrid:

Goal: Marrying the flexibility and ease of use of spreadsheets with the scalability and power of databases

Enables the “99%” with large datasets but limited prog. skills to open, touch, and examine their datasets

http://dataspread.github.io

[VLDB’15,VLDB’15,ICDE’16]

Page 9: Three Tools for "Human-in-the-loop" Data Science

9

Play and View:Zenvisage is effortless visual exploration tool.

Goal: “fast-forward” to visual patterns, trends, without having analyst step through each one individually

Enables individuals to play with, and extract insights from large datasets at a fraction of the time.

http://zenvisage.github.io

[TR’16,VLDB’16,VLDB’15,DSIA’15,VLDB’14,VLDB’14]

Page 10: Three Tools for "Human-in-the-loop" Data Science

10

Collaborate and Share:

OrpheusDB is a tool for managing dataset versions with a database

Goal: building a versioned database system to reduce the burden of recording datasets in various stages of analysis

Enables individuals to collaborate on data analysis, and share, keep track of, and retrieve dataset versions.

http://orpheus-db.github.io

[VLDB’16,VLDB’15,VLDB’15,TAPP’15,CIDR’15]

(also part of : a collab. analysis system w/ MIT & UMD) datahub

Page 11: Three Tools for "Human-in-the-loop" Data Science

11

This talkAbout 10 minutes per system:

overview + architecture + one key technical challenge

Common theme: if you torture databases enough, you can get them to do what you want!

Share & Collaborate

Play & View

Touch & Feel

Incr

easin

g so

phist

icatio

n of

ana

lysis

Page 12: Three Tools for "Human-in-the-loop" Data Science

12

Page 13: Three Tools for "Human-in-the-loop" Data Science

13

MotivationMost of the people doing ad-hoc data

manipulation and analysis use spreadsheets,

e.g., Excel

Why?

• Easy to use: direct manipulation• Built-in visualization capabilities• Flexible: no need for a schema

Page 14: Three Tools for "Human-in-the-loop" Data Science

14

But Spreadsheets are Terrible!

– Slow• single change wait minutes on a 10,000 x 10

spreadsheet• can’t even open a spreadsheet with >1M cells• speed by itself can prevent analysis

– Tedious + not Powerful• filters via copy-paste• only FK joins via VLOOKUPs; others impossible• even simple operations are cumbersome

– Brittle• sharing excel sheets around, no collab/recovery• using spreadsheets for collaboration is painful and

error-prone

Page 15: Three Tools for "Human-in-the-loop" Data Science

15

Let’s turn to DatabasesDatabases are:• Slow Scalable• Tedious + not Powerful Powerful and expressive (SQL)• Brittle Collaboration, recovery, succinct

So why not use databases? Well, for the same reason why spreadsheets are so useful:

• Easy to use Not easy to use• Built-in visualization No built-in visualization• Flexible Not flexible

Page 16: Three Tools for "Human-in-the-loop" Data Science

16

Combining the benefits of spreadsheets and databases

Spreadsheet as a frontend interfaceDatabases as a backend engine

Result: retain the benefits of both!

But it’s not that simple…

Page 17: Three Tools for "Human-in-the-loop" Data Science

17

Different IdeologiesDatabases and spreadsheets have different ideologies that need to be reconciled…

Due to this, the integration is not trivial…

Feature Databases SpreadsheetsData Model Schema-first Dynamic/No SchemaAddressing Tuples with PK Cells, using Row/ColPresentation

Set-oriented, no such notion

Notion of current window, order

Modifications

Must correspond to queries

Can be done at any granularity

Computation

Query at a time Value at a time

Page 18: Three Tools for "Human-in-the-loop" Data Science

18

First Problem: RepresentationQ: how do we represent spreadsheet

data?

Dense spreadsheets: represent as tables(Row #, Col1 val, Col2 val, …)

Sparse spreadsheets: represent as triples(Row #, Column #, Value)

Page 19: Three Tools for "Human-in-the-loop" Data Science

19

First Problem: RepresentationQ: how do we represent spreadsheet

data?

Can we do even better than the two extremes? Yes!

Carve out dense areas store as tables, sparse areas store as triples

Page 20: Three Tools for "Human-in-the-loop" Data Science

20

First Problem: Representation

However, even if we only use “tables”, carving out the ideal # partitions (min. storage, modif., access) is NP-HardReduction from min. edge-length

partition of rectilinear polygons

Thankfully, we have a way out…

Page 21: Three Tools for "Human-in-the-loop" Data Science

21

Solution: Constrain the Problem

A new class of partitionings: recursive decomp.

A very natural class of partitionings!

Page 22: Three Tools for "Human-in-the-loop" Data Science

22

Solution: Constrain the Problem

The optimal recursive decomp. partitioning can be found in PTIME using DP

Still quadratic in # rows, columns

Merge rows/columns with identical signatures

~ the time for a single scan

Page 23: Three Tools for "Human-in-the-loop" Data Science

23

Initial Progress and Architecture

Postgres backendZK spreadsheet • open-source web

frontend

Comfortably scales to arbitrarily many rows+ handle SQL queries

Hopefully bring spreadsheets to the big data age!

1224560

Page 24: Three Tools for "Human-in-the-loop" Data Science
Page 25: Three Tools for "Human-in-the-loop" Data Science

25

Standard Visual Data Analysis Recipe:

1. Load dataset into viz tool2. Select viz to be generated3. See if it matches desired

visual pattern or insight4. Repeat until you find a

match

Page 26: Three Tools for "Human-in-the-loop" Data Science

26

Tedious and Time-consuming!

Page 27: Three Tools for "Human-in-the-loop" Data Science

27

Key Issue: Visualizations can be generated by

• varying subsets of data, and • varying attributes being

visualized

Too many visualizations to look at to find desired visual patterns!

Page 28: Three Tools for "Human-in-the-loop" Data Science

28

MotivationThis is a real problem!

• Advertisers at Turn – find keywords with similar CTRs to a specific one

• Bioinformaticians at an NIH genomics center– find aspects on which two sets of genes differ

• Battery scientists at CMU – find solvents with desired properties

Common theme: finding the “right” visualization can take several hours of combing through visualizations manually.

Page 29: Three Tools for "Human-in-the-loop" Data Science

29

Key InsightWe can automate that!• instead of combing through visualizations manually• tell us what you want, and we can “fast-forward” to desired

insights

Desiderata for automation:• Expressive – the ability to specify what you want• Interactive – interact with the results, catering to non-

programmers• Scalable – get interesting results quickly

Enter Zenvisage:(zen + envisage: to effortlessly visualize)

Page 30: Three Tools for "Human-in-the-loop" Data Science

30

Effortless Visual Exploration of Large Datasets with

Ingredients• Drag-and-drop and sketch based interactions • to find specific patterns

• Sophisticated visual exploration language, ZQL • to ask more elaborate questions

• Scalable visualization generation engine• preprocess, batch and parallel eval. for

interactive results• Rapid pattern matching algorithms• sampling-based techniques

Page 31: Three Tools for "Human-in-the-loop" Data Science

31Attribute Selection

Sketching Canvas

Matches Typical Trends and Outliers

ZQL: Advanced Exploration Interface

Screenshots

Page 32: Three Tools for "Human-in-the-loop" Data Science

32

Screenshots

Page 33: Three Tools for "Human-in-the-loop" Data Science

Challenges: One Specific InstanceFind visualizations on which two groups of data differ most.

Examples: • find visualizations where solvent x differs from solvent y• find visualizations where product x differs from product y

We represent a visualization using [d, m, f]• dimension = x axis• measure = y axis• function = aggregate applied to y

Each [d,m,f] on a specific subset of data can be computed using a single SQL query.

33

Page 34: Three Tools for "Human-in-the-loop" Data Science

Challenge: One Specific InstanceFind visualizations on which two groups of data differ most.

Naïve approach:

For each [d, m, f]:Compute visualization for both products (two SQL queries), then compare

Pick k best (“highest utility”) [d, m, f]

Utility Metric: We ignore how to compare for now, but there are many standard distance metrics

Scale: 10s of dimensions, 10s of measures, handful of aggregates 100s of queries for a single user task!

34

Page 35: Three Tools for "Human-in-the-loop" Data Science

Issues w/ Naïve Approach

• Repeated processing of same data in sequence across queries

• Computation wasted on low-utility visualizations

Sharing

Pruning

35

Page 36: Three Tools for "Human-in-the-loop" Data Science

Sharing Optimizations

1. Minimize # of queries: Group queries together• Combine multiple aggregates: (d1, m1, f1), (d1, m2, f1) —> (d1, [m1, m2], f1)• Combine multiple group-bys:

(d1, m1, f1), (d2, m1, f1) —> ([d1, d2], m1, f1)

2. Minimize sequential execution: Parallel query evaluation

A bit tricky!

36

Page 37: Three Tools for "Human-in-the-loop" Data Science

Pruning Optimizations

• Keep running estimates of utility• Prune visualizations based on estimates: Two flavors– Vanilla Confidence Interval based Pruning– Multi-armed Bandit Pruning

Discard low-utility views early to avoid wasted computation

37

Page 38: Three Tools for "Human-in-the-loop" Data Science

38

Visualizations

Queries (100s)

Sharing

Pruning

Optimizer

DBMS

Middleware Layer

Viz interface

Page 39: Three Tools for "Human-in-the-loop" Data Science

Up to 300X speedup: <1s for SM, 4s for L

Experimental Findings

39

Page 40: Three Tools for "Human-in-the-loop" Data Science

40

Effortless Visual Exploration of Large Datasets with

Ingredients• Drag-and-drop and

sketch based interactions

• Sophisticated visual exploration language, ZQL

• Scalable visualization generation engine

• Rapid pattern matching algorithms

Page 41: Three Tools for "Human-in-the-loop" Data Science

41

Page 42: Three Tools for "Human-in-the-loop" Data Science

42

MotivationCollaborative data science is ubiquitous• Many users, many versions of

the same dataset stored at many stages of analysis

• Status quo:– Stored in a file system,

relationships unknown

Challenge: can we build a versioned data store?

– Support efficient access, retrieval, querying, and modification of versions

Page 43: Three Tools for "Human-in-the-loop" Data Science

43

Motivation: Starting Points• VCS: Git/svn is inefficient and

unsuitable– Ordered semantics– No data manipulation API– No efficient multi-version queries– Poor support for massive files

• DBMS: Relational databases don’t support versioning, but are efficient and scalable

Page 44: Three Tools for "Human-in-the-loop" Data Science

44

OrpheusDB: Current FocusPostgreSQL + Versioning Commands

Page 45: Three Tools for "Human-in-the-loop" Data Science

45

Challenge: Storing Versions Compactly/Retrieving Versions Quickly

1000s of versions, spanning millions of records.

Store all versions independentlyHuge storage, version access time is very small

Store one version, all others via chains of “deltas” Very small storage, version access time is high

Page 46: Three Tools for "Human-in-the-loop" Data Science

46

And Answer Queries…• Retrieve the first version that contains this

tuple• Find versions where the average(salary) is

greater than 1000• Find all pairs of versions where over 100 new

tuples were added• Show the history of the tuple with record id

34.

For more examples, see [TAPP’15]

Page 47: Three Tools for "Human-in-the-loop" Data Science

Framework

“Versioning” Layer(translation/bookkeeping)

User Interface Layer

47

Unmodified Postgres Backend(not aware of versions)

Parser &Translato

rLayout

Optimizer

DBMS

git commands, or SQL (versions as

rel)

Page 48: Three Tools for "Human-in-the-loop" Data Science

48

Summary: Make Data Analytics Great Again!

orpheus-db.github.ioShare & Collaborate

Play & View

Touch & Feel

Incr

easin

g so

phist

icatio

n of

ana

lysis

zenvisage.github.io

dataspread.github.io

My website: http://data-people.cs.illinois.eduTwitter: @adityagp