Interactive Data Exploration using Constraints

Interactive Data Exploration using

ConstraintsAlexander Kalinin

Ugur Cetintemel, Stan Zdonik

2

CP + DBMSfor Data Intensive Exploration

3

Interactive Data Exploration (IDE)Searching for the “interesting” within big data

• Exploratory-analysis: ad-hoc & repetitive• Questions are not well defined• “Interesting” can be complex

• Human-in-the loop operation• Fast, online results• Query refinement

Where’s Waldo?Where’s Horrible Gelatinous Blob?

4

Exploratory Queries: Some examples• First-order

• “Celestial 3-5o by 5-7o regions with brightness > 0.8”

• Higher-order• “Pairs of 2o by 2o celestial regions with similarity > 0.5”

• Optimized• “Celestial 3o by 7o region with maximum brightness”

Sloan Digital Sky Survey (SDSS)

5

“Celestial 3-5o by 5-7o regions with average brightness > 0.8” in SQL

1. Divide the data into cells2. Enumerate all regions3. Final filtering (> 0.8)

6

DBMSs for IDE?• No native support for exploratory constructs• No power set• No user-defined objective functions

• No support for interactivity• No online results• No notion of a “query session”

7

Data Exploration as a CP problem

Decision variables:

Constraints:

“Celestial 3-5o by 5-7o regions with average brightness > 0.8”

Left-most corner

Lengths

8

CP Solvers• Large variety of methods for exploring a search space

• Branch-and-Cut• Large Neighborhood Search (LNS)• Randomized search with Restarts

• Highly extensible – important for ad-hoc exploration!• New constraints/functions• New search heuristics

• But… comparing with DBMSs• In-memory data (CP) vs. efficient disk data handling (DBMS)• No I/O cost-awareness (CP) vs. cost-based query planning (DBMS)

9

SearchLight• A fusion of CP solvers and DBMSs

• The DBMS stores and maintains data• The CP solver explores the constrained

search space

• SearchLight is a mediator• Extends CP solvers• Provides buffering, prefetching• Distributes the search• Makes CP solvers cost-aware

CP Solver(OR-tools, Gecode)

Constraints/Functions

Search Heuristics

SearchLight

Metadata Buffering

DBMS(PostgreSQL, SciDB)

Data

, esti

mat

es, d

ecisi

ons

Requ

ests

, Sol

ution

s Data, schema info

Data requests, constraints

Exploration Query

10

Research Issues• A cost model for data-intensive CP

• Each search decision has an I/O cost

• Mediation of data access• Meta-data for guiding and optimizing search (annotated trees, samples, etc.)• Prefetching

• Distributed search• Multi-node parallel branch processing

• CP/DBMS integrated query planning• Propagating CP/Schema constraints

11

Semantic Windows (SW)• First step towards constraint-based exploration

• Supports first-order queries• Exploration via multi-dimensonal “windows of interest”• Shape-based constraints (“a 3-5o by 5-7o region”)• Content-based constraints (“avg_br() > 0.8")

• Custom distributed cost-aware solver

12

SQL/CP Extensions for Data ExplorationSELECT lb(ra), rb(ra), lb(dec), rb(dec),

avg(brightness)FROM sdssGRID BY ra BETWEEN 100 AND 300 STEP 1 dec BETWEEN 5 AND 40 STEP 1HAVING avg(brightness) > 0.8 AND

size(ra) = 5 AND size(dec) >= 5 AND size(dec) <= 7

13

Cost-aware Solver• Best-first search based on the utility

• Utility = f(benefit, cost)

• Benefit – how close a window is to satisfy the constraints• A distance between the constraint’s value and the estimated value

• Cost – how expensive it is to read a window from disk• Measured in cells we have to read• Adjustments are made for skewed data

14

Optimizations• Cost and benefit are estimated by sampling

• Objective function values are cached in a cell cache• Dynamic utility updates• Avoiding same cells re-reads

• Constraint-based pruning during the search

• Distributed search• Multiple nodes work in parallel

15

Adaptive Prefetching• Dispersed reads hit total performance

• Prefetching: read the neighborhood with every window

• Progress-driven prefetching: how much? • Finding new results? Prefetch a small amount• No new results? Increase the prefetch

exponentially

3

2

1

4

No prefetching

With prefetching

1

2

3

4

16

Online vs. Total Performance Results• 35GB data set (part of the SDSS)• 4GB total memory (1GB shared buffer)• First results in 10-20 seconds

20% 40% 60% 80% 100% total0

1000

2000

3000

4000

5000

6000

Static Adaptive PostgreSQL

% of results returned

Tim

e, s

17

Conclusions• Integrate CP and DBMS technologies

• SearchLight: Data-Intensive CP Engine

• Initial implementation: Semantic Windows• Cost-aware solver• Mediating disk access (sampling, prefetching)• Distributed search

• Current work:• OR-Tools as the CP solver• SciDB as the DBMS

18

Questions?

Supported by:

Documents

Interactive Data Exploration using Constraints