Upload
jeanne
View
43
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Interactive Data Exploration using Constraints. Alexander Kalinin Ugur Cetintemel, Stan Zdonik. CP + DBMS for Data Intensive Exploration. Interactive Data Exploration (IDE). Where’s Horrible Gelatinous Blob?. Where’s Waldo?. Searching for the “interesting” within big data - PowerPoint PPT Presentation
Citation preview
Interactive Data Exploration using
ConstraintsAlexander Kalinin
Ugur Cetintemel, Stan Zdonik
2
CP + DBMSfor Data Intensive Exploration
3
Interactive Data Exploration (IDE)Searching for the “interesting” within big data
• Exploratory-analysis: ad-hoc & repetitive• Questions are not well defined• “Interesting” can be complex
• Human-in-the loop operation• Fast, online results• Query refinement
Where’s Waldo?Where’s Horrible Gelatinous Blob?
4
Exploratory Queries: Some examples• First-order
• “Celestial 3-5o by 5-7o regions with brightness > 0.8”
• Higher-order• “Pairs of 2o by 2o celestial regions with similarity > 0.5”
• Optimized• “Celestial 3o by 7o region with maximum brightness”
Sloan Digital Sky Survey (SDSS)
5
“Celestial 3-5o by 5-7o regions with average brightness > 0.8” in SQL
1. Divide the data into cells2. Enumerate all regions3. Final filtering (> 0.8)
6
DBMSs for IDE?• No native support for exploratory constructs• No power set• No user-defined objective functions
• No support for interactivity• No online results• No notion of a “query session”
7
Data Exploration as a CP problem
Decision variables:
Constraints:
“Celestial 3-5o by 5-7o regions with average brightness > 0.8”
Left-most corner
Lengths
8
CP Solvers• Large variety of methods for exploring a search space
• Branch-and-Cut• Large Neighborhood Search (LNS)• Randomized search with Restarts
• Highly extensible – important for ad-hoc exploration!• New constraints/functions• New search heuristics
• But… comparing with DBMSs• In-memory data (CP) vs. efficient disk data handling (DBMS)• No I/O cost-awareness (CP) vs. cost-based query planning (DBMS)
9
SearchLight• A fusion of CP solvers and DBMSs
• The DBMS stores and maintains data• The CP solver explores the constrained
search space
• SearchLight is a mediator• Extends CP solvers• Provides buffering, prefetching• Distributes the search• Makes CP solvers cost-aware
CP Solver(OR-tools, Gecode)
Constraints/Functions
Search Heuristics
SearchLight
Metadata Buffering
DBMS(PostgreSQL, SciDB)
Data
, esti
mat
es, d
ecisi
ons
Requ
ests
, Sol
ution
s Data, schema info
Data requests, constraints
Exploration Query
10
Research Issues• A cost model for data-intensive CP
• Each search decision has an I/O cost
• Mediation of data access• Meta-data for guiding and optimizing search (annotated trees, samples, etc.)• Prefetching
• Distributed search• Multi-node parallel branch processing
• CP/DBMS integrated query planning• Propagating CP/Schema constraints
11
Semantic Windows (SW)• First step towards constraint-based exploration
• Supports first-order queries• Exploration via multi-dimensonal “windows of interest”• Shape-based constraints (“a 3-5o by 5-7o region”)• Content-based constraints (“avg_br() > 0.8")
• Custom distributed cost-aware solver
12
SQL/CP Extensions for Data ExplorationSELECT lb(ra), rb(ra), lb(dec), rb(dec),
avg(brightness)FROM sdssGRID BY ra BETWEEN 100 AND 300 STEP 1 dec BETWEEN 5 AND 40 STEP 1HAVING avg(brightness) > 0.8 AND
size(ra) = 5 AND size(dec) >= 5 AND size(dec) <= 7
13
Cost-aware Solver• Best-first search based on the utility
• Utility = f(benefit, cost)
• Benefit – how close a window is to satisfy the constraints• A distance between the constraint’s value and the estimated value
• Cost – how expensive it is to read a window from disk• Measured in cells we have to read• Adjustments are made for skewed data
14
Optimizations• Cost and benefit are estimated by sampling
• Objective function values are cached in a cell cache• Dynamic utility updates• Avoiding same cells re-reads
• Constraint-based pruning during the search
• Distributed search• Multiple nodes work in parallel
15
Adaptive Prefetching• Dispersed reads hit total performance
• Prefetching: read the neighborhood with every window
• Progress-driven prefetching: how much? • Finding new results? Prefetch a small amount• No new results? Increase the prefetch
exponentially
3
2
1
4
No prefetching
With prefetching
1
2
3
4
16
Online vs. Total Performance Results• 35GB data set (part of the SDSS)• 4GB total memory (1GB shared buffer)• First results in 10-20 seconds
20% 40% 60% 80% 100% total0
1000
2000
3000
4000
5000
6000
Static Adaptive PostgreSQL
% of results returned
Tim
e, s
17
Conclusions• Integrate CP and DBMS technologies
• SearchLight: Data-Intensive CP Engine
• Initial implementation: Semantic Windows• Cost-aware solver• Mediating disk access (sampling, prefetching)• Distributed search
• Current work:• OR-Tools as the CP solver• SciDB as the DBMS
18
Questions?
Supported by: