Supporting SQL-3 Aggregations on Grid-based Data Repositories

Supporting SQL-3 Aggregations on Grid-based Data RepositoriesSupporting SQL-3 Aggregations on Grid-based Data Repositories
Li Weng, Gagan Agrawal,
Umit Catalyurek, Joel Saltz
Scientific Data Analysis
Multi-dimensional datasets
Motivating Scientific Applications
Magnetic Resonance Imaging
Oil Reservoir Management
Cancer Studies using MRI Telepathology with Digitized Slides
Satellite Data Processing Virtual Microscope
…
Current Approaches
Good! But is it too heavyweight for read-mostly scientific data ?
Manual implementation based on low-level datasets
Need detailed understanding of low-level formats
HDF5, NetCDF, etc
No single established standard
BinX, BFD, DFDL
Machine readable descriptions, but application is dependent on a specific layout
Ohio State University
Our Approach
Express the query & the computing declaratively on a virtual relational table view
Dataset in complex, low-level layouts can be abstracted as SQL-3 table to scientists.
Support basic SELECT query for specifying subset of interest.
Data analysis on subset of interest can be defined as SQL-3 aggregate function on SQL-3 relation.
Our Approach
Generate data extracting service & data aggregation service
A lightweight layer on top of datasets
A runtime middleware STORM is used to work in coordination with the generated services.
System Overview
Extraction
Service
STORM
Query
frontend
Outline
Introduction
Motivation
Compiler analysis and code generation
Design a meta-data descriptor
Experimental results
Related work
Canonical Query Structure
From <Dataset Name>
<SQL statement list>
Oil Reservoir Management (IPARS)
FROM IPARS
WHERE REL in (0,5,10) AND TIME >= 1000 AND TIME <= 1200
GROUP BY X, Y, Z HAVING ipars_bypass_sum(OIL)>0;
CREATE AGGREGATE ipars_bypass_sum
SELECT CASE WHEN $2.soil > 0.7 AND
|/($2.oilx * $2.oilx + $2.oily * $2.oily + $2.oilz * $2.oilz)<30.0
THEN $1 & 1
Compiler Analysis and Code Generation
Transform the canonical query into two pipelined sub-queries.
Data Extraction Service
Data Aggregation Service
SELECT <attribute list> , <AGG_name(Dataset Name)> FROM TempDataset GROUP BY <group-by attribute_list>;
Design a Meta-data Descriptor
The dataset comprises several simulations on the same grid
For each realization, each grid point, a number of attributes are stored.
The dataset is stored on a 4 node cluster.
Component I: Dataset Schema Description
[IPARS] // { * Dataset schema name *}
TIME = int
X = float
Y = float
Z = float
SOIL = float
SGAS = float
[IparsData] //{* Dataset name *}
An Example
Oil Reservoir Management
Use LOOP keyword for capturing the repetitive structure within a file.
The grid has 4 partitions (0~3).
“IparsData” comprises “ipars1” and “ipars2”. “ipars1” describes the data files with the spatial coordinates’ stored; “ipars2” specifies the data files with other attributes stored.
Component III: Dataset Layout Description
DATASET “IparsData” { //{* Name for Dataset *}
DATATYPE { IPARS } //{* Schema for Dataset *}
DATAINDEX { REL TIME }
DATASET “ipars1” {
X Y Z
} // {* end of DATASET “ipars1” *}
SOIL SGAS
$DIRID = 0:3:1 }
Generate Data Extraction Service
Our tool parses the meta-data descriptor and generates function codes.
At run time, the query would provide parameters to invoke the generated functions to create Aligned File Chunks.
Generate Data Aggregation Service
Aggregate function analysis
Projection push-down helps to extract data only needed for a particular query and its aggregation.
TempDataset = SELECT <useful attributes> From <Dataset Name> WHERE <Expression> ;
As for the IPARS application, only 7 out of the 22 attributes are actually needed for the considered query. The reduction of the data volume to be retrieved and communicated is 66%.
As for the TITAN application, 5 out of the 8 attributes are actually needed and the reduction is 38%.
2. Aggregate function decomposition
The first step involves computations applied on each tuple; The second step updates the aggregate status variable.
Replace the largest expression with TempAttr. As for the IPARS, the number of attributes is reduced further from 7 to 4.
CREATE FUNCTION ipars_func(int, IPARS) RETURNS int AS '
SELECT CASE WHEN $2.TempAttr
Partition the subset of interest based on the values of the group-by attributes if more client nodes are provided as the computing unit.
Construct a hash-table using the values of the group-by attributes as the hash-key. And translate the aggregate function in SQL-3 into the imperative C/C++ code.
Experimental Setup & Design
A Linux cluster connected via a Switched Fast Ethernet. Each node has a PIII 933MHz CPU, 512 MB main Memory, and three 100GB IDE disks.
Scalability test when varying the number of nodes for hosting data and performing the computations;
Performance test when the amount of data to be processed is increased;
Comparison with hand-written code;
Experimental Results for IPARS
Scale the number of nodes hosting the data and the number of nodes for processing.
Extract a subset of interest at the size of 640MB from scanning the 1.9GB data.
The execution times scale almost linearly.
The performance difference varies between 6%~20%, with an average difference of 14%.
The aggregate decomposition can reduce the difference to be between 1% and 10%.
Chart1
1
1
1
2
2
2
4
4
4
8
8
8
Hand
Comp
Experimental Results for IPARS
Evaluate the system’s ability to scale to larger datasets.
Use 8 data source nodes and 8 client nodes.
The execution time stays proportional to the amount of data to be retrieved and processed.
Chart2
1.9
1.9
1.9
3.8
3.8
3.8
5.7
5.7
5.7
7.6
7.6
7.6
Hand
Comp
Experimental Results for TITAN
Scale the number of nodes hosting the data and the number of nodes for processing.
Extract a subset of interest at the size of 228MB from scanning the 456MB data.
The execution times scale almost linearly.
The performance difference is 17%.
The aggregate decomposition can reduce the difference to be 6%.
Chart3
1
1
1
2
2
2
4
4
4
8
8
8
Hand
Comp
Experimental Results for TITAN
Evaluate the system’s ability to scale to larger datasets.
Use 8 data source nodes and 8 client nodes.
The execution time stays proportional to the amount of data to be retrieved and processed.
Chart4
228
228
228
456
456
456
684
684
684
912
912
912
Hand
Comp
Related Work
Data cubes
Runtime strategies for supporting reductions in a distributed environment
Conclusions
A compiler-based system for supporting SQL-3 aggregate function and select query with group-by operator on flat-file scientific datasets.
Both the extraction of the subset of interest and the aggregate computing can be expressed declaratively.
By using a meta-data descriptor to represent the layout of the dataset, our compiler generates efficient data extraction service.
The compiler analyzes the user-define aggregate function and generate code in a parallel environment.
Processing Remotely
(AVHRR)
•
are gathered to form an
instantaneous field of view
A single file of
IFOV’s

Documents

Supporting SQL-3 Aggregations on Grid-based Data Repositories