CCGrid, 2012
Supporting User Defined Subsetting and Aggregation over Parallel
NetCDF Datasets
Yu Su and Gagan AgrawalDepartment of Computer Science and Engineering
The Ohio State University
CCGrid 2012, Ottawa, Canada
CCGrid, 2012
Outline
• Motivation and Introduction• Background• System Overview• Experiment• Conclusion
CCGrid, 2012
Motivation
• Science become increasingly data driven• Strong desire for efficient data analysis• Challenges
– Data sizes grow rapidly– Slow IO and Network Bandwidth
• An example– Different kinds of subsetting requests– Different scientific data formats
CCGrid, 2012
An Example• GCRM (Global Cloud Resolving Model)
– A global atmospheric circulation model
Parameter ValueCurrent Grid Cell Size 4 KM
Number of Cells 3 billion
Number of Layers > 100
Time Step 10 seconds
Data Generation Speed 100 TB per day
Future Grid Cell Size 1KM
Future Data Generation Speed 6.4 PB per dayNetwork Speed 10 GB per sec
7.4 days!
CCGrid, 2012
Client-side vs. Sever-side subsetting and aggregation
SimpleRequest
AdvancedRequest
CCGrid, 2012
Data Virtualization
• Support SQL queries over scientific dataset– Standard– Flexible
• Keep data in native format(etc. NetCDF, HDF5)• Compare with other scientific data management
tools– SciDB: support for data arrays in parallel– OPeNDAP: no flexible subsetting and aggregation
CCGrid, 2012
Our Approach• User-defined subsetting and aggregations
– Subsetting: Dimensions, Coordinates, Variables– Aggregation: SUM, AVG, COUNT, MAX, MIN
• Support NetCDF data format– Developed by UCAR– Widely used in climate simulation
• Parallel Data Access– Data Partition Strategy– Different Parallel Level
CCGrid, 2012
Background - NetCDFnetcdf mynetcdf{dimensions:
X=4;Y=5;Time=UNLIMITED;
variables:float X(X);float Y(Y);int Time(Time);float Temperature(Time, Y, X);
Temperature:unit = ‘Celsius’data:
X = 10, 20, 30, 40;Y = 110, 120, 130, 140;Time = 31, 59, 90;
Temperature =111,211,311,411,121,221,321,421,131,231,331,431,141,241,341,441,112,212,312,412,122,222,322,422,132,232,332,432,142,242,342,442,113,213,313,413,123,223,323,423,133,233,333,433,143,243,343,443;
}
Y
X
Time
Time = 1 to 3
Y = 1 to 4
X = 1 to 4
Metadata
Actual value stored in m-d array
CCGrid, 2012
System Architecture
Parse the SQL expression
Parse the metadata file
Physical MetadataLogical Metadata
Generate Query Request
Partition Criteria: Subsetting: Disk AccessAggregation: Data Transfer
Read DataPost-filter dataLocal Data Aggregation
CCGrid, 2012
Data Aggregation
SQL: SELECT SUM(pressure) FROM GCRM
Slave Processes
Master Process
CCGrid, 2012
Data Parallelism
Level 3: data block (12)
Level 1: data file (2 < 12?)
Level 2: variable (5 < 12?)
CCGrid, 2012 12
Experiment Goals
• To compare the functionality and performance of our system with OPeNDAP– OPeNDAP makes local data accessible to remote
locations regardless of local storage format. – Data Translation Mechanism– No flexible subsetting and aggregation support
• To evaluate the parallel scalability of our system• To show how aggregation queries reduce the
data transfer cost.
CCGrid, 2012
Compare with OPeNDAP for Type 1 Queries
• Data size: 4GB• Input: 50 SQL queries• Query Type: queries only include
dimensions• Object:
• Baseline: NetCDF query time• Our system without parallelism• OPeNDAP
• Relative Speedup: 2.34 – 3.10
CCGrid, 2012
Compare with OPeNDAP for Type 2, Type 3 Queries
• Data size: 4GB• Input: 50 SQL queries• Query Type: queries include
coordinates and variables• Object:
• Baseline• Our system without parallelism• OPeNDAP + Filter
• Relative Speedup: 1.58 – 3.47
CCGrid, 2012
Parallel Optimization – Different Data Size
• Data size: 4GB – 32GB • Process number: 1 to 16• Input: select the whole variable• Relative Speedup:
• 4 procs: 2.17 – 2.87• 8 procs: 4.06 – 5.54• 16 procs: 7.23 – 9.33
CCGrid, 2012
Parallel Optimization – Different Queries
• Data size: 32GB• Processes number: 1 to16• Input: 100 SQL queries• Query Type: queries include
dimensions, coordinates and variables
• Relative Speedup: • 4 procs: 2.20 – 2.92• 8 procs: 3.95 – 4.21• 16 procs: 7.25 – 7.74
CCGrid, 2012
Data Aggregation - Time
• Data size: 16GB• Process number: 1 - 16• Input: 60 aggregation queries• Query Type:
• Only Agg• Agg + Group by + Having • Agg + Group by
• Relative Speedup: • 4 procs: 2.61 – 3.08• 8 procs: 4.31 – 5.52• 16 procs: 6.65 – 9.54
CCGrid, 2012
Data Aggregation – Data Transfer Amount
• Data size: 16GB• Process number: 1 - 16• Input: 60 aggregation queries• Query Type:
• Only Agg• Agg + Group by + Having• Agg + Group by
CCGrid, 2012
Conclusion
• Data sizes increase in a fast speed• Goal: Find exact data subset as user specifies• Data virtualization on top of NetCDF dataset• Query request partition and parallel processing• A good speedup compared with OPeNDAP
CCGrid, 2012 20
Thanks
CCGrid, 2012
Pre-filter Module
Dataset Storage Metadata Dataset Logical Metadata Request Partition Strategy
Phase 1 Phase 2 Phase 3