Upload
serge
View
30
Download
6
Tags:
Embed Size (px)
DESCRIPTION
An Virtualization based Data Management Framework for Big Data Applications. Yu Su Advisor: Dr. Gagan Agrawal, The Ohio State University. Motivation: Scientific Data Analysis. Parallel Ocean Program 3-D Grid: 42 * 2400 * 3600 > 30 attributes (TEMP, SALT …) 1.4 GB per attribute - PowerPoint PPT Presentation
Citation preview
Oral Exam 2013
An Virtualization based Data Management Framework for Big Data Applications
Yu SuAdvisor: Dr. Gagan Agrawal,
The Ohio State University
Oral Exam 2013
Motivation: Scientific Data Analysis• Science becomes increasing data driven• Strong requirements for efficient data analysis
Road-runner EC3 simulation 40003 records 7 attributes (X, Y, VX, … MASS) 36 bytes per record Simulation Speed: 2.3 TB
Parallel Ocean Program 3-D Grid: 42 * 2400 * 3600 > 30 attributes (TEMP, SALT …) 1.4 GB per attribute Simulation Speed: > 50 GB
Oral Exam 2013
Motivation: Big Data
• “Big Data” Challenge: – Fast Data Generation Speed– Slow Disk IO and Network Speed– Gap will become bigger in the future– Different Data Formats
• Observations: – Scientific analysis over data subsets
Community Climate System Model, Data Pipelines from Tomography, X-ray Photon Correlation Spectroscopy
Attributes Subset, Spatial Subset, Value Subset– Multi-resolution data analysis– Wide area data transfer protocols
Oral Exam 2013
TEMPSALTUVELVVEL
Network
I want to analyze TEMP within North Atlantic Ocean!
More Efficient!
Entire Data File
Data Subset
POP.nc
An Example of Ocean Simulation
Remote Data Server
I want to see the average TEMP of the ocean!
I want to quickly view the general global ocean TEMP
Aggregation Result
Data Samples
Combine Flexible Data Management Wide Area Data Transfer Protocol
Oral Exam 2013
Introduction
• A server-side data virtualization method– Standard SQL queries over scientific datasets
Translate SQL into low-level data access code Data Formats: NetCDF, HDF5
– Data subsetting and aggregation Multiple subsetting and aggregation types Greatly decrease the data transfer volume
– Data sampling Efficient data analysis with small accuracy lost
– Combine with wide area transfer protocols Flexible data management + Efficient data transfer SDQuery_DSI in Globus GridFTP
Oral Exam 2013
Thesis Work• Existing Work:
– Supporting User-Defined Subsetting and Aggregation over Parallel NetCDF Datasets (CCGrid2012)
– Indexing and Parallel Query Processing Support for Visualizing Climate Datasets (ICPP2012)
– Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices (HPDC2013)
– SDQuery DSI: Integrating Data Management Support with a Wide Area Data Transfer Protocol (SC2013)
• Future Work: – Correlation Data Analysis among Multiple Variables
Bitmap Indexing Better Efficiency, More Flexibility
– Correlation Data Mining over Scientific Data
Oral Exam 2013
Outline
• Current Work– Parallel Server-side Data Subsetting and Aggregation– Flexible Data Sampling and Efficient Error Calculation– Combine Data Management with Data Transfer Protocol
• Proposed Work– Flexible Correlation Analysis over Multi-Variables – Correlation Mining over Scientific Dataset
• Conclusion
Oral Exam 2013
Contribution• Server-side subsetting and aggregation
– Subsetting: Dimensions, Coordinates, Values– Bitmap Indexing: two-phase optimizations– Aggregation: SUM, AVG, COUNT, MAX, MIN
• Keep data in native format(e.g., NetCDF, HDF5)– SciDB, OPeNDAP: huge data loading or transform cost
• Parallel data processing– Data Partition Strategy– Multiple Parallel Levels – Files, Attributes, Blocks
• Data visualization– SDQueryReader in Paraview– Visualize only subsets of data
Oral Exam 2013
Background: Bitmap Indexing
• Widely used in scientific data management
• Suitable for float value by binning small ranges• Run Length Compression (WAH, BBC)
– Compress bitvectors based on continuous 0s or 1s• Can be treated as a small profile of the data
Oral Exam 2013
Overview of Server-side Data Subsetting and Aggregation
Parse the SQL expression
Parse the metadata file
Generate Query RequestIndex GenerationIndex Retrieval
Generate data subset based on IDs
Perform data aggregation
Generate Unstructured Grid
Oral Exam 2013
Bitmap Index Optimizations• Run-length Compression(WAH, BBC)
– Pros: compression rate, fast bitwise operations– Cons: ability to locate dim subset is lost
• Value Predicates vs. Dim Predicates• Two traditional methods:
– Without bitmap indices: post-filter on values– With bitmap indices (Fastbit): post-filter on dim info
• Two-phase optimizations: – Index Generation: Distributed indices over sub-blocks– Index Retrieval:
Transform dim subsetting conditions into bitvectors Support bitwise operation among dim and value bitvectors
Oral Exam 2013
Optimization 1: Distributed Index Generation
• Index Generation:• Generate multi-small indices
over sub-blocks of data• Partition Strategy:
• Study relationship between queries and partitions
• Partition the data based on query preferences
• α rate: redundancy rate of data elements
• Index Retrieval: • Filter the indices based on
dim-based query conditions
Oral Exam 2013
Partition Strategy• Queries involve both value and dim conditions
– Bitmap Indexing + Dim Filter– Worst: All elements have to be involved – Ideal: Elements exact the same as dim subset
• α rate: redundancy rate of data elements– Number of elements in index / Total data size
• Partition Strategies: – Users query has preference
Timestamp, Longitude, Latitude– Study relationship between queries and partitions– Partition the data based on query preferences– α rate can be greatly decreased
Oral Exam 2013
Optimization 2: Index Retrieval
Post-filter?
• Value-based Predicates:• Find satisfied bitvectors from
index files on disk• Dim-based Predicates:
• Dynamically generate dim bitvectors which satisfy current predicates
• Fast Bitwise Operations: • Logic AND operations are
performed between dim and value bitvectors to generate the point ID set
Oral Exam 2013
Parallel Processing Framework
L3: data block
L1: data file
L2: attribute
Oral Exam 2013
Experiment Setup
• Goals: – Index-based Subsetting vs. Load + Filter in Paraview– Scalability of Parallel Indexing Method– Parallel Indexing vs. FastQuery– Server-side Aggregation vs. Client-side Aggregation
• Dataset: – POP (Parallel Ocean Program)– GCRM (Global Cloud Resolving Model)
• Environment: – IBM Xeon Cluster 8 cores, 2.53GHZ– 12 GB memory
Oral Exam 2013
Efficiency Comparison with Filtering in Paraview
• Data size: 5.6 GB• Input: 400 queries• Depends on subset
percentage• General index method is
better than filtering when data subset < 60%
• Two phase optimization achieved a 0.71 – 11.17 speedup compared with traditional bitmap indexing method
Index m1: Traditional Bitmap Indexing, no optimizationIndex m2: Use bitwise operation instead of post-filteringIndex m3: Use both bitwise operation and index partition Filter: load all data + filter
Oral Exam 2013
Memory Comparison with Filtering in Paraview
• Data size: 5.6 GB• Input: 400 queries• Depends on subset
percentage• General index method has
much smaller memory cost than filtering method
• Two phase optimization only has small extra memory cost
Index m1: Bitmap Indexing, no optimizationIndex m2: Use bitwise operation instead of post-filteringIndex m3: Use both bitwise operation and index partition Filter: load all data + filter
Oral Exam 2013
Scalability with Different Proc#
• Data size: 8.4 GB• Proc#: 6, 24, 48, 96• Input: 100 queries• X pivot: subset percentage• Y pivot: time• Each process take care of
one sub-block• Good scalability as
number of processes increases
Oral Exam 2013
Compare with FastQuery• FastQuery:
– A parallel indexing method based on FastBit Build a relational table view over dataset Generate parallel indices based on partition of the table
– Pros: standard way to process data based on tables– Cons: multi-dim feature is lost
Only support row-based partition Basic Reading Unit: continuous rows (1-dim segments)
• Our method: – Flexible Partition Strategy
Partition the multi-dim data based on users’ query preference– Smaller Reading Times
Basic Reading Unit: multi-dim blocks
Oral Exam 2013
Execution Time Comparison with FastQuery
• Data size: 8.4 GB, 48 processes• Query Type: value + 1st dim, value + 2nd dim, value + 3rd dim, overall• Input: 100 queries for each query type• Achieved a 1.41 to 2.12 speedup compared with FastQuery
Oral Exam 2013
Parallel Data Aggregation Efficiency
• Data size: 16GB• Process number: 1 - 16• Input: 60 aggregation queries• Query Type:
• Only Agg• Agg + Group by + Having • Agg + Group by
• Much smaller data transfer volume• Relative Speedup:
• 4 procs: 2.61 – 3.08• 8 procs: 4.31 – 5.52• 16 procs: 6.65 – 9.54
Oral Exam 2013
Outline
• Current Work– Parallel Server-side Data Subsetting and Aggregation– Flexible Data Sampling and Efficient Error Calculation– Combine Data Management with Data Transfer Protocol
• Proposed Work– Flexible Correlation Analysis over Multi-Variables – Correlation Mining over Scientific Dataset
• Conclusion
Oral Exam 2013
Contributions• Statistic Sampling Techniques:
– A subset of individuals to represent whole population– Information Loss and Error Metrics:
Mean, Variance, Histogram, Q-Q Plot
• Challenges: – Sampling Accuracy Considering Data Features– Error Calculation with High Overhead
• Support Data Sampling over Bitmap Indices– Data samples has better accuracy– Support error prediction before sampling the data– Support data sampling over flexible data subset– No data reorganization is needed
Oral Exam 2013
Data Sampling over Bitmap Indices
• Features of Bitmap Indexing: – Each bin (bitvector) corresponds to one value range– Different bins reflect the entire value distribution– Each bin keeps the data spatial locality
Contains all space IDs (0-bits and 1-bits) Row Major, Column Major Hilbert Curve, Z-Order Curve
• Method:– Perform stratified random sampling over each bin– Multi-level indices generates multi-level samples
Oral Exam 2013
Stratified Random Sampling over Bins
S1: Index Generation
S2: Divide Bitvector into Equal Strides
S3: Random Select certain % of 1’s out of
each stride
Oral Exam 2013
Error Prediction vs. Error Calculation
SamplingRequest
PredictRequest
Error PredictionError CalculationData Sampling
Error Calculation
Sample
Not Good?
Multi-TimesError Prediction
Error Metrics Feedback
Decide Sampling
Sampling Request Sample
Oral Exam 2013
Error Prediction
• Pre-estimate the error metrics before sampling• Calculate error metrics based on bins
– Bitmap Indices classifies the data into bins Each bin corresponds to one value or value range; Find some representative values for each bin: Vi;
– Enforce equal sampling percentage for each bin Extra Metadata: number of 1-bits of each bin: Ci; Compute number of samples of each bin: Si;
– Pre-calculate error metrics based on Vi and Si
• Representative Values: – Small Bin: mean value– Big Bin: lower-bound, upper-bound, mean value
Oral Exam 2013
Data Subsetting + Data Sampling
S3: Perform Stratified Sampling on Subset
S2: Find Spatial ID subset
S1: Find value subset Value = [2, 3)
RID = (9, 25)
Oral Exam 2013
Experiment Results
• Goals: – Accuracy among different sampling methods– Compare Predicted Error with Actual Error – Efficiency among different sampling methods– Speedup for combining data sampling with subsetting
• Datasets: – Ocean Data – Multi-dimensional Arrays– Cosmos Data – Separate Points with 7 attributes
• Environment: – Darwin Cluster: 120 nodes, 48 cores, 64 GB memory
Oral Exam 2013
Sample Accuracy Comparison
• Sampling Methods: – Simple Random Method– Stratified Random Method– KDTree Stratified Random Method– Big Bin Index Random Method– Small Bin Index Random Method
• Error Metrics: – Means over 200 separate sectors– Histogram using 200 value intervals– Q-Q Plot with 200 quantiles
• Sampling Percentage: 0.1%
Oral Exam 2013
Sample Accuracy Comparison
• Traditional sampling methods can not achieve good accuracy;
• Small Bin method achieves best accuracy in most cases;
• Big Bin method achieves comparable accuracy to KDTree sampling method.
Mean
Histogram
Q-Q Plot
Oral Exam 2013
Predicted Error vs. Actual Error
Means, Histogram, Q-Q Plot for Small Bin Method
Means, Histogram, Q-Q Plot for Big Bin Method
Oral Exam 2013
Efficiency Comparison
• Index-based Sample Generation Time is proportional to the number of bins(1.10 to 3.98 times slower).
• The Error Calculation Time based on bins is much smaller than that based on data (>28 times faster).
Sample Generation Time Error Calculation Time
Oral Exam 2013
Total Time based on Resampling Times
Total Sampling Time
• Index-based Sampling: • Multi-time Error Calculations• One-time Sampling
• Other Sampling Methods: • Multi-time Samplings• Multi-time Error Calculations
• X axis: resampling times• Speedup of Small Bin:
• 0.91 – 20.12
Oral Exam 2013
Speedup of Sampling over Subset
• X axis: Data Subsetting Percentage (100%, 50%, 30%, 10%, 1%)• Y axis: Index Loading Time + Sampling Generation Time• 25% Sampling Percentage• Speedup :1.47 – 4.98 for Spatial Subsetting 2.25 - 21.54 for value Subsetting
Subset over Spatial IDs Subset over values
Oral Exam 2013
Outline
• Current Work– Parallel Server-side Data Subsetting and Aggregation– Flexible Data Sampling and Efficient Error Calculation– Combine Data Management with Data Transfer Protocol
• Proposed Work– Flexible Correlation Analysis over Multi-Variables – Correlation Mining over Scientific Dataset
• Conclusion
Oral Exam 2013
Background: Wide-Area Data Transfer Protocols
• Efficient data transfers over wide-area network• Globus GridFTP:
– Striped, Streaming, Parallel Data Transfer – Reliable and Restartable Data Transfer
• Limitation: volume?– The basic data transfer unit is file (GB or TB Level)– Strong requirements for transferring data subsets
• Goal: Integrate core data management functionality with wide-area data transfer protocols
Oral Exam 2013
Contribution• Challenges:
– How should the method be designed to allow easy use and integration with existing GridFTP installation?
– How can users view a remote file and specify the subsets of data ?
– How to support efficient data retrieval with different subsetting scenarios?
– How can data retrieval be parallelized and benefits from multi-steaming?
• GridFTP SDQuery DSI– Efficient Data Transfer over Flexible File Subset– Dynamic Loading / Unloading with Small Overhead– Performance Model based Hybrid Data Reading– Parallel Streaming Data Reading and Transferring
Oral Exam 2013
Outline
• Current Work– Parallel Server-side Data Subsetting and Aggregation– Flexible Data Sampling and Efficient Error Calculation– Combine Data Management with Data Transfer Protocol
• Proposed Work– Flexible Correlation Analysis over Multi-Variables – Correlation Mining over Scientific Dataset
• Conclusion
Oral Exam 2013
Motivation: Correlation Analysis
• Correlation Attributes (Variables) Analysis– Study relationship among variables– Make scientific discovery– Two Scenarios:
Basic Scientific Rule Verification and Discovery Feature Mining – Halo finding, Eddy finding
• Challenge: – Correlation analysis is useful but extremely time
consuming and resource costly– No method support flexible correlation analysis on data
subset
Oral Exam 2013
Correlation Metrics• Multi-Dimensional Histogram:
– Value distributions of variables;• Entropy
– A metric to show the variability of the dataset;– Low => constant, predictable data;– High => random data;
• Mutual Information– A metric for computing the dependence between two variables;– Low => two variables are independent;– High => one variable provides information about another;
• Pearson Correlation Coefficient– A metric to quantify the linear correspondence between two
variables;– Value Range: [-1, 1];– <0: inverse proportional; >0 proportional; =0 independent;
Oral Exam 2013
Our Solution and Contribution
• A framework which supports both individual and correlation data analysis based on bitmap indexing– Individual Analysis: flexible data subsetting– Correlation Analysis:
Interactive queries among multi-variablesCorrelation metrics calculation based on indicesSupport correlation analysis over data subset
• Support Correlation Analysis over Bitmap Indices– Better efficiency, smaller memory cost– Support both Static Indexing and Dynamic Indexing– Support correlation analysis over data samples
Oral Exam 2013
User Cases of Correlation Analysis• Please enter variable names which you want to perform correlation queries: • TEMP SALT UVEL• Please enter your SQL query:• SELECT TEMP FROM POP WHERE TEMP>0 AND TEMP<1 AND depth_t<50;• Entropy: TEMP(2.19), SALT(1.90), UVEL(1.48)• Mutual Information: TEMPSALT: 0.18, TEMP->UVEL->0.017;• Pearson Correlation: ….. Histogram: (SALT), (UVEL)• Please enter your SQL query: • SELECT SALT FROM POP WHERE SALT<0.0346;• Entropy: TEMP(2.29), SALT(2.99), UVEL(2.68)• Mutual Information: TEMPUVEL 0.02, SALT->UVEL->0.19;• Pearson Correlation: ….. Histogram: (UVEL)• Please enter your SQL query:• UNDO• Entropy: TEMP(2.19), SALT(1.90), UVEL(1.48)• Mutual Information: TEMPSALT: 0.18, TEMP->UVEL->0.017;• Pearson Correlation: ….. Histogram: (SALT), (UVEL)• Please enter your query:
Oral Exam 2013
Dynamic Indexing• No Indexing Support:
– Load all data for A and B;– Filtering A and B to generate subset;– Combined Bins: Generate (A1, B1)->count1, … (Am, Bm)->countm
based on each data elements within the data subset;– Calculate Correlation Information based on combined bins;
• Dynamic Indexing (Indices for each variable): – Query bitvectors for A and B; (no data loading cost, zero or very
small filtering cost)– Combined Bins: Generate (A1, B1)->count1, … (Am, Bm)->countm
based on bitwise operations between A and B (much faster because bitvectors# are much smaller than elements#)
– Calculate Correlation Information based on combined bins
Oral Exam 2013
Static Indexing• Dynamic Indexing: One index for each variable. Still need to
perform bitwise operations to generate combine bins. • Static Indexing: Generate one big indices file over multi-
variables. Only need to perform bitvectors filtering or combining. (Extremely small cost)
Oral Exam 2013
Outline
• Current Work– Parallel Server-side Data Subsetting and Aggregation– Flexible Data Sampling and Efficient Error Calculation– Combine Data Management with Data Transfer Protocol
• Proposed Work– Flexible Correlation Analysis over Multi-Variables – Correlation Mining over Scientific Dataset
• Conclusion
Oral Exam 2013
Correlation Mining
• Challenges of Correlation Queries– Do not know which subsets contain important correlations – Keep submitting queries to explore correlations
• Correlation Mining: – Automatically find important correlations– Suggest correlations to users
• A bottom-up method: – Generate correlations over basic spatial and value units– Use bitmap indexing to speedup this process– Use association rule mining to find and combine similar
correlations
Oral Exam 2013
Generate Scientific Association Rule
Association Rule Example: t_lon(10.1−15.1), t_lat(25.2−30.2), depth_t(1−10), TEMP(0−1), SALT(0.01−0.02) →Mutual Information(0.23, High)
Oral Exam 2013
Feature Mining
• Feature Mining based on Correlation Analysis– Sub-halo: Correlation between space and velocity– Eddy: Correlation between speed in different directions
• OW distance to find eddies
– OW > 0, not eddy; OW<= 0, might be eddy – One detection method:
Build v based on row major (x, y) Build u based on column major (y, x) Eddy can not exist for long sequence of 1-bits
Oral Exam 2013
Conclusion
• “Big Data” challenge• A server-side data virtualization method• Server-side data subsetting and aggregation• Data sampling based on bitmap indexing• Integrate flexible data management with efficient
data transfer protocol• Future work:
– Correlation queries – Correlation mining
Oral Exam 2013 52
Thanks for your attention!Q & A