An Virtualization based Data Management Framework for Big Data Applications

Oral Exam 2013

An Virtualization based Data Management Framework for Big Data Applications

Yu SuAdvisor: Dr. Gagan Agrawal,

The Ohio State University

Oral Exam 2013

Motivation: Scientific Data Analysis• Science becomes increasing data driven• Strong requirements for efficient data analysis

Road-runner EC3 simulation 40003 records 7 attributes (X, Y, VX, … MASS) 36 bytes per record Simulation Speed: 2.3 TB

Parallel Ocean Program 3-D Grid: 42 * 2400 * 3600 > 30 attributes (TEMP, SALT …) 1.4 GB per attribute Simulation Speed: > 50 GB

Oral Exam 2013

Motivation: Big Data

• “Big Data” Challenge: – Fast Data Generation Speed– Slow Disk IO and Network Speed– Gap will become bigger in the future– Different Data Formats

• Observations: – Scientific analysis over data subsets

Community Climate System Model, Data Pipelines from Tomography, X-ray Photon Correlation Spectroscopy

Attributes Subset, Spatial Subset, Value Subset– Multi-resolution data analysis– Wide area data transfer protocols

Oral Exam 2013

TEMPSALTUVELVVEL

Network

I want to analyze TEMP within North Atlantic Ocean!

More Efficient!

Entire Data File

Data Subset

POP.nc

An Example of Ocean Simulation

Remote Data Server

I want to see the average TEMP of the ocean!

I want to quickly view the general global ocean TEMP

Aggregation Result

Data Samples

Combine Flexible Data Management Wide Area Data Transfer Protocol

Oral Exam 2013

Introduction

• A server-side data virtualization method– Standard SQL queries over scientific datasets

Translate SQL into low-level data access code Data Formats: NetCDF, HDF5

– Data subsetting and aggregation Multiple subsetting and aggregation types Greatly decrease the data transfer volume

– Data sampling Efficient data analysis with small accuracy lost

– Combine with wide area transfer protocols Flexible data management + Efficient data transfer SDQuery_DSI in Globus GridFTP

Oral Exam 2013

Thesis Work• Existing Work:

– Supporting User-Defined Subsetting and Aggregation over Parallel NetCDF Datasets (CCGrid2012)

– Indexing and Parallel Query Processing Support for Visualizing Climate Datasets (ICPP2012)

– Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices (HPDC2013)

– SDQuery DSI: Integrating Data Management Support with a Wide Area Data Transfer Protocol (SC2013)

• Future Work: – Correlation Data Analysis among Multiple Variables

Bitmap Indexing Better Efficiency, More Flexibility

– Correlation Data Mining over Scientific Data

Oral Exam 2013

Outline

• Current Work– Parallel Server-side Data Subsetting and Aggregation– Flexible Data Sampling and Efficient Error Calculation– Combine Data Management with Data Transfer Protocol

• Proposed Work– Flexible Correlation Analysis over Multi-Variables – Correlation Mining over Scientific Dataset

• Conclusion

Oral Exam 2013

Contribution• Server-side subsetting and aggregation

– Subsetting: Dimensions, Coordinates, Values– Bitmap Indexing: two-phase optimizations– Aggregation: SUM, AVG, COUNT, MAX, MIN

• Keep data in native format(e.g., NetCDF, HDF5)– SciDB, OPeNDAP: huge data loading or transform cost

• Parallel data processing– Data Partition Strategy– Multiple Parallel Levels – Files, Attributes, Blocks

• Data visualization– SDQueryReader in Paraview– Visualize only subsets of data

Oral Exam 2013

Background: Bitmap Indexing

• Widely used in scientific data management

• Suitable for float value by binning small ranges• Run Length Compression (WAH, BBC)

– Compress bitvectors based on continuous 0s or 1s• Can be treated as a small profile of the data

Oral Exam 2013

Overview of Server-side Data Subsetting and Aggregation

Parse the SQL expression

Parse the metadata file

Generate Query RequestIndex GenerationIndex Retrieval

Generate data subset based on IDs

Perform data aggregation

Generate Unstructured Grid

Oral Exam 2013

Bitmap Index Optimizations• Run-length Compression(WAH, BBC)

– Pros: compression rate, fast bitwise operations– Cons: ability to locate dim subset is lost

• Value Predicates vs. Dim Predicates• Two traditional methods:

– Without bitmap indices: post-filter on values– With bitmap indices (Fastbit): post-filter on dim info

• Two-phase optimizations: – Index Generation: Distributed indices over sub-blocks– Index Retrieval:

Transform dim subsetting conditions into bitvectors Support bitwise operation among dim and value bitvectors

Oral Exam 2013

Optimization 1: Distributed Index Generation

• Index Generation:• Generate multi-small indices

over sub-blocks of data• Partition Strategy:

• Study relationship between queries and partitions

• Partition the data based on query preferences

• α rate: redundancy rate of data elements

• Index Retrieval: • Filter the indices based on

dim-based query conditions

Oral Exam 2013

Partition Strategy• Queries involve both value and dim conditions

– Bitmap Indexing + Dim Filter– Worst: All elements have to be involved – Ideal: Elements exact the same as dim subset

• α rate: redundancy rate of data elements– Number of elements in index / Total data size

• Partition Strategies: – Users query has preference

Timestamp, Longitude, Latitude– Study relationship between queries and partitions– Partition the data based on query preferences– α rate can be greatly decreased

Oral Exam 2013

Optimization 2: Index Retrieval

Post-filter?

• Value-based Predicates:• Find satisfied bitvectors from

index files on disk• Dim-based Predicates:

• Dynamically generate dim bitvectors which satisfy current predicates

• Fast Bitwise Operations: • Logic AND operations are

performed between dim and value bitvectors to generate the point ID set

Oral Exam 2013

Parallel Processing Framework

L3: data block

L1: data file

L2: attribute

Oral Exam 2013

Experiment Setup

• Goals: – Index-based Subsetting vs. Load + Filter in Paraview– Scalability of Parallel Indexing Method– Parallel Indexing vs. FastQuery– Server-side Aggregation vs. Client-side Aggregation

• Dataset: – POP (Parallel Ocean Program)– GCRM (Global Cloud Resolving Model)

• Environment: – IBM Xeon Cluster 8 cores, 2.53GHZ– 12 GB memory

Oral Exam 2013

Efficiency Comparison with Filtering in Paraview

• Data size: 5.6 GB• Input: 400 queries• Depends on subset

percentage• General index method is

better than filtering when data subset < 60%

• Two phase optimization achieved a 0.71 – 11.17 speedup compared with traditional bitmap indexing method

Index m1: Traditional Bitmap Indexing, no optimizationIndex m2: Use bitwise operation instead of post-filteringIndex m3: Use both bitwise operation and index partition Filter: load all data + filter

Oral Exam 2013

Memory Comparison with Filtering in Paraview

• Data size: 5.6 GB• Input: 400 queries• Depends on subset

percentage• General index method has

much smaller memory cost than filtering method

• Two phase optimization only has small extra memory cost

Index m1: Bitmap Indexing, no optimizationIndex m2: Use bitwise operation instead of post-filteringIndex m3: Use both bitwise operation and index partition Filter: load all data + filter

Oral Exam 2013

Scalability with Different Proc#

• Data size: 8.4 GB• Proc#: 6, 24, 48, 96• Input: 100 queries• X pivot: subset percentage• Y pivot: time• Each process take care of

one sub-block• Good scalability as

number of processes increases

Oral Exam 2013

Compare with FastQuery• FastQuery:

– A parallel indexing method based on FastBit Build a relational table view over dataset Generate parallel indices based on partition of the table

– Pros: standard way to process data based on tables– Cons: multi-dim feature is lost

Only support row-based partition Basic Reading Unit: continuous rows (1-dim segments)

• Our method: – Flexible Partition Strategy

Partition the multi-dim data based on users’ query preference– Smaller Reading Times

Basic Reading Unit: multi-dim blocks

Oral Exam 2013

Execution Time Comparison with FastQuery

• Data size: 8.4 GB, 48 processes• Query Type: value + 1st dim, value + 2nd dim, value + 3rd dim, overall• Input: 100 queries for each query type• Achieved a 1.41 to 2.12 speedup compared with FastQuery

Oral Exam 2013

Parallel Data Aggregation Efficiency

• Data size: 16GB• Process number: 1 - 16• Input: 60 aggregation queries• Query Type:

• Only Agg• Agg + Group by + Having • Agg + Group by

• Much smaller data transfer volume• Relative Speedup:

• 4 procs: 2.61 – 3.08• 8 procs: 4.31 – 5.52• 16 procs: 6.65 – 9.54

Oral Exam 2013

Outline



• Conclusion

Oral Exam 2013

Contributions• Statistic Sampling Techniques:

– A subset of individuals to represent whole population– Information Loss and Error Metrics:

Mean, Variance, Histogram, Q-Q Plot

• Challenges: – Sampling Accuracy Considering Data Features– Error Calculation with High Overhead

• Support Data Sampling over Bitmap Indices– Data samples has better accuracy– Support error prediction before sampling the data– Support data sampling over flexible data subset– No data reorganization is needed

Oral Exam 2013

Data Sampling over Bitmap Indices

• Features of Bitmap Indexing: – Each bin (bitvector) corresponds to one value range– Different bins reflect the entire value distribution– Each bin keeps the data spatial locality

Contains all space IDs (0-bits and 1-bits) Row Major, Column Major Hilbert Curve, Z-Order Curve

• Method:– Perform stratified random sampling over each bin– Multi-level indices generates multi-level samples

Oral Exam 2013

Stratified Random Sampling over Bins

S1: Index Generation

S2: Divide Bitvector into Equal Strides

S3: Random Select certain % of 1’s out of

each stride

Oral Exam 2013

Error Prediction vs. Error Calculation

SamplingRequest

PredictRequest

Error PredictionError CalculationData Sampling

Error Calculation

Sample

Not Good?

Multi-TimesError Prediction

Error Metrics Feedback

Decide Sampling

Sampling Request Sample

Oral Exam 2013

Error Prediction

• Pre-estimate the error metrics before sampling• Calculate error metrics based on bins

– Bitmap Indices classifies the data into bins Each bin corresponds to one value or value range; Find some representative values for each bin: Vi;

– Enforce equal sampling percentage for each bin Extra Metadata: number of 1-bits of each bin: Ci; Compute number of samples of each bin: Si;

– Pre-calculate error metrics based on Vi and Si

• Representative Values: – Small Bin: mean value– Big Bin: lower-bound, upper-bound, mean value

Oral Exam 2013

Data Subsetting + Data Sampling

S3: Perform Stratified Sampling on Subset

S2: Find Spatial ID subset

S1: Find value subset Value = [2, 3)

RID = (9, 25)

Oral Exam 2013

Experiment Results

• Goals: – Accuracy among different sampling methods– Compare Predicted Error with Actual Error – Efficiency among different sampling methods– Speedup for combining data sampling with subsetting

• Datasets: – Ocean Data – Multi-dimensional Arrays– Cosmos Data – Separate Points with 7 attributes

• Environment: – Darwin Cluster: 120 nodes, 48 cores, 64 GB memory

Oral Exam 2013

Sample Accuracy Comparison

• Sampling Methods: – Simple Random Method– Stratified Random Method– KDTree Stratified Random Method– Big Bin Index Random Method– Small Bin Index Random Method

• Error Metrics: – Means over 200 separate sectors– Histogram using 200 value intervals– Q-Q Plot with 200 quantiles

• Sampling Percentage: 0.1%

Oral Exam 2013

Sample Accuracy Comparison

• Traditional sampling methods can not achieve good accuracy;

• Small Bin method achieves best accuracy in most cases;

• Big Bin method achieves comparable accuracy to KDTree sampling method.

Mean

Histogram

Q-Q Plot

Oral Exam 2013

Predicted Error vs. Actual Error

Means, Histogram, Q-Q Plot for Small Bin Method

Means, Histogram, Q-Q Plot for Big Bin Method

Oral Exam 2013

Efficiency Comparison

• Index-based Sample Generation Time is proportional to the number of bins(1.10 to 3.98 times slower).

• The Error Calculation Time based on bins is much smaller than that based on data (>28 times faster).

Sample Generation Time Error Calculation Time

Oral Exam 2013

Total Time based on Resampling Times

Total Sampling Time

• Index-based Sampling: • Multi-time Error Calculations• One-time Sampling

• Other Sampling Methods: • Multi-time Samplings• Multi-time Error Calculations

• X axis: resampling times• Speedup of Small Bin:

• 0.91 – 20.12

Oral Exam 2013

Speedup of Sampling over Subset

• X axis: Data Subsetting Percentage (100%, 50%, 30%, 10%, 1%)• Y axis: Index Loading Time + Sampling Generation Time• 25% Sampling Percentage• Speedup :1.47 – 4.98 for Spatial Subsetting 2.25 - 21.54 for value Subsetting

Subset over Spatial IDs Subset over values

Oral Exam 2013

Outline



• Conclusion

Oral Exam 2013

Background: Wide-Area Data Transfer Protocols

• Efficient data transfers over wide-area network• Globus GridFTP:

– Striped, Streaming, Parallel Data Transfer – Reliable and Restartable Data Transfer

• Limitation: volume?– The basic data transfer unit is file (GB or TB Level)– Strong requirements for transferring data subsets

• Goal: Integrate core data management functionality with wide-area data transfer protocols

Oral Exam 2013

Contribution• Challenges:

– How should the method be designed to allow easy use and integration with existing GridFTP installation?

– How can users view a remote file and specify the subsets of data ?

– How to support efficient data retrieval with different subsetting scenarios?

– How can data retrieval be parallelized and benefits from multi-steaming?

• GridFTP SDQuery DSI– Efficient Data Transfer over Flexible File Subset– Dynamic Loading / Unloading with Small Overhead– Performance Model based Hybrid Data Reading– Parallel Streaming Data Reading and Transferring

Oral Exam 2013

Outline



• Conclusion

Oral Exam 2013

Motivation: Correlation Analysis

• Correlation Attributes (Variables) Analysis– Study relationship among variables– Make scientific discovery– Two Scenarios:

Basic Scientific Rule Verification and Discovery Feature Mining – Halo finding, Eddy finding

• Challenge: – Correlation analysis is useful but extremely time

consuming and resource costly– No method support flexible correlation analysis on data

subset

Oral Exam 2013

Correlation Metrics• Multi-Dimensional Histogram:

– Value distributions of variables;• Entropy

– A metric to show the variability of the dataset;– Low => constant, predictable data;– High => random data;

• Mutual Information– A metric for computing the dependence between two variables;– Low => two variables are independent;– High => one variable provides information about another;

• Pearson Correlation Coefficient– A metric to quantify the linear correspondence between two

variables;– Value Range: [-1, 1];– <0: inverse proportional; >0 proportional; =0 independent;

Oral Exam 2013

Our Solution and Contribution

• A framework which supports both individual and correlation data analysis based on bitmap indexing– Individual Analysis: flexible data subsetting– Correlation Analysis:

Interactive queries among multi-variablesCorrelation metrics calculation based on indicesSupport correlation analysis over data subset

• Support Correlation Analysis over Bitmap Indices– Better efficiency, smaller memory cost– Support both Static Indexing and Dynamic Indexing– Support correlation analysis over data samples

Oral Exam 2013

User Cases of Correlation Analysis• Please enter variable names which you want to perform correlation queries: • TEMP SALT UVEL• Please enter your SQL query:• SELECT TEMP FROM POP WHERE TEMP>0 AND TEMP<1 AND depth_t<50;• Entropy: TEMP(2.19), SALT(1.90), UVEL(1.48)• Mutual Information: TEMPSALT: 0.18, TEMP->UVEL->0.017;• Pearson Correlation: ….. Histogram: (SALT), (UVEL)• Please enter your SQL query: • SELECT SALT FROM POP WHERE SALT<0.0346;• Entropy: TEMP(2.29), SALT(2.99), UVEL(2.68)• Mutual Information: TEMPUVEL 0.02, SALT->UVEL->0.19;• Pearson Correlation: ….. Histogram: (UVEL)• Please enter your SQL query:• UNDO• Entropy: TEMP(2.19), SALT(1.90), UVEL(1.48)• Mutual Information: TEMPSALT: 0.18, TEMP->UVEL->0.017;• Pearson Correlation: ….. Histogram: (SALT), (UVEL)• Please enter your query:

Oral Exam 2013

Dynamic Indexing• No Indexing Support:

– Load all data for A and B;– Filtering A and B to generate subset;– Combined Bins: Generate (A1, B1)->count1, … (Am, Bm)->countm

based on each data elements within the data subset;– Calculate Correlation Information based on combined bins;

• Dynamic Indexing (Indices for each variable): – Query bitvectors for A and B; (no data loading cost, zero or very

small filtering cost)– Combined Bins: Generate (A1, B1)->count1, … (Am, Bm)->countm

based on bitwise operations between A and B (much faster because bitvectors# are much smaller than elements#)

– Calculate Correlation Information based on combined bins

Oral Exam 2013

Static Indexing• Dynamic Indexing: One index for each variable. Still need to

perform bitwise operations to generate combine bins. • Static Indexing: Generate one big indices file over multi-

variables. Only need to perform bitvectors filtering or combining. (Extremely small cost)

Oral Exam 2013

Outline



• Conclusion

Oral Exam 2013

Correlation Mining

• Challenges of Correlation Queries– Do not know which subsets contain important correlations – Keep submitting queries to explore correlations

• Correlation Mining: – Automatically find important correlations– Suggest correlations to users

• A bottom-up method: – Generate correlations over basic spatial and value units– Use bitmap indexing to speedup this process– Use association rule mining to find and combine similar

correlations

Oral Exam 2013

Generate Scientific Association Rule

Association Rule Example: t_lon(10.1−15.1), t_lat(25.2−30.2), depth_t(1−10), TEMP(0−1), SALT(0.01−0.02) →Mutual Information(0.23, High)

Oral Exam 2013

Feature Mining

• Feature Mining based on Correlation Analysis– Sub-halo: Correlation between space and velocity– Eddy: Correlation between speed in different directions

• OW distance to find eddies

– OW > 0, not eddy; OW<= 0, might be eddy – One detection method:

Build v based on row major (x, y) Build u based on column major (y, x) Eddy can not exist for long sequence of 1-bits

Oral Exam 2013

Conclusion

• “Big Data” challenge• A server-side data virtualization method• Server-side data subsetting and aggregation• Data sampling based on bitmap indexing• Integrate flexible data management with efficient

data transfer protocol• Future work:

– Correlation queries – Correlation mining

Oral Exam 2013 52

Thanks for your attention!Q & A

Documents

An Virtualization based Data Management Framework for Big Data Applications