Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine Learning

Predicting Multiple Metrics for Queries: Better Decision Enabled by

Machine LearningBy: Archana Ganapathi, Harumi Kuno,

Umeshwar Dayal, Janet L. Wiener, Armando Fox, Michael Jordan, David

Patterson

Problem:

• Predicting the performance(running time, resource usage) of a query before executing it will help us in:

• Work load management • query scheduling

• System sizing • requirement for a system to reply a query with time

constraint

• Capacity planning• Given an expected workload, does system require upgarde?

Why it is a hard problem

• Sources of uncertainty• Skewed data distribution• Inaccurate cardinality prediction

• Complex query plans• Huge amount of data• Different schemas for different databases

makes using ML a big challenge

Solution • It should be able to simultaneously predict all

performance metrics, using information available prior to query execution for short and long running queries.

• Potential candidates:• Cost models

– Manually model performance output of each operator for each configuration setting to estimate final value based on query plan.

– Estimation error propagationMachine learning

– Build model based on training data– Not sensitive to estimation error since it is working based on similarity.

Experiment set up (data)

• Machine used to gather training and test query performance metrics

• Hp neoview database system.• Machines with 4,8,16,32 processor.• Fixed memory allocated per CPU.• Each CPU has its own disk and data is partitioned

roughly equally across all 4 disks.

Experiment set up (query)• Categorize queries by runtime:

• 0min < feather < 3min• 3min < golfball < 30min• 30min < bowlingball < 2h.

• Standard decision support benchmark TPC-DS templates to generate queries for feathers.

• Write new templates from real queries that took at least 4 hours to compute for longer queries.

• Some feathers queries from another database with different schema in train and test set.

• Producing queries with appropriate performance was a hard and time-consuming task since changing a constant might turn a feather to bowling ball or vice versa.

Independent modelling of performance metrics• Regression

• Individually model each performance metric y=A1X1+A2X2+…+AnXn

• Regression use different set of features for different performance metrics which will make it hard to unify all performance metrics in one model.

Joint modelling of performance metrics

• Clustering cluster entries of a single dataset based on their similarity.

• PCA Project dataset over dimensions with maximal variance for clustering.

• (K)CCA finds Dimensions of maximal correlations among pairs of datasets and Map each dataset on those dimensions. Notion of similarity can be defined by user in a kernel function.

Query features before running

Performance features after running

KCCA

KCCA

• We are given N queries.• We produce two N*N matrix of similarities

among query features and query performance features .

•

Prediction using KCCA

Evaluation

• Predictive risk

• predictive risk ~ 1 near prefect prediction• This metric is very sensitive to outliers an

removing top outliers can significantly improve predictive risk.

Performance feature vector

• Performance features : 6 measures computed by DBMS after running a query.

– Elapsed time– Disk i/o– Message count– Message bytes– Records accessed– Records used

Query feature vector

• Information available prior to query execution1. SQL text of query• Number of nested sub-queries• Total number of selection predicates• Number of equality selection predicates• Total number of join predicates• Number of equi-join predicates• Number of non-equi-join predicates• Number of sort columns• Number of aggregation columns

Query feature vector

2. query execution plan(a tree of query operators with estimated cardinalities)

• Instance count and cardinality sum for each operator.

Prediction based on neighbours• How to find ‘nearest’ neighbour?

• Euclidian distance captures magnitude-wise closest neighbour.• Cosine distance captures direction-wise closest neighbour.• Experiments suggest that Euclidian distance is providing better

prediction.

Prediction based on neighbours

• How many neighbours to consider when calculating freshness?

• According to experiments done, 3 nearest neighbour is providing a good trade-off.

Prediction based on neighbours

• How to map from neighbours performance metrics to test query performance metric? combine neighbours performance feature vectors.

Equally weighted• 1:2:3 weighted based on distance ranking• Weighting proportinal to distance from test query feature vector

Experiment design

• Experiment 1: Train model with realistic mix of query types-1027(30b+230g+767f)

• Experiment 2 : Train model with 30 queries of each type-120(30b+30g+30f)

• Experiment 3 : 2-step prediction with query type-specific models

• Experiment 4 : Training and testing on queries using different data tables and schemas.

Experiment 1- Time

Experiment 1- Record usage

Experiment 1- Message count

Experiment 2- Time

Experiment 3- Time

Experiment 4- Time

Conclusions

• Predict performance metrics using information available before executing query using ML.

• Prediction can greatly improve system sizing, capacity planning and workload management.

• I want to predict the percentage of up-to-date result for a query result extracted from cache and based on similar queries statistics.

Education

Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine Learning