26
Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine Learning By: Archana Ganapathi, Harumi Kuno, Umeshwar Dayal, Janet L. Wiener, Armando Fox, Michael Jordan, David Patterson

Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine Learning

Embed Size (px)

Citation preview

Page 1: Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine Learning

Predicting Multiple Metrics for Queries: Better Decision Enabled by

Machine LearningBy: Archana Ganapathi, Harumi Kuno,

Umeshwar Dayal, Janet L. Wiener, Armando Fox, Michael Jordan, David

Patterson

Page 2: Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine Learning

Problem:

• Predicting the performance(running time, resource usage) of a query before executing it will help us in:

• Work load management • query scheduling

• System sizing • requirement for a system to reply a query with time

constraint

• Capacity planning• Given an expected workload, does system require upgarde?

Page 3: Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine Learning

Why it is a hard problem

• Sources of uncertainty• Skewed data distribution• Inaccurate cardinality prediction

• Complex query plans• Huge amount of data• Different schemas for different databases

makes using ML a big challenge

Page 4: Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine Learning

Solution • It should be able to simultaneously predict all

performance metrics, using information available prior to query execution for short and long running queries.

• Potential candidates:• Cost models

– Manually model performance output of each operator for each configuration setting to estimate final value based on query plan.

– Estimation error propagationMachine learning

– Build model based on training data– Not sensitive to estimation error since it is working based on similarity.

Page 5: Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine Learning

Experiment set up (data)

• Machine used to gather training and test query performance metrics

• Hp neoview database system.• Machines with 4,8,16,32 processor.• Fixed memory allocated per CPU.• Each CPU has its own disk and data is partitioned

roughly equally across all 4 disks.

Page 6: Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine Learning

Experiment set up (query)• Categorize queries by runtime:

• 0min < feather < 3min• 3min < golfball < 30min• 30min < bowlingball < 2h.

• Standard decision support benchmark TPC-DS templates to generate queries for feathers.

• Write new templates from real queries that took at least 4 hours to compute for longer queries.

• Some feathers queries from another database with different schema in train and test set.

• Producing queries with appropriate performance was a hard and time-consuming task since changing a constant might turn a feather to bowling ball or vice versa.

Page 7: Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine Learning

Independent modelling of performance metrics• Regression

• Individually model each performance metric y=A1X1+A2X2+…+AnXn

• Regression use different set of features for different performance metrics which will make it hard to unify all performance metrics in one model.

Page 8: Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine Learning

Joint modelling of performance metrics

• Clustering cluster entries of a single dataset based on their similarity.

• PCA Project dataset over dimensions with maximal variance for clustering.

• (K)CCA finds Dimensions of maximal correlations among pairs of datasets and Map each dataset on those dimensions. Notion of similarity can be defined by user in a kernel function.

Query features before running

Performance features after running

Page 9: Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine Learning

KCCA

Page 10: Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine Learning

KCCA

• We are given N queries.• We produce two N*N matrix of similarities

among query features and query performance features .

Page 11: Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine Learning

Prediction using KCCA

Page 12: Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine Learning

Evaluation

• Predictive risk

• predictive risk ~ 1 near prefect prediction• This metric is very sensitive to outliers an

removing top outliers can significantly improve predictive risk.

Page 13: Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine Learning

Performance feature vector

• Performance features : 6 measures computed by DBMS after running a query.

– Elapsed time– Disk i/o– Message count– Message bytes– Records accessed– Records used

Page 14: Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine Learning

Query feature vector

• Information available prior to query execution1. SQL text of query• Number of nested sub-queries• Total number of selection predicates• Number of equality selection predicates• Total number of join predicates• Number of equi-join predicates• Number of non-equi-join predicates• Number of sort columns• Number of aggregation columns

Page 15: Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine Learning

Query feature vector

2. query execution plan(a tree of query operators with estimated cardinalities)

• Instance count and cardinality sum for each operator.

Page 16: Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine Learning

Prediction based on neighbours• How to find ‘nearest’ neighbour?

• Euclidian distance captures magnitude-wise closest neighbour.• Cosine distance captures direction-wise closest neighbour.• Experiments suggest that Euclidian distance is providing better

prediction.

Page 17: Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine Learning

Prediction based on neighbours

• How many neighbours to consider when calculating freshness?

• According to experiments done, 3 nearest neighbour is providing a good trade-off.

Page 18: Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine Learning

Prediction based on neighbours

• How to map from neighbours performance metrics to test query performance metric? combine neighbours performance feature vectors.

Equally weighted• 1:2:3 weighted based on distance ranking• Weighting proportinal to distance from test query feature vector

Page 19: Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine Learning

Experiment design

• Experiment 1: Train model with realistic mix of query types-1027(30b+230g+767f)

• Experiment 2 : Train model with 30 queries of each type-120(30b+30g+30f)

• Experiment 3 : 2-step prediction with query type-specific models

• Experiment 4 : Training and testing on queries using different data tables and schemas.

Page 20: Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine Learning

Experiment 1- Time

Page 21: Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine Learning

Experiment 1- Record usage

Page 22: Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine Learning

Experiment 1- Message count

Page 23: Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine Learning

Experiment 2- Time

Page 24: Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine Learning

Experiment 3- Time

Page 25: Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine Learning

Experiment 4- Time

Page 26: Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine Learning

Conclusions

• Predict performance metrics using information available before executing query using ML.

• Prediction can greatly improve system sizing, capacity planning and workload management.

• I want to predict the percentage of up-to-date result for a query result extracted from cache and based on similar queries statistics.