fitted distributionerror < 0.05error > 0.05
estimated #iterations
speculation
Improving the life of a data scientistData scientist today
What people think he does What he thinks he does What he actually does
ObservationGradient Descent
(GD)
Stochastic GDBatch GD
Mini-batch GD
Optimization problem
ML tasks • classification • clustering • …
Processing Platform
express solve execute
minw
X
i2data
fi(w) + g(w)
GD Cost Model
GD Abstraction Planner
Rewriter
GD Iteration Estimator
GD Plan Space
Cost-based GD optimizerSolution
GD TASK (declarative query)Hyperparameter
tuning
ImplementationImplementation
Select algorithm
General Problem Our FocusNot all-times winner
false
Transform
Data units
Stage
Sample
Compute
Update
Convergence
Loop
Preparation phase1
Processing phase2
Convergence phase3
truefalse
Model
SGD
Pla
n
+1 2:0.1 4:0.4 10:0.3-1 3:0.3 4:0.5 9:0.5+1 1:0.1 2:0.7 6:0.2
1 [2, 4, 10] [0.1, 0.4, 0.3]-1 [3, 4, 9] [0.3, 0.5, 0.6]1 [1, 2, 6] [0.1, 0.7, 0.2]
label
indices
values
1 [2, 4, 10] [0.1, 0.4, 0.3]
1 [2, 4, 10] [0.1, 0.4, 0.3]
How to model ML tasks?
How to optimize GD plans?
Staging
Update
Transform
Loop
Convergence
Sample
Compute
1 2 3 4 5
Bernoulli1 2 3 4 5
Random-partition3 1 2 4 5
Shuffle-partition
partitions
Lazy transformation
Sampling techniques
GD plans
BGD SGD/MGD
Eager Lazy
Eager Lazy
Bernoulli ShuffleRandom Bernoulli ShuffleRandom
How to get # iterations?
i0 = 0
a = 0.1
w0 = [0.0, 0.0, ..., 0.0]
(r , w)
wk+1 = wk � ↵rf(wk)
ik+1 = ik + 1
� = ||wk+1 � wk||
� > 0.01
Rheem
More info
Visit our web page: http://da.qcri.org/rheemSource code: https://github.com/rheem-ecosystem/rheem
Source code of abstraction available at https://github.com/rheem-ecosystem/rheem
A Cross-Platform System
ML4all
• Rheem: Enabling Multi-Platform Task Execution SIGMOD 2016, San Francisco, USA (demo paper)
• Road to Freedom in Data Analytics EDBT 2016, Bordeaux, France (vision paper)
Join us at the Spark Summit 2017
San Francisco, USA
A DB-like Machine Learning System
Trai
ning
tim
e (s
ec)
1
100
10000
Datasetadult covtypeyearpred rcv1 higgs svm1 svm2 svm3
MLlibSystemMLOur system
BGD
2051 36
632
8
32
6
569204
SystemML conversion
(a) BGD
Trai
ning
tim
e (s
ec)
1
100
10000
Datasetadult covtypeyearpred rcv1 higgs svm1 svm2 svm3
436
1161117772
1330
16
10,000
3,600
162213
10,000
2,5951,5122,184
96
1427
18
MLlibSystemMLOur system
MGD
SystemML conversion
24
1322 16
41 46
(b) MGD
Trai
ning
tim
e (s
ec)
1
10
100
Datasetadult covtypeyearpred rcv1 higgs svm1 svm2 svm 3
MLlibSystemMLOur system
SGD
6
17
11
30
8
30
110
173
SystemML conversion
(c) SGD
Figure 9: Training time (sec). Our system significantly outperforms both MLlib and SystemML, thanks toits novel sampling mechanisms and its lazy transformation technique.
1000 iterations MLlib-GD MLlib-SGD MLlib-MGD ML4all-SGD ML4all GD NEW RESULTS MLlib ML4all
adult (LogR) 63.6563333333333 33.843 SGD (eager-random) 33871
covtype (LinR) 92.2523333333333 73.658 69.3983333333333 37.382 SGD (eager-random) 252.54 1856 iterations, 0.001 131.756465517241 44111
yearPred (LinR) 108.356 42.1966666666667 SGD (lazy-shuffle) 42.237 SGD (eager-random) 69927 31690
rcv1 (SVM) 855.448333333333 757.117333333333 904.312333333333 41.1923333333333 SGD (lazy-shuffle) 104.604666666667 1055 iterations, e=0.1 95060 33237
kdda (SVM) crashed crashed crashed 346.807 SGD (lazy-random) 1038311
SVM A 1205.151 34.69833333 lazy-shuffle 1281130 31630
2.7M (5GB) 845.261333333333 54.1523333333333 SGD (eager-random) 34.69833333 lazy-shuffle
5.5M (10GB) 1566.044 1205.151 1136.739 72.0886666666667 SGD (eager-random) 36.425
11M (20GB) 1721.823 108.257666666667 SGD (eager-random) 39.092
22M (40GB) 2567.43 187.695 38.24066667
44M (80GB) 3928.575 322.500333333333 SGD (eager-random) 44.808 87.6757498660953
88M (160GB) 58% cached (SGD of MLlib killed after 12 hours (job 345) -> 4.3min per iteration
1536.64433333333 58.672
1k (183MB) 51.499 34.9113333333333 32.70633333
10k (1.8GB) 507.152 49.8223333333333 53.10266667
50k (9GB) 1108.66066666667 93.5683333333333 136.572
100k (18GB) 1765.451 159.645333333333 230.5406667
500k (90GB) 8511.65766666667 615.701333333333 1219.1535 13.8243287871176
changed covtype from LinR to LogR since the data is for binary classification (runtimes should not change)
Trai
ning
tim
e (s
ec)
0
300
600
900
1200
adult (LogR)
covtype (LogR)
yearPred (LinR)
rcv1 (SVM)
kdda (SVM)
synth (SVM)
MLlibML4all
Trai
ning
tim
e (s
)
0
1000
2000
3000
4000
#points (size)
2.7M (5GB)
5.5M (10GB)
11M (20GB)
22M (40GB)
44M (80GB)
88M(160GB)
MLlibEager-randomLazy-shuffle
Trai
ning
tim
e (s
)
02250450067509000
#features (size)
1k (180MB)
10k (1.8GB)
50k (9GB)
100k(18GB)
500k(90GB)
MLlibEager-randomLazy-shuffle
Trai
ning
tim
e (s
ec)
0
300
600
900
1200
adult (LogR)
covtype (LinR)
yearPred (LinR)
rcv1 (SVM)
kdda (SVM)
synth (SVM)
MLlibML4all
Trai
ning
tim
e (s
ec)
0
1000
2000
3000
4000
#points (size)
2.7M
(5GB)
5.5M
(10GB)
11M
(20GB)
22M
(40GB)
44M
(80GB)
88M
(160G
B)
MLlibML4all (eager-random)ML4all (lazy-shuffle)
Trai
ning
tim
e (s
ec)
0
2250
4500
6750
9000
#features (size)
1k
(180M
B)10
k
(1.8G
B) 50k
(9GB)
100k
(18GB)
500k
(90GB)
MLlibML4all (eager-random)ML4all (lazy-shuffle)
�1
(a) Scaling #points.
1000 iterations MLlib-GD MLlib-SGD MLlib-MGD ML4all-SGD ML4all GD NEW RESULTS MLlib ML4all
adult (LogR) 63.6563333333333 33.843 SGD (eager-random) 33871
covtype (LinR) 92.2523333333333 73.658 69.3983333333333 37.382 SGD (eager-random) 252.54 1856 iterations, 0.001 131.756465517241 44111
yearPred (LinR) 108.356 42.1966666666667 SGD (lazy-shuffle) 42.237 SGD (eager-random) 69927 31690
rcv1 (SVM) 855.448333333333 757.117333333333 904.312333333333 41.1923333333333 SGD (lazy-shuffle) 104.604666666667 1055 iterations, e=0.1 95060 33237
kdda (SVM) crashed crashed crashed 346.807 SGD (lazy-random) 1038311
SVM A 1205.151 34.69833333 lazy-shuffle 1281130 31630
2.7M (5GB) 845.261333333333 54.1523333333333 SGD (eager-random) 34.69833333 lazy-shuffle
5.5M (10GB) 1566.044 1205.151 1136.739 72.0886666666667 SGD (eager-random) 36.425
11M (20GB) 1721.823 108.257666666667 SGD (eager-random) 39.092
22M (40GB) 2567.43 187.695 38.24066667
44M (80GB) 3928.575 322.500333333333 SGD (eager-random) 44.808 87.6757498660953
88M (160GB) 58% cached (SGD of MLlib killed after 12 hours (job 345) -> 4.3min per iteration
1536.64433333333 58.672
1k (183MB) 51.499 34.9113333333333 32.70633333
10k (1.8GB) 507.152 49.8223333333333 53.10266667
50k (9GB) 1108.66066666667 93.5683333333333 136.572
100k (18GB) 1765.451 159.645333333333 230.5406667
500k (90GB) 8511.65766666667 615.701333333333 1219.1535 13.8243287871176
changed covtype from LinR to LogR since the data is for binary classification (runtimes should not change)
Trai
ning
tim
e (s
ec)
0
300
600
900
1200
adult (LogR)
covtype (LogR)
yearPred (LinR)
rcv1 (SVM)
kdda (SVM)
synth (SVM)
MLlibML4all
Trai
ning
tim
e (s
)
0
1000
2000
3000
4000
#points (size)
2.7M (5GB)
5.5M (10GB)
11M (20GB)
22M (40GB)
44M (80GB)
88M(160GB)
MLlibEager-randomLazy-shuffle
Trai
ning
tim
e (s
)
02250450067509000
#features (size)
1k (180MB)
10k (1.8GB)
50k (9GB)
100k(18GB)
500k(90GB)
MLlibEager-randomLazy-shuffle
Trai
ning
tim
e (s
ec)
0
300
600
900
1200
adult (LogR)
covtype (LinR)
yearPred (LinR)
rcv1 (SVM)
kdda (SVM)
synth (SVM)
MLlibML4all
Trai
ning
tim
e (s
ec)
0
1000
2000
3000
4000
#points (size)
2.7M
(5GB)
5.5M
(10GB)
11M
(20GB)
22M
(40GB)
44M
(80GB)
88M
(160G
B)
MLlibML4all (eager-random)ML4all (lazy-shuffle)
Trai
ning
tim
e (s
ec)
0
2250
4500
6750
9000
#features (size)
1k
(180M
B)10
k
(1.8G
B) 50k
(9GB)
100k
(18GB)
500k
(90GB)
MLlibML4all (eager-random)ML4all (lazy-shuffle)
�1
(b) Scaling #features.
Figure 10: Our system’s scalability compared toMLlib. It scales gracefully with both the numberof data points and features.
tem is still faster than MLlib. This is because we used map-
Partitions and reduce instead of treeAggregate, which re-sulted in better data locality and hence better response timesfor larger datasets. Notice that SystemML is slightly fasterthan our system for the small datasets, because SystemMLprocesses them locally. The largest bottleneck of SystemMLfor small datasets is the time to convert the dataset to itsbinary format. However, we observe that our system sig-nificantly outperforms SystemML for larger datasets, whenSystemML runs on Spark. In fact, we had to stop Sys-temML after 3 hours for the higgs dataset, while for thethree dense synthetic datasets SystemML failed with out ofmemory exceptions.
(2) For MGD (Figure 9(b)), we observe that our systemoutperforms, on average, both MLib and SystemML: It hassimilar performance to MLib and SystemML. However, Sys-temML requires an extra overhead of converting the data toits binary representation. It is up to 28 times faster thanMLib and more than 2 orders of magnitude faster than Sys-temML for large datasets (higgs, svm1, and svm2). Espe-cially, for the dataset svm3 that does not fit entirely intoSpark’s cache, MLlib incurred disk IOs in each iteration re-sulting in a training time per iteration of 6 min. Thus, wehad to terminate the execution after 3 hours. The large ben-efits of our system come from the shu✏e-partition samplingtechnique, which significantly saves IO costs.
(3) For SGD (Figure 9(c)), we observe that our system issignificantly superior than MLlib (by a factor from 2 forsmall datasets to 46 for larger datasets). In fact, similarlyto MGD, MLlib incurred many disk IOs for svm3. We hadto stop the execution after 3 hours. In contrast, SystemMLhas lower training times for the very small datasets (adult,covtype, and yearpred), thanks to its binary data represen-tation that makes local processing faster. However, the costof converting data to its binary data representation is higherthan its training time itself, which makes SystemML slowerthan our system (except for covtype). Things get worse forSystemML as the data grows. Our system is more than 2orders of magnitude faster than SystemML. The benefits of
our system on SGD is mainly due to the lazy transformationused by our system. In fact, as for BGD and MGD, we hadto stop SystemML after 3 hours for the higgs dataset whileit failed with out of memory exceptions for the three densedatasets. Notice that the training time for a larger datasetmay be smaller if the number of iterations to converge issmaller. For example, this is the case for the dataset cov-
type, which required 923 iterations to converge using SGD,in contrast to rcv1, which required only 196. This resultedin smaller training time for rcv1 than covtype.
8.4.2 ScalabilityFigure 10 shows the scalability results for SGD for the
two largest synthetic datasets (SVM A and SVM B), when in-creasing the number of data points (Figure 10(a)) and thenumber of features (Figure 10(b)). Notice that we discardedSystemML as it was not able to run on these dense datasets.We plot the runtimes of the eager-random and the lazy-shu✏e GD plan. We observe that both plans outperformMLlib by more than one order of magnitude in both cases.In particular, we observe that our system scales gracefullywith both the number of data points and the number of fea-tures while MLlib does not. This is even more prominent forthe datasets that do not fit in Spark’s cache memory. Es-pecially, we observe that the lazy-shu✏e plan scales betterthan the eager-random. This shows the high e�ciency ofour shu✏ed-partition sampling mechanism in combinationwith the lazy transformation. Note that we had to stop theexecution of MLlib after 24 hours for the largest dataset of88 million points in Figure 10(a). MLlib took 4.3 min foreach iteration and thus, would require 3 days to completewhile our GD plan took only 25 minutes. This leads to morethan 2 orders of magnitude improvement over MLlib.
8.4.3 Benefits and overhead of abstractionWe also evaluate the benefits and overhead of using the
ML4all abstraction. For this, we implemented the planproduced by ML4all directly on top of Spark. We also im-plemented the Bismarck abstraction [12], which comes witha Prepare UDF, while the Compute and Update are com-bined, on Spark. Recall that a key advantage of separatingCompute from Update is that the former can be parallelizedwhere the latter has to be e↵ectively serialized. When thesetwo operators are combined into one, parallelization cannotbe leveraged. Its Prepare UDF, however, can be parallelized.Figure 11 illustrates the results of these experiments. We
observe that ML4all adds almost no additional overheadto plan execution as it has very similar runtimes as the pureSpark implementation. We also observe that our systemand Bismarck have similar runtimes for SGD and MGD(1k)and for all three data sets. This is because our prototyperuns in a hybrid mode and parts of the plan are executed
MGD
Performance
dataset
GD Spark Our system Bismarck-Spark SGD Spark Our system Bismarck-Spark MGD b=1000 Spark ML4all Spark-local ML4all-local Bismarck-on-Spark
adult 22.896 23.0543333333333 231.117333333333 32.147 33.843 30.2636666666667 76.6613333333333 76.9013333333333 43.3326666666667 44.2973333333333 41.0406666666667
rcv1 621.77 622 crashed 34.1603333333333 32.6746666666667 33.4693333333333 223.016666666667 230.040666666667 153.802333333333 155.386 150.857
USCensus90 168.990333333333 173.372666666667
synth 211.05 216.294833333333 crashed 33.031 31.2083333333333 33.0366666666667 207.384666666667 200.811 206.2525
rcv1 621.77 609.165 b=10000
adult 127.202 137.3005
adult rcv1 rcv1 588.948 crashed
SGD 32.147 33.843 30.26366667 34.16033333 32.67466667 33.46933333 synth 488.5175 1663.519
MGD(1K) 43.33266667 44.29733333 41.04066667 153.8023333 155.386 150.857
MGD(10K) 127 127.202 137.3005 586 588.948 crashed
BGD 22.896 23.05433333 231.1173333 621.77 622
SGD synth 31 31.20833333 33.03666667
MGD(1K) 201 200.811 206.2525
MGD(10K) 487 488.5175 1663.519
BGD 211.05 216.2948333 crashed
Trai
ning
tim
e (s
ec)
0
175
350
525
700
adult rcv1 USCensus90
SparkOur system
Trai
ning
tim
e (s
ec)
0
75
150
225
300
SGD MGD(1K) MGD(10K) BGD
SparkOur systemBismarck-Spark
0
175
350
525
700
SGD MGD(1K) MGD(10K) BGD
SparkOur systemBismarck-Spark
0
450
900
1350
1800
SGD MGD(1K) MGD(10K) BGD
SparkOur systemBismarck-Spark
Trai
ning
tim
e (s
ec)
0
60
120
180
240
adult rcv1
distributedhybrid
�1
(a) adult dataset
GD Spark Our system Bismarck-Spark SGD Spark Our system Bismarck-Spark MGD b=1000 Spark ML4all Spark-local ML4all-local Bismarck-on-Spark
adult 22.896 23.0543333333333 231.117333333333 32.147 33.843 30.2636666666667 76.6613333333333 76.9013333333333 43.3326666666667 44.2973333333333 41.0406666666667
rcv1 621.77 622 crashed 34.1603333333333 32.6746666666667 33.4693333333333 223.016666666667 230.040666666667 153.802333333333 155.386 150.857
USCensus90 168.990333333333 173.372666666667
synth 211.05 216.294833333333 crashed 33.031 31.2083333333333 33.0366666666667 207.384666666667 200.811 206.2525
rcv1 621.77 609.165 b=10000
adult 127.202 137.3005
adult rcv1 rcv1 588.948 crashed
SGD 32.147 33.843 30.26366667 34.16033333 32.67466667 33.46933333 synth 488.5175 1663.519
MGD(1K) 43.33266667 44.29733333 41.04066667 153.8023333 155.386 150.857
MGD(10K) 127 127.202 137.3005 586 588.948 crashed
BGD 22.896 23.05433333 231.1173333 621.77 622
SGD synth 31 31.20833333 33.03666667
MGD(1K) 201 200.811 206.2525
MGD(10K) 487 488.5175 1663.519
BGD 211.05 216.2948333 crashed
Trai
ning
tim
e (s
ec)
0
175
350
525
700
adult rcv1 USCensus90
SparkOur system
Trai
ning
tim
e (s
ec)
0
75
150
225
300
SGD MGD(1K) MGD(10K) BGD
SparkOur systemBismarck-Spark
0
175
350
525
700
SGD MGD(1K) MGD(10K) BGD
SparkOur systemBismarck-Spark
0
450
900
1350
1800
SGD MGD(1K) MGD(10K) BGD
SparkOur systemBismarck-Spark
Trai
ning
tim
e (s
ec)
0
60
120
180
240
adult rcv1
distributedhybrid
�1
(b) rcv1 dataset
GD Spark Our system Bismarck-Spark SGD Spark Our system Bismarck-Spark MGD b=1000 Spark ML4all Spark-local ML4all-local Bismarck-on-Spark
adult 22.896 23.0543333333333 231.117333333333 32.147 33.843 30.2636666666667 76.6613333333333 76.9013333333333 43.3326666666667 44.2973333333333 41.0406666666667
rcv1 621.77 622 crashed 34.1603333333333 32.6746666666667 33.4693333333333 223.016666666667 230.040666666667 153.802333333333 155.386 150.857
USCensus90 168.990333333333 173.372666666667
synth 211.05 216.294833333333 crashed 33.031 31.2083333333333 33.0366666666667 207.384666666667 200.811 206.2525
rcv1 621.77 609.165 b=10000
adult 127.202 137.3005
adult rcv1 rcv1 588.948 crashed
SGD 32.147 33.843 30.26366667 34.16033333 32.67466667 33.46933333 synth 488.5175 1663.519
MGD(1K) 43.33266667 44.29733333 41.04066667 153.8023333 155.386 150.857
MGD(10K) 127 127.202 137.3005 586 588.948 crashed
BGD 22.896 23.05433333 231.1173333 621.77 622
SGD synth 31 31.20833333 33.03666667
MGD(1K) 201 200.811 206.2525
MGD(10K) 487 488.5175 1663.519
BGD 211.05 216.2948333 crashed
Trai
ning
tim
e (s
ec)
0
175
350
525
700
adult rcv1 USCensus90
SparkOur system
Trai
ning
tim
e (s
ec)
0
75
150
225
300
SGD MGD(1K) MGD(10K) BGD
SparkOur systemBismarck-Spark
0
175
350
525
700
SGD MGD(1K) MGD(10K) BGD
SparkOur systemBismarck-Spark
0
450
900
1350
1800
SGD MGD(1K) MGD(10K) BGD
1,664
20633
216
489
20131
211
487
20131
SparkOur systemBismarck-Spark
Trai
ning
tim
e (s
ec)
0
60
120
180
240
adult rcv1
distributedhybrid
�1
(c) svm1 datasetFigure 11: ML4all abstraction benefits and overhead. The proposed abstraction has negligible overhead w.r.t.hard-coded Spark programs while it allows for exhaustive distributed execution.
in a centralized fashion thus negating the separation of theCompute and the Update step. As the dataset cardinality ordimensionality increases, the advantages of ML4all becomeclear. Our system is (i) slightly faster for MGD(10k) for asmall dataset (Figure 11(a)), (ii) more than 3 times fasterfor MGD(10k) in Figure 11(c), because of the distribution ofthe gradient computation, and (iii) able to run MGD(10k)in Figure 11(b) while the Bismarck abstraction fails due tothe large number of features of rcv1. This is also the reasonthat the Bismark abstraction fails to run BGD for the samedataset of rcv1, but for svm1 the reason it fails is the largenumber of data points. This clearly shows that the Bismarckabstraction cannot scale with the dataset size. In contrast,our system scales gracefully in all cases as it execute thealgorithms in a distributed fashion whenever required.
8.4.4 SummaryThe high e�ciency of our system comes from its (i) lazy
transformation technique, (ii) novel sampling mechanisms,and (iii) e�cient execution operators. All these results notonly show the high e�ciency of our optimizations tech-niques, but also the power of the ML4all abstraction thatallows for such optimizations without adding any overhead.
8.5 AccuracyThe reader might think that our system achieves high per-
formance at the cost of sacrificing accuracy. However, thisis far from the truth. To demonstrate this, we measure thetesting error of each system and each GD algorithm. Weused the test datasets from LIBSVM when available, other-wise we randomly split the initial dataset in training (80%)and testing (20%). We then apply the model (i.e., weightsvector) produced on the training dataset to each examplein the testing dataset to determine its output label. Weplot the mean square error of the output labels comparedto the ground truth. Recall that we have used the sameparameters (e.g., step size) in all systems.
Let us first note that, as expected, all systems return thesame model for BGD and hence we omit the graph as thetesting error is exactly the same. Figure 12 shows the resultsfor MGD and SGD. We omit the results for svm3 as only oursystem could converge in a reasonable amount of time. Al-though our system uses aggressive sampling techniques insome cases, such as shu✏e-partition for the large datasets inMGD4, the error is significantly close to the ones of MLliband SystemML. The only case where shu✏e-partition influ-ences the testing error is for rcv1 in SGD. The testing errorfor MLlib is 0.08, while in our case it is 0.18. This is due tothe skewness of the data. SystemML having a testing errorof 0.3 also seems to su↵er from this problem. We are cur-
4Table 4 in Appendix E shows the plan chosen in each case.
Test
ing
erro
r (M
SE)
0
0.15
0.3
0.45
0.6
Dataset
adult
covtype
yearpred rcv
1higgs
svm1svm
2
MLlib SystemML Our system
MGD
(a) MGD
Test
ing
erro
r (M
SE)
0
0.15
0.3
0.45
0.6
Dataset
adult
covtype
yearpred rcv
1higgs
svm1svm
2
MLlib SystemML Our system
SGD
(b) SGD
Figure 12: Testing error (mean square error). ForSGD/MGD, our system achieves an error close toMLlib even if it uses di↵erent sampling methods.
Eager transformation MGD
Bernoulli Random-partition Shuffle-partition Eager transformation SGD
Bernoulli Random-partition Shuffle-partition Shuffle partition MGD
Eager Lazy Random partition MGD
Eager Lazy
adult 15.764 20.399 21.08 27.201 13.6535 24.05 21.08 20.279 20.399 19.726
covtype 33.628 33.203 38.612 44.937 36.508 33.6 38.612 37.495 33.203 30.422
yearpred 17.167 18.321 16.96 21.371 37.487 16.611 16.96 13.131 18.321 14.476
rcv1 72.318 140.345 150.909 28.806 56.897 25.198 150.909 161.452 140.345 159.516
higgs 117.731 279.982 77.049 50.012 24.459 17.432 77.049 81.338 279.982 277.572
svm1 134.217 354.082 111.183 57.004 32.662 14.982 111.183 142.939 354.082 381.475
svm2 1118.997 2669.537 115.718 413.835 38.201 15.241 115.718 145.139 2669.537 0
Lazy transformation MGD
Lazy transformation SGD
Shuffle partition SGD
Random partition SGD
adult 19.726 20.279 11.8855 13.5415 24.05 13.5415 13.6535 11.8855
covtype 30.422 37.495 33.419 31.082 33.6 31.082 36.508 33.419
yearpred 14.476 13.131 21.318 11.764 16.611 11.764 37.487 21.318
rcv1 159.516 161.452 34.071 23.066 25.198 23.066 56.897 34.071
higgs 277.572 81.338 13.418 10.599 17.432 10.599 24.459 13.418
svm1 381.475 142.939 17.322 10.076 14.982 10.076 32.662 17.322
svm2 0 145.139 13.496 10.522 15.241 10.522 38.201 13.496
Trai
ning
tim
e (s
)
1
100
10000
Dataset
adult
covtype
yearpred rcv
1higgs
svm1
svm2
BernoulliRandom-partitionShuffle-partition
Trai
ning
tim
e (s
)
1
10
100
Dataset
adult
covtype
yearpred rcv
1higgs
svm1
svm2
Random-partitionShuffle-partition
Trai
ning
tim
e (s
ec)
1
10
100
Dataset
adult
covtype
yearpred rcv
1higgs
svm1
svm2
BernoulliRandom-partitionShuffle-partition
Trai
ning
tim
e (s
ec)
1
10
100
Dataset
adult
covtype
yearpred rcv
1higgs
svm1
svm2
Random-partition Shuffle-partition
Trai
ning
tim
e (s
)
1
10
100
Dataset
adult
covtype
yearpred rcv
1higgs
svm1
svm2
Eager Lazy
Trai
ning
tim
e (s
)
1
10
100
Dataset
adult
covtype
yearpred rcv
1higgs
svm1
svm2
Eager Lazy
Trai
ning
tim
e (s
ec)
1
100
10000
Dataset
adult
covtype
yearpred rcv
1higgs
svm1
svm2
Eager Lazy
Trai
ning
tim
e (s
ec)
1
10
100
Dataset
adult
covtype
yearpred rcv
1higgs
svm1
svm2
Eager Lazy
�1
(a) Eager transformation
Eager transformation MGD
Bernoulli Random-partition Shuffle-partition Eager transformation SGD
Bernoulli Random-partition Shuffle-partition Shuffle partition MGD
Eager Lazy Random partition MGD
Eager Lazy
adult 15.764 20.399 21.08 27.201 13.6535 24.05 21.08 20.279 20.399 19.726
covtype 33.628 33.203 38.612 44.937 36.508 33.6 38.612 37.495 33.203 30.422
yearpred 17.167 18.321 16.96 21.371 37.487 16.611 16.96 13.131 18.321 14.476
rcv1 72.318 140.345 150.909 28.806 56.897 25.198 150.909 161.452 140.345 159.516
higgs 117.731 279.982 77.049 50.012 24.459 17.432 77.049 81.338 279.982 277.572
svm1 134.217 354.082 111.183 57.004 32.662 14.982 111.183 142.939 354.082 381.475
svm2 1118.997 2669.537 115.718 413.835 38.201 15.241 115.718 145.139 2669.537 0
Lazy transformation MGD
Lazy transformation SGD
Shuffle partition SGD
Random partition SGD
adult 19.726 20.279 11.8855 13.5415 24.05 13.5415 13.6535 11.8855
covtype 30.422 37.495 33.419 31.082 33.6 31.082 36.508 33.419
yearpred 14.476 13.131 21.318 11.764 16.611 11.764 37.487 21.318
rcv1 159.516 161.452 34.071 23.066 25.198 23.066 56.897 34.071
higgs 277.572 81.338 13.418 10.599 17.432 10.599 24.459 13.418
svm1 381.475 142.939 17.322 10.076 14.982 10.076 32.662 17.322
svm2 0 145.139 13.496 10.522 15.241 10.522 38.201 13.496
Trai
ning
tim
e (s
)
1
100
10000
Dataset
adult
covtype
yearpred rcv
1higgs
svm1
svm2
BernoulliRandom-partitionShuffle-partition
Trai
ning
tim
e (s
)
1
10
100
Dataset
adult
covtype
yearpred rcv
1higgs
svm1
svm2
Random-partitionShuffle-partition
Trai
ning
tim
e (s
ec)
1
10
100
Dataset
adult
covtype
yearpred rcv
1higgs
svm1
svm2
BernoulliRandom-partitionShuffle-partition
Trai
ning
tim
e (s
ec)
1
10
100
Dataset
adult
covtype
yearpred rcv
1higgs
svm1
svm2
Random-partition Shuffle-partition
Trai
ning
tim
e (s
)
1
10
100
Dataset
adult
covtype
yearpred rcv
1higgs
svm1
svm2
Eager Lazy
Trai
ning
tim
e (s
)
1
10
100
Dataset
adult
covtype
yearpred rcv
1higgs
svm1
svm2
Eager Lazy
Trai
ning
tim
e (s
ec)
1
100
10000
Dataset
adult
covtype
yearpred rcv
1higgs
svm1
svm2
Eager Lazy
Trai
ning
tim
e (s
ec)
1
10
100
Dataset
adult
covtype
yearpred rcv
1higgs
svm1
svm2
Eager Lazy
�1
(b) Lazy transformation
Figure 13: Sampling e↵ect in MGD for eager andlazy transformation.
rently working to improve this sampling technique for suchcases. However, in cases where the data is now skewed ourtesting error even for SGD is very close to the one of ML-lib. Thus, we can conclude that ML4all decreases trainingtimes without a↵ecting the accuracy of the model.
8.6 In-DepthWe analyze in detail how the sampling and the transfor-
mation techniques a↵ect performance when running MGDwith 1, 000 samples and SGD until convergence with thetolerance set to 0.001 and a maximum of 1, 000 iterations.
8.6.1 Varying the sampling techniqueWe first fix the transformation and vary the sampling tech-
nique. Figure 13 shows how the sampling technique a↵ectsMGD when using eager and lazy transformation. First, ineager transformation for small datasets, using the Bernoullisampling is more beneficial (Figure 13(a)). This is becauseMGD needs a thousand samples per iteration and thus, afull scan of the whole dataset per iteration does not penal-ize the total execution time. However, for larger datasetsthat consist of more partitions, the shu✏e-partition is fasterin all cases as it accesses only few partitions.For the lazy transformation (Figure 13(b)), we ran only
the random-partition and shu✏e-partition sampling tech-niques. Using a plan with Bernoulli sampling and lazy trans-formation is always ine�cient as explained in Section 6.We observe that for MGD and the two small datasets of
Abstraction benefit
#Iterations
10
100
1000
10000
100000
1000000
Tolerance0.1 0.01 0.001
BGD-real BGD-estimMGD-real MGD-estimSGD-real SGD-estim
22 2582 16
9156
201
227
2131430
adult
1437 2009
1709 2586
2125
30792
14365
26010
17085
(a) adult dataset
#Iterations
10
100
1000
10000
100000
1000000
Tolerance0.1 0.01 0.001
BGD-real BGD-estimMGD-real MGD-estimSGD-real SGD-estim
26 26788
1211
1113 1489
134 221
3518 10294
7737 12648
1856
2210
31620
102935
67582
126480
covtype
(b) covtype dataset
#Iterations
10
100
1000
10000
100000
1000000
Tolerance0.1 0.01
BGD-real BGD-estimMGD-real MGD-estimSGD-real SGD-estim
rcv1
799 1197
1158
1293
1201
1297
21942
10175
30263
10991
30421
11016
(c) rcv1 dataset
Figure 6: ML4all obtains good estimates for the number of iterations for all GD algorithms.
GD (1000 iterations)
SGD-eager-random
SGD-eager-shuffle SGD-eager-spark SGD-lazy-random SGD-lazy-shuffle MGD-eager-random (b=1000)
MGD-eager-shuffle (b=1000)
MGD-eager-spark MGD-lazy-random MGD-lazy-shuffle Real Estimated Real transform time
Real sort time for lazy
adult 81.9893333333333 34.8286666666667 35.581 63.6563333333333 85.167 34.035 89.571 88.212 47.7803333333333 614.276333333333 84.3903333333333 34.8286666666667 33.016 SGD-eager-random 5638.66666666667 7788.33333333333
covtype 146.585333333333 37.382 41.301 73.658 314.792 40.03 96.3566666666667 96.195 53.3363333333333 630.840666666667 91.2843333333333 37.382 41.185 SGD-eager-random 8599 13284.6666666667
yearpred 100.336666666667 42.237 42.3736666666667 108.356 278.1685 42.1966666666667 152.578 142.684666666667 50.2226666666667 1105.164 159.538 42.1966666666667 35.589 SGD-lazy-shuffle 8456 12778.3333333333
rcv1 946.157 50.724 48.7695 757.117333333333 269.503333333333 41.1923333333333 301.094333333333 170.331333333333 193.43 1417.90866666667 233.560666666667 41.1923333333333 35.089 SGD-lazy-shuffle 9061.33333333333 14785.6666666667
synth SVM 1024.604 76.2323333333333 114.528666666667 1205.151 288.354666666667 87.025 2647.987 183.335333333333 863.817666666667 9747.694 258.443666666667 76.2323333333333 58.571
0.0520452500813506
0.10173345460382
0.156592147878979
0.148166730054946 0.114634395654774
Trai
ning
tim
e (s
ec)
0
15
30
45
60
adult covtype yearpred rcv1
Real Estimated
Trai
ning
tim
e (s
ec)
0
15
30
45
60
Datasetadult covtype yearpred rcv1
Real Estimated
�1
(a) Run of 1,000 iterations
GD time GD iterations SGD-eager-random time
SGD-eager-random iterations
SGD-eager-shuffle-partition time
SGD-eager-shuffle-partition iterations
SGD-lazy-random time
SGD-lazy-random iterations
SGD-lazy-shufflepartition time
SGD-lazy-shufflepartition iterations
MGD-eager-shuffle-partition time
MGD-eager-shuffle-partition iterations
MGD-eager-random time
MGD-eager-random iterations
MGD-eager-spark time
MGD-eager-spark iterations
MGD-lazy-random time
MGD-lazy-random iterations
MGD-lazy-shufflepartition time
MGD-lazy-shufflepartition iterations
adult 0.1 10.082 22 11.895 147.66666666666712.7973333333333 157 23.949 172.666666666667 10.691 145.333333333333 12.128 136 7.25166666666667 32.6666666666667 17.5046666666667 33.6666666666667 11.1883333333333 123.666666666667
adult 0.01 25.9373333333333 227 54.00633333333331908.33333333333 57.588 2076 144.840333333333 1781.66666666667 56.816 2271 74.7513333333333 2040.66666666667 34.3586666666667 803 168.102 853.666666666667 63.167 2023.66666666667
adult 0.001 139.251 2586 507.582 26044.6666666667421.641333333333 21973.3333333333 2221.261 31108 467.288 24915.6666666667 680.991 24438.3333333333 1088.725 45213.5 - - 497.267666666667 22724
covtype 0.1 17.4193333333333 26 41.39333333333331085.6666666666732.7056666666667 838.666666666667 346.113333333333 1084.33333333333 44.742 1443 60.826 1460 35.6813333333333 257.666666666667 21.2843333333333 239 160.914 238 60.412 1744.33333333333
covtype 0.01 38.272 134 172.2773333333336481.33333333333155.765666666667 6991 2104.446 6817.5 215.579666666667 10657.3333333333 203.432666666667 6446.66666666667 119.88 1291.33333333333 66.506 1419.66666666667 851.12 1374.66666666667 182.061666666667 7059.66666666667
covtype 0.001 252.025 1856 1437.8356666666761121.33333333331404.68833333333 73509.3333333333 18886.9425 62129 1374.76033333333 73566.6666666667 1832.23966666667 68450.6666666667 1087.54733333333 14254.6666666667 498.987 14476.3333333333 7286.24966666667 13547.3333333333 1141.045 47368.6666666667
yearpred 0.1 15.814 44 12.3663333333333 60 12.3663333333333 58.6666666666667 25.631 60 9.179 59 14.806 58 22.5576666666667 45 13.087 45 62.5686666666667 45 12.358 58
rcv1 0.1 275.982666666667 799 61.23733333333331229.66666666667 52.495 1259.66666666667 354.316666666667 1212.66666666667 36.3803333333333 1104 87.5776666666667 1147.66666666667 381.023 1163.66666666667 844.877 1145 1677.33333333333 1146.66666666667 120.833666666667 1189.33333333333
rcv1 0.01 19086.1115 21942 994.48233333333330743.6666666667 910.623 29416 8156.54366666667 31104 587.945 29485 1817.81866666667 31969.6666666667 stopped after 4 hours
2392.55666666667 28556.6666666667
SVM synthetic 220.484666666667 145 104.8596666666673245.33333333333122.480666666667 3525 981.911 3588.66666666667 83.6226666666667 2871.33333333333 200.208666666667 3408.66666666667 - - 1479.1635 3978.5 - more than 1hour 378.192333333333 3969.33333333333
0.073 1901.26066666667
0.757 930.857666666667
0.063
SGD Speculation 2 (1000)
mini Speculation 2 (1000, 10)
Real Iterations BGD-real SGD-real MGD-real BGD-estim SGD-estim MGD-estim
adult 0.1 22 155.666666666667 81.5 25 201.333333333333 169 0.991797831929939 Real Es'mated Method Realmethod
adult 0.01 227 2009.25 1430.25 213 1709 1437 adult(0.1) 7.251666667 6.208 GD MGD-eager-
bernoulliadult 0.001 2586 26010.4166666667 30791.9444444444 2125 17085 14365 adult
(0.01) 25.93733333 19.5733882030178 GD
covtype 0.1 26 1112.91666666667 787.8 26 1488.5 1211 0.172113484747303 adult(0.001) 139.251 95.639 GD
covtype 0.01 134 7736.79166666667 3518.4 221 12648 10294 covtype(0.1) 17.41933333 12.053 GD
covtype 0.001 1856 67581.5833333333 31619.5333333333 2210 126480 102935 covtype(0.01) 38.272 33.542 GD
yearpred 0.1 44 59.4166666666667 50.2 45 59.3333333333333 58 covtype(0.001) 252.025 248.325 GD
rcv1 0.1 799 1201.5 1158.46666666667 1197 1296.66666666667 1293 519 0.1476234833532381.47623483353238 yearPred(0.1) 9.179 8.665 Lazyshuffle
rcv1 0.01 21942 30421.2222222222 30263.1666666667 10175 11016 10991 rcv1(0.1) 36.38033333 37.911 Lazyshuffle
145 3307.58333333333 rcv1(0.01) 587.945 312.183 Lazyshuffle
Trai
ning
tim
e (s
ec)
1
10
100
Dataset
adult (0.1)
adult (0.01)
adult (0.001)
covtype (0.1)
covtype (0.01)
covtype (0.001)
yearPred (0.1)
rcv1 (0.1)
rcv1 (0.01)
Real Estimated
Trai
ning
tim
e (s
ec)
10
100
1000
10000
Datasetadult covtype yearpred rcv1
Real Estimated
100
10000
adult covtype yearpred rcv1
Real Estimated
#Ite
ratio
ns
10
1001000
10000100000
1000000
0.1 0.01 0.001
BGD-real MGD-realSGD-real BGD-estimMGD-estim SGD-estim
#Ite
ratio
ns
10
100
1000
10000
100000
1000000
Tolerance0.1 0.01 0.001
BGD-real BGD-estimMGD-real MGD-estimSGD-real SGD-estim
#Ite
ratio
ns
10
100
1000
10000
100000
1000000
Tolerance0.1 0.01 0.001
BGD-real BGD-estimMGD-real MGD-estimSGD-real SGD-estim
#Ite
ratio
ns
10
100
1000
10000
100000
1000000
Tolerance0.1 0.01
BGD-real BGD-estimMGD-real MGD-estimSGD-real SGD-estim
�1
(b) Run to convergence
Figure 7: ML4all obtains accurate time estimates.
1
10
100
1000
10000
100000
Min
Max
Our
sys
tem
Min
Max
Our
sys
tem
Min
Max
Our
sys
tem
Min
Max
Our
sys
tem
Min
Max
Our
sys
tem
Min
Max
Our
sys
tem
Min
Max
Our
sys
tem
Min
Max
Our
sys
tem
adult covtype yearpred rcv1 higgs svm1 svm2 svm3
Trai
ning
tim
e (s
ec) Execution
Speculation
MinMaxPlan executionSpeculation
BGD
MGD lazy
random
SGD eagershuffle
SGD lazyshuffle
SGD lazyshuffle
SGD lazyshuffle
SGD lazyshuffle
SGD lazyshuffle
Figure 8: ML4all always performs very close to thebest plan by choosing it plus a small overhead.
by our optimizer. For this, we used a larger variety of realand synthetic datasets and measure the training time.
Figure 8 illustrates the training times of the best (min)and worst (max) GD plan as well as of the GD plan selectedby ML4all for each dataset. Notice that the latter timeincludes the time taken by our optimizer to choose the GDplan (speculation part) plus the time to execute it. The leg-end above the green bars indicate which was the GD planthat our optimizer chose. Although for most datasets SGDwas the best choice, other GD algorithms can be the win-ner for di↵erent tolerance values and tasks as we showedin the introduction. We make two observations from theseresults. First, ML4all always selects the fastest GD planand, second, ML4all incurs a very low overhead due to thespeculation. Therefore, even with the optimization over-head, ML4all still achieves very low training times - closeto the ones a user would achieve if she knew which plan torun. In fact, the optimization time is between 4.6 to 8 sec-onds for all datasets. From this overhead time, around 4 secis the overhead of Spark’s job initialization for collecting thesample. Given that usually the training time of ML modelsis in the order of hours, few seconds are negligible. It isworth noting that we observed an optimization time of lessthen 100 milliseconds when just the number of iterations isgiven.
All the above results show the e�ciency of our cost modeland the accuracy of ML4all to estimate the number ofiterations that a GD algorithm requires to converge, whilemaintaining the optimization cost negligible.
8.4 The Power of Abstraction
We proceed to demonstrate the power of the ML4all ab-straction. We show how (i) the commuting of the Transformand the Loop operator (i.e., lazy vs. eager transformation)can result in rich performance dividends, and (ii) decou-pling the Compute operator with the choice of the samplingmethod for MGD and SGD can yield substantial perfor-mance gains too. In particular, we show how these opti-mization techniques allow our system to outperform base-line systems as well as to scale in terms of data points andnumber of features. Moreover, we show the benefits andoverhead of the proposed GD abstraction.
8.4.1 System performanceWe compare our system with MLlib and SystemML. As
neither of these systems have an equivalent of a GD opti-mizer, we ran BGD, MGD and SGD and we used ML4all
just to find the best plan given a GD algorithm, i.e., whichsampling to use and whether to use lazy transformation ornot. We ran BGD, SGD, and MGD with a batch size of1, 000 in all three systems until convergence. We considereda tolerance of 0.001 and a maximum of 1, 000 iterations.Let us now stress three important points. First, note that
the API of MLlib allows users to specify the fraction of thedata that will be processed in each iteration. Thus, we setthis fraction to 1 for BGD while, for SGD and MGD, wecompute the fraction as the batch size over the total sizeof the dataset. However, the Bernoulli sample mechanismimplemented in Spark (and used in MLlib) does not exactlyreturn the number of sample data requested. For this rea-son, for SGD, we set the fraction slightly higher to reducethe chances that the sample will be empty. We found this tobe more e�cient than checking if the sample is empty and,in case it is, run the sample process again. Second, we usedthe DeveloperApi in order to be able to specify a conver-gence condition instead of a constant number of iterations.Third, as SystemML does not support the LIBSVM format,we had to convert all our real datasets into SystemML bi-nary representation. We used the source code provided tous by the authors of [8], which first converts the input fileinto a Spark RDD using the MLlib tools and then convertsit into matrix binary blocks. The performance results forSystemML show the breakdown between the training timeand this few seconds conversion time.Figure 9 shows the training time in log-scale for di↵erent
real datasets and two larger synthetic ones. Note that forour system, the plots of SGD and MGD show the runtimeof the best plan for the specific GD algorithm. Details onthese plans as well as the number of iterations required toconverge can be found in Table 4 in Appendix E. From theseresults we can make the following three observations:(1) For BGD (Figure 9(a)), we observe that even if sam-pling and lazy transformation are not used in BGD, our sys-
GD Plan choice
#Iterations
10
100
1000
10000
100000
1000000
Tolerance0.1 0.01 0.001
BGD-real BGD-estimMGD-real MGD-estimSGD-real SGD-estim
22 2582 16
9156
201
227
2131430
adult
1437 2009
1709 2586
2125
30792
14365
26010
17085
(a) adult dataset
#Iterations
10
100
1000
10000
100000
1000000
Tolerance0.1 0.01 0.001
BGD-real BGD-estimMGD-real MGD-estimSGD-real SGD-estim
26 26788
1211
1113 1489
134 221
3518 10294
7737 12648
1856
2210
31620
102935
67582
126480
covtype
(b) covtype dataset
#Iterations
10
100
1000
10000
100000
1000000
Tolerance0.1 0.01
BGD-real BGD-estimMGD-real MGD-estimSGD-real SGD-estim
rcv1
799 1197
1158
1293
1201
1297
21942
10175
30263
10991
30421
11016
(c) rcv1 dataset
Figure 6: ML4all obtains good estimates for the number of iterations for all GD algorithms.
GD (1000 iterations)
SGD-eager-random
SGD-eager-shuffle SGD-eager-spark SGD-lazy-random SGD-lazy-shuffle MGD-eager-random (b=1000)
MGD-eager-shuffle (b=1000)
MGD-eager-spark MGD-lazy-random MGD-lazy-shuffle Real Estimated Real transform time
Real sort time for lazy
adult 81.9893333333333 34.8286666666667 35.581 63.6563333333333 85.167 34.035 89.571 88.212 47.7803333333333 614.276333333333 84.3903333333333 34.8286666666667 33.016 SGD-eager-random 5638.66666666667 7788.33333333333
covtype 146.585333333333 37.382 41.301 73.658 314.792 40.03 96.3566666666667 96.195 53.3363333333333 630.840666666667 91.2843333333333 37.382 41.185 SGD-eager-random 8599 13284.6666666667
yearpred 100.336666666667 42.237 42.3736666666667 108.356 278.1685 42.1966666666667 152.578 142.684666666667 50.2226666666667 1105.164 159.538 42.1966666666667 35.589 SGD-lazy-shuffle 8456 12778.3333333333
rcv1 946.157 50.724 48.7695 757.117333333333 269.503333333333 41.1923333333333 301.094333333333 170.331333333333 193.43 1417.90866666667 233.560666666667 41.1923333333333 35.089 SGD-lazy-shuffle 9061.33333333333 14785.6666666667
synth SVM 1024.604 76.2323333333333 114.528666666667 1205.151 288.354666666667 87.025 2647.987 183.335333333333 863.817666666667 9747.694 258.443666666667 76.2323333333333 58.571
0.0520452500813506
0.10173345460382
0.156592147878979
0.148166730054946 0.114634395654774
Trai
ning
tim
e (s
ec)
0
15
30
45
60
adult covtype yearpred rcv1
Real Estimated
Trai
ning
tim
e (s
ec)
0
15
30
45
60
Datasetadult covtype yearpred rcv1
353641
334142
3735
Real Estimated
�1
(a) Run of 1,000 iterations
GD time GD iterations SGD-eager-random time
SGD-eager-random iterations
SGD-eager-shuffle-partition time
SGD-eager-shuffle-partition iterations
SGD-lazy-random time
SGD-lazy-random iterations
SGD-lazy-shufflepartition time
SGD-lazy-shufflepartition iterations
MGD-eager-shuffle-partition time
MGD-eager-shuffle-partition iterations
MGD-eager-random time
MGD-eager-random iterations
MGD-eager-spark time
MGD-eager-spark iterations
MGD-lazy-random time
MGD-lazy-random iterations
MGD-lazy-shufflepartition time
MGD-lazy-shufflepartition iterations
adult 0.1 10.082 22 11.895 147.66666666666712.7973333333333 157 23.949 172.666666666667 10.691 145.333333333333 12.128 136 7.25166666666667 32.6666666666667 17.5046666666667 33.6666666666667 11.1883333333333 123.666666666667
adult 0.01 25.9373333333333 227 54.00633333333331908.33333333333 57.588 2076 144.840333333333 1781.66666666667 56.816 2271 74.7513333333333 2040.66666666667 34.3586666666667 803 168.102 853.666666666667 63.167 2023.66666666667
adult 0.001 139.251 2586 507.582 26044.6666666667421.641333333333 21973.3333333333 2221.261 31108 467.288 24915.6666666667 680.991 24438.3333333333 1088.725 45213.5 - - 497.267666666667 22724
covtype 0.1 17.4193333333333 26 41.39333333333331085.6666666666732.7056666666667 838.666666666667 346.113333333333 1084.33333333333 44.742 1443 60.826 1460 35.6813333333333 257.666666666667 21.2843333333333 239 160.914 238 60.412 1744.33333333333
covtype 0.01 38.272 134 172.2773333333336481.33333333333155.765666666667 6991 2104.446 6817.5 215.579666666667 10657.3333333333 203.432666666667 6446.66666666667 119.88 1291.33333333333 66.506 1419.66666666667 851.12 1374.66666666667 182.061666666667 7059.66666666667
covtype 0.001 252.025 1856 1437.8356666666761121.33333333331404.68833333333 73509.3333333333 18886.9425 62129 1374.76033333333 73566.6666666667 1832.23966666667 68450.6666666667 1087.54733333333 14254.6666666667 498.987 14476.3333333333 7286.24966666667 13547.3333333333 1141.045 47368.6666666667
yearpred 0.1 15.814 44 12.3663333333333 60 12.3663333333333 58.6666666666667 25.631 60 9.179 59 14.806 58 22.5576666666667 45 13.087 45 62.5686666666667 45 12.358 58
rcv1 0.1 275.982666666667 799 61.23733333333331229.66666666667 52.495 1259.66666666667 354.316666666667 1212.66666666667 36.3803333333333 1104 87.5776666666667 1147.66666666667 381.023 1163.66666666667 844.877 1145 1677.33333333333 1146.66666666667 120.833666666667 1189.33333333333
rcv1 0.01 19086.1115 21942 994.48233333333330743.6666666667 910.623 29416 8156.54366666667 31104 587.945 29485 1817.81866666667 31969.6666666667 stopped after 4 hours
2392.55666666667 28556.6666666667
SVM synthetic 220.484666666667 145 104.8596666666673245.33333333333122.480666666667 3525 981.911 3588.66666666667 83.6226666666667 2871.33333333333 200.208666666667 3408.66666666667 - - 1479.1635 3978.5 - more than 1hour 378.192333333333 3969.33333333333
0.073 1901.26066666667
0.757 930.857666666667
0.063
SGD Speculation 2 (1000)
mini Speculation 2 (1000, 10)
Real Iterations BGD-real SGD-real MGD-real BGD-estim SGD-estim MGD-estim
adult 0.1 22 155.666666666667 81.5 25 201.333333333333 169 0.991797831929939 Real Es'mated Method Realmethod
adult 0.01 227 2009.25 1430.25 213 1709 1437 adult(0.1) 7.251666667 6.208 GD MGD-eager-
bernoulliadult 0.001 2586 26010.4166666667 30791.9444444444 2125 17085 14365 adult
(0.01) 25.93733333 19.5733882030178 GD
covtype 0.1 26 1112.91666666667 787.8 26 1488.5 1211 0.172113484747303 adult(0.001) 139.251 95.639 GD
covtype 0.01 134 7736.79166666667 3518.4 221 12648 10294 covtype(0.1) 17.41933333 12.053 GD
covtype 0.001 1856 67581.5833333333 31619.5333333333 2210 126480 102935 covtype(0.01) 38.272 33.542 GD
yearpred 0.1 44 59.4166666666667 50.2 45 59.3333333333333 58 covtype(0.001) 252.025 248.325 GD
rcv1 0.1 799 1201.5 1158.46666666667 1197 1296.66666666667 1293 519 0.1476234833532381.47623483353238 yearPred(0.1) 9.179 8.665 Lazyshuffle
rcv1 0.01 21942 30421.2222222222 30263.1666666667 10175 11016 10991 rcv1(0.1) 36.38033333 37.911 Lazyshuffle
145 3307.58333333333 rcv1(0.01) 587.945 312.183 Lazyshuffle
Trai
ning
tim
e (s
ec)
1
10
100
Dataset
adult (0.1)
adult (0.01)
adult (0.001)
covtype (0.1)
covtype (0.01)
covtype (0.001)
yearPred (0.1)
rcv1 (0.1)
rcv1 (0.01)
Real Estimated
Trai
ning
tim
e (s
ec)
10
100
1000
10000
Datasetadult covtype yearpred rcv1
312
9
24896
588
9
252139
Real Estimated
100
10000
adult covtype yearpred rcv1
Real Estimated
#Ite
ratio
ns
10
1001000
10000100000
1000000
0.1 0.01 0.001
BGD-real MGD-realSGD-real BGD-estimMGD-estim SGD-estim
#Ite
ratio
ns
10
100
1000
10000
100000
1000000
Tolerance0.1 0.01 0.001
BGD-real BGD-estimMGD-real MGD-estimSGD-real SGD-estim
#Ite
ratio
ns
10
100
1000
10000
100000
1000000
Tolerance0.1 0.01 0.001
BGD-real BGD-estimMGD-real MGD-estimSGD-real SGD-estim
#Ite
ratio
ns
10
100
1000
10000
100000
1000000
Tolerance0.1 0.01
BGD-real BGD-estimMGD-real MGD-estimSGD-real SGD-estim
�1
(b) Run to convergence
Figure 7: ML4all obtains accurate time estimates.
1
10
100
1000
10000
100000
Min
Max
Our
sys
tem
Min
Max
Our
sys
tem
Min
Max
Our
sys
tem
Min
Max
Our
sys
tem
Min
Max
Our
sys
tem
Min
Max
Our
sys
tem
Min
Max
Our
sys
tem
Min
Max
Our
sys
tem
adult covtype yearpred rcv1 higgs svm1 svm2 svm3
Trai
ning
tim
e (s
ec) Execution
Speculation
MinMaxPlan executionSpeculation
BGD
MGD lazy
random
SGD eagershuffle
SGD lazyshuffle
SGD lazyshuffle
SGD lazyshuffle
SGD lazyshuffle
SGD lazyshuffle
Figure 8: ML4all always performs very close to thebest plan by choosing it plus a small overhead.
by our optimizer. For this, we used a larger variety of realand synthetic datasets and measure the training time.
Figure 8 illustrates the training times of the best (min)and worst (max) GD plan as well as of the GD plan selectedby ML4all for each dataset. Notice that the latter timeincludes the time taken by our optimizer to choose the GDplan (speculation part) plus the time to execute it. The leg-end above the green bars indicate which was the GD planthat our optimizer chose. Although for most datasets SGDwas the best choice, other GD algorithms can be the win-ner for di↵erent tolerance values and tasks as we showedin the introduction. We make two observations from theseresults. First, ML4all always selects the fastest GD planand, second, ML4all incurs a very low overhead due to thespeculation. Therefore, even with the optimization over-head, ML4all still achieves very low training times - closeto the ones a user would achieve if she knew which plan torun. In fact, the optimization time is between 4.6 to 8 sec-onds for all datasets. From this overhead time, around 4 secis the overhead of Spark’s job initialization for collecting thesample. Given that usually the training time of ML modelsis in the order of hours, few seconds are negligible. It isworth noting that we observed an optimization time of lessthen 100 milliseconds when just the number of iterations isgiven.
All the above results show the e�ciency of our cost modeland the accuracy of ML4all to estimate the number ofiterations that a GD algorithm requires to converge, whilemaintaining the optimization cost negligible.
8.4 The Power of Abstraction
We proceed to demonstrate the power of the ML4all ab-straction. We show how (i) the commuting of the Transformand the Loop operator (i.e., lazy vs. eager transformation)can result in rich performance dividends, and (ii) decou-pling the Compute operator with the choice of the samplingmethod for MGD and SGD can yield substantial perfor-mance gains too. In particular, we show how these opti-mization techniques allow our system to outperform base-line systems as well as to scale in terms of data points andnumber of features. Moreover, we show the benefits andoverhead of the proposed GD abstraction.
8.4.1 System performanceWe compare our system with MLlib and SystemML. As
neither of these systems have an equivalent of a GD opti-mizer, we ran BGD, MGD and SGD and we used ML4all
just to find the best plan given a GD algorithm, i.e., whichsampling to use and whether to use lazy transformation ornot. We ran BGD, SGD, and MGD with a batch size of1, 000 in all three systems until convergence. We considereda tolerance of 0.001 and a maximum of 1, 000 iterations.Let us now stress three important points. First, note that
the API of MLlib allows users to specify the fraction of thedata that will be processed in each iteration. Thus, we setthis fraction to 1 for BGD while, for SGD and MGD, wecompute the fraction as the batch size over the total sizeof the dataset. However, the Bernoulli sample mechanismimplemented in Spark (and used in MLlib) does not exactlyreturn the number of sample data requested. For this rea-son, for SGD, we set the fraction slightly higher to reducethe chances that the sample will be empty. We found this tobe more e�cient than checking if the sample is empty and,in case it is, run the sample process again. Second, we usedthe DeveloperApi in order to be able to specify a conver-gence condition instead of a constant number of iterations.Third, as SystemML does not support the LIBSVM format,we had to convert all our real datasets into SystemML bi-nary representation. We used the source code provided tous by the authors of [8], which first converts the input fileinto a Spark RDD using the MLlib tools and then convertsit into matrix binary blocks. The performance results forSystemML show the breakdown between the training timeand this few seconds conversion time.Figure 9 shows the training time in log-scale for di↵erent
real datasets and two larger synthetic ones. Note that forour system, the plots of SGD and MGD show the runtimeof the best plan for the specific GD algorithm. Details onthese plans as well as the number of iterations required toconverge can be found in Table 4 in Appendix E. From theseresults we can make the following three observations:(1) For BGD (Figure 9(a)), we observe that even if sam-pling and lazy transformation are not used in BGD, our sys-
Time estimates Results
A Cost-based Optimizer for Gradient Descent OptimizationZoi Kaoudi Jorge Quianè-Ruiz Saravanan Thirumuruganathan Sanjay Chawla Divy Agrawal
1 Error sequence follows known distribution
2Shape of error sequence on sample D’ << D
'Shape of error sequence over D
Speculative approach1.Take a sample D’ << D 2.Run GD for a larger error 3.Fit the distribution
Trai
ning
tim
e (s
ec)
1
100
10000
Dataset
adult covtype rcv1
1,818
26
7
911
173
24
19,086
910
batch GDstochastic GDmini-batch GD