11
Benchmarking Deep Learning Frameworks: Design Considerations, Metrics and Beyond Ling Liu, Yanzhao Wu, Wenqi Wei, Wenqi Cao, Semih Sahin, Qi Zhang Georgia Institute of Technology Abstract—With increasing number of open-source deep learning (DL) software tools made available, benchmarking DL software frameworks and systems is in high demand. This paper presents design considerations, metrics and challenges towards developing an effective benchmark for DL software frameworks and illustrate our observations through a comparative study of three popular DL frameworks: TensorFlow, Caffe, and Torch. First, we show that these deep learning frameworks are optimized with their default configurations settings. However, the default configuration optimized on one specific dataset may not work effectively for other datasets with respect to runtime performance and learning accuracy. Second, the default configuration optimized on a dataset by one DL framework does not work well for another DL framework on the same dataset. Third, we show through experiments that different DL frameworks exhibit different levels of robustness against adversarial examples. Through this study, we conjecture that effectively benchmarking deep learning software frameworks and systems is significantly more challenging than traditional performance-driven benchmarks. I. I NTRODUCTION Deep learning (DL) applications and systems have blossomed in recently years as more variety of data is entering cyber space and increasing number of open-source deep learning (DL) software frameworks are made available. Benchmarking DL frameworks and systems is in high demand [1], [2], [3], [4], [5], [6], [7], [8], [9]. However, benchmarking deep learning software frameworks and systems is notably more difficult than traditional performance-driven benchmarks. This is simply because big data powered deep learning systems are inherently both computation-intensive and data-intensive, demanding intelligent integration of massive data parallelism and massive computation-parallelism at all levels of a deep learning framework. For instance, a deep learning framework typically has a large set of model parameters and system parameters that need to be configured and tuned. Many of the parameters are interacting with one another in a complex manner from both model learning perspective and system runtime optimization perspective, making the tuning of such large space of parameters substantially more tricky than what have been experimented and understood from systems administration of conventional computer systems, software tools and applications. This paper presents design considerations, metrics and insights towards benchmarking DL software frameworks through a comparative study of three popular deep learning frameworks: TensorFlow [10], Caffe [11] and Torch [12]. First, we show that although deep learning software frameworks are optimized with their default configuration settings, for a given DL framework, its default configuration optimized to train on one dataset may not work effectively for other datasets. Second, the default configuration optimized on a dataset by one DL framework may not work well when used to train the same dataset by another DL framework. Hence, it may not be meaningful to compare different DL frameworks under the same configuration. Third, different DL frameworks exhibit different levels of robustness in response to adversarial behaviors, and different sensitivity boundaries over potential biases or noise levels inherent in different training datasets. Through this experimental study, we show that system runtime performance, learning accuracy, and model robustness against adversarial behaviors and consequence of overfitting are the three sets of metrics that are equally important for effectively configuring, measuring and comparing different deep learning software frameworks. II. DEEP LEARNING REFERENCE MODEL Deep learning software frameworks are scalable software implementations of deep neural networks (DNN) on modern computers with many-core CPUs without or with GPUs. A. DNN Model A deep neural network refers to a N-layer neural network with N 2. Learning over the input data to a DNN is typically performed through a sequence of transformations of input data layer by layer with each

Benchmarking Deep Learning Frameworks: Design ... · models by transforming the data to LMDB format, defining network architecture in the .prototxt file, defining hyperparameters

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Benchmarking Deep Learning Frameworks: Design ... · models by transforming the data to LMDB format, defining network architecture in the .prototxt file, defining hyperparameters

Benchmarking Deep Learning Frameworks:Design Considerations, Metrics and Beyond

Ling Liu, Yanzhao Wu, Wenqi Wei, Wenqi Cao, Semih Sahin, Qi ZhangGeorgia Institute of Technology

Abstract—With increasing number of open-source deeplearning (DL) software tools made available, benchmarkingDL software frameworks and systems is in high demand.This paper presents design considerations, metrics andchallenges towards developing an effective benchmark forDL software frameworks and illustrate our observationsthrough a comparative study of three popular DLframeworks: TensorFlow, Caffe, and Torch. First, weshow that these deep learning frameworks are optimizedwith their default configurations settings. However, thedefault configuration optimized on one specific dataset maynot work effectively for other datasets with respect toruntime performance and learning accuracy. Second, thedefault configuration optimized on a dataset by one DLframework does not work well for another DL frameworkon the same dataset. Third, we show through experimentsthat different DL frameworks exhibit different levels ofrobustness against adversarial examples. Through thisstudy, we conjecture that effectively benchmarking deeplearning software frameworks and systems is significantlymore challenging than traditional performance-drivenbenchmarks.

I. INTRODUCTION

Deep learning (DL) applications and systems haveblossomed in recently years as more variety of datais entering cyber space and increasing number ofopen-source deep learning (DL) software frameworksare made available. Benchmarking DL frameworksand systems is in high demand [1], [2], [3], [4],[5], [6], [7], [8], [9]. However, benchmarking deeplearning software frameworks and systems is notablymore difficult than traditional performance-drivenbenchmarks. This is simply because big datapowered deep learning systems are inherently bothcomputation-intensive and data-intensive, demandingintelligent integration of massive data parallelism andmassive computation-parallelism at all levels of a deeplearning framework. For instance, a deep learningframework typically has a large set of model parametersand system parameters that need to be configured andtuned. Many of the parameters are interacting withone another in a complex manner from both modellearning perspective and system runtime optimizationperspective, making the tuning of such large space

of parameters substantially more tricky than whathave been experimented and understood from systemsadministration of conventional computer systems,software tools and applications.

This paper presents design considerations, metrics andinsights towards benchmarking DL software frameworksthrough a comparative study of three popular deeplearning frameworks: TensorFlow [10], Caffe [11] andTorch [12]. First, we show that although deep learningsoftware frameworks are optimized with their defaultconfiguration settings, for a given DL framework,its default configuration optimized to train on onedataset may not work effectively for other datasets.Second, the default configuration optimized on adataset by one DL framework may not work wellwhen used to train the same dataset by anotherDL framework. Hence, it may not be meaningfulto compare different DL frameworks under the sameconfiguration. Third, different DL frameworks exhibitdifferent levels of robustness in response to adversarialbehaviors, and different sensitivity boundaries overpotential biases or noise levels inherent in differenttraining datasets. Through this experimental study,we show that system runtime performance, learningaccuracy, and model robustness against adversarialbehaviors and consequence of overfitting are the threesets of metrics that are equally important for effectivelyconfiguring, measuring and comparing different deeplearning software frameworks.

II. DEEP LEARNING REFERENCE MODEL

Deep learning software frameworks are scalablesoftware implementations of deep neural networks(DNN) on modern computers with many-core CPUswithout or with GPUs.

A. DNN ModelA deep neural network refers to a N-layer neuralnetwork with N ≥ 2. Learning over the input data toa DNN is typically performed through a sequence oftransformations of input data layer by layer with each

Page 2: Benchmarking Deep Learning Frameworks: Design ... · models by transforming the data to LMDB format, defining network architecture in the .prototxt file, defining hyperparameters

layer representing a network of neurons extracted frominput data. Although different layers extract differentrepresentations of features of the input data, each layerlearns to extract more complex, deeper features fromits previous layer. A typical layer of a neural networkconsists of weights w, biases b and activation functionact(), in addition to the set of loosely connected neurons(parameters). It takes as input the neuron map producedfrom its previous layer, and produces the output asy = act(w ∗ x + b), where x is the input of thelayer. By connecting these layers together, a deep neuralnetwork is a function y = F (x) in which x ∈ Rn isa n-dimension input and y ∈ Rm is the output of am-dimension vector.

The neural network model F consists of many modelparameters θF . The values of parameters θF are tunedduring the training phase where a large number ofinput-output pairs as training dataset is fed into theneural network. The training process is conducted inmultiple rounds/iterations. During each round, the neuralnetwork uses the parameters from the previous iterationwith the training data input to predict output forwardlyon the N-layer neural network. Then, the DNN computesa loss function between the predicted output and the real(pre-labeled) output. Using the loss function, the DNNupdates the parameters using backpropagation withan optimizer, e.g., stochastic gradient descent (SGD)[13] or Adam [14], which minimizes the pre-definedloss function. The general principle of defining a lossfunction is to measure the difference between thecomputed output with the ground truth (real output).

In addition to network parameters tuned in the trainingprocess, hyperparameters also need to be tuned. Thelearning rate and batch size are among the mostimportant ones. Both are used to control the extent ofparameter updates for each iteration. In general, largerlearning rate and/or larger batch size will bring aboutfaster convergence. However, if the learning rate istoo large, the training process may not be sophisticatedenough and may suffers from fluctuation. If the batchsize is too large to fit each mini-batch in memoryof GPU or CPU core, the training process may takemuch longer to complete. Smaller learning rate leads toslower convergence but makes the training process morefine-grained. Smaller batch size leads to larger numberof bagging and thus larger number of epochs and mayresult in lower accuracy, but it can avoid out of memory(OOM) induced runtime performance penalty. Anotherkey hyperparameter is the number of kernels (weightfilers), which produces the number of feature maps in

each layer of the neural networks. Usually more featuremaps enable a deep learning model to give the input dataa more refined representation, but at higher runtime cost.

Other hyperparameters, such as the networkarchitecture, number of layers, paddings, strides,kernel sizes of layers, the type of loss function,optimizer, and regularization method, also influence theperformance of the deep learning model, each fromtheir own perspective. Note that regularization methodcan reduce overfitting. Overfitting occurs when thedeep learning model is able to achieve high accuracyon the training data but such high accuracy cannot begeneralized to the testing data.

Once the DNN is trained, the parameters θF are fixed,producing a deep learning model that is used in thetesting phase to make predictions and classifications.Usually, the trained DNN model may be re-trainedBefore its actual deployment, to ensure that it passesthe validation test and the system uses the trained DNNmodel can provide sufficiently accurate results. Thetesting phase refers to both the validation and the useof a trained DNN model in real application system.

B. Reference DL FrameworksThree mainstream DL frameworks: TensorFlow, Caffeand Torch, are selected for this study.

TensorFlow [10] is a open source software libraryand implemented based on data flow graph. Neurons aretensors (data arrays) flow between nodes via the edges.The nodes are used to represent mathematical operationsand the edges represent the data flow. The tensorbased data flow makes TensorFlow an ideal API andimplementation tool for Convolutional Neural Networks(CNNs) and Recurrent Neural Neworks (RNNs).However, TensorFlow does not support dynamic inputsize, which are crucial for applications like NLP.

Caffe [11] supports many different types of deeplearning architectures (CNN, RCNN, LSTM and fullyconnected neural network) geared towards imageclassification and segmentation. Caffe can be usedsimply in command lines and it builds neural networkmodels by transforming the data to LMDB format,defining network architecture in the .prototxt file,defining hyperparameters such as learning rate andtraining epochs in the “solver” file and then training.Caffe works layer-wisely for deep learning applications.

Torch [12] is an open source machine learning librarythat provides a wide range of algorithms for DL. Itscomputing framework is based on the Lua programminglanguage [15], a script language. Torch has the mostcomprehensive set of convolutions, and it supports

2

Page 3: Benchmarking Deep Learning Frameworks: Design ... · models by transforming the data to LMDB format, defining network architecture in the .prototxt file, defining hyperparameters

temporal convolution with variable input length.

C. Evaluation MetricsMetrics are a critical component for a benchmark. Thefollowing metrics are used in this measurement study.

Training Time. It is a key performance indicatorfor DL frameworks. Training time is the time spent onbuilding a DNN model over the training dataset. Foroptimization purposes, models need to be trained forseveral times with different parameters in order to findthe best parameters that achieves the optimal design.Although pre-trained models are made available on manyplatforms, such as Caffe Model Zoo [16], training is stillessential for new model development and for retrainingover new datasets or incrementally enhanced datasets.

Testing Time. It is another important performanceindicator for DL frameworks. Testing time is the timespent on testing the trained model using a validationdataset. It indicates the potential latency of using thetrained model for the prediction or classification basedinference when the model is deployed in real-worldapplications. Thus, testing time affects user experienceto a large extent and affects the performance of actualapplications. Both training and testing time can beinfluenced by the configurations of system-specific andmodel specific parameters.

Learning/Prediction Accuracy. The learningaccuracy metric measures the utility of the trainingframework in the training phase, and the predictionaccuracy measures the utility of the trained DNN modelat testing phase. Accuracy measurement is highlysensitive to both data-specific parameters, such as thetype of datasets, the characteristics of the datasets, suchas the number of classes, the number of training samplesper class, and the type of deep learning architectures andmachine learning library used, such as the collection ofalgorithms/optimizations included, the configurations ofmany model specific parameters.

Adversarial Robustness. This metric is designedto measure the resilience of the DL framework andits trained DNN model against adversarial behaviors,including targeted attacks and random (untargeted)attacks, as well as effect of overfitting against potentialbiases and noise levels in the training dataset during thetesting phase. It can also be used a measure to evaluatethe effectiveness of regularization techniques deployed indifferent DL frameworks. For example, TensorFlow usesdropout, while Caffe has weight decay. In this paper, weuse the success rate of crafting adversarial examples asa measure of adversarial Robustness.

Two types of adversarial attacks are considered

in this study: untargeted Fast Gradient Sign Method(FGSM) [17] and targeted Jacobian-based attacks [18].Adversarial examples x′ consists of an input x, and itsadversarial perturbation δx. With some perturbation, theadversarial example is classified as a new class differentfrom its original class, i.e., x′ = x+ δx, F (x) = y andF (x′) = y′ both hold while y′ 6= y.

FGSM: a simple and fast way to generate untargetedadversarial examples:

x′ = x+ εsign(∇xL(x, y)), (1)

where L(x, y) is the loss function of original input andthe true label. sign() is a mathematical function:

sign(x) =

{1, x > 0,0, x = 0,−1, x < 0

Jacobian-based attacks: a targeted attack launchedby adversaries to generate adversarial examples such thatthey are classified as targeted class t instead of their trulylegitimate source class. Concretely, for each feature i, theperturbations δx are formed by decreasing saliency mapS(x, t)[i] of the network rather than the loss function,e.g.,

S(x, t)[i] =

0, if ∂Ft

∂xi(x) < 0

or∑

j 6=t∂Fj

∂xi(x) > 0,

∂Ft

∂xi(x)|

∑j 6=t

∂Fj

∂xi(x)|,

otherwise,

(2)

where matrix JF = [∂Fj

∂xi]i,j is the Jacobian matrix

of the neural network function. The goal of Equation(2) is to reject input features with a negative targetderivative or overall positive derivative on classes otherthan the targeted class. In fact, saliency maps exploitinput features that contribute to increasing the probabilityof target class or decreasing source class or other classessignificantly, or both.

III. EXPERIMENTS

We conduct three sets of experiments to answer thefollowing questions: (i) How effective the default setting(configuration) of one DL framework is in comparisonwith that of another DL framework for the samedatasets? (ii) How efficient the default setting used totrain one dataset will be when it is deployed to trainanother dataset using the same DL framework? (iii)Would the default setting used by one DL frameworkbe effective when used by another DL framework totrain the same dataset (i.e., dataset dependent defaultconfiguration)? (iv) How well the default setting of aDL framework, which is optimized for training on onedataset, can perform when it is used by another DLframework to train on the same dataset (i.e., frameworkdependent default configuration)?

3

Page 4: Benchmarking Deep Learning Frameworks: Design ... · models by transforming the data to LMDB format, defining network architecture in the .prototxt file, defining hyperparameters

Frameworks Version Hash Tag Library Interface LoC License WebsiteTensorFlow 1.3.0 ab0fcac Eigen & CUDA Java, Python, Go, R 1281085 Apache https://www.tensorflow.org/

Caffe 1.0.0 c430690 OpenBLAS & CUDA Python, Matlab 69608 BSD http://caffe.berkeleyvision.org/Torch torch7 0219027 optim & CUDA Lua 29750 BSD http://torch.ch/

TABLE I: Deep Learning Software Frameworks and Basic Properties

All experiments are conducted on Intel Xeon(R)E5-1620 server with CPU: 3.6Ghz, Memory: DDR31600 MHz 8GB × 4 (32GB), Hard drive: SSD 256GB,GPU: Nvidia GeForce GTX 1080 Ti (11GB), installedwith Ubuntu 16.04 LTS, CUDA 8.0 and cuDNN 6.0.

DL Frameworks. TensorFlow [10], Caffe [11], andTorch [12] are selected for this measurement study.Table I shows some of their statistics.

Datasets. Datasets play a definitive role in theperformance of deep learning in most cases [19], [20],[21], [22]. For the objectives of this study, we choose thetwo classic datasets: MNIST [23] and CIFAR-10 [24], asall 3 DL frameworks have tuned their configurations forthese two datasets. MNIST consists of 70,000 imagesof ten handwritten digits, each image is 28 × 28 insize. CIFAR-10 consists of 60,000 colorful images of10 classes, each is 32× 32 in size.

A. Default Settings in DL FrameworksWe compare their primary hyperparameter settings usedfor MINST and CIFAR-10 in Table II and Table IIIrespectively. All three DL frameworks select their ownpreferred default training parameters for both datasets.For MINST, the base learning rate. TensorFlow prefersAdam[14] as its optimizer while Caffe and Torchuse SGD [13]. TensorFlow uses the smallest baselearning rate, and Caffe uses the largest batch sizealong with the smallest number of training epochs,Torch uses the largest base learning rate and largernumber of training epochs. For MNIST, TensorFlow setsits maximum steps to 20,000, and Caffe sets its maxiterations to 10,000, thus by #Epochs = max steps *batch size / #Training Samples, we obtain 20,000 * 50/ 60,000= 16.67 epochs for TensorFlow and 10,000 * 64/ 60,000 = 10.67 epochs for Caffe. For Torch, the max#Epochs is manually set to 20 and the #max iterationsis set to (20 * 60,000) / 10 = 120,000. For CIFAR-10,SGD is used by all as their optimizer. Caffe adopts atwo-phase training, the learning rate for its first phase is0.001 and 0.0001 for the second phase. Also Caffe uses8 epochs for the first phase training and 2 epochs forthe second phase. Using the same formula, TensorFlowhas its maximum steps set to (2,560 * 50,000) / 128 =1,000,000, Caffe sets its max iterations to (10 * 50,000) /100 = 5,000, and Torch sets it to 20 * 5,000 / 1 = 100,000for CIFAR-10, since #Training Samples of CIFAR-10 is50,000 while 60,000 for MNIST.

Framework TensorFlow Caffe TorchAlgorithm Adam SGD SGD

Base Learning Rate 0.0001 0.01 0.05Batch Size 50 64 10

#Max Iterations 20,000 10,000 120,000#Epochs 16.67 10.67 20

TABLE II: Default training parameters on MNISTFramework TensorFlow Caffe TorchAlgorithm SGD SGD SGD

Base Learning Rage 0.1 0.001→ 0.0001 0.001Batch Size 128 100 1

#Max Iterations 1,000,000 5,000 100,000#Epochs 2560 8+2 20

TABLE III: Default training parameters on CIFAR-10Next, we compare their primary default parameters

about neural network structures, which are configuredby the framework creators to work optimally withMINST and CIFAR10, shown in Table IV and TableV respectively. For MNIST, three frameworks adopt asimilar network structure with 2 convolution layers and2 fully connected layers, derived from LeNet [23]. Allset their kernel/neuron size as 5 × 5. 1 → 32 indicatesthe number of input feature maps is 1. However, otherparameters are selected differently for MNIST, such asthe activation operation, the number of kernels/featuremaps extracted. In comparison, for CIFAR-10, theirnetwork structures vary more significantly from the onesconfigured for MNIST. Caffe and TensorFlow use a5-layer network structure and Torch uses 4-layer instead.Caffe employs 3 convolution layers and TensorFlow andTorch both use 2 convolution layers.

We make two observations from Table IV andTable V. First, a DL framework may vary theirdefault parameters for different datasets, as shown.We refer to this type of default setting variationsas dataset-dependent default settings. Second, differentframeworks may NOT use the same default configurationto train the same dataset because each frameworkmay optimize its performance using a different settingof model-turning and system-turning parameters, evenwhen trained over the same dataset. We call this type ofdefault setting variations as framework-dependent defaultsettings, which refers to the default settings used bydifferent DL frameworks to train on a specific dataset.Interesting questions arise:

• Can the default setting optimized to train onedataset be effective to train a different datasetusing the same DL software framework? (sameframework on different datasets)

4

Page 5: Benchmarking Deep Learning Frameworks: Design ... · models by transforming the data to LMDB format, defining network architecture in the .prototxt file, defining hyperparameters

Framework TensorFlow Caffe Torch

1st Layer (conv) 5× 5, 1→ 32 5× 5, 1→ 20 5× 5, 1→ 32ReLU, MaxPooling(2× 2) MaxPooling(2× 2) Tanh, MaxPooling(3× 3)

2nd Layer (conv) 5× 5, 32→ 64 5× 5, 20→ 50 5× 5, 32→ 64ReLU, MaxPooling(2× 2) MaxPooling(2× 2) Tanh, MaxPooling(3× 3)

3rd Layer (fc) 7× 7× 64→ 1024 4× 4× 50→ 500 3× 3× 64→ 200ReLU ReLU Tanh

4th Layer (fc) 1024→ 10 500→ 10 200→ 10

TABLE IV: Primary Default Neural Network Parameters on MNISTFramework TensorFlow Caffe Torch

1st Layer (conv) 5× 5, 3→ 64 5× 5, 3→ 32 5× 5, 3→ 16ReLU, MaxPooling(3× 3), Normalization MaxPooling(3× 3), ReLU Tanh, MaxPooling(2× 2)

2nd Layer (conv) 5× 5, 64→ 64 5× 5, 32→ 32 5× 5, 16→ 256ReLU, Normalization, MaxPooling(3× 3) ReLU, AveragePooling(3× 3) Tanh, MaxPooling(2× 2)

3rd Layer 7× 7× 64→ 384 5× 5, 32→ 64 5× 5× 256→ 128fc, ReLU conv, ReLU, AveragePooling(3× 3) fc, Tanh

4th Layer (fc) 384→ 192 4× 4× 64→ 64 128→ 10ReLU — —

5th Layer(fc) 192→ 10 64→ 10 —

TABLE V: Primary Default Neural Network Parameters on CIFAR-10

• Can the default setting trained on a dataset byone DL framework be used effectively to configureanother framework to train the same dataset?(different frameworks on the same dataset)

• Will the default setting, optimized by oneframework, work effectively for other DLframeworks? (different frameworks on differentdatasets)

B. Impact of Default SettingsWe first study the impact of default settings by oneframework on different datasets to answer the firstquestion (same framework on different datasets).

This set of experiments compares the performanceof three DL frameworks on MNIST using their owndefault setting optimized for MNIST (Figure 1) andsimilar comparison also on CIFAR-10 (Figure 2). Wehighlight three observations: (1) For both datasets, Torchspent longest time in testing as well as its trainingon MNIST. TensorFlow has the highest accuracy. Eventhough Torch’s accuracy is higher than Caffe and lowerthan TensorFlow for MNIST, its accuracy for CIFAR-10is the worst of the three. One reason that Caffe hasshorter training and testing time might be that Caffe wastrained for the least number of epochs and less featuremaps were extracted for inference, whereas TensorFlowhas largest number of feature maps, which helps itto achieve the highest accuracy with low testing time.(2) For MNIST, TensorFlow and Caffe show similartraining and testing time, but Caffe has the lowestaccuracy. For CIFAR-10, Caffe spent least time fortraining in CPU or GPU settings and least testing timefor GPU setting. TensorFlow has significantly longertraining time and its GPU testing time doubles that of

Caffe. But TensorFlow has the highest accuracy withCaffe ranking the second, followed by Torch. Note thatTorch GPU has slightly lower accuracy (65.61%) thanTorch-CPU (66.16%). One reason could be that Torchuses SpatialConvolutionMap [25] on CPU for CIFAR-10,but it lacks the corresponding implementation on GPU.Thus, SpatialConvolutionMM [26] is used as the default.(3) All three frameworks have shortened their trainingand testing time with GPU for both datasets. Concretely,GPU acceleration for MNIST, TensorFlow is faster by16 times and 10 times, Caffe is faster by 5 times and6 times, and Torch is faster by 28 times and 32 times,in training time and testing time, respectively. However,TensorFlow CPU and Torch CPU obtain slightly higheraccuracy on MNIST compared to TensorFlow GPUand Torch GPU respectively, though accuracy of allsettings on MNIST are above 99% with highest byTF CPU (99.28%) and lowest by Caffe CPU (99.03%).For CIFAR-10, Torch CPU has slightly higher accuracythan Torch GPU. It is worth to note that TensorFlowCPU with 256 epochs can achieve 86.6% accuracy withtraining time of 21673.81sec ( 6.02 hours), whereasTensorFlow CPU with 2560 epochs achieves 86.90%accuracy at the cost of training time of 60.88 hours. Thisresult demonstrates that GPU acceleration shortens thetraining/testing time but it may not ensure high accuracydue to multiple factors, such as mini-batch size, baggingalgorithm, per unit memory capacity of GPU. For moredetail, see [27].

In summary, the deep learning model demonstratesmuch better accuracy on the sparse and gray-scaleMNIST dataset over the color-rich and content-richCIFAR-10 dataset. The sparseness and gray scale ofMNIST give the data low entropy. We attribute the better

5

Page 6: Benchmarking Deep Learning Frameworks: Design ... · models by transforming the data to LMDB format, defining network architecture in the .prototxt file, defining hyperparameters

11

14

.34

51

2.1

8

16

09

6.6

2

68

.51

97

.02 56

3.2

8

1

10

100

1000

10000

100000

TF

Caf

fe

Torc

h TF

Caf

fe

Torc

h

CPU GPU

Trai

nin

g Ti

me

(s)

(a) Training Time

2.7

3

3.3

3 56

.62

0.2

6 0.5

5

1.7

6

0.1

1

10

100

TF

Caf

fe

Torc

h TF

Caf

fe

Torc

h

CPU GPU

Test

ing

Tim

e (s

)

(b) Testing Time

99

.28

99

.03

99

.20

99

.22

99

.13 99

.18

98.9

98.95

99

99.05

99.1

99.15

99.2

99.25

99.3

TF

Caf

fe

Torc

h TF

Caf

fe

Torc

h

CPU GPU

Acc

ura

cy (

%)

(c) Accuracy

Fig. 1: Experimental Results on MNIST, using MNIST Default Settings by 3 frameworks

21

91

69

.14

17

30

.89

38

26

8.6

7

12

47

7.0

5

16

3.5

1

72

2.1

5

1

10

100

1000

10000

100000

1000000

TF

Caf

fe

Torc

h TF

Caf

fe

Torc

h

CPU GPU

Trai

nin

g Ti

me

(s)

(a) Training Time

4.8

0 14

.35

12

1.1

1

2.3

4

1.3

6 3.6

6

1

10

100

1000

TF

Caf

fe

Torc

h TF

Caf

fe

Torc

h

CPU GPU

Test

ing

Tim

e (s

)

(b) Testing Time

86

.90

75

.39

66

.16

87

.00

75

.52

65

.61

60

65

70

75

80

85

90

TF

Caf

fe

Torc

h TF

Caf

fe

Torc

h

CPU GPU

Acc

ura

cy (

%)

(c) Accuracy

Fig. 2: Experimental Results on CIFAR-10, using CIFAR-10 Default Settings by 3 frameworks

accuracy performance to the lower entropy of the datasince it is easier for the deep learning model to learn.The low entropy also makes the training and testing ofthe DNN model much faster on MNIST than CIFAR-10.

C. Impact of Dataset-dependent Default SettingsThe next set of experiments studies the impact ofdataset-dependent default settings and their variationson the performance of different DL frameworks. Weconfigure all three DL frameworks using the samesetting on MNIST first, and compare their performance.To be fair, we choose the default setting preferredby each of the three DL frameworks, one at a time.Figure 3 shows the results. In Figure 3a, we comparethe training time of the three frameworks on MNIST.There are two colored bars for each framework. ForTensorFlow, the training time using its own MNISTdefault setting is the blue bar and the training timeusing its own CIFAR-10 default setting is the red barrespectively. Similar evaluation settings for Caffe andfor Torch. Testing time comparison is in Figure 3b,and accuracy comparison is in Figure 3c. We maketwo observations. First, all frameworks perform worstusing their own CIFAR-10 default setting on MNISTdataset with longer training and testing time. RecallTable IV and Table V, all three frameworks choose

deeper and more complex neural network structuresfor CIFAR-10, thus, the longer training and testingtime. Second, surprisingly, TensorFlow and Torch usingtheir CIFAR-10 default settings on MNIST allow bothto achieve high accuracy and almost identical to thebest accuracy they produced when using their ownMNIST default setting on MNIST. However, Caffe doesnot enjoy the same result and its performance usingits own CIFAR-10 setting on MNIST is surprisinglyworsened. In summary, the longer training time and morecomplex NN structures do not necessarily guaranteehigher accuracy. A possible explanation is that overtraining may bring about the worst overfitting.

Similarly, for CIFAR-10 dataset, Figure 4 shows thecomparison results for dataset dependent default settings.Figure 4a and Figure 4b show a similar trend: all threeframeworks have much shorter training and testing timewhen using their default MNIST settings to train and teston the CIFAR-10. This is simply because all frameworkschoose simpler neural network structures for training andtesting MNIST dataset. Thus, the MNIST default settingruns faster. Now we look at how accuracy respondsto the simpler NN structure and shorter training andtesting time. Figure 4c shows that both TensorFlowand Caffe suffer from lower accuracy when using

6

Page 7: Benchmarking Deep Learning Frameworks: Design ... · models by transforming the data to LMDB format, defining network architecture in the .prototxt file, defining hyperparameters

68

.51

14

27

3.5

9

97

.02

16

4.6

8

56

3.2

8

29

78

.52

1

10

100

1000

10000

100000

MN

IST

CIF

AR

-10

MN

IST

CIF

AR

-10

MN

IST

CIF

AR

-10

TensorFlow Caffe Torch

Trai

nin

g Ti

me

(s)

(a) Training Time

0.2

6

0.6

0

0.5

5

1.4

7

1.7

6

3.7

0

0.00.51.01.52.02.53.03.54.0

MN

IST

CIF

AR

-10

MN

IST

CIF

AR

-10

MN

IST

CIF

AR

-10

TensorFlow Caffe Torch

Test

ing

Tim

e (s

)

(b) Testing Time

99

.22

99

.31

99

.13

91

.79

99

.18

99

.17

91

93

95

97

99

101

MN

IST

CIF

AR

-10

MN

IST

CIF

AR

-10

MN

IST

CIF

AR

-10

TensorFlow Caffe Torch

Acc

ura

cy (

%)

(c) Accuracy

Fig. 3: Experimental Results on MNIST (Dataset-dependent Default Settings on GPU)

15

1.6

7

12

47

7.0

5

11

5.3

0

16

3.5

1

63

8.0

0

72

2.1

5

1

10

100

1000

10000

100000

MN

IST

CIF

AR

-10

MN

IST

CIF

AR

-10

MN

IST

CIF

AR

-10

TensorFlow Caffe Torch

Trai

nin

g Ti

me

(s)

(a) Training Time

1.3

2

2.3

4

0.6

4 1.3

6

3.4

7

3.6

6

0.00.51.01.52.02.53.03.54.0

MN

IST

CIF

AR

-10

MN

IST

CIF

AR

-10

MN

IST

CIF

AR

-10

TensorFlow Caffe Torch

Test

ing

Tim

e (

s)

(b) Testing Time

69

.76

87

.00

11

.03

75

.52

66

.40

65

.61

0

20

40

60

80

100

MN

IST

CIF

AR

-10

MN

IST

CIF

AR

-10

MN

IST

CIF

AR

-10

TensorFlow Caffe Torch

Acc

ura

cy (

%)

(c) Accuracy

Fig. 4: Experimental Results on CIFAR-10 (Dataset-dependent Default Settings on GPU)

their own default MNIST settings to train and test onthe CIFAR-10 dataset, but Torch shows very similaraccuracy using either its own MNIST default or its ownCIFAR-10 default setting. It is worth noting that Caffe’sperformance is very sensitive to its dataset-dependentdefault setting. Thus, For CIFAR-10, when Caffe uses itsown MNIST setting, it fails to train on CIFAR-10 suchthat the training does not converge, resulting in a verylow accuracy also at testing phrase. Figure 5 shows thetraining loss for Caffe as the training process progressesin this experiment. Using Caffe CIFAR-10 default settingon CIFAR-10 dataset, the training loss rate of Caffeduring training declines as the training proceeds to 5,000iterations (expected), while the training loss of usingCaffe MNIST default setting on CIFAR-10 is almostconstant, at 87.34%, indicating that Caffe using itsMNIST setting to train on CIFAR-10 dataset will notconverge. Figure 5 measures the training loss by varyingthe number of iterations since Caffe only provides thestatistics for each iteration.

In summary, this two sets of experiments show thatfor a DL framework, its own default setting optimizedfor one dataset may not work well for other datasets.

D. Impact of framework-dependent default settingWe evaluate and compare the impact of using differentframework default settings when trained on the same

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000Training Iteration

Tra

inin

g L

oss

Training Loss of Caffe on CIFAR-10

Dataset Default Settings

0

1

87

87.5

88

88.5

89

89.5

90

MNIST-Settings

CIFAR-10-Settings

Fig. 5: Training Loss (convergence) rate of Caffe onCIFAR-10 with its MNIST and CIFAR-10 default

dataset. The results show that the default setting that isoptimized for training on a dataset by one frameworkmay not work effectively for other frameworks to trainon the same dataset.

We first conduct experiments for MNIST. Figure 6shows the results. In Figure 6a, the first set of threebars compares the training time of using TensorFlow totrain on MNIST with three different settings (i) using itsown MNIST default setting (blue bar), (ii) using CaffeMNIST default setting (red bar) and (iii) using TorchMNIST default setting (orange bar). Similarly the firstset of three bars in Figure 6b shows the testing time ofTensorFlow using the three framework-dependent defaultsettings on MNIST, and the first set of three bars in

7

Page 8: Benchmarking Deep Learning Frameworks: Design ... · models by transforming the data to LMDB format, defining network architecture in the .prototxt file, defining hyperparameters

68

.51

21

.32

17

6.2

3

20

6.6

6

97

.02 2

35

.57

32

1.6

3

18

7.5

4

56

3.2

8

0

100

200

300

400

500

600

TF-M

NIS

T

Caf

fe-M

NIS

T

Torc

h-M

NIS

T

TF-M

NIS

T

Caf

fe-M

NIS

T

Torc

h-M

NIS

T

TF-M

NIS

T

Caf

fe-M

NIS

T

Torc

h-M

NIS

T

TensorFlow Caffe Torch

Trai

nin

g Ti

me

(s)

(a) Training Time

0.2

6

0.1

2

0.1

3

0.7

1

0.5

5 0.7

6

1.5

3

1.3

7 1.7

6

00.20.40.60.8

11.21.41.61.8

2

TF-M

NIS

T

Caf

fe-M

NIS

T

Torc

h-M

NIS

T

TF-M

NIS

T

Caf

fe-M

NIS

T

Torc

h-M

NIS

T

TF-M

NIS

T

Caf

fe-M

NIS

T

Torc

h-M

NIS

T

TensorFlow Caffe Torch

Test

ing

Tim

e (s

)

(b) Testing Time

99

.22

98

.51

99

.10

99

.94

99

.13

94

.14

99

.11

98

.78

99

.18

94

95

96

97

98

99

100

101

TF-M

NIS

T

Caf

fe-M

NIS

T

Torc

h-M

NIS

T

TF-M

NIS

T

Caf

fe-M

NIS

T

Torc

h-M

NIS

T

TF-M

NIS

T

Caf

fe-M

NIS

T

Torc

h-M

NIS

T

TensorFlow Caffe Torch

Acc

ura

cy (

%)

(c) Accuracy

Fig. 6: Experimental Results on MNIST (Framework-dependent Default Settings on GPU)

12

47

7.0

5

32

.98

21

00

.61

33

90

8.4

3

16

3.5

1

68

2.5

8

12

63

04

.27

39

6.8

6

72

2.1

5

1

10

100

1000

10000

100000

1000000

TF-C

IFA

R-1

0

Caf

fe-C

IFA

R-1

0

Torc

h-C

IFA

R-1

0

TF-C

IFA

R-1

0

Caf

fe-C

IFA

R-1

0

Torc

h-C

IFA

R-1

0

TF-C

IFA

R-1

0

Caf

fe-C

IFA

R-1

0

Torc

h-C

IFA

R-1

0

TensorFlow Caffe Torch

Trai

nin

g Ti

me

(s)

(a) Training Time

2.3

4

1.4

0

7.1

00

.91

1.3

6

0.5

8

4.1

8

4.1

1

3.6

6

012345678

TF-C

IFA

R-1

0

Caf

fe-C

IFA

R-1

0

Torc

h-C

IFA

R-1

0

TF-C

IFA

R-1

0

Caf

fe-C

IFA

R-1

0

Torc

h-C

IFA

R-1

0

TF-C

IFA

R-1

0

Caf

fe-C

IFA

R-1

0

Torc

h-C

IFA

R-1

0

TensorFlow Caffe Torch

Test

ing

Tim

e (s

)

(b) Testing Time

87

.00

55

.96

55

.04

10

.10

75

.52

59

.27

73

.74

31

.47

65

.61

0102030405060708090

100

TF-C

IFA

R-1

0

Caf

fe-C

IFA

R-1

0

Torc

h-C

IFA

R-1

0

TF-C

IFA

R-1

0

Caf

fe-C

IFA

R-1

0

Torc

h-C

IFA

R-1

0

TF-C

IFA

R-1

0

Caf

fe-C

IFA

R-1

0

Torc

h-C

IFA

R-1

0

TensorFlow Caffe Torch

Acc

ura

cy (

%)

(c) Accuracy

Fig. 7: Experimental Results on CIFAR-10 (Framework-dependent Default Settings on GPU)

Figure 6c shows the accuracy of TensorFlow using thethree framework settings. The same measurements aredone for Caffe and Torch, as shown in the secondand third set of 3 bars respectively, in Figure 6. Wemake two observations. First, using Caffe MNIST defaultsetting, all three frameworks have shorter training timeand testing time on MNIST. One reason is that thenumber of training epochs in Caffe MNIST setting isthe smallest, and its NN structure is relatively simple.Second, TensorFlow and Torch achieved the highestaccuracy when using their own MNIST default settingsto train and test on MNIST, but Caffe has a higheraccuracy when using TensorFlow MNIST default setting,at the cost of more than twice the training time.Thus, it is still fair to say that the default setting ofCaffe is optimal for MNIST dataset considering bothtraining/testing time and accuracy.

Next, we conduct the same experiments for CIFAR-10and show results in Figure 7. We highlight threeobservations. First, Figure 7a shows that the trainingtime of Caffe’s own CIFAR-10 default setting is theshortest as it uses simple NN structures and least numberof training epochs. TensorFlow’s own CIFAR-10 defaultsetting works poorly with very long training time whenused by Caffe and Torch to train on CIFAR-10 dataset.

Second, from Figure 7a and Figure 7b, we observethat Torch’s own CIFAR-10 default setting does notwork well when used by TensorFlow on CIFAR-10,because it took much longer training and testing time.The reason can be attributed to the framework-specificimplementation and different levels of complexity oftheir NN structures. Third, TensorFlow and Caffe offerhigher accuracy using their own CIFAR-10 defaultsettings. However, Torch achieved much higher accuracyusing TensorFlow’s CIFAR-10 default setting thanTorch’s own CIFAR-10 default, at a huge cost on trainingtime. Finally, when Caffe is using TensorFlow’s ownCIFAR-10 default setting to train on CIFAR-10, Caffefailed to generate a DNN model again as the trainingprocess did not converge. The reason is similar to thoseexplained in Section III.C and Figure 5.

Table VI and Table VII provide a summary ofthe results reported so far on MNIST and CIFAR-10datasets respectively. Table VIa and Table VIIa show theexperimental results with default settings of these threeframeworks, serving as the baseline. Table VIb and TableVIIb compared the performance of dataset-dependentdefault settings, while Table VIc and Table VIIc presentthe experimental results with framework-dependentdefault settings.

8

Page 9: Benchmarking Deep Learning Frameworks: Design ... · models by transforming the data to LMDB format, defining network architecture in the .prototxt file, defining hyperparameters

Framework TrainingTime (s)

TestingTime (s)

Accuracy(%)

TF-CPU 1114.34 2.73 99.28Caffe-CPU 512.18 3.33 99.03Torch-CPU 16096.62 56.62 99.20

TF-GPU 68.51 0.26 99.22Caffe-GPU 97.02 0.55 99.13Torch-GPU 563.28 1.76 99.18

(a) Baseline Default Comparison

Framework(GPU)

DefaultSettings

TrainingTime (s)

TestingTime (s)

Accuracy(%)

TF TF MNIST 68.51 0.26 99.22TF TF CIFAR-10 14273.59 0.60 99.31

Caffe Caffe MNIST 97.02 0.55 99.13Caffe Caffe CIFAR-10 164.68 1.47 91.79Torch Torch MNIST 563.28 1.76 99.18Torch Torch CIFAR-10 2978.52 3.70 99.17

(b) Dataset-dependent Default Comparison

Framework(GPU)

DefaultSettings

TrainingTime (s)

TestingTime (s)

Accuracy(%)

TF TF MNIST 68.51 0.26 99.22TF Caffe MNIST 21.32 0.12 98.51TF Torch MNIST 176.23 0.13 99.10

Caffe TF MNIST 206.66 0.71 99.94Caffe Caffe MNIST 97.02 0.55 99.13Caffe Torch MNIST 235.57 0.76 94.14Torch TF MNIST 321.63 1.53 99.11Torch Caffe MNIST 187.54 1.37 98.78Torch Torch MNIST 563.28 1.76 99.18

(c) Framework Default Comparison

TABLE VI: Configurations for Training MNIST using TensorFlow (TF), Caffe and Torch

Framework TrainingTime (s)

TestingTime (s)

Accuracy(%)

TF-CPU 219169.14 4.80 86.90Caffe-CPU 1730.89 14.35 75.39Torch-CPU 38268.67 121.11 66.16

TF-GPU 12477.05 2.34 87.00Caffe-GPU 163.51 1.36 75.52Torch-GPU 722.15 3.66 65.61

(a) Baseline Default Comparison

Framework(GPU)

DefaultSettings

TrainingTime (s)

TestingTime (s)

Accuracy(%)

TF TF MNIST 151.67 1.32 69.76TF TF CIFAR-10 12477.05 2.34 87.00

Caffe Caffe MNIST 115.30 0.64 11.03Caffe Caffe CIFAR-10 163.51 1.36 75.52Torch Torch MNIST 638.00 3.47 66.40Torch Torch CIFAR-10 722.15 3.66 65.61

(b) Dataset-dependent Default Comparison

Framework(GPU)

DefaultSettings

TrainingTime (s)

TestingTime (s)

Accuracy(%)

TF TF CIFAR-10 12477.05 2.34 87.00TF Caffe CIFAR-10 32.98 1.40 55.96TF Torch CIFAR-10 2100.61 7.10 55.04

Caffe TF CIFAR-10 33908.43 0.91 10.10Caffe Caffe CIFAR-10 163.51 1.36 75.52Caffe Torch CIFAR-10 682.58 0.58 59.27Torch TF CIFAR-10 126304.27 4.18 73.74Torch Caffe CIFAR-10 396.86 4.11 31.47Torch Torch CIFAR-10 722.15 3.66 65.61

(c) Framework Default Comparison

TABLE VII: Configurations for Training CIFAR-10 using TensorFlow (TF), Caffe and Torch

0

0.2

0.4

0.6

0.8

1

1.2

0 1 2 3 4 5 6 7 8 9

TensorFlow MNIST

0 1 2 3 4 5 6 7 8 9

SR

0.9

97

0.9

98

0.8

92

0.9

77

0.9

77

0.9

89

0.9

75

0.9

92

0.9

79

0.9

88

(a) TensorFlow Model under UntargetedAttack

0

0.2

0.4

0.6

0.8

1

1.2

0 1 2 3 4 5 6 7 8 9

Caffe MNIST

0 1 2 3 4 5 6 7 8 9

SR 1.0

00

1.0

00

0.9

79

0.9

86

0.9

95

0.9

84

0.9

95

0.9

88

0.9

85

0.9

91

(b) Caffe Model under Untargeted Attack

0.003

0.002

0.087

0.009

0.018

0.009

0.006

-0.004

0.006

-0.007

-0.02

0

0.02

0.04

0.06

0.08

0.1

0 1 2 3 4 5 6 7 8 9

Success Rate Difference

(c) Success Rate Difference onUntargeted FGSM Attacks

Fig. 8: Experimental Results on Untargeted Attacks

0.0

14

0

0.8

02

0.5

96

0.4

21

0.0

22

0.0

70

0.6

33

0.9

91

0.2

71

0.0

18

0

0.7

21

0.4

82

0.3

77

0.0

25

0.1

13

0.5

82

0.8

23

0.1

19

0.5

84

0

0.8

93

0.8

02

0.7

21

0.0

46

0.5

33

0.9

12

0.9

25

0.3

27

0.9

24

0

0.8

99 0.9

95

0.9

93

0.0

49

0.8

70 0.9

82

0.9

98

0.4

41

0

0.2

0.4

0.6

0.8

1

1.2

0 1 2 3 4 5 6 7 8 9

TF (TF) TF (Caffe) Caffe (TF) Caffe (Caffe)

Fig. 9: Success Rate of Crafting digit 1

E. The Impact of Adversarial BehaviorsIn this section, we generate adversarial examples inTensorflow and Caffe on MNIST with their defaultsettings and compare the effective of these adversarialexamples in terms of the success rate. We first useuntargeted FGSM to launch adversarial attack on neuralnetwork (NN) models trained by TensorFlow and Cafferespectively, measure and compare the attack successrates of these models. The parameter ε in the experiment

is set to 0.001 (recall Formula (1) in Section II.C).Figure 8a and Figure 8b show the success rates of

the ten digits for TensorFlow trained DNN model andCaffe trained DNN model respectively. It is observed thatsome digits tend to be crafted more easily into specificclasses than to other classes. For instance, consider digit5, for TensorFlow trained MNIST model, the attack cansuccessfully change its class to digit 3 with highestprobability, followed by digit 8, then digit 2 and withsmall probability to digit 9 and so forth. Similarly, forCaffe trained MNIST model, the FGSM attack has thenon-zero probability to misclassify digit 5 to all other 9classes with different probabilities, but the top 4 highestare the same as TensorFlow: 3, 8, 2, 9. More in-depthanalysis can be found in [28].

Framework TF TF Caffe Caffeparameter TF Caffe TF Caffe

average time 113min 92min 187min 134min

TABLE VIII: Average Crafting Time of Targeted Attackson MNIST

9

Page 10: Benchmarking Deep Learning Frameworks: Design ... · models by transforming the data to LMDB format, defining network architecture in the .prototxt file, defining hyperparameters

Framework third-layer Regularization 0 2 3 4 5 6 7 8 9TF (TF) 3136 → 1024 drop out 0.014 0.802 0.596 0.421 0.022 0.070 0.633 0.991 0.271

TF (Caffe) 800 → 500 drop out 0.018 0.721 0.482 0.377 0.025 0.113 0.582 0.823 0.119Caffe (TF) 3136 → 1024 weight decay 0.584 0.893 0.802 0.721 0.046 0.533 0.912 0.925 0.327

Caffe (Caffe) 800 → 500 weight decay 0.924 0.995 0.995 0.993 0.049 0.870 0.982 0.998 0.441

TABLE IX: Impact of Default Feature Maps/Regularization Methods on MNIST

Figure 8c compares the average success rate of tendigits by the difference of subtracting the success rateof TensorFlow trained MNIST model (Figure 8a) fromthe success rate of Caffe trained MNIST model (Figure8b). We make an interesting observation: in general,the success rate of generating adversarial examples inTensorFlows tained DNN model is lower than that ofCaffe. The high success rate for digit 2 shows that digits2 is the most possible class to which an untargetedadversarial example crafted by FGSM method would fallin.

We then study the impact of Jacobian-based targetedattack on the deep learning models trained on MNISTby TensorFlow and Caffe respectively. To understandwhether the size of the feature maps may impact on theattack success rate, for these two sets of experiments,we reduced the feature maps of both at the third layerby the same percentage: for TensorFlow, the featuremaps are reduced from 3136 to 1024 and for Caffe,the feature maps are condensed from 800 to 500. Wecompare the two different sizes of feature maps forboth models. It is observed that the smaller numberof feature maps tends to result in a much faster rateof crafting adversarial examples, no matter the networkmodel is trained by TensorFlow or Caffe. The result isshown in Table VIII. First, TensorFlow MNIST modelis much faster than Caffe MNIST model when all theparameters are the same (either using TF parameter:113mins v.s. 187mins or Caffe parameter: 92mins v.s.134mins). Second, the smaller amount of feature mapsaccelerates the training process even more. Caffe MNISTmodel with 500 feature maps could generate adversarialexamples two times as much as TensorFlow with 1024feature maps in the same amount of time. For MNISTdataset, the DNN model trained by TensorFlow is tosome extend more robust against both types of attacksthan the DNN model trained by Caffe. Figure 9 showsthe success rate of crafting digit 1 to other nine classes.Table IX shows the success rate result for digit 1with default regulation methods under different amountof feature maps. The notation TF(TF) means that theTensorFlow framework uses the TensorFlow defaultparameters, and TF(Caffe) means that the TensorFlowframework uses the Caffe default parameter setting.

We observe that larger number of feature maps, inmost cases, would introduce higher robustness regardlessof the frameworks. Also, TensorFlow trained modeldemonstrates higher robustness than that of Caffe. Onepossible reason is that the dropout in TensorFlow isslightly weaker regularization than the weight decay inCaffe. Such difference may affect the inductive bias ofalgorithms using one regularizer or the other. Furtherstudy on this subject refers to [28], [27].

IV. RELATED WORK AND CONCLUSION

This paper rethinks the problems of benchmarkingdeep learning software frameworks.

Several DL benchmark efforts have been putforward [8], [1], [7], [9], [3], [6], [2], [4], [5],each studied a small subset of the popular DLframeworks. However, these proposals do not examinemodel specific parameters about both neural networkstructure and hyperparameters of DL frameworks andtheir interactions with system runtime performanceparameters.

We presented a comparative study of TensorFlow,Caffe and Torch with respect to training and testingtime, learning and prediction accuracy, as well asmodel robustness against adversarial examples [29],[30], [18], [31], [32], [33], [17]. We highlight threeobservations from our in-depth experiments: (1) Thesedeep learning software frameworks are optimized withtheir default configurations settings. However, the defaultconfiguration optimized on one specific dataset maynot work effectively for other datasets. (2) The defaultconfiguration optimized for one framework to trainon a dataset may not work well when used byanother DL framework to train on the same dataset.(3) Different DL frameworks exhibit different levelsof robustness against adversarial examples. Our studydemonstrates that benchmarking deep learning softwareframeworks is significantly more challenging thantraditional performance-driven benchmarks.

10

Page 11: Benchmarking Deep Learning Frameworks: Design ... · models by transforming the data to LMDB format, defining network architecture in the .prototxt file, defining hyperparameters

REFERENCES

[1] S. Bahrampour, N. Ramakrishnan, L. Schott, and M. Shah,“Comparative study of caffe, neon, theano, and torch for deeplearning,” CoRR, vol. abs/1511.06435, 2015. [Online]. Available:http://arxiv.org/abs/1511.06435

[2] S. Shams, R. Platania, K. Lee, and S. J. Park, “Evaluation of deeplearning frameworks over different hpc architectures,” in 2017IEEE 37th International Conference on Distributed ComputingSystems (ICDCS), June 2017, pp. 1389–1396.

[3] P. Xu, S. Shi, and X. Chu, “Performance Evaluation of DeepLearning Tools in Docker Containers,” ArXiv e-prints, Nov. 2017.

[4] C. Coleman, D. Narayanan, D. Kang, T. Zhao, J. Zhang, L. Nardi,P. Bailis, K. Olukotun, C. Re, and M. Zaharia, “Dawnbench: Anend-to-end deep learning benchmark and competition,” in NIPSML Systems Workshop, 2017.

[5] B. Research, “Benchmarking deep learning operationson different hardware,” https://github.com/baidu-research/DeepBench, 2017, [Online; accessed 28-Jan-2018].

[6] H. Kim, H. Nam, W. Jung, and J. Lee, “Performance analysisof cnn frameworks for gpus,” in 2017 IEEE InternationalSymposium on Performance Analysis of Systems and Software(ISPASS), April 2017, pp. 55–64.

[7] S. Shi, Q. Wang, P. Xu, and X. Chu, “Benchmarkingstate-of-the-art deep learning software tools,” CoRR, vol.abs/1608.07249, 2016. [Online]. Available: http://arxiv.org/abs/1608.07249

[8] S. Kovalev, “Performance of deep learning frameworks:Caffe, deeplearning4j, tensorflow, theano, and torch,”https://www.altoros.com/performance-deep-learning-frameworks-caffe-deeplearning4j-tensorflow-theano-torch.html,2016, [Online; accessed 04-Dec-2017].

[9] S. Shi and X. Chu, “Performance Modeling and Evaluationof Distributed Deep Learning Frameworks on GPUs,” ArXive-prints, Nov. 2017.

[10] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin,S. Ghemawat, I. J. Goodfellow, A. Harp, G. Irving, M. Isard,Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg,D. Mane, R. Monga, S. Moore, D. G. Murray, C. Olah,M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar,P. A. Tucker, V. Vanhoucke, V. Vasudevan, F. B. Viegas,O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu,and X. Zheng, “Tensorflow: Large-scale machine learning onheterogeneous distributed systems,” CoRR, vol. abs/1603.04467,2016. [Online]. Available: http://arxiv.org/abs/1603.04467

[11] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long,R. Girshick, S. Guadarrama, and T. Darrell, “Caffe:Convolutional architecture for fast feature embedding,” inProceedings of the 22Nd ACM International Conferenceon Multimedia, ser. MM ’14. New York, NY,USA: ACM, 2014, pp. 675–678. [Online]. Available:http://doi.acm.org/10.1145/2647868.2654889

[12] R. Collobert and K. Kavukcuoglu, “Torch7: A matlab-likeenvironment for machine learning,” in In BigLearn, NIPSWorkshop, 2011.

[13] L. Bottou, “Stochastic gradient descent tricks,” in Neuralnetworks: Tricks of the trade. Springer, 2012, pp. 421–436.

[14] D. P. Kingma and J. Ba, “Adam: A method for stochasticoptimization,” CoRR, vol. abs/1412.6980, 2014. [Online].Available: http://arxiv.org/abs/1412.6980

[16] Y. Jia, “Caffe model zoo,” http://caffe.berkeleyvision.org/modelzoo.html, 2017, [Online; accessed 04-Dec-2017].

[15] L. Developers, “The programming language lua,” https://www.lua.org/, 2018, [Online; accessed 15-Jan-2018].

[17] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explainingand harnessing adversarial examples,” arXiv preprintarXiv:1412.6572, 2014.

[18] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik,and A. Swami, “The limitations of deep learning in adversarialsettings,” in 2016 IEEE European Symposium on Security andPrivacy (EuroS&P). IEEE, 2016, pp. 372–387.

[19] X. W. Chen and X. Lin, “Big data deep learning: Challenges andperspectives,” IEEE Access, vol. 2, pp. 514–525, 2014.

[20] O. Y. Al-Jarrah, P. D. Yoo, S. Muhaidat, G. K. Karagiannidis,and K. Taha, “Efficient machine learning for big data:A review,” Big Data Research, vol. 2, no. 3, pp. 87– 93, 2015, big Data, Analytics, and High-PerformanceComputing. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S2214579615000271

[21] T. Condie, P. Mineiro, N. Polyzotis, and M. Weimer, “Machinelearning on big data,” in 2013 IEEE 29th InternationalConference on Data Engineering (ICDE), April 2013, pp.1242–1244.

[22] D. C. Cirean, U. Meier, L. M. Gambardella, and J. Schmidhuber,“Deep, big, simple neural nets for handwritten digit recognition,”Neural Computation, vol. 22, no. 12, pp. 3207–3220, 2010,pMID: 20858131. [Online]. Available: https://doi.org/10.1162/NECO a 00052

[23] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-basedlearning applied to document recognition,” Proceedings of theIEEE, vol. 86, no. 11, pp. 2278–2324, Nov 1998.

[24] A. Krizhevsky and G. Hinton, “Learning multiple layers offeatures from tiny images,” 2009.

[25] T. Developers, “Convolutional layers-spatialconvolutionmap,”https://nn.readthedocs.io/en/rtd/convolution/index.html#nn.SpatialConvolutionMap, 2018, [Online; accessed 15-Jan-2018].

[26] T. Developters, “SpatialConvolutionMM Source Code,” https://github.com/torch/nn/blob/master/SpatialConvolutionMM.lua,2018, [Online; accessed 15-Jan-2018].

[27] Y. Wu, L. Liu, C. Pu, and W. Wei, “GIT DLBench: ABenchmark Suite for Deep Learning Frameworks: CharacterizingPerformance, Accuracy and Adversarial Robustness,” GeorgiaInstitute of Technology, School of Computer Science, Tech. Rep.,02 2018.

[28] W. Wei, L. Liu, S. Truex, L. Yu, E. Gursoy, and Y. Wu,“Demystifying Adversarial Behaviors in Deep Learning,”Georgia Institute of Technology, School of Computer Science,Tech. Rep., 02 2018.

[29] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan,I. Goodfellow, and R. Fergus, “Intriguing properties of neuralnetworks,” arXiv preprint arXiv:1312.6199, 2013.

[30] A. Nguyen, J. Yosinski, and J. Clune, “Deep neural networksare easily fooled: High confidence predictions for unrecognizableimages,” in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, 2015, pp. 427–436.

[31] R. Feinman, R. R. Curtin, S. Shintre, and A. B. Gardner,“Detecting adversarial samples from artifacts,” arXiv preprintarXiv:1703.00410, 2017.

[32] N. Carlini and D. Wagner, “Towards evaluating the robustnessof neural networks,” in 2017 IEEE Symposium on Security andPrivacy (SP). IEEE, 2017, pp. 39–57.

[33] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu,“Towards deep learning models resistant to adversarial attacks,”arXiv preprint arXiv:1706.06083, 2017.

11