Abstract - arXiv › pdf › 2007.11622.pdfTPU SRAM (28MB) 1 2 4 8 Raspberry Pi 1 DRAM (256MB) ßoat mult SRAM access DRAM access Energy 3.7 5.0 640.0 Table 1 ResNet MBV2-1.4 Params

TinyTL: Reduce Memory, Not Parametersfor Efficient On-Device Learning

Han Cai1, Chuang Gan2, Ligeng Zhu1, Song Han11Massachusetts Institute of Technology, 2MIT-IBM Watson AI Lab

http://tinyml.mit.edu/

Abstract

On-device learning enables edge devices to continually adapt the AI models to newdata, which requires a small memory footprint to fit the tight memory constraintof edge devices. Existing work solves this problem by reducing the number oftrainable parameters. However, this doesn’t directly translate to memory savingsince the major bottleneck is the activations, not parameters. In this work, wepresent Tiny-Transfer-Learning (TinyTL) for memory-efficient on-device learning.TinyTL freezes the weights while only learns the bias modules, thus no needto store the intermediate activations. To maintain the adaptation capacity, weintroduce a new memory-efficient bias module, the lite residual module, to refinethe feature extractor by learning small residual feature maps adding only 3.8%memory overhead. Extensive experiments show that TinyTL significantly savesthe memory (up to 6.5×) with little accuracy loss compared to fine-tuning thefull network. Compared to fine-tuning the last layer, TinyTL provides significantaccuracy improvements (up to 34.1%) with little memory overhead. Furthermore,combined with feature extractor adaptation, TinyTL provides 7.3-12.9× memorysaving without sacrificing accuracy compared to fine-tuning the full Inception-V3.

1 Introduction

Intelligent edge devices with rich sensors (e.g., billions of mobile phones and IoT devices)1 have beenubiquitous in our daily lives. These devices keep collecting new and sensitive data through the sensorevery day while being expected to provide high-quality and customized services without sacrificingprivacy2. These pose new challenges to efficient AI systems that could not only run inference butalso continually fine-tune the pre-trained models on newly collected data (i.e., on-device learning).

Though on-device learning can enable many appealing applications, it is an extremely challengingproblem. First, edge devices are memory-constrained. For example, a Raspberry Pi 1 Model Aonly has 256MB of memory, which is sufficient for inference, but by far insufficient for training(Figure 1 left), even using a lightweight neural network architecture (MobileNetV2 [1]). Furthermore,the memory is shared by various on-device applications (e.g., other deep learning models) and theoperating system. A single application may only be allocated a small fraction of the total memory,which makes this challenge more critical. Second, edge devices are energy-constrained. DRAMaccess consumes two orders of magnitude more energy than on-chip SRAM access. The largememory footprint of activations cannot fit into the limited on-chip SRAM, thus has to access DRAM.For instance, the training memory of MobileNetV2, under batch size 16, is close to 1GB, which is byfar larger than the SRAM size of an AMD EPYC CPU3 (Figure 1 left), not to mention lower-end

1https://www.statista.com/statistics/330695/number-of-smartphone-users-worldwide/2https://ec.europa.eu/info/law/law-topic/data-protection_en3https://www.amd.com/en/products/cpu/amd-epyc-7302

34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.

arX

iv:2

007.

1162

2v4

[cs

.CV

] 8

Jan

202

1

http://tinyml.mit.edu/https://www.statista.com/statistics/330695/number-of-smartphone-users-worldwide/https://ec.europa.eu/info/law/law-topic/data-protection_enhttps://www.amd.com/en/products/cpu/amd-epyc-7302

Training

128x expensive!

Inference Memory Footprint, Batch Size = 1 (20MB)

Memory Cost

#batch size ResNet50 Act ResNet50 Params ResNet50 Running Act

ResNet50 Training Memory Cost

#batch size ResNet50 Inference Memory Cost

#batch size MobileNetV2 Act MobileNetV2 Params

MobileNetV2 Running Act

MobileNetV2 Training Memory Cost

#batch size MobileNetV2 Inference Memory Cost

Untitled 1 0 88.4 102.23 6.42 190.63 0 108.65 0 54.80 14.02 5.60 68.82 0 19.62

Untitled 2 1 176.8 102.23 279.03 1 108.65 1 109.60 14.02 123.62 1 19.62

Untitled 3 2 353.6 102.23 456.83 2 108.65 2 219.20 14.02 233.22 2 19.62

Untitled 4 3 707.2 102.23 809.43 3 108.65 3 438.40 14.02 452.42 3 19.62

4 1414.4 102.23 1516.63 4 108.65 4 876.80 14.02 890.82 4 19.62

101

102

103

TPU SRAM (28MB)

21 4 8

Raspberry Pi 1 DRAM (256MB)

float mult SRAM access DRAM access

Energy 3.7 5.0 640.0

Table 1

ResNet MBV2-1.4

Params (M) 102 24

Activations (M) 707.2 626.4

0

200

400

600

800

Param (MB) Activation (MB)

ResNet-50 MbV2-1.4

4.3x

1.1x

The main bottleneck does not improve much.

DRAM: 640 pJ/byte

SRAM: 5 pJ/byte

6.9x larger

Table 1-1

MobileNetV3-1.4

4

40

59

16Batch Size

MbV

2M

emor

y Fo

otpr

int (

MB)

Activation is the main bottleneck, not parameters.


Energy 3.7 5.0 640.0

Training Inference

Batch Size

101

102

103

Mob

ileNe

tV2

Mem

ory

Foot

prin

t (M

B)

TPU SRAM (28MB)

21 4 8 16

Raspberry Pi 1 Model A DRAM (256MB)

32 bitFloat Mult

32 bitSRAM Access

32 bitDRAM Access

102

103

101

100

Ener

gy (p

J)

3.7 pJ 5 pJ

640 pJ

128xExpensive


Energy 3.7 5.0 640.0

Inference, bs=1

Energy 20.0

0

125

250

375

500

InferenceBatch Size = 1

Mob

ileNe

tV2

Mem

ory

Foot

prin

t (M

B)

SRAM: 5 pJ/byte

DRAM: 640 pJ/byte

128x expensive!

Table 2

SRAM Access Training, bs=8

Energy 20 890.82

0

250

500

750

1000

AMD EPYC CPU SRAM (L3 Cache)

Raspberry Pi 1 DRAM

MbV

2M

emor

y Fo

otpr

int (

MB)

InferenceBatch Size = 1

TrainingBatch Size = 16

Table 3

ResNet-50 MbV2-1.4

Param (MB) 102 24

Activation (MB) 1414.4 1252.8

0

400

800

1200

1600

Param (MB) Activation (MB)

ResNet-50 MbV2-1.4The main bottleneck does not improve much.

13.9x larger

Activation is the main bottleneck, not parameters.

4.3x

1.1x

1

Figure 1: Left: The memory footprint required by training is much larger than inference. Right:Memory cost comparison between ResNet-50 and MobileNetV2-1.4 under batch size 16. Recentadvances in efficient model design only reduce the size of parameters, but the activation size, whichis the main bottleneck for training, does not improve much.

edge platforms. If the training memory can fit on-chip SRAM, it will drastically improve the speedand energy efficiency.

There is plenty of efficient inference techniques that reduce the number of trainable parameters andthe computation FLOPs [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], however, parameter-efficient or FLOPs-efficienttechniques do not directly save the training memory. It is the activation that bottlenecks the trainingmemory, not the parameters. For example, Figure 1 (right) compares ResNet-50 and MobileNetV2-1.4. In terms of parameter size, MobileNetV2-1.4 is 4.3× smaller than ResNet-50. However, fortraining activation size, MobileNetV2-1.4 is almost the same as ResNet-50 (only 1.1× smaller),leading to little memory reduction. It is essential to reduce the size of intermediate activationsrequired by back-propagation, which is the key memory bottleneck for efficient on-device training.

In this paper, we propose Tiny-Transfer-Learning (TinyTL) to address these challenges. By analyzingthe memory footprint during the backward pass, we notice that the intermediate activations (themain bottleneck) are only needed when updating the weights, not the biases (Eq. 2). Inspired bythis finding, we propose to freeze the weights of the pre-trained feature extractor and only updatethe biases to reduce the memory footprint (Figure 2b). To compensate for the capacity loss, weintroduce a memory-efficient bias module, called lite residual module, which improves the modelcapacity by refining the intermediate feature maps of the feature extractor (Figure 2c). Meanwhile,we aggressively shrink the resolution and width of the lite residual module to have a small memoryoverhead (only 3.8%). Extensive experiments on 9 image classification datasets with the samepre-trained model (ProxylessNAS-Mobile [11]) demonstrate the effectiveness of TinyTL compared toprevious transfer learning methods. Further, combined with a pre-trained once-for-all network [10],TinyTL can select a specialized sub-network as the feature extractor for each transfer dataset (i.e.,feature extractor adaptation): given a more difficult dataset, a larger sub-network is selected, and viceversa. TinyTL achieves the same level of (or even higher) accuracy compared to fine-tuning the fullInception-V3 while reducing the training memory footprint by up to 12.9×. Our contributions can besummarized as follows:

• We propose TinyTL, a novel transfer learning method to reduce the training memory footprint byan order of magnitude for efficient on-device learning. We systematically analyze the memoryof training and find the bottleneck comes from updating the weights, not biases (assume ReLUactivation).

• We also introduce the lite residual module, a memory-efficient bias module to improve the modelcapacity with little memory overhead.

• Extensive experiments on transfer learning tasks show that our method is highly memory-efficientand effective. It reduces the training memory footprint by up to 12.9× without sacrificing accuracy.

2 Related Work

Efficient Inference Techniques. Improving the inference efficiency of deep neural networks onresource-constrained edge devices has recently drawn extensive attention. Starting from [4, 5, 12, 13,

2

14], one line of research focuses on compressing pre-trained neural networks, including i) networkpruning that removes less-important units [4, 15] or channels [16, 17]; ii) network quantization thatreduces the bitwidth of parameters [5, 18] or activations [19, 20]. However, these techniques cannothandle the training phase, as they rely on a well-trained model on the target task as the starting point.

Another line of research focuses on lightweight neural architectures by either manual design [1, 2,3, 21, 22] or neural architecture search [6, 8, 11, 23]. These lightweight neural networks providehighly competitive accuracy [10, 24] while significantly improving inference efficiency. However,concerning the training memory efficiency, key bottlenecks are not solved: the training memory isdominated by activations, not parameters (Figure 1).

There are also some non-deep learning methods [25, 26, 27] that are designed for efficient inferenceon edge devices. These methods are suitable for handling simple tasks like MNIST. However, formore complicated tasks, we still need the representation capacity of deep neural networks.

Memory Footprint Reduction. Researchers have been seeking ways to reduce the training memoryfootprint. One typical approach is to re-compute discarded activations during backward [28, 29].This approach reduces memory usage at the cost of a large computation overhead. Thus it is notpreferred for edge devices. Layer-wise training [30] can also reduce the memory footprint comparedto end-to-end training. However, it cannot achieve the same level of accuracy as end-to-end training.Another representative approach is through activation pruning [31], which builds a dynamic sparsecomputation graph to prune activations during training. Similarly, [32] proposes to reduce the bitwidthof training activations by introducing new reduced-precision floating-point formats. Besides reducingthe training memory cost, there are some techniques that focus on reducing the peak inferencememory cost, such as RNNPool [33] and MemNet [34]. Our method is orthogonal to these techniquesand can be combined to further reduce the memory footprint.

Transfer Learning. Neural networks pre-trained on large-scale datasets (e.g., ImageNet [35]) arewidely used as a fixed feature extractor for transfer learning, then only the last layer needs to befine-tuned [36, 37, 38, 39]. This approach does not require to store the intermediate activations of thefeature extractor, and thus is memory-efficient. However, the capacity of this approach is limited,resulting in poor accuracy, especially on datasets [40, 41] whose distribution is far from ImageNet(e.g., only 45.9% Aircraft top1 accuracy achieved by Inception-V3 [42]). Alternatively, fine-tuningthe full network can achieve better accuracy [43, 44]. But it requires a vast memory footprint andhence is not friendly for training on edge devices. Recently, [45,46] propose to only update parametersof the batch normalization (BN) [47] layers, which greatly reduces the number of trainable parameters.Unfortunately, parameter-efficiency doesn’t translate to memory-efficiency. It still requires alarge amount of memory (e.g., 326MB under batch size 8) to store the input activations of the BNlayers (Table 3). Additionally, the accuracy of this approach is still much worse than fine-tuningthe full network (70.7% v.s. 85.5%; Table 3). People can also partially fine-tune some layers, buthow many layers to select is still ad hoc. This paper provides a systematic approach to save memorywithout losing accuracy.

3 Tiny Transfer Learning

3.1 Understanding the Memory Footprint of Back-propagation

Without loss of generality, we consider a neural networkM that consists of a sequence of layers:

M(·) = Fwn(Fwn−1(· · · Fw2(Fw1(·)) · · · )), (1)

where wi denotes the parameters of the ith layer. Let ai and ai+1 be the input and output activationsof the ith layer, respectively, and L be the loss. In the backward pass, given ∂L∂ai+1 , there are twogoals for the ith layer: computing ∂L∂ai and

∂L∂wi

.

Assuming the ith layer is a linear layer whose forward process is given as: ai+1 = aiW + b, thenits backward process under batch size 1 is

∂L∂ai

=∂L∂ai+1

∂ai+1∂ai

=∂L∂ai+1

WT ,∂L∂W

= aTi∂L∂ai+1

,∂L∂b

=∂L∂ai+1

. (2)

3

fmap in memory

fmap not in memory

learned weights on target task

pre-trained weights

(a) Fine-tune the full network

Downsample Upsample

(b) Lightweight residual learning (ours) (d) Our lightweight residual branch

KxK GroupConv

1x1 Conv

keep activations small while using group conv to increase the arithmetic intensity

(c) Mobile inverted bottleneck block

little computation but large activation

(a) Fine-tune the full network (Conventional)

train a once-for-all network

(c) Lite residual learning

fmap in memory fmap not in memory

learnable params fixed params weight bias mobile inverted bottleneck blockith

UpsampleDownsample Group Conv 1x1 Conv

(b) Fine-tune bias only


(c) Lite residual learning(d) Feature network adaptation


learnable params fixed params

weight bias

mobile inverted bottleneck blockith

Aircraft CarsFlowers

Downsample Group Conv

1x1 Conv

Avoid inverted bottleneck

1x1 Conv


C, R 6C, R 6C, R C, R

C, 0.5R C, 0.5R

1x1 Conv1x1 Conv Depth-wise Conv



(c) Lite residual learning


learnable params fixed params weight bias mobile inverted bottleneck blockith

UpsampleDownsample Group Conv 1x1 Conv


R

R R

R

0.5R 0.5R



C 6C 6C C

C

1

Figure 2: TinyTL overview (“C” denotes the width and “R” denote the resolution). Conventionaltransfer learning relies on fine-tuning the weights to adapt the model (Fig.a), which requires a largeamount of activation memory (in blue) for back-propagation. TinyTL reduces the memory usage byfixing the weights (Fig.b) while only fine-tuning the bias. (Fig.c) exploit lite residual learning tocompensate for the capacity loss, using group convolution and avoiding inverted bottleneck to achievehigh arithmetic intensity and small memory footprint. The skip connection remains unchanged(omitted for simplicity).

According to Eq. (2), the intermediate activations (i.e., {ai}) that dominate the memory footprint areonly required to compute the gradient of the weights (i.e., ∂L∂W ), not the bias. If we only update thebias, training memory can be greatly saved. This property is also applicable to convolution layers andnormalization layers (e.g., batch normalization [47], group normalization [48], etc) since they can beconsidered as special types of linear layers.

Regarding non-linear activation layers (e.g., ReLU, sigmoid, h-swish), sigmoid and h-swish requireto store ai to compute ∂L∂ai (Table 1), hence they are not memory-efficient. Activation layers thatbuild upon them are also not memory-efficient consequently, such as tanh, swish [49], etc. In contrast,ReLU and other ReLU-styled activation layers (e.g., LeakyReLU [50]) only requires to store a binarymask representing whether the value is smaller than 0, which is 32× smaller than storing ai.

Table 1: Detailed forward and backward processes of non-linear activation layers. |ai| denotes thenumber of elements of ai. “◦” denotes the element-wise product. (1ai≥0)j = 0 if (ai)j < 0 and(1ai≥0)j = 1 otherwise. ReLU6(ai) = min(6,max(0,ai)).

Layer Type Forward Backward Memory Cost

ReLU ai+1 = max(0,ai) ∂L∂ai =∂L

∂ai+1◦ 1ai≥0 |ai| bits

sigmoid ai+1 = σ(ai) = 11+exp(−ai)∂L∂ai

= ∂L∂ai+1 ◦ σ(ai) ◦ (1− σ(ai)) 32 |ai| bitsh-swish [7] ai+1 = ai ◦ ReLU6(ai+3)6

∂L∂ai

= ∂L∂ai+1 ◦ (ReLU6(ai+3)

6 + ai ◦1−3≤ai≤3

6 ) 32 |ai| bits

3.2 Lite Residual Learning

Based on the memory footprint analysis, one possible solution of reducing the memory cost is to freezethe weights of the pre-trained feature extractor while only update the biases (Figure 2b). However,only updating biases has limited adaptation capacity. Therefore, we introduce lite residual learningthat exploits a new class of generalized memory-efficient bias modules to refine the intermediatefeature maps (Figure 2c).

4

Formally, a layer with frozen weights and learnable biases can be represented as:

ai+1 = FW(ai) + b. (3)

To improve the model capacity while keeping a small memory footprint, we propose to add a literesidual module that generates a residual feature map to refine the output:

ai+1 = FW(ai) + b+ Fwr (a′i = reduce(ai)), (4)

where a′i = reduce(ai) is the reduced activation. According to Eq. (2), learning these lite residualmodules only requires to store the reduced activations {a′i} rather than the full activations {ai}.

Implementation (Figure 2c). We apply Eq. (4) to mobile inverted bottleneck blocks (MB-block)[1]. The key principle is to keep the activation small. Following this principle, we explore two designdimensions to reduce the activation size:

• Width. The widely-used inverted bottleneck requires a huge number of channels (6×) to com-pensate for the small capacity of a depthwise convolution, which is parameter-efficient but highlyactivation-inefficient. Even worse, converting 1× channels to 6× channels back and forth requirestwo 1× 1 projection layers, which doubles the total activation to 12×. Depthwise convolution alsohas a very low arithmetic intensity (its OPs/Byte is less than 4% of 1× 1 convolution’s OPs/Byte ifwith 256 channels), thus highly memory in-efficient with little reuse. To solve these limitations,our lite residual module employs the group convolution that has much higher arithmetic intensitythan depthwise convolution, providing a good trade-off between FLOPs and memory. That alsoremoves the 1×1 projection layer, reducing the total channel number by 6×2+11+1 = 6.5×.

• Resolution. The activation size grows quadratically with the resolution. Therefore, we shrink theresolution in the lite residual module by employing a 2× 2 average pooling to downsample theinput feature map. The output of the lite residual module is then upsampled to match the size ofthe main branch’s output feature map via bilinear upsampling. Combining resolution and widthoptimizations, the activation of our lite residual module is roughly 22 × 6.5 = 26× smaller thanthe inverted bottleneck.

3.3 Discussions

Normalization Layers. As discussed in Section 3.1, TinyTL flexibly supports different normal-ization layers, including batch normalization (BN), group normalization (GN), layer normalization(LN), and so on. In particular, BN is the most widely used one in vision tasks. However, BN requiresa large batch size to have accurate running statistics estimation during training, which is not suitablefor on-device learning where we want a small training batch size to reduce the memory footprint.Moreover, the data may come in a streaming fashion in on-device learning, which requires a trainingbatch size of 1. In contrast to BN, GN can handle a small training batch size as the running statisticsin GN are computed independently for different inputs. In our experiments, GN with a small trainingbatch size (e.g., 8) performs slightly worse than BN with a large training batch size (e.g., 256).However, as we target at on-device learning, we choose GN in our models.

Feature Extractor Adaptation. TinyTL can be applied to different backbone neural networks,such as MobileNetV2 [1], ProxylessNASNets [11], EfficientNets [24], etc. However, since theweights of the feature extractor are frozen in TinyTL, we find using the same backbone neuralnetwork for all transfer tasks is sub-optimal. Therefore, we choose the backbone of TinyTL usinga pre-trained once-for-all network [10] to adaptively select the specialized feature extractor thatbest fits the target transfer dataset. Specifically, a once-for-all network is a special kind of neuralnetwork that is sparsely activated, from which many different sub-networks can be derived withoutretraining by sparsely activating parts of the model according to the architecture configuration (i.e.,depth, width, kernel size, resolution), while the weights are shared. This allows us to efficientlyevaluate the effectiveness of a backbone neural network on the target transfer dataset without theexpensive pre-training process. Further details of the feature extractor adaptation process are providedin Appendix A.

5

Table 2: Comparison between TinyTL and conventional transfer learning methods (training memoryfootprint is calculated assuming the batch size is 8 and the classifier head for Flowers is used).For object classification datasets, we report the top1 accuracy (%) while for CelebA we report theaverage top1 accuracy (%) over 40 facial attribute classification tasks. ‘B’ represents Bias while‘L’ represents LiteResidual. FT-Last represents only the last layer is fine-tuned. FT-Norm+Lastrepresents normalization layers and the last layer are fine-tuned. FT-Full represents the full network isfine-tuned. The backbone neural network is ProxylessNAS-Mobile, and the resolution is 224 exceptfor ‘TinyTL-L+B@320’ whose resolution is 320. TinyTL consistently outperforms FT-Last andFT-Norm+Last by a large margin with a similar or lower training memory footprint. By increasing theresolution to 320, TinyTL can reach the same level of accuracy as FT-Full while being 6× memoryefficient.

Method Train. Flowers Cars CUB Food Pets Aircraft CIFAR10 CIFAR100 CelebAMem.

FT-Last 31MB 90.1 50.9 73.3 68.7 91.3 44.9 85.9 68.8 88.7TinyTL-B 32MB 93.5 73.4 75.3 75.5 92.1 63.2 93.7 78.8 90.4TinyTL-L 37MB 95.3 84.2 76.8 79.2 91.7 76.4 96.1 80.9 91.2TinyTL-L+B 37MB 95.5 85.0 77.1 79.7 91.8 75.4 95.9 81.4 91.2TinyTL-L+B@320 65MB 96.8 88.8 81.0 82.9 92.9 82.3 96.1 81.5 -

FT-Norm+Last 192MB 94.3 77.9 76.3 77.0 92.2 68.1 94.8 80.2 90.4FT-Full 391MB 96.8 90.2 81.0 84.6 93.0 86.0 97.1 84.1 91.4

4 Experiments

4.1 Setups

Datasets. Following the common practice [43, 44, 45], we use ImageNet [35] as the pre-trainingdataset, and then transfer the models to 8 downstream object classification tasks, including Cars [41],Flowers [51], Aircraft [40], CUB [52], Pets [53], Food [54], CIFAR10 [55], and CIFAR100 [55].Besides object classification, we also evaluate our TinyTL on human facial attribute classificationtasks, where CelebA [56] is the transfer dataset and VGGFace2 [57] is the pre-training dataset.

Model Architecture. To justify the effectiveness of TinyTL, we first apply TinyTL and previoustransfer learning methods to the same backbone neural network, ProxylessNAS-Mobile [11]. Foreach MB-block in ProxylessNAS-Mobile, we insert a lite residual module as described in Section 3.2and Figure 2 (c). The group number is 2, and the kernel size is 5. We use the ReLU activation since itis more memory-efficient according to Section 3.1. We replace all BN layers with GN layers to bettersupport small training batch sizes. We set the number of channels per group to 8 for all GN layers.Following [58], we apply weight standardization [59] to convolution layers that are followed by GN.

For feature extractor adaptation, we build the once-for-all network using the MobileNetV2 designspace [10, 11] that contains five stages with a gradually decreased resolution, and each stage consistsof a sequence of MB-blocks. In the stage-level, it supports elastic depth (i.e., 2, 3, 4). In theblock-level, it supports elastic kernel size (i.e., 3, 5, 7) and elastic width expansion ratio (i.e., 3, 4,6). Similarly, for each MB-block in the once-for-all network, we insert a lite residual module thatsupports elastic group number (i.e., 2, 4) and elastic kernel size (i.e., 3, 5).

Training Details. We freeze the memory-heavy modules (weights of the feature extractor) and onlyupdate memory-efficient modules (bias, lite residual, classifier head) during transfer learning. Themodels are fine-tuned for 50 epochs using the Adam optimizer [60] with batch size 8 on a single GPU.The initial learning rate is tuned for each dataset while cosine schedule [61] is adopted for learningrate decay. We apply 8bits weight quantization [5] on the frozen weights to reduce the parametersize, which causes a negligible accuracy drop in our experiments. For all compared methods, we alsoassume the 8bits weight quantization is applied if eligible when calculating their training memoryfootprint. Additionally, as PyTorch does not support explicit fine-grained memory management, weuse the theoretically calculated training memory footprint for comparison in our experiments. Forsimplicity, we assume the batch size is 8 for all compared methods throughout the experiment section.

6

Stanford-Cars

Full Last BN Bias LiteResidual LiteResidual+Bias

256, 448

224, 416

192, 384 89.1 292.4

160, 352 87.3 208.7

128, 320 84.2 140.5 60.0 57.6 80.1 59.3 88.3 64.7 88.8 64.7

96, 288 76.1 87.2 58.4 47.6 78.1 49.0 87.7 54.4 88.0 54.4

, 256 54.7 38.7 80.2 249.9 75.9 39.8 86.3 45.2 87.4 45.2

, 224 50.9 30.8 77.9 192.4 73.4 31.7 84.2 37.1 85.0 37.1

, 192 73.7 142.9 68.6 24.7 82.1 30.1 83.6 30.1

, 160 67.9 100.7 61.2 18.7 77.3 24.1 78.2 24.2

45

55

65

75

85

95

0 75 150 225 300

TinyTL (LiteResidual+Bias) TinyTL (Bias) FT-Norm+Last FT-Last FT-Full

Training Memory (MB)

Cars

Flowers102-1

Full Last BN Bias LiteResidual LiteResidual+bias Batch Size

Model Size 18.98636 5.138576 5.264432 5.201504 10.587824 10.63352 8

Act@256, Act@448 60.758528 12.845056 93.6488 13.246464 13.246464 13.246464

Act@224, Act@416 46.482132 11.075584 80.713856 11.421696 11.421696 11.421696

Act@192, Act@384 34.176672 9.437184 68.8032 9.732096 9.732096 9.732096

Act@160, Act@352 23.70904 7.929856 57.785036 8.177664 8.177664 8.177664

Act@128, Act@320 15.189632 6.5536 47.78 6.7584 6.7584 6.7584

Act@96, Act@288 8.530757 5.308416 38.678632 5.474304 5.474304 5.474304

, Act@256 4.194304 30.5792 4.325376 4.325376 4.325376

, Act@224 3.211264 23.39462 3.311616 3.311616 3.311616

, Act@192 2.359296 17.2008 2.433024 2.433024 2.433024

, Act@160 1.6384 11.933009 1.6896 1.6896 1.6896

Aircraft


256, 448

224, 416

192, 384 83.5 292.4

160, 352 81.0 208.7

128, 320 77.7 140.5 51.9 57.6 68.6 59.3 81.5 64.7 82.3 64.7

96, 288 70.5 87.2 50.6 47.6 67.3 49.0 80.0 54.4 80.8 54.4

, 256 48.6 38.7 70.7 249.9 65.6 39.8 79.0 45.2 78.9 45.2

, 224 44.9 30.8 68.1 192.4 63.2 31.7 76.4 37.1 75.4 37.1

, 192 64.7 142.9 59.4 24.7 73.3 30.1 74.9 30.1

, 160 60.5 100.7 55.2 18.7 69.5 24.1 70.4 24.2

40

50

60

70

80

90

0 75 150 225 300


Airc

raft

Flowers


256, 448

224, 416 96.8 390.8

192, 384 96.1 292.4

160, 352 95.4 208.7

128, 320 93.6 140.5 93.3 57.6 96.0 387.5 95.6 59.3 96.7 64.7 96.8 64.7

96, 288 89.6 87.2 92.6 47.6 95.6 314.7 95.1 49.0 96.4 54.4 96.4 54.4

, 256 91.6 38.7 95.0 249.9 94.5 39.8 95.9 45.2 96.0 45.2

, 224 90.1 30.8 94.3 192.4 93.5 31.7 95.3 37.1 95.5 37.1

, 192 92.8 142.9 91.5 24.7 94.6 30.1 94.6 30.1

, 160 90.5 100.7 89.5 18.7 92.8 24.1 93.1 24.2

88

90

92

94

96

98

0 100 200 300 400Training Memory (MB)

Flow

ers

Cub200


256, 448

224, 416 81.0 390.8

192, 384 79.0 292.4

160, 352 76.7 208.7

128, 320 71.8 140.5 77.9 57.6 80.6 387.5 79.8 59.3 80.5 64.7 81.0 64.7

96, 288 77.0 47.6 79.6 314.7 78.6 49.0 79.6 54.4 80.0 54.4

, 256 75.4 38.7 79.1 249.9 77.5 39.8 78.5 45.2 78.8 45.2

, 224 73.3 30.8 76.3 192.4 75.3 31.7 76.8 37.1 77.1 37.1

, 192 73.7 142.9 72.7 24.7 74.7 30.1 74.7 30.1

, 160

70

72

74

76

78

80

82

0 100 200 300 400


CUB

Food101


256, 448

224, 416 84.6 390.8

192, 384 83.2 292.4

160, 352 81.2 208.7

128, 320 78.1 140.5 73.0 57.6 80.2 387.5 78.7 59.3 82.8 64.7 82.9 64.7

96, 288 73.5 87.2 72.0 47.6 79.5 314.7 77.9 49.0 82.0 54.4 82.1 54.4

, 256 70.7 38.7 78.4 249.9 76.8 39.8 81.1 45.2 81.5 45.2

, 224 68.7 30.8 77.0 192.4 75.5 31.7 79.2 37.1 79.7 37.1

, 192 74.9 142.9 73.0 24.7 78.2 30.1 78.4 30.1

, 160 72.4 100.7 70.1 18.7 74.6 24.1 75.1 24.2

65

69

73

77

81

85


Food

Pets

Full Last BN Bias LiteResidual LiteResidual+Bias256, 448

88

89

90

91

92

93

94


Pets

75

80

85

90

95

100


CIFA

R10

55

61

67

73

79

85


CIFA

R100

6.5x memory saving 292MB45MB

209MB45MB

4.6x memory saving

292MB45MB

6.5x memory saving

292MB65MB

4.5x memory saving

391MB65MB

6.0x memory saving

209MB54MB



4.6x memory saving

1

Figure 3: Top1 accuracy results of different transfer learning methods under varied resolutions usingthe same pre-trained neural network (ProxylessNAS-Mobile). With the same level of accuracy,TinyTL achieves 3.9-6.5× memory saving compared to fine-tuning the full network.

4.2 Main Results

Effectiveness of TinyTL. Table 2 reports the comparison between TinyTL and previous transferlearning methods including: i) fine-tuning the last linear layer [36, 37, 39] (referred to as FT-Last);ii) fine-tuning the normalization layers (e.g., BN, GN) and the last linear layer [42] (referred to asFT-Norm+Last) ; iii) fine-tuning the full network [43, 44] (referred to as FT-Full). We also studyseveral variants of TinyTL including: i) TinyTL-B that fine-tunes biases and the last linear layer;ii) TinyTL-L that fine-tunes lite residual modules and the last linear layer; iii) TinyTL-L+B thatfine-tunes lite residual modules, biases, and the last linear layer. All compared methods use the samepre-trained model but fine-tune different parts of the model as discussed above. We report the averageaccuracy across five runs.

Compared to FT-Last, TinyTL maintains a similar training memory footprint while improving the top1accuracy by a significant margin. In particular, TinyTL-L+B improves the top1 accuracy by 34.1% onCars, by 30.5% on Aircraft, by 12.6% on CIFAR100, by 11.0% on Food, etc. It shows the improvedadaptation capacity of our method over FT-Last. Compared to FT-Norm+Last, TinyTL-L+B improvesthe training memory efficiency by 5.2× while providing up to 7.3% higher top1 accuracy, whichshows that our method is not only more memory-efficient but also more effective than FT-Norm+Last.Compared to FT-Full, TinyTL-L+B@320 can achieve the same level of accuracy while providing6.0× training memory saving.Regarding the comparison between different variants of TinyTL, both TinyTL-L and TinyTL-L+Bhave clearly better accuracy than TinyTL-B while incurring little memory overhead. It shows that thelite residual modules are essential in TinyTL. Besides, we find that TinyTL-L+B is slightly betterthan TinyTL-L on most of the datasets while maintaining the same memory footprint. Therefore, wechoose TinyTL-L+B as the default.

Figure 3 demonstrates the results under different input resolutions. We can observe that simplyreducing the input resolution will result in significant accuracy drops for FT-Full. In contrast,TinyTL can reduce the memory footprint by 3.9-6.5× while having the same or even higher accuracycompared to fine-tuning the full network.

Combining TinyTL and Feature Extractor Adaptation. Table 3 summarizes the results ofTinyTL and previously reported transfer learning results, where different backbone neural net-works are used as the feature extractor. Combined with feature extractor adaptation, TinyTL achieves7.5-12.9× memory saving compared to fine-tuning the full Inception-V3, reducing from 850MBto 66-114MB while providing the same level of accuracy. Additionally, we try updating the lasttwo layers besides biases and lite residual modules (indicated by †), which results in 2MB of extra

7

Table 3: Comparison with previous transfer learning results under different backbone neural networks.‘I-V3’ is Inception-V3; ‘N-A’ is NASNet-A Mobile; ‘M2-1.4’ is MobileNetV2-1.4; ‘R-50’ is ResNet-50; ‘PM’ is ProxylessNAS-Mobile; ‘FA’ represents feature extractor adaptation. † indicates the lasttwo layers are updated besides biases and lite residual modules in TinyTL. TinyTL+FA reduces thetraining memory by 7.5-12.9× without sacrificing accuracy compared to fine-tuning the widely usedInception-V3.

Method Net Train. Reduce Flowers Cars CUB Food Pets Aircraft CIFAR10 CIFAR100mem. ratio

FT-Full

I-V3 [44] 850MB 1.0× 96.3 91.3 82.8 88.7 - 85.5 - -R-50 [43] 802MB 1.1× 97.5 91.7 - 87.8 92.5 86.6 96.8 84.5M2-1.4 [43] 644MB 1.3× 97.5 91.8 - 87.7 91.0 86.8 96.1 82.5N-A [43] 566MB 1.5× 96.8 88.5 - 85.5 89.4 72.8 96.8 83.9

FT-Norm+Last I-V3 [42] 326MB 2.6× 90.4 81.0 - - - 70.7 - -FT-Last I-V3 [42] 94MB 9.0× 84.5 55.0 - - - 45.9 - -

TinyTL

PM@320 65MB 13.1× 96.8 88.8 81.0 82.9 92.9 82.3 96.1 81.5FA@256 66MB 12.9× 96.8 89.6 80.8 83.4 93.0 82.4 96.8 82.7FA@352 114MB 7.5× 97.4 90.7 82.4 85.0 93.4 84.8 - -FA@352† 116MB 7.3× - 91.5 - 86.0 - 85.4 - -

Flowers102

ResNet-50 Activation Pruning

Ours MobileNetV2 Activation Pruning

97.5 802.2 96.6 447.8

96.9 682.7 97.4 114.0 95.8 373.8

96.3 612.0 96.8 66.0 94.1 330.0

95.2 541.3 90.4 286.2

93.4 470.6 79.7 242.3

88.6 399.9

75

80

85

90

95

100

0 225 450 675 900

TinyTL Activation Pruning (ResNet-50) Activation Pruning (MobileNetV2)


Aircraft



86.6 802.1 82.8 447.8

83.53 682.7 84.8 116.0 79.8 373.8

80.83 612.0 82.4 69.0 77.0 330.0

77.47 541.3 70.4 286.2

75.64 470.6 61.8 242.3

72.24 399.8

60

66

72

78

84

90


Stanford-Cars



91.7 802.8 91.0 448.3

91.28 683.5 90.7 119.0 88.7 374.3

90.95 612.8 89.6 71.0 86.2 330.5

89.71 542.1 82.5 286.6

88.20 471.3 75.0 242.8

85.20 400.6

70

75

80

85

90

95


Flow

ers

Top1

(%)

Airc

raft

Top1

(%)

Cars

Top

1 (%

) 0%pruning 50%

pruning 50%

0%0%

pruning 50%

0%20%

pruning 60%

20%20%

pruning 50%

pruning 60%

0%

pruning 50%

0% 20%

pruning 60%

20%

pruning 50%

20%

1

Figure 4: Compared with the dynamic activation pruning [31], TinyTL saves the memory moreeffectively.

training memory footprint. This slightly improves the accuracy performances, from 90.7% to 91.5%on Cars, from 85.0% to 86.0% on Food, and from 84.8% to 85.4% on Aircraft.

4.3 Ablation Studies and Discussions

Comparison with Dynamic Activation Pruning. The comparison between TinyTL and dynamicactivation pruning [31] is summarized in Figure 4. TinyTL is more effective because it re-designedthe transfer learning framework (lite residual module, feature extractor adaptation) rather than prunean existing architecture. The transfer accuracy drops quickly when the pruning ratio increases beyond50% (only 2× memory saving). In contrast, TinyTL can achieve much higher memory reductionwithout loss of accuracy.

Initialization for Lite Residual Modules. By default, we use the pre-trained weights on the pre-training dataset to initialize the lite residual modules. It requires to have lite residual modules duringboth the pre-training phase and transfer learning phase. When applying TinyTL to existing pre-trainedneural networks that do not have lite residual modules during the pre-training phase, we need to useanother initialization strategy for the lite residual modules during transfer learning. To verify theeffectiveness of TinyTL under this setting, we also evaluate the performances of TinyTL when usingrandom weights [62] to initialize the lite residual modules except for the scaling parameter of the finalnormalization layer in each lite residual module. These scaling parameters are initialized with zeros.

Table 4 reports the summarized results. We find using the pre-trained weights to initialize the literesidual modules consistently outperforms using random weights. Besides, we also find that using

8

Table 4: Results of TinyTL under different initialization strategies for lite residual modules. TinyTL-L+B adds lite residual modules starting from the pre-training phase and uses the pre-trained weightsto initialize the lite residual modules during transfer learning. In contrast, TinyTL-RandomL+B usesrandom weights to initialize the lite residual modules. Using random weights for initialization hurtsthe performances of TinyTL. But on datasets whose distribution is far from the pre-training dataset,TinyTL-RandomL+B still provides competitive results.

Method Train. Flowers Cars CUB Food Pets Aircraft CIFAR10 CIFAR100 CelebAMem.

FT-Last 31MB 90.1 50.9 73.3 68.7 91.3 44.9 85.9 68.8 88.7TinyTL-RandomL+B 37MB 88.0 82.4 72.9 79.3 84.3 73.6 95.7 81.4 91.2TinyTL-L+B 37MB 95.5 85.0 77.1 79.7 91.8 75.4 95.9 81.4 91.2FT-Norm+Last 192MB 94.3 77.9 76.3 77.0 92.2 68.1 94.8 80.2 90.4FT-Full 391MB 96.8 90.2 81.0 84.6 93.0 86.0 97.1 84.1 91.4

Flowers102

TinyTL (batch size 8)


96.8 64.7 96.3 17.496.4 54.4 96.1 16.1

96.0 45.2 95.9 15.0

95.5 37.1 95.6 13.9

94.6 30.1 94.8 13.1

93.1 24.2 93.4 12.3

92

94

96

98

0 18 35 53 70

TinyTL (batch size 1) TinyTL (batch size 8)


Aircraft



82.3 64.7 82.7 17.480.8 54.4 80.2 16.1

78.9 45.2 79.6 15.0

75.4 37.1 77.5 13.9

74.9 30.1 75.0 13.1

70.4 24.2 70.7 12.3

60

66

72

78

84

90


Stanford-Cars



88.8 64.7 88.7 17.488.0 54.4 87.8 16.1

87.4 45.2 86.6 15.0

85.0 37.1 84.5 13.9

83.6 30.1 82.1 13.1

78.2 24.2 78.1 12.3

75

79

83

87

91

95


Flow

ers

Top1

(%)

Airc

raft

Top1

(%)

Cars

Top

1 (%

)

16MBTypical L3 Cache Size



1

Figure 5: Results of TinyTL when trained with batch size 1. It further reduces the training memoryfootprint to around 16MB (typical L3 cache size), making it possible to train on the cache (SRAM)instead of DRAM.

TinyTL-RandomL+B still provides highly competitive results on Cars, Food, Aircraft, CIFAR10,CIFAR100, and CelebA. Therefore, if having the budget, it is better to use pre-trained weights toinitialize the lite residual modules. If not, TinyTL can still be applied and provides competitive resultson datasets whose distribution is far from the pre-training dataset.

Results of TinyTL under Batch Size 1. Figure 5 demonstrates the results of TinyTL when usinga training batch size of 1. We tune the initial learning rate for each dataset while keeping theother training settings unchanged. As our model employs group normalization rather than batchnormalization (Section 3.3), we observe little/no loss of accuracy than training with batch size 8.Meanwhile, the training memory footprint is further reduced to around 16MB, a typical L3 cachesize. This makes it much easier to train on the cache (SRAM), which can greatly reduce energyconsumption than DRAM training.

5 Conclusion

We proposed Tiny-Transfer-Learning (TinyTL) for memory-efficient on-device learning that aims toadapt pre-trained models to newly collected data on edge devices. Unlike previous methods that focuson reducing the number of parameters or FLOPs, TinyTL directly optimizes the training memoryfootprint by fixing the memory-heavy modules (i.e., weights) while learning memory-efficient biasmodules. We further introduce lite residual modules that significantly improve the adaptation capacityof the model with little memory overhead. Extensive experiments on benchmark datasets consistentlyshow the effectiveness and memory-efficiency of TinyTL, paving the way for efficient on-devicemachine learning.

9

Broader Impact

The proposed efficient on-device learning technique greatly reduces the training memory footprintof deep neural networks, enabling adapting pre-trained models to new data locally on edge deviceswithout leaking them to the cloud. It can democratize AI to people in the rural areas where theInternet is unavailable or the network condition is poor. They can not only inference but also fine-tuneAI models on their local devices without connections to the cloud servers. This can also benefitprivacy-sensitive AI applications, such as health care, smart home, and so on.

Acknowledgements

We thank MIT-IBM Watson AI Lab, NSF CAREER Award #1943349 and NSF Award #2028888 forsupporting this research. We thank MIT Satori cluster for providing the computation resource.

References

[1] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen.Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, 2018. 1, 2, 3, 5

[2] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, TobiasWeyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neuralnetworks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017. 2, 3

[3] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficientconvolutional neural network for mobile devices. In CVPR, 2018. 2, 3

[4] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections forefficient neural network. In NeurIPS, 2015. 2, 3

[5] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neuralnetworks with pruning, trained quantization and huffman coding. In ICLR, 2016. 2, 3, 6

[6] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, andQuoc V Le. Mnasnet: Platform-aware neural architecture search for mobile. In CVPR, 2019. 2,3

[7] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan,Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3.In ICCV, 2019. 2, 4

[8] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, YuandongTian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient convnetdesign via differentiable neural architecture search. In CVPR, 2019. 2, 3

[9] Han Cai, Ji Lin, Yujun Lin, Zhijian Liu, Kuan Wang, Tianzhe Wang, Ligeng Zhu, and SongHan. Automl for architecting efficient and specialized neural networks. IEEE Micro, 2019. 2

[10] Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once for all: Train onenetwork and specialize it for efficient deployment. In ICLR, 2020. 2, 3, 5, 6, 13

[11] Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct neural architecture search on targettask and hardware. In ICLR, 2019. 2, 3, 5, 6

[12] Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. Compressing deep convolutionalnetworks using vector quantization. arXiv preprint arXiv:1412.6115, 2014. 2

[13] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploitinglinear structure within convolutional networks for efficient evaluation. In NeurIPS, 2014. 2

[14] Vincent Vanhoucke, Andrew Senior, and Mark Z Mao. Improving the speed of neural networkson cpus. In NeurIPS Deep Learning and Unsupervised Feature Learning Workshop, 2011. 2

[15] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainableneural networks. In ICLR, 2019. 3

[16] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang.Learning efficient convolutional networks through network slimming. In ICCV, 2017. 3

10

[17] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neuralnetworks. In ICCV, 2017. 3

[18] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deepneural networks with binary weights during propagations. In NeurIPS, 2015. 3

[19] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard,Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks forefficient integer-arithmetic-only inference. In CVPR, 2018. 3

[20] Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. Haq: Hardware-aware automatedquantization. In CVPR, 2019. 3

[21] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, andKurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mbmodel size. arXiv preprint arXiv:1602.07360, 2016. 3

[22] Gao Huang, Shichen Liu, Laurens Van der Maaten, and Kilian Q Weinberger. Condensenet: Anefficient densenet using learned group convolutions. In CVPR, 2018. 3

[23] Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. Efficient architecture searchby network transformation. In AAAI, 2018. 3

[24] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neuralnetworks. In ICML, 2019. 3, 5

[25] Ashish Kumar, Saurabh Goyal, and Manik Varma. Resource-efficient machine learning in 2 kbram for the internet of things. In ICML, 2017. 3

[26] Chirag Gupta, Arun Sai Suggala, Ankit Goyal, Harsha Vardhan Simhadri, Bhargavi Paranjape,Ashish Kumar, Saurabh Goyal, Raghavendra Udupa, Manik Varma, and Prateek Jain. Protonn:Compressed and accurate knn for resource-scarce devices. In ICML, 2017. 3

[27] Dennis, Don Kurian and Gaurkar, Yash and Gopinath, Sridhar and Goyal, Sachin and Gupta,Chirag and Jain, Moksh and Kumar, Ashish and Kusupati, Aditya and Lovett, Chris and Patil,Shishir G and Saha, Oindrila and Simhadri, Harsha Vardhan. EdgeML: Machine Learning forresource-constrained edge devices. 3

[28] Audrunas Gruslys, Rémi Munos, Ivo Danihelka, Marc Lanctot, and Alex Graves. Memory-efficient backpropagation through time. In NeurIPS, 2016. 3

[29] Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinearmemory cost. arXiv preprint arXiv:1604.06174, 2016. 3

[30] Klaus Greff, Rupesh K Srivastava, and Jürgen Schmidhuber. Highway and residual networkslearn unrolled iterative estimation. arXiv preprint arXiv:1612.07771, 2016. 3

[31] Liu Liu, Lei Deng, Xing Hu, Maohua Zhu, Guoqi Li, Yufei Ding, and Yuan Xie. Dynamicsparse graph for efficient deep learning. In ICLR, 2019. 3, 8

[32] Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, and Kailash Gopalakrishnan.Training deep neural networks with 8-bit floating point numbers. In NeurIPS, 2018. 3

[33] Oindrila Saha, Aditya Kusupati, Harsha Vardhan Simhadri, Manik Varma, and PrateekJain. Rnnpool: Efficient non-linear pooling for ram constrained inference. arXiv preprintarXiv:2002.11921, 2020. 3

[34] Peiye Liu, Bo Wu, Huadong Ma, Pavan Kumar Chundi, and Mingoo Seok. Memnet: Memory-efficiency guided neural architecture search with augment-trim learning. arXiv preprintarXiv:1907.09569, 2019. 3

[35] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scalehierarchical image database. In CVPR, 2009. 3, 6

[36] Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Return of the devilin the details: Delving deep into convolutional nets. In BMVC, 2014. 3, 7

[37] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and TrevorDarrell. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML,2014. 3, 7

11

[38] Chuang Gan, Naiyan Wang, Yi Yang, Dit-Yan Yeung, and Alex G Hauptmann. Devnet: Adeep event network for multimedia event detection and evidence recounting. In CVPR, pages2568–2577, 2015. 3

[39] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. Cnn featuresoff-the-shelf: an astounding baseline for recognition. In CVPR Workshops, 2014. 3, 7

[40] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013. 3, 6

[41] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on ComputerVision Workshops, 2013. 3, 6

[42] Pramod Kaushik Mudrakarta, Mark Sandler, Andrey Zhmoginov, and Andrew Howard. K forthe price of 1: Parameter efficient multi-task and transfer learning. In ICLR, 2019. 3, 7, 8

[43] Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do better imagenet models transfer better?In CVPR, 2019. 3, 6, 7, 8, 14

[44] Yin Cui, Yang Song, Chen Sun, Andrew Howard, and Serge Belongie. Large scale fine-grainedcategorization and domain-specific transfer learning. In CVPR, 2018. 3, 6, 7, 8, 13

[45] Pramod Kaushik Mudrakarta, Mark Sandler, Andrey Zhmoginov, and Andrew Howard. K forthe price of 1: Parameter-efficient multi-task and transfer learning. In ICLR, 2019. 3, 6, 13

[46] Jonathan Frankle, David J Schwab, and Ari S Morcos. Training batchnorm and only batchnorm:On the expressive power of random features in cnns. arXiv preprint arXiv:2003.00152, 2020. 3

[47] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network trainingby reducing internal covariate shift. In ICML, 2015. 3, 4

[48] Yuxin Wu and Kaiming He. Group normalization. In ECCV, 2018. 4

[49] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. In ICLRWorkshop, 2018. 4

[50] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectified activationsin convolutional network. arXiv preprint arXiv:1505.00853, 2015. 4

[51] Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large num-ber of classes. In Sixth Indian Conference on Computer Vision, Graphics & Image Processing,2008. 6

[52] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. Thecaltech-ucsd birds-200-2011 dataset. 2011. 6

[53] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. InCVPR, 2012. 6

[54] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminativecomponents with random forests. In ECCV, 2014. 6

[55] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.Technical report, Citeseer, 2009. 6

[56] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Large-scale celebfaces attributes(celeba) dataset. Retrieved August, 2018. 6

[57] Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and Andrew Zisserman. Vggface2: A datasetfor recognising faces across pose and age. In 2018 13th IEEE International Conference onAutomatic Face & Gesture Recognition (FG 2018), 2018. 6

[58] Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, SylvainGelly, and Neil Houlsby. Big transfer (bit): General visual representation learning. arXivpreprint arXiv:1912.11370, 2019. 6

[59] Siyuan Qiao, Huiyu Wang, Chenxi Liu, Wei Shen, and Alan Yuille. Weight standardization.arXiv preprint arXiv:1903.10520, 2019. 6

[60] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980, 2014. 6

12

[61] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXivpreprint arXiv:1608.03983, 2016. 6

[62] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In CVPR, 2016. 8

[63] Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and JianSun. Single path one-shot neural architecture search with uniform sampling. arXiv preprintarXiv:1904.00420, 2019. 13

A Details of Feature Extractor Adaptation

Conventional transfer learning chooses the feature extractor according to the pre-training accuracy(e.g., ImageNet accuracy) and uses the same one for all transfer tasks [44, 45]. However, we findthis approach sub-optimal since different target tasks may need very different feature extractors, andhigh pre-training accuracy does not guarantee good transferability of the pre-trained weights. This isespecially critical in our case where the weights are frozen.

ImageNet

super netw. weightsharing

Skip

Skip

sub-ops (including skip)

(1) Pre-training (cloud) (2) Fine-tuning on the target dataset (edge) (3) Discrete op selection (edge)

target dataset

Skip

discrete op selection

random sample

target dataset

ImageNet Head + Lightweight residual

Head +

Lightweight residual

Head +

Lightweight residual

fmap in memory

fmap not in memory

learned weights

pre-trained weights

MobileNetV2 ResNet-34 MnasNet ProxylessNAS MobileNetV3 ResNet-50 ResNet-101 Inception-v3

ImageNet Top1 72.0 73.3 74.0 74.6 75.2 77.2 78.3 78.8

70.0

72.5

75.0

77.5

80.0

ImageNet Top1 (%)


Flowers102 91.7 90.6 90.8 90.3 83.2 92.4 92.1 84.5

80.0

83.8

87.5

91.3

95.0

Flowers Top1 (%)



Stanford cars 51.6 49.2 47.3 48.8 42.4 51.6 53.4 55.0

35.0

42.5

50.0

57.5

65.0

Cars Top1 (%)


Aircraft 44.2 41.3 41.6 43.2 37.4 41.5 41.5 45.9

35.0

40.0

45.0

50.0

55.0

Aircraft Top1 (%)

70.0

72.5

75.0

77.5

80.0

ImageNet Top1 (%)80.0

83.8

87.5

91.3

95.0

Flowers Top1 (%)


35.0

42.5

50.0

57.5

65.0

Cars Top1 (%)35.0

40.0

45.0

50.0

55.0

Aircraft Top1 (%)

1

Figure 6: Transfer learning performances of various ImageNet pre-trained models with the last linearlayer trained. The relative accuracy order between different pre-trained models changes significantlyamong ImageNet and the transfer learning datasets.

Figure 6 shows the top1 accuracy of various widely used ImageNet pre-trained models on threetransfer datasets by only learning the last layer, which reflects the transferability of their pre-trainedweights. The relative order between different pre-trained models is not consistent with their ImageNetaccuracy on all three datasets. This result indicates that the ImageNet accuracy is not a good proxyfor transferability. Besides, we also find that the same pre-trained model can have very differentrankings on different tasks. For instance, Inception-V3 gives poor accuracy on Flowers but providestop results on the other two datasets.

Therefore, we need to specialize the feature extractor to best match the target dataset. In this work,we achieve this by using a pre-trained once-for-all network [10] that comprises many differentsub-networks. Specifically, given a pre-trained once-for-all network on ImageNet, we fine-tune iton the target transfer dataset with the weights of the main branches (i.e., MB-blocks) frozen and theother parameters (i.e., biases, lite residual modules, classifier head) updated via gradient descent. Inthis phase, we randomly sample one sub-network in each training step. The peak memory cost ofthis phase is 61MB under resolution 224, which is reached when the largest sub-network is sampled.Regarding the computation cost, the average MAC (forward & backward)4 of sampled sub-netsis (776M + 2510M) / 2 = 1643M per sample, where 776M is the training MAC of the smallestsub-network and 2510M is the training MAC of the largest sub-network. Therefore, the total MACof this phase is 1643M × 2040 × 0.8 × 50 = 134T on Flowers, where 2040 is the number of totaltraining samples, 0.8 means the once-for-all network is fine-tuned on 80% of the training samples(the remaining 20% is reserved for search), and 50 is the number of training epochs.

Based on the fine-tuned once-for-all network, we collect 500 [sub-net, accuracy] pairs on thevalidation set (20% randomly sampled training data) and train an accuracy predictor5 using thecollected data [10]. We employ evolutionary search [63] based on the accuracy predictor to find thesub-network that best matches the target transfer dataset. No back-propagation on the once-for-all

4The training MAC of a sampled sub-network is roughly 2× larger than its inference MAC, rather than 3×,since we do not need to update the weights of the main branches.

5Details of the accuracy predictor is provided in Appendix B.

13

Frozen Params (MB)Trained Params (MB)Activation (MB)

3 (8%)7 (19%)

27(73%)

Table 1

Weight (8bit) fw Weight (32bit) Flowers102 fw+b/w

Weight (32bit) Aircraft fw+b/w

Weight (32bit) Cars fw+b/w

Act per batch Act Mask Total Flowers102 (MB)

Total Aircraft (MB) Total Cars (MB)

Mbv2 (last) 2.22 0.52 0.51 1.0 5.6 44.8 47.54 47.53 48.02

ResNet-50 (last) 23.51 0.84 0.82 1.61 6.42 51.36 75.71 75.69 76.48

Mbv2 (last + bn) 2.19 0.66 0.65 1.14 27.7 221.6 224.45 224.44 224.93

ResNet-50 (last + bn) 23.45 1.05 1.03 1.82 45.8 366.4 390.9 390.88 391.67

Inception-V3 (last + bn) 25.08 0.98 0.96 1.75 37.5 300 326.06 326.04 326.83

Proxyless-Mobile (last + bn) 2.77 0.66 0.65 1.14 22.4 179.2 182.63 182.62 183.11

Mbv3 (Full) 17.33 17.32 17.81 42 336 353.33 353.32 353.81

Proxyless-Mobile (Full) 11.72 11.71 12.2 44.4 355.2 366.92 366.91 367.4

Mbv2 (Full) 9.42 9.41 9.90 54.8 438.4 447.82 447.81 448.3

Mbv2-1.4 (Full) 17.99 17.98 18.67 78.3 626.4 644.39 644.38 645.07

ResNet-18 (Full) 44.92 44.91 45.11 19.1 152.8 197.72 197.71 197.91

ResNet-34 (Full) 85.35 85.34 85.54 29.2 233.6 318.95 318.94 319.14

ResNet-50 (Full) 94.87 94.85 95.64 88.4 707.2 802.07 802.05 802.84

ResNet-101 (Full) 170.84 170.82 171.61 130.0 1040 1210.84 1210.82 1211.61

Inception-V3 (Full) 101.29 101.27 102.06 93.6 748.8 850.09 850.07 850.86

Table 2

Batch Size

8

Table 1-1

Params Final Act Final Memory Cost

Mbv2-1.4 (Full) 17.3 626.4 643.7

TinyTL (R=256) 10 26.5 36.5

0

175

350

525

700Param (MB) Activation (MB)

ResNet-50

MbV2-1.4 TinyTL

707MB(88%) 626MB

(97%)

95MB 18MB

Perc

enta

ge

14MB

50MB (78%)

10x smaller

CIFAR10

Memory Cost (GB) Computation Cost (T) Top1 (%)

ResNet-50 (Full) 3

Mbv2-1.4 (Full) 644 8900 96.1 7

TinyTL (R=224) 37 2200 96.1 27

0

140

280

420

560

700

FT-Full (MbV2-1.4) TinyTL

Memory Cost (MB)

644

37

17x smaller

0

3000

6000

9000

Training Cost (TMAC)

8,900

2,200

4x smaller

90

92

94

96

9896.1 96.1

CIFAR10 Top1 (%)

Frozen Params (MB)Trained Params (MB)Activation (MB)

3 (4.6%)7 (10.8%)

55(84.6%)

Pets

Memory Cost (GB) Computation Cost (T) Top1 (%)

ResNet-50 (Full) 3

Mbv2-1.4 (Full) 644 8900 91.0 7

TinyTL (R=320) 65 330 92.9 55

0

140

280

420

560

700

FT-Full (MbV2-1.4) TinyTL

Memory Cost (MB)

644

65

9.9x smaller

0

3000

6000

9000

Training Cost (TMAC)

8,900

330

27x smaller

90.0

91.0

92.0

93.0

91.0

92.9

Pets Top1 (%)

1

Figure 7: On-device training cost on Pets. TinyTL requires 9.9× smaller memory cost (assumingusing the same batch size) and 27× smaller computation cost compared to fine-tuning the fullMobileNetV2-1.4 [43] while having a better accuracy.

network is required in this step, thus incurs no additional memory overhead. The primary computationcost of this phase comes from collecting 500 [sub-net, accuracy] pairs required to train the accuracypredictor. It only involves the forward processes of sampled sub-nets, and no back-propagation isrequired. The average MAC (only forward) of sampled sub-nets is (355M + 1182M) / 2 = 768.5M persample, where 355M is the inference MAC of the smallest sub-network and 1182M is the inferenceMAC of the largest sub-network. Therefore, the total MAC of this phase is 768.5M × 2040 × 0.2 ×500 = 157T on Flowers, where 2040 is the number of total training samples, 0.2 means the validationset consists of 20% of the training samples, and 500 is the number of measured sub-nets.

Finally, we fine-tune the searched sub-network with the weights of the main branches frozen and theother parameters updated, using the full training set to get the final results. The memory cost of thisphase is 66MB under resolution 256 on Flowers. The total MAC is 2190M × 2040 × 1.0 × 50 =223T, on Flowers, where 2190M is the training MAC, 2040 is the number of total training samples,1.0 means the full training set is used, and 50 is the number of training epochs.

B Details of the Accuracy Predictor

The accuracy predictor is a three-layer feed-forward neural network with a hidden dimension of 400and ReLU as the activation function for each layer. It takes the one-hot encoding of the sub-network’sarchitecture as the input and outputs the predicted accuracy of the given sub-network. The inferenceMAC of this accuracy predictor is only 0.37M, which is 3-4 orders of magnitude smaller than theinference MAC of the CNN classification models. The memory footprint of this accuracy predictoris only 5KB. Therefore, both the computation overhead and the memory overhead of the accuracypredictor are negligible.

C Cost Details

The on-device training cost of TinyTL and FT-Full on Pets is summarized in Figure 7 (left side), whilethe memory cost breakdown of TinyTL is provided in Figure 7 (right side). Compared to fine-tuningthe full MobileNetV2-1.4, TinyTL not only greatly reduces the activation size but also reduces theparameter size (by applying weight quantization on the frozen parameters) and the training cost (bynot updating weights of the feature extractor and using fewer training steps). Specifically, TinyTLreduces the memory footprint by 9.9×, and reduces the training computation by 27× without loss ofaccuracy. Therefore, TinyTL is not only more memory-efficient but also more computation-efficient.

14

1 Introduction2 Related Work3 Tiny Transfer Learning3.1 Understanding the Memory Footprint of Back-propagation3.2 Lite Residual Learning3.3 Discussions

4 Experiments4.1 Setups4.2 Main Results4.3 Ablation Studies and Discussions

5 ConclusionA Details of Feature Extractor AdaptationB Details of the Accuracy PredictorC Cost Details

Documents

Abstract - arXiv › pdf › 2007.11622.pdfTPU SRAM (28MB) 1 2 4 8 Raspberry Pi 1 DRAM (256MB) ßoat mult SRAM access DRAM access Energy 3.7 5.0 640.0 Table 1 ResNet MBV2-1.4 Params