fjulieta,jashan,tingwei.liu,andreib,wenyuan,urtasung@uber

Permute, Quantize, and Fine-tune: Efficient Compression of Neural Networks

Julieta Martinez∗,1 Jashan Shewakramani∗,1,2 Ting Wei Liu∗,1,2

Ioan Andrei Barsan1,3 Wenyuan Zeng1,3 Raquel Urtasun1,3

1Uber Advanced Technologies Group 2University of Waterloo 3University of Toronto{julieta,jashan,tingwei.liu,andreib,wenyuan,urtasun}@uber.com

Abstract

Compressing large neural networks is an important stepfor their deployment in resource-constrained computationalplatforms. In this context, vector quantization is an appeal-ing framework that expresses multiple parameters using asingle code, and has recently achieved state-of-the-art net-work compression on a range of core vision and naturallanguage processing tasks. Key to the success of vectorquantization is deciding which parameter groups should becompressed together. Previous work has relied on heuristicsthat group the spatial dimension of individual convolutionalfilters, but a general solution remains unaddressed. This isdesirable for pointwise convolutions (which dominate mod-ern architectures), linear layers (which have no notion ofspatial dimension), and convolutions (when more than onefilter is compressed to the same codeword). In this paperwe make the observation that the weights of two adjacentlayers can be permuted while expressing the same function.We then establish a connection to rate-distortion theory andsearch for permutations that result in networks that are eas-ier to compress. Finally, we rely on an annealed quantizationalgorithm to better compress the network and achieve higherfinal accuracy. We show results on image classification, ob-ject detection, and segmentation, reducing the gap with theuncompressed model by 40 to 70% w.r.t. the current stateof the art. All our experiments can be reproduced using thecode at https://github.com/uber-research/permute-quantize-finetune.

1. IntroductionState-of-the-art approaches to many computer vision

tasks are currently based on deep neural networks. Thesenetworks often have large memory and computational re-quirements, limiting the range of hardware platforms onwhich they can operate. This poses a challenge for appli-cations such as virtual reality and robotics, which naturallyrely on mobile and low-power computational platforms forlarge-scale deployment. At the same time, these networks

are often overparameterized [5], which implies that it is pos-sible to compress them – thereby reducing their memory andcomputation demands – without much loss in accuracy.

Scalar quantization is a popular approach to networkcompression where each network parameter is compressedindividually, thereby limiting the achievable compressionrates. To address this limitation, a recent line of work hasfocused on vector quantization (VQ) [13, 47, 54], whichcompresses multiple parameters into a single code. Conspic-uously, these approaches have recently achieved state-of-the-art compression-to-accuracy ratios on core computer visionand natural language processing tasks [10, 48].

A key advantage of VQ is that it can naturally exploitredundancies among groups of network parameters, for ex-ample, by grouping the spatial dimensions of convolutionalfilters in a single vector to achieve high compression rates.However, finding which network parameters should be com-pressed jointly can be challenging; for instance, there is nonotion of spatial dimension in fully connected layers, and itis not clear how vectors should be formed when the vectorsize is larger than a single convolutional filter – which isalways true for pointwise convolutions. Current approacheseither employ clustering (e.g., k-means) using the order ofthe weights as obtained by the network [13, 47, 54], which issuboptimal, or search for groups of parameters that, whencompressed jointly, minimize the reconstruction error of thenetwork activations [10, 48, 54], which is hard to optimize.

In this paper, we formalize the notion of redundancyamong parameter groups using concepts from rate-distortiontheory, and leverage this analysis to search for permutationsof the network weights that yield functionally equivalent, yeteasier-to-quantize networks. The result is Permute, Quan-tize, and Fine-tune (PQF), an efficient algorithm that firstsearches for permutations, codes and codebooks that mini-mize the reconstruction error of the network weights, andthen uses gradient-based optimization to recover the accu-racy of the uncompressed network. Our main contributionscan be summarized as follows:

1. We study the invariance of neural networks under per-mutation of their weights, focusing on constraints in-

1

arX

iv:2

010.

1570

3v3

[cs

.CV

] 1

0 A

pr 2

021

https://github.com/uber-research/permute-quantize-finetune

https://github.com/uber-research/permute-quantize-finetune

duced by the network topology. We then formulate apermutation optimization problem to find functionallyequivalent networks that are easier to quantize. Ourresult focuses on improving a quantization lower boundof the weights; therefore

2. We use an efficient annealed quantization algorithmthat reduces quantization error and leads to higher ac-curacy of the compressed networks. Finally,

3. We show that the reconstruction error of the networkparameters is inversely correlated with the final networkaccuracy after gradient-based fine-tuning.

Put together, the above contributions define a novel methodthat produces state-of-the-art results in terms of model sizevs. accuracy. We benchmark our method by compressingpopular architectures for image classification, and objectdetection & segmentation, showcasing the wide applicabilityof our approach. Our results show a 40-60% relative errorreduction on Imagenet object classification over the currentstate-of-the-art when compressing a ResNet-50 [21] downto about 3 MB (∼31× compression). We also demonstrate arelative 60% (resp. 70%) error reduction in object detection(resp. mask segmentation) on COCO over previous work, bycompressing a Mask-RCNN architecture down to about 6.6MB (∼26× compression).

2. Related Work

There is a vast literature on compressing neural networks.Efforts in this area can broadly be divided into pruning, low-rank approximations, and quantization.

Weight pruning: In its simplest form, weight pruning canbe achieved by removing small weights [16, 18], or approxi-mating the importance of each parameter using second-orderterms [7, 19, 30]. More sophisticated approaches use meta-learning to obtain pruning policies that generalize to multiplemodels [22], or use regularization terms during training toreduce parameter count [36]. Most of these methods pruneindividual weights, and result in sparse networks that aredifficult to accelerate on commonly available hardware. Toaddress these issues, another line of work aims to removeunimportant channels, producing networks that are easier toaccelerate in practice [23, 31, 38].

Low-rank approximations: These methods can achieveacceleration by design [6,25,29,42], as they typically factor-ize the original weight matrix into several smaller matrices.As a result, the original computationally-heavy forward passcan be replaced by a multiplication of several smaller vectorsand matrices.

Scalar quantization: These techniques constrain the num-ber of bits that each parameter may take, in the extreme case

using binary [4, 39, 44, 53] or ternary [57] values. 8-bit quan-tization methods have proven robust and efficient, whichhas motivated their native support by popular deep learn-ing libraries such as PyTorch1 and Tensorflow Lite2, withacceleration often targeting CPUs. We refer the reader tothe survey by [41] for a recent comprehensive overview ofthe subject. In this context, reducing each parameter to asingle bit yields a theoretical compression ratio of 32× (al-though, in practice, fully-connected and batch norm layersare not quantized [39]). To obtain higher compression ratios,researchers have turned to vector quantization.

Vector quantization (VQ): VQ of neural networks waspioneered by Gong et al. [13], who investigated scalar, vec-tor, and product quantization [26] (PQ) of fully-connected(FC) layers, which were the most memory-demanding layersof convolutional neural networks (CNNs) at the time. Wu etal. [54] used PQ to compress both FC and convolutionallayers of CNNs; they noticed that minimizing the quantiza-tion error of the network parameters produces much worseresults than minimizing the error of the activations, so theysequentially quantized the layers to minimize error accumu-lation. However, neither Gong et al. [13] nor Wu et al. [54],explored end-to-end training, which is necessary to recoverthe network accuracy as the compression ratio increases.

Son et al. [47] clustered 3×3 convolutions using vectorquantization, and fine-tuned the centroids via gradient de-scent using additional bits to encode filter rotation, resultingin very compact codebooks. However, they did not explorethe compression of FC layers nor pointwise convolutions(which dominate modern architectures), and did not explorethe relationship of quantization error to accuracy.

Stock et al. [48] use PQ to compress convolutional andFC layers using a clustering technique designed to minimizethe reconstruction error of the layer outputs (which is compu-tationally expensive), followed by end-to-end training of thecluster centroids via distillation. However, their approachdoes not optimize the grouping of the network parametersfor quantization, which we find to be crucial to obtain goodcompression. Chen et al. [2] improve upon the results of [48]by minimizing the reconstruction error of the parameters andthe task loss jointly; however, their method also uses morefine-tuning epochs, so a direct comparison is hard.

Different from previous approaches, our method exploitsthe invariance of neural networks under permutation of theirweights for the purpose of vector compression. Based onthis observation, we draw connections to rate distortion the-ory, and use an efficient permutation optimization algorithmthat makes the network easier to quantize. We also use anannealed clustering algorithm to further reduce quantizationerror, and show that there is a direct correlation between the

1pytorch.org/docs/stable/quantization.html2tensorflow.org/lite/performance/post_training_

quantization

2

pytorch.org/docs/stable/quantization.html

tensorflow.org/lite/performance/post_training_quantization

tensorflow.org/lite/performance/post_training_quantization

quantization error of a network weights and its final accuracyafter fine-tuning. These contributions result in an efficientmethod that largely outperforms its competitors on a widerange of applications.

3. Learning to Compress a Neural Network

In this paper we compress a neural network by compress-ing the weights of its layers. Specifically, instead of storingthe weight matrix W of a layer explicitly, we learn an encod-ing B(W) that takes considerably less memory. Intuitively,we can decode B to a matrix W that is “close” to W, anduse W as the weight matrix for the layer. The idea is that ifW is similar to W, the activations of the layer should alsobe similar. Note that the encoding will be different for eachof the layers.

3.1. Designing the Encoding

For a desired compression rate, we design the encodingB to consist of a codebook C, a set of codes B, and a permu-tation matrix P. The permutation matrix preprocesses theweights so that they are easier to compress without affectingthe input-output mapping of the network, while the codesand codebook attempt to express the permuted weights asaccurately as possible using limited memory.

Codes and codebook: Let W ∈ Rm×n denote the weightmatrix of a fully-connected (FC) layer, with m the input sizeof the layer, and n the size of its output. We split eachcolumn of W into column subvectors wi,j ∈ Rd×1, whichare then compressed individually:

W =

w1,1 w1,2 · · · w1,n

w2,1 w2,2 · · · w2,n

......

. . ....

wm,1 wm,2 · · · wm,n

, (1)

where m = m/d, and m ·n is the total number of subvectors.Intuitively, larger d results in fewer subvectors and thushigher compression rates. The set {wi,j} is thus a collectionof d-dimensional blocks that can be used to construct W.

Instead of storing all these subvectors, we approximatethem by a smaller set C = {c(1), . . . , c(k)} ⊆ Rd×1, whichwe call the codebook for the layer. We refer to the elementsof C as centroids. Let bi,j ∈ {1, . . . , k} be the index of theelement in C that is closest to wi,j in Euclidean space:

bi,j = argmint‖wi,j − c(t)‖22, (2)

The codes B = {bi,j} are the indices of the codes in thecodebook that best reconstruct every subvector {wi,j}. Theapproximation W of W is thus the matrix obtained by re-

placing each subvector wi,j with c(bi,j):

W =

c(b1,1) c(b1,2) · · · c(b1,n)c(b2,1) c(b2,2) · · · c(b2,n)

......

. . ....

c(bm,1) c(bm,2) · · · c(bm,n)

. (3)

We refer to the process of expressing the weight matrix interms of codes and a codebook as quantization.

Permutation: The effectiveness of a given set of codesand codebooks depends on their ability to represent the orig-inal weight matrix W accurately. Intuitively, this is easierto achieve if the subvectors wi,j are similar to one another.Therefore, it is natural to consider transformations of W thatmake the resulting subvectors easier to compress.

A feedforward network can be thought of as a directedacyclic graph (DAG), where nodes represent layers and edgesrepresent the flow of information in the network. We referto the starting node of an edge as a parent layer, and tothe end node as a child layer. We note that the network isinvariant under permutation of its weights, as long as thesame permutation is applied to the output dimension forparent layers and the input dimension for children layers.Here, our key insight is that we can search for permutationsthat make the network easier to quantize.

Formally, consider a network comprised of two layers:

f(x) = φ(xW2), W2 ∈ Rm×n (4)

g(x) = φ(xW1), W1 ∈ Rp×m (5)

where φ represents a non-linear activation function. Thenetwork can be described as the function

f ◦ g(x) = f(g(x)) = φ(g(x)W2), (6)

where x ∈ R1×p is the input to the network. Furthermore,from a topological point of view, g is the parent of f .

Given a permutation π of m elements π : {1, . . . ,m} →{1, . . . ,m}, we denote P as the permutation matrix thatresults from reordering the rows of them×m identity matrixaccording to π. Left-multiplying P with X has the effect ofreordering the rows of X according to π.

Let fP2be the layer that results from applying the per-

mutation matrix P2 to the input dimension of the weights off :

fP2(x) = φ(xP2W2). (7)

Analogously, let gP2 be the layer that results from ap-plying the permutation P2 to the output dimension of theweights of g:

gP2(x) = φ(x(P2W>1 )>) = φ(xW1P

>2 ). (8)

Importantly, so long as φ is an element-wise operator,gP2 produces the same output as g, only permuted:

gP2(x) = g(x)P>2 , (9)

3

Figure 1: Permutation optimization of a fully-connected layer. Our goal is to find a permutation P of the weights W such that theresulting subvectors are easier to compress.

then we have

fP2◦ gP2(x) = fP2

(gP2(x)) (10)

= fP2(g(x)P>2 ) (11)

= φ(g(x)P>2 P2W2) (12)= φ(g(x)W2) (13)= f(g(x)) (14)= f ◦ g(x), ∀P2,x. (15)

This functional equivalence has previously been used to char-acterize the optimization landscape of neural networks [1,43].In contrast, here we focus on quantizing the permuted weightP2W2, and denote its subvectors as {wP2

i,j }. We depict theprocess of applying a permutation and obtaining new sub-vectors in Figure 1.

Extension to convolutional layers: The encoding of con-volutional layers is closely related to that of fully-connectedlayers. Let W ∈ RCin×Cout×K×K denote the weights ofa convolutional layer with Cin input channels, Cout outputchannels, and a kernel size of K × K. The idea is to re-shape W into a 2d matrix Wr of size CinK

2 × Cout, andthen apply the same encoding method that we use with fully-connected layers. The result is an approximation Wr to Wr.We then apply the inverse of the reshaping operation on Wr

to get our approximation to W.When K > 1, we set the codeword size d to a multiple

of K2 and limit the permutation matrix P to have a blockstructure such that the spatial dimensions of filters are quan-tized together. For pointwise convolutions (i.e., K = 1), weset d to 4 or 8, depending on the desired compression rate.

We have so far considered networks where each layerhas a single parent and a single child (i.e., the topology is achain). We now consider architectures where some layersmay have more than one child or more than one parent.

Extension beyond chain architectures: AlexNet [28]and VGG [46] are examples of popular architectures witha chain topology. As a consequence, each layer can havea different permutation. However, architectures with morecomplicated topologies have more constraints on the permu-tations that they admit.

For example, consider Figure 2, which depicts six res-blocks as used in the popular ResNet-50 architecture. Westart by finding a permutation for layer 4a, and realize thatits parent is layer 3c. We also notice that layers 3c and 2cmust share the same permutation for the residual addition tohave matching channels. By induction, this is also true oflayers 1c and 1d, which are now all parents of our initiallayer 4a. These parents have children of their own (layers2a, 3a and 4d), so these must be counted as siblings of 4a,and must share the same permutation as 4a. However, notethat all b and c layers are only children, so they can havetheir own independent permutation.

Operations such as reshaping and concatenation in parentlayers may also affect the permutations that a layer can toler-ate while preserving functional equivalence. For example, inthe detection head of Mask-RCNN [20], the output (of shape256× 7× 7) of a convolutional layer is reshaped (to 12 544)before before entering a FC layer. Moreover, the same tensoris used in another convolutional layer (without reshaping)for mask prediction. Therefore, the FC layer and the childconvolutional layer must share the same permutation. Inthis case, the FC layer must keep blocks of 7× 7 = 49 con-tiguous dimensions together to respect the channel orderingof its parent (and to match the permutation of its sibling).Determining the maximum set of independent permutationsthat an arbitrary network may admit (and finding efficientalgorithms to do so) is a problem we leave for future work.

3.2. Learning the Encoding

Our overarching goal is to learn an encoding of eachlayer such that the final output of the network is preserved.Towards this goal, we search for a set of codes, codebook,and permutation that minimizes the quantization error Et ofevery layer t of the network:

Et = minPt,Bt,Ct

1

mn

∥∥∥Wt −PtWt

∥∥∥2F. (16)

Our optimization procedure consists of three steps:

1. Permute: We search for a permutation of each layerthat results in subvectors that are easier to quantize. Wedo this by minimizing the determinant of the covarianceof the resulting subvectors.

4

a. 1x

1 co

nv

b. 3

x3 co

nv

c. 1x

1 co

nv

+

a. 1x

1 co

nv

b. 3

x3 co

nv

c. 1x

1 co

nv

+

a. 1x

1 co

nv

b. 3

x3 co

nv

c. 1x

1 co

nv

+

d. 1x1 conv

a. 1x

1 co

nv

b. 3

x3 co

nv

c. 1x

1 co

nv

+

d. 1x1 conv

Resblock 1 Resblock 2 Resblock 3 Resblock 4

a. 1x

1 co

nv

b. 3

x3 co

nv

c. 1x

1 co

nv

+Resblock 0

a. 1x

1 co

nv

b. 3

x3 co

nv

c. 1x

1 co

nv

+

Resblock 5

Figure 2: Parent-child dependencies in the Resblocks of a ResNet-50 architecture. Purple nodes are children and yellow nodes areparents, and must share the same permutation (in Cin for children and in Cout for parents) for the network to produce the same output.

2. Quantize: We obtain codes and codebooks for eachlayer by minimizing the difference between the approx-imated weight and the permuted weight.

3. Fine-tune: Finally, we jointly fine-tune all the code-books with gradient-based optimization by minimizingthe loss function of the original network over the train-ing dataset.

We have found that minimizing the quantization error ofthe network weights (Eq. (16)) results in small inaccuraciesthat accumulate over multiple layers reducing performance;therefore, it is important to jointly fine-tune the network sothat it can recover its original accuracy with gradient descent.We have also observed that the quality of the intial recon-struction has a direct impact on the final network accuracy.We now describe the three steps in detail.

3.2.1 Permute

In this step, our goal is to estimate a permutation Pt such thatthe permuted weight matrix PtWt has subvectors {wPt

i,j }that are easily quantizable. Intuitively, we want to mini-mize the spread of the vectors, as more compact vectorscan be expressed more accurately given a fixed number ofcentroids. We now formalize this intuition and propose asimple algorithm to find good permutations.

A quantization lower bound: We assume that the weightsubvectors that form the input to the quantization step comefrom a Gaussian distribution, wPt

i,j ∼ N (0,Σt), with zero-mean and covariance Σt ∈ Rd×d, which is a positive semi-definite matrix. Thanks to rate distortion theory [12], weknow that the expected reconstruction error Et must follow

Et ≥ k−2d d∣∣Σt

∣∣ 1d ; (17)

in other words, the error is lower-bounded by the determinantof the covariance of the subvectors of PtWt. We assumethat we have access to a good minimizer such that, roughly,this bound is equal to the reconstruction error achieved byour quantization algorithm. Thus, for a fixed target compres-sion bit-rate, we can focus on finding a permutation Pt thatminimizes

∣∣Σt

∣∣.

Searching for permutations: We make use of Expres-sion (17) and focus on obtaining a permutation Pt that min-imizes the determinant of the covariance of the set {wPt

i,j }.We follow an argument similar to that of Ge et al. [11], andnote that the determinant of any positive semi-definite ma-trix Σt ∈ Rd×d, with elements σti,j , satisfies Hadamard’sinequality: ∣∣Σt

∣∣ ≤ d∏i=1

σti,i; (18)

that is, the determinant of Σt is upper-bounded by the prod-uct of its diagonal elements.

Motivated by this inequality, we greedily obtain an initialPt that minimizes the product of the diagonal elements ofΣt by creating d buckets of row indices, each with capacityto hold m = m/d elements. We then compute the varianceof each row of Wt, and greedily assign each row index tothe non-full bucket that results in lowest bucket variance.Finally, we obtain Pt by interlacing rows from the bucketsso that rows from the same bucket are placed d rows apart.K×K convolutions can be handled similarly, assuming thatPt has a block structure, and making use of the more generalFischer’s inequality. Please refer to the appendix for moredetails.

There areO(m!) possible permutations of Wt, so greedyalgorithms are bound to have limitations on the quality of thesolution that they can find. Thus, we refine our solution viastochastic local search [24]. Specifically, we iteratively im-prove the candidate permutation by flipping two dimensionschosen at random, and keeping the new permutation if itresults in a set of subvectors whose covariance has lower de-terminant

∣∣Σt

∣∣. We repeat this procedure for a fixed numberof iterations, and return the best permutation obtained.

3.2.2 Quantize

In this step, we estimate the codes Bt and codebook Ct thatapproximate the permuted weight PtWt. Given a fixedpermutation, this is equivalent to the well-known k-meansproblem. We use an annealed quantization algorithm calledSR-C originally due to Zeger et al. [56], and recently adaptedby Martinez et al. [40] to multi-codebook quantization. Em-pirically, SR-C achieves lower quantization error than the

5

vanilla k-means algorithm, and is thus a better minimizer ofExpr. (17).

A stochastic relaxation of clustering: The quantizationlower bound from Expression (17) suggests that the k-meansalgorithm can be annealed by scheduling a perturbation suchthat the determinant of the covariance of the set {wPt

i,j }decreases over time. Due to Hadamard’s inequality (i.e.,Expresion (18)), this can be achieved by adding decreas-ing amounts of noise to wPt

i,j sampled from a zero-meanGaussian with diagonal covariance.

Therefore, after randomly initializing the codes, we itera-tively update the codebook and codes with a noisy codebookupdate (which operates on subvectors with additive diagonal-ized Gaussian noise), and a standard k-means code update.We decay the noise according to the schedule (1− (τ/I))γ ,where τ is the current iteration, I is the total number of up-date iterations, and γ is a constant. We use γ = 0.5 in allour experiments. For a detailed description, please refer toAlgorithm 1.

Algorithm 1 SR-C: Stochastic relaxation of k-means.

1: procedure SR-C({wPti,j },Σt, k, T, γ)

2: Bt ← INITIALIZECODES(k)3: for τ ← 1, . . . , T do

# Add scheduled noise to subvectors4: for wPt

i,j ∈ {wPti,j } do

5: xi,j ∼ N (0,diag(Σt))

6: wPti,j ← wPt

i,j + (xi,j × (1− (τ/I))γ)7: end for

# Noisy codebook update8: Ct ← argminC

∑i,j‖w

Pti,j − c(bi,j)‖22

# Regular codes update9: Bt ← argminB

∑i,j‖w

Pti,j − c(bi,j)‖22

10: end for11: return Bt, Ct12: end procedure

3.2.3 Fine-tune

Encoding each layer independently causes errors in the ac-tivations to accumulate, resulting in degradation of perfor-mance. It is thus important to fine-tune the encoding in orderto recover the original accuracy of the network. In particular,we fix the codes and permutations for the remainder of theprocedure.

Let L be the original loss function of the network (e.g.,cross-entropy for classification). We note that L is differen-tiable with respect to each of the learned centroids – sincethese are continuous – so we use the original training set tofine-tune the centroids with gradient-based learning:

c(i)← c(i)− u(

∂L∂c(i)

, θ

), (19)

Model Regime dK dpw dfc

ResNet-18Small blocks K2 4 4Large blocks 2K2 4 4

ResNet-50Small blocks K2 4 4Large blocks 2K2 8 4

Table 1: Subvector sizes and compression regimes.

where u(·, ·) is an update rule (such as SGD, RMSProp [50]or Adam [27]) with hyperparameters θ (such as learning rate,momentum, and decay rates).

4. ExperimentsWe test our method on ResNet [21] architectures for

image classification and Mask R-CNN [20] for object de-tection and instance segmentation. We compress stan-dard ResNet-18 and ResNet-50 models that have been pre-trained on ImageNet, taking the weights directly from thePyTorch model zoo. We train different networks withk ∈ {256, 512, 1024, 2048}. We also clamp the size of thecodebook for each layer to min(k, n× Cout/4).

Small vs. large block sizes: To further assess the trade-offbetween compression and accuracy, we use two compres-sion regimes. In the large blocks regime, we use a largersubvector size d for each layer, which allows the weightmatrix to be encoded with fewer codes, and thus leads tohigher compression rates. To describe the subvector sizeswe use for each layer, we let dK denote the subvector sizefor a convolutional layer with filters of size K ×K. In thespecial case when K = 1, corresponding to a pointwiseconvolution, we denote the subvector size by dpw. Finally,fully-connected layers have a subvector size of dfc. We sum-marize our subvector sizes for each model and compressionregime in Table 1.

Bit allocation: We compress all the fully-connected andconvolutional layers of a network. However, following [48],we do not compress the first convolutional layer (since itoccupies less than 0.05% of the network size), the bias of thefully-connected layers, or the batchnorm layers. While wetrain with 32-bit floats, we store our final model using 16-bitfloats, which has a negligible impact on validation accuracy(less than 0.02%). Finally, we fuse batchnorm layers intotwo vectors, which can be done with algebraic manipulationand is a trick normally used to speed up inference. Pleaserefer to the appendix for a detailed breakdown of the bitallocation in our models.

Hyperparameters: We use a batch size of 128 for ResNet-18 and a batch size of 64 for ResNet-50. For annealed k-means, we implement SR-C in the GPU, and run it for 1 000iterations. We fine-tune the codebooks for 9 epochs usingAdam [27] with an initial learning rate of 10−3, which is

6

Ratio Size Acc. Gap

Semi-sup R50 [55] – 97.50 MB 79.30 –BGD [48] 19× 5.20 MB 76.12 3.18

Semi-sup R50 [55] – 97.50 MB ∗78.72 –Our PQF 19× 5.09 MB 77.15 1.57

Table 2: ImageNet classification starting from a semi-supervised ResNet-50. We set a new state of the art in termsof accuracy vs model size. ∗Reproduced from downloaded model.

gradually reduced to 10−6 using cosine annealing [37]. Fine-tuning is the most expensive part of this process, and takesaround 8 hours both for ResNet-18 (with 1 GPU) and forResNet-50 (with 4 GPUs). In the latter case, we scale thelearning rate by a factor of 4, following Goyal et al. [14].For permutation optimization, we perform 1 000 local searchiterations; this is done in the CPU in parallel for each inde-pendent permutation. This process takes less than 5 minutesfor ResNet-18, and about 10 minutes for ResNet-50 on a12-core CPU.

Baselines: We compare the results of our method against avariety of network compression methods: Binary Weight Net-work (BWN) [44], Trained Ternary Quantization (TTQ) [57],ABC-Net [35], LR-Net [45], Deep Compression (DC) [17],Hardware-Aware Automated Quantization (HAQ) [52],CLIP-Q [51], Hessian AWare Quantization of Neural Net-works with Mixed Precision (HAWQ) [9], and HAWQ-V2 [8]. We compare extensively against the recently-proposed Bit Goes Down (BGD) method of [48] because itis the current state of the art by a large margin. BGD usesas initialization the method due to Wu et al. [54], and thussubsumes it. All results presented are taken either from theoriginal papers, or from two additional surveys [3, 15].

4.1. Image Classification

A summary of our results can be found in Figure 3. Fromthe Figure, it is clear that our method outperforms all itscompetitors. On ResNet-18 for example, we can surpassthe performance of ABC-Net (M=5) with our small blocksmodels at roughly 3× the compression rate. Our biggest im-provement generally comes from higher compression rates,and is especially apparent for the larger ResNet-50. Whenusing large blocks and k = 256 centroids, we obtain a top-1accuracy of 72.18% using only ∼3 MB of memory. Thisrepresents an absolute ∼4% improvement over the state ofthe art. On ResNet-50, our method consistently reduces theremaining error by 40-60% w.r.t. the state of the art.

Semi-supervised ResNet-50: We also benchmark ourmethod using a stronger backbone as a starting point. Westart from the recently released ResNet-50 model due toYalniz et al. [55], which has been pre-trained on unlabelledimages from the YFCC100M dataset [49], and fine-tuned on

Perm. SR-C Adam Acc. ∆

62.29 −1.023 62.55 −0.763 3 62.92 −0.393 3 3 63.31 0.00

Table 3: Ablation study. ResNet18 on ImageNet w/large blocks.

ImageNet. While the accuracy of this model is reported to be79.30%, we obtain a slightly lower 78.72% after download-ing the publicly-available model3; (contacting the authors welearned that the previous, slightly more accurate model, isno longer available for download). We use the small blockscompression regime with k = 256, mirroring the proceduredescribed previously.

We show our results in Table 2, where our model attainsa top-1 accuracy of 77.15%. This means that we are ableto outperform previous work by over 1% absolute accuracy,with a much smaller gap w.r.t. the uncompressed model.We find this result particularly interesting, as we originallyexpected distillation to be necessary to transfer the knowl-edge of the larger network pretrained on a large corpus ofunlabeled images. However, our results show that at leastpart of this knowledge is retained through the initializationand structure that low-error clustering imposes on the com-pressed network.

Ablation study: In Table 3, we show results for ResNet-18 using large blocks, for which we obtain a final accuracyof 63.31%. We add permutation optimization (Sec. 3.2.1),annealed k-means, as opposed to plain k-means (called SR-C in Sec. 3.2.2), and the use of the Adam optimizer withcosine annealing instead of plain SGD, as in previous work.From the Table, we can see that all our components areimportant and complementary to achieve top accuracy. Itis also interesting to note that a baseline that simply doesk-means and SGD fine-tuning is already ∼1% better thanthe current state-of-the-art. Since both annealed k-meansand permutation optimization directly reduce quantizationerror before fine-tuning, these experiments demonstrate thatminimizing the quantization error of the weights leads tohigher final network accuracy.

4.2. Object Detection and Segmentation

We also benchmark our method on the task of object de-tection by compressing the popular ResNet-50 Mask-RCNNFPN architecture [20] using the MS COCO 2017 dataset [34].We start from the pretrained model available on the PyTorchmodel zoo, and apply the same procedure described abovefor all the convolutional and linear layers (plus one deconvo-lutional layer, which we treat as a convolutional layer for the

3https://github.com/facebookresearch/semi-supervised-ImageNet1K-models

7

https://github.com/facebookresearch/semi-supervised-ImageNet1K-models

https://github.com/facebookresearch/semi-supervised-ImageNet1K-models

0 10 20 30 40Compression Ratio

60

62

64

66

68

70

Top-

1 A

ccur

acy

(%)

ABC-Net (M=1)

ABC-Net (M=3)

ABC-Net (M=5)

TTQ

BWN

LR-Net, binary

LR-Net, ternary

2.13%

ResNet-18 on ImageNet

Our PQF, small blocksOur PQF, large blocksBGD, small blocksBGD, large blocksOriginal ModelReference Methods

0 5 10 15 20 25 30Compression Ratio

68

69

70

71

72

73

74

75

76

Top-

1 A

ccur

acy

(%)

DC (2 bits)

DC (3 bits)

DC (4 bits)

HAQ (2 bits)

HAQ (3 bits)

HAQ (4 bits)

CLIP-Q

HAWQ-V2HAWQ

3.97%

ResNet-50 on ImageNet

Our PQF, small blocksOur PQF, large blocksBGD, small blocksBGD, large blocksOriginal ModelReference Methods

Figure 3: Compression results on ResNet-18 and ResNet-50. We compare accuracy vs. model size, using models from the PyTorch zooas a starting point. In general, our method achieves higher accuracy compared to previous work.

Size Ratio APbb APbb50 APbb

75 APmk APmk50 APmk

75

RetinaNet [33] (uncompressed) 145.00 MB – 35.6 – – – – –Direct 18.13 MB 8.0× 31.5 – – – – –FQN [32] 18.13 MB 8.0× 32.5 51.5 34.7 – – –HAWQ-V2 [8] 17.90 MB 8.1× 34.8 – – – – –

Mask-RCNN R-50 FPN [20] (uncompressed) 169.40 MB – 37.9 59.2 41.1 34.6 56.0 36.8BGD [48] 6.65 MB 26.0× 33.9 55.5 36.2 30.8 52.0 32.2Our PQF 6.65 MB 26.0× 36.3 57.9 39.4 33.5 54.7 35.6

Table 4: Object detection results on MS COCO 2017. We compress a Mask R-CNN network with a ResNet-50 backbone, and includedifferent object detection architectures used by other baselines. We report both bounding box (bb) and mask (mk) metrics for Mask R-CNN.We also report the accuracy at different IoU when available. The memory taken by [48] corresponds to the (correct) latest version on arXiv.

Ratio APbb APmk

Mask-RCNN R-50 FPN [20] – 37.9 34.6BGD [48] 26.0× 33.9 30.8

Our PQF (no perm., no SR-C)26.0×

35.6 33.0Our PQF (no perm.) 35.8 33.1Our PQF (full) 36.3 33.5

Table 5: Ablation results results on MS COCO 2017. Permuta-tion optimization is particularly important for Mask-RCNN

purpose of compression). We use the small blocks regimewith k = 256 centroids, for a model of 6.65 MB.

We compress and fine-tune the network on a single NvidiaGTX 1080Ti GPU with a batch size of 2 for 4 epochs. As be-fore, we use Adam [27] and cosine annealing [37], but withan initial learning rate of 5×10−5. Our results are presentedin Table 4. We also compare against recent baselines suchas the Fully Quantized Network (FQN) [32], and the secondversion of Hessian Aware Quantization (HAWQ-V2) [8],which showcase results compressing RetinaNet [33].

Our method obtains a box AP of 36.3, and a mask AP of33.5, which represent improvements of 2.4% and 2.7% over

the best previously reported result, closing the gap to theuncompressed model by 60-70%. Compared to BGD [48],we also use fewer computational resources, as they used 8V100 GPUs and distributed training for compression, whilewe use a single 1080Ti GPU. In Table 5, we show again thatusing both SR-C and permutation optimization is crucial toobtain the best results. These results demonstrate the abilityof our method to generalize to more complex tasks beyondimage classification.

5. Conclusion

We have demonstrated that the quantization error of theweights of a neural network is inversely correlated with itsaccuracy after codebook fine tuning. We have further pro-posed a method that exploits the functional equivalence ofthe network under permutation of its weights to find configu-rations of the weights that are easier to quantize. We havealso shown that using an annealed k-means algorithm fur-ther reduces quantization error and improves final networkaccuracy. On ResNet-50, our method closes the relativegap to the uncompressed model by 40-70% compared to theprevious state-of-the-art in a variety of visual tasks.

8

Our optimization method consists of three stages thatfocus on different variables of the encoding. Future workmay focus on techniques that jointly fine-tune the codesand the codebooks, or optimization methods that learn theweight permutation jointly with the codes and codebook. Thedeterminant of the covariance of the weights is a continuousmetric that could be minimized as the network is trained fromscratch, resulting in networks that are easier to compress bydesign. Last but not least, demonstrating practical hardwareacceleration on deep architectures that have been compressedwith product codes also remains an open area of research.

References[1] An Mei Chen, Haw-minn Lu, and Robert Hecht-Nielsen. On

the geometry of feedforward neural network error surfaces.Neural computation, 5(6):910–927, 1993. 4

[2] Weihan Chen, Peisong Wang, and Jian Cheng. Towardsconvolutional neural networks compression via global & pro-gressive product quantization. In BMVC, 2020. 2

[3] Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey onmodel compression and acceleration for deep neural networks.CoRR, 2017. 7

[4] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks: Train-ing deep neural networks with weights and activations con-strained to + 1 or - 1. arXiv preprint arXiv:1602.02830, 2016.2

[5] Misha Denil, Babak Shakibi, Laurent Dinh, Marc’AurelioRanzato, and Nando de Freitas. Predicting parameters indeep learning. In Advances in neural information processingsystems, 2013. 1

[6] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann Le-Cun, and Rob Fergus. Exploiting linear structure withinconvolutional networks for efficient evaluation. In Advancesin Neural Information Processing Systems, 2014. 2

[7] Xin Dong, Shangyu Chen, and Sinno Pan. Learning to prunedeep neural networks via layer-wise optimal brain surgeon.In Advances in Neural Information Processing Systems, 2017.2

[8] Zhen Dong, Zhewei Yao, Yaohui Cai, Daiyaan Arfreen, AmirGholami, Michael W. Mahoney, and Kurt Keutzer. HAWQ-V2: Hessian aware trace-weighted quantization of neuralnetworks. In Advances in Neural Information ProcessingSystems, 2020. 7, 8

[9] Zhen Dong, Zhewei Yao, Amir Gholami, Michael W. Ma-honey, and Kurt Keutzer. Hessian-aware quantization ofneural networks with mixed precision. In ICCV, 2019. 7

[10] Angela Fan, Pierre Stock, Benjamin Graham, Edouard Grave,Remi Gribonval, Herve Jegou, and Armand Joulin. Trainingwith quantization noise for extreme model compression. InICLR, 2021. 1

[11] Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. Optimizedproduct quantization for approximate nearest neighbor search.In CVPR, 2013. 5, 11

[12] Allen Gersho and Robert M Gray. Vector quantization andsignal compression, chapter 8, pages 228–243. SpringerScience & Business Media, 1991. 5

[13] Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev.Compressing deep convolutional networks using vector quan-tization. arXiv preprint arXiv:1412.6115, 2014. 1, 2

[14] Priya Goyal, Piotr Dollar, Ross Girshick, Pieter Noord-huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch,Yangqing Jia, and Kaiming He. Accurate, large mini-batch sgd: Training imagenet in 1 hour. arXiv preprintarXiv:1706.02677, 2017. 7

[15] Yunhui Guo. A survey on methods and theories of quantizedneural networks. arXiv preprint arXiv:1808.04752, 2018. 7

[16] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic net-work surgery for efficient DNNs. In Advances In NeuralInformation Processing Systems, 2016. 2

[17] Song Han, Huizi Mao, and William J Dally. Deep com-pression: Compressing deep neural networks with pruning,trained quantization and Huffman coding. In ICLR, 2016. 7

[18] Song Han, Jeff Pool, John Tran, and William Dally. Learningboth weights and connections for efficient neural network. InAdvances in Neural Information Processing Systems, 2015. 2

[19] Babak Hassibi and David G Stork. Second order derivativesfor network pruning: Optimal brain surgeon. In Advances inNeural Information Processing Systems, 1993. 2

[20] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Gir-shick. Mask R-CNN. In ICCV, 2017. 4, 6, 7, 8

[21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In CVPR, 2016.2, 6

[22] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, andSong Han. AMC: AutoML for model compression and accel-eration on mobile devices. In ECCV, 2018. 2

[23] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning foraccelerating very deep neural networks. In ICCV, 2017. 2

[24] Holger H Hoos and Thomas Stutzle. Stochastic local search:Foundations and applications. Elsevier, 2004. 5

[25] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman.Speeding up convolutional neural networks with low rankexpansions. In BMVC, 2014. 2

[26] Herve Jegou, Matthijs Douze, and Cordelia Schmid. Productquantization for nearest neighbor search. IEEE Transactionson Patten Analysis and Machine Intelligence, 33(1), 2010. 2,11

[27] Diederik P Kingma and Jimmy Ba. Adam: A method forstochastic optimization. In ICLR, 2015. 6, 8

[28] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Im-agenet classification with deep convolutional neural networks.In Advances in Neural Information Processing Systems, 2012.4

[29] Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, IvanOseledets, and Victor Lempitsky. Speeding-up convolutionalneural networks using fine-tuned CP-decomposition. In ICLR,2015. 2

[30] Yann LeCun, John S Denker, and Sara A Solla. Optimalbrain damage. In Advances in Neural Information ProcessingSystems, 1990. 2

9

[31] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, andHans Peter Graf. Pruning filters for efficient convnets. InICLR, 2016. 2

[32] Rundong Li, Yan Wang, Feng Liang, Hongwei Qin, JunjieYan, and Rui Fan. Fully quantized network for object detec-tion. In CVPR, 2019. 8

[33] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, andDollar. Focal loss for dense object detection. In ICCV, pages2980–2988, 2017. 8

[34] Tsung-Yi Lin, Michael Maire, Serge Belognie, James Hays,Pietro Perona, Deva Ramanan, Piotr Dollar, and C LawrenceZitnich. Microsoft COCO: Common objects in context. InECCV, 2014. 7

[35] Xiaofan Lin, Cong Zhao, and Wei Pan. Towards accuratebinary convolutional neural network. In Advances in NeuralInformation Processing Systems, 2017. 7

[36] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang,Shoumeng Yan, and Changshui Zhang. Learning efficientconvolutional networks through network slimming. In ICCV,2017. 2

[37] Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradientdescent with warm restarts. In ICLR, 2017. 7, 8

[38] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filterlevel pruning method for deep neural network compression.In ICCV, 2017. 2

[39] Brais Martinez, Jing Yang, Adrian Bulat, and Georgios Tz-imiropoulos. Training binary neural networks with real-to-binary convolutions. In ICLR, 2020. 2

[40] Julieta Martinez, Shobhit Zakhmi, Holger H Hoos, andJames J Little. LSQ++: Lower running time and higherrecall in multi-codebook quantization. In ECCV, 2018. 5

[41] James O’ Neill. An overview of neural network compression.arXiv preprint arXiv:2006.03669, 2020. 2

[42] Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, andDmitry P Vetrov. Tensorizing neural networks. In Advancesin Neural Information Processing Systems, 2015. 2

[43] A Emin Orhan and Xaq Pitkow. Skip connections eliminatesingularities. In ICLR, 2018. 4

[44] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon,and Ali Farhadi. XNOR-net: Imagenet classification usingbinary convolutional neural networks. In ECCV, 2016. 2, 7

[45] Oran Shayer, Dan Levi, and Ethan Fetaya. Learning discreteweights using the local reparameterization trick. In ICLR,2018. 7

[46] Karen Simonyan and Andrew Zisserman. Very deep convolu-tional networks for large-scale image recognition. In ICLR,2015. 4

[47] Sanghyun Son, Seungjun Nah, and Kyoung Mu Lee. Cluster-ing convolutional kernels to compress deep neural networks.In ECCV, pages 216–232, 2018. 1, 2, 11

[48] Pierre Stock, Armand Joulin, Remi Gribonval, Benjamin Gra-ham, and Herve Jegou. And the bit goes down: Revisitingthe quantization of neural networks. In ICLR, 2020. 1, 2, 6,7, 8, 11, 12

[49] Bart Thomee, David A Shamma, Gerald Friedland, BenjaminElizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. YFCC100M: The new data in multimedia research.Communications of the ACM, 59(2):64–73, 2016. 7

[50] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop:Divide the gradient by a running average of its recent magni-tude. COURSERA: Neural networks for machine learning,4(2):26–31, 2012. 6

[51] Frederick Tung and Greg Mori. Deep neural network compres-sion by in-parallel pruning-quantization. IEEE Transactionson Patten Analysis and Machine Intelligence, 2019. 7

[52] Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han.HAQ: Hardware-aware automated quantization with mixedprecision. In CVPR, 2019. 7

[53] Ziwei Wang, Jiwen Lu, Chenxin Tao, Jie Zhou, and Qi Tian.Learning channel-wise interactions for binary convolutionalneural networks. In CVPR, 2019. 2

[54] Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, andJian Cheng. Quantized convolutional neural networks formobile devices. In CVPR, 2016. 1, 2, 7

[55] I Zeki Yalniz, Herve Jegou, Kan Chen, Manohar Paluri, andDhruv Mahajan. Billion-scale semi-supervised learning forimage classification. arXiv preprint arXiv:1905.00546, 2019.7

[56] Kenneth Zeger, Jacques Vaisey, Allen Gersho, et al. Glob-ally optimal vector quantizer design by stochastic relaxation.IEEE Transactions on Signal Processing, 40(2):310–322,1992. 5

[57] Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally.Trained ternary quantization. In ICLR, 2017. 2, 7

10

Appendices

A: Permutation initialization for K ×K convolutionsIn Section 3.2.1 we describe our goal of finding a permutation Pt of the rows of the weight matrix Wt, to create a new

matrix PtWt whose subvectors {wPti,j } are easier to quantize. The algorithm as described works for a linear layer; that is,

when Wt ∈ Rm×n.We also describe in Section 3.1 how we reshape the weights of convolutional layers, W ∈ RCin×Cout×K×K into a matrix

of shape Wr ∈ RCinK2×Cout before permutation and quantization. In the case when K = 1 (i.e., pointwise convolutions), we

can treat the resulting reshaped weight Wr as a linear layer, and apply the same algorithm described in the main paper. Wenow show the generalization of our permutation initialization algorithm to K ×K convolutional layers for K > 1.K ×K convolutions have a natural partition for quantization; namely, clustering the K2 contiguous values that make

up a single K ×K convolutional filter. This natural order has been exploited in previous work [47, 48], and lends itself toacceleration with lookup tables, as commonly done on product-quantized databases [26]. Maintaining this partition is alsonecessary if we want to optimize permutations, as applying a permutation on the Cout dimension of a parent layer has theeffect of permuting the order of the channels of the output tensor, and applying the same permutation on the Cin dimension ofchildren layers preserves the function expressed by the network – this naturally keeps K ×K convolutional filters together.

Since the transformed convolution weight matrix Wr ∈ RCinK2×Cout contains the K2 values of each filter stacked across

its rows, we preserve this order by limiting the permutation to only move groups of K2 contiguous rows. In practice, thisforces the permutation matrix Pt (of size CinK

2 × CinK2) to have a block structure:

Pt =

Pt1,1 . . . Pt

1,Cin

.... . .

...PtCin,1

. . . PtCin,Cin

, (20)

where every submatrix Pti,j ∈ {0K×K , IK×K} is a K ×K matrix with either all zeros, or the identity matrix. Please refer to

Figure 4 for an illustration of how this matrix permutes blocks of weights. As before, our goal is to minimize the determinant ofthe covariance of the resulting subvectors {wPt

i,j }. We follow Ge et al. [11] and observe that any square, positive semidefinitematrix Σ may be decomposed into M2 squared blocks Σi,j :

Σ =

Σ1,1 . . . Σ1,M

.... . .

...ΣM,1 . . . ΣM,M

. (21)

In this case, Fischer’s inequality states that ∣∣Σ∣∣ ≤ M∏i=1

∣∣Σi,i

∣∣ , (22)

in other words, the determinant of Σ is bounded by the product of the determinants of its block-diagonal submatrices.Note that Hadamard’s inequality (which we use in our main paper) can be seen as a corollary of this result. To recap, ourpreviously-described algorithm finds an intial permutation that minimizes the product of the diagonal elements by:

1. Creating d buckets

2. Computing the variance of each row of PWr

3. Assigning each row to the non-full bucket that yields the lowest variance for the dimensions in that bucket.

Thus, we can modify our algorithm to find an initial permutation that minimizes the product of the block-diagonal submatricesas follows:

1. Instead of creating d buckets, we create d/K2 buckets

2. Instead of computing the variance of each row of PWr, we compute the determinant of the covariance of every K2

contiguous rows in PWr

11

Input to k-means

Figure 4: Permutation optimization of a 3× 3 convolutional layer. Our goal is to find a permutation P such that the resulting input tok-means is easier to quantize. We construct P with a block structure that swaps K2 = 9 contiguous rows at a time, preserving the naturalstructure of 3× 3 convolutions.

3. We assign each group of K2 contiguous row indices to the non-full bucket that yields the lowest determinant of thecovariance for the dimensions in that bucket.

Note that for pointwise convolutions, i.e., when K = 1, the above steps naturally yield the algorithm that we described in themain paper.

After intialization, we also use our iterated local search (ILS) algorithm (described in Section 3.2.2) to find improvedpermutations. However, instead of swapping pairs of rows of PWr, we swap groups of K2 contiguous rows at a time.

Finally, note that this implies that in our current setting we only optimize the permutation of K ×K convolutions when theblock size is at least 2K2. In practice, this occurs when dK = 2K2 = 2 · 32 = 18 in Table 1, which corresponds to the largeblocks compression regime. On the other hand, permutations for 1× 1 convolutions can be optimized under both large andsmall blocks regimes.

B: Bit allocationTables 6 and 7 show the bit allocation of ResNet-18 and ResNet-50 under the small and large compression regimes

respectively. Note that we borrow the bit allocation from [48], so our compression ratios match theirs exactly. As we mentionin the main paper, we store the codes using the smallest number of bits possible i.e., 8 bits for codebooks of size k = 256, 9bits for k = 512, 10 bits for k = 1024, and 11 bits for k = 2048. We also store the codebooks using float16, and layersthat are left uncompressed, such as all the batch normalization layers, are stored with float32. We observe that the codestend to take more memory than the codebooks, and that the linear layer tends to be the most memory-heavy part of the network.

C: Permutation groupsThere are 12 permutation groups for ResNet-18, and 37 groups for ResNet-50. Listing 1 lists the permutations for the

default PyTorch implementation of ResNet-18, and Listings 2 and 3 contain the permutations for ResNet-50. These sets ofparents and children determine the maximum number of independent permutations that we can apply without changing thefunction expressed by the network. While we obtained these permutation groups by hand, we believe that computing themautomatically on arbitrary networks is an interesting area of future work.

12

Name layer type shape dtype bits

conv1.weight Conv2d (64, 3, 7, 7) torch.float32 301056

bn1.weight BatchNorm2d (64,) torch.float32 2048bn1.bias (64,) torch.float32 2048

layer1.0.conv1.codebook Conv2d (256, 9) torch.float16 36864layer1.0.conv1.codes matrix (64, 64) torch.uint8 32768

layer1.0.bn1.weight BatchNorm2d (64,) torch.float32 2048layer1.0.bn1.bias (64,) torch.float32 2048











layer2.0.downsample.0.codebook Conv2d (256, 4) torch.float16 16384layer2.0.downsample.0.codes matrix (128, 16) torch.uint8 16384

layer2.0.downsample.1.weight Conv2d (128,) torch.float32 4096layer2.0.downsample.1.bias (128,) torch.float32 4096










layer3.0.downsample.1.weight Conv2d (256,) torch.float32 8192

13

layer3.0.downsample.1.bias (256,) torch.float32 8192















fc.bias Linear (1000,) torch.float32 32000fc.codebook (2048, 4) torch.float16 131072fc.codes matrix (1000, 128) torch.int16 1408000

total bits 12927232total bytes 1615904total KB 1578.03total MB 1.54 MB

Table 6: Bit allocation for Resnet-18. Small blocks regime, k = 256.

14

Name layer type shape dtype bits

conv1.weight Conv2d (64, 3, 7, 7) torch.float32 301056

bn1.weight BatchNorm2d (64,) torch.float32 2048bn1.bias (64,) torch.float32 2048
























layer2.0.bn2.weight BatchNorm2d (128,) torch.float32 4096

15

layer2.0.bn2.bias (128,) torch.float32 4096



























16



























layer3.4.conv3.codebook Conv2d (256, 8) torch.float16 32768

17

layer3.4.conv3.codes matrix (1024, 32) torch.uint8 262144



























18


fc.bias Linear (1000,) torch.float32 32000fc.codebook (1024, 4) torch.float16 65536fc.codes matrix (1000, 512) torch.int16 5120000

total bits 26718976total bytes 3339872total KB 3261.59total MB 3.19 MB

Table 7: Bit allocation for Resnet-50. Large blocks regime, k = 256.

19

import torchvision.models as modelsresnet18 = models.resnet18()

permutations:−

− parents: [layer1.0.conv1, layer1.0.bn1]− children: [layer1.0.conv2]

−− parents: [layer1.1.conv1, layer1.1.bn1]− children: [layer1.1.conv2]







−− parents: [

conv1, bn1,layer1.0.conv2, layer1.0.bn2,layer1.1.conv2, layer1.1.bn2,

]− children: [

layer1.0.conv1,layer2.0.downsample.0,layer1.1.conv1,layer2.0.conv1,

]−

− parents: [layer2.0.downsample.0, layer2.0.downsample.1,layer2.0.conv2, layer2.0.bn2,layer2.1.conv2, layer2.1.bn2,

]− children: [

layer2.1.conv1,layer3.0.downsample.0,layer3.0.conv1,

]−

− parents: [layer3.0.downsample.0, layer3.0.downsample.1,layer3.0.conv2, layer3.0.bn2,layer3.1.conv2, layer3.1.bn2,

]− children: [

layer3.1.conv1,layer4.0.downsample.0,layer4.0.conv1,

]−

− parents: [layer4.0.conv2, layer4.0.bn2,layer4.0.downsample.0, layer4.0.downsample.1,layer4.1.conv2, layer4.1.bn2,

]− children: [

layer4.1.conv1,fc,

]

Listing 1 Permutation groups for ResNet-18.

20

import torchvision.models as modelsresnet50 = models.resnet50()

permutations:−

− parents: [conv1, bn1]− children: [

layer1.0.conv1,layer1.0.downsample.0

]−



−− parents: [layer1.0.conv3, layer1.0.bn3,layer1.0.downsample.0, layer1.0.downsample.1,layer1.1.conv3, layer1.1.bn3,layer1.2.conv3, layer1.2.bn3,

]− children: [

layer1.1.conv1,layer1.2.conv1,layer2.0.conv1,layer2.0.downsample.0

]−







−− parents: [

layer2.0.conv3, layer2.0.bn3,layer2.0.downsample.0, layer2.0.downsample.1,layer2.1.conv3, layer2.1.bn3,layer2.2.conv3, layer2.2.bn3,layer2.3.conv3, layer2.3.bn3,

]− children: [

layer2.1.conv1,layer2.2.conv1,layer2.3.conv1,layer3.0.conv1,layer3.0.downsample.0

]−









Listing 2 Permutation groups for ResNet-50 (part 1/2).

21

−− parents: [

layer3.0.conv3, layer3.0.bn3,layer3.0.downsample.0, layer3.0.downsample.1,layer3.1.conv3, layer3.1.bn3,layer3.2.conv3, layer3.2.bn3,layer3.3.conv3, layer3.3.bn3,layer3.4.conv3, layer3.4.bn3,layer3.5.conv3, layer3.5.bn3,

]− children: [

layer3.1.conv1,layer3.2.conv1,layer3.3.conv1,layer3.4.conv1,layer3.5.conv1,layer4.0.conv1,layer4.0.downsample.0

]−

















−− parents: [layer4.0.conv3, layer4.0.bn3,layer4.0.downsample.0, layer4.0.downsample.1,layer4.1.conv3, layer4.1.bn3,layer4.2.conv3, layer4.2.bn3,

]− children: [

layer4.1.conv1,layer4.2.conv1,fc,

]

Listing 3 Permutation groups for ResNet-50 (part 2/2).

22

Documents

fjulieta,jashan,tingwei.liu,andreib,wenyuan,urtasung@uber