Model Compression

Introduction

Screen Shot 2021-06-15 at 15.04.00.png (1.2 MB)

Choudhary+, [A comprehensive survey on model compression and acceleration](https://link.springer.com/content/pdf/10.1007/s10462-020-09816-7.pdf), Artifcial Intelligence Review 2020

Table 2: summary of different methods for network compression.

Method	strengths	limitations
Knowledge Distill	Can downsize a network regardless of the structural difference between the teacher and the student network	can only be applied to classification tasks with softmax loss function
Low-Rank Factorization	standard pipepline	performed layer by layer, cannot perform global parameter compression
Data quantization	significantly reduced memory usage and float-point operations	Quantized weights make neural networks harder to converge
Pruning	Can improve the inference time and model size vs accuracy tradeoff for a given architecture	Generally, does not help as much as switching to a better architecture

Cheng+, Model Compression and Acceleration for Deep Neural Networks: The Principles, Progress, and Challenges, Vol.35, pp.126-136, 2018

Low-Rank Factorization

In low-rank factorization, a weight matrix A with m × n dimension and having rank r is replaced by smaller dimension matrices. In feed-forward NN and CNN, singular value decomposition (SVD) is a common and popular factorization scheme for reducing the number of parameters. SVD factorize the original weight matrix into three smaller matrices, replacing the original weight matrix. For any matrix A ∈ ℝm×n , there exists a factorization, A = U S V^(T) . Where, U ∈ ℝ^(m×r) , S ∈ ℝ^(r×r) , and V^T ∈ ℝ^(r×n) . S is a diagonal matrix with the singular values on the diagonal, U and V are orthogonal matrices. Each entry in the S is larger than the next diagonal entry. When reducing the model size is necessary, low- rank factorization techniques help by factorizing a large matrix into smaller matrices. Yu+, On Compressing Deep Models by Low Rank and Sparse Decomposition

Knowledge Distillation

Knowledge Distilled: The basic idea of KD is to distill knowledge from a large teacher model into a small one by learning the class distributions output by the teacher via softened softmax.

Mishra+, Apprentice: using knowledge distillation techniques to improve low-precision network accuracy, ICLR2018

data quantization

In quantization, we represent weights by reducing the number of bits required per weight to store each weight. This idea can also be further extended to represent gradient and activation in the quantized form. The weights can be quantized to 16-bit, 8-bit, 4-bit or even with 1-bit (which is a particular case of quantization, in which weights are represented with binary values only, known as weight binarization).

our model is implemented by scala, the parameter is in double: 64 bit, float: 32 bit, integer: 32 bit.

QAT Quantizer Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference, CVPR2018
- Weights are quantized before they are convolved with the input. If batch normalization is used for the layer, the batch normalization parameters are “folded into” the weights before quantization. Activations are quantized at points where they would be during inference, e.g. after the activation function is applied to a convolutional or fully connected layer’s output, or after a bypass connection adds or concatenates the outputs of several layers together such as in ResNets.
BNN Quantizer Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1

pruning

pruning: remove parameters which have least effect on the accuracy of the network, which can reduce model complexity and mitigate the over-ﬁtting issue.

Weight pruning: In unimportant weight connection pruning, we prunes (zeros out) the weight connections if they are below some predefned threshold (Han et al. 2015) or if they are redundant. Han+, Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015
Neuron pruning: Instead of removing the weights one by one, which is a timeconsuming process, we can also remove the individual neurons if they are redundant (Srinivas and Babu 2015). In that case, all the incoming and outgoing connection of the neuron will also be removed. There are many other ways of removing individual weight connections or neurons. Srinivas+, Data-free parameter pruning for Deep Neural Networks, BMVC 2015
Filter pruning: In flter pruning, flters are ranked according to their importance, and the least important (least ranking) flters are removed from the network. The importance of the flters can be calculated by L1/L2 norm (Li et al. 2017) or some other methods like their infuence on the error. Li+, Pruning Filters for Efficient ConvNets, ICLR 2017
Layer pruning: Similarly, from a very deep network, some of the layers can also be pruned (Chen and Zhao 2018). Chen+, Shallowing Deep Networks: Layer-Wise Pruning Based on Feature Representations, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, Iss. 12, pp. 3048 - 3056, 2019

criteria

The stsndard criteria to measure the quality of model compression and acceleration are the compression rate and the speedup rate. Screen Shot 2021-06-15 at 15.27.31.png (350.7 kB)

table source: Choudhary+, A comprehensive survey on model compression and acceleration, Artifcial Intelligence Review 2020