Model Compression
Introduction
Choudhary+, [A comprehensive survey on model compression and acceleration](https://link.springer.com/content/pdf/10.1007/s10462-020-09816-7.pdf), Artifcial Intelligence Review 2020Table 2: summary of different methods for network compression.
Method | strengths | limitations |
---|---|---|
Knowledge Distill | Can downsize a network regardless of the structural difference between the teacher and the student network | can only be applied to classification tasks with softmax loss function |
Low-Rank Factorization | standard pipepline | performed layer by layer, cannot perform global parameter compression |
Data quantization | significantly reduced memory usage and float-point operations | Quantized weights make neural networks harder to converge |
Pruning | Can improve the inference time and model size vs accuracy tradeoff for a given architecture | Generally, does not help as much as switching to a better architecture |
Cheng+, Model Compression and Acceleration for Deep Neural Networks: The Principles, Progress, and Challenges, Vol.35, pp.126-136, 2018
Low-Rank Factorization
In low-rank factorization, a weight matrix A with m × n dimension and having rank r is replaced by smaller dimension matrices. In feed-forward NN and CNN, singular value decomposition (SVD) is a common and popular factorization scheme for reducing the number of parameters. SVD factorize the original weight matrix into three smaller matrices, replacing the original weight matrix. For any matrix A ∈ ℝm×n , there exists a factorization, A = U S V^(T) . Where, U ∈ ℝ^(m×r) , S ∈ ℝ^(r×r) , and V^T ∈ ℝ^(r×n) . S is a diagonal matrix with the singular values on the diagonal, U and V are orthogonal matrices. Each entry in the S is larger than the next diagonal entry. When reducing the model size is necessary, low- rank factorization techniques help by factorizing a large matrix into smaller matrices. Yu+, On Compressing Deep Models by Low Rank and Sparse Decomposition
Knowledge Distillation
Knowledge Distilled: The basic idea of KD is to distill knowledge from a large teacher model into a small one by learning the class distributions output by the teacher via softened softmax.
Mishra+, Apprentice: using knowledge distillation techniques to improve low-precision network accuracy, ICLR2018
data quantization
In quantization, we represent weights by reducing the number of bits required per weight to store each weight. This idea can also be further extended to represent gradient and activation in the quantized form. The weights can be quantized to 16-bit, 8-bit, 4-bit or even with 1-bit (which is a particular case of quantization, in which weights are represented with binary values only, known as weight binarization).
our model is implemented by scala, the parameter is in double: 64 bit, float: 32 bit, integer: 32 bit.
-
QAT Quantizer Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference, CVPR2018
- Weights are quantized before they are convolved with the input. If batch normalization is used for the layer, the batch normalization parameters are “folded into” the weights before quantization. Activations are quantized at points where they would be during inference, e.g. after the activation function is applied to a convolutional or fully connected layer’s output, or after a bypass connection adds or concatenates the outputs of several layers together such as in ResNets.
-
BNN Quantizer Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1
pruning
pruning: remove parameters which have least effect on the accuracy of the network, which can reduce model complexity and mitigate the over-fitting issue.
-
Weight pruning: In unimportant weight connection pruning, we prunes (zeros out) the weight connections if they are below some predefned threshold (Han et al. 2015) or if they are redundant. Han+, Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015
-
Neuron pruning: Instead of removing the weights one by one, which is a timeconsuming process, we can also remove the individual neurons if they are redundant (Srinivas and Babu 2015). In that case, all the incoming and outgoing connection of the neuron will also be removed. There are many other ways of removing individual weight connections or neurons. Srinivas+, Data-free parameter pruning for Deep Neural Networks, BMVC 2015
-
Filter pruning: In flter pruning, flters are ranked according to their importance, and the least important (least ranking) flters are removed from the network. The importance of the flters can be calculated by L1/L2 norm (Li et al. 2017) or some other methods like their infuence on the error. Li+, Pruning Filters for Efficient ConvNets, ICLR 2017
-
Layer pruning: Similarly, from a very deep network, some of the layers can also be pruned (Chen and Zhao 2018). Chen+, Shallowing Deep Networks: Layer-Wise Pruning Based on Feature Representations, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, Iss. 12, pp. 3048 - 3056, 2019
criteria
The stsndard criteria to measure the quality of model compression and acceleration are the compression rate and the speedup rate.
table source: Choudhary+, A comprehensive survey on model compression and acceleration, Artifcial Intelligence Review 2020