Model Compression

M1 student @ Tokyo tech

Bing BAI

June 19, 2021

Choudhary+, [A comprehensive survey on model compression and acceleration](, Artifcial Intelligence Review 2020

Table 2: summary of different methods for network compression.

Method strengths limitations
Knowledge Distill Can downsize a network regardless of the structural difference between the teacher and the student network can only be applied to classification tasks with softmax loss function
Low-Rank Factorization standard pipepline performed layer by layer, cannot perform global parameter compression
Data quantization significantly reduced memory usage and float-point operations Quantized weights make neural networks harder to converge
Pruning Can improve the inference time and model size vs accuracy tradeoff for a given architecture Generally, does not help as much as switching to a better architecture

Cheng+, Model Compression and Acceleration for Deep Neural Networks: The Principles, Progress, and Challenges, Vol.35, pp.126-136, 2018

Low-Rank Factorization

In low-rank factorization, a weight matrix A with m × n dimension and having rank r is replaced by smaller dimension matrices. In feed-forward NN and CNN, singular value decomposition (SVD) is a common and popular factorization scheme for reducing the number of parameters. SVD factorize the original weight matrix into three smaller matrices, replacing the original weight matrix. For any matrix A ∈ ℝm×n , there exists a factorization, A = U S V^(T) . Where, U ∈ ℝ^(m×r) , S ∈ ℝ^(r×r) , and V^T ∈ ℝ^(r×n) . S is a diagonal matrix with the singular values on the diagonal, U and V are orthogonal matrices. Each entry in the S is larger than the next diagonal entry. When reducing the model size is necessary, low- rank factorization techniques help by factorizing a large matrix into smaller matrices. Yu+, On Compressing Deep Models by Low Rank and Sparse Decomposition

Knowledge Distillation

Knowledge Distilled: The basic idea of KD is to distill knowledge from a large teacher model into a small one by learning the class distributions output by the teacher via softened softmax.

Mishra+, Apprentice: using knowledge distillation techniques to improve low-precision network accuracy, ICLR2018

data quantization

In quantization, we represent weights by reducing the number of bits required per weight to store each weight. This idea can also be further extended to represent gradient and activation in the quantized form. The weights can be quantized to 16-bit, 8-bit, 4-bit or even with 1-bit (which is a particular case of quantization, in which weights are represented with binary values only, known as weight binarization).

our model is implemented by scala, the parameter is in double: 64 bit, float: 32 bit, integer: 32 bit.


pruning: remove parameters which have least effect on the accuracy of the network, which can reduce model complexity and mitigate the over-fitting issue.


The standard criteria to measure the quality of model compression and acceleration are the compression rate and the speedup rate.

table source: Choudhary+, A comprehensive survey on model compression and acceleration, Artifcial Intelligence Review 2020