Differentiable Architecture Search

NAS: General Problem Setup

Overview

Screen Shot 2021-07-06 at 17.59.43.png (86.0 kB)

source: Elksen et al., Neural Architecture Search: A survey, 2018
Search Space: Defines which architecture can be represented in principle,
Search Strategy: Detail on how to explore search space
- often exponentially large or even unbounded
Performance Estimation Strategy: estimating an architecture’s performance
- standard training and validation of architecture on data may be computionally expensive and limits number of architectures that can be explored

Cell based Search Space:

the NASNet search space (Zoph et al. 2018) defines the architecture of a conv net as the same cell getting repeated multiple times and each cell contains several operations predicted by the NAS algorithm.
a small directed acyclic graph representing a feature transformation
NASNet search space: Learns 2 types of cells:
- Normal Cell: Input and output feature maos have same dimension
- Reduction Cell: Output feature map has width the height reduced by half

Screen Shot 2021-07-07 at 3.17.55.png (103.1 kB)

Zoph et al., Learning Transferable Architectures for Scalable Image Recognition, CVPR 2018

pros

A well-designed cell module enables transferability between datasets.
easy to scale down or up the model size by adjusting the number of cell repeats.

cons

arrangement of operation is restricted
Suppose CNN can be obtained by stacking the same Cell, and RNN can be obtained by recursive connection of the same Cell

Comparision of different methods

Screen Shot 2021-07-07 at 2.21.50.png (184.1 kB)

Comparison with state-of-the-art architectures on ImageNet (mobile setting)

Table source: Chen +, Progressive Differentiable Architecture Search:Bridging the Depth Gap between Search and Evaluation, ICCV 2019

	Reinforcement Learning	Evolution Algorithm	Differentiable Search
Computation cost	High	High	Low
Search space	Large	Large	Restricted

Differentiable Architecture Search: Gradient-based method

Liu et al., DARTS: Differentiable Architecture Search, ICLR 2019

Overview of DARTS

Screen Shot 2021-07-06 at 19.36.03.png (250.4 kB)

(a) Operations on the edges are initially unknown
(b) Continuous relaxation of the search space by placing a mixture of candidate opeartions on each edge.
(c) Joint optimization of the mixing probabilities and the network weights by solving a bilevel optimization problem.
(d) Inducing the final architecture from the learned mixing probabilities

Continuous relaxation and optimization

Relaxation

The operation mixing weights for a pair of nodes $(i, j)$ are parameterized by a vector in $\alpha^{(i, j)}$ of dimension $ |\mathcal O|$
The method relaxes categorical chioce of a particular operation as a softmax over all operations

$$ \overline{o} ^{(i,j)}(x) = \sum_{o\in \mathcal O}\frac{exp(\alpha_{o}^{ij})} { \sum_{o^{'} \in \mathcal O} exp(\alpha_{o^{'} } ^{(i,j)} ) }o(x)$$

This task of architecture search then reduces to learning a set of continuous variables $\alpha = \lbrace\alpha^{(i, j)} \rbrace$
At the end of search, obatain discrete architecture by replacing each mixed operation $ { \overline{o} }^{(i,j)} $ with the most likely operation, i.e. $o^{(i,j)} = \underset{o \in \mathcal O}{argmax} \ \alpha_{o}^{(i, j)} $

Optimization

After relaxation, our goal is to jointly learn architecture $\alpha$ and the weights $w$ within all the mixed operations (e.g. weight of the convolution filters)
The goal for archiutecture is to find $\alpha^{*}$ that minimizes the validation loss $\mathcal L_{train} (w^{*}, \alpha^{*})$,
where the weights $ w^{*} $ assiciated with the architecture are obtained by minimizing the train loss $w^{*} = \underset{w}{argmin} \mathcal L_{train} (w, \alpha^{*}) $
This implies a bilevel optimization problem with $\alpha$ as the upper-level variable and $w$ as the lower-level variable

$$ \underset{\alpha}{min} {\mathcal L}_{val}(w^{*}(\alpha), \alpha) $$

$$ s.t. \ w^{*} (\alpha) = \underset{w}{argmin} \mathcal L_{train} (w, \alpha) $$

Approximate Architecture Gradient

evalute the gradient exactly can be prohibitive due to the inner optimizartion is expensive.
an approximation scheme as follows:

$$ \nabla_{\alpha} {\mathcal L_{val}(w^{*}(\alpha), \alpha)}\approx \nabla_{\alpha}{\mathcal L_{val}(w - \xi \nabla_{w}{\mathcal L_{train} (w,\alpha), \alpha} )}$$

the idea is to approximate $w^{*}(\alpha)$ by adapting $w$ using only a single training step using learning rate $\xi$ without solving the inner optimization $w^{*} (\alpha) = \underset{w}{argmin} \mathcal L_{train} (w, \alpha) $ completely by training until convergence.
The procedure is outlined as Algotithm 1:

Screen Shot 2021-07-06 at 20.20.46.png (127.1 kB)

Apply chain rule to the approximate architecture gradient can get: $$\color{red}{\nabla_{\alpha}}{\mathcal L_{val}(w - \xi \nabla_{w}{\mathcal L_{train} (w,\color{red}{\alpha}), \color{red}{\alpha}} )} = \color{blue}{\nabla_{\alpha}} {\mathcal L_{val}(w^{'},\color{blue}{\alpha})} - \xi \nabla_{\alpha, w}^{2} {\mathcal L_{train}}(w,\alpha)\cdot \color{blue}{\nabla_{w^{'}}} {\mathcal L_{val}}(\color{blue}{w^{'}},\alpha)$$
where $w^{'} = w - \xi \nabla_{w}\mathcal L_{train} (w,\alpha)$ denoted the weights for a one-step forward model.

Induction: $$ \nabla_{\alpha} f(g_{1}(\alpha), g_{2}(\alpha))=\color{blue}{D_{1}f(g_{1}(\alpha), g_{2}(\alpha))} \cdot \color {red}{\nabla_{\alpha} g_{1}(\alpha)} +\color{blue}{D_{2}f(g{1}(\alpha), g_{2}(\alpha))} \cdot \color {red}{\nabla_{\alpha} g_{2}(\alpha)} $$

$f(\cdot,\cdot) = {\mathcal L{val}} (\cdot, \cdot) $
$g_{1}(\alpha) = w^{'} = w - \xi \nabla{w}\mathcal L{train} (w,\alpha)$
$g_{2}(\alpha) = \alpha$

Differentiate:

$\nabla_{\alpha} g_{1}(\alpha) = - \xi \nabla{\alpha, w}^{2} \mathcal L{train} (w,\alpha)$
$\nabla_{\alpha} g_{2}(\alpha) = 1$
$D{1}f(g_{1}(\alpha), g_{2}(\alpha)) = \color{blue}{\nabla{w^{'}}} {\mathcal L{val}}(\color{blue}{w^{'}},\alpha)$
$D{2}f(g_{1}(\alpha), g_{2}(\alpha)) = \color{blue}{\nabla{\alpha}} {\mathcal L{val}(w^{'},\color{blue}{\alpha})}$

the expression above contains an expensive matrix-vector product in its second term. Reduce it using finite difference approximation. Let $\epsilon$ be a small scalar $\epsilon = \frac {0.01} { {\lVert \color{blue}{\nabla{\alpha}} {\mathcal L{val}(w^{'},\color{blue}{\alpha})} \rVert}_{ 2 } }$
and $w^{\pm} = w \pm \epsilon \color{blue}{\nabla_{w^{'}}} {\mathcal L_{val}}(\color{blue}{w^{'}},\alpha) $
Then

$$\nabla_{\alpha, w}^{2} {\mathcal L_{train}}(w,\alpha)\cdot \color{blue}{\nabla_{w^{'}}} {\mathcal L_{val}}(\color{blue}{w^{'}},\alpha) \approx \frac {\nabla_{\alpha} {\mathcal L_{train}} (w^{+}, \alpha) - \nabla_{\alpha} {\mathcal L_{train}} (w^{-}, \alpha) } {2 \epsilon} $$
the Complexity reduced from $O(|\alpha | |w|)$ to $O(|\alpha | + |w|)$

Induction: We know the Taylor series : $$ f(x) = f(x_0) + \frac {f^{'}(x_0)}{ 1!} (x-x_0)+ \frac {f^{''}(x_0)}{ 2!} (x-x_0)^2 + \cdots$$

Let $ x = x_0 + hA$ and $ x = x_0 - hA $, we can induct the following expressions:

$f(x_{0} + hA) = f(x_0) + \frac {f^{'} (x_0) }{ 1!}hA + \cdots $
$ f(x_{0} - hA) = f(x_0) -\frac {f^{'} (x_0) }{ 1!}hA + \cdots $
$f^{'} (x_{0} )\cdot A \approx \frac {f(x_{0} + hA )-f(x_{0} - hA )} { 2h } $

Then replace items by :

$h = \epsilon$
$A = \color{blue}{\nabla_{w^{'}}} {\mathcal L_{val}}(\color{blue}{w^{'}},\alpha) $
$x_{0} = w$
$f(\cdot, \cdot) = \nabla_{\alpha} {\mathcal L_{train}(\cdot, \cdot)}$

experiment and result

experiment setting

following operations are included in $\mathcal O$
- 3 × 3 dilated separable convolution
- 5 × 5 dilated separable convolution
- 3 × 3 depth-wise separable convolution
- 5 × 5 depth-wise separable convolution
- 3 × 3 max pooling
- 3 × 3 average pooling,
- no connection (zero)
- and a skip connection (identity)

Screen Shot 2021-07-07 at 3.14.07.png (230.3 kB)

result

CNN
RNN
The performanc by ENAS(RL method) looks similar to the NAS, I will explore it next time.
Pham et al., Efficient Neural Architecture Search via Parameter Sharing , ICML 2018

Conclusion

DARTS is able to greatly reduce the cost of GPU hours. Their experiments for searching for CNN cells have 7 Nodes and only took 1.5 days with a single GPU.
However, it suffers from the high GPU memory consumption issue due to its continuous representation of network architecture.

Candidate NAS for my research

Random Search (as baseline)
DARTS (Gradient)
ENAS (RL+ parameter sharing)

Possible future direction

Search efficiency
Moving towards less constrained Search Space
Designing efficient architectures: automated scaling, pruning and quantization (model compression techniques metioned at last bw meeting)

Material To learn about NAS:

[1] https://lilianweng.github.io/lil-log/2020/08/06/neural-architecture-search.html#evolutionary-algorithms [2] https://hangzhang.org/ECCV2020/

Engineering part

The implementation of DARTS is available at https://github.com/quark0/darts
Apply it on my current research domain to see the performance?