Neural Architecture Search: Learning How to Learn

Neural Architecture Search:
Learning How to Learn
Kwanghee Choi
Local Optima 2019

Reference
- Neural Architecture Search with Reinforcement Learning (ICLR 2017)
- Learning Transferable Architectures for Scalable Image Recognition (CVPR 2018)
- Large-Scale Evolution of Image Classifiers (ICML 2017)
- Hierarchical Representations for Efficient Architecture Search (ICLR 2018)
- Regularized Evolution for Image Classifier Architecture Search (AAAI 2019)
- Progressive Neural Architecture Search (ECCV 2018)
- Neural Architecture Optimization (NIPS 2018)
- Exploring Randomly Wired Neural Networks for Image Recognition (2019)
- Weight Agnostic Neural Networks (2019)
- HyperNetworks (ICLR 2016)
- SMASH: One-Shot Model Architecture Search through HyperNetworks (ICLR 2018)
- Efficient Neural Architecture Search via Parameter Sharing (ICML 2018)
- Understanding and Simplifying One-Shot Architecture Search (ICML 2018)
- DARTS: Differentiable Architecture Search (ICLR 2019)
- ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware (ICLR 2019)
- MnasNet: Platform-Aware Neural Architecture Search for Mobile (CVPR 2019)
- FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search (CVPR 2019)
- EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks (ICML 2019)
- ScarletNAS: Bridging the Gap Between Scalability and Fairness in Neural Architecture Search (2019)
- NAS-Bench-101: Towards Reproducible Neural Architecture Search (ICML 2019)

Introduction
- Excerpt from Exploring Randomly Wired Neural Networks for Image
Recognition (2019)
- Neural networks for image recognition have evolved through extensive
manual design. ex) ResNet, DenseNet
- What we call deep learning today descends from the connectionist approach
to cognitive science — a paradigm reﬂecting the hypothesis that how
computational networks are wired is crucial for building intelligent machines.
- NAS (Neural Architecture Search): Optimization of wiring and operation types,
but possible wirings or operations are constrained.

Neural Architecture Search
with Reinforcement Learning
Barret Zoph, Quoc V. Le (Google)
ICLR 2017

- A gradient-based method for finding good architectures
- Use a recurrent network to generate the model descriptions of neural
networks and train this RNN with reinforcement learning to maximize the
expected accuracy of the generated architectures on a validation set.
- The structure and connectivity of a neural network can be typically specified
by a variable-length string. It is therefore possible to use a recurrent network –
the controller – to generate such string.
- Architecture engineering with CNNs often identifies repeated motifs
consisting of combinations of convolutional filter banks,
nonlinearities and a prudent selection of connections
to achieve state-of-the-art results.

Controller Recurrent Neural Network
- Every prediction is carried out by a softmax classiﬁer and then fed into the next time step as input.
- The process of generating an architecture stops if the number of layers exceeds a certain value.
- Once the controller RNN ﬁnishes generating an architecture,
a neural network with this architecture is built and trained.
- At convergence, the accuracy of the network
on a held-out validation set is recorded.

Training with REINFORCE
- The list of tokens that the controller predicts can be viewed as a list of
actions a1:T
to design an architecture for a child network.
- At convergence, this child network will achieve an accuracy R on a held-out
dataset.
- We can use this accuracy R as the reward signal and use reinforcement
learning to train the controller.
- REINFORCE by Williams (1992), Sutton (2000)
- We do not predict the learning rate and we also assume that the architectures
consist of only convolutional layers,
which is also quite restrictive.

Distributed training for NAS
- We use a set of S parameter servers to store and send parameters to K controller replicas.
- Each controller replica then samples m architectures and run the multiple child models in parallel.
- The accuracy of each child model is recorded to compute the gradients with respect to θc
, which are
then sent back to the parameter servers.

Generating Skip Connections
- At layer N, we add an anchor point
which has N − 1 content-based sigmoids
to indicate the previous layers that need to be connected.
- Each sigmoid is a function of the current hiddenstate of the controller and the previous hiddenstates
of the previous N − 1 anchor points.
- P(Layer j is an input to layer i) = sigmoid(vT
tanh(Wprev
∗ hj
+ Wcurr
∗ hi
))
where hj
represents the hidden state of the controller at anchor point for the j-th layer, where j ranges
from 0 to N − 1.
- We then sample from these sigmoids to decide what previous layers to be used as inputs to the
current layer.
- The matrices Wprev
, Wcurr
and v are trainable parameters.

Generating Skip Connections
- Skip connections can cause “compilation failures” where one layer is not
compatible with another layer, or one layer may not have any input or output.
- If a layer is not connected to any input layer then the image is used as the input layer.
- At the final layer we take all layer outputs that have not been connected and concatenate them
before sending this final hiddenstate to the classifier.
- If input layers to be concatenated have different sizes, we pad the small layers with zeros so
that the concatenated layers have the same sizes.

Generating Recurrent Cells
- The computations for basic RNN and LSTM cells can be generalized as a tree of steps that take xt
and
ht−1
as inputs and produce ht
as ﬁnal output.
- The controller RNN needs to label each node in the tree with a combination method (add, dot product,
etc.) and an activation function (tanh, sigmoid, etc.) to merge two inputs to produce one output.
- Two outputs are then fed as inputs to the next node in the tree.
- Two leaf nodes (Tree Index 0, 1): thus it is called a “base 2” architecture.
- In our experiments, we use a base number of 8
to make sure that the cell is expressive.

Transfer Learning Performance (PTB)
- To understand whether the cell can generalize to a different task, we apply it
to the character language modeling task on the same dataset (PTB).
- The new cell was found on word level language modeling.

Learning Transferable Architectures
for Scalable Image Recognition
Barret Zoph, Vijay Vasudevan, Jonathon Shlens, Quoc V. Le (Google)
ICLR 2017

Transferable Architectures
- We propose to search for an architectural building block on a small dataset
and then transfer the block to a larger dataset: the design of a new search
space (“NASNet search space”) which enables transferability.
- Applying NAS, or any other search methods, directly to a large dataset is computationally
expensive.
- NASNet search space: A search space so that the complexity of the architecture is
independent of the depth of the network and the size of input images.
- All convolutional networks in our search space are composed of convolutional layers (or
“cells”) with identical structure but different weights. Searching for the best convolutional
architectures is therefore reduced to searching for the best cell structure.
- By simply varying # of the convolutional cells and # of ﬁlters,
we can create different versions of NASNets
with different computational demands.

Transferable Architectures
- Two types of cells:
- Normal Cell: return a feature map of the same dimension
- Reduction Cell: return a feature map where height and width is
reduced by a factor of two.
- We empirically found it beneficial to learn two separate
architectures.
- We use a common heuristic to double the number of filters in
the output whenever the spatial activation size is reduced in
order to maintain roughly constant hidden state dimension.
- We consider the # of motif repetitions and
the # of initial convolutional filters
as free parameters.

Controller Model Architecture
- Select a hidden state from hi
, hi−1
or from the set of hidden states created in previous blocks.
- In our experiments, selecting B = 5 provides good results, although we have not exhaustively
searched this space due to computational limitations.
- To allow the controller RNN to predict both Normal Cell and Reduction Cell,
we simply make the controller have 2 × 5B predictions in total.

Transfer Learning Performance (ImageNet)
- The new cell was found on CIFAR-10.

Other Cell Types
- NASNet-B
- Do not concatenate the output hidden states, each output hidden state is used as a hidden
state in the future layers.
- We allow addition followed by layer normalization or instance normalization.
- NASNet-C
- We allow addition followed by layer normalization or instance normalization.

Large-Scale Evolution
of Image Classiﬁers
Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena,
Yutaka Leon Suematsu, Jie Tan, Quoc V. Le, Alexey Kurakin (Google)
ICML 2017

Large-Scale Evolution
- Starting out with poor-performing models with no convolutions, the algorithm
must evolve complex convolutional neural networks while navigating a fairly
unrestricted search space.
- We use a simpliﬁed graph as our DNA, which is transformed to a full neural
network graph for training and evaluation.
- Mutations were chosen for their similarity to the actions that a human
designer may take when improving an architecture.
- we allow the children to inherit the parents’ weights whenever possible.
Namely, if a layer has matching shapes,
the weights are preserved.

Progress of an Evolution Experiment
-

Hierarchical Representations
for Eﬃcient Architecture Search
Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, Koray Kavukcuoglu (Google)
ICLR 2018

Hierarchical Representations
for describing neural network architectures

Tournament Selection
- Starting from an initial population of random genotypes, tournament selection
provides a mechanism to pick promising genotypes from the population, and
to place its mutated offspring back into the population.
- By repeating this process, the quality of the population keeps being reﬁned
over time.

Cell Found
- We use the proposed search framework to learn the architecture of a
convolutional cell, rather than the entire model.
- Only motifs 1,3,4,5 are used to construct the cell,
among which motifs 3 and 5 are dominating.

Transfer Learning Performance (ImageNet)

Regularized Evolution for Image
Classiﬁer Architecture Search
Esteban Real, Alok Aggarwal, Yanping Huang, Quoc V Le (Google)
AAAI 2019

Regularized Evolution
- In tournament selection, the best genotypes
(architectures) are kept, we propose to
associate each genotype with an age, and
bias the tournament selection to choose the
younger genotypes, by killing the oldest
population.

Mutations for NASNet cell structure
- Simplest set of mutations that would allow
evolving in the NASNet search space: Hidden
state mutation, Op mutation, and Identity.

Progressive
Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens , Wei Hua , Li-Jia Li,
Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy (Google / Stanford)
ECCV 2018

Sequential Model-Based Optimization (SMBO)
- Searching for structures in order of increasing complexity, while
simultaneously learning a surrogate model to guide the search through
structure space.
- 5x more eﬃcient (# of models evaluated to achieve desired accuracy)
- 8x more faster than NAS (no reranking)

Sequential Model-Based Optimization (SMBO)
- At iteration b of the algorithm, we have a set of K candidate cells
(each of size b blocks), which we train and evaluate on a dataset of
interest.
- Since this process is expensive, we also learn a model or surrogate
function which can predict the performance of a structure without
needing to training it.
- We expand the K candidates of size b into K′ ≫ K children, each of
size b + 1.
- We apply our surrogate function to rank all of the K′ children, pick
the top K, and then train and evaluate them.
- We continue in this way until b = B, which is the maximum number
of blocks we want to use in our cell.

SMBO Advantages
- The simple structures train faster, so we get some initial results to train the
surrogate quickly.
- We only ask the surrogate to predict the quality of structures that are slightly
different (larger) from the ones it has seen.
- We factorize the search space into a product of smaller search spaces,
allowing us to potentially search models with many more blocks.

SMBO Predictors
- Handle variable-sized inputs
- Correlated with true performance
- Ordering preserving is more important than Accuracy MSE
- Sample eﬃciency
- We want to train and evaluate as few cells as possible, which means training data is scarce.
- → used LSTM

Neural Architecture Optimization
Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, Tie-Yan Liu (USTC, Microsoft)
NIPS 2018

Continuous Optimization
- (1) An encoder embeds/maps neural network architectures into a continuous
space.
- We use a sequence consisting of discrete string tokens to describe a CNN or RNN
architecture.
- (2) A predictor p takes the continuous representation of a network as input
and predicts its accuracy.
- If models are are symmetric (e.g., x2 is formed via swapping two branches within a node in
x1), their embeddings should be close to produce the same performance prediction scores,
so p(x1) = p(x2) = s.
- (3) A decoder maps a continuous representation of
a network back to its architecture.

NAO Algorithm
- (For N iterations)
- 1. Train each candidate architecture foun
- 2. Train encoder, predictor, decoder by previous history of Model → Score
- 3. Pick K architectures, forming seed architectures
- 4. Find new candidate architecture representation, using encoder
representation and predictor.
- 5. Decode each candidate architecture representation
- The performance predictor and the encoder enable us to perform gradient based optimization in the
continuous space to ﬁnd the embedding of a new architecture
with potentially better accuracy. Such a better embedding is then
decoded to a network by the decoder.

NAO with ENAS
- NAO tries to reduce the huge computational cost brought by the search
algorithm.
- Weight-sharing aims to ease the huge complexity brought by massive child
models via the one-shot model setup
- So NAO and weight-sharing (ENAS) is complementary.

Transfer Learning Performance (WikiText-2)

Exploring Randomly Wired Neural
Networks for Image Recognition
Saining Xie, Alexander Kirillov, Ross Girshick, Kaiming He (Facebook)
2019. 4.

Randomly Wired Neural Networks
- NAS network generator is hand designed and the space of allowed wiring
patterns is constrained in a small subset of all possible graphs.
- What happens if we loosen this constraint and design novel network
generators?
- More diverse set of connectivity patterns through the lens of randomly wired
neural networks.
- 1. Deﬁne a stochastic network generator that encapsulates the entire network generation
process.
- 2. Generate randomly wired graphs.

Generator Prior
- Each random graph model has certain probabilistic behaviors such that
sampled graphs likely exhibit certain properties (e.g., WS is highly clustered).
- Ultimately, the generator design determines a probabilistic distribution over
networks, and as a result these networks tend to have certain properties.
- The generator design underlies the prior and thus should not be overlooked.
- Random graphs used
- Erdos-Renyi (ER), Barabasi-Albert (BA), Watts-Strogatz (WS)

Stochastic Network Generators
- We deﬁne a network generator as a mapping g from a parameter space Θ to a
space of neural network architectures N , g: Θ→N
- g(θ) performs a deterministic mapping.
- We can extend g to accept an additional argument s that is the seed of a
pseudo-random number generator that is used internally by g.
- We call generators of the form g(θ, s) stochastic network generators.

NAS vs. Stochastic Network Generators
- LSTM is only part of the complete NAS network generator, which is in fact a
stochastic network generator.
- The output of each LSTM time-step is a probability distribution conditioned on
θ.
- Given this distribution and the seed s, each step samples a construction
action.
- Network space N has been carefully restricted by hand designed rules.
e.g. “Cell”, M=5, No output concat to avg…

Mapping from Graphs to Neural Networks
- We define that edges are data flow.
- We define the operations represented by one node as
- Aggregation: Combined via weighted sum, weights: learnable &
positive
- Transformation: The aggregated data is processed by a
transformation defined as a ReLU-convolution-BN triplet = conv
- Distribution: The same copy of the transformed data is sent out by
the output edges of the node.
- Those without any input edge is an input node,
and vice versa for output nodes.

Properties of Node Operations
- Additive aggregation (unlike concatenation) maintains the same number of
output channels as input channels, and this prevents the convolution growing
large in computation.
- The transformation should have the same number of output and input
channels, to make sure the transformed data can be combined with the data
from any other nodes.
- Aggregation and distribution are almost parameter free (except for a
negligible number of parameters for weighted summation).

RandWire Architectures
- We use a simple strategy: the random graph generated above deﬁnes one
stage(layer). ex. conv stage(layer).

Weight Agnostic Neural Networks
Adam Gaier, David Ha (Google)
2019. 6.

Network Architectures that Encodes Solutions
- It is never claimed that the solution from NAS approach is innate to the
structure of the network – no one supposes these networks will solve the task
without training. The weights are the solution; the found architectures merely
a better substrate for the weights to inhabit.
- To produce architectures that themselves encode solutions, the importance
of weights must be minimized. Rather than judging networks by their
performance with optimal weight values, we can instead measure their
performance when their weight values are drawn from a random distribution.

Weight Agnostic Neural Network Search

Topology Search
- Inspired by neuroevolution algorithm NEAT.
- (1) Insert Node: a new node is inserted by splitting an existing connection.
- (2) Add Connection: a new connection is added by connecting two previously unconnected
nodes.
- (3) Change Activation: the activation function of a hidden node is reassigned.

Experimental Results: CartPoleSwingUp & MNIST

HyperNetworks
David Ha, Andrew Dai, Quoc V. Le (Google)
ICLR 2016

HyperNetwork
- Schmidhuber has suggested the concept of fast weights in which one
network (HyperNetwork) can produce context-dependent weight changes for
a second network.
- Recurrent networks: imposing weight-sharing across layers, which makes
them inﬂexible and diﬃcult to learn due to vanishing gradient.
- Convolutional networks: having redundant parameters when the networks are
deep.
- Hypernetworks can be viewed as relaxed form of weight-sharing across
layers.

Static and Dynamic HyperNetworks

SMASH: One-Shot
Model Architecture Search
through HyperNetworks
Andrew Brock, Theodore Lim, J.M. Ritchie, Nick Weston (Heriot-Watt Univ., Renishaw PLC)
ICLR 2018

Why HyperNetworks?
- Bypass the expensive procedure of fully training candidate models by instead
training an auxiliary model, a HyperNet, to dynamically generate the weights
of a main model with variable architecture.
- By comparing validation performance for a set of architectures using
generated weights, we can approximately rank numerous architectures at the
cost of a single training run.

SMASH
- At each training step, we randomly sample a network architecture, generate
the weights for that architecture using a HyperNet, and train the entire system
end-to-end through backpropagation.
- When the model is ﬁnished training, we sample a number of random
architectures and evaluate their performance on a validation set, using
weights generated by the HyperNet.
- We then select the architecture with the best estimated validation
performance and train its weights normally.

Eﬃcient Neural Architecture Search
via Parameter Sharing
Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, Jeff Dean (Google / CMU / Stanford)
ICML 2018

Eﬃcient Neural Architecture Search (ENAS)
- A fast and inexpensive approach for automatic model design.
1000x less expensive than standard NAS.
- The main contribution of this work is to improve the eﬃciency of NAS
by forcing all child models to share weights to eschew training each child
model from scratch to convergence while delivering strong empirical
performances.
- Central to the idea of ENAS is the observation that all of the graphs which
NAS ends up iterating over can be viewed as sub-graphs of a larger graph.
In other words, we can represent NAS’s search space
using a single directed acyclic graph (DAG).

Recurrent Cells
- To design recurrent cells, we employ a DAG with N nodes, where the nodes
represent local computations, and the edges represent the ﬂow of information
between the N nodes.
- ENAS’s controller is an RNN that decides:
- 1) which edges are activated
- 2) which computations are performed at each node in the DAG.
- Our search space allows ENAS to design both the topology and the
operations in RNN cells, and hence is more ﬂexible than NAS.

Recurrent Cells
- First node: The controller ﬁrst samples an activation function.
- Middle nodes: samples a previous index and an activation function.
- Output node: we simply average all the loose ends, i.e. the nodes that are not
selected as inputs to any other nodes.
- Note that for each pair of nodes j < ℓ, there is an independent parameter
matrix Wℓ,j
(h). → Shared weights.
- 4 activation functions, N nodes: search space = 4^N × N!

Training ENAS
- In ENAS, there are two sets of learnable parameters:
the parameters of the controller LSTM, denoted by θ,
and the shared parameters of the child models, denoted by ω.
- The first phase trains ω, the shared parameters of the child models,
on a whole pass through the training data set.
(Fix policy, choose model based on policy, minimize loss of the model)
- Surprisingly, we can update ω using the gradient from any single model m sampled from
policy. It just works fine.
- The second phase trains θ, the parameters of the controller LSTM, for a fixed
number of steps. (Trains policy to maximize on validation set)
- Two phases are alternated during the training of ENAS.

Deriving architectures from trained ENAS model
- We ﬁrst sample several models from the trained policy π(m, θ).
- For each sampled model, we compute its reward on a single minibatch
sampled from the validation set. → Model chosen
- We then take only the model with the highest reward to re-train from scratch.
→ Train the chosen model

Convolutional Cells (Macro)
- Chooses 1) what previous nodes to connect to
and 2) what computation operation to use
- (vs. Recurrent Cells. 1) what previous nodes to connect to, 2) what activation to use)
- It allows the model to form skip connections.
- As for recurrent cells, each operation at each layer in our ENAS convolutional
network has a distinct set of parameters.

Convolutional Cells Found (Macro)

Convolutional Cells (Micro)
- Same with Scalable Architectures
- We utilize the ENAS computational DAG with B nodes to represent the
computations that happen locally in a cell.
- We sample the reduction cell conditioned on the convolutional cell, hence
making the controller RNN run for a total of 2(B − 2) blocks.

Convolutional Cells Found (Micro)

Performance (CIFAR-10)
- Cutout: Simple regularization
technique of randomly masking
out square regions of input during
training

NAS vs. ENAS
- Minimum change to ENAS makes bad performance.
- We thus believe that the controller RNN learned by ENAS is as good as the
controller RNN learned by NAS.
- The performance gap between NAS and ENAS is due to the fact that we do
not sample multiple architectures from our trained controller, train them, and
then select the best architecture on the validation data.
- This extra step beneﬁts NAS’s performance.

Understanding and Simplifying
One-Shot Architecture Search
Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, Quoc Le (Google)
ICML 2018

One-shot Model
- It is possible to eﬃciently identify promising architectures from a complex
search space without either hypernetworks or RL.
- Train a large one-shot model containing every possible operation in the
search space.
- Zero out some of the operations and measure the impact on the model’s
prediction accuracies. Network automatically focuses its capacity on the
operations that are most useful for generating good predictions.

One-shot Architecture Search
- (1) Design a search space that allows us to represent a wide variety of
architectures using a single one-shot model.
○ Enabling or disabling incoming connections makes the size of the search space grows
exponentially while the size of the one-shot model grows only linearly.
- (2) Train the one-shot model to make it predictive of the validation accuracies
of the architectures.
○ If we train naively, the components can co-adapt. Removing operations – even unimportant
ones – from the network can cause the quality of the model’s predictions to degrade severely.
- (3) Evaluate candidate architectures on the validation set using the
pre-trained one shot model.
- (4) Re-train the most promising architectures from scratch
and evaluate their performance on the test set.

DARTS:
Differentiable Architecture Search
Hanxiao Liu, Karen Simonyan, Yiming Yang (CMU, Google)
ICLR 2019

Differentiable Architecture Search
- Unlike conventional approaches of applying evolution or reinforcement
learning over a discrete and non-differentiable search space,
- our method is based on the continuous relaxation of the architecture
representation,
- allowing eﬃcient search of the architecture using gradient descent.

Overview
- (a) Operations on the edges are initially unknown.
- (b) Continuous relaxation of the search space
by placing a mixture of candidate operations on each edge.
- (c) Joint optimization of the mixing probabilities and the network weights
by solving a bilevel optimization problem.
- (d) Inducing the ﬁnal architecture from the learned mixing probabilities.

Continuous Relaxation
- To make the search space continuous, we relax the categorical choice of a
particular operation o to a softmax over all possible operations O.
-
where the operation mixing weights for a pair of nodes (i, j)
are parameterized by a vector αo
(i,j)

Joint Optimization (Bilevel Optimization)
- Jointly learn the architecture α and the weights w within all the mixed
operations (e.g. weights of the convolution ﬁlters).
-
- (While not converged)
- 1. Update architecture α by training loss
- 2. Update w (evaluate training loss)

ProxylessNAS:
Direct Neural Architecture Search
on Target Task and Hardware
Han Cai, Ligeng Zhu, Song Han (MIT)
ICLR 2019

Proxyless Training
- Differentiable NAS can reduce the cost of GPU hours via a continuous
representation of network architecture but suffers from the high GPU memory
consumption issue (grow linearly w.r.t. candidate set size).
- As a result, they need to utilize proxy tasks.
○ ex. smaller dataset, learning with only a few blocks, or training just for a few epochs
- Optimizing on proxy tasks are not guaranteed to be optimal on the target task.
- ProxylessNAS that can directly learn the architectures for large-scale target
tasks and target hardware platforms by training memory-eﬃciently.

MnasNet: Platform-Aware Neural
Architecture Search for Mobile
Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard,
Quoc V. Le (Google)
CVPR 2019

Model Latency Problem
- Explicitly incorporate model latency into the main objective so that the search
can identify a model that achieves a good trade-off between accuracy and
latency.
- Unlike previous work, where latency is considered via another, often
inaccurate proxy (e.g., FLOPS), our approach directly measures real-world
inference latency by executing the model on mobile phones.
- FLOPS is often an inaccurate proxy: for example, MobileNet and NASNet have
similar FLOPS (575M vs. 564M), but their latencies are signiﬁcantly different
(113ms vs. 183ms)

Model Latency Problem
- While previous approaches mainly perform architecture search on smaller
tasks such as CIFAR10, we ﬁnd those small proxy tasks don’t work when
model latency is taken into account, because one typically needs to scale up
the model when applying to larger problems.
- In this paper, we directly perform our architecture search on the ImageNet
training set but with fewer training steps (5 epochs).

Factorized Hierarchical Search Space
- Previous approaches mainly search for a few types of cells and then
repeatedly stack. This simpliﬁes the search process, but also precludes layer
diversity that is important for computational eﬃciency.
- Advantage of balancing the diversity of layers and the size of total search
space
-

Pareto Optimal
-
- This approach only maximizes a single metric and does not provide multiple
Pareto optimal solutions.
-

FBNet: Hardware-Aware Eﬃcient
ConvNet Design via Differentiable
Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu,
Yuandong Tian, Peter Vajda, Yangqing Jia, Kurt Keutzer (Facebook)
CVPR 2019

Designing Convnets is hard!
- Intractable design space: The design space of a ConvNet is combinatorial &
training a ConvNet is very time-consuming.
- Nontransferable optimality: the optimality is conditioned on many factors
such as input resolutions and target devices. Once these factors change, the
optimal architecture is likely to be different.
- Inconsistent efficiency metrics: Most of the efficiency metrics we care about
are dependent on not only the ConvNet architecture but also the hardware
and software configurations on the target device.

Differentiable NAS
- Layer-wise search space where we can choose a different block for each layer
of the network
- By using the Gumbel Softmax technique, we can directly train the architecture
distribution using gradient-based optimization, which is extremely fast
compared with previous reinforcement learning (RL) based method.
- We measure the latency of each operator and use a lookup table model.
Overall latency is computed by adding up each operator. Using this allows us
to quickly estimate latency, and it makes the latency differentiable with
respect to layer-wise block choices.

Performance (ImageNet)
- Achieves better accuracy and lower latency than MnasNet, but we estimate
the search cost of DNAS is 420x smaller.

EﬃcientNet:
Rethinking Model Scaling for
Convolutional Neural Networks
Mingxing Tan, Quoc V. Le (Google)
ICML 2019

Scaling up CNNs
- Convolutional Neural Networks (ConvNets) are commonly developed at a
ﬁxed resource budget, and then scaled up for better accuracy if more
resources are available.
- Carefully balancing network depth, width, and resolution can lead to better
performance.
- A new scaling method that uniformly scales all dimensions of
depth (ex. ResNet, Inception) / width (ex. WideResNet, MobileNet)
/ resolution (NASNet, GPipe) using a simple yet highly effective compound
coeﬃcient.

Compound Scaling
-
- φ is a user-speciﬁed coeﬃcient that controls how many more resources are
available for model scaling, while α, β, γ specify how to assign these extra
resources to network width, depth, and resolution.
- FLOPS of a regular convolution op is proportional to d, w2
, r2
.
- Total FLOPS will approximately increase by 2φ
.

Compound Scaling
- Developed baseline network by leveraging a multi-objective neural
architecture search from MnasNet.
- Starting from the baseline EﬃcientNet-B0, we apply our compound scaling
method to scale it up with two steps.
- 1. Fix φ = 1, assuming twice more resources available, and do a small grid
search of α, β to ﬁnd the optimal values.
- 2. Fix α, β, γ as constants and scale up baseline network with different φ.

ScarletNAS: Bridging the Gap
Between Scalability and Fairness
in Neural Architecture Search
Xiangxiang Chu, Bo Zhang, Jixiang Li, Qingyuan Li, Ruijun Xu (Xiaomi)
2019. 8.

Supernet training with variable depths
- One-shot NAS features fast training of a supernet in a single run.
But weight-sharing approach lacks of scalability.
- Identity block helps building a scalable supernet (with variable depths) but it
makes supernet training unstable.
- Introduce linearly equivalent transformation to soothe training turbulence,
providing with the proof that such transformed path is identical with the
original one as per representational power.

Linearly Equivalent Transformation
- As pure identity blocks are direct
short paths and don’t learn any
information, we have to
accommodate this defect by
injecting a learning unit.
- Here we remedy the issue with 1 × 1
convolution without non-linear
activations.

NAS-Bench-101:
Towards Reproducible
Chris Ying, Aaron Klein, Esteban Real, Eric Christiansen, Kevin Murphy, Frank Hutter
(Google)
ICML 2019

NAS research is hard!
- NAS demands tremendous computational resources, which makes it difficult
to reproduce experiments and imposes a barrier-to-entry to researchers
without access to large-scale computation.
- Recent improvements have yielded more efficient methods, different methods
are not comparable to each other due to different training procedures and
different search spaces, which make it difficult to attribute the success of
each method to the search algorithm itself.

Architecture Dataset
- We carefully constructed a search space, exploiting graph isomorphisms to
identify 423k unique convolutional architectures.
- 7-vertex directed acyclic graph, one for each of the 5 intermediate vertices.
(recall that the input and output vertices are ﬁxed)
- To support both ResNet and Inception-like cells and to keep the space
tractable: tensors going to the output vertex are concatenated and those
going into other vertices are summed.
- We trained and evaluated all of these architectures multiple times on
CIFAR-10 and compiled the results into a large dataset of
over 5 million trained models.

Metrics
- training accuracy, validation accuracy, testing accuracy,
training time in seconds, number of trainable model parameters
- Only metrics on the training and validation set should be used to search
models within a single NAS algorithm, and testing accuracy should only be
used for an oﬄine evaluation. The training time metric allows benchmarking
algorithms that optimize for accuracy while operating under a time limit and
also allows the evaluation of multi-objective optimization methods.

Accuracy
- We repeat the training and evaluation of all architectures 3 times to obtain a
measure of variance, and we trained all our architectures with four increasing
epoch budgets: {4, 12, 36, 108}.
- train/valid/test accuracy after training for 108 epochs and (right) the noise,
deﬁned as the standard deviation of the test accuracy between the three trials

Pareto Frontier
- Hand-designed cells, such as ResNet and Inception, perform near the Pareto
frontier of accuracy over cost, which suggests that topology and operation
selection are critical for ﬁnding both high-accuracy and low-cost models.

Locality of Architecture Search Space
- Locality across the whole space
○ Random-walk autocorrelation (RWA), deﬁned as the
autocorrelation of the accuracies of points visited
as we perform a random walks through the space,
shows high correlations for lower distances,
indicating locality. The correlations become indistinguishable beyond a distance of about 6.
- Locality around a global accuracy maximum
○ Fitness-distance correlation metric (FDC) shows that there is locality around the global
maximum as well and the peak also has a coarse-grained width of about 6.
- Locality around inception-like cell
○ Fraction of the search space volume that lies within a given distance
to the closest high peak.

Neural Architecture Search: Learning How to Learn

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Neural Architecture Search: Learning How to Learn

Similar to Neural Architecture Search: Learning How to Learn (20)

More from Kwanghee Choi

More from Kwanghee Choi (19)

Recently uploaded

Recently uploaded (20)

Neural Architecture Search: Learning How to Learn