SlideShare a Scribd company logo
1 of 130
Download to read offline
Neural Architecture Search:
Learning How to Learn
Kwanghee Choi
Local Optima 2019
Reference
- Neural Architecture Search with Reinforcement Learning (ICLR 2017)
- Learning Transferable Architectures for Scalable Image Recognition (CVPR 2018)
- Large-Scale Evolution of Image Classifiers (ICML 2017)
- Hierarchical Representations for Efficient Architecture Search (ICLR 2018)
- Regularized Evolution for Image Classifier Architecture Search (AAAI 2019)
- Progressive Neural Architecture Search (ECCV 2018)
- Neural Architecture Optimization (NIPS 2018)
- Exploring Randomly Wired Neural Networks for Image Recognition (2019)
- Weight Agnostic Neural Networks (2019)
- HyperNetworks (ICLR 2016)
- SMASH: One-Shot Model Architecture Search through HyperNetworks (ICLR 2018)
- Efficient Neural Architecture Search via Parameter Sharing (ICML 2018)
- Understanding and Simplifying One-Shot Architecture Search (ICML 2018)
- DARTS: Differentiable Architecture Search (ICLR 2019)
- ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware (ICLR 2019)
- MnasNet: Platform-Aware Neural Architecture Search for Mobile (CVPR 2019)
- FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search (CVPR 2019)
- EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks (ICML 2019)
- ScarletNAS: Bridging the Gap Between Scalability and Fairness in Neural Architecture Search (2019)
- NAS-Bench-101: Towards Reproducible Neural Architecture Search (ICML 2019)
Introduction
- Excerpt from Exploring Randomly Wired Neural Networks for Image
Recognition (2019)
- Neural networks for image recognition have evolved through extensive
manual design. ex) ResNet, DenseNet
- What we call deep learning today descends from the connectionist approach
to cognitive science — a paradigm reflecting the hypothesis that how
computational networks are wired is crucial for building intelligent machines.
- NAS (Neural Architecture Search): Optimization of wiring and operation types,
but possible wirings or operations are constrained.
Neural Architecture Search
with Reinforcement Learning
Barret Zoph, Quoc V. Le (Google)
ICLR 2017
Neural Architecture Search
Neural Architecture Search
- A gradient-based method for finding good architectures
- Use a recurrent network to generate the model descriptions of neural
networks and train this RNN with reinforcement learning to maximize the
expected accuracy of the generated architectures on a validation set.
- The structure and connectivity of a neural network can be typically specified
by a variable-length string. It is therefore possible to use a recurrent network –
the controller – to generate such string.
- Architecture engineering with CNNs often identifies repeated motifs
consisting of combinations of convolutional filter banks,
nonlinearities and a prudent selection of connections
to achieve state-of-the-art results.
Controller Recurrent Neural Network
- Every prediction is carried out by a softmax classifier and then fed into the next time step as input.
- The process of generating an architecture stops if the number of layers exceeds a certain value.
- Once the controller RNN finishes generating an architecture,
a neural network with this architecture is built and trained.
- At convergence, the accuracy of the network
on a held-out validation set is recorded.
Training with REINFORCE
- The list of tokens that the controller predicts can be viewed as a list of
actions a1:T
to design an architecture for a child network.
- At convergence, this child network will achieve an accuracy R on a held-out
dataset.
- We can use this accuracy R as the reward signal and use reinforcement
learning to train the controller.
- REINFORCE by Williams (1992), Sutton (2000)
- We do not predict the learning rate and we also assume that the architectures
consist of only convolutional layers,
which is also quite restrictive.
Distributed training for NAS
- We use a set of S parameter servers to store and send parameters to K controller replicas.
- Each controller replica then samples m architectures and run the multiple child models in parallel.
- The accuracy of each child model is recorded to compute the gradients with respect to θc
, which are
then sent back to the parameter servers.
Generating Skip Connections
- At layer N, we add an anchor point
which has N − 1 content-based sigmoids
to indicate the previous layers that need to be connected.
- Each sigmoid is a function of the current hiddenstate of the controller and the previous hiddenstates
of the previous N − 1 anchor points.
- P(Layer j is an input to layer i) = sigmoid(vT
tanh(Wprev
∗ hj
+ Wcurr
∗ hi
))
where hj
represents the hidden state of the controller at anchor point for the j-th layer, where j ranges
from 0 to N − 1.
- We then sample from these sigmoids to decide what previous layers to be used as inputs to the
current layer.
- The matrices Wprev
, Wcurr
and v are trainable parameters.
Generating Skip Connections
- Skip connections can cause “compilation failures” where one layer is not
compatible with another layer, or one layer may not have any input or output.
- If a layer is not connected to any input layer then the image is used as the input layer.
- At the final layer we take all layer outputs that have not been connected and concatenate them
before sending this final hiddenstate to the classifier.
- If input layers to be concatenated have different sizes, we pad the small layers with zeros so
that the concatenated layers have the same sizes.
Generating Recurrent Cells
- The computations for basic RNN and LSTM cells can be generalized as a tree of steps that take xt
and
ht−1
as inputs and produce ht
as final output.
- The controller RNN needs to label each node in the tree with a combination method (add, dot product,
etc.) and an activation function (tanh, sigmoid, etc.) to merge two inputs to produce one output.
- Two outputs are then fed as inputs to the next node in the tree.
- Two leaf nodes (Tree Index 0, 1): thus it is called a “base 2” architecture.
- In our experiments, we use a base number of 8
to make sure that the cell is expressive.
Performance (CIFAR-10)
Performance (PTB)
Transfer Learning Performance (PTB)
- To understand whether the cell can generalize to a different task, we apply it
to the character language modeling task on the same dataset (PTB).
- The new cell was found on word level language modeling.
Learning Transferable Architectures
for Scalable Image Recognition
Barret Zoph, Vijay Vasudevan, Jonathon Shlens, Quoc V. Le (Google)
ICLR 2017
Transferable Architectures
- We propose to search for an architectural building block on a small dataset
and then transfer the block to a larger dataset: the design of a new search
space (“NASNet search space”) which enables transferability.
- Applying NAS, or any other search methods, directly to a large dataset is computationally
expensive.
- NASNet search space: A search space so that the complexity of the architecture is
independent of the depth of the network and the size of input images.
- All convolutional networks in our search space are composed of convolutional layers (or
“cells”) with identical structure but different weights. Searching for the best convolutional
architectures is therefore reduced to searching for the best cell structure.
- By simply varying # of the convolutional cells and # of filters,
we can create different versions of NASNets
with different computational demands.
Transferable Architectures
- Two types of cells:
- Normal Cell: return a feature map of the same dimension
- Reduction Cell: return a feature map where height and width is
reduced by a factor of two.
- We empirically found it beneficial to learn two separate
architectures.
- We use a common heuristic to double the number of filters in
the output whenever the spatial activation size is reduced in
order to maintain roughly constant hidden state dimension.
- We consider the # of motif repetitions and
the # of initial convolutional filters
as free parameters.
Controller Model Architecture
- Select a hidden state from hi
, hi−1
or from the set of hidden states created in previous blocks.
- In our experiments, selecting B = 5 provides good results, although we have not exhaustively
searched this space due to computational limitations.
- To allow the controller RNN to predict both Normal Cell and Reduction Cell,
we simply make the controller have 2 × 5B predictions in total.
NASNet-A Cells
Transfer Learning Performance (ImageNet)
- The new cell was found on CIFAR-10.
Other Cell Types
- NASNet-B
- Do not concatenate the output hidden states, each output hidden state is used as a hidden
state in the future layers.
- We allow addition followed by layer normalization or instance normalization.
- NASNet-C
- We allow addition followed by layer normalization or instance normalization.
Large-Scale Evolution
of Image Classifiers
Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena,
Yutaka Leon Suematsu, Jie Tan, Quoc V. Le, Alexey Kurakin (Google)
ICML 2017
Large-Scale Evolution
- Starting out with poor-performing models with no convolutions, the algorithm
must evolve complex convolutional neural networks while navigating a fairly
unrestricted search space.
- We use a simplified graph as our DNA, which is transformed to a full neural
network graph for training and evaluation.
- Mutations were chosen for their similarity to the actions that a human
designer may take when improving an architecture.
- we allow the children to inherit the parents’ weights whenever possible.
Namely, if a layer has matching shapes,
the weights are preserved.
Progress of an Evolution Experiment
-
Performance (CIFAR)
-
Performance (CIFAR)
-
Hierarchical Representations
for Efficient Architecture Search
Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, Koray Kavukcuoglu (Google)
ICLR 2018
Hierarchical Representations
for describing neural network architectures
Tournament Selection
- Starting from an initial population of random genotypes, tournament selection
provides a mechanism to pick promising genotypes from the population, and
to place its mutated offspring back into the population.
- By repeating this process, the quality of the population keeps being refined
over time.
Cell Found
- We use the proposed search framework to learn the architecture of a
convolutional cell, rather than the entire model.
- Only motifs 1,3,4,5 are used to construct the cell,
among which motifs 3 and 5 are dominating.
Motifs Found
Performance (CIFAR-10)
Transfer Learning Performance (ImageNet)
Regularized Evolution for Image
Classifier Architecture Search
Esteban Real, Alok Aggarwal, Yanping Huang, Quoc V Le (Google)
AAAI 2019
Regularized Evolution
- In tournament selection, the best genotypes
(architectures) are kept, we propose to
associate each genotype with an age, and
bias the tournament selection to choose the
younger genotypes, by killing the oldest
population.
Mutations for NASNet cell structure
- Simplest set of mutations that would allow
evolving in the NASNet search space: Hidden
state mutation, Op mutation, and Identity.
Transfer Learning Performance (ImageNet)
Progressive
Neural Architecture Search
Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens , Wei Hua , Li-Jia Li,
Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy (Google / Stanford)
ECCV 2018
Sequential Model-Based Optimization (SMBO)
- Searching for structures in order of increasing complexity, while
simultaneously learning a surrogate model to guide the search through
structure space.
- 5x more efficient (# of models evaluated to achieve desired accuracy)
- 8x more faster than NAS (no reranking)
Sequential Model-Based Optimization (SMBO)
- At iteration b of the algorithm, we have a set of K candidate cells
(each of size b blocks), which we train and evaluate on a dataset of
interest.
- Since this process is expensive, we also learn a model or surrogate
function which can predict the performance of a structure without
needing to training it.
- We expand the K candidates of size b into K′ ≫ K children, each of
size b + 1.
- We apply our surrogate function to rank all of the K′ children, pick
the top K, and then train and evaluate them.
- We continue in this way until b = B, which is the maximum number
of blocks we want to use in our cell.
SMBO Advantages
- The simple structures train faster, so we get some initial results to train the
surrogate quickly.
- We only ask the surrogate to predict the quality of structures that are slightly
different (larger) from the ones it has seen.
- We factorize the search space into a product of smaller search spaces,
allowing us to potentially search models with many more blocks.
SMBO Predictors
- Handle variable-sized inputs
- Correlated with true performance
- Ordering preserving is more important than Accuracy MSE
- Sample efficiency
- We want to train and evaluate as few cells as possible, which means training data is scarce.
- → used LSTM
Transfer Learning Performance (ImageNet)
Neural Architecture Optimization
Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, Tie-Yan Liu (USTC, Microsoft)
NIPS 2018
Continuous Optimization
- (1) An encoder embeds/maps neural network architectures into a continuous
space.
- We use a sequence consisting of discrete string tokens to describe a CNN or RNN
architecture.
- (2) A predictor p takes the continuous representation of a network as input
and predicts its accuracy.
- If models are are symmetric (e.g., x2 is formed via swapping two branches within a node in
x1), their embeddings should be close to produce the same performance prediction scores,
so p(x1) = p(x2) = s.
- (3) A decoder maps a continuous representation of
a network back to its architecture.
NAO Algorithm
- (For N iterations)
- 1. Train each candidate architecture foun
- 2. Train encoder, predictor, decoder by previous history of Model → Score
- 3. Pick K architectures, forming seed architectures
- 4. Find new candidate architecture representation, using encoder
representation and predictor.
- 5. Decode each candidate architecture representation
- The performance predictor and the encoder enable us to perform gradient based optimization in the
continuous space to find the embedding of a new architecture
with potentially better accuracy. Such a better embedding is then
decoded to a network by the decoder.
NAO with ENAS
- NAO tries to reduce the huge computational cost brought by the search
algorithm.
- Weight-sharing aims to ease the huge complexity brought by massive child
models via the one-shot model setup
- So NAO and weight-sharing (ENAS) is complementary.
Performance (CIFAR-10)
Transfer Learning Performance (ImageNet)
Performance (PTB)
Transfer Learning Performance (WikiText-2)
Exploring Randomly Wired Neural
Networks for Image Recognition
Saining Xie, Alexander Kirillov, Ross Girshick, Kaiming He (Facebook)
2019. 4.
Randomly Wired Neural Networks
- NAS network generator is hand designed and the space of allowed wiring
patterns is constrained in a small subset of all possible graphs.
- What happens if we loosen this constraint and design novel network
generators?
- More diverse set of connectivity patterns through the lens of randomly wired
neural networks.
- 1. Define a stochastic network generator that encapsulates the entire network generation
process.
- 2. Generate randomly wired graphs.
Generator Prior
- Each random graph model has certain probabilistic behaviors such that
sampled graphs likely exhibit certain properties (e.g., WS is highly clustered).
- Ultimately, the generator design determines a probabilistic distribution over
networks, and as a result these networks tend to have certain properties.
- The generator design underlies the prior and thus should not be overlooked.
- Random graphs used
- Erdos-Renyi (ER), Barabasi-Albert (BA), Watts-Strogatz (WS)
Stochastic Network Generators
- We define a network generator as a mapping g from a parameter space Θ to a
space of neural network architectures N , g: Θ→N
- g(θ) performs a deterministic mapping.
- We can extend g to accept an additional argument s that is the seed of a
pseudo-random number generator that is used internally by g.
- We call generators of the form g(θ, s) stochastic network generators.
NAS vs. Stochastic Network Generators
- LSTM is only part of the complete NAS network generator, which is in fact a
stochastic network generator.
- The output of each LSTM time-step is a probability distribution conditioned on
θ.
- Given this distribution and the seed s, each step samples a construction
action.
- Network space N has been carefully restricted by hand designed rules.
e.g. “Cell”, M=5, No output concat to avg…
Mapping from Graphs to Neural Networks
- We define that edges are data flow.
- We define the operations represented by one node as
- Aggregation: Combined via weighted sum, weights: learnable &
positive
- Transformation: The aggregated data is processed by a
transformation defined as a ReLU-convolution-BN triplet = conv
- Distribution: The same copy of the transformed data is sent out by
the output edges of the node.
- Those without any input edge is an input node,
and vice versa for output nodes.
Properties of Node Operations
- Additive aggregation (unlike concatenation) maintains the same number of
output channels as input channels, and this prevents the convolution growing
large in computation.
- The transformation should have the same number of output and input
channels, to make sure the transformed data can be combined with the data
from any other nodes.
- Aggregation and distribution are almost parameter free (except for a
negligible number of parameters for weighted summation).
RandWire Architectures
- We use a simple strategy: the random graph generated above defines one
stage(layer). ex. conv stage(layer).
Performance
Weight Agnostic Neural Networks
Adam Gaier, David Ha (Google)
2019. 6.
Network Architectures that Encodes Solutions
- It is never claimed that the solution from NAS approach is innate to the
structure of the network – no one supposes these networks will solve the task
without training. The weights are the solution; the found architectures merely
a better substrate for the weights to inhabit.
- To produce architectures that themselves encode solutions, the importance
of weights must be minimized. Rather than judging networks by their
performance with optimal weight values, we can instead measure their
performance when their weight values are drawn from a random distribution.
Weight Agnostic Neural Network Search
Topology Search
- Inspired by neuroevolution algorithm NEAT.
- (1) Insert Node: a new node is inserted by splitting an existing connection.
- (2) Add Connection: a new connection is added by connecting two previously unconnected
nodes.
- (3) Change Activation: the activation function of a hidden node is reassigned.
Experimental Results: CartPoleSwingUp & MNIST
HyperNetworks
David Ha, Andrew Dai, Quoc V. Le (Google)
ICLR 2016
HyperNetwork
- Schmidhuber has suggested the concept of fast weights in which one
network (HyperNetwork) can produce context-dependent weight changes for
a second network.
- Recurrent networks: imposing weight-sharing across layers, which makes
them inflexible and difficult to learn due to vanishing gradient.
- Convolutional networks: having redundant parameters when the networks are
deep.
- Hypernetworks can be viewed as relaxed form of weight-sharing across
layers.
Static and Dynamic HyperNetworks
SMASH: One-Shot
Model Architecture Search
through HyperNetworks
Andrew Brock, Theodore Lim, J.M. Ritchie, Nick Weston (Heriot-Watt Univ., Renishaw PLC)
ICLR 2018
Why HyperNetworks?
- Bypass the expensive procedure of fully training candidate models by instead
training an auxiliary model, a HyperNet, to dynamically generate the weights
of a main model with variable architecture.
- By comparing validation performance for a set of architectures using
generated weights, we can approximately rank numerous architectures at the
cost of a single training run.
SMASH
- At each training step, we randomly sample a network architecture, generate
the weights for that architecture using a HyperNet, and train the entire system
end-to-end through backpropagation.
- When the model is finished training, we sample a number of random
architectures and evaluate their performance on a validation set, using
weights generated by the HyperNet.
- We then select the architecture with the best estimated validation
performance and train its weights normally.
Efficient Neural Architecture Search
via Parameter Sharing
Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, Jeff Dean (Google / CMU / Stanford)
ICML 2018
Efficient Neural Architecture Search (ENAS)
- A fast and inexpensive approach for automatic model design.
1000x less expensive than standard NAS.
- The main contribution of this work is to improve the efficiency of NAS
by forcing all child models to share weights to eschew training each child
model from scratch to convergence while delivering strong empirical
performances.
- Central to the idea of ENAS is the observation that all of the graphs which
NAS ends up iterating over can be viewed as sub-graphs of a larger graph.
In other words, we can represent NAS’s search space
using a single directed acyclic graph (DAG).
Recurrent Cells
- To design recurrent cells, we employ a DAG with N nodes, where the nodes
represent local computations, and the edges represent the flow of information
between the N nodes.
- ENAS’s controller is an RNN that decides:
- 1) which edges are activated
- 2) which computations are performed at each node in the DAG.
- Our search space allows ENAS to design both the topology and the
operations in RNN cells, and hence is more flexible than NAS.
Recurrent Cells
- First node: The controller first samples an activation function.
- Middle nodes: samples a previous index and an activation function.
- Output node: we simply average all the loose ends, i.e. the nodes that are not
selected as inputs to any other nodes.
- Note that for each pair of nodes j < ℓ, there is an independent parameter
matrix Wℓ,j
(h). → Shared weights.
- 4 activation functions, N nodes: search space = 4^N × N!
Training ENAS
- In ENAS, there are two sets of learnable parameters:
the parameters of the controller LSTM, denoted by θ,
and the shared parameters of the child models, denoted by ω.
- The first phase trains ω, the shared parameters of the child models,
on a whole pass through the training data set.
(Fix policy, choose model based on policy, minimize loss of the model)
- Surprisingly, we can update ω using the gradient from any single model m sampled from
policy. It just works fine.
- The second phase trains θ, the parameters of the controller LSTM, for a fixed
number of steps. (Trains policy to maximize on validation set)
- Two phases are alternated during the training of ENAS.
Deriving architectures from trained ENAS model
- We first sample several models from the trained policy π(m, θ).
- For each sampled model, we compute its reward on a single minibatch
sampled from the validation set. → Model chosen
- We then take only the model with the highest reward to re-train from scratch.
→ Train the chosen model
Convolutional Cells (Macro)
- Chooses 1) what previous nodes to connect to
and 2) what computation operation to use
- (vs. Recurrent Cells. 1) what previous nodes to connect to, 2) what activation to use)
- It allows the model to form skip connections.
- As for recurrent cells, each operation at each layer in our ENAS convolutional
network has a distinct set of parameters.
Convolutional Cells Found (Macro)
Convolutional Cells (Micro)
- Same with Scalable Architectures
- We utilize the ENAS computational DAG with B nodes to represent the
computations that happen locally in a cell.
- We sample the reduction cell conditioned on the convolutional cell, hence
making the controller RNN run for a total of 2(B − 2) blocks.
Convolutional Cells Found (Micro)
Performance (PTB)
Performance (CIFAR-10)
- Cutout: Simple regularization
technique of randomly masking
out square regions of input during
training
NAS vs. ENAS
- Minimum change to ENAS makes bad performance.
- We thus believe that the controller RNN learned by ENAS is as good as the
controller RNN learned by NAS.
- The performance gap between NAS and ENAS is due to the fact that we do
not sample multiple architectures from our trained controller, train them, and
then select the best architecture on the validation data.
- This extra step benefits NAS’s performance.
Understanding and Simplifying
One-Shot Architecture Search
Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, Quoc Le (Google)
ICML 2018
One-shot Model
- It is possible to efficiently identify promising architectures from a complex
search space without either hypernetworks or RL.
- Train a large one-shot model containing every possible operation in the
search space.
- Zero out some of the operations and measure the impact on the model’s
prediction accuracies. Network automatically focuses its capacity on the
operations that are most useful for generating good predictions.
One-shot Architecture Search
- (1) Design a search space that allows us to represent a wide variety of
architectures using a single one-shot model.
○ Enabling or disabling incoming connections makes the size of the search space grows
exponentially while the size of the one-shot model grows only linearly.
- (2) Train the one-shot model to make it predictive of the validation accuracies
of the architectures.
○ If we train naively, the components can co-adapt. Removing operations – even unimportant
ones – from the network can cause the quality of the model’s predictions to degrade severely.
- (3) Evaluate candidate architectures on the validation set using the
pre-trained one shot model.
- (4) Re-train the most promising architectures from scratch
and evaluate their performance on the test set.
DARTS:
Differentiable Architecture Search
Hanxiao Liu, Karen Simonyan, Yiming Yang (CMU, Google)
ICLR 2019
Differentiable Architecture Search
- Unlike conventional approaches of applying evolution or reinforcement
learning over a discrete and non-differentiable search space,
- our method is based on the continuous relaxation of the architecture
representation,
- allowing efficient search of the architecture using gradient descent.
Overview
- (a) Operations on the edges are initially unknown.
- (b) Continuous relaxation of the search space
by placing a mixture of candidate operations on each edge.
- (c) Joint optimization of the mixing probabilities and the network weights
by solving a bilevel optimization problem.
- (d) Inducing the final architecture from the learned mixing probabilities.
Continuous Relaxation
- To make the search space continuous, we relax the categorical choice of a
particular operation o to a softmax over all possible operations O.
-
where the operation mixing weights for a pair of nodes (i, j)
are parameterized by a vector αo
(i,j)
Joint Optimization (Bilevel Optimization)
- Jointly learn the architecture α and the weights w within all the mixed
operations (e.g. weights of the convolution filters).
-
- (While not converged)
- 1. Update architecture α by training loss
- 2. Update w (evaluate training loss)
Cells Found
Performance (CIFAR-10)
Performance (PTB)
Transfer Learning Performance (ImageNet)
Transfer Learning Performance (WikiText-2)
ProxylessNAS:
Direct Neural Architecture Search
on Target Task and Hardware
Han Cai, Ligeng Zhu, Song Han (MIT)
ICLR 2019
Proxyless Training
- Differentiable NAS can reduce the cost of GPU hours via a continuous
representation of network architecture but suffers from the high GPU memory
consumption issue (grow linearly w.r.t. candidate set size).
- As a result, they need to utilize proxy tasks.
○ ex. smaller dataset, learning with only a few blocks, or training just for a few epochs
- Optimizing on proxy tasks are not guaranteed to be optimal on the target task.
- ProxylessNAS that can directly learn the architectures for large-scale target
tasks and target hardware platforms by training memory-efficiently.
Binarized Path
v.s.
Differentiable Latency
ImageNet Performance
MnasNet: Platform-Aware Neural
Architecture Search for Mobile
Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard,
Quoc V. Le (Google)
CVPR 2019
Model Latency Problem
- Explicitly incorporate model latency into the main objective so that the search
can identify a model that achieves a good trade-off between accuracy and
latency.
- Unlike previous work, where latency is considered via another, often
inaccurate proxy (e.g., FLOPS), our approach directly measures real-world
inference latency by executing the model on mobile phones.
- FLOPS is often an inaccurate proxy: for example, MobileNet and NASNet have
similar FLOPS (575M vs. 564M), but their latencies are significantly different
(113ms vs. 183ms)
Model Latency Problem
- While previous approaches mainly perform architecture search on smaller
tasks such as CIFAR10, we find those small proxy tasks don’t work when
model latency is taken into account, because one typically needs to scale up
the model when applying to larger problems.
- In this paper, we directly perform our architecture search on the ImageNet
training set but with fewer training steps (5 epochs).
Factorized Hierarchical Search Space
- Previous approaches mainly search for a few types of cells and then
repeatedly stack. This simplifies the search process, but also precludes layer
diversity that is important for computational efficiency.
- Advantage of balancing the diversity of layers and the size of total search
space
-
Pareto Optimal
-
- This approach only maximizes a single metric and does not provide multiple
Pareto optimal solutions.
-
Performance
FBNet: Hardware-Aware Efficient
ConvNet Design via Differentiable
Neural Architecture Search
Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu,
Yuandong Tian, Peter Vajda, Yangqing Jia, Kurt Keutzer (Facebook)
CVPR 2019
Designing Convnets is hard!
- Intractable design space: The design space of a ConvNet is combinatorial &
training a ConvNet is very time-consuming.
- Nontransferable optimality: the optimality is conditioned on many factors
such as input resolutions and target devices. Once these factors change, the
optimal architecture is likely to be different.
- Inconsistent efficiency metrics: Most of the efficiency metrics we care about
are dependent on not only the ConvNet architecture but also the hardware
and software configurations on the target device.
Differentiable NAS
- Layer-wise search space where we can choose a different block for each layer
of the network
- By using the Gumbel Softmax technique, we can directly train the architecture
distribution using gradient-based optimization, which is extremely fast
compared with previous reinforcement learning (RL) based method.
- We measure the latency of each operator and use a lookup table model.
Overall latency is computed by adding up each operator. Using this allows us
to quickly estimate latency, and it makes the latency differentiable with
respect to layer-wise block choices.
Performance (ImageNet)
- Achieves better accuracy and lower latency than MnasNet, but we estimate
the search cost of DNAS is 420x smaller.
EfficientNet:
Rethinking Model Scaling for
Convolutional Neural Networks
Mingxing Tan, Quoc V. Le (Google)
ICML 2019
Scaling up CNNs
- Convolutional Neural Networks (ConvNets) are commonly developed at a
fixed resource budget, and then scaled up for better accuracy if more
resources are available.
- Carefully balancing network depth, width, and resolution can lead to better
performance.
- A new scaling method that uniformly scales all dimensions of
depth (ex. ResNet, Inception) / width (ex. WideResNet, MobileNet)
/ resolution (NASNet, GPipe) using a simple yet highly effective compound
coefficient.
Compound Scaling
-
- φ is a user-specified coefficient that controls how many more resources are
available for model scaling, while α, β, γ specify how to assign these extra
resources to network width, depth, and resolution.
- FLOPS of a regular convolution op is proportional to d, w2
, r2
.
- Total FLOPS will approximately increase by 2φ
.
Compound Scaling
- Developed baseline network by leveraging a multi-objective neural
architecture search from MnasNet.
- Starting from the baseline EfficientNet-B0, we apply our compound scaling
method to scale it up with two steps.
- 1. Fix φ = 1, assuming twice more resources available, and do a small grid
search of α, β to find the optimal values.
- 2. Fix α, β, γ as constants and scale up baseline network with different φ.
Performance
ScarletNAS: Bridging the Gap
Between Scalability and Fairness
in Neural Architecture Search
Xiangxiang Chu, Bo Zhang, Jixiang Li, Qingyuan Li, Ruijun Xu (Xiaomi)
2019. 8.
Supernet training with variable depths
- One-shot NAS features fast training of a supernet in a single run.
But weight-sharing approach lacks of scalability.
- Identity block helps building a scalable supernet (with variable depths) but it
makes supernet training unstable.
- Introduce linearly equivalent transformation to soothe training turbulence,
providing with the proof that such transformed path is identical with the
original one as per representational power.
Linearly Equivalent Transformation
- As pure identity blocks are direct
short paths and don’t learn any
information, we have to
accommodate this defect by
injecting a learning unit.
- Here we remedy the issue with 1 × 1
convolution without non-linear
activations.
Performance (ImageNet)
NAS-Bench-101:
Towards Reproducible
Neural Architecture Search
Chris Ying, Aaron Klein, Esteban Real, Eric Christiansen, Kevin Murphy, Frank Hutter
(Google)
ICML 2019
NAS research is hard!
- NAS demands tremendous computational resources, which makes it difficult
to reproduce experiments and imposes a barrier-to-entry to researchers
without access to large-scale computation.
- Recent improvements have yielded more efficient methods, different methods
are not comparable to each other due to different training procedures and
different search spaces, which make it difficult to attribute the success of
each method to the search algorithm itself.
Architecture Dataset
- We carefully constructed a search space, exploiting graph isomorphisms to
identify 423k unique convolutional architectures.
- 7-vertex directed acyclic graph, one for each of the 5 intermediate vertices.
(recall that the input and output vertices are fixed)
- To support both ResNet and Inception-like cells and to keep the space
tractable: tensors going to the output vertex are concatenated and those
going into other vertices are summed.
- We trained and evaluated all of these architectures multiple times on
CIFAR-10 and compiled the results into a large dataset of
over 5 million trained models.
Metrics
- training accuracy, validation accuracy, testing accuracy,
training time in seconds, number of trainable model parameters
- Only metrics on the training and validation set should be used to search
models within a single NAS algorithm, and testing accuracy should only be
used for an offline evaluation. The training time metric allows benchmarking
algorithms that optimize for accuracy while operating under a time limit and
also allows the evaluation of multi-objective optimization methods.
Accuracy
- We repeat the training and evaluation of all architectures 3 times to obtain a
measure of variance, and we trained all our architectures with four increasing
epoch budgets: {4, 12, 36, 108}.
- train/valid/test accuracy after training for 108 epochs and (right) the noise,
defined as the standard deviation of the test accuracy between the three trials
Pareto Frontier
- Hand-designed cells, such as ResNet and Inception, perform near the Pareto
frontier of accuracy over cost, which suggests that topology and operation
selection are critical for finding both high-accuracy and low-cost models.
Architectural Design
Locality of Architecture Search Space
- Locality across the whole space
○ Random-walk autocorrelation (RWA), defined as the
autocorrelation of the accuracies of points visited
as we perform a random walks through the space,
shows high correlations for lower distances,
indicating locality. The correlations become indistinguishable beyond a distance of about 6.
- Locality around a global accuracy maximum
○ Fitness-distance correlation metric (FDC) shows that there is locality around the global
maximum as well and the peak also has a coarse-grained width of about 6.
- Locality around inception-like cell
○ Fraction of the search space volume that lies within a given distance
to the closest high peak.

More Related Content

What's hot

Neural network final NWU 4.3 Graphics Course
Neural network final NWU 4.3 Graphics CourseNeural network final NWU 4.3 Graphics Course
Neural network final NWU 4.3 Graphics CourseMohaiminur Rahman
 
GANs Presentation.pptx
GANs Presentation.pptxGANs Presentation.pptx
GANs Presentation.pptxMAHMOUD729246
 
Autoencoders
AutoencodersAutoencoders
AutoencodersCloudxLab
 
Convolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular ArchitecturesConvolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular Architecturesananth
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNNShuai Zhang
 
GAN - Theory and Applications
GAN - Theory and ApplicationsGAN - Theory and Applications
GAN - Theory and ApplicationsEmanuele Ghelfi
 
Autoencoders in Deep Learning
Autoencoders in Deep LearningAutoencoders in Deep Learning
Autoencoders in Deep Learningmilad abbasi
 
Deep Learning: Application & Opportunity
Deep Learning: Application & OpportunityDeep Learning: Application & Opportunity
Deep Learning: Application & OpportunityiTrain
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networksSi Haem
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learningKien Le
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep LearningSebastian Ruder
 
Convolutional Neural Network and Its Applications
Convolutional Neural Network and Its ApplicationsConvolutional Neural Network and Its Applications
Convolutional Neural Network and Its ApplicationsKasun Chinthaka Piyarathna
 
GANs and Applications
GANs and ApplicationsGANs and Applications
GANs and ApplicationsHoang Nguyen
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkKnoldus Inc.
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual IntroductionLukas Masuch
 
Image classification using convolutional neural network
Image classification using convolutional neural networkImage classification using convolutional neural network
Image classification using convolutional neural networkKIRAN R
 

What's hot (20)

Neural network final NWU 4.3 Graphics Course
Neural network final NWU 4.3 Graphics CourseNeural network final NWU 4.3 Graphics Course
Neural network final NWU 4.3 Graphics Course
 
GANs Presentation.pptx
GANs Presentation.pptxGANs Presentation.pptx
GANs Presentation.pptx
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
 
Deep learning
Deep learning Deep learning
Deep learning
 
Evolutionary Computing
Evolutionary ComputingEvolutionary Computing
Evolutionary Computing
 
Convolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular ArchitecturesConvolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular Architectures
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNN
 
GAN - Theory and Applications
GAN - Theory and ApplicationsGAN - Theory and Applications
GAN - Theory and Applications
 
Autoencoders in Deep Learning
Autoencoders in Deep LearningAutoencoders in Deep Learning
Autoencoders in Deep Learning
 
Cnn
CnnCnn
Cnn
 
Deep Learning: Application & Opportunity
Deep Learning: Application & OpportunityDeep Learning: Application & Opportunity
Deep Learning: Application & Opportunity
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learning
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 
Convolutional Neural Network and Its Applications
Convolutional Neural Network and Its ApplicationsConvolutional Neural Network and Its Applications
Convolutional Neural Network and Its Applications
 
GANs and Applications
GANs and ApplicationsGANs and Applications
GANs and Applications
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
 
Image classification using convolutional neural network
Image classification using convolutional neural networkImage classification using convolutional neural network
Image classification using convolutional neural network
 

Similar to Neural Architecture Search: Learning How to Learn

Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSaptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSitakanta Mishra
 
Spine net learning scale permuted backbone for recognition and localization
Spine net learning scale permuted backbone for recognition and localizationSpine net learning scale permuted backbone for recognition and localization
Spine net learning scale permuted backbone for recognition and localizationDevansh16
 
Artificial Neural Network Implementation on FPGA – a Modular Approach
Artificial Neural Network Implementation on FPGA – a Modular ApproachArtificial Neural Network Implementation on FPGA – a Modular Approach
Artificial Neural Network Implementation on FPGA – a Modular ApproachRoee Levy
 
Handwritten Digit Recognition using Convolutional Neural Networks
Handwritten Digit Recognition using Convolutional Neural  NetworksHandwritten Digit Recognition using Convolutional Neural  Networks
Handwritten Digit Recognition using Convolutional Neural NetworksIRJET Journal
 
intro-to-cnn-April_2020.pptx
intro-to-cnn-April_2020.pptxintro-to-cnn-April_2020.pptx
intro-to-cnn-April_2020.pptxssuser3aa461
 
2017 (albawi-alkabi)image-net classification with deep convolutional neural n...
2017 (albawi-alkabi)image-net classification with deep convolutional neural n...2017 (albawi-alkabi)image-net classification with deep convolutional neural n...
2017 (albawi-alkabi)image-net classification with deep convolutional neural n...ali hassan
 
Solar power forecasting report
Solar power forecasting reportSolar power forecasting report
Solar power forecasting reportGaurav Singh
 
Review-image-segmentation-by-deep-learning
Review-image-segmentation-by-deep-learningReview-image-segmentation-by-deep-learning
Review-image-segmentation-by-deep-learningTrong-An Bui
 
GNR638_Course Project for spring semester
GNR638_Course Project for spring semesterGNR638_Course Project for spring semester
GNR638_Course Project for spring semesterBijayChandraDasTECH0
 
Efficiency of Neural Networks Study in the Design of Trusses
Efficiency of Neural Networks Study in the Design of TrussesEfficiency of Neural Networks Study in the Design of Trusses
Efficiency of Neural Networks Study in the Design of TrussesIRJET Journal
 
Optimization of Number of Neurons in the Hidden Layer in Feed Forward Neural ...
Optimization of Number of Neurons in the Hidden Layer in Feed Forward Neural ...Optimization of Number of Neurons in the Hidden Layer in Feed Forward Neural ...
Optimization of Number of Neurons in the Hidden Layer in Feed Forward Neural ...IJERA Editor
 
201907 AutoML and Neural Architecture Search
201907 AutoML and Neural Architecture Search201907 AutoML and Neural Architecture Search
201907 AutoML and Neural Architecture SearchDaeJin Kim
 
Bio-inspired Algorithms for Evolving the Architecture of Convolutional Neural...
Bio-inspired Algorithms for Evolving the Architecture of Convolutional Neural...Bio-inspired Algorithms for Evolving the Architecture of Convolutional Neural...
Bio-inspired Algorithms for Evolving the Architecture of Convolutional Neural...Ashray Bhandare
 
Exploring Randomly Wired Neural Networks for Image Recognition
Exploring Randomly Wired Neural Networks for Image RecognitionExploring Randomly Wired Neural Networks for Image Recognition
Exploring Randomly Wired Neural Networks for Image RecognitionYongsu Baek
 
IRJET-Multiple Object Detection using Deep Neural Networks
IRJET-Multiple Object Detection using Deep Neural NetworksIRJET-Multiple Object Detection using Deep Neural Networks
IRJET-Multiple Object Detection using Deep Neural NetworksIRJET Journal
 

Similar to Neural Architecture Search: Learning How to Learn (20)

Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSaptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
 
Spine net learning scale permuted backbone for recognition and localization
Spine net learning scale permuted backbone for recognition and localizationSpine net learning scale permuted backbone for recognition and localization
Spine net learning scale permuted backbone for recognition and localization
 
Artificial Neural Network Implementation on FPGA – a Modular Approach
Artificial Neural Network Implementation on FPGA – a Modular ApproachArtificial Neural Network Implementation on FPGA – a Modular Approach
Artificial Neural Network Implementation on FPGA – a Modular Approach
 
Handwritten Digit Recognition using Convolutional Neural Networks
Handwritten Digit Recognition using Convolutional Neural  NetworksHandwritten Digit Recognition using Convolutional Neural  Networks
Handwritten Digit Recognition using Convolutional Neural Networks
 
intro-to-cnn-April_2020.pptx
intro-to-cnn-April_2020.pptxintro-to-cnn-April_2020.pptx
intro-to-cnn-April_2020.pptx
 
2017 (albawi-alkabi)image-net classification with deep convolutional neural n...
2017 (albawi-alkabi)image-net classification with deep convolutional neural n...2017 (albawi-alkabi)image-net classification with deep convolutional neural n...
2017 (albawi-alkabi)image-net classification with deep convolutional neural n...
 
Solar power forecasting report
Solar power forecasting reportSolar power forecasting report
Solar power forecasting report
 
A distributed virtual architecture for data centers
A distributed virtual architecture for data centersA distributed virtual architecture for data centers
A distributed virtual architecture for data centers
 
Review-image-segmentation-by-deep-learning
Review-image-segmentation-by-deep-learningReview-image-segmentation-by-deep-learning
Review-image-segmentation-by-deep-learning
 
GNR638_Course Project for spring semester
GNR638_Course Project for spring semesterGNR638_Course Project for spring semester
GNR638_Course Project for spring semester
 
Efficiency of Neural Networks Study in the Design of Trusses
Efficiency of Neural Networks Study in the Design of TrussesEfficiency of Neural Networks Study in the Design of Trusses
Efficiency of Neural Networks Study in the Design of Trusses
 
GNR638_project ppt.pdf
GNR638_project ppt.pdfGNR638_project ppt.pdf
GNR638_project ppt.pdf
 
Optimization of Number of Neurons in the Hidden Layer in Feed Forward Neural ...
Optimization of Number of Neurons in the Hidden Layer in Feed Forward Neural ...Optimization of Number of Neurons in the Hidden Layer in Feed Forward Neural ...
Optimization of Number of Neurons in the Hidden Layer in Feed Forward Neural ...
 
ai7.ppt
ai7.pptai7.ppt
ai7.ppt
 
ai7.ppt
ai7.pptai7.ppt
ai7.ppt
 
201907 AutoML and Neural Architecture Search
201907 AutoML and Neural Architecture Search201907 AutoML and Neural Architecture Search
201907 AutoML and Neural Architecture Search
 
Bio-inspired Algorithms for Evolving the Architecture of Convolutional Neural...
Bio-inspired Algorithms for Evolving the Architecture of Convolutional Neural...Bio-inspired Algorithms for Evolving the Architecture of Convolutional Neural...
Bio-inspired Algorithms for Evolving the Architecture of Convolutional Neural...
 
Machine Learning @NECST
Machine Learning @NECSTMachine Learning @NECST
Machine Learning @NECST
 
Exploring Randomly Wired Neural Networks for Image Recognition
Exploring Randomly Wired Neural Networks for Image RecognitionExploring Randomly Wired Neural Networks for Image Recognition
Exploring Randomly Wired Neural Networks for Image Recognition
 
IRJET-Multiple Object Detection using Deep Neural Networks
IRJET-Multiple Object Detection using Deep Neural NetworksIRJET-Multiple Object Detection using Deep Neural Networks
IRJET-Multiple Object Detection using Deep Neural Networks
 

More from Kwanghee Choi

Trends of ICASSP 2022
Trends of ICASSP 2022Trends of ICASSP 2022
Trends of ICASSP 2022Kwanghee Choi
 
추천 시스템 한 발짝 떨어져 살펴보기 (3)
추천 시스템 한 발짝 떨어져 살펴보기 (3)추천 시스템 한 발짝 떨어져 살펴보기 (3)
추천 시스템 한 발짝 떨어져 살펴보기 (3)Kwanghee Choi
 
Recommendation systems: Vertical and Horizontal Scrolls
Recommendation systems: Vertical and Horizontal ScrollsRecommendation systems: Vertical and Horizontal Scrolls
Recommendation systems: Vertical and Horizontal ScrollsKwanghee Choi
 
추천 시스템 한 발짝 떨어져 살펴보기 (1)
추천 시스템 한 발짝 떨어져 살펴보기 (1)추천 시스템 한 발짝 떨어져 살펴보기 (1)
추천 시스템 한 발짝 떨어져 살펴보기 (1)Kwanghee Choi
 
추천 시스템 한 발짝 떨어져 살펴보기 (2)
추천 시스템 한 발짝 떨어져 살펴보기 (2)추천 시스템 한 발짝 떨어져 살펴보기 (2)
추천 시스템 한 발짝 떨어져 살펴보기 (2)Kwanghee Choi
 
Before and After the AI Winter - Recap
Before and After the AI Winter - RecapBefore and After the AI Winter - Recap
Before and After the AI Winter - RecapKwanghee Choi
 
Mastering Gomoku - Recap
Mastering Gomoku - RecapMastering Gomoku - Recap
Mastering Gomoku - RecapKwanghee Choi
 
Teachings of Ada Lovelace
Teachings of Ada LovelaceTeachings of Ada Lovelace
Teachings of Ada LovelaceKwanghee Choi
 
div, grad, curl, and all that - a review
div, grad, curl, and all that - a reviewdiv, grad, curl, and all that - a review
div, grad, curl, and all that - a reviewKwanghee Choi
 
Duality between OOP and RL
Duality between OOP and RLDuality between OOP and RL
Duality between OOP and RLKwanghee Choi
 
Bandit algorithms for website optimization - A summary
Bandit algorithms for website optimization - A summaryBandit algorithms for website optimization - A summary
Bandit algorithms for website optimization - A summaryKwanghee Choi
 
Dummy log generation using poisson sampling
 Dummy log generation using poisson sampling Dummy log generation using poisson sampling
Dummy log generation using poisson samplingKwanghee Choi
 
Azure functions: Quickstart
Azure functions: QuickstartAzure functions: Quickstart
Azure functions: QuickstartKwanghee Choi
 
Modern convolutional object detectors
Modern convolutional object detectorsModern convolutional object detectors
Modern convolutional object detectorsKwanghee Choi
 
Usage of Moving Average
Usage of Moving AverageUsage of Moving Average
Usage of Moving AverageKwanghee Choi
 
Jpl coding standard for the c programming language
Jpl coding standard for the c programming languageJpl coding standard for the c programming language
Jpl coding standard for the c programming languageKwanghee Choi
 

More from Kwanghee Choi (19)

Visual Transformers
Visual TransformersVisual Transformers
Visual Transformers
 
Trends of ICASSP 2022
Trends of ICASSP 2022Trends of ICASSP 2022
Trends of ICASSP 2022
 
추천 시스템 한 발짝 떨어져 살펴보기 (3)
추천 시스템 한 발짝 떨어져 살펴보기 (3)추천 시스템 한 발짝 떨어져 살펴보기 (3)
추천 시스템 한 발짝 떨어져 살펴보기 (3)
 
Recommendation systems: Vertical and Horizontal Scrolls
Recommendation systems: Vertical and Horizontal ScrollsRecommendation systems: Vertical and Horizontal Scrolls
Recommendation systems: Vertical and Horizontal Scrolls
 
추천 시스템 한 발짝 떨어져 살펴보기 (1)
추천 시스템 한 발짝 떨어져 살펴보기 (1)추천 시스템 한 발짝 떨어져 살펴보기 (1)
추천 시스템 한 발짝 떨어져 살펴보기 (1)
 
추천 시스템 한 발짝 떨어져 살펴보기 (2)
추천 시스템 한 발짝 떨어져 살펴보기 (2)추천 시스템 한 발짝 떨어져 살펴보기 (2)
추천 시스템 한 발짝 떨어져 살펴보기 (2)
 
Before and After the AI Winter - Recap
Before and After the AI Winter - RecapBefore and After the AI Winter - Recap
Before and After the AI Winter - Recap
 
Mastering Gomoku - Recap
Mastering Gomoku - RecapMastering Gomoku - Recap
Mastering Gomoku - Recap
 
Teachings of Ada Lovelace
Teachings of Ada LovelaceTeachings of Ada Lovelace
Teachings of Ada Lovelace
 
div, grad, curl, and all that - a review
div, grad, curl, and all that - a reviewdiv, grad, curl, and all that - a review
div, grad, curl, and all that - a review
 
Gaussian processes
Gaussian processesGaussian processes
Gaussian processes
 
Duality between OOP and RL
Duality between OOP and RLDuality between OOP and RL
Duality between OOP and RL
 
JFEF encoding
JFEF encodingJFEF encoding
JFEF encoding
 
Bandit algorithms for website optimization - A summary
Bandit algorithms for website optimization - A summaryBandit algorithms for website optimization - A summary
Bandit algorithms for website optimization - A summary
 
Dummy log generation using poisson sampling
 Dummy log generation using poisson sampling Dummy log generation using poisson sampling
Dummy log generation using poisson sampling
 
Azure functions: Quickstart
Azure functions: QuickstartAzure functions: Quickstart
Azure functions: Quickstart
 
Modern convolutional object detectors
Modern convolutional object detectorsModern convolutional object detectors
Modern convolutional object detectors
 
Usage of Moving Average
Usage of Moving AverageUsage of Moving Average
Usage of Moving Average
 
Jpl coding standard for the c programming language
Jpl coding standard for the c programming languageJpl coding standard for the c programming language
Jpl coding standard for the c programming language
 

Recently uploaded

Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...kumargunjan9515
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...kumargunjan9515
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...kumargunjan9515
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...HyderabadDolls
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...HyderabadDolls
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numberssuginr1
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...gajnagarg
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themeitharjee
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...HyderabadDolls
 

Recently uploaded (20)

Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 

Neural Architecture Search: Learning How to Learn

  • 1. Neural Architecture Search: Learning How to Learn Kwanghee Choi Local Optima 2019
  • 2. Reference - Neural Architecture Search with Reinforcement Learning (ICLR 2017) - Learning Transferable Architectures for Scalable Image Recognition (CVPR 2018) - Large-Scale Evolution of Image Classifiers (ICML 2017) - Hierarchical Representations for Efficient Architecture Search (ICLR 2018) - Regularized Evolution for Image Classifier Architecture Search (AAAI 2019) - Progressive Neural Architecture Search (ECCV 2018) - Neural Architecture Optimization (NIPS 2018) - Exploring Randomly Wired Neural Networks for Image Recognition (2019) - Weight Agnostic Neural Networks (2019) - HyperNetworks (ICLR 2016) - SMASH: One-Shot Model Architecture Search through HyperNetworks (ICLR 2018) - Efficient Neural Architecture Search via Parameter Sharing (ICML 2018) - Understanding and Simplifying One-Shot Architecture Search (ICML 2018) - DARTS: Differentiable Architecture Search (ICLR 2019) - ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware (ICLR 2019) - MnasNet: Platform-Aware Neural Architecture Search for Mobile (CVPR 2019) - FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search (CVPR 2019) - EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks (ICML 2019) - ScarletNAS: Bridging the Gap Between Scalability and Fairness in Neural Architecture Search (2019) - NAS-Bench-101: Towards Reproducible Neural Architecture Search (ICML 2019)
  • 3. Introduction - Excerpt from Exploring Randomly Wired Neural Networks for Image Recognition (2019) - Neural networks for image recognition have evolved through extensive manual design. ex) ResNet, DenseNet - What we call deep learning today descends from the connectionist approach to cognitive science — a paradigm reflecting the hypothesis that how computational networks are wired is crucial for building intelligent machines. - NAS (Neural Architecture Search): Optimization of wiring and operation types, but possible wirings or operations are constrained.
  • 4. Neural Architecture Search with Reinforcement Learning Barret Zoph, Quoc V. Le (Google) ICLR 2017
  • 6. Neural Architecture Search - A gradient-based method for finding good architectures - Use a recurrent network to generate the model descriptions of neural networks and train this RNN with reinforcement learning to maximize the expected accuracy of the generated architectures on a validation set. - The structure and connectivity of a neural network can be typically specified by a variable-length string. It is therefore possible to use a recurrent network – the controller – to generate such string. - Architecture engineering with CNNs often identifies repeated motifs consisting of combinations of convolutional filter banks, nonlinearities and a prudent selection of connections to achieve state-of-the-art results.
  • 7. Controller Recurrent Neural Network - Every prediction is carried out by a softmax classifier and then fed into the next time step as input. - The process of generating an architecture stops if the number of layers exceeds a certain value. - Once the controller RNN finishes generating an architecture, a neural network with this architecture is built and trained. - At convergence, the accuracy of the network on a held-out validation set is recorded.
  • 8. Training with REINFORCE - The list of tokens that the controller predicts can be viewed as a list of actions a1:T to design an architecture for a child network. - At convergence, this child network will achieve an accuracy R on a held-out dataset. - We can use this accuracy R as the reward signal and use reinforcement learning to train the controller. - REINFORCE by Williams (1992), Sutton (2000) - We do not predict the learning rate and we also assume that the architectures consist of only convolutional layers, which is also quite restrictive.
  • 9. Distributed training for NAS - We use a set of S parameter servers to store and send parameters to K controller replicas. - Each controller replica then samples m architectures and run the multiple child models in parallel. - The accuracy of each child model is recorded to compute the gradients with respect to θc , which are then sent back to the parameter servers.
  • 10. Generating Skip Connections - At layer N, we add an anchor point which has N − 1 content-based sigmoids to indicate the previous layers that need to be connected. - Each sigmoid is a function of the current hiddenstate of the controller and the previous hiddenstates of the previous N − 1 anchor points. - P(Layer j is an input to layer i) = sigmoid(vT tanh(Wprev ∗ hj + Wcurr ∗ hi )) where hj represents the hidden state of the controller at anchor point for the j-th layer, where j ranges from 0 to N − 1. - We then sample from these sigmoids to decide what previous layers to be used as inputs to the current layer. - The matrices Wprev , Wcurr and v are trainable parameters.
  • 11. Generating Skip Connections - Skip connections can cause “compilation failures” where one layer is not compatible with another layer, or one layer may not have any input or output. - If a layer is not connected to any input layer then the image is used as the input layer. - At the final layer we take all layer outputs that have not been connected and concatenate them before sending this final hiddenstate to the classifier. - If input layers to be concatenated have different sizes, we pad the small layers with zeros so that the concatenated layers have the same sizes.
  • 12. Generating Recurrent Cells - The computations for basic RNN and LSTM cells can be generalized as a tree of steps that take xt and ht−1 as inputs and produce ht as final output. - The controller RNN needs to label each node in the tree with a combination method (add, dot product, etc.) and an activation function (tanh, sigmoid, etc.) to merge two inputs to produce one output. - Two outputs are then fed as inputs to the next node in the tree. - Two leaf nodes (Tree Index 0, 1): thus it is called a “base 2” architecture. - In our experiments, we use a base number of 8 to make sure that the cell is expressive.
  • 15. Transfer Learning Performance (PTB) - To understand whether the cell can generalize to a different task, we apply it to the character language modeling task on the same dataset (PTB). - The new cell was found on word level language modeling.
  • 16. Learning Transferable Architectures for Scalable Image Recognition Barret Zoph, Vijay Vasudevan, Jonathon Shlens, Quoc V. Le (Google) ICLR 2017
  • 17. Transferable Architectures - We propose to search for an architectural building block on a small dataset and then transfer the block to a larger dataset: the design of a new search space (“NASNet search space”) which enables transferability. - Applying NAS, or any other search methods, directly to a large dataset is computationally expensive. - NASNet search space: A search space so that the complexity of the architecture is independent of the depth of the network and the size of input images. - All convolutional networks in our search space are composed of convolutional layers (or “cells”) with identical structure but different weights. Searching for the best convolutional architectures is therefore reduced to searching for the best cell structure. - By simply varying # of the convolutional cells and # of filters, we can create different versions of NASNets with different computational demands.
  • 18. Transferable Architectures - Two types of cells: - Normal Cell: return a feature map of the same dimension - Reduction Cell: return a feature map where height and width is reduced by a factor of two. - We empirically found it beneficial to learn two separate architectures. - We use a common heuristic to double the number of filters in the output whenever the spatial activation size is reduced in order to maintain roughly constant hidden state dimension. - We consider the # of motif repetitions and the # of initial convolutional filters as free parameters.
  • 19. Controller Model Architecture - Select a hidden state from hi , hi−1 or from the set of hidden states created in previous blocks. - In our experiments, selecting B = 5 provides good results, although we have not exhaustively searched this space due to computational limitations. - To allow the controller RNN to predict both Normal Cell and Reduction Cell, we simply make the controller have 2 × 5B predictions in total.
  • 21. Transfer Learning Performance (ImageNet) - The new cell was found on CIFAR-10.
  • 22. Other Cell Types - NASNet-B - Do not concatenate the output hidden states, each output hidden state is used as a hidden state in the future layers. - We allow addition followed by layer normalization or instance normalization. - NASNet-C - We allow addition followed by layer normalization or instance normalization.
  • 23. Large-Scale Evolution of Image Classifiers Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan, Quoc V. Le, Alexey Kurakin (Google) ICML 2017
  • 24. Large-Scale Evolution - Starting out with poor-performing models with no convolutions, the algorithm must evolve complex convolutional neural networks while navigating a fairly unrestricted search space. - We use a simplified graph as our DNA, which is transformed to a full neural network graph for training and evaluation. - Mutations were chosen for their similarity to the actions that a human designer may take when improving an architecture. - we allow the children to inherit the parents’ weights whenever possible. Namely, if a layer has matching shapes, the weights are preserved.
  • 25. Progress of an Evolution Experiment -
  • 28. Hierarchical Representations for Efficient Architecture Search Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, Koray Kavukcuoglu (Google) ICLR 2018
  • 29. Hierarchical Representations for describing neural network architectures
  • 30. Tournament Selection - Starting from an initial population of random genotypes, tournament selection provides a mechanism to pick promising genotypes from the population, and to place its mutated offspring back into the population. - By repeating this process, the quality of the population keeps being refined over time.
  • 31. Cell Found - We use the proposed search framework to learn the architecture of a convolutional cell, rather than the entire model. - Only motifs 1,3,4,5 are used to construct the cell, among which motifs 3 and 5 are dominating.
  • 35. Regularized Evolution for Image Classifier Architecture Search Esteban Real, Alok Aggarwal, Yanping Huang, Quoc V Le (Google) AAAI 2019
  • 36. Regularized Evolution - In tournament selection, the best genotypes (architectures) are kept, we propose to associate each genotype with an age, and bias the tournament selection to choose the younger genotypes, by killing the oldest population.
  • 37. Mutations for NASNet cell structure - Simplest set of mutations that would allow evolving in the NASNet search space: Hidden state mutation, Op mutation, and Identity.
  • 39. Progressive Neural Architecture Search Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens , Wei Hua , Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy (Google / Stanford) ECCV 2018
  • 40. Sequential Model-Based Optimization (SMBO) - Searching for structures in order of increasing complexity, while simultaneously learning a surrogate model to guide the search through structure space. - 5x more efficient (# of models evaluated to achieve desired accuracy) - 8x more faster than NAS (no reranking)
  • 41. Sequential Model-Based Optimization (SMBO) - At iteration b of the algorithm, we have a set of K candidate cells (each of size b blocks), which we train and evaluate on a dataset of interest. - Since this process is expensive, we also learn a model or surrogate function which can predict the performance of a structure without needing to training it. - We expand the K candidates of size b into K′ ≫ K children, each of size b + 1. - We apply our surrogate function to rank all of the K′ children, pick the top K, and then train and evaluate them. - We continue in this way until b = B, which is the maximum number of blocks we want to use in our cell.
  • 42. SMBO Advantages - The simple structures train faster, so we get some initial results to train the surrogate quickly. - We only ask the surrogate to predict the quality of structures that are slightly different (larger) from the ones it has seen. - We factorize the search space into a product of smaller search spaces, allowing us to potentially search models with many more blocks.
  • 43. SMBO Predictors - Handle variable-sized inputs - Correlated with true performance - Ordering preserving is more important than Accuracy MSE - Sample efficiency - We want to train and evaluate as few cells as possible, which means training data is scarce. - → used LSTM
  • 45. Neural Architecture Optimization Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, Tie-Yan Liu (USTC, Microsoft) NIPS 2018
  • 46. Continuous Optimization - (1) An encoder embeds/maps neural network architectures into a continuous space. - We use a sequence consisting of discrete string tokens to describe a CNN or RNN architecture. - (2) A predictor p takes the continuous representation of a network as input and predicts its accuracy. - If models are are symmetric (e.g., x2 is formed via swapping two branches within a node in x1), their embeddings should be close to produce the same performance prediction scores, so p(x1) = p(x2) = s. - (3) A decoder maps a continuous representation of a network back to its architecture.
  • 47. NAO Algorithm - (For N iterations) - 1. Train each candidate architecture foun - 2. Train encoder, predictor, decoder by previous history of Model → Score - 3. Pick K architectures, forming seed architectures - 4. Find new candidate architecture representation, using encoder representation and predictor. - 5. Decode each candidate architecture representation - The performance predictor and the encoder enable us to perform gradient based optimization in the continuous space to find the embedding of a new architecture with potentially better accuracy. Such a better embedding is then decoded to a network by the decoder.
  • 48. NAO with ENAS - NAO tries to reduce the huge computational cost brought by the search algorithm. - Weight-sharing aims to ease the huge complexity brought by massive child models via the one-shot model setup - So NAO and weight-sharing (ENAS) is complementary.
  • 53. Exploring Randomly Wired Neural Networks for Image Recognition Saining Xie, Alexander Kirillov, Ross Girshick, Kaiming He (Facebook) 2019. 4.
  • 54. Randomly Wired Neural Networks - NAS network generator is hand designed and the space of allowed wiring patterns is constrained in a small subset of all possible graphs. - What happens if we loosen this constraint and design novel network generators? - More diverse set of connectivity patterns through the lens of randomly wired neural networks. - 1. Define a stochastic network generator that encapsulates the entire network generation process. - 2. Generate randomly wired graphs.
  • 55. Generator Prior - Each random graph model has certain probabilistic behaviors such that sampled graphs likely exhibit certain properties (e.g., WS is highly clustered). - Ultimately, the generator design determines a probabilistic distribution over networks, and as a result these networks tend to have certain properties. - The generator design underlies the prior and thus should not be overlooked. - Random graphs used - Erdos-Renyi (ER), Barabasi-Albert (BA), Watts-Strogatz (WS)
  • 56. Stochastic Network Generators - We define a network generator as a mapping g from a parameter space Θ to a space of neural network architectures N , g: Θ→N - g(θ) performs a deterministic mapping. - We can extend g to accept an additional argument s that is the seed of a pseudo-random number generator that is used internally by g. - We call generators of the form g(θ, s) stochastic network generators.
  • 57. NAS vs. Stochastic Network Generators - LSTM is only part of the complete NAS network generator, which is in fact a stochastic network generator. - The output of each LSTM time-step is a probability distribution conditioned on θ. - Given this distribution and the seed s, each step samples a construction action. - Network space N has been carefully restricted by hand designed rules. e.g. “Cell”, M=5, No output concat to avg…
  • 58. Mapping from Graphs to Neural Networks - We define that edges are data flow. - We define the operations represented by one node as - Aggregation: Combined via weighted sum, weights: learnable & positive - Transformation: The aggregated data is processed by a transformation defined as a ReLU-convolution-BN triplet = conv - Distribution: The same copy of the transformed data is sent out by the output edges of the node. - Those without any input edge is an input node, and vice versa for output nodes.
  • 59. Properties of Node Operations - Additive aggregation (unlike concatenation) maintains the same number of output channels as input channels, and this prevents the convolution growing large in computation. - The transformation should have the same number of output and input channels, to make sure the transformed data can be combined with the data from any other nodes. - Aggregation and distribution are almost parameter free (except for a negligible number of parameters for weighted summation).
  • 60. RandWire Architectures - We use a simple strategy: the random graph generated above defines one stage(layer). ex. conv stage(layer).
  • 62. Weight Agnostic Neural Networks Adam Gaier, David Ha (Google) 2019. 6.
  • 63. Network Architectures that Encodes Solutions - It is never claimed that the solution from NAS approach is innate to the structure of the network – no one supposes these networks will solve the task without training. The weights are the solution; the found architectures merely a better substrate for the weights to inhabit. - To produce architectures that themselves encode solutions, the importance of weights must be minimized. Rather than judging networks by their performance with optimal weight values, we can instead measure their performance when their weight values are drawn from a random distribution.
  • 64. Weight Agnostic Neural Network Search
  • 65. Topology Search - Inspired by neuroevolution algorithm NEAT. - (1) Insert Node: a new node is inserted by splitting an existing connection. - (2) Add Connection: a new connection is added by connecting two previously unconnected nodes. - (3) Change Activation: the activation function of a hidden node is reassigned.
  • 67. HyperNetworks David Ha, Andrew Dai, Quoc V. Le (Google) ICLR 2016
  • 68. HyperNetwork - Schmidhuber has suggested the concept of fast weights in which one network (HyperNetwork) can produce context-dependent weight changes for a second network. - Recurrent networks: imposing weight-sharing across layers, which makes them inflexible and difficult to learn due to vanishing gradient. - Convolutional networks: having redundant parameters when the networks are deep. - Hypernetworks can be viewed as relaxed form of weight-sharing across layers.
  • 69. Static and Dynamic HyperNetworks
  • 70. SMASH: One-Shot Model Architecture Search through HyperNetworks Andrew Brock, Theodore Lim, J.M. Ritchie, Nick Weston (Heriot-Watt Univ., Renishaw PLC) ICLR 2018
  • 71. Why HyperNetworks? - Bypass the expensive procedure of fully training candidate models by instead training an auxiliary model, a HyperNet, to dynamically generate the weights of a main model with variable architecture. - By comparing validation performance for a set of architectures using generated weights, we can approximately rank numerous architectures at the cost of a single training run.
  • 72. SMASH - At each training step, we randomly sample a network architecture, generate the weights for that architecture using a HyperNet, and train the entire system end-to-end through backpropagation. - When the model is finished training, we sample a number of random architectures and evaluate their performance on a validation set, using weights generated by the HyperNet. - We then select the architecture with the best estimated validation performance and train its weights normally.
  • 73. Efficient Neural Architecture Search via Parameter Sharing Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, Jeff Dean (Google / CMU / Stanford) ICML 2018
  • 74. Efficient Neural Architecture Search (ENAS) - A fast and inexpensive approach for automatic model design. 1000x less expensive than standard NAS. - The main contribution of this work is to improve the efficiency of NAS by forcing all child models to share weights to eschew training each child model from scratch to convergence while delivering strong empirical performances. - Central to the idea of ENAS is the observation that all of the graphs which NAS ends up iterating over can be viewed as sub-graphs of a larger graph. In other words, we can represent NAS’s search space using a single directed acyclic graph (DAG).
  • 75. Recurrent Cells - To design recurrent cells, we employ a DAG with N nodes, where the nodes represent local computations, and the edges represent the flow of information between the N nodes. - ENAS’s controller is an RNN that decides: - 1) which edges are activated - 2) which computations are performed at each node in the DAG. - Our search space allows ENAS to design both the topology and the operations in RNN cells, and hence is more flexible than NAS.
  • 76. Recurrent Cells - First node: The controller first samples an activation function. - Middle nodes: samples a previous index and an activation function. - Output node: we simply average all the loose ends, i.e. the nodes that are not selected as inputs to any other nodes. - Note that for each pair of nodes j < ℓ, there is an independent parameter matrix Wℓ,j (h). → Shared weights. - 4 activation functions, N nodes: search space = 4^N × N!
  • 77. Training ENAS - In ENAS, there are two sets of learnable parameters: the parameters of the controller LSTM, denoted by θ, and the shared parameters of the child models, denoted by ω. - The first phase trains ω, the shared parameters of the child models, on a whole pass through the training data set. (Fix policy, choose model based on policy, minimize loss of the model) - Surprisingly, we can update ω using the gradient from any single model m sampled from policy. It just works fine. - The second phase trains θ, the parameters of the controller LSTM, for a fixed number of steps. (Trains policy to maximize on validation set) - Two phases are alternated during the training of ENAS.
  • 78. Deriving architectures from trained ENAS model - We first sample several models from the trained policy π(m, θ). - For each sampled model, we compute its reward on a single minibatch sampled from the validation set. → Model chosen - We then take only the model with the highest reward to re-train from scratch. → Train the chosen model
  • 79. Convolutional Cells (Macro) - Chooses 1) what previous nodes to connect to and 2) what computation operation to use - (vs. Recurrent Cells. 1) what previous nodes to connect to, 2) what activation to use) - It allows the model to form skip connections. - As for recurrent cells, each operation at each layer in our ENAS convolutional network has a distinct set of parameters.
  • 81. Convolutional Cells (Micro) - Same with Scalable Architectures - We utilize the ENAS computational DAG with B nodes to represent the computations that happen locally in a cell. - We sample the reduction cell conditioned on the convolutional cell, hence making the controller RNN run for a total of 2(B − 2) blocks.
  • 84. Performance (CIFAR-10) - Cutout: Simple regularization technique of randomly masking out square regions of input during training
  • 85. NAS vs. ENAS - Minimum change to ENAS makes bad performance. - We thus believe that the controller RNN learned by ENAS is as good as the controller RNN learned by NAS. - The performance gap between NAS and ENAS is due to the fact that we do not sample multiple architectures from our trained controller, train them, and then select the best architecture on the validation data. - This extra step benefits NAS’s performance.
  • 86. Understanding and Simplifying One-Shot Architecture Search Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, Quoc Le (Google) ICML 2018
  • 87. One-shot Model - It is possible to efficiently identify promising architectures from a complex search space without either hypernetworks or RL. - Train a large one-shot model containing every possible operation in the search space. - Zero out some of the operations and measure the impact on the model’s prediction accuracies. Network automatically focuses its capacity on the operations that are most useful for generating good predictions.
  • 88. One-shot Architecture Search - (1) Design a search space that allows us to represent a wide variety of architectures using a single one-shot model. ○ Enabling or disabling incoming connections makes the size of the search space grows exponentially while the size of the one-shot model grows only linearly. - (2) Train the one-shot model to make it predictive of the validation accuracies of the architectures. ○ If we train naively, the components can co-adapt. Removing operations – even unimportant ones – from the network can cause the quality of the model’s predictions to degrade severely. - (3) Evaluate candidate architectures on the validation set using the pre-trained one shot model. - (4) Re-train the most promising architectures from scratch and evaluate their performance on the test set.
  • 89. DARTS: Differentiable Architecture Search Hanxiao Liu, Karen Simonyan, Yiming Yang (CMU, Google) ICLR 2019
  • 90. Differentiable Architecture Search - Unlike conventional approaches of applying evolution or reinforcement learning over a discrete and non-differentiable search space, - our method is based on the continuous relaxation of the architecture representation, - allowing efficient search of the architecture using gradient descent.
  • 91. Overview - (a) Operations on the edges are initially unknown. - (b) Continuous relaxation of the search space by placing a mixture of candidate operations on each edge. - (c) Joint optimization of the mixing probabilities and the network weights by solving a bilevel optimization problem. - (d) Inducing the final architecture from the learned mixing probabilities.
  • 92. Continuous Relaxation - To make the search space continuous, we relax the categorical choice of a particular operation o to a softmax over all possible operations O. - where the operation mixing weights for a pair of nodes (i, j) are parameterized by a vector αo (i,j)
  • 93. Joint Optimization (Bilevel Optimization) - Jointly learn the architecture α and the weights w within all the mixed operations (e.g. weights of the convolution filters). - - (While not converged) - 1. Update architecture α by training loss - 2. Update w (evaluate training loss)
  • 99. ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware Han Cai, Ligeng Zhu, Song Han (MIT) ICLR 2019
  • 100. Proxyless Training - Differentiable NAS can reduce the cost of GPU hours via a continuous representation of network architecture but suffers from the high GPU memory consumption issue (grow linearly w.r.t. candidate set size). - As a result, they need to utilize proxy tasks. ○ ex. smaller dataset, learning with only a few blocks, or training just for a few epochs - Optimizing on proxy tasks are not guaranteed to be optimal on the target task. - ProxylessNAS that can directly learn the architectures for large-scale target tasks and target hardware platforms by training memory-efficiently.
  • 104. MnasNet: Platform-Aware Neural Architecture Search for Mobile Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, Quoc V. Le (Google) CVPR 2019
  • 105. Model Latency Problem - Explicitly incorporate model latency into the main objective so that the search can identify a model that achieves a good trade-off between accuracy and latency. - Unlike previous work, where latency is considered via another, often inaccurate proxy (e.g., FLOPS), our approach directly measures real-world inference latency by executing the model on mobile phones. - FLOPS is often an inaccurate proxy: for example, MobileNet and NASNet have similar FLOPS (575M vs. 564M), but their latencies are significantly different (113ms vs. 183ms)
  • 106. Model Latency Problem - While previous approaches mainly perform architecture search on smaller tasks such as CIFAR10, we find those small proxy tasks don’t work when model latency is taken into account, because one typically needs to scale up the model when applying to larger problems. - In this paper, we directly perform our architecture search on the ImageNet training set but with fewer training steps (5 epochs).
  • 107. Factorized Hierarchical Search Space - Previous approaches mainly search for a few types of cells and then repeatedly stack. This simplifies the search process, but also precludes layer diversity that is important for computational efficiency. - Advantage of balancing the diversity of layers and the size of total search space -
  • 108. Pareto Optimal - - This approach only maximizes a single metric and does not provide multiple Pareto optimal solutions. -
  • 110. FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, Kurt Keutzer (Facebook) CVPR 2019
  • 111. Designing Convnets is hard! - Intractable design space: The design space of a ConvNet is combinatorial & training a ConvNet is very time-consuming. - Nontransferable optimality: the optimality is conditioned on many factors such as input resolutions and target devices. Once these factors change, the optimal architecture is likely to be different. - Inconsistent efficiency metrics: Most of the efficiency metrics we care about are dependent on not only the ConvNet architecture but also the hardware and software configurations on the target device.
  • 112. Differentiable NAS - Layer-wise search space where we can choose a different block for each layer of the network - By using the Gumbel Softmax technique, we can directly train the architecture distribution using gradient-based optimization, which is extremely fast compared with previous reinforcement learning (RL) based method. - We measure the latency of each operator and use a lookup table model. Overall latency is computed by adding up each operator. Using this allows us to quickly estimate latency, and it makes the latency differentiable with respect to layer-wise block choices.
  • 113. Performance (ImageNet) - Achieves better accuracy and lower latency than MnasNet, but we estimate the search cost of DNAS is 420x smaller.
  • 114. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks Mingxing Tan, Quoc V. Le (Google) ICML 2019
  • 115. Scaling up CNNs - Convolutional Neural Networks (ConvNets) are commonly developed at a fixed resource budget, and then scaled up for better accuracy if more resources are available. - Carefully balancing network depth, width, and resolution can lead to better performance. - A new scaling method that uniformly scales all dimensions of depth (ex. ResNet, Inception) / width (ex. WideResNet, MobileNet) / resolution (NASNet, GPipe) using a simple yet highly effective compound coefficient.
  • 116. Compound Scaling - - φ is a user-specified coefficient that controls how many more resources are available for model scaling, while α, β, γ specify how to assign these extra resources to network width, depth, and resolution. - FLOPS of a regular convolution op is proportional to d, w2 , r2 . - Total FLOPS will approximately increase by 2φ .
  • 117. Compound Scaling - Developed baseline network by leveraging a multi-objective neural architecture search from MnasNet. - Starting from the baseline EfficientNet-B0, we apply our compound scaling method to scale it up with two steps. - 1. Fix φ = 1, assuming twice more resources available, and do a small grid search of α, β to find the optimal values. - 2. Fix α, β, γ as constants and scale up baseline network with different φ.
  • 119. ScarletNAS: Bridging the Gap Between Scalability and Fairness in Neural Architecture Search Xiangxiang Chu, Bo Zhang, Jixiang Li, Qingyuan Li, Ruijun Xu (Xiaomi) 2019. 8.
  • 120. Supernet training with variable depths - One-shot NAS features fast training of a supernet in a single run. But weight-sharing approach lacks of scalability. - Identity block helps building a scalable supernet (with variable depths) but it makes supernet training unstable. - Introduce linearly equivalent transformation to soothe training turbulence, providing with the proof that such transformed path is identical with the original one as per representational power.
  • 121. Linearly Equivalent Transformation - As pure identity blocks are direct short paths and don’t learn any information, we have to accommodate this defect by injecting a learning unit. - Here we remedy the issue with 1 × 1 convolution without non-linear activations.
  • 123. NAS-Bench-101: Towards Reproducible Neural Architecture Search Chris Ying, Aaron Klein, Esteban Real, Eric Christiansen, Kevin Murphy, Frank Hutter (Google) ICML 2019
  • 124. NAS research is hard! - NAS demands tremendous computational resources, which makes it difficult to reproduce experiments and imposes a barrier-to-entry to researchers without access to large-scale computation. - Recent improvements have yielded more efficient methods, different methods are not comparable to each other due to different training procedures and different search spaces, which make it difficult to attribute the success of each method to the search algorithm itself.
  • 125. Architecture Dataset - We carefully constructed a search space, exploiting graph isomorphisms to identify 423k unique convolutional architectures. - 7-vertex directed acyclic graph, one for each of the 5 intermediate vertices. (recall that the input and output vertices are fixed) - To support both ResNet and Inception-like cells and to keep the space tractable: tensors going to the output vertex are concatenated and those going into other vertices are summed. - We trained and evaluated all of these architectures multiple times on CIFAR-10 and compiled the results into a large dataset of over 5 million trained models.
  • 126. Metrics - training accuracy, validation accuracy, testing accuracy, training time in seconds, number of trainable model parameters - Only metrics on the training and validation set should be used to search models within a single NAS algorithm, and testing accuracy should only be used for an offline evaluation. The training time metric allows benchmarking algorithms that optimize for accuracy while operating under a time limit and also allows the evaluation of multi-objective optimization methods.
  • 127. Accuracy - We repeat the training and evaluation of all architectures 3 times to obtain a measure of variance, and we trained all our architectures with four increasing epoch budgets: {4, 12, 36, 108}. - train/valid/test accuracy after training for 108 epochs and (right) the noise, defined as the standard deviation of the test accuracy between the three trials
  • 128. Pareto Frontier - Hand-designed cells, such as ResNet and Inception, perform near the Pareto frontier of accuracy over cost, which suggests that topology and operation selection are critical for finding both high-accuracy and low-cost models.
  • 130. Locality of Architecture Search Space - Locality across the whole space ○ Random-walk autocorrelation (RWA), defined as the autocorrelation of the accuracies of points visited as we perform a random walks through the space, shows high correlations for lower distances, indicating locality. The correlations become indistinguishable beyond a distance of about 6. - Locality around a global accuracy maximum ○ Fitness-distance correlation metric (FDC) shows that there is locality around the global maximum as well and the peak also has a coarse-grained width of about 6. - Locality around inception-like cell ○ Fraction of the search space volume that lies within a given distance to the closest high peak.