Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

Harnessing Neural Networks
Corinna Cortes
Google Research, NY

Harnessing the Power of Neural Networks
Introduction
How do we standardize the output?
How do we speed up inference?
How do we automatically find a good network architecture?

Google’s mission is to organize the world’s information
and make it universally accessible and useful.

Smart reply
in Inbox
10%
of all responses
sent on mobile

LSTMs and Extrapolation
They daydream or hallucinate :-)
Feature or bug?

DeepDream Art Auction and Symposium (A&MI)

Magenta
A
ht
Xt
A.I. Duet
https://aiexperiments.withgoogle.com/ai-duet/view/

Restricting the Output. Smart Replies.
http://www.kdd.org/kdd2016/papers/files/Paper_1069.pdf
● Ungrammatical and inappropriate answers
○ thanks hon!; Yup, got it thx; Leave me alone!
● Work with a Fixed Response Set
○ Sanitized answers are clustered in semantically similar answers using
label propagation;
○ The answers in the clusters are used to filter the candidate set generated
by the LSTM. Diversity is ensured by using top answers from different
clusters.
● Efficient search via tries

Search Tree, Trie, for Valid Responses
Tuesday Wednesday Tuesday? Wednesday?
I can do
Cluster responses
How about
. !
! What time
works for you?
. What time
works for you?

Computational Complexity
● Exhaustive: R x l
R size of response set, l length of longest sentence
● Beam search: b x l
Typical size of R ~ millions, typical size of b ~ 10-30

● A more elegant solution based on rules
○ Exploit rules to efficiently enlarge the response set:
■ “Can you do Monday?” “Yes, I can do Monday”
■ “Can you do Tuesday?” “Yes, I can do Tuesday”
■ ...
“Can you do <time>?”
“Yes, I can do <time>” or “No, I can do <time + 1>
What if the Response Set in Billions?

Rules for Response Set
Text Normalization for Text-to-Speech, TTS, Systems
Navigation assistant

Text Normalization
Richard Sproat, Navdeep Jaitly, Google: “RNN Approaches to Text Normalization: A Challenge”
https://arxiv.org/pdf/1611.00068.pdf

Break the Task in Two
● Channel model
○ possible normalizations of that token? Sequence of tokens to words.
○ Example: 123
■ one hundred twenty three, one two three, one twenty three, ...
● Language model
○ which one is appropriate to the given context? Words to words.
○ Example: 123
■ 123 King Ave. - the correct reading in American English would
normally be one twenty three.

Combining the Models
One combined LSTM

Add a Grammar to Constrain the Output
Rule: <number> + <measurement abbreviation> => <number> + the possible
verbalizations of the measure abbreviation.
Instantiation: 24.2kg => twenty four point two kilogram, twenty four point two
kilograms, twenty four point two kilo.
Finite State Transducers: a finite state automaton which produces output as well
as reading input, pattern matching, regular expressions.

Thrax Grammar
MEASURE: <number> + <measurement abbreviation> -> <number> +
measurement verbalizations
Input: 5 kg -> five kilo/kilograms/kilogram
MONEY: $ <number> -> <number> dollars
Input composed with FSTs. The output of the FST is used to restrict the output of
the LSTM.

TTS: RNN + FST
Measure and Money
restricted by grammar.

One class per image type (horse, car, …), M classes.
Neural network inference: Just to compute the last layer requires MN multiply
adds.
Super-Multiclass Classification Problem
Output layer,
M units:
Last hidden layer, N units:

Asymmetric Hashing
W1
W2
W3
WM
Weights to the output layer, parted in N/k chunks
● Represent each chunk with
a set of cluster centers
(256) using k-means.
● Save the coordinates of the
centers, (ID, coordinates).
● Save each weight vector as
a set of closest IDs,
hashcode.

Asymmetric Hashing
W1
W2
W3
WM
Weights to the output layer, parted in N/k chunks
● Represent each chunk with
a set of cluster centers
(256) using k-means.
● Save the coordinates of the
centers, (ID, coordinates).
● Save each weight vector as
a set of closest IDs,
hashcode.
78 184 15
12 63 192
56 82 72
201 37 51

Asymmetric Hashing, Searching
● For given activation u, divide it into its N/k chunks, uj
:
○ Compute the 256 N/k distances to centers. 256N multiply adds, not MN.
○ Compute the distances to all hash codes:
● MN/k additions needed.
● The “Asymmetric” in “Asymmetric Hashing” refers to the fact that we hash the
weight vectors but not the activation vector.

Asymmetric Hashing
Incredible saving in inference time
Sometimes also with a bit of improved accuracy

“Learning to Learn” a.k.a
“Automated Hyperparameter Tuning”
Google: AdaNet, Architecture Search with Reinforcement Learning
MIT: Designing Neural Networks Architectures Using Reinforcement Learning,
Harvard,Toronto, MIT, Intel: Scalable Bayesian Optimization Using Deep Neural
Networks.
Genetic Algorithms, Reinforcement Learning, Boosting Algorithm

Modeling Challenges for ML
The right model choice can significantly improve
the performance. For Deep Learning it is
particularly hard as the search space is huge and
● Difficult non-convex optimization
● Lack of sufficient theory
Questions
● Can neural network architectures be learned
together with their weights?
● Can this problem be solved efficiently and in a
principled way?
● Can we capture the end-to-end process?

AdaNet
● Incremental construction: At each round, the algorithm adds a subnetwork to
the existing neural network;
● Algorithm leverages embeddings previous learned;
● Adaptively grows network, balancing trade-off between empirical error and
model complexity;
● Learning bound:

Experimental Results, AdaNet
CIFAR-10:
60,000 images,
10 classes
SD of all #’s: 0.01
Label Pair AdaNet Log. Reg. NN
deer-truck 0.94 0.90 0.92
deer-horse 0.84 0.77 0.81
automobile-truck 0.85 0.80 0.81
cat-dog 0.69 0.67 0.66
dog-horse 0.84 0.80 0.81

Neural Architecture Search with RL

Neural Architecture Search with RL
Error rates on
CIFAR-10
Perplexity on Penn
Treebank
Current accuracy of NAS on ImageNet: 78%
State-of-Art: 80.x%

Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Corinna Cortes, Head of Research, Google, at MLconf NYC 2017

Similar to Corinna Cortes, Head of Research, Google, at MLconf NYC 2017 (20)

More from MLconf

More from MLconf (20)

Recently uploaded

Recently uploaded (20)

Corinna Cortes, Head of Research, Google, at MLconf NYC 2017