Transformer Zoo (a deeper dive)

Transformer Zoo
(a deeper dive)
Grigory Sapunov
NTR Seminars
07.07.2020
gs@inten.to

● Transformer architecture understanding
○ Original paper: https://arxiv.org/abs/1706.03762
○ Great visual explanation: http://jalammar.github.io/illustrated-transformer
○ Lecture #12 from my DL course
https://github.com/che-shr-cat/deep-learning-for-biology-hse-2019-course
● This talk is a follow-up talk for the one from the GDG DevParty
○ https://www.youtube.com/watch?v=KZ9NXYcXVBY
Prerequisites

Recap: Transformer Architecture

Transformer
A new simple network architecture,
the Transformer:
● Is a Encoder-Decoder architecture
● Based solely on attention mechanisms
(no RNN/CNN)
● The major component in the transformer is
the unit of multi-head self-attention
mechanism.
● Fast: only matrix multiplications
● Strong results on standard WMT datasets

Multi-head self-attention mechanism
Essentially, the Multi-Head Attention is just
several attention layers stacked together with
different linear transformations of the same
input.

The transformer adopts the scaled dot-product
attention: the output is a weighted sum of the
values, where the weight assigned to each value
is determined by the dot-product of the query
with all the keys:
The input consists of queries and keys of
dimension dk, and values of dimension dv.
Scaled dot-product attention

Problems with vanilla transformers
● It’s a pretty heavy model
→ hard to train, tricky training
schedule
● It has O(N2) computational
complexity attention mechanism
→ scales poorly
● It has limited context span
(mostly due to the complexity),
typically 512 tokens
→ can’t process long sequences.
● May need different implicit bias
for other types of data (e.g. image,
sound, etc)

Many transformers exist
● Image Transformer
● Music Transformer
● Universal Transformer
● Transformer-XL
● Sparse Transformer
● Star-Transformer
● R-Transformer
● Reformer
● Compressive Transformer
● Longformer
● Extended Transformer
Construction (ETC)
● Levenstein Transformer, Insertion Transformer, Imputer, KERMIT, …
● ...

Axes of variation
● General architecture:
○ encoder/decoder/both;
○ #layers, #attn.heads, hidden dim, attention span, ...
● Input elements: symbols/BPE/words/pixels/…
● Dimensionality: 1D, 2D, ...
● Positional encodings: sinusoidal, learned, relative, …
● Attentional mechanism: original, sparse, LSH, local, global, …
● Recurrency: segments/depth/…
● Memory
● Adaptivity: ACT, adaptive span, …
● Generation order: autoregressive, non-autoregressive
● ...

General architecture
● Encoder: BERT
● Decoder: GPT
● Both: original transformer for NMT, BART
http://jalammar.github.io/illustrated-gpt2/

BART: “classic” seq2seq
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and
Comprehension, https://arxiv.org/abs/1910.13461
BERT encoder
+
GPT decoder

General architecture
● #layers/heads/dhid:
○ BERT-base (12-layer, 768-hidden, 12-heads)
○ BERT-large (24-layer, 1024-hidden, 16-heads)
○ The “GPT-3” (96-layer, 12288-hidden, 96-heads)
https://blog.inten.to/gpt-3-language-models-are-few-shot-learners-a13d1ae8b1f9
https://arxiv.org/abs/2005.14165

Input elements
● Character: Character-Level Language Modeling with Deeper Self-
Attention, https://arxiv.org/abs/1808.04444
● BPE (subword units): most of the transformers
● Word: still why not, but not so flexible working with out-of-vocabulary words
● Pixels: Image transformer or iGPT
● MIDI notes: Music transformer
● ...

Image GPT (iGPT)
Just GPT-2 trained on images unrolled into long sequences of pixels!
Waiting for GPT-3 (uses sparse attention) trained on images.
https://openai.com/blog/image-gpt/

Dimensionality: 1D, 2D, ...
Axial Transformer: for images and other data organized as high dim tensors.
Axial Attention in Multidimensional Transformers

Positional Encoding
1. Sinusoidal Position Encoding (Vaswani et al, 2017,
https://arxiv.org/abs/1706.03762)
Uses sine/cosine waves as in the original paper.
2. Learned Position Encoding (Gehring et al, 2017,
Embed the absolute position of input elements. Can’t extrapolate to lengths
it has never seen during training.
3. Relative Position Representations (Shaw et al, 2018,
Model the input as a labeled, directed, fully-connected graph. Learn edge
representation.

Transformer with added recurrence: it can see the previous segment
representations, so can process longer sentences.
Recurrence: Transformer-XL

The Compressive Transformer keeps a fine-grained memory of past activations,
which are then compressed into coarser compressed memories.
Recurrence & Mem: Compressive Transformer
Compressive Transformers for Long-Range Sequence Modelling

Compressive Transformers for Long-Range Sequence Modelling

Simple baselines show the memory can help
Memory Transformer
Memory Transformer

Memory Transformer

Attention mechanism: Image Transformer
Local self-attention: in every self-attention layer, each position in a query block
attends to all positions in the memory block.
Image Transformer, https://arxiv.org/abs/1802.05751

Sparse factorizations of the attention matrix reduces complexity to O(N*sqrt(N)).
Can generate sounds and images.
Attention mechanism: Sparse Transformer
Generating Long Sequences with Sparse Transformers
https://openai.com/blog/sparse-transformer/

Generating Long Sequences with Sparse Transformers
https://openai.com/blog/sparse-transformer/

Reformer is an optimizer transformer:
● Using less memory (reversible
layers do not store activations,
chunking ff-layer computations)
● Calculating attention using LSH
(Locality-sensitive hashing)
○ O(L2) → O(L*logL)
○ Approximate softmax by LSH (softmax
is dominated by the largest elements,
for each query qi we only need to focus on the keys in K that are closest to qi)
● => can process larger sequences!
64K Sequences on One GPU!
Reformer
Reformer: The Efficient Transformer

Attention matrices
Reformer
Reformer: The Efficient Transformer

https://twitter.com/huggingface/status/1263850138595987457

ETC: Encoding Long and Structured Data in Transformers

Use local sliding window attention + add global attention for pre-selected positions.
Longformer
Longformer: The Long-Document Transformer

Scales linearly!
Longformer
Longformer: The Long-Document Transformer

● Another local + global attention.
● Can incorporate structured data into the model!
Extended Transformer Construction (ETC)
ETC: Encoding Long and Structured Data in Transformers

Idea:
● Apply ACT to Transformers
● Apply a variable number of repetitions for calculating each position: a
Universal Transformer (UT)
● Use dynamic attention span: Adaptive Attention Span in Transformers
Adaptive Computation Time in Transformers
Adaptive Computation Time (ACT) in Neural Networks [3/3]
https://medium.com/@moocaholic/adaptive-computation-time-act-in-neural-networks-3-3-99452b2eff18

● Two flavors of UT in the paper:
○ UT with a fixed number of repetitions.
○ UT with dynamic halting.
● The UT repeatedly refines a series of vector representations for each position
of the sequence in parallel, by combining information from different positions
using self-attention and applying a recurrent transition function across all time
steps.
○ The number of time steps, T, is arbitrary but fixed (no ACT here, fixed
number of repetitions).
○ The number of time steps, T, is dymanic (a dynamic ACT halting
mechanism to each position in the input sequence)
Universal Transformer (UT): Implementation
“Universal Transformers”,

UT with a fixed number of repetitions
“Moving Beyond Translation with the Universal Transformer”,
https://ai.googleblog.com/2018/08/moving-beyond-translation-with.html

Adaptive UT with dynamic halting
https://mostafadehghani.com/2019/05/05/universal-transformers/

● Universal Transformer is a recurrent function (not in time, but in depth) that
evolves per-symbol hidden states in parallel, based at each step on the
sequence of previous hidden states.
○ In that sense, UT is similar to architectures such as the Neural GPU
and the Neural Turing Machine.
● When running for a fixed number of steps, the Universal Transformer is
equivalent to a multi-layer Transformer with tied parameters across its layers.
● Adaptive UT: as the recurrent transition function can be applied any number
of times, this implies that adaptive UTs can have variable depth (number of
per-symbol processing steps).
● Universal Transformer can be shown to be Turing-complete (or
“computationally universal”)
Universal Transformer (UT): Notes

Related idea: cross-layer parameter sharing (ALBERT)
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

● The problem with the vanilla transformer is its fixed context size (or attention
span).
● It cannot be very large because of the computation cost of the attention
mechanism (it requires O(n²) computations).
● Let the layer (or even the attention head) decide the required context size on
its own.
● There are two options:
○ Learnable (the adaptive attention span): let each attention head learn it’s
own attention span independently from the other heads. It is learnable,
but still fixed after the training is done.
○ ACT-like (the dynamic attention span): changes the span dynamically
depending on the current input.
Adaptive Attention Span: Idea & Implementation
“Adaptive Attention Span in Transformers”,

The models are smaller, the performance is better.
Adaptive Attention Span: Performance

Adaptive spans (in log-scale) of every attention heads in a 12-layer model with
span limit S = 4096. Few attention heads require long attention spans
Adaptive spans are learned larger when needed

Example of average dynamic attention span as a function of the input sequence.
The span is averaged over the layers and heads.
Dynamic spans adapt to the input sequence

Non-autoregressive generation
KERMIT: Generative Insertion-Based Modeling for Sequences,

● Transformers are cool and produce great results!
● There are many modifications, it’s kind of LEGO, you can combine it.
● More good source code and libraries are available (Huggingface, Colab
notebooks, etc)
● Definitely more transformers to come!
● GET INVOLVED!
You CAN move things forward!
(just combine several
ideas from these
slides 🙂)
Wrap up

https://ru.linkedin.com/in/grigorysapunov
gs@inten.to
Thanks!

Transformer Zoo (a deeper dive)

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Transformer Zoo (a deeper dive)

Similar to Transformer Zoo (a deeper dive) (20)

More from Grigory Sapunov

More from Grigory Sapunov (20)

Recently uploaded

Recently uploaded (20)

Transformer Zoo (a deeper dive)