A more deeper talk on the Transformer architecture from the webinar at NTR
https://www.ntr.ai/webinar/transformery
Google slides version: https://docs.google.com/presentation/d/1dIadh_nIszxXG8-672vJmvFGT6jBp0mOqzNV4g3e2Lc/edit?usp=sharing
2. ● Transformer architecture understanding
○ Original paper: https://arxiv.org/abs/1706.03762
○ Great visual explanation: http://jalammar.github.io/illustrated-transformer
○ Lecture #12 from my DL course
https://github.com/che-shr-cat/deep-learning-for-biology-hse-2019-course
● This talk is a follow-up talk for the one from the GDG DevParty
○ https://www.youtube.com/watch?v=KZ9NXYcXVBY
Prerequisites
4. Transformer
A new simple network architecture,
the Transformer:
● Is a Encoder-Decoder architecture
● Based solely on attention mechanisms
(no RNN/CNN)
● The major component in the transformer is
the unit of multi-head self-attention
mechanism.
● Fast: only matrix multiplications
● Strong results on standard WMT datasets
7. The transformer adopts the scaled dot-product
attention: the output is a weighted sum of the
values, where the weight assigned to each value
is determined by the dot-product of the query
with all the keys:
The input consists of queries and keys of
dimension dk, and values of dimension dv.
Scaled dot-product attention
8. Problems with vanilla transformers
● It’s a pretty heavy model
→ hard to train, tricky training
schedule
● It has O(N2) computational
complexity attention mechanism
→ scales poorly
● It has limited context span
(mostly due to the complexity),
typically 512 tokens
→ can’t process long sequences.
● May need different implicit bias
for other types of data (e.g. image,
sound, etc)
16. Input elements
● Character: Character-Level Language Modeling with Deeper Self-
Attention, https://arxiv.org/abs/1808.04444
● BPE (subword units): most of the transformers
● Word: still why not, but not so flexible working with out-of-vocabulary words
● Pixels: Image transformer or iGPT
● MIDI notes: Music transformer
● ...
17. Image GPT (iGPT)
Just GPT-2 trained on images unrolled into long sequences of pixels!
Waiting for GPT-3 (uses sparse attention) trained on images.
https://openai.com/blog/image-gpt/
18. Dimensionality: 1D, 2D, ...
Axial Transformer: for images and other data organized as high dim tensors.
Axial Attention in Multidimensional Transformers
https://arxiv.org/abs/1912.12180
19. Positional Encoding
1. Sinusoidal Position Encoding (Vaswani et al, 2017,
https://arxiv.org/abs/1706.03762)
Uses sine/cosine waves as in the original paper.
2. Learned Position Encoding (Gehring et al, 2017,
https://arxiv.org/abs/1705.03122)
Embed the absolute position of input elements. Can’t extrapolate to lengths
it has never seen during training.
3. Relative Position Representations (Shaw et al, 2018,
https://arxiv.org/abs/1803.02155)
Model the input as a labeled, directed, fully-connected graph. Learn edge
representation.
20. Transformer with added recurrence: it can see the previous segment
representations, so can process longer sentences.
Recurrence: Transformer-XL
https://arxiv.org/abs/1901.02860
21. The Compressive Transformer keeps a fine-grained memory of past activations,
which are then compressed into coarser compressed memories.
Recurrence & Mem: Compressive Transformer
Compressive Transformers for Long-Range Sequence Modelling
https://arxiv.org/abs/1911.05507
25. Attention mechanism: Image Transformer
Local self-attention: in every self-attention layer, each position in a query block
attends to all positions in the memory block.
Image Transformer, https://arxiv.org/abs/1802.05751
26. Sparse factorizations of the attention matrix reduces complexity to O(N*sqrt(N)).
Can generate sounds and images.
Attention mechanism: Sparse Transformer
Generating Long Sequences with Sparse Transformers
https://arxiv.org/abs/1904.10509
https://openai.com/blog/sparse-transformer/
27. Generating Long Sequences with Sparse Transformers
https://arxiv.org/abs/1904.10509
https://openai.com/blog/sparse-transformer/
28. Reformer is an optimizer transformer:
● Using less memory (reversible
layers do not store activations,
chunking ff-layer computations)
● Calculating attention using LSH
(Locality-sensitive hashing)
○ O(L2) → O(L*logL)
○ Approximate softmax by LSH (softmax
is dominated by the largest elements,
for each query qi we only need to focus on the keys in K that are closest to qi)
● => can process larger sequences!
64K Sequences on One GPU!
Reformer
Reformer: The Efficient Transformer
https://arxiv.org/abs/2001.04451
31. ETC: Encoding Long and Structured Data in Transformers
https://arxiv.org/abs/2004.08483
32. Use local sliding window attention + add global attention for pre-selected positions.
Longformer
Longformer: The Long-Document Transformer
https://arxiv.org/abs/2004.05150
34. ● Another local + global attention.
● Can incorporate structured data into the model!
Extended Transformer Construction (ETC)
ETC: Encoding Long and Structured Data in Transformers
https://arxiv.org/abs/2004.08483
35. Idea:
● Apply ACT to Transformers
● Apply a variable number of repetitions for calculating each position: a
Universal Transformer (UT)
● Use dynamic attention span: Adaptive Attention Span in Transformers
Adaptive Computation Time in Transformers
Adaptive Computation Time (ACT) in Neural Networks [3/3]
https://medium.com/@moocaholic/adaptive-computation-time-act-in-neural-networks-3-3-99452b2eff18
36. ● Two flavors of UT in the paper:
○ UT with a fixed number of repetitions.
○ UT with dynamic halting.
● The UT repeatedly refines a series of vector representations for each position
of the sequence in parallel, by combining information from different positions
using self-attention and applying a recurrent transition function across all time
steps.
○ The number of time steps, T, is arbitrary but fixed (no ACT here, fixed
number of repetitions).
○ The number of time steps, T, is dymanic (a dynamic ACT halting
mechanism to each position in the input sequence)
Universal Transformer (UT): Implementation
“Universal Transformers”,
https://arxiv.org/abs/1807.03819
37. UT with a fixed number of repetitions
“Moving Beyond Translation with the Universal Transformer”,
https://ai.googleblog.com/2018/08/moving-beyond-translation-with.html
38. Adaptive UT with dynamic halting
“Universal Transformers”,
https://mostafadehghani.com/2019/05/05/universal-transformers/
39. ● Universal Transformer is a recurrent function (not in time, but in depth) that
evolves per-symbol hidden states in parallel, based at each step on the
sequence of previous hidden states.
○ In that sense, UT is similar to architectures such as the Neural GPU
and the Neural Turing Machine.
● When running for a fixed number of steps, the Universal Transformer is
equivalent to a multi-layer Transformer with tied parameters across its layers.
● Adaptive UT: as the recurrent transition function can be applied any number
of times, this implies that adaptive UTs can have variable depth (number of
per-symbol processing steps).
● Universal Transformer can be shown to be Turing-complete (or
“computationally universal”)
Universal Transformer (UT): Notes
“Universal Transformers”,
https://arxiv.org/abs/1807.03819
40. Related idea: cross-layer parameter sharing (ALBERT)
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
https://arxiv.org/abs/1909.11942
41. ● The problem with the vanilla transformer is its fixed context size (or attention
span).
● It cannot be very large because of the computation cost of the attention
mechanism (it requires O(n²) computations).
● Let the layer (or even the attention head) decide the required context size on
its own.
● There are two options:
○ Learnable (the adaptive attention span): let each attention head learn it’s
own attention span independently from the other heads. It is learnable,
but still fixed after the training is done.
○ ACT-like (the dynamic attention span): changes the span dynamically
depending on the current input.
Adaptive Attention Span: Idea & Implementation
“Adaptive Attention Span in Transformers”,
https://arxiv.org/abs/1905.07799
42. The models are smaller, the performance is better.
Adaptive Attention Span: Performance
“Adaptive Attention Span in Transformers”,
https://arxiv.org/abs/1905.07799
43. Adaptive spans (in log-scale) of every attention heads in a 12-layer model with
span limit S = 4096. Few attention heads require long attention spans
Adaptive spans are learned larger when needed
“Adaptive Attention Span in Transformers”,
https://arxiv.org/abs/1905.07799
44. Example of average dynamic attention span as a function of the input sequence.
The span is averaged over the layers and heads.
Dynamic spans adapt to the input sequence
“Adaptive Attention Span in Transformers”,
https://arxiv.org/abs/1905.07799
47. ● Transformers are cool and produce great results!
● There are many modifications, it’s kind of LEGO, you can combine it.
● More good source code and libraries are available (Huggingface, Colab
notebooks, etc)
● Definitely more transformers to come!
● GET INVOLVED!
You CAN move things forward!
(just combine several
ideas from these
slides 🙂)
Wrap up