SlideShare a Scribd company logo
1 of 48
Transformer Zoo
(a deeper dive)
Grigory Sapunov
NTR Seminars
07.07.2020
gs@inten.to
● Transformer architecture understanding
○ Original paper: https://arxiv.org/abs/1706.03762
○ Great visual explanation: http://jalammar.github.io/illustrated-transformer
○ Lecture #12 from my DL course
https://github.com/che-shr-cat/deep-learning-for-biology-hse-2019-course
● This talk is a follow-up talk for the one from the GDG DevParty
○ https://www.youtube.com/watch?v=KZ9NXYcXVBY
Prerequisites
Recap: Transformer Architecture
Transformer
A new simple network architecture,
the Transformer:
● Is a Encoder-Decoder architecture
● Based solely on attention mechanisms
(no RNN/CNN)
● The major component in the transformer is
the unit of multi-head self-attention
mechanism.
● Fast: only matrix multiplications
● Strong results on standard WMT datasets
Multi-head self-attention mechanism
Essentially, the Multi-Head Attention is just
several attention layers stacked together with
different linear transformations of the same
input.
The transformer adopts the scaled dot-product
attention: the output is a weighted sum of the
values, where the weight assigned to each value
is determined by the dot-product of the query
with all the keys:
The input consists of queries and keys of
dimension dk, and values of dimension dv.
Scaled dot-product attention
Problems with vanilla transformers
● It’s a pretty heavy model
→ hard to train, tricky training
schedule
● It has O(N2) computational
complexity attention mechanism
→ scales poorly
● It has limited context span
(mostly due to the complexity),
typically 512 tokens
→ can’t process long sequences.
● May need different implicit bias
for other types of data (e.g. image,
sound, etc)
What can be changed?
Many transformers exist
● Image Transformer
● Music Transformer
● Universal Transformer
● Transformer-XL
● Sparse Transformer
● Star-Transformer
● R-Transformer
● Reformer
● Compressive Transformer
● Longformer
● Extended Transformer
Construction (ETC)
● Levenstein Transformer, Insertion Transformer, Imputer, KERMIT, …
● ...
Axes of variation
● General architecture:
○ encoder/decoder/both;
○ #layers, #attn.heads, hidden dim, attention span, ...
● Input elements: symbols/BPE/words/pixels/…
● Dimensionality: 1D, 2D, ...
● Positional encodings: sinusoidal, learned, relative, …
● Attentional mechanism: original, sparse, LSH, local, global, …
● Recurrency: segments/depth/…
● Memory
● Adaptivity: ACT, adaptive span, …
● Generation order: autoregressive, non-autoregressive
● ...
General architecture
● Encoder: BERT
● Decoder: GPT
● Both: original transformer for NMT, BART
http://jalammar.github.io/illustrated-gpt2/
BART: “classic” seq2seq
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and
Comprehension, https://arxiv.org/abs/1910.13461
BERT encoder
+
GPT decoder
General architecture
● #layers/heads/dhid:
○ BERT-base (12-layer, 768-hidden, 12-heads)
○ BERT-large (24-layer, 1024-hidden, 16-heads)
○ The “GPT-3” (96-layer, 12288-hidden, 96-heads)
https://blog.inten.to/gpt-3-language-models-are-few-shot-learners-a13d1ae8b1f9
https://arxiv.org/abs/2005.14165
GPT-3
is 10 screens
higher!!!
Input elements
● Character: Character-Level Language Modeling with Deeper Self-
Attention, https://arxiv.org/abs/1808.04444
● BPE (subword units): most of the transformers
● Word: still why not, but not so flexible working with out-of-vocabulary words
● Pixels: Image transformer or iGPT
● MIDI notes: Music transformer
● ...
Image GPT (iGPT)
Just GPT-2 trained on images unrolled into long sequences of pixels!
Waiting for GPT-3 (uses sparse attention) trained on images.
https://openai.com/blog/image-gpt/
Dimensionality: 1D, 2D, ...
Axial Transformer: for images and other data organized as high dim tensors.
Axial Attention in Multidimensional Transformers
https://arxiv.org/abs/1912.12180
Positional Encoding
1. Sinusoidal Position Encoding (Vaswani et al, 2017,
https://arxiv.org/abs/1706.03762)
Uses sine/cosine waves as in the original paper.
2. Learned Position Encoding (Gehring et al, 2017,
https://arxiv.org/abs/1705.03122)
Embed the absolute position of input elements. Can’t extrapolate to lengths
it has never seen during training.
3. Relative Position Representations (Shaw et al, 2018,
https://arxiv.org/abs/1803.02155)
Model the input as a labeled, directed, fully-connected graph. Learn edge
representation.
Transformer with added recurrence: it can see the previous segment
representations, so can process longer sentences.
Recurrence: Transformer-XL
https://arxiv.org/abs/1901.02860
The Compressive Transformer keeps a fine-grained memory of past activations,
which are then compressed into coarser compressed memories.
Recurrence & Mem: Compressive Transformer
Compressive Transformers for Long-Range Sequence Modelling
https://arxiv.org/abs/1911.05507
Compressive Transformers for Long-Range Sequence Modelling
https://arxiv.org/abs/1911.05507
Simple baselines show the memory can help
Memory Transformer
Memory Transformer
https://arxiv.org/abs/2006.11527
Memory Transformer
https://arxiv.org/abs/2006.11527
Attention mechanism: Image Transformer
Local self-attention: in every self-attention layer, each position in a query block
attends to all positions in the memory block.
Image Transformer, https://arxiv.org/abs/1802.05751
Sparse factorizations of the attention matrix reduces complexity to O(N*sqrt(N)).
Can generate sounds and images.
Attention mechanism: Sparse Transformer
Generating Long Sequences with Sparse Transformers
https://arxiv.org/abs/1904.10509
https://openai.com/blog/sparse-transformer/
Generating Long Sequences with Sparse Transformers
https://arxiv.org/abs/1904.10509
https://openai.com/blog/sparse-transformer/
Reformer is an optimizer transformer:
● Using less memory (reversible
layers do not store activations,
chunking ff-layer computations)
● Calculating attention using LSH
(Locality-sensitive hashing)
○ O(L2) → O(L*logL)
○ Approximate softmax by LSH (softmax
is dominated by the largest elements,
for each query qi we only need to focus on the keys in K that are closest to qi)
● => can process larger sequences!
64K Sequences on One GPU!
Reformer
Reformer: The Efficient Transformer
https://arxiv.org/abs/2001.04451
Attention matrices
Reformer
Reformer: The Efficient Transformer
https://arxiv.org/abs/2001.04451
https://twitter.com/huggingface/status/1263850138595987457
ETC: Encoding Long and Structured Data in Transformers
https://arxiv.org/abs/2004.08483
Use local sliding window attention + add global attention for pre-selected positions.
Longformer
Longformer: The Long-Document Transformer
https://arxiv.org/abs/2004.05150
Scales linearly!
Longformer
Longformer: The Long-Document Transformer
https://arxiv.org/abs/2004.05150
● Another local + global attention.
● Can incorporate structured data into the model!
Extended Transformer Construction (ETC)
ETC: Encoding Long and Structured Data in Transformers
https://arxiv.org/abs/2004.08483
Idea:
● Apply ACT to Transformers
● Apply a variable number of repetitions for calculating each position: a
Universal Transformer (UT)
● Use dynamic attention span: Adaptive Attention Span in Transformers
Adaptive Computation Time in Transformers
Adaptive Computation Time (ACT) in Neural Networks [3/3]
https://medium.com/@moocaholic/adaptive-computation-time-act-in-neural-networks-3-3-99452b2eff18
● Two flavors of UT in the paper:
○ UT with a fixed number of repetitions.
○ UT with dynamic halting.
● The UT repeatedly refines a series of vector representations for each position
of the sequence in parallel, by combining information from different positions
using self-attention and applying a recurrent transition function across all time
steps.
○ The number of time steps, T, is arbitrary but fixed (no ACT here, fixed
number of repetitions).
○ The number of time steps, T, is dymanic (a dynamic ACT halting
mechanism to each position in the input sequence)
Universal Transformer (UT): Implementation
“Universal Transformers”,
https://arxiv.org/abs/1807.03819
UT with a fixed number of repetitions
“Moving Beyond Translation with the Universal Transformer”,
https://ai.googleblog.com/2018/08/moving-beyond-translation-with.html
Adaptive UT with dynamic halting
“Universal Transformers”,
https://mostafadehghani.com/2019/05/05/universal-transformers/
● Universal Transformer is a recurrent function (not in time, but in depth) that
evolves per-symbol hidden states in parallel, based at each step on the
sequence of previous hidden states.
○ In that sense, UT is similar to architectures such as the Neural GPU
and the Neural Turing Machine.
● When running for a fixed number of steps, the Universal Transformer is
equivalent to a multi-layer Transformer with tied parameters across its layers.
● Adaptive UT: as the recurrent transition function can be applied any number
of times, this implies that adaptive UTs can have variable depth (number of
per-symbol processing steps).
● Universal Transformer can be shown to be Turing-complete (or
“computationally universal”)
Universal Transformer (UT): Notes
“Universal Transformers”,
https://arxiv.org/abs/1807.03819
Related idea: cross-layer parameter sharing (ALBERT)
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
https://arxiv.org/abs/1909.11942
● The problem with the vanilla transformer is its fixed context size (or attention
span).
● It cannot be very large because of the computation cost of the attention
mechanism (it requires O(n²) computations).
● Let the layer (or even the attention head) decide the required context size on
its own.
● There are two options:
○ Learnable (the adaptive attention span): let each attention head learn it’s
own attention span independently from the other heads. It is learnable,
but still fixed after the training is done.
○ ACT-like (the dynamic attention span): changes the span dynamically
depending on the current input.
Adaptive Attention Span: Idea & Implementation
“Adaptive Attention Span in Transformers”,
https://arxiv.org/abs/1905.07799
The models are smaller, the performance is better.
Adaptive Attention Span: Performance
“Adaptive Attention Span in Transformers”,
https://arxiv.org/abs/1905.07799
Adaptive spans (in log-scale) of every attention heads in a 12-layer model with
span limit S = 4096. Few attention heads require long attention spans
Adaptive spans are learned larger when needed
“Adaptive Attention Span in Transformers”,
https://arxiv.org/abs/1905.07799
Example of average dynamic attention span as a function of the input sequence.
The span is averaged over the layers and heads.
Dynamic spans adapt to the input sequence
“Adaptive Attention Span in Transformers”,
https://arxiv.org/abs/1905.07799
Non-autoregressive generation
KERMIT: Generative Insertion-Based Modeling for Sequences,
https://arxiv.org/abs/1906.01604
Wrap up
● Transformers are cool and produce great results!
● There are many modifications, it’s kind of LEGO, you can combine it.
● More good source code and libraries are available (Huggingface, Colab
notebooks, etc)
● Definitely more transformers to come!
● GET INVOLVED!
You CAN move things forward!
(just combine several
ideas from these
slides 🙂)
Wrap up
https://ru.linkedin.com/in/grigorysapunov
gs@inten.to
Thanks!

More Related Content

What's hot

What's hot (19)

Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYC
Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYCTed Willke, Senior Principal Engineer, Intel Labs at MLconf NYC
Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYC
 
Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYC
Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYCBryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYC
Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYC
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
 
Programming Language Memory Models: What do Shared Variables Mean?
Programming Language Memory Models: What do Shared Variables Mean?Programming Language Memory Models: What do Shared Variables Mean?
Programming Language Memory Models: What do Shared Variables Mean?
 
Learn about Tensorflow for Deep Learning now! Part 1
Learn about Tensorflow for Deep Learning now! Part 1Learn about Tensorflow for Deep Learning now! Part 1
Learn about Tensorflow for Deep Learning now! Part 1
 
Modular Pick and Place Simulator using ROS Framework
Modular Pick and Place Simulator using ROS FrameworkModular Pick and Place Simulator using ROS Framework
Modular Pick and Place Simulator using ROS Framework
 
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
 
Parallel computation
Parallel computationParallel computation
Parallel computation
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistry
 
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
 
Electricity price forecasting with Recurrent Neural Networks
Electricity price forecasting with Recurrent Neural NetworksElectricity price forecasting with Recurrent Neural Networks
Electricity price forecasting with Recurrent Neural Networks
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: Theory
 
Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowDistributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflow
 
Common Design of Deep Learning Frameworks
Common Design of Deep Learning FrameworksCommon Design of Deep Learning Frameworks
Common Design of Deep Learning Frameworks
 
Attention mechanisms with tensorflow
Attention mechanisms with tensorflowAttention mechanisms with tensorflow
Attention mechanisms with tensorflow
 
Recurrent Neural Networks II (D2L3 Deep Learning for Speech and Language UPC ...
Recurrent Neural Networks II (D2L3 Deep Learning for Speech and Language UPC ...Recurrent Neural Networks II (D2L3 Deep Learning for Speech and Language UPC ...
Recurrent Neural Networks II (D2L3 Deep Learning for Speech and Language UPC ...
 
[DL輪読会]Pay Attention to MLPs (gMLP)
[DL輪読会]Pay Attention to MLPs	(gMLP)[DL輪読会]Pay Attention to MLPs	(gMLP)
[DL輪読会]Pay Attention to MLPs (gMLP)
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
 

Similar to Transformer Zoo (a deeper dive)

Pregel: A System For Large Scale Graph Processing
Pregel: A System For Large Scale Graph ProcessingPregel: A System For Large Scale Graph Processing
Pregel: A System For Large Scale Graph Processing
Riyad Parvez
 
Os Reindersfinal
Os ReindersfinalOs Reindersfinal
Os Reindersfinal
oscon2007
 
HKG18-110 - net_mdev: Fast path user space I/O
HKG18-110 - net_mdev: Fast path user space I/OHKG18-110 - net_mdev: Fast path user space I/O
HKG18-110 - net_mdev: Fast path user space I/O
Linaro
 
Recent advance in netmap/VALE(mSwitch)
Recent advance in netmap/VALE(mSwitch)Recent advance in netmap/VALE(mSwitch)
Recent advance in netmap/VALE(mSwitch)
micchie
 

Similar to Transformer Zoo (a deeper dive) (20)

MOVED: The challenge of SVE in QEMU - SFO17-103
MOVED: The challenge of SVE in QEMU - SFO17-103MOVED: The challenge of SVE in QEMU - SFO17-103
MOVED: The challenge of SVE in QEMU - SFO17-103
 
spaGO: A self-contained ML & NLP library in GO
spaGO: A self-contained ML & NLP library in GOspaGO: A self-contained ML & NLP library in GO
spaGO: A self-contained ML & NLP library in GO
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learning
 
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
 
Realtime 3D Visualization without GPU
Realtime 3D Visualization without GPURealtime 3D Visualization without GPU
Realtime 3D Visualization without GPU
 
Heterogeneous multiprocessing on androd and i.mx7
Heterogeneous multiprocessing on androd and i.mx7Heterogeneous multiprocessing on androd and i.mx7
Heterogeneous multiprocessing on androd and i.mx7
 
Architecture of TPU, GPU and CPU
Architecture of TPU, GPU and CPUArchitecture of TPU, GPU and CPU
Architecture of TPU, GPU and CPU
 
Onnc intro
Onnc introOnnc intro
Onnc intro
 
Pregel: A System For Large Scale Graph Processing
Pregel: A System For Large Scale Graph ProcessingPregel: A System For Large Scale Graph Processing
Pregel: A System For Large Scale Graph Processing
 
Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
 
Os Reindersfinal
Os ReindersfinalOs Reindersfinal
Os Reindersfinal
 
Fpga & VHDL
Fpga & VHDLFpga & VHDL
Fpga & VHDL
 
Linux Locking Mechanisms
Linux Locking MechanismsLinux Locking Mechanisms
Linux Locking Mechanisms
 
HKG18-110 - net_mdev: Fast path user space I/O
HKG18-110 - net_mdev: Fast path user space I/OHKG18-110 - net_mdev: Fast path user space I/O
HKG18-110 - net_mdev: Fast path user space I/O
 
Graph processing
Graph processingGraph processing
Graph processing
 
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
 
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
 
Recent advance in netmap/VALE(mSwitch)
Recent advance in netmap/VALE(mSwitch)Recent advance in netmap/VALE(mSwitch)
Recent advance in netmap/VALE(mSwitch)
 
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and KibanaBuilding a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
 

More from Grigory Sapunov

Практический подход к выбору доменно-адаптивного NMT​
Практический подход к выбору доменно-адаптивного NMT​Практический подход к выбору доменно-адаптивного NMT​
Практический подход к выбору доменно-адаптивного NMT​
Grigory Sapunov
 

More from Grigory Sapunov (20)

Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
 
AI Hardware Landscape 2021
AI Hardware Landscape 2021AI Hardware Landscape 2021
AI Hardware Landscape 2021
 
What's new in AI in 2020 (very short)
What's new in AI in 2020 (very short)What's new in AI in 2020 (very short)
What's new in AI in 2020 (very short)
 
Artificial Intelligence (lecture for schoolchildren) [rus]
Artificial Intelligence (lecture for schoolchildren) [rus]Artificial Intelligence (lecture for schoolchildren) [rus]
Artificial Intelligence (lecture for schoolchildren) [rus]
 
BERTology meets Biology
BERTology meets BiologyBERTology meets Biology
BERTology meets Biology
 
Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware Landscape
 
Modern neural net architectures - Year 2019 version
Modern neural net architectures - Year 2019 versionModern neural net architectures - Year 2019 version
Modern neural net architectures - Year 2019 version
 
AI - Last Year Progress (2018-2019)
AI - Last Year Progress (2018-2019)AI - Last Year Progress (2018-2019)
AI - Last Year Progress (2018-2019)
 
Практический подход к выбору доменно-адаптивного NMT​
Практический подход к выбору доменно-адаптивного NMT​Практический подход к выбору доменно-адаптивного NMT​
Практический подход к выбору доменно-адаптивного NMT​
 
Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018
 
Введение в Deep Learning
Введение в Deep LearningВведение в Deep Learning
Введение в Deep Learning
 
Введение в машинное обучение
Введение в машинное обучениеВведение в машинное обучение
Введение в машинное обучение
 
Введение в архитектуры нейронных сетей / HighLoad++ 2016
Введение в архитектуры нейронных сетей / HighLoad++ 2016Введение в архитектуры нейронных сетей / HighLoad++ 2016
Введение в архитектуры нейронных сетей / HighLoad++ 2016
 
Artificial Intelligence - Past, Present and Future
Artificial Intelligence - Past, Present and FutureArtificial Intelligence - Past, Present and Future
Artificial Intelligence - Past, Present and Future
 
Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016
 
Deep Learning Cases: Text and Image Processing
Deep Learning Cases: Text and Image ProcessingDeep Learning Cases: Text and Image Processing
Deep Learning Cases: Text and Image Processing
 
Computer Vision and Deep Learning
Computer Vision and Deep LearningComputer Vision and Deep Learning
Computer Vision and Deep Learning
 
Apache Spark & MLlib
Apache Spark & MLlibApache Spark & MLlib
Apache Spark & MLlib
 
Международная научно-практическая конференция учителей / Яндекс, МФТИ / 05.12...
Международная научно-практическая конференция учителей / Яндекс, МФТИ / 05.12...Международная научно-практическая конференция учителей / Яндекс, МФТИ / 05.12...
Международная научно-практическая конференция учителей / Яндекс, МФТИ / 05.12...
 
EdCrunch
EdCrunchEdCrunch
EdCrunch
 

Recently uploaded

Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
Silpa
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
Scintica Instrumentation
 

Recently uploaded (20)

Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Velocity and Acceleration PowerPoint.ppt
Velocity and Acceleration PowerPoint.pptVelocity and Acceleration PowerPoint.ppt
Velocity and Acceleration PowerPoint.ppt
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Exploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfExploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdf
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
Introduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxIntroduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptx
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 

Transformer Zoo (a deeper dive)

  • 1. Transformer Zoo (a deeper dive) Grigory Sapunov NTR Seminars 07.07.2020 gs@inten.to
  • 2. ● Transformer architecture understanding ○ Original paper: https://arxiv.org/abs/1706.03762 ○ Great visual explanation: http://jalammar.github.io/illustrated-transformer ○ Lecture #12 from my DL course https://github.com/che-shr-cat/deep-learning-for-biology-hse-2019-course ● This talk is a follow-up talk for the one from the GDG DevParty ○ https://www.youtube.com/watch?v=KZ9NXYcXVBY Prerequisites
  • 4. Transformer A new simple network architecture, the Transformer: ● Is a Encoder-Decoder architecture ● Based solely on attention mechanisms (no RNN/CNN) ● The major component in the transformer is the unit of multi-head self-attention mechanism. ● Fast: only matrix multiplications ● Strong results on standard WMT datasets
  • 5.
  • 6. Multi-head self-attention mechanism Essentially, the Multi-Head Attention is just several attention layers stacked together with different linear transformations of the same input.
  • 7. The transformer adopts the scaled dot-product attention: the output is a weighted sum of the values, where the weight assigned to each value is determined by the dot-product of the query with all the keys: The input consists of queries and keys of dimension dk, and values of dimension dv. Scaled dot-product attention
  • 8. Problems with vanilla transformers ● It’s a pretty heavy model → hard to train, tricky training schedule ● It has O(N2) computational complexity attention mechanism → scales poorly ● It has limited context span (mostly due to the complexity), typically 512 tokens → can’t process long sequences. ● May need different implicit bias for other types of data (e.g. image, sound, etc)
  • 9. What can be changed?
  • 10. Many transformers exist ● Image Transformer ● Music Transformer ● Universal Transformer ● Transformer-XL ● Sparse Transformer ● Star-Transformer ● R-Transformer ● Reformer ● Compressive Transformer ● Longformer ● Extended Transformer Construction (ETC) ● Levenstein Transformer, Insertion Transformer, Imputer, KERMIT, … ● ...
  • 11. Axes of variation ● General architecture: ○ encoder/decoder/both; ○ #layers, #attn.heads, hidden dim, attention span, ... ● Input elements: symbols/BPE/words/pixels/… ● Dimensionality: 1D, 2D, ... ● Positional encodings: sinusoidal, learned, relative, … ● Attentional mechanism: original, sparse, LSH, local, global, … ● Recurrency: segments/depth/… ● Memory ● Adaptivity: ACT, adaptive span, … ● Generation order: autoregressive, non-autoregressive ● ...
  • 12. General architecture ● Encoder: BERT ● Decoder: GPT ● Both: original transformer for NMT, BART http://jalammar.github.io/illustrated-gpt2/
  • 13. BART: “classic” seq2seq BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension, https://arxiv.org/abs/1910.13461 BERT encoder + GPT decoder
  • 14. General architecture ● #layers/heads/dhid: ○ BERT-base (12-layer, 768-hidden, 12-heads) ○ BERT-large (24-layer, 1024-hidden, 16-heads) ○ The “GPT-3” (96-layer, 12288-hidden, 96-heads) https://blog.inten.to/gpt-3-language-models-are-few-shot-learners-a13d1ae8b1f9 https://arxiv.org/abs/2005.14165
  • 16. Input elements ● Character: Character-Level Language Modeling with Deeper Self- Attention, https://arxiv.org/abs/1808.04444 ● BPE (subword units): most of the transformers ● Word: still why not, but not so flexible working with out-of-vocabulary words ● Pixels: Image transformer or iGPT ● MIDI notes: Music transformer ● ...
  • 17. Image GPT (iGPT) Just GPT-2 trained on images unrolled into long sequences of pixels! Waiting for GPT-3 (uses sparse attention) trained on images. https://openai.com/blog/image-gpt/
  • 18. Dimensionality: 1D, 2D, ... Axial Transformer: for images and other data organized as high dim tensors. Axial Attention in Multidimensional Transformers https://arxiv.org/abs/1912.12180
  • 19. Positional Encoding 1. Sinusoidal Position Encoding (Vaswani et al, 2017, https://arxiv.org/abs/1706.03762) Uses sine/cosine waves as in the original paper. 2. Learned Position Encoding (Gehring et al, 2017, https://arxiv.org/abs/1705.03122) Embed the absolute position of input elements. Can’t extrapolate to lengths it has never seen during training. 3. Relative Position Representations (Shaw et al, 2018, https://arxiv.org/abs/1803.02155) Model the input as a labeled, directed, fully-connected graph. Learn edge representation.
  • 20. Transformer with added recurrence: it can see the previous segment representations, so can process longer sentences. Recurrence: Transformer-XL https://arxiv.org/abs/1901.02860
  • 21. The Compressive Transformer keeps a fine-grained memory of past activations, which are then compressed into coarser compressed memories. Recurrence & Mem: Compressive Transformer Compressive Transformers for Long-Range Sequence Modelling https://arxiv.org/abs/1911.05507
  • 22. Compressive Transformers for Long-Range Sequence Modelling https://arxiv.org/abs/1911.05507
  • 23. Simple baselines show the memory can help Memory Transformer Memory Transformer https://arxiv.org/abs/2006.11527
  • 25. Attention mechanism: Image Transformer Local self-attention: in every self-attention layer, each position in a query block attends to all positions in the memory block. Image Transformer, https://arxiv.org/abs/1802.05751
  • 26. Sparse factorizations of the attention matrix reduces complexity to O(N*sqrt(N)). Can generate sounds and images. Attention mechanism: Sparse Transformer Generating Long Sequences with Sparse Transformers https://arxiv.org/abs/1904.10509 https://openai.com/blog/sparse-transformer/
  • 27. Generating Long Sequences with Sparse Transformers https://arxiv.org/abs/1904.10509 https://openai.com/blog/sparse-transformer/
  • 28. Reformer is an optimizer transformer: ● Using less memory (reversible layers do not store activations, chunking ff-layer computations) ● Calculating attention using LSH (Locality-sensitive hashing) ○ O(L2) → O(L*logL) ○ Approximate softmax by LSH (softmax is dominated by the largest elements, for each query qi we only need to focus on the keys in K that are closest to qi) ● => can process larger sequences! 64K Sequences on One GPU! Reformer Reformer: The Efficient Transformer https://arxiv.org/abs/2001.04451
  • 29. Attention matrices Reformer Reformer: The Efficient Transformer https://arxiv.org/abs/2001.04451
  • 31. ETC: Encoding Long and Structured Data in Transformers https://arxiv.org/abs/2004.08483
  • 32. Use local sliding window attention + add global attention for pre-selected positions. Longformer Longformer: The Long-Document Transformer https://arxiv.org/abs/2004.05150
  • 33. Scales linearly! Longformer Longformer: The Long-Document Transformer https://arxiv.org/abs/2004.05150
  • 34. ● Another local + global attention. ● Can incorporate structured data into the model! Extended Transformer Construction (ETC) ETC: Encoding Long and Structured Data in Transformers https://arxiv.org/abs/2004.08483
  • 35. Idea: ● Apply ACT to Transformers ● Apply a variable number of repetitions for calculating each position: a Universal Transformer (UT) ● Use dynamic attention span: Adaptive Attention Span in Transformers Adaptive Computation Time in Transformers Adaptive Computation Time (ACT) in Neural Networks [3/3] https://medium.com/@moocaholic/adaptive-computation-time-act-in-neural-networks-3-3-99452b2eff18
  • 36. ● Two flavors of UT in the paper: ○ UT with a fixed number of repetitions. ○ UT with dynamic halting. ● The UT repeatedly refines a series of vector representations for each position of the sequence in parallel, by combining information from different positions using self-attention and applying a recurrent transition function across all time steps. ○ The number of time steps, T, is arbitrary but fixed (no ACT here, fixed number of repetitions). ○ The number of time steps, T, is dymanic (a dynamic ACT halting mechanism to each position in the input sequence) Universal Transformer (UT): Implementation “Universal Transformers”, https://arxiv.org/abs/1807.03819
  • 37. UT with a fixed number of repetitions “Moving Beyond Translation with the Universal Transformer”, https://ai.googleblog.com/2018/08/moving-beyond-translation-with.html
  • 38. Adaptive UT with dynamic halting “Universal Transformers”, https://mostafadehghani.com/2019/05/05/universal-transformers/
  • 39. ● Universal Transformer is a recurrent function (not in time, but in depth) that evolves per-symbol hidden states in parallel, based at each step on the sequence of previous hidden states. ○ In that sense, UT is similar to architectures such as the Neural GPU and the Neural Turing Machine. ● When running for a fixed number of steps, the Universal Transformer is equivalent to a multi-layer Transformer with tied parameters across its layers. ● Adaptive UT: as the recurrent transition function can be applied any number of times, this implies that adaptive UTs can have variable depth (number of per-symbol processing steps). ● Universal Transformer can be shown to be Turing-complete (or “computationally universal”) Universal Transformer (UT): Notes “Universal Transformers”, https://arxiv.org/abs/1807.03819
  • 40. Related idea: cross-layer parameter sharing (ALBERT) ALBERT: A Lite BERT for Self-supervised Learning of Language Representations https://arxiv.org/abs/1909.11942
  • 41. ● The problem with the vanilla transformer is its fixed context size (or attention span). ● It cannot be very large because of the computation cost of the attention mechanism (it requires O(n²) computations). ● Let the layer (or even the attention head) decide the required context size on its own. ● There are two options: ○ Learnable (the adaptive attention span): let each attention head learn it’s own attention span independently from the other heads. It is learnable, but still fixed after the training is done. ○ ACT-like (the dynamic attention span): changes the span dynamically depending on the current input. Adaptive Attention Span: Idea & Implementation “Adaptive Attention Span in Transformers”, https://arxiv.org/abs/1905.07799
  • 42. The models are smaller, the performance is better. Adaptive Attention Span: Performance “Adaptive Attention Span in Transformers”, https://arxiv.org/abs/1905.07799
  • 43. Adaptive spans (in log-scale) of every attention heads in a 12-layer model with span limit S = 4096. Few attention heads require long attention spans Adaptive spans are learned larger when needed “Adaptive Attention Span in Transformers”, https://arxiv.org/abs/1905.07799
  • 44. Example of average dynamic attention span as a function of the input sequence. The span is averaged over the layers and heads. Dynamic spans adapt to the input sequence “Adaptive Attention Span in Transformers”, https://arxiv.org/abs/1905.07799
  • 45. Non-autoregressive generation KERMIT: Generative Insertion-Based Modeling for Sequences, https://arxiv.org/abs/1906.01604
  • 47. ● Transformers are cool and produce great results! ● There are many modifications, it’s kind of LEGO, you can combine it. ● More good source code and libraries are available (Huggingface, Colab notebooks, etc) ● Definitely more transformers to come! ● GET INVOLVED! You CAN move things forward! (just combine several ideas from these slides 🙂) Wrap up