SlideShare a Scribd company logo
1 of 10
Download to read offline
A Study of BFLOAT16 for Deep Learning Training
Dhiraj Kalamkar1
, Dheevatsa Mudigere2
, Naveen Mellempudi∗1
, Dipankar Das1
,
Kunal Banerjee1
, Sasikanth Avancha1
, Dharma Teja Vooturi †1
, Nataraj Jammalamadaka‡1
,
Jianyu Huang2
, Hector Yuen2
, Jiyan Yang2
, Jongsoo Park2
, Alexander Heinecke1
,
Evangelos Georganas1
, Sudarshan Srinivasan1
, Abhisek Kundu1
,
Misha Smelyanskiy2
, Bharat Kaul1
, and Pradeep Dubey1
1
Parallel Computing Lab, Intel Labs
2
Facebook, 1 Hacker Way, Menlo Park, CA
Abstract
This paper presents the first comprehensive empirical study demonstrating the
efficacy of the Brain Floating Point (BFLOAT16) half-precision format for Deep
Learning training across image classification, speech recognition, language model-
ing, generative networks and industrial recommendation systems. BFLOAT16 is
attractive for Deep Learning training for two reasons: the range of values it can
represent is the same as that of IEEE 754 floating-point format (FP32) and conver-
sion to/from FP32 is simple. Maintaining the same range as FP32 is important to
ensure that no hyper-parameter tuning is required for convergence; e.g., IEEE 754
compliant half-precision floating point (FP16) requires hyper-parameter tuning.
In this paper, we discuss the flow of tensors and various key operations in mixed
precision training, and delve into details of operations, such as the rounding modes
for converting FP32 tensors to BFLOAT16. We have implemented a method to
emulate BFLOAT16 operations in Tensorflow, Caffe2, IntelCaffe, and Neon for
our experiments. Our results show that deep learning training using BFLOAT16
tensors achieves the same state-of-the-art (SOTA) results across domains as FP32
tensors in the same number of iterations and with no changes to hyper-parameters.
1 Introduction
The spectacular success of Deep Learning has come riding on the ready availability of data and a
tremendous growth in compute capability of deep learning systems. In recent years, compute growth
has been driven by specialized architectures for GEMM (General Matrix Multiply) acceleration,
and the shift to low precision compute. Inference has witnessed a proliferation of mixed precision
compute [19, 33, 27, 25] where different operations execute at different precision, all the way from
binary/ternary operands to 16b floating point. Similarly, training has also witnessed its share of mixed
precision methods, where a combination of half- and single-precision compute is used.
There are at least three half-precision formats in the domain of mixed precision training of large neural
networks: FP16 [28], 16-bit Integer based [9], and BFLOAT16 [10, 4, 3]. All these methods have 16-
bit input operands and 32-bit accumulators for all the computations. Of the three formats, only the first
two have publicly available description of training methodology and experimental results on a wide
variety of neural networks although BFLOAT16 was originally conceived for deep learning training.
BFLOAT16 data format was first introduced as part of distributed training frameworks DistBelief [10]
and Tensorflow [4] as a low precision storage format used to reduce communication volumes of
∗
corresponding author {naveen.k.mellempudi}@intel.com
†
IIIT Hyderabad, ‡
Lab126 Amazon, work done while at at Intel
Preprint. Under review.
arXiv:1905.12322v3
[cs.LG]
13
Jun
2019
weights and activations shared between compute nodes during distributed parallel training3
. It has
since become an alternative numeric format specifically targeted towards accelerating deep leaning
training (mainly within the Google ecosystem [3]), because of its wider dynamic range and smaller
footprint. In this work, we present a detailed mixed precision methodology using BFLOAT16 and
demonstrate coverage by training a variety of workloads from image processing (including GANs),
to speech/language processing, and recommendation systems. For our experiments we employ a
method, where FP32 operations emulate the behavior of BFLOAT16 operations by appropriately
zeroing out the lower 16 bits and appropriately rounding the input operands.
We have developed a library called Quantlib to implement the emulation in multiple deep learning
frameworks such as IntelCaffe, Caffe2, Neon and Tensorflow. One of the functions Quantlib provides
is appropriately modifying the elements of an input FP32 tensor to emulate the behavior of BFLOAT16.
Specifically, it zeroes out the lower 16 bits of the FP32 elements and performs RNE (Round to Nearest
Even) rounding based on those bits. This modification ensures that the tensor possesses FP32 precision
(so that FP32-hardware and FP32-libraries can operate on it), and also provides the exact precision
and rounding as would be afforded by BFLOAT16 hardware. Quantlib is called prior to GEMM
operations (or other operations which are planned to be implemented in BFLOAT16) to emulate
the behavior of BFLOAT16 input operands, while FP32 output of the GEMM naturally fits into the
“BFLOAT16-input, FP32-accumulator” schema of this training methodology.
Using multiple deep learning frameworks modified to insert appropriate Quantlib calls, we provide
SOTA results for: AlexNet [24], ResNet-50 [20], DC-GAN [32], SR-GAN [26], DeepSpeech2
[5], GNMT [40], and two industrial workloads namely: a Deep and Cross Network, and a DNN
Recommendation System. We also qualitatively compare and contrast the training methodologies
using BFLOAT16, FP16 and INT16. We observe that FP16-based training requires tuning an
additional hyper-parameter for loss scaling to achieve SOTA results; INT16-based training requires
fine grained block-quantization and maintaining block-level scaling factors to achieve SOTA results.
In comparison all BFLOAT16 experiments are performed without any hyperparameter changes and
BFLOAT16 kernels are expected to be relatively straightforward.
The rest of the paper is organized as follows. Section 2 provides a survey of the literature and
describes various attempts at half-precision based training. Section 3 discusses the BFLOAT16
format, operations and data flow in detail. Section 4 describes our experimental results in detail.
Section 5 discusses our concluding thoughts.
2 Related Work
Application of low precision datatype in deep learning is a well explored topic in research. Literature
shows that various different reduced precision data representations have been investigated which can
be broadly classified into two types: the more standard floating-point based formats [28, 15, 12] and
custom fixed point based formats [37, 7, 18, 22, 23].
Custom fixed point representations may offer more flexibility than typical floating point based ones
in terms of both increased precision and dynamic range by maintaining separate integer values
for precision and range. Consequently, fixed point representations may provide more robust and
accurate training of an underlying application. The work reported in [37] leverages the dynamically
scaled fixed point representation proposed in [39] to speed up convolution neural networks by 4×
over an optimized floating point implementation on general purpose CPU hardware. In [18], the
authors present a comprehensive study on the effect of low precision fixed point computation for
deep learning. They also train smaller networks using 16-bit fixed point on specialized hardware.
Researchers have ventured into less than 16-bit precision as well and almost all of them use custom
fixed point schemes. Reference [7] uses a dynamical fixed point format with low precision multipli-
cations with up to 12-bit operations. This idea is further advanced in [8] where the authors showcase
training with only binary weights while keeping all other tensors and operations in full precision.
Another extension [21] uses binary activations as well; however, the gradients and the weights are
maintained in full precision. Another related work [22] uses activations and weights quantized into
6-bits for neural network training with gradients in full precision. The method described in [33]
uses binary representation for all components including gradients. However, all the aforementioned
3
special thanks to Jeff Dean for pointing this out
2
methods are shown to work for smaller benchmark model/data-sets only and invariably result in
a non-trivial drop in accuracy with larger ImageNet data-set [11] and classification task [35]. The
authors of [23] advocate for a fixed point numerical format called Flexpoint that is specifically
tailored for executing deep neural networks on a specialized hardware; this datatype is shown to
outperform FP16 and achieve numerical parity with FP32 across a diverse set of workloads. A more
general dynamic fixed point representation and associated compute primitives are presented in [9],
which leverages general purpose hardware using the integer-compute pipeline to match FP32 baseline
accuracy across state of the art convolution neural networks.
The disadvantage of these integer based representations in contrast to floating point is the additional
overheads of handling shared exponents and managing accumulator overflow. Köster et al.[23]
proposed an algorithm that predicts the shared exponent ahead of time to eliminate some of these
overheads. However this solution requires collection of additional statistics at each layer, which
cannot be efficiently computed on general purpose hardware.
A mixed precision training methodology using FP16 and FP32 is reported in [28]. This work employs
FP16 for storing activations, weights and gradients. The computations during forward pass and
back propagation use FP16 datatype while results are accumulated into FP32. A master copy of
the FP32 weights are preserved for the update operation. The authors successfully perform deep
learning training on a wide range of applications encompassing deep networks and larger data-sets
(ILSVRC-class problems) at the expense of minimal loss compared to baseline FP32 results. This
work, however, underlines that FP16/FP32 mixed precision training entails loss scaling [15] to attain
near-SOTA results. Loss scaling, basically, ensures that back-propagated gradient values are shifted
into a range which can be represented by FP16 and therefore the small magnitude (negative exponent)
values which are critical for accuracy are preserved.
The need for loss scaling can be avoided by using BFLOAT16 datatype. The hardware numerics
of BFLOAT16 on Intel architecture is available at [1]. BFLOAT16 has been underlined to be a
crucial ingredient for achieving peta-FLOPS scale on image classification task in [42]. In [3], Google
notes the speed ups achieved on using this datatype over FP32 for TensorFlow models for various
tasks, such as, image classification, image segmentation, object detection, machine translation. It
may be noted that the benefits of BFLOAT16 are not only restricted to machine learning paradigm,
the recently published work [41] uses this representation for performing Monte Carlo simulations
of Ising model which is an important model is statistical physics. In fact, the authors of [41]
mention BFLOAT16 “provides better training and model accuracy than the IEEE half-precision
representation”. The Julia language [6], which is designed to provide high-performance without
sacrificing ease of programming, has also been shown to be benefited from BFLOAT16 [13]. OpenAI
has also mentioned that optimizing kernels targeting BFLOAT16 is “in active development” [17].
3 Training with Brain Floating Point
Numerous studies have shown that 16-bits of precision is sufficient for training deep neural
networks[18],[28],[23],[9]. Researchers have experimented with various numeric formats to op-
timize training platforms for power and performance. State-of-the-art training platforms today have
chosen IEEE-754 half-precision floating point as the preferred numeric format for deep leaning
training. However, the narrow dynamic range of half-precision floating point is not sufficient to
represent error gradients during back propagation. To mitigate this, training methods use loss scaling
techniques[28] to shift gradients into expressible range supported by half-precision floats. While this
is easier to implement for certain feed-forward networks as a simple multiplication of loss with a
constant scaling factor, others (e.g. recurrent) require a more sophisticated approach to determine
the right scaling parameter. This process is often iterative and requires significant time and resource
investment from data scientists to optimize it to their network. These software overheads become a
hindrance for seamless migration of new deep learning applications to take advantage of the low-
precision hardware. Recent developments in software tools such as “automatic mixed precision” [30]
are aimed at easing some of this burden from data scientists. However, these tools in their current
form also require changes in the original model code and are not guaranteed to result in sufficient
performance gains.
The values are represented as truncated full precision floating point values with 8 bits of mantissa and
the dynamic range comparable to FP32 (Table 1). The extended dynamic range can now represent
3
smaller gradient values without applying complicated loss scaling methods, which enables easier
migration of deep learning workloads to BFLOAT16 hardware. There are some additional benefits to
adopting BFLOAT16 numeric format to build hardware for deep learning. Core compute primitives
such as FMA can be built using 8-bit multipliers which lead to significant area and power savings
while preserving the full dynamic range of FP32. Table 1 shows the comparison of BFLOAT16 with
other standard IEEE floating point formats.
Table 1: Comparison BFLOAT16 numeric format with IEEE-754 FP32 and FP16 formats.
Data Type Bit Format Max Min Min Acc.
(s, e, m) Normal Normal Subnormal Size
FP32 1, 8, 23 3.40e38 1.17e−38 1.40e−45 float32
FP16 1, 5, 10 6.55e4 6.10e−5 5.96e−8 float32
BFLOAT16 1, 8, 7 3.38e38 1.17e−38 N/A float32
Figure 1: Mixed precision data flow used for training DNNs with BFLOAT16 data format
Figure 1 shows the mixed precision data flow used to train deep neural networks using BFLOAT16
numeric format. The core compute kernels represented as GEMM operations accept inputs as
BFLOAT16 tensors and accumulate the output to FP32 tensors. Quantlib (shown as Q in Figure 1)
modifies these output tensors to BFLOAT16 format before passing them to the next layer. Quantlib
is also employed to modify a copy of the FP32 weights to BFLOAT16 for the forward pass. Error
gradients with respect to the inputs also in BFLOAT16 format. Non-GEMM compute operations
including batch-normalization, and activation functions such as ReLU, tanh and sigmoid also accept
BFLOAT16 tensors as inputs. Bias tensors are always maintained in FP32. The weight update step
(e.g., in SGD solver) uses the FP32 copy of the weights to maintain model accuracy.
4 Results
Our evaluation of BFLOAT16 consists of the aforementioned deep learning models from different
application domains and frameworks, using the tensor modification method via Quantlib discussed in
section 1.
4.1 Convolution Neural Networks
Convolutional neural networks (CNN) have been primarily used for computer vision applications such
as image classification, object detection and semantic segmentation. CNNs have been extensively
studied both in academia and industry, primarily driven by public benchmarks such as the ImageNet
Large Scale Visual Recognition Competition (ILSVRC). Over the past few years the CNNs which
have won the ILSVRC competition, have become well established benchmarks. Here we choose
AlexNet (ILSVRC 2012) [24] and ResNet-50 (ILSVRC 2015) [20] as representative models for
4
the BFLOAT16 evaluation. In addition to Convolution and InnerProduct layers (which contributes
to majority of the computations), we use BFLOAT16 emulations for ReLU, BatchNorm, Pooling,
Dropout and EltWise layers as well. This ensures that the full training pipeline uses BFLOAT16, not
necessitating the use of higher precision for the intermediate tensor outputs.
4.1.1 AlexNet
For AlexNet, we used a global minibatch of 1024 running data parallel on 16 nodes for 88 epochs and
achieved 57.4% top-1 and 80.7% top-5 accuracy. As shown in Figure 2, our BFLOAT16 emulation
follows very closely to the actual FP32 run and achieves 57.2% top-1 and 80.1% top-5 accuracy.
57.4%
57.2%
80.7%
80.1%
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
70.0%
80.0%
90.0%
1
4
7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
52
55
58
61
64
67
70
73
76
79
82
85
88
Test
Accuracy
#Epochs
Top-1 FP32 Top-1 BFP16 Top-5 FP32 Top-5 BFP16
(a) AlexNet
74.7%
74.7%
92.0%
92.0%
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
70.0%
80.0%
90.0%
100.0%
1
4
7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
52
55
58
61
64
67
70
73
76
79
82
85
88
Max
Test
Accuracy
#Epochs
Top-1 FP32 Top-1 BFP16 Top-5 FP32 Top-5 BFP16
(b) ResNet-50
Figure 2: Imagenet-1K training, top-1 and top-5 validation accuracy plots for CNNs
4.1.2 ResNet
Our Resnet-50 experiments are with a global minibatch of 1024 running data parallel on 32 nodes
using SGD with Nestrov momentum. We trained for 90 epochs with learning rate warm up for first
5 epochs. Baseline FP32 run achieved top-1 accuracy of 74.7% and top-5 accuracy of 92.0% and
as shown in Figure 2, our BFLOAT16 emulation follows the baseline almost exactly and achieving
the same top-1 and top-5 accuracy. During training we use local batch statistics for the batch
normalization to compute validation accuracy after every epoch. The fully trained BFLOAT16 model,
achieves 75.7% top-1 test accuracy with global sample statistics, matching the baseline FP32 results.
4.2 Recurrent Neural Networks
Recurrent neural networks (RNN) unlike the feedforward networks allows for capturing temporal
information due to its feedback connections. These models have been popularly used for applications
such as automatic speech recognition (ASR) and language processing, which primarily involve
sequence-based learning. RNNs have been observed to have more demanding numerical range
requirements [28] and are more sensitive to the half precision datatype. For this class of networks we
identify Baidu’s DeepSpeech2 [5] and Google’s neural machine translation (GNMT) model [40] as
representative candidates for the BFLOAT16 evaluation.
4.2.1 DeepSpeech2
The Deep speech 2 (DS2) topology, consists of two convolution layers followed by 3 bi-directional
gated recurrent unit (GRU) layers with 2048 cells and a final inner-product layer as a classifier. We
use Adam optimizer to compute connectionist temporal classification loss (CTC) [16]. We use a
batch size of 64 and a learning rate of 0.0005. The aforementioned model is trained on the librispeech
dataset [31], which consists of 460 hour corpus.
4.2.2 Neural Machine Translation
Google’s Neural Machine Translation (GNMT) is the SOTA neural machine translation model using
a recurrent network. It uses stack of long short-term memory (LSTM) layers, along with an attention
model for language modeling and translation. Table 2 compares translation accuracy in terms of
achieved BLEU scores for baseline FP32 and BFLOAT16 emulation. We use the small Vietnamese
5
(a) DeepSpeech2 (b) GNMT
Figure 3: RNN training using BFLOAT16 data type.
(VI) to English (EN) model and big German (DE) to English (EN) model. BFLOAT16 emulation
achieves same or better accuracy than baseline. Figure 3 shows how closely BFLOAT16 emulation
run follows the baseline FP32 run.
Table 2: GNMT BLEU scores for De→En and Vi→En on WMT’16 and IWSLT’15 datasets.
TASK FP32 BFLOAT16
DE→EN, WMT’16 29.3 29.3
VI→EN, IWSLT’15 + ATTENTION 17.1 18.3
4.3 Generative Adversarial Networks (GANs)
Generative Adversarial Networks (GANs) have become a very important class of networks, they can
be used to learn and mimic any arbitrary distribution of data. GANs achieve this by using two separate
generator and discriminators networks in a tightly coupled way. Because GANs combine regression
and discrimination tasks during training they tend to have different requirements for numerical
precision and range. For the BFLOAT16 evaluation we consider DC-GAN[32] and SR-GAN [26]
models.
4.3.1 DC-GAN
DC-GAN [32] represents a critical step in designing GAN architectures which were earlier known
to be notoriously difficult to train. This consists of fractionally-strided convolutions with ReLU
activations in the generator, whereas convolutions with leaky ReLU activations are used in the
discriminator; batch normalization layers are used in both the generator and the discriminator. We have
implemented DC-GAN in Caffe and for our experiments all the input tensors (activations, weights)
are converted to BFLOAT16 for convolution layers (in both the generator and the discriminator),
while only the input activations are converted to BFLOAT16 for batch normalization layers; all other
tensors are maintained in full precision.
A comparison between FP32 and BFLOAT16 is shown in Table 3 in terms of inception scores and
MS-SSIM. As evident from the table, the outputs obtained for FP32 and BFLOAT16 are comparable.
Table 3: Comparison between FP32 and BFLOAT16 for DC-GAN on face dataset
Datatype Inception Score MS-SSIM
Baseline (FP32) 1.97 ± 0.054 0.262
BFLOAT16 2.06 ± 0.055 0.217
6
4.3.2 SR-GAN
SR-GAN generates photo-realistic high-resolution images by super-resolving from a single shot of the
low-resolution image [26]. The low-resolution images are scaled 4× preserving the spatial features
while minimizing the noise. The quality of the output is measured using SSIM (Structural Similarity)
and MS-SSIM (multi-scale structural similarity) and PNSR (peak signal to noise ratio) metrics. The
topology consists of a “generator” network based on Resnet architecture, and the “discriminator”
consists of 8 convolution layers each followed by a batchnorm and LeakyReLu. The network uses a
“VGG loss” function using a pre-trained VGG19 network [36].
For the BFLOAT16 experiments, we converted all the inputs tensors (weights, activations) at convolu-
tion layers to BFLOAT16, while keeping the rest of the layers (Batchnorm, ReLU, LeakyReLu and
Eltwise) at full precision.
Table 4: SR-GAN model trained with BFLOAT16 on DIV2K (http://www.vision.ee.ethz.ch/ntire17/)
dataset.
DATATYPE PSNR SSIM MS-SSIM
BASELINE (FP32) 26.1749 0.73753 0.99999
BFLOAT16 26.1415 0.74079 0.99999
(a) Generator Network (b) Discriminator Network
Figure 4: Training loss for SR-GAN training using BFLOAT16
4.4 Industrial Scale Recommendation System
Recommendation system and personalization models are very important for many practical-scale
applications. Here we evaluate the Deep & Cross Network [38] on a small Kaggle Criteo Dataset4
and a typical DNN recommender system [43, 29] on a large Terabyte Criteo Dataset5
, which target
predicting the ads click-through rate [34]. The accuracy of the recommendation system models is
measured by the log loss [38, 43, 29], which predicts when the users will click on ads. Lower log
loss translates into higher prediction accuracy for the recommendation model. Note that an accuracy
loss of 0.001 in log loss is considered unacceptable in practice.
For our BFLOAT16 experiments, all input tensors (activations and weights) are converted to
BFLOAT16 for fully connected layers in both forward and backward propagation passes. Dur-
ing the weight update stages, we use a FP32 master copy [28] to reduce the additional accuracy loss.
We use either the round-to-nearest or direct truncation scheme when we do the conversion from
BFLOAT16 to FP32.
The accuracy evaluation results are shown in Table 5. As we can observe, BFLOAT16 with the
round-to-nearest scheme is almost the same as FP32 baseline accuracy, while BFLOAT16 with the
direct truncation scheme suffers from a tiny accuracy degradation (∼ 0.02%).
4
https://www.kaggle.com/c/criteo-display-ad-challenge/data
5
https://www.criteo.com/news/press-releases/2015/07/criteo-releases-industrys-largest-ever-dataset/
7
Table 5: The log loss for Deep & Cross Network [38] on a small Kaggle Criteo dataset and a
DNN recommender system [43] on a TeraByte Criteo dataset trained with BFLOAT16 (with either
round-to-nearest or direct truncation for the conversion from FP32 to BFLOAT16).
RECOMMENDATION SYSTEM BASELINE(FP32) BFLOAT16 (RND) BFLOAT16 (TRUNC)
DEEP & CROSS NETWORK 0.44372 0.44372 0.44393
DNN RECOMMENDER SYSTEM 0.12520 0.12520 0.12537
4.5 Beyond Emulation - Towards Bare Metal Execution
We close this section by highlighting that our presented emulation strategy is an excellent approx-
imation of future Intel Xeon CPUs. We therefore took the slim CNN training framework GxM,
which was presented in [14], and implemented all important operators (convolution, fully-connected,
batch-normalization and pooling layers) utilizing the AVX512BF16 instruction set extensions [2].
Therefore, all activation and weight data was only present as 16bit data in memory and the special
VNNI (vector neural network instructions) datalayout was employed to support the BFLOAT16
dot-product instruction with FP32 accumulation. The execution of these instructions was done by
bit-accurate emulation on current AVX512 silicon with only a very slight performance tax. When
training ResNet-50 on Imagenet using the AVX512BF16 instructions, we achieved a Top-1 accuracy
of 75.62%, which matches the current state-of-the-art performance. Additionally, the code is already
heavily optimized and ready for prime-time. Additionally we have implemented a full BFLOAT16
LSTM-cell which is currently being integrated into Tensorflow.
5 Conclusion
Our goal in this paper was to establish BFLOAT16 as an alternative half-precision format for Deep
Learning training, given that its dynamic range is the same as that of FP32 and conversion to/from
FP32 is straightforward. Our empirical study demonstrates that, unlike IEEE 754 half-precision
format and 16-bit Integer, BFLOAT16 based training eliminates the need for hyperparameter tuning or
complex software management for block quantization. Our study also demonstrates that BFLOAT16
is a robust datatype having the ability to cover the range of tensors across application domains
including vision, speech, language, generative networks, and recommendation systems. We expect
industry-wide adoption of BFLOAT16 across emerging domains.
References
[1] BFLOAT16 - hardware numerics definition. https://
software.intel.com/sites/default/files/managed/40/8b/
bf16-hardware-numerics-definition-white-paper.pdf. Accessed: 2019-03-22.
[2] Intel architecture instruction set extensions and future features programming refer-
ence. https://software.intel.com/sites/default/files/managed/c5/15/
architecture-instruction-set-extensions-programming-reference.pdf. Ac-
cessed: 2019-05-21.
[3] Using bfloat16 with tensorflow models. https://cloud.google.com/tpu/docs/
bfloat16. Accessed: 2019-16-04.
[4] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,
Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale
machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467,
2016.
[5] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg,
Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech 2:
End-to-end speech recognition in english and mandarin. In ICML, pages 173–182, 2016.
[6] Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B. Shah. Julia: A fresh approach to
numerical computing. SIAM Review, 59(1):65–98, 2017.
8
[7] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Training deep neural networks
with low precision multiplications. arXiv preprint arXiv:1412.7024, 2014.
[8] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep
neural networks with binary weights during propagations. In NIPS, pages 3123–3131, 2015.
[9] Dipankar Das, Naveen Mellempudi, Dheevatsa Mudigere, Dhiraj D. Kalamkar, Sasikanth
Avancha, Kunal Banerjee, Srinivas Sridharan, Karthik Vaidyanathan, Bharat Kaul, Evange-
los Georganas, Alexander Heinecke, Pradeep Dubey, Jesús Corbal, Nikita Shustrov, Roman
Dubtsov, Evarist Fomenko, and Vadim O. Pirogov. Mixed precision training of convolutional
neural networks using integer operations. In ICLR, 2018.
[10] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew
Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In
Advances in neural information processing systems, pages 1223–1231, 2012.
[11] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale
hierarchical image database. In CVPR, pages 248–255, 2009.
[12] Tim Dettmers. 8-bit approximations for parallelism in deep learning. arXiv preprint
arXiv:1511.04561, 2015.
[13] Keno Fischer and Elliot Saba. Automatic full compilation of julia programs and ML models to
cloud TPUs. CoRR, abs/1810.09868, 2018.
[14] Evangelos Georganas, Sasikanth Avancha, Kunal Banerjee, Dhiraj Kalamkar, Greg Henry, Hans
Pabst, and Alexander Heinecke. Anatomy of high-performance deep learning convolutions
on simd architectures. In Proceedings of the International Conference for High Performance
Computing, Networking, Storage, and Analysis, SC ’18, pages 66:1–66:12, Piscataway, NJ,
USA, 2018. IEEE Press.
[15] Boris Ginsburg, Sergei Nikolaev, and Paulius Micikevicius. Training of deep networks with
half-precision float. NVidia GPU Technology Conference, 2017.
[16] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist
temporal classification: labelling unsegmented sequence data with recurrent neural networks. In
Proceedings of the 23rd international conference on Machine learning, pages 369–376. ACM,
2006.
[17] Scott Gray, Alec Radford, and Diederik P. Kingma. Gpu kernels for block-sparse
weights. https://s3-us-west-2.amazonaws.com/openai-assets/blocksparse/
blocksparsepaper.pdf. Accessed: 2019-16-04.
[18] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning
with limited numerical precision. In ICML, pages 1737–1746, 2015.
[19] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural
network with pruning, trained quantization and huffman coding. In ICLR, 2016.
[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In CVPR, pages 770–778, 2016.
[21] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized
neural networks. In NIPS, pages 4107–4115, 2016.
[22] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quan-
tized neural networks: Training neural networks with low precision weights and activations.
arXiv preprint arXiv:1609.07061, 2016.
[23] Urs Köster, Tristan Webb, Xin Wang, Marcel Nassar, Arjun K. Bansal, William Constable,
Oguz Elibol, Stewart Hall, Luke Hornof, Amir Khosrowshahi, Carey Kloss, Ruby J. Pai, and
Naveen Rao. Flexpoint: An adaptive numerical format for efficient training of deep neural
networks. CoRR, abs/1711.02213, 2017.
[24] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep
convolutional neural networks. In NIPS, pages 1097–1105, 2012.
[25] Abhisek Kundu, Kunal Banerjee, Naveen Mellempudi, Dheevatsa Mudigere, Dipankar Das,
Bharat Kaul, and Pradeep Dubey. Ternary residual networks. CoRR, abs/1707.04679, 2017.
9
[26] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro
Acosta, Andrew P Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic
single image super-resolution using a generative adversarial network. In CVPR, volume 2,
page 4, 2017.
[27] Naveen Mellempudi, Abhisek Kundu, Dheevatsa Mudigere, Dipankar Das, Bharat Kaul, and
Pradeep Dubey. Ternary neural networks with fine-grained quantization. arXiv preprint
arXiv:1705.01462, 2017.
[28] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia,
Boris Ginsburg, Michael Houston, Oleksii Kuchaev, Ganesh Venkatesh, et al. Mixed precision
training. arXiv preprint arXiv:1710.03740, 2017.
[29] Maxim Naumov, Dheevatsa Mudigere, et al. Deep learning recommendation model for person-
alization and recommendation systems. arXiv preprint arXiv:1906.00091, 2019.
[30] NVIDIA. Automatic mixed precision for deep learning. https://developer.nvidia.com/
automatic-mixed-precision. Accessed: 2019-22-05.
[31] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an
asr corpus based on public domain audio books. In 2015 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210. IEEE, 2015.
[32] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with
deep convolutional generative adversarial networks. In ICLR, 2016.
[33] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet
classification using binary convolutional neural networks. In ECCV, 2016.
[34] Matthew Richardson, Ewa Dominowska, and Robert Ragno. Predicting clicks: Estimating the
click-through rate for new ads. In Proceedings of the 16th International Conference on World
Wide Web, WWW ’07, pages 521–530, New York, NY, USA, 2007. ACM.
[35] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng
Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual
recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
[36] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556, 2014.
[37] Vincent Vanhoucke, Andrew Senior, and Mark Z Mao. Improving the speed of neural networks
on CPUs. In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop,
volume 1, 2011.
[38] Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. Deep & cross network for ad click
predictions. In Proceedings of the ADKDD’17, ADKDD’17, pages 12:1–12:7, New York, NY,
USA, 2017. ACM.
[39] Darrell Williamson. Dynamically scaled fixed point arithmetic. In PacRim, pages 315–318,
1991.
[40] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang
Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine
translation system: Bridging the gap between human and machine translation. arXiv preprint
arXiv:1609.08144, 2016.
[41] Kun Yang, Yi-Fan Chen, George Roumpos, Chris Colby, and John R. Anderson. High per-
formance monte carlo simulation of ising model on TPU clusters. CoRR, abs/1903.11714,
2019.
[42] Chris Ying, Sameer Kumar, Dehao Chen, Tao Wang, and Youlong Cheng. Image classification
at supercomputer scale. CoRR, abs/1811.06992, 2018.
[43] Jian Zhang, Jiyan Yang, and Hector Yuen. Training with low-precision embedding tables. In
Systems for Machine Learning Workshop at NeurIPS 2018. 2018.
10

More Related Content

What's hot

Efficient Forecasting of Exchange rates with Recurrent FLANN
Efficient Forecasting of Exchange rates with Recurrent FLANNEfficient Forecasting of Exchange rates with Recurrent FLANN
Efficient Forecasting of Exchange rates with Recurrent FLANNIOSR Journals
 
An Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data FragmentsAn Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data FragmentsIJMER
 
RunPool: A Dynamic Pooling Layer for Convolution Neural Network
RunPool: A Dynamic Pooling Layer for Convolution Neural NetworkRunPool: A Dynamic Pooling Layer for Convolution Neural Network
RunPool: A Dynamic Pooling Layer for Convolution Neural NetworkPutra Wanda
 
Efficient design of feedforward network for pattern classification
Efficient design of feedforward network for pattern classificationEfficient design of feedforward network for pattern classification
Efficient design of feedforward network for pattern classificationIOSR Journals
 
Comparison Between Levenberg-Marquardt And Scaled Conjugate Gradient Training...
Comparison Between Levenberg-Marquardt And Scaled Conjugate Gradient Training...Comparison Between Levenberg-Marquardt And Scaled Conjugate Gradient Training...
Comparison Between Levenberg-Marquardt And Scaled Conjugate Gradient Training...CSCJournals
 
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...lauratoni4
 
Learning Graph Representation for Data-Efficiency RL
Learning Graph Representation for Data-Efficiency RLLearning Graph Representation for Data-Efficiency RL
Learning Graph Representation for Data-Efficiency RLlauratoni4
 
AN EFFICIENT CODEBOOK INITIALIZATION APPROACH FOR LBG ALGORITHM
AN EFFICIENT CODEBOOK INITIALIZATION APPROACH FOR LBG ALGORITHMAN EFFICIENT CODEBOOK INITIALIZATION APPROACH FOR LBG ALGORITHM
AN EFFICIENT CODEBOOK INITIALIZATION APPROACH FOR LBG ALGORITHMIJCSEA Journal
 
Super Resolution with OCR Optimization
Super Resolution with OCR OptimizationSuper Resolution with OCR Optimization
Super Resolution with OCR OptimizationniveditJain
 
Data clustering using kernel based
Data clustering using kernel basedData clustering using kernel based
Data clustering using kernel basedIJITCA Journal
 
Implementation of Fuzzy Logic for the High-Resolution Remote Sensing Images w...
Implementation of Fuzzy Logic for the High-Resolution Remote Sensing Images w...Implementation of Fuzzy Logic for the High-Resolution Remote Sensing Images w...
Implementation of Fuzzy Logic for the High-Resolution Remote Sensing Images w...IOSR Journals
 
AVC based Compression of Compound Images Using Block Classification Scheme
AVC based Compression of Compound Images Using Block Classification SchemeAVC based Compression of Compound Images Using Block Classification Scheme
AVC based Compression of Compound Images Using Block Classification SchemeDR.P.S.JAGADEESH KUMAR
 
A divisive hierarchical clustering based method for indexing image information
A divisive hierarchical clustering based method for indexing image informationA divisive hierarchical clustering based method for indexing image information
A divisive hierarchical clustering based method for indexing image informationsipij
 
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...csandit
 
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual LearningSequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual LearningMLAI2
 

What's hot (20)

Efficient Forecasting of Exchange rates with Recurrent FLANN
Efficient Forecasting of Exchange rates with Recurrent FLANNEfficient Forecasting of Exchange rates with Recurrent FLANN
Efficient Forecasting of Exchange rates with Recurrent FLANN
 
An Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data FragmentsAn Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data Fragments
 
RunPool: A Dynamic Pooling Layer for Convolution Neural Network
RunPool: A Dynamic Pooling Layer for Convolution Neural NetworkRunPool: A Dynamic Pooling Layer for Convolution Neural Network
RunPool: A Dynamic Pooling Layer for Convolution Neural Network
 
Efficient design of feedforward network for pattern classification
Efficient design of feedforward network for pattern classificationEfficient design of feedforward network for pattern classification
Efficient design of feedforward network for pattern classification
 
LSTM_public.pdf
LSTM_public.pdfLSTM_public.pdf
LSTM_public.pdf
 
Comparison Between Levenberg-Marquardt And Scaled Conjugate Gradient Training...
Comparison Between Levenberg-Marquardt And Scaled Conjugate Gradient Training...Comparison Between Levenberg-Marquardt And Scaled Conjugate Gradient Training...
Comparison Between Levenberg-Marquardt And Scaled Conjugate Gradient Training...
 
ssc_icml13
ssc_icml13ssc_icml13
ssc_icml13
 
Neural Style Transfer in practice
Neural Style Transfer in practiceNeural Style Transfer in practice
Neural Style Transfer in practice
 
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
 
Learning Graph Representation for Data-Efficiency RL
Learning Graph Representation for Data-Efficiency RLLearning Graph Representation for Data-Efficiency RL
Learning Graph Representation for Data-Efficiency RL
 
AN EFFICIENT CODEBOOK INITIALIZATION APPROACH FOR LBG ALGORITHM
AN EFFICIENT CODEBOOK INITIALIZATION APPROACH FOR LBG ALGORITHMAN EFFICIENT CODEBOOK INITIALIZATION APPROACH FOR LBG ALGORITHM
AN EFFICIENT CODEBOOK INITIALIZATION APPROACH FOR LBG ALGORITHM
 
G010334554
G010334554G010334554
G010334554
 
Cray HPC + D + A = HPDA
Cray HPC + D + A = HPDACray HPC + D + A = HPDA
Cray HPC + D + A = HPDA
 
Super Resolution with OCR Optimization
Super Resolution with OCR OptimizationSuper Resolution with OCR Optimization
Super Resolution with OCR Optimization
 
Data clustering using kernel based
Data clustering using kernel basedData clustering using kernel based
Data clustering using kernel based
 
Implementation of Fuzzy Logic for the High-Resolution Remote Sensing Images w...
Implementation of Fuzzy Logic for the High-Resolution Remote Sensing Images w...Implementation of Fuzzy Logic for the High-Resolution Remote Sensing Images w...
Implementation of Fuzzy Logic for the High-Resolution Remote Sensing Images w...
 
AVC based Compression of Compound Images Using Block Classification Scheme
AVC based Compression of Compound Images Using Block Classification SchemeAVC based Compression of Compound Images Using Block Classification Scheme
AVC based Compression of Compound Images Using Block Classification Scheme
 
A divisive hierarchical clustering based method for indexing image information
A divisive hierarchical clustering based method for indexing image informationA divisive hierarchical clustering based method for indexing image information
A divisive hierarchical clustering based method for indexing image information
 
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
 
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual LearningSequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
 

Similar to A Study of BFLOAT16 for Deep Learning Training

Performance Comparison between Pytorch and Mindspore
Performance Comparison between Pytorch and MindsporePerformance Comparison between Pytorch and Mindspore
Performance Comparison between Pytorch and Mindsporeijdms
 
Hardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpHardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpeSAT Publishing House
 
Iaetsd pipelined parallel fft architecture through folding transformation
Iaetsd pipelined parallel fft architecture through folding transformationIaetsd pipelined parallel fft architecture through folding transformation
Iaetsd pipelined parallel fft architecture through folding transformationIaetsd Iaetsd
 
ADAPTIVE NETWORK BASED FUZZY INFERENCE SYSTEM FOR SPEECH RECOGNITION THROUGH ...
ADAPTIVE NETWORK BASED FUZZY INFERENCE SYSTEM FOR SPEECH RECOGNITION THROUGH ...ADAPTIVE NETWORK BASED FUZZY INFERENCE SYSTEM FOR SPEECH RECOGNITION THROUGH ...
ADAPTIVE NETWORK BASED FUZZY INFERENCE SYSTEM FOR SPEECH RECOGNITION THROUGH ...ijaia
 
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...IJCNCJournal
 
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...IJCNCJournal
 
Comprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPIComprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPIijtsrd
 
Parallel Hardware Implementation of Convolution using Vedic Mathematics
Parallel Hardware Implementation of Convolution using Vedic MathematicsParallel Hardware Implementation of Convolution using Vedic Mathematics
Parallel Hardware Implementation of Convolution using Vedic MathematicsIOSR Journals
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ijnlc
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...
ANALYZING ARCHITECTURES FOR NEURAL  MACHINE TRANSLATION USING LOW  COMPUTATIO...ANALYZING ARCHITECTURES FOR NEURAL  MACHINE TRANSLATION USING LOW  COMPUTATIO...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...kevig
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...kevig
 
Colfax-Winograd-Summary _final (1)
Colfax-Winograd-Summary _final (1)Colfax-Winograd-Summary _final (1)
Colfax-Winograd-Summary _final (1)Sangamesh Ragate
 
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...rinzindorjej
 
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...rinzindorjej
 
Fpga based efficient multiplier for image processing applications using recur...
Fpga based efficient multiplier for image processing applications using recur...Fpga based efficient multiplier for image processing applications using recur...
Fpga based efficient multiplier for image processing applications using recur...VLSICS Design
 
Large Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate DescentLarge Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate DescentShaleen Kumar Gupta
 
Hand Written Digit Classification
Hand Written Digit ClassificationHand Written Digit Classification
Hand Written Digit Classificationijtsrd
 

Similar to A Study of BFLOAT16 for Deep Learning Training (20)

Performance Comparison between Pytorch and Mindspore
Performance Comparison between Pytorch and MindsporePerformance Comparison between Pytorch and Mindspore
Performance Comparison between Pytorch and Mindspore
 
Hardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpHardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmp
 
Iaetsd pipelined parallel fft architecture through folding transformation
Iaetsd pipelined parallel fft architecture through folding transformationIaetsd pipelined parallel fft architecture through folding transformation
Iaetsd pipelined parallel fft architecture through folding transformation
 
cug2011-praveen
cug2011-praveencug2011-praveen
cug2011-praveen
 
ADAPTIVE NETWORK BASED FUZZY INFERENCE SYSTEM FOR SPEECH RECOGNITION THROUGH ...
ADAPTIVE NETWORK BASED FUZZY INFERENCE SYSTEM FOR SPEECH RECOGNITION THROUGH ...ADAPTIVE NETWORK BASED FUZZY INFERENCE SYSTEM FOR SPEECH RECOGNITION THROUGH ...
ADAPTIVE NETWORK BASED FUZZY INFERENCE SYSTEM FOR SPEECH RECOGNITION THROUGH ...
 
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
 
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
 
Comprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPIComprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPI
 
Parallel Hardware Implementation of Convolution using Vedic Mathematics
Parallel Hardware Implementation of Convolution using Vedic MathematicsParallel Hardware Implementation of Convolution using Vedic Mathematics
Parallel Hardware Implementation of Convolution using Vedic Mathematics
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...
ANALYZING ARCHITECTURES FOR NEURAL  MACHINE TRANSLATION USING LOW  COMPUTATIO...ANALYZING ARCHITECTURES FOR NEURAL  MACHINE TRANSLATION USING LOW  COMPUTATIO...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
 
Colfax-Winograd-Summary _final (1)
Colfax-Winograd-Summary _final (1)Colfax-Winograd-Summary _final (1)
Colfax-Winograd-Summary _final (1)
 
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
 
6119ijcsitce01
6119ijcsitce016119ijcsitce01
6119ijcsitce01
 
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
 
Fpga based efficient multiplier for image processing applications using recur...
Fpga based efficient multiplier for image processing applications using recur...Fpga based efficient multiplier for image processing applications using recur...
Fpga based efficient multiplier for image processing applications using recur...
 
Model checking
Model checkingModel checking
Model checking
 
Large Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate DescentLarge Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate Descent
 
Hand Written Digit Classification
Hand Written Digit ClassificationHand Written Digit Classification
Hand Written Digit Classification
 

More from Subhajit Sahu

DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTESDyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTESSubhajit Sahu
 
Shared memory Parallelism (NOTES)
Shared memory Parallelism (NOTES)Shared memory Parallelism (NOTES)
Shared memory Parallelism (NOTES)Subhajit Sahu
 
A Dynamic Algorithm for Local Community Detection in Graphs : NOTES
A Dynamic Algorithm for Local Community Detection in Graphs : NOTESA Dynamic Algorithm for Local Community Detection in Graphs : NOTES
A Dynamic Algorithm for Local Community Detection in Graphs : NOTESSubhajit Sahu
 
Scalable Static and Dynamic Community Detection Using Grappolo : NOTES
Scalable Static and Dynamic Community Detection Using Grappolo : NOTESScalable Static and Dynamic Community Detection Using Grappolo : NOTES
Scalable Static and Dynamic Community Detection Using Grappolo : NOTESSubhajit Sahu
 
Application Areas of Community Detection: A Review : NOTES
Application Areas of Community Detection: A Review : NOTESApplication Areas of Community Detection: A Review : NOTES
Application Areas of Community Detection: A Review : NOTESSubhajit Sahu
 
Community Detection on the GPU : NOTES
Community Detection on the GPU : NOTESCommunity Detection on the GPU : NOTES
Community Detection on the GPU : NOTESSubhajit Sahu
 
Survey for extra-child-process package : NOTES
Survey for extra-child-process package : NOTESSurvey for extra-child-process package : NOTES
Survey for extra-child-process package : NOTESSubhajit Sahu
 
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTER
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTERDynamic Batch Parallel Algorithms for Updating PageRank : POSTER
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTERSubhajit Sahu
 
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...Subhajit Sahu
 
Fast Incremental Community Detection on Dynamic Graphs : NOTES
Fast Incremental Community Detection on Dynamic Graphs : NOTESFast Incremental Community Detection on Dynamic Graphs : NOTES
Fast Incremental Community Detection on Dynamic Graphs : NOTESSubhajit Sahu
 
Can you fix farming by going back 8000 years : NOTES
Can you fix farming by going back 8000 years : NOTESCan you fix farming by going back 8000 years : NOTES
Can you fix farming by going back 8000 years : NOTESSubhajit Sahu
 
HITS algorithm : NOTES
HITS algorithm : NOTESHITS algorithm : NOTES
HITS algorithm : NOTESSubhajit Sahu
 
Basic Computer Architecture and the Case for GPUs : NOTES
Basic Computer Architecture and the Case for GPUs : NOTESBasic Computer Architecture and the Case for GPUs : NOTES
Basic Computer Architecture and the Case for GPUs : NOTESSubhajit Sahu
 
Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDES
Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDESDynamic Batch Parallel Algorithms for Updating Pagerank : SLIDES
Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDESSubhajit Sahu
 
Are Satellites Covered in Gold Foil : NOTES
Are Satellites Covered in Gold Foil : NOTESAre Satellites Covered in Gold Foil : NOTES
Are Satellites Covered in Gold Foil : NOTESSubhajit Sahu
 
Taxation for Traders < Markets and Taxation : NOTES
Taxation for Traders < Markets and Taxation : NOTESTaxation for Traders < Markets and Taxation : NOTES
Taxation for Traders < Markets and Taxation : NOTESSubhajit Sahu
 
A Generalization of the PageRank Algorithm : NOTES
A Generalization of the PageRank Algorithm : NOTESA Generalization of the PageRank Algorithm : NOTES
A Generalization of the PageRank Algorithm : NOTESSubhajit Sahu
 
ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...
ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...
ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...Subhajit Sahu
 
Income Tax Calender 2021 (ITD) : NOTES
Income Tax Calender 2021 (ITD) : NOTESIncome Tax Calender 2021 (ITD) : NOTES
Income Tax Calender 2021 (ITD) : NOTESSubhajit Sahu
 
Youngistaan Foundation: Annual Report 2020-21 : NOTES
Youngistaan Foundation: Annual Report 2020-21 : NOTESYoungistaan Foundation: Annual Report 2020-21 : NOTES
Youngistaan Foundation: Annual Report 2020-21 : NOTESSubhajit Sahu
 

More from Subhajit Sahu (20)

DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTESDyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES
 
Shared memory Parallelism (NOTES)
Shared memory Parallelism (NOTES)Shared memory Parallelism (NOTES)
Shared memory Parallelism (NOTES)
 
A Dynamic Algorithm for Local Community Detection in Graphs : NOTES
A Dynamic Algorithm for Local Community Detection in Graphs : NOTESA Dynamic Algorithm for Local Community Detection in Graphs : NOTES
A Dynamic Algorithm for Local Community Detection in Graphs : NOTES
 
Scalable Static and Dynamic Community Detection Using Grappolo : NOTES
Scalable Static and Dynamic Community Detection Using Grappolo : NOTESScalable Static and Dynamic Community Detection Using Grappolo : NOTES
Scalable Static and Dynamic Community Detection Using Grappolo : NOTES
 
Application Areas of Community Detection: A Review : NOTES
Application Areas of Community Detection: A Review : NOTESApplication Areas of Community Detection: A Review : NOTES
Application Areas of Community Detection: A Review : NOTES
 
Community Detection on the GPU : NOTES
Community Detection on the GPU : NOTESCommunity Detection on the GPU : NOTES
Community Detection on the GPU : NOTES
 
Survey for extra-child-process package : NOTES
Survey for extra-child-process package : NOTESSurvey for extra-child-process package : NOTES
Survey for extra-child-process package : NOTES
 
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTER
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTERDynamic Batch Parallel Algorithms for Updating PageRank : POSTER
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTER
 
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...
 
Fast Incremental Community Detection on Dynamic Graphs : NOTES
Fast Incremental Community Detection on Dynamic Graphs : NOTESFast Incremental Community Detection on Dynamic Graphs : NOTES
Fast Incremental Community Detection on Dynamic Graphs : NOTES
 
Can you fix farming by going back 8000 years : NOTES
Can you fix farming by going back 8000 years : NOTESCan you fix farming by going back 8000 years : NOTES
Can you fix farming by going back 8000 years : NOTES
 
HITS algorithm : NOTES
HITS algorithm : NOTESHITS algorithm : NOTES
HITS algorithm : NOTES
 
Basic Computer Architecture and the Case for GPUs : NOTES
Basic Computer Architecture and the Case for GPUs : NOTESBasic Computer Architecture and the Case for GPUs : NOTES
Basic Computer Architecture and the Case for GPUs : NOTES
 
Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDES
Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDESDynamic Batch Parallel Algorithms for Updating Pagerank : SLIDES
Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDES
 
Are Satellites Covered in Gold Foil : NOTES
Are Satellites Covered in Gold Foil : NOTESAre Satellites Covered in Gold Foil : NOTES
Are Satellites Covered in Gold Foil : NOTES
 
Taxation for Traders < Markets and Taxation : NOTES
Taxation for Traders < Markets and Taxation : NOTESTaxation for Traders < Markets and Taxation : NOTES
Taxation for Traders < Markets and Taxation : NOTES
 
A Generalization of the PageRank Algorithm : NOTES
A Generalization of the PageRank Algorithm : NOTESA Generalization of the PageRank Algorithm : NOTES
A Generalization of the PageRank Algorithm : NOTES
 
ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...
ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...
ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...
 
Income Tax Calender 2021 (ITD) : NOTES
Income Tax Calender 2021 (ITD) : NOTESIncome Tax Calender 2021 (ITD) : NOTES
Income Tax Calender 2021 (ITD) : NOTES
 
Youngistaan Foundation: Annual Report 2020-21 : NOTES
Youngistaan Foundation: Annual Report 2020-21 : NOTESYoungistaan Foundation: Annual Report 2020-21 : NOTES
Youngistaan Foundation: Annual Report 2020-21 : NOTES
 

Recently uploaded

BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
PROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalPROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalMAESTRELLAMesa2
 
trihybrid cross , test cross chi squares
trihybrid cross , test cross chi squarestrihybrid cross , test cross chi squares
trihybrid cross , test cross chi squaresusmanzain586
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationColumbia Weather Systems
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024Jene van der Heide
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxJorenAcuavera1
 
basic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomybasic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomyDrAnita Sharma
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationColumbia Weather Systems
 

Recently uploaded (20)

BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
PROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalPROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and Vertical
 
trihybrid cross , test cross chi squares
trihybrid cross , test cross chi squarestrihybrid cross , test cross chi squares
trihybrid cross , test cross chi squares
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather Station
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptx
 
basic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomybasic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomy
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather Station
 

A Study of BFLOAT16 for Deep Learning Training

  • 1. A Study of BFLOAT16 for Deep Learning Training Dhiraj Kalamkar1 , Dheevatsa Mudigere2 , Naveen Mellempudi∗1 , Dipankar Das1 , Kunal Banerjee1 , Sasikanth Avancha1 , Dharma Teja Vooturi †1 , Nataraj Jammalamadaka‡1 , Jianyu Huang2 , Hector Yuen2 , Jiyan Yang2 , Jongsoo Park2 , Alexander Heinecke1 , Evangelos Georganas1 , Sudarshan Srinivasan1 , Abhisek Kundu1 , Misha Smelyanskiy2 , Bharat Kaul1 , and Pradeep Dubey1 1 Parallel Computing Lab, Intel Labs 2 Facebook, 1 Hacker Way, Menlo Park, CA Abstract This paper presents the first comprehensive empirical study demonstrating the efficacy of the Brain Floating Point (BFLOAT16) half-precision format for Deep Learning training across image classification, speech recognition, language model- ing, generative networks and industrial recommendation systems. BFLOAT16 is attractive for Deep Learning training for two reasons: the range of values it can represent is the same as that of IEEE 754 floating-point format (FP32) and conver- sion to/from FP32 is simple. Maintaining the same range as FP32 is important to ensure that no hyper-parameter tuning is required for convergence; e.g., IEEE 754 compliant half-precision floating point (FP16) requires hyper-parameter tuning. In this paper, we discuss the flow of tensors and various key operations in mixed precision training, and delve into details of operations, such as the rounding modes for converting FP32 tensors to BFLOAT16. We have implemented a method to emulate BFLOAT16 operations in Tensorflow, Caffe2, IntelCaffe, and Neon for our experiments. Our results show that deep learning training using BFLOAT16 tensors achieves the same state-of-the-art (SOTA) results across domains as FP32 tensors in the same number of iterations and with no changes to hyper-parameters. 1 Introduction The spectacular success of Deep Learning has come riding on the ready availability of data and a tremendous growth in compute capability of deep learning systems. In recent years, compute growth has been driven by specialized architectures for GEMM (General Matrix Multiply) acceleration, and the shift to low precision compute. Inference has witnessed a proliferation of mixed precision compute [19, 33, 27, 25] where different operations execute at different precision, all the way from binary/ternary operands to 16b floating point. Similarly, training has also witnessed its share of mixed precision methods, where a combination of half- and single-precision compute is used. There are at least three half-precision formats in the domain of mixed precision training of large neural networks: FP16 [28], 16-bit Integer based [9], and BFLOAT16 [10, 4, 3]. All these methods have 16- bit input operands and 32-bit accumulators for all the computations. Of the three formats, only the first two have publicly available description of training methodology and experimental results on a wide variety of neural networks although BFLOAT16 was originally conceived for deep learning training. BFLOAT16 data format was first introduced as part of distributed training frameworks DistBelief [10] and Tensorflow [4] as a low precision storage format used to reduce communication volumes of ∗ corresponding author {naveen.k.mellempudi}@intel.com † IIIT Hyderabad, ‡ Lab126 Amazon, work done while at at Intel Preprint. Under review. arXiv:1905.12322v3 [cs.LG] 13 Jun 2019
  • 2. weights and activations shared between compute nodes during distributed parallel training3 . It has since become an alternative numeric format specifically targeted towards accelerating deep leaning training (mainly within the Google ecosystem [3]), because of its wider dynamic range and smaller footprint. In this work, we present a detailed mixed precision methodology using BFLOAT16 and demonstrate coverage by training a variety of workloads from image processing (including GANs), to speech/language processing, and recommendation systems. For our experiments we employ a method, where FP32 operations emulate the behavior of BFLOAT16 operations by appropriately zeroing out the lower 16 bits and appropriately rounding the input operands. We have developed a library called Quantlib to implement the emulation in multiple deep learning frameworks such as IntelCaffe, Caffe2, Neon and Tensorflow. One of the functions Quantlib provides is appropriately modifying the elements of an input FP32 tensor to emulate the behavior of BFLOAT16. Specifically, it zeroes out the lower 16 bits of the FP32 elements and performs RNE (Round to Nearest Even) rounding based on those bits. This modification ensures that the tensor possesses FP32 precision (so that FP32-hardware and FP32-libraries can operate on it), and also provides the exact precision and rounding as would be afforded by BFLOAT16 hardware. Quantlib is called prior to GEMM operations (or other operations which are planned to be implemented in BFLOAT16) to emulate the behavior of BFLOAT16 input operands, while FP32 output of the GEMM naturally fits into the “BFLOAT16-input, FP32-accumulator” schema of this training methodology. Using multiple deep learning frameworks modified to insert appropriate Quantlib calls, we provide SOTA results for: AlexNet [24], ResNet-50 [20], DC-GAN [32], SR-GAN [26], DeepSpeech2 [5], GNMT [40], and two industrial workloads namely: a Deep and Cross Network, and a DNN Recommendation System. We also qualitatively compare and contrast the training methodologies using BFLOAT16, FP16 and INT16. We observe that FP16-based training requires tuning an additional hyper-parameter for loss scaling to achieve SOTA results; INT16-based training requires fine grained block-quantization and maintaining block-level scaling factors to achieve SOTA results. In comparison all BFLOAT16 experiments are performed without any hyperparameter changes and BFLOAT16 kernels are expected to be relatively straightforward. The rest of the paper is organized as follows. Section 2 provides a survey of the literature and describes various attempts at half-precision based training. Section 3 discusses the BFLOAT16 format, operations and data flow in detail. Section 4 describes our experimental results in detail. Section 5 discusses our concluding thoughts. 2 Related Work Application of low precision datatype in deep learning is a well explored topic in research. Literature shows that various different reduced precision data representations have been investigated which can be broadly classified into two types: the more standard floating-point based formats [28, 15, 12] and custom fixed point based formats [37, 7, 18, 22, 23]. Custom fixed point representations may offer more flexibility than typical floating point based ones in terms of both increased precision and dynamic range by maintaining separate integer values for precision and range. Consequently, fixed point representations may provide more robust and accurate training of an underlying application. The work reported in [37] leverages the dynamically scaled fixed point representation proposed in [39] to speed up convolution neural networks by 4× over an optimized floating point implementation on general purpose CPU hardware. In [18], the authors present a comprehensive study on the effect of low precision fixed point computation for deep learning. They also train smaller networks using 16-bit fixed point on specialized hardware. Researchers have ventured into less than 16-bit precision as well and almost all of them use custom fixed point schemes. Reference [7] uses a dynamical fixed point format with low precision multipli- cations with up to 12-bit operations. This idea is further advanced in [8] where the authors showcase training with only binary weights while keeping all other tensors and operations in full precision. Another extension [21] uses binary activations as well; however, the gradients and the weights are maintained in full precision. Another related work [22] uses activations and weights quantized into 6-bits for neural network training with gradients in full precision. The method described in [33] uses binary representation for all components including gradients. However, all the aforementioned 3 special thanks to Jeff Dean for pointing this out 2
  • 3. methods are shown to work for smaller benchmark model/data-sets only and invariably result in a non-trivial drop in accuracy with larger ImageNet data-set [11] and classification task [35]. The authors of [23] advocate for a fixed point numerical format called Flexpoint that is specifically tailored for executing deep neural networks on a specialized hardware; this datatype is shown to outperform FP16 and achieve numerical parity with FP32 across a diverse set of workloads. A more general dynamic fixed point representation and associated compute primitives are presented in [9], which leverages general purpose hardware using the integer-compute pipeline to match FP32 baseline accuracy across state of the art convolution neural networks. The disadvantage of these integer based representations in contrast to floating point is the additional overheads of handling shared exponents and managing accumulator overflow. Köster et al.[23] proposed an algorithm that predicts the shared exponent ahead of time to eliminate some of these overheads. However this solution requires collection of additional statistics at each layer, which cannot be efficiently computed on general purpose hardware. A mixed precision training methodology using FP16 and FP32 is reported in [28]. This work employs FP16 for storing activations, weights and gradients. The computations during forward pass and back propagation use FP16 datatype while results are accumulated into FP32. A master copy of the FP32 weights are preserved for the update operation. The authors successfully perform deep learning training on a wide range of applications encompassing deep networks and larger data-sets (ILSVRC-class problems) at the expense of minimal loss compared to baseline FP32 results. This work, however, underlines that FP16/FP32 mixed precision training entails loss scaling [15] to attain near-SOTA results. Loss scaling, basically, ensures that back-propagated gradient values are shifted into a range which can be represented by FP16 and therefore the small magnitude (negative exponent) values which are critical for accuracy are preserved. The need for loss scaling can be avoided by using BFLOAT16 datatype. The hardware numerics of BFLOAT16 on Intel architecture is available at [1]. BFLOAT16 has been underlined to be a crucial ingredient for achieving peta-FLOPS scale on image classification task in [42]. In [3], Google notes the speed ups achieved on using this datatype over FP32 for TensorFlow models for various tasks, such as, image classification, image segmentation, object detection, machine translation. It may be noted that the benefits of BFLOAT16 are not only restricted to machine learning paradigm, the recently published work [41] uses this representation for performing Monte Carlo simulations of Ising model which is an important model is statistical physics. In fact, the authors of [41] mention BFLOAT16 “provides better training and model accuracy than the IEEE half-precision representation”. The Julia language [6], which is designed to provide high-performance without sacrificing ease of programming, has also been shown to be benefited from BFLOAT16 [13]. OpenAI has also mentioned that optimizing kernels targeting BFLOAT16 is “in active development” [17]. 3 Training with Brain Floating Point Numerous studies have shown that 16-bits of precision is sufficient for training deep neural networks[18],[28],[23],[9]. Researchers have experimented with various numeric formats to op- timize training platforms for power and performance. State-of-the-art training platforms today have chosen IEEE-754 half-precision floating point as the preferred numeric format for deep leaning training. However, the narrow dynamic range of half-precision floating point is not sufficient to represent error gradients during back propagation. To mitigate this, training methods use loss scaling techniques[28] to shift gradients into expressible range supported by half-precision floats. While this is easier to implement for certain feed-forward networks as a simple multiplication of loss with a constant scaling factor, others (e.g. recurrent) require a more sophisticated approach to determine the right scaling parameter. This process is often iterative and requires significant time and resource investment from data scientists to optimize it to their network. These software overheads become a hindrance for seamless migration of new deep learning applications to take advantage of the low- precision hardware. Recent developments in software tools such as “automatic mixed precision” [30] are aimed at easing some of this burden from data scientists. However, these tools in their current form also require changes in the original model code and are not guaranteed to result in sufficient performance gains. The values are represented as truncated full precision floating point values with 8 bits of mantissa and the dynamic range comparable to FP32 (Table 1). The extended dynamic range can now represent 3
  • 4. smaller gradient values without applying complicated loss scaling methods, which enables easier migration of deep learning workloads to BFLOAT16 hardware. There are some additional benefits to adopting BFLOAT16 numeric format to build hardware for deep learning. Core compute primitives such as FMA can be built using 8-bit multipliers which lead to significant area and power savings while preserving the full dynamic range of FP32. Table 1 shows the comparison of BFLOAT16 with other standard IEEE floating point formats. Table 1: Comparison BFLOAT16 numeric format with IEEE-754 FP32 and FP16 formats. Data Type Bit Format Max Min Min Acc. (s, e, m) Normal Normal Subnormal Size FP32 1, 8, 23 3.40e38 1.17e−38 1.40e−45 float32 FP16 1, 5, 10 6.55e4 6.10e−5 5.96e−8 float32 BFLOAT16 1, 8, 7 3.38e38 1.17e−38 N/A float32 Figure 1: Mixed precision data flow used for training DNNs with BFLOAT16 data format Figure 1 shows the mixed precision data flow used to train deep neural networks using BFLOAT16 numeric format. The core compute kernels represented as GEMM operations accept inputs as BFLOAT16 tensors and accumulate the output to FP32 tensors. Quantlib (shown as Q in Figure 1) modifies these output tensors to BFLOAT16 format before passing them to the next layer. Quantlib is also employed to modify a copy of the FP32 weights to BFLOAT16 for the forward pass. Error gradients with respect to the inputs also in BFLOAT16 format. Non-GEMM compute operations including batch-normalization, and activation functions such as ReLU, tanh and sigmoid also accept BFLOAT16 tensors as inputs. Bias tensors are always maintained in FP32. The weight update step (e.g., in SGD solver) uses the FP32 copy of the weights to maintain model accuracy. 4 Results Our evaluation of BFLOAT16 consists of the aforementioned deep learning models from different application domains and frameworks, using the tensor modification method via Quantlib discussed in section 1. 4.1 Convolution Neural Networks Convolutional neural networks (CNN) have been primarily used for computer vision applications such as image classification, object detection and semantic segmentation. CNNs have been extensively studied both in academia and industry, primarily driven by public benchmarks such as the ImageNet Large Scale Visual Recognition Competition (ILSVRC). Over the past few years the CNNs which have won the ILSVRC competition, have become well established benchmarks. Here we choose AlexNet (ILSVRC 2012) [24] and ResNet-50 (ILSVRC 2015) [20] as representative models for 4
  • 5. the BFLOAT16 evaluation. In addition to Convolution and InnerProduct layers (which contributes to majority of the computations), we use BFLOAT16 emulations for ReLU, BatchNorm, Pooling, Dropout and EltWise layers as well. This ensures that the full training pipeline uses BFLOAT16, not necessitating the use of higher precision for the intermediate tensor outputs. 4.1.1 AlexNet For AlexNet, we used a global minibatch of 1024 running data parallel on 16 nodes for 88 epochs and achieved 57.4% top-1 and 80.7% top-5 accuracy. As shown in Figure 2, our BFLOAT16 emulation follows very closely to the actual FP32 run and achieves 57.2% top-1 and 80.1% top-5 accuracy. 57.4% 57.2% 80.7% 80.1% 0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 70.0% 80.0% 90.0% 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 Test Accuracy #Epochs Top-1 FP32 Top-1 BFP16 Top-5 FP32 Top-5 BFP16 (a) AlexNet 74.7% 74.7% 92.0% 92.0% 0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 70.0% 80.0% 90.0% 100.0% 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 Max Test Accuracy #Epochs Top-1 FP32 Top-1 BFP16 Top-5 FP32 Top-5 BFP16 (b) ResNet-50 Figure 2: Imagenet-1K training, top-1 and top-5 validation accuracy plots for CNNs 4.1.2 ResNet Our Resnet-50 experiments are with a global minibatch of 1024 running data parallel on 32 nodes using SGD with Nestrov momentum. We trained for 90 epochs with learning rate warm up for first 5 epochs. Baseline FP32 run achieved top-1 accuracy of 74.7% and top-5 accuracy of 92.0% and as shown in Figure 2, our BFLOAT16 emulation follows the baseline almost exactly and achieving the same top-1 and top-5 accuracy. During training we use local batch statistics for the batch normalization to compute validation accuracy after every epoch. The fully trained BFLOAT16 model, achieves 75.7% top-1 test accuracy with global sample statistics, matching the baseline FP32 results. 4.2 Recurrent Neural Networks Recurrent neural networks (RNN) unlike the feedforward networks allows for capturing temporal information due to its feedback connections. These models have been popularly used for applications such as automatic speech recognition (ASR) and language processing, which primarily involve sequence-based learning. RNNs have been observed to have more demanding numerical range requirements [28] and are more sensitive to the half precision datatype. For this class of networks we identify Baidu’s DeepSpeech2 [5] and Google’s neural machine translation (GNMT) model [40] as representative candidates for the BFLOAT16 evaluation. 4.2.1 DeepSpeech2 The Deep speech 2 (DS2) topology, consists of two convolution layers followed by 3 bi-directional gated recurrent unit (GRU) layers with 2048 cells and a final inner-product layer as a classifier. We use Adam optimizer to compute connectionist temporal classification loss (CTC) [16]. We use a batch size of 64 and a learning rate of 0.0005. The aforementioned model is trained on the librispeech dataset [31], which consists of 460 hour corpus. 4.2.2 Neural Machine Translation Google’s Neural Machine Translation (GNMT) is the SOTA neural machine translation model using a recurrent network. It uses stack of long short-term memory (LSTM) layers, along with an attention model for language modeling and translation. Table 2 compares translation accuracy in terms of achieved BLEU scores for baseline FP32 and BFLOAT16 emulation. We use the small Vietnamese 5
  • 6. (a) DeepSpeech2 (b) GNMT Figure 3: RNN training using BFLOAT16 data type. (VI) to English (EN) model and big German (DE) to English (EN) model. BFLOAT16 emulation achieves same or better accuracy than baseline. Figure 3 shows how closely BFLOAT16 emulation run follows the baseline FP32 run. Table 2: GNMT BLEU scores for De→En and Vi→En on WMT’16 and IWSLT’15 datasets. TASK FP32 BFLOAT16 DE→EN, WMT’16 29.3 29.3 VI→EN, IWSLT’15 + ATTENTION 17.1 18.3 4.3 Generative Adversarial Networks (GANs) Generative Adversarial Networks (GANs) have become a very important class of networks, they can be used to learn and mimic any arbitrary distribution of data. GANs achieve this by using two separate generator and discriminators networks in a tightly coupled way. Because GANs combine regression and discrimination tasks during training they tend to have different requirements for numerical precision and range. For the BFLOAT16 evaluation we consider DC-GAN[32] and SR-GAN [26] models. 4.3.1 DC-GAN DC-GAN [32] represents a critical step in designing GAN architectures which were earlier known to be notoriously difficult to train. This consists of fractionally-strided convolutions with ReLU activations in the generator, whereas convolutions with leaky ReLU activations are used in the discriminator; batch normalization layers are used in both the generator and the discriminator. We have implemented DC-GAN in Caffe and for our experiments all the input tensors (activations, weights) are converted to BFLOAT16 for convolution layers (in both the generator and the discriminator), while only the input activations are converted to BFLOAT16 for batch normalization layers; all other tensors are maintained in full precision. A comparison between FP32 and BFLOAT16 is shown in Table 3 in terms of inception scores and MS-SSIM. As evident from the table, the outputs obtained for FP32 and BFLOAT16 are comparable. Table 3: Comparison between FP32 and BFLOAT16 for DC-GAN on face dataset Datatype Inception Score MS-SSIM Baseline (FP32) 1.97 ± 0.054 0.262 BFLOAT16 2.06 ± 0.055 0.217 6
  • 7. 4.3.2 SR-GAN SR-GAN generates photo-realistic high-resolution images by super-resolving from a single shot of the low-resolution image [26]. The low-resolution images are scaled 4× preserving the spatial features while minimizing the noise. The quality of the output is measured using SSIM (Structural Similarity) and MS-SSIM (multi-scale structural similarity) and PNSR (peak signal to noise ratio) metrics. The topology consists of a “generator” network based on Resnet architecture, and the “discriminator” consists of 8 convolution layers each followed by a batchnorm and LeakyReLu. The network uses a “VGG loss” function using a pre-trained VGG19 network [36]. For the BFLOAT16 experiments, we converted all the inputs tensors (weights, activations) at convolu- tion layers to BFLOAT16, while keeping the rest of the layers (Batchnorm, ReLU, LeakyReLu and Eltwise) at full precision. Table 4: SR-GAN model trained with BFLOAT16 on DIV2K (http://www.vision.ee.ethz.ch/ntire17/) dataset. DATATYPE PSNR SSIM MS-SSIM BASELINE (FP32) 26.1749 0.73753 0.99999 BFLOAT16 26.1415 0.74079 0.99999 (a) Generator Network (b) Discriminator Network Figure 4: Training loss for SR-GAN training using BFLOAT16 4.4 Industrial Scale Recommendation System Recommendation system and personalization models are very important for many practical-scale applications. Here we evaluate the Deep & Cross Network [38] on a small Kaggle Criteo Dataset4 and a typical DNN recommender system [43, 29] on a large Terabyte Criteo Dataset5 , which target predicting the ads click-through rate [34]. The accuracy of the recommendation system models is measured by the log loss [38, 43, 29], which predicts when the users will click on ads. Lower log loss translates into higher prediction accuracy for the recommendation model. Note that an accuracy loss of 0.001 in log loss is considered unacceptable in practice. For our BFLOAT16 experiments, all input tensors (activations and weights) are converted to BFLOAT16 for fully connected layers in both forward and backward propagation passes. Dur- ing the weight update stages, we use a FP32 master copy [28] to reduce the additional accuracy loss. We use either the round-to-nearest or direct truncation scheme when we do the conversion from BFLOAT16 to FP32. The accuracy evaluation results are shown in Table 5. As we can observe, BFLOAT16 with the round-to-nearest scheme is almost the same as FP32 baseline accuracy, while BFLOAT16 with the direct truncation scheme suffers from a tiny accuracy degradation (∼ 0.02%). 4 https://www.kaggle.com/c/criteo-display-ad-challenge/data 5 https://www.criteo.com/news/press-releases/2015/07/criteo-releases-industrys-largest-ever-dataset/ 7
  • 8. Table 5: The log loss for Deep & Cross Network [38] on a small Kaggle Criteo dataset and a DNN recommender system [43] on a TeraByte Criteo dataset trained with BFLOAT16 (with either round-to-nearest or direct truncation for the conversion from FP32 to BFLOAT16). RECOMMENDATION SYSTEM BASELINE(FP32) BFLOAT16 (RND) BFLOAT16 (TRUNC) DEEP & CROSS NETWORK 0.44372 0.44372 0.44393 DNN RECOMMENDER SYSTEM 0.12520 0.12520 0.12537 4.5 Beyond Emulation - Towards Bare Metal Execution We close this section by highlighting that our presented emulation strategy is an excellent approx- imation of future Intel Xeon CPUs. We therefore took the slim CNN training framework GxM, which was presented in [14], and implemented all important operators (convolution, fully-connected, batch-normalization and pooling layers) utilizing the AVX512BF16 instruction set extensions [2]. Therefore, all activation and weight data was only present as 16bit data in memory and the special VNNI (vector neural network instructions) datalayout was employed to support the BFLOAT16 dot-product instruction with FP32 accumulation. The execution of these instructions was done by bit-accurate emulation on current AVX512 silicon with only a very slight performance tax. When training ResNet-50 on Imagenet using the AVX512BF16 instructions, we achieved a Top-1 accuracy of 75.62%, which matches the current state-of-the-art performance. Additionally, the code is already heavily optimized and ready for prime-time. Additionally we have implemented a full BFLOAT16 LSTM-cell which is currently being integrated into Tensorflow. 5 Conclusion Our goal in this paper was to establish BFLOAT16 as an alternative half-precision format for Deep Learning training, given that its dynamic range is the same as that of FP32 and conversion to/from FP32 is straightforward. Our empirical study demonstrates that, unlike IEEE 754 half-precision format and 16-bit Integer, BFLOAT16 based training eliminates the need for hyperparameter tuning or complex software management for block quantization. Our study also demonstrates that BFLOAT16 is a robust datatype having the ability to cover the range of tensors across application domains including vision, speech, language, generative networks, and recommendation systems. We expect industry-wide adoption of BFLOAT16 across emerging domains. References [1] BFLOAT16 - hardware numerics definition. https:// software.intel.com/sites/default/files/managed/40/8b/ bf16-hardware-numerics-definition-white-paper.pdf. Accessed: 2019-03-22. [2] Intel architecture instruction set extensions and future features programming refer- ence. https://software.intel.com/sites/default/files/managed/c5/15/ architecture-instruction-set-extensions-programming-reference.pdf. Ac- cessed: 2019-05-21. [3] Using bfloat16 with tensorflow models. https://cloud.google.com/tpu/docs/ bfloat16. Accessed: 2019-16-04. [4] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016. [5] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In ICML, pages 173–182, 2016. [6] Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B. Shah. Julia: A fresh approach to numerical computing. SIAM Review, 59(1):65–98, 2017. 8
  • 9. [7] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Training deep neural networks with low precision multiplications. arXiv preprint arXiv:1412.7024, 2014. [8] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In NIPS, pages 3123–3131, 2015. [9] Dipankar Das, Naveen Mellempudi, Dheevatsa Mudigere, Dhiraj D. Kalamkar, Sasikanth Avancha, Kunal Banerjee, Srinivas Sridharan, Karthik Vaidyanathan, Bharat Kaul, Evange- los Georganas, Alexander Heinecke, Pradeep Dubey, Jesús Corbal, Nikita Shustrov, Roman Dubtsov, Evarist Fomenko, and Vadim O. Pirogov. Mixed precision training of convolutional neural networks using integer operations. In ICLR, 2018. [10] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pages 1223–1231, 2012. [11] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009. [12] Tim Dettmers. 8-bit approximations for parallelism in deep learning. arXiv preprint arXiv:1511.04561, 2015. [13] Keno Fischer and Elliot Saba. Automatic full compilation of julia programs and ML models to cloud TPUs. CoRR, abs/1810.09868, 2018. [14] Evangelos Georganas, Sasikanth Avancha, Kunal Banerjee, Dhiraj Kalamkar, Greg Henry, Hans Pabst, and Alexander Heinecke. Anatomy of high-performance deep learning convolutions on simd architectures. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC ’18, pages 66:1–66:12, Piscataway, NJ, USA, 2018. IEEE Press. [15] Boris Ginsburg, Sergei Nikolaev, and Paulius Micikevicius. Training of deep networks with half-precision float. NVidia GPU Technology Conference, 2017. [16] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369–376. ACM, 2006. [17] Scott Gray, Alec Radford, and Diederik P. Kingma. Gpu kernels for block-sparse weights. https://s3-us-west-2.amazonaws.com/openai-assets/blocksparse/ blocksparsepaper.pdf. Accessed: 2019-16-04. [18] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In ICML, pages 1737–1746, 2015. [19] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In ICLR, 2016. [20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016. [21] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In NIPS, pages 4107–4115, 2016. [22] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quan- tized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016. [23] Urs Köster, Tristan Webb, Xin Wang, Marcel Nassar, Arjun K. Bansal, William Constable, Oguz Elibol, Stewart Hall, Luke Hornof, Amir Khosrowshahi, Carey Kloss, Ruby J. Pai, and Naveen Rao. Flexpoint: An adaptive numerical format for efficient training of deep neural networks. CoRR, abs/1711.02213, 2017. [24] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012. [25] Abhisek Kundu, Kunal Banerjee, Naveen Mellempudi, Dheevatsa Mudigere, Dipankar Das, Bharat Kaul, and Pradeep Dubey. Ternary residual networks. CoRR, abs/1707.04679, 2017. 9
  • 10. [26] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew P Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, volume 2, page 4, 2017. [27] Naveen Mellempudi, Abhisek Kundu, Dheevatsa Mudigere, Dipankar Das, Bharat Kaul, and Pradeep Dubey. Ternary neural networks with fine-grained quantization. arXiv preprint arXiv:1705.01462, 2017. [28] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaev, Ganesh Venkatesh, et al. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017. [29] Maxim Naumov, Dheevatsa Mudigere, et al. Deep learning recommendation model for person- alization and recommendation systems. arXiv preprint arXiv:1906.00091, 2019. [30] NVIDIA. Automatic mixed precision for deep learning. https://developer.nvidia.com/ automatic-mixed-precision. Accessed: 2019-22-05. [31] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210. IEEE, 2015. [32] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016. [33] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV, 2016. [34] Matthew Richardson, Ewa Dominowska, and Robert Ragno. Predicting clicks: Estimating the click-through rate for new ads. In Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pages 521–530, New York, NY, USA, 2007. ACM. [35] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015. [36] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [37] Vincent Vanhoucke, Andrew Senior, and Mark Z Mao. Improving the speed of neural networks on CPUs. In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, volume 1, 2011. [38] Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. Deep & cross network for ad click predictions. In Proceedings of the ADKDD’17, ADKDD’17, pages 12:1–12:7, New York, NY, USA, 2017. ACM. [39] Darrell Williamson. Dynamically scaled fixed point arithmetic. In PacRim, pages 315–318, 1991. [40] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016. [41] Kun Yang, Yi-Fan Chen, George Roumpos, Chris Colby, and John R. Anderson. High per- formance monte carlo simulation of ising model on TPU clusters. CoRR, abs/1903.11714, 2019. [42] Chris Ying, Sameer Kumar, Dehao Chen, Tao Wang, and Youlong Cheng. Image classification at supercomputer scale. CoRR, abs/1811.06992, 2018. [43] Jian Zhang, Jiyan Yang, and Hector Yuen. Training with low-precision embedding tables. In Systems for Machine Learning Workshop at NeurIPS 2018. 2018. 10