AI Hardware Landscape 2021

AI:
Hardware Landscape
Grigory Sapunov
OpenTalks.AI / 2021.02.05
gs@inten.to

Executive Summary :)
Most hardware focused on DL, which requires a lot of computations:
● There’s much more diversity in CPUs now, not only x86.
● GPUs (mostly NVIDIA) are the most popular choice. Intel and AMD can
propose interesting alternatives this year.
● There are some available ASIC alternatives: Google TPU (in cloud only),
Graphcore, Huawei Ascend.
● More ASICs are coming into this field: Cerebras, Habana, etc.
● Some companies try to use FPGAs and allow to use them in the cloud
(Microsoft, AWS).
● Edge AI is everywhere already! More to come!
● Neuromorphic computing is on the rise (IBM TrueNorth, Tianjic,
memristors, etc)
● Quantum computing can potentially benefit machine learning as well
(but probably it won’t be a desktop or in-house server solutions)

The Deep Learning Revolution and Its Implications for Computer Architecture and Chip Design
https://arxiv.org/abs/1911.05289

Typically multi-core even on the desktop market:
● usually from 2 to 10 cores in modern Core i3-i9 Intel CPUs
● up to 18 cores/36 threads in high-end Intel CPUs (i9–
7980XE/9980XE/10980XE) [https://en.wikipedia.org/wiki/List_of_Intel_Core_i9_microprocessors]
● up to 64 cores/128 threads in AMD Ryzen Threadripper
(Ryzen Threadripper 3990X, Ryzen Threadripper Pro 3995WX)
x86: Desktops

On the server market:
● Intel Xeon: up to 56 cores/112 threads (Xeon Platinum 9282 Processor)
● AMD EPYC: up to 64 cores/128 threads (EPYC 7702/7742)
● usually having more cores than desktop processors and some other useful
capabilities (supporting more RAM, multi-processor configurations, ECC, etc)
x86: Servers

AVX-512: Fused Multiply Add (FMA) core instructions for enabling lower-precision
operations. List of CPUs with AVX-512 support:
https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512
VNNI (Vector Neural Network Instructions): Multiply and add for integers, etc.
designed to accelerate convolutional neural network-based algorithms.
https://en.wikichip.org/wiki/x86/avx512vnni
DL Boost: AVX512-VNNI + Brain floating-point format (bfloat16)
designed for inference acceleration.
https://en.wikichip.org/wiki/brain_floating-point_format
x86: ML instructions (SIMD)
https://blog.inten.to/cpu-hardware-for-deep-learning-b91f53cb18af

● BigDL: distributed deep learning library for Apache Spark
https://github.com/intel-analytics/BigDL
● Deep Neural Network Library (DNNL): an open-source performance library
for deep learning applications. Layer primitives, etc.
https://intel.github.io/mkl-dnn/
● PlaidML: advanced and portable tensor compiler for enabling deep learning
on laptops, embedded devices, or other devices.
Supports Keras, ONNX, and nGraph.
https://github.com/plaidml/plaidml
● OpenVINO Toolkit: for computer vision
https://docs.openvinotoolkit.org/
x86: Optimized ML Libraries

Some CPU-optimized DL libraries:
● Caffe Con Troll (research project, latest commit in 2016)
https://github.com/HazyResearch/CaffeConTroll
● Intel Caffe (optimized for Xeon):
https://github.com/intel/caffe
● Intel DL Boost can be used in many popular frameworks:
TensorFlow, PyTorch, MXNet, PaddlePaddle, Intel Caffe
https://www.intel.ai/increasing-ai-performance-intel-dlboost/
x86: Optimized ML Libraries

● nGraph: open source C++ library, compiler and runtime for Deep Learning.
Frameworks using nGraph Compiler stack to execute workloads have shown
up to 45X performance boost when compared to native framework
implementations. https://www.ngraph.ai/
Graph compilers

Graph compilers: watch for MLIR!
https://www.tensorflow.org/mlir/overview

● #Cores
● PCIe bandwidth
● PCIe generation (gen3, gen4)
● PCIe lanes (x16, x8, etc) at the processor/chipset side
● Memory type (DDR4, DDR3, etc)
● Memory speed (2133, 2666, 3200, etc)
● Memory channels (1, 2, 4, …)
● Memory size
● Memory speed/bandwidth
● ECC support
● Power usage (Watts)
● Price
● ...
Important dimensions
https://blog.inten.to/cpu-hardware-for-deep-learning-b91f53cb18af

● Single-board computers: Raspberry Pi, part of Jetson
Nano, and Google Coral Dev Board.
● Mobile: Qualcomm, Apple A11, etc
● Server: Marvell ThunderX, Ampere eMAG,
Amazon A1 instance, etc; NVIDIA announced
GPU-accelerated Arm-based servers.
● Laptops: Apple M1, Microsoft Surface Pro X
● ARM also has ML/AI Ethos NPU and Mali GPU
ARM

● ARM announces Neoverse N1 platform (scales up to 128 cores)
https://www.networkworld.com/article/3342998/arm-introduces-neoverse-high-performance-cpus-for-servers-5g.html
● Qualcomm manufactured ARM server processor for cloud applications called Centriq 2400 (48 single-thread cores,
2.2GHz). Project stopped.
https://www.tomshardware.com/news/qualcomm-server-chip-exit-china-centriq-2400,38223.html
● Ampere Altra is the first 80-core ARM-based server processor
https://venturebeat.com/2020/03/03/ampere-altra-is-the-first-80-core-arm-based-server-processor/
● Ampere announces 128-core Arm server processor
https://www.networkworld.com/article/3564514/ampere-announces-128-core-arm-server-processor.html
● Ampere eMAG ARM server microprocessors (up to 32 cores, up to 3.3 GHz)
https://amperecomputing.com/product/, https://en.wikichip.org/wiki/ampere_computing/emag
● Marvell ThunderX ARM Processors (up to 48 cores, up to 2.5 GHz)
https://www.marvell.com/server-processors/thunderx-arm-processors/
● Amazon Graviton ARM processor (16 cores, 2.3GHz)
https://en.wikichip.org/wiki/annapurna_labs/alpine/al73400
https://aws.amazon.com/blogs/aws/new-ec2-instances-a1-powered-by-arm-based-aws-graviton-processors/
● Huawei Kunpeng 920 ARM Server CPU (64 cores, 2.6 GHz)
https://www.huawei.com/en/press-events/news/2019/1/huawei-unveils-highest-performance-arm-based-cpu
ARM: Servers

NVIDIA to Acquire Arm for $40 Billion

Current architecture is POWER9:
● 12 cores x 8 threads or 24 cores x 4 threads (96 threads).
● PCIe v.4, 48 PCIe lanes
● Nvidia NVLink 2.0: the industry’s only CPU-to-GPU Nvidia NVLink connection
● CAPI 2.0, OpenCAPI 3.0 (for heterogeneous computing with FPGA/ASIC)
IBM POWER

An open-source hardware instruction set architecture.
Examples:
● SiFive U5, U7 and U8 cores
https://www.anandtech.com/show/15036/sifive-announces-first-riscv-ooo-cpu-core-the-u8series-processor-ip
● Alibaba's RISC-V processor Xuantie 910 with Vector Engine for AI
Acceleration 12nm 64-bit 16 cores clocked at up to 2.5GHz,
the fastest RISC-V processor to date
https://www.theregister.co.uk/2019/07/27/alibaba_risc_v_chip/
● Western Digital SweRV Core designed for embedded devices
supporting data-intensive edge applications.
https://www.westerndigital.com/company/innovations/risc-v
● Manticore: A 4096-core RISC-V Chiplet Architecture for
Ultra-efficient Floating-point Computing https://arxiv.org/abs/2008.06502
● Esperanto Technologies is building AI chip with 1k+ cores
https://www.esperanto.ai/technology/
RISC-V

NVIDIA slides: http://www.nvidia.com/content/events/geoInt2015/LBrown_DL.pdf

… → Kepler → Maxwell → Pascal → Volta → Turing → Ampere → ...
NVIDIA Architectures

● Peak performance (GFLOPS) at FP32/16/...
● #Cores (+Tensor Cores)
● Memory size
● Memory speed/bandwidth
● Precision support
● Can connect using NVLink?
● Power usage (Watts)
● Price
● GFLOPS/USD
● GFLOPS/Watt
● Form factor (for desktop or server?)
● ECC memory
● Legal restrictions (e.g. GeForce is not allowed to use in datacenters)
Important dimensions
https://blog.inten.to/hardware-for-deep-learning-part-3-gpu-8906c1644664

● FP64 (64-bit float), not used for DL
● FP32 — the most commonly used for training
● FP16 or Mixed precision (FP32+FP16) — becoming the new default
● INT8 — usually for inference
● INT4, INT1 — experimental modes for inference
Precision

bfloat16 is now supported on Ampere GPUs, supported on TPU gen3, and will be
supported on AMD GPU and Intel CPUs..
Precision: bfloat16
https://moocaholic.medium.com/fp64-fp32-fp16-bfloat16-tf32-and-other-members-of-the-zoo-a1ca7897d407

Precision: many caveats

Not only FLOPS: Roofline Performance Model

Roofline Performance Model: Example

Separate cards can joined using NVLink; SLI is not relevant for DL, it’s for
graphics.
NVSwitch: The Fully Connected NVLink
NCCL 1: multi-GPU collective communication primitives library
NVIDIA: Single-machine Multi-GPU

Distributed training is now a commodity (but scaling is sublinear).
NCCL 2: multi-node collective communication primitives library
NVIDIA: Distributed Multi-GPU

Intel offered the following performance numbers, given as peak GFLOPs of FP32
math using the OpenCL-based CLPeak benchmark.
GPU: Intel Xe
https://www.anandtech.com/show/16018/intel-xe-hp-graphics-early-samples-offer-42-tflops-of-fp32-performance

Peak Performance:
● 46.1 TFLOPs Single Precision Matrix (FP32)
● 23.1 TFLOPs Single Precision (FP32)
● 184.6 TFLOPs Half Precision (FP16)
● 11.5 TFLOPs Double Precision (FP64)
● 92.3 TFLOPs bfloat16
● 184.6 TOPs INT8 and INT4
32 GB HBM2, Up to 1228.8 GB/s
300W
Announced support for TensorFlow, PyTorch, etc!
AMD Instinct MI100
https://www.amd.com/en/products/server-accelerators/instinct-mi100

There should
be:
Intel Xe
42.2 TFLOPS
(FP32)
AMD Instinct
MI100
46.1 TFLOPS
(FP32)

Problems
Serious problems with the current processors (CPU/GPU) are:
● Energy efficiency:
○ The version of AlphaGo playing against Lee Sedol used 1,920 CPUs and
280 GPUs (https://en.wikipedia.org/wiki/AlphaGo)
○ The estimated power consumption of approximately 1 MW (200 W per
CPU and 200 W per GPU) compared to only 20 watts used by the human
brain (https://jacquesmattheij.com/another-way-of-looking-at-lee-sedol-vs-alphago/)
● Architecture:
○ good for matrix multiplication (still the essence of DL)
○ but not well-suitable for brain-like computations

FPGA
● FPGA (field-programmable gate array) is an integrated circuit designed to be
configured by a customer or a designer after manufacturing
● Both FPGAs and ASICs (see later) are usually much more energy-efficient than
general purpose processors (so more productive with respect to GFLOPS per
Watt). FPGAs are usually used for inference, not training.
● OpenCL can be the language for development for FPGA (C/C++ can be as well),
and some ML/DL libraries are using OpenCL too (for example, Caffe). So, there
could appear an easy way to do low-level ML on FPGAs.
● For high-level ML there are vendor tools and graph compilers (inference only).
● Can use FPGA in the cloud!
● See also for MLIR (mentioned earlier).
● Learning curve to use FPGAs is too steep now :(

FPGA in production
There is some interesting movement to FPGA:
● Amazon has FPGA F1 instances https://aws.amazon.com/ec2/instance-types/f1/
● Alibaba has FGPA F3 instances in the cloud https://www.alibabacloud.com/blog/deep-dive-into-alibaba-
cloud-f3-fpga-as-a-service-instances_594057
● Yandex uses FPGAs for its own DL inference.
● Microsoft ran (in 2015) Project Catapult that uses clusters of FPGAs
https://blogs.msdn.microsoft.com/msr_er/2015/11/12/project-catapult-servers-available-to-academic-researchers/
https://www.microsoft.com/en-us/research/project/project-catapult/
● Microsoft Project Brainwave: AI inference omn FPGA
https://www.microsoft.com/en-us/research/project/project-brainwave/
● Microsoft Azure allows deploying pretrained models on FPGA (!).
https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-deploy-fpga-web-service
● Baidu has FPGA instances https://cloud.baidu.com/product/fpga.html
● ...

Two main manufacturers: Intel (ex. Altera) and Xilinx.
The ‘world’s largest’ FPGA chips
● Intel Stratix 10 GX 10M
>10.2 million logic cells, 43.3B transistors
https://www.techpowerup.com/260906/intel-unveils-worlds-largest-fpga
● Xilinx Virtex UltraScale+ VU19P
9M system logic cells, 35B transistors
https://www.xilinx.com/products/silicon-devices/fpga/virtex-ultrascale-plus-vu19p.html
Intel has a hybrid Xeon+FPGA chip https://www.top500.org/news/intel-ships-xeon-skylake-processor-with-
integrated-fpga/
Intel has FPGA acceleration cards
https://www.intel.com/content/www/us/en/programmable/solutions/acceleration-hub/platforms.html
FPGA chips

Adaptive compute acceleration platform (ACAP)
Xilinx Versal ACAP, a fully software-programmable,
heterogeneous compute platform that combines Scalar Engines,
Adaptable Engines, and Intelligent Engines.
The Intelligent Engines are an array of VLIW and SIMD
processing engines and memories, all interconnected with 100s
of terabits per second of interconnect and memory bandwidth.
These permit 5X–10X performance improvement for ML and
DSP applications.
https://www.xilinx.com/products/silicon-devices/acap/versal-premium.html
https://www.xilinx.com/products/silicon-devices/acap/versal-ai-core.html
https://www.xilinx.com/support/documentation/white_papers/wp505-versal-acap.pdf

FPGA: Xilinx Vitis AI
Vitis AI is Xilinx’s development stack
for AI inference on Xilinx hardware
platforms, including both edge devices
and Alveo cards.
It consists of optimized IP, tools,
libraries, models, and example
designs.
Xilinx ML Suite is now deprecated.
https://github.com/Xilinx/Vitis-AI

FPGA: Intel OpenVINO toolkit
The OpenVINO (Open Visual Inference and Neural Network Optimization) toolkit
offers software developers a single toolkit to accelerate their solutions across
multiple hardware platforms including FPGAs
https://software.intel.com/content/www/us/en/develop/tools/openvino-toolkit.html

ASIC custom chips
ASIC (application-specific integrated circuit) is an integrated circuit customized for a
particular use, rather than intended for general-purpose use.
There is a lot of movement to ASIC right now:
● Google has Tensor Processing Units (TPU v2/v3) in the cloud, v4 exists too.
● Intel acquired Habana, Mobileye, Movidius, Nervana and has processors for training
and inference.
● Graphcore has its second generation IPU.
● AWS has its own chips for training and inference
● Alibaba Hanguang 800
● Huawei Ascend 310, 910
● Bitmain Sophon, Cerebras, Groq, and many many others …
Many ASICs are built for multi-chip and supercomputer configurations!
https://blog.inten.to/hardware-for-deep-learning-part-4-asic-96a542fe6a81

Case: AlphaGo Zero
https://deepmind.com/blog/alphago-zero-learning-scratch/

ASIC: Google TPU
TPU v2
● 180 TFLOPS (bfloat16)
● 64 GB HBM
● $4.50 / TPU hour
https://cloud.google.com/tpu/
https://cloud.google.com/tpu/docs/tpus
https://cloud.google.com/tpu/docs/system-architecture
TPU v3
● 420 TFLOPS (bfloat16)
● 128 GB HBM
● $8.00 / TPU hour

A “TPU v3 pod” 100+ petaflops, 32 TB HBM, 2-D toroidal mesh network
Many ASICs are built for multi-chip configurations

ASIC: Intel (Nervana) NNP-T [discontinued]
Processor for training. Can build PODs (say 10-rack POD with 480 NNP-T)
● 24 Tensor Processing Cluster (TPC)
● PCIe Gen 4 x16 accelerator card, 300W
● OCP Accelerator Module, 375W
● 119 TOPS bfloat16
● 32 GB HBM2
https://www.intel.ai/nervana-nnp/nnpt/
https://en.wikichip.org/wiki/nervana/microarchitectures/spring_crest

ASIC: Intel (Nervana) NNP-I [discontinued]
Processor for inference using mixed precision math, with a special emphasis on low-precision
computations using INT8.
● 12 inference compute engines (ICE) + 2 Intel architecture cores (AVX+VNNI)
● M.2 form factor (1 chip): 12W, up to 50 TOPS.
● PCIe card (2 chips): 75W, up to 170 TOPS.
https://www.intel.ai/nervana-nnp/nnpi
https://en.wikichip.org/wiki/intel/microarchitectures/spring_hill

ASIC: Habana
Gaudi: training chip HL-2000. Designed to scale well.
● PCIe 4.0 x16, 32 GB HBM2, 1Tb/s, ECC, RDMA
● 200-300W
● FP32, BF16, INT/UINT 32, 16, 8
https://habana.ai/training/
Goya: inference chip HL-1000
● PCIe 4.0 x16, 4/8/16 Gb DDR4, ECC, 200W
● FP32, INT/UINT 32, 16, 8
https://habana.ai/inference/

ASIC: Graphcore IPU
Graphcore IPU: for both training and inference.
Allows new and emerging machine intelligence
workloads to be realized.
Colossus MK2 GC200 IPU:
● 59.4B transistors, 1472 independent processor cores running 8832 independent
parallel program threads
● 250 TFLOPS mixed precision
● 900MB in-processor mem, 47.5TB/s memory bandwidth
● 8TB/s on-chip exchange between cores, 320GB/s chip-to-chip bandwidth
● IPU-M2000 systems with 4xIPU (1 PFLOPS FP16)
IPU on Azure
https://www.graphcore.ai/posts/microsoft-and-graphcore-collaborate-to-accelerate-artificial-intelligence

● Cerebras Systems Wafer Scale Engine (WSE), an
AI chip that measures 8.46x8.46 inches, making it
almost the size of an iPad.
● WSE chip has 1.2 trillion transistors. For
comparison, NVIDIA’s A100 GPU contains 54
billion transistors, 22x less!
● 400,000 computing cores and 18 gigabytes of
memory with 9 PB/s memory bandwidth.
● Cerebras CS-1 is a system built on WSE.
● WSE 2nd gen is announced! 850,000 AI-optimized
cores, 2.6 Trillion Transistors
https://cerebras.net/
Cerebras

ASIC: AWS Inferentia Chips
AWS Inferentia chips are designed to accelerate the inference.
● 64 TOPS on 16-bit floating point (FP16 and BF16) and mixed-precision data.
● 128 TOPS on 8-bit integer (INT8) data.
● Up to 16 chips in the largest instance (inf1.24xlarge)
https://aws.amazon.com/machine-learning/inferentia/
https://github.com/aws/aws-neuron-sdk
https://aws.amazon.com/blogs/aws/amazon-ec2-update-inf1-instances-with-aws-inferentia-chips-for-high-performance-cost-effective-inferencing/

ASIC: AWS Trainium
(December 1st, 2020) Amazon announced its AWS Trainium chip. will be available in 2021
https://aws.amazon.com/machine-learning/trainium/

ASIC: Huawei Ascend 310
● 22 TOPS INT8
● 11 TFLOPS FP16
● 8W of power consumption.
Atlas 300I Inference Card:
● 32 GB LPDDR4X with a bandwidth of 204.8 GB/s
● PCIe x16 Gen3.0 device, max 67 W
● A single card provides up to 88 TOPS INT8

ASIC: Huawei Ascend 910
● 32 built-in Da Vinci AI Cores and 16 TaiShan Cores
● 320 TFLOPS (FP16), 640 TOPS (INT8). It’s pretty
close to NVIDIA’s A100 BF16 peak performance of
312 TFLOPS
Atlas 300T Training Card
● 32 GB HBM or 16GB DDR4 2933
● PCIe x16 Gen4.0
● up to 300W power consumption

ASIC: Bitmain Sophon
Tensor Computing Processors:
● BM1680 (1st gen, 2 TFLOPS FP32, 32MB SRAM, 25W)
● BM1682 (2nd gen, 3 TFLOPS FP32, 16MB SRAM)
● BM1684 (3rd gen, 2.2TFLOPS FP32, 17.6 TOPS INT8, 32 MB SRAM)
● BM1880 (1 TOPS INT8).
There are Deep Learning Acceleration PCIe Cards:
● SC3 with a BM1682 chip (8 GB DDR memory, 65W)
● SC5 and SC5H with a BM1684 chip and 12 GB RAM
(up to 16 GB) with 30W max power consumption
● SC5+ with 3x BM1684 and 36 GB memory (up to 48 GB)
with 75W max power consumption.
https://www.sophon.ai/product/introduce/bm1684.html

ASIC: Alibaba Hanguang 800
AI-Inference Chip
Its performance is independent of the batch size.

ASIC: Baidu Kunlun
● 14nm chip
● 16GB HBM memory with
512 GB/s bandwidth
● up to 260 TOPS INT8
(that’s twice the INT8 performance of NVIDIA TESLA T4)
● 64 TFLOPS INT16/FP16 at 150W.
● This chip looks like an inference chip. The processor can be
accessed via Baidu Cloud.
September 2020, Baidu announced Kunlun 2. The new chip uses 7
nm process technology and its computational capability is over three
times that of the previous generation. The mass production of the
chip is expected to begin in early 2021.

ASIC: Groq TSP
Groq develops its own Tensor Streaming Processor.
Jonathan Ross, Croq’s CEO had co-founded the
first Google’s TPU before that.
● 14nm chip, 26.8B transistors
● 220MB SRAM with 80TB/s on-die memory bandwidth
● no board memory
● PCIe Gen4 x16 with 31.5 GB/s in each direction.
● up to 1000 TOPS INT8 and 250 TFLOPS FP16 (with FP32 acc).
For comparison, NVIDIA A100 has 312 TFLOPS on dense FP16
calculations with FP32 acc, and 624 TOPS INT8. It’s even larger
than the 825 TOPS INT8 of Alibaba’s Hanguang 800.

ASIC: Others
● Qualcomm Cloud AI 100 (inference)
https://www.qualcomm.com/products/cloud-artificial-intelligence/cloud-ai-100
● Wave Computing Dataflow Processing Unit (DPU)
https://wavecomp.ai/products/
● ARM ML inference NPU Ethos-N78
https://www.arm.com/products/silicon-ip-cpu/machine-learning/arm-ml-processor
● SambaNova came out of stealth-mode in December 2020 with their
Reconfigurable Dataflow Architecture (RDA) delivering “100s of TFLOPS”.
https://sambanova.ai/
● Mythic focuses on Compute-in-Memory, Dataflow Architecture, and Analog
Computing https://www.mythic-ai.com/technology/
● Intel eASIC: an intermediary technology between FPGAs and standard-cell
ASICs with lower unit-cost and faster time-to-market
https://www.intel.com/content/www/us/en/products/programmable/asic/easic-devices.html
● ...

ASIC: Summary
● Very diverse field!
● Hard to directly compare different solutions based on their characteristics (can
be too different architectures).
● You can use a common benchmark like https://mlperf.org/
● DL framework support is usually limited, some solutions use their own
frameworks/libraries.

AI at the edge
● NVidia Jetson TK1/TX1/TX2/Xavier/Nano
○ 192/256/256/512/128 CUDA Cores
○ 4/4/6/8/4-Core ARM CPU, 2/4/8/16/4 Gb Mem
● Tablets, Smartphones
○ Qualcomm Snapdragon 845/855, Apple A11/12/Bionic, Huawei Kirin 970/980/990 etc
● Raspberry Pi 4 (1.5 GHz 4-core, 4Gb mem)
● Movidius Neural Compute Stick, Stick 2
● Google Edge TPU

(Nov 25, 2020) “Our brand-new 6th gen Qualcomm AI Engine includes the
Qualcomm® Hexagon™ 780 Processor with a fused AI-accelerator architecture,
plus the Tensor Accelerator with 2 times the compute capacity. This Qualcomm AI
Engine astonishes with up to 26 TOPS performance.”
https://www.qualcomm.com/products/snapdragon-888-5g-mobile-platform
Mobile AI: Qualcomm SnapDragon 888

“HUAWEI’s self-developed Da Vinci architecture NPU delivers better power efficiency,
stronger processing capabilities and higher accuracy. The powerful Big-Core plus ultra-low
consumption Tiny-Core contribute to an enormous boost in AI performance. In AI face
recognition, the efficiency of NPU Tiny-Core can be enhanced up to 24x than the Big-Core.
With 2 Big-Core plus 1 Tiny-Core, the NPU of Kirin 990 5G is ready to unlock the magic
of the future.”
https://consumer.huawei.com/en/campaign/kirin-990-series/
“Huawei intends to scale this AI processing block from servers to smartphones. It supports both INT8
and FP16 on both cores, whereas the older Cambricon design could only perform INT8 on one core.
There’s also a new ‘Tiny Core’ NPU. It’s a smaller version of the Da Vinci architecture focused on power
efficiency above all else, and it can be used for polling or other applications where performance isn’t
particularly time critical. The 990 5G will have two “big” NPU cores and a single Tiny Core, while the Kirin
990 (LTE) has one big core and one tiny core.”
https://www.extremetech.com/mobile/298028-huaweis-kirin-990-soc-is-the-first-chip-with-an-integrated-5g-modem
Mobile AI: Huawei Kirin 970, 980, 990 (NPU)

(Sep 15, 2020) Apple unveils A14 Bionic processor with
40% faster CPU and 11.8 billion transistors
“The chip has a 16-core neural engine that can execute 11 trillion
AI operations per second. The neural engine core count is twice
the previous chip, and can perform machine learning computations
10 times faster. The A14 has six CPU cores and four graphics
processing unit (GPU) cores.”
https://venturebeat.com/2020/09/15/apple-unveils-a14-bionic-processor-with-40-faster-cpu-and-11-8-billion-transistors/
Mobile AI: Apple (Neural Engine)

(January 12, 2021) Samsung sets new standard for
flagship mobile processors with Exynos 2100
“AI capabilities will also enjoy a significant boost with the Exynos
2100. The newly-designed tri-core NPU has architectural enhancements such as
minimizing unnecessary operations for high effective utilization and support for
feature-map and weight compression. Exynos 2100 can perform up to 26-trillion-
operations-per-second (TOPS) with more than twice the power efficiency than
the previous generation. With on-device AI processing and support for advanced
neural networks, users will be able to enjoy more interactive and smart features as
well as enhanced computer vision performance in applications such as imaging.”
https://www.samsung.com/semiconductor/minisite/exynos/newsroom/pressrelease/samsung-sets-new-
standard-for-flagship-mobile-processors-with-exynos-2100/
Mobile AI: Samsung (NPU)

(Aug 7, 2019) MediaTek Announces Dimensity 1000 ARM
Chip With Integrated 5G Modem
“The Dimensity 1000 doesn’t just bring new branding; it’s also
sporting four Cortex A77 CPU cores and four Cortex A55 CPU
cores, all built on a 7nm process node. There’s also a 9-core Mali GPU, a 5-core
ISP, and a 6-core AI processor.
The MediaTek AI Processing Unit APU 3.0 is a brand new architecture. It
houses six AI processors (two big cores, three small cores and a single tiny core)
The new APU 3.0 brings devices a significant performance boost at 4.5 TOPS.”
https://www.extremetech.com/extreme/302712-mediatek-announces-dimensity-1000-arm-chip-with-
integrated-5g-modem
https://i.mediatek.com/mediatek-5g
Mobile AI: MediaTek (APU)

AI at the Edge: Jetson Nano
Price: $99 ($59 for 2Gb)
NVIDIA Jetson Nano Developer Kit is a small, powerful
computer that lets you run multiple neural networks in parallel
for applications like image classification, object detection, segmentation,
and speech processing. All in an easy-to-use platform that runs in as little as 5
watts.
● 128-core Maxwell GPU + Quad-core ARM A57, 472 GFLOPS
● 4 GB 64-bit LPDDR4 25.6 GB/s
https://developer.nvidia.com/embedded/jetson-nano-developer-kit
See also Jetson TX1, TX2, Xavier: https://developer.nvidia.com/embedded/develop/hardware

Neural Compute Stick 2 (~$70)
The latest generation of Intel® VPUs includes 16
powerful processing cores (called SHAVE cores) and
a dedicated deep neural network hardware accelerator for high-performance
vision and AI inference applications—all at low power.
● Supports Convolutional Neural Network (CNN)
● Support: TensorFlow, Caffe, Apache MXNet, ONNX, PyTorch, and
PaddlePaddle via an ONNX conversion
● Processor: Intel Movidius Myriad X Vision Processing Unit (VPU)
● Connectivity: USB 3.0 Type-A
https://software.intel.com/en-us/neural-compute-stick
AI at the Edge: Movidius

AI at the Edge: Google Edge TPU
The Edge TPU is a small ASIC designed by Google that provides
high performance ML inferencing for low-power devices. For
example, it can execute state-of-the-art mobile vision models such
as MobileNet V2 at 400 FPS, in a power efficient manner.
The on-board Edge TPU coprocessor is capable of performing 4 TOPS
using 0.5 watts for each TOPS (2 TOPS per watt).
TensorFlow Lite models can be compiled to run on the Edge TPU.
USB/Mini PCIe/M.2 A+E key/M.2 B+M key/SoM/Dev Board
https://cloud.google.com/edge-tpu/
https://coral.ai/products/

● Sophon Neural Network Stick (NNS)
https://www.sophon.ai/product/introduce/nns.html
● Xilinx Edge AI (FPGA!)
https://www.xilinx.com/applications/industrial/analytics-machine-learning.html
● The Hailo-8 M.2 Module
https://hailo.ai/product-hailo/hailo-8-m2-module/
● More:
https://github.com/crespum/edge-ai
AI at the Edge: Others

Problems
Even with FPGA/ASIC and edge devices:
● Energy efficiency:
○ Better than CPU/GPU, but still far from 20 watts used by the human brain
● Architecture:
○ Even more specialized for ML/DL computations, but...
○ Still far from brain-like computations

Neuromorphic chips
● Neuromorphic computing - brain-inspired computing - has emerged as a new
technology to enable information processing at very low energy cost using
electronic devices that emulate the electrical behaviour of (biological) neural
networks.
● Neuromorphic chips attempt to model in silicon the massively parallel way the
brain processes information as billions of neurons and trillions of synapses
respond to sensory inputs such as visual and auditory stimuli.
● DARPA SyNAPSE program (Systems of Neuromorphic Adaptive Plastic
Scalable Electronics)
● IBM TrueNorth; Stanford Neurogrid; HRL neuromorphic chip; Human Brain
Project SpiNNaker and HICANN.
https://www.technologyreview.com/s/526506/neuromorphic-chips/

Neuromorphic chips: IBM TrueNorth
● 1M neurons, 256M synapses, 4096 neurosynaptic
cores on a chip, est. 46B synaptic ops per sec per W
● Uses 70mW, power density is 20 milliwatts per
cm^2— almost 1/10,000th the power of most modern
microprocessors
● “Our sights are now set high on the ambitious goal of
integrating 4,096 chips in a single rack with 4B neurons and 1T synapses while
consuming ~4kW of power”.
● Currently IBM is making plans to commercialize it.
● (2016) Lawrence Livermore National Lab got a cluster of 16 TrueNorth chips
(16M neurons, 4B synapses, for context, the human brain has 86B neurons).
When running flat out, the entire cluster will consume a grand total of 2.5 watts.
http://spectrum.ieee.org/tech-talk/computing/hardware/ibms-braininspired-computer-chip-comes-from-the-future

Neuromorphic chips: IBM TrueNorth
● (03.2016) IBM Research demonstrated convolutional neural nets with close to
state of the art performance:
“Convolutional Networks for Fast, Energy-Efficient Neuromorphic Computing”, http://arxiv.org/abs/1603.08270

Neuromorphic chips: Intel Loihi
● Fully asynchronous neuromorphic many core mesh that
supports a wide range of sparse, hierarchical and recurrent
neural network topologies
● Each neuromorphic core includes a learning engine that can
be programmed to adapt network parameters during
operation, supporting supervised, unsupervised,
reinforcement and other learning paradigms.
● Fabrication on Intel’s 14 nm process technology.
● A total of 130,000 neurons and 130 million synapses.
● Development and testing of several algorithms with high
algorithmic efficiency for problems including path planning,
constraint satisfaction, sparse coding, dictionary learning,
and dynamic pattern learning and adaptation.
https://newsroom.intel.com/editorials/intels-new-self-learning-chip-promises-accelerate-artificial-intelligence/
https://techcrunch.com/2018/01/08/intel-shows-off-its-new-loihi-ai-chip-and-a-new-49-qubit-quantum-chip/
https://ieeexplore.ieee.org/document/8259423
https://en.wikichip.org/wiki/intel/loihi

Neuromorphic chips: Intel Pohoiki Beach
(Jul 15, 2019) “Intel announced that an 8 million-neuron
neuromorphic system comprising 64 Loihi research chips
— codenamed Pohoiki Beach — is now available to the broader
research community. With Pohoiki Beach, researchers can
experiment with Intel’s brain-inspired research chip, Loihi, which
applies the principles found in biological brains to computer
architectures. ”
https://newsroom.intel.com/news/intels-pohoiki-beach-64-chip-neuromorphic-system-delivers-breakthrough-results-research-tests/

Neuromorphic chips: Intel Pohoiki Springs
https://www.nextplatform.com/2020/03/19/intel-smells-neuromorphic-opportunity/

“Using Intel's Loihi neuromorphic research chip and
ABR's Nengo Deep Learning toolkit, we analyze the
inference speed, dynamic power consumption, and
energy cost per inference of a two-layer neural
network keyword spotter trained to recognize a single
phrase. We perform comparative analyses of this
keyword spotter running on more conventional
hardware devices including a CPU, a GPU, Nvidia's
Jetson TX1, and the Movidius Neural Compute
Stick.”
Benchmarking Keyword Spotting Efficiency on Neuromorphic Hardware

Intel Benchmarks for Loihi Neuromorphic Computing Chip
https://www.eetasia.com/intel-benchmarks-for-loihi-neuromorphic-computing-chip/

https://newsroom.intel.com/wp-content/uploads/sites/11/2020/12/Neuromorphic-Computing-slides-B.pdf

NxTF: a Keras-like API for SNNs on Loihi
“NxTF: An API and Compiler for Deep Spiking Neural Networks on Intel Loihi”
https://github.com/intel-nrc-ecosystem/models/tree/master/nxsdk_modules_ncl/dnn

NxTF: a Keras-like API for SNNs on Loihi
“NxTF: An API and Compiler for Deep Spiking Neural Networks on Intel Loihi”

Neuromorphic chips: Tianjic
Tianjic’s unified function core (FCore) which combines essential
building blocks for both artificial neural networks and biologically
networks — axon, synapse, dendrite and soma blocks. The 28-nm
chip consists of 156 FCores, containing approximately 40,000
neurons and 10 million synapses in an area of 3.8×3.8 mm2.
Tianjic delivers an internal memory bandwidth of more than 610 GB
per second, and a peak performance of 1.28 TOPS per watt for
running artificial neural networks. In the biologically-inspired spiking
neural network mode, Tianjic achieves a peak performance of about
650 giga synaptic operations per second (GSOPS) per watt.
https://medium.com/syncedreview/nature-cover-story-chinese-teams-tianjic-chip-bridges-machine-
learning-and-neuroscience-in-f1c3e8a03113
https://www.nature.com/articles/s41586-019-1424-8

Neuromorphic chips: Others
● SpiNNaker (1,036,800 ARM9 cores)
http://apt.cs.manchester.ac.uk/projects/SpiNNaker/
● SpiNNaker-2
https://niceworkshop.org/wp-content/uploads/2018/05/2-27-SHoppner-SpiNNaker2.pdf
https://arxiv.org/abs/1911.02385 “SpiNNaker 2: A 10 Million Core Processor System for Brain
Simulation and Machine Learning”
● BrainScaleS, HICANN: 20x 8-inch silicon wafers each incorporates 50 x 106
plastic synapses and 200,000 biologically realistic neurons.
https://www.humanbrainproject.eu/en/silicon-brains/how-we-work/hardware/
● Akida NSoC: 1.2 million neurons and 10 billion synapses
https://www.brainchipinc.com/products/akida-neuromorphic-system-on-chip
https://www.nextplatform.com/2020/01/30/neuromorphic-chip-maker-takes-aim-at-the-edge/
https://en.wikichip.org/wiki/brainchip/akida
● Neurogrid: Neurogrid can model a slab of cortex with up to 16x256x256
neurons (>1M) https://web.stanford.edu/group/brainsinsilicon/neurogrid.html
https://web.stanford.edu/group/brainsinsilicon/documents/BenjaminEtAlNeurogrid2014.pdf

From: https://d1io3yog0oux5.cloudfront.net/_51d5497ffa729abd180ed52c4234217f/brainchipinc/db/217/1582/pdf/Akida+Launch+Presentation.pdf

Other approaches
● Memristors https://spectrum.ieee.org/semiconductors/design/the-mysterious-memristor
● Quantum computing https://ai.googleblog.com/2019/10/quantum-supremacy-using-programmable.html
● Optical computing https://www.nextplatform.com/2019/05/31/startup-looks-to-light-up-machine-learning/
● DNA computing https://www.wired.com/story/finally-a-dna-computer-that-can-actually-be-reprogrammed/
● Unconventional computing: cellular automata, reservoir computing, using
biological cells/neurons, chemical computation, membrane computing, slime
mold computing and much more https://www.springer.com/gp/book/9781493968824
● ...

References:
Hardware for Deep Learning series of posts:
https://blog.inten.to/hardware-for-deep-learning-current-state-and-trends-51c01ebbb6dc
● Part 1: Introduction and Executive summary
● Part 2: CPU
● Part 3: GPU
● Part 4: ASIC
● Part 5: FPGA
● Part 6: Mobile AI
● Part 7: Neuromorphic computing
● Part 8: Quantum computing

https://ru.linkedin.com/in/grigorysapunov
gs@inten.to
Thanks!

AI Hardware Landscape 2021

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to AI Hardware Landscape 2021

Similar to AI Hardware Landscape 2021 (20)

More from Grigory Sapunov

More from Grigory Sapunov (20)

Recently uploaded

Recently uploaded (20)

AI Hardware Landscape 2021

Editor's Notes