Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

"The Xilinx AI Engine: High Performance with Future-proof Architecture Adaptability," a Presentation from Xilinx


Published on

For the full video of this presentation, please visit:

For more information about embedded vision, please visit:

Nick Ni, Director of Product Marketing at Xilinx, presents the "Xilinx AI Engine: High Performance with Future-proof Architecture Adaptability" tutorial at the May 2019 Embedded Vision Summit.

AI inference demands orders- of-magnitude more compute capacity than what today’s SoCs offer. At the same time, neural network topologies are changing too quickly to be addressed by ASICs that take years to go from architecture to production. In this talk, Ni introduces the Xilinx AI Engine, which complements the dynamically- programmable FPGA fabric to enable ASIC-like performance via custom data flows and a flexible memory hierarchy. This combination provides an orders-of-magnitude boost in AI performance along with the hardware architecture flexibility needed to quickly adapt to rapidly evolving neural network topologies.

Published in: Technology
  • Login to see the comments

"The Xilinx AI Engine: High Performance with Future-proof Architecture Adaptability," a Presentation from Xilinx

  1. 1. © 2019 Xilinx The Xilinx AI Engine: High Performance with Future-proof Architecture Adaptability Vinod Kathail Xilinx May 2019
  2. 2. © 2019 Xilinx 2 Motivation for AI Engine
  3. 3. © 2019 Xilinx Compute Intensity Real Time Capability Power Efficiency Moore’s Law Performance & Power Scaling Traditional Single / Multi-core Machine Learning ADAS / AD5G Smart City Smart Factory Data Center Workloads Motivation for AI Engine Dynamic Markets Require Adaptive Compute Acceleration Platform (ACAP) AI Everywhere Applications Technology Scaling Page 3
  4. 4. © 2019 Xilinx Adaptable Engines 2X compute density Programmable I/O • Any interface or sensor • Includes 4.2Gb/s MIPI AI Engines • AI Compute • Diverse DSP workloads DDR Memory • 3200-DDR4, 3200-LPDDR4 • 2X bandwidth/pin Protocol Engines • Integrated 600G cores • 4X encrypted bandwidth PCIe & CCIX • 2X PCIe & DMA bandwidth • Cache-coherent interface to accelerators Transceivers • Broad range, 25G →112G • 58G in mainstream devices Scalar Engines • Platform Control • Edge Compute Versal ACAP Architecture Overview >> 4 Network-on-Chip • Guaranteed Bandwidth • Enables SW Programmability
  5. 5. © 2019 Xilinx AI CORE MEMORY AI CORE MEMORY AI CORE MEMORY AI CORE MEMORY Introducing the AI Engine Signal ProcessingArtificial Intelligence CNN, LSTM, MLP Computer Vision • 1GHz+ Multi-precision Vector Processor • High bandwidth extensible memory • Up to 400 AI Engines per device • 8X Compute Density • 40% Lower Power SW Programmable Adaptable. Intelligent. Deterministic Efficient Page 5
  6. 6. © 2019 Xilinx C/C++ C/C++ Software Programmable: Any Developer Page 6 Compile Design 4G/5G/Radar Library AI Library Vision Library AI Engine Compiler Programming Abstraction Levels 1 2 3Run Domain Specific Architecture Data Flow w/ Xilinx libraries Kernel Program Data Flow w/ user defined libraries Page 6 Frameworks
  7. 7. © 2019 Xilinx AI Engine Application Performance & Power Efficiency Page 7 Image Classification (GoogleNet, <1ms) Massive MIMO Radio (DUC, DDC, CFR, DPD) AI Inference Compute 5G Wireless Bandwidth Power Consumption Xilinx UltraScale+ Xilinx Versal w/ AI Engine 20x 40% Less Power 5x Xilinx 16nm UltraScale+ Xilinx 7nm Versal w/ AI Engine
  8. 8. © 2019 Xilinx 8 AI Engine Architecture
  9. 9. © 2019 Xilinx AI Engine: Tile-Based Architecture Page 9 Interconnect ISA-based Vector Processor Local Memory AI Vector Extensions 5G Vector Extensions ISA-based Vector Processor Software Programmable (e.g., C/C++) Data Mover Data Mover Non-neighbor data communication Integrated synchronization primitives Non-Blocking Interconnect high GB/s bandwidth per tile Local Memory Multi-bank implementation Shared across neighbor cores Cascade Interface Partial results to next core PL PS I/O
  10. 10. © 2019 Xilinx AI Engine: Array Architecture Page 10 Memory AI Core Memory AI Core Memory AI Core Memory AI Core Memory AI Core Memory AI Core Memory AI Core Memory AI Core Memory AI Core Modular and scalable architecture • More tiles = more compute • Up to 400 per device • Versal AI Core VC1902 device Distributed memory hierarchy Maximize memory bandwidth Array of AI Engines • Increase in compute, memory and communication bandwidth Deterministic Performance & Low Latency PL PS I/O
  11. 11. © 2019 Xilinx Page 11 AI Engine: Processor Core Local, Shareable Memory • 32KB Local, 128KB Addressable 32-bit Scalar RISC Processor Up to 128 MACs / Clock Cycle per Core (INT 8) Highly Parallel Memory Interface Scalar Unit Scalar Register File Scalar ALU Non-linear Functions Vector Register File Fixed-Point Vector Unit Floating-Point Vector Unit Vector Unit Vector Processor 512-bit SIMD Datapath Instruction Fetch & Decode Unit AGU AGU AGU Load Unit A Load Unit B Store Unit 7+ operations / clock cycle • 2 Vector Loads / 1 Mult / 1 Store • 2 Scalar Ops / Stream Access Instruction Parallelism: VLIW Data Parallelism: SIMD Multiple vector lanes • Vector Datapath • 8 / 16 / 32-bit & SPFP operands Stream Interface
  12. 12. © 2019 Xilinx Data Movement Architecture Page 12 Dataflow Graph Mem Mem AI Core AI Core AI Core Dataflow Pipeline AI Core Memory B0 B1 Memory B2 B3 Mem AI Core AI Core Mem AI Core Streaming Multicast AI Core AI Core AI Core AI Core AI Core Memory AI Core Memory Non- Neighbor AI Core AI Core Cascade Streaming Memory Communication Streaming Communication Memory Interface Stream Interface Cascade Interface Mem Mem AI Core AI Core AI Core
  13. 13. © 2019 Xilinx AI Engine Integration in Versal Page 13 ˃ TB/s of Interface Bandwidth AI Engine to Programmable Logic AI Engine to NOC ˃ Leveraging NOC connectivity Processing System manages Config / Debug / Trace AI Engine to DRAM without PL PL PS I/O
  14. 14. © 2019 Xilinx AI Engine: Multi-Core Compute with Dedicated Memory Page 14 core L0 core L0 core L0 Block 0 L1 core L0 core L0 core L0 Block 1 L1 L2 DRAM D0 D0 D0 D0 Fixed, shared Interconnect • Blocking limits compute • Timing not deterministic Data Replicated • Robs bandwidth • Reduces capacity Traditional Multi-core (cache-based architecture) MEM AI Core MEM AI Core MEM AI Core MEM AI Core MEM AI Core MEM AI Core AI Core MEM AI Core MEM AI Core MEM AI Engine Array (intelligent engine) Dedicated Interconnect • Non-blocking • Deterministic Local, Distributed Memory • No cache misses • Higher bandwidth • Less capacity required
  15. 15. © 2019 Xilinx AI Engine Delivers High Compute Efficiency Page 15 95% 80% 98% ML Convolutions FFT DPD Vector Processor Efficiency Peak Kernel Theoretical Performance Block-based Matrix Multiplication (32×64) × (64×32) 1024-pt FFT/iFFT Volterra-based forward-path DPD ˃ Adaptable, non-blocking interconnect Flexible data movement architecture Avoids interconnect “bottlenecks” ˃ Adaptable memory hierarchy Local, distributed, shareable = extreme bandwidth No cache misses or data replication Extend to PL memory (BRAM, URAM) ˃ Transfer data while AI Engine Computes Compute Comm Overlap Compute and Communication Compute Compute Comm Comm
  16. 16. © 2019 Xilinx 16 AI Engine Programming and Applications
  17. 17. © 2019 Xilinx Versal ACAP Development Tools Page 17 Frameworks AI and Data Scientists Unified Software Development Environment Software Application Developers Vivado Design Suite Hardware Developers USERTOOLS SUPPORTED FRAMEWORKS
  18. 18. © 2019 Xilinx Software Development Environment Page 18 ˃ Unified development environment Full chip programming ˃ SW programmable for whole application Heterogeneous SW acceleration ˃ Full system simulation, debug & profiling Software development experience Application (e.g. C/C++) Performance Constraints Application e.g. C/C++ Processing Sub-system Programmable Logic AI Engines System Simulation Hardware System Debug & Profiling Unified SW Development Environment IntelligentAdaptableScalar
  19. 19. © 2019 Xilinx AI Engine Programming: Dataflow Model Page 19 a b c d e User defines dataflow logic User describes dataflow graph using C/C++ APIs 1 2 3 a b c d ee Compiler transparently manages placement & interconnect to e Memory b Memory a Memory Vector Core Memory Vector Core Memory Vector Core Memory Vector Core MemoryMemory Memory c Physical Mapping to AI Engines Vector Core d PL
  20. 20. © 2019 Xilinx Accelerating AI Inference Page 20 2 1 3 User works in framework of choice • Develop & train custom network • User provides trained model Xilinx DNN Compiler implements network • Targets AI inference implemented on FPGA and Versal • Optimizations: Quantize, merge layers, prune • Compile to AI Engines Scalable across hardware targets • Start with Alveo boards with FPGAs today Deep Learning Frameworks Xilinx DNN Compiler New Versal based Acceleration Cards Xilinx AI Inference Domain Specific Architecture Alveo U200/U250/U280
  21. 21. © 2019 Xilinx AI Engine Delivers Real-time Inference Leadership Page 21 Sources: GPU: Nvidia T4 TensorRT 5, Published March 2019 (INT8, Batch=4, 1.5ms Latency) Versal Card, Projected (INT8, Batch=8, 1.5ms Latency) 0 2,000 4,000 6,000 8,000 10,000 12,000 VersalGPU 1x Resnet50 Inference Performance 3.5x Images/Sec 4.5x With Xilinx Pruning
  22. 22. © 2019 Xilinx AI Engine: Accelerating AI Inference & Signal Processing Page 22 Software Programmable Deterministic Efficient • Frameworks & C/C++ • SW Compile, Debug & Deploy • Max throughput w/ low latency • Real-time inference leadership • Up to 8X compute density • At ~40% lower power Signal ProcessingAI Inference 10x 5x
  23. 23. © 2019 Xilinx Additional Resources 23 For more information on ACAP and Versal, please visit: Please visit EVS Booth #610 Face recognition using Xilinx FPGA