SlideShare a Scribd company logo
1 of 24
Download to read offline
SIMD Programming Introduction
Champ Yen (嚴梓鴻)
champ.yen@gmail.com
http://champyen.blogspt.com
Agenda
2
● What & Why SIMD
● SIMD in different Processors
● SIMD for Software Optimization
● What are important in SIMD?
● Q & A
Link of This Slides
https://goo.gl/Rc8xPE
What is SIMD (Single Instruction Multiple Data)
3
for(y = 0; y < height; y++){
for(x = 0; x < width; x+=8){
//process 8 point simutaneously
uin16x8_t va, vb, vout;
va = vld1q_u16(a+x);
vb = vld1q_u16(b+x);
vout = vaddq_u16(va, vb);
vst1q_u16(out, vout);
}
a+=width; b+=width; out+=width;
}
for(y = 0; y < height; y++){
for(x = 0; x < width; x++){
//process 1 point
out[x] = a[x]+b[x];
}
a+=width; b+=width; out+=width;
}
one lane
Why do we need to use SIMD?
4
Why & How do we use SIMD?
5
Image
Processing
Scientific Computing
Gaming
Deep
Neural
Network
SIMD in different Processor - CPU
6
● x86
− MMX
− SSE
− AVX/AVX-512
● ARM - Application
− v5 DSP Extension
− v6 SIMD
− v7 NEON
− v8 Advanced SIMD (NEON)
https://software.intel.com/sites/landingpage/IntrinsicsGuide/
http://infocenter.arm.com/help/topic/com.arm.doc.ihi0073a/IHI0073A_arm_neon_intrinsics_ref.pdf
SIMD in different Processor - GPU
7
● SIMD
− AMD GCN
− ARM Mali
● WaveFront
− Nvidia
− Imagination PowerVR Rogue
SIMD in different Processor - DSP
8
● Qualcomm Hexagon 600 HVX
● Cadence IVP P5
● Synopsys EV6x
● CEVA XM4
SIMD Optimization
9
SIMD
Hardware
Design
SIMD
Framework /
Programming
Model
SIMD
Framework /
Programming
Model
Software
SIMD Optimization
10
● Auto/Semi-Auto Method
● Compiler Intrinsics
● Specific Framework/Infrastructure
● Coding in Assembly
● What are the difficult parts of SIMD Programming?
SIMD Optimization – Auto/SemiAuto
11
● compiler
− auto-vectorization optimization options
− #pragma
− IR optimization
● CilkPlus/OpenMP/OpenACC
Serial Code
for(i = 0; i < N; i++){
A[i] = B[i] + C[i];
}
#pragma omp simd
for(i = 0; i < N; i++){
A[i] = B[i] + C[i];
}
SIMD Pragma
https://software.intel.com/en-us/articles/performance-essentials-with-openmp-40-vectorization
SIMD Optimization – Intrinsics
12
● Arch-Dependent Intrinsics
− Intel SSE/AVX
− ARM NEON/MIPS ASE
− Vector-based DSP
● Common Intrisincs
− GCC/Clang vector intrinsics
− OpenCL / SIMD.js
● take Vector Width into consideration
● Portability between compilers
SIMD.js example from:
https://01.org/node/1495
SIMD Optimization – Intrinsics
13
4x4 Matrix Multiplication ARM NEON Example
http://www.fixstars.com/en/news/?p=125
//...
//Load matrixB into four vectors
uint16x4_t vectorB1, vectorB2, vectorB3, vectorB4;
vectorB1 = vld1_u16 (B[0]);
vectorB2 = vld1_u16 (B[1]);
vectorB3 = vld1_u16 (B[2]);
vectorB4 = vld1_u16 (B[3]);
//Temporary vectors to use with calculating the dotproduct
uint16x4_t vectorT1, vectorT2, vectorT3, vectorT4;
// For each row in A...
for (i=0; i<4; i++){
//Multiply the rows in B by each value in A's row
vectorT1 = vmul_n_u16(vectorB1, A[i][0]);
vectorT2 = vmul_n_u16(vectorB2, A[i][1]);
vectorT3 = vmul_n_u16(vectorB3, A[i][2]);
vectorT4 = vmul_n_u16(vectorB4, A[i][3]);
//Add them together
vectorT1 = vadd_u16(vectorT1, vectorT2);
vectorT1 = vadd_u16(vectorT1, vectorT3);
vectorT1 = vadd_u16(vectorT1, vectorT4);
//Output the dotproduct
vst1_u16 (C[i], vectorT1);
}
//...
A B C
A B C
SIMD Optimization –
Data Parallel Frameworks
14
● OpenCL/Cuda/C++AMP
● OpenVX/Halide
● SIMD-Optimized Libraries
− Apple Accelerate
− OpenCV
− ffmpeg/x264
− fftw/Ne10
core idea of Halide Lang
workitems in OpenCL
SIMD Optimization –
Coding in Assembly
15
● WHY !!!!???
● Extreme Performance Optimization
− precise code size/cycle/register usage
● The difficulty depends on ISA design and
assembler
SIMD Optimization –
The Difficult Parts
16
● Finding Parallelism in Algorithm
● Portability between different intrinsics
− sse-to-neon
− neon-to-sse
● Boundary handling
− Padding, Predication, Fallback
● Divergence
− Predication, Fallback to scalar
● Register Spilling
− multi-stages, # of variables + braces, assembly
● Non-Regular Access/Processing Pattern/Dependency
− Multi-stages/Reduction ISA/Enhanced DMA
● Unsupported Operations
− Division, High-level function (eg: math functions)
● Floating-Point
− Unsupported/cross-deivce Compatiblilty
What are important in SIMD?
17
TLP Programming
ILP Programming
Data Parallel Algorithm
Architecture
What are important in SIMD?
18
● ISA design
● Memory Model
● Thread / Execution Model
● Scalability / Extensibility
● The Trends
What are important in SIMD?
ISA Design
19
● SuperScalar vs VLIW
● vector register design
− Unified or Dedicated float/predication/accumulators
registers
● ALU
● Inter-Lane
− reduction, swizzle, pack/unpack
● Multiply
● Memory Access
− Load/Store, Scatter/Gather, LUT, DMA ...
● Scalar ↔ Vector
What are important in SIMD?
Memory Model
20
● DATA! DATA! DATA!
− it takes great latency to pull data from DRAM to register!
● Buffers Between Host Processor and SIMD Processor
− Handled in driver/device-side
● TCM
− DMA
− relative simple hw/control/understand
− fixed cycle (good for simulation/estimation)
● Cache
− prefetch
− transparent
− portable
− performance scalability
− simple code flow
● Mixed
− powerful
What are important in SIMD?
Thread / Execution Model
21
● The flow to trigger DSP to work
Qualcomm FastRPC
w/ IDL Compiler CEVA Link
TI DSP Link
Synopsys EV6x SW
ecosystem
What are important in SIMD?
Scalability/Extensibility
22
● What to do when ONE core doesn’t meet
performance requirement?
● HWA/Co-processor Interface
● Add another core?
− Cost
− Multicore Programming Mode
What are important in SIMD?
Trends
23
Vector Width
128/256 bit vector width
CEVA DSP/SSE/AVX
64 bit vector width
Hexagon/ARM DSP/MIPS ASE/MMX
512 bit vector width
AVX-512/IVP P5/EV6x/HVX
1024/2048 bit vector width
HVX/ARM SVE
ILP/TLP
Count
Explicit
vector
programming
Autovectorization
Time
Thank You, Q & A
24

More Related Content

What's hot

7nm "Navi" GPU - A GPU Built For Performance
7nm "Navi" GPU - A GPU Built For Performance 7nm "Navi" GPU - A GPU Built For Performance
7nm "Navi" GPU - A GPU Built For Performance AMD
 
用Raspberry Pi 學Linux I2C Driver
用Raspberry Pi 學Linux I2C Driver用Raspberry Pi 學Linux I2C Driver
用Raspberry Pi 學Linux I2C Driver艾鍗科技
 
System on chip architectures
System on chip architecturesSystem on chip architectures
System on chip architecturesA B Shinde
 
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesAMD
 
3D V-Cache
3D V-Cache 3D V-Cache
3D V-Cache AMD
 
Single instruction multiple data
Single instruction multiple dataSingle instruction multiple data
Single instruction multiple dataSyed Zaid Irshad
 
AMD EPYC™ Microprocessor Architecture
AMD EPYC™ Microprocessor ArchitectureAMD EPYC™ Microprocessor Architecture
AMD EPYC™ Microprocessor ArchitectureAMD
 
Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware LandscapeGrigory Sapunov
 
Linux Porting to a Custom Board
Linux Porting to a Custom BoardLinux Porting to a Custom Board
Linux Porting to a Custom BoardPatrick Bellasi
 
TIP1 - Overview of C/C++ Debugging/Tracing/Profiling Tools
TIP1 - Overview of C/C++ Debugging/Tracing/Profiling ToolsTIP1 - Overview of C/C++ Debugging/Tracing/Profiling Tools
TIP1 - Overview of C/C++ Debugging/Tracing/Profiling ToolsXiaozhe Wang
 

What's hot (20)

eBPF/XDP
eBPF/XDP eBPF/XDP
eBPF/XDP
 
7nm "Navi" GPU - A GPU Built For Performance
7nm "Navi" GPU - A GPU Built For Performance 7nm "Navi" GPU - A GPU Built For Performance
7nm "Navi" GPU - A GPU Built For Performance
 
用Raspberry Pi 學Linux I2C Driver
用Raspberry Pi 學Linux I2C Driver用Raspberry Pi 學Linux I2C Driver
用Raspberry Pi 學Linux I2C Driver
 
System on chip architectures
System on chip architecturesSystem on chip architectures
System on chip architectures
 
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
 
3D V-Cache
3D V-Cache 3D V-Cache
3D V-Cache
 
Zynq ultrascale
Zynq ultrascaleZynq ultrascale
Zynq ultrascale
 
ARM CORTEX M3 PPT
ARM CORTEX M3 PPTARM CORTEX M3 PPT
ARM CORTEX M3 PPT
 
USB Drivers
USB DriversUSB Drivers
USB Drivers
 
Asic
AsicAsic
Asic
 
Embedded systems basics
Embedded systems basicsEmbedded systems basics
Embedded systems basics
 
Single instruction multiple data
Single instruction multiple dataSingle instruction multiple data
Single instruction multiple data
 
AMD EPYC™ Microprocessor Architecture
AMD EPYC™ Microprocessor ArchitectureAMD EPYC™ Microprocessor Architecture
AMD EPYC™ Microprocessor Architecture
 
Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware Landscape
 
Lect01 flow
Lect01 flowLect01 flow
Lect01 flow
 
Linux Porting to a Custom Board
Linux Porting to a Custom BoardLinux Porting to a Custom Board
Linux Porting to a Custom Board
 
Uvm dac2011 final_color
Uvm dac2011 final_colorUvm dac2011 final_color
Uvm dac2011 final_color
 
TIP1 - Overview of C/C++ Debugging/Tracing/Profiling Tools
TIP1 - Overview of C/C++ Debugging/Tracing/Profiling ToolsTIP1 - Overview of C/C++ Debugging/Tracing/Profiling Tools
TIP1 - Overview of C/C++ Debugging/Tracing/Profiling Tools
 
Ixgbe internals
Ixgbe internalsIxgbe internals
Ixgbe internals
 
Qemu
QemuQemu
Qemu
 

Similar to SIMD Programming Introduction Guide

Vectorization on x86: all you need to know
Vectorization on x86: all you need to knowVectorization on x86: all you need to know
Vectorization on x86: all you need to knowRoberto Agostino Vitillo
 
Building a Big Data Machine Learning Platform
Building a Big Data Machine Learning PlatformBuilding a Big Data Machine Learning Platform
Building a Big Data Machine Learning PlatformCliff Click
 
Anatomy of ROCgdb presentation at gcc cauldron 2022
Anatomy of ROCgdb presentation at gcc cauldron 2022Anatomy of ROCgdb presentation at gcc cauldron 2022
Anatomy of ROCgdb presentation at gcc cauldron 2022ssuser866937
 
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)Laurent Leturgez
 
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMDEdge AI and Vision Alliance
 
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019 Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019 Unity Technologies
 
Raspberry Pi - HW/SW Application Development
Raspberry Pi - HW/SW Application DevelopmentRaspberry Pi - HW/SW Application Development
Raspberry Pi - HW/SW Application DevelopmentCorley S.r.l.
 
Cryptography and secure systems
Cryptography and secure systemsCryptography and secure systems
Cryptography and secure systemsVsevolod Stakhov
 
SIMD inside and outside oracle 12c
SIMD inside and outside oracle 12cSIMD inside and outside oracle 12c
SIMD inside and outside oracle 12cLaurent Leturgez
 
GBM in H2O with Cliff Click: H2O API
GBM in H2O with Cliff Click: H2O APIGBM in H2O with Cliff Click: H2O API
GBM in H2O with Cliff Click: H2O APISri Ambati
 
Megamodeling of Complex, Distributed, Heterogeneous CPS Systems
Megamodeling of Complex, Distributed, Heterogeneous CPS SystemsMegamodeling of Complex, Distributed, Heterogeneous CPS Systems
Megamodeling of Complex, Distributed, Heterogeneous CPS SystemsEugenio Villar
 
A Journey into Hexagon: Dissecting Qualcomm Basebands
A Journey into Hexagon: Dissecting Qualcomm BasebandsA Journey into Hexagon: Dissecting Qualcomm Basebands
A Journey into Hexagon: Dissecting Qualcomm BasebandsPriyanka Aash
 
Data-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsData-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsDilum Bandara
 
tau 2015 spyrou fpga timing
tau 2015 spyrou fpga timingtau 2015 spyrou fpga timing
tau 2015 spyrou fpga timingTom Spyrou
 
SIMD inside and outside Oracle 12c In Memory
SIMD inside and outside Oracle 12c In MemorySIMD inside and outside Oracle 12c In Memory
SIMD inside and outside Oracle 12c In MemoryLaurent Leturgez
 
Happy To Use SIMD
Happy To Use SIMDHappy To Use SIMD
Happy To Use SIMDWei-Ta Wang
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Intel® Software
 

Similar to SIMD Programming Introduction Guide (20)

Vectorization on x86: all you need to know
Vectorization on x86: all you need to knowVectorization on x86: all you need to know
Vectorization on x86: all you need to know
 
Joel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMDJoel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMD
 
Building a Big Data Machine Learning Platform
Building a Big Data Machine Learning PlatformBuilding a Big Data Machine Learning Platform
Building a Big Data Machine Learning Platform
 
Anatomy of ROCgdb presentation at gcc cauldron 2022
Anatomy of ROCgdb presentation at gcc cauldron 2022Anatomy of ROCgdb presentation at gcc cauldron 2022
Anatomy of ROCgdb presentation at gcc cauldron 2022
 
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
 
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
 
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019 Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
 
Raspberry Pi - HW/SW Application Development
Raspberry Pi - HW/SW Application DevelopmentRaspberry Pi - HW/SW Application Development
Raspberry Pi - HW/SW Application Development
 
Fedor Polyakov - Optimizing computer vision problems on mobile platforms
Fedor Polyakov - Optimizing computer vision problems on mobile platforms Fedor Polyakov - Optimizing computer vision problems on mobile platforms
Fedor Polyakov - Optimizing computer vision problems on mobile platforms
 
Cryptography and secure systems
Cryptography and secure systemsCryptography and secure systems
Cryptography and secure systems
 
SIMD inside and outside oracle 12c
SIMD inside and outside oracle 12cSIMD inside and outside oracle 12c
SIMD inside and outside oracle 12c
 
GBM in H2O with Cliff Click: H2O API
GBM in H2O with Cliff Click: H2O APIGBM in H2O with Cliff Click: H2O API
GBM in H2O with Cliff Click: H2O API
 
Megamodeling of Complex, Distributed, Heterogeneous CPS Systems
Megamodeling of Complex, Distributed, Heterogeneous CPS SystemsMegamodeling of Complex, Distributed, Heterogeneous CPS Systems
Megamodeling of Complex, Distributed, Heterogeneous CPS Systems
 
A Journey into Hexagon: Dissecting Qualcomm Basebands
A Journey into Hexagon: Dissecting Qualcomm BasebandsA Journey into Hexagon: Dissecting Qualcomm Basebands
A Journey into Hexagon: Dissecting Qualcomm Basebands
 
Vectorization in ATLAS
Vectorization in ATLASVectorization in ATLAS
Vectorization in ATLAS
 
Data-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsData-Level Parallelism in Microprocessors
Data-Level Parallelism in Microprocessors
 
tau 2015 spyrou fpga timing
tau 2015 spyrou fpga timingtau 2015 spyrou fpga timing
tau 2015 spyrou fpga timing
 
SIMD inside and outside Oracle 12c In Memory
SIMD inside and outside Oracle 12c In MemorySIMD inside and outside Oracle 12c In Memory
SIMD inside and outside Oracle 12c In Memory
 
Happy To Use SIMD
Happy To Use SIMDHappy To Use SIMD
Happy To Use SIMD
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
 

More from Champ Yen

Halide tutorial 2019
Halide tutorial 2019Halide tutorial 2019
Halide tutorial 2019Champ Yen
 
Linux SD/MMC Driver Stack
Linux SD/MMC Driver Stack Linux SD/MMC Driver Stack
Linux SD/MMC Driver Stack Champ Yen
 
Video Compression Standards - History & Introduction
Video Compression Standards - History & IntroductionVideo Compression Standards - History & Introduction
Video Compression Standards - History & IntroductionChamp Yen
 
OpenCL Kernel Optimization Tips
OpenCL Kernel Optimization TipsOpenCL Kernel Optimization Tips
OpenCL Kernel Optimization TipsChamp Yen
 
OpenGL ES 2.x Programming Introduction
OpenGL ES 2.x Programming IntroductionOpenGL ES 2.x Programming Introduction
OpenGL ES 2.x Programming IntroductionChamp Yen
 
Chrome OS Observation
Chrome OS ObservationChrome OS Observation
Chrome OS ObservationChamp Yen
 
Play With Android
Play With AndroidPlay With Android
Play With AndroidChamp Yen
 
Linux Porting
Linux PortingLinux Porting
Linux PortingChamp Yen
 

More from Champ Yen (8)

Halide tutorial 2019
Halide tutorial 2019Halide tutorial 2019
Halide tutorial 2019
 
Linux SD/MMC Driver Stack
Linux SD/MMC Driver Stack Linux SD/MMC Driver Stack
Linux SD/MMC Driver Stack
 
Video Compression Standards - History & Introduction
Video Compression Standards - History & IntroductionVideo Compression Standards - History & Introduction
Video Compression Standards - History & Introduction
 
OpenCL Kernel Optimization Tips
OpenCL Kernel Optimization TipsOpenCL Kernel Optimization Tips
OpenCL Kernel Optimization Tips
 
OpenGL ES 2.x Programming Introduction
OpenGL ES 2.x Programming IntroductionOpenGL ES 2.x Programming Introduction
OpenGL ES 2.x Programming Introduction
 
Chrome OS Observation
Chrome OS ObservationChrome OS Observation
Chrome OS Observation
 
Play With Android
Play With AndroidPlay With Android
Play With Android
 
Linux Porting
Linux PortingLinux Porting
Linux Porting
 

Recently uploaded

Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noidabntitsolutionsrishis
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 

Recently uploaded (20)

Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 

SIMD Programming Introduction Guide

  • 1. SIMD Programming Introduction Champ Yen (嚴梓鴻) champ.yen@gmail.com http://champyen.blogspt.com
  • 2. Agenda 2 ● What & Why SIMD ● SIMD in different Processors ● SIMD for Software Optimization ● What are important in SIMD? ● Q & A Link of This Slides https://goo.gl/Rc8xPE
  • 3. What is SIMD (Single Instruction Multiple Data) 3 for(y = 0; y < height; y++){ for(x = 0; x < width; x+=8){ //process 8 point simutaneously uin16x8_t va, vb, vout; va = vld1q_u16(a+x); vb = vld1q_u16(b+x); vout = vaddq_u16(va, vb); vst1q_u16(out, vout); } a+=width; b+=width; out+=width; } for(y = 0; y < height; y++){ for(x = 0; x < width; x++){ //process 1 point out[x] = a[x]+b[x]; } a+=width; b+=width; out+=width; } one lane
  • 4. Why do we need to use SIMD? 4
  • 5. Why & How do we use SIMD? 5 Image Processing Scientific Computing Gaming Deep Neural Network
  • 6. SIMD in different Processor - CPU 6 ● x86 − MMX − SSE − AVX/AVX-512 ● ARM - Application − v5 DSP Extension − v6 SIMD − v7 NEON − v8 Advanced SIMD (NEON) https://software.intel.com/sites/landingpage/IntrinsicsGuide/ http://infocenter.arm.com/help/topic/com.arm.doc.ihi0073a/IHI0073A_arm_neon_intrinsics_ref.pdf
  • 7. SIMD in different Processor - GPU 7 ● SIMD − AMD GCN − ARM Mali ● WaveFront − Nvidia − Imagination PowerVR Rogue
  • 8. SIMD in different Processor - DSP 8 ● Qualcomm Hexagon 600 HVX ● Cadence IVP P5 ● Synopsys EV6x ● CEVA XM4
  • 10. SIMD Optimization 10 ● Auto/Semi-Auto Method ● Compiler Intrinsics ● Specific Framework/Infrastructure ● Coding in Assembly ● What are the difficult parts of SIMD Programming?
  • 11. SIMD Optimization – Auto/SemiAuto 11 ● compiler − auto-vectorization optimization options − #pragma − IR optimization ● CilkPlus/OpenMP/OpenACC Serial Code for(i = 0; i < N; i++){ A[i] = B[i] + C[i]; } #pragma omp simd for(i = 0; i < N; i++){ A[i] = B[i] + C[i]; } SIMD Pragma https://software.intel.com/en-us/articles/performance-essentials-with-openmp-40-vectorization
  • 12. SIMD Optimization – Intrinsics 12 ● Arch-Dependent Intrinsics − Intel SSE/AVX − ARM NEON/MIPS ASE − Vector-based DSP ● Common Intrisincs − GCC/Clang vector intrinsics − OpenCL / SIMD.js ● take Vector Width into consideration ● Portability between compilers SIMD.js example from: https://01.org/node/1495
  • 13. SIMD Optimization – Intrinsics 13 4x4 Matrix Multiplication ARM NEON Example http://www.fixstars.com/en/news/?p=125 //... //Load matrixB into four vectors uint16x4_t vectorB1, vectorB2, vectorB3, vectorB4; vectorB1 = vld1_u16 (B[0]); vectorB2 = vld1_u16 (B[1]); vectorB3 = vld1_u16 (B[2]); vectorB4 = vld1_u16 (B[3]); //Temporary vectors to use with calculating the dotproduct uint16x4_t vectorT1, vectorT2, vectorT3, vectorT4; // For each row in A... for (i=0; i<4; i++){ //Multiply the rows in B by each value in A's row vectorT1 = vmul_n_u16(vectorB1, A[i][0]); vectorT2 = vmul_n_u16(vectorB2, A[i][1]); vectorT3 = vmul_n_u16(vectorB3, A[i][2]); vectorT4 = vmul_n_u16(vectorB4, A[i][3]); //Add them together vectorT1 = vadd_u16(vectorT1, vectorT2); vectorT1 = vadd_u16(vectorT1, vectorT3); vectorT1 = vadd_u16(vectorT1, vectorT4); //Output the dotproduct vst1_u16 (C[i], vectorT1); } //... A B C A B C
  • 14. SIMD Optimization – Data Parallel Frameworks 14 ● OpenCL/Cuda/C++AMP ● OpenVX/Halide ● SIMD-Optimized Libraries − Apple Accelerate − OpenCV − ffmpeg/x264 − fftw/Ne10 core idea of Halide Lang workitems in OpenCL
  • 15. SIMD Optimization – Coding in Assembly 15 ● WHY !!!!??? ● Extreme Performance Optimization − precise code size/cycle/register usage ● The difficulty depends on ISA design and assembler
  • 16. SIMD Optimization – The Difficult Parts 16 ● Finding Parallelism in Algorithm ● Portability between different intrinsics − sse-to-neon − neon-to-sse ● Boundary handling − Padding, Predication, Fallback ● Divergence − Predication, Fallback to scalar ● Register Spilling − multi-stages, # of variables + braces, assembly ● Non-Regular Access/Processing Pattern/Dependency − Multi-stages/Reduction ISA/Enhanced DMA ● Unsupported Operations − Division, High-level function (eg: math functions) ● Floating-Point − Unsupported/cross-deivce Compatiblilty
  • 17. What are important in SIMD? 17 TLP Programming ILP Programming Data Parallel Algorithm Architecture
  • 18. What are important in SIMD? 18 ● ISA design ● Memory Model ● Thread / Execution Model ● Scalability / Extensibility ● The Trends
  • 19. What are important in SIMD? ISA Design 19 ● SuperScalar vs VLIW ● vector register design − Unified or Dedicated float/predication/accumulators registers ● ALU ● Inter-Lane − reduction, swizzle, pack/unpack ● Multiply ● Memory Access − Load/Store, Scatter/Gather, LUT, DMA ... ● Scalar ↔ Vector
  • 20. What are important in SIMD? Memory Model 20 ● DATA! DATA! DATA! − it takes great latency to pull data from DRAM to register! ● Buffers Between Host Processor and SIMD Processor − Handled in driver/device-side ● TCM − DMA − relative simple hw/control/understand − fixed cycle (good for simulation/estimation) ● Cache − prefetch − transparent − portable − performance scalability − simple code flow ● Mixed − powerful
  • 21. What are important in SIMD? Thread / Execution Model 21 ● The flow to trigger DSP to work Qualcomm FastRPC w/ IDL Compiler CEVA Link TI DSP Link Synopsys EV6x SW ecosystem
  • 22. What are important in SIMD? Scalability/Extensibility 22 ● What to do when ONE core doesn’t meet performance requirement? ● HWA/Co-processor Interface ● Add another core? − Cost − Multicore Programming Mode
  • 23. What are important in SIMD? Trends 23 Vector Width 128/256 bit vector width CEVA DSP/SSE/AVX 64 bit vector width Hexagon/ARM DSP/MIPS ASE/MMX 512 bit vector width AVX-512/IVP P5/EV6x/HVX 1024/2048 bit vector width HVX/ARM SVE ILP/TLP Count Explicit vector programming Autovectorization Time
  • 24. Thank You, Q & A 24