SlideShare a Scribd company logo
1 of 32
SSE的那些事儿
Use SIMD to boost your program!
CPU-Z
What all these
about?
Outline
• What is SSE?
• Why SSE?
• How to use SSE?
• CPUID
• Useful References
• Discussions
Outline
• What is SSE?
• Why SSE?
• How to use SSE?
• CPUID
• Useful References
• Discussions
SSE
• Streaming SIMD Extensions
A set of CPU instructions dedicated to applications like signal
processing, scientific computation or 3D graphics.
SIMD
• Single Instruction, Multiple Data
A CPU instruction is said to be SIMD when the same operation is
applied on multiple data at the same time, i.e. operate on a “vector”
of data with a single instruction.
Flynn’s taxonomy
• Flynn's taxonomy is a classification of computer architectures,
proposed by Michael Flynn in 1966.
Single instruction stream Multiple instruction streams
Single data stream SISD MISD
Multiple data streams SIMD MIMD
PU: Processing Unit
More on SSE
• Streaming SIMD Extensions (SSE) is an SIMD instruction set extension
to the x86 architecture, designed by Intel and introduced in 1999 in
their Pentium III series processors as a reply to AMD's 3DNow!
• SSE contains 70 new instructions, most of which work on single
precision floating point data.
• Intel's first IA-32 SIMD effort was the MMX instruction set.
• SSE was subsequently expanded by Intel to SSE2, SSE3, SSSE3, SSE4
and AVX.
• SSE was originally called Katmai New Instructions (KNI), Katmai being
the code name for the first Pentium III core revision.
SSE Registers
• SSE originally added eight new 128-bit registers known as XMM0
through XMM7. Later versions add more registers.
• There is also a new 32-bit control/status register, MXCSR, which
provides control and status bits for operations performed on XMM
registers.
SSE instructions
• Packed and scalar single-precision floating-point instructions
 Data movement instructions
 Arithmetic instructions
 Logical instructions
 Comparison instructions
 Shuffle instructions
 Conversion instructions
• 64-bit SIMD integer instructions
 Operate on data in MMX registers and 64-bit memory locations.
• State management instructions
 LDMXCSR
 STMXCSR
• Cacheability control, prefetch, and memory ordering instructions
 Give programs more control over the caching of data
Intel CPU SIMD technology evolution
Outline
• What is SSE?
• Why SSE?
• How to use SSE?
• CPUID
• Useful References
• Discussions
Advantages of SIMD
• Many real-world problems, especially in science and engineering,
map well to computation on arrays.
• SIMD instructions can greatly increase performance when exactly the
same operations are to be performed on multiple data objects
(arrays).
• Typical applications are digital signal processing and graphics
processing.
Outline
• What is SSE?
• Why SSE?
• How to use SSE?
• CPUID
• Useful References
• Discussions
Think twice before you go
• What is your application?
• Is there better algorithm?
• Will the effort get performance gain eventually? How much?
• Which SSE version suites best?
• Does your CPU support SSE? If, up to what version?
• Does you operating system have SSE support?
• How will you code the SSE programs? Assembly or high level?
• …
Identity if applicable
• SIMD improves the performance of 3D
graphics, speech recognition, image
processing, scientific applications and
applications that have the following
characteristics:
Inherently parallel.
Recurring memory access patterns.
Localized recurring operations performed on
the data.
Data-independent control flow.
• Support must be ensured on:
CPU
Operating System
• SIMD application candidates:
Speech compression algorithms and filters.
Speech recognition algorithms.
Video display and capture routines.
Rendering routines.
3D graphics (geometry).
Image and video processing algorithms.
Spatial (3D) audio.
Physical modeling (graphics, CAD).
Workstation applications.
Encryption algorithms.
Complex arithmetic.
Choose the right instructions – Refer to Intel
Optimization Manual 2.9
• MMX
• SSE
• SSE2
• SSE3
• SSSE3
• SSE4
• AESNI and PCLMULQDQ
• AVX, FMA and AVX2
Coding methodologies for SIMD
• Assembly
• Intrinsic
• Classes
• Automatic Vectorization
Assembly
• Key loops can be coded directly in assembly language using an
assembler or by using inline assembly (C-ASM) in C/C++ code.
• This model offers the opportunity for attaining greatest performance,
but this performance is not portable across the different processor
architectures.
Intrinsic
• Intrinsic provides the access to the ISA functionality using C/C++ style
coding instead of assembly language.
• https://software.intel.com/sites/landingpage/IntrinsicsGuide/#
Header File Instructions & CPU
x86intrin.h x86 instructions
mmintrin.h MMX (Pentium MMX!)
mm3dnow.h 3dnow! (K6-2) (deprecated)
xmmintrin.h SSE + MMX (Pentium 3, Athlon XP)
emmintrin.h SSE2 + SSE + MMX (Pentiuem 4, Ahtlon 64)
pmmintrin.h SSE3 + SSE2 + SSE + MMX (Pentium 4 Prescott, Ahtlon 64 San
Diego)
tmmintrin.h SSSE3 + SSE3 + SSE2 + SSE + MMX (Core 2, Bulldozer)
popcntintrin.h POPCNT (Core i7, Phenom subset of SSE4.2 and SSE4A)
ammintrin.h SSE4A + SSE3 + SSE2 + SSE + MMX (Phenom)
smmintrin.h SSE4_1 + SSSE3 + SSE3 + SSE2 + SSE + MMX (Core i7,
Bulldozer)
nmmintrin.h SSE4_2 + SSE4_1 + SSSE3 + SSE3 + SSE2 + SSE + MMX (Core i7,
Bulldozer)
wmmintrin.h AES (Core i7 Westmere, Bulldozer)
immintrin.h AVX, SSE4_2 + SSE4_1 + SSSE3 + SSE3 + SSE2 + SSE + MMX
(Core i7 Sandy Bridge, Bulldozer)
Classes
• A set of C++ classes has been defined and available in Intel C++
Compiler to provide both a higher-level abstraction and more
flexibility for programming with SIMD technology.
Automatic Vectorization
• The Intel C++ Compiler provides an optimization mechanism by which
loops, such as in Example 4-13 can be automatically vectorized, or
converted into Streaming SIMD Extensions code.
• Compile this code using the -QAX and -QRESTRICT switches of the
Intel C++ Compiler, version 4.0 or later.
SSE Demo
Outline
• What is SSE?
• Why SSE?
• How to use SSE?
• CPUID
• Useful References
• Discussions
CPUID
• CPU IDentification
• The CPUID instruction can be used to retrieve various amount of
information about your CPU, like its vendor string and model number,
the size of internal caches and (more interesting), the list of CPU
features supported.
CPUID evolution
• 1. Originally, Intel published code sequences that could detect minor
implementation or architectural differences to identify processor
generations.
• 2. With the advent of the Intel386 processor, Intel implemented
processor signature identification that provided the processor family,
model, and stepping numbers to software, but only upon reset.
• 3. As the Intel Architecture evolved, Intel extended the processor
signature identification into the CPUID instruction. The CPUID
instruction not only provides the processor signature, but also
provides information about the features supported by and
implemented on the Intel processor.
CPUID Demo
Outline
• What is SSE?
• Why SSE?
• How to use SSE?
• CPUID
• Useful References
• Discussions
Useful References
• http://www.intel.com/content/www/us/en/processors/architectures-software-developer-
manuals.html
• http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-
architectures-optimization-manual.pdf (Chapter 4 Coding For SIMD Architectures, Chapter 5 & 6
& 10 & 11)
• https://software.intel.com/en-us/isa-extensions
• https://www.scss.tcd.ie/Jeremy.Jones/CS4021/processor-identification-cpuid-instruction-note.pdf
• https://software.intel.com/en-us/articles/intel-software-development-emulator
• http://supercomputingblog.com/optimization/getting-started-with-sse-programming/
• http://felix.abecassis.me/2011/09/cpp-getting-started-with-sse/
• http://wiki.osdev.org/CPUID
• http://sandpile.org/x86/cpuid.htm
• http://www.etallen.com/cpuid.html
More to explore
• Memory alignment
• AVX
• FMA
• ARM NEON
• Intel® SHA Extensions
• Intel® VTune™ Amplifier
• Intel® VTune™ Performance Analyzer
• Intel® Software Development Emulator
• …
Thank You!
Lihang Li @ IEG

More Related Content

What's hot

Risc and cisc eugene clewlow
Risc and cisc   eugene clewlowRisc and cisc   eugene clewlow
Risc and cisc eugene clewlowManish Prajapati
 
intel Presentation
intel Presentationintel Presentation
intel Presentationfinance6
 
Intel(R)Core(Tm)I7 Desktop Processor Product Brief
Intel(R)Core(Tm)I7 Desktop Processor Product BriefIntel(R)Core(Tm)I7 Desktop Processor Product Brief
Intel(R)Core(Tm)I7 Desktop Processor Product BriefOscar del Rio
 
Comparative Study of RISC AND CISC Architectures
Comparative Study of RISC AND CISC ArchitecturesComparative Study of RISC AND CISC Architectures
Comparative Study of RISC AND CISC ArchitecturesEditor IJCATR
 
Core i 7 processor
Core i 7 processorCore i 7 processor
Core i 7 processorSumit Biswas
 
Risc and cisc computers
Risc and cisc computersRisc and cisc computers
Risc and cisc computersankita mundhra
 
Risc and cisc eugene clewlow
Risc and cisc   eugene clewlowRisc and cisc   eugene clewlow
Risc and cisc eugene clewlowChaudhary Manzoor
 
Intel Core i7
Intel Core i7Intel Core i7
Intel Core i7Md Ajmat
 
System on chip architectures
System on chip architecturesSystem on chip architectures
System on chip architecturesA B Shinde
 
Core I3 Vs Core I5
Core I3 Vs Core I5Core I3 Vs Core I5
Core I3 Vs Core I5Ayeshasidhu
 
Difference between i3 and i5 and i7 and core 2 duo
Difference between i3 and i5 and i7 and core 2 duoDifference between i3 and i5 and i7 and core 2 duo
Difference between i3 and i5 and i7 and core 2 duoShubham Singh
 
Intel core i7 processor
Intel core i7 processorIntel core i7 processor
Intel core i7 processorsharjeel anjum
 
SOC Application Studies: Image Compression
SOC Application Studies: Image CompressionSOC Application Studies: Image Compression
SOC Application Studies: Image CompressionA B Shinde
 
RISC - Reduced Instruction Set Computing
RISC - Reduced Instruction Set ComputingRISC - Reduced Instruction Set Computing
RISC - Reduced Instruction Set ComputingTushar Swami
 

What's hot (20)

Risc and cisc eugene clewlow
Risc and cisc   eugene clewlowRisc and cisc   eugene clewlow
Risc and cisc eugene clewlow
 
Intel Core i7
Intel Core i7Intel Core i7
Intel Core i7
 
I3 Vs I5 Vs I7
I3 Vs I5 Vs I7I3 Vs I5 Vs I7
I3 Vs I5 Vs I7
 
intel Presentation
intel Presentationintel Presentation
intel Presentation
 
Intel’s core i7
Intel’s core i7Intel’s core i7
Intel’s core i7
 
Intel(R)Core(Tm)I7 Desktop Processor Product Brief
Intel(R)Core(Tm)I7 Desktop Processor Product BriefIntel(R)Core(Tm)I7 Desktop Processor Product Brief
Intel(R)Core(Tm)I7 Desktop Processor Product Brief
 
Comparative Study of RISC AND CISC Architectures
Comparative Study of RISC AND CISC ArchitecturesComparative Study of RISC AND CISC Architectures
Comparative Study of RISC AND CISC Architectures
 
Core i 7 processor
Core i 7 processorCore i 7 processor
Core i 7 processor
 
Intel core i7 processor
Intel core i7 processorIntel core i7 processor
Intel core i7 processor
 
Risc and cisc computers
Risc and cisc computersRisc and cisc computers
Risc and cisc computers
 
Risc and cisc eugene clewlow
Risc and cisc   eugene clewlowRisc and cisc   eugene clewlow
Risc and cisc eugene clewlow
 
Intel Core i7
Intel Core i7Intel Core i7
Intel Core i7
 
System on chip architectures
System on chip architecturesSystem on chip architectures
System on chip architectures
 
Core I3 Vs Core I5
Core I3 Vs Core I5Core I3 Vs Core I5
Core I3 Vs Core I5
 
Difference between i3 and i5 and i7 and core 2 duo
Difference between i3 and i5 and i7 and core 2 duoDifference between i3 and i5 and i7 and core 2 duo
Difference between i3 and i5 and i7 and core 2 duo
 
Risc and cisc
Risc and ciscRisc and cisc
Risc and cisc
 
CISC VS CISC
CISC VS CISCCISC VS CISC
CISC VS CISC
 
Intel core i7 processor
Intel core i7 processorIntel core i7 processor
Intel core i7 processor
 
SOC Application Studies: Image Compression
SOC Application Studies: Image CompressionSOC Application Studies: Image Compression
SOC Application Studies: Image Compression
 
RISC - Reduced Instruction Set Computing
RISC - Reduced Instruction Set ComputingRISC - Reduced Instruction Set Computing
RISC - Reduced Instruction Set Computing
 

Viewers also liked

Dpdk Validation - Liu, Yong
Dpdk Validation - Liu, YongDpdk Validation - Liu, Yong
Dpdk Validation - Liu, Yongharryvanhaaren
 
前端规范(初稿)
前端规范(初稿)前端规范(初稿)
前端规范(初稿)EnLei-Cai
 
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systemshdhappy001
 
冯宏华:H base在小米的应用与扩展
冯宏华:H base在小米的应用与扩展冯宏华:H base在小米的应用与扩展
冯宏华:H base在小米的应用与扩展hdhappy001
 
Fast flux domain detection
Fast flux domain detectionFast flux domain detection
Fast flux domain detectionNi Zhiqiang
 
CVLinkedIn
CVLinkedInCVLinkedIn
CVLinkedInJun Ma
 
Research Park: Year in Review 2014
Research Park: Year in Review 2014Research Park: Year in Review 2014
Research Park: Year in Review 2014UIResearchPark
 
Zejia_CV_final
Zejia_CV_finalZejia_CV_final
Zejia_CV_finalZJ Zheng
 
CV_Shilidong
CV_ShilidongCV_Shilidong
CV_Shilidong?? ?
 
Introducing Ubuntu SDK
Introducing Ubuntu SDKIntroducing Ubuntu SDK
Introducing Ubuntu SDKShuduo Sang
 
Stanford splash spring 2016 basic programming
Stanford splash spring 2016 basic programmingStanford splash spring 2016 basic programming
Stanford splash spring 2016 basic programmingYu-Sheng (Yosen) Chen
 
台湾趴趴走
台湾趴趴走台湾趴趴走
台湾趴趴走Limbo Wong
 
Cheng_Wang_resume
Cheng_Wang_resumeCheng_Wang_resume
Cheng_Wang_resumeCheng Wang
 
Xiaoli_Ma_developer_resume
Xiaoli_Ma_developer_resumeXiaoli_Ma_developer_resume
Xiaoli_Ma_developer_resumeXiaoli Ma
 

Viewers also liked (20)

SfM Workflow Presentation
SfM Workflow PresentationSfM Workflow Presentation
SfM Workflow Presentation
 
Dpdk Validation - Liu, Yong
Dpdk Validation - Liu, YongDpdk Validation - Liu, Yong
Dpdk Validation - Liu, Yong
 
前端规范(初稿)
前端规范(初稿)前端规范(初稿)
前端规范(初稿)
 
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems
 
Sara Saile cv
Sara Saile cv Sara Saile cv
Sara Saile cv
 
冯宏华:H base在小米的应用与扩展
冯宏华:H base在小米的应用与扩展冯宏华:H base在小米的应用与扩展
冯宏华:H base在小米的应用与扩展
 
Fast flux domain detection
Fast flux domain detectionFast flux domain detection
Fast flux domain detection
 
CVLinkedIn
CVLinkedInCVLinkedIn
CVLinkedIn
 
Research Park: Year in Review 2014
Research Park: Year in Review 2014Research Park: Year in Review 2014
Research Park: Year in Review 2014
 
Lichang Wang_CV
Lichang Wang_CVLichang Wang_CV
Lichang Wang_CV
 
Zejia_CV_final
Zejia_CV_finalZejia_CV_final
Zejia_CV_final
 
CV_Shilidong
CV_ShilidongCV_Shilidong
CV_Shilidong
 
Introducing Ubuntu SDK
Introducing Ubuntu SDKIntroducing Ubuntu SDK
Introducing Ubuntu SDK
 
Cv 12112015
Cv 12112015Cv 12112015
Cv 12112015
 
Stanford splash spring 2016 basic programming
Stanford splash spring 2016 basic programmingStanford splash spring 2016 basic programming
Stanford splash spring 2016 basic programming
 
Paradigm Shifts
Paradigm ShiftsParadigm Shifts
Paradigm Shifts
 
台湾趴趴走
台湾趴趴走台湾趴趴走
台湾趴趴走
 
Cheng_Wang_resume
Cheng_Wang_resumeCheng_Wang_resume
Cheng_Wang_resume
 
Xiaoli_Ma_developer_resume
Xiaoli_Ma_developer_resumeXiaoli_Ma_developer_resume
Xiaoli_Ma_developer_resume
 
周士云的简历
周士云的简历周士云的简历
周士云的简历
 

Similar to Boost Program Speed with SSE

Advanced Processor Power Point Presentation
Advanced Processor  Power Point  PresentationAdvanced Processor  Power Point  Presentation
Advanced Processor Power Point PresentationPrashantYadav931011
 
Ca lecture 03
Ca lecture 03Ca lecture 03
Ca lecture 03Haris456
 
Throughput oriented aarchitectures
Throughput oriented aarchitecturesThroughput oriented aarchitectures
Throughput oriented aarchitecturesNomy059
 
VLSI Systems & Design
VLSI Systems & DesignVLSI Systems & Design
VLSI Systems & DesignAakash Mishra
 
CS304PC:Computer Organization and Architecture UNIT V_merged_merged.pdf
CS304PC:Computer Organization and Architecture UNIT V_merged_merged.pdfCS304PC:Computer Organization and Architecture UNIT V_merged_merged.pdf
CS304PC:Computer Organization and Architecture UNIT V_merged_merged.pdfAsst.prof M.Gokilavani
 
Question 1. please describe an embedded system in less than 100 word.pdf
Question 1. please describe an embedded system in less than 100 word.pdfQuestion 1. please describe an embedded system in less than 100 word.pdf
Question 1. please describe an embedded system in less than 100 word.pdfarmcomputers
 
RISC Vs CISC Computer architecture and design
RISC Vs CISC Computer architecture and designRISC Vs CISC Computer architecture and design
RISC Vs CISC Computer architecture and designyousefzahdeh
 
6 months/weeks training in Vlsi,jalandhar
6 months/weeks training in Vlsi,jalandhar6 months/weeks training in Vlsi,jalandhar
6 months/weeks training in Vlsi,jalandhardeepikakaler1
 
6 weeks/months summer training in vlsi,ludhiana
6 weeks/months summer training in vlsi,ludhiana6 weeks/months summer training in vlsi,ludhiana
6 weeks/months summer training in vlsi,ludhianadeepikakaler1
 
Introduction to embedded System.pptx
Introduction to embedded System.pptxIntroduction to embedded System.pptx
Introduction to embedded System.pptxPratik Gohel
 
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML AcceleratorsRISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML AcceleratorsRISC-V International
 
Ppt on six month training on embedded system & IOT
Ppt on six month training on embedded system & IOTPpt on six month training on embedded system & IOT
Ppt on six month training on embedded system & IOTpreetigill309
 
Summer training embedded system and its scope
Summer training  embedded system and its scopeSummer training  embedded system and its scope
Summer training embedded system and its scopeArshit Rai
 
Microprocessor.ppt
Microprocessor.pptMicroprocessor.ppt
Microprocessor.pptsafia kalwar
 
finaldraft-intelcorei5processorsarchitecture-130207093535-phpapp01.pdf
finaldraft-intelcorei5processorsarchitecture-130207093535-phpapp01.pdffinaldraft-intelcorei5processorsarchitecture-130207093535-phpapp01.pdf
finaldraft-intelcorei5processorsarchitecture-130207093535-phpapp01.pdfNazarAhmadAlkhidir
 

Similar to Boost Program Speed with SSE (20)

Processors selection
Processors selectionProcessors selection
Processors selection
 
Advanced Processor Power Point Presentation
Advanced Processor  Power Point  PresentationAdvanced Processor  Power Point  Presentation
Advanced Processor Power Point Presentation
 
Ca lecture 03
Ca lecture 03Ca lecture 03
Ca lecture 03
 
Unit I_MT2301.pdf
Unit I_MT2301.pdfUnit I_MT2301.pdf
Unit I_MT2301.pdf
 
Fixed-point Multi-Core DSP Platform
Fixed-point Multi-Core DSP PlatformFixed-point Multi-Core DSP Platform
Fixed-point Multi-Core DSP Platform
 
Throughput oriented aarchitectures
Throughput oriented aarchitecturesThroughput oriented aarchitectures
Throughput oriented aarchitectures
 
VLSI Systems & Design
VLSI Systems & DesignVLSI Systems & Design
VLSI Systems & Design
 
CS304PC:Computer Organization and Architecture UNIT V_merged_merged.pdf
CS304PC:Computer Organization and Architecture UNIT V_merged_merged.pdfCS304PC:Computer Organization and Architecture UNIT V_merged_merged.pdf
CS304PC:Computer Organization and Architecture UNIT V_merged_merged.pdf
 
Question 1. please describe an embedded system in less than 100 word.pdf
Question 1. please describe an embedded system in less than 100 word.pdfQuestion 1. please describe an embedded system in less than 100 word.pdf
Question 1. please describe an embedded system in less than 100 word.pdf
 
RISC Vs CISC Computer architecture and design
RISC Vs CISC Computer architecture and designRISC Vs CISC Computer architecture and design
RISC Vs CISC Computer architecture and design
 
6 months/weeks training in Vlsi,jalandhar
6 months/weeks training in Vlsi,jalandhar6 months/weeks training in Vlsi,jalandhar
6 months/weeks training in Vlsi,jalandhar
 
6 weeks/months summer training in vlsi,ludhiana
6 weeks/months summer training in vlsi,ludhiana6 weeks/months summer training in vlsi,ludhiana
6 weeks/months summer training in vlsi,ludhiana
 
Introduction to embedded System.pptx
Introduction to embedded System.pptxIntroduction to embedded System.pptx
Introduction to embedded System.pptx
 
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML AcceleratorsRISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
 
Ppt on six month training on embedded system & IOT
Ppt on six month training on embedded system & IOTPpt on six month training on embedded system & IOT
Ppt on six month training on embedded system & IOT
 
Summer training embedded system and its scope
Summer training  embedded system and its scopeSummer training  embedded system and its scope
Summer training embedded system and its scope
 
esunit1.pptx
esunit1.pptxesunit1.pptx
esunit1.pptx
 
Microprocessor.ppt
Microprocessor.pptMicroprocessor.ppt
Microprocessor.ppt
 
finaldraft-intelcorei5processorsarchitecture-130207093535-phpapp01.pdf
finaldraft-intelcorei5processorsarchitecture-130207093535-phpapp01.pdffinaldraft-intelcorei5processorsarchitecture-130207093535-phpapp01.pdf
finaldraft-intelcorei5processorsarchitecture-130207093535-phpapp01.pdf
 
Intel core i5
Intel core i5Intel core i5
Intel core i5
 

More from Lihang Li

Some experiences and lessons learnt from hunting a job
Some experiences and lessons learnt from hunting a jobSome experiences and lessons learnt from hunting a job
Some experiences and lessons learnt from hunting a jobLihang Li
 
Getting started with Linux and Python by Caffe
Getting started with Linux and Python by CaffeGetting started with Linux and Python by Caffe
Getting started with Linux and Python by CaffeLihang Li
 
Point cloud mesh-investigation_report-lihang
Point cloud mesh-investigation_report-lihangPoint cloud mesh-investigation_report-lihang
Point cloud mesh-investigation_report-lihangLihang Li
 
Rtabmap investigation report-lihang
Rtabmap investigation report-lihangRtabmap investigation report-lihang
Rtabmap investigation report-lihangLihang Li
 
Rgbdslam and mapping_investigation_report-lihang
Rgbdslam and mapping_investigation_report-lihangRgbdslam and mapping_investigation_report-lihang
Rgbdslam and mapping_investigation_report-lihangLihang Li
 
DTAM: Dense Tracking and Mapping in Real-Time, Robot vision Group
DTAM: Dense Tracking and Mapping in Real-Time, Robot vision GroupDTAM: Dense Tracking and Mapping in Real-Time, Robot vision Group
DTAM: Dense Tracking and Mapping in Real-Time, Robot vision GroupLihang Li
 
2013新人见面会-中科院开源软件协会介绍-hustcalm
2013新人见面会-中科院开源软件协会介绍-hustcalm2013新人见面会-中科院开源软件协会介绍-hustcalm
2013新人见面会-中科院开源软件协会介绍-hustcalmLihang Li
 
像Hackers一样写博客-hustcalm
像Hackers一样写博客-hustcalm像Hackers一样写博客-hustcalm
像Hackers一样写博客-hustcalmLihang Li
 

More from Lihang Li (8)

Some experiences and lessons learnt from hunting a job
Some experiences and lessons learnt from hunting a jobSome experiences and lessons learnt from hunting a job
Some experiences and lessons learnt from hunting a job
 
Getting started with Linux and Python by Caffe
Getting started with Linux and Python by CaffeGetting started with Linux and Python by Caffe
Getting started with Linux and Python by Caffe
 
Point cloud mesh-investigation_report-lihang
Point cloud mesh-investigation_report-lihangPoint cloud mesh-investigation_report-lihang
Point cloud mesh-investigation_report-lihang
 
Rtabmap investigation report-lihang
Rtabmap investigation report-lihangRtabmap investigation report-lihang
Rtabmap investigation report-lihang
 
Rgbdslam and mapping_investigation_report-lihang
Rgbdslam and mapping_investigation_report-lihangRgbdslam and mapping_investigation_report-lihang
Rgbdslam and mapping_investigation_report-lihang
 
DTAM: Dense Tracking and Mapping in Real-Time, Robot vision Group
DTAM: Dense Tracking and Mapping in Real-Time, Robot vision GroupDTAM: Dense Tracking and Mapping in Real-Time, Robot vision Group
DTAM: Dense Tracking and Mapping in Real-Time, Robot vision Group
 
2013新人见面会-中科院开源软件协会介绍-hustcalm
2013新人见面会-中科院开源软件协会介绍-hustcalm2013新人见面会-中科院开源软件协会介绍-hustcalm
2013新人见面会-中科院开源软件协会介绍-hustcalm
 
像Hackers一样写博客-hustcalm
像Hackers一样写博客-hustcalm像Hackers一样写博客-hustcalm
像Hackers一样写博客-hustcalm
 

Recently uploaded

main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
Comparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization TechniquesComparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization Techniquesugginaramesh
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction managementMariconPadriquez1
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 

Recently uploaded (20)

main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
Comparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization TechniquesComparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization Techniques
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction management
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 

Boost Program Speed with SSE

  • 1. SSE的那些事儿 Use SIMD to boost your program!
  • 3. Outline • What is SSE? • Why SSE? • How to use SSE? • CPUID • Useful References • Discussions
  • 4. Outline • What is SSE? • Why SSE? • How to use SSE? • CPUID • Useful References • Discussions
  • 5. SSE • Streaming SIMD Extensions A set of CPU instructions dedicated to applications like signal processing, scientific computation or 3D graphics.
  • 6. SIMD • Single Instruction, Multiple Data A CPU instruction is said to be SIMD when the same operation is applied on multiple data at the same time, i.e. operate on a “vector” of data with a single instruction.
  • 7. Flynn’s taxonomy • Flynn's taxonomy is a classification of computer architectures, proposed by Michael Flynn in 1966. Single instruction stream Multiple instruction streams Single data stream SISD MISD Multiple data streams SIMD MIMD PU: Processing Unit
  • 8. More on SSE • Streaming SIMD Extensions (SSE) is an SIMD instruction set extension to the x86 architecture, designed by Intel and introduced in 1999 in their Pentium III series processors as a reply to AMD's 3DNow! • SSE contains 70 new instructions, most of which work on single precision floating point data. • Intel's first IA-32 SIMD effort was the MMX instruction set. • SSE was subsequently expanded by Intel to SSE2, SSE3, SSSE3, SSE4 and AVX. • SSE was originally called Katmai New Instructions (KNI), Katmai being the code name for the first Pentium III core revision.
  • 9. SSE Registers • SSE originally added eight new 128-bit registers known as XMM0 through XMM7. Later versions add more registers. • There is also a new 32-bit control/status register, MXCSR, which provides control and status bits for operations performed on XMM registers.
  • 10. SSE instructions • Packed and scalar single-precision floating-point instructions  Data movement instructions  Arithmetic instructions  Logical instructions  Comparison instructions  Shuffle instructions  Conversion instructions • 64-bit SIMD integer instructions  Operate on data in MMX registers and 64-bit memory locations. • State management instructions  LDMXCSR  STMXCSR • Cacheability control, prefetch, and memory ordering instructions  Give programs more control over the caching of data
  • 11. Intel CPU SIMD technology evolution
  • 12. Outline • What is SSE? • Why SSE? • How to use SSE? • CPUID • Useful References • Discussions
  • 13. Advantages of SIMD • Many real-world problems, especially in science and engineering, map well to computation on arrays. • SIMD instructions can greatly increase performance when exactly the same operations are to be performed on multiple data objects (arrays). • Typical applications are digital signal processing and graphics processing.
  • 14. Outline • What is SSE? • Why SSE? • How to use SSE? • CPUID • Useful References • Discussions
  • 15. Think twice before you go • What is your application? • Is there better algorithm? • Will the effort get performance gain eventually? How much? • Which SSE version suites best? • Does your CPU support SSE? If, up to what version? • Does you operating system have SSE support? • How will you code the SSE programs? Assembly or high level? • …
  • 16. Identity if applicable • SIMD improves the performance of 3D graphics, speech recognition, image processing, scientific applications and applications that have the following characteristics: Inherently parallel. Recurring memory access patterns. Localized recurring operations performed on the data. Data-independent control flow. • Support must be ensured on: CPU Operating System • SIMD application candidates: Speech compression algorithms and filters. Speech recognition algorithms. Video display and capture routines. Rendering routines. 3D graphics (geometry). Image and video processing algorithms. Spatial (3D) audio. Physical modeling (graphics, CAD). Workstation applications. Encryption algorithms. Complex arithmetic.
  • 17. Choose the right instructions – Refer to Intel Optimization Manual 2.9 • MMX • SSE • SSE2 • SSE3 • SSSE3 • SSE4 • AESNI and PCLMULQDQ • AVX, FMA and AVX2
  • 18. Coding methodologies for SIMD • Assembly • Intrinsic • Classes • Automatic Vectorization
  • 19. Assembly • Key loops can be coded directly in assembly language using an assembler or by using inline assembly (C-ASM) in C/C++ code. • This model offers the opportunity for attaining greatest performance, but this performance is not portable across the different processor architectures.
  • 20. Intrinsic • Intrinsic provides the access to the ISA functionality using C/C++ style coding instead of assembly language. • https://software.intel.com/sites/landingpage/IntrinsicsGuide/#
  • 21. Header File Instructions & CPU x86intrin.h x86 instructions mmintrin.h MMX (Pentium MMX!) mm3dnow.h 3dnow! (K6-2) (deprecated) xmmintrin.h SSE + MMX (Pentium 3, Athlon XP) emmintrin.h SSE2 + SSE + MMX (Pentiuem 4, Ahtlon 64) pmmintrin.h SSE3 + SSE2 + SSE + MMX (Pentium 4 Prescott, Ahtlon 64 San Diego) tmmintrin.h SSSE3 + SSE3 + SSE2 + SSE + MMX (Core 2, Bulldozer) popcntintrin.h POPCNT (Core i7, Phenom subset of SSE4.2 and SSE4A) ammintrin.h SSE4A + SSE3 + SSE2 + SSE + MMX (Phenom) smmintrin.h SSE4_1 + SSSE3 + SSE3 + SSE2 + SSE + MMX (Core i7, Bulldozer) nmmintrin.h SSE4_2 + SSE4_1 + SSSE3 + SSE3 + SSE2 + SSE + MMX (Core i7, Bulldozer) wmmintrin.h AES (Core i7 Westmere, Bulldozer) immintrin.h AVX, SSE4_2 + SSE4_1 + SSSE3 + SSE3 + SSE2 + SSE + MMX (Core i7 Sandy Bridge, Bulldozer)
  • 22. Classes • A set of C++ classes has been defined and available in Intel C++ Compiler to provide both a higher-level abstraction and more flexibility for programming with SIMD technology.
  • 23. Automatic Vectorization • The Intel C++ Compiler provides an optimization mechanism by which loops, such as in Example 4-13 can be automatically vectorized, or converted into Streaming SIMD Extensions code. • Compile this code using the -QAX and -QRESTRICT switches of the Intel C++ Compiler, version 4.0 or later.
  • 25. Outline • What is SSE? • Why SSE? • How to use SSE? • CPUID • Useful References • Discussions
  • 26. CPUID • CPU IDentification • The CPUID instruction can be used to retrieve various amount of information about your CPU, like its vendor string and model number, the size of internal caches and (more interesting), the list of CPU features supported.
  • 27. CPUID evolution • 1. Originally, Intel published code sequences that could detect minor implementation or architectural differences to identify processor generations. • 2. With the advent of the Intel386 processor, Intel implemented processor signature identification that provided the processor family, model, and stepping numbers to software, but only upon reset. • 3. As the Intel Architecture evolved, Intel extended the processor signature identification into the CPUID instruction. The CPUID instruction not only provides the processor signature, but also provides information about the features supported by and implemented on the Intel processor.
  • 29. Outline • What is SSE? • Why SSE? • How to use SSE? • CPUID • Useful References • Discussions
  • 30. Useful References • http://www.intel.com/content/www/us/en/processors/architectures-software-developer- manuals.html • http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32- architectures-optimization-manual.pdf (Chapter 4 Coding For SIMD Architectures, Chapter 5 & 6 & 10 & 11) • https://software.intel.com/en-us/isa-extensions • https://www.scss.tcd.ie/Jeremy.Jones/CS4021/processor-identification-cpuid-instruction-note.pdf • https://software.intel.com/en-us/articles/intel-software-development-emulator • http://supercomputingblog.com/optimization/getting-started-with-sse-programming/ • http://felix.abecassis.me/2011/09/cpp-getting-started-with-sse/ • http://wiki.osdev.org/CPUID • http://sandpile.org/x86/cpuid.htm • http://www.etallen.com/cpuid.html
  • 31. More to explore • Memory alignment • AVX • FMA • ARM NEON • Intel® SHA Extensions • Intel® VTune™ Amplifier • Intel® VTune™ Performance Analyzer • Intel® Software Development Emulator • …

Editor's Notes

  1. The first use of SIMD instructions was in vector supercomputers of the early 1970s such as the CDC Star-100 and the Texas Instruments ASC, which could operate on a "vector" of data with a single instruction. Supercomputing moved away from the SIMD approach when inexpensive scalar MIMD approaches based on commodity processors such as the Intel i860 XP [3] became more powerful, and interest in SIMD waned. The current era of SIMD processors grew out of the desktop-computer market rather than the supercomputer market. As desktop processors became powerful enough to support real-time gaming and video processing, demand grew for this particular type of computing power, and microprocessor vendors turned to SIMD to meet the demand. The first widely-deployed desktop SIMD was with Intel's MMX extensions to the x86 architecture in 1996. This sparked the introduction of the much more powerful AltiVec system in the Motorola PowerPC's and IBM's POWER systems. Intel responded in 1999 by introducing the all-new SSE system. Since then, there have been several extensions to the SIMD instruction sets for both architectures.
  2. MMX had two main problems: it re-used existing floating point registers making the CPU unable to work on both floating point and SIMD data at the same time, and it only worked on integers. SSE floating point instructions operate on a new independent register set (the XMM registers), and it adds a few integer instructions that work on MMX registers. During the Katmai project Intel sought to distinguish it from their earlier product line, particularly their flagship Pentium II. It was later renamed Intel Streaming SIMD Extensions (ISSE), then SSE. AMD eventually added support for SSE instructions, starting with its Athlon XP and Duron (Morgan core) processors.
  3. 英特尔在1996年率先引入了MMX(Multi Media eXtensions)多媒体扩展指令集,也开创了SIMD(Single Instruction Multiple Data,单指令多数据)指令集之先河,即在一个周期内一个指令可以完成多个数据操作,MMX指令集的出现让当时的MMX Pentium大出风头。  SSE(Streaming SIMD Extensions,流式单指令多数据扩展)指令集是1999年英特尔在Pentium III处理器中率先推出的,并将矢量处理能力从64位扩展到了128位。在Willamette核心的Pentium 4中英特尔又将扩展指令集升级到SSE2(2000年),而SSE3指令集(2004年)是从Prescott核心的Pentium 4开始出现。  SSE4(2007年)指令集是自SSE以来最大的一次指令集扩展,它实际上分成Penryn中出现的SSE4.1和Nehalem中出现的SSE4.2,其中SSE4.1占据了大部分的指令,共有47条,Nehalem中的SSE4指令集更新很少,只有7条指令,这样一共有54条指令,称为SSE4.2。 当我们还在惯性的认为英特尔将推出SSE5时,不料半路杀出来个程咬金,2007年8月,AMD抢先宣布了SSE5指令集(SSE到SSE4均为英特尔出品),英特尔当即黑脸表示不支持SSE5,转而在2008年3月宣布Sandy Bridge微架构将引入全新的AVX指令集,同年4月英特尔公布AVX指令集规范,随后开始不断进行更新,业界普遍认为支持AVX指令集是Sandy Bridge最重要的进步,没有之一。
  4. For the optimal use of the Streaming SIMD Extensions that need data alignment on the 16-byte boundary.
  5. These classes provide an easy-to-use and flexible interface to the intrinsic functions, allowing developers to write more natural C++ code without worrying about which intrinsic or assembly language instruction to use for a given operation. Since the intrinsic functions underlie the implementation of these C++ classes, the performance of applications using this methodology can approach that of one using the intrinsic. Here, fvec.h is the class definition file and F32vec4 is the class representing an array of four floats. The “+” and “=” operators are overloaded so that the actual Streaming SIMD Extensions implementation in the previous example is abstracted out, or hidden, from the developer. Note how much more this resembles the original code, allowing for simpler and faster programming.
  6. The compiler uses similar techniques to those used by a programmer to identify whether a loop is suitable for conversion to SIMD. This involves determining whether the following might prevent vectorization: • The layout of the loop and the data structures used. • Dependencies amongst the data accesses in each iteration and across iterations.
  7. By taking advantage of the CPUID instruction, software developers can create software applications and tools that can execute compatibly across the widest range of Intel processor generations and models, past, present, and future.