This document discusses streaming SIMD extensions (SSE) and how to use SIMD instructions to boost program performance. It defines SSE as a set of CPU instructions for applications like signal processing that use single instruction, multiple data (SIMD) parallelism. The document outlines what SSE is, the advantages of SIMD, how to identify if an application can benefit from SSE, different SSE versions, coding methods like assembly and intrinsics, and references for further information.
3. Outline
• What is SSE?
• Why SSE?
• How to use SSE?
• CPUID
• Useful References
• Discussions
4. Outline
• What is SSE?
• Why SSE?
• How to use SSE?
• CPUID
• Useful References
• Discussions
5. SSE
• Streaming SIMD Extensions
A set of CPU instructions dedicated to applications like signal
processing, scientific computation or 3D graphics.
6. SIMD
• Single Instruction, Multiple Data
A CPU instruction is said to be SIMD when the same operation is
applied on multiple data at the same time, i.e. operate on a “vector”
of data with a single instruction.
7. Flynn’s taxonomy
• Flynn's taxonomy is a classification of computer architectures,
proposed by Michael Flynn in 1966.
Single instruction stream Multiple instruction streams
Single data stream SISD MISD
Multiple data streams SIMD MIMD
PU: Processing Unit
8. More on SSE
• Streaming SIMD Extensions (SSE) is an SIMD instruction set extension
to the x86 architecture, designed by Intel and introduced in 1999 in
their Pentium III series processors as a reply to AMD's 3DNow!
• SSE contains 70 new instructions, most of which work on single
precision floating point data.
• Intel's first IA-32 SIMD effort was the MMX instruction set.
• SSE was subsequently expanded by Intel to SSE2, SSE3, SSSE3, SSE4
and AVX.
• SSE was originally called Katmai New Instructions (KNI), Katmai being
the code name for the first Pentium III core revision.
9. SSE Registers
• SSE originally added eight new 128-bit registers known as XMM0
through XMM7. Later versions add more registers.
• There is also a new 32-bit control/status register, MXCSR, which
provides control and status bits for operations performed on XMM
registers.
10. SSE instructions
• Packed and scalar single-precision floating-point instructions
Data movement instructions
Arithmetic instructions
Logical instructions
Comparison instructions
Shuffle instructions
Conversion instructions
• 64-bit SIMD integer instructions
Operate on data in MMX registers and 64-bit memory locations.
• State management instructions
LDMXCSR
STMXCSR
• Cacheability control, prefetch, and memory ordering instructions
Give programs more control over the caching of data
12. Outline
• What is SSE?
• Why SSE?
• How to use SSE?
• CPUID
• Useful References
• Discussions
13. Advantages of SIMD
• Many real-world problems, especially in science and engineering,
map well to computation on arrays.
• SIMD instructions can greatly increase performance when exactly the
same operations are to be performed on multiple data objects
(arrays).
• Typical applications are digital signal processing and graphics
processing.
14. Outline
• What is SSE?
• Why SSE?
• How to use SSE?
• CPUID
• Useful References
• Discussions
15. Think twice before you go
• What is your application?
• Is there better algorithm?
• Will the effort get performance gain eventually? How much?
• Which SSE version suites best?
• Does your CPU support SSE? If, up to what version?
• Does you operating system have SSE support?
• How will you code the SSE programs? Assembly or high level?
• …
16. Identity if applicable
• SIMD improves the performance of 3D
graphics, speech recognition, image
processing, scientific applications and
applications that have the following
characteristics:
Inherently parallel.
Recurring memory access patterns.
Localized recurring operations performed on
the data.
Data-independent control flow.
• Support must be ensured on:
CPU
Operating System
• SIMD application candidates:
Speech compression algorithms and filters.
Speech recognition algorithms.
Video display and capture routines.
Rendering routines.
3D graphics (geometry).
Image and video processing algorithms.
Spatial (3D) audio.
Physical modeling (graphics, CAD).
Workstation applications.
Encryption algorithms.
Complex arithmetic.
17. Choose the right instructions – Refer to Intel
Optimization Manual 2.9
• MMX
• SSE
• SSE2
• SSE3
• SSSE3
• SSE4
• AESNI and PCLMULQDQ
• AVX, FMA and AVX2
19. Assembly
• Key loops can be coded directly in assembly language using an
assembler or by using inline assembly (C-ASM) in C/C++ code.
• This model offers the opportunity for attaining greatest performance,
but this performance is not portable across the different processor
architectures.
20. Intrinsic
• Intrinsic provides the access to the ISA functionality using C/C++ style
coding instead of assembly language.
• https://software.intel.com/sites/landingpage/IntrinsicsGuide/#
22. Classes
• A set of C++ classes has been defined and available in Intel C++
Compiler to provide both a higher-level abstraction and more
flexibility for programming with SIMD technology.
23. Automatic Vectorization
• The Intel C++ Compiler provides an optimization mechanism by which
loops, such as in Example 4-13 can be automatically vectorized, or
converted into Streaming SIMD Extensions code.
• Compile this code using the -QAX and -QRESTRICT switches of the
Intel C++ Compiler, version 4.0 or later.
25. Outline
• What is SSE?
• Why SSE?
• How to use SSE?
• CPUID
• Useful References
• Discussions
26. CPUID
• CPU IDentification
• The CPUID instruction can be used to retrieve various amount of
information about your CPU, like its vendor string and model number,
the size of internal caches and (more interesting), the list of CPU
features supported.
27. CPUID evolution
• 1. Originally, Intel published code sequences that could detect minor
implementation or architectural differences to identify processor
generations.
• 2. With the advent of the Intel386 processor, Intel implemented
processor signature identification that provided the processor family,
model, and stepping numbers to software, but only upon reset.
• 3. As the Intel Architecture evolved, Intel extended the processor
signature identification into the CPUID instruction. The CPUID
instruction not only provides the processor signature, but also
provides information about the features supported by and
implemented on the Intel processor.
The first use of SIMD instructions was in vector supercomputers of the early 1970s such as the CDC Star-100 and the Texas Instruments ASC, which could operate on a "vector" of data with a single instruction.
Supercomputing moved away from the SIMD approach when inexpensive scalar MIMD approaches based on commodity processors such as the Intel i860 XP [3] became more powerful, and interest in SIMD waned.
The current era of SIMD processors grew out of the desktop-computer market rather than the supercomputer market. As desktop processors became powerful enough to support real-time gaming and video processing, demand grew for this particular type of computing power, and microprocessor vendors turned to SIMD to meet the demand.
The first widely-deployed desktop SIMD was with Intel's MMX extensions to the x86 architecture in 1996. This sparked the introduction of the much more powerful AltiVec system in the Motorola PowerPC's and IBM's POWER systems. Intel responded in 1999 by introducing the all-new SSE system. Since then, there have been several extensions to the SIMD instruction sets for both architectures.
MMX had two main problems: it re-used existing floating point registers making the CPU unable to work on both floating point and SIMD data at the same time, and it only worked on integers. SSE floating point instructions operate on a new independent register set (the XMM registers), and it adds a few integer instructions that work on MMX registers.
During the Katmai project Intel sought to distinguish it from their earlier product line, particularly their flagship Pentium II. It was later renamed Intel Streaming SIMD Extensions (ISSE), then SSE. AMD eventually added support for SSE instructions, starting with its Athlon XP and Duron (Morgan core) processors.
For the optimal use of the Streaming SIMD Extensions that need data alignment on the 16-byte boundary.
These classes provide an easy-to-use and flexible interface to the intrinsic functions, allowing developers to write more natural C++ code without worrying about which intrinsic or assembly language instruction to use for a given operation. Since the intrinsic functions underlie the implementation of these C++ classes, the performance of applications using this methodology can approach that of one using the intrinsic.
Here, fvec.h is the class definition file and F32vec4 is the class representing an array of four floats. The “+” and “=” operators are overloaded so that the actual Streaming SIMD Extensions implementation in the previous example is abstracted out, or hidden, from the developer. Note how much more this resembles the original code, allowing for simpler and faster programming.
The compiler uses similar techniques to those used by a programmer to identify whether a loop is suitable for conversion to SIMD. This involves determining whether the following might prevent vectorization: • The layout of the loop and the data structures used. • Dependencies amongst the data accesses in each iteration and across iterations.
By taking advantage of the CPUID instruction, software developers can create software applications and tools that can execute compatibly across the widest range of Intel processor generations and models, past, present, and future.