SlideShare a Scribd company logo
1 of 48
Download to read offline
ITRI
Industrial Technology
Research Institute
Heterogeneous System Architecture
(HSA) Design
王振傑 (Jay Wang)
嵌入式系統與晶片技術組 -系統架構設計部 (D200)
資訊與通訊研究所 (ICL)
ccwang.jay@itri.org.tw
2015-04-30
2
嵌入式系統硬體技術部 (D100)
系統架構設計部 (D200)
嵌入式系統軟體技術部 (D300)
智慧電子產業推動部 (D400)
系統整合與應用部 (D500)
嵌入式系統與晶片技術組
Division for Embedded System
and SoC Technology
工業技術研究院
資訊與通訊研究所
HSA Design (2015-04-30) @ NCKU, Tainan
What is HSA?
3
An intelligent computing architecture that enables CPU, GPU and other
processors to work in harmony on a single piece of silicon by seamlessly
moving the right tasks to the best suited processing element.
HSA Design (2015-04-30) @ NCKU, Tainan
Three Eras of Processor Performance
4
?
Single-thread
Performance
Time
we are
here
Enabled by:
 Moore’s Observation
 Voltage Scaling
 Micro-Architecture
Constrained by:
 Power
 Complexity
Single-Core Era
ModernApplication
Performance
Time (Data-parallel exploitation)
we are
here
Heterogeneous
Systems Era
Enabled by:
 Moore’s Observation
 Abundant data parallelism
 Power efficient data parallel
processing (GPUs)
Constrained by:
 Programming models
 Communication overheads
Throughput
Performance
Time (# of processors)
we are
here
Enabled by:
 Moore’s Observation
 Desire for Throughput
 20 years of SMP arch
Constrained by:
 Power
 Parallel SW availability
 Scalability
Multi-Core Era
Assembly  C/C++  Java … pthreads  OpenMP / TBB …
Shader  CUDA OpenCL
 C++ and Java
SOURCE : HSA INTRODUCTION, HSA FOUNDATION (PHIL ROGERS, AMD)
HSA Design (2015-04-30) @ NCKU, Tainan
HSA Foundation
5
 Founded in June 2012
 www.hsafoundation.com
 Developing a new platform for heterogeneous
systems
 Launched the official v1.0 specification set in
March 2015
HSA Design (2015-04-30) @ NCKU, Tainan
HSA Foundation Members (April 2015)
6
Founders
Promoters
Contributors
Academics
Supporters
HSA Design (2015-04-30) @ NCKU, Tainan
HSA Platform Model
7
In HSA system, a regular device is called an HSA agent, and if the HSA
agent can run kernels then it is also an HSA kernel agent.
Compute Unit (CU)
Compute Unit (CU)
Compute Unit (CU)
Compute Unit (CU)
Compute Unit
(CU)
Lane
(Processing Element)
Host CPU
(OS, HSA runtime)
HSA Kernel Agent
Compute Unit (CU)
Compute Unit (CU)
Wavefront Size
(A power of 2 in the range from 1 to 256 inclusive)
HSA Agent
SIMD
Data Parallel
Workloads
Serial and Task
Parallel Workloads
Jay Wang, Taiwan, 2015.03
HSA Design (2015-04-30) @ NCKU, Tainan
HSA Intermediate Language (HSAIL)
8
The HSA Foundation members are building a heterogeneous compute software ecosystem
built on open, royalty-free industry standards and open-source software: the HSA
runtimes and compilation tools are based on open-source technologies such as LLVM and
GCC. ( https://github.com/HSAFoundation )
Company D
GPU
...
Other
Hardware
Accelerator
Company B
CPUs
Finalizer
(Company A - CPU)
Finalizer
(Company B - CPU)
Finalizer
(Company C - GPU)
Finalizer
(Company D - GPU)
Finalizer
(Company E - DSP)
Finalizer
(...)
OpenMP DSL
Virtual Parallel
ISA
CLOC –
Compile OpenCL
kernels to HSAIL
HSA Intermediate Language (HSAIL)
OpenCL C++AMP Java
Company A
CPUs
Company C
GPU
Company E
DSP
Parallel
Programming
Languages
HSA Runtime
Libraries
Jay Wang, Taiwan,
2014.10
HSA Design (2015-04-30) @ NCKU, Tainan
HSAIL Programming Model
9
HSA Design (2015-04-30) @ NCKU, Tainan
HSA Runtime Stack
10
HSA Kernel Agent
CPU
HSA Runtime
HSA
Application
(HSA Agent)
Language Runtime
(ex: OpenCL runtime)
User Application
( CPU Code + HSAIL Kernel Code )
HSA Kernel Agent
GPU
HSA
Kernel Mode
Driver
Host CPU
HSA Kernel Agent
DSP
HSA User Mode Queuing (Architected Queuing Language)
+
HSA Signaling
Jay Wang, Taiwan, 2015.04
Target ISA
HSA
Finalizers
HSA Design (2015-04-30) @ NCKU, Tainan
Kernel Execution
11
HSA Design (2015-04-30) @ NCKU, Tainan
HSA Memory Consistency Model
(Relaxed Model)
Second Operation
ld_rlx
st_rlx
atomic_rlx
atomicNoRet_rlx
atomic_acq
atomicNoRet_acq
fence_acq
atomic_rel
atomicNoRet_rel
fence_rel
atomic_ar
atomicNoRet_ar
fence_ar
First
Operation
ld_rlx or st_rlx yes yes yes yes no no
atomic_rlx
atomicNoRet_rlx
yes yes yes no no no
atomic_acq
atomicNoRet_acq
fence_acq
no no no no no no
atomic_rel
atomicNoRet_rel
yes yes no no no no
fence_rel yes no no no no no
atomic_ar
atomicNoRet_ar
fence_ar
no no no no no no
12
relaxed ;
…..
acquire ;
…..
release ;
…..
acq_rel ;
…..
HSA Design (2015-04-30) @ NCKU, Tainan
System Arch. Requirements
1. Shared Virtual Memory
2. Cache Coherency Domains
3. Flat Addressing
4. Endianess
5. Signaling and Synchronization
6. Atomic Memory Operations
7. HSA System Timestamp
8. User Mode Queuing
9. Architected Queuing Language (AQL)
10. Agent Scheduling
11. Kernel Agent Context Switching
12. IEEE754-2008 Floating Point Exceptions
13. Kernel Agent Hardware Debug Infrastructure
14. HSA Platform Topology Discovery
15. Images
13
@ HSA PLATFORM SYSTEM ARCHITECTURE SPECIFICATION, VERSION 1.0 FINAL (2015-03-16)
HSA Design (2015-04-30) @ NCKU, Tainan
Legacy GPU Compute
 Multiple memory pools and address spaces
 Data copies before/after GPU compute
14
System Memory GPU Memory
1
23
Host CPUs GPU
Virtual Memory #1 Virtual Memory #2
(HSA Agent)
(HSA Kernel Agent) Jay Wang, Taiwan, 2015.04
HSA Design (2015-04-30) @ NCKU, Tainan
Host CPUs GPU(HSA Agent)
(HSA Kernel Agent)
Shared Virtual Memory
System Memory GPU Memory
Jay Wang, Taiwan, 2015.04
Shared Virtual Memory (HSA)
15
32-bit HSA System
(32 bits VA)
64-bit HSA System
(≥ 48 bits VA)
IOMMU
OS Page Table
MMU
HSA Design (2015-04-30) @ NCKU, Tainan
Group Segments within
Flat Address Space
Global Segment within
Flat Address Space
Private Segments within
Flat Address Space
Kernel Dispatch Grid
Work-Group Work-Group
WI WI WI
Private Segment
WI WI WI
Private Segment
Group Segment
Group Segment
Global Segment
Flat Address SpaceHSA Agent
$s0
$s1
$s2
$s3
$s4
$s5
$s6
$s7
$s124
$s125
$s126
$s127
32-bit
Registers
( s registers)
$c0
$c1
$c2
$c3
$c4
$c5
$c6
$c7
$d0
$d1
$d2
$d3
$d62
$d63
64-bit
Registers
( d registers)
$q0
$q31
$q1
128-bit
Registers
( q registers)
1-bit
Control Registers
( c registers)
Local Registers per Work-Item
Jay Wang, Taiwan,
2014.10
HSA Memory Hierarchy
16
1) Global
2) Group
3) Private
4) Kernarg
5) Readonly
6) Spill
7) Arg Virtual Address Range Reservation
(System Memory or Device Local Memory)
HSA Design (2015-04-30) @ NCKU, Tainan
Group Segments within
Flat Address Space
Global Segment within
Flat Address Space
Private Segments within
Flat Address Space
Kernel Dispatch Grid
Work-Group Work-Group
WI WI WI
Private Segment
WI WI WI
Private Segment
Group Segment
Group Segment
Global Segment
Flat Address Space
HSA
Kernel Agent
Host CPUs
Jay Wang, Taiwan,
2015.04
Cache Coherency Domains
17
System Memory
Cache
Cache
Cache
Coherency
HSA Design (2015-04-30) @ NCKU, Tainan
System Arch. Requirements
1. Shared Virtual Memory
2. Cache Coherency Domains
3. Flat Addressing
4. Endianess
5. Signaling and Synchronization
6. Atomic Memory Operations
7. HSA System Timestamp
8. User Mode Queuing
9. Architected Queuing Language (AQL)
10. Agent Scheduling
11. Kernel Agent Context Switching
12. IEEE754-2008 Floating Point Exceptions
13. Kernel Agent Hardware Debug Infrastructure
14. HSA Platform Topology Discovery
15. Images
18
@ HSA PLATFORM SYSTEM ARCHITECTURE SPECIFICATION, VERSION 1.0 FINAL (2015-03-16)
HSA Design (2015-04-30) @ NCKU, Tainan
Signaling and Synchronization
 The required mechanisms for HSAIL and the HSA runtime are:
 Allocate/Destroy an HSA signal
 Read the current HSA signal value
 Wait on an HSA signal to meet a specified condition (with a maximum wait duration
requested)
 Send an HSA signal value
 Atomic read-modify-write an HSA signal value
19
sem_init()
sem_wait()
sem_post()
sem_destroy()
pthread_mutex_init()
pthread_mutex_lock()
pthread_mutex_unlock()
pthread_mutex_destroy()
Signal Handle
(hsa_signal_t)
Signal Value
(hsa_signal_value_t)
HSA
Kernel Agent
Host CPU
HSA Runtime
APIs
HSAIL
Instructions
Implementation-
defined data
Sig32 or Sig64
Jay Wang, Taiwan, 2015.04
HSA Design (2015-04-30) @ NCKU, Tainan
HSA Runtime APIs for Signaling
20
HSA Runtime APIs ( for HSA application )
• hsa_signal_create ( )
• hsa_signal_destroy ( )
• hsa_signal_load_{acquire, relaxed} ( )
• hsa_signal_store_{relaxed, release} ( )
• hsa_signal_exchange_{acq_rel, acquire, relaxed, release} ( )
• hsa_signal_cas_{acq_rel, acquire, relaxed, release} ( )
• hsa_signal_add_{acq_rel, acquire, relaxed, release} ( )
• hsa_signal_subtract_{acq_rel, acquire, relaxed, release} ( )
• hsa_signal_and_{acq_rel, acquire, relaxed, release} ( )
• hsa_signal_or_{acq_rel, acquire, relaxed, release} ( )
• hsa_signal_xor_{acq_rel, acquire, relaxed, release} ( )
• hsa_signal_wait_{acquire, relaxed} ( )
HSA Runtime Programmer’s Reference Manual (v1.0)
2.4 Signals
HSA Design (2015-04-30) @ NCKU, Tainan
HSAIL Instructions for Signaling
21
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model,
Compiler Writer’s Guide, and Object Format (BRIG) (v1.0)
6.8 Notification (signal) Instructions
HSA Design (2015-04-30) @ NCKU, Tainan
Atomic Memory Operations
 HSA requires the following standard atomic memory operations to be
supported by HSA Kernel Agents (other HSA Agents only need to
support the subset of these operations required by their role in the
system):
 Load from memory
 Store to memory
 Fetch from memory, apply logic operation (bitwise AND/OR/XOR)
with one addition operand, and store back.
 Fetch from memory, apply integer arithmetic operation (add,
subtract, increment, decrement, minimum, maximum) with one
addition operand, and store back.
 Exchange memory location with operand.
 Compare-and-swap (CAS); load memory location, compare with first
operand, if equal than store second operand back to memory
location.
22
HSA Design (2015-04-30) @ NCKU, Tainan
Timestamp
(64-bit)
Host CPU
HSA
Runtime
APIs
HSAIL
Clock
Instruction
Timestamp
Frequency
(1~400MHz)
HSA Runtime
HSA
Kernel Agent
Jay Wang, Taiwan, 2015.04
HSA System Timestamp
 The HSA system provide for a low overhead mechanism of determining the
passing of time.
 A system timestamp is required that can be read from HSAIL or through the
HSA runtime.
 It is also possible to determine the system timestamp frequency through the
HSA runtime.
23
HSA Design (2015-04-30) @ NCKU, Tainan
System Arch. Requirements
1. Shared Virtual Memory
2. Cache Coherency Domains
3. Flat Addressing
4. Endianess
5. Signaling and Synchronization
6. Atomic Memory Operations
7. HSA System Timestamp
8. User Mode Queuing
9. Architected Queuing Language (AQL)
10. Agent Scheduling
11. Kernel Agent Context Switching
12. IEEE754-2008 Floating Point Exceptions
13. Kernel Agent Hardware Debug Infrastructure
14. HSA Platform Topology Discovery
15. Images
24
@ HSA PLATFORM SYSTEM ARCHITECTURE SPECIFICATION, VERSION 1.0 FINAL (2015-03-16)
HSA Design (2015-04-30) @ NCKU, Tainan
User Model Queuing
 Multiple user-level
command queues
 Runtime-allocated
 Architected Queuing
Language (AQL)
25
HSA Kernel Agent
K
A
CPU
A
HSA Runtime
HSA
Application
(HSA Agent)
CPU
Language
Runtime
(ex: OpenCL runtime)
User Application
HSA
Finalizers
HSA Kernel Agent
GPU
HSA
Kernel Mode
Driver
CPU
K
A
A
Jay Wang, Taiwan, 2015.04
K
AQL
Kernel Dispatch Queue
A
AQL
Agent Dispatch Queue
HSA Design (2015-04-30) @ NCKU, Tainan
HSA Packet Processor
26
type
features
base_address
doorbell_signal
0x00
0x04
0x08
0x10
0x0C
0x14
size0x18
reserved (must be 0)0x1C
write_index (64-bit)read_index (64-bit)
base_address +
( (read_index%size) * AQL packet size )
base_address +
( (write_index%size) * AQL packet size )
Support single or multiple producers
Support KERNEL_DISPATCH and/or
AGENT_DISPATCH packet
AQL Packet (64 Bytes)
User Mode Queue Structure (hsa_queue_t)
Ring Buffer
id
0x20
0x24
Jay Wang, Taiwan, 2015.03
HSA Design (2015-04-30) @ NCKU, Tainan
HSA Kernel Agent
K
A
A
HSA Runtime
HSA Application
(HSA Agent)
CPU
Language Runtime
(ex: OpenCL runtime)
User Application
GPU
Jay Wang, Taiwan, 2015.04
User Mode Queue Operations
HSA Runtime APIs ( for HSA application )
• hsa_queue_create ( )
• hsa_soft_queue_create ( )
• hsa_queue_destroy ( )
• hsa_queue_inactivate ( )
• hsa_queue_load_write_index_{acquire, relaxed} ( )
• hsa_queue_store_write_index_{relaxed, release} ( )
• hsa_queue_cas_write_index_{acq_rel, acquire, relaxed, release} ( )
• hsa_queue_add_write_index_{acq_rel, acquire, relaxed, release} ( )
• hsa_queue_load_read_index_{acquire, relaxed} ( )
• hsa_queue_store_read_index_{relaxed, release} ( )
27
HSAIL Instructions ( for HSA Kernel Agent)
• queueid_u32 dest
• queueptr_uLength dest
• ldqueuewriteindex_segment_order_u64 dest, address
• stqueuewriteindex_segment_order_u64 address, src
• casqueuewriteindex_segment_order_u64 dest, address, src0, src1
• addqueuewriteindex_segment_order_u64 dest, address, src
• ldqueuereadindex_segment_order_u64 dest, address
• stqueuereadindex_segment_order_u64 address, src
HSA Design (2015-04-30) @ NCKU, Tainan
0x00
0x04
0x08
0x10
0x0C
0x14
0x18
0x1C
0x20
0x24
0x28
0x30
0x2C
0x34
0x38
0x3C
header
workgroup_size_x
kernel_object
kernarg_address
dimensions (2-bit)
workgroup_size_y
workgroup_size_z
grid_size_x
reserved
grid_size_y
grid_size_z
private_segment_size_bytes
group_segment_size_bytes
reserved
completion_signal
Kernel Dispatch Packet
031 1516
Jay Wang, Taiwan, 2015.03
header
return_address
arg0
0x00
0x04
0x08
0x10
0x0C
0x14
0x18
0x1C
type
reserved
0x20
0x24
0x28
0x30
0x2C
0x34
0x38
0x3C
arg1
arg2
arg3
reserved
completion_signal
Agent Dispatch Packet
031 1516
Jay Wang, Taiwan, 2015.03
header
dep_signal0
0x00
0x04
0x08
0x10
0x0C
0x14
0x18
0x1C
reserved
reserved
0x20
0x24
0x28
0x30
0x2C
0x34
0x38
0x3C
reserved
completion_signal
dep_signal1
dep_signal2
dep_signal3
dep_signal4
Barrier-AND / Barrier-OR Packet
031 1516
Jay Wang, Taiwan, 2015.03
AQL Packet Types
28
 HSA signaling object handle used to indicate completion of the job.
format (8-bit)
barrier (1-bit)
acquire_fence_scope (2-bit)
release_fence_scope (2-bit)
reserved (3-bit)
0101112 9 8 71315
AQL_FORMAT
0 VENDOR_SPECIFIC
1 INVALID
2 KERNEL_DISPATCH
3 BARRIER_AND
4 AGENT_DISPATCH
5 BARRIER_OR
Jay Wang, Taiwan, 2015.03
HSA Design (2015-04-30) @ NCKU, Tainan
0x00
0x04
0x08
0x10
0x0C
0x14
0x18
0x1C
0x20
0x24
0x28
0x30
0x2C
0x34
0x38
0x3C
header
workgroup_size_x
kernel_object
kernarg_address
dimensions (2-bit)
workgroup_size_y
workgroup_size_z
grid_size_x
reserved
grid_size_y
grid_size_z
private_segment_size_bytes
group_segment_size_bytes
reserved
completion_signal
031 1516
Jay Wang, Taiwan, 2015.03
Kernel Dispatch Packet
29
Work-group Size
Grid Size
Segment Size
Pointer to the Kernel
Pointer to the
arguments
HSA Design (2015-04-30) @ NCKU, Tainan
header
return_address
arg0
0x00
0x04
0x08
0x10
0x0C
0x14
0x18
0x1C
type
reserved
0x20
0x24
0x28
0x30
0x2C
0x34
0x38
0x3C
arg1
arg2
arg3
reserved
completion_signal
031 1516
Jay Wang, Taiwan, 2015.03
Agent Dispatch Packet
30
64-bit direct or indirect
arguments
Pointer to location to
store the function
return value(s) in
The function to be performed by the destination agent.
The function codes are application defined.
HSA Design (2015-04-30) @ NCKU, Tainan
header
dep_signal0
0x00
0x04
0x08
0x10
0x0C
0x14
0x18
0x1C
reserved
reserved
0x20
0x24
0x28
0x30
0x2C
0x34
0x38
0x3C
reserved
completion_signal
dep_signal1
dep_signal2
dep_signal3
dep_signal4
031 1516
Jay Wang, Taiwan, 2015.03
Barrier-AND / Barrier-OR Packet
 The Barrier packet defines dependencies for the HSA Packet Processor
to monitor.
 The HSA Packet Processor will not launch any further packets until the Barrier-
AND / Barrier-OR packet is complete.
31
Handles for dependent
signaling objects to be
evaluated by the packet
processor.
HSA Design (2015-04-30) @ NCKU, Tainan
Packet Process Flow
 All preceding packets in the queue must have completed their launch phase.
 If the barrier bit in the packet header is set than all preceding packets in the
queue must have completed.
 An acquire memory fence is applied for Kernel/Agent Dispatch packets
before the packet enters the active phase.
 Kernel Dispatch packets and Agent Dispatch packets execute on the Kernel
Agent/Agent, and the active phase ends when the task completes.
 Barrier-AND and Barrier-OR packets remain in the active phase until their
condition is met.
 If the packet is a Barrier-AND or Barrier-OR packet then an acquire memory
fence is applied as the first step.
 After execution of the acquire fence, the memory release fence is applied.
 After the memory release fence completes, the signal specified by the
completion_signal field in the AQL packet is signaled with a decrementing
atomic operation.
32
Launch Phase
Active Phase
Completion Phase
HSA Design (2015-04-30) @ NCKU, Tainan
Barrier-bit Example
33
completionSignal
AQL Packet
Barrier bit = 1
DequeueEnqueue
LaunchPhase
ActivePhase
CompletionPhase
Jay Wang, Taiwan, 2015.04
If barrier bit is set, then
processing of the packet will
only begin when all preceding
packets are complete.
HSA Design (2015-04-30) @ NCKU, Tainan
Barrier-AND Packet Example
34
HSA Design (2015-04-30) @ NCKU, Tainan
System Arch. Requirements
1. Shared Virtual Memory
2. Cache Coherency Domains
3. Flat Addressing
4. Endianess
5. Signaling and Synchronization
6. Atomic Memory Operations
7. HSA System Timestamp
8. User Mode Queuing
9. Architected Queuing Language (AQL)
10. Agent Scheduling
11. Kernel Agent Context Switching
12. IEEE754-2008 Floating Point Exceptions
13. Kernel Agent Hardware Debug Infrastructure
14. HSA Platform Topology Discovery
15. Images
35
@ HSA PLATFORM SYSTEM ARCHITECTURE SPECIFICATION, VERSION 1.0 FINAL (2015-03-16)
HSA Design (2015-04-30) @ NCKU, Tainan
Agent Scheduling
36
AQL packet
(Agent/Kernel Dispatch packet or Barrier-AND/OR packet)
Agent
Scheduling
AQL Queue
AQL Queue
AQL Queue
AQL Queue
Non-HSA Task Pool
AQL Queue
Application #1
Application #2
Application #3
HSA
(Kernel)
Agent
Poke!
(1) Task execution completed
(3) Barrier packet completed
Agt
Agt
Agt
Agt
Agt
Agt
Agt
Jay Wang, Taiwan, 2015.04
(2) New AQL packet submission
HSA Design (2015-04-30) @ NCKU, Tainan
Kernel Agent Context Switching
37
AQL Queue
AQL Queue
AQL Queue
AQL Queue
Non-HSA Task Pool
AQL Queue
#1
#2
#3
HSA
Agent
Scheduling
Compute Unit
(CU)
Compute Unit
(CU)
Compute Unit
(CU)
HSA Kernel Agent
Context
Switching
Kernel
Program
Kernel
Program
Kernel
Program
WG
WG
WG
1. Switch ( Required )
2. Preempt ( Required as soon as possible )
3. Terminate and context reset (Terminated as fast as possible)
Jay Wang, Taiwan, 2015.04
HSA Design (2015-04-30) @ NCKU, Tainan
System Arch. Requirements
1. Shared Virtual Memory
2. Cache Coherency Domains
3. Flat Addressing
4. Endianess
5. Signaling and Synchronization
6. Atomic Memory Operations
7. HSA System Timestamp
8. User Mode Queuing
9. Architected Queuing Language (AQL)
10. Agent Scheduling
11. Kernel Agent Context Switching
12. IEEE754-2008 Floating Point Exceptions
13. Kernel Agent Hardware Debug Infrastructure
14. HSA Platform Topology Discovery
15. Images
38
@ HSA PLATFORM SYSTEM ARCHITECTURE SPECIFICATION, VERSION 1.0 FINAL (2015-03-16)
HSA Design (2015-04-30) @ NCKU, Tainan
FP Exception Reporting
 A Kernel Agent shall report certain defined exceptions related to the
execution of the HSAIL code to the HSA Runtime.
39
Lane
0
Lane
1
Lane
2
Lane
(N-1)
Lane
3
Work
Item
Work
Item
Work
Item
Work
Item
Work
Item
Lane
4
Work
Item
Work-Group 0 Work-Group 2Work-Group 1 Work-Group X
avefront 0 Wavefront 1 Wavefront 2 Wavefront 3 Wavefront Y
Work-Group 1
Compute Unit (CU)
PC
HSA Kernel Agent
Wavefront 2
SIMD (Single Instruction, Multiple Data) style
HSA Runtime
Host CPU
Exception Module
Control Directive
enablebreakexceptions #EC
Signaling
Exception
Code
Description
Invalid operatoin
Divide-by-zero
Overflow
Underflow
Inexact
0
1
2
3
4
IEEE754-2008
Jay Wang, Taiwan, 2015.04
enabledetectexceptions #EC
DETECT
Policy
BREAK
Policy
BreakEn bits
DetectEn bits
Status bits
Exception
Handler
HSAIL Instruction
cleardetectexcept_u32
getdetectexcept_u32
setdetectexcept_u32
HSA Design (2015-04-30) @ NCKU, Tainan
Debug Infrastructure
 The Kernel Agent shall provide mechanisms to allow system software
and some select application software (for example, debuggers and
profilers) to set breakpoints and collect throughput information for
profiling.
40
Lane
0
Lane
1
Lane
2
Lane
(N-1)
Lane
3
Work
Item
Work
Item
Work
Item
Work
Item
Work
Item
Lane
4
Work
Item
Work-Group 0 Work-Group 2Work-Group 1
Wavefront 0 Wavefront 1 Wavefront 2 Wavefront 3
Grid
Work-Group 1
Compute
Unit
PC
HSA Kernel Agent
Wavefront 2
SIMD (Single Instruction, Multiple Data) style
Host CPU
(HSA Agent)
Debuggers
HSA
Kernel Agent
Debug Inteface
Profilers
Debug Module
Conditional
Breakpoint
Memory
Breakpoint
Jay Wang, Taiwan, 2015.04
Instruction
Breakpoint
HSA Design (2015-04-30) @ NCKU, Tainan
System Arch. Requirements
1. Shared Virtual Memory
2. Cache Coherency Domains
3. Flat Addressing
4. Endianess
5. Signaling and Synchronization
6. Atomic Memory Operations
7. HSA System Timestamp
8. User Mode Queuing
9. Architected Queuing Language (AQL)
10. Agent Scheduling
11. Kernel Agent Context Switching
12. IEEE754-2008 Floating Point Exceptions
13. Kernel Agent Hardware Debug Infrastructure
14. HSA Platform Topology Discovery
15. Images
41
@ HSA PLATFORM SYSTEM ARCHITECTURE SPECIFICATION, VERSION 1.0 FINAL (2015-03-16)
HSA Design (2015-04-30) @ NCKU, Tainan
Execution Environment
42
You have 2 OpenCL platform(s)
----------------------------------------------
Platform[0].Name = NVIDIA CUDA
Platform[0].Vendor = NVIDIA Corporation
Platform[0].Version = OpenCL 1.1 CUDA 4.2.1
Platform[0].Profile = FULL_PROFILE
----------------------------------------------
Platform[1].Name = Intel(R) OpenCL
Platform[1].Vendor = Intel(R) Corporation
Platform[1].Version = OpenCL 1.2
Platform[1].Profile = FULL_PROFILE
----------------------------------------------
Platform[0] has 1 device(s)
----------------------------------------------
Device[0].Type = CL_DEVICE_TYPE_GPU
Device[0].Name = GeForce GT 625
Device[0].Vendor = NVIDIA Corporation
Device[0].Version = OpenCL 1.1 CUDA
Device[0].DriverVersion = 320.49
Device[0].Profile = FULL_PROFILE
Device[0].OpenCL_C = OpenCL C 1.1
Device[0].MaxComputeUnits = 1
Device[0].MaxWiDimensions = 3
Device[0].MaxWiSize = (1024,1024,64)
Device[0].MaxWgSize = 1024
Device[0].MaxClkFrequency = 1747 MHz
Device[0].AddrSpaceSize = 32 bits
Platform[1] has 1 device(s)
----------------------------------------------
Device[0].Type = CL_DEVICE_TYPE_CPU
Device[0].Name = Intel(R) Core(TM) i5-4440 CPU @ 3.10GHz
Device[0].Vendor = Intel(R) Corporation
Device[0].Version = OpenCL 1.2 (Build 80752)
Device[0].DriverVersion = 3.0.1.15216
Device[0].Profile = FULL_PROFILE
Device[0].OpenCL_C = OpenCL C 1.2
Device[0].MaxComputeUnits = 4
Device[0].MaxWiDimensions = 3
Device[0].MaxWiSize = (1024,1024,1024)
Device[0].MaxWgSize = 1024
Device[0].MaxClkFrequency = 3100 MHz
Device[0].AddrSpaceSize = 32 bits
OpenCL APIs
HSA Design (2015-04-30) @ NCKU, Tainan
HSA Platform Topology Discovery
 HSA platform resources: Agent, Memory, Compute Properties, Caches, and I/O
43
HSA Platform Node 2
Node 0
Add-In Board (optional)
HSA discrete GPU
System Memory
(cacheable)
coherent
(non-cacheable)
non-coherent
HSA APU
GPU
H-CU
H-CU
H-CU
GPU
H-CU
H-CU
H-CU
CPU
Core
Core
Core
Device Local
Memory
coherent
non-coherent
Mem
Mem
HSA MMU
SBIOS
UEFI
HSA discrete GPU
GPU
H-CU
H-CU
H-CU
Device Local
Memory
coherent
non-coherent
Mem
Node 1
PCIe
BridgePCIE
System Memory
(cacheable)
coherent
(non-cacheable)
non-coherent
HSA APU
GPU
H-CU
H-CU
H-CU
CPU
Core
Core
Core
Mem HSA MMU
Add-In Board (optional)
HSA discrete GPU
GPU
H-CU
H-CU
H-CU
Device Local
Memory
coherent
non-coherent
PCIE
Mem
VBIOS
UEFI GOP
SocketInterconnect
Node 3
PCIE
Node 4
PCIE
VBIOS
UEFI GOP
HSA Design (2015-04-30) @ NCKU, Tainan
System Arch. Requirements
1. Shared Virtual Memory
2. Cache Coherency Domains
3. Flat Addressing
4. Endianess
5. Signaling and Synchronization
6. Atomic Memory Operations
7. HSA System Timestamp
8. User Mode Queuing
9. Architected Queuing Language (AQL)
10. Agent Scheduling
11. Kernel Agent Context Switching
12. IEEE754-2008 Floating Point Exceptions
13. Kernel Agent Hardware Debug Infrastructure
14. HSA Platform Topology Discovery
15. Images
44
@ HSA PLATFORM SYSTEM ARCHITECTURE SPECIFICATION, VERSION 1.0 FINAL (2015-03-16)
HSA Design (2015-04-30) @ NCKU, Tainan
Images
 A graphics feature that can
sometimes be useful in data-
parallel computing
 Used to store one-, two-, or
three-dimensional images
 predefined image formats
 Image memory is a special kind
of memory access
 Dedicated hardware to speed
up image operations.
45
 The OpenCL™ Specification
Version 2.1:
5.3 Image Objects
https://www.khronos.org/registry/cl/specs/opencl-2.1.pdf
Image Channel Type
Image Channel Order
Image Geometry
Image Data Size
Image Handle
(hsa_ext_image_handle_t)
Image Data
(1D, 2D, or 3D images)
Global Segment
Image
Data
Image Descriptor
HSA Kernel Agent
HSA Runtime
Image Object
rdimage
ldimage
stimage
Jay Wang, Taiwan, 2015.04
HSA Design (2015-04-30) @ NCKU, Tainan
Summary
 Programming model issues
 HSA Intermediate Language (HSAIL) + HSA Runtime
 Architected Queuing Language (AQL) + Signaling
 Debug infrastructure
 Communication overhead issues
 Cache coherent shared virtual memory (CC-SVM)
 Architected Queuing Language (AQL) for user mode queuing
 Hardware-assisted signaling and atomic operations for synchronization
46
CPUs GPU DSP
...
HSAIL
Unified Coherent Memory
HSA Runtime
AQL
Jay Wang, Taiwan, 2015.04
HSA Design (2015-04-30) @ NCKU, Tainan
HSA Kernel Agent
CPU
HSA Runtime
HSA
Application
(HSA Agent)
User Application
( CPU Code + HSAIL Kernel Code )
HSA Kernel Agent
GPU
HSA
Kernel Mode
Driver
Host CPU
HSA Kernel Agent
DSP
HSA User Mode Queuing (Architected Queuing Language)
+
HSA Signaling
Jay Wang, Taiwan, 2015.04
HSA
Finalizers
HSA Kernel Agent Designer
Parallel Application
Designer
HSA
System Software
Designer
HSA
System Architecture
Designer
Language Runtime
(ex: OpenCL runtime)
47
媽~
我在這!
 OpenCL Standards ( https://www.khronos.org/opencl/ )
 HSA Standards ( http://www.hsafoundation.com/html/HSA_Library.htm )
 HSA Platform System Architecture Specification v1.0
 HSA Programmer Reference Manual Specification v1.0
 HSA Runtime Specification v1.0
 HSA Foundation Github ( https://github.com/HSAFoundation )
HSA Design (2015-04-30) @ NCKU, Tainan
Taiwan HSA Group @ Facebook
48

More Related Content

What's hot

Tegra 186のu-boot & Linux
Tegra 186のu-boot & LinuxTegra 186のu-boot & Linux
Tegra 186のu-boot & LinuxMr. Vengineer
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDatabricks
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareBrendan Gregg
 
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all startedKernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all startedAnne Nicolas
 
GPU Virtualization in SUSE
GPU Virtualization in SUSEGPU Virtualization in SUSE
GPU Virtualization in SUSELiang Yan
 
Security Monitoring with eBPF
Security Monitoring with eBPFSecurity Monitoring with eBPF
Security Monitoring with eBPFAlex Maestretti
 
Introduction to OpenVX
Introduction to OpenVXIntroduction to OpenVX
Introduction to OpenVX家榮 張
 
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019NVIDIA
 
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSEUnderstanding blue store, Ceph's new storage backend - Tim Serong, SUSE
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSEOpenStack
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Brendan Gregg
 
Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Tomasz Bednarz
 
2017 red hat open stack(rhosp) function overview (samuel,2017-0516)
2017 red hat open stack(rhosp) function overview (samuel,2017-0516)2017 red hat open stack(rhosp) function overview (samuel,2017-0516)
2017 red hat open stack(rhosp) function overview (samuel,2017-0516)SAMUEL SJ Cheon
 
Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!Databricks
 
Android audio system(오디오 출력-트랙생성)
Android audio system(오디오 출력-트랙생성)Android audio system(오디오 출력-트랙생성)
Android audio system(오디오 출력-트랙생성)fefe7270
 

What's hot (20)

Tegra 186のu-boot & Linux
Tegra 186のu-boot & LinuxTegra 186のu-boot & Linux
Tegra 186のu-boot & Linux
 
librados
libradoslibrados
librados
 
Embedded Android : System Development - Part IV
Embedded Android : System Development - Part IVEmbedded Android : System Development - Part IV
Embedded Android : System Development - Part IV
 
Ixgbe internals
Ixgbe internalsIxgbe internals
Ixgbe internals
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of Software
 
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all startedKernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
 
GPU Virtualization in SUSE
GPU Virtualization in SUSEGPU Virtualization in SUSE
GPU Virtualization in SUSE
 
Security Monitoring with eBPF
Security Monitoring with eBPFSecurity Monitoring with eBPF
Security Monitoring with eBPF
 
Introduction to OpenVX
Introduction to OpenVXIntroduction to OpenVX
Introduction to OpenVX
 
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
 
Apache spark
Apache sparkApache spark
Apache spark
 
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSEUnderstanding blue store, Ceph's new storage backend - Tim Serong, SUSE
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)
 
Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Introduction to OpenCL, 2010
Introduction to OpenCL, 2010
 
NVIDIA CUDA
NVIDIA CUDANVIDIA CUDA
NVIDIA CUDA
 
Audio Drivers
Audio DriversAudio Drivers
Audio Drivers
 
2017 red hat open stack(rhosp) function overview (samuel,2017-0516)
2017 red hat open stack(rhosp) function overview (samuel,2017-0516)2017 red hat open stack(rhosp) function overview (samuel,2017-0516)
2017 red hat open stack(rhosp) function overview (samuel,2017-0516)
 
Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!
 
Android audio system(오디오 출력-트랙생성)
Android audio system(오디오 출력-트랙생성)Android audio system(오디오 출력-트랙생성)
Android audio system(오디오 출력-트랙생성)
 

Viewers also liked

20150501南園
20150501南園20150501南園
20150501南園健正 林
 
Task 4 final: Consultants-E E-Moderating Course Oct 2015
Task 4 final: Consultants-E E-Moderating Course Oct 2015Task 4 final: Consultants-E E-Moderating Course Oct 2015
Task 4 final: Consultants-E E-Moderating Course Oct 2015brendawm
 
No Place Left Session Seven
No Place Left Session SevenNo Place Left Session Seven
No Place Left Session SevenGrace Canberra
 
No Place Left Session Six - Acts 15
No Place Left Session Six - Acts 15No Place Left Session Six - Acts 15
No Place Left Session Six - Acts 15Grace Canberra
 
1 John Series Sunday 22nd February
1 John Series Sunday 22nd February1 John Series Sunday 22nd February
1 John Series Sunday 22nd FebruaryGrace Canberra
 
ABP Electronics
ABP ElectronicsABP Electronics
ABP ElectronicsJustin Yi
 
If It's The Lords Will
If It's The Lords WillIf It's The Lords Will
If It's The Lords WillGrace Canberra
 
Risky Living Session Five - Sin & Judgment
Risky Living Session Five - Sin & JudgmentRisky Living Session Five - Sin & Judgment
Risky Living Session Five - Sin & JudgmentGrace Canberra
 
地政研究所演講 160311v3.1
地政研究所演講 160311v3.1地政研究所演講 160311v3.1
地政研究所演講 160311v3.1健正 林
 
台南校區多功能會館 151002
台南校區多功能會館 151002台南校區多功能會館 151002
台南校區多功能會館 151002健正 林
 

Viewers also liked (20)

Web design
Web designWeb design
Web design
 
20150501南園
20150501南園20150501南園
20150501南園
 
Task 4 final: Consultants-E E-Moderating Course Oct 2015
Task 4 final: Consultants-E E-Moderating Course Oct 2015Task 4 final: Consultants-E E-Moderating Course Oct 2015
Task 4 final: Consultants-E E-Moderating Course Oct 2015
 
No Place Left Session Seven
No Place Left Session SevenNo Place Left Session Seven
No Place Left Session Seven
 
No Place Left Session Six - Acts 15
No Place Left Session Six - Acts 15No Place Left Session Six - Acts 15
No Place Left Session Six - Acts 15
 
SMTULSA Social Business Conference Sponsorship Kit
SMTULSA Social Business Conference Sponsorship KitSMTULSA Social Business Conference Sponsorship Kit
SMTULSA Social Business Conference Sponsorship Kit
 
Boats and Business
Boats and BusinessBoats and Business
Boats and Business
 
1 John Series Sunday 22nd February
1 John Series Sunday 22nd February1 John Series Sunday 22nd February
1 John Series Sunday 22nd February
 
The Tongue
The TongueThe Tongue
The Tongue
 
ABP Electronics
ABP ElectronicsABP Electronics
ABP Electronics
 
WAA PCB
WAA PCBWAA PCB
WAA PCB
 
If It's The Lords Will
If It's The Lords WillIf It's The Lords Will
If It's The Lords Will
 
Something I Can Use
Something I Can UseSomething I Can Use
Something I Can Use
 
COUFEST Rocks Social Media! How Bands can Rock Social Media
COUFEST Rocks Social Media! How Bands can Rock Social MediaCOUFEST Rocks Social Media! How Bands can Rock Social Media
COUFEST Rocks Social Media! How Bands can Rock Social Media
 
Risky Living Session Five - Sin & Judgment
Risky Living Session Five - Sin & JudgmentRisky Living Session Five - Sin & Judgment
Risky Living Session Five - Sin & Judgment
 
2014 cheer constitution
2014 cheer constitution2014 cheer constitution
2014 cheer constitution
 
Tale of Two Men
Tale of Two MenTale of Two Men
Tale of Two Men
 
地政研究所演講 160311v3.1
地政研究所演講 160311v3.1地政研究所演講 160311v3.1
地政研究所演講 160311v3.1
 
Dealing With Anxiety at Work
Dealing With Anxiety at WorkDealing With Anxiety at Work
Dealing With Anxiety at Work
 
台南校區多功能會館 151002
台南校區多功能會館 151002台南校區多功能會館 151002
台南校區多功能會館 151002
 

Similar to HSA Design (2015-04-30)

助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」Shinya Takamaeda-Y
 
Intel® QuickAssist Technology Introduction, Applications, and Lab, Including ...
Intel® QuickAssist Technology Introduction, Applications, and Lab, Including ...Intel® QuickAssist Technology Introduction, Applications, and Lab, Including ...
Intel® QuickAssist Technology Introduction, Applications, and Lab, Including ...Michelle Holley
 
From FPGA-based Reconfigurable Systems to Autonomic Heterogeneous Computing S...
From FPGA-based Reconfigurable Systems to Autonomic Heterogeneous Computing S...From FPGA-based Reconfigurable Systems to Autonomic Heterogeneous Computing S...
From FPGA-based Reconfigurable Systems to Autonomic Heterogeneous Computing S...NECST Lab @ Politecnico di Milano
 
HSA HSAIL Introduction Hot Chips 2013
HSA HSAIL Introduction  Hot Chips 2013 HSA HSAIL Introduction  Hot Chips 2013
HSA HSAIL Introduction Hot Chips 2013 HSA Foundation
 
Smart Data Slides: Emerging Hardware Choices for Modern AI Data Management
Smart Data Slides: Emerging Hardware Choices for Modern AI Data ManagementSmart Data Slides: Emerging Hardware Choices for Modern AI Data Management
Smart Data Slides: Emerging Hardware Choices for Modern AI Data ManagementDATAVERSITY
 
Software used in Electronics and Communication
Software used in Electronics and CommunicationSoftware used in Electronics and Communication
Software used in Electronics and Communicationashishsoni1505
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Jason Dai
 
Introduction to PowerAI - The Enterprise AI Platform
Introduction to PowerAI - The Enterprise AI PlatformIntroduction to PowerAI - The Enterprise AI Platform
Introduction to PowerAI - The Enterprise AI PlatformIndrajit Poddar
 
How to lock a Python in a cage? Managing Python environment inside an R project
How to lock a Python in a cage?  Managing Python environment inside an R projectHow to lock a Python in a cage?  Managing Python environment inside an R project
How to lock a Python in a cage? Managing Python environment inside an R projectWLOG Solutions
 
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Holden Karau
 
Speeding up Programs with OpenACC in GCC
Speeding up Programs with OpenACC in GCCSpeeding up Programs with OpenACC in GCC
Speeding up Programs with OpenACC in GCCinside-BigData.com
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureRevolution Analytics
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks
 
“Quantum” Performance Effects: beyond the Core
“Quantum” Performance Effects: beyond the Core“Quantum” Performance Effects: beyond the Core
“Quantum” Performance Effects: beyond the CoreC4Media
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowChetan Khatri
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...AMD Developer Central
 
Intel(r) Quick Assist Technology Overview
Intel(r) Quick Assist Technology OverviewIntel(r) Quick Assist Technology Overview
Intel(r) Quick Assist Technology OverviewMichelle Holley
 
The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...NECST Lab @ Politecnico di Milano
 

Similar to HSA Design (2015-04-30) (20)

助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
 
Intel® QuickAssist Technology Introduction, Applications, and Lab, Including ...
Intel® QuickAssist Technology Introduction, Applications, and Lab, Including ...Intel® QuickAssist Technology Introduction, Applications, and Lab, Including ...
Intel® QuickAssist Technology Introduction, Applications, and Lab, Including ...
 
From FPGA-based Reconfigurable Systems to Autonomic Heterogeneous Computing S...
From FPGA-based Reconfigurable Systems to Autonomic Heterogeneous Computing S...From FPGA-based Reconfigurable Systems to Autonomic Heterogeneous Computing S...
From FPGA-based Reconfigurable Systems to Autonomic Heterogeneous Computing S...
 
HSA HSAIL Introduction Hot Chips 2013
HSA HSAIL Introduction  Hot Chips 2013 HSA HSAIL Introduction  Hot Chips 2013
HSA HSAIL Introduction Hot Chips 2013
 
Smart Data Slides: Emerging Hardware Choices for Modern AI Data Management
Smart Data Slides: Emerging Hardware Choices for Modern AI Data ManagementSmart Data Slides: Emerging Hardware Choices for Modern AI Data Management
Smart Data Slides: Emerging Hardware Choices for Modern AI Data Management
 
Software used in Electronics and Communication
Software used in Electronics and CommunicationSoftware used in Electronics and Communication
Software used in Electronics and Communication
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
 
Introduction to PowerAI - The Enterprise AI Platform
Introduction to PowerAI - The Enterprise AI PlatformIntroduction to PowerAI - The Enterprise AI Platform
Introduction to PowerAI - The Enterprise AI Platform
 
How to lock a Python in a cage? Managing Python environment inside an R project
How to lock a Python in a cage?  Managing Python environment inside an R projectHow to lock a Python in a cage?  Managing Python environment inside an R project
How to lock a Python in a cage? Managing Python environment inside an R project
 
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
 
PowerAI Deep dive
PowerAI Deep divePowerAI Deep dive
PowerAI Deep dive
 
Speeding up Programs with OpenACC in GCC
Speeding up Programs with OpenACC in GCCSpeeding up Programs with OpenACC in GCC
Speeding up Programs with OpenACC in GCC
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to Azure
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
 
“Quantum” Performance Effects: beyond the Core
“Quantum” Performance Effects: beyond the Core“Quantum” Performance Effects: beyond the Core
“Quantum” Performance Effects: beyond the Core
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
 
Intel(r) Quick Assist Technology Overview
Intel(r) Quick Assist Technology OverviewIntel(r) Quick Assist Technology Overview
Intel(r) Quick Assist Technology Overview
 
The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...
 
HSA Introduction
HSA IntroductionHSA Introduction
HSA Introduction
 

Recently uploaded

8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitterShivangiSharma879191
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptMadan Karki
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptJasonTagapanGulla
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 

Recently uploaded (20)

8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.ppt
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.ppt
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 

HSA Design (2015-04-30)

  • 1. ITRI Industrial Technology Research Institute Heterogeneous System Architecture (HSA) Design 王振傑 (Jay Wang) 嵌入式系統與晶片技術組 -系統架構設計部 (D200) 資訊與通訊研究所 (ICL) ccwang.jay@itri.org.tw 2015-04-30
  • 2. 2 嵌入式系統硬體技術部 (D100) 系統架構設計部 (D200) 嵌入式系統軟體技術部 (D300) 智慧電子產業推動部 (D400) 系統整合與應用部 (D500) 嵌入式系統與晶片技術組 Division for Embedded System and SoC Technology 工業技術研究院 資訊與通訊研究所
  • 3. HSA Design (2015-04-30) @ NCKU, Tainan What is HSA? 3 An intelligent computing architecture that enables CPU, GPU and other processors to work in harmony on a single piece of silicon by seamlessly moving the right tasks to the best suited processing element.
  • 4. HSA Design (2015-04-30) @ NCKU, Tainan Three Eras of Processor Performance 4 ? Single-thread Performance Time we are here Enabled by:  Moore’s Observation  Voltage Scaling  Micro-Architecture Constrained by:  Power  Complexity Single-Core Era ModernApplication Performance Time (Data-parallel exploitation) we are here Heterogeneous Systems Era Enabled by:  Moore’s Observation  Abundant data parallelism  Power efficient data parallel processing (GPUs) Constrained by:  Programming models  Communication overheads Throughput Performance Time (# of processors) we are here Enabled by:  Moore’s Observation  Desire for Throughput  20 years of SMP arch Constrained by:  Power  Parallel SW availability  Scalability Multi-Core Era Assembly  C/C++  Java … pthreads  OpenMP / TBB … Shader  CUDA OpenCL  C++ and Java SOURCE : HSA INTRODUCTION, HSA FOUNDATION (PHIL ROGERS, AMD)
  • 5. HSA Design (2015-04-30) @ NCKU, Tainan HSA Foundation 5  Founded in June 2012  www.hsafoundation.com  Developing a new platform for heterogeneous systems  Launched the official v1.0 specification set in March 2015
  • 6. HSA Design (2015-04-30) @ NCKU, Tainan HSA Foundation Members (April 2015) 6 Founders Promoters Contributors Academics Supporters
  • 7. HSA Design (2015-04-30) @ NCKU, Tainan HSA Platform Model 7 In HSA system, a regular device is called an HSA agent, and if the HSA agent can run kernels then it is also an HSA kernel agent. Compute Unit (CU) Compute Unit (CU) Compute Unit (CU) Compute Unit (CU) Compute Unit (CU) Lane (Processing Element) Host CPU (OS, HSA runtime) HSA Kernel Agent Compute Unit (CU) Compute Unit (CU) Wavefront Size (A power of 2 in the range from 1 to 256 inclusive) HSA Agent SIMD Data Parallel Workloads Serial and Task Parallel Workloads Jay Wang, Taiwan, 2015.03
  • 8. HSA Design (2015-04-30) @ NCKU, Tainan HSA Intermediate Language (HSAIL) 8 The HSA Foundation members are building a heterogeneous compute software ecosystem built on open, royalty-free industry standards and open-source software: the HSA runtimes and compilation tools are based on open-source technologies such as LLVM and GCC. ( https://github.com/HSAFoundation ) Company D GPU ... Other Hardware Accelerator Company B CPUs Finalizer (Company A - CPU) Finalizer (Company B - CPU) Finalizer (Company C - GPU) Finalizer (Company D - GPU) Finalizer (Company E - DSP) Finalizer (...) OpenMP DSL Virtual Parallel ISA CLOC – Compile OpenCL kernels to HSAIL HSA Intermediate Language (HSAIL) OpenCL C++AMP Java Company A CPUs Company C GPU Company E DSP Parallel Programming Languages HSA Runtime Libraries Jay Wang, Taiwan, 2014.10
  • 9. HSA Design (2015-04-30) @ NCKU, Tainan HSAIL Programming Model 9
  • 10. HSA Design (2015-04-30) @ NCKU, Tainan HSA Runtime Stack 10 HSA Kernel Agent CPU HSA Runtime HSA Application (HSA Agent) Language Runtime (ex: OpenCL runtime) User Application ( CPU Code + HSAIL Kernel Code ) HSA Kernel Agent GPU HSA Kernel Mode Driver Host CPU HSA Kernel Agent DSP HSA User Mode Queuing (Architected Queuing Language) + HSA Signaling Jay Wang, Taiwan, 2015.04 Target ISA HSA Finalizers
  • 11. HSA Design (2015-04-30) @ NCKU, Tainan Kernel Execution 11
  • 12. HSA Design (2015-04-30) @ NCKU, Tainan HSA Memory Consistency Model (Relaxed Model) Second Operation ld_rlx st_rlx atomic_rlx atomicNoRet_rlx atomic_acq atomicNoRet_acq fence_acq atomic_rel atomicNoRet_rel fence_rel atomic_ar atomicNoRet_ar fence_ar First Operation ld_rlx or st_rlx yes yes yes yes no no atomic_rlx atomicNoRet_rlx yes yes yes no no no atomic_acq atomicNoRet_acq fence_acq no no no no no no atomic_rel atomicNoRet_rel yes yes no no no no fence_rel yes no no no no no atomic_ar atomicNoRet_ar fence_ar no no no no no no 12 relaxed ; ….. acquire ; ….. release ; ….. acq_rel ; …..
  • 13. HSA Design (2015-04-30) @ NCKU, Tainan System Arch. Requirements 1. Shared Virtual Memory 2. Cache Coherency Domains 3. Flat Addressing 4. Endianess 5. Signaling and Synchronization 6. Atomic Memory Operations 7. HSA System Timestamp 8. User Mode Queuing 9. Architected Queuing Language (AQL) 10. Agent Scheduling 11. Kernel Agent Context Switching 12. IEEE754-2008 Floating Point Exceptions 13. Kernel Agent Hardware Debug Infrastructure 14. HSA Platform Topology Discovery 15. Images 13 @ HSA PLATFORM SYSTEM ARCHITECTURE SPECIFICATION, VERSION 1.0 FINAL (2015-03-16)
  • 14. HSA Design (2015-04-30) @ NCKU, Tainan Legacy GPU Compute  Multiple memory pools and address spaces  Data copies before/after GPU compute 14 System Memory GPU Memory 1 23 Host CPUs GPU Virtual Memory #1 Virtual Memory #2 (HSA Agent) (HSA Kernel Agent) Jay Wang, Taiwan, 2015.04
  • 15. HSA Design (2015-04-30) @ NCKU, Tainan Host CPUs GPU(HSA Agent) (HSA Kernel Agent) Shared Virtual Memory System Memory GPU Memory Jay Wang, Taiwan, 2015.04 Shared Virtual Memory (HSA) 15 32-bit HSA System (32 bits VA) 64-bit HSA System (≥ 48 bits VA) IOMMU OS Page Table MMU
  • 16. HSA Design (2015-04-30) @ NCKU, Tainan Group Segments within Flat Address Space Global Segment within Flat Address Space Private Segments within Flat Address Space Kernel Dispatch Grid Work-Group Work-Group WI WI WI Private Segment WI WI WI Private Segment Group Segment Group Segment Global Segment Flat Address SpaceHSA Agent $s0 $s1 $s2 $s3 $s4 $s5 $s6 $s7 $s124 $s125 $s126 $s127 32-bit Registers ( s registers) $c0 $c1 $c2 $c3 $c4 $c5 $c6 $c7 $d0 $d1 $d2 $d3 $d62 $d63 64-bit Registers ( d registers) $q0 $q31 $q1 128-bit Registers ( q registers) 1-bit Control Registers ( c registers) Local Registers per Work-Item Jay Wang, Taiwan, 2014.10 HSA Memory Hierarchy 16 1) Global 2) Group 3) Private 4) Kernarg 5) Readonly 6) Spill 7) Arg Virtual Address Range Reservation (System Memory or Device Local Memory)
  • 17. HSA Design (2015-04-30) @ NCKU, Tainan Group Segments within Flat Address Space Global Segment within Flat Address Space Private Segments within Flat Address Space Kernel Dispatch Grid Work-Group Work-Group WI WI WI Private Segment WI WI WI Private Segment Group Segment Group Segment Global Segment Flat Address Space HSA Kernel Agent Host CPUs Jay Wang, Taiwan, 2015.04 Cache Coherency Domains 17 System Memory Cache Cache Cache Coherency
  • 18. HSA Design (2015-04-30) @ NCKU, Tainan System Arch. Requirements 1. Shared Virtual Memory 2. Cache Coherency Domains 3. Flat Addressing 4. Endianess 5. Signaling and Synchronization 6. Atomic Memory Operations 7. HSA System Timestamp 8. User Mode Queuing 9. Architected Queuing Language (AQL) 10. Agent Scheduling 11. Kernel Agent Context Switching 12. IEEE754-2008 Floating Point Exceptions 13. Kernel Agent Hardware Debug Infrastructure 14. HSA Platform Topology Discovery 15. Images 18 @ HSA PLATFORM SYSTEM ARCHITECTURE SPECIFICATION, VERSION 1.0 FINAL (2015-03-16)
  • 19. HSA Design (2015-04-30) @ NCKU, Tainan Signaling and Synchronization  The required mechanisms for HSAIL and the HSA runtime are:  Allocate/Destroy an HSA signal  Read the current HSA signal value  Wait on an HSA signal to meet a specified condition (with a maximum wait duration requested)  Send an HSA signal value  Atomic read-modify-write an HSA signal value 19 sem_init() sem_wait() sem_post() sem_destroy() pthread_mutex_init() pthread_mutex_lock() pthread_mutex_unlock() pthread_mutex_destroy() Signal Handle (hsa_signal_t) Signal Value (hsa_signal_value_t) HSA Kernel Agent Host CPU HSA Runtime APIs HSAIL Instructions Implementation- defined data Sig32 or Sig64 Jay Wang, Taiwan, 2015.04
  • 20. HSA Design (2015-04-30) @ NCKU, Tainan HSA Runtime APIs for Signaling 20 HSA Runtime APIs ( for HSA application ) • hsa_signal_create ( ) • hsa_signal_destroy ( ) • hsa_signal_load_{acquire, relaxed} ( ) • hsa_signal_store_{relaxed, release} ( ) • hsa_signal_exchange_{acq_rel, acquire, relaxed, release} ( ) • hsa_signal_cas_{acq_rel, acquire, relaxed, release} ( ) • hsa_signal_add_{acq_rel, acquire, relaxed, release} ( ) • hsa_signal_subtract_{acq_rel, acquire, relaxed, release} ( ) • hsa_signal_and_{acq_rel, acquire, relaxed, release} ( ) • hsa_signal_or_{acq_rel, acquire, relaxed, release} ( ) • hsa_signal_xor_{acq_rel, acquire, relaxed, release} ( ) • hsa_signal_wait_{acquire, relaxed} ( ) HSA Runtime Programmer’s Reference Manual (v1.0) 2.4 Signals
  • 21. HSA Design (2015-04-30) @ NCKU, Tainan HSAIL Instructions for Signaling 21 HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer’s Guide, and Object Format (BRIG) (v1.0) 6.8 Notification (signal) Instructions
  • 22. HSA Design (2015-04-30) @ NCKU, Tainan Atomic Memory Operations  HSA requires the following standard atomic memory operations to be supported by HSA Kernel Agents (other HSA Agents only need to support the subset of these operations required by their role in the system):  Load from memory  Store to memory  Fetch from memory, apply logic operation (bitwise AND/OR/XOR) with one addition operand, and store back.  Fetch from memory, apply integer arithmetic operation (add, subtract, increment, decrement, minimum, maximum) with one addition operand, and store back.  Exchange memory location with operand.  Compare-and-swap (CAS); load memory location, compare with first operand, if equal than store second operand back to memory location. 22
  • 23. HSA Design (2015-04-30) @ NCKU, Tainan Timestamp (64-bit) Host CPU HSA Runtime APIs HSAIL Clock Instruction Timestamp Frequency (1~400MHz) HSA Runtime HSA Kernel Agent Jay Wang, Taiwan, 2015.04 HSA System Timestamp  The HSA system provide for a low overhead mechanism of determining the passing of time.  A system timestamp is required that can be read from HSAIL or through the HSA runtime.  It is also possible to determine the system timestamp frequency through the HSA runtime. 23
  • 24. HSA Design (2015-04-30) @ NCKU, Tainan System Arch. Requirements 1. Shared Virtual Memory 2. Cache Coherency Domains 3. Flat Addressing 4. Endianess 5. Signaling and Synchronization 6. Atomic Memory Operations 7. HSA System Timestamp 8. User Mode Queuing 9. Architected Queuing Language (AQL) 10. Agent Scheduling 11. Kernel Agent Context Switching 12. IEEE754-2008 Floating Point Exceptions 13. Kernel Agent Hardware Debug Infrastructure 14. HSA Platform Topology Discovery 15. Images 24 @ HSA PLATFORM SYSTEM ARCHITECTURE SPECIFICATION, VERSION 1.0 FINAL (2015-03-16)
  • 25. HSA Design (2015-04-30) @ NCKU, Tainan User Model Queuing  Multiple user-level command queues  Runtime-allocated  Architected Queuing Language (AQL) 25 HSA Kernel Agent K A CPU A HSA Runtime HSA Application (HSA Agent) CPU Language Runtime (ex: OpenCL runtime) User Application HSA Finalizers HSA Kernel Agent GPU HSA Kernel Mode Driver CPU K A A Jay Wang, Taiwan, 2015.04 K AQL Kernel Dispatch Queue A AQL Agent Dispatch Queue
  • 26. HSA Design (2015-04-30) @ NCKU, Tainan HSA Packet Processor 26 type features base_address doorbell_signal 0x00 0x04 0x08 0x10 0x0C 0x14 size0x18 reserved (must be 0)0x1C write_index (64-bit)read_index (64-bit) base_address + ( (read_index%size) * AQL packet size ) base_address + ( (write_index%size) * AQL packet size ) Support single or multiple producers Support KERNEL_DISPATCH and/or AGENT_DISPATCH packet AQL Packet (64 Bytes) User Mode Queue Structure (hsa_queue_t) Ring Buffer id 0x20 0x24 Jay Wang, Taiwan, 2015.03
  • 27. HSA Design (2015-04-30) @ NCKU, Tainan HSA Kernel Agent K A A HSA Runtime HSA Application (HSA Agent) CPU Language Runtime (ex: OpenCL runtime) User Application GPU Jay Wang, Taiwan, 2015.04 User Mode Queue Operations HSA Runtime APIs ( for HSA application ) • hsa_queue_create ( ) • hsa_soft_queue_create ( ) • hsa_queue_destroy ( ) • hsa_queue_inactivate ( ) • hsa_queue_load_write_index_{acquire, relaxed} ( ) • hsa_queue_store_write_index_{relaxed, release} ( ) • hsa_queue_cas_write_index_{acq_rel, acquire, relaxed, release} ( ) • hsa_queue_add_write_index_{acq_rel, acquire, relaxed, release} ( ) • hsa_queue_load_read_index_{acquire, relaxed} ( ) • hsa_queue_store_read_index_{relaxed, release} ( ) 27 HSAIL Instructions ( for HSA Kernel Agent) • queueid_u32 dest • queueptr_uLength dest • ldqueuewriteindex_segment_order_u64 dest, address • stqueuewriteindex_segment_order_u64 address, src • casqueuewriteindex_segment_order_u64 dest, address, src0, src1 • addqueuewriteindex_segment_order_u64 dest, address, src • ldqueuereadindex_segment_order_u64 dest, address • stqueuereadindex_segment_order_u64 address, src
  • 28. HSA Design (2015-04-30) @ NCKU, Tainan 0x00 0x04 0x08 0x10 0x0C 0x14 0x18 0x1C 0x20 0x24 0x28 0x30 0x2C 0x34 0x38 0x3C header workgroup_size_x kernel_object kernarg_address dimensions (2-bit) workgroup_size_y workgroup_size_z grid_size_x reserved grid_size_y grid_size_z private_segment_size_bytes group_segment_size_bytes reserved completion_signal Kernel Dispatch Packet 031 1516 Jay Wang, Taiwan, 2015.03 header return_address arg0 0x00 0x04 0x08 0x10 0x0C 0x14 0x18 0x1C type reserved 0x20 0x24 0x28 0x30 0x2C 0x34 0x38 0x3C arg1 arg2 arg3 reserved completion_signal Agent Dispatch Packet 031 1516 Jay Wang, Taiwan, 2015.03 header dep_signal0 0x00 0x04 0x08 0x10 0x0C 0x14 0x18 0x1C reserved reserved 0x20 0x24 0x28 0x30 0x2C 0x34 0x38 0x3C reserved completion_signal dep_signal1 dep_signal2 dep_signal3 dep_signal4 Barrier-AND / Barrier-OR Packet 031 1516 Jay Wang, Taiwan, 2015.03 AQL Packet Types 28  HSA signaling object handle used to indicate completion of the job. format (8-bit) barrier (1-bit) acquire_fence_scope (2-bit) release_fence_scope (2-bit) reserved (3-bit) 0101112 9 8 71315 AQL_FORMAT 0 VENDOR_SPECIFIC 1 INVALID 2 KERNEL_DISPATCH 3 BARRIER_AND 4 AGENT_DISPATCH 5 BARRIER_OR Jay Wang, Taiwan, 2015.03
  • 29. HSA Design (2015-04-30) @ NCKU, Tainan 0x00 0x04 0x08 0x10 0x0C 0x14 0x18 0x1C 0x20 0x24 0x28 0x30 0x2C 0x34 0x38 0x3C header workgroup_size_x kernel_object kernarg_address dimensions (2-bit) workgroup_size_y workgroup_size_z grid_size_x reserved grid_size_y grid_size_z private_segment_size_bytes group_segment_size_bytes reserved completion_signal 031 1516 Jay Wang, Taiwan, 2015.03 Kernel Dispatch Packet 29 Work-group Size Grid Size Segment Size Pointer to the Kernel Pointer to the arguments
  • 30. HSA Design (2015-04-30) @ NCKU, Tainan header return_address arg0 0x00 0x04 0x08 0x10 0x0C 0x14 0x18 0x1C type reserved 0x20 0x24 0x28 0x30 0x2C 0x34 0x38 0x3C arg1 arg2 arg3 reserved completion_signal 031 1516 Jay Wang, Taiwan, 2015.03 Agent Dispatch Packet 30 64-bit direct or indirect arguments Pointer to location to store the function return value(s) in The function to be performed by the destination agent. The function codes are application defined.
  • 31. HSA Design (2015-04-30) @ NCKU, Tainan header dep_signal0 0x00 0x04 0x08 0x10 0x0C 0x14 0x18 0x1C reserved reserved 0x20 0x24 0x28 0x30 0x2C 0x34 0x38 0x3C reserved completion_signal dep_signal1 dep_signal2 dep_signal3 dep_signal4 031 1516 Jay Wang, Taiwan, 2015.03 Barrier-AND / Barrier-OR Packet  The Barrier packet defines dependencies for the HSA Packet Processor to monitor.  The HSA Packet Processor will not launch any further packets until the Barrier- AND / Barrier-OR packet is complete. 31 Handles for dependent signaling objects to be evaluated by the packet processor.
  • 32. HSA Design (2015-04-30) @ NCKU, Tainan Packet Process Flow  All preceding packets in the queue must have completed their launch phase.  If the barrier bit in the packet header is set than all preceding packets in the queue must have completed.  An acquire memory fence is applied for Kernel/Agent Dispatch packets before the packet enters the active phase.  Kernel Dispatch packets and Agent Dispatch packets execute on the Kernel Agent/Agent, and the active phase ends when the task completes.  Barrier-AND and Barrier-OR packets remain in the active phase until their condition is met.  If the packet is a Barrier-AND or Barrier-OR packet then an acquire memory fence is applied as the first step.  After execution of the acquire fence, the memory release fence is applied.  After the memory release fence completes, the signal specified by the completion_signal field in the AQL packet is signaled with a decrementing atomic operation. 32 Launch Phase Active Phase Completion Phase
  • 33. HSA Design (2015-04-30) @ NCKU, Tainan Barrier-bit Example 33 completionSignal AQL Packet Barrier bit = 1 DequeueEnqueue LaunchPhase ActivePhase CompletionPhase Jay Wang, Taiwan, 2015.04 If barrier bit is set, then processing of the packet will only begin when all preceding packets are complete.
  • 34. HSA Design (2015-04-30) @ NCKU, Tainan Barrier-AND Packet Example 34
  • 35. HSA Design (2015-04-30) @ NCKU, Tainan System Arch. Requirements 1. Shared Virtual Memory 2. Cache Coherency Domains 3. Flat Addressing 4. Endianess 5. Signaling and Synchronization 6. Atomic Memory Operations 7. HSA System Timestamp 8. User Mode Queuing 9. Architected Queuing Language (AQL) 10. Agent Scheduling 11. Kernel Agent Context Switching 12. IEEE754-2008 Floating Point Exceptions 13. Kernel Agent Hardware Debug Infrastructure 14. HSA Platform Topology Discovery 15. Images 35 @ HSA PLATFORM SYSTEM ARCHITECTURE SPECIFICATION, VERSION 1.0 FINAL (2015-03-16)
  • 36. HSA Design (2015-04-30) @ NCKU, Tainan Agent Scheduling 36 AQL packet (Agent/Kernel Dispatch packet or Barrier-AND/OR packet) Agent Scheduling AQL Queue AQL Queue AQL Queue AQL Queue Non-HSA Task Pool AQL Queue Application #1 Application #2 Application #3 HSA (Kernel) Agent Poke! (1) Task execution completed (3) Barrier packet completed Agt Agt Agt Agt Agt Agt Agt Jay Wang, Taiwan, 2015.04 (2) New AQL packet submission
  • 37. HSA Design (2015-04-30) @ NCKU, Tainan Kernel Agent Context Switching 37 AQL Queue AQL Queue AQL Queue AQL Queue Non-HSA Task Pool AQL Queue #1 #2 #3 HSA Agent Scheduling Compute Unit (CU) Compute Unit (CU) Compute Unit (CU) HSA Kernel Agent Context Switching Kernel Program Kernel Program Kernel Program WG WG WG 1. Switch ( Required ) 2. Preempt ( Required as soon as possible ) 3. Terminate and context reset (Terminated as fast as possible) Jay Wang, Taiwan, 2015.04
  • 38. HSA Design (2015-04-30) @ NCKU, Tainan System Arch. Requirements 1. Shared Virtual Memory 2. Cache Coherency Domains 3. Flat Addressing 4. Endianess 5. Signaling and Synchronization 6. Atomic Memory Operations 7. HSA System Timestamp 8. User Mode Queuing 9. Architected Queuing Language (AQL) 10. Agent Scheduling 11. Kernel Agent Context Switching 12. IEEE754-2008 Floating Point Exceptions 13. Kernel Agent Hardware Debug Infrastructure 14. HSA Platform Topology Discovery 15. Images 38 @ HSA PLATFORM SYSTEM ARCHITECTURE SPECIFICATION, VERSION 1.0 FINAL (2015-03-16)
  • 39. HSA Design (2015-04-30) @ NCKU, Tainan FP Exception Reporting  A Kernel Agent shall report certain defined exceptions related to the execution of the HSAIL code to the HSA Runtime. 39 Lane 0 Lane 1 Lane 2 Lane (N-1) Lane 3 Work Item Work Item Work Item Work Item Work Item Lane 4 Work Item Work-Group 0 Work-Group 2Work-Group 1 Work-Group X avefront 0 Wavefront 1 Wavefront 2 Wavefront 3 Wavefront Y Work-Group 1 Compute Unit (CU) PC HSA Kernel Agent Wavefront 2 SIMD (Single Instruction, Multiple Data) style HSA Runtime Host CPU Exception Module Control Directive enablebreakexceptions #EC Signaling Exception Code Description Invalid operatoin Divide-by-zero Overflow Underflow Inexact 0 1 2 3 4 IEEE754-2008 Jay Wang, Taiwan, 2015.04 enabledetectexceptions #EC DETECT Policy BREAK Policy BreakEn bits DetectEn bits Status bits Exception Handler HSAIL Instruction cleardetectexcept_u32 getdetectexcept_u32 setdetectexcept_u32
  • 40. HSA Design (2015-04-30) @ NCKU, Tainan Debug Infrastructure  The Kernel Agent shall provide mechanisms to allow system software and some select application software (for example, debuggers and profilers) to set breakpoints and collect throughput information for profiling. 40 Lane 0 Lane 1 Lane 2 Lane (N-1) Lane 3 Work Item Work Item Work Item Work Item Work Item Lane 4 Work Item Work-Group 0 Work-Group 2Work-Group 1 Wavefront 0 Wavefront 1 Wavefront 2 Wavefront 3 Grid Work-Group 1 Compute Unit PC HSA Kernel Agent Wavefront 2 SIMD (Single Instruction, Multiple Data) style Host CPU (HSA Agent) Debuggers HSA Kernel Agent Debug Inteface Profilers Debug Module Conditional Breakpoint Memory Breakpoint Jay Wang, Taiwan, 2015.04 Instruction Breakpoint
  • 41. HSA Design (2015-04-30) @ NCKU, Tainan System Arch. Requirements 1. Shared Virtual Memory 2. Cache Coherency Domains 3. Flat Addressing 4. Endianess 5. Signaling and Synchronization 6. Atomic Memory Operations 7. HSA System Timestamp 8. User Mode Queuing 9. Architected Queuing Language (AQL) 10. Agent Scheduling 11. Kernel Agent Context Switching 12. IEEE754-2008 Floating Point Exceptions 13. Kernel Agent Hardware Debug Infrastructure 14. HSA Platform Topology Discovery 15. Images 41 @ HSA PLATFORM SYSTEM ARCHITECTURE SPECIFICATION, VERSION 1.0 FINAL (2015-03-16)
  • 42. HSA Design (2015-04-30) @ NCKU, Tainan Execution Environment 42 You have 2 OpenCL platform(s) ---------------------------------------------- Platform[0].Name = NVIDIA CUDA Platform[0].Vendor = NVIDIA Corporation Platform[0].Version = OpenCL 1.1 CUDA 4.2.1 Platform[0].Profile = FULL_PROFILE ---------------------------------------------- Platform[1].Name = Intel(R) OpenCL Platform[1].Vendor = Intel(R) Corporation Platform[1].Version = OpenCL 1.2 Platform[1].Profile = FULL_PROFILE ---------------------------------------------- Platform[0] has 1 device(s) ---------------------------------------------- Device[0].Type = CL_DEVICE_TYPE_GPU Device[0].Name = GeForce GT 625 Device[0].Vendor = NVIDIA Corporation Device[0].Version = OpenCL 1.1 CUDA Device[0].DriverVersion = 320.49 Device[0].Profile = FULL_PROFILE Device[0].OpenCL_C = OpenCL C 1.1 Device[0].MaxComputeUnits = 1 Device[0].MaxWiDimensions = 3 Device[0].MaxWiSize = (1024,1024,64) Device[0].MaxWgSize = 1024 Device[0].MaxClkFrequency = 1747 MHz Device[0].AddrSpaceSize = 32 bits Platform[1] has 1 device(s) ---------------------------------------------- Device[0].Type = CL_DEVICE_TYPE_CPU Device[0].Name = Intel(R) Core(TM) i5-4440 CPU @ 3.10GHz Device[0].Vendor = Intel(R) Corporation Device[0].Version = OpenCL 1.2 (Build 80752) Device[0].DriverVersion = 3.0.1.15216 Device[0].Profile = FULL_PROFILE Device[0].OpenCL_C = OpenCL C 1.2 Device[0].MaxComputeUnits = 4 Device[0].MaxWiDimensions = 3 Device[0].MaxWiSize = (1024,1024,1024) Device[0].MaxWgSize = 1024 Device[0].MaxClkFrequency = 3100 MHz Device[0].AddrSpaceSize = 32 bits OpenCL APIs
  • 43. HSA Design (2015-04-30) @ NCKU, Tainan HSA Platform Topology Discovery  HSA platform resources: Agent, Memory, Compute Properties, Caches, and I/O 43 HSA Platform Node 2 Node 0 Add-In Board (optional) HSA discrete GPU System Memory (cacheable) coherent (non-cacheable) non-coherent HSA APU GPU H-CU H-CU H-CU GPU H-CU H-CU H-CU CPU Core Core Core Device Local Memory coherent non-coherent Mem Mem HSA MMU SBIOS UEFI HSA discrete GPU GPU H-CU H-CU H-CU Device Local Memory coherent non-coherent Mem Node 1 PCIe BridgePCIE System Memory (cacheable) coherent (non-cacheable) non-coherent HSA APU GPU H-CU H-CU H-CU CPU Core Core Core Mem HSA MMU Add-In Board (optional) HSA discrete GPU GPU H-CU H-CU H-CU Device Local Memory coherent non-coherent PCIE Mem VBIOS UEFI GOP SocketInterconnect Node 3 PCIE Node 4 PCIE VBIOS UEFI GOP
  • 44. HSA Design (2015-04-30) @ NCKU, Tainan System Arch. Requirements 1. Shared Virtual Memory 2. Cache Coherency Domains 3. Flat Addressing 4. Endianess 5. Signaling and Synchronization 6. Atomic Memory Operations 7. HSA System Timestamp 8. User Mode Queuing 9. Architected Queuing Language (AQL) 10. Agent Scheduling 11. Kernel Agent Context Switching 12. IEEE754-2008 Floating Point Exceptions 13. Kernel Agent Hardware Debug Infrastructure 14. HSA Platform Topology Discovery 15. Images 44 @ HSA PLATFORM SYSTEM ARCHITECTURE SPECIFICATION, VERSION 1.0 FINAL (2015-03-16)
  • 45. HSA Design (2015-04-30) @ NCKU, Tainan Images  A graphics feature that can sometimes be useful in data- parallel computing  Used to store one-, two-, or three-dimensional images  predefined image formats  Image memory is a special kind of memory access  Dedicated hardware to speed up image operations. 45  The OpenCL™ Specification Version 2.1: 5.3 Image Objects https://www.khronos.org/registry/cl/specs/opencl-2.1.pdf Image Channel Type Image Channel Order Image Geometry Image Data Size Image Handle (hsa_ext_image_handle_t) Image Data (1D, 2D, or 3D images) Global Segment Image Data Image Descriptor HSA Kernel Agent HSA Runtime Image Object rdimage ldimage stimage Jay Wang, Taiwan, 2015.04
  • 46. HSA Design (2015-04-30) @ NCKU, Tainan Summary  Programming model issues  HSA Intermediate Language (HSAIL) + HSA Runtime  Architected Queuing Language (AQL) + Signaling  Debug infrastructure  Communication overhead issues  Cache coherent shared virtual memory (CC-SVM)  Architected Queuing Language (AQL) for user mode queuing  Hardware-assisted signaling and atomic operations for synchronization 46 CPUs GPU DSP ... HSAIL Unified Coherent Memory HSA Runtime AQL Jay Wang, Taiwan, 2015.04
  • 47. HSA Design (2015-04-30) @ NCKU, Tainan HSA Kernel Agent CPU HSA Runtime HSA Application (HSA Agent) User Application ( CPU Code + HSAIL Kernel Code ) HSA Kernel Agent GPU HSA Kernel Mode Driver Host CPU HSA Kernel Agent DSP HSA User Mode Queuing (Architected Queuing Language) + HSA Signaling Jay Wang, Taiwan, 2015.04 HSA Finalizers HSA Kernel Agent Designer Parallel Application Designer HSA System Software Designer HSA System Architecture Designer Language Runtime (ex: OpenCL runtime) 47 媽~ 我在這!  OpenCL Standards ( https://www.khronos.org/opencl/ )  HSA Standards ( http://www.hsafoundation.com/html/HSA_Library.htm )  HSA Platform System Architecture Specification v1.0  HSA Programmer Reference Manual Specification v1.0  HSA Runtime Specification v1.0  HSA Foundation Github ( https://github.com/HSAFoundation )
  • 48. HSA Design (2015-04-30) @ NCKU, Tainan Taiwan HSA Group @ Facebook 48