Hardware acceleration based on FPGAs is becoming popular in recent years thanks to their availability as cloud commodities and the continuous improvement of High-Level Synthesis (HLS) tools that simplify the adoption of this technology to software developers. HLS tools allow to generate several hardware designs of a given function to accelerate and to explore several trade-off between performance and resource requirements. Nevertheless, when addressing complex applications featuring many potential functions to accelerate, the designer faces new challenges and design opportunities. Indeed, the designer has to: map each candidate function to a given hardware implementation, partition the kernel functions into one or more FPGA configurations and schedule the FPGA reconfiguration throughout the execution of the application. In this work, we present an approach for performing automatic mapping, partitioning and scheduling of the candidate functions for hardware acceleration in order to minimize the overall execution time of the application. Furthermore, based on the identified solution, we also provide automatic code transformations for integrating the original software code with calls to the accelerated hardware functions and with the FPGA reconfiguration requests.
Automatic mapping, partitioning and scheduling for hardware acceleration on FPGAs
1. 1
DIPARTIMENTO DI ELETTRONICA,
INFORMAZIONE E BIOINGEGNERIA
AMPS
Automatic Mapping, Partitioning and Scheduling
for hardware acceleration on FPGAs
Mirko Salaris: mirko.salaris@mail.polimi.it
Marco Rabozzi: marco.rabozzi@polimi.it
May 17-31, 2019
NGCX at San Francisco
2. 2
Steps for software acceleration on FPGA
• Software profiling
• Identification of candidate hardware functions
• Design Space Exploration of hardware functions
implementations
• Choice of the function to implement in hardware
3. 3
Steps for software acceleration on FPGA
✓ Software profiling
✓ Identification of candidate hardware functions
✓ Design Space Exploration of hardware functions
implementations
✓ Choice of the function to implement in hardware
Already supported by CAOS
1
[1] CAOS: CAD as an Adaptive OpenPlatform Service, http://caos.necst.it
4. 4
Steps for software acceleration on FPGA
✓ Software profiling
✓ Identification of candidate hardware functions
✓ Design Space Exploration of hardware functions
implementations
✓ Choice of the function to implement in hardware
What about the acceleration
of multiple functions?
5. 5
Steps for software acceleration on FPGA
✓ Software profiling
✓ Identification of candidate hardware functions
✓ Design Space Exploration of hardware functions
implementations
• Hardware functions implementations selection
6. 6
Steps for software acceleration on FPGA
✓ Software profiling
✓ Identification of candidate hardware functions
✓ Design Space Exploration of hardware functions
implementations
• Hardware functions implementations selection
• Hardware functions partitioning
into one or more bitstreams
• Scheduling of the FPGA
reconfigurations
8. 8
AMPS – Profiling
Function Self Time % Total Time %
funA 98.71% 32.26%
funB 92.65% 12.83%
funC 89.26% 27.98%
funD 94.37% 9.41%
funE 2.73% 68.52%
… … …
9. 9
AMPS – Profiling
Function Self Time % Total Time %
funA 98.71% 32.26%
funB 92.65% 12.83%
funC 89.26% 27.98%
funD 94.37% 9.41%
funE 2.73% 68.52%
… … …
10. 10
AMPS – Call Trace Analysis
The list of function calls, in order
11. 11
AMPS – Call Trace Analysis
The list of function calls, in order
funE
funB
funD
funA
funA
funA
funA
funA
funF
funG
funG
funG
funE
funB
funD
funB
funD
funF
funG
funH
funC
funE
funA
funA
funA
funF
funG
funG
funH
funC
funF
funG
funG
funE
funA
funA
funA
funB
funD
funB
funD
funB
funF
funG
funF
funG
funH
funC
funE
funA
funA
funA
12. 12
AMPS – Call Trace Analysis
The list of function calls, in order
funE
funB
funD
funA
funA
funA
funA
funA
funF
funG
funG
funG
funE
funB
funD
funB
funD
funF
funG
funH
funC
funE
funA
funA
funA
funF
funG
funG
funH
funC
funF
funG
funG
funE
funA
funA
funA
funB
funD
funB
funD
funB
funF
funG
funF
funG
funH
funC
funE
funA
funA
funA
NOT synthesizable in
hardware:
funE, funF, funG, funH
13. 13
AMPS – Call Trace Analysis
The list of function calls, in order
funB
funD
funA
funA
funA
funA
funA
funB
funD
funB
funD
funC
funA
funA
funA
funC
funA
funA
funA
funB
funD
funB
funD
funB
funC
funA
funA
funA
14. 14
AMPS – Call Trace Analysis
The list of function calls, in order
funB
funD
funA
funA
funA
funA
funA
funB
funD
funB
funD
funC
funA
funA
funA
funC
funA
funA
funA
funB
funD
funB
funD
funB
funC
funA
funA
funA
funA is always called in
blocks of multiples calls
15. 15
AMPS – Call Trace Analysis
The list of function calls, in order
funB
funD
funA
funA
funA
funA
funA
funB
funD
funB
funD
funC
funA
funA
funA
funC
funA
funA
funA
funB
funD
funB
funD
funB
funC
funA
funA
funA
funA is always called in
blocks of multiples calls
funB and funD are always
called in quick succession
and in an alternate
fashion
16. 16
AMPS – Call Trace Analysis
The list of function calls, in order
funB
funD
funA
funA
funA
funA
funA
funB
funD
funB
funD
funC
funA
funA
funA
funC
funA
funA
funA
funB
funD
funB
funD
funB
funC
funA
funA
funA
funA is always called in
blocks of multiples calls
funB and funD are always
called in quick succession
and in an alternate
fashion
Other patterns?
17. 17
AMPS – DSE
Function Implementation Performance Resources
function_1 F1.impl_1 Execution time: 14.68s
Clock Frequency: 200MHz
BRAM_18K: 1523 (35%)
FF: 1211 (0.05%)
LUT: 2211 (0.19%)
[…]
F1.impl_2 Execution time: 12.47s
Clock Frequency: 220MHz
BRAM_18K: 3 (0.07%)
FF: 1274 (0.05%)
LUT: 1937 (0.16%)
[…]
function_2 F2.impl_1 […] […]
Automated Design Space Exploration and Roofline Analysis for FPGA-based HLS Applications
Marco Siracusa, Marco Rabozzi, Lorenzo di Tucci, Marco Domenico Santambrogio
21. 21
Is this the best?
Partitioning, Mapping
and Scheduling
22. 22
Function Self Time % Total Time %
funA 98.71% 32.26%
funC 89.26% 27.98%
Partitioning, Mapping
and Scheduling
23. 23
Function Self Time % Total Time %
funA 98.71% 32.26%
funC 89.26% 27.98%
funA is always called in
blocks of multiples calls
funC is called few times
Partitioning, Mapping
and Scheduling
24. 24
Function Self Time % Total Time %
funA 98.71% 32.26%
funC 89.26% 27.98%
funA is always called in
blocks of multiples calls
funC is called few times
Partitioning, Mapping
and Scheduling
25. 25
Function Self Time % Total Time %
funA 98.71% 32.26%
funC 89.26% 27.98%
funA is always called in
blocks of multiples calls
funC is called few times
funB and funD are always
called in quick succession
and in an alternate
fashion
Partitioning, Mapping
and Scheduling
26. 26
Function Self Time % Total Time %
funA 98.71% 32.26%
funC 89.26% 27.98%
funA is always called in
blocks of multiples calls
funC is called few times
funB and funD are always
called in quick succession
and in an alternate
fashion
Partitioning, Mapping
and Scheduling
27. 27
Function Self Time % Total Time %
funA 98.71% 32.26%
funC 89.26% 27.98%
funA is always called in
blocks of multiples calls
funC is called few times
funB and funD are always
called in quick succession
and in an alternate
fashion
Is this the best?
Partitioning, Mapping
and Scheduling
28. 28
Conclusions
• The concurrent acceleration of multiple functions requires
multiple steps
• There is no easy way to decouple these steps while still
guaranteeing optimality
Future works:
• Validate the proposed flow on a set of real applications
• Integrate this flow into CAOS