Characterization of Emu Chick with Microbenchmarks
1. Characterization of the Emu Chick with
Microbenchmarks
E. Jason Riedy
Center for Research into Novel Computing Hierarchies at Georgia Tech
23 January 2019
3. Memory-centric HPDA
• “Big data” platforms fare poorly v. a single thread
plus large SSD. (McSherry, Isard, Murray. “Scalability!
But at what COST?” HotOS XV, 2015.)
• New architecture proposals are difficult to evaluate
via simulation and modeling alone.
Evaluate the FPGA-based prototype Emu Chick...
• But by what criteria?
• Chose memory bandwidth utilization.
• Memory-centric architecture
• BW is equivalent to MFLOP/s in SpMV, TEPS in BFS
Emu: µbenchmarks — 23 Jan 2019 3/27
4. Emu Technology’s PGAS Architecture
1 nodelet
Gossamer
Core 1
Memory-Side Processor
Gossamer
Core 4
...
Migration Engine
RapidIODisk I/O
8 nodelets
per node
64 nodelets
per Chick
RapidIO
Stationary
Core
• Multithreaded multicore
• Memory-side “processor” for
operations in
narrow-channel DRAM
• Stationary core for OS
• Threads migrate in
hardware on reads!
• Optimize for weak locality
Emu: µbenchmarks — 23 Jan 2019 4/27
5. Baseline: Emu STREAM ADD c[i] = a[i] + b[i]
GC Config Nodelets Scale Threads BW (MB/s)
1 8 30 512 1,599.86
3 4 29 384 1,288.39
1 64 31 4096 12,790.31
3 32 31 6144 7,241.07
Theor. Peak 8 9,600
Theor. Peak 64 76,800
STREAM results are used to compare bandwidth
utilization for the current prototype. 3 GC is experimental
and has (had?) half the memory controllers1
1
Eric Hein, Young, Srinivas Eswar, Jiajia Li, Patrick Lavin, Riedy, Vuduc, Conte. “A Microbenchmark Characterization
of the Emu Chick,” (in submission, https://arxiv.org/abs/1809.07696 ).
Emu: µbenchmarks — 23 Jan 2019 5/27
11. SpMV Layout, Synthetic (5pt Laplacian)
CSR:
Local 1D 2D
1 nodelet 8+ nodelets 8+ nodelets
X
row
v
col
= x
Y
X
Y =
x
Y
Xx
=
102 302 602 802 1002 2002 3002
Number of Rows
0
100
200
300
400
500
600
Bandwidth(MB/s)
Data Layout
Local layout
1D layout
2D layout
Single node, integer entries1
Emu: µbenchmarks — 23 Jan 2019 11/27
12. SpMV Synthetic, Replicated – Single node, 1 GC
0 500 1000 1500 2000 2500 3000 3500
Matrix Size (MB)
0
100
200
300
400
500
600
700
800
900
Bandwidth(MB/s) SpMV (Emu Chick, Single node)
No. Threads
64
128
256
512
Good bandwidth utilization with high thread counts and
replicated x.
Emu: µbenchmarks — 23 Jan 2019 12/27
13. SpMV Synthetic – Single node, 1 GC
0 500 1000 1500 2000 2500 3000 3500
Matrix Size (MB)
0
100
200
300
400
500
600
700
800
Bandwidth(MB/s) SpMV (Emu Chick, Single node)
No. Threads
256
512
The 5pt Laplacian without replicating x bounces between
migratory and non-migratory areas.1
Emu: µbenchmarks — 23 Jan 2019 13/27
14. SpMV Synthetic – Single node, 1 and 3 GC
102 502 1002 1502 2002 2502 3002 5002 10002 11002 14002 15002 20002 25002 30002 40002
Number of Rows
0
200
400
600
800
1000
Bandwidth(MB/s)
SpMV (Emu Chick, Single node, 512 threads)
1GC
3GC
3 GC version: half the nodes, half the memory controllers
Emu: µbenchmarks — 23 Jan 2019 14/27
15. SpMV Synthetic – Single node, 1 and 3 GC
0 200 400 600 800 1000 1200 1400 1600
Matrix Size (MB)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
BandwidthUtlization SpMV (Emu Chick, Single node, 512 threads)
ctype
1GC
3GC
3 GC results demonstrate that SpMV is compute-bound
from address computation.
Emu: µbenchmarks — 23 Jan 2019 15/27
16. SpMV Synthetic, Replicated – Multinode, 1 GC
0 2000 4000 6000 8000 10000 12000 14000
Matrix Size (MB)
0
1000
2000
3000
4000
5000
6000
7000
Bandwidth(MB/s)
SpMV (Emu Chick, Multi node)
No. Threads
64
128
256
512
1024
2048
4096
SpMV scales up to 50% of bandwidth for high thread
counts and replicated x.1
Emu: µbenchmarks — 23 Jan 2019 16/27
17. SpMV Synthetic – Multinode, 1 GC
0 2000 4000 6000 8000 10000 12000 14000
Matrix Size (MB)
0
1000
2000
3000
4000
5000
6000
Bandwidth(MB/s)
SpMV (Emu Chick, Multi node)
No. Threads
1024
2048
4096
But migrations for fetching x hurt with eight nodes.
Emu: µbenchmarks — 23 Jan 2019 17/27
18. SpMV Real-World Results, Replicated
SpMV multinode bandwidths (in MB/s) for real world graphs (Tim Davis’s collection)
along with matrix dimension, number of non-zeros (NNZ), and the average and
maximum row degrees. Run with 4K threads.
Matrix Rows NNZ Avg Deg Max Deg BW
mc2depi 526K 2.1M 3.99 4 3870.31
ecology1 1.0M 5.0M 5.00 5 4425.61
amazon03 401K 3.2M 7.99 10 4494.79
Delor295 296K 2.4M 8.12 11 4492.47
roadNet- 1.39M 3.84M 2.76 12 3811.57
mac_econ 206K 1.27M 6.17 44 3735.54
cop20k_A 121K 2.62M 21.65 81 4520.05
watson_2 352K 1.85M 5.25 93 3486.30
ca2010 710K 3.49M 4.91 141 4075.97
poisson3 86K 2.37M 27.74 145 4031.20
gyro_k 17K 1.02M 58.82 360 2446.36
vsp_fina 140K 1.1M 7.90 669 1335.59
Stanford 282K 2.31M 8.20 38606 287.82
ins2 309K 2.75M 8.89 309412 43.91
Emu: µbenchmarks — 23 Jan 2019 18/27
19. Breadth-First Search with Remote Writes
1. For each vertex in the frontier, try to set self as
parent of each neighbor vertex
• Done using remote writes, no migrations
• Last writer wins (benign race condition)
2. Double-buffer: Check to see which vertices acquired
a new parent, and add them to the queue
• This step is completely nodelet-local
• Caveat: also scans inactive vertices
Emu: µbenchmarks — 23 Jan 2019 19/27
20. BFS Pseudo-code
Listing 1: BFS algorithm using remote writes
queue.push(root)
while len(queue) > 0:
for src in queue:
for dst in out_edges(src):
# Remote write
new_parent[dst] = src
for v in range(num_vertices):
if parent[v] == -1:
if new_parent[v] != -1:
parent[v] = new_parent[v]
queue.push(v)
Emu: µbenchmarks — 23 Jan 2019 20/27
21. BFS on a Dynamic Data Structure
15 16 17 18 19 20 21
scale
0
20
40
60
80
100
MTEPS
Emu single node - Cilk
Emu multi-node - Cilk
x86 Haswell - STINGER
x86 Haswell - Cilk
0
500
1000
1500
EdgeBandwidth(MB/s)
Note: Streaming data structure, not statically optimized.
But Erdös-Rényi graphs. RMAT: Load imbalance. 3
3
Hein, Eswar, Abdurrahman Yasar, Prasanth Chatarasi, Li, Young, Conte, Ümit Çatalyürek, Vuduc, Riedy, Bora Uçar.
“Programming Strategies for Irregular Algorithms on the Emu Chick,” (in submission).
Emu: µbenchmarks — 23 Jan 2019 21/27
22. Labeled Subgraph Alignment
1 2 4 8 16 32 64 128
Number of Threads
0
10
20
30
40
50
Speedup
Multi-BLK
Multi-HCB
Single-BLK
Single-HCB
gsaNA, the first parallel algorithm, strong scaling on DBLP
graph (2048 vertices). Block (BLK) vertex layout is slightly
worse than Hilbert curve (HCB) layout.3
Emu: µbenchmarks — 23 Jan 2019 22/27
23. Lessons Learned i
• Finding appropriate metrics is difficult:
• Comparing ASICs (e.g. x86) to FPGA-based prototypes
can be unfair either way.
• Fraction of peak bandwidth for the idealized
problem?
• Measured peak is much lower than theoretical peak.
• The Chick is compute bound.
• SpMV: FLOP/s ∝ BW, level 2 sparse BLAS op.
• Graph500 BFS: TEPS ∝ BW
Emu: µbenchmarks — 23 Jan 2019 23/27
24. Lessons Learned ii
• Distilling observations on architecture ↔
programming model:
• Program data location for load (BW) balance.
• Remote memory operations v. migration exposes the
architecture.
• Migrations cost more than it appears. Computation?
• Stack spills/access can cause ping-ponging.
• How does HW support for top-down (Cilk-ish) affect
bottom-up (UPC) PGAS programming?
• Memory allocation similar to UPC, SHMEM
• UPC++ rpc_ff v. Emu thread migration?
Emu: µbenchmarks — 23 Jan 2019 24/27
25. Integrating the Chick with Flexible Infrastructure
login
rg-adm
Slurm Ctl
toolbox
(NFS)
Scheduling,
Tools, and
Admin
Key:
Schedulable Resource
Physical Resource
VM
USB device
User
Resources
fpaa-host
power-host
nvidia-tegra-N
nvidia-tegra-1
fpaa-dev
rg-db
Slurm DBD
emu-dev emu-chick
..Nfpga-dev-1
fpga-hmcfpga-intel
Powell, Riedy, Young, and Conte. “Wrangling Rogues: Managing
Experimental Post-Moore Architectures.”
https://arxiv.org/abs/1808.06334
• Available. Plans to
integrate with NSF
XSEDE.
• Scheduler being
deployed.
• Incorporates
Singularity and virtual
machines for
OS/library versioning.
Emu: µbenchmarks — 23 Jan 2019 25/27
26. Umbrella Project: CRNCH Rogues Gallery
A physical & virtual space for hosting novel computing
architectures, systems, and accelerators.
Host / manage remote access for novel architectures!
• Emu Chick
• FPGA + HMC: 3D stacked
• FPAA: Analog/Neuromorphic
Amortize effort and cost of trying novel architectures.
Break the “but it’s too much work” barrier.
http://crnch.gatech.edu/rogues-gallery
Emu: µbenchmarks — 23 Jan 2019 26/27
27. Acknowledgments
• Srinivas Eswar (GT CSE)
• Dr. Eric Hein (GT ECE ⇒ Emu)
• Patrick Lavin (GT CSE)
• Jiajia Li (GT CSE ⇒ PNNL)
• Abdurrahman Yaşar (GT CSE)
• Dr. Ümit Çatalürek (GT CSE)
• Dr. Tom Conte (GT CS/ECE)
• Dr. Bora Uçar (ENS Lyon CNRS)
• Dr. Rich Vuduc (GT CSE)
• Dr. Jeffrey S. Young (GT CS)
Code:
• https://gitlab.com/crnch-rg (soon)
• https://github.com/ehein6/emu-microbench
Emu: µbenchmarks — 23 Jan 2019 27/27