National Institute of Advanced Industrial Science and Technology (AIST) in Japan is focusing on bridging innovative technological seeds to commercialization. It currently lacks cutting-edge computing infrastructure dedicated to AI and big data that is openly available. The proposed AI Bridging Cloud Infrastructure (ABCI) project aims to provide a large-scale open AI infrastructure to accelerate joint academic-industry R&D for AI in Japan. ABCI will feature 1088 compute nodes with 4352 NVIDIA Tesla V100 GPUs providing 0.550 exaflops of AI performance, connected by an InfiniBand network and utilizing liquid cooling technologies. It will provide an open platform for AI research, applications, services and infrastructure design through industry and academic
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
1. /
/ /
National Institute of
Advanced Industrial Science and Technology (AIST)
Artificial Intelligence Research Center
Hitoshi Sato
1
2. About AIST
2
ITRI
Information Technology
Research Institute
Artificial Intelligence
Research Center
Real World Big-Data
Computing – Open
Innovation Laboratory
One of the public research institute in Japan under METI*, focus on
“bridging” the gap between innovative technological seeds and commercialization.
*Ministry of Economy, Trade and Industry
@Odaiba, Tokyo Joint Lab w/ Tokyo Tech
HQ @Tsukuba
3. Current Status of AI R&D in Japan:
AI requires the convergence of Algorithm, Big Data, and
Computing Infrastructure
3
Big Data
Algorithm
Computing Infrastructure
Deep Learning, Machine Learning, etc.
Cutting edge infrastructures,
Clouds, Supercomputers, etc.
We lack the cutting edge infrastructure
dedicated to AI and Big Data, open to public use
Big Companies AI/BD
R&D (also Science)
AI Venture Startups
AI/BD Centers & Labs in National
Labs & Universities
4. /
• Dense Problem
• Compute-intensive
• Dense Matrix Op., Multiply-Add
• Reduced Precision
• Regular data/memory accesses
• Sparse Problem
– Graph Data Analytics Symbol Processing
• Data-intensive
• Sparse Matrix Op., Integer
• Big Data Kernels (Sort, PrefixSum, etc.)
• Irregular data/memory accesses
4
Different Workload Characteristics,
But, Similar to HPC
AI requires HPC
5. Example of Real AI/DL Apps Usage
• lang2program
– https://github.com/kelvinguu/
lang2program
– Implementation for ACL 2017
(http://acl2017.org/ ) paper
– Provided as Dockerfile
• Including Tensorflow, CUDA, CuDNN,
PostgresQL, Python pip Packages, etc.
• Various software dependencies for installation
5
Can traditional shared HPC systems support
such DL Apps?
• Not allowed Docker for security reason
(rootless access required)
• Elaborate efforts for manual installation
6. 6
F E DD 86G CBF
/ HA E 86
E6E F
H EL EC E
+L D B L D B
6E6 F FG AF
HFGE L
6G8 C
8 H EF
Linux O S
+B B 6
/M C86
GCE6
M
88 E6GCEF
VMsLLinux Containers
Ethernet
Local Node
Storage
x86 CPU
Distributed
Filesystems
HDFS
M apReduce Fram ew orks
Spark/Hadoop
User Applications
RDB
PostgresQL
ML Libraries
MLlib/
Mahout
Graph
Libraries
GraphX/
Giraph
JavaL ScalaM IDL
SQL
Hive/Pig
CloudDB/NoSQL
Hbase/Cassandra/MondoDB
CordinationEngine
ZooKeeper
BHJ
CH F HD E8CADHG EFDD 86G CB
FG A C GI6E
6E I6E
CEGE6BL L M +
• Cloud Jobs are often Interactive w/ resource control REST APIs
• HPC Jobs are Batch-Oriented, resource control by MPI
• Cloud employs High Productivity Languages but performance neglected, focus on
data analytics and dynamic frequent changes
• HPC employs High Performance Languages but requires Ninja Programmers, low
productivity. Kernels & compilers well tuned & result shared by many programs,
less rewrite
• Cloud HW based on Web Server “commodity” x86 servers, distributed storage on
nodes assuming REST API access
• HPC HW aggressively adopts new technologies such a s GPUs, focused on ultimate
performance at higher cost, shared storage to support legacy apps
Various convergence research efforts underway but no realistic converged
SW Stack yet
C GI6E 8CF FG A CE B +
//
CE CI
• Cloud focused on databases and data manipulation workflow
• HPC focused on compute kernels, even for data processing. Jobs scales to
thousands of jobs, thus debugging and performance tuning
• Cloud requires purpose-specific computing/data environment as well as their mutual
isolation & security
• HPC requires environment for fast & lean use of resources, but on modern machines
require considerable system software support
7. Characteristics of Deep Learning I/O Workloads
• Small READ I/O to Huge Files
– Native Media Files such as Jpeg, Wav, Text, etc.
– Painful I/O patterns for shared parallel file
systems
– Increasing Metadata Operation
• Randomness is essentially important
– Runtime Data Augmentation for
Training Accuracy
• Random Crop
• Random Horizontal Flip
• Normalization, etc.
7
Training Dataset to ImageNet1K
8. • Expected Power Limit of Real HPC Systems
– Multi MW for systems
– 60 70 kW per rack
• General-purpose Data Centers
– Multi kW per rack
– Lack of power density for HPC
→ How to introduce supercomputer cooling technologies
to data centers
üPower Efficient, Thermal Density, Maintainability, etc.
üIn-expensive Build
8
AIST AI Cloud
Air Cooling
TSUBAME-KFC
Liquid
Immersion
Cooling
#3 Green500
2017 June
TSUBAME2
Water Cooling
#2 Green500
2010 Nov
#1 Green500
2013 Nov
9. • Accelerating AI/Big Data R&D
– Dedicated HPC Infra converged w/ AI/Big Data
• Extreme Computing Power
• Ultra High Bandwidth Low Latency in Memory, Network, and Storage
• Extreme Green
• Bridging AI Tech through Common Platform
– For AI R&D, Apps, Services, and Infrastructure Designs
– Aiming fast tech transfer through industry and academia collaboration
– De facto Standard Software
• Cloud-Datacenter-ready Infrastructure, but Supercomputing
– Supercomputer Cooling Technologies to Datacenter
– Inexpensive Facilities, etc.
9
10. AI Bridging Cloud Infrastructure
as World’s First Large-scale Open AI Infrastructure
• Open, Public, and Dedicated infrastructure for AI/Big Data
• Platform to accelerate joint academic-industry R&D for AI in Japan
• Top-level compute capability w/ 0.550EFlops(AI), 37.2 PFlops(DP)
( )2 5 0.1 ,. 10 5 ## 0 #
10
Univ. Tokyo Kashiwa II Campus
Operation Scheduled in 2018
11. • 1088x compute nodes w/ 4352x NVIDIA Tesla V100 GPUs, 476TiB of Memory,
1.6PB of NVMe SSDs, 22PB of HDD-based Storage and Infiniband EDR network
• Ultra-dense IDC design from the ground-up w/ 20x thermal density of standard IDC
• Extreme Green w/ ambient warm liquid cooling, high-efficiency power supplies, etc.,
commoditizing supercomputer cooling technologies to clouds ( 2.3MW, 70kW/rack)
11
Gateway and
Firewall
Computing Nodes: 0.550 EFlops(HP), 37 PFlops(DP)
476 TiB Mem, 1.6 PB NVMe SSD
Storage: 22 PB GPFS
High Performance Computing Nodes (w/GPU) x1088
• Intel Xeon Gold6148 (2.4GHz/20cores) x2
• NVIDIA Tesla V100 (SXM2) x 4
• 384GiB Memory, 1.6TB NVMe SSD
Multi-platform Nodes (w/o GPU) x10
• Intel Xeon Gold6132 (2.6GHz/14cores) x2
• 768GiB Memory, 3.8TB NVMe SSD
Interactive Nodes
DDN SFA14K
(w/ SS8462 Enclosure x 10) x 3
• 12TB 7.2Krpm NL-SAS HDD x 2400
• 3.84TB SAS SSD x 216
• NSD Servers x 12
Object Storage for Protocol Nodes
100GbE
Service Network (10GbE)
External
Networks
SINET5
Interconnect (Infiniband EDR)
12. ABCI: AI Bridging Cloud Infrastructure
12
System
(32 Racks)Rack
(17 Chassis)
Compute Node
(4GPUs, 2CPUs)
Chips
(GPU, CPU)
Node Chassis
(2 Compute Nodes)
NVIDIA Tesla V100
(16GB SMX2)
3.72 TB/s MEM BW
384 GiB MEM
200 Gbps NW BW
1.6TB NVMe SSD
1.16 PFlops(DP)
17.2 PFlops (AI)
37.2 PFlops(DP)
0.550 EFlops (AI)
68.5 PFlops(DP)
1.01 PFlops (AI)
34.2 TFlops(DP)
506 TFlops (AI)
GPU:
7.8 TFlops(DP)
125 TFlops (AI)
CPU:
1.53 TFlops(DP)
3.07 TFlops (AI)
Intel Xeon Gold 6148
(27.5M Cache,
2.40 GHz, 20 Core)
0.550 EFlops(AI), 37.2 PFlops(DP)
19.88 PFlops(Peak), Ranked #5 Top500 June 2018
131TB/s MEM BW
Full Bisection BW within Rack
70kW Max
1088 Compute Nodes
4352 GPUs
4.19 PB/s MEM BW
1/3 of Oversubscription BW
2.3MW
13. GPU Compute Nodes
• NVIDIA TESLA V100
(16GB, SXM2) x 4
• Intel Xeon Gold 6148
x 2 Sockets
– 20 cores per Socket
• 384GiB of DDR4 Memory
• 1.6TB NVMe SSD x 1
– Intel DC P4600 u.2
• EDR Infiniband HCA x 2
– Connected to other Compute Notes
and Filesystems
13
Xeon Gold
6148
Xeon Gold
6148
10.4GT/s x3DDR4-2666
32GB x 6
DDR4-2666
32GB x 6
128GB/s 128GB/s
IB HCA (100Gbps)IB HCA (100Gbps)
NVMe
UPI x3
x48 switch x64 switch
Tesla V100 SXM2 Tesla V100 SXM2
Tesla V100 SXM2 Tesla V100 SXM2
PCIe gen3 x16 PCIe gen3 x16
PCIe gen3 x16 PCIe gen3 x16
NVLink2 x2
14. Rack as Dense-packaged “Pod”
( AB < ) 1 6
) 0 BC / 0 BC ,3
0G < F BA -7
BH<D G D CF BA -7 FB
<IF<DA CB
) 7 7 4 I
7 F<D , D 2 D .BB A
14Pod #1
LEAF#1
(SB7890)
LEAF#2
(SB7890)
LEAF#3
(SB7890)
LEAF#4
(SB7890)
SPINE#1
(CS7500)
SPINE#2
(CS7500)
CX40
0#1
CX2570#1
CX2570#2
CX40
0#2
CX2570#3
CX2570#4
CX40
0#3
CX2570#5
CX2570#6
CX40
0#17
CX2570#33
CX2570#34
FBB#1
(SB7890)
FBB#2
(SB7890)
FBB#3
(SB7890)
1/3 Oversubscription BW
IB-EDR x 24
Full bisection BW
IB-EDR x 72
InfiniBand EDR x1
InfiniBand EDR x6
InfiniBand EDR x4
x 32 pods
15. Hierarchical Storage Tiers
• Local Storage
– 1.6 TB NVMe SSD (Intel DC P4600 u.2) per Node
– Local Storage Aggregation w/ BeeOnd
• Parallel Filesystem
– 22PB of GPFS
• DDN SFA14K ( w/ SS8462 Enclosure x 10) x 3 set
• Bare Metal NSD servers and Flash-based Metadata
Volumes for metadata operation acceleration
– Home and Shared Use
• Object Storage
– Part of GPFS using OpenStack Swift
– S3-like API Access, Global Shared Use
– Additional Secure Volumes w/ Encryption
(Planned)
15
Parallel Filesystem
Local Storage
as Burst Buffers
Object Storage as Campaign Storage
16. 16
Better• Environments
– ABCI 64 nodes (256 GPUs)
– Framework: ChainerMN v1.3.0
• Chainer 4.2.0, Cupy 4.2.3, mpi4py 3.0.0, Python 3.6.5
– Baremetal
• CentOS 7.4, gcc-4.8.5,
CUDA 9.2, CuDNN 7.1.4, NCCL2.2, OpenMPI 2,1.3
• Settings
– Dataset: Imagenet-1K
– Model: ResNet-50
– Training:
• Batch size: 32 per GPU, 32 x 256 in total
• Learning Rate: Starting 0.1 and x0.1 at 30, 60, 80 epoch
w/ warm up scheduling
• Optimization: Momentum SGD (momentum=0.9)
• Weight Decay: 0.0001
• Training Epoch: 100
17. .F F A A A 5L I F L I .FFCA 6 FCF A B I B
1A M A R I FL R LAC A
5L: A C 6FC I M F
4 B
P I B
P I B FI .0
3FM I
P ( 2
P 2 FI .0
:A I 1A LA .FFCA
P I CF B ) B I B
P / .FAC 7 A B I B
P 2 A F C
17
19m x 24m x 6m
計算
8
計算
7
計算
6
計算
5
計算
4
計算
3
計算
2
計算
1
計算
16
計算
15
計算
14
計算
13
計算
12
計算
11
計算
10
計算
9
計算
23
計算
22
計算
21
計算
20
計算
28
計算
27
計算
19
計算
18
計算
17
計算
32
計算
31
計算
30
計算
29
IB
1
IB2
計算
26
計算
25
計算
24
STR
G38
STR
G37
STR
G36
STR
G35
STR
G34
その
他
MPF
39
MP
F4
STR
G33
ラックレイアウト案IBスイッチラックをUPSゾーンに設置した場合
High Voltage Transform ers (3.25M W )
Passive
Cooling Tower
Free cooling
Cooling Capacity: 3MW
Lithium Battery
(Plan)
1MWh, 1MVA
Active Chillers
Cooling Capacity: 200kW
Future Expansion Space
UPS
72 Racks
w/ Hybrid Cooling
18 Racks
w/ UPS
18. Server Room Cooling
• Skeleton frame structure for
computer racks, water circuit,
air-cooling units, and cables
• 70kW/rack cooling capacity
60kW liquid (32 C water) + 10kW air
18
19 or 23 inch
Rack (48U)
ComputingServer
Hot Water
Circuit
Front side
Air
CDU
Coolability 10kW
Water
Hot Aisle Capping
35 C
Water Block
(CPU or/and Accelerator, etc.)
Air 40 C
32 C 40 C
Cold Water
Circuit
Fan Coil Unit
Coolability 10kW
Cold Aisle
35 C
Hot Aisle
40 C
Capping Wall
Capping Wall
Fan Coil Unit
Server Rack
W ater
Circuit
Fan Coil
Unit
Server
Rack
Hot
Aisle
19. Software Stuck for ABCI
• Batch Job Scheduler
– High throughput computing
• Minimum Pre-installed Software
– Users can deploy their environments
using anaconda, pip, python venv, etc.
– Reduce operational cost
• Container Support
– Singularity for multi-node jobs w/
user customized images
– Docker for single-node jobs w/
site-certified images
19
User Applications
DL
Frameworks
Python, Ruby, R, Java, Scala, Perl, Lua, etc.
ISV AppsOSSHadoop/
Spark
GCC PGI
OpenMPI MVAPICH2
Intel Parallel
Studio XE
Cluster Edition
CUDA/CUDNN/NCCL
GPFS BeeOND
OpenStack
Swift
Univa Grid Engine
Singularity Docker
CentOS/RHEL
20. • Singularity
– HPC Linux Container developed at LBNL and Sylabs Inc.
• http://singularity.lbl.gov/
• https://github.com/singularityware/singularity
– Rootless program execution, storage access
• No daemons w/ root privilege like Docker
– Computing resources are managed by job schedulers such as UGE
• Root privilege is given by commands w/ setuid
– Mount, Linux namespace creation, Bind paths between host and container
– Existing Docker images available via DockerHub
– HPC software stacks available
• CUDA, MPI, Infiniband, etc.
20
21. How to Create Container Images for Distributed DL
• Host
– Mount native device drivers and
system libraries to containers
• GPU, Infiniband, etc.
• cf. nvidia-docker
• Containers
– Install user space libraries
• CUDA, CuDNN, NCCL2, ibverbs, MPI, etc.
– Install distributed deep learning
frameworks
• Optimized build for underlying computing
resources
21
Base Drivers, Libraries on Host
CUDA
Drivers
Infiniband
Drivers
Filesystem
Libraries
(GPFS, Lustre)
Userland Libraries on Container
CUDA CuDNN NCCL2
MPI
(mpi4py)
Mount
ibverbs
Distributed Deep Learning
Frameworks
Caffe2 ChainerMNDistributed
TensorflowMXNet
22. AICloudModules: Container Image Collection for AI
• Base Images
– Optimized OS Images for
AI Cloud
• Including CUDA,CuDNN, NCCL,
MPI, etc., Libraries
• Excluding CUDA, Infiniband, GPFS
etc., native drivers on host
DL Images
• DL Images
– Installed Deep Learning
Frameworks and AI Applications
based on Base OS Images
22
Base Images
Base OS Images
incl. CUDA, CuDNN, NCCL2, MPI, Infiniband
DL Images
Images for Deep Learning Frameworks
based on HPCBase
CentOSUbuntu
Base Drivers, Libraries on Host
GPU Infiniband GPFSMPI
Caffe2 ChainerMNDistributed
TensorflowMXNet CNTK
https://hub.docker.com/r/aistairc/aaic/
as a prototype
23. • Accommodate open ecosystems of data and software for AI R&D
– Sharable/Reusable/Reproducible algorithm methods,
training data w/ labels, trained models, etc.
• Creating various social applications that use AI (ML, DL, etc.)
23
X
Y
Training
Dataset
Public Data
on Internet Deep Learning
Framework
Data
Collection and
Cleansing
Modeling Learning
Open
Data in
Enterprise
Open
Data in
Government
Trained
Model
Inference
Publish Trained Models to Public
to apply various domains
Prepare Training Datasets
Trial and error for designing
networks, parameters
Accelerate learning with HPC
to improve accuracy of models
Reuse
Trained
Model
Implementing
AI Techs
to Society
Retail Stores
Factories
Logistics Centers
Object Detection and
Tracking
24. Summary
• Issues for Public AI Infrastructures
– Accelerating AI R&D
– Bridging AI Tech through Common Platform
– Cloud-Datacenter-ready Infrastructure, but Supercomputing
• ABCI (AI Bridging Cloud Infrastructure)
– 0.550 EFlops(AI), 37.2 PFlops(DP),
19.88 PFlops(Peak) #5 Top500 June 2018
– Open, Public, Dedicated Infrastructure for AI/Big Data
– A Reference for AI/Big Data and HPC System
24