SlideShare a Scribd company logo
1 of 31
Download to read offline
A NumPy-compatible Library for GPU
Shohei Hido
VP of Research
Preferred Networks
Preferred Networks: An AI Startup in Japan
● Founded: March 2014 (120 engineers and researchers)
● Major news
● $100+M investment from Toyota for autonomous driving
● 2nd place at Amazon Robotics Challenge 2016
● Fastest ImageNet training on GPU cluster (15 minutes using 1,024 GPUs)
2
Deep learning research Industrial applications
Manufacturing
Automotive
Healthcare
Key takeaways
● CuPy is an open-source NumPy for NVIDIA GPU
● Python users can easily write CPU/GPU-agnostic code
● Existing NumPy code can be accelerated thanks to GPU and CUDA libraries
● What is CuPy
● Example: CPU/GPU agnostic implementation of k-means
● Introduction to CuPy
● Recent updates & conclusion
CuPy: A NumPy-Compatible Library for NVIDIA GPU
● NumPy is extensively used in Python but GPU is not supported
● GPU is getting faster and more important for scientific computing
import numpy as np
x_cpu = np.random.rand(10)
W_cpu = np.random.rand(10, 5)
y_cpu = np.dot(x_cpu, W_cpu)
import cupy as cp
x_gpu = cp.random.rand(10)
W_gpu = cp.random.rand(10, 5)
y_gpu = cp.dot(x_gpu, W_gpu)
y_gpu = cp.asarray(y_cpu)
y_cpu = cp.asnumpy(y_gpu)
for xp in [numpy, cupy]:
x = xp.random.rand(10)
W = xp.random.rand(10, 5)
y = xp.dot(x, W)
CPU/GPU-agnostic
NVIDIA GPUCPU
CuPy is actively developed (1,600+ github stars, 11,000+ commits)
Ryosuke Okuta
CTO
Preferred
Networks
Deep learning framework
https://chainer.org/
Probabilistic and graphical modeling
https://github.com/jmschrei/pomegranate
Natural language processing
https://spacy.io/
Python libraries powered by CuPy
Reputation (1/2): Travis Oliphant, creator of NumPy and SciPy
Reputation (2/2): Stephan Merity of Salesforce Research (MetaMind)
Our mission: make CuPy the default tool for GPU computation in Python
https://anaconda.org/anaconda/cupy/
● CuPy is now available on Anaconda in collaboration w/ Anaconda team
● You can install cupy with “$ conda install cupy” on Linux 64-bit
● We are working on Windows version
Don’t have GPU for CuPy? Google Colaboratory gives you one (for free!)
…
● What is CuPy
● Example: CPU/GPU agnostic implementation of k-means
● Introduction to CuPy
● Recent updates & conclusion
Implementation of CPU/GPU agnostic k-means fit(): 37 lines
https://github.com/cupy/cupy/blob/master/examples/kmeans/kmeans.py
K-means (1/3): Call function and initialization
● fit() follows the training API of scikit-learn
● xp represents either numpy or cupy
● Cluster centers are initialized by positions of
random samples
<- Specify NumPy or CuPy
K-means (2/3): Calculate distance to all of the cluster centers
● xp.linalg.norm is to compute the distance and
supported both in numpy and cupy
● _fit_calc_distances() uses custom kernel on cupy
Customized kernel with C++ snippet in cupy.ElementwiseKernel
● A kernel is generated by element-wise operation defined in C++ snippet
K-means (3/3): Update positions of cluster centers
● xp.stack is to update the cluster centers and
supported both in numpy and cupy
● _fit_calc_center() is also custom kernel based
Another element-wise kernel
● It just adds all of the points inside each cluster and count the number
● What is CuPy
● Example: CPU/GPU agnostic implementation of k-means
● Introduction to CuPy
● Recent updates & conclusion
Performance comparison with NumPy
● CuPy is faster than NumPy even in simple manipulation of large matrix
Benchmark code
Size CuPy [ms] NumPy [ms]
10^4 0.58 0.03
10^5 0.97 0.20
10^6 1.84 2.00
10^7 12.48 55.55
10^8 84.73 517.17
Benchmark result
6x faster
● Data types (dtypes)
○ bool_, int8, int16, int32, int64, uint8, uint16,
uint32, uint64, float16, float32, float64,
complex64, and complex128
● All basic indexing
○ indexing by ints, slices, newaxes, and Ellipsis
● Most of advanced indexing
○ except indexing patterns with boolean
masks
● Most of the array creation routines
○ empty, ones_like, diag, etc...
● Most of the array manipulation routines
○ reshape, rollaxis, concatenate, etc...
● All operators with broadcasting
● All universal functions for element-wise
operations
○ except those for complex numbers
● Linear algebra functions accelerated by cuBLAS
○ including product: dot, matmul, etc...
○ including decomposition: cholesky, svd,
etc...
● Reduction along axes
○ sum, max, argmax, etc...
● Sort operations implemented by Thrust
○ sort, argsort, and lexsort
● Sparse matrix accelerated by cuSPARSE
Compatibility with NumPy
Comparison with other Python libraries for/on CUDA
● CuPy is the only library that is designed for high compatibility with NumPy
still allowing users to write customized CUDA kernels for better performance
CuPy PyCUDA MinPy*
NVIDIA CUDA support ✔ ✔ ✔
CPU/GPU agnostic coding ✔ ✔
Automatic gradient support ** ✔
NumPy compatible interface ✔ ✔
User-defined CUDA kernel ✔ ✔
* https://github.com/dmlc/minpy
** Autograd is supported by Chainer
Inside CuPy
● CuPy extensively relies on NVIDIA libraries for better performance
Linear algebra
NVIDIA GPU
CUDA
cuDNN cuBLAS cuRANDcuSPARSE
NCCL
Thrust
Sparse matrix
DNN
Utility
Random
numbers
cuSOLVER
User-
defined
CUDA
kernel
Multi-
GPU
data
transfer
Sort
CuPy
Looks very easy?
● CUDA and its libraries are not designed for Python nor NumPy
━ CuPy is not just a wrapper of CUDA libraries for Python
━ CuPy is a fast numerical computation library on GPU with NumPy-compatible API
● NumPy specification is not documented
━ We have carefully investigated some unexpected behaviors of NumPy
━ CuPy tries to replicate NumPy’s behavior as much as possible
● NumPy’s behaviors vary between different versions
━ e.g, NumPy v1.14 changed the output format of __str__
• `[ 0. 1.]` -> `[0. 1.]` (no space)
Advanced features of CuPy (1/2)
Memory pool GPU Memory profiler
Function name
Used
Bytes
Acquired
Bytes
Occurrence
LinearFunction 5.16GB 0.18GB 3900
ReLU 0.99GB 0.46GB 1300
SoftMaxEnropy 7.71MB 5.08MB 1300
Accuracy 0.62MB 0.35MB 700
● This enables function-wise memory
profiling on Chainer
● Avoiding cudaMalloc is a
common practice in CUDA
programming
● CuPy supports memory pooling
using Best-Fit with Coalescing
(BFC) algorithm
● It reduces memory usage
to 25% on seq2seq model
Advanced features of CuPy (2/2)
Kernel fusion (experimental)
@cp.fuse()
def fused_func(x, y, z):
return (x * y) + z
● By adding decorator @cp.fuse(),
CuPy stores a series of operations
● Then it compiles a single kernel
to execute the operations
● What is CuPy
● Example: CPU/GPU agnostic implementation of k-means
● Introduction to CuPy
● Recent updates & conclusion
• Start providing pre-built wheel packages of CuPy
– cupy-cuda80, cupy-cuda90, and cupy-cuda91
– $ pip install cupy-cuda80
• Memory pool is now the default allocator
– Added line memory profiler using memory hook and traceback
• CUDA stream is fully supported
stream = cupy.cuda.stream.Stream()
with stream:
y = cupy.linalg.norm(x)
stream.synchronize()
stream = cupy.cuda.stream.Stream()
stream.use()
y = cupy.linalg.norm(x)
stream.synchronize()
What’s new in CuPy v4?
cupy.argpartition
cupy.unravel_index
cupy.percentile
cupy.moveaxis
cupy.blackman
cupy.hamming
cupy.hanning
cupy.isclose
cupy.iscomplex
cupy.iscomplexobj
cupy.isfortran
cupy.isreal
cupy.isrealobj
cupy.linalg.tensorinv
cupy.random.shuffle
cupy.random.set_random_state
cupy.random.RandomState.tomaxint
cupy.sparse.random
cupy.sparse.csr_matrix.eliminate_zeros
cupy.sparse.coo_matrix.eliminate_zeros
cupy.sparse.csc_matrix.eliminate_zeros
cupyx.scatter_add
cupy.fft
Standard FFTs:
fft, ifft, fft2, ifft2, fftn, ifftn
Real FFTs:
rfft, irfft, rfft2, irfft2., rfftn, irfftn
Hermitian FFTs:
hfft, ihfft
Helper routines:
fftfreq, rfftfreq, fftshift, ifftshift
Newly added functions in v4
• Windows support
• AMD GPU support via HIP
• More useful fusion function
• Add more functions (NumPy, SciPy)
• Add more probability distributions
• Provide simple CUDA kernel
• Support DLPack and
TensorComprehension
– toDLPack() and fromDLPack()
@cupy.fuse()
def sample2(x, y):
return cupy.sum(x + y, axis=0) * 2
CuPy v5 - planned features
Summary: CuPy is a drop-in replacement of NumPy for GPU
1. Highly-compatible with NumPy
━ data types, indexing, broadcasting, operations
━ Users can write CPU/GPU-agnostic code
2. High performance on NVIDIA GPUs
━ cuBLAS, cuDNN, cuRAND, cuSPARSE, and NCCL
3. Easy to install
━ $ pip install cupy
━ $ conda install cupy
4. Easy to write custom kernel
━ ElementwiseKernel, ReductionKernel
import numpy as np
x = np.random.rand(10)
W = np.random.rand(10, 5)
y = np.dot(x, W)
import cupy as cp
x = cp.random.rand(10)
W = cp.random.rand(10, 5)
y = cp.dot(x, W)
to
GPU to
CPU
Your contribution will be highly appreciated & We are hiring!

More Related Content

What's hot

Profiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsProfiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf Tools
emBO_Conference
 
2015年度GPGPU実践プログラミング 第10回 行列計算(行列-行列積の高度な最適化)
2015年度GPGPU実践プログラミング 第10回 行列計算(行列-行列積の高度な最適化)2015年度GPGPU実践プログラミング 第10回 行列計算(行列-行列積の高度な最適化)
2015年度GPGPU実践プログラミング 第10回 行列計算(行列-行列積の高度な最適化)
智啓 出川
 

What's hot (20)

It's Time to ROCm!
It's Time to ROCm!It's Time to ROCm!
It's Time to ROCm!
 
この手に超高精度GPSを!
この手に超高精度GPSを!この手に超高精度GPSを!
この手に超高精度GPSを!
 
Introduction to parallel_computing
Introduction to parallel_computingIntroduction to parallel_computing
Introduction to parallel_computing
 
One Class SVMを用いた異常値検知
One Class SVMを用いた異常値検知One Class SVMを用いた異常値検知
One Class SVMを用いた異常値検知
 
FPGA+SoC+Linux実践勉強会資料
FPGA+SoC+Linux実践勉強会資料FPGA+SoC+Linux実践勉強会資料
FPGA+SoC+Linux実践勉強会資料
 
Real Number Modeling (RNM) 超・初級編
Real Number Modeling (RNM) 超・初級編Real Number Modeling (RNM) 超・初級編
Real Number Modeling (RNM) 超・初級編
 
Rethinking and Beyond ImageNet
Rethinking and Beyond ImageNetRethinking and Beyond ImageNet
Rethinking and Beyond ImageNet
 
170420 東工大授業「ロボット技術」資料
170420 東工大授業「ロボット技術」資料170420 東工大授業「ロボット技術」資料
170420 東工大授業「ロボット技術」資料
 
Profiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsProfiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf Tools
 
08 Supercomputer Fugaku
08 Supercomputer Fugaku08 Supercomputer Fugaku
08 Supercomputer Fugaku
 
Introduction to GPU Programming
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU Programming
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)
 
Graphics Processing Unit by Saurabh
Graphics Processing Unit by SaurabhGraphics Processing Unit by Saurabh
Graphics Processing Unit by Saurabh
 
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
2値化CNN on FPGAでGPUとガチンコバトル(公開版)2値化CNN on FPGAでGPUとガチンコバトル(公開版)
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
 
Lie-Trotter-Suzuki分解、特にフラクタル分解について
Lie-Trotter-Suzuki分解、特にフラクタル分解についてLie-Trotter-Suzuki分解、特にフラクタル分解について
Lie-Trotter-Suzuki分解、特にフラクタル分解について
 
GPU
GPUGPU
GPU
 
Singularityで分散深層学習
Singularityで分散深層学習Singularityで分散深層学習
Singularityで分散深層学習
 
Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Introduction to OpenCL, 2010
Introduction to OpenCL, 2010
 
1076: CUDAデバッグ・プロファイリング入門
1076: CUDAデバッグ・プロファイリング入門1076: CUDAデバッグ・プロファイリング入門
1076: CUDAデバッグ・プロファイリング入門
 
2015年度GPGPU実践プログラミング 第10回 行列計算(行列-行列積の高度な最適化)
2015年度GPGPU実践プログラミング 第10回 行列計算(行列-行列積の高度な最適化)2015年度GPGPU実践プログラミング 第10回 行列計算(行列-行列積の高度な最適化)
2015年度GPGPU実践プログラミング 第10回 行列計算(行列-行列積の高度な最適化)
 

Similar to CuPy: A NumPy-compatible Library for GPU

Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
PeterAndreasEntschev
 
Pycon2014 GPU computing
Pycon2014 GPU computingPycon2014 GPU computing
Pycon2014 GPU computing
Ashwin Ashok
 

Similar to CuPy: A NumPy-compatible Library for GPU (20)

Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
Pycon2014 GPU computing
Pycon2014 GPU computingPycon2014 GPU computing
Pycon2014 GPU computing
 
GPU Computing with Ruby
GPU Computing with RubyGPU Computing with Ruby
GPU Computing with Ruby
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
 
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systems
 
Common Design of Deep Learning Frameworks
Common Design of Deep Learning FrameworksCommon Design of Deep Learning Frameworks
Common Design of Deep Learning Frameworks
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
 
KubeCon EU 2016: Bringing an open source Containerized Container Platform to ...
KubeCon EU 2016: Bringing an open source Containerized Container Platform to ...KubeCon EU 2016: Bringing an open source Containerized Container Platform to ...
KubeCon EU 2016: Bringing an open source Containerized Container Platform to ...
 
KURMA - A Containerized Container Platform - KubeCon 2016
KURMA - A Containerized Container Platform - KubeCon 2016KURMA - A Containerized Container Platform - KubeCon 2016
KURMA - A Containerized Container Platform - KubeCon 2016
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
 
Univa and SUSE at SC17: Scaling Machine Learning for SUSE Linux Containers, S...
Univa and SUSE at SC17: Scaling Machine Learning for SUSE Linux Containers, S...Univa and SUSE at SC17: Scaling Machine Learning for SUSE Linux Containers, S...
Univa and SUSE at SC17: Scaling Machine Learning for SUSE Linux Containers, S...
 
CUDA
CUDACUDA
CUDA
 
CUDA DLI Training Courses at GTC 2019
CUDA DLI Training Courses at GTC 2019CUDA DLI Training Courses at GTC 2019
CUDA DLI Training Courses at GTC 2019
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computing
 
How to Puppetize Google Cloud Platform - PuppetConf 2014
How to Puppetize Google Cloud Platform - PuppetConf 2014How to Puppetize Google Cloud Platform - PuppetConf 2014
How to Puppetize Google Cloud Platform - PuppetConf 2014
 
Python и программирование GPU (Ивашкевич Глеб)
Python и программирование GPU (Ивашкевич Глеб)Python и программирование GPU (Ивашкевич Глеб)
Python и программирование GPU (Ивашкевич Глеб)
 
Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI Supercomputer
 

More from Shohei Hido

More from Shohei Hido (20)

Deep Learning Lab 異常検知入門
Deep Learning Lab 異常検知入門Deep Learning Lab 異常検知入門
Deep Learning Lab 異常検知入門
 
NIPS2017概要
NIPS2017概要NIPS2017概要
NIPS2017概要
 
ディープラーニングの産業応用とそれを支える技術
ディープラーニングの産業応用とそれを支える技術ディープラーニングの産業応用とそれを支える技術
ディープラーニングの産業応用とそれを支える技術
 
機械学習モデルフォーマットの話:さようならPMML、こんにちはPFA
機械学習モデルフォーマットの話:さようならPMML、こんにちはPFA機械学習モデルフォーマットの話:さようならPMML、こんにちはPFA
機械学習モデルフォーマットの話:さようならPMML、こんにちはPFA
 
Software for Edge Heavy Computing @ INTEROP 2016 Tokyo
Software for Edge Heavy Computing @ INTEROP 2016 TokyoSoftware for Edge Heavy Computing @ INTEROP 2016 Tokyo
Software for Edge Heavy Computing @ INTEROP 2016 Tokyo
 
Chainer GTC 2016
Chainer GTC 2016Chainer GTC 2016
Chainer GTC 2016
 
How AI revolutionizes robotics and automotive industries
How AI revolutionizes robotics and automotive industriesHow AI revolutionizes robotics and automotive industries
How AI revolutionizes robotics and automotive industries
 
NIPS2015概要資料
NIPS2015概要資料NIPS2015概要資料
NIPS2015概要資料
 
プロダクトマネージャのお仕事
プロダクトマネージャのお仕事プロダクトマネージャのお仕事
プロダクトマネージャのお仕事
 
あなたの業務に機械学習を活用する5つのポイント
あなたの業務に機械学習を活用する5つのポイントあなたの業務に機械学習を活用する5つのポイント
あなたの業務に機械学習を活用する5つのポイント
 
PFIセミナー "「失敗の本質」を読む"発表資料
PFIセミナー "「失敗の本質」を読む"発表資料PFIセミナー "「失敗の本質」を読む"発表資料
PFIセミナー "「失敗の本質」を読む"発表資料
 
NIPS2013読み会: More Effective Distributed ML via a Stale Synchronous Parallel P...
NIPS2013読み会: More Effective Distributed ML via a Stale Synchronous Parallel P...NIPS2013読み会: More Effective Distributed ML via a Stale Synchronous Parallel P...
NIPS2013読み会: More Effective Distributed ML via a Stale Synchronous Parallel P...
 
機械学習CROSS 後半資料
機械学習CROSS 後半資料機械学習CROSS 後半資料
機械学習CROSS 後半資料
 
機械学習CROSS 前半資料
機械学習CROSS 前半資料機械学習CROSS 前半資料
機械学習CROSS 前半資料
 
Jubatus Casual Talks #2 異常検知入門
Jubatus Casual Talks #2 異常検知入門Jubatus Casual Talks #2 異常検知入門
Jubatus Casual Talks #2 異常検知入門
 
Jubatusが目指すインテリジェンス基盤
Jubatusが目指すインテリジェンス基盤Jubatusが目指すインテリジェンス基盤
Jubatusが目指すインテリジェンス基盤
 
今年のKDDベストペーパーを実装・公開しました
今年のKDDベストペーパーを実装・公開しました今年のKDDベストペーパーを実装・公開しました
今年のKDDベストペーパーを実装・公開しました
 
さらば!データサイエンティスト
さらば!データサイエンティストさらば!データサイエンティスト
さらば!データサイエンティスト
 
ICML2013読み会 開会宣言
ICML2013読み会 開会宣言ICML2013読み会 開会宣言
ICML2013読み会 開会宣言
 
ビッグデータはどこまで効率化できるか?
ビッグデータはどこまで効率化できるか?ビッグデータはどこまで効率化できるか?
ビッグデータはどこまで効率化できるか?
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 

CuPy: A NumPy-compatible Library for GPU

  • 1. A NumPy-compatible Library for GPU Shohei Hido VP of Research Preferred Networks
  • 2. Preferred Networks: An AI Startup in Japan ● Founded: March 2014 (120 engineers and researchers) ● Major news ● $100+M investment from Toyota for autonomous driving ● 2nd place at Amazon Robotics Challenge 2016 ● Fastest ImageNet training on GPU cluster (15 minutes using 1,024 GPUs) 2 Deep learning research Industrial applications Manufacturing Automotive Healthcare
  • 3. Key takeaways ● CuPy is an open-source NumPy for NVIDIA GPU ● Python users can easily write CPU/GPU-agnostic code ● Existing NumPy code can be accelerated thanks to GPU and CUDA libraries
  • 4. ● What is CuPy ● Example: CPU/GPU agnostic implementation of k-means ● Introduction to CuPy ● Recent updates & conclusion
  • 5. CuPy: A NumPy-Compatible Library for NVIDIA GPU ● NumPy is extensively used in Python but GPU is not supported ● GPU is getting faster and more important for scientific computing import numpy as np x_cpu = np.random.rand(10) W_cpu = np.random.rand(10, 5) y_cpu = np.dot(x_cpu, W_cpu) import cupy as cp x_gpu = cp.random.rand(10) W_gpu = cp.random.rand(10, 5) y_gpu = cp.dot(x_gpu, W_gpu) y_gpu = cp.asarray(y_cpu) y_cpu = cp.asnumpy(y_gpu) for xp in [numpy, cupy]: x = xp.random.rand(10) W = xp.random.rand(10, 5) y = xp.dot(x, W) CPU/GPU-agnostic NVIDIA GPUCPU
  • 6. CuPy is actively developed (1,600+ github stars, 11,000+ commits) Ryosuke Okuta CTO Preferred Networks
  • 7. Deep learning framework https://chainer.org/ Probabilistic and graphical modeling https://github.com/jmschrei/pomegranate Natural language processing https://spacy.io/ Python libraries powered by CuPy
  • 8. Reputation (1/2): Travis Oliphant, creator of NumPy and SciPy
  • 9. Reputation (2/2): Stephan Merity of Salesforce Research (MetaMind)
  • 10. Our mission: make CuPy the default tool for GPU computation in Python https://anaconda.org/anaconda/cupy/ ● CuPy is now available on Anaconda in collaboration w/ Anaconda team ● You can install cupy with “$ conda install cupy” on Linux 64-bit ● We are working on Windows version
  • 11. Don’t have GPU for CuPy? Google Colaboratory gives you one (for free!) …
  • 12. ● What is CuPy ● Example: CPU/GPU agnostic implementation of k-means ● Introduction to CuPy ● Recent updates & conclusion
  • 13. Implementation of CPU/GPU agnostic k-means fit(): 37 lines https://github.com/cupy/cupy/blob/master/examples/kmeans/kmeans.py
  • 14. K-means (1/3): Call function and initialization ● fit() follows the training API of scikit-learn ● xp represents either numpy or cupy ● Cluster centers are initialized by positions of random samples <- Specify NumPy or CuPy
  • 15. K-means (2/3): Calculate distance to all of the cluster centers ● xp.linalg.norm is to compute the distance and supported both in numpy and cupy ● _fit_calc_distances() uses custom kernel on cupy
  • 16. Customized kernel with C++ snippet in cupy.ElementwiseKernel ● A kernel is generated by element-wise operation defined in C++ snippet
  • 17. K-means (3/3): Update positions of cluster centers ● xp.stack is to update the cluster centers and supported both in numpy and cupy ● _fit_calc_center() is also custom kernel based
  • 18. Another element-wise kernel ● It just adds all of the points inside each cluster and count the number
  • 19. ● What is CuPy ● Example: CPU/GPU agnostic implementation of k-means ● Introduction to CuPy ● Recent updates & conclusion
  • 20. Performance comparison with NumPy ● CuPy is faster than NumPy even in simple manipulation of large matrix Benchmark code Size CuPy [ms] NumPy [ms] 10^4 0.58 0.03 10^5 0.97 0.20 10^6 1.84 2.00 10^7 12.48 55.55 10^8 84.73 517.17 Benchmark result 6x faster
  • 21. ● Data types (dtypes) ○ bool_, int8, int16, int32, int64, uint8, uint16, uint32, uint64, float16, float32, float64, complex64, and complex128 ● All basic indexing ○ indexing by ints, slices, newaxes, and Ellipsis ● Most of advanced indexing ○ except indexing patterns with boolean masks ● Most of the array creation routines ○ empty, ones_like, diag, etc... ● Most of the array manipulation routines ○ reshape, rollaxis, concatenate, etc... ● All operators with broadcasting ● All universal functions for element-wise operations ○ except those for complex numbers ● Linear algebra functions accelerated by cuBLAS ○ including product: dot, matmul, etc... ○ including decomposition: cholesky, svd, etc... ● Reduction along axes ○ sum, max, argmax, etc... ● Sort operations implemented by Thrust ○ sort, argsort, and lexsort ● Sparse matrix accelerated by cuSPARSE Compatibility with NumPy
  • 22. Comparison with other Python libraries for/on CUDA ● CuPy is the only library that is designed for high compatibility with NumPy still allowing users to write customized CUDA kernels for better performance CuPy PyCUDA MinPy* NVIDIA CUDA support ✔ ✔ ✔ CPU/GPU agnostic coding ✔ ✔ Automatic gradient support ** ✔ NumPy compatible interface ✔ ✔ User-defined CUDA kernel ✔ ✔ * https://github.com/dmlc/minpy ** Autograd is supported by Chainer
  • 23. Inside CuPy ● CuPy extensively relies on NVIDIA libraries for better performance Linear algebra NVIDIA GPU CUDA cuDNN cuBLAS cuRANDcuSPARSE NCCL Thrust Sparse matrix DNN Utility Random numbers cuSOLVER User- defined CUDA kernel Multi- GPU data transfer Sort CuPy
  • 24. Looks very easy? ● CUDA and its libraries are not designed for Python nor NumPy ━ CuPy is not just a wrapper of CUDA libraries for Python ━ CuPy is a fast numerical computation library on GPU with NumPy-compatible API ● NumPy specification is not documented ━ We have carefully investigated some unexpected behaviors of NumPy ━ CuPy tries to replicate NumPy’s behavior as much as possible ● NumPy’s behaviors vary between different versions ━ e.g, NumPy v1.14 changed the output format of __str__ • `[ 0. 1.]` -> `[0. 1.]` (no space)
  • 25. Advanced features of CuPy (1/2) Memory pool GPU Memory profiler Function name Used Bytes Acquired Bytes Occurrence LinearFunction 5.16GB 0.18GB 3900 ReLU 0.99GB 0.46GB 1300 SoftMaxEnropy 7.71MB 5.08MB 1300 Accuracy 0.62MB 0.35MB 700 ● This enables function-wise memory profiling on Chainer ● Avoiding cudaMalloc is a common practice in CUDA programming ● CuPy supports memory pooling using Best-Fit with Coalescing (BFC) algorithm ● It reduces memory usage to 25% on seq2seq model
  • 26. Advanced features of CuPy (2/2) Kernel fusion (experimental) @cp.fuse() def fused_func(x, y, z): return (x * y) + z ● By adding decorator @cp.fuse(), CuPy stores a series of operations ● Then it compiles a single kernel to execute the operations
  • 27. ● What is CuPy ● Example: CPU/GPU agnostic implementation of k-means ● Introduction to CuPy ● Recent updates & conclusion
  • 28. • Start providing pre-built wheel packages of CuPy – cupy-cuda80, cupy-cuda90, and cupy-cuda91 – $ pip install cupy-cuda80 • Memory pool is now the default allocator – Added line memory profiler using memory hook and traceback • CUDA stream is fully supported stream = cupy.cuda.stream.Stream() with stream: y = cupy.linalg.norm(x) stream.synchronize() stream = cupy.cuda.stream.Stream() stream.use() y = cupy.linalg.norm(x) stream.synchronize() What’s new in CuPy v4?
  • 30. • Windows support • AMD GPU support via HIP • More useful fusion function • Add more functions (NumPy, SciPy) • Add more probability distributions • Provide simple CUDA kernel • Support DLPack and TensorComprehension – toDLPack() and fromDLPack() @cupy.fuse() def sample2(x, y): return cupy.sum(x + y, axis=0) * 2 CuPy v5 - planned features
  • 31. Summary: CuPy is a drop-in replacement of NumPy for GPU 1. Highly-compatible with NumPy ━ data types, indexing, broadcasting, operations ━ Users can write CPU/GPU-agnostic code 2. High performance on NVIDIA GPUs ━ cuBLAS, cuDNN, cuRAND, cuSPARSE, and NCCL 3. Easy to install ━ $ pip install cupy ━ $ conda install cupy 4. Easy to write custom kernel ━ ElementwiseKernel, ReductionKernel import numpy as np x = np.random.rand(10) W = np.random.rand(10, 5) y = np.dot(x, W) import cupy as cp x = cp.random.rand(10) W = cp.random.rand(10, 5) y = cp.dot(x, W) to GPU to CPU Your contribution will be highly appreciated & We are hiring!