SlideShare a Scribd company logo
1 of 39
Download to read offline
Performance Tools for
Computer Vision Applications
2018/12/15 コンピュータビジョン勉強会 @関東
Performance Tools for CV - Agenda
● NVIDIA GPU Profiler
○ nvprof
○ nvvp
○ NVIDIA NSight systems
● Tensorflow / Keras
○ tf.timeline
● Others
○ perf, gperftools, ….
○ cProfile, yep, ...
> GPUに絞ってお話しします <
 ̄Y^Y^Y^Y^Y^Y^Y^Y^Y^Y^Y^Y ̄
What is Profiling ?
What is Profiling ?
● Application の Performance を計測すること
● Simple Profiling
○ 各部の処理時間を計測する
● Advanced Profiling
○ 何故その処理が遅いのか を分析する
timer 差し込みなど
● Command-line profiler
○ /usr/local/cuda/bin/nvprof
● Usage
$ nvprof [npprof-options] <app> [arguments]
==17126== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 28.09% 9.6689ms 32 302.15us 231.69us 544.94us maxwell_scudnn_winograd_128x128_ldg1_ldg4_tile148n_nt
8.13% 2.7997ms 34 82.344us 42.145us 121.09us maxwell_scudnn_128x128_relu_interior_nn
7.46% 2.5673ms 72 35.657us 4.6720us 569.87us normalize_kernel(int, float*, float*, float*, int, int, int)
7.01% 2.4122ms 104 23.194us 1.7600us 243.82us copy_kernel(int, float*, int, int, float*, int, int)
6.96% 2.3956ms 116 20.651us 1.1840us 242.98us activate_array_kernel(float*, int, ACTIVATION)
6.46% 2.2237ms 75 29.649us 3.0400us 301.03us add_bias_kernel(float*, float*, int, int, int)
6.24% 2.1489ms 72 29.846us 3.3600us 299.91us scale_bias_kernel(float*, float*, int, int)
5.98% 2.0587ms 184 11.188us 1.4080us 112.64us fill_kernel(int, float, float*, int)
5.87% 2.0187ms 3 672.90us 28.960us 1.8676ms [CUDA memcpy DtoH]
5.16% 1.7760ms 4 444.00us 414.16us 516.91us maxwell_scudnn_128x128_relu_small_nn
5.10% 1.7553ms 32 54.854us 7.2320us 163.24us void cudnn::winograd::generateWinogradTilesKernel<int=0, float,
float>(cudnn::winograd::GenerateWinogradTilesParams<float, float>)
4つのprofiling mode
● summary mode (default)
● trace mode
● event/metric summary mode
● event/metric trace mode
$ nvprof --print-gpu-trace --print-api-trace
$ nvprof --events <event-name> --metrics <metric-name>
$ nvprof --aggregate-mode off [event|metric]
$ nvprof <application>
CUDA Runtime API + Driver API 呼出
● summary mode (default)
==17126== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 28.09% 9.6689ms 32 302.15us 231.69us 544.94us maxwell_scudnn_winograd_128x128_ldg1_ldg4_tile148n_nt
8.13% 2.7997ms 34 82.344us 42.145us 121.09us maxwell_scudnn_128x128_relu_interior_nn
7.46% 2.5673ms 72 35.657us 4.6720us 569.87us normalize_kernel(int, float*, float*, float*, int, int, int)
7.01% 2.4122ms 104 23.194us 1.7600us 243.82us copy_kernel(int, float*, int, int, float*, int, int)
6.96% 2.3956ms 116 20.651us 1.1840us 242.98us activate_array_kernel(float*, int, ACTIVATION)
6.46% 2.2237ms 75 29.649us 3.0400us 301.03us add_bias_kernel(float*, float*, int, int, int)
6.24% 2.1489ms 72 29.846us 3.3600us 299.91us scale_bias_kernel(float*, float*, int, int)
5.98% 2.0587ms 184 11.188us 1.4080us 112.64us fill_kernel(int, float, float*, int)
5.87% 2.0187ms 3 672.90us 28.960us 1.8676ms [CUDA memcpy DtoH]
5.16% 1.7760ms 4 444.00us 414.16us 516.91us maxwell_scudnn_128x128_relu_small_nn
5.10% 1.7553ms 32 54.854us 7.2320us 163.24us void cudnn::winograd::generateWinogradTilesKernel<int=0, float,
float>(cudnn::winograd::GenerateWinogradTilesParams<float, float>)
4.00% 1.3780ms 23 59.912us 18.272us 250.79us shortcut_kernel(int, int, int, int, int, int, int, int, int, int, float*, int, int, int,
float, float, float*)
1.36% 467.82us 1 467.82us 467.82us 467.82us maxwell_scudnn_128x64_relu_small_nn
0.86% 296.90us 1 296.90us 296.90us 296.90us maxwell_scudnn_128x32_relu_small_nn
0.48% 165.99us 2 82.994us 82.882us 83.106us maxwell_scudnn_128x64_relu_interior_nn
0.39% 134.98us 1 134.98us 134.98us 134.98us maxwell_scudnn_128x32_relu_interior_nn
0.26% 89.508us 43 2.0810us 1.7280us 9.3440us
0.17% 58.018us 2 29.009us 19.809us 38.209us upsample_kernel(unsigned long, float*, int, int, int, int, int, int, float, float*)
API calls: 90.08% 285.51ms 798 357.78us 3.4690us 282.40ms cudaLaunch
9.68% 30.674ms 3 10.225ms 1.6737ms 24.491ms cudaMemcpy
0.11% 363.03us 3540 102ns 86ns 1.5170us cudaSetupArgument
例:Tensor Core の利用率を調べる
● 4x4 乗算を1サイクルで実行
○ Volta アーキテクチャに搭載
利用可能な metrics を調べる
● --query-metrics
$ nvprof --query-metrics
Available Metrics:
Name Description
Device 0 (TITAN V):
shared_load_transactions_per_request: Average number of shared memory load transactions performed for each
shared memory load
shared_store_transactions_per_request: Average number of shared memory store transactions performed for each
shared memory store
local_load_transactions_per_request: Average number of local memory load transactions performed for each local
memory load
local_store_transactions_per_request: Average number of local memory store transactions performed for each local
memory store
half_precision_fu_utilization: The utilization level of the multiprocessor function units that execute 16 bit floating-point
instructions on a scale of 0 to 10. Note that this doesn't specify the utilization level of tensor core unit
tensor_precision_fu_utilization: The utilization level of the multiprocessor function units that execute tensor core
instructions on a scale of 0 to 10
tensorcore !
● metrics を指定して実行
$ nvprof --metrics tensor_precision_fu_utilization <application>
Invocations Metric Name Metric Description Min Max Avg
Device "TITAN V (0)"
Kernel: volta_s884cudnn_fp16_128x128_ldg8_relu_exp_interior_nhwc_tn_v1
3 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Mid (4) Mid (5) Mid (4)
Kernel: volta_fp16_s884cudnn_fp16_128x128_ldg8_relu_f2f_exp_small_nhwc2nchw_tn_v1
27 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Mid (4) High (7) Mid (6)
Kernel: volta_fp16_s884cudnn_fp16_128x128_ldg8_relu_f2f_exp_interior_nhwc2nchw_tn_v1
20 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Mid (4) Mid (6) Mid (5)
Kernel: volta_fp16_s884cudnn_fp16_256x128_ldg8_relu_filter1x1_stg8_interior_nchw_nn_v1
14 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Low (2) Mid (5) Mid (4)
Kernel: volta_fp16_s884cudnn_fp16_256x64_ldg8_relu_f2f_exp_small_nhwc2nchw_tn_v1
11 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Low (3) High (7) Mid (6)
utilization level
Profiling Scope
● プロファイリング箇所を限定する
○ 測定したい箇所に cudaProfilerStart(); を埋め込む
#include <cuda_profiler_api.h>
// do something to profile
$ nvprof --profile-from-start off <application>
Python 越しに CUDA API を呼ぶ場合は?
● 普通に nvprof にかける
● ctypes を使うことも
$ nvprof [npprof-options] python ...
Python Script
import ctypes
_cudart = ctypes.CDLL('')
ret = _cudart.cudaProfilerStart()
# call cuda-based methods
ret = _cudart.cudaProfilerStop()
CUDA を使った Python 拡張ライブラリ
nvvp : nvidia visual profiler
● GUI 版の Profiler
○ navigation に従ってぽちぽちすると使える
$ nvvp
compute res / memory bandwidth / latency
Primary Performance Limiter
● Both High:
○ 演算器・メモリ帯域共に利用率が高い
● Memory High, Compute Low:
○ メモリ帯域で律速
● Compute High, Memory Low:
○ 演算資源で律速
Compute Memory
NVLink view
● NVLink のトポロジや通信のスループットが見れる
○ ※ GPU 間の トポロジは $ nvidia-smi topo --matrix でも調べられる
Remote Profiling
● X Forwarding で nvvp を飛ばすのは重い
○ nvprof で プロファイル結果を吐いて、scpすれば良い
● 中継サーバにスクリプトを置く方法もある
$ nvprof --analysis-metrics -o profile.nvvp <application>
Remote Profiling
● しかし...
$ nvprof -o profile.nvvp <application>
Timeline ぐらいしか見えない
$ nvprof --analysis-metrics -o profile.nvvp <application>
Kernel を リプレイしまくる
--kernel で限定する … ?
Remote Profiling
● *.nvvp を dump して飛ばさなくても Remote から直接プロ
Take-home message : NVIDIA が一番詳しい
> Note that Visual Profiler and nvprof will be
deprecated in a future CUDA release.
> It is recommended to use next-generation tools NVIDIA Nsight Compute for GPU profiling
> and NVIDIA Nsight Systems for GPU and CPU sampling and tracing.
NSight systems
NVIDIA NSight Systems for GPU and CPU sampling and Tracing
Watch the official movie ! (投げやり)
1. 普段の研究/開発で GPU を使っている
2. CUDA カーネルを書いたことがある
3. プロファイルをきちんと取っている
4. GPUアーキテクチャ完全に理解した
Tensorflow timeline
● Tensorflow 本体付属のプロファイリング機能
import tensorflow as tf
from tensorflow.python.client import timeline
# build your model ...
ops = …
with tf.Session() as sess:
# add additional options to trace the session execution
options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata(), options=options, run_metadata=run_metadata)
# Create the Timeline object, and write it to a json file
fetched_timeline = timeline.Timeline(run_metadata.step_stats)
chrome_trace = fetched_timeline.generate_chrome_trace_format()
with open('timeline.json', 'w') as f:
tf.timeline from Keras
● Tensorflow バックエンドの Keras でも利用可能
from tensorflow.python.client import timeline
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
trace = timeline.Timeline(step_stats=run_metadata.step_stats)
with open('timeline.json', 'w') as f:
● Chrome Event Format に準拠
○ Chrome ブラウザ の chrome://tracing でロード
● フォーカスした部分に絞って見ることもできる
● All-Reduce アルゴリズムの比較
● Chrome Performance tools*
○ Chrome / Go / Android で利用
○ Trace Event Format 詳細
● Projects
○ Trace-viewer Javascript codebase that loads trace files and creates the UI
○ Telemetry
○ Performance Dashboard
○ Systrace
○ Web Page Replay
Tensorflow Profiler and Advisor
● 多少浮いている気はするが ...
○ GPU のプロファイリングツールを簡単に紹介
○ ツールを使いこなし,世界最速を目指そう !!!

More Related Content

What's hot

Molecular Shape Searching on GPUs: A Brave New World
Molecular Shape Searching on GPUs: A Brave New WorldMolecular Shape Searching on GPUs: A Brave New World
Molecular Shape Searching on GPUs: A Brave New WorldCan Ozdoruk
Intel Nervana Graph とは?
Intel Nervana Graph とは?Intel Nervana Graph とは?
Intel Nervana Graph とは?Mr. Vengineer
Deploying Prometheus stacks with Juju
Deploying Prometheus stacks with JujuDeploying Prometheus stacks with Juju
Deploying Prometheus stacks with JujuJ.J. Ciarlante
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Ural-PDC
Java gpu computing
Java gpu computingJava gpu computing
Java gpu computingArjan Lamers
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Brendan Gregg
Active Web Development
Active Web DevelopmentActive Web Development
Active Web DevelopmentDivya Manian
QCon 2015 Broken Performance Tools
QCon 2015 Broken Performance ToolsQCon 2015 Broken Performance Tools
QCon 2015 Broken Performance ToolsBrendan Gregg
Kernel Recipes 2015: Introduction to Kernel Power Management
Kernel Recipes 2015: Introduction to Kernel Power ManagementKernel Recipes 2015: Introduction to Kernel Power Management
Kernel Recipes 2015: Introduction to Kernel Power ManagementAnne Nicolas
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceBrendan Gregg
Patching: answers to questions you probably were afraid to ask about oracle s...
Patching: answers to questions you probably were afraid to ask about oracle s...Patching: answers to questions you probably were afraid to ask about oracle s...
Patching: answers to questions you probably were afraid to ask about oracle s...DATA SECURITY SOLUTIONS
Gömülü Sistemlerde Derin Öğrenme Uygulamaları
Gömülü Sistemlerde Derin Öğrenme UygulamalarıGömülü Sistemlerde Derin Öğrenme Uygulamaları
Gömülü Sistemlerde Derin Öğrenme UygulamalarıFerhat Kurt
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedBrendan Gregg
DiUS Computing Lca Rails Final
DiUS  Computing Lca Rails FinalDiUS  Computing Lca Rails Final
DiUS Computing Lca Rails FinalRobert Postill
FreeBSD 2014 Flame Graphs
FreeBSD 2014 Flame GraphsFreeBSD 2014 Flame Graphs
FreeBSD 2014 Flame GraphsBrendan Gregg
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityBrendan Gregg
Trace kernel code tips
Trace kernel code tipsTrace kernel code tips
Trace kernel code tipsViller Hsiao
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFOSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFBrendan Gregg
Intro to linux performance analysis
Intro to linux performance analysisIntro to linux performance analysis
Intro to linux performance analysisChris McEniry

What's hot (20)

Molecular Shape Searching on GPUs: A Brave New World
Molecular Shape Searching on GPUs: A Brave New WorldMolecular Shape Searching on GPUs: A Brave New World
Molecular Shape Searching on GPUs: A Brave New World
Intel Nervana Graph とは?
Intel Nervana Graph とは?Intel Nervana Graph とは?
Intel Nervana Graph とは?
Deploying Prometheus stacks with Juju
Deploying Prometheus stacks with JujuDeploying Prometheus stacks with Juju
Deploying Prometheus stacks with Juju
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Java gpu computing
Java gpu computingJava gpu computing
Java gpu computing
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)
Active Web Development
Active Web DevelopmentActive Web Development
Active Web Development
QCon 2015 Broken Performance Tools
QCon 2015 Broken Performance ToolsQCon 2015 Broken Performance Tools
QCon 2015 Broken Performance Tools
Kernel Recipes 2015: Introduction to Kernel Power Management
Kernel Recipes 2015: Introduction to Kernel Power ManagementKernel Recipes 2015: Introduction to Kernel Power Management
Kernel Recipes 2015: Introduction to Kernel Power Management
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance
Patching: answers to questions you probably were afraid to ask about oracle s...
Patching: answers to questions you probably were afraid to ask about oracle s...Patching: answers to questions you probably were afraid to ask about oracle s...
Patching: answers to questions you probably were afraid to ask about oracle s...
Gömülü Sistemlerde Derin Öğrenme Uygulamaları
Gömülü Sistemlerde Derin Öğrenme UygulamalarıGömülü Sistemlerde Derin Öğrenme Uygulamaları
Gömülü Sistemlerde Derin Öğrenme Uygulamaları
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting Started
DiUS Computing Lca Rails Final
DiUS  Computing Lca Rails FinalDiUS  Computing Lca Rails Final
DiUS Computing Lca Rails Final
SOFA Tutorial
SOFA TutorialSOFA Tutorial
SOFA Tutorial
FreeBSD 2014 Flame Graphs
FreeBSD 2014 Flame GraphsFreeBSD 2014 Flame Graphs
FreeBSD 2014 Flame Graphs
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF Observability
Trace kernel code tips
Trace kernel code tipsTrace kernel code tips
Trace kernel code tips
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFOSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
Intro to linux performance analysis
Intro to linux performance analysisIntro to linux performance analysis
Intro to linux performance analysis

Similar to GPU profiling for computer vision applications

Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016Brendan Gregg
HKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightHKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightLinaro
OSDC 2015: Georg Schönberger | Linux Performance Profiling and Monitoring
OSDC 2015: Georg Schönberger | Linux Performance Profiling and MonitoringOSDC 2015: Georg Schönberger | Linux Performance Profiling and Monitoring
OSDC 2015: Georg Schönberger | Linux Performance Profiling and MonitoringNETWAYS
Linux Performance Profiling and Monitoring
Linux Performance Profiling and MonitoringLinux Performance Profiling and Monitoring
Linux Performance Profiling and MonitoringGeorg Schönberger
Known basic of NFV Features
Known basic of NFV FeaturesKnown basic of NFV Features
Known basic of NFV FeaturesRaul Leite
OSDC 2017 - Werner Fischer - Linux performance profiling and monitoring
OSDC 2017 - Werner Fischer - Linux performance profiling and monitoringOSDC 2017 - Werner Fischer - Linux performance profiling and monitoring
OSDC 2017 - Werner Fischer - Linux performance profiling and monitoringNETWAYS
Computer Architecture and Organization
Computer Architecture and OrganizationComputer Architecture and Organization
Computer Architecture and Organizationssuserdfc773
Profiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & SustainabilityProfiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & Sustainabilitygeetachauhan
BKK16-302: Android Optimizing Compiler: New Member Assimilation Guide
BKK16-302: Android Optimizing Compiler: New Member Assimilation GuideBKK16-302: Android Optimizing Compiler: New Member Assimilation Guide
BKK16-302: Android Optimizing Compiler: New Member Assimilation GuideLinaro
OSMC 2015: Linux Performance Profiling and Monitoring by Werner Fischer
OSMC 2015: Linux Performance Profiling and Monitoring by Werner FischerOSMC 2015: Linux Performance Profiling and Monitoring by Werner Fischer
OSMC 2015: Linux Performance Profiling and Monitoring by Werner FischerNETWAYS
OSMC 2015 | Linux Performance Profiling and Monitoring by Werner Fischer
OSMC 2015 | Linux Performance Profiling and Monitoring by Werner FischerOSMC 2015 | Linux Performance Profiling and Monitoring by Werner Fischer
OSMC 2015 | Linux Performance Profiling and Monitoring by Werner FischerNETWAYS
OSMC 2021 | pg_stat_monitor: A cool extension for better database (PostgreSQL...
OSMC 2021 | pg_stat_monitor: A cool extension for better database (PostgreSQL...OSMC 2021 | pg_stat_monitor: A cool extension for better database (PostgreSQL...
OSMC 2021 | pg_stat_monitor: A cool extension for better database (PostgreSQL...NETWAYS
PerfUG 3 - perfs système
PerfUG 3 - perfs systèmePerfUG 3 - perfs système
PerfUG 3 - perfs systèmeLudovic Piot
Hardware Assisted Latency Investigations
Hardware Assisted Latency InvestigationsHardware Assisted Latency Investigations
Hardware Assisted Latency InvestigationsScyllaDB
Performance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux KernelPerformance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux Kernellcplcp1
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsJack (Jaegeun) Han
Dynamic tracing of MariaDB on Linux - problems and solutions (MariaDB Server ...
Dynamic tracing of MariaDB on Linux - problems and solutions (MariaDB Server ...Dynamic tracing of MariaDB on Linux - problems and solutions (MariaDB Server ...
Dynamic tracing of MariaDB on Linux - problems and solutions (MariaDB Server ...Valeriy Kravchuk

Similar to GPU profiling for computer vision applications (20)

Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
HKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightHKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with Coresight
OSDC 2015: Georg Schönberger | Linux Performance Profiling and Monitoring
OSDC 2015: Georg Schönberger | Linux Performance Profiling and MonitoringOSDC 2015: Georg Schönberger | Linux Performance Profiling and Monitoring
OSDC 2015: Georg Schönberger | Linux Performance Profiling and Monitoring
Linux Performance Profiling and Monitoring
Linux Performance Profiling and MonitoringLinux Performance Profiling and Monitoring
Linux Performance Profiling and Monitoring
Known basic of NFV Features
Known basic of NFV FeaturesKnown basic of NFV Features
Known basic of NFV Features
OSDC 2017 - Werner Fischer - Linux performance profiling and monitoring
OSDC 2017 - Werner Fischer - Linux performance profiling and monitoringOSDC 2017 - Werner Fischer - Linux performance profiling and monitoring
OSDC 2017 - Werner Fischer - Linux performance profiling and monitoring
Computer Architecture and Organization
Computer Architecture and OrganizationComputer Architecture and Organization
Computer Architecture and Organization
Profiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & SustainabilityProfiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & Sustainability
BKK16-302: Android Optimizing Compiler: New Member Assimilation Guide
BKK16-302: Android Optimizing Compiler: New Member Assimilation GuideBKK16-302: Android Optimizing Compiler: New Member Assimilation Guide
BKK16-302: Android Optimizing Compiler: New Member Assimilation Guide
OSMC 2015: Linux Performance Profiling and Monitoring by Werner Fischer
OSMC 2015: Linux Performance Profiling and Monitoring by Werner FischerOSMC 2015: Linux Performance Profiling and Monitoring by Werner Fischer
OSMC 2015: Linux Performance Profiling and Monitoring by Werner Fischer
OSMC 2015 | Linux Performance Profiling and Monitoring by Werner Fischer
OSMC 2015 | Linux Performance Profiling and Monitoring by Werner FischerOSMC 2015 | Linux Performance Profiling and Monitoring by Werner Fischer
OSMC 2015 | Linux Performance Profiling and Monitoring by Werner Fischer
OSMC 2021 | pg_stat_monitor: A cool extension for better database (PostgreSQL...
OSMC 2021 | pg_stat_monitor: A cool extension for better database (PostgreSQL...OSMC 2021 | pg_stat_monitor: A cool extension for better database (PostgreSQL...
OSMC 2021 | pg_stat_monitor: A cool extension for better database (PostgreSQL...
PerfUG 3 - perfs système
PerfUG 3 - perfs systèmePerfUG 3 - perfs système
PerfUG 3 - perfs système
Debugging 2013- Jesper Brouer
Debugging 2013- Jesper BrouerDebugging 2013- Jesper Brouer
Debugging 2013- Jesper Brouer
100Gbps OpenStack For Providing High-Performance NFV
100Gbps OpenStack For Providing High-Performance NFV100Gbps OpenStack For Providing High-Performance NFV
100Gbps OpenStack For Providing High-Performance NFV
Hardware Assisted Latency Investigations
Hardware Assisted Latency InvestigationsHardware Assisted Latency Investigations
Hardware Assisted Latency Investigations
Performance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux KernelPerformance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux Kernel
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systems
Programar para GPUs
Programar para GPUsProgramar para GPUs
Programar para GPUs
Dynamic tracing of MariaDB on Linux - problems and solutions (MariaDB Server ...
Dynamic tracing of MariaDB on Linux - problems and solutions (MariaDB Server ...Dynamic tracing of MariaDB on Linux - problems and solutions (MariaDB Server ...
Dynamic tracing of MariaDB on Linux - problems and solutions (MariaDB Server ...

Recently uploaded

Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank M.Gokilavani
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank M.Gokilavani
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank M.Gokilavani
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitterShivangiSharma879191
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
lifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxlifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxsomshekarkn64
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - GuideGOPINATHS437943
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
Vishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsVishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsSachinPawar510423
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction managementMariconPadriquez1
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar

Recently uploaded (20)

Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
lifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxlifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - Guide
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
Vishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsVishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documents
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction management
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger

GPU profiling for computer vision applications

  • 1. Performance Tools for Computer Vision Applications @denkiwakame 1 2018/12/15 コンピュータビジョン勉強会 @関東
  • 2. Performance Tools for CV - Agenda ● NVIDIA GPU Profiler ○ nvprof ○ nvvp ○ NVIDIA NSight systems ● Tensorflow / Keras ○ tf.timeline ● Others ○ perf, gperftools, …. ○ cProfile, yep, ... 2 _人人人人人人人人人人人人人人_ > GPUに絞ってお話しします <  ̄Y^Y^Y^Y^Y^Y^Y^Y^Y^Y^Y^Y ̄
  • 4. What is Profiling ? ● Application の Performance を計測すること ● Simple Profiling ○ 各部の処理時間を計測する ● Advanced Profiling ○ 何故その処理が遅いのか を分析する 4 timer 差し込みなど 専用のツールが必要
  • 6. ● Command-line profiler ○ /usr/local/cuda/bin/nvprof ● Usage nvprof 6 $ nvprof [npprof-options] <app> [arguments] ==17126== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 28.09% 9.6689ms 32 302.15us 231.69us 544.94us maxwell_scudnn_winograd_128x128_ldg1_ldg4_tile148n_nt 8.13% 2.7997ms 34 82.344us 42.145us 121.09us maxwell_scudnn_128x128_relu_interior_nn 7.46% 2.5673ms 72 35.657us 4.6720us 569.87us normalize_kernel(int, float*, float*, float*, int, int, int) 7.01% 2.4122ms 104 23.194us 1.7600us 243.82us copy_kernel(int, float*, int, int, float*, int, int) 6.96% 2.3956ms 116 20.651us 1.1840us 242.98us activate_array_kernel(float*, int, ACTIVATION) 6.46% 2.2237ms 75 29.649us 3.0400us 301.03us add_bias_kernel(float*, float*, int, int, int) 6.24% 2.1489ms 72 29.846us 3.3600us 299.91us scale_bias_kernel(float*, float*, int, int) 5.98% 2.0587ms 184 11.188us 1.4080us 112.64us fill_kernel(int, float, float*, int) 5.87% 2.0187ms 3 672.90us 28.960us 1.8676ms [CUDA memcpy DtoH] 5.16% 1.7760ms 4 444.00us 414.16us 516.91us maxwell_scudnn_128x128_relu_small_nn 5.10% 1.7553ms 32 54.854us 7.2320us 163.24us void cudnn::winograd::generateWinogradTilesKernel<int=0, float, float>(cudnn::winograd::GenerateWinogradTilesParams<float, float>)
  • 7. 4つのprofiling mode ● summary mode (default) ● trace mode ● event/metric summary mode ● event/metric trace mode 7 $ nvprof --print-gpu-trace --print-api-trace $ nvprof --events <event-name> --metrics <metric-name> $ nvprof --aggregate-mode off [event|metric] $ nvprof <application> GPUで発生する全てのアクティビティ CUDA Runtime API + Driver API 呼出
  • 8. ● summary mode (default) nvprof ==17126== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 28.09% 9.6689ms 32 302.15us 231.69us 544.94us maxwell_scudnn_winograd_128x128_ldg1_ldg4_tile148n_nt 8.13% 2.7997ms 34 82.344us 42.145us 121.09us maxwell_scudnn_128x128_relu_interior_nn 7.46% 2.5673ms 72 35.657us 4.6720us 569.87us normalize_kernel(int, float*, float*, float*, int, int, int) 7.01% 2.4122ms 104 23.194us 1.7600us 243.82us copy_kernel(int, float*, int, int, float*, int, int) 6.96% 2.3956ms 116 20.651us 1.1840us 242.98us activate_array_kernel(float*, int, ACTIVATION) 6.46% 2.2237ms 75 29.649us 3.0400us 301.03us add_bias_kernel(float*, float*, int, int, int) 6.24% 2.1489ms 72 29.846us 3.3600us 299.91us scale_bias_kernel(float*, float*, int, int) 5.98% 2.0587ms 184 11.188us 1.4080us 112.64us fill_kernel(int, float, float*, int) 5.87% 2.0187ms 3 672.90us 28.960us 1.8676ms [CUDA memcpy DtoH] 5.16% 1.7760ms 4 444.00us 414.16us 516.91us maxwell_scudnn_128x128_relu_small_nn 5.10% 1.7553ms 32 54.854us 7.2320us 163.24us void cudnn::winograd::generateWinogradTilesKernel<int=0, float, float>(cudnn::winograd::GenerateWinogradTilesParams<float, float>) 4.00% 1.3780ms 23 59.912us 18.272us 250.79us shortcut_kernel(int, int, int, int, int, int, int, int, int, int, float*, int, int, int, float, float, float*) 1.36% 467.82us 1 467.82us 467.82us 467.82us maxwell_scudnn_128x64_relu_small_nn 0.86% 296.90us 1 296.90us 296.90us 296.90us maxwell_scudnn_128x32_relu_small_nn 0.48% 165.99us 2 82.994us 82.882us 83.106us maxwell_scudnn_128x64_relu_interior_nn 0.39% 134.98us 1 134.98us 134.98us 134.98us maxwell_scudnn_128x32_relu_interior_nn 0.26% 89.508us 43 2.0810us 1.7280us 9.3440us cudnn::maxwell::gemm::computeOffsetsKernel(cudnn::maxwell::gemm::ComputeOffsetsParams) 0.17% 58.018us 2 29.009us 19.809us 38.209us upsample_kernel(unsigned long, float*, int, int, int, int, int, int, float, float*) API calls: 90.08% 285.51ms 798 357.78us 3.4690us 282.40ms cudaLaunch 9.68% 30.674ms 3 10.225ms 1.6737ms 24.491ms cudaMemcpy 0.11% 363.03us 3540 102ns 86ns 1.5170us cudaSetupArgument 8
  • 9. 例:Tensor Core の利用率を調べる ● 4x4 乗算を1サイクルで実行 ○ Volta アーキテクチャに搭載 9
  • 10. 利用可能な metrics を調べる ● --query-metrics 10 $ nvprof --query-metrics Available Metrics: Name Description Device 0 (TITAN V): ... shared_load_transactions_per_request: Average number of shared memory load transactions performed for each shared memory load shared_store_transactions_per_request: Average number of shared memory store transactions performed for each shared memory store local_load_transactions_per_request: Average number of local memory load transactions performed for each local memory load local_store_transactions_per_request: Average number of local memory store transactions performed for each local memory store … half_precision_fu_utilization: The utilization level of the multiprocessor function units that execute 16 bit floating-point instructions on a scale of 0 to 10. Note that this doesn't specify the utilization level of tensor core unit tensor_precision_fu_utilization: The utilization level of the multiprocessor function units that execute tensor core instructions on a scale of 0 to 10 sharedmem tensorcore !
  • 11. 早速... ● metrics を指定して実行 11 $ nvprof --metrics tensor_precision_fu_utilization <application> Invocations Metric Name Metric Description Min Max Avg Device "TITAN V (0)" Kernel: volta_s884cudnn_fp16_128x128_ldg8_relu_exp_interior_nhwc_tn_v1 3 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Mid (4) Mid (5) Mid (4) Kernel: volta_fp16_s884cudnn_fp16_128x128_ldg8_relu_f2f_exp_small_nhwc2nchw_tn_v1 27 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Mid (4) High (7) Mid (6) Kernel: volta_fp16_s884cudnn_fp16_128x128_ldg8_relu_f2f_exp_interior_nhwc2nchw_tn_v1 20 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Mid (4) Mid (6) Mid (5) Kernel: volta_fp16_s884cudnn_fp16_256x128_ldg8_relu_filter1x1_stg8_interior_nchw_nn_v1 14 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Low (2) Mid (5) Mid (4) Kernel: volta_fp16_s884cudnn_fp16_256x64_ldg8_relu_f2f_exp_small_nhwc2nchw_tn_v1 11 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Low (3) High (7) Mid (6) utilization level
  • 12. Profiling Scope ● プロファイリング箇所を限定する ○ 測定したい箇所に cudaProfilerStart(); を埋め込む 12 #include <cuda_profiler_api.h> cudaProfilerStart(); // do something to profile ... cudaProfilerStop(); $ nvprof --profile-from-start off <application> オプションが必要
  • 13. Python 越しに CUDA API を呼ぶ場合は? ● 普通に nvprof にかける ● ctypes を使うことも 13 $ nvprof [npprof-options] python ... Python Script import ctypes _cudart = ctypes.CDLL('') ret = _cudart.cudaProfilerStart() # call cuda-based methods ret = _cudart.cudaProfilerStop() libcuda… CUDA を使った Python 拡張ライブラリ
  • 15. nvvp : nvidia visual profiler ● GUI 版の Profiler ○ navigation に従ってぽちぽちすると使える 15 $ nvvp
  • 18. カーネルのパフォーマンスを調べる 18 compute res / memory bandwidth / latency 何で律速している?更に詳しい解析
  • 19. Primary Performance Limiter ● Both High: ○ 演算器・メモリ帯域共に利用率が高い ● Memory High, Compute Low: ○ メモリ帯域で律速 ● Compute High, Memory Low: ○ 演算資源で律速 19 Compute Memory
  • 21. NVLink view ● NVLink のトポロジや通信のスループットが見れる ○ ※ GPU 間の トポロジは $ nvidia-smi topo --matrix でも調べられる 21
  • 22. Remote Profiling ● X Forwarding で nvvp を飛ばすのは重い ○ nvprof で プロファイル結果を吐いて、scpすれば良い ● 中継サーバにスクリプトを置く方法もある ○ 22 $ nvprof --analysis-metrics -o profile.nvvp <application> カーネルの詳細な分析に必要 GPUなしで良い
  • 23. Remote Profiling ● しかし... 23 $ nvprof -o profile.nvvp <application> Timeline ぐらいしか見えない $ nvprof --analysis-metrics -o profile.nvvp <application> Kernel を リプレイしまくる 不便... --kernel で限定する … ? ちょっと複雑なアプリケーションだと 無限に重い
  • 24. Remote Profiling ● *.nvvp を dump して飛ばさなくても Remote から直接プロ ファイリングできる 24
  • 25. Take-home message : NVIDIA が一番詳しい ● ● 8_JeremyAppleyard.pdf 25
  • 26. > Note that Visual Profiler and nvprof will be deprecated in a future CUDA release. > It is recommended to use next-generation tools NVIDIA Nsight Compute for GPU profiling > and NVIDIA Nsight Systems for GPU and CPU sampling and tracing. 26
  • 27. NSight systems NVIDIA NSight Systems for GPU and CPU sampling and Tracing 27
  • 30. Question: 30 1. 普段の研究/開発で GPU を使っている 2. CUDA カーネルを書いたことがある 3. プロファイルをきちんと取っている 4. GPUアーキテクチャ完全に理解した
  • 32. Tensorflow timeline ● Tensorflow 本体付属のプロファイリング機能 32 import tensorflow as tf from tensorflow.python.client import timeline # build your model ... ops = … with tf.Session() as sess: # add additional options to trace the session execution options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE) run_metadata = tf.RunMetadata(), options=options, run_metadata=run_metadata) # Create the Timeline object, and write it to a json file fetched_timeline = timeline.Timeline(run_metadata.step_stats) chrome_trace = fetched_timeline.generate_chrome_trace_format() with open('timeline.json', 'w') as f: f.write(chrome_trace)
  • 33. tf.timeline from Keras ● Tensorflow バックエンドの Keras でも利用可能 33 from tensorflow.python.client import timeline run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE) run_metadata = tf.RunMetadata() model.compile(loss='...', optimizer='...', options=run_options, run_metadata=run_metadata) … trace = timeline.Timeline(step_stats=run_metadata.step_stats) with open('timeline.json', 'w') as f: f.write(trace.generate_chrome_trace_format())
  • 34. chrome://tracing ● Chrome Event Format に準拠 ○ Chrome ブラウザ の chrome://tracing でロード 34 timeline time/node
  • 37. Catapult ● Chrome Performance tools* ○ ○ Chrome / Go / Android で利用 ○ Trace Event Format 詳細 ■ I0nSsKchNAySU/preview ● Projects ○ Trace-viewer Javascript codebase that loads trace files and creates the UI ○ Telemetry ○ Performance Dashboard ○ Systrace ○ Web Page Replay 37 [*]
  • 38. Tensorflow Profiler and Advisor   38 core/profiler/
  • 39. まとめ ● 多少浮いている気はするが ... ○ GPU のプロファイリングツールを簡単に紹介 ○ ツールを使いこなし,世界最速を目指そう !!! 39