SlideShare a Scribd company logo
1 of 21
Download to read offline
science + computing ag
IT Service and Software Solutions for Complex Computing Environments
Tuebingen | Muenchen | Berlin | Duesseldorf
Runtime Performance Optimizations for an OpenFOAM Simulation
Dr. Oliver Schröder
HPC Services and Software
21 June 2014
HPC Advisory Council European Conference 2014
Many Thanks to…
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
 Applications and Performance Team
 Madeleine Richards
 Rafael Eduardo Escovar Mendez
 HPC Services and Software Team
 Fisnik Kraja
 Josef Hellauer
2
Overview
 Introduction
 The Test Case
 The Benchmarking Environment
 Initial Results (the starting point)
 Dependency Analysis
• CPU Frequency
• Interconnect
• Memory Hierarchy
 Performance Improvements
• MPI Communications
 Tuning MPI Routines
 Intel MPI vs. Bull MPI
• Domain Decomposition
 Conclusions
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
3
Introduction
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
 How can we improve application performance?
• By using the resources efficiently
• By selecting the appropriate resources
 In order to do this we have to:
• Analyze the behavior of the application
• And then tune the runtime environment accordingly
 In this study we:
• Analyze the dependencies of an OpenFOAM Simulation
• Improve the performance of OpenFOAM by
• Tuning MPI communications
• Selecting the appropriate domain decomposition
4
The Test Environment
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
5
Compute Cluster Sid Robin
Hardware
Processor Intel E5-2680 V2 Ivy-Bridge Intel E5-2697 V2 Ivy-Bridge
Frequency 2.80 GHz 2.70 GHz
Cores per processor 10 12
Sockets per node 2 2
Cores per nodes 20 24
Cache 25 MB (L3), 10x256 KB (L2) 30 MB (L3), 12 x 256 KB (L2)
Memory 64 GB 64 GB
Frequency of memory 1866 MHz 1866 MHz
IO – FS NFS over IB NFS over IB
Interconnect IB FDR IB FDR
Software
OpenFOAM 2.2.2 2.2.2
Intel C Compiler 13.1.3.192 13.1.3.192
GNU C Compiler 4.7.3 4.7.3
Compiler Options -O3 –fPIC –m64 -O3 –fPIC –m64
Intel MPI 4.1.0.030 4.1.0.030
Bull MPI 1.2.4.1 1.2.4.1
SLURM 2.6.0 2.5.0
OFED 1.5.4.1 1.5.4.1
Initial Results on SID (E5-2680v2) (1)
 GCC(IMPI) vs. ICC (IMPI)
 Better performance is obtained with the OpenFOAM version compiled with GCC
 During the development and optimization of OpenFOAM mainly GCC is used
 8 vs. 10 Tasks / CPU
 Better Performance with 8 Tasks/CPU. The reasons might be:
 With fewer tasks we use fewer cores in each CPU, having so more cache and more memory
bandwidth per core.
 The domain decomposition is different. We will see later the impact of Domain Decomposition on
Performance
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
6
256 Tasks 512 Tasks 768 Tasks
16 Nodes 32 Nodes 48 Nodes
GCC 1767 1019 869
ICC 1806 1080 884
0
200
400
600
800
1000
1200
1400
1600
1800
2000
ClockTime[s]
GCC vs. ICC
16 Nodes 32 Nodes 48 Nodes
8 Tasks/CPU 1767 1019 869
10 Tasks/CPU 1799 1116 1010
0
200
400
600
800
1000
1200
1400
1600
1800
2000
ClockTime[s]
8 vs. 10 Tasks /CPU
Initial Results on SID (E5-2680v2) (2)
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
7
320 Tasks 640 Tasks 960 Tasks
16 Nodes 32 Nodes 48 Nodes
No_CPU_Binding 1801 1121 1007
With_CPU_BINDING 1799 1116 1010
0
200
400
600
800
1000
1200
1400
1600
1800
2000
ClockTime[s]
OpenFOAM: CPU Binding
DAPL
Fabric
OFA Fabric
zone
reclaim
huge pages
hyperthrea
ding
turbo
boost
16 Nodes 16 Nodes 16 Nodes 16 Nodes 16 Nodes 16 Nodes
320 Tasks 320 Tasks 320 Tasks 320 Tasks 640 Tasks 320 Tasks
Clock Time [s] 1804 1845 1802 1807 2229 1750
0
500
1000
1500
2000
2500
Seconds
OpenFOAM Tests on 16 Nodes
0,0
500,0
1000,0
1500,0
2000,0
2500,0
3000,0
3500,0
4000,0
120 Tasks 240 Tasks 480 Tasks 960 Tasks
6 Nodes 12 Nodes 24 Nodes 48 Nodes
Time[s]
Example results with Star-CCM+
(cb=cpu_bind, zr=zone reclaim, thp=transparent huge pages, trb=turbo, ht=hyperthreading)
Opt: none
Opt: cb
Opt: cb,zr
Opt: cb,zr,thp
Opt: cb,zr,thp,trb
Opt: cb,rz,thp,ht
Initial Results on SID (E5-2680v2) (3)
 CPU Binding
 With other applications we often observed performance improvements when we
bind the Tasks to specific CPUs (especially when Hyperthreading=ON)
 This seems not to be the case for this OpenFOAM simulation
 We applied also other optimization techniques but we observed
improvement only in the case of Turbo Boost.
 Actually with Turbo Boost we expected more than 3% improvement.
 We really need to analyze the dependencies of
OpenFOAM in order to understand this behavior.
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
8
Overview
 Introduction
 The Test Case
 The Benchmarking Environment
 Initial Results (the starting point)
 Dependency Analysis
• CPU Frequency
• Interconnect
• Memory Hierarchy
 Performance Improvements
• MPI Communications
 Tuning MPI Routines
 Intel MPI vs. Bull MPI
• Domain Decomposition
 Conclusions
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
9
CPU Comparison
E5-2680v2 vs. E5-2697v2
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
 Core-to-Core Comparison
 16 nodes: better on E5-2697v2 CPUs considering the additional 5 MB of L3 cache
 24 nodes: very small differences – MPI overhead hides the benefits from the cache
 Node-to-Node Comparison
 16 nodes: 2% better on E5-2697v2 CPUs
 24 nodes: <1% better on E5-2697v2 CPUs
10
256 Tasks 320 Tasks 384 Tasks 480 Tasks
16 Nodes 16 Nodes 24 Nodes 24 Nodes
E5-2680v2 on SID 1912 1925 1364 1345
E5-2697v2 on Robin 1849 1803 1355 1344
0
500
1000
1500
2000
2500
ClockTime[s]
Core-to-Core Comparison
OpenFOAM @ 2.4 GHz
16 Nodes 24 Nodes
10c@2.8GHz - E5-2680v2 1799 1294
12c@2.7GHz - E5-2697v2 1765 1284
0
200
400
600
800
1000
1200
1400
1600
1800
2000
ClockTime[s]
Node-to-Node Comparison
OpenFOAM on Fully Populated Nodes
OpenFOAM Dependency on CPU Frequency
Tests on SID B71010c (E5-2680v2)
 Dependency on the CPU frequency only modest, approx 45%
 Beyond 2.5 GHz there is not much improvement
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
11
y = 69878x-0,463
R² = 0,9618
y = 36256x-0,441
R² = 0,9697
y = 25120x-0,408
R² = 0,9702
0
500
1000
1500
2000
2500
3000
1000 1500 2000 2500 3000
WallclocktimeinSeconds
CPU frequency in MHz
16 Nodes - 320 Tasks
32 Nodes - 640 Tasks
48 Nodes - 960 Tasks
Pot.(16 Nodes - 320 Tasks)
Pot.(32 Nodes - 640 Tasks)
Pot.(48 Nodes - 960 Tasks)
OpenFOAM Dependency on Interconnect
Compact Task Placement vs. Fully Populated
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
 With the compact task placement we bind all 10 tasks on one of the CPUs in
each node to make sure that cache size and memory bandwidth stay the
same. In this way we give to each task more interconnect bandwidth.
 For tests with 320 tasks we obtain 3% improvement
 For tests with 640 tasks we obtain 11% improvement
 Conclusion:
 Dependency on Interconnect Bandwidth increases with the number of tasks (nodes)
12
320 Tasks 640 Tasks
Speedup 1,03 1,11
1,00
1,02
1,04
1,06
1,08
1,10
1,12
Compact vs. Fully Populated
 Test
 Fully Populated: 20 Tasks/Node
 320 Tasks on 16 Nodes
 640 Tasks on 32 Nodes
 Compact: 10 Tasks/Node (2x #Nodes)
 320 Tasks on 32 Nodes
 640 Tasks on 64 Nodes
OpenFOAM Dependency on Memory Hierarchy
Scatter vs. Compact Task Placement
 The dependency on the memory hierarchy is bigger when running
OpenFOAM on a small number of nodes. On more nodes, the bottleneck is
not anymore the memory hierarchy but the interconnect.
 The impact of the L3 cache stays almost the same when doubling the
number of nodes. However, the impact coming from the memory
throughput is lower.
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
13
28%
4%
22%
24%
0%
10%
20%
30%
40%
50%
60%
320 Tasks 640 Tasks
32 Nodes 64 Nodes
Improvement in Memory Bandwidth
Improvement from the L3 Cache
1,50
1,28
1,00
1,10
1,20
1,30
1,40
1,50
1,60
320 Tasks 640 Tasks
32 Nodes 64 Nodes
Speedup
OpenFOAM Dependency on Memory Hierarchy
Scatter vs. Compact Task Placement
 We tested OpenFOAM on Lustre with various
 Stripe Counts
 Stripe Sizes
 We could not obtain any improvement
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
14
0
200
400
600
800
1000
1200
1400
1600
1800
2000
16 32 48
ClockTime[s]
Number of Nodes (20 Tasks/Node)
1 Stripe of 1 MB
2 Stripes of 1 MB
4 Stripes of 1 MB
6 Stripes of 1 MB
1 Stripe of 2 MB
1 Stripe of 4 MB
1 Stripe of 6MB
Overview
 Introduction
 The Test Case
 The Benchmarking Environment
 Initial Results (the starting point)
 Dependency Analysis
• CPU Frequency
• Interconnect
• Memory Hierarchy
 Performance Improvements
• MPI Communications
 Tuning MPI Routines
 Intel MPI vs. Bull MPI
• Domain Decomposition
 Conclusions
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
15
MPI Profiling (with Intel MPI)
Tests on SID B71010c (E5-2680v2)
 The portion of time spent in MPI increases with the number of nodes
 Looking at the MPI Calls
 ~ 50 % of MPI Time is spent on MPI_Allreduce
 ~ 30 % of MPI Time is spent on MPI_Waitall
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
16
76%
58%
42%
23,5%
41,6%
57,9%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
320 Tasks 640 Tasks 960 Tasks
16 Nodes 32 Nodes 48 Nodes
MPI User
48,6% 53,5% 51,2%
31,9%
29,6% 28,4%
15,1% 11,7% 14,0%
4,4% 5,2% 6,4%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
320 Tasks 640 Tasks 960 Tasks
16 Nodes 32 Nodes 48 Nodes
MPI_Allreduce MPI_Waitall MPI_Recv Other
MPI All Reduce Tuning (with Intel MPI)
Tests on SID B71010c (E5-2680v2)
 The best All_Reduce algorithm:
 (5) Binominal Gather + Scatter
 Overall obtained improvement
 None on 16 nodes – 320 Tasks
 11% on 32 Nodes – 640 Tasks
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
17
1. Recursive doubling algorithm
2. Rabenseifner's algorithm
3. Reduce + Bcast algorithm
4. Topology aware Reduce + Bcast algorithm
5. Binomial gather + scatter algorithm
6. Topology aware binominal gather + scatter algorithm
7. Shumilin's ring algorithm
8. Ring algorithm
1797 1893 2075 2232
1864 1801 1860
2208
2813
0
1000
2000
3000
4000
5000
6000
7000
8000
default 1 2 3 4 5 6 7 8
ClockTime[s]
MPI All Reduce Algorithms
16 Nodes - 320 Tasks
1117 1210
1597
4096
1490
1004
1526
3985
7373
0
1000
2000
3000
4000
5000
6000
7000
8000
default 1 2 3 4 5 6 7 8
ClockTime[s]
MPI All Reduce Algorithms
32 Nodes - 640 Tasks
Finding the Right Domain Decomposition
Tests on Robin with E6-2697v2 CPUs
 We obtain
 10 - 14% improvement on 16 nodes
 3 – 5 % improvement on 24 nodes
 This means that decomposition improves data locality impacting thus the intra node
communication, but it cannot have a high impact when the inter-node communication
overhead increases.
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
18
1765
1546 1550
1592
1,000
1,142 1,139
1,109
0,800
0,900
1,000
1,100
1,200
1,300
1,400
1,500
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2 2 2 2 3 2 4 4 24 2 3 4 4 6 8 8
2 96 4 6 4 8 4 24 4 12 8 6 8 8 6 8
96 2 48 32 32 24 24 4 4 16 16 16 12 8 8 6 Speedup
ClockTime[s]
Domain Decomposition (Bottom Up: Y Z X)
OpenFOAM on 16 Nodes with 24 Tasks/Node
Time [s]
Speedup
1284 1246
1224
1249
1,000
1,030 1,049 1,028
0,900
0,950
1,000
1,050
1,100
1,150
1,200
1,250
1,300
0
200
400
600
800
1000
1200
1400
2 2 2 3 2 4 2 3 2 3 4 2 4 3 4 6 4 6 8
2 4 6 4 8 4 9 6 12 8 6 16 8 12 9 6 12 8 8
144 72 48 48 36 36 32 32 24 24 24 18 18 16 16 16 12 12 9
Speeedup
ClockTime[s]
Domain Decomposition (Bottom Up : Y Z X)
OpenFOAM on 24 Nodes (24 Tasks/Node)
Time [s] Speedup
Bull MPI vs. Intel MPI
Tests on Robin with E6-2697v2 CPUs
 Bull MPI is based on OpenMPI
 It has been enhanced with many
features such as:
 effective abnormal pattern detection
 network-aware collective operations
 multi-path network failover.
 It has been designed to boost the
performance of MPI applications
thanks to full integration with
 bullx PFS (based on Lustre)
 bullx BM (based on SLURM)
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
19
0
200
400
600
800
1000
1200
1400
1600
1800
2000
X=2 X=2 X=2 X=2
Z=2 Z=6 Z=2 Z=9
Y=96 Y=32 Y=144 Y=32
384 Tasks 576 Tasks
16 Nodes 24 Nodes
Intel MPI
Bull MPI
6.2%
3.2%
1.4%
3.9%
Conclusions
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
1. CPU Frequency
• OpenFOAM shows a dependency under 50% on the CPU Frequency
2. Cache Size
• OpenFOAM is dependent on the Cache Size per Core
3. Limiting factor changes as number of nodes change:
1. Memory Bandwidth:
OpenFOAM is memory bandwidth dependent on a small number of nodes
2. Communication:
MPI Time increases substantially with increasing number of nodes
4. Tuning MPI All Reduce Operations
• Up to 11 % improvement can be obtained
5. Domain Decomposition
• Finding the right mesh dimensions is very important
6. File System
• We could not obtain better results with different stripe counts or size in Lustre
20
Thank you for your attention.
science + computing ag
www.science-computing.de
Oliver Schröder
Email: o.schroeder@science-computing.de

More Related Content

What's hot

スーパーコンピューターとクラウドでのOpenFOAM性能・費用ベンチマークテスト
スーパーコンピューターとクラウドでのOpenFOAM性能・費用ベンチマークテストスーパーコンピューターとクラウドでのOpenFOAM性能・費用ベンチマークテスト
スーパーコンピューターとクラウドでのOpenFOAM性能・費用ベンチマークテストMasanori Sumitomo
 
OpenFOAMにおける相変化解析
OpenFOAMにおける相変化解析OpenFOAMにおける相変化解析
OpenFOAMにおける相変化解析takuyayamamoto1800
 
PreCICE CHT with OpenFOAM and CalculiX
PreCICE CHT with OpenFOAM and CalculiXPreCICE CHT with OpenFOAM and CalculiX
PreCICE CHT with OpenFOAM and CalculiX守淑 田村
 
About multiphaseEulerFoam
About multiphaseEulerFoamAbout multiphaseEulerFoam
About multiphaseEulerFoam守淑 田村
 
Dynamic Mesh in OpenFOAM
Dynamic Mesh in OpenFOAMDynamic Mesh in OpenFOAM
Dynamic Mesh in OpenFOAMFumiya Nozaki
 
DEXCS2022 for preCICE
DEXCS2022 for preCICEDEXCS2022 for preCICE
DEXCS2022 for preCICEEtsuji Nomura
 
無償のモデリングソフトウェアCAESESを使ってみた
無償のモデリングソフトウェアCAESESを使ってみた無償のモデリングソフトウェアCAESESを使ってみた
無償のモデリングソフトウェアCAESESを使ってみたFumiya Nozaki
 
About chtMultiRegionFoam
About chtMultiRegionFoam About chtMultiRegionFoam
About chtMultiRegionFoam 守淑 田村
 
Inter flowによる軸対称milkcrown解析
Inter flowによる軸対称milkcrown解析Inter flowによる軸対称milkcrown解析
Inter flowによる軸対称milkcrown解析守淑 田村
 
OpenFOAMソルバの実行時ベイズ最適化
OpenFOAMソルバの実行時ベイズ最適化OpenFOAMソルバの実行時ベイズ最適化
OpenFOAMソルバの実行時ベイズ最適化Masashi Imano
 
Paraviewの等高面を綺麗に出力する
Paraviewの等高面を綺麗に出力するParaviewの等高面を綺麗に出力する
Paraviewの等高面を綺麗に出力するtakuyayamamoto1800
 
Mixer vessel by cfmesh
Mixer vessel by cfmeshMixer vessel by cfmesh
Mixer vessel by cfmeshEtsuji Nomura
 
流体解析入門者向け超初級講習会
流体解析入門者向け超初級講習会流体解析入門者向け超初級講習会
流体解析入門者向け超初級講習会mmer547
 
Free cad 0.19.2 and cfdof (Japanese Ver.)
Free cad 0.19.2 and cfdof (Japanese Ver.)Free cad 0.19.2 and cfdof (Japanese Ver.)
Free cad 0.19.2 and cfdof (Japanese Ver.)YohichiShiina
 
Turbulence Models in OpenFOAM
Turbulence Models in OpenFOAMTurbulence Models in OpenFOAM
Turbulence Models in OpenFOAMFumiya Nozaki
 
OpenFOAMの混相流用改造solver(S-CLSVOF法)の設定・使い方
OpenFOAMの混相流用改造solver(S-CLSVOF法)の設定・使い方OpenFOAMの混相流用改造solver(S-CLSVOF法)の設定・使い方
OpenFOAMの混相流用改造solver(S-CLSVOF法)の設定・使い方takuyayamamoto1800
 
DEXCS2022OF_Install.pdf
DEXCS2022OF_Install.pdfDEXCS2022OF_Install.pdf
DEXCS2022OF_Install.pdfEtsuji Nomura
 
OpenFOAM の境界条件をまとめよう!
OpenFOAM の境界条件をまとめよう!OpenFOAM の境界条件をまとめよう!
OpenFOAM の境界条件をまとめよう!Fumiya Nozaki
 

What's hot (20)

スーパーコンピューターとクラウドでのOpenFOAM性能・費用ベンチマークテスト
スーパーコンピューターとクラウドでのOpenFOAM性能・費用ベンチマークテストスーパーコンピューターとクラウドでのOpenFOAM性能・費用ベンチマークテスト
スーパーコンピューターとクラウドでのOpenFOAM性能・費用ベンチマークテスト
 
OpenFOAMにおける相変化解析
OpenFOAMにおける相変化解析OpenFOAMにおける相変化解析
OpenFOAMにおける相変化解析
 
PreCICE CHT with OpenFOAM and CalculiX
PreCICE CHT with OpenFOAM and CalculiXPreCICE CHT with OpenFOAM and CalculiX
PreCICE CHT with OpenFOAM and CalculiX
 
About multiphaseEulerFoam
About multiphaseEulerFoamAbout multiphaseEulerFoam
About multiphaseEulerFoam
 
Dynamic Mesh in OpenFOAM
Dynamic Mesh in OpenFOAMDynamic Mesh in OpenFOAM
Dynamic Mesh in OpenFOAM
 
DEXCS2022 for preCICE
DEXCS2022 for preCICEDEXCS2022 for preCICE
DEXCS2022 for preCICE
 
無償のモデリングソフトウェアCAESESを使ってみた
無償のモデリングソフトウェアCAESESを使ってみた無償のモデリングソフトウェアCAESESを使ってみた
無償のモデリングソフトウェアCAESESを使ってみた
 
multiDimAMR
multiDimAMRmultiDimAMR
multiDimAMR
 
About chtMultiRegionFoam
About chtMultiRegionFoam About chtMultiRegionFoam
About chtMultiRegionFoam
 
Inter flowによる軸対称milkcrown解析
Inter flowによる軸対称milkcrown解析Inter flowによる軸対称milkcrown解析
Inter flowによる軸対称milkcrown解析
 
OpenFOAMソルバの実行時ベイズ最適化
OpenFOAMソルバの実行時ベイズ最適化OpenFOAMソルバの実行時ベイズ最適化
OpenFOAMソルバの実行時ベイズ最適化
 
Paraviewの等高面を綺麗に出力する
Paraviewの等高面を綺麗に出力するParaviewの等高面を綺麗に出力する
Paraviewの等高面を綺麗に出力する
 
Mixer vessel by cfmesh
Mixer vessel by cfmeshMixer vessel by cfmesh
Mixer vessel by cfmesh
 
流体解析入門者向け超初級講習会
流体解析入門者向け超初級講習会流体解析入門者向け超初級講習会
流体解析入門者向け超初級講習会
 
Free cad 0.19.2 and cfdof (Japanese Ver.)
Free cad 0.19.2 and cfdof (Japanese Ver.)Free cad 0.19.2 and cfdof (Japanese Ver.)
Free cad 0.19.2 and cfdof (Japanese Ver.)
 
Turbulence Models in OpenFOAM
Turbulence Models in OpenFOAMTurbulence Models in OpenFOAM
Turbulence Models in OpenFOAM
 
OpenFOAMの混相流用改造solver(S-CLSVOF法)の設定・使い方
OpenFOAMの混相流用改造solver(S-CLSVOF法)の設定・使い方OpenFOAMの混相流用改造solver(S-CLSVOF法)の設定・使い方
OpenFOAMの混相流用改造solver(S-CLSVOF法)の設定・使い方
 
DEXCS2022OF_Install.pdf
DEXCS2022OF_Install.pdfDEXCS2022OF_Install.pdf
DEXCS2022OF_Install.pdf
 
Helyx os dexcs2020
Helyx os dexcs2020Helyx os dexcs2020
Helyx os dexcs2020
 
OpenFOAM の境界条件をまとめよう!
OpenFOAM の境界条件をまとめよう!OpenFOAM の境界条件をまとめよう!
OpenFOAM の境界条件をまとめよう!
 

Similar to Runtime Performance Optimizations for an OpenFOAM Simulation

Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Fisnik Kraja
 
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009James McGalliard
 
Understanding and Measuring I/O Performance
Understanding and Measuring I/O PerformanceUnderstanding and Measuring I/O Performance
Understanding and Measuring I/O PerformanceGlenn K. Lockwood
 
Configuration Optimization for Big Data Software
Configuration Optimization for Big Data SoftwareConfiguration Optimization for Big Data Software
Configuration Optimization for Big Data SoftwarePooyan Jamshidi
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPAnil Bohare
 
Application Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance CenterApplication Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance Centerinside-BigData.com
 
Poly3D Case Study: The Impact of Cache Misses on Performance of CPU-Intensive...
Poly3D Case Study: The Impact of Cache Misses on Performance of CPU-Intensive...Poly3D Case Study: The Impact of Cache Misses on Performance of CPU-Intensive...
Poly3D Case Study: The Impact of Cache Misses on Performance of CPU-Intensive...terraborealis
 
Building A Linux Cluster Using Raspberry PI #2!
Building A Linux Cluster Using Raspberry PI #2!Building A Linux Cluster Using Raspberry PI #2!
Building A Linux Cluster Using Raspberry PI #2!A Jorge Garcia
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com
 
Top 10 Supercomputers With Descriptive Information & Analysis
Top 10 Supercomputers With Descriptive Information & AnalysisTop 10 Supercomputers With Descriptive Information & Analysis
Top 10 Supercomputers With Descriptive Information & AnalysisNomanSiddiqui41
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Ahsan Javed Awan
 
What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performancePiotr Przymus
 
Understanding Android Benchmarks
Understanding Android BenchmarksUnderstanding Android Benchmarks
Understanding Android BenchmarksKoan-Sin Tan
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersIntel® Software
 
Early Successes Debugging with TotalView on the Intel Xeon Phi Coprocessor
Early Successes Debugging with TotalView on the Intel Xeon Phi CoprocessorEarly Successes Debugging with TotalView on the Intel Xeon Phi Coprocessor
Early Successes Debugging with TotalView on the Intel Xeon Phi CoprocessorIntel IT Center
 
Analysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific ApplicationsAnalysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific ApplicationsJames McGalliard
 

Similar to Runtime Performance Optimizations for an OpenFOAM Simulation (20)

Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
 
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
 
Understanding and Measuring I/O Performance
Understanding and Measuring I/O PerformanceUnderstanding and Measuring I/O Performance
Understanding and Measuring I/O Performance
 
Configuration Optimization for Big Data Software
Configuration Optimization for Big Data SoftwareConfiguration Optimization for Big Data Software
Configuration Optimization for Big Data Software
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMP
 
Application Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance CenterApplication Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance Center
 
Poly3D Case Study: The Impact of Cache Misses on Performance of CPU-Intensive...
Poly3D Case Study: The Impact of Cache Misses on Performance of CPU-Intensive...Poly3D Case Study: The Impact of Cache Misses on Performance of CPU-Intensive...
Poly3D Case Study: The Impact of Cache Misses on Performance of CPU-Intensive...
 
The Computing Continuum.pdf
The Computing Continuum.pdfThe Computing Continuum.pdf
The Computing Continuum.pdf
 
Building A Linux Cluster Using Raspberry PI #2!
Building A Linux Cluster Using Raspberry PI #2!Building A Linux Cluster Using Raspberry PI #2!
Building A Linux Cluster Using Raspberry PI #2!
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
Top 10 Supercomputers With Descriptive Information & Analysis
Top 10 Supercomputers With Descriptive Information & AnalysisTop 10 Supercomputers With Descriptive Information & Analysis
Top 10 Supercomputers With Descriptive Information & Analysis
 
01-06 OCRE Test Suite - Fernandes.pdf
01-06 OCRE Test Suite - Fernandes.pdf01-06 OCRE Test Suite - Fernandes.pdf
01-06 OCRE Test Suite - Fernandes.pdf
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
 
Ch1
Ch1Ch1
Ch1
 
Ch1
Ch1Ch1
Ch1
 
What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performance
 
Understanding Android Benchmarks
Understanding Android BenchmarksUnderstanding Android Benchmarks
Understanding Android Benchmarks
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
 
Early Successes Debugging with TotalView on the Intel Xeon Phi Coprocessor
Early Successes Debugging with TotalView on the Intel Xeon Phi CoprocessorEarly Successes Debugging with TotalView on the Intel Xeon Phi Coprocessor
Early Successes Debugging with TotalView on the Intel Xeon Phi Coprocessor
 
Analysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific ApplicationsAnalysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific Applications
 

Recently uploaded

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 

Recently uploaded (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Runtime Performance Optimizations for an OpenFOAM Simulation

  • 1. science + computing ag IT Service and Software Solutions for Complex Computing Environments Tuebingen | Muenchen | Berlin | Duesseldorf Runtime Performance Optimizations for an OpenFOAM Simulation Dr. Oliver Schröder HPC Services and Software 21 June 2014 HPC Advisory Council European Conference 2014
  • 2. Many Thanks to… © 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de  Applications and Performance Team  Madeleine Richards  Rafael Eduardo Escovar Mendez  HPC Services and Software Team  Fisnik Kraja  Josef Hellauer 2
  • 3. Overview  Introduction  The Test Case  The Benchmarking Environment  Initial Results (the starting point)  Dependency Analysis • CPU Frequency • Interconnect • Memory Hierarchy  Performance Improvements • MPI Communications  Tuning MPI Routines  Intel MPI vs. Bull MPI • Domain Decomposition  Conclusions © 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de 3
  • 4. Introduction © 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de  How can we improve application performance? • By using the resources efficiently • By selecting the appropriate resources  In order to do this we have to: • Analyze the behavior of the application • And then tune the runtime environment accordingly  In this study we: • Analyze the dependencies of an OpenFOAM Simulation • Improve the performance of OpenFOAM by • Tuning MPI communications • Selecting the appropriate domain decomposition 4
  • 5. The Test Environment © 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de 5 Compute Cluster Sid Robin Hardware Processor Intel E5-2680 V2 Ivy-Bridge Intel E5-2697 V2 Ivy-Bridge Frequency 2.80 GHz 2.70 GHz Cores per processor 10 12 Sockets per node 2 2 Cores per nodes 20 24 Cache 25 MB (L3), 10x256 KB (L2) 30 MB (L3), 12 x 256 KB (L2) Memory 64 GB 64 GB Frequency of memory 1866 MHz 1866 MHz IO – FS NFS over IB NFS over IB Interconnect IB FDR IB FDR Software OpenFOAM 2.2.2 2.2.2 Intel C Compiler 13.1.3.192 13.1.3.192 GNU C Compiler 4.7.3 4.7.3 Compiler Options -O3 –fPIC –m64 -O3 –fPIC –m64 Intel MPI 4.1.0.030 4.1.0.030 Bull MPI 1.2.4.1 1.2.4.1 SLURM 2.6.0 2.5.0 OFED 1.5.4.1 1.5.4.1
  • 6. Initial Results on SID (E5-2680v2) (1)  GCC(IMPI) vs. ICC (IMPI)  Better performance is obtained with the OpenFOAM version compiled with GCC  During the development and optimization of OpenFOAM mainly GCC is used  8 vs. 10 Tasks / CPU  Better Performance with 8 Tasks/CPU. The reasons might be:  With fewer tasks we use fewer cores in each CPU, having so more cache and more memory bandwidth per core.  The domain decomposition is different. We will see later the impact of Domain Decomposition on Performance © 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de 6 256 Tasks 512 Tasks 768 Tasks 16 Nodes 32 Nodes 48 Nodes GCC 1767 1019 869 ICC 1806 1080 884 0 200 400 600 800 1000 1200 1400 1600 1800 2000 ClockTime[s] GCC vs. ICC 16 Nodes 32 Nodes 48 Nodes 8 Tasks/CPU 1767 1019 869 10 Tasks/CPU 1799 1116 1010 0 200 400 600 800 1000 1200 1400 1600 1800 2000 ClockTime[s] 8 vs. 10 Tasks /CPU
  • 7. Initial Results on SID (E5-2680v2) (2) © 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de 7 320 Tasks 640 Tasks 960 Tasks 16 Nodes 32 Nodes 48 Nodes No_CPU_Binding 1801 1121 1007 With_CPU_BINDING 1799 1116 1010 0 200 400 600 800 1000 1200 1400 1600 1800 2000 ClockTime[s] OpenFOAM: CPU Binding DAPL Fabric OFA Fabric zone reclaim huge pages hyperthrea ding turbo boost 16 Nodes 16 Nodes 16 Nodes 16 Nodes 16 Nodes 16 Nodes 320 Tasks 320 Tasks 320 Tasks 320 Tasks 640 Tasks 320 Tasks Clock Time [s] 1804 1845 1802 1807 2229 1750 0 500 1000 1500 2000 2500 Seconds OpenFOAM Tests on 16 Nodes 0,0 500,0 1000,0 1500,0 2000,0 2500,0 3000,0 3500,0 4000,0 120 Tasks 240 Tasks 480 Tasks 960 Tasks 6 Nodes 12 Nodes 24 Nodes 48 Nodes Time[s] Example results with Star-CCM+ (cb=cpu_bind, zr=zone reclaim, thp=transparent huge pages, trb=turbo, ht=hyperthreading) Opt: none Opt: cb Opt: cb,zr Opt: cb,zr,thp Opt: cb,zr,thp,trb Opt: cb,rz,thp,ht
  • 8. Initial Results on SID (E5-2680v2) (3)  CPU Binding  With other applications we often observed performance improvements when we bind the Tasks to specific CPUs (especially when Hyperthreading=ON)  This seems not to be the case for this OpenFOAM simulation  We applied also other optimization techniques but we observed improvement only in the case of Turbo Boost.  Actually with Turbo Boost we expected more than 3% improvement.  We really need to analyze the dependencies of OpenFOAM in order to understand this behavior. © 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de 8
  • 9. Overview  Introduction  The Test Case  The Benchmarking Environment  Initial Results (the starting point)  Dependency Analysis • CPU Frequency • Interconnect • Memory Hierarchy  Performance Improvements • MPI Communications  Tuning MPI Routines  Intel MPI vs. Bull MPI • Domain Decomposition  Conclusions © 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de 9
  • 10. CPU Comparison E5-2680v2 vs. E5-2697v2 © 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de  Core-to-Core Comparison  16 nodes: better on E5-2697v2 CPUs considering the additional 5 MB of L3 cache  24 nodes: very small differences – MPI overhead hides the benefits from the cache  Node-to-Node Comparison  16 nodes: 2% better on E5-2697v2 CPUs  24 nodes: <1% better on E5-2697v2 CPUs 10 256 Tasks 320 Tasks 384 Tasks 480 Tasks 16 Nodes 16 Nodes 24 Nodes 24 Nodes E5-2680v2 on SID 1912 1925 1364 1345 E5-2697v2 on Robin 1849 1803 1355 1344 0 500 1000 1500 2000 2500 ClockTime[s] Core-to-Core Comparison OpenFOAM @ 2.4 GHz 16 Nodes 24 Nodes 10c@2.8GHz - E5-2680v2 1799 1294 12c@2.7GHz - E5-2697v2 1765 1284 0 200 400 600 800 1000 1200 1400 1600 1800 2000 ClockTime[s] Node-to-Node Comparison OpenFOAM on Fully Populated Nodes
  • 11. OpenFOAM Dependency on CPU Frequency Tests on SID B71010c (E5-2680v2)  Dependency on the CPU frequency only modest, approx 45%  Beyond 2.5 GHz there is not much improvement © 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de 11 y = 69878x-0,463 R² = 0,9618 y = 36256x-0,441 R² = 0,9697 y = 25120x-0,408 R² = 0,9702 0 500 1000 1500 2000 2500 3000 1000 1500 2000 2500 3000 WallclocktimeinSeconds CPU frequency in MHz 16 Nodes - 320 Tasks 32 Nodes - 640 Tasks 48 Nodes - 960 Tasks Pot.(16 Nodes - 320 Tasks) Pot.(32 Nodes - 640 Tasks) Pot.(48 Nodes - 960 Tasks)
  • 12. OpenFOAM Dependency on Interconnect Compact Task Placement vs. Fully Populated © 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de  With the compact task placement we bind all 10 tasks on one of the CPUs in each node to make sure that cache size and memory bandwidth stay the same. In this way we give to each task more interconnect bandwidth.  For tests with 320 tasks we obtain 3% improvement  For tests with 640 tasks we obtain 11% improvement  Conclusion:  Dependency on Interconnect Bandwidth increases with the number of tasks (nodes) 12 320 Tasks 640 Tasks Speedup 1,03 1,11 1,00 1,02 1,04 1,06 1,08 1,10 1,12 Compact vs. Fully Populated  Test  Fully Populated: 20 Tasks/Node  320 Tasks on 16 Nodes  640 Tasks on 32 Nodes  Compact: 10 Tasks/Node (2x #Nodes)  320 Tasks on 32 Nodes  640 Tasks on 64 Nodes
  • 13. OpenFOAM Dependency on Memory Hierarchy Scatter vs. Compact Task Placement  The dependency on the memory hierarchy is bigger when running OpenFOAM on a small number of nodes. On more nodes, the bottleneck is not anymore the memory hierarchy but the interconnect.  The impact of the L3 cache stays almost the same when doubling the number of nodes. However, the impact coming from the memory throughput is lower. © 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de 13 28% 4% 22% 24% 0% 10% 20% 30% 40% 50% 60% 320 Tasks 640 Tasks 32 Nodes 64 Nodes Improvement in Memory Bandwidth Improvement from the L3 Cache 1,50 1,28 1,00 1,10 1,20 1,30 1,40 1,50 1,60 320 Tasks 640 Tasks 32 Nodes 64 Nodes Speedup
  • 14. OpenFOAM Dependency on Memory Hierarchy Scatter vs. Compact Task Placement  We tested OpenFOAM on Lustre with various  Stripe Counts  Stripe Sizes  We could not obtain any improvement © 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de 14 0 200 400 600 800 1000 1200 1400 1600 1800 2000 16 32 48 ClockTime[s] Number of Nodes (20 Tasks/Node) 1 Stripe of 1 MB 2 Stripes of 1 MB 4 Stripes of 1 MB 6 Stripes of 1 MB 1 Stripe of 2 MB 1 Stripe of 4 MB 1 Stripe of 6MB
  • 15. Overview  Introduction  The Test Case  The Benchmarking Environment  Initial Results (the starting point)  Dependency Analysis • CPU Frequency • Interconnect • Memory Hierarchy  Performance Improvements • MPI Communications  Tuning MPI Routines  Intel MPI vs. Bull MPI • Domain Decomposition  Conclusions © 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de 15
  • 16. MPI Profiling (with Intel MPI) Tests on SID B71010c (E5-2680v2)  The portion of time spent in MPI increases with the number of nodes  Looking at the MPI Calls  ~ 50 % of MPI Time is spent on MPI_Allreduce  ~ 30 % of MPI Time is spent on MPI_Waitall © 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de 16 76% 58% 42% 23,5% 41,6% 57,9% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 320 Tasks 640 Tasks 960 Tasks 16 Nodes 32 Nodes 48 Nodes MPI User 48,6% 53,5% 51,2% 31,9% 29,6% 28,4% 15,1% 11,7% 14,0% 4,4% 5,2% 6,4% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 320 Tasks 640 Tasks 960 Tasks 16 Nodes 32 Nodes 48 Nodes MPI_Allreduce MPI_Waitall MPI_Recv Other
  • 17. MPI All Reduce Tuning (with Intel MPI) Tests on SID B71010c (E5-2680v2)  The best All_Reduce algorithm:  (5) Binominal Gather + Scatter  Overall obtained improvement  None on 16 nodes – 320 Tasks  11% on 32 Nodes – 640 Tasks © 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de 17 1. Recursive doubling algorithm 2. Rabenseifner's algorithm 3. Reduce + Bcast algorithm 4. Topology aware Reduce + Bcast algorithm 5. Binomial gather + scatter algorithm 6. Topology aware binominal gather + scatter algorithm 7. Shumilin's ring algorithm 8. Ring algorithm 1797 1893 2075 2232 1864 1801 1860 2208 2813 0 1000 2000 3000 4000 5000 6000 7000 8000 default 1 2 3 4 5 6 7 8 ClockTime[s] MPI All Reduce Algorithms 16 Nodes - 320 Tasks 1117 1210 1597 4096 1490 1004 1526 3985 7373 0 1000 2000 3000 4000 5000 6000 7000 8000 default 1 2 3 4 5 6 7 8 ClockTime[s] MPI All Reduce Algorithms 32 Nodes - 640 Tasks
  • 18. Finding the Right Domain Decomposition Tests on Robin with E6-2697v2 CPUs  We obtain  10 - 14% improvement on 16 nodes  3 – 5 % improvement on 24 nodes  This means that decomposition improves data locality impacting thus the intra node communication, but it cannot have a high impact when the inter-node communication overhead increases. © 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de 18 1765 1546 1550 1592 1,000 1,142 1,139 1,109 0,800 0,900 1,000 1,100 1,200 1,300 1,400 1,500 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2 2 2 2 3 2 4 4 24 2 3 4 4 6 8 8 2 96 4 6 4 8 4 24 4 12 8 6 8 8 6 8 96 2 48 32 32 24 24 4 4 16 16 16 12 8 8 6 Speedup ClockTime[s] Domain Decomposition (Bottom Up: Y Z X) OpenFOAM on 16 Nodes with 24 Tasks/Node Time [s] Speedup 1284 1246 1224 1249 1,000 1,030 1,049 1,028 0,900 0,950 1,000 1,050 1,100 1,150 1,200 1,250 1,300 0 200 400 600 800 1000 1200 1400 2 2 2 3 2 4 2 3 2 3 4 2 4 3 4 6 4 6 8 2 4 6 4 8 4 9 6 12 8 6 16 8 12 9 6 12 8 8 144 72 48 48 36 36 32 32 24 24 24 18 18 16 16 16 12 12 9 Speeedup ClockTime[s] Domain Decomposition (Bottom Up : Y Z X) OpenFOAM on 24 Nodes (24 Tasks/Node) Time [s] Speedup
  • 19. Bull MPI vs. Intel MPI Tests on Robin with E6-2697v2 CPUs  Bull MPI is based on OpenMPI  It has been enhanced with many features such as:  effective abnormal pattern detection  network-aware collective operations  multi-path network failover.  It has been designed to boost the performance of MPI applications thanks to full integration with  bullx PFS (based on Lustre)  bullx BM (based on SLURM) © 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de 19 0 200 400 600 800 1000 1200 1400 1600 1800 2000 X=2 X=2 X=2 X=2 Z=2 Z=6 Z=2 Z=9 Y=96 Y=32 Y=144 Y=32 384 Tasks 576 Tasks 16 Nodes 24 Nodes Intel MPI Bull MPI 6.2% 3.2% 1.4% 3.9%
  • 20. Conclusions © 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de 1. CPU Frequency • OpenFOAM shows a dependency under 50% on the CPU Frequency 2. Cache Size • OpenFOAM is dependent on the Cache Size per Core 3. Limiting factor changes as number of nodes change: 1. Memory Bandwidth: OpenFOAM is memory bandwidth dependent on a small number of nodes 2. Communication: MPI Time increases substantially with increasing number of nodes 4. Tuning MPI All Reduce Operations • Up to 11 % improvement can be obtained 5. Domain Decomposition • Finding the right mesh dimensions is very important 6. File System • We could not obtain better results with different stripe counts or size in Lustre 20
  • 21. Thank you for your attention. science + computing ag www.science-computing.de Oliver Schröder Email: o.schroeder@science-computing.de