Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A challenge for thread parallelism on OpenFOAM

2019年10月15日 - 2019年10月17日 で開催された7th OpenFOAM Conferenceでの発表資料になります。

  • Be the first to comment

  • Be the first to like this

A challenge for thread parallelism on OpenFOAM

  1. 1. A challenge for thread parallelism on OpenFOAM YOSHIFUJI Naoki* TOMIOKA Minoru FUJIWARA Ko SAWAHARA Masataka ITO Yuki MARUISHI Takafumi Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation
  2. 2. Who we are Japanese software company – Accelerating customer’s software – In any area, any devices Professionals in software speedup – Not manufacturer using CAE software – Not CAE software developer 2 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation
  3. 3. Who I am Computational Civil Engineer – Lead Engineer @ Solution Div., Fixstars Corporation – Doctoral student @ Coastal and Ocean Lab., Nagoya University Interests and professional – High performance computing (HPC) – Computational Fluid Dynamics (CFD) – Speedup software (on from SoC to supercomputer) 3 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation Name: YOSHIFUJI Naoki / 𠮷藤 尚生 Online ID: @LWisteria Email: yoshifuji@fixstars.com – Feel free to contact about anythingOnline avatar
  4. 4. What we’ve done x13.5 speedup 1. Abstract • Case: OpenFOAM Benchmark Test case "channelReTau110“ provided by The Open CAE Society of Japan • Solver: pimpleFOAM with DIC-PCG. • Based OpenFOAM version: the Foundation version 16b559c1 • Average time of the first five steps • Computer: Intel Ninja Developer Platform (Intel Xeon Phi 7210, DDR4) 4 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation
  5. 5. Abstract 5 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation 1. Experimental implementation with OpenMP 2. Target = pimpleFoam for channel flow benchmark 3. Solver = DIC-PCG, one of the most challenging case for thread parallelism 4. Measured speedup factor is x13.5 (without CM method) over single process with Intel Knights Landing 5. The potential of thread parallelism is shown in this study 6. Improvements and investigation with other cases will continues in the future 1. Abstract
  6. 6. Table of Contents 1. Abstract 2. Background and motivation 3. Parallel methodology 4. Performance measurement 5. Conclusion and future work 6 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation
  7. 7. Table of Contents 1. Abstract 2. Background and motivation 3. Parallel methodology 4. Performance measurement 5. Conclusion and future work 2. Background and motivation 7 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation
  8. 8. Modern engineering and OpenFOAM  All product designers and engineers need CAE  OpenFOAM is one of the most used CAE software  Speedup OpenFOAM is important in modern engineering 2. Background and motivation 8 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation
  9. 9. OpenFOAM is slow in modern computers 2. Background and motivation 9 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation Quoted from Imano (2017): “OpenFOAMによる流体解析ベンチマークテスト FOCUS・クラウド・スパコンでのチャネルおよびボックスファン流れ解析”, 第17回PCクラスタシンポジウム, p.19. Copyright 2017 OCAEL All ights reserved.
  10. 10. Difference of computers 2. Background and motivation 10 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation Ancient computer Modern computer Num. of CPU cores Single / a few Many Num. of computer nodes A few Many CPU speed over intra network’s Low High i.e. Num. of MPI processes A few Massive MPI management cost over arithmetic operation Light Heavy MPI communication cost over arithmetic operation Light Heavy
  11. 11. Solution: Parallelism 2. Background and motivation 11 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation Current OpenFOAM This study Framework MPI OpenMP Mechanism Process Thread Data communication Socket Shared memory Target All inter-core In the same node i.e. Management cost Heavy Light Communication cost Heavy Light  Using OpenMP could speedup OpenFOAM
  12. 12. Our goal 2. Background and motivation 12 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation 1. Implement thread parallelism with OpenMP for the intra-node parallelism (Hyblid parallel) 2. Measure performance improvement 3. Share the impl. and result to the world CPU core CPU core CPU core CPU core CPU core CPU core CPU core CPU core CPU core CPU core CPU core CPU core CPU core CPU core CPU core CPU core OpenMP OpenMP OpenMP OpenMP MPI
  13. 13. Our goal in this study 2. Background and motivation 13 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation This study shows the progress and the incomplete result pimpleFOAM & DIC-PCG only – To estimate the worst improvement Only single node – Little MPI cost – Expected as fast as flat MPI • Possibility to be faster on the multiple node
  14. 14. Extra motivation for our business 2. Background and motivation 14 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation Outreach to customers – Provide an example by Fixstars’ work –  We’re happy if you place an order with us to speedup your software  https://www.fixstars.com/en/service/acceleration/ Employee training – Provide an exercise to Fixstars’ engineer – Problem with only CPU is good for the beginner
  15. 15. Table of Contents 1. Abstract 2. Background and motivation 3. Parallel methodology 4. Performance measurement 5. Conclusion and future work 3. Parallel methodology 15 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation
  16. 16. Target in this study 16 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation Speedup / parallelize solving sparse linear equation – Generally known that it takes the large part of CFD Solver: DIC-PCG – Diagonal Incomplete Cholesky preconditioner – Preconditioned Conjugate Gradient Many-core CPU, only one node Challenge to one of the hardest case for thread parallelism – To estimate the worst improvement 3. Parallel methodology
  17. 17. Components of DIC-PCG 17 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation 3. Parallel methodology Amul: Sparse-matrix vector multiply (SpMV) DIC precondition WAXPBY: Vector vector addition sumMag: Sum of absoluted element of vector sumProd: Vector inner product consists only matrix/vector operation. – it is element-independent, thus easy to parallelize (in principle)
  18. 18. Parallelization of DIC-PCG with OpenMP 18 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation Difficult – lduMatrix format for Amul – DIC’s substitution operation Easy – Elementwise operation • WAXPBY – Parallel reduction • sumMag • sumProd 3. Parallel methodology
  19. 19. lduMatrix SpMV parallelization 19 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation 3. Parallel methodology #pragma omp parallel for for (label face=0; face<nFaces; face++) { ApsiPtr[uPtr[face]] += lowerPtr[face]*psiPtr[lPtr[face]]; ApsiPtr[lPtr[face]] += upperPtr[face]*psiPtr[uPtr[face]]; } src/OpenFOAM/matrices/lduMatrix/lduMatrix/lduMatrixATmul.C::Foam::lduMatrix::Amul() face 0 1 2 lPtr 0 0 2 uPtr 2 3 3 1 5 6 2 8 3 7 9 10 4 Dependency among face – Data race (write at the same time, difference face) lduMatrix can not be parallelized
  20. 20. lduMatrix to CSR format 20 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation Compressed Sparse Row (CSR) Widely used sparse matrix format 3. Parallel methodology
  21. 21. DIC preconditioner 21 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation 3. Parallel methodology #pragma omp parallel for for (label face=0; face<nFaces; face++) { wAPtr[uPtr[face]] -= rDPtr[uPtr[face]]*upperPtr[face]*wAPtr[lPtr[face]]; } src/OpenFOAM/matrices/lduMatrix/preconditioners/DICPreconditioner/DICPreconditioner.C::Foam::DICPreconditioner::precondition() face 0 1 2 lPtr 0 0 2 uPtr 2 3 3 1 5 6 2 8 3 7 9 10 4 Substitusion phase (forward) wA(face=2) uses wA(face=0) – Data race (the result would be changed) Can not be parallelized
  22. 22. Cuthill-McKee ordering 22 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation 3. Parallel methodology 6 7 8 3 4 5 0 1 2 cell Example : 4 point stencil (Regular mesh) matrix 4 1 1 1 4 1 1 1 4 1 4 1 1 1 1 4 1 1 1 4 1 4 1 1 1 4 1 1 4
  23. 23. Cuthill-McKee ordering 23 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation 3. Parallel methodology 6 7 8 3 4 5 0 1 2 cell Dependency matrix 4 1 1 1 4 1 1 1 4 1 4 1 1 1 1 4 1 1 1 4 1 4 1 1 1 4 1 1 4
  24. 24. Cuthill-McKee ordering 24 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation 3. Parallel methodology 6 7 8 3 4 5 0 1 2 cell Independent among colors matrix 4 1 1 1 4 1 1 1 4 1 4 1 1 1 1 4 1 1 1 4 1 4 1 1 1 4 1 1 4  Parallelly executable in the same color
  25. 25. Parallelization of DIC-PCG 25 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation 3. Parallel methodology Amul: CSR format DIC precondition: Cuthill McKee WAXPBY: parallel elementwise sumMag: parallel reduction sumProd: parallel reduction Whole DIC-PCG can be parallelized
  26. 26. Table of Contents 1. Abstract 2. Background and motivation 3. Parallel methodology 4. Performance measurement 5. Conclusion and future work 4. Performance measurement 26 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation
  27. 27. Benchmark condition 27 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation Case: OpenFOAM Benchmark Test case "channelReTau110“ provided by The Open CAE Society of Japan Solver: pimpleFOAM with DIC-PCG Based OpenFOAM version: the Foundation version 16b559c1 Computer: Intel Ninja Developer Platform (Intel Xeon Phi 7210, 256 logical core, DDR4) Average time of the first five steps – Insert a clock timer manually at the beginning and at the end of each function in the source code 4. Performance measurement
  28. 28. Single process result 28 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation 4. Performance measurement PCG is the largest part of whole pimpleFOAM Matrix op. for PCG – DIC::precondition – Amul PCG DIC Amul
  29. 29. Step-by-step speedup (0) 29 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation 4. Performance measurement The base version – same as previous page
  30. 30. Step-by-step speedup (1) 30 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation 4. Performance measurement Change to CSR Longer DIC ??
  31. 31. Step-by-step speedup (2) 31 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation 4. Performance measurement ldu-CSR format Divide CSR into – Lower triangular – Upper triangular – Diagonal Went back to original – Improve cache miss
  32. 32. Step-by-step speedup (3) 32 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation 4. Performance measurement Parallelize matrix op. – Amul – DIC precondition + Cuthill-McKee x2.0
  33. 33. Step-by-step speedup (4) 33 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation 4. Performance measurement Parallelize vector op. – WAXPBY – sumMag – sumProd x3.4
  34. 34. Step-by-step speedup (5) 34 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation 4. Performance measurement Change OpenMP setting – From 256 threads – To 64 threads x4.8
  35. 35. Achieved speedup without CM x13.5 35 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation 4. Performance measurement CM could be ignored – required only if remeshed
  36. 36. vs. flat MPI 36 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation ½ slower
  37. 37. Speedup by each function 37 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation 4. Performance measurement Single [s] OpenMP [s] Speedup factor Amul 60.0 3.0 x19.8 DIC::precondition 80.1 7.9 x10.1 WAXPBY 21.1 0.4 x53.2 sumMag 6.1 0.1 x44.6 sumProd 12.3 0.3 x43.6 Cuthill-McKee 0.0 24.0 --- (other) 1.1 1.6 x0.7 total 180.8 37.4 x4.8 total excluding CM 180.8 13.4 x13.5
  38. 38. Why slower than MPI (1) 38 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation DIC was slow – Theoretical reason • DIC is difficult for thread parallelism – Small number of parallel thread – Implementation reason • Reordering input/output vector can be reduced 4. Performance measurement
  39. 39. Why slower than MPI (2) 39 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation Expect: Num. of iteration decreases with OpenMP – Convergence on OpenMP is expected better than MPI because OpenMP does not require domain decomposition. – Domain decomposition decrease the convergence Actual: Not decreased – Convergence of the used benchmark is too good 4. Performance measurement
  40. 40. Table of Contents 1. Abstract 2. Background and motivation 3. Parallel methodology 4. Performance measurement 5. Conclusion and future work 5. Conclusion and future work 40 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation
  41. 41. Conclusion 41 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation Achieved – x13.5 speedup over single process – ½ slower over flat MPI Condition with – Channel flow benchmark with regular mesh – DIC-PCG solver 5. Conclusion and future work Very difficult method in very simple problem = the worst condition shows only ½ degradation
  42. 42. Future work 42 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation Speedup DIC – erase vector reordering More simple preconditioner / solver – Diagonal, GAMG More complicated benchmark – Motorbike, dam-break Multi-node supercomputer – Efficiency of reduction of MPI process 5. Conclusion and future work Please look forward to next our work
  43. 43. return 0; Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation
  44. 44. lduMatrix format 44 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation Sparse matrix storing format Used by pimpleFoam (and also by many other solver) Three part – Upper triangular part U: column major – Lower triangular part : row major – Diagonal part D Equivalent to COO format 3. Parallel methodology
  45. 45. Example of lduMatrix 45 1 5 6 2 8 3 7 9 10 4 diag = 1, 2, 3, 4 upper = 5, 6, 7 lower = 8, 9, 10 lowerAdder = 0, 0, 2 upperAdder = 2, 3, 3 : Value of diagonal elements : Value of upper triangular elements : Value of lower triangular elements : Column number of upper elements, row of lower : Column number of lower elements, row of upper 𝑈 𝐿 𝐷 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation 3. Parallel methodology
  46. 46. Example of CSR matrix 46 1 5 6 2 8 3 7 9 10 4 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation 3. Parallel methodology data = [1, 5, 6, 2, 8, 3, 7, 9, 10, 4] column = [0, 2, 3, 1, 0, 2, 3, 0, 2, 3] offset = [0, 3, 4, 7, 10] : Element’s value : Element’s column number : Start position of row
  47. 47. CSR SpMV parallelization 47 Otherwise noted, available under GPL version 3; ©2019 Fixstars Corporation 3. Parallel methodology #pragma omp parallel for for (label i = 0; i < n; i++) { double y_i = 0.0; for (label index = offset[i]; index < offset[i + 1]; index++) { y_i += data[index] * x[column[index]]; } y[i] = y_i; } Independent among i – Never write at the same time (different i) Can be parallelized

×