OFC/NFOEC: GPU-based Parallelization of System Modeling

GPU-based Parallelization of
System Modeling

Stephan Pachnicke, 18.03.2013

Outline

• Motivation

• Numerical System Modeling

• GPU-Parallelization

• Comparison of Speedup and Accuracy

• Conclusion

2 © 2013 ADVA Optical Networking. All rights reserved.

Acknowledgments

The author would like to acknowledge the help and
contributions of

Adam Chachaj – Krone Messtechnik

Heinrich Müller – TU Dortmund

Peter Krummrich – TU Dortmund

Markus Roppelt – ADVA Optical Networking

Michael Eiselt – ADVA Optical Networking


Motivation


In Short: Computational Performance

Graphical Processing Unit
(GPU)

vs.
CPU Cluster


Increase in GFlop/s

• GPU performance is growing even faster than predicted by Moore„s
law and is significantly higher than CPU performance

• GPUs are attractive also for general purpose computing
(complex numerical simulations)


Optical System Modeling

• Simulation of (long-haul) optical transmission systems requires
numerical solution of the nonlinear Schrödinger equation

 High computational effort for small step-sizes due to accurate
simulation of nonlinear fiber effects

• Precise estimation of the bit error ratio with Monte-Carlo
simulations for PMD and noise

 Requires a high number of simulated bits


Split-Step Fourier Method (SSFM)
• Splits nonlinear Schrödinger equation in linear and nonlinear parts
• Separate solution of linear and nonlinear parts

• Solution of the linear part in the frequency domain and of the nonlinear
part in time domain (acceptable for small step-sizes)

… FFT
FFT IFFT
IFFT
IFFT …

1 Split-Step

Speedup Factor (GPU vs CPU)

Single precision
(SP)

Double precision
(DP)
Legend
DP: Nvidia CUDA FFT
SP: FFT using pre-calculated
twiddle factors

• Single precision arithmetic has much higher performance on GPU
(because main target group is computer gaming)

• Longer block lengths allow better parallelization

 Single precision implementation desirable


Accuracy (in single precision)

Legend
CUFFT: Nvidia CUDA FFT
FFTW: Fastest Fourier Transform
in the West
IPP: Intel Integrated
Performance Primitives
LUT-based FFT LUT: Precalculate trigonometric
functions in DP

• Total accuracy of SSFM dominated by FFT accuracy

• Backward error grows linearly with increasing number of FFTs

• CUDA FFT shows considerably higher error than other FFT
implementations


Analysis: Accuracy

Why is the accuracy of CUFFT in SP relatively low?

 FFT performance depends crucially on accuracy of „twiddle-
factors“ (or trigonometric functions)

 HW implementation of trigonometric functions in SP on GPUs
optimized for peak performance not accuracy

What can be done to increase accuracy in single precision?

 Implementation of Taylor series expansion (slow!)

 Compute trigonometric functions in DP on CPU and store them in
a look-up table on the GPU
(especially suited to the split-step Fourier method with thousands
of FFTs of similar length)

J. C. Schatzman, SIAM J. Scientific Comput. (1996).


Illustrative Example
CUDA FFT (SP) LUT-based FFT (SP)

-: GPU
-: CPU

• Look-up table based FFT provides a significantly increased accuracy in single-
precision arithmetics
• Look-up table holds pre-calculated „twiddle-factor“ values

Source: S. Pachnicke, et al, OFC 2011.


System Analysis (SSFM Simulation)

Req. OSNR deviation for BER=10-3 [dB]

GPU simulation
(in SP or DP)
vs.
CPU simulation
(in DP)

11x 112 Gb/s CP-QPSK

• GPU double precision results are (almost) identical to CPU results

• The OSNR penalty of our single precision implementation remains below
0.1 dB up to a number of approx. 125,000 split-steps
Source: S. Pachnicke, IEEE ICTON, 2010.


Combined Simulation in SP & DP
 Calculate approximate
division of the parameter
space into strata by fast
simulations with single
precision.
 The ellipses represent
parameter combinations
for which bit errors occur
during transmission.
 Execute simulations with
double precision
accuracy sparsely in the
different strata to assess
the BER.

 Combined simulation with single and double precision and automatic
(algorithmic) choice of amount of single precision simulations
P. Serena, et al, IEEE JLT, 2009.
S. Pachnicke, et al, OFC 2011.


Discussion

Robustness of algorithm has
been checked by deliberately
selecting high amount of
880,000 split-steps

• Results of combined (SP & DP) GPU simulations match well with results obtained
from CPU simulations in DP
• Speedup of up to a factor of 180 possible compared to CPU
 Stratified Monte-Carlo sampling allows algorithmic choice of amount of required DP
simulations for a given accuracy

Source: S. Pachnicke, et al, OFC 2011.


Design Advantages
• GPU parallelization allows simulation of a long distance 80 WDM channel system on
a PC in reasonable time

Source: C. Xia, D. van den Borne, OFC, 2011

• Result: The system performance can be estimated much more precisely than with
CPU-based simulations (typically modeling only 10 WDM channel systems)


Conclusion

• GPUs offer a much higher computational peak performance
than CPUs

• Full benefit of GPU power only in single precision

• Increase in single precision accuracy possible by pre-computing of
trigonometric function values for FFTs

• Speedup in simulation time of more than a factor of 100 possible
compared to CPU


Further Reading

• N. K. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, J. Manferdelli, “High
Performance Discrete Fourier Transforms on Graphics Processors”, Proc. of
IEEE conference on Supercomputing (SC), article no. 2 (2008).

• S. Pachnicke, “Fiber-Optic Transmission Networks: Efficient Design and
Dynamic Operation”, Springer (2011).

• J. C. Schatzman, “Accuracy of the Discrete Fourier Transform and the Fast
Fourier Transform”, SIAM J. Scientific Comput. 17, 1150-1166 (1996).

• G. Falcao, V. Silva, L. Sousa, “How GPUs can outperform ASICs for fast LDPC
decoding”, Proc. of ACM International Conference on Supercomputing
(ICS), 390-399 (2009).

• J. A. Stratton, S. S. Stone, W.-M. W. Hwu, “MCUDA: An Efficient
Implementation of CUDA Kernels for Multi-core CPUs”, Lecture Notes in
Computer Science 5335, 16-30 (2008).

• R. R. Exposito, G. L. Taboada, S. Ramos, J. Tourino, R. Doallo, “General-
purpose computation on GPUs for high performance cloud computing”, Wiley J.
Concurrency and Computation 24 (2012).


Thank you

spachnicke@advaoptical.com

IMPORTANT NOTICE

The content of this presentation is strictly confidential. ADVA Optical Networking is the exclusive owner or licensee of the
content, material, and information in this presentation. Any reproduction, publication or reprint, in whole or in part, is strictly
prohibited.

The information in this presentation may not be accurate, complete or up to date, and is provided without warranties or
representations of any kind, either express or implied. ADVA Optical Networking shall not be responsible for and disclaims any
liability for any loss or damages, including without limitation, direct, indirect, incidental, consequential and special damages,
alleged to have been caused by or in connection with using and/or relying on the information contained in this presentation.

Copyright © for the entire content of this presentation: ADVA Optical Networking.

OFC/NFOEC: GPU-based Parallelization of System Modeling

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to OFC/NFOEC: GPU-based Parallelization of System Modeling

Similar to OFC/NFOEC: GPU-based Parallelization of System Modeling (20)

More from ADVA

More from ADVA (20)

Recently uploaded

Recently uploaded (20)

OFC/NFOEC: GPU-based Parallelization of System Modeling