All computing must be parallel to take advantage of modern systems like multicore processors, GPUs, and distributed systems. Results that are not bit-wise reproducible introduce doubt on many levels. Sometimes that is appropriate. Reproducibility limitations occur because underlying libraries do not specify their reproducibility requirements. New advances in interfaces, algorithms, and architectures allow selecting among those requirements in the future. This talk covers many of the upcoming options and their trade-offs.
4. Informal UCB survey
Sca/LAPACK prospectus (2006) identified usefulness of
reproducibility. A request appeared on NA-Digest in 2010.
So Dr. Demmel asked around 100 relevant UCB faculty if
“reproducibility” matters.1
• Typical response: Debugging.
• Atypical responses:
• Error analysis! (One guess...)
• Investigating rare instances in simulated data, so
must be reproducible.
• UN uses my code to detect nuclear tests.
1
Demmel, Riedy, Ahrens. “Reproducible BLAS: Make Addition Associative Again!” SIAM News, Oct. 2018
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 2/21
5. (Some) kinds of reproducibility
• Debugging
• Just developer’s platform.
• Then users occur.
• Investigate rare instance
• Small job, similar to debugging.
• Larger? # proc changes.
• Schrödinger’s nuke, climate
negotiations, ...
• Likely little control over the
runtime environment.
• Accounting, some finance
• Legal: identical across history.
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 3/21
6. Growing “demand,” or at least interest
• Reproducibility is a tag on SC sessions.
• Often means reproducible (“replicable”)
experiments.
• But many sessions and talks on numerical
reproducibility.
• Vendors want to reduce support calls.
• New MATLAB™ version, new optimized BLAS, new
cache hierarchy ⇒ “My results changed! Fix it!”
• Compiler A evaluates constant f.p. expressions at run
time, compiler B at compile time ⇒ “My results
changed! Fix it!”
• Vendors (HW & SW) want to sell new versions.
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 4/21
7. What Stands in the Way?
1. Performance.
2. Shared
definitions.
3. Performance.
4. “Ease of use.”
5. Performance Summit photo from OLCF at ORNL.
Debugging run much longer than an email check ⇒ No.
Production run uses up an allocation ⇒ No.
Some help from excess of processing cycles compared to
memory then compared to interconnect.
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 5/21
8. Shared definitions: Exceptional behavior?
0 × NaN ⇒ NaN except when ⇒ 0?
0 × ∞ ⇒ NaN except when ⇒ 0?
GER: BLAS2 routine for A = A + α x yT
. In the reference
implementation:
• If α = 0, xyT
is not evaluated, no propagation.
• If yk = 0, skip column, no propagation.
A symmetric update of a general matrix may not be a
symmetric update!
Some vendor libraries behave differently.
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 6/21
9. Other exceptional issues
ISAMAX, ICAMAX: Find entry of max. |magnitude|.
• Return NaN’s location only if it’s first.
• ICAMAX uses |real| + |imag|, which may overflow!
• Actual |complex| may differ (e.g. C++ v. C).
GEMM: BLAS3, C = α op(A) op(B) + βC
• If α = 0, then A, B do not propagate.
• If β = 0, then C does not...
• BUT traditionally β = 0 overwrites C.
More examples in Demmel, Gates, Henry, Li, Riedy, Tang, “A Proposal
for a Next-Generation BLAS,” http://goo.gl/D1UKnw.
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 7/21
10. Complex complexity
It gets worse.
• C99 & C11: Complex is infinite if either
component is, regardless of NaNs.
• Hence multiplication yields complex ∞, but
• “by hand” would yield NaN + iNaN.
• C standard defines 30+ line complex
multiplication.
• Vectorization tends to ignore this definition.
• Fortran, C++ do not define semantics!
• People expect their particular compiler’s
result.
• Division is at least as bad.
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 8/21
12. Assuming “agreement” on exceptions...
(Some) current approaches:
• Specific platform reproducibility for debugging.
• Intel CNR, NVIDIA
• Arbitrary precision / exact comp.
• Not saying more on this.
• Correctly rounded results
• ExBLAS
• Not “faithful” rounding. One of two choices, but
another implementation may choose the other.
• Reproducible accumulators
• Very wide accumulators (Kulisch, AMD HPA)
• Binned accumulators (ReproBLAS)
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 9/21
13. Vendor debugging support
Intel Conditional Numerical Reproducibility in MKL2
:
• “[...] calls to Intel®
MKL occur in a single executable”
• “the number of computational threads [...] remains constant
[...]” with SSE/AVX selection routines
• Strict CNR Mode: Any # threads for GEMM, TRSM, and SYMM.
NVIDIA cuBLAS3
:
• “By design, all CUBLAS API routines from a given toolkit version,
generate the same bit-wise results at every run when executed
on GPUs with the same architecture and the same number of
SMs.”
• No promise across versions, can disable for performance
2
https://software.intel.com/en-us/articles/
introduction-to-the-conditional-numerical-reproducibility-cnr
3
https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 10/21
14. Correctly rounded results
Example: the ExBLAS https://exblas.lip6.fr/
• Deliver correctly rounded results for AXPY, SUM, DOT,
TRSV, GEMV, GEMM, ...
• Reproducible no matter parallelism, vectors,
distribution
• Used in fluid simulations with discontinuous
Galerkin (ArXiV 1807.01971), parallel PCG (Wednesday!)
• Memory / communication limited.
Built on accumulators using exact operations
• twoSum (a + b ⇒ s + e), and
• twoProd (a · b ⇒ p + e).
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 11/21
15. Reproducible accumulators: Fixed-point
Expose the accumulators. Can compose across uses.
• With a sufficiently wide accumulator, floating point
becomes fixed point.
• Fixed point is reproducible.
• Kulisch superaccumulator: binary64 ⇒ >4000 bits
• Various implementations and optimizations: e.g.
Koenig et al. (ARITH17)
• AMD High-Precision Anchored (HPA)4
number in SVE
• Narrowed by data range, likely ≤ 200 bits
• Kinda wide, kinda binned
• Int / fixed-point summation is much faster than f.p.
4
Burgess et al., IEEE ToC 7/2019, doi: 10.1109/TC.2018.2855729
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 12/21
16. Reproducible accumulators: Binned
Expose the accumulators: Example: ReproBLAS5
:
Goals for summation:
• Build on standard IEEE 754 binary ops
• Tunable precision, ≥ conventional sum
• Handle exceptions reproducibly
• One read-only pass, one reduction
• Minimal memory for tiling
• “Usable:” Can build higher-level and parallel
routines (for some α, β, ...)
5
Demmel, Ahrens, Nguyen https://bebop.cs.berkeley.edu/reproblas/
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 13/21
17. ReproBLAS binning
• p: Precison
• K : -fold (in)
• W: Width (in)
bin32 bin64 bin128
W 13 40 100
K 3
# sums 233
264
2124
7n flops for n sums. But twoSum again!
20% overhead for adding 1M #s on 1K XC30 processors
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 14/21
19. Defining twoSum
Long history...
quickTwoSum(a, b) ⇒ (s, e), |a| ≥ |b|
1. s = a + b
2. e = b − (s − a)
twoSum(a, b) ⇒ (s, e)
1. s = a + b
2. t := s − a
3. e = (a − (s − t)) + (b − t)
Many sign re-arrangements possible.
Each generates different ±0, NaNs, ...
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 15/21
20. Specifying twoSum ⇒ augmentedAddition
Finally: IEEE 754-2019 (to be fully available on 7/22?)
• augmentedAddition(a, b) ⇒ (s, e)
• augmentedSubtraction(a, b) ⇒ (d, e)
• augmentedMultiplication(a, b) ⇒ (p, e)
Fully specified exceptional cases including overflow,
underflow, invalid, and signed zeros.
“Funny” aspect: Round ties towards zero.
• Mode exists for decimal but not binary.
• Reproducibility: Rounding must not depend on
value!
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 16/21
21. augmented∗ operations i
• Rationale: Riedy, Demmel. “Augmented Arithmetic
Operations Proposed for IEEE-754 2018.” ARITH 18,
doi: 10.1109/ARITH.2018.8464813
• Will be expanded in IEEE 754 background documents
• Software:
• FP implementation: Boldo, Lauter, Muller. “Emulating
[ties→0] ‘augmented’ [...] operations using
[ties→even] arithmetic.” HAL 2137968.
• Int impl: (proof eternally in progress)
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 17/21
22. augmented∗ operations ii
• Hardware: There is “interest” from processor
designers...
• In a two-operation form, one for each piece.
• There need to be users / uses of $ importance.
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 18/21
23. Possible HW performance: single CPU core
Emulating augmentedAddition as two instructions
improves double-double:
Operation Skylake Haswell
Addition latency −55% −45%
throughput +36% +18%
DDGEMM MFLOP/s from reduced insn dependencies:
Operation Intel Skylake Intel Haswell
“Typical” implementation 1732 (≈ 1/37 DP) 1199 (≈ 1/45 DP)
Two-insn augmentedAddition 3344 (≈ 1/19 DP) 2283 (≈ 1/24 DP)
Dukhan, Riedy, Vuduc. “Wanted: Floating-point add round-off error instruction,” PMAA 2016, ArXiv 1603.00491.
ReproBLAS dot product: 33% rate improvement,
only 2× slower than non-reproducible.
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 19/21
25. Closing
• Even before the “interesting” parts, challenges in
reproducible linear algebra:
• Exceptional conditions and propagation
• Different complex arithmetics
• (Also: Signaling exceptions and more)
• Each approach has benefits and limitations.
• ExBLAS: Cannot easily compose, but easy to
understand.
• ReproBLAS: Composable (multipliers ±1, 0),
• Aside: Composing is useful for streaming data...
• Performance can be ok, so long as networks are
involved.
• Really need hardware for day-to-day use.
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 20/21
26. Upcoming issues
• Current work assumes a processor-centric view.
• Novel memory-centric processors may minimize
state, limiting accumulators.
• FPGAs near / in memory change engineering choices.
• Coming flood of 8-bit multipliers.
• Energy and space scale as bits2.
• Machine learning “wants” them (bfloat16).
• Reproducibility and neuromorphic? Quantum?
Emu Chick FPGAs & HMC/HBM
FPAA
Georgia Tech CRNCH Rogues Gallery
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 21/21