ICIAM 2019: Reproducible Linear Algebra from Application to Architecture

Reproducible Linear Algebra from
Application to Architecture
Jason Riedy (GT) James Demmel (UCB) Peter Ahrens (MIT)
ICIAM, 16 July 2019

Outline
Motivation: Why worry?
Higher level methods & libraries
Implementations and architectures
Closing
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 1/21

Informal UCB survey
Sca/LAPACK prospectus (2006) identified usefulness of
reproducibility. A request appeared on NA-Digest in 2010.
So Dr. Demmel asked around 100 relevant UCB faculty if
“reproducibility” matters.1
• Typical response: Debugging.
• Atypical responses:
• Error analysis! (One guess...)
• Investigating rare instances in simulated data, so
must be reproducible.
• UN uses my code to detect nuclear tests.
1
Demmel, Riedy, Ahrens. “Reproducible BLAS: Make Addition Associative Again!” SIAM News, Oct. 2018

(Some) kinds of reproducibility
• Debugging
• Just developer’s platform.
• Then users occur.
• Investigate rare instance
• Small job, similar to debugging.
• Larger? # proc changes.
• Schrödinger’s nuke, climate
negotiations, ...
• Likely little control over the
runtime environment.
• Accounting, some finance
• Legal: identical across history.

Growing “demand,” or at least interest
• Reproducibility is a tag on SC sessions.
• Often means reproducible (“replicable”)
experiments.
• But many sessions and talks on numerical
reproducibility.
• Vendors want to reduce support calls.
• New MATLAB™ version, new optimized BLAS, new
cache hierarchy ⇒ “My results changed! Fix it!”
• Compiler A evaluates constant f.p. expressions at run
time, compiler B at compile time ⇒ “My results
changed! Fix it!”
• Vendors (HW & SW) want to sell new versions.

What Stands in the Way?
1. Performance.
2. Shared
definitions.
3. Performance.
4. “Ease of use.”
5. Performance Summit photo from OLCF at ORNL.
Debugging run much longer than an email check ⇒ No.
Production run uses up an allocation ⇒ No.
Some help from excess of processing cycles compared to
memory then compared to interconnect.

Shared definitions: Exceptional behavior?
0 × NaN ⇒ NaN except when ⇒ 0?
0 × ∞ ⇒ NaN except when ⇒ 0?
GER: BLAS2 routine for A = A + α x yT
. In the reference
implementation:
• If α = 0, xyT
is not evaluated, no propagation.
• If yk = 0, skip column, no propagation.
A symmetric update of a general matrix may not be a
symmetric update!
Some vendor libraries behave differently.

Other exceptional issues
ISAMAX, ICAMAX: Find entry of max. |magnitude|.
• Return NaN’s location only if it’s first.
• ICAMAX uses |real| + |imag|, which may overflow!
• Actual |complex| may differ (e.g. C++ v. C).
GEMM: BLAS3, C = α op(A) op(B) + βC
• If α = 0, then A, B do not propagate.
• If β = 0, then C does not...
• BUT traditionally β = 0 overwrites C.
More examples in Demmel, Gates, Henry, Li, Riedy, Tang, “A Proposal
for a Next-Generation BLAS,” http://goo.gl/D1UKnw.

Complex complexity
It gets worse.
• C99 & C11: Complex is infinite if either
component is, regardless of NaNs.
• Hence multiplication yields complex ∞, but
• “by hand” would yield NaN + iNaN.
• C standard defines 30+ line complex
multiplication.
• Vectorization tends to ignore this definition.
• Fortran, C++ do not define semantics!
• People expect their particular compiler’s
result.
• Division is at least as bad.

Higher level methods & libraries

Assuming “agreement” on exceptions...
(Some) current approaches:
• Specific platform reproducibility for debugging.
• Intel CNR, NVIDIA
• Arbitrary precision / exact comp.
• Not saying more on this.
• Correctly rounded results
• ExBLAS
• Not “faithful” rounding. One of two choices, but
another implementation may choose the other.
• Reproducible accumulators
• Very wide accumulators (Kulisch, AMD HPA)
• Binned accumulators (ReproBLAS)

Vendor debugging support
Intel Conditional Numerical Reproducibility in MKL2
:
• “[...] calls to Intel®
MKL occur in a single executable”
• “the number of computational threads [...] remains constant
[...]” with SSE/AVX selection routines
• Strict CNR Mode: Any # threads for GEMM, TRSM, and SYMM.
NVIDIA cuBLAS3
:
• “By design, all CUBLAS API routines from a given toolkit version,
generate the same bit-wise results at every run when executed
on GPUs with the same architecture and the same number of
SMs.”
• No promise across versions, can disable for performance
2
https://software.intel.com/en-us/articles/
introduction-to-the-conditional-numerical-reproducibility-cnr
3
https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility

Correctly rounded results
Example: the ExBLAS https://exblas.lip6.fr/
• Deliver correctly rounded results for AXPY, SUM, DOT,
TRSV, GEMV, GEMM, ...
• Reproducible no matter parallelism, vectors,
distribution
• Used in fluid simulations with discontinuous
Galerkin (ArXiV 1807.01971), parallel PCG (Wednesday!)
• Memory / communication limited.
Built on accumulators using exact operations
• twoSum (a + b ⇒ s + e), and
• twoProd (a · b ⇒ p + e).

Reproducible accumulators: Fixed-point
Expose the accumulators. Can compose across uses.
• With a sufficiently wide accumulator, floating point
becomes fixed point.
• Fixed point is reproducible.
• Kulisch superaccumulator: binary64 ⇒ >4000 bits
• Various implementations and optimizations: e.g.
Koenig et al. (ARITH17)
• AMD High-Precision Anchored (HPA)4
number in SVE
• Narrowed by data range, likely ≤ 200 bits
• Kinda wide, kinda binned
• Int / fixed-point summation is much faster than f.p.
4
Burgess et al., IEEE ToC 7/2019, doi: 10.1109/TC.2018.2855729

Reproducible accumulators: Binned
Expose the accumulators: Example: ReproBLAS5
:
Goals for summation:
• Build on standard IEEE 754 binary ops
• Tunable precision, ≥ conventional sum
• Handle exceptions reproducibly
• One read-only pass, one reduction
• Minimal memory for tiling
• “Usable:” Can build higher-level and parallel
routines (for some α, β, ...)
5
Demmel, Ahrens, Nguyen https://bebop.cs.berkeley.edu/reproblas/

ReproBLAS binning
• p: Precison
• K : -fold (in)
• W: Width (in)
bin32 bin64 bin128
W 13 40 100
K 3
# sums 233
264
2124
7n flops for n sums. But twoSum again!
20% overhead for adding 1M #s on 1K XC30 processors

Implementations and
architectures

Defining twoSum
Long history...
quickTwoSum(a, b) ⇒ (s, e), |a| ≥ |b|
1. s = a + b
2. e = b − (s − a)
twoSum(a, b) ⇒ (s, e)
1. s = a + b
2. t := s − a
3. e = (a − (s − t)) + (b − t)
Many sign re-arrangements possible.
Each generates different ±0, NaNs, ...

Specifying twoSum ⇒ augmentedAddition
Finally: IEEE 754-2019 (to be fully available on 7/22?)
• augmentedAddition(a, b) ⇒ (s, e)
• augmentedSubtraction(a, b) ⇒ (d, e)
• augmentedMultiplication(a, b) ⇒ (p, e)
Fully specified exceptional cases including overflow,
underflow, invalid, and signed zeros.
“Funny” aspect: Round ties towards zero.
• Mode exists for decimal but not binary.
• Reproducibility: Rounding must not depend on
value!

augmented∗ operations i
• Rationale: Riedy, Demmel. “Augmented Arithmetic
Operations Proposed for IEEE-754 2018.” ARITH 18,
doi: 10.1109/ARITH.2018.8464813
• Will be expanded in IEEE 754 background documents
• Software:
• FP implementation: Boldo, Lauter, Muller. “Emulating
[ties→0] ‘augmented’ [...] operations using
[ties→even] arithmetic.” HAL 2137968.
• Int impl: (proof eternally in progress)

augmented∗ operations ii
• Hardware: There is “interest” from processor
designers...
• In a two-operation form, one for each piece.
• There need to be users / uses of $ importance.

Possible HW performance: single CPU core
Emulating augmentedAddition as two instructions
improves double-double:
Operation Skylake Haswell
Addition latency −55% −45%
throughput +36% +18%
DDGEMM MFLOP/s from reduced insn dependencies:
Operation Intel Skylake Intel Haswell
“Typical” implementation 1732 (≈ 1/37 DP) 1199 (≈ 1/45 DP)
Two-insn augmentedAddition 3344 (≈ 1/19 DP) 2283 (≈ 1/24 DP)
Dukhan, Riedy, Vuduc. “Wanted: Floating-point add round-off error instruction,” PMAA 2016, ArXiv 1603.00491.
ReproBLAS dot product: 33% rate improvement,
only 2× slower than non-reproducible.

Closing
• Even before the “interesting” parts, challenges in
reproducible linear algebra:
• Exceptional conditions and propagation
• Different complex arithmetics
• (Also: Signaling exceptions and more)
• Each approach has benefits and limitations.
• ExBLAS: Cannot easily compose, but easy to
understand.
• ReproBLAS: Composable (multipliers ±1, 0),
• Aside: Composing is useful for streaming data...
• Performance can be ok, so long as networks are
involved.
• Really need hardware for day-to-day use.

Upcoming issues
• Current work assumes a processor-centric view.
• Novel memory-centric processors may minimize
state, limiting accumulators.
• FPGAs near / in memory change engineering choices.
• Coming flood of 8-bit multipliers.
• Energy and space scale as bits2.
• Machine learning “wants” them (bfloat16).
• Reproducibility and neuromorphic? Quantum?
Emu Chick FPGAs & HMC/HBM
FPAA
Georgia Tech CRNCH Rogues Gallery

ICIAM 2019: Reproducible Linear Algebra from Application to Architecture

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Similar to ICIAM 2019: Reproducible Linear Algebra from Application to Architecture

Similar to ICIAM 2019: Reproducible Linear Algebra from Application to Architecture (20)

More from Jason Riedy

More from Jason Riedy (20)

Recently uploaded

Recently uploaded (20)

ICIAM 2019: Reproducible Linear Algebra from Application to Architecture