SlideShare a Scribd company logo
1 of 26
Download to read offline
Reproducible Linear Algebra from
Application to Architecture
Jason Riedy (GT) James Demmel (UCB) Peter Ahrens (MIT)
ICIAM, 16 July 2019
Outline
Motivation: Why worry?
Higher level methods & libraries
Implementations and architectures
Closing
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 1/21
Motivation: Why worry?
Informal UCB survey
Sca/LAPACK prospectus (2006) identified usefulness of
reproducibility. A request appeared on NA-Digest in 2010.
So Dr. Demmel asked around 100 relevant UCB faculty if
“reproducibility” matters.1
• Typical response: Debugging.
• Atypical responses:
• Error analysis! (One guess...)
• Investigating rare instances in simulated data, so
must be reproducible.
• UN uses my code to detect nuclear tests.
1
Demmel, Riedy, Ahrens. “Reproducible BLAS: Make Addition Associative Again!” SIAM News, Oct. 2018
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 2/21
(Some) kinds of reproducibility
• Debugging
• Just developer’s platform.
• Then users occur.
• Investigate rare instance
• Small job, similar to debugging.
• Larger? # proc changes.
• Schrödinger’s nuke, climate
negotiations, ...
• Likely little control over the
runtime environment.
• Accounting, some finance
• Legal: identical across history.
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 3/21
Growing “demand,” or at least interest
• Reproducibility is a tag on SC sessions.
• Often means reproducible (“replicable”)
experiments.
• But many sessions and talks on numerical
reproducibility.
• Vendors want to reduce support calls.
• New MATLAB™ version, new optimized BLAS, new
cache hierarchy ⇒ “My results changed! Fix it!”
• Compiler A evaluates constant f.p. expressions at run
time, compiler B at compile time ⇒ “My results
changed! Fix it!”
• Vendors (HW & SW) want to sell new versions.
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 4/21
What Stands in the Way?
1. Performance.
2. Shared
definitions.
3. Performance.
4. “Ease of use.”
5. Performance Summit photo from OLCF at ORNL.
Debugging run much longer than an email check ⇒ No.
Production run uses up an allocation ⇒ No.
Some help from excess of processing cycles compared to
memory then compared to interconnect.
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 5/21
Shared definitions: Exceptional behavior?
0 × NaN ⇒ NaN except when ⇒ 0?
0 × ∞ ⇒ NaN except when ⇒ 0?
GER: BLAS2 routine for A = A + α x yT
. In the reference
implementation:
• If α = 0, xyT
is not evaluated, no propagation.
• If yk = 0, skip column, no propagation.
A symmetric update of a general matrix may not be a
symmetric update!
Some vendor libraries behave differently.
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 6/21
Other exceptional issues
ISAMAX, ICAMAX: Find entry of max. |magnitude|.
• Return NaN’s location only if it’s first.
• ICAMAX uses |real| + |imag|, which may overflow!
• Actual |complex| may differ (e.g. C++ v. C).
GEMM: BLAS3, C = α op(A) op(B) + βC
• If α = 0, then A, B do not propagate.
• If β = 0, then C does not...
• BUT traditionally β = 0 overwrites C.
More examples in Demmel, Gates, Henry, Li, Riedy, Tang, “A Proposal
for a Next-Generation BLAS,” http://goo.gl/D1UKnw.
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 7/21
Complex complexity
It gets worse.
• C99 & C11: Complex is infinite if either
component is, regardless of NaNs.
• Hence multiplication yields complex ∞, but
• “by hand” would yield NaN + iNaN.
• C standard defines 30+ line complex
multiplication.
• Vectorization tends to ignore this definition.
• Fortran, C++ do not define semantics!
• People expect their particular compiler’s
result.
• Division is at least as bad.
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 8/21
Higher level methods & libraries
Assuming “agreement” on exceptions...
(Some) current approaches:
• Specific platform reproducibility for debugging.
• Intel CNR, NVIDIA
• Arbitrary precision / exact comp.
• Not saying more on this.
• Correctly rounded results
• ExBLAS
• Not “faithful” rounding. One of two choices, but
another implementation may choose the other.
• Reproducible accumulators
• Very wide accumulators (Kulisch, AMD HPA)
• Binned accumulators (ReproBLAS)
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 9/21
Vendor debugging support
Intel Conditional Numerical Reproducibility in MKL2
:
• “[...] calls to Intel®
MKL occur in a single executable”
• “the number of computational threads [...] remains constant
[...]” with SSE/AVX selection routines
• Strict CNR Mode: Any # threads for GEMM, TRSM, and SYMM.
NVIDIA cuBLAS3
:
• “By design, all CUBLAS API routines from a given toolkit version,
generate the same bit-wise results at every run when executed
on GPUs with the same architecture and the same number of
SMs.”
• No promise across versions, can disable for performance
2
https://software.intel.com/en-us/articles/
introduction-to-the-conditional-numerical-reproducibility-cnr
3
https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 10/21
Correctly rounded results
Example: the ExBLAS https://exblas.lip6.fr/
• Deliver correctly rounded results for AXPY, SUM, DOT,
TRSV, GEMV, GEMM, ...
• Reproducible no matter parallelism, vectors,
distribution
• Used in fluid simulations with discontinuous
Galerkin (ArXiV 1807.01971), parallel PCG (Wednesday!)
• Memory / communication limited.
Built on accumulators using exact operations
• twoSum (a + b ⇒ s + e), and
• twoProd (a · b ⇒ p + e).
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 11/21
Reproducible accumulators: Fixed-point
Expose the accumulators. Can compose across uses.
• With a sufficiently wide accumulator, floating point
becomes fixed point.
• Fixed point is reproducible.
• Kulisch superaccumulator: binary64 ⇒ >4000 bits
• Various implementations and optimizations: e.g.
Koenig et al. (ARITH17)
• AMD High-Precision Anchored (HPA)4
number in SVE
• Narrowed by data range, likely ≤ 200 bits
• Kinda wide, kinda binned
• Int / fixed-point summation is much faster than f.p.
4
Burgess et al., IEEE ToC 7/2019, doi: 10.1109/TC.2018.2855729
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 12/21
Reproducible accumulators: Binned
Expose the accumulators: Example: ReproBLAS5
:
Goals for summation:
• Build on standard IEEE 754 binary ops
• Tunable precision, ≥ conventional sum
• Handle exceptions reproducibly
• One read-only pass, one reduction
• Minimal memory for tiling
• “Usable:” Can build higher-level and parallel
routines (for some α, β, ...)
5
Demmel, Ahrens, Nguyen https://bebop.cs.berkeley.edu/reproblas/
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 13/21
ReproBLAS binning
• p: Precison
• K : -fold (in)
• W: Width (in)
bin32 bin64 bin128
W 13 40 100
K 3
# sums 233
264
2124
7n flops for n sums. But twoSum again!
20% overhead for adding 1M #s on 1K XC30 processors
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 14/21
Implementations and
architectures
Defining twoSum
Long history...
quickTwoSum(a, b) ⇒ (s, e), |a| ≥ |b|
1. s = a + b
2. e = b − (s − a)
twoSum(a, b) ⇒ (s, e)
1. s = a + b
2. t := s − a
3. e = (a − (s − t)) + (b − t)
Many sign re-arrangements possible.
Each generates different ±0, NaNs, ...
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 15/21
Specifying twoSum ⇒ augmentedAddition
Finally: IEEE 754-2019 (to be fully available on 7/22?)
• augmentedAddition(a, b) ⇒ (s, e)
• augmentedSubtraction(a, b) ⇒ (d, e)
• augmentedMultiplication(a, b) ⇒ (p, e)
Fully specified exceptional cases including overflow,
underflow, invalid, and signed zeros.
“Funny” aspect: Round ties towards zero.
• Mode exists for decimal but not binary.
• Reproducibility: Rounding must not depend on
value!
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 16/21
augmented∗ operations i
• Rationale: Riedy, Demmel. “Augmented Arithmetic
Operations Proposed for IEEE-754 2018.” ARITH 18,
doi: 10.1109/ARITH.2018.8464813
• Will be expanded in IEEE 754 background documents
• Software:
• FP implementation: Boldo, Lauter, Muller. “Emulating
[ties→0] ‘augmented’ [...] operations using
[ties→even] arithmetic.” HAL 2137968.
• Int impl: (proof eternally in progress)
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 17/21
augmented∗ operations ii
• Hardware: There is “interest” from processor
designers...
• In a two-operation form, one for each piece.
• There need to be users / uses of $ importance.
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 18/21
Possible HW performance: single CPU core
Emulating augmentedAddition as two instructions
improves double-double:
Operation Skylake Haswell
Addition latency −55% −45%
throughput +36% +18%
DDGEMM MFLOP/s from reduced insn dependencies:
Operation Intel Skylake Intel Haswell
“Typical” implementation 1732 (≈ 1/37 DP) 1199 (≈ 1/45 DP)
Two-insn augmentedAddition 3344 (≈ 1/19 DP) 2283 (≈ 1/24 DP)
Dukhan, Riedy, Vuduc. “Wanted: Floating-point add round-off error instruction,” PMAA 2016, ArXiv 1603.00491.
ReproBLAS dot product: 33% rate improvement,
only 2× slower than non-reproducible.
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 19/21
Closing
Closing
• Even before the “interesting” parts, challenges in
reproducible linear algebra:
• Exceptional conditions and propagation
• Different complex arithmetics
• (Also: Signaling exceptions and more)
• Each approach has benefits and limitations.
• ExBLAS: Cannot easily compose, but easy to
understand.
• ReproBLAS: Composable (multipliers ±1, 0),
• Aside: Composing is useful for streaming data...
• Performance can be ok, so long as networks are
involved.
• Really need hardware for day-to-day use.
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 20/21
Upcoming issues
• Current work assumes a processor-centric view.
• Novel memory-centric processors may minimize
state, limiting accumulators.
• FPGAs near / in memory change engineering choices.
• Coming flood of 8-bit multipliers.
• Energy and space scale as bits2.
• Machine learning “wants” them (bfloat16).
• Reproducibility and neuromorphic? Quantum?
Emu Chick FPGAs & HMC/HBM
FPAA
Georgia Tech CRNCH Rogues Gallery
Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 21/21

More Related Content

What's hot

Artificial Neural Networks for Storm Surge Prediction in North Carolina
Artificial Neural Networks for Storm Surge Prediction in North CarolinaArtificial Neural Networks for Storm Surge Prediction in North Carolina
Artificial Neural Networks for Storm Surge Prediction in North CarolinaAnton Bezuglov
 
Implementation of Low Power and Area-Efficient Carry Select Adder
Implementation of Low Power and Area-Efficient Carry Select AdderImplementation of Low Power and Area-Efficient Carry Select Adder
Implementation of Low Power and Area-Efficient Carry Select AdderIJMTST Journal
 
A 12-Bit High Speed Analog To Digital Convertor Using μp 8085
A 12-Bit High Speed Analog To Digital Convertor Using μp 8085A 12-Bit High Speed Analog To Digital Convertor Using μp 8085
A 12-Bit High Speed Analog To Digital Convertor Using μp 8085IJERA Editor
 
Design optimization of BOP for fatigue and strength in HPHT environment using...
Design optimization of BOP for fatigue and strength in HPHT environment using...Design optimization of BOP for fatigue and strength in HPHT environment using...
Design optimization of BOP for fatigue and strength in HPHT environment using...Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
 
128 bit low power and area efficient carry select adder amit bakshi academia
128 bit low power and area efficient carry select adder   amit bakshi   academia128 bit low power and area efficient carry select adder   amit bakshi   academia
128 bit low power and area efficient carry select adder amit bakshi academiagopi448
 
A digital calibration algorithm with variable amplitude dithering for domain-...
A digital calibration algorithm with variable amplitude dithering for domain-...A digital calibration algorithm with variable amplitude dithering for domain-...
A digital calibration algorithm with variable amplitude dithering for domain-...VLSICS Design
 
Reza Talk En Kf 09
Reza Talk En Kf 09Reza Talk En Kf 09
Reza Talk En Kf 09rezatavakoli
 

What's hot (9)

Artificial Neural Networks for Storm Surge Prediction in North Carolina
Artificial Neural Networks for Storm Surge Prediction in North CarolinaArtificial Neural Networks for Storm Surge Prediction in North Carolina
Artificial Neural Networks for Storm Surge Prediction in North Carolina
 
Implementation of Low Power and Area-Efficient Carry Select Adder
Implementation of Low Power and Area-Efficient Carry Select AdderImplementation of Low Power and Area-Efficient Carry Select Adder
Implementation of Low Power and Area-Efficient Carry Select Adder
 
A 12-Bit High Speed Analog To Digital Convertor Using μp 8085
A 12-Bit High Speed Analog To Digital Convertor Using μp 8085A 12-Bit High Speed Analog To Digital Convertor Using μp 8085
A 12-Bit High Speed Analog To Digital Convertor Using μp 8085
 
Design optimization of BOP for fatigue and strength in HPHT environment using...
Design optimization of BOP for fatigue and strength in HPHT environment using...Design optimization of BOP for fatigue and strength in HPHT environment using...
Design optimization of BOP for fatigue and strength in HPHT environment using...
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
128 bit low power and area efficient carry select adder amit bakshi academia
128 bit low power and area efficient carry select adder   amit bakshi   academia128 bit low power and area efficient carry select adder   amit bakshi   academia
128 bit low power and area efficient carry select adder amit bakshi academia
 
A digital calibration algorithm with variable amplitude dithering for domain-...
A digital calibration algorithm with variable amplitude dithering for domain-...A digital calibration algorithm with variable amplitude dithering for domain-...
A digital calibration algorithm with variable amplitude dithering for domain-...
 
Reza Talk En Kf 09
Reza Talk En Kf 09Reza Talk En Kf 09
Reza Talk En Kf 09
 
Electromagnetic Simulations for Aerospace Applications
Electromagnetic Simulations for Aerospace ApplicationsElectromagnetic Simulations for Aerospace Applications
Electromagnetic Simulations for Aerospace Applications
 

Similar to ICIAM 2019: Reproducible Linear Algebra from Application to Architecture

ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisJason Riedy
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesIntel® Software
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 
Static Energy Prediction in Software: A Worst-Case Scenario Approach
Static Energy Prediction in Software: A Worst-Case Scenario ApproachStatic Energy Prediction in Software: A Worst-Case Scenario Approach
Static Energy Prediction in Software: A Worst-Case Scenario ApproachGreenLabAtDI
 
Products go Green: Worst-Case Energy Consumption in Software Product Lines
Products go Green: Worst-Case Energy Consumption in Software Product LinesProducts go Green: Worst-Case Energy Consumption in Software Product Lines
Products go Green: Worst-Case Energy Consumption in Software Product LinesGreenLabAtDI
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learningArnaud Rachez
 
ParaForming - Patterns and Refactoring for Parallel Programming
ParaForming - Patterns and Refactoring for Parallel ProgrammingParaForming - Patterns and Refactoring for Parallel Programming
ParaForming - Patterns and Refactoring for Parallel Programmingkhstandrews
 
Low Power High-Performance Computing on the BeagleBoard Platform
Low Power High-Performance Computing on the BeagleBoard PlatformLow Power High-Performance Computing on the BeagleBoard Platform
Low Power High-Performance Computing on the BeagleBoard Platforma3labdsp
 
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...Naoki Shibata
 
Low power & area efficient carry select adder
Low power & area efficient carry select adderLow power & area efficient carry select adder
Low power & area efficient carry select adderSai Vara Prasad P
 
Barker_SIAMCSE15
Barker_SIAMCSE15Barker_SIAMCSE15
Barker_SIAMCSE15Karen Pao
 
The Berkeley View on the Parallel Computing Landscape
The Berkeley View on the Parallel Computing LandscapeThe Berkeley View on the Parallel Computing Landscape
The Berkeley View on the Parallel Computing Landscapeugur candan
 
PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...Jinwon Lee
 
ALEA:Fine-grain Energy Profiling with Basic Block sampling
ALEA:Fine-grain Energy Profiling with Basic Block samplingALEA:Fine-grain Energy Profiling with Basic Block sampling
ALEA:Fine-grain Energy Profiling with Basic Block samplingLev Mukhanov
 
A Performance Analysis of Self-* Evolutionary Algorithms on Networks with Cor...
A Performance Analysis of Self-* Evolutionary Algorithms on Networks with Cor...A Performance Analysis of Self-* Evolutionary Algorithms on Networks with Cor...
A Performance Analysis of Self-* Evolutionary Algorithms on Networks with Cor...Rafael Nogueras
 
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale EraRealizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale EraMasaharu Munetomo
 
Design and Implementation of Different types of Carry skip adder
Design and Implementation of Different types of Carry skip adderDesign and Implementation of Different types of Carry skip adder
Design and Implementation of Different types of Carry skip adderIRJET Journal
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalJunli Gu
 
C-SAW: A Framework for Graph Sampling and Random Walk on GPUs
C-SAW: A Framework for Graph Sampling and Random Walk on GPUsC-SAW: A Framework for Graph Sampling and Random Walk on GPUs
C-SAW: A Framework for Graph Sampling and Random Walk on GPUsPandey_G
 

Similar to ICIAM 2019: Reproducible Linear Algebra from Application to Architecture (20)

ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
 
Bs25412419
Bs25412419Bs25412419
Bs25412419
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Static Energy Prediction in Software: A Worst-Case Scenario Approach
Static Energy Prediction in Software: A Worst-Case Scenario ApproachStatic Energy Prediction in Software: A Worst-Case Scenario Approach
Static Energy Prediction in Software: A Worst-Case Scenario Approach
 
Products go Green: Worst-Case Energy Consumption in Software Product Lines
Products go Green: Worst-Case Energy Consumption in Software Product LinesProducts go Green: Worst-Case Energy Consumption in Software Product Lines
Products go Green: Worst-Case Energy Consumption in Software Product Lines
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learning
 
ParaForming - Patterns and Refactoring for Parallel Programming
ParaForming - Patterns and Refactoring for Parallel ProgrammingParaForming - Patterns and Refactoring for Parallel Programming
ParaForming - Patterns and Refactoring for Parallel Programming
 
Low Power High-Performance Computing on the BeagleBoard Platform
Low Power High-Performance Computing on the BeagleBoard PlatformLow Power High-Performance Computing on the BeagleBoard Platform
Low Power High-Performance Computing on the BeagleBoard Platform
 
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
 
Low power & area efficient carry select adder
Low power & area efficient carry select adderLow power & area efficient carry select adder
Low power & area efficient carry select adder
 
Barker_SIAMCSE15
Barker_SIAMCSE15Barker_SIAMCSE15
Barker_SIAMCSE15
 
The Berkeley View on the Parallel Computing Landscape
The Berkeley View on the Parallel Computing LandscapeThe Berkeley View on the Parallel Computing Landscape
The Berkeley View on the Parallel Computing Landscape
 
PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...
 
ALEA:Fine-grain Energy Profiling with Basic Block sampling
ALEA:Fine-grain Energy Profiling with Basic Block samplingALEA:Fine-grain Energy Profiling with Basic Block sampling
ALEA:Fine-grain Energy Profiling with Basic Block sampling
 
A Performance Analysis of Self-* Evolutionary Algorithms on Networks with Cor...
A Performance Analysis of Self-* Evolutionary Algorithms on Networks with Cor...A Performance Analysis of Self-* Evolutionary Algorithms on Networks with Cor...
A Performance Analysis of Self-* Evolutionary Algorithms on Networks with Cor...
 
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale EraRealizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
 
Design and Implementation of Different types of Carry skip adder
Design and Implementation of Different types of Carry skip adderDesign and Implementation of Different types of Carry skip adder
Design and Implementation of Different types of Carry skip adder
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation final
 
C-SAW: A Framework for Graph Sampling and Random Walk on GPUs
C-SAW: A Framework for Graph Sampling and Random Walk on GPUsC-SAW: A Framework for Graph Sampling and Random Walk on GPUs
C-SAW: A Framework for Graph Sampling and Random Walk on GPUs
 

More from Jason Riedy

Lucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoFLucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoFJason Riedy
 
LAGraph 2021-10-13
LAGraph 2021-10-13LAGraph 2021-10-13
LAGraph 2021-10-13Jason Riedy
 
Lucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoFLucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoFJason Riedy
 
Graph analysis and novel architectures
Graph analysis and novel architecturesGraph analysis and novel architectures
Graph analysis and novel architecturesJason Riedy
 
GraphBLAS and Emus
GraphBLAS and EmusGraphBLAS and Emus
GraphBLAS and EmusJason Riedy
 
PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...
PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...
PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...Jason Riedy
 
Novel Architectures for Applications in Data Science and Beyond
Novel Architectures for Applications in Data Science and BeyondNovel Architectures for Applications in Data Science and Beyond
Novel Architectures for Applications in Data Science and BeyondJason Riedy
 
Characterization of Emu Chick with Microbenchmarks
Characterization of Emu Chick with MicrobenchmarksCharacterization of Emu Chick with Microbenchmarks
Characterization of Emu Chick with MicrobenchmarksJason Riedy
 
CRNCH 2018 Summit: Rogues Gallery Update
CRNCH 2018 Summit: Rogues Gallery UpdateCRNCH 2018 Summit: Rogues Gallery Update
CRNCH 2018 Summit: Rogues Gallery UpdateJason Riedy
 
Augmented Arithmetic Operations Proposed for IEEE-754 2018
Augmented Arithmetic Operations Proposed for IEEE-754 2018Augmented Arithmetic Operations Proposed for IEEE-754 2018
Augmented Arithmetic Operations Proposed for IEEE-754 2018Jason Riedy
 
Graph Analysis: New Algorithm Models, New Architectures
Graph Analysis: New Algorithm Models, New ArchitecturesGraph Analysis: New Algorithm Models, New Architectures
Graph Analysis: New Algorithm Models, New ArchitecturesJason Riedy
 
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing PlatformsCRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing PlatformsJason Riedy
 
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing PlatformsCRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing PlatformsJason Riedy
 
A New Algorithm Model for Massive-Scale Streaming Graph Analysis
A New Algorithm Model for Massive-Scale Streaming Graph AnalysisA New Algorithm Model for Massive-Scale Streaming Graph Analysis
A New Algorithm Model for Massive-Scale Streaming Graph AnalysisJason Riedy
 
High-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming Graphs High-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming Graphs Jason Riedy
 
High-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming GraphsHigh-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming GraphsJason Riedy
 
Updating PageRank for Streaming Graphs
Updating PageRank for Streaming GraphsUpdating PageRank for Streaming Graphs
Updating PageRank for Streaming GraphsJason Riedy
 
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsScalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsJason Riedy
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraJason Riedy
 
Network Challenge: Error and Sensitivity Analysis
Network Challenge: Error and Sensitivity AnalysisNetwork Challenge: Error and Sensitivity Analysis
Network Challenge: Error and Sensitivity AnalysisJason Riedy
 

More from Jason Riedy (20)

Lucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoFLucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoF
 
LAGraph 2021-10-13
LAGraph 2021-10-13LAGraph 2021-10-13
LAGraph 2021-10-13
 
Lucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoFLucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoF
 
Graph analysis and novel architectures
Graph analysis and novel architecturesGraph analysis and novel architectures
Graph analysis and novel architectures
 
GraphBLAS and Emus
GraphBLAS and EmusGraphBLAS and Emus
GraphBLAS and Emus
 
PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...
PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...
PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...
 
Novel Architectures for Applications in Data Science and Beyond
Novel Architectures for Applications in Data Science and BeyondNovel Architectures for Applications in Data Science and Beyond
Novel Architectures for Applications in Data Science and Beyond
 
Characterization of Emu Chick with Microbenchmarks
Characterization of Emu Chick with MicrobenchmarksCharacterization of Emu Chick with Microbenchmarks
Characterization of Emu Chick with Microbenchmarks
 
CRNCH 2018 Summit: Rogues Gallery Update
CRNCH 2018 Summit: Rogues Gallery UpdateCRNCH 2018 Summit: Rogues Gallery Update
CRNCH 2018 Summit: Rogues Gallery Update
 
Augmented Arithmetic Operations Proposed for IEEE-754 2018
Augmented Arithmetic Operations Proposed for IEEE-754 2018Augmented Arithmetic Operations Proposed for IEEE-754 2018
Augmented Arithmetic Operations Proposed for IEEE-754 2018
 
Graph Analysis: New Algorithm Models, New Architectures
Graph Analysis: New Algorithm Models, New ArchitecturesGraph Analysis: New Algorithm Models, New Architectures
Graph Analysis: New Algorithm Models, New Architectures
 
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing PlatformsCRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
 
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing PlatformsCRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
 
A New Algorithm Model for Massive-Scale Streaming Graph Analysis
A New Algorithm Model for Massive-Scale Streaming Graph AnalysisA New Algorithm Model for Massive-Scale Streaming Graph Analysis
A New Algorithm Model for Massive-Scale Streaming Graph Analysis
 
High-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming Graphs High-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming Graphs
 
High-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming GraphsHigh-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming Graphs
 
Updating PageRank for Streaming Graphs
Updating PageRank for Streaming GraphsUpdating PageRank for Streaming Graphs
Updating PageRank for Streaming Graphs
 
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsScalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear Algebra
 
Network Challenge: Error and Sensitivity Analysis
Network Challenge: Error and Sensitivity AnalysisNetwork Challenge: Error and Sensitivity Analysis
Network Challenge: Error and Sensitivity Analysis
 

Recently uploaded

办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 

ICIAM 2019: Reproducible Linear Algebra from Application to Architecture

  • 1. Reproducible Linear Algebra from Application to Architecture Jason Riedy (GT) James Demmel (UCB) Peter Ahrens (MIT) ICIAM, 16 July 2019
  • 2. Outline Motivation: Why worry? Higher level methods & libraries Implementations and architectures Closing Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 1/21
  • 4. Informal UCB survey Sca/LAPACK prospectus (2006) identified usefulness of reproducibility. A request appeared on NA-Digest in 2010. So Dr. Demmel asked around 100 relevant UCB faculty if “reproducibility” matters.1 • Typical response: Debugging. • Atypical responses: • Error analysis! (One guess...) • Investigating rare instances in simulated data, so must be reproducible. • UN uses my code to detect nuclear tests. 1 Demmel, Riedy, Ahrens. “Reproducible BLAS: Make Addition Associative Again!” SIAM News, Oct. 2018 Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 2/21
  • 5. (Some) kinds of reproducibility • Debugging • Just developer’s platform. • Then users occur. • Investigate rare instance • Small job, similar to debugging. • Larger? # proc changes. • Schrödinger’s nuke, climate negotiations, ... • Likely little control over the runtime environment. • Accounting, some finance • Legal: identical across history. Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 3/21
  • 6. Growing “demand,” or at least interest • Reproducibility is a tag on SC sessions. • Often means reproducible (“replicable”) experiments. • But many sessions and talks on numerical reproducibility. • Vendors want to reduce support calls. • New MATLAB™ version, new optimized BLAS, new cache hierarchy ⇒ “My results changed! Fix it!” • Compiler A evaluates constant f.p. expressions at run time, compiler B at compile time ⇒ “My results changed! Fix it!” • Vendors (HW & SW) want to sell new versions. Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 4/21
  • 7. What Stands in the Way? 1. Performance. 2. Shared definitions. 3. Performance. 4. “Ease of use.” 5. Performance Summit photo from OLCF at ORNL. Debugging run much longer than an email check ⇒ No. Production run uses up an allocation ⇒ No. Some help from excess of processing cycles compared to memory then compared to interconnect. Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 5/21
  • 8. Shared definitions: Exceptional behavior? 0 × NaN ⇒ NaN except when ⇒ 0? 0 × ∞ ⇒ NaN except when ⇒ 0? GER: BLAS2 routine for A = A + α x yT . In the reference implementation: • If α = 0, xyT is not evaluated, no propagation. • If yk = 0, skip column, no propagation. A symmetric update of a general matrix may not be a symmetric update! Some vendor libraries behave differently. Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 6/21
  • 9. Other exceptional issues ISAMAX, ICAMAX: Find entry of max. |magnitude|. • Return NaN’s location only if it’s first. • ICAMAX uses |real| + |imag|, which may overflow! • Actual |complex| may differ (e.g. C++ v. C). GEMM: BLAS3, C = α op(A) op(B) + βC • If α = 0, then A, B do not propagate. • If β = 0, then C does not... • BUT traditionally β = 0 overwrites C. More examples in Demmel, Gates, Henry, Li, Riedy, Tang, “A Proposal for a Next-Generation BLAS,” http://goo.gl/D1UKnw. Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 7/21
  • 10. Complex complexity It gets worse. • C99 & C11: Complex is infinite if either component is, regardless of NaNs. • Hence multiplication yields complex ∞, but • “by hand” would yield NaN + iNaN. • C standard defines 30+ line complex multiplication. • Vectorization tends to ignore this definition. • Fortran, C++ do not define semantics! • People expect their particular compiler’s result. • Division is at least as bad. Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 8/21
  • 11. Higher level methods & libraries
  • 12. Assuming “agreement” on exceptions... (Some) current approaches: • Specific platform reproducibility for debugging. • Intel CNR, NVIDIA • Arbitrary precision / exact comp. • Not saying more on this. • Correctly rounded results • ExBLAS • Not “faithful” rounding. One of two choices, but another implementation may choose the other. • Reproducible accumulators • Very wide accumulators (Kulisch, AMD HPA) • Binned accumulators (ReproBLAS) Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 9/21
  • 13. Vendor debugging support Intel Conditional Numerical Reproducibility in MKL2 : • “[...] calls to Intel® MKL occur in a single executable” • “the number of computational threads [...] remains constant [...]” with SSE/AVX selection routines • Strict CNR Mode: Any # threads for GEMM, TRSM, and SYMM. NVIDIA cuBLAS3 : • “By design, all CUBLAS API routines from a given toolkit version, generate the same bit-wise results at every run when executed on GPUs with the same architecture and the same number of SMs.” • No promise across versions, can disable for performance 2 https://software.intel.com/en-us/articles/ introduction-to-the-conditional-numerical-reproducibility-cnr 3 https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 10/21
  • 14. Correctly rounded results Example: the ExBLAS https://exblas.lip6.fr/ • Deliver correctly rounded results for AXPY, SUM, DOT, TRSV, GEMV, GEMM, ... • Reproducible no matter parallelism, vectors, distribution • Used in fluid simulations with discontinuous Galerkin (ArXiV 1807.01971), parallel PCG (Wednesday!) • Memory / communication limited. Built on accumulators using exact operations • twoSum (a + b ⇒ s + e), and • twoProd (a · b ⇒ p + e). Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 11/21
  • 15. Reproducible accumulators: Fixed-point Expose the accumulators. Can compose across uses. • With a sufficiently wide accumulator, floating point becomes fixed point. • Fixed point is reproducible. • Kulisch superaccumulator: binary64 ⇒ >4000 bits • Various implementations and optimizations: e.g. Koenig et al. (ARITH17) • AMD High-Precision Anchored (HPA)4 number in SVE • Narrowed by data range, likely ≤ 200 bits • Kinda wide, kinda binned • Int / fixed-point summation is much faster than f.p. 4 Burgess et al., IEEE ToC 7/2019, doi: 10.1109/TC.2018.2855729 Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 12/21
  • 16. Reproducible accumulators: Binned Expose the accumulators: Example: ReproBLAS5 : Goals for summation: • Build on standard IEEE 754 binary ops • Tunable precision, ≥ conventional sum • Handle exceptions reproducibly • One read-only pass, one reduction • Minimal memory for tiling • “Usable:” Can build higher-level and parallel routines (for some α, β, ...) 5 Demmel, Ahrens, Nguyen https://bebop.cs.berkeley.edu/reproblas/ Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 13/21
  • 17. ReproBLAS binning • p: Precison • K : -fold (in) • W: Width (in) bin32 bin64 bin128 W 13 40 100 K 3 # sums 233 264 2124 7n flops for n sums. But twoSum again! 20% overhead for adding 1M #s on 1K XC30 processors Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 14/21
  • 19. Defining twoSum Long history... quickTwoSum(a, b) ⇒ (s, e), |a| ≥ |b| 1. s = a + b 2. e = b − (s − a) twoSum(a, b) ⇒ (s, e) 1. s = a + b 2. t := s − a 3. e = (a − (s − t)) + (b − t) Many sign re-arrangements possible. Each generates different ±0, NaNs, ... Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 15/21
  • 20. Specifying twoSum ⇒ augmentedAddition Finally: IEEE 754-2019 (to be fully available on 7/22?) • augmentedAddition(a, b) ⇒ (s, e) • augmentedSubtraction(a, b) ⇒ (d, e) • augmentedMultiplication(a, b) ⇒ (p, e) Fully specified exceptional cases including overflow, underflow, invalid, and signed zeros. “Funny” aspect: Round ties towards zero. • Mode exists for decimal but not binary. • Reproducibility: Rounding must not depend on value! Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 16/21
  • 21. augmented∗ operations i • Rationale: Riedy, Demmel. “Augmented Arithmetic Operations Proposed for IEEE-754 2018.” ARITH 18, doi: 10.1109/ARITH.2018.8464813 • Will be expanded in IEEE 754 background documents • Software: • FP implementation: Boldo, Lauter, Muller. “Emulating [ties→0] ‘augmented’ [...] operations using [ties→even] arithmetic.” HAL 2137968. • Int impl: (proof eternally in progress) Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 17/21
  • 22. augmented∗ operations ii • Hardware: There is “interest” from processor designers... • In a two-operation form, one for each piece. • There need to be users / uses of $ importance. Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 18/21
  • 23. Possible HW performance: single CPU core Emulating augmentedAddition as two instructions improves double-double: Operation Skylake Haswell Addition latency −55% −45% throughput +36% +18% DDGEMM MFLOP/s from reduced insn dependencies: Operation Intel Skylake Intel Haswell “Typical” implementation 1732 (≈ 1/37 DP) 1199 (≈ 1/45 DP) Two-insn augmentedAddition 3344 (≈ 1/19 DP) 2283 (≈ 1/24 DP) Dukhan, Riedy, Vuduc. “Wanted: Floating-point add round-off error instruction,” PMAA 2016, ArXiv 1603.00491. ReproBLAS dot product: 33% rate improvement, only 2× slower than non-reproducible. Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 19/21
  • 25. Closing • Even before the “interesting” parts, challenges in reproducible linear algebra: • Exceptional conditions and propagation • Different complex arithmetics • (Also: Signaling exceptions and more) • Each approach has benefits and limitations. • ExBLAS: Cannot easily compose, but easy to understand. • ReproBLAS: Composable (multipliers ±1, 0), • Aside: Composing is useful for streaming data... • Performance can be ok, so long as networks are involved. • Really need hardware for day-to-day use. Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 20/21
  • 26. Upcoming issues • Current work assumes a processor-centric view. • Novel memory-centric processors may minimize state, limiting accumulators. • FPGAs near / in memory change engineering choices. • Coming flood of 8-bit multipliers. • Energy and space scale as bits2. • Machine learning “wants” them (bfloat16). • Reproducibility and neuromorphic? Quantum? Emu Chick FPGAs & HMC/HBM FPAA Georgia Tech CRNCH Rogues Gallery Reproducible Linear Algebra, Alg. to Arch. — ICIAM 2019 21/21