Biopesticide (2).pptx .This slides helps to know the different types of biop...
Austin_SIAMCSE15
1. Brian Austin
Alex Druinsky, Osni Marquez, Eric
Roman, Sherry Li (LBNL)
Incorporating
Error Detection
and Recovery
into
Hierarchically
Semi-
Separable
Matrix
Operations
- 1 -
April 8, 2015
2. Towards Optimal Order Resilient Solvers
at Extreme Scale (TOORSES)
Linear solvers are ubiquitous in scientific computing
• Performance
– HSS matrix format reduces computational complexity
• Resilience
– Error rates may increase on extreme scale systems.
• Increased concurrency – more parts that might fail
• Potentially lower part reliability
(smaller transistors, near-threshold voltage)
- 2 -
6. Algorithm Based Fault Tolerance (ABFT)
for Dense Matrices (Huang & Abraham, 1984)
Checksum protection for individual matrices
Recovers up to one error per row/column
eT.A = [eTA]
A.e = [Ae]
Matix multiplication preserves checksums
[eTA].B = eT.[AB] A.[Be] = [AB].e
- 6 -
[Ae]
A
[eTA]
A.[Be]=C.e
C
[eTA].B = eT.C
A
[Ae]
[eTA]
[Be]
B
[eTB]
× =
Checksum relationships
can be derived from
associative properties.
7. Intermediate error checking for HSS-
MV
• Observation: between each parenthesis, there is an
implicit (i.e. not explicitly stored) matrix.
• Many invariant conditions can be constructed using
associativity.
• For example:
y . [ U(3) . U(2) . U(1) . e ] = [ y. U(3) . U(2) . U(1) ] . e
• Many options for error checking at different stages
of HSS-MV
- 7 -
A = D(3) + U(3) ( B(2) + U(2) ( B(1) + U(1) B(0) V(1)* ) V(2)* ) V(3)*
8. ABFT for HSS-mv
Error checking with
adjustable granularity
• Coarse + CD
[e.AHSS].x = e.[AHSS .x]
• Medium + CD
e.[V(L).x] = [e.V(L)].x
e.[V(0)…V(L).x] =
[e.V(0)…V(L)].x
[e.AHSS].x = e.[AHSS .x]
• Fine + CD
Detect errors in each MV
• Encoded
Detect & correct errors in
each MV
- 8 -
HSS Matrix-Vector multiplication: b=A.x
9. Error recovery by Containment Domains
(CDs)
Error Detection
• Classical ABFT cannot
recover all errors.
Multiple errors per row.
Errors in both A and B.
Redesign for every algorithm.
• Containment Domains
provide more robust
recovery techniques.
Users supply validation tests.
Remote safe store
Composable (nested,…)
Automatic escalation
CD pseudocode
CD_Begin()
//first pass:
// store “safe” copies of A,B
//second pass:
// restore A,B
CD_Preserve(A,[eTA],[Ae])
CD_Preserve(B,[eTB],[Be])
Compute: C=A.B
CD_Assert(eT.C==[eTA].B)
CD_Complete()
- 9 -
10. Runtime overhead without error
injection
- 10 -
0.234
0.236
0.238
0.240
0.242
0.244
0.246
None
(1148.2)
Coarse
(2290.4)
Medium
(2290.9)
Fine
(2292.2)
Encoded +
Coarse
(2295.9)
Encoded
(1164.2)
TimeperHSSmviteration(s)
• Overhead is less than 2%
• Comparable to natural
performance variation.
(Memory (GB))
12. Conclusions & Future work
• Identified checksum relationships to validate HSS-MV
operations.
• Fine grained error checking:
– has very low overhead
– maintains excellent efficiency at high error rates.
• Containment Domains
– Fine-grained preservation has incurs minimal runtime overhead.
– Preservation doubles memory capacity requirements.
• Merge fault-tolerance branch into main (parallel) HSS
code.
• Incorporation into linear solver
- 12 -
13. Acknowledgement
• Toorses (LBNL)
– Sherry Li (PI)
– Eric Roman
– Osni Marquez
– Alex Druinski
• Strumpack – HSS Library
– Francois-Henry Rouet
• Containment Domains (UT)
– Mattan Erez
– Kyushick Lee
• Support
– This material is based upon work supported by the U.S. Department of Energy, Office of
Science, Office of Advanced Scientific Computing Research, Applied Mathematics program
under contract number DE-AC02-05CH11231.
– This research used resources of the National Energy Research Scientific Computing Center, a
DOE Office of Science User Facility supported by the Office of Science of the U.S. Department
of Energy under Contract No. DE-AC02-05CH11231.
- 13 -
----- Meeting Notes (3/13/15 16:33) -----
audience likely know huang and abraham, so zip past this
fix date on the slide
fix colors on plots
make relative time
clearer explanation of CD
----- Meeting Notes (3/13/15 16:37) -----
cd slide is realy dense
what is a containment domain
walk through pseudo code as slowly as i did during q&a
make it clear that checksums are also being preserved
----- Meeting Notes (3/13/15 16:33) -----
need a better introduction
not necessarily a slide
why am i here and what am i going to talk about