SlideShare a Scribd company logo
1 of 14
Brian Austin
Alex Druinsky, Osni Marquez, Eric
Roman, Sherry Li (LBNL)
Incorporating
Error Detection
and Recovery
into
Hierarchically
Semi-
Separable
Matrix
Operations
- 1 -
April 8, 2015
Towards Optimal Order Resilient Solvers
at Extreme Scale (TOORSES)
Linear solvers are ubiquitous in scientific computing
• Performance
– HSS matrix format reduces computational complexity
• Resilience
– Error rates may increase on extreme scale systems.
• Increased concurrency – more parts that might fail
• Potentially lower part reliability
(smaller transistors, near-threshold voltage)
- 2 -
Outline
• Hierarchically Semi-Separable (HSS) decomposition
• Algorithm-based fault tolerance (ABFT) for dense
matrices
• Error detection for HSS matrix-vector multiplication
• Error recovery using Containment Domains
• Performance results.
- 3 -
Hierarchically Semi-Separable (HSS)
Matrix Decomposition
- 4 -
• Exploits low numerical rank of matrix.
• Structured block sparsity
• Factorization has bounded error.
A = D(3) + U(3) ( B(2) + U(2) ( B(1) +U(1)B(0)V(1)* ) V(2)* ) V(3)*
HSS Matrix Vector multiplication
- 5 -
HSS Matrix-Vector multiplication: b=A.x
D(3) + U(3) ( B(2) + U(2) ( B(1) + U(1) B(0) V(1)* ) V(2)* ) V(3)*
Algorithm Based Fault Tolerance (ABFT)
for Dense Matrices (Huang & Abraham, 1984)
Checksum protection for individual matrices
Recovers up to one error per row/column
eT.A = [eTA]
A.e = [Ae]
Matix multiplication preserves checksums
[eTA].B = eT.[AB] A.[Be] = [AB].e
- 6 -
[Ae]
A
[eTA]
A.[Be]=C.e
C
[eTA].B = eT.C
A
[Ae]
[eTA]
[Be]
B
[eTB]
× =
Checksum relationships
can be derived from
associative properties.
Intermediate error checking for HSS-
MV
• Observation: between each parenthesis, there is an
implicit (i.e. not explicitly stored) matrix.
• Many invariant conditions can be constructed using
associativity.
• For example:
y . [ U(3) . U(2) . U(1) . e ] = [ y. U(3) . U(2) . U(1) ] . e
• Many options for error checking at different stages
of HSS-MV
- 7 -
A = D(3) + U(3) ( B(2) + U(2) ( B(1) + U(1) B(0) V(1)* ) V(2)* ) V(3)*
ABFT for HSS-mv
Error checking with
adjustable granularity
• Coarse + CD
[e.AHSS].x = e.[AHSS .x]
• Medium + CD
e.[V(L).x] = [e.V(L)].x
e.[V(0)…V(L).x] =
[e.V(0)…V(L)].x
[e.AHSS].x = e.[AHSS .x]
• Fine + CD
Detect errors in each MV
• Encoded
Detect & correct errors in
each MV
- 8 -
HSS Matrix-Vector multiplication: b=A.x
Error recovery by Containment Domains
(CDs)
Error Detection
• Classical ABFT cannot
recover all errors.
Multiple errors per row.
Errors in both A and B.
Redesign for every algorithm.
• Containment Domains
provide more robust
recovery techniques.
Users supply validation tests.
Remote safe store
Composable (nested,…)
Automatic escalation
CD pseudocode
CD_Begin()
//first pass:
// store “safe” copies of A,B
//second pass:
// restore A,B
CD_Preserve(A,[eTA],[Ae])
CD_Preserve(B,[eTB],[Be])
Compute: C=A.B
CD_Assert(eT.C==[eTA].B)
CD_Complete()
- 9 -
Runtime overhead without error
injection
- 10 -
0.234
0.236
0.238
0.240
0.242
0.244
0.246
None
(1148.2)
Coarse
(2290.4)
Medium
(2290.9)
Fine
(2292.2)
Encoded +
Coarse
(2295.9)
Encoded
(1164.2)
TimeperHSSmviteration(s)
• Overhead is less than 2%
• Comparable to natural
performance variation.
(Memory (GB))
injection.
- 11 -
0.20
0.25
0.30
0.35
0.40
0.45 1.0E-3
3.2E-3
1.0E-2
3.2E-2
1.0E-1
3.2E-1
1.0E+0
TimeperHSSmvIteration(s)
Error Rate (#/s)
Coarse
Medium
Fine
Encoded
Conclusions & Future work
• Identified checksum relationships to validate HSS-MV
operations.
• Fine grained error checking:
– has very low overhead
– maintains excellent efficiency at high error rates.
• Containment Domains
– Fine-grained preservation has incurs minimal runtime overhead.
– Preservation doubles memory capacity requirements.
• Merge fault-tolerance branch into main (parallel) HSS
code.
• Incorporation into linear solver
- 12 -
Acknowledgement
• Toorses (LBNL)
– Sherry Li (PI)
– Eric Roman
– Osni Marquez
– Alex Druinski
• Strumpack – HSS Library
– Francois-Henry Rouet
• Containment Domains (UT)
– Mattan Erez
– Kyushick Lee
• Support
– This material is based upon work supported by the U.S. Department of Energy, Office of
Science, Office of Advanced Scientific Computing Research, Applied Mathematics program
under contract number DE-AC02-05CH11231.
– This research used resources of the National Energy Research Scientific Computing Center, a
DOE Office of Science User Facility supported by the Office of Science of the U.S. Department
of Energy under Contract No. DE-AC02-05CH11231.
- 13 -
National Energy Research Scientific Computing
Center
- 14 -

More Related Content

Viewers also liked

Гайдамаки
ГайдамакиГайдамаки
Гайдамакиnjhujdbwz
 
Đàn ông cần hiểu rõ hơn về testosterone
Đàn ông cần hiểu rõ hơn về testosteroneĐàn ông cần hiểu rõ hơn về testosterone
Đàn ông cần hiểu rõ hơn về testosteronesara633
 
Citizen Engagement a Game Changer for Development at the Grassroots
Citizen Engagement a Game Changer for Development at the GrassrootsCitizen Engagement a Game Changer for Development at the Grassroots
Citizen Engagement a Game Changer for Development at the GrassrootsDoyin Idowu
 
Barker_SIAMCSE15
Barker_SIAMCSE15Barker_SIAMCSE15
Barker_SIAMCSE15Karen Pao
 
TTW BOOK_Testimonials + Early Reviews + Description_July 28 2015
TTW BOOK_Testimonials + Early Reviews + Description_July 28 2015TTW BOOK_Testimonials + Early Reviews + Description_July 28 2015
TTW BOOK_Testimonials + Early Reviews + Description_July 28 2015Peter Klein
 

Viewers also liked (7)

зарядись!
зарядись!зарядись!
зарядись!
 
Гайдамаки
ГайдамакиГайдамаки
Гайдамаки
 
Resume Jevy Callipare
Resume Jevy CallipareResume Jevy Callipare
Resume Jevy Callipare
 
Đàn ông cần hiểu rõ hơn về testosterone
Đàn ông cần hiểu rõ hơn về testosteroneĐàn ông cần hiểu rõ hơn về testosterone
Đàn ông cần hiểu rõ hơn về testosterone
 
Citizen Engagement a Game Changer for Development at the Grassroots
Citizen Engagement a Game Changer for Development at the GrassrootsCitizen Engagement a Game Changer for Development at the Grassroots
Citizen Engagement a Game Changer for Development at the Grassroots
 
Barker_SIAMCSE15
Barker_SIAMCSE15Barker_SIAMCSE15
Barker_SIAMCSE15
 
TTW BOOK_Testimonials + Early Reviews + Description_July 28 2015
TTW BOOK_Testimonials + Early Reviews + Description_July 28 2015TTW BOOK_Testimonials + Early Reviews + Description_July 28 2015
TTW BOOK_Testimonials + Early Reviews + Description_July 28 2015
 

Similar to Austin_SIAMCSE15

PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosAMD Developer Central
 
1629 stochastic subgradient approach for solving linear support vector
1629 stochastic subgradient approach for solving linear support vector1629 stochastic subgradient approach for solving linear support vector
1629 stochastic subgradient approach for solving linear support vectorDr Fereidoun Dejahang
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier홍배 김
 
Finding similar items in high dimensional spaces locality sensitive hashing
Finding similar items in high dimensional spaces  locality sensitive hashingFinding similar items in high dimensional spaces  locality sensitive hashing
Finding similar items in high dimensional spaces locality sensitive hashingDmitriy Selivanov
 
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...Mail.ru Group
 
Machine learning for_finance
Machine learning for_financeMachine learning for_finance
Machine learning for_financeStefan Duprey
 
Seminar_New -CESG
Seminar_New -CESGSeminar_New -CESG
Seminar_New -CESGQian Wang
 
Hardware Implementation of Cascade SVM
Hardware Implementation of Cascade SVMHardware Implementation of Cascade SVM
Hardware Implementation of Cascade SVMQian Wang
 
Efficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingEfficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingHsing-chuan Hsieh
 
Support Vector Machines- SVM
Support Vector Machines- SVMSupport Vector Machines- SVM
Support Vector Machines- SVMCarlo Carandang
 
Test vector compression in Digital Testing
Test vector compression in Digital Testing Test vector compression in Digital Testing
Test vector compression in Digital Testing Amr Abd El Latief
 
generalized_nbody_acs_2015_challacombe
generalized_nbody_acs_2015_challacombegeneralized_nbody_acs_2015_challacombe
generalized_nbody_acs_2015_challacombeMatt Challacombe
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelAndrey Lomakin
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machinesnextlib
 
Cerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerCerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerZahra Sadeghi
 
Tree representation in map reduce world
Tree representation  in map reduce worldTree representation  in map reduce world
Tree representation in map reduce worldYu Liu
 

Similar to Austin_SIAMCSE15 (20)

PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 
1629 stochastic subgradient approach for solving linear support vector
1629 stochastic subgradient approach for solving linear support vector1629 stochastic subgradient approach for solving linear support vector
1629 stochastic subgradient approach for solving linear support vector
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
 
Finding similar items in high dimensional spaces locality sensitive hashing
Finding similar items in high dimensional spaces  locality sensitive hashingFinding similar items in high dimensional spaces  locality sensitive hashing
Finding similar items in high dimensional spaces locality sensitive hashing
 
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
 
lecture_01.ppt
lecture_01.pptlecture_01.ppt
lecture_01.ppt
 
Machine learning for_finance
Machine learning for_financeMachine learning for_finance
Machine learning for_finance
 
Seminar_New -CESG
Seminar_New -CESGSeminar_New -CESG
Seminar_New -CESG
 
Hardware Implementation of Cascade SVM
Hardware Implementation of Cascade SVMHardware Implementation of Cascade SVM
Hardware Implementation of Cascade SVM
 
Efficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingEfficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketching
 
Support Vector Machines- SVM
Support Vector Machines- SVMSupport Vector Machines- SVM
Support Vector Machines- SVM
 
Test vector compression in Digital Testing
Test vector compression in Digital Testing Test vector compression in Digital Testing
Test vector compression in Digital Testing
 
generalized_nbody_acs_2015_challacombe
generalized_nbody_acs_2015_challacombegeneralized_nbody_acs_2015_challacombe
generalized_nbody_acs_2015_challacombe
 
lecture_16.pptx
lecture_16.pptxlecture_16.pptx
lecture_16.pptx
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data model
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Self healing data
Self healing dataSelf healing data
Self healing data
 
I stata
I stataI stata
I stata
 
Cerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerCerebellar Model Articulation Controller
Cerebellar Model Articulation Controller
 
Tree representation in map reduce world
Tree representation  in map reduce worldTree representation  in map reduce world
Tree representation in map reduce world
 

More from Karen Pao

LupoPasini_SIAMCSE15
LupoPasini_SIAMCSE15LupoPasini_SIAMCSE15
LupoPasini_SIAMCSE15Karen Pao
 
Druinsky_SIAMCSE15
Druinsky_SIAMCSE15Druinsky_SIAMCSE15
Druinsky_SIAMCSE15Karen Pao
 
Myers_SIAMCSE15
Myers_SIAMCSE15Myers_SIAMCSE15
Myers_SIAMCSE15Karen Pao
 
Adams_SIAMCSE15
Adams_SIAMCSE15Adams_SIAMCSE15
Adams_SIAMCSE15Karen Pao
 
Slattery_SIAMCSE15
Slattery_SIAMCSE15Slattery_SIAMCSE15
Slattery_SIAMCSE15Karen Pao
 
Loffeld_SIAMCSE15
Loffeld_SIAMCSE15Loffeld_SIAMCSE15
Loffeld_SIAMCSE15Karen Pao
 
Dubey_SIAMCSE15
Dubey_SIAMCSE15Dubey_SIAMCSE15
Dubey_SIAMCSE15Karen Pao
 

More from Karen Pao (7)

LupoPasini_SIAMCSE15
LupoPasini_SIAMCSE15LupoPasini_SIAMCSE15
LupoPasini_SIAMCSE15
 
Druinsky_SIAMCSE15
Druinsky_SIAMCSE15Druinsky_SIAMCSE15
Druinsky_SIAMCSE15
 
Myers_SIAMCSE15
Myers_SIAMCSE15Myers_SIAMCSE15
Myers_SIAMCSE15
 
Adams_SIAMCSE15
Adams_SIAMCSE15Adams_SIAMCSE15
Adams_SIAMCSE15
 
Slattery_SIAMCSE15
Slattery_SIAMCSE15Slattery_SIAMCSE15
Slattery_SIAMCSE15
 
Loffeld_SIAMCSE15
Loffeld_SIAMCSE15Loffeld_SIAMCSE15
Loffeld_SIAMCSE15
 
Dubey_SIAMCSE15
Dubey_SIAMCSE15Dubey_SIAMCSE15
Dubey_SIAMCSE15
 

Recently uploaded

VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 

Recently uploaded (20)

VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 

Austin_SIAMCSE15

  • 1. Brian Austin Alex Druinsky, Osni Marquez, Eric Roman, Sherry Li (LBNL) Incorporating Error Detection and Recovery into Hierarchically Semi- Separable Matrix Operations - 1 - April 8, 2015
  • 2. Towards Optimal Order Resilient Solvers at Extreme Scale (TOORSES) Linear solvers are ubiquitous in scientific computing • Performance – HSS matrix format reduces computational complexity • Resilience – Error rates may increase on extreme scale systems. • Increased concurrency – more parts that might fail • Potentially lower part reliability (smaller transistors, near-threshold voltage) - 2 -
  • 3. Outline • Hierarchically Semi-Separable (HSS) decomposition • Algorithm-based fault tolerance (ABFT) for dense matrices • Error detection for HSS matrix-vector multiplication • Error recovery using Containment Domains • Performance results. - 3 -
  • 4. Hierarchically Semi-Separable (HSS) Matrix Decomposition - 4 - • Exploits low numerical rank of matrix. • Structured block sparsity • Factorization has bounded error. A = D(3) + U(3) ( B(2) + U(2) ( B(1) +U(1)B(0)V(1)* ) V(2)* ) V(3)*
  • 5. HSS Matrix Vector multiplication - 5 - HSS Matrix-Vector multiplication: b=A.x D(3) + U(3) ( B(2) + U(2) ( B(1) + U(1) B(0) V(1)* ) V(2)* ) V(3)*
  • 6. Algorithm Based Fault Tolerance (ABFT) for Dense Matrices (Huang & Abraham, 1984) Checksum protection for individual matrices Recovers up to one error per row/column eT.A = [eTA] A.e = [Ae] Matix multiplication preserves checksums [eTA].B = eT.[AB] A.[Be] = [AB].e - 6 - [Ae] A [eTA] A.[Be]=C.e C [eTA].B = eT.C A [Ae] [eTA] [Be] B [eTB] × = Checksum relationships can be derived from associative properties.
  • 7. Intermediate error checking for HSS- MV • Observation: between each parenthesis, there is an implicit (i.e. not explicitly stored) matrix. • Many invariant conditions can be constructed using associativity. • For example: y . [ U(3) . U(2) . U(1) . e ] = [ y. U(3) . U(2) . U(1) ] . e • Many options for error checking at different stages of HSS-MV - 7 - A = D(3) + U(3) ( B(2) + U(2) ( B(1) + U(1) B(0) V(1)* ) V(2)* ) V(3)*
  • 8. ABFT for HSS-mv Error checking with adjustable granularity • Coarse + CD [e.AHSS].x = e.[AHSS .x] • Medium + CD e.[V(L).x] = [e.V(L)].x e.[V(0)…V(L).x] = [e.V(0)…V(L)].x [e.AHSS].x = e.[AHSS .x] • Fine + CD Detect errors in each MV • Encoded Detect & correct errors in each MV - 8 - HSS Matrix-Vector multiplication: b=A.x
  • 9. Error recovery by Containment Domains (CDs) Error Detection • Classical ABFT cannot recover all errors. Multiple errors per row. Errors in both A and B. Redesign for every algorithm. • Containment Domains provide more robust recovery techniques. Users supply validation tests. Remote safe store Composable (nested,…) Automatic escalation CD pseudocode CD_Begin() //first pass: // store “safe” copies of A,B //second pass: // restore A,B CD_Preserve(A,[eTA],[Ae]) CD_Preserve(B,[eTB],[Be]) Compute: C=A.B CD_Assert(eT.C==[eTA].B) CD_Complete() - 9 -
  • 10. Runtime overhead without error injection - 10 - 0.234 0.236 0.238 0.240 0.242 0.244 0.246 None (1148.2) Coarse (2290.4) Medium (2290.9) Fine (2292.2) Encoded + Coarse (2295.9) Encoded (1164.2) TimeperHSSmviteration(s) • Overhead is less than 2% • Comparable to natural performance variation. (Memory (GB))
  • 11. injection. - 11 - 0.20 0.25 0.30 0.35 0.40 0.45 1.0E-3 3.2E-3 1.0E-2 3.2E-2 1.0E-1 3.2E-1 1.0E+0 TimeperHSSmvIteration(s) Error Rate (#/s) Coarse Medium Fine Encoded
  • 12. Conclusions & Future work • Identified checksum relationships to validate HSS-MV operations. • Fine grained error checking: – has very low overhead – maintains excellent efficiency at high error rates. • Containment Domains – Fine-grained preservation has incurs minimal runtime overhead. – Preservation doubles memory capacity requirements. • Merge fault-tolerance branch into main (parallel) HSS code. • Incorporation into linear solver - 12 -
  • 13. Acknowledgement • Toorses (LBNL) – Sherry Li (PI) – Eric Roman – Osni Marquez – Alex Druinski • Strumpack – HSS Library – Francois-Henry Rouet • Containment Domains (UT) – Mattan Erez – Kyushick Lee • Support – This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Applied Mathematics program under contract number DE-AC02-05CH11231. – This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. - 13 -
  • 14. National Energy Research Scientific Computing Center - 14 -

Editor's Notes

  1. ----- Meeting Notes (3/13/15 16:33) ----- audience likely know huang and abraham, so zip past this fix date on the slide fix colors on plots make relative time clearer explanation of CD
  2. ----- Meeting Notes (3/13/15 16:37) ----- cd slide is realy dense what is a containment domain walk through pseudo code as slowly as i did during q&a make it clear that checksums are also being preserved
  3. ----- Meeting Notes (3/13/15 16:33) ----- need a better introduction not necessarily a slide why am i here and what am i going to talk about