SlideShare a Scribd company logo
1 of 29
Download to read offline
Reliability, Availability, and
Serviceability(RAS) on ARM64
Wei Fu
ENGINEERS AND DEVICES
WORKING TOGETHER
AGENDA
● What is RAS?
○ Introduction: Definition, Importance, History on X86
○ Overview of RAS functions
● ARMv8 CPU requirements for RAS
○ CPU core, GICv2/GICv3
○ RAS Extension(ARMv8.2)
● SDEI(Software Delegated Exception Interface)
● APEI(ACPI Platform Error Interfaces)
○ BERT and CPER, HEST and GHESv2, EINJ/ERST
● SW components for RAS(in example)
○ All SW components
○ How it works: BERT, HEST
● Current status and Planning
ENGINEERS AND DEVICES
WORKING TOGETHER
What is RAS?
There are three key aspects to robust systems:
Reliability: Continuity, Computation needs be correct and
reliable.
Availability: Readiness, System needs to remain available as
long as possible.
Serviceability: Ability to undergo modifications and repairs,
System should provide information to administrator to aid in
system servicing.
The RAS architecture mostly describes data corruption faults, which
mostly occur in memory and on data links, can also be used for handling
other types of physical faults found in systems.
In another word, The RAS architecture primarily cares about errors
produced from hardware .
ENGINEERS AND DEVICES
WORKING TOGETHER
How important is RAS ?
Enterprise systems often provide mission-critical services:
Impacts
System failure or data
corruption impacts the
customer’s business
and reputation
Inevitability
Although faults are
rare, enterprise
systems can be very
large. So failures are
inevitable.
OPEX
Operating Expense (OPEX) for maintenance is reduced
by replacing only failed parts, and scheduled
maintenance is cheaper than unscheduled service
outages.
So we have to maintain system very well, and Operating Expense (OPEX)
for maintenance is inevitable. How to make the cost down?
ENGINEERS AND DEVICES
WORKING TOGETHER
What if we don’t have RAS?
We DO have some other solutions:
To Be Successful in Business,
You Need a Little Luck.
--Richard Branson
ENGINEERS AND DEVICES
WORKING TOGETHER
RAS on X86 History
● Machine Check Architecture (MCA)
○ An x86 mechanism in which the CPU
reports hardware errors to the
operating system
○ model-specific registers (MSRs)
■ set up machine checking
■ recorde detected errors
■ the information they contain is
CPU specific
○ Machine Check Exception (MCE)
■ signals the detection of an
uncorrected machine-check error
■ handler collect information about
error from MSRs
■ mcelog
● EDAC (Error Detection and
Correction)
○ designed to report and possibly act on
hardware errors
○ inspect the hardware directly
(system-specific handling and
decoding.)
○ only support memory controller and
PCI/AGP errors
● PCI-E: Advanced Error Reporting
(AER)
● Other Hardware features
○ ECC in memory controllers and I/O
RAMs
ENGINEERS AND DEVICES
WORKING TOGETHER
Overview of RAS functions
ENGINEERS AND DEVICES
WORKING TOGETHER
ARMv8 CPU requirements for RAS
● CPU
○ ARMv8-A architecture (ARMv8.2)
○ EL2, EL3, or both
■ Virtualization extension or Security extensions or both
● Generic Interrupt Controller (GICv2/GICv3)
○ Interrupt routing modes
○ Private and shared interrupts(PPI/SPI)
○ Ability to set an interrupt pending event signaling and delegation
○ Interrupt groups/priority
● RAS Extension
○ ESB (Error Synchronization Barrier) instructions
○ RAS Extension registers
○ corrupted data poisoning
ENGINEERS AND DEVICES
WORKING TOGETHER
ESB
Synchronization barrier
instruction, used to
synchronize Unrecoverable
errors, to make sure
containable errors
architecturally consumed by
the PE and not silently
propagated.
Registers Group
● Processor Feature
● Trapping accesses to
Error system
● Error records
(Single/Group)
● Deferred Interrupt
Status(work with ESB)
Overview of RAS Extension
The RAS extension deals primarily with errors produced from hardware faults.
POISON
signals corrupted data
(hardware mechanism)
ENGINEERS AND DEVICES
WORKING TOGETHER
Taxonomy of Error in Diagram
ENGINEERS AND DEVICES
WORKING TOGETHER
ESB instruction
ESB(Error Synchronization Barrier) can be used to isolate
Unrecoverable errors. Software can determine that:
● The error was reported as Unrecoverable.
● The preferred return address of the SEI is an ESB
instruction.
The software between that ESB and the previous ESB can be
isolated. ESB might update
● DISR_EL1 / DISR (Deferred Interrupt Status Register)
○ SEIs Generated by instructions occurring in program order before the
ESB, are either taken before or at the ESB or pended in the
DISR_EL1 / DISR register (depending on whether the SEI exception is
masked).
● VDISR_EL2 / VDISR(Virtual Deferred Interrupt Status
Register)
○ for virtualization
ENGINEERS AND DEVICES
WORKING TOGETHER
System register view
● Processor/Memory Model Feature Register
● Error Record Register
○ ID, Select Register
○ Record Register(ERX*_EL1)
■ Feature
■ Control
■ Record Primary Syndrome
■ Record Address Register
■ Record Miscellaneous Registers
● Hypervisor Configuration Register
● Virtual SError Exception Syndrome Register
● Secure Configuration Register
RAS Extension registers
Memory-mapped view
● Peripheral/Component ID Register
● Error Record ID Register
● Record Register(ERR<n>*)
○ Feature
○ Control
○ Record Primary Syndrome
○ Record Address Register
○ Record Miscellaneous Registers
○ Group Status Registers
● Interrupt Register
○ Error Interrupt Configuration Register
(Fault-Handling/Recovery)
○ Generic Error Interrupt Configuration Register
○ Error Interrupt Status Register
● Device Affinity Register
● Device Architecture Register
ENGINEERS AND DEVICES
WORKING TOGETHER
RAS Extension
How to route interrupts and Exception abort
ENGINEERS AND DEVICES
WORKING TOGETHER
SDEI: Software Delegated Exception Interface
A PSCI-like interface for registering and servicing system events from
system firmware( But it’s NOT only for RAS)
ENGINEERS AND DEVICES
WORKING TOGETHER
SDEI: Some Interface Functions
Event control
SDEI_EVENT_REGISTER
SDEI_EVENT_ENABLE/DISABLE
SDEI_EVENT_CONTEXT
SDEI_EVENT_COMPLETE
SDEI_EVENT_COMPLETE_AND_RE
SUME
SDEI_EVENT_UNREGISTER
SDEI_EVENT_STATUS
SDEI_EVENT_GET_INFO
SDEI_EVENT_ROUTING_SET
SDEI_EVENT_SIGNAL
info & other control
SDEI_VERSION
SDEI_FEATURES
SDEI_INTERRUPT_BIND/RELEASE
SDEI_PE_UNMASK
SDEI_PE_DATA_RESET
SDEI_SYSTEM_DATA_RESET
ENGINEERS AND DEVICES
WORKING TOGETHER
APEI(ACPI Platform Error Interfaces)
ENGINEERS AND DEVICES
WORKING TOGETHER
BERT: Boot Error Record Table
Scenarios: Record errors in
emergency (OS crash/reset)
Mechanism: Report
unhandled errors that
occurred in the previous
boot
Key info:
WHERE are the error records
SIZE of the error record
region
CPER : Common Platform Error Record
The description is in the Appendix of UEFI Specification. With the help of CPER
and FF(Firmware-First mode), OS can get all kinds of error we could think of.
ENGINEERS AND DEVICES
WORKING TOGETHER
HEST: Hardware Error Source Table
Scenarios: Record errors in
runtime (OS still can work)
Mechanism: describes a
standardized mechanism
platforms may use to describe
their error sources by Error
Key info:
○ HOW to get trigger
○ WHERE are the error
records
○ HOW to release records’
mem
ENGINEERS AND DEVICES
WORKING TOGETHER
HEST: Hardware Error Source Table
For ARM64 : GHES v2
HOW to get trigger:
Notification Structure
WHERE are the error records:
Error Status Address
(GAS : Generic Address
Structure)
HOW to release records’
mem:
Read Ack Register
For IA-32 :
MCE/CMC/NMI
For PCI:
AER Root
Port/Endpoint/Bridge
For generic hardware:
GHES(Generic Hardware
Error Source) V1/V2
ENGINEERS AND DEVICES
WORKING TOGETHER
ERST: Error Record Serialization Table
Scenarios: Record and
Retrieve errors in persistent
storage
Mechanism : Operation
abstract, provides details
necessary to communicate
with on-board persistent
storage for error recording
Plan B: IPMI , MTD, block
device, ABRT...
ENGINEERS AND DEVICES
WORKING TOGETHER
EINJ: Error Injection Table
Scenarios: Test OSPM error
handling stack
Mechanism: Operation
abstract, provides a generic
interface which OSPM can inject
hardware errors to the platform
without requiring platform specific
software.
Possible use case:
ENGINEERS AND DEVICES
WORKING TOGETHER
SW components for RAS
Firmware:
ARM TF(ARM Trusted Firmware)with RAS support
UEFI(Unified Extensible Firmware Interface): tianocore-edk2
ACPI tables(with APEI tables)
OS:
Linux kernel( with APEI drivers)
Userspace:
rasdaemon
How it works: BERT, HEST
ENGINEERS AND DEVICES
WORKING TOGETHER
Current status
● Hardware: ARMv8.2 spec includes RAS extension
● Firmware:
○ RAS extension doc has described some different software
sequences based on different scenarios, but the spec itself
is underdevelopment
○ SDEI is underdevelopment
● OS(Linux): APEI on ARM64 can be enabled in kernel.
○ BERT works
○ GHESv2 of HEST is upstreaming(v11) by Qualcomm
engineers
○ ERST and EINJ are untested because of the lack of FW
● Daemon(Application)
○ rasdaemon can be launched on ARM64
ENGINEERS AND DEVICES
WORKING TOGETHER
What we are missing and planning
● Hardware: need a hardware or a simulator (ARMv8.2, including
RAS extension)
● Firmware:
○ RAS extension and SDEI spec
○ ARM TF with RAS extension support and SDEI dispatcher
○ REST and EINJ implementation
● OS(Linux):
○ SDEI client and dispatcher(virtual SDEI)
○ GHESv2 support(upstreaming)
○ RAS event (How to pass/keep info?)
● Daemon(Application)
○ rasdaemon on ARM64 for RAS event
ENGINEERS AND DEVICES
WORKING TOGETHER
Acknowledgments
Al Stone (Red Hat)
Charles Garcia-Tobin(ARM)
John Feeney(Red Hat)
Jon Masters(Red Hat)
Zhigao Li(Huawei)
Thank You
#BUD17
For further information: www.linaro.org
BUD17 keynotes and videos on: connect.linaro.org

More Related Content

What's hot

The Theory and Implementation of DVFS on Linux
The Theory and Implementation of DVFS on LinuxThe Theory and Implementation of DVFS on Linux
The Theory and Implementation of DVFS on LinuxPicker Weng
 
LCU13: An Introduction to ARM Trusted Firmware
LCU13: An Introduction to ARM Trusted FirmwareLCU13: An Introduction to ARM Trusted Firmware
LCU13: An Introduction to ARM Trusted FirmwareLinaro
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Brendan Gregg
 
Static partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VStatic partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VRISC-V International
 
Project ACRN: SR-IOV implementation
Project ACRN: SR-IOV implementationProject ACRN: SR-IOV implementation
Project ACRN: SR-IOV implementationGeoffroy Van Cutsem
 
HKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
HKG15-505: Power Management interactions with OP-TEE and Trusted FirmwareHKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
HKG15-505: Power Management interactions with OP-TEE and Trusted FirmwareLinaro
 
U Boot or Universal Bootloader
U Boot or Universal BootloaderU Boot or Universal Bootloader
U Boot or Universal BootloaderSatpal Parmar
 
LCA13: Power State Coordination Interface
LCA13: Power State Coordination InterfaceLCA13: Power State Coordination Interface
LCA13: Power State Coordination InterfaceLinaro
 
HKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
HKG15-505: Power Management interactions with OP-TEE and Trusted FirmwareHKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
HKG15-505: Power Management interactions with OP-TEE and Trusted FirmwareLinaro
 
ACPI Debugging from Linux Kernel
ACPI Debugging from Linux KernelACPI Debugging from Linux Kernel
ACPI Debugging from Linux KernelSUSE Labs Taipei
 
Project ACRN Device Passthrough Introduction
Project ACRN Device Passthrough IntroductionProject ACRN Device Passthrough Introduction
Project ACRN Device Passthrough IntroductionProject ACRN
 
Trusted firmware deep_dive_v1.0_
Trusted firmware deep_dive_v1.0_Trusted firmware deep_dive_v1.0_
Trusted firmware deep_dive_v1.0_Linaro
 
What is Universal Flash Storage (UFS)?
What is Universal Flash Storage (UFS)?What is Universal Flash Storage (UFS)?
What is Universal Flash Storage (UFS)?UniversalFlash
 
LAS16-402: ARM Trusted Firmware – from Enterprise to Embedded
LAS16-402: ARM Trusted Firmware – from Enterprise to EmbeddedLAS16-402: ARM Trusted Firmware – from Enterprise to Embedded
LAS16-402: ARM Trusted Firmware – from Enterprise to EmbeddedLinaro
 
XPDDS17: PVH Dom0: The Road so Far - Roger Pau Monné, Citrix
XPDDS17: PVH Dom0: The Road so Far - Roger Pau Monné, CitrixXPDDS17: PVH Dom0: The Road so Far - Roger Pau Monné, Citrix
XPDDS17: PVH Dom0: The Road so Far - Roger Pau Monné, CitrixThe Linux Foundation
 

What's hot (20)

The Theory and Implementation of DVFS on Linux
The Theory and Implementation of DVFS on LinuxThe Theory and Implementation of DVFS on Linux
The Theory and Implementation of DVFS on Linux
 
LCU13: An Introduction to ARM Trusted Firmware
LCU13: An Introduction to ARM Trusted FirmwareLCU13: An Introduction to ARM Trusted Firmware
LCU13: An Introduction to ARM Trusted Firmware
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
 
PCI Drivers
PCI DriversPCI Drivers
PCI Drivers
 
Static partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VStatic partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-V
 
Interrupts on xv6
Interrupts on xv6Interrupts on xv6
Interrupts on xv6
 
Project ACRN: SR-IOV implementation
Project ACRN: SR-IOV implementationProject ACRN: SR-IOV implementation
Project ACRN: SR-IOV implementation
 
Ixgbe internals
Ixgbe internalsIxgbe internals
Ixgbe internals
 
HKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
HKG15-505: Power Management interactions with OP-TEE and Trusted FirmwareHKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
HKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
 
U Boot or Universal Bootloader
U Boot or Universal BootloaderU Boot or Universal Bootloader
U Boot or Universal Bootloader
 
LCA13: Power State Coordination Interface
LCA13: Power State Coordination InterfaceLCA13: Power State Coordination Interface
LCA13: Power State Coordination Interface
 
HKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
HKG15-505: Power Management interactions with OP-TEE and Trusted FirmwareHKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
HKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
 
ACPI Debugging from Linux Kernel
ACPI Debugging from Linux KernelACPI Debugging from Linux Kernel
ACPI Debugging from Linux Kernel
 
Project ACRN Device Passthrough Introduction
Project ACRN Device Passthrough IntroductionProject ACRN Device Passthrough Introduction
Project ACRN Device Passthrough Introduction
 
Trusted firmware deep_dive_v1.0_
Trusted firmware deep_dive_v1.0_Trusted firmware deep_dive_v1.0_
Trusted firmware deep_dive_v1.0_
 
Spi drivers
Spi driversSpi drivers
Spi drivers
 
USB Drivers
USB DriversUSB Drivers
USB Drivers
 
What is Universal Flash Storage (UFS)?
What is Universal Flash Storage (UFS)?What is Universal Flash Storage (UFS)?
What is Universal Flash Storage (UFS)?
 
LAS16-402: ARM Trusted Firmware – from Enterprise to Embedded
LAS16-402: ARM Trusted Firmware – from Enterprise to EmbeddedLAS16-402: ARM Trusted Firmware – from Enterprise to Embedded
LAS16-402: ARM Trusted Firmware – from Enterprise to Embedded
 
XPDDS17: PVH Dom0: The Road so Far - Roger Pau Monné, Citrix
XPDDS17: PVH Dom0: The Road so Far - Roger Pau Monné, CitrixXPDDS17: PVH Dom0: The Road so Far - Roger Pau Monné, Citrix
XPDDS17: PVH Dom0: The Road so Far - Roger Pau Monné, Citrix
 

Similar to Reliability, Availability, and Serviceability (RAS) on ARM64 Systems

Spike yuan server ras and uefi cper final
Spike yuan  server ras and uefi cper finalSpike yuan  server ras and uefi cper final
Spike yuan server ras and uefi cper finalparth bera
 
AMulti-coreSoftwareHardwareCo-DebugPlatform_Final
AMulti-coreSoftwareHardwareCo-DebugPlatform_FinalAMulti-coreSoftwareHardwareCo-DebugPlatform_Final
AMulti-coreSoftwareHardwareCo-DebugPlatform_FinalAlan Su
 
Virtualization Support in ARMv8+
Virtualization Support in ARMv8+Virtualization Support in ARMv8+
Virtualization Support in ARMv8+Aananth C N
 
Developing a Windows CE OAL.ppt
Developing a Windows CE OAL.pptDeveloping a Windows CE OAL.ppt
Developing a Windows CE OAL.pptKundanSingh887495
 
Oracle R12 EBS Performance Tuning
Oracle R12 EBS Performance TuningOracle R12 EBS Performance Tuning
Oracle R12 EBS Performance TuningScott Jenner
 
8086 MICROPROCESSOR
8086 MICROPROCESSOR8086 MICROPROCESSOR
8086 MICROPROCESSORAlxus Shuvo
 
Embedded System basic and classifications
Embedded System basic and classificationsEmbedded System basic and classifications
Embedded System basic and classificationsrajkciitr
 
ARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_Architecture
ARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_ArchitectureARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_Architecture
ARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_ArchitectureRaahul Raghavan
 
Arm architecture
Arm architectureArm architecture
Arm architectureMinYeop Na
 
HKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 ServersHKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 ServersLinaro
 
Processor organization &amp; register organization
Processor organization &amp; register organizationProcessor organization &amp; register organization
Processor organization &amp; register organizationGhanshyam Patel
 
Computer Organization and Architecture.
Computer Organization and Architecture.Computer Organization and Architecture.
Computer Organization and Architecture.CS_GDRCST
 

Similar to Reliability, Availability, and Serviceability (RAS) on ARM64 Systems (20)

Spike yuan server ras and uefi cper final
Spike yuan  server ras and uefi cper finalSpike yuan  server ras and uefi cper final
Spike yuan server ras and uefi cper final
 
AMulti-coreSoftwareHardwareCo-DebugPlatform_Final
AMulti-coreSoftwareHardwareCo-DebugPlatform_FinalAMulti-coreSoftwareHardwareCo-DebugPlatform_Final
AMulti-coreSoftwareHardwareCo-DebugPlatform_Final
 
Module-2 Instruction Set Cpus.pdf
Module-2 Instruction Set Cpus.pdfModule-2 Instruction Set Cpus.pdf
Module-2 Instruction Set Cpus.pdf
 
Virtualization Support in ARMv8+
Virtualization Support in ARMv8+Virtualization Support in ARMv8+
Virtualization Support in ARMv8+
 
Cao 2012
Cao 2012Cao 2012
Cao 2012
 
Embedded system
Embedded system Embedded system
Embedded system
 
Developing a Windows CE OAL.ppt
Developing a Windows CE OAL.pptDeveloping a Windows CE OAL.ppt
Developing a Windows CE OAL.ppt
 
Oracle R12 EBS Performance Tuning
Oracle R12 EBS Performance TuningOracle R12 EBS Performance Tuning
Oracle R12 EBS Performance Tuning
 
Embedded Systems
Embedded SystemsEmbedded Systems
Embedded Systems
 
8086 MICROPROCESSOR
8086 MICROPROCESSOR8086 MICROPROCESSOR
8086 MICROPROCESSOR
 
Embedded System basic and classifications
Embedded System basic and classificationsEmbedded System basic and classifications
Embedded System basic and classifications
 
Agnostic Device Drivers
Agnostic Device DriversAgnostic Device Drivers
Agnostic Device Drivers
 
Processor types
Processor typesProcessor types
Processor types
 
ARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_Architecture
ARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_ArchitectureARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_Architecture
ARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_Architecture
 
esunit1.pptx
esunit1.pptxesunit1.pptx
esunit1.pptx
 
Arm architecture
Arm architectureArm architecture
Arm architecture
 
Assignment
AssignmentAssignment
Assignment
 
HKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 ServersHKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 Servers
 
Processor organization &amp; register organization
Processor organization &amp; register organizationProcessor organization &amp; register organization
Processor organization &amp; register organization
 
Computer Organization and Architecture.
Computer Organization and Architecture.Computer Organization and Architecture.
Computer Organization and Architecture.
 

More from Linaro

Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloLinaro
 
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta VekariaArm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta VekariaLinaro
 
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua MoraHuawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua MoraLinaro
 
Bud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qaBud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qaLinaro
 
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018Linaro
 
HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018Linaro
 
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...Linaro
 
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Linaro
 
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Linaro
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Linaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineLinaro
 
HKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteLinaro
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopLinaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineLinaro
 
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allHKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allLinaro
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorLinaro
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMULinaro
 
HKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MLinaro
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation Linaro
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootLinaro
 

More from Linaro (20)

Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
 
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta VekariaArm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
 
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua MoraHuawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
 
Bud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qaBud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qa
 
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
 
HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018
 
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
 
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
 
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening Keynote
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP Workshop
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allHKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMU
 
HKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8M
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted boot
 

Recently uploaded

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 

Recently uploaded (20)

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 

Reliability, Availability, and Serviceability (RAS) on ARM64 Systems

  • 2. ENGINEERS AND DEVICES WORKING TOGETHER AGENDA ● What is RAS? ○ Introduction: Definition, Importance, History on X86 ○ Overview of RAS functions ● ARMv8 CPU requirements for RAS ○ CPU core, GICv2/GICv3 ○ RAS Extension(ARMv8.2) ● SDEI(Software Delegated Exception Interface) ● APEI(ACPI Platform Error Interfaces) ○ BERT and CPER, HEST and GHESv2, EINJ/ERST ● SW components for RAS(in example) ○ All SW components ○ How it works: BERT, HEST ● Current status and Planning
  • 3. ENGINEERS AND DEVICES WORKING TOGETHER What is RAS? There are three key aspects to robust systems: Reliability: Continuity, Computation needs be correct and reliable. Availability: Readiness, System needs to remain available as long as possible. Serviceability: Ability to undergo modifications and repairs, System should provide information to administrator to aid in system servicing. The RAS architecture mostly describes data corruption faults, which mostly occur in memory and on data links, can also be used for handling other types of physical faults found in systems. In another word, The RAS architecture primarily cares about errors produced from hardware .
  • 4. ENGINEERS AND DEVICES WORKING TOGETHER How important is RAS ? Enterprise systems often provide mission-critical services: Impacts System failure or data corruption impacts the customer’s business and reputation Inevitability Although faults are rare, enterprise systems can be very large. So failures are inevitable. OPEX Operating Expense (OPEX) for maintenance is reduced by replacing only failed parts, and scheduled maintenance is cheaper than unscheduled service outages. So we have to maintain system very well, and Operating Expense (OPEX) for maintenance is inevitable. How to make the cost down?
  • 5. ENGINEERS AND DEVICES WORKING TOGETHER What if we don’t have RAS? We DO have some other solutions: To Be Successful in Business, You Need a Little Luck. --Richard Branson
  • 6. ENGINEERS AND DEVICES WORKING TOGETHER RAS on X86 History ● Machine Check Architecture (MCA) ○ An x86 mechanism in which the CPU reports hardware errors to the operating system ○ model-specific registers (MSRs) ■ set up machine checking ■ recorde detected errors ■ the information they contain is CPU specific ○ Machine Check Exception (MCE) ■ signals the detection of an uncorrected machine-check error ■ handler collect information about error from MSRs ■ mcelog ● EDAC (Error Detection and Correction) ○ designed to report and possibly act on hardware errors ○ inspect the hardware directly (system-specific handling and decoding.) ○ only support memory controller and PCI/AGP errors ● PCI-E: Advanced Error Reporting (AER) ● Other Hardware features ○ ECC in memory controllers and I/O RAMs
  • 7. ENGINEERS AND DEVICES WORKING TOGETHER Overview of RAS functions
  • 8. ENGINEERS AND DEVICES WORKING TOGETHER ARMv8 CPU requirements for RAS ● CPU ○ ARMv8-A architecture (ARMv8.2) ○ EL2, EL3, or both ■ Virtualization extension or Security extensions or both ● Generic Interrupt Controller (GICv2/GICv3) ○ Interrupt routing modes ○ Private and shared interrupts(PPI/SPI) ○ Ability to set an interrupt pending event signaling and delegation ○ Interrupt groups/priority ● RAS Extension ○ ESB (Error Synchronization Barrier) instructions ○ RAS Extension registers ○ corrupted data poisoning
  • 9. ENGINEERS AND DEVICES WORKING TOGETHER ESB Synchronization barrier instruction, used to synchronize Unrecoverable errors, to make sure containable errors architecturally consumed by the PE and not silently propagated. Registers Group ● Processor Feature ● Trapping accesses to Error system ● Error records (Single/Group) ● Deferred Interrupt Status(work with ESB) Overview of RAS Extension The RAS extension deals primarily with errors produced from hardware faults. POISON signals corrupted data (hardware mechanism)
  • 10. ENGINEERS AND DEVICES WORKING TOGETHER Taxonomy of Error in Diagram
  • 11. ENGINEERS AND DEVICES WORKING TOGETHER ESB instruction ESB(Error Synchronization Barrier) can be used to isolate Unrecoverable errors. Software can determine that: ● The error was reported as Unrecoverable. ● The preferred return address of the SEI is an ESB instruction. The software between that ESB and the previous ESB can be isolated. ESB might update ● DISR_EL1 / DISR (Deferred Interrupt Status Register) ○ SEIs Generated by instructions occurring in program order before the ESB, are either taken before or at the ESB or pended in the DISR_EL1 / DISR register (depending on whether the SEI exception is masked). ● VDISR_EL2 / VDISR(Virtual Deferred Interrupt Status Register) ○ for virtualization
  • 12. ENGINEERS AND DEVICES WORKING TOGETHER System register view ● Processor/Memory Model Feature Register ● Error Record Register ○ ID, Select Register ○ Record Register(ERX*_EL1) ■ Feature ■ Control ■ Record Primary Syndrome ■ Record Address Register ■ Record Miscellaneous Registers ● Hypervisor Configuration Register ● Virtual SError Exception Syndrome Register ● Secure Configuration Register RAS Extension registers Memory-mapped view ● Peripheral/Component ID Register ● Error Record ID Register ● Record Register(ERR<n>*) ○ Feature ○ Control ○ Record Primary Syndrome ○ Record Address Register ○ Record Miscellaneous Registers ○ Group Status Registers ● Interrupt Register ○ Error Interrupt Configuration Register (Fault-Handling/Recovery) ○ Generic Error Interrupt Configuration Register ○ Error Interrupt Status Register ● Device Affinity Register ● Device Architecture Register
  • 13. ENGINEERS AND DEVICES WORKING TOGETHER RAS Extension How to route interrupts and Exception abort
  • 14. ENGINEERS AND DEVICES WORKING TOGETHER SDEI: Software Delegated Exception Interface A PSCI-like interface for registering and servicing system events from system firmware( But it’s NOT only for RAS)
  • 15. ENGINEERS AND DEVICES WORKING TOGETHER SDEI: Some Interface Functions Event control SDEI_EVENT_REGISTER SDEI_EVENT_ENABLE/DISABLE SDEI_EVENT_CONTEXT SDEI_EVENT_COMPLETE SDEI_EVENT_COMPLETE_AND_RE SUME SDEI_EVENT_UNREGISTER SDEI_EVENT_STATUS SDEI_EVENT_GET_INFO SDEI_EVENT_ROUTING_SET SDEI_EVENT_SIGNAL info & other control SDEI_VERSION SDEI_FEATURES SDEI_INTERRUPT_BIND/RELEASE SDEI_PE_UNMASK SDEI_PE_DATA_RESET SDEI_SYSTEM_DATA_RESET
  • 16. ENGINEERS AND DEVICES WORKING TOGETHER APEI(ACPI Platform Error Interfaces)
  • 17. ENGINEERS AND DEVICES WORKING TOGETHER BERT: Boot Error Record Table Scenarios: Record errors in emergency (OS crash/reset) Mechanism: Report unhandled errors that occurred in the previous boot Key info: WHERE are the error records SIZE of the error record region
  • 18. CPER : Common Platform Error Record The description is in the Appendix of UEFI Specification. With the help of CPER and FF(Firmware-First mode), OS can get all kinds of error we could think of.
  • 19. ENGINEERS AND DEVICES WORKING TOGETHER HEST: Hardware Error Source Table Scenarios: Record errors in runtime (OS still can work) Mechanism: describes a standardized mechanism platforms may use to describe their error sources by Error Key info: ○ HOW to get trigger ○ WHERE are the error records ○ HOW to release records’ mem
  • 20. ENGINEERS AND DEVICES WORKING TOGETHER HEST: Hardware Error Source Table For ARM64 : GHES v2 HOW to get trigger: Notification Structure WHERE are the error records: Error Status Address (GAS : Generic Address Structure) HOW to release records’ mem: Read Ack Register For IA-32 : MCE/CMC/NMI For PCI: AER Root Port/Endpoint/Bridge For generic hardware: GHES(Generic Hardware Error Source) V1/V2
  • 21. ENGINEERS AND DEVICES WORKING TOGETHER ERST: Error Record Serialization Table Scenarios: Record and Retrieve errors in persistent storage Mechanism : Operation abstract, provides details necessary to communicate with on-board persistent storage for error recording Plan B: IPMI , MTD, block device, ABRT...
  • 22. ENGINEERS AND DEVICES WORKING TOGETHER EINJ: Error Injection Table Scenarios: Test OSPM error handling stack Mechanism: Operation abstract, provides a generic interface which OSPM can inject hardware errors to the platform without requiring platform specific software. Possible use case:
  • 23. ENGINEERS AND DEVICES WORKING TOGETHER SW components for RAS Firmware: ARM TF(ARM Trusted Firmware)with RAS support UEFI(Unified Extensible Firmware Interface): tianocore-edk2 ACPI tables(with APEI tables) OS: Linux kernel( with APEI drivers) Userspace: rasdaemon
  • 24. How it works: BERT, HEST
  • 25.
  • 26. ENGINEERS AND DEVICES WORKING TOGETHER Current status ● Hardware: ARMv8.2 spec includes RAS extension ● Firmware: ○ RAS extension doc has described some different software sequences based on different scenarios, but the spec itself is underdevelopment ○ SDEI is underdevelopment ● OS(Linux): APEI on ARM64 can be enabled in kernel. ○ BERT works ○ GHESv2 of HEST is upstreaming(v11) by Qualcomm engineers ○ ERST and EINJ are untested because of the lack of FW ● Daemon(Application) ○ rasdaemon can be launched on ARM64
  • 27. ENGINEERS AND DEVICES WORKING TOGETHER What we are missing and planning ● Hardware: need a hardware or a simulator (ARMv8.2, including RAS extension) ● Firmware: ○ RAS extension and SDEI spec ○ ARM TF with RAS extension support and SDEI dispatcher ○ REST and EINJ implementation ● OS(Linux): ○ SDEI client and dispatcher(virtual SDEI) ○ GHESv2 support(upstreaming) ○ RAS event (How to pass/keep info?) ● Daemon(Application) ○ rasdaemon on ARM64 for RAS event
  • 28. ENGINEERS AND DEVICES WORKING TOGETHER Acknowledgments Al Stone (Red Hat) Charles Garcia-Tobin(ARM) John Feeney(Red Hat) Jon Masters(Red Hat) Zhigao Li(Huawei)
  • 29. Thank You #BUD17 For further information: www.linaro.org BUD17 keynotes and videos on: connect.linaro.org