SlideShare a Scribd company logo
1 of 41
Download to read offline
Copyright 2018 FUJITSU LIMITED
The ideal and reality of
NVDIMM RAS
Yasunori Goto / QI Fuli
Linux Development Division
Fujitsu Limited
China Linux Kernel
Developer Conference
Oct 14, 2018
Copyright 2018 FUJITSU LIMITED
n Introduction of NVDIMM
n Issues of RAS which we found and solved
n Distinguish and replace broken NVDIMM (by Yasunori Goto)
n Monitoring feature of NVDIMM (by Qi Fuli)
Agenda
Introduction of NVDIMM
Copyright 2018 FUJITSU LIMITED
NVDIMM is coming
n Non Volatile DIMM (NVDIMM)
n Some vendor released / will release NVDIMM
• Intel Optane DC persistent memory
n CPU can read/write NVDIMM directly like RAM
n Data is not volatile even if system is powered down
n Though its access latency is a little slower than DRAM, it is expected to
be very faster than NVMe
n May be huge amount (*)
n Use case
n For example, On memory Database
(*) It depends on type of NVDIMM
Copyright 2018 FUJITSU LIMITED
Impact of NVDIMM
n Traditional I/O layer is not suitable for NVDIMM
n Page cache is not necessary at least
• It was created for SLOW I/O storage
• But, it is just redundant for NVDIMM
n Even system call may be waste of time due to switch kernel mode and
user mode
• Ideal is that application can access to NVDIMM directly without kernel
n If application calls cpu flush instruction, then data becomes persistent
• calling sync()/fsync()/msync() becomes redundant
n To achieve the above feature, New access method is expected
n DAX (Direct Access Mode) is developed in Linux
Copyright 2018 FUJITSU LIMITED
Filesystem is still useful for traditional software
n On the other hand,
Many software assumes that memory is VOLATILE
• They allocate memory areas on the RAM and make structures on them
• When allocate memory area on the NVDIMM like RAM, what will happen?
n For NVDIMM, many things are necessary. For example
• Need data structure compatibility on NVDIMM
• Should not change data structures in NVDIMM
• If the structures are changed, software update will be cause of disaster
• Need to detect / correct collapsing data
• CPU cache is still volatile. If system power down suddenly, then some data may not
be stored
• If the data is broken, software need to detect it and correct it
n Filesystem is still useful for traditional software
n Filesystem gives many solution for the above considerations
• Format compatibility of filesystem, data correction, region management, ,
authority check, etc...
Copyright 2018 FUJITSU LIMITED
Interfaces of NVDIMM
n Linux provides some interfaces for application
n Storage Access
• Application can access NVDIMM via traditional I/O
• This mode is for old application
n Filesystem DAX(*) and Device DAX
• Application can read / write NVDIMM area directly if it calls mmap()
• Need modification or
making new software
n PMDK is provided
• libraries and tools for
software which uses DAX
Copyright 2018 FUJITSU LIMITED
DAX enabled
filesystem
NVDIMM
user layer
kernel layer
Pmem (or block window) driver device dax driver
/dev/dax
New App with Device
DAX
BTT driver
Traditional
filesystem
Traditional
App
New App with
Filesystem DAX
PMDK
management
command
sysfs
ACPI driver
ACPI
(*) DAX= Direct Access Mode
Region and Namespace(1/2)
n To manage huge amount of NVDIMM, “Region” and
“namespace” is defined
n Region
• Region is created by hardware and firmware feature
• Each NVDIMM modules are placed on a System Board(Mother Board)
• You can create regions on NVDIMM modules
• Interleave is available
Copyright 2018 FUJITSU LIMITED
region 1
region 2
region 0
(This is
Interleaved
region)
System
Board
NVDIMM module1
NVDIMM module2
NVDIMM module 3
Region and Namespace (2/2)
nNamespace
• You can create namespaces on each region by ndctl command
• Namespace concept is likely LUN of SCSI
• You can make plural namespaces on a region, if you wish
• You can specify storage, filesystem-dax, or device dax at this time
• NVDIMM driver names each NVDIMM module “nmemX”
• /dev/pmemX or /dev/daxY is named for each namespace
• If Storage mode or Filesystem DAX, you can make partition and filesystem
on the /dev/pmemXXX
Copyright 2018 FUJITSU LIMITED
region 1
region 2
System
Board
NVDIMM module1
(nmem1)
NVDIMM module2
(nmem2)
NVDIMM module 3
(nmem3)
namespace0.0
/dev/pmem0
namespace1.0
/dev/pmem1s
namespace1.1
/dev/pmem1.1
namespac2.0
/dev/dax2
(region 0)
n ndctl
n Many features
• Create / Manage namespaces
• Set access mode (storage / Filesystem DAX/ Device DAX)
• Show status of NVDIMM modules, regions, namespaces
• Support difference among NVDIMM vendors (NVDIMM-N or 3D-xpoint...)
• etc..
n It use ACPI via sysfs
and acpi driver for
NVDIMM (nfit.ko)
n We add some feature
in it for RAS
DAX enabled
filesystem
NVDIMM
user layer
kernel layer
Pmem (or block window) driver device dax driver
/dev/dax
New App with Device
DAX
BTT driver
Traditional
filesystem
Traditional
App
New App with
Filesystem DAX
PMDK
ndctl
sysfs
ACPI
driver
ACPI
Management command
Copyright 2018 FUJITSU LIMITED
NVDIMMACPI
ACPI spec for NVDIMM
n ACPI v6.0 or later supports specification for NVDIMM
n NFIT table is defined for NVDIMM
• Many sub tables are specified for NVDIMM basic information
• (Very huge table)
n In DSDT
• Some specification of NVDIMM is defined
• “NVDIMM root device”
• defined for whole system
• Its _DSM method has some features like Address Range Scrubbing
• Notification is used for NFIT update or Memory error detection
• “NVDIMM device”
• defined for each NVDIMM modules
• Its _DSM method is used for manage namespace or Health (SMART) information, etc.
• If health status is changed, firmware notifies a event to the “NVDIMM device” in DSDT
• There is a little bit difference among vendors
(Intel) http://pmem.io/documents/NVDIMM_DSM_Interface-V1.7.pdf
(HPE) https://github.com/HewlettPackard/hpe-nvm
Copyright 2018 FUJITSU LIMITED
n Issues and what I solved
Replacement of NVDIMM
Copyright 2018 FUJITSU LIMITED
NVDIMM may be broken
n We need to consider when NVDIMM becomes broken
n The life of NVDIMM is limited
• Write endurance of NVDIMM is / will be not infinite
• NVDIMM module has / may have some feature
• wear leveling
• spare blocks instead of broken block
• If spare block is consumed completely, the NVDIMM modules can not
recover
• NVDIMM-N has a battery in the module
• It will be also deteriorated
n Bit error may also occur on NVDIMM
• “Scrubbing” feature may rescue it from fatal error if possible
• If it can not rescue, the block becomes broken
Copyright 2018 FUJITSU LIMITED
We need to have a way to replace broken NVDIMM
Question
n Please imagine a difference between HDD/SSD and NVDIMM
n How to replace its device?
Copyright 2018 FUJITSU LIMITED
Answer
n There is a difference between SSD/HDD and NVDIMM
replacement like the followings
Copyright 2018 FUJITSU LIMITED
•
•
•
•
•
•
•
Replacing of NVDIMM is more difficult than SSD/HDD
Replacing broken NVDIMM is complex (1/2)
n In addition, the specification of NVDIMM (region and
namespace) is cause of more difficulty of replacement
n In this figure, when a block on region 0 is broken eternally, and user
need to replace it, what information is necessary?
Copyright 2018 FUJITSU LIMITEDCopyright 2018 FUJITSU LIMITED
region 1
region 2
System
Board
NVDIMM module1
(nmem1)
NVDIMM module2
(nmem2)
NVDIMM module 3
(nmem3)
namespace0.0
/dev/pmem0
namespace1.0
/dev/pmem1s
namespace1.1
/dev/pmem1.1
namespac2.0
/dev/dax2broken
(region 0)
Replacing broken NVDIMM is complex(2/2)
n To replace broken NVDIMM, the following
information is necessary, at least
Copyright 2018 FUJITSU LIMITED
region 1
region 2
System
Board
NVDIMM module1
(nmem1)
NVDIMM module2
(nmem2)
NVDIMM module 3
(nmem3)
namespac0.0
/dev/pmem0
namespace1.0
/dev/pmem1s
namespace1.1
/dev/pmem1.1
namespac2.0
/dev/dax2broken
1. Need a way to distinguish
which NVDIMM module
is broken (especially, if
the region is interleaved,
only firmware know it)
2. To replace NVDIMM, Need information
of relationship between
“Physical” location of NVDIMM module
(Ex. Silk print on mother board),
and Device name of namespace /dev/pmemX)
3. Need to know
other namespace
on same DIMM
modules for backup
before replace
(region 0)
What I developed
n I made 2 features of ndctl
n To show important information for replace
Copyright 2018 FUJITSU LIMITED
region 1
region 2
region 0
System
Board
NVDIMM module1
(nmem1)
NVDIMM module2
(nmem2)
NVDIMM module 3
(nmem3)
namespac0.0
/dev/pmem0
namespace1.0
/dev/pmem1s
namespace1.1
/dev/pmem1.1
namespac2.0
/dev/dax2broken
1. Show NVDIMM info
which has broken
block in the region
2. Show the id of NVDIMM module
to find physical location
Fortunately, current ndctl can show
all namespaces on a NVDIMM already,
(I will mention how to do it later)
n ndctl did not show broken NVDIMM module information
n It just showed broken “block number” in the region
n When the region is Interleaved, kernel/driver cannot distinguish NVDIMM
• Only firmware know the relationship between physical address and NVDIMM
module
n Fortunately, “Translate SPA” is defined in _DSM method of “NVDIMM
root Device” at ACPI ver.6.2
• Firmware translates from System Physical Address to “NFIT DIMM handle”
and Physical address in the DIMM( DPA)
n I made the following translation in the ndctl
Copyright 2018 FUJITSU LIMITED
To show broken NVDIMM
block number
of the region
Its System
Physical address
Get NFIT
DIMM handle
NVDIMM
device name
(nmemX)
ndctl
firmware
call _DSM
Transrate SPA
display bad block information (1/2)
8 B 12D
G
: B -4
G
8 - : #
B - ( 0 . ( , )) 0." #
-
6 :B -4
G
8 - #
-
H#
G
8 - #
-
H
5#
-
"
Copyright 2018 FUJITSU LIMITED
region 0 is
interleaved region
by nmem1 and nmem0
command to show
region info with
dimm, health and bad block info
(output is JSON format)
display bad block information (2/2)
, ,5 : #
, ,5
00 :
5 1: #
366
6 6"
Copyright 2018 FUJITSU LIMITED
Current ndctl displays
which DIMM has badblock
in this region
In this example,
nmem0 has broken block
badblock info in the region
(the previous ndctl showed only
this information)
To show NVDIMM device location (1/2)
n What can shows the physical location of NVDIMM modules?
n /dev/nmemX is named for NVDIMM module by NVDIMM driver
• It is also used by ndctl
• However, its id (X) is not related with device location
• The driver decided it by the order which it found
n I thought the handle of “NVDIMM Device” in ACPI DSDT may be good
• Handle is created with the followings
• node controller id, socket id, memory controller id, etc...
• However, these id may be dynamic due to hardware/firmware
configuration
• Especially, our some servers can change # of nodes by dividing the box with
system configuration
• So, this may not be good to find physical location too
Copyright 2018 FUJITSU LIMITED
To show NVDIMM device location (2/2)
nThe remaining option is SMBIOS handle
•You can confirm physical location of a device with
“dmidecode” command
• In addition, each device has SMBIOS handle, and shown in the
command
•SMBIOS handle is also “NVDIMM Region mapping
structure” table in NFIT
•So, if ndctl command can show it, you can find the location
with the handle
•I made a patch that ndctl show it as phys_id
Copyright 2018 FUJITSU LIMITED
Example of SMBIOS Handle of ndctl
# ndctl list -Du
[
{
"dev":"nmem1",
"id":"XXXX-XX-XXXX-XXXXXXXX",
"handle":"0x120",
"phys_id":"0x1c"
},
{
"dev":"nmem0",
"id":"XXXX-XX-XXXX-XXXXXXXX",
"handle":"0x20",
"phys_id":"0x10",
"flag_failed_flush":true,
"flag_smart_event":true
}
]
Copyright 2018 FUJITSU LIMITED
display DIMM infomation
with human readable format
(phys_id becomes Hexadecimal)
phys_id is SMBIOS handle of
these DIMM
0x10 is broken DIMM’s handle
in this example
(The nmem0 included broken block
in previous result)
dmidecode can show the location of DIMM
# demidecode
:
Handle 0x0010, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0004
:
:
Locator: DIMM-Location-example-Slot-A
:
Type Detail: Non-Volatile Registered (Buffered)
Copyright 2018 FUJITSU LIMITED
SMBIOS handle of DIMM
Locator: shows the place of the DIMM
How to distinguish all namespaces on the DIMM
n ndctl list command has many filtering feature
n by region, namespace, bus, and dimm
n Then list namespace with filtering dimm by ndctl, you can find all the
namespaces
Copyright 2018 FUJITSU LIMITED
721 4 "- "2
.
23 , 7063 9013 #
6 23 , 31 :
4 3 ,
,
23 , 7063 9013 #
6 23 , 31 :
4 3 ,
,
command to show
all namespaces on the nmem0
This dimm includes 2 namespaces
You need to back up them
before replacing nmem0
n ndctl monitor
NVDIMM Monitoring Daemon
Copyright 2018 FUJITSU LIMITED
Who am I?
nQI Fuli
nSoftware Engineer at Fujitsu Ltd.
nPhD Student at the University of Tokyo
nWorking on R&D of PM
nEmail: qi.fuli@fujitsu.com
Copyright 2018 FUJITSU LIMITED
Why the status of NVDIMM must be monitored?
nNVDIMM has a life span
nBlocks of NVDIMM will be broken due to endurance problem
nNVDIMM modules consume spare block
nWhen spare decreases to 0, it cannot rescue broken block
nNVDIMM has no feature, such as mirroring to save its data
nBackup and replacement are needed before no spares
nUsers should know the best time to do the backup and
replacement
Copyright 2018 FUJITSU LIMITED
Status of NVDIMM needs to be monitored and
be notified before NVIMM becomes broken
Ndctl monitor
nI created a daemon to monitor NVDIMM
nmonitor health status
nnotify critical status
https://pmem.io/ndctl/ndctl-monitor.html
NVDIMM will be broken !!!
Spare block ratio is low !!!
5%
( 1 /( ) 1 1 /
nSMART events come from NVDIMM can be monitored
Copyright 2018 FUJITSU LIMITED
SMART event trigger effect
spares-
remaining
spare block value goes below
the spare threshold limit NVDIMM becomes broken;
data will be lost;
etc.
health-status
normal health status of
NVDIMM changes
unclean-
shutdown
last shutdown to be recorded
at the next boot
saving data target failed
on NVDIMM; etc.
media-
temperature
temperature value of nvdimm
goes above threshold limit
server is getting too hot;
may need remediation;
a specific fan fails;
etc.,
controller-
temperature
controller temperature value
goes above threshold limit
( 2/( ) /
nActions after monitoring
nLog the output notification
•Log destination
• syslog
• arbitrary file set in the configuration file or set by command option
•Output format
• JSON (can be analyzed by other log collectors, such as fluentd
nKick other application to be implemented
•When the monitor detects a SMART health event,
the application will move the data in real time.
Copyright 2018 FUJITSU LIMITED
/ ( )
nFilter of ndctl monitor
nobjects to monitor can be filtered
• all NVDIMMs are monitored by default
• objects can be filtered by namespace, region, bus, and dimm
• backup the data of applications which are running on a certain namespace
⇒ filter by namespace
• allowing to monitor for any media error on the bus
⇒ filter by bus
nSMART health events to monitor can be filtered
Copyright 2018 FUJITSU LIMITED
ACPI supports for NVDIMM
n NVDIMM has SMART Health Info like SSD/HDD
n SMART Health Info can be retrieved via a _DSM method
n SMART Health Info includes spare block value and spare threshold limit
n ACPI 6.1 adds “NFIT Health Event Notification”
n a notification will be sent when the spare block value goes below the
spare threshold limit
n a poll(2) event will be triggered when the notification is received
Copyright 2018 FUJITSU LIMITED
NFIT Health Event Notification
triggers a poll(2) event on /nmemX/nfit/flags
spare block value below threshold
ACPI (Firmware)
NVDIMM (Hardware)
Kernel (OS)
Copyright 2018 FUJITSU LIMITED
triggers a poll(2) event on /nmemX/nfit/flags
output notification
filter health event(s) to monitor
retrieve NVDIMM SMART status
Kernel (OS)
ndctl monitor
standard output, syslog, /var/log/ndctl/monitor.log etc.
monitor poll(2) event(s) on /nmemX/nfit/flags
filter object(s) to monitor
retrieve NVDIMM SMART status
"timestamp":"1537323709.432459532",
"pid":28189,
"event":{ "dimm-spares-remaining":true },
"dimm":{
"dev":"nmem1", "health":{
"health_state":"non-critical",
"temperature_celsius":23,
"controller_temperature_celsius":25,
"spares_percentage":4,
"temperature_threshold":40,
"controller_temperature_threshold":30,
…
"spares_threshold":5,
"shutdown_state":"clean“
}
}
Copyright 2018 FUJITSU LIMITED
current spare block value
spare threshold limit
SMART Health event
nLinux distribution support systemd
nSystemd service
⇒ systemctl start ndctl-monitor.service
nLinux distribution does not support systemd
nndctl monitor --daemon
nUsers could run monitor as a one shot command
nndctl monitor
Copyright 2018 FUJITSU LIMITED
Summary and Future work
nSummary
nBasis of NVDIMM
nReplacement
•Replacement of NVDIMM is complex
•Adding some information in ndctl list for replacing NVDIMM
nMonitoring
•Monitoring SMART status of NVDIMM is important
•Ndctl monitor daemon
nFuture work
nKicks other applications by daemon
nplural daemon execution via systemctl
Copyright 2018 FUJITSU LIMITED
Copyright 2018 FUJITSU LIMITED
Emulation of NVDIMM
n You can try NVDIMM interface easily
n custom e820
• To try interface for filesystem or DAX
• it can used with kernel boot option (memmap=XXG!YYG)
• XX Gbyte RAM is used for NVDIMM interfaces
n nfit_test.ko
• Try to test management command (ndctl)
• It works instead of the firmware for NVDIMM
Copyright 2018 FUJITSU LIMITED
Types of NVDIMM
n JEDEC defines some types of DIMM module
Copyright 2018 FUJITSU LIMITED
NVDIMM-N
• DIMM module includes
DRAM and NVMEM
• Amount of RAM =
Amount of NVMEM
• The module includes
battely
• Usually, RAM is used
• Backup to NVMEM at
Power down
NVDIMM-F
• Only NVMEM is used
• access like storage
NVDIMM-P
• DIMM module includes
DRAM and NVMEM
• Amount of RAM is less
than NVMEM
• RAM is used as cache
DRAM NVMEM
battery
DRAM NVMEMNVMEM

More Related Content

What's hot

The TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelThe TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelDivye Kapoor
 
The Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageThe Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageKernel TLV
 
PostgreSQL 15 開発最新情報
PostgreSQL 15 開発最新情報PostgreSQL 15 開発最新情報
PostgreSQL 15 開発最新情報Masahiko Sawada
 
ACRN vMeet-Up EU 2021 - shared memory based inter-vm communication introduction
ACRN vMeet-Up EU 2021 - shared memory based inter-vm communication introductionACRN vMeet-Up EU 2021 - shared memory based inter-vm communication introduction
ACRN vMeet-Up EU 2021 - shared memory based inter-vm communication introductionProject ACRN
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Brendan Gregg
 
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingBrendan Gregg
 
Linux Performance Profiling and Monitoring
Linux Performance Profiling and MonitoringLinux Performance Profiling and Monitoring
Linux Performance Profiling and MonitoringGeorg Schönberger
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDBSage Weil
 
Monitoring IO performance with iostat and pt-diskstats
Monitoring IO performance with iostat and pt-diskstatsMonitoring IO performance with iostat and pt-diskstats
Monitoring IO performance with iostat and pt-diskstatsBen Mildren
 
Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...Adrian Huang
 
Linux Memory Management
Linux Memory ManagementLinux Memory Management
Linux Memory ManagementNi Zo-Ma
 
Page reclaim
Page reclaimPage reclaim
Page reclaimsiburu
 
LAS16-403: GDB Linux Kernel Awareness
LAS16-403: GDB Linux Kernel AwarenessLAS16-403: GDB Linux Kernel Awareness
LAS16-403: GDB Linux Kernel AwarenessLinaro
 
The linux networking architecture
The linux networking architectureThe linux networking architecture
The linux networking architecturehugo lu
 
10GbE時代のネットワークI/O高速化
10GbE時代のネットワークI/O高速化10GbE時代のネットワークI/O高速化
10GbE時代のネットワークI/O高速化Takuya ASADA
 
Physical Memory Management.pdf
Physical Memory Management.pdfPhysical Memory Management.pdf
Physical Memory Management.pdfAdrian Huang
 
CXL_説明_公開用.pdf
CXL_説明_公開用.pdfCXL_説明_公開用.pdf
CXL_説明_公開用.pdfYasunori Goto
 
OpenStackを使用したGPU仮想化IaaS環境 事例紹介
OpenStackを使用したGPU仮想化IaaS環境 事例紹介OpenStackを使用したGPU仮想化IaaS環境 事例紹介
OpenStackを使用したGPU仮想化IaaS環境 事例紹介VirtualTech Japan Inc.
 
Page cache in Linux kernel
Page cache in Linux kernelPage cache in Linux kernel
Page cache in Linux kernelAdrian Huang
 
Futex Scaling for Multi-core Systems
Futex Scaling for Multi-core SystemsFutex Scaling for Multi-core Systems
Futex Scaling for Multi-core SystemsDavidlohr Bueso
 

What's hot (20)

The TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelThe TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux Kernel
 
The Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageThe Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast Storage
 
PostgreSQL 15 開発最新情報
PostgreSQL 15 開発最新情報PostgreSQL 15 開発最新情報
PostgreSQL 15 開発最新情報
 
ACRN vMeet-Up EU 2021 - shared memory based inter-vm communication introduction
ACRN vMeet-Up EU 2021 - shared memory based inter-vm communication introductionACRN vMeet-Up EU 2021 - shared memory based inter-vm communication introduction
ACRN vMeet-Up EU 2021 - shared memory based inter-vm communication introduction
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
 
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor Benchmarking
 
Linux Performance Profiling and Monitoring
Linux Performance Profiling and MonitoringLinux Performance Profiling and Monitoring
Linux Performance Profiling and Monitoring
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
 
Monitoring IO performance with iostat and pt-diskstats
Monitoring IO performance with iostat and pt-diskstatsMonitoring IO performance with iostat and pt-diskstats
Monitoring IO performance with iostat and pt-diskstats
 
Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...
 
Linux Memory Management
Linux Memory ManagementLinux Memory Management
Linux Memory Management
 
Page reclaim
Page reclaimPage reclaim
Page reclaim
 
LAS16-403: GDB Linux Kernel Awareness
LAS16-403: GDB Linux Kernel AwarenessLAS16-403: GDB Linux Kernel Awareness
LAS16-403: GDB Linux Kernel Awareness
 
The linux networking architecture
The linux networking architectureThe linux networking architecture
The linux networking architecture
 
10GbE時代のネットワークI/O高速化
10GbE時代のネットワークI/O高速化10GbE時代のネットワークI/O高速化
10GbE時代のネットワークI/O高速化
 
Physical Memory Management.pdf
Physical Memory Management.pdfPhysical Memory Management.pdf
Physical Memory Management.pdf
 
CXL_説明_公開用.pdf
CXL_説明_公開用.pdfCXL_説明_公開用.pdf
CXL_説明_公開用.pdf
 
OpenStackを使用したGPU仮想化IaaS環境 事例紹介
OpenStackを使用したGPU仮想化IaaS環境 事例紹介OpenStackを使用したGPU仮想化IaaS環境 事例紹介
OpenStackを使用したGPU仮想化IaaS環境 事例紹介
 
Page cache in Linux kernel
Page cache in Linux kernelPage cache in Linux kernel
Page cache in Linux kernel
 
Futex Scaling for Multi-core Systems
Futex Scaling for Multi-core SystemsFutex Scaling for Multi-core Systems
Futex Scaling for Multi-core Systems
 

Similar to The ideal and reality of NVDIMM RAS

The Forefront of the Development for NVDIMM on Linux Kernel
The Forefront of the Development for NVDIMM on Linux KernelThe Forefront of the Development for NVDIMM on Linux Kernel
The Forefront of the Development for NVDIMM on Linux KernelYasunori Goto
 
The Forefront of the Development for NVDIMM on Linux Kernel (Linux Plumbers c...
The Forefront of the Development for NVDIMM on Linux Kernel (Linux Plumbers c...The Forefront of the Development for NVDIMM on Linux Kernel (Linux Plumbers c...
The Forefront of the Development for NVDIMM on Linux Kernel (Linux Plumbers c...Yasunori Goto
 
PowerEdge Rack and Tower Server Masters - AMD Server Memory.pptx
PowerEdge Rack and Tower Server Masters - AMD Server Memory.pptxPowerEdge Rack and Tower Server Masters - AMD Server Memory.pptx
PowerEdge Rack and Tower Server Masters - AMD Server Memory.pptxNeoKenj
 
NVDIMM block drivers with NFIT
NVDIMM block drivers with NFITNVDIMM block drivers with NFIT
NVDIMM block drivers with NFITjoeylikernel
 
Q4.11: Introduction to eMMC
Q4.11: Introduction to eMMCQ4.11: Introduction to eMMC
Q4.11: Introduction to eMMCLinaro
 
IMC Summit 2016 Keynote - Arthur Sainio - NVDIMM: Changes are Here So What’s ...
IMC Summit 2016 Keynote - Arthur Sainio - NVDIMM: Changes are Here So What’s ...IMC Summit 2016 Keynote - Arthur Sainio - NVDIMM: Changes are Here So What’s ...
IMC Summit 2016 Keynote - Arthur Sainio - NVDIMM: Changes are Here So What’s ...In-Memory Computing Summit
 
Tuning DB2 in a Solaris Environment
Tuning DB2 in a Solaris EnvironmentTuning DB2 in a Solaris Environment
Tuning DB2 in a Solaris EnvironmentJignesh Shah
 
Volatile Uses for Persistent Memory
Volatile Uses for Persistent MemoryVolatile Uses for Persistent Memory
Volatile Uses for Persistent MemoryIntel® Software
 
Veritas Software Foundations
Veritas Software FoundationsVeritas Software Foundations
Veritas Software Foundations.Gastón. .Bx.
 
Amd epyc update_gdep_xilinx_ai_web_seminar_20201028
Amd epyc update_gdep_xilinx_ai_web_seminar_20201028Amd epyc update_gdep_xilinx_ai_web_seminar_20201028
Amd epyc update_gdep_xilinx_ai_web_seminar_20201028ssuser5b12d1
 
We4IT lcty 2013 - infra-man - domino run faster
We4IT lcty 2013 - infra-man - domino run faster We4IT lcty 2013 - infra-man - domino run faster
We4IT lcty 2013 - infra-man - domino run faster We4IT Group
 
Open Source Data Deduplication
Open Source Data DeduplicationOpen Source Data Deduplication
Open Source Data DeduplicationRedWireServices
 
LCE13: Android Graphics Upstreaming
LCE13: Android Graphics UpstreamingLCE13: Android Graphics Upstreaming
LCE13: Android Graphics UpstreamingLinaro
 
High Performance Scaling Techniques in Golang Using Go Assembly
High Performance Scaling Techniques in Golang Using Go AssemblyHigh Performance Scaling Techniques in Golang Using Go Assembly
High Performance Scaling Techniques in Golang Using Go AssemblyMinio
 
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Stefano Di Carlo
 

Similar to The ideal and reality of NVDIMM RAS (20)

The Forefront of the Development for NVDIMM on Linux Kernel
The Forefront of the Development for NVDIMM on Linux KernelThe Forefront of the Development for NVDIMM on Linux Kernel
The Forefront of the Development for NVDIMM on Linux Kernel
 
The Forefront of the Development for NVDIMM on Linux Kernel (Linux Plumbers c...
The Forefront of the Development for NVDIMM on Linux Kernel (Linux Plumbers c...The Forefront of the Development for NVDIMM on Linux Kernel (Linux Plumbers c...
The Forefront of the Development for NVDIMM on Linux Kernel (Linux Plumbers c...
 
PowerEdge Rack and Tower Server Masters - AMD Server Memory.pptx
PowerEdge Rack and Tower Server Masters - AMD Server Memory.pptxPowerEdge Rack and Tower Server Masters - AMD Server Memory.pptx
PowerEdge Rack and Tower Server Masters - AMD Server Memory.pptx
 
NVDIMM block drivers with NFIT
NVDIMM block drivers with NFITNVDIMM block drivers with NFIT
NVDIMM block drivers with NFIT
 
Q4.11: Introduction to eMMC
Q4.11: Introduction to eMMCQ4.11: Introduction to eMMC
Q4.11: Introduction to eMMC
 
IMC Summit 2016 Keynote - Arthur Sainio - NVDIMM: Changes are Here So What’s ...
IMC Summit 2016 Keynote - Arthur Sainio - NVDIMM: Changes are Here So What’s ...IMC Summit 2016 Keynote - Arthur Sainio - NVDIMM: Changes are Here So What’s ...
IMC Summit 2016 Keynote - Arthur Sainio - NVDIMM: Changes are Here So What’s ...
 
DDR DIMM Design
DDR DIMM DesignDDR DIMM Design
DDR DIMM Design
 
DDR SDRAM : Notes
DDR SDRAM : NotesDDR SDRAM : Notes
DDR SDRAM : Notes
 
Module 1 unit 4
Module 1 unit 4Module 1 unit 4
Module 1 unit 4
 
Dlm ppt
Dlm pptDlm ppt
Dlm ppt
 
Linux Huge Pages
Linux Huge PagesLinux Huge Pages
Linux Huge Pages
 
Tuning DB2 in a Solaris Environment
Tuning DB2 in a Solaris EnvironmentTuning DB2 in a Solaris Environment
Tuning DB2 in a Solaris Environment
 
Volatile Uses for Persistent Memory
Volatile Uses for Persistent MemoryVolatile Uses for Persistent Memory
Volatile Uses for Persistent Memory
 
Veritas Software Foundations
Veritas Software FoundationsVeritas Software Foundations
Veritas Software Foundations
 
Amd epyc update_gdep_xilinx_ai_web_seminar_20201028
Amd epyc update_gdep_xilinx_ai_web_seminar_20201028Amd epyc update_gdep_xilinx_ai_web_seminar_20201028
Amd epyc update_gdep_xilinx_ai_web_seminar_20201028
 
We4IT lcty 2013 - infra-man - domino run faster
We4IT lcty 2013 - infra-man - domino run faster We4IT lcty 2013 - infra-man - domino run faster
We4IT lcty 2013 - infra-man - domino run faster
 
Open Source Data Deduplication
Open Source Data DeduplicationOpen Source Data Deduplication
Open Source Data Deduplication
 
LCE13: Android Graphics Upstreaming
LCE13: Android Graphics UpstreamingLCE13: Android Graphics Upstreaming
LCE13: Android Graphics Upstreaming
 
High Performance Scaling Techniques in Golang Using Go Assembly
High Performance Scaling Techniques in Golang Using Go AssemblyHigh Performance Scaling Techniques in Golang Using Go Assembly
High Performance Scaling Techniques in Golang Using Go Assembly
 
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
 

Recently uploaded

Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 

Recently uploaded (20)

Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 

The ideal and reality of NVDIMM RAS

  • 1. Copyright 2018 FUJITSU LIMITED The ideal and reality of NVDIMM RAS Yasunori Goto / QI Fuli Linux Development Division Fujitsu Limited China Linux Kernel Developer Conference Oct 14, 2018
  • 2. Copyright 2018 FUJITSU LIMITED n Introduction of NVDIMM n Issues of RAS which we found and solved n Distinguish and replace broken NVDIMM (by Yasunori Goto) n Monitoring feature of NVDIMM (by Qi Fuli) Agenda
  • 3. Introduction of NVDIMM Copyright 2018 FUJITSU LIMITED
  • 4. NVDIMM is coming n Non Volatile DIMM (NVDIMM) n Some vendor released / will release NVDIMM • Intel Optane DC persistent memory n CPU can read/write NVDIMM directly like RAM n Data is not volatile even if system is powered down n Though its access latency is a little slower than DRAM, it is expected to be very faster than NVMe n May be huge amount (*) n Use case n For example, On memory Database (*) It depends on type of NVDIMM Copyright 2018 FUJITSU LIMITED
  • 5. Impact of NVDIMM n Traditional I/O layer is not suitable for NVDIMM n Page cache is not necessary at least • It was created for SLOW I/O storage • But, it is just redundant for NVDIMM n Even system call may be waste of time due to switch kernel mode and user mode • Ideal is that application can access to NVDIMM directly without kernel n If application calls cpu flush instruction, then data becomes persistent • calling sync()/fsync()/msync() becomes redundant n To achieve the above feature, New access method is expected n DAX (Direct Access Mode) is developed in Linux Copyright 2018 FUJITSU LIMITED
  • 6. Filesystem is still useful for traditional software n On the other hand, Many software assumes that memory is VOLATILE • They allocate memory areas on the RAM and make structures on them • When allocate memory area on the NVDIMM like RAM, what will happen? n For NVDIMM, many things are necessary. For example • Need data structure compatibility on NVDIMM • Should not change data structures in NVDIMM • If the structures are changed, software update will be cause of disaster • Need to detect / correct collapsing data • CPU cache is still volatile. If system power down suddenly, then some data may not be stored • If the data is broken, software need to detect it and correct it n Filesystem is still useful for traditional software n Filesystem gives many solution for the above considerations • Format compatibility of filesystem, data correction, region management, , authority check, etc... Copyright 2018 FUJITSU LIMITED
  • 7. Interfaces of NVDIMM n Linux provides some interfaces for application n Storage Access • Application can access NVDIMM via traditional I/O • This mode is for old application n Filesystem DAX(*) and Device DAX • Application can read / write NVDIMM area directly if it calls mmap() • Need modification or making new software n PMDK is provided • libraries and tools for software which uses DAX Copyright 2018 FUJITSU LIMITED DAX enabled filesystem NVDIMM user layer kernel layer Pmem (or block window) driver device dax driver /dev/dax New App with Device DAX BTT driver Traditional filesystem Traditional App New App with Filesystem DAX PMDK management command sysfs ACPI driver ACPI (*) DAX= Direct Access Mode
  • 8. Region and Namespace(1/2) n To manage huge amount of NVDIMM, “Region” and “namespace” is defined n Region • Region is created by hardware and firmware feature • Each NVDIMM modules are placed on a System Board(Mother Board) • You can create regions on NVDIMM modules • Interleave is available Copyright 2018 FUJITSU LIMITED region 1 region 2 region 0 (This is Interleaved region) System Board NVDIMM module1 NVDIMM module2 NVDIMM module 3
  • 9. Region and Namespace (2/2) nNamespace • You can create namespaces on each region by ndctl command • Namespace concept is likely LUN of SCSI • You can make plural namespaces on a region, if you wish • You can specify storage, filesystem-dax, or device dax at this time • NVDIMM driver names each NVDIMM module “nmemX” • /dev/pmemX or /dev/daxY is named for each namespace • If Storage mode or Filesystem DAX, you can make partition and filesystem on the /dev/pmemXXX Copyright 2018 FUJITSU LIMITED region 1 region 2 System Board NVDIMM module1 (nmem1) NVDIMM module2 (nmem2) NVDIMM module 3 (nmem3) namespace0.0 /dev/pmem0 namespace1.0 /dev/pmem1s namespace1.1 /dev/pmem1.1 namespac2.0 /dev/dax2 (region 0)
  • 10. n ndctl n Many features • Create / Manage namespaces • Set access mode (storage / Filesystem DAX/ Device DAX) • Show status of NVDIMM modules, regions, namespaces • Support difference among NVDIMM vendors (NVDIMM-N or 3D-xpoint...) • etc.. n It use ACPI via sysfs and acpi driver for NVDIMM (nfit.ko) n We add some feature in it for RAS DAX enabled filesystem NVDIMM user layer kernel layer Pmem (or block window) driver device dax driver /dev/dax New App with Device DAX BTT driver Traditional filesystem Traditional App New App with Filesystem DAX PMDK ndctl sysfs ACPI driver ACPI Management command Copyright 2018 FUJITSU LIMITED NVDIMMACPI
  • 11. ACPI spec for NVDIMM n ACPI v6.0 or later supports specification for NVDIMM n NFIT table is defined for NVDIMM • Many sub tables are specified for NVDIMM basic information • (Very huge table) n In DSDT • Some specification of NVDIMM is defined • “NVDIMM root device” • defined for whole system • Its _DSM method has some features like Address Range Scrubbing • Notification is used for NFIT update or Memory error detection • “NVDIMM device” • defined for each NVDIMM modules • Its _DSM method is used for manage namespace or Health (SMART) information, etc. • If health status is changed, firmware notifies a event to the “NVDIMM device” in DSDT • There is a little bit difference among vendors (Intel) http://pmem.io/documents/NVDIMM_DSM_Interface-V1.7.pdf (HPE) https://github.com/HewlettPackard/hpe-nvm Copyright 2018 FUJITSU LIMITED
  • 12. n Issues and what I solved Replacement of NVDIMM Copyright 2018 FUJITSU LIMITED
  • 13. NVDIMM may be broken n We need to consider when NVDIMM becomes broken n The life of NVDIMM is limited • Write endurance of NVDIMM is / will be not infinite • NVDIMM module has / may have some feature • wear leveling • spare blocks instead of broken block • If spare block is consumed completely, the NVDIMM modules can not recover • NVDIMM-N has a battery in the module • It will be also deteriorated n Bit error may also occur on NVDIMM • “Scrubbing” feature may rescue it from fatal error if possible • If it can not rescue, the block becomes broken Copyright 2018 FUJITSU LIMITED We need to have a way to replace broken NVDIMM
  • 14. Question n Please imagine a difference between HDD/SSD and NVDIMM n How to replace its device? Copyright 2018 FUJITSU LIMITED
  • 15. Answer n There is a difference between SSD/HDD and NVDIMM replacement like the followings Copyright 2018 FUJITSU LIMITED • • • • • • • Replacing of NVDIMM is more difficult than SSD/HDD
  • 16. Replacing broken NVDIMM is complex (1/2) n In addition, the specification of NVDIMM (region and namespace) is cause of more difficulty of replacement n In this figure, when a block on region 0 is broken eternally, and user need to replace it, what information is necessary? Copyright 2018 FUJITSU LIMITEDCopyright 2018 FUJITSU LIMITED region 1 region 2 System Board NVDIMM module1 (nmem1) NVDIMM module2 (nmem2) NVDIMM module 3 (nmem3) namespace0.0 /dev/pmem0 namespace1.0 /dev/pmem1s namespace1.1 /dev/pmem1.1 namespac2.0 /dev/dax2broken (region 0)
  • 17. Replacing broken NVDIMM is complex(2/2) n To replace broken NVDIMM, the following information is necessary, at least Copyright 2018 FUJITSU LIMITED region 1 region 2 System Board NVDIMM module1 (nmem1) NVDIMM module2 (nmem2) NVDIMM module 3 (nmem3) namespac0.0 /dev/pmem0 namespace1.0 /dev/pmem1s namespace1.1 /dev/pmem1.1 namespac2.0 /dev/dax2broken 1. Need a way to distinguish which NVDIMM module is broken (especially, if the region is interleaved, only firmware know it) 2. To replace NVDIMM, Need information of relationship between “Physical” location of NVDIMM module (Ex. Silk print on mother board), and Device name of namespace /dev/pmemX) 3. Need to know other namespace on same DIMM modules for backup before replace (region 0)
  • 18. What I developed n I made 2 features of ndctl n To show important information for replace Copyright 2018 FUJITSU LIMITED region 1 region 2 region 0 System Board NVDIMM module1 (nmem1) NVDIMM module2 (nmem2) NVDIMM module 3 (nmem3) namespac0.0 /dev/pmem0 namespace1.0 /dev/pmem1s namespace1.1 /dev/pmem1.1 namespac2.0 /dev/dax2broken 1. Show NVDIMM info which has broken block in the region 2. Show the id of NVDIMM module to find physical location Fortunately, current ndctl can show all namespaces on a NVDIMM already, (I will mention how to do it later)
  • 19. n ndctl did not show broken NVDIMM module information n It just showed broken “block number” in the region n When the region is Interleaved, kernel/driver cannot distinguish NVDIMM • Only firmware know the relationship between physical address and NVDIMM module n Fortunately, “Translate SPA” is defined in _DSM method of “NVDIMM root Device” at ACPI ver.6.2 • Firmware translates from System Physical Address to “NFIT DIMM handle” and Physical address in the DIMM( DPA) n I made the following translation in the ndctl Copyright 2018 FUJITSU LIMITED To show broken NVDIMM block number of the region Its System Physical address Get NFIT DIMM handle NVDIMM device name (nmemX) ndctl firmware call _DSM Transrate SPA
  • 20. display bad block information (1/2) 8 B 12D G : B -4 G 8 - : # B - ( 0 . ( , )) 0." # - 6 :B -4 G 8 - # - H# G 8 - # - H 5# - " Copyright 2018 FUJITSU LIMITED region 0 is interleaved region by nmem1 and nmem0 command to show region info with dimm, health and bad block info (output is JSON format)
  • 21. display bad block information (2/2) , ,5 : # , ,5 00 : 5 1: # 366 6 6" Copyright 2018 FUJITSU LIMITED Current ndctl displays which DIMM has badblock in this region In this example, nmem0 has broken block badblock info in the region (the previous ndctl showed only this information)
  • 22. To show NVDIMM device location (1/2) n What can shows the physical location of NVDIMM modules? n /dev/nmemX is named for NVDIMM module by NVDIMM driver • It is also used by ndctl • However, its id (X) is not related with device location • The driver decided it by the order which it found n I thought the handle of “NVDIMM Device” in ACPI DSDT may be good • Handle is created with the followings • node controller id, socket id, memory controller id, etc... • However, these id may be dynamic due to hardware/firmware configuration • Especially, our some servers can change # of nodes by dividing the box with system configuration • So, this may not be good to find physical location too Copyright 2018 FUJITSU LIMITED
  • 23. To show NVDIMM device location (2/2) nThe remaining option is SMBIOS handle •You can confirm physical location of a device with “dmidecode” command • In addition, each device has SMBIOS handle, and shown in the command •SMBIOS handle is also “NVDIMM Region mapping structure” table in NFIT •So, if ndctl command can show it, you can find the location with the handle •I made a patch that ndctl show it as phys_id Copyright 2018 FUJITSU LIMITED
  • 24. Example of SMBIOS Handle of ndctl # ndctl list -Du [ { "dev":"nmem1", "id":"XXXX-XX-XXXX-XXXXXXXX", "handle":"0x120", "phys_id":"0x1c" }, { "dev":"nmem0", "id":"XXXX-XX-XXXX-XXXXXXXX", "handle":"0x20", "phys_id":"0x10", "flag_failed_flush":true, "flag_smart_event":true } ] Copyright 2018 FUJITSU LIMITED display DIMM infomation with human readable format (phys_id becomes Hexadecimal) phys_id is SMBIOS handle of these DIMM 0x10 is broken DIMM’s handle in this example (The nmem0 included broken block in previous result)
  • 25. dmidecode can show the location of DIMM # demidecode : Handle 0x0010, DMI type 17, 40 bytes Memory Device Array Handle: 0x0004 : : Locator: DIMM-Location-example-Slot-A : Type Detail: Non-Volatile Registered (Buffered) Copyright 2018 FUJITSU LIMITED SMBIOS handle of DIMM Locator: shows the place of the DIMM
  • 26. How to distinguish all namespaces on the DIMM n ndctl list command has many filtering feature n by region, namespace, bus, and dimm n Then list namespace with filtering dimm by ndctl, you can find all the namespaces Copyright 2018 FUJITSU LIMITED 721 4 "- "2 . 23 , 7063 9013 # 6 23 , 31 : 4 3 , , 23 , 7063 9013 # 6 23 , 31 : 4 3 , , command to show all namespaces on the nmem0 This dimm includes 2 namespaces You need to back up them before replacing nmem0
  • 27. n ndctl monitor NVDIMM Monitoring Daemon Copyright 2018 FUJITSU LIMITED
  • 28. Who am I? nQI Fuli nSoftware Engineer at Fujitsu Ltd. nPhD Student at the University of Tokyo nWorking on R&D of PM nEmail: qi.fuli@fujitsu.com Copyright 2018 FUJITSU LIMITED
  • 29. Why the status of NVDIMM must be monitored? nNVDIMM has a life span nBlocks of NVDIMM will be broken due to endurance problem nNVDIMM modules consume spare block nWhen spare decreases to 0, it cannot rescue broken block nNVDIMM has no feature, such as mirroring to save its data nBackup and replacement are needed before no spares nUsers should know the best time to do the backup and replacement Copyright 2018 FUJITSU LIMITED Status of NVDIMM needs to be monitored and be notified before NVIMM becomes broken
  • 30. Ndctl monitor nI created a daemon to monitor NVDIMM nmonitor health status nnotify critical status https://pmem.io/ndctl/ndctl-monitor.html NVDIMM will be broken !!! Spare block ratio is low !!! 5%
  • 31. ( 1 /( ) 1 1 / nSMART events come from NVDIMM can be monitored Copyright 2018 FUJITSU LIMITED SMART event trigger effect spares- remaining spare block value goes below the spare threshold limit NVDIMM becomes broken; data will be lost; etc. health-status normal health status of NVDIMM changes unclean- shutdown last shutdown to be recorded at the next boot saving data target failed on NVDIMM; etc. media- temperature temperature value of nvdimm goes above threshold limit server is getting too hot; may need remediation; a specific fan fails; etc., controller- temperature controller temperature value goes above threshold limit
  • 32. ( 2/( ) / nActions after monitoring nLog the output notification •Log destination • syslog • arbitrary file set in the configuration file or set by command option •Output format • JSON (can be analyzed by other log collectors, such as fluentd nKick other application to be implemented •When the monitor detects a SMART health event, the application will move the data in real time. Copyright 2018 FUJITSU LIMITED
  • 33. / ( ) nFilter of ndctl monitor nobjects to monitor can be filtered • all NVDIMMs are monitored by default • objects can be filtered by namespace, region, bus, and dimm • backup the data of applications which are running on a certain namespace ⇒ filter by namespace • allowing to monitor for any media error on the bus ⇒ filter by bus nSMART health events to monitor can be filtered Copyright 2018 FUJITSU LIMITED
  • 34. ACPI supports for NVDIMM n NVDIMM has SMART Health Info like SSD/HDD n SMART Health Info can be retrieved via a _DSM method n SMART Health Info includes spare block value and spare threshold limit n ACPI 6.1 adds “NFIT Health Event Notification” n a notification will be sent when the spare block value goes below the spare threshold limit n a poll(2) event will be triggered when the notification is received Copyright 2018 FUJITSU LIMITED NFIT Health Event Notification triggers a poll(2) event on /nmemX/nfit/flags spare block value below threshold ACPI (Firmware) NVDIMM (Hardware) Kernel (OS)
  • 35. Copyright 2018 FUJITSU LIMITED triggers a poll(2) event on /nmemX/nfit/flags output notification filter health event(s) to monitor retrieve NVDIMM SMART status Kernel (OS) ndctl monitor standard output, syslog, /var/log/ndctl/monitor.log etc. monitor poll(2) event(s) on /nmemX/nfit/flags filter object(s) to monitor retrieve NVDIMM SMART status
  • 36. "timestamp":"1537323709.432459532", "pid":28189, "event":{ "dimm-spares-remaining":true }, "dimm":{ "dev":"nmem1", "health":{ "health_state":"non-critical", "temperature_celsius":23, "controller_temperature_celsius":25, "spares_percentage":4, "temperature_threshold":40, "controller_temperature_threshold":30, … "spares_threshold":5, "shutdown_state":"clean“ } } Copyright 2018 FUJITSU LIMITED current spare block value spare threshold limit SMART Health event
  • 37. nLinux distribution support systemd nSystemd service ⇒ systemctl start ndctl-monitor.service nLinux distribution does not support systemd nndctl monitor --daemon nUsers could run monitor as a one shot command nndctl monitor Copyright 2018 FUJITSU LIMITED
  • 38. Summary and Future work nSummary nBasis of NVDIMM nReplacement •Replacement of NVDIMM is complex •Adding some information in ndctl list for replacing NVDIMM nMonitoring •Monitoring SMART status of NVDIMM is important •Ndctl monitor daemon nFuture work nKicks other applications by daemon nplural daemon execution via systemctl Copyright 2018 FUJITSU LIMITED
  • 40. Emulation of NVDIMM n You can try NVDIMM interface easily n custom e820 • To try interface for filesystem or DAX • it can used with kernel boot option (memmap=XXG!YYG) • XX Gbyte RAM is used for NVDIMM interfaces n nfit_test.ko • Try to test management command (ndctl) • It works instead of the firmware for NVDIMM Copyright 2018 FUJITSU LIMITED
  • 41. Types of NVDIMM n JEDEC defines some types of DIMM module Copyright 2018 FUJITSU LIMITED NVDIMM-N • DIMM module includes DRAM and NVMEM • Amount of RAM = Amount of NVMEM • The module includes battely • Usually, RAM is used • Backup to NVMEM at Power down NVDIMM-F • Only NVMEM is used • access like storage NVDIMM-P • DIMM module includes DRAM and NVMEM • Amount of RAM is less than NVMEM • RAM is used as cache DRAM NVMEM battery DRAM NVMEMNVMEM