Troubleshooting Disk Lifetime

An important part of keeping a firewall running reliably is to ensure its storage is in good condition.

If the disk in a firewall fails, it may continue to run in a reduced capacity until the system restarts. Exactly which parts may fail depends on the services and packages in use and what roles the firewall is performing. Packet filtering may continue to function indefinitely, but it may not be able to update rules, for example. Certain types of disks, such as SSD and eMMC disks, may fail into a read only state where disk writes fail or are discarded, but data can still be read.

Checking Disk Health & Lifetime

Contrary to popular rumors, the vast majority of modern flash storage found in SSD and eMMC drives is very resilient. Early SSDs were more prone to failure or only supported a comparatively small number of writes, but the technology has improved vastly over the years.

Traditional spinning platter hard drives have their own problems, such as failure of mechanical moving parts, which are harder to predict.

Over the lifetime of any disk, they typically will encounter failing spots and remap them to spares. This happens on traditional HDDs as well as SSDs. It’s normal to see a small number of these over time, but if the pool of spares is almost consumed, the disk should be replaced.

Depending on the hardware it may be possible to query the disk for information about its health. Not all platforms support each method, and disk OEMs track data in different ways. When in doubt, contact the manufacturer of the disk for details.

SSD

Many SSDs can report their health through S.M.A.R.T. data as described in S.M.A.R.T. Hard Disk Status as well as perform disk tests. Disks connected using SATA, mSATA, M.2, and other similar methods can typically use the same methods as traditional HDDs, but may have SSD-specific health information in their S.M.A.R.T. data. For example, some SSDs will include a life time estimate or media wear indication level.

NVMe disks require special handling, but can also be queried through S.M.A.R.T., though the data they report may be in a different format from typical SSDs.

eMMC

eMMC disks are unique in that they do not support S.M.A.R.T. but hardware which supports the correct revisions of the eMMC specification are capable of reporting health in their own way.

Install MMC Utilities

The first step is to install the mmc-utils package from an SSH or console shell prompt:

# pkg install -y mmc-utils; rehash

Note

This package is currently only available on pfSense® Plus software and does not have a GUI component. It must be run from an SSH or console shell prompt.

Check MMC Health Status

To check the health of the first MMC disk, run the following command:

# mmc extcsd read /dev/mmcsd0rpmb

If the disk supports reporting its health, it should return output. For additional MMC disks, increase the device number in the command.

Interpreting MMC Health Data

The primary fields to look at in MMC health are the life time estimations and Pre-EOL estimation.

Note

Not all disks support all of these fields.

: mmc extcsd read /dev/mmcsd0rpmb | egrep 'LIFE|EOL'
eMMC Life Time Estimation A [EXT_CSD_DEVICE_LIFE_TIME_EST_TYP_A]: 0x01
eMMC Life Time Estimation B [EXT_CSD_DEVICE_LIFE_TIME_EST_TYP_B]: 0x02
eMMC Pre EOL information [EXT_CSD_PRE_EOL_INFO]: 0x01
Type A

An estimate for life time of SLC (and pseudo-SLC) eraseblocks in steps of 10%.

Type B

An estimate for life time of MLC eraseblocks in steps of 10%.

Type A and B Values

The values of the A and B life time estimations are in 10% increments based on the hexadecimal value returned by the disk. This is only an estimate and the value can exceed 100%.

Possible values include:

Value

Meaning

0x00

Not defined

0x01

The disk has used 0%-10% of its estimated life time

0x02

The disk has used 10%-20% of its estimated life time

0x03

The disk has used 20%-30% of its estimated life time

0x04

The disk has used 30%-40% of its estimated life time

0x05

The disk has used 40%-50% of its estimated life time

0x06

The disk has used 50%-60% of its estimated life time

0x07

The disk has used 60%-70% of its estimated life time

0x08

The disk has used 70%-80% of its estimated life time

0x09

The disk has used 80%-90% of its estimated life time

0x0a

The disk has used 90%-100% of its estimated life time

0x0b

The disk has used 100%-110% of its estimated life time

Warning

This is only an estimation. Though useful as a general guideline, it does not necessarily indicate that a disk will fail at a given time.

Pre-EOL

Pre EOL information is an overall status for reserved blocks on the disks.

Possible values are:

Value

Severity

Meaning

0x00

Not defined.

0x01

Normal

The disk has consumed less than 80% of its reserved blocks

0x02

Warning

The disk has consumed more than 80% of its reserved blocks

0x03

Urgent

The disk has consumed more than 90% of its reserved blocks

HDD

Most HDDs can report their health through S.M.A.R.T. data as described in S.M.A.R.T. Hard Disk Status as well as perform disk tests. S.M.A.R.T. may need to be enabled both in the system BIOS and on the disk, though in modern systems these tend to both be enabled by default.

USB Disks

Disks connected through USB controllers are unlikely to support health queries, no matter what type of disk they are.

Taking Action

If a disk is showing signs that it might be failing, the safest action is to replace the disk. If a device has a built-in disk like an eMMC disk that cannot be replaced, it may be capable of taking an additional drive using another means such as NVMe, M.2, mSATA, or SATA. When in doubt, contact the device OEM for guidance.

If the disk is showing some wear but still has a lot of life left, consider making changes to reduce disk writes to potentially extend its remaining lifetime.