Bit Rot: What It Is and How To Stop It From Destroying Your Data

In Backup & Archiving Workflow by Michael GreccoLeave a Comment

Bit Rot: What it is and how to avoid it
Bit rot sounds like something organic that happens over time. But the truth is much more immediate and technical. Hard disks are marketed as supremely reliable and often quote mean times between failure (MTBF) in the hundreds of thousands of hours. But while drives keep on keeping on, it is still possible to lose data thanks to this phenomenon of bit rot.

What is Bit Rot, exactly?

Pull out a microscope and peer at the surface of a hard disk and you’ll see a bumpy landscape of exotic metals arrayed in reasonably neat patterns.

The metals need to be neat because a disk drive delivers a very precise jolt of electricity to a very small region of the disk, changing its stored charge to denote stored data.

Sometimes, those regions spontaneously lose or change their charge, a phenomenon known as ‘flipping’. When a region on a disk flips, the data it contains is erased, corrupted or rendered unreadable. To denote the mysterious nature of this degradation, the industry has developed the organic-sounding term ‘bit rot’ to explain the phenomenon.

Storage array vendors are aware of bit rot and build their products to identify flaws in disks before they place them in arrays, and then monitor disks in production to detect rot before it becomes a problem.

“EMC only purchases, and then sells, drives that have a low percentage of ‘manufacturing’ sector failures,” explains Clive Gold, Marketing Chief Technology Officer for EMC Australia New Zealand.

The company also scans drives to make sure bit rot is not destroying data.

“All data that is received by the front end is ‘tagged’ and this allows the backend to check the data that is stored on the disk to ensure it hasn’t changed as it has gone through the storage system,” Gold explains. “In fact, where an application like Oracle databases has a checksum, we use that to ensure end-to-end integrity, from application to the rust on the disk! These technologies do detection as well as correction.”

Adrian De Luca, Hitachi Data Systems’ Director of Pre-Sales and Solutions for Australia and New Zealand, says his company also takes care to ensure that damaged drives don’t destroy data, through connectivity precautions as well as corruption checks.

“HDS ensures all physical disk drives are dual-ported into the backplane, controllers and cache to ensure there is no physical single point of failure as data comes in through the front end controllers and out to the physical disks,” he says. “We also support Oracle H.A.R.D (Hardware Assisted Resilient Data) to prevent corrupted data blocks generated in the database-to-storage system infrastructure from being written onto the disk storage.”

Does Bit Rot Occur in Solid State Drives (SSD)

The simple answer: yes. However, bit rot for flash SSDs is quite different than bit rot on hard disk drives.

As we learned, bit rot for HDDs occurs when the magnetic polarity of a bit spontaneously flips from electromagnetic radiation in the surroundings. Flash SSD bit rot occurs when the state of an NAND cell changes from electron leakage.

As the number of states within a cell increases, so does the potential for electron leakage. SLC has two states, 0,1; MLC has four states, 00, 01, 11, 10; and TLC has eight states, 000, 001, 010, 011, 100, 110, 101, 111. That means bit rot is most likely to occur with TLC NAND flash drives.

The way manufacturers are handling the increased probabilities of bit rot is through the extensive use of error correcting codes (ECC). Obviously, the ECC for TLC must be considerably more sensitive than ECC for SLC or MLC. And the 3D NAND TLC drive vendors know this and have incorporated much more sensitive ECC.

Determining how effective a 3D NAND TLC drive is at combating bit rot comes down to the unrecoverable bit error rate (UBER) as rated by the 3D NAND flash vendor. Keep in mind that 3D NAND TLC drives are best suited for read- not write-optimized applications. This is most similar to the application fit for nearline or “fat” HDDs. The UBER rate for a SATA HDD is 10-15. The UBER rate for nearline SAS HDDs is 10-16. The UBER ratings for 3D NAND TLC drives have not been released as of this writing; however, they are expected to be at least the same or higher as SATA or SAS HDDs.

How dangerous is Bit Rot?

While Bit Rot is something most storage vendors work to counter, NetApp has recently conducted studies that play down the risk it poses.

“While ‘bit rot’ has received a reasonable amount of attention recently, two NetApp sponsored studies show that bit rot is far less of a problem for storage array reliability than many other factors,” says John Martin, Principal Technologist for NetApp Australia New Zealand.

One of the papers Martin refers to, A Highly Accurate Method for Assessing Reliability of Redundant Arrays of Inexpensive Disks (RAID) by Jon G. Elerath and Michael Pecht, appeared in the journal IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 3, MARCH 2009”.

Martin summarizes the paper by saying that Bit Rot is a risk, as it “raises the specter, not just of a lost or corrupted file, but of the potential to completely lose an entire RAID group after the failure of a single drive due to the ‘Media Error on Data Reconstruct’ problem. “

But Martin adds that “The less catastrophic issue on an enterprise-class array is far less because the additional error detection and correction available through the use of RAID and block level checksums means the chances of bit rot causing the loss or corruption of a file is vanishingly remote.”

WhatreDrawing on Elerath and Pecht’s paper, Martin, therefore offers four other phenomena as more likely sources of data loss, namely:

  • “Thermal asperities” – Instances of high heat for a short duration caused by head-disk contact. This is usually the result of heads hitting small “bumps” created by particles embedded in the media surface during the manufacturing process. The heat generated on a single contact may not be sufficient to thermally erase data but may be sufficient after many contacts;
  • Disk head issues – Disk heads are designed to push particles away, but contaminants can still become lodged between the head and disk, hard particles used in the manufacture of an HDD, can cause surface scratches and data erasure any time the disk is rotating;
  • Soft particle corruption – Other “soft” materials such as stainless steel can come from assembly tooling. Soft particles tend to smear across the surface of the media, rendering the data unreadable;
  • Corrosion – Although carefully controlled, can also cause data erasure and may be accelerated by thermal asperity generated heat.

Whatever the cause of lost data, storage administrators need a way to combat it, and NetApp’s Martin recommends ‘disk scrubs’, the practice of wiping disks to erase any problem sectors. Another alternative is to “Use additional levels of RAID protection such as RAID-6 which allows for higher levels of resiliency and error correction in the event of hitting a latent block error when reconstructing a RAID set. NetApp uses both approaches as studies have shown that the risk of losing data through these kinds of events is thousands of times higher than predicted by most simple ‘MTBF’ failure models.

Keith Busson, Quantum’s Country Manager for Australia and New Zealand, has more prosaic advice for ameliorating Bit Rot.

“Quantum recommends that IT organizations stage practice data recoveries on a regular basis,” he says. “It is important to demonstrate the ability of fast, comprehensive data recovery before it is required in an emergency situation. Such testing is a test not only of hardware and software but of people and processes.”