RAID Server “Write Hole” Phenomenon

Following a power outage during the write process to a RAID it could result in the “write hole” phenomenon. All RAID arrays are subject to this problem whereby it’s impossible to determine which disk the data blocks or parity information was not written, including RAID 1, RAID 5 and RAID 6.

The biggest issue is, that this is undetectable thereby going unnoticed which could result in problems at a later time. This phenomenon is quite rare, but it can lead to serious problems, especially if a failure occurs which requires data recovery.

Data Not Committed

In the event of a power failure it is possible that not all data pending is written to all the disks in a RAID array. Most modern file system incorporates journaling, which means that a power failure is not usually a problem, as failed writes are still stored in the journal. A RAID array however performs multiple read/write tasks in parallel, which may lead to data not be written to all disks due to timing issues.

If a data block was not written to a file, when the file system is mounted, the journaling may well correct any issue, but the failure to write the parity stripe could go undetected until a time when the parity data is required.

Resynchronising the Array

The example of a RAID 1 mirrored pair, whereby data is written to a pair of disks, illustrates the problem when a discrepancy between the two disks is detected following a power failure. Unfortunately in these circumstances it is almost impossible to know which disk holds the correct version of that data. For a RAID where the parity information is calculated, the same is also true when the parity data does not match the data blocks stored in a data stripe.

There is advice that regular resynchronisation of the array should be made. However, running a resynchronisation process could result in the incorrect data being committed as part of the RAID, which could lead to either corruption of file system data structures or file contents. It is therefore debatable whether regular resynchronisation as part of RAID maintenance is a good thing.

UPS and Data Recovery

The Installation of an uninterruptible power supply (UPS) on a server system running a RAID is the best choice in order to avoid suffering the “write hole” phenomenon. This allows the server to make a controlled shutdown allowing all data to be committed to the RAID, thereby avoiding file system corruption.

It is almost impossible to determine which disks hold the correct data if a “write hole” is present, although the detection is almost impossible. Manual intervention may make it possible to resolve some of these issues. In most cases it may be impossible to determine which disks hold the correct data, so it’s important to reduce the risks of suffering “write hole” damage.

