No matter how diligent you are about conducting regular data backups, all that hard work comes to nothing if corruption or backup data loss keeps you from performing a successful restore.
How big of a problem is it? Well, the figures are all over the place. Google “tape restoration failure rate” and the first few hits give you estimates ranging from a 10% to a 71% failure rate. Whatever the actual figure is, it clearly is too high.
Disks are also prone to failure. Although Fiber Channel drive manufacturers talk about a Bit Error Rate of 1 in 1015 (1 in 1014 for most SATA drives) each drive holds a lot of bits and a data center holds a lot of drives. When you figure a single 1TB SATA drive has 8×1012 bits, you would expect a bit error once every 12 times you read the entire drive. In practice, bit error rate is dwarfed by actual drive failures. There are other sources of bit errors than just the disks themselves — consider all other points along the data transport path. A 2008 study of the issue by researchers at the University of Wisconsin-Madison, the University of Toronto and Network Appliance looked at 1.53 million disk drives in production storage systems and found, “more than 400,000 instances of checksum mismatches over the 41-month period.”
A UW/Net App study from a year earlier found that after 24 months of use, 5% to 20% of nearline disks had at least one sector error. Enterprise disks had a significantly lower, but still too high, error rate: “1.46% of enterprise class disks develop at least one latent sector error within twelve months of their ship date.”
Now, in some cases, such as a word document, a single bit error may not be catastrophic, but it can wreak havoc for certain databases or compressed data streams. Given that both tape and disk storage will suffer from data corruption over time, here are some steps to ensure that data can be restored when needed.
1. Use Strong RAID Encoding
RAID-5 uses an n+1 architecture, meaning it is designed to protect against the failure or corruption of one of the disks in the set. Switching to RAID-6, which uses an n+2 architecture can guard against two simultaneous failures and is much more secure. RAID-6 is also better for another reason, which is also overlooked: with RAID-5 if one of the blocks in a raid set doesn’t match (i.e., parity mismatch) the system has no way of telling what is the “correct” value — the parity could be wrong, or one of the data disks could be wrong. With RAID-6 if there is a single error, the others can be compared to determine which “one of these things is not like the other,” and the data corrected without resorting to have to restore from backup.
2. Backup to Multiple Nodes
Although not specifically a “corruption” issue, this protects against failure of an entire backup device. Zetta’s RAIN-6 (Redundant Array of Independent Nodes) technology, for example, stripes the data across independent computers protecting the data whether there is a network, power supply, memory, or disk failure. RAIN gives rise to greater availability of the backup service, ensuring it is available for backing up or restoring your data as needed.
3. Make Backup Copies of All Tapes
While this adds to the cost and it is slow to have to retrieve and load a second tape when the first fails, it does increase reliability. The probability of two tapes failing is the product of the probability of each tape failing.
4. Ensure your off-site schedule meets business needs
If you are taking tapes off site once a week, on average you have a 3.5 day window of vulnerability, — and worst case up to 7 days — where the tapes have been written, but are on-site and potentially destroyed in the event of a site disaster.
5. Verify All Tapes
Don’t assume that tapes will retain accuracy sitting in a vault. Verify the tape before archiving and pull them out on a regular basis to check for errors.
6. Implement CRC
A cyclic redundancy check (CRC) is a small fingerprint of data that can identify bit errors or corruption in a larger data set. A common CRC method is to extend the normal 512 byte disk sector to 520 bytes or more, and use that extra header space for the CRC information. CRC, however, is usually limited to high-end storage arrays using Fibre Channel and SAS disks, and is not something that can typically be added “after the fact,” if the system wasn’t designed to implement sector based CRCs from the ground up.
7. Use Cryptographic Hashes
Create a cryptographic hash such as SHA-1 or SHA-256 for any stored data to ensure that the recovered data is identical to what was originally stored. If there is any discrepancy, and you are using a redundant storage method, you can restore an uncorrupted version of the data. Being able to know and prove that a backup is exactly the same as when it was taken is critical for many industries and applications.
An alternative approach to preventing backup data corruption is to use online backup and disaster recovery, such as Zetta provides.