I am Jeff Whitehead, Zetta.net’s CTO, and I run all technical operations for Zetta.net. I’m privileged to work with a team that collectively has decades managing and administering large scale distributed systems. One of the main components of any distributed system or enterprise is its operational storage. Years ago, we managed petabytes of data when a petabyte was considered unfathomably huge. We have experienced firsthand the challenges of massive data growth, and with our team of software developers are bringing to market a solution for large scale, multi-tenant data management.
The challenges associated to scaling data storage solutions are now applicable to IT professionals in all industries (driven by the continually expanding information footprints of both existing and new applications), no longer just a few niche, data-intensive applications.
At a high level, the scaling problem typically results in either sub-linear scaling, data corruption or data loss, because as you increase the population of hard drives, the probability of having a drive failure (or other type of error) increases. Enterprises can’t avoid increasing the number of hard drives because capacity increases are not keeping pace with data growth[i], in addition, while disk drives are getting bigger, they are not getting much faster.
Our primary design consideration for the Zetta.net system is data integrity. Data integrity means that we don’t corrupt or lose data, even at tremendous scale and over long periods of time, regardless of the corruption vector.
An Overview of RAID
One of the most common tools in the IT professional’s arsenal for managing large data sets is RAID (“Redundant Array of Independent Disks or Drives — I will use disk and hard drive interchangeably). RAID is the encoding of data across a group of multiple hard drives managed as a single set, such that you don’t lose data when a single disk fails. RAID also can give better performance than a single hard drive, by splitting the data workload across the constituent disks. Different RAID implementations have different degrees of redundancy and performance characteristics, based on how the data is laid out; they share a common fundamental building block, which are still hard drives.
Most consumer and virtually all enterprise RAID systems properly handle the case of a single drive failure, allowing the faulty drive to be removed[ii] and replaced with a new drive, onto which the data from the failed drive is reconstructed from the healthy disks.
Figure 1 is an example of a sample RAID-5 layout of 4 disks. RAID-5 is an “N+1″ architecture. This is a 4+1 layout, meaning that 4 disks worth of data are encoded into 5 disks, such that as long as 4 disks are available, the data remains available and recoverable.
Figure 1: RAID-5, 4+1 configuration protects against data loss through the failure of any single disk.
In the Figure 1 example, if disk 5 were to fail, logical blocks 8, 12, 16, and 20 would still be available by reading the entire remaining stripe, and using the parity calculation to retrieve the value. A read error returned by the hard drive to the RAID controller is corrected similarly. It is also significant to note that reading from the remaining stripe utilizes IO resources from the healthy disks during a rebuild.
RAID-5 protects for the case where a single hard drive fails, so that you can replace the drive and rebuild the data set. But data is lost if there is a second failure before the rebuild completes. Additionally, RAID-5 has one major non-obvious limitation; it has no defense against “bit rot,” or “silent data corruption.” Silent data corruption is the case where the computer issues a read command to the hard drive and the hard drive returns data that is different from what was originally written, i.e., the data is corrupt, yet no errors are raised by the disk, either prior to or during the read. The lack of an error condition is why it is called “silent” data corruption.
These corruption events can be transient (meaning you can re-read the block and it will be OK) or can indicate permanent data corruption. Hard drive manufacturers express these corruption events as a Bit Error Rate[iii], and helpfully publish specifications, typically 1 in 1014 operations for consumer level drives and 1 in 1015 operations for enterprise drives.
It’s worth noting that RAID-5, in the absence of other data integrity techniques, is vulnerable both in the non-degraded (i.e., all the drives are online and working) and especially vulnerable in the degraded state, whereby the bit error will corrupt the data rebuilt on the replacement drive, where it can lay dormant and undiscovered until the file is accessed[iv].
Many RAID vendors encouraged customers to move to newer RAID-6 technologies, which is an “N+2″ implementation. This means that RAID-6 can tolerate two simultaneous device failures, which is important both because of hard drive device count proliferation making simultaneous drive failure more probable, but also because of rebuild times expanding (because hard drive performance is not keeping pace with capacity increases) increasing the window of vulnerability.
The most important driver for RAID-6 adoption, however, has been largely ignored. In the normal operating state, RAID-6 has the opportunity to provide a single level of protection against the bit error rate problem. In a nutshell, there is the data hard drive, and two parity hard drives, resulting in three “votes.” In the event that one of the drives returns bad data, the other two can “vote,” with the majority accepted as the correct data, allowing the system to both detect and correct the error. Since the probability of a bit error is small, the probability of two of them is very, very small (the product of the two probabilities).
There are multiple RAID-6 implementations, but Figure 2 is one example.
Figure 2: RAID-6 with 2 parities (in this example, 8+2) allows data to be reconstructed even in the event of 2 simultaneous failures.
One downside of this implementation is that in order for the system to have the data corruption protection, it must always operate on the entire stripe at a time, either for reads or for writes. In a typical 8+2 scenario, a stripe of 10 disks would have the same small random read performance of a single disk, although the streaming (large file) performance is typically of all 10 disks.
Addressing the bit error rate issues of traditional RAID
A technique used by enterprise storage vendors to address the bit error rate weakness is a calculation called a cyclic redundancy check, or CRC. Simply put, a CRC is a small “fingerprint” of data that can – to a varying degree of probability (based on the “strength” of the CRC) – identify bit errors or corruptions in a larger set of data. By using the CRC, the storage system can compare a set of data to its CRC before allowing it to vote. The typical implementation of CRC in enterprise storage systems is to format each sector with 520 bytes or more, instead of 512 bytes, and use this additional sector header space for CRC and other information. One downside to this is that generally only enterprise level fiber channel and SAS drives support this low-level formatting option, and this information is generally proprietary and only available inside the storage array, limiting its usefulness.
With data integrity as our primary service objective, we surveyed the existing RAID implementations and found a few opportunities for improvement, mainly to support massive scale over extended timeframes, and also to support our other objectives of availability and performance.
We call our implementation “RAIN-6.” We call it RAIN instead of RAID because we redundantly encode across nodes (i.e. “computers”), not just disks. In this method, we are tolerant not just of hard drive failure or bit error, but also complete node failure or bit error, from any cause, whether network, power supply, memory or a hard drive. This node striping not only improves data integrity, but also improves system availability and the overall scalability of the system.
Integral to the RAIN-6 encoding is our CRC implementation. Zetta.net employs a similar but generalized approach to the enterprise 520 byte encoding, which works with all hard drive hardware and is applicable to network transport as well as all other bit error sources. We verify CRC on every read, so we are guaranteed to a very high level of probability to never pass back data that is different from the data that was originally written.
Verifiably Comprehensive Data Integrity Solution
This set of features — RAIN-6 network encoding and strong CRCs — provide a solid foundation for a data protection strategy. I will write about our snapshots and replication (including geo-diverse replication) in a future article, which combines with RAIN-6 to provide a truly comprehensive data management solution.
As an IT professional, however, I know that best practice is to “trust but verify,” meaning I want the ability to verify that my data is available and free from corruption independently of the vendor’s opinion. I’m proud that Zetta.net offers that degree of transparency with our “write receipts” API, which provides for cryptographic hashes of files stored on the Zetta.net system, which are also customer accessible. Cryptographic hashes (such as SHA-1 and SHA-256) are like CRCs with the added property that it is mathematically very difficult to “make up,” a file that matches the fingerprint.
By recording these “write receipts,” Zetta.net’s customers can independently verify and calculate that the data returned from Zetta.net is free from corruption prior to relying on it.
The features I’ve outlined here, RAIN-6, CRCs, continuous scrubbing, file-level hashes, all combine with other Zetta.net features to provide better data availability, integrity and protection than is possible with even the most expensive on-premise enterprise storage systems. At Zetta.net, we’ve learned how to build, manage and deliver storage services and we’re now in the business to do that for our customers.
[i]I’m using hard drives as an example here, but this also holds true for solid state disk (flash) and tape, which is still in wide use.
[ii] Many enterprises return these drives to the manufacturer for replacement under warranty. However despite the failure, in almost all cases some or all data remains on the hard drive, causing potential data loss or exposure. Zetta.net encrypts all data at rest with strong standards based encryption to ensure that even if a physical drive is taken, the data is unusable.
[iii]Some causes leading to a nonzero bit error rate: a) misdirected writes, where the hard drive writes the data correctly, but in the wrong place (now two sectors are corrupt!) b) torn writes, where the hard drive doesn’t finish a write (for example it lost power while writing) c) non-writes where the hard drive says it wrote data, but didn’t
[iv]Some implementations verify the parity data against the main data, looking for these issues. This only works in conjunction with other data integrity techniques, because if the parity doesn’t match the stripe data, absent other information, the system is unable to determine which of the hard drives returned the correct data and which returned the corrupt data. To simplify, there are two voters, the parity disk, and the data disk. If they “vote” differently, there is a tie, and the system is unable to determine the correct value.