Zetta Scalabytes Blog

In this blog, hear from Zetta’s founders and leaders about cloud computing, storage and data management best practices and Zetta Enterprise Cloud Storage technology.
Jeff Whitehead

June 10, 2009

Calculating Mean Time To Data Loss (and probability of silent data corruption)

Jeff Whitehead, Vice President of Technical Operations and CTO, and Zetta co-founder is responsible for delivering the Zetta Enterprise Cloud Storage Service. Jeff spent more than two years as CIO of Shutterfly growing and managing over six petabytes of storage infrastructure.Twitter: @jwhitehead

IT professionals are well aware of many challenges related to scaling storage: capital required to house data, manage backups, data center space, power and cooling. One area many IT professionals haven’t had time to look at, however, is how increasing data footprints translate into increased risk of data loss or data corruption. To put this in context, IDC recently reported that data volumes will increase by a “factor of almost five,” while “total IT budgets worldwide will only grow by a factor of 1.2 and IT staff by a factor of 1.1.” In this context of constraints, being asked to do more with less, without special attention to data risk management, risk inevitably increases.

I believe that many IT professionals and CIO’s will be very surprised to see that while Data Loss (ie, simultaneous drive failures) may not be very probable, Data Corruption (the data on disk is no longer what was originally written out by the application) is shockingly likely, and has caused outages for even some of the most technologically advanced high end environments.

The objective of this blog is to introduce or reintroduce the concept of “Mean Time To Data Loss (MTTDL),” whereby IT professionals, CIOs, and risk managers can create a probabilistic model for evaluating the reliability and probability of data loss for your current environment, and also compare and contrast with how Zetta is advancing the state of the art for cost effective data protection.

MTTDL is a tool, and to be effective one must understand its limitations. The inputs to the model are as follows:

  1. The number of hard drives (data set size/system performance)
  2. The reliability of each hard drive
  3. The probability of reading a given hard drive correctly without error (see prior blog about silent data corruption)
  4. The redundancy encoding of the system
  5. The rebuild rate.

Mean Time to Data Loss is in many respects a best case scenario, because it ignores risks to data integrity such as fire, natural disaster, human error, and other common causes of storage failures. It also ignores autocorrelation¸ or drives failing at the same time due to similar workload, similar manufacturing batches, firmware issues, or the like. Despite these limitations, MTTDL is still one of the better tools for evaluating the data protection features of a storage system.

Download MTTDL calculator spreadsheet

The attached spreadsheet, here, will allow you to model Mean Time To Data Loss, Probability of Data Loss over a fixed period of time, and Probability of Data Corruption over a fixed period of time. I define data corruption as any change in the data from its original form. For some applications, such as a text file, a single byte error may not be catastrophic. A single byte error in the middle of a relational database or compressed data stream, however, probably means the entire data set is invalid.

Now, selecting inputs. First, decide on the data set size you’d like to model, and the size of the hard drives. You don’t need to worry about the RAID configuration at this point.

Next, you need to pick the reliability of each hard drive. Interestingly enough, despite the wide variety of storage systems on the market, the there are only a few drive manufacturers, and for the most part all storage systems use the same drives, whether from Seagate, Hitachi, Samsung, or Toshiba. The hard drive manufacturers helpfully specify their expected reliability in terms of “Mean Time Between Failure (MTBF),” or “Annual Failure Rate (AFR)” which are equivalent measures. Seagate helpfully publishes a 1.2 million hour MTBF for their 1TB ES.2, which is equivalent to a 0.73% annual failure rate. Hard drive manufacturers are probably a bit optimistic, as several studies have found real-world AFRs to be more typically in the 3-4% range, and sometimes even above 10%.

The third input is the rate of silent corruption for the drives. A recent study over a 41-month period analyzed checksum errors on 1.53 million drives. During the 41-month period, silent data corruption was observed on 0.86% of 358,000 nearline SATA drives and .065% of 1.17 million enterprise-class Fibre Channel drives. Again, this real world data is somewhat at variance with the manufacturer’s published Bit Error Rate of 1 in 1015.

I tried to make this spreadsheet simple to use, simply pick or enter values for the yellow cells in the B column, and see calculated values for probability of data loss, and the expected number of undetected bit errors, or expected rate of silent data corruption, in cell I86.

3 Responses to “Calculating Mean Time To Data Loss (and probability of silent data corruption)”

  1. [...] This topic is discussed by Zetta CTO Jeff Whitehead in a recent blog entry. Whitehead’s excellent description of the problem and how to analyze it includes a calculator to help estimate the probability of random disk [...]

  2. [...] A dirty little secret in the storage community is that data corruption happens all the time – through the relative rate of corruption seems low on its face, the increasing scale of data stored guarantees that corruption events are always occurring. For more on this topic at a deeper technical level, along with a calculator to help you gauge your own data integrity risk, please see JW’s post on Calculating Mean Time To Data Loss (and probability of silent data corruption). [...]

Leave a Reply