Zetta Blog

calculating mean time data loss

Mean Time to Data Loss (MTTDL): Calculation, Meaning & More

by Jeff Whitehead

What does Data Loss mean?

IT professionals are well aware of many challenges related to scaling storage: capital required to house data, manage backups, data center space, power and cooling. One area many IT professionals haven’t had time to look at, however, is how increasing data footprints translate into increased risk of data loss or data corruption. To put this in context, IDC recently reported that data volumes will increase by a “factor of almost five,” while “total IT budgets worldwide will only grow by a factor of 1.2 and IT staff by a factor of 1.1.” In this context of constraints, being asked to do more with less, without special attention to data risk management, risk inevitably increases.

I believe that many IT professionals and CIO’s will be very surprised to see that while Data Loss (ie, simultaneous drive failures) may not be very probable, Data Corruption (the data on disk is no longer what was originally written out by the application) is shockingly likely, and has caused outages for even some of the most technologically advanced high end environments.

MTTDL: Mean Time to Data Loss

The objective of this blog is to introduce or reintroduce the concept of “Mean Time To Data Loss (MTTDL),” whereby IT professionals, CIOs, and risk managers can create a probabilistic model for evaluating the reliability and probability of data loss for your current environment, and also compare and contrast with how Zetta is advancing the state of the art for cost effective data protection.

MTTDL is a tool, and to be effective one must understand its limitations.

See how Disaster Recovery can protect your systems in the cloud.

Disaster Recovery Product

The Inputs of the MTTDL Model

  • The number of hard drives (data set size/system performance)
  • The reliability of each hard drive
  • The probability of reading a given hard drive correctly without error (Find out more about silent data corruption)
  • The redundancy encoding of the system
  • The rebuild rate

Mean Time to Data Loss is in many respects a best case scenario, because it ignores risks to data integrity such as fire, natural disaster, human error, and other common causes of storage failures. It also ignores autocorrelation or drives failing at the same time due to similar workload, similar manufacturing batches, firmware issues, or the like. Despite these limitations, MTTDL is still one of the better tools for evaluating the data protection features of a storage system.

Calculating Mean Time To Data Loss

This spreadsheet will allow you to model Mean Time To Data Loss, Probability of Data Loss over a fixed period of time, and Probability of Data Corruption over a fixed period of time. I define data corruption as any change in the data from its original form. For some applications, such as a text file, a single byte error may not be catastrophic. A single byte error in the middle of a relational database or compressed data stream, however, probably means the entire data set is invalid.

Now, selecting inputs. First, decide on the data set size you’d like to model, and the size of the hard drives. You don’t need to worry about the RAID configuration at this point.

Next, you need to pick the reliability of each hard drive. Interestingly enough, despite the wide variety of storage systems on the market, the there are only a few drive manufacturers, and for the most part all storage systems use the same drives, whether from Seagate, Hitachi, Samsung, or Toshiba. The hard drive manufacturers helpfully specify their expected reliability in terms of “Mean Time Between Failure (MTBF),” or “Annual Failure Rate (AFR)” which are equivalent measures. Seagate helpfully publishes a 1.2 million hour MTBF for their 1TB ES.2, which is equivalent to a 0.73% annual failure rate. Hard drive manufacturers are probably a bit optimistic, as several studies have found real-world AFRs to be more typically in the 3-4% range, and sometimes even above 10%.

The third input is the rate of silent corruption for the drives. A disk drive error study over a 41-month period analyzed checksum errors on 1.53 million drives. During the 41-month period, silent data corruption was observed on 0.86% of 358,000 nearline SATA drives and .065% of 1.17 million enterprise-class Fibre Channel drives.

I tried to make this spreadsheet simple to use, simply pick or enter values for the yellow cells in the B column, and see calculated values for probability of data loss, and the expected number of undetected bit errors, or expected rate of silent data corruption, in cell I86.

Learn more about preventing data loss with disaster recovery planning and how a disaster recovery solution can help you prevent data loss. 

Jeff Whitehead
Zetta blog author

Jeff is Zetta's co-founder, VP of Technical Operations, Engineering and CTO.