Zetta Scalabytes Blog

In this blog, hear from Zetta’s founders and leaders about cloud computing, storage and data management best practices and Zetta Enterprise Cloud Storage technology.

Posts Tagged ‘availability’

Chris Schin

January 19, 2010

Hosting Primary, Unstructured Enterprise Data in the Cloud – Part 6: Continuous Availability

Chris Schin, VP Products, is responsible for coordinating all Zetta product-related initiatives including product strategy, direction, and marketing, as well as business model and go-to-market process definition. Prior to joining Zetta, Chris was acting GM and Senior Director for Symantec Protection Network, Symantec's Software as a Service platform.

For those of you just joining here, I’m using this blog series to document what enterprise IT professionals have told us about the baseline requirements that would need to be met by a cloud storage service before they would consider storing their enterprise primary data in the cloud. This list outlines the high-level requirements and hyperlinks to previous posts:

 

 

This post lists a few questions you should ask your cloud storage vendor about their architecture for delivering availability before considering placing a primary copy of your data in their cloud:

 

  • “Does your solution have redundant network links from different top-tier networking providers?” It must; networks go down every day, no matter how expensive they are or what brand is behind them. Redundancy in networks is a baseline requirement for placing primary data in the cloud.

     

  • “Does your solution reside in a data center that has redundant power and cooling?” It must; if the environs of the systems holding your data are not adequately protected, failure of the solutions is inevitable, resulting in availability outages.

     

  • “Does your solution offer triple-layer redundancy at the storage controller tier at no additional cost?” It must; the controller tier holds the brains of the storage solution, and cannot afford downtime or corruption — this is not only key to system availability, but extends to data integrity as well.

     

  • “Does your solution leverage an advanced RAID algorithm to ensure that the data is available?” It must; holding single copies of data in multiple locations is not nearly as available and protected as holding RAID-6-protected copies of data in multiple locations.

 

Before you even consider putting a primary copy of your data into a cloud storage provider’s infrastructure, you should certainly ask these questions and receive detailed, satisfactory answers. If you are using a cloud solution today and don’t know the answers to these questions (or even whom to ask these questions), then you should be concerned about the availability and protection of your data.

 

Zetta’s CTO, Jeff Whitehead, is fond of using a nuclear submarine analogy when discussing system availability, as in “imagine you are on a nuclear submarine right now — would you be satisfied knowing that submarine was highly available, or would you demand that it be continuously available?” An enterprise solution must be built to the stringent demands of an enterprise IT professional, and when it comes to data, an enterprise IT professional demands continuous availability.

Twitter iconReading: Hosting Primary, Unstructured Enterprise Data in the Cloud – Part 6: Continuous AvailabilityTweet This
Chris Schin

December 16, 2009

Hosting Primary, Unstructured Enterprise Data in the Cloud – Part 4: Comprehensive data integrity/protection

Chris Schin, VP Products, is responsible for coordinating all Zetta product-related initiatives including product strategy, direction, and marketing, as well as business model and go-to-market process definition. Prior to joining Zetta, Chris was acting GM and Senior Director for Symantec Protection Network, Symantec's Software as a Service platform.

Hi – for those of you just tuning in, this is part four of a 9-part blog series in which I am describing the Zetta solution, and how it is built from the ground up to host primary, unstructured enterprise data in the Cloud. Here again is the list of requirements; the first two have already been addressed, along with an introduction to the series:

 

 

In this post I’m going to discuss the position that in order for a cloud storage service to be a viable option to host primary enterprise data sets, it must provide comprehensive data integrity/protection.

 

At Zetta, our most important design consideration was data integrity (or data “protection” – whatever term is used, this is the idea that we won’t allow data to be lost or corrupted), since ultimately a data storage solution is worthless if it allows the loss or corruption of data.

 

A dirty little secret in the storage community is that data corruption happens all the time – though the relative rate of corruption seems low on its face, the increasing scale of data stored guarantees that corruption events are always occurring. For more on this topic at a deeper technical level, along with a calculator to help you gauge your own data integrity risk, please see JW’s post on Calculating Mean Time To Data Loss (and probability of silent data corruption).

 

So any solution for storing primary enterprise data MUST assume data corruption will happen, and must be designed to adapt to that reality and repair corrupted data, thereby guaranteeing data integrity.

 

Here are some of the unique ways the Zetta solution has been designed to automatically detect data corruption and repair it; taken collectively, these tools truly give Zetta an unparalleled data integrity profile.

 

Zetta Comprehensice Data Integrity/Protection Requirements

  • Write Receipts — Zetta creates a strong SHA-1 hash of every file that enters a Zetta customer virtual volume, and we do two things with that hash (one of which is optionally available at the customer’s request, one mandatory though transparent to the customer).
    • First, at a customer’s request, we can place these hashes on the customer’s volume, allowing a customer to ensure that what we have stored at Zetta is what was sent by the customer.
    • Second, we store each hash in perpetuity. This allows us to compare a read file with the one that was originally received; if there is any difference, we repair the file before completing the read, guaranteeing that what is read is identical to what was written.

       

  • RAIN6 N+3 — Zetta employs a best-in-class RAID algorithm. It is based on RAID 6 (based on Reed Solomon encoding), and adds an additional parity node (RAID 6 traditionally has 2 parity nodes, the Zetta solution has 3 – this is laid out in great detail by JW in his post on Data Integrity in the Cloud). We also refer to it as “RAIN” because we stripe data not just across independent disks, but actually across independent nodes (i.e. storage servers). This level of redundant protection is not available even in traditional storage hardware from top vendors, ensuring integrity (and availability) of data in the event of up to three independent computer failures.

     

  • Proactive Error Correction — In addition to creating a SHA hash of every complete file that enters the Zetta storage cloud, the Zetta solution also creates a SHA hash of every “chunk” of data encoded and striped across the disks in our lower-level storage servers. Then, using any spare system processing cycles, a background process on the system traverses all hard drives and compares those stored hashes to the current chunk on disk, proactively detecting and repairing any data corruptions on disk using our triple redundancy RAIN6 encoding.

     

  • Snapshots — Zetta cloud storage comes with a full-featured file system (a distributed, clustered, highly parallelized file system that we’ll be discussing in a future post). As with most file systems, the Zetta file system provides full snapshotting capabilities – either scheduled or ad hoc snapshots. And Zetta snapshots are free from the capacity and performance limitations of single devices and fixed size clusters. This provides a customer-controlled protection mechanism – once a snapshot is created, the file system is preserved in that state until the snapshot is deleted, allowing a user to go back and restore filed and directories from the “.snapshot” directory like with any on-premises filer.

     

  • Geo-Replication — All customer data stored at Zetta is replicated to another data center. In 2010 we expect to begin to offer full asynchronous replication to our customers who want a fully-mountable volume resident in another Zetta data center, either for performance or for data integrity.

 

Again, the Zetta solution was designed with the core premise that preventing data loss was our primary charter, and these are some of the unique features we’ve put into the solution to live up to that charter.

 

Compare this with what is available today from the HTTP cloud storage vendors. I reiterate that this is not a knock on those solutions – they do an excellent job for their customer target, but their target is not the enterprise, and they don’t provide the requisite features to host primary enterprise data. These solutions do not provide write receipts, have no RAID implementations, lack proactive error correction, and offer no file systems with snapshots.

 

I’ll be back soon to discuss Zetta’s approach to data security & privacy.

Twitter iconReading: Hosting Primary, Unstructured Enterprise Data in the Cloud – Part 4: Comprehensive data integrity/protectionTweet This
Jeff Whitehead

June 10, 2009

Calculating Mean Time To Data Loss (and probability of silent data corruption)

Jeff Whitehead, Vice President of Technical Operations and CTO, and Zetta co-founder is responsible for delivering the Zetta Enterprise Cloud Storage Service. Jeff spent more than two years as CIO of Shutterfly growing and managing over six petabytes of storage infrastructure.Twitter: @jwhitehead

IT professionals are well aware of many challenges related to scaling storage: capital required to house data, manage backups, data center space, power and cooling. One area many IT professionals haven’t had time to look at, however, is how increasing data footprints translate into increased risk of data loss or data corruption. To put this in context, IDC recently reported that data volumes will increase by a “factor of almost five,” while “total IT budgets worldwide will only grow by a factor of 1.2 and IT staff by a factor of 1.1.” In this context of constraints, being asked to do more with less, without special attention to data risk management, risk inevitably increases.

I believe that many IT professionals and CIO’s will be very surprised to see that while Data Loss (ie, simultaneous drive failures) may not be very probable, Data Corruption (the data on disk is no longer what was originally written out by the application) is shockingly likely, and has caused outages for even some of the most technologically advanced high end environments.

The objective of this blog is to introduce or reintroduce the concept of “Mean Time To Data Loss (MTTDL),” whereby IT professionals, CIOs, and risk managers can create a probabilistic model for evaluating the reliability and probability of data loss for your current environment, and also compare and contrast with how Zetta is advancing the state of the art for cost effective data protection.

MTTDL is a tool, and to be effective one must understand its limitations. The inputs to the model are as follows:

  1. The number of hard drives (data set size/system performance)
  2. The reliability of each hard drive
  3. The probability of reading a given hard drive correctly without error (see prior blog about silent data corruption)
  4. The redundancy encoding of the system
  5. The rebuild rate.

Mean Time to Data Loss is in many respects a best case scenario, because it ignores risks to data integrity such as fire, natural disaster, human error, and other common causes of storage failures. It also ignores autocorrelation¸ or drives failing at the same time due to similar workload, similar manufacturing batches, firmware issues, or the like. Despite these limitations, MTTDL is still one of the better tools for evaluating the data protection features of a storage system.

Download MTTDL calculator spreadsheet

The attached spreadsheet, here, will allow you to model Mean Time To Data Loss, Probability of Data Loss over a fixed period of time, and Probability of Data Corruption over a fixed period of time. I define data corruption as any change in the data from its original form. For some applications, such as a text file, a single byte error may not be catastrophic. A single byte error in the middle of a relational database or compressed data stream, however, probably means the entire data set is invalid.

Now, selecting inputs. First, decide on the data set size you’d like to model, and the size of the hard drives. You don’t need to worry about the RAID configuration at this point.

Next, you need to pick the reliability of each hard drive. Interestingly enough, despite the wide variety of storage systems on the market, the there are only a few drive manufacturers, and for the most part all storage systems use the same drives, whether from Seagate, Hitachi, Samsung, or Toshiba. The hard drive manufacturers helpfully specify their expected reliability in terms of “Mean Time Between Failure (MTBF),” or “Annual Failure Rate (AFR)” which are equivalent measures. Seagate helpfully publishes a 1.2 million hour MTBF for their 1TB ES.2, which is equivalent to a 0.73% annual failure rate. Hard drive manufacturers are probably a bit optimistic, as several studies have found real-world AFRs to be more typically in the 3-4% range, and sometimes even above 10%.

The third input is the rate of silent corruption for the drives. A recent study over a 41-month period analyzed checksum errors on 1.53 million drives. During the 41-month period, silent data corruption was observed on 0.86% of 358,000 nearline SATA drives and .065% of 1.17 million enterprise-class Fibre Channel drives. Again, this real world data is somewhat at variance with the manufacturer’s published Bit Error Rate of 1 in 1015.

I tried to make this spreadsheet simple to use, simply pick or enter values for the yellow cells in the B column, and see calculated values for probability of data loss, and the expected number of undetected bit errors, or expected rate of silent data corruption, in cell I86.

Twitter iconReading: Calculating Mean Time To Data Loss (and probability of silent data corruption)Tweet This  | Follow Jeff Whitehead on Twitter.
Jeff Whitehead

May 12, 2009

Service Level Agreements – Meaningful or Meaningless?

Jeff Whitehead, Vice President of Technical Operations and CTO, and Zetta co-founder is responsible for delivering the Zetta Enterprise Cloud Storage Service. Jeff spent more than two years as CIO of Shutterfly growing and managing over six petabytes of storage infrastructure.Twitter: @jwhitehead

 

Hello, my name is Jeff Whitehead, and I’m the CTO for Zetta. My background is in scale computing, and in running mission critical operations sites. This is the first entry in a blog series where I will share some of my experiences with scale computing, and describe how the Zetta Enterprise Cloud Storage Service is designed to make data storage simpler, safer, and approachable.

 

Today, I want to talk about the SLA, or Service Level Agreement. Service Level Agreements are agreements, typically between a service provider organization and its customers, as a way of setting expectations. I have to admit, as a buyer of IT services, I’m generally not satisfied with what are offered as SLAs from providers.

 

Most SLAs in the market are uninteresting because they lack consequences and are not aligned with the real business issue. An example would be a typical “premium,” or “platinum,” support package on some enterprise software. The SLA states that calls will be answered within 15 minutes 24×7x365. This illustrates a fundamental misalignment of purposes between the service organization and the customer—there is often quite a path (in terms of complexity and duration) between “answer the call,” and “fix the problem,” which is the customer‘s true objective.

 

SLAs are effective business tools only when they align business interests between two parties, have consequences, and are properly defined. Good SLAs are ones that ensure that the output of a technology system is business useful. There is a small technical difference between an ISP’s SLA of “being able to ping the upstream router,” and “being able to ping customers,” but a very large business utility difference; the first has zero business utility, and as such is an inappropriate target for an SLA.

 

Zetta attempts to align business interests with its customers by providing business meaningful targets, and has consequences in the form of financial penalties.

 

Smart cloud storage customers have four major questions of their provider:

  1. “Can I get to the data right now?” I.e., is the data available? Availability is a pretty common SLA metric.
  2. “Can I get to the data ever? Can you prove that the data hasn’t changed while it was in your system?” Data Integrity is critical.
  3. “Can I get to the data at a rate sufficient for my business needs?” I.e., is performance consistent and guaranteed?
  4. “Is my data secure?” Ensure only authorized access to the system.

 

Zetta’s SLA covers all of these concerns and is backed by financial consequences. It represents a meaningful, business useful tool.

 

Twitter iconReading: Service Level Agreements – Meaningful or Meaningless?Tweet This  | Follow Jeff Whitehead on Twitter.