Double-parity RAID, or RAID-6, is the de facto industry standard for storage; when I started talking about triple-parity RAID for ZFS earlier this year, the need wasn’t always immediately obvious. Double-parity RAID, of course, provides protection from up to two failures (data corruption or the whole drive) within a RAID stripe. The necessity of triple-parity RAID arises from the observation that while hard drive capacity has roughly followed Kryder’s law, doubling annually, hard drive throughput has improved far more modestly. Accordingly, the time to populate a replacement drive in a RAID stripe is increasing rapidly. Today, a 1TB SAS drive takes about 4 hours to fill at its theoretical peak throughput; in a real-world environment that number can easily double, and 2TB and 3TB drives expected this year and next won’t move data much faster. Those long periods spent in a degraded state increase the exposure to the bit errors and other drive failures that would in turn lead to data loss. The industry moved to double-parity RAID because one parity disk was insufficient; longer resilver times mean that we’re spending more and more time back at single-parity. From that it was obvious that double-parity will soon become insufficient. (I’m working on an article that examines these phenomena quantitatively so stay tuned… update Dec 21, 2009: you can find the article here)
Last week I integrated triple-parity RAID into ZFS. You can take a look at the implementation and the details of the algorithm here, but rather than describing the specifics, I wanted to describe its genesis. For double-parity RAID-Z, we drew on the work of Peter Anvin which was also the basis of RAID-6 in Linux. This work was more or less a tutorial for systems programers, simplifying some of the more subtle underlying mathematics with an eye towards optimization. While a systems programmer by trade, I have a background in mathematics so was interested to understand the foundational work. James S. Plank’s paper A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like Systems describes a technique for generalized N+M RAID. Not only was it simple to implement, but it could easily be made to perform well. I struggled for far too long trying to make the code work before discovering trivial flaws with the math itself. A bit more digging revealed that the author himself had published Note: Correction to the 1997 Tutorial on Reed-Solomon Coding 8 years later addressing those same flaws.
Predictably, the mathematically accurate version was far harder to optimize, stifling my enthusiasm for the generalized case. My more serious concern was that the double-parity RAID-Z code suffered some similar systemic flaw. This fear was quickly assuaged as I verified that the RAID-6 algorithm was sound. Further, from this investigation I was able to find a related method for doing triple-parity RAID-Z that was nearly as simple as its double-parity cousin. The math is a bit dense; but the key observation was that given that 3 is the smallest factor of 255 (the largest value representable by an unsigned byte) it was possible to find exactly of 3 different seed or generator values after which there were collections of failures that formed uncorrectable singularities. Using that technique I was able to implement a triple-parity RAID-Z scheme that performed nearly as well as the double-parity version.
As far as generic N-way RAID-Z goes, it’s still something I’d like to add to ZFS. Triple-parity will suffice for quite a while, but we may want more parity sooner for a variety of reasons. Plank’s revised algorithm is an excellent start. The test will be if it can be made to perform well enough or if some new clever algorithm will need to be devised. Now, as for what to call these additional RAID levels, I’m not sure. RAID-7 or RAID-8 seem a bit ridiculous and RAID-TP and RAID-QP aren’t any better. Fortunately, in ZFS triple-parity RAID is just raidz3.
A little over three years ago, I integrated double-parity RAID-Z into ZFS, a feature expected of enterprise class storage. This was in the early days of Fishworks when much of our focus was on addressing functional gaps. The move to triple-parity RAID-Z comes in the wake of a number of our unique advancements to the state of the art such as DTrace-powered Analytics and the Hybrid Storage Pool as the Sun Storage 7000 series products meet and exceed the standards set by the industry. Triple-parity RAID-Z will, of course, be a feature included in the next major software update for the 7000 series (2009.Q3).
14 Responses
Another great addition to the RAIDZ feature set, but you know what’s still missing! Is the BP rewrite work still on hold, or can we expect to see news about the raft of new features that will bring if/when it finally happens, i.e. vdev resizing/evacuating etc. It’s been talked about for years and seems to be a real missing feature of ZFS compared with other RAID systems. Maybe it’s less important for enterprise customers, but even here I’m sure there are those who would like an ‘undo’ option after adding the wrong vdev to a pool or adding a vdev to the wrong pool.
@Graham BP rewrite and device removal and looking good. Keep an eye out for them later this year.
I’m curious about the logical terminus of always needing one more parity set.
The law of diminishing returns is going to kick in and enterprise storage is going to be even more expensive than it is now, which is saying something.
Drive access times may eventually catch up with SSDs, and hopefully MTBF will too, but I’m not familiar with how those are looking yet. Something has to give somewhere.
@Matt Interesting observation. My guess is that for a fixed BER and MTBF the number of parity device desired with increase logarithmically compared with the size of the drives; I’ll run the numbers when I get the opportunity.
Regarding the cost of enterprise storage, I actually believe that the sort of software we’re building in ZFS and the Sun Storage 7000 series has and will continue to dramatically reduce the cost of enterprise storage by enabling the use of cheaper components.
As to flash-based SSDs, drive access times are NEVER going to catch up. Today a 15K RPM drive will give you 2ms average rotational latency; to reach the < 200us of today’s you’d need to spin drives at least 10 times faster which simply isn’t practical for a variety of reasons. Conversely, as feature sizes get smaller flash SSDs will need increasingly elaborate tricks to maintain their MTBF.
[Trackback] Distribuendo il carico su tutto il sistema ottengo risultati di eccellenza, un disco SATA da 1TB (utilizzato oltre il 80%) ha un tempo di ricostruzione dell’ordine delle 16 ore (visto sul campo con un Compellent con un controller serie 20 in produzi…
Good News on addition of Raidz3, at what point does RaidzN become redundant and it more sensible to mirror the RaidzN?
Serious respect for this and all of ZFS.
But what i want /ideally :)/ most of all is someone to implement a ZFS>CIFS stack on a PCIe card. (hand waving the details, naturally)
I’d like to say i’m joking, or dreaming, but the object would be to allow Win desktops to deal with serious datasets without needing infrastructure or loading the same infrastructure. There’s enough desktop horsepower, but the combination of proprietary RAID hardware + NTFS is a integrity and management mess. Put ZFS right at a trader / quant’s machine and it’d be simple to copy a dataset and round trip it to larger systems or shared storage.
Surely someone is working on this? 🙂
@sophie The advantages of RAID-Z (or RAID-5, RAID-6, etc) over mirroring are that capacity can be used more efficiently and there’s more redundancy per grouping of disks. If you start to combine mirroring and RAID-Z you eliminate those. That said, mirroring has a significant performance advantage for random IOPS.
@JMK Well, what if that PCIe card is a 10GbE, FC or SAS card and it connects to a Sun Storage 7000 series system to provide the power of ZFS on that Windows desktop? We’re definitely working on that, and have delivered on much of it already.
Hi Adam, thanks for your reply . .
can you imagine I’m a little skeptical of pseudo-DAS, throwing LUNs at desktops?
(this really woudn’t feel good with big media files in post for sure)
I do think I’ve missed something about ZFS development. Any hints please?
Regards 10GbE etc – my point is to drop latency below that fabric, as close to what ICH10 can do. You get scaling when doing e.g. time series overhead over the network on a MatLab sim. Without "local" ZFS goodbye inefficiencies due to lack of pool optimisation.
Note this is a backhanded compliment in every way, though i’m caffiene twitching 🙂
What i can’t do is port the front end apps to Sol10x86. Not easily anyhow:)
Intiguing is the HP Z series boxen, which allow simul VM for Win/Linux. Now, if that worked with Solaris . . . problem solved! (save for boot order!)
I have a problematic app: huge metadata + good sized amounts of underlying media objects. I’d like to sandbox that closest to the user without paying for low-lat 10GbE everywhere. Cost that against what you can do with the latest RE4 WD drives in a bundle, allowing your f/e licenses are already paid.
I want to hint at how it’s be great to have Sol management for all data, and you sell the upstream end for that. I’d love even to say that you could do all this in VirtualBox, but that’d knock people doing serious VIS in OpenGL.
this idealist signs out . . 🙂
– john
If ZFS is adding triple parity, ZFS should also add an option to bypass the hardware CRC checking on the drive so the disk space can be reclaimed, since ZFS is really guaranteeing the integrity of the data.
@sophie That’s exactly what I was thinking; capacity is continually increasing, but disk speeds are not keeping up, so for keeping a decent speed you’ll have to look at RAID-10 and other simpler RAID-levels. You lose a bit of protection, but that is partly offset by faster rebuilds and having to read data from only a single disk for the restore.
@sophie About three years ago my nightmare scenario happened when 3 disks failed in a 14 disk RAID set, despite having 2 hot spare drives the RAID set was dead. Is there a formulae for ZFS to calculate the tipping point where switching from single raidzN to a mirrored pair of raidzN drives becomes advantageous. Yes, there will be a cost issue for the extra drives, cabinets and power but when Availability and Performance takes priority because the downtime costs are more expensive.
@Adam Thanks for the update on BP rewrite – looks like interesting developments are ahead. Presumably this would also allow an existing raidz2 pool to be upgraded to raidz3. Otherwise and until then the raidz3 feature will only be available for new vdevs.
@Sophie I notice you mentioned 2 hot spares but was this a single parity raid set? The hot spares only help make sure the array is rebuilt as soon as the fault is found, but they don’t add redundancy. Also 14 disks in one raid set is probably a bit much and outside the recommended number of disks for a single raidz set. The more disks in a raid set, the greater the probability that one will fail. It would be better to have two 7-disk sets.
I had a recent scare when an arbitrary decision to run a one-off scrub on an 11-disk set uncovered 4 disks with checksum errors (silent corruption). Fortunately the affected blocks were all in different places so ZFS repaired them all from the raidz2 redundancy information. This is another reason for wanting higher parity protection – silent corruption will degrade your effective redundancy level for the effective blocks. If you use single-parity raidz, then regular scrubbing is strongly recommended, because if a disk fails whilst there is some undetected/unrepaired corruption on another disk, data will be lost.
@Sophie, Unfortunately, it was a single parity raid set. The 14 disk raidset was imposed due to a limitation of the hardware/firmware of the disk system, it was replaced and that hardware is no longer being made.
What is best practice for ZFS raids?
ie for 10TB of usable storage
a) 10 x 1TB drives + 2 x 1TB @ Raidz2
b) 10 x 1TB drives + 3 x 1TB @ Raidz3
c) 14 x 750GB hdd + 2 x 750GB hdd @ Raidz2
d) 5 x 2TB hdd + 2 x 2TB hdd @ Raidz2
e) 5 x 2TB hdd + 3 x 2TB hdd @ Raidz3