June 2006 – Adam Leventhal's blog

Double-Parity RAID-Z

June 18, 2006

When ZFS first started, it was just Jeff trying to pair old problems with new solutions in margins too small to contain either. Then Matt joined up to bring some young blood to the project. By the time the project putback, the team had grown to more than a dozen. And now I’ve been pulled in — if only for a cameo.

When ZFS first hit the streets, Jeff wrote about RAID-Z, an implementation of RAID designed for ZFS. RAID-Z improves upon previous RAID schemes primarily in that it eliminates the so-called “write hole” by using a full (and variable-sized) stripe for all write operations. It’s worth noting that RAID-Z exploits the fact that ZFS is an end-to-end solution such that metadata (traditionally associated with the filesystem layer) is used to interpret the RAID layout on disk (an operation usually ascribed to a volume manager). In that post, Jeff mentioned that a double-parity version of RAID-Z was in the works. What he actually meant is that he had read a paper, and thought it might work out — you’d be forgiven for inferring that actual code had been written.

Over lunch, Bill — yet another elite ZFS hacker — mentioned double-parity RAID-Z and their plans for implementing it. I pressed for details, read the paper, got interested in the math, and started yakking about it enough for Bill to tell me to put up or shut up.

RAID-6

The basic notion behind double-parity RAID or RAID-6 is that a stripe can survive two failures without losing data where RAID-5 can survive only a single failure. There are a number of different ways of implementing double-parity RAID; the way Jeff and Bill had chosen (due to its computational simplicity and lack of legal encumbrance) was one described by H. Peter Anvin in this paper. It’s a nice read, but I’ll attempt to summarize some of the math (warning: this summary is going to be boring and largely unsatisfying so feel free to skip it).

For a given stripe of n data blocks, D₀ .. D_n-1, RAID-5 computes the contents of the parity disk P by taking the bitwise XOR of those data blocks. If any D_n is corrupted or missing, we can recover it by taking the XOR of all other data blocks with P. With RAID-6, we need to compute another prity disk Q using a different technique such that Q alone can reconstruct any D_n and P and Q together can reconstruction any two data blocks.

To talk about this, it’s easier — believe it or not — to define a Galois field (or a finite field as I learned it) over the integers [0..255] — the values that can be stored in a single byte. The addition field operation (+) is just bitwise XOR. Multiplication (x) by 2 is given by this bitwise operation for x x 2 = y:

y₇	=	x₆
y₆	=	x₅
y₅	=	x₄
y₄	=	x₃ + x₇
y₃	=	x₂ + x₇
y₂	=	x₁ + x₇
y₁	=	x₀
y₀	=	x₇

A couple of simple things worth noting: addition (+) is the same as subtraction (-), 0 is the additive identity and the multiplicative annihilator, 1 is the multiplicative identity. Slightly more subtle: each element of the field except for 0 (i.e. [1..255]) can be represented as 2ⁿ for some n. And importantly: x^-1 = x²⁵⁴. Also note that x x y can be rewritten as 2^{log x} x 2^{log y} or 2^{log x + log y} (where + in that case is normal integer addition).

We compute Q as
2^n-1 D₀ + 2^n-2 D₁ … + D_n-1
or equivalently
((…(((D₀ x 2 + D₁ + …) x 2 + D_n-2) x 2 + D_n-1.
Computing Q isn’t much slower than computing P since we’re just dealing with a few simple bitwise operations.

With P and Q we can recover from any two failures. If D_x fails, we can repair it with P. If P also fails, we can recover D_x by computing Q_x where Q_i = Q + 2^{n – 1 – x} x D_x (easily done by performing the same computation as for generating Q but with D_x set to 0); D_x is then (Q_x + Q) / 2^{n – 1 – x} = (Q_x + Q) x 2^{x + 1 – n}. Once we solve for D_x, then we recompute P as we had initially.

When two data disks are missing, D_x and D_y, that’s when the rubber really meets the road. We compute P_xy and Q_xy such that P_xy + D_x + D_y = P and Q_xy + 2^{n – 1 – x} x D_x + 2^{n – 1 – y} x D_y = Q (as before). Using those two expressions and some basic algebra, we can solve for D_x and then plug that in to solve for D_y. The actual expressions are a little too hairy for HTML, but you can check out equation 16 in the paper or the code for the gory details.

Double-Parity RAID-Z

As of build 42 of OpenSolaris, RAID-Z comes in a double-parity version to complement the existing single-parity version — and it only took about 400 additional lines of code. Check out the code here. Of particular interest are the code to generate both parity blocks and the code to do double block reconstruction. What’s especially cool about ZFS is that we don’t just blithely reconstruct data, but we can verify it against the known checksum. This means, for example, that we could get back seemingly valid data from all disks, but fail the checksum; in that case we’d first try reconstructing each individual block, and then try reconstructing every pair of blocks until we’ve found something that checksums. You can see the code for combinatorial reconstruction here.

Using raidz2

To make a double-parity RAID-Z vdev, specify raidz2 to zpool(1M):

# zpool create pool raidz2 c1t0d0 c1t0d1 c1t0d2 c1t0d3 c1t0d4

This will create a pool with a double-parity RAID-Z vdev of width 5 where all data can sustain up to two failures be they corrupt data coming off the drives or drives that are failed or missing. The raidz vdev type continues to mean single-parity RAID-Z as does the new alias raidz1.

Double-parity RAID-Z is probably going to supplant the use of its single-parity predecessor in many if not most cases. As Dave Hitz of NetApp helpfully noted in a recent post double-parity RAID doesn’t actually cost you any additional space because you’ll typically have wider stripes. Rather than having two single-parity stripes of 5 disks each, you’ll have one double-parity stripe with 10 disks — the same capacity with extra protection against failures. It also shouldn’t cost you in terms of performance because the total number of disk operations will be the same and the additional math, while slightly more complex, is still insignificant compared with actually getting bits on disk. So enjoy the extra parity.

Technorati Tags: OpenSolaris ZFS

DTrace on Geek Muse

June 16, 2006

DTrace was recently featured on episode 35 of Geek Muse. DTrace was brought to their attention because of John Birrell’s recent work to port it to FreeBSD (nice work, John!). The plug was nice, but I did want to respond to a few things:

DTrace was referred to as “a scripting language for debugging”. While I can understand why one might get that impression, it’s kind of missing the point. DTrace, concisely, is a systemic observability framework that’s designed explicitly for use on mission-critical systems. It lets users and system administrators get concise answers to arbitrary questions. The scripting language aspect to DTrace lets you express those questions, but that’s really just a component. James Dickens took a stab at an all-encompassing definition of DTrace….

One of the podcasters said something to the effect of “I’m just a web developer…” One of the great things about DTrace is that it has uses for developers at almost any layer of the stack. Initially DTrace could only view the kernel, and C and C++ code, but its release in Solaris 10 well over a year ago, DTrace has been extended to Java, Ruby, php, python, perl, and a handful of other dynamic languages that folks who are “just web developers” tend to use. In addition to being able to understand how your own code works, you’ll be able to see how it interacts with every level of the system all the way down to things like disk I/O and the CPU scheduler.

Shortly after that, someone opined “I could use it for looking at XML-RPC communication”. For sure! DTrace is crazy useful for understanding communication between processes, and in particular for XML-RPC for viewing calls and replies quickly and easily.

At one point they also identified the need to make sure users can’t use DTrace to spy on each other. By default, DTrace is only executable by the root user. System administrators can dole out various levels of DTrace privilege to users as desired. Check out the manual — and the security chapter in particular.

Technorati Tags: DTrace Geek Muse

Adam Leventhal's blog

Month: June 2006

Double-Parity RAID-Z

RAID-6

Double-Parity RAID-Z

Using raidz2

DTrace on Geek Muse

Recent Posts

Austin API Summit Wrap-up

Rust and JSON Schema: odd couple or perfect strangers

Oxide and Friends Season 4

DTrace probes in Rust

From Prometheus to Sisyphus

DTrace at Home

Archives

Archives