Adam Leventhal's blog

Search
Close this search box.

Category: ZFS

I’ve been dabbling a bit in ZFS recently, and what’s amazing is not just how well it solved the well-understood filesystem problem, but how its design opens the door to novel ways to manage data. Compression is a great example. An almost accidental by-product of the design is that your data can be stored compressed on disk. This is especially interesting in an era when we have CPU cycles to spare, many too few available IOPs, and disk latencies that you can measure with a stop watch (well, not really, but you get the idea). With ZFS can you trade in some of those spare CPU cycles for IOPs by turning on compression, and the additional latency introduced by decompression is dwarfed by the time we spend twiddling our thumbs waiting for the platter to complete another revolution.

smaller and smaller

Turning on compression in zfs (zfs compression=on <dataset>) enables the so called LZJB compression algorithm — a variation on Lempel-Ziv tagged by its humble author. LZJB is fast, reasonably effective, and quite simple (compress and decompress are implemented in about a hundred lines of code). But the ZFS architecture can support many compression algorithms. Just as users can choose from several different checksum algorithms (fletcher2, fletcher4, or sha256), ZFS lets you pick your compression routine — it’s just that there’s only the one so far.

putting the z(lib) in ZFS

I thought it might be interesting to add a gzip compression algorithm based on zlib. I was able to hack this up pretty quicky because the Solaris kernel already contains a complete copy of zlib (albeit scattered around a little) for decompressing CTF data for DTrace, and apparently for some sort of compressed PPP streams module (or whatever… I don’t care). Here’s what the ZFS/zlib mash-up looks like (for the curious, this is with the default compression level — 6 on a scale from 1 to 9):

# zfs create pool/gzip
# zfs set compression=gzip pool/gzip
# cp -r /pool/lzjb/* /pool/gzip
# zfs list
NAME        USED  AVAIL  REFER  MOUNTPOINT
pool/gzip  64.9M  33.2G  64.9M  /pool/gzip
pool/lzjb   128M  33.2G   128M  /pool/lzjb

That’s with a 1.2G crash dump (pretty much the most compressible file imaginable). Here are the compression ratios with a pile of ELF binaries (/usr/bin and /usr/lib):

# zfs get compressratio
NAME       PROPERTY       VALUE      SOURCE
pool/gzip  compressratio  3.27x      -
pool/lzjb  compressratio  1.89x      -

Pretty cool. Actually compressing these files with gzip(1) yields a slightly smaller result, but it’s very close, and the convenience of getting the same compression transparently from the filesystem is awfully compelling. It’s just a prototype at the moment. I have no idea how well it will perform in terms of speed, but early testing suggests that it will be lousy compared to LZJB. I’d be very interested in any feedback: Would this be a useful feature? Is there an ideal trade-off between CPU time and compression ratio? I’d like to see if this is worth integrating into OpenSolaris.


Technorati Tags:

When ZFS first started, it was just Jeff trying to pair old problems with new solutions in margins too small to contain either. Then Matt joined up to bring some young blood to the project. By the time the project putback, the team had grown to more than a dozen. And now I’ve been pulled in — if only for a cameo.

When ZFS first hit the streets, Jeff wrote about RAID-Z, an implementation of RAID designed for ZFS. RAID-Z improves upon previous RAID schemes primarily in that it eliminates the so-called “write hole” by using a full (and variable-sized) stripe for all write operations. It’s worth noting that RAID-Z exploits the fact that ZFS is an end-to-end solution such that metadata (traditionally associated with the filesystem layer) is used to interpret the RAID layout on disk (an operation usually ascribed to a volume manager). In that post, Jeff mentioned that a double-parity version of RAID-Z was in the works. What he actually meant is that he had read a paper, and thought it might work out — you’d be forgiven for inferring that actual code had been written.

Over lunch, Bill — yet another elite ZFS hacker — mentioned double-parity RAID-Z and their plans for implementing it. I pressed for details, read the paper, got interested in the math, and started yakking about it enough for Bill to tell me to put up or shut up.

RAID-6

The basic notion behind double-parity RAID or RAID-6 is that a stripe can survive two failures without losing data where RAID-5 can survive only a single failure. There are a number of different ways of implementing double-parity RAID; the way Jeff and Bill had chosen (due to its computational simplicity and lack of legal encumbrance) was one described by H. Peter Anvin in this paper. It’s a nice read, but I’ll attempt to summarize some of the math (warning: this summary is going to be boring and largely unsatisfying so feel free to skip it).

For a given stripe of n data blocks, D0 .. Dn-1, RAID-5 computes the contents of the parity disk P by taking the bitwise XOR of those data blocks. If any Dn is corrupted or missing, we can recover it by taking the XOR of all other data blocks with P. With RAID-6, we need to compute another prity disk Q using a different technique such that Q alone can reconstruct any Dn and P and Q together can reconstruction any two data blocks.

To talk about this, it’s easier — believe it or not — to define a Galois field (or a finite field as I learned it) over the integers [0..255] — the values that can be stored in a single byte. The addition field operation (+) is just bitwise XOR. Multiplication (x) by 2 is given by this bitwise operation for x x 2 = y:

y7 = x6
y6 = x5
y5 = x4
y4 = x3 + x7
y3 = x2 + x7
y2 = x1 + x7
y1 = x0
y0 = x7

A couple of simple things worth noting: addition (+) is the same as subtraction (-), 0 is the additive identity and the multiplicative annihilator, 1 is the multiplicative identity. Slightly more subtle: each element of the field except for 0 (i.e. [1..255]) can be represented as 2n for some n. And importantly: x-1 = x254. Also note that x x y can be rewritten as 2log x x 2log y or 2log x + log y (where + in that case is normal integer addition).

We compute Q as
2n-1 D0 + 2n-2 D1 … + Dn-1
or equivalently
((…(((D0 x 2 + D1 + …) x 2 + Dn-2) x 2 + Dn-1.
Computing Q isn’t much slower than computing P since we’re just dealing with a few simple bitwise operations.

With P and Q we can recover from any two failures. If Dx fails, we can repair it with P. If P also fails, we can recover Dx by computing Qx where Qi = Q + 2n – 1 – x x Dx (easily done by performing the same computation as for generating Q but with Dx set to 0); Dx is then (Qx + Q) / 2n – 1 – x = (Qx + Q) x 2x + 1 – n. Once we solve for Dx, then we recompute P as we had initially.

When two data disks are missing, Dx and Dy, that’s when the rubber really meets the road. We compute Pxy and Qxy such that Pxy + Dx + Dy = P and Qxy + 2n – 1 – x x Dx + 2n – 1 – y x Dy = Q (as before). Using those two expressions and some basic algebra, we can solve for Dx and then plug that in to solve for Dy. The actual expressions are a little too hairy for HTML, but you can check out equation 16 in the paper or the code for the gory details.

Double-Parity RAID-Z

As of build 42 of OpenSolaris, RAID-Z comes in a double-parity version to complement the existing single-parity version — and it only took about 400 additional lines of code. Check out the code here. Of particular interest are the code to generate both parity blocks and the code to do double block reconstruction. What’s especially cool about ZFS is that we don’t just blithely reconstruct data, but we can verify it against the known checksum. This means, for example, that we could get back seemingly valid data from all disks, but fail the checksum; in that case we’d first try reconstructing each individual block, and then try reconstructing every pair of blocks until we’ve found something that checksums. You can see the code for combinatorial reconstruction here.

Using raidz2

To make a double-parity RAID-Z vdev, specify raidz2 to zpool(1M):

# zpool create pool raidz2 c1t0d0 c1t0d1 c1t0d2 c1t0d3 c1t0d4

This will create a pool with a double-parity RAID-Z vdev of width 5 where all data can sustain up to two failures be they corrupt data coming off the drives or drives that are failed or missing. The raidz vdev type continues to mean single-parity RAID-Z as does the new alias raidz1.

Double-parity RAID-Z is probably going to supplant the use of its single-parity predecessor in many if not most cases. As Dave Hitz of NetApp helpfully noted in a recent post double-parity RAID doesn’t actually cost you any additional space because you’ll typically have wider stripes. Rather than having two single-parity stripes of 5 disks each, you’ll have one double-parity stripe with 10 disks — the same capacity with extra protection against failures. It also shouldn’t cost you in terms of performance because the total number of disk operations will be the same and the additional math, while slightly more complex, is still insignificant compared with actually getting bits on disk. So enjoy the extra parity.


Technorati Tags:

Recent Posts

April 17, 2024
January 13, 2024
December 29, 2023
February 12, 2017
December 18, 2016

Archives

Archives