Adam Leventhal's blog

Close this search box.

Month: November 2012

Back in October I was pleased to attend — and my employer, Delphix, was pleased to sponsor — illumos day and ZFS day, run concurrently with Oracle Open World. Inspired by the success of dtrace.conf(12) in the Spring, the goal was to assemble developers, practitioners, and users of ZFS and illumos-derived distributions to educate, share information, and discuss the future.

illumos day

The week started with the developer-centric illumos day. While illumos picked up the torch when Oracle re-closed OpenSolaris, each project began with a very different focus. Sun and the OpenSolaris community obsessed with inclusion, and developer adoption — often counterproductively. The illumos community is led by those building products based on the unique technologies in illumos — DTrace, ZFS, Zones, COMSTAR, etc. While all are welcome, it’s those who contribute the most whose voices are most relevant.

I was asked to give a talk about technologies unique to illumos that are unlikely to appear in Oracle Solaris. It was only when I started to prepare the talk that the difference in focuses of illumos and Oracle Solaris fell into sharp focus. In the illumos community, we’ve advanced technologies such as ZFS in ways that would benefit Oracle Solaris greatly, but Oracle has made it clear that open source is anathema for its version of Solaris. For example, at Delphix we’ve recently been fixing bugs, asking ourselves, “how has Oracle never seen this?”.

Yet the differences between illumos and Oracle Solaris are far deeper. In illumos we’re building products that rely on innovation and differentiation in the operating system, and it’s those higher-level products that our various customers use. At Oracle, the priorities are more traditional: support for proprietary SPARC platforms, packaging and updating for administrators, and ease-of-use. In my talk, rather than focusing on the sundry contributions to illumos, I picked a few of my favorites. The slides are more or less incomprehensible on their own; many thanks to Deirdre Straughan for posting the video (and for putting together the event!) — check out 40:30 for a photo of Jean-Luc Picard attending the DTrace talk at OOW.

[youtube_sc url=””]

ZFS day

While illumos day was for developers, ZFS day was for users of ZFS to learn from each others’ experiences, and hear from ZFS developers. I had the ignominous task of presenting an update on the Hybrid Storage Pool (HSP). We developed the HSP at Fishworks as the first enterprise storage system to add flash memory into the storage hierarchy to accelerate reads and writes. Since then, economics and physics have thrown up some obstacles: DRAM has gotten cheaper, and flash memory is getting harder and harder to turn into a viable enterprise solution. In addition, the L2ARC that adds flash as a ZFS read cache, has languished; it has serious problems that no one has been motivated or proficient enough to address.

I’ll warn you that after the explanation of the HSP, it’s mostly doom and gloom (also I was sick as a dog when I prepared and gave the talk), but check out the slides and video for more on the promise and shortcomings of the HSP.

[youtube_sc url=””]


For both illumos day and ZFS day, it was a mostly full house. Reuniting with the folks I already knew was fun as always, but even better was to meet so many who I had no idea were building on illumos or using ZFS. The events highlighted that we need to facilitate more collaboration — especially around ZFS — between the illumos distros, FreeBSD, and Linux — hell, even Oracle Solaris would be welcome.

Lately, I’ve been rooting around in the bowels of ZFS as we’ve explored some long-standing performance pathologies. To that end I’ve been fortunate to learn at the feet of Matt Ahrens who was half of the ZFS founding team and George Wilson who has forgotten more about ZFS than most people will ever know. I wanted to start sharing some of the interesting details I’ve unearthed.

For allocation purposes, ZFS carves vdevs (disks) into a number of “metaslabs” — simply smaller, more manageable chunks of the whole. How many metaslabs? Around 200:

vdev_metaslab_set_size(vdev_t *vd)
         * Aim for roughly 200 metaslabs per vdev.
        vd->vdev_ms_shift = highbit(vd->vdev_asize / 200);
        vd->vdev_ms_shift = MAX(vd->vdev_ms_shift, SPA_MAXBLOCKSHIFT);

Why 200? Well, that just kinda worked and was never revisited. Is it optimal? Almost certainly not. Should there be more or less? Should metaslab size be independent of vdev size? How much better could we do? All completely unknown.

The space in the vdev is allotted proportionally, and contiguously to those metaslabs. But what happens when a vdev is expanded? This can happen when a disk is replaced by a larger disk or if an administrator grows a SAN-based LUN. It turns out that ZFS simply creates more metaslabs — an answer whose simplicity was only obvious in retrospect.

For example, let’s say we start with a 2T disk; then we’ll have 200 metaslabs of 10G each. If we then grow the LUN to 4TB then we’ll have 400 metaslabs. If we started instead from a 200GB LUN that we eventually grew to 4TB we’d end up with 4,000 metaslabs (each 1G). Further, if we started with a 40TB LUN (why not) and grew it by 100G ZFS would not have enough space to allocate a full metaslab and we’d therefore not be able to use that additional space.

At Delphix our metaslabs can become highly fragmented because most of our datasets use a 8K record size (read up on space maps to understand how metaslabs are managed — truly fascinating), and our customers often expand LUNs as a mechanism for adding more space. It’s not clear how much room there is for improvement, but these are curious phenomena that we intend to investigate along with the structure of space maps, the idiosyncrasies of the allocation path, and other aspects of ZFS as we continue to understand and improve performance. Stay tuned.

Recent Posts

April 17, 2024
January 13, 2024
December 29, 2023
February 12, 2017
December 18, 2016