Adam Leventhal's blog

Close this search box.

Tag: flash

This series of posts covers APFS, Apple’s new filesystem announced at WWDC 2016. See the first post for the table of contents.


APFS claims to be optimized for flash. Flash memory (NAND) is the stuff in your speedy SSD. Apple changed the computing industry when it put flash into the iPod and iPhone, volumes for which fundamentally changed the economics of flash. This consumer change impacted the enterprise (as it often does), giving rise to hybrid and all-flash arrays. Ten years ago flash cost as much as DRAM; now it’s challenging the economics of hard disks.

SSDs mimic the block interface of conventional hard drives, but the underlying technology is completely different. In particular while magnetic media can read or write sectors arbitrarily, flash erases large chunks (blocks) and reads and writes smaller chunks (pages). The management is done by what’s called the flash translation layer (FTL), software that makes blocks and pages appear more like a hard drive. An FTL is very similar to a file system, creating a virtual mapping (a translation) between block addresses and locations within the media. Apple controls the full stack including the SSD, FTL, and file system; they could have built something differentiated, optimizing this components to work together. What APFS does, however, is simply write in patterns known to be more easily handled by NAND. It’s a file system with flash-aware characteristics rather than one written explicitly for the native flash interfaces, more or less what you’d expect in 2016.

Also on the topic of flash, APFS includes TRIM support. TRIM is a command in the ATA protocol that allows a file system to indicate to an SSD (specifically, its FTL) that some space has been freed. SSDs require significant free space and perform better when there’s more of it; they include more physical space than they advertise. For example, my 1TB SSD includes 1TB (240 = 10244) bytes of flash but only reports 931GB of available space, sneakily matching the storage industry’s self-serving definition of 1TB (10004 = 1 trillion bytes). With more free space, FTLs can trade off space efficiency for performance and longevity. TRIM has become expected of file systems; it’s unsurprising that APFS supports it. The problem with TRIM though is that it’s only useful when there’s free space: it’s something of a benchmark special. Once your disk is mostly full (as mine are in my laptop and phone basically at all times) TRIM doesn’t do anything for you. I doubt that TRIM will bring any discernible benefit for APFS users beyond the placebo effect of feature parity.

APFS also focuses on latency; Apple’s number one goal is to avoid the beachball of doom. APFS addresses this with I/O QoS (quality of service) to prioritize accesses that are immediately visible to the user over background activity that doesn’t have the same time-constraints. This is inarguably a benefit to users and a sophisticated file system capability.


Next in this series: Data Integrity

[latexpage] RAID algorithms have become a particular fascination of mine, and I recently read a very interesting paper that describes an optimization for RAID reconstruction (by Xiang, Xu, Lui, Chang, Pan, and Li). Before writing double- and triple-parity RAID algorithms for ZFS, I spent a fair bit of time researching the subject and have stayed interested since. Before describing the reconstruction optimization, some preamble is required. RAID algorithms can be divided into two buckets: one-dimensional algorithms, and multi-dimensional algorithms (terms of my own choosing; I haven’t seen this distinction discussed in literature).

One-dimensional RAID

A one-dimensional algorithm is one in which all data in a single RAID stripe is used to compute all parities. The RAID algorithm used by ZFS falls into this category as do most algorithms derived from Reed-Solomon coding. For a given RAID stripe’s set of data blocks, $D$, we can compute the nth parity block with some function $p(D, n)$. For example, ZFS, roughly, uses the formula \[ p(D,n) = \sum_{i=1}^{\left|{D}\right|} 2^{(i-1)(n-1)} \cdot D_{i} \] Here, addition and multiplication are defined over a Galois Field – the explanation would be far longer than it would be interesting or relevant so I’ll omit it from this post. It is worth noting that this particular approach only works for three parity disks or fewer, but that too is an entirely different subject (albeit an interesting one). Reconstruction of a missing block in a one-dimensional algorithm requires reading the available data, and performing some computation; each stripe may be reconstructed separately (and thus, in parallel).

Multi-dimensional RAID

A multi-dimensional algorithm is one in which parts of multiple logical RAID stripes may contribute to parity calculation. Examples of this include IBM’s EVENODD and NetApp’s slight simplification, Row-Diagonal Parity (RDP). These are most easily conveyed through diagrams:


With both EVENODD and RDP, calculation of the first parity block simply XOR the data blocks in that RAID stripe. The second parity block is calculated by a simple XOR of data values across RAID stripes more or less diagonally. Both of these techniques place constraints on the width of a RAID stripe.

Optimizing RAID reconstruction for fewer reads

The paper, A Hybrid Approach of Failed Disk Recovery Using RAID-6 Codes: Algorithms and Performance Evaluation, describes a optimization for reconstruction under multi-dimensional RAID algorithms. The key insight is that with parity calculations that effectively overlap, a clever reconstruction algorithm can use fewer blocks, thus incurring fewer disk reads. As described in the paper, normally when a given disk fails, all remaining data blocks and blocks from the first parity disk are used for reconstruction:

simple recovery
simple recovery

It is, however, possible to read fewer total blocks by taking advantage of the fact that certain blocks can be multiply used. In the reconstruction below, blocks with circles are used for “row” reconstruction, and blocks with squares are used for “diagonal” reconstruction.

optimized recovery
optimized recovery

Note that the simple approach requires reading 36 blocks (none from disk 7) whereas the reconstruction described in the paper requires reading only 27 blocks. This applies generally: the new approach requires 25% fewer blocks to perform the same reconstruction. And the paper includes a method of balancing the reads between disks.

Disappointing results

Unfortunately, optimizing for fewer reads didn’t translate to significant performance improvements in the overall recovery. For RDP it was about 12% better in the best case, but typically closer to 7%. For EVENODD the improvement was less than 5% in all cases. Why? The naïve recovery algorithm streams data off the healthy hard drives, performs a simple computation, and streams good data onto the replacement drive. Streaming is what drives do best – 3.5” or 2.5”, 7200, 10K, or 15K RPM; SATA, SAS, or FC they all stream pretty well. There may be some contention for I/O resources, but either that contention isn’t severe or the “skips” in the I/O patterns interrupt the normal streaming efficiency.

Applicability in flash

There’s another medium, however, that has throughput and IOPS to spare where this RAID reconstruction approach could be highly effective: flash. With SSDs, it’s possible to see throughput that strains the limits of I/O systems; reading less data could be a significant improvement, and the non-contiguous read patterns wouldn’t degrade performance as they do with HDDs. For all-flash arrays, this sort of optimization may be one of many in its class; with a surplus of IOPS, compute, and memory, the RAID algorithms designed for slow disks, slow CPUs, and a dearth of DRAM, may need to be scrapped and rethought.

For a short while, I ran the flash memory strategy at Sun and then Oracle, so I still keep my ear to the ground regarding flash news. That news is often frustratingly light — journalists in the space who are fully capable of providing analysis end up brushing the surface. With a tip of the hat to the FJM crew, here’s my commentary on a recent article.
NetApp has Hybrid Aggregate drives coming, with data moved automatically in real time between flash located next to the spinning disks. The company now says that this is a better technology than PCIe flash approaches.
Sounds interesting. NetApp had previously stacked its chips on a PCIe approach for flash called the performance acceleration module (PAM); I read about it in the same publication. This apparent change of strategy is significant, and I wish that the article would have explored the issue, but it was never mentioned.
NetApp, presenting at an Analyst Day event in New York on 30 June, said that having networked storage move as it were into the host server environment was disadvantageous. This was according to Stifel Nicolaus analyst Aaron Rakers.
1. So is this a quote from NetApp or a quote from an analyst or a quote from NetApp quoting an analyst? I’m confused.
2. This is a dense and interesting statement so allow me to unpack it. Moving storage to the host server is code for Fusion-io. These guys make a flash-laden PCIe card that you put in your compute node for super-fast local data access, and they connect a bunch of them together with an IB backplane to share the contents of different cards between hosts. They recently went public, and customers love the performance they offer over traditional SANs. I assume the term “disadvantageous” was left intentionally vague as those being disadvantaged may be NTAP shareholders rather than customers implementing such a solution.
Manish Goel, NetApp’s product ops EVP, said SSDs used as hard disk drive replacements were not as interesting as using flash at the disk layer in a Hybrid Aggregate drive approach – and this was coming.
An Aggregate is the term NetApp uses for a collection of drives. A Hybrid Aggregate — presumably — is some new thing that mixes HDDs and SSDs. Maybe it’s like Sun’s hybrid storage pool. I would have liked to see Manish Goel’s statement vetted or explained, but that’s all we get.
Flash Cache in the controller is a straightforward array read I/O accelerator. PCIe flash in host servers is a complementary technology but will not decentralise the storage market and move networked storage back into the host servers.
Is this still the NetApp announcement or is this back to the journalism? It’s a new paragraph so I guess it’s the latter. Fusion-io will be happy to learn that it only took a couple of lines to be upgraded from “disadvantageous” to “complementary”. And you may be interested to know why NetApp says that host-based flash is complementary. There’s a vendor out there working with NetApp on a host-based flash PCIe card that NetApp will treat as part of its caching tier, pushing data to the card for fast access by the host. I’d need to dig up my notes from the many vendor roadmaps I saw to recall who is building this, but in the context of a public blog post it’s probably better that I don’t.
NetApp has a patent in this Hybrid Aggregate disk drive area called “Mechanisms for moving data in a Hybrid Aggregate”.
I won’t bore you by reposting the except from the patent, but the broad language of the patent does recall to mind the many recent invalidated NetApp patents…
Surely this is what we all understand as auto-placement of data in a virtual storage pool comprising SSD and fast disk tiers, such as Compellent’s block-level Data Progression? Not so, according to a person close to the situation: “It’s much more automatic, real-time and granular. Compellent needs policies and is not real-time. [NetApp] will be automatic and always move data real-time, rather than retroactively.”
What could have followed this — but didn’t — was a response from a representative from Compellent or someone familiar with their technology. Compellent, EMC, Oracle, and others all have strategies that involve mixing flash memory with conventional hard drives. It’s the rare article that discusses those types of connections. Oracle’s ZFSproducts uses flash as a caching tier, automatically populating it with useful data. Compellent has a clever technique of moving data between storage tiers seamlessly — and customers seem to love it. EMC just hucks a bunch of SSDs into an array — and customers seem to grin and bear it. NetApp’s approach? It’s hard to decipher what it would mean to “move data in real-time, rather than retroactively.” Does that mean that data is moved when it’s written and then never moved again? That doesn’t sound better. My guess is that NetApp’s approach is very much like Compellent’s — something they should be touting rather than parrying. And I’d love to read that article.

This year’s flash memory summit got me thinking about our use of SSDs over the years at Fishworks. The picture of our left is a visual history of SSD evals in rough chronological order from the oldest at the bottom to the newest at the top (including some that have yet to see the light of day).

Early Days

When we started Fishworks, we were inspired by the possibilities presented by ZFS and Thumper. Those components would be key building blocks in the enterprise storage solution that became the 7000 series. An immediate deficiency we needed to address was how to deliver competitive performance using 7,200 RPM disks. Folks like NetApp and EMC use PCI-attached NV-DRAM as a write accelerator. We evaluated something similar, but found the solution lacking because it had limited scalability (the biggest NV-DRAM cards at the time were 4GB), consumed our limited PCIe slots, and required a high-speed connection between nodes in a cluster (e.g. IB, further eating into our PCIe slot budget).

The idea we had was to use flash. None of us had any experience with flash beyond cell phones and USB sticks, but we had the vague notion that flash was fast and getting cheaper. By luck, flash SSDs were just about to be where we needed them. In late 2006 I started evaluating SSDs on behalf of the group, looking for what we would eventually call Logzilla. At that time, SSDs were getting affordable, but were designed primarily for environments such as military use where ruggedness was critical. The performance of those early SSDs was typically awful.


STEC — still Simpletech in those days — realized that their early samples didn’t really suit our needs, but they had a new device (partly due to the acquisition of Gnutech) that would be a good match. That first sample was fibre-channel and took some finagling to get working (memorably it required metric screw of an odd depth), but the Zeus IOPS, an 18GB 3.5″ SATA SSD using SLC NAND, eventually became our Logzilla (we’ve recently updated it with a SAS version for our updated SAS-2 JBODs). Logzilla addressed write performance economically, and scalably in a way that also simplified clustering; the next challenge was read performance.


Intent on using commodity 7,200 RPM drives, we realized that our random read latency would be about twice that of 15K RPM drives (duh). Fortunately, most users don’t access all of their data randomly (regardless of how certain benchmarks are designed). We already had much more DRAM cache than other storage products in our market segment, but we thought that we could extend that cache further by using SSDs. In fact, the invention of the L2ARC followed a slightly different thought process: seeing the empty drive bays in the front of our system (just two were used as our boot disks) and the piles of SSDs laying around, I stuck the SSDs in the empty bays and figured out how we’d use them.

It was again STEC who stepped up to provide our Readzilla, a 100GB 2.5″ SATA SSD using SLC flash.

Next Generation

Logzilla and Readzilla are important features of the Hybrid Storage Pool. For the next generation expect the 7000 series to move away from SLC NAND flash. It was great for the first generation, but other technologies provide better $/IOPS for Logzilla and better $/GB for Readzilla (while maintaining low latency). For Logzilla we think that NV-DRAM is a better solution (I reviewed one such solution here), and for Readzilla MLC flash has sufficient performance at much lower cost and ZFS will be able to ensure the longevity.

As I mentioned in my previous post, I wrote an article about the hybrid storage pool (HSP); that article appears in the recently released July issue of Communications of the ACM. You can find it here. In the article, I talk about a novel way of augmenting the traditional storage stack with flash memory as a new level in the hierarchy between DRAM and disk, as well as the ways in which we’ve adapted ZFS and optimized it for use with flash.

So what’s the impact of the HSP? Very simply, the article demonstrates that, considering the axes of cost, throughput, capacity, IOPS and power-efficiency, HSPs can match and exceed what’s possible with either drives or flash alone. Further, an HSP can be built or modified to address specific goals independently. For example, it’s common to use 15K RPM drives to get high IOPS; unfortunately, they’re expensive, power-hungry, and offer only a modest improvement. It’s possible to build an HSP that can match the necessary IOPS count at a much lower cost both in terms of the initial investment and the power and cooling costs. As another example, people are starting to consider all-flash solutions to get very high IOPS with low power consumption. Using flash as primary storage means that some capacity will be lost to redundancy. An HSP can provide the same IOPS, but use conventional disks to provide redundancy yielding a significantly lower cost.

My hope — perhaps risibly naive — is that HSPs will mean the eventual death of the 15K RPM drive. If it also puts to bed the notion of flash as general purpose mass storage, well, I’d be happy to see that as well.

Recent Posts

February 12, 2017
December 18, 2016
August 9, 2016
August 2, 2016