Adam Leventhal's blog

Search
Close this search box.

Tag: SSD

This series of posts covers APFS, Apple’s new filesystem announced at WWDC 2016. See the first post for the table of contents.

Performance

APFS claims to be optimized for flash. Flash memory (NAND) is the stuff in your speedy SSD. Apple changed the computing industry when it put flash into the iPod and iPhone, volumes for which fundamentally changed the economics of flash. This consumer change impacted the enterprise (as it often does), giving rise to hybrid and all-flash arrays. Ten years ago flash cost as much as DRAM; now it’s challenging the economics of hard disks.

SSDs mimic the block interface of conventional hard drives, but the underlying technology is completely different. In particular while magnetic media can read or write sectors arbitrarily, flash erases large chunks (blocks) and reads and writes smaller chunks (pages). The management is done by what’s called the flash translation layer (FTL), software that makes blocks and pages appear more like a hard drive. An FTL is very similar to a file system, creating a virtual mapping (a translation) between block addresses and locations within the media. Apple controls the full stack including the SSD, FTL, and file system; they could have built something differentiated, optimizing this components to work together. What APFS does, however, is simply write in patterns known to be more easily handled by NAND. It’s a file system with flash-aware characteristics rather than one written explicitly for the native flash interfaces, more or less what you’d expect in 2016.

Also on the topic of flash, APFS includes TRIM support. TRIM is a command in the ATA protocol that allows a file system to indicate to an SSD (specifically, its FTL) that some space has been freed. SSDs require significant free space and perform better when there’s more of it; they include more physical space than they advertise. For example, my 1TB SSD includes 1TB (240 = 10244) bytes of flash but only reports 931GB of available space, sneakily matching the storage industry’s self-serving definition of 1TB (10004 = 1 trillion bytes). With more free space, FTLs can trade off space efficiency for performance and longevity. TRIM has become expected of file systems; it’s unsurprising that APFS supports it. The problem with TRIM though is that it’s only useful when there’s free space: it’s something of a benchmark special. Once your disk is mostly full (as mine are in my laptop and phone basically at all times) TRIM doesn’t do anything for you. I doubt that TRIM will bring any discernible benefit for APFS users beyond the placebo effect of feature parity.

APFS also focuses on latency; Apple’s number one goal is to avoid the beachball of doom. APFS addresses this with I/O QoS (quality of service) to prioritize accesses that are immediately visible to the user over background activity that doesn’t have the same time-constraints. This is inarguably a benefit to users and a sophisticated file system capability.

 

Next in this series: Data Integrity

Back at Fishworks, my colleagues had a nickname for me: Adam Leventhal, Hardware Engineer. I wasn’t designing hardware; I wasn’t even particularly more involved with hardware specs. The name referred to my preternatural ability to fit round pegs into square holes, to know when parts would bend but not break (or if they broke how to clean up the evidence), and when a tight fit necessitated a running start.

I first earned the nickname when we got the prototype hardware for what would eventually becomes the Sun Storage 7410 — part of our initial product line, and our first product to support clustering. When the system arrived, and I tried to install a SAS HBA, I encountered my first hardware bug. In the Solaris kernel group I had hit microprocessor bugs, but this was pretty different: the actual sheet metal was designed for cards to drop in horizontally, and the designers hadn’t considered connectors that protruded from a PCI card’s faceplate.

To solve the problem, I had to (carefully) bend back the retaining metal supports, drop in the card, and then try to bend them back. I think my colleagues were just impressed that I didn’t break anything.

The hardware team took our feedback and designed a different mechanism for inserting PCI cards.

Science Experiments with SSDs

Another task that fell to Adam Leventhal, Hardware Engineer was conducting the science experiments we needed to verify if something was a stupid idea or merely a crazy one. Often this took the form of trying to make something fit somewhere it wasn’t supposed to fit. For example, we often had 2.5″ SSDs that we wanted to stick into 3.5″ drive bays to eliminate as many variables as possible when baking off a 2.5″ SSD versus as 3.5″ one. Here are some examples:

some SSDs in a Thumper (SS7200)
an SSD in a Riverwalk (J4400)

The Ice-Cream Sandwich

Another favorite experiment involving SSDs came when we were first investigating Readzilla candidates. We wanted to get as much capacity as we could in the 2.5″ drive bay. The prototypes of the Intel X25-E were only 7mm high so we speculated that we could make an “ice-cream sandwich” with some sort of chip to present them as a single SATA device. Well, we found such a chip, and so I ran the experiment to see what the hardware would look like to our OS and what the performance characteristics would be.

You can see the two Intel SSDs duct-taped together, and connected to a power supply in the background and the test board on the right. The test board has another SATA cable that snakes into the box and connects where the drive connector is at the back of the drive bay. Yes, it was a huge pain to connect that final cable; not pictured is the duct tape in the drive bay to keep the SATA cable in place.

The thing worked, but the performance was lousy, and we determined that two drives and some sort of interposer might fit, but it would be like sticking a potato up the tailpipe — all airflow would be blocked.

Conjoined Twins

By far my favorite science project was the conjoined twin Iwashis (SS7110). Iwashi was a stand-alone storage box with an internal SAS HBA that connected to a 16-disk backplane. It turned out though that only one of the two SAS connections was needed to see all the disks. Sitting around at lunch one day we had an idea: could we provide high availability for user data by getting a pair of Iwashis and cross-wiring their HBAs to connect to each others’ backplanes. We would then mirror the data (or something) between the two boxes.

Note that that two systems needed to be placed head-to-toe in order to let the cables reach; take note of a few features in the picture above:

  1. The SAS HBA in the right system…
  2. connects up to the right system’s own backplane…
  3. and to the backplane on the left (note that running between the fan trays was the only option)…
  4. which also connects up to the SAS HBA in the left system.

This all required running with the lid off. Those systems contained a magnetic kill switch — if you removed the lid, the power would shut off. This was — wisely — to ensure proper airflow and to avoid overheating. But this interfered with this (and many other) experiments, so I just unscrewed the magnet from the lid and let it connect directly with its main chassis mate.

I could get the lid onto the left system, but I didn’t want the fan tray lid pinching the SAS cable that ran between the two boxes. To this day, I think that propping up the fan tray lid is the best use of those discarded PCI faceplate fillers.

We scrapped this idea for a variety of concerns both mundane (we needed both SAS connectors to drive the LEDs for each drive), and fundamental (it was pretty clearly goofy).

Still Hardware Engineering

At Delphix, we’re selling a virtual appliance so the opportunities for Adam Leventhal, Hardware Engineer to shine are fewer and farther between. But hardware engineering has always been more of a state of mind… and there’s still the occasional opportunity to stab at a jumper with a bread knife from the kitchen to generate an NMI and initiate a kernel panic!

For a short while, I ran the flash memory strategy at Sun and then Oracle, so I still keep my ear to the ground regarding flash news. That news is often frustratingly light — journalists in the space who are fully capable of providing analysis end up brushing the surface. With a tip of the hat to the FJM crew, here’s my commentary on a recent article.
NetApp has Hybrid Aggregate drives coming, with data moved automatically in real time between flash located next to the spinning disks. The company now says that this is a better technology than PCIe flash approaches.
Sounds interesting. NetApp had previously stacked its chips on a PCIe approach for flash called the performance acceleration module (PAM); I read about it in the same publication. This apparent change of strategy is significant, and I wish that the article would have explored the issue, but it was never mentioned.
NetApp, presenting at an Analyst Day event in New York on 30 June, said that having networked storage move as it were into the host server environment was disadvantageous. This was according to Stifel Nicolaus analyst Aaron Rakers.
1. So is this a quote from NetApp or a quote from an analyst or a quote from NetApp quoting an analyst? I’m confused.
2. This is a dense and interesting statement so allow me to unpack it. Moving storage to the host server is code for Fusion-io. These guys make a flash-laden PCIe card that you put in your compute node for super-fast local data access, and they connect a bunch of them together with an IB backplane to share the contents of different cards between hosts. They recently went public, and customers love the performance they offer over traditional SANs. I assume the term “disadvantageous” was left intentionally vague as those being disadvantaged may be NTAP shareholders rather than customers implementing such a solution.
Manish Goel, NetApp’s product ops EVP, said SSDs used as hard disk drive replacements were not as interesting as using flash at the disk layer in a Hybrid Aggregate drive approach – and this was coming.
An Aggregate is the term NetApp uses for a collection of drives. A Hybrid Aggregate — presumably — is some new thing that mixes HDDs and SSDs. Maybe it’s like Sun’s hybrid storage pool. I would have liked to see Manish Goel’s statement vetted or explained, but that’s all we get.
Flash Cache in the controller is a straightforward array read I/O accelerator. PCIe flash in host servers is a complementary technology but will not decentralise the storage market and move networked storage back into the host servers.
Is this still the NetApp announcement or is this back to the journalism? It’s a new paragraph so I guess it’s the latter. Fusion-io will be happy to learn that it only took a couple of lines to be upgraded from “disadvantageous” to “complementary”. And you may be interested to know why NetApp says that host-based flash is complementary. There’s a vendor out there working with NetApp on a host-based flash PCIe card that NetApp will treat as part of its caching tier, pushing data to the card for fast access by the host. I’d need to dig up my notes from the many vendor roadmaps I saw to recall who is building this, but in the context of a public blog post it’s probably better that I don’t.
NetApp has a patent in this Hybrid Aggregate disk drive area called “Mechanisms for moving data in a Hybrid Aggregate”.
I won’t bore you by reposting the except from the patent, but the broad language of the patent does recall to mind the many recent invalidated NetApp patents…
Surely this is what we all understand as auto-placement of data in a virtual storage pool comprising SSD and fast disk tiers, such as Compellent’s block-level Data Progression? Not so, according to a person close to the situation: “It’s much more automatic, real-time and granular. Compellent needs policies and is not real-time. [NetApp] will be automatic and always move data real-time, rather than retroactively.”
What could have followed this — but didn’t — was a response from a representative from Compellent or someone familiar with their technology. Compellent, EMC, Oracle, and others all have strategies that involve mixing flash memory with conventional hard drives. It’s the rare article that discusses those types of connections. Oracle’s ZFSproducts uses flash as a caching tier, automatically populating it with useful data. Compellent has a clever technique of moving data between storage tiers seamlessly — and customers seem to love it. EMC just hucks a bunch of SSDs into an array — and customers seem to grin and bear it. NetApp’s approach? It’s hard to decipher what it would mean to “move data in real-time, rather than retroactively.” Does that mean that data is moved when it’s written and then never moved again? That doesn’t sound better. My guess is that NetApp’s approach is very much like Compellent’s — something they should be touting rather than parrying. And I’d love to read that article.

This year’s flash memory summit got me thinking about our use of SSDs over the years at Fishworks. The picture of our left is a visual history of SSD evals in rough chronological order from the oldest at the bottom to the newest at the top (including some that have yet to see the light of day).

Early Days

When we started Fishworks, we were inspired by the possibilities presented by ZFS and Thumper. Those components would be key building blocks in the enterprise storage solution that became the 7000 series. An immediate deficiency we needed to address was how to deliver competitive performance using 7,200 RPM disks. Folks like NetApp and EMC use PCI-attached NV-DRAM as a write accelerator. We evaluated something similar, but found the solution lacking because it had limited scalability (the biggest NV-DRAM cards at the time were 4GB), consumed our limited PCIe slots, and required a high-speed connection between nodes in a cluster (e.g. IB, further eating into our PCIe slot budget).

The idea we had was to use flash. None of us had any experience with flash beyond cell phones and USB sticks, but we had the vague notion that flash was fast and getting cheaper. By luck, flash SSDs were just about to be where we needed them. In late 2006 I started evaluating SSDs on behalf of the group, looking for what we would eventually call Logzilla. At that time, SSDs were getting affordable, but were designed primarily for environments such as military use where ruggedness was critical. The performance of those early SSDs was typically awful.

Logzilla

STEC — still Simpletech in those days — realized that their early samples didn’t really suit our needs, but they had a new device (partly due to the acquisition of Gnutech) that would be a good match. That first sample was fibre-channel and took some finagling to get working (memorably it required metric screw of an odd depth), but the Zeus IOPS, an 18GB 3.5″ SATA SSD using SLC NAND, eventually became our Logzilla (we’ve recently updated it with a SAS version for our updated SAS-2 JBODs). Logzilla addressed write performance economically, and scalably in a way that also simplified clustering; the next challenge was read performance.

Readzilla

Intent on using commodity 7,200 RPM drives, we realized that our random read latency would be about twice that of 15K RPM drives (duh). Fortunately, most users don’t access all of their data randomly (regardless of how certain benchmarks are designed). We already had much more DRAM cache than other storage products in our market segment, but we thought that we could extend that cache further by using SSDs. In fact, the invention of the L2ARC followed a slightly different thought process: seeing the empty drive bays in the front of our system (just two were used as our boot disks) and the piles of SSDs laying around, I stuck the SSDs in the empty bays and figured out how we’d use them.

It was again STEC who stepped up to provide our Readzilla, a 100GB 2.5″ SATA SSD using SLC flash.

Next Generation

Logzilla and Readzilla are important features of the Hybrid Storage Pool. For the next generation expect the 7000 series to move away from SLC NAND flash. It was great for the first generation, but other technologies provide better $/IOPS for Logzilla and better $/GB for Readzilla (while maintaining low latency). For Logzilla we think that NV-DRAM is a better solution (I reviewed one such solution here), and for Readzilla MLC flash has sufficient performance at much lower cost and ZFS will be able to ensure the longevity.

Recent Posts

January 22, 2024
January 13, 2024
December 29, 2023
February 12, 2017
December 18, 2016
August 9, 2016

Archives

Archives