Adam Leventhal's blog

Search
Close this search box.

Category: Fishworks

Back at Fishworks, my colleagues had a nickname for me: Adam Leventhal, Hardware Engineer. I wasn’t designing hardware; I wasn’t even particularly more involved with hardware specs. The name referred to my preternatural ability to fit round pegs into square holes, to know when parts would bend but not break (or if they broke how to clean up the evidence), and when a tight fit necessitated a running start.

I first earned the nickname when we got the prototype hardware for what would eventually becomes the Sun Storage 7410 — part of our initial product line, and our first product to support clustering. When the system arrived, and I tried to install a SAS HBA, I encountered my first hardware bug. In the Solaris kernel group I had hit microprocessor bugs, but this was pretty different: the actual sheet metal was designed for cards to drop in horizontally, and the designers hadn’t considered connectors that protruded from a PCI card’s faceplate.

To solve the problem, I had to (carefully) bend back the retaining metal supports, drop in the card, and then try to bend them back. I think my colleagues were just impressed that I didn’t break anything.

The hardware team took our feedback and designed a different mechanism for inserting PCI cards.

Science Experiments with SSDs

Another task that fell to Adam Leventhal, Hardware Engineer was conducting the science experiments we needed to verify if something was a stupid idea or merely a crazy one. Often this took the form of trying to make something fit somewhere it wasn’t supposed to fit. For example, we often had 2.5″ SSDs that we wanted to stick into 3.5″ drive bays to eliminate as many variables as possible when baking off a 2.5″ SSD versus as 3.5″ one. Here are some examples:

some SSDs in a Thumper (SS7200)
an SSD in a Riverwalk (J4400)

The Ice-Cream Sandwich

Another favorite experiment involving SSDs came when we were first investigating Readzilla candidates. We wanted to get as much capacity as we could in the 2.5″ drive bay. The prototypes of the Intel X25-E were only 7mm high so we speculated that we could make an “ice-cream sandwich” with some sort of chip to present them as a single SATA device. Well, we found such a chip, and so I ran the experiment to see what the hardware would look like to our OS and what the performance characteristics would be.

You can see the two Intel SSDs duct-taped together, and connected to a power supply in the background and the test board on the right. The test board has another SATA cable that snakes into the box and connects where the drive connector is at the back of the drive bay. Yes, it was a huge pain to connect that final cable; not pictured is the duct tape in the drive bay to keep the SATA cable in place.

The thing worked, but the performance was lousy, and we determined that two drives and some sort of interposer might fit, but it would be like sticking a potato up the tailpipe — all airflow would be blocked.

Conjoined Twins

By far my favorite science project was the conjoined twin Iwashis (SS7110). Iwashi was a stand-alone storage box with an internal SAS HBA that connected to a 16-disk backplane. It turned out though that only one of the two SAS connections was needed to see all the disks. Sitting around at lunch one day we had an idea: could we provide high availability for user data by getting a pair of Iwashis and cross-wiring their HBAs to connect to each others’ backplanes. We would then mirror the data (or something) between the two boxes.

Note that that two systems needed to be placed head-to-toe in order to let the cables reach; take note of a few features in the picture above:

  1. The SAS HBA in the right system…
  2. connects up to the right system’s own backplane…
  3. and to the backplane on the left (note that running between the fan trays was the only option)…
  4. which also connects up to the SAS HBA in the left system.

This all required running with the lid off. Those systems contained a magnetic kill switch — if you removed the lid, the power would shut off. This was — wisely — to ensure proper airflow and to avoid overheating. But this interfered with this (and many other) experiments, so I just unscrewed the magnet from the lid and let it connect directly with its main chassis mate.

I could get the lid onto the left system, but I didn’t want the fan tray lid pinching the SAS cable that ran between the two boxes. To this day, I think that propping up the fan tray lid is the best use of those discarded PCI faceplate fillers.

We scrapped this idea for a variety of concerns both mundane (we needed both SAS connectors to drive the LEDs for each drive), and fundamental (it was pretty clearly goofy).

Still Hardware Engineering

At Delphix, we’re selling a virtual appliance so the opportunities for Adam Leventhal, Hardware Engineer to shine are fewer and farther between. But hardware engineering has always been more of a state of mind… and there’s still the occasional opportunity to stab at a jumper with a bread knife from the kitchen to generate an NMI and initiate a kernel panic!

I joined the Solaris Kernel Group in 2001 at what turned out to be a remarkable place and time for the industry. More by luck and intuition than by premonition, I found myself surrounded by superlative engineers working on revolutionary technologies that were the products of their own experience and imagination rather than managerial fiat. I feel very lucky to have worked with Bryan and Mike on DTrace; it was amazing that just down the hall our colleagues reinvented the operating system with Zones, ZFS, FMA, SMF and other innovations.

With Solaris 10 behind us, lauded by customers and pundits, I was looking for that next remarkable place and time, and found it with Fishworks. The core dozen or so are some of the finest engineers I could hope to work with, but there were so many who contributed to the success of the 7000 series. From the executives who enabled our fledgling skunkworks in its nascent days, to our Solaris colleagues building fundamental technologies like ZFS, COMSTAR, SMF, networking, I/O, and IPS, and the OpenStorage team who toiled to bring a product to market, educating us without deflating our optimism in the process.

I would not trade the last 9 years for anything. There are many engineers who never experience a single such confluence of talent, organizational will, and success; I’m grateful to my colleagues and to Sun for those two opportunities. Now I’m off to look for my next remarkable place and time beyond the walls of Oracle. My last day will be August 20th, 2010.

Thank you to the many readers of this blog. After six years and 130 posts I’d never think of giving it up. You’ll be able to find my new blog at dtrace.org/blogs/ahl (comments to this post are open there); I can’t wait to begin chronicling my next endeavors. You can reach me by email here: my initials at alumni dot brown dot edu. I look forward to your continued to comments and emails. Thanks again!

This year’s flash memory summit got me thinking about our use of SSDs over the years at Fishworks. The picture of our left is a visual history of SSD evals in rough chronological order from the oldest at the bottom to the newest at the top (including some that have yet to see the light of day).

Early Days

When we started Fishworks, we were inspired by the possibilities presented by ZFS and Thumper. Those components would be key building blocks in the enterprise storage solution that became the 7000 series. An immediate deficiency we needed to address was how to deliver competitive performance using 7,200 RPM disks. Folks like NetApp and EMC use PCI-attached NV-DRAM as a write accelerator. We evaluated something similar, but found the solution lacking because it had limited scalability (the biggest NV-DRAM cards at the time were 4GB), consumed our limited PCIe slots, and required a high-speed connection between nodes in a cluster (e.g. IB, further eating into our PCIe slot budget).

The idea we had was to use flash. None of us had any experience with flash beyond cell phones and USB sticks, but we had the vague notion that flash was fast and getting cheaper. By luck, flash SSDs were just about to be where we needed them. In late 2006 I started evaluating SSDs on behalf of the group, looking for what we would eventually call Logzilla. At that time, SSDs were getting affordable, but were designed primarily for environments such as military use where ruggedness was critical. The performance of those early SSDs was typically awful.

Logzilla

STEC — still Simpletech in those days — realized that their early samples didn’t really suit our needs, but they had a new device (partly due to the acquisition of Gnutech) that would be a good match. That first sample was fibre-channel and took some finagling to get working (memorably it required metric screw of an odd depth), but the Zeus IOPS, an 18GB 3.5″ SATA SSD using SLC NAND, eventually became our Logzilla (we’ve recently updated it with a SAS version for our updated SAS-2 JBODs). Logzilla addressed write performance economically, and scalably in a way that also simplified clustering; the next challenge was read performance.

Readzilla

Intent on using commodity 7,200 RPM drives, we realized that our random read latency would be about twice that of 15K RPM drives (duh). Fortunately, most users don’t access all of their data randomly (regardless of how certain benchmarks are designed). We already had much more DRAM cache than other storage products in our market segment, but we thought that we could extend that cache further by using SSDs. In fact, the invention of the L2ARC followed a slightly different thought process: seeing the empty drive bays in the front of our system (just two were used as our boot disks) and the piles of SSDs laying around, I stuck the SSDs in the empty bays and figured out how we’d use them.

It was again STEC who stepped up to provide our Readzilla, a 100GB 2.5″ SATA SSD using SLC flash.

Next Generation

Logzilla and Readzilla are important features of the Hybrid Storage Pool. For the next generation expect the 7000 series to move away from SLC NAND flash. It was great for the first generation, but other technologies provide better $/IOPS for Logzilla and better $/GB for Readzilla (while maintaining low latency). For Logzilla we think that NV-DRAM is a better solution (I reviewed one such solution here), and for Readzilla MLC flash has sufficient performance at much lower cost and ZFS will be able to ensure the longevity.

On the heels of the 2010.Q1 software release, we’ve provided a new version of the Sun Storage 7000 simulator that can be found at this new location. As noted previously, the simulator is a terrific way to take the Sun Storage 7000 user interface for a spin; it includes the exact same software as a physical 7000 series system with the same features.

The last release of the simulator added support for VirtualBox; this release now removes support for VMware. VMware was tremendously useful for our initial release, but VirtualBox has addressed the problems that initially excluded it, and the maintenance burden of supporting two virtual platforms has led us to drop VMware in this release. Future releases of the simulator will continue to be VirtualBox-only; our hope is that this doesn’t create a significant impediment as VirtualBox is free and readily available.

When I first wrote about triple-parity RAID in ZFS and the Sun Storage 7000 series, I alluded a looming requirement for triple-parity RAID due to a growing disparity between disk capacity and throughput. I’ve written an article in ACM Queue examining this phenomenon in detail, and making the case for triple-parity RAID. Dominic Kay helped me sift through hard drive data for the past ten years to build a model for how long it takes to fully populate a drive. I’ve reproduced a graph here from the paper than displays the timing data for a few common drive types — the trends are obviously quite clear.

The time to populate a drive is directly relevant for RAID rebuild. As disks in RAID systems take longer to reconstruct, the reliability of the total system decreases due to increased periods running in a degraded state. Today that can be four hours or longer; that could easily grow to days or weeks. RAID-6 grew out of a need for a system more reliable than what RAID-5 could offer. We are approaching a time when RAID-6 is no more reliable than RAID-5 once was. At that point, we will again need to refresh the reliability of RAID, and RAID-7, triple-parity RAID, will become the new standard.

Triple-Parity RAID and Beyond
ADAM LEVENTHAL, SUN MICROSYSTEMS
As hard-drive capacities continue to outpace their throughput, the time has come for a new level of RAID. How much longer will current RAID techniques persevere? The RAID levels were codified in the late 1980s; double-parity RAID, known as RAID-6, is the current standard for high-availability, space-efficient storage. The incredible growth of hard-drive capacities, however, could impose serious limitations on the reliability even of RAID-6 systems. Recent trends in hard drives show that triple-parity RAID must soon become pervasive. In 2005, Scientific American reported on Kryder’s law, which predicts that hard-drive density will double annually. While the rate of doubling has not quite maintained that pace, it has been close.

Problematically for RAID, hard-disk throughput has failed to match that exponential rate of growth. Today repairing a high-density disk drive in a RAID group can easily take more than four hours, and the problem is getting significantly more pronounced as hard-drive capacities continue to outpace their throughput. As the time required for rebuilding a disk increases, so does the likelihood of data loss. The ability of hard-drive vendors to maintain reliability while pushing to higher capacities has already been called into question in this magazine. Perhaps even more ominously, in a few years, reconstruction will take so long as to effectively strip away a level of redundancy. What follows is an examination of RAID, the rate of capacity growth in the hard-drive industry, and the need for triple-parity RAID as a response to diminishing reliability.

[]

The Hybrid Storage Pool integrates flash into the storage hierarchy in two specific ways: as a massive read cache and as fast log devices. For read cache devices, Readzillas, there’s no need for redundant configurations; it’s a clean cache so the data necessarily also resides on disk. For log devices, Logzillas, redundancy is essential, but how that translates to their configuration can be complicated. How to decide whether to stripe or mirror?

ZFS intent log devices

Logzillas are used as ZFS intent log devices (slogs in ZFS jargon). For certain synchronous write operations, data is written to the Logzilla so the operation can be acknowledged to the client quickly before the data is later streamed out to disk. Rather than the milliseconds of latency for disks, Logzillas respond in about 100μs. If there’s a power failure or system crash before the data can be written to disk, the log will be replayed when the system comes back up, the only scenario in which Logzillas are read. Under normal operation they are effectively write-only. Unlike Readzillas, Logzillas are integral to data integrity and they are relied upon for data integrity in the case of a system failure.

A common misconception is that a non-redundant Logzilla configuration introduces a single point of failure into the system, however this is not the case since the data contained on the log devices is also held in system memory. Though that memory is indeed volatile, data loss could only occur if both the Logzilla failed and the system failed within a fairly small time window.

Logzilla configuration

While a Logzilla doesn’t represent a single point of failure, redundant configurations are still desirable in many situations. The Sun Storage 7000 series implements the Hybrid Storage Pool, and offers several different redundant disk configurations. Some of those configurations add a single level of redundancy: mirroring and single-parity RAID. Others provide additional redundancy: triple-mirroring, double-parity RAID and triple-parity RAID. For disk configurations that provide double disk redundancy of better, the best practice is to mirror Logzillas to achieve a similar level of reliability. For singly redundant disk configurations, non-redundant Logzillas might suffice, but there are conditions such as a critically damaged JBOD that could affect both Logzilla and controller more or less simultaneously. Mirrored Logzillas add additional protection against such scenarios.

Note that the Logzilla configuration screen (pictured) includes a column for No Single Point of Failure (NSPF). Logzillas are never truly a single point of failure as previous discussed; instead, this column refers to the arrangement of Logzillas in JBODs. A value of true indicates that the configuration is resilient against JBOD failure.

The most important factors to consider when deciding between mirrored or striped Logzillas are the consequences of potential data loss. In a failure of Logzillas and controller, data will not be corrupted, but the last 5-30 seconds worth of transactions could be lost. For example, while it typically makes sense to mirror Logzillas for triple-parity RAID configurations, it may be that the data stored is less important and the implications for data loss not worthy of the cost of another Logzilla device. Conversely, while a mirrored or single-parity RAID disk configuration provides only a single level of redundancy, the implications of data loss might be such that the redundancy of volatile system memory is insufficient. Just as it’s important to choose the appropriate disk configuration for the right balance of performance, capacity, and reliability, it’s at least as important to take care and gather data to make an informed decision about Logzilla configurations.

Recent Posts

April 17, 2024
January 13, 2024
December 29, 2023
February 12, 2017
December 18, 2016

Archives

Archives