Adam Leventhal's blog

Tag: DTrace

In previous posts I discussed the problems with the legacy ZFS write throttle that cause degraded performance and wildly variable latencies. I then presented the new OpenZFS write throttle and I/O scheduler that Matt Ahrens and I designed. In addition to solving several problems in ZFS, the new approach was designed to be easy to reason about, measure, and adjust. In this post I’ll cover performance analysis and tuning — using DTrace of course. These details are intended for those using OpenZFS and trying to optimize performance — if you have only a casual interest in ZFS consider yourself warned!

Buffering dirty data

OpenZFS limits the amount of dirty data on the system according to the tunable zfs_dirty_data_max. It’s default value is 10% of memory up to 4GB. The tradeoffs are pretty simple:

Lower Higher
Less memory reserved for use by OpenZFS More memory reserved for use by OpenZFS
Able to absorb less workload variation before throttling Able to absorb more workload variation before throttling
Less data in each transaction group More data in each transaction group
Less time spent syncing out each transaction group More time spent syncing out each transaction group
More metadata written due to less amortization Less metadata written due to more amortization


Most workloads contain variability. Think of the dirty data as a buffer for that variability. Let’s say the LUNs assigned to your OpenZFS storage pool are able to sustain 100MB/s in aggregate. If a workload consistently writes at 100MB/s then only a very small buffer would be required. If instead the workload oscillates between 200MB/s and 0MB/s for 10 seconds each, then a small buffer would limit performance. A buffer of 800MB would be large enough to absorb the full 20 second cycle over which the average is 100MB/s. A buffer of only 200MB would cause OpenZFS to start to throttle writes — inserting artificial delays — after less than 2 seconds during which the LUNs could flush 200MB of dirty data while the client tried to generate 400MB.

Track the amount of outstanding dirty data within your storage pool to know which way to adjust zfs_dirty_data_max:

        this->dp = (dsl_pool_t *)arg0;

/this->dp->dp_spa->spa_name == $$1/
        printf("%4dMB of %4dMB used", this->dp->dp_dirty_total / 1024 / 1024,
            `zfs_dirty_data_max / 1024 / 1024);

# dtrace -s dirty.d pool
dtrace: script 'dirty.d' matched 2 probes
11 8730 txg_sync_thread:txg-syncing 966MB of 4096MB used
0 8730 txg_sync_thread:txg-syncing 774MB of 4096MB used
10 8730 txg_sync_thread:txg-syncing 954MB of 4096MB used
0 8730 txg_sync_thread:txg-syncing 888MB of 4096MB used
0 8730 txg_sync_thread:txg-syncing 858MB of 4096MB used

The write throttle kicks in once the amount of dirty data exceeds zfs_delay_min_dirty_percent of the limit (60% by default). If the the amount of dirty data fluctuates above and below that threshold, it might be possible to avoid throttling by increasing the size of the buffer. If the metric stays low, you may reduce zfs_dirty_data_max. Weigh this tuning against other uses of memory on the system (a larger value means that there’s less memory for applications or the OpenZFS ARC for example).

A larger buffer also means that flushing a transaction group will take longer. This is relevant for certain OpenZFS administrative operations (sync tasks) that occur when a transaction group is committed to stable storage such as creating or cloning a new dataset. If the interactive latency of these commands is important, consider how long it would take to flush zfs_dirty_data_max bytes to disk. You can measure the time to sync transaction groups (recall, there are up to three active at any given time) like this:

/((dsl_pool_t *)arg0)->dp_spa->spa_name == $$1/
        start = timestamp;

/start && ((dsl_pool_t *)arg0)->dp_spa->spa_name == $$1/
        this->d = timestamp - start;
        printf("sync took %d.%02d seconds", this->d / 1000000000,
            this->d / 10000000 % 100);

# dtrace -s duration.d pool
dtrace: script 'duration.d' matched 2 probes
5 8729 txg_sync_thread:txg-synced sync took 5.86 seconds
2 8729 txg_sync_thread:txg-synced sync took 6.85 seconds
11 8729 txg_sync_thread:txg-synced sync took 6.25 seconds
1 8729 txg_sync_thread:txg-synced sync took 6.32 seconds
11 8729 txg_sync_thread:txg-synced sync took 7.20 seconds
1 8729 txg_sync_thread:txg-synced sync took 5.14 seconds

Note that the value of zfs_dirty_data_max is relevant when sizing a separate intent log device (SLOG). zfs_dirty_data_max puts a hard limit on the amount of data in memory that has yet been written to the main pool; at most, that much data is active on the SLOG at any given time. This is why small, fast devices such as the DDRDrive make for great log devices. As an aside, consider the ostensible upgrade that Oracle brought to the ZFS Storage Appliance a few years ago replacing the 18GB “Logzilla” with a 73GB upgrade.

I/O scheduler

Where ZFS had a single IO queue for all IO types, OpenZFS has five IO queues for each of the different IO types: sync reads (for normal, demand reads), async reads (issued from the prefetcher), sync writes (to the intent log), async writes (bulk writes of dirty data), and scrub (scrub and resilver operations). Note that bulk dirty data described above are scheduled in the async write queue. See vdev_queue.c for the related tunables:

uint32_t zfs_vdev_sync_read_min_active = 10;
uint32_t zfs_vdev_sync_read_max_active = 10;
uint32_t zfs_vdev_sync_write_min_active = 10;
uint32_t zfs_vdev_sync_write_max_active = 10;
uint32_t zfs_vdev_async_read_min_active = 1;
uint32_t zfs_vdev_async_read_max_active = 3;
uint32_t zfs_vdev_async_write_min_active = 1;
uint32_t zfs_vdev_async_write_max_active = 10;
uint32_t zfs_vdev_scrub_min_active = 1;
uint32_t zfs_vdev_scrub_max_active = 2;

Each of these queues has tunable values for the min and max number of outstanding operations of the given type that can be issued to a leaf vdev (LUN). The tunable zfs_vdev_max_active limits the number of IOs issued to a single vdev. If its value is less than the sum of the zfs_vdev_*_max_active tunables, then the minimums come into play. The minimum number of each queue will be scheduled and the remainder of zfs_vdev_max_active is issued from the queues in priority order.

At a high level, the appropriate values for these tunables will be specific to your LUNs. Higher maximums lead to higher throughput with potentially higher latency. On some devices such as storage arrays with distinct hardware for reads and writes, some of the queues can be thought of as independent; on other devices such as traditional HDDs, reads and writes will likely impact each other.

A simple way to tune these values is to monitor I/O throughput and latency under load. Increase values by 20-100% until you find a point where throughput no longer increases, but latency is acceptable.

#pragma D option quiet

        start = timestamp;

        ts[args[0]->b_edev, args[0]->b_lblkno] = timestamp;

/ts[args[0]->b_edev, args[0]->b_lblkno]/
        this->delta = (timestamp - ts[args[0]->b_edev, args[0]->b_lblkno]) / 1000;
        this->name = (args[0]->b_flags & (B_READ | B_WRITE)) == B_READ ?
            "read " : "write ";

        @q[this->name] = quantize(this->delta);
        @a[this->name] = avg(this->delta);
        @v[this->name] = stddev(this->delta);
        @i[this->name] = count();
        @b[this->name] = sum(args[0]->b_bcount);

        ts[args[0]->b_edev, args[0]->b_lblkno] = 0;


        normalize(@i, (timestamp - start) / 1000000000);
        normalize(@b, (timestamp - start) / 1000000000 * 1024);

        printf("%-30s %11s %11s %11s %11s\n", "", "avg latency", "stddev",
            "iops", "throughput");
        printa("%-30s %@9uus %@9uus %@9u/s %@8uk/s\n", @a, @v, @i, @b);

# dtrace -s rw.d -c 'sleep 60'

           value  ------------- Distribution ------------- count
              32 |                                         0
              64 |                                         23
             128 |@                                        655
             256 |@@@@                                     1638
             512 |@@                                       743
            1024 |@                                        380
            2048 |@@@                                      1341
            4096 |@@@@@@@@@@@@                             5295
            8192 |@@@@@@@@@@@                              5033
           16384 |@@@                                      1297
           32768 |@@                                       684
           65536 |@                                        400
          131072 |                                         225
          262144 |                                         206
          524288 |                                         127
         1048576 |                                         19
         2097152 |                                         0

           value  ------------- Distribution ------------- count
              32 |                                         0
              64 |                                         47
             128 |                                         469
             256 |                                         591
             512 |                                         327
            1024 |                                         924
            2048 |@                                        6734
            4096 |@@@@@@@                                  43416
            8192 |@@@@@@@@@@@@@@@@@                        102013
           16384 |@@@@@@@@@@                               60992
           32768 |@@@                                      20312
           65536 |@                                        6789
          131072 |                                         860
          262144 |                                         208
          524288 |                                         153
         1048576 |                                         36
         2097152 |                                         0

                               avg latency      stddev        iops  throughput
write                              19442us     32468us      4064/s   261889k/s
read                               23733us     88206us       301/s    13113k/s

Async writes

Dirty data governed by zfs_dirty_data_max is written to disk via async writes. The I/O scheduler treats async writes a little differently than other operations. The number of concurrent async writes scheduled depends on the amount of dirty data on the system. Recall that there is a fixed (but tunable) limit of dirty data in memory. With a small amount of dirty data, the scheduler will only schedule a single operation (zfs_vdev_async_write_min); the idea is to preserve low latency of synchronous operations when there isn’t much write load on the system. As the amount of dirty data increases, the scheduler will push the LUNs harder to flush it out by issuing more concurrent operations.

The old behavior was to schedule a fixed number of operations regardless of the load. This meant that the latency of synchronous operations could fluctuate significantly. While writing out dirty data ZFS would slam the LUNs with writes, contending with synchronous operations and increasing their latency. After the syncing transaction group had completed, there would be a period of relatively low async write activity during which synchronous operations would complete more quickly. This phenomenon was known as “picket fencing” due to the square wave pattern of latency over time. The new OpenZFS I/O scheduler is optimized for consistency.

In addition to tuning the minimum and maximum number of concurrent operations sent to the device, there are two other tunables related to asynchronous writes: zfs_vdev_async_write_active_min_dirty_percent and zfs_vdev_async_write_active_max_dirty_percent. Along with the min and max operation counts (zfs_vdev_async_write_min_active and zfs_vdev_aysync_write_max_active), these four tunables define a piece-wise linear function that determines the number of operations scheduled as depicted in this lovely ASCII art graph excerpted from the comments:

 * The number of concurrent operations issued for the async write I/O class
 * follows a piece-wise linear function defined by a few adjustable points.
 *        |                   o---------| <-- zfs_vdev_async_write_max_active
 *   ^    |                  /^         |
 *   |    |                 / |         |
 * active |                /  |         |
 *  I/O   |               /   |         |
 * count  |              /    |         |
 *        |             /     |         |
 *        |------------o      |         | <-- zfs_vdev_async_write_min_active
 *       0|____________^______|_________|
 *        0%           |      |       100% of zfs_dirty_data_max
 *                     |      |
 *                     |      `-- zfs_vdev_async_write_active_max_dirty_percent
 *                     `--------- zfs_vdev_async_write_active_min_dirty_percent

In a relatively steady state we’d like to see the amount of outstanding dirty data stay in a narrow band between the min and max percentages, by default 30% and 60% respectively.

Tune zfs_vdev_async_write_max_active as described above to maximize throughput without hurting latency. The only reason to increase zfs_vdev_async_write_min_active is if additional writes have little to no impact on latency. While this could be used to make sure data reaches disk sooner, an alternative approach is to decrease zfs_vdev_async_write_active_min_dirty_percent thereby starting to flush data despite less dirty data accumulating.

To tune the min and max percentages, watch both latency and the number of scheduled async write operations. If the operation count fluctuates wildly and impacts latency, you may want to flatten the slope by decreasing the min and/or increasing the max (note below that you will likely want to increase zfs_delay_min_dirty_percent if you increase zfs_vdev_async_write_active_max_dirty_percent — see below).

#pragma D option aggpack
#pragma D option quiet

        self->spa = args[0];
/self->spa && self->spa->spa_name == $$1/
        @ = lquantize(args[1], 0, 30, 1);


        self->spa = 0;

# dtrace -s q.d dcenter

min .--------------------------------. max | count
< 0 : ▃▆ : >= 30 | 23279

min .--------------------------------. max | count
< 0 : █ : >= 30 | 18453

min .--------------------------------. max | count
< 0 : █ : >= 30 | 27741

min .--------------------------------. max | count
< 0 : █ : >= 30 | 3455

min .--------------------------------. max | count
< 0 : : >= 30 | 0

Write delay

In situations where LUNs cannot keep up with the incoming write rate, OpenZFS artificially delays writes to ensure consistent latency (see the previous post in this series). Until a certain amount of dirty data accumulates there is no delay. When enough dirty data accumulates OpenZFS gradually increases the delay. By delaying writes OpenZFS effectively pushes back on the client to limit the rate of writes by forcing artificially higher latency. There are two tunables that pertain to delay: how much dirty data there needs to be before the delay kicks in, and the factor by which that delay increases as the amount of outstanding dirty data increases.

The tunable zfs_delay_min_dirty_percent determines when OpenZFS starts delaying writes. The default is 60%; note that we don’t start delaying client writes until the IO scheduler is pushing out data as fast as it can (zfs_vdev_async_write_active_max_dirty_percent also defaults to 60%).

The other relevant tunable is zfs_delay_scale is really the only magic number here. It roughly corresponds to the inverse of the maximum number of operations per second (denominated in nanoseconds), and is used as a scaling factor.

Delaying writes is an aggressive step to ensure consistent latency. It is required if the client really is pushing more data than the system can handle, but unnecessarily delaying writes degrades overall throughput. There are two goals to tuning delay: reduce or remove unnecessary delay, and ensure consistent delays when needed.

First check to see how often writes are delayed. This simple DTrace one-liner does the trick:

# dtrace -n fbt::dsl_pool_need_dirty_delay:return'{ @[args[1] == 0 ? "no delay" : "delay"] = count(); }'

If a relatively small percentage of writes are delayed, increasing the amount of dirty data allowed (zfs_dirty_data_max) or even pushing out the point at which delays start (zfs_delay_min_dirty_percent). When increasing zfs_dirty_data_max consider the other users of DRAM on the system, and also note that a small amount of small delays does not impact performance significantly.

If many writes are being delayed, the client really is trying to push data faster than the LUNs can handle. In that case, check for consistent latency, again, with a DTrace one-liner:

# dtrace -n delay-mintime'{ @ = quantize(arg2); }'

With high variance or if many write operations are being delayed for the maximum zfs_delay_max_ns (100ms by default) then try increasing zfs_delay_scale by a factor of 2 or more, or try delaying earlier by reducing zfs_delay_min_dirty_percent (remember to also reduce zfs_vdev_async_write_active_max_dirty_percent).

Summing up

Our experience at Delphix tuning the new write throttle has been so much better than in the old ZFS world: each tunable has a clear and comprehensible purpose, their relationships are well-defined, and the issues in tension pulling values up or down are both easy to understand and — most importantly — easy to measure. I hope that this tuning guide helps others trying to get the most out of their OpenZFS systems whether on Linux, FreeBSD, Mac OS X, illumos — not to mention the support engineers for the many products that incorporate OpenZFS into a larger solution.

At the illumos hackathon last week, Robert Mustacchi and I prototyped better support for manipulating user-land structures. As anyone who’s used it knows, DTrace is currently very kernel-centric — this both reflects the reality of how operating systems and DTrace are constructed, and the origins of DTrace itself in the Solaris Kernel Group. Discussions at dtrace.conf(12) this spring prompted me to chart a path to better user-land support. This prototype of copyin-automagic was a first step.

What we implemented was a new ‘user’ keyword to denote that a type is a user-land structure. For example, let’s say we had the address of a 4-byte integer; today we’d access its value using copyin():

this->i = *(int *)copyin(this->addr, sizeof (int));

With our prototype, this gets simpler and more intuitive:

this->i = *(user int *)addr;

The impact is even more apparent when it comes to pointer chasing through structures. Today if we need to get to the third element of a linked list, the D code would look like this:

this->p = (node_t *)copyin(this->addr, sizeof (node_t));
this->p = (node_t *)copyin((uintptr_t)this->p->next, sizeof (node_t));
this->p = (node_t *)copyin((uintptr_t)this->p->next, sizeof (node_t));

Again, it’s much simpler with the user keyword:

this->p = (user node_t *)this->addr;

D programs are compiled into a series of instructions — DIF — that are executed by the code DTrace framework when probes fire. We use the new keyword to generate instructions that load from the address space of the currently executing process rather than that of the kernel.

Adding a new keyword feels a little clunky, and I’m not sure if it’s the right way forward, but it does demonstrate a simpler model for accessing user-land structures, and was a critical first step. We already have three main sources of user-land values; the next steps are to make use of those.

Types for system calls

Arguments to system calls (mostly) have well-known types. Indeed those types are encoded in truss in excruciating and exotic detail. We should educate DTrace about those types. What I’d propose is that we create a single repository of all system call metadata. This could be, for example, and XML file that listed every system call, its name, code, subcode, types, etc. Of course we could use that as the source of type information for the syscall provider, but we could also use that to generate everything from the decoding tables in truss to the libc and kernel stubs for system calls.

As an aside, there are a couple of system calls whose parameter types — ioctl(2) is the obvious example. It would be interesting to assess the utility of an ioctl provider whose probes would be the various codes that are passed as the second parameter. Truss already has this information; why not DTrace?

Types for the pid provider

Another obvious source of type information is the process being traced. When a user specifies the -p or -c option to dtrace(1M) we’re examining a particular process, and that process can have embedded type information. We could use those types and implicitly identify them as belonging to user-space rather than the kernel. Pid provider probes correspond to the entry and return from user-land functions; we could identify the appropriate types for those parameters. We could simplify it further by doing all type handling in libdtrace (in user-land) rather than pushing the types into the kernel.

Types for USDT

User-land statically defined tracing — tracepoints explicitly inserted into code — can already have types associated with them. A first step would be to implicitly identify those types as belonging to user-land. I believe that this could even be done without adversely affecting existing scripts.

Thorny issues

While there are some clear paths forward, there are some tricky issues that remain. In particular that processes can have different data models — 32-bit v. 64-bit — presents a real challenge. Both the width of a load and offsets into structures change depending on the process that’s running. There might be some shortcuts for system calls, and we might be able to constrain the problem for the pid provider by requiring -p or -c, or we might have to compile our D program twice and then choose which version to run based on the data model of the process. In the spirit of the hackathon, Robert and I punted for our ‘user’ keyword prototype, but these problems need to be well understood and sufficiently solved.

Next steps

I’ll be working on some of these problems on the back burner; I’m especially interested in the Grand Unified Syscall Project — an idea I’ve been touting for more years than I care to relate — to bring types to the syscall provider. If you have ideas for user-land tracing with DTrace, or want to work on anything I’ve mention, leave a comment or drop me a note.

DTrace first peered into Java in early 2005 thanks to an early prototype by Jarod Jenson that led eventually to the inclusion of USDT probes in the HotSpot JVM. If you want to see where, say, the method is called, you can simply run this DTrace script:

/copyinstr(arg1, arg2) == "java/net/SocketOutputStream" &&
 copyinstr(arg3, arg4) == "write"/
        jstack(50, 8000);

And that will work as long as you rememember to start your JVM with the -XX:+ExtendedDTraceProbes option or you use the jinfo utility to enable it after the fact. And as long as you don’t mind a crippling performance penalty (hint: you probably do).

Inspired by dtrace.conf a few weeks ago, I wanted to sketch out what the real Java provider would look like:


And check it out:

# -p $(pgrep java) -n 'java${ jstack(50,8000); }'
dtrace: script '/tmp/jdtrace.19092/jdtrace.d' matched 0 probes
CPU     ID                    FUNCTION:NAME
0  64991 Java_com_sun_btrace_BTraceRuntime_dtraceProbe0:event`Java_com_sun_btrace_BTraceRuntime_dtraceProbe0+0xbb
StubRoutines (1)`__1cJJavaCallsLcall_helper6FpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v_+0x21d`__1cCosUos_exception_wrapper6FpFpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v2468_v_+0x27`__1cJJavaCallsMcall_virtual6FpnJJavaValue_nGHandle_nLKlassHandle_nMsymbolHandle_5pnGThread__v_+0x149`__1cMthread_entry6FpnKJavaThread_pnGThread__v_+0x113`__1cKJavaThreadDrun6M_v_+0x2c6`java_start+0x1f2`_thrp_setup+0x9b`_lwp_start

Obviously there’s something fishy going on. First, we’re using perl — the shibboleth of fake-o-ware — and there’s this BTrace stuff in the output.

Faking it with BTrace

BTrace is a dynamic instrumentation tool for Java; it is both inspired by DTrace and contains some DTrace integration. The perl script above takes the DTrace syntax and generates a DTrace script and a BTrace-enabled Java source file.

Like DTrace, BTrace lets you specify the points of instrumentation in your Java program as well as the actions to take. Here’s what our generated source file looks like.

import com.sun.btrace.annotations.*;
import static com.sun.btrace.BTraceUtils.*;
public class jdtrace {
        @OnMethod(clazz="", method="write", location=@Location(Kind.ENTRY))
        public static void probe1(@ProbeClassName String c,
            @ProbeMethodName String m) {
                String name = "entry";
                String p = Strings.strcat(c, Strings.strcat(":",
                    Strings.strcat(m, Strings.strcat(":", name))));
                D.probe(p, "");

Note that we specify where to trace (this can be a regular expression), and then take the action of joining the class, method, and “entry” string into a single string that we pass to the D.probe() method that causes a BTrace USDT probe to fire.

Here’s what the D script looks like:

        this->__jd_arg = copyinstr(arg0);
        this->__jd_mod = strtok(this->__jd_arg, ":");
        this->__jd_func = strtok(NULL, ":");
        this->__jd_name = strtok(NULL, ":");

/((this->__jd_mod == "" &&
 this->__jd_func == "write" &&
 this->__jd_name == "entry"))/

It’s pretty simple. We parse the string that was passed to D.probe(), and disassemble it into the DTrace notion of module, function, and name. We then use that information so that the specified actions are executed as appropriate (we could have specified different Java methods to probe, and different actions to take for each). Here’s the code if you’re interested.

This isn’t the real Java provider, but is it close enough? Unfortunately not. The most glaring problem is that BTrace sometimes renders my Java process unresponsive. Other times it leaves instrumentation behind with no way of extracting it. The word “safe” appears as the third word on the BTrace website (“BTrace is safe”), but apparently there’s still some way to go to achieve the requisite level of safety.

A Better BTrace

BTrace is an interesting tool for examining Java programs, but one obvious obstacle is that the programs are pretty cumbersome to write. With BTrace, we should be able to write a simple one-liner to see where we are when the method is called, but instead we have to write a fairly cumbersome program:

import com.sun.btrace.annotations.*;
import static com.sun.btrace.BTraceUtils.*;
public class TraceWrite {
        @OnMethod(clazz="", method="write", location=@Location(Kind.ENTRY))
        public static void onWrite() {

DTrace-inspired syntax would let users iterate much more quickly:

$ dbtrace -p $(pgrep -n java) -n '{ jstack(); }'

With BTrace, you can trace nearly arbitrary information about a program’s state, but instead of doing something like this:

dbtrace -p $(pgrep -n java) -n '{ printFields(this.impl); }'

You have to do this:

import com.sun.btrace.annotations.*;
import com.sun.btrace.AnyType;
import static com.sun.btrace.BTraceUtils.Reflective.*;
public class TraceWrite {
        @OnMethod(clazz="", method="write", location=@Location(Kind.ENTRY))
        public static void onWrite(@Self Object self) {
                Object impl = get(field(classOf(self), "impl"), self);
$ ./bin/btrace $(pgrep -n java)
{server=null, port=1080, external_address=null, useV4=false, cmdsock=null, cmdIn=null, cmdOut=null, applicationSetProxy=false, timeout=0, trafficClass=0, shut_rd=false, shut_wr=false,, fdUseCount=0, fdLock=java.lang.Object@ab5443, closePending=false, CONNECTION_NOT_RESET=0, CONNECTION_RESET_PENDING=1, CONNECTION_RESET=2, resetState=0, resetLock=java.lang.Object@292936, fd1=null, anyLocalBoundAddr=null, lastfd=-1, stream=false, socket=Socket[addr=/,port=38832,localport=8765], serverSocket=null,, address=/, port=38832, localport=8765, }

BTrace needs a language that enables rapid iteration — piggybacking on Java is holding it back — and it needs some hard safety guarantees. With those, many developers and support engineers would use BTrace as part of their daily work — we certainly would here at Delphix.

Back to DTrace. Even with a useable solution for Java only, the ability to have lightweight and focused tracing for Java (and other dynamic languages) could be highly valuable. We’ll see how far BTrace can take us.

For the second time in as many quadrennial dtrace.confs, I was impressed at how well the unconference format worked out. Sharing coffee with the DTrace community, it was great to see some of the oldest friends of DTrace — Jarod Jenson, Stephen O’Grady, Jonathan Adams to name a few — and to put faces to names — Scott Fritchie, Dustin Sallings, Blake Irvin, etc — of the many new additions to the DTrace community. You can see all the slides and videos; these are my thoughts and notes on the day.

Bryan provided a typically eloquent review of the state of the community. DTrace development is alive and well — after a lull while Oracle’s acquisition of Sun settled in — with new support for a variety of languages and runtimes, and new products that rely heavily on DTrace as a secret sauce. Bryan laid out some important development goals, areas where many have started straying from the edges of the completed DTrace features into the partially complete or starkly missing. We all then set to work hammering out a loose schedule for the day; I’ll admit that at first I was worried that we’d have too many listeners and not enough presenters, but the schedule quickly filled — and with more topics than we’d end up having time to cover.

User-land CTF and Dynamic Translators

DTrace, from its inception, has been a systemic analysis tool, but the earliest development focused on kernel observability — not a surprise since Bryan, Mike, and I developed it while working in the Solaris kernel development. After its use spread (quickly) beyond the kernel team, use shifted more and more to features focused on understanding C and C++ applications in user-land, and then to applications written in a variety of higher-level languages — Java, Ruby, Perl, Javascript, Erlang, etc. User-land Statically Defined Tracing (USDT) is the DTrace facility that enables rich tracing of higher-level languages. It was a relatively late addition to DTrace (integrated in 2004, well after the initial integration in 2003), and since then we’ve learned a lot about what we got right, what we got wrong, and where it’s rough — in some cases very rough — around the edges.

In his opening remarks, Bryan identified USDT improvements as a key area for the community’s focus. In DTrace development we tried to focus on making the impossible possible rather than making the possible easier. In its current form, some things are still impossible with DTrace, namely consumption of type structures from user-land programs; stable, non-privileged use of DTrace; and support for different runtime versions. Dave Pacheco and I took the first  slot on the schedule and spoke (at length — sorry) about solutions to these problems.

While others had the benefit of a bit more time to prepare, I did have the advantage of spending many years idly contemplating the problem space and possible solutions. On the subject of user-land type information (in the form of CTF), I identified the key parts of the code that would would need some work. For the USDT enhancements, we discussed dynamic translators — D code that would be linked and executed at runtime, contrasted with today’s static translators that are compiled into a D program — how they would address the problem, and how these ideas could be extended to the kernel (for once, user-land is actually a bit ahead).

I’ll go into the details of our off the cuff proposals, and delve into the code to firm up those ideas in a future blog post. Beyond the extensive implementation work we laid out, the next step is to gather the most complicated, extant USDT providers and proposals for other providers, and figure out what they should look like in the new, dynamic translator world.

The D Language

Next up, my long-time colleague, DTrace contributor, Eric Schrock led the discussion on D language additions. The format of a D program is heavily tied to DTrace’s implementation: all clauses must trace a fixed amount of data, and infinite loops are forbidden. For this reason, D lacks the backward branches needed for traditional looping, subroutines for common code, and if/else clauses for control flow. Each of these has a work-alike — unrolled loops, macros, and predicates or the ternary operator — but their absence renders D confusing to some — especially those unaware of the motivation. Further, the D language need not necessarily hold the underlying implementation so central.

Eric discussed some proposals for how each might be addressed, and I noted that it would be possible to create a prototype environment where we could try out these “D++” features by compiling into D work-alikes. The next step is to identify the most complicated D scripts, and see what they might look like for various incarnations of those language features.

Work with DTrace

The next few sessions focused not on changes to DTrace, but interesting work done using DTrace:

John Thompson of Sony talked about their port of DTrace to the Playstation Vita (!). Sony developers are given access to DTrace, but found it to be unfamiliar and unapproachable. John spoke his attempts to remedy this by replacing D with a C++-like interface which he implemented by replacing the D compiler with Clang.

My Fishworks colleague, Brendan Gregg, showed some of beautiful visualizations they’ve been developing at Joyent, and talked about the analyses those visualizations enabled. As always, it was fascinating stuff. If you don’t read Brendan’s blog, you really should. Long-time DTrace advocate, Theo Schlossnagle, talked about the visualizations they’re doing in Circonus — also fascinating stuff for anyone thinking about how to present system activity in comprehensible ways. Richard Elling showed the DTrace-based visualizations Nexenta used at VMworld to rave reviews.

Mark Cavage presented Joyent’s work bringing DTrace to node.js; Scott Fritchie talked about DTrace for Erlang. Both were useful sources of ideas for how we could improve USDT.

Ryan Stone presented the state of DTrace on FreeBSD. That DTrace is not enabled in the build by default remains a key obstacle for adoption. I hope that Ryan et al. are able to persuade the FreeBSD leadership that their licensing fears are misguided.

DTrace for OEL

I was delighted that Kris van Hees was able to attend to present the Oracle port to Linux. DTrace for OEL was announced at Oracle Open World 2011, but the initial beta didn’t live up to its billing at OOW. As is often the case, this was more a failure of messaging than of engineering. Kris and his team are making steady progress. While it’s not yet in the public beta, they have the kernel function boundary tracing provider (fbt) implemented. Most heartening of all, Oracle intends to keep DTrace for OEL moving forward as the community evolves and improves DTrace — rather than forking it. How that plays out, and what that means for DTrace on Oracle Solaris will be interesting to see, but it’s great to hear that Kris sees the value of DTrace ubiquity and DTrace compatibility.

As was remarked several times, having DTrace available on the fastest growing deployment platform will be the single most significant accelerator for DTrace adoption. The work Kris and his team at Oracle are doing is probably the most important in the DTrace ecosystem, and I think that I speak for the entire DTrace community in offering to assist in any way possible.

A ZFS DTrace Provider

Matt Ahrens and George Wilson — respectively the co-inventor of ZFS, and the preeminent SPA developer — presented a proposal for a DTrace provider for ZFS. ZFS is a highly sophisticated filesystem, but one that is also difficult to understand. Building in rich instrumentation is going to be a tremendous step forward for anyone using ZFS (for example, our mutual employer, Delphix).

Whither DTrace?

Jarod Jenson — the first DTrace user outside of Sun — took the stage in the final session to talk about DTrace adoption. Jarod has made DTrace a significant part of his business for many years. What continues to amazing him, despite numerous presentations, demonstrations, and lessons, is the relatively low level of DTrace adoption. DTrace is a tool that comes alive in the hands of a skilled, scientific, incisive practitioner — and in all of those, Jarod is superlative — but it can have a high bar of entry. There were many concrete suggestions for how to improve DTrace adoption. Most of them didn’t hold water for me — different avenue for education, further documentation, community outreach, higher level tools, visualizations, etc. — but two were quite compelling: DTrace for Linux, and DTrace on (and the like). I don’t know how much room there is to participate in the former, but by all means if there are DTrace one-liners that solve problems (on Mac OS X for example), post them, and get people covertly using DTrace.

The core DTrace community is growing. It was great to see old friends like Steve Peters who worked on porting DTrace to Mac OS X in the same room as Kris van Hees as he spoke about his port to Linux. It was inspiring to see so many new members of the community, eager to use, build and improve DTrace. And personally it inspired me to get back into the code to finish up some projects I had in flight, and to chart out the course for some of the projects we discussed.

Thanks to everyone who attended dtrace.conf in person or online. And thanks especially to Deirdre Straughan who made it happen.

A few months ago I took DTrace on OEL for a spin after Oracle announced it. The results were ugly; as one of the authors of DTrace, I admit to being shocked by shoddiness of the effort. Yesterday, Oracle dropped an updated beta so I wanted to see how far they’ve come in the 4+ months since that initial false start.

Whither the probes?

Back in October there were 574 functional probes (and 13 more that didn’t work). Here’s the quantitative state of DTrace for OEL today:

[root@screven drivers]# dtrace -l | wc -l

Okay. Steady improvement. By way of unfair comparison, here’s what it looks like on my Mac OS X laptop:

qadi /Users/ahl # dtrace -l | wc -l

What’s new?

Back in October, I tried enabling all system call probes (i.e. all functional probes); the result was that ssh started failing mysteriously. It was a gross violation of the core principles — it would be unacceptable for DTrace to cause harm to the production systems on which it operates. Good to see that Oracle fixed it.

Previously, profile provider probes weren’t working. The profile probes have been removed — you can’t do arbitrary resolution timer-based profiling — but the simple, tick probes are there:

[root@screven drivers]# dtrace -l -n profile:::
ID   PROVIDER            MODULE                          FUNCTION NAME
612    profile                                                     tick-1
613    profile                                                     tick-10
614    profile                                                     tick-100
615    profile                                                     tick-500
616    profile                                                     tick-1000
617    profile                                                     tick-5000

… and seem to work:

[root@screven ~]# dtrace -n 'tick-1{ printf("%Y", walltimestamp); }'
dtrace: description 'tick-1' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0    612                          :tick-1 2012 Feb 23 04:31:27
  0    612                          :tick-1 2012 Feb 23 04:31:28
  0    612                          :tick-1 2012 Feb 23 04:31:29

They’ve also added some inscrutable SDT (statically defined tracing) probes:

[root@screven ~]# dtrace -l -n sdt:::
   ID   PROVIDER            MODULE                          FUNCTION NAME
  597        sdt           vmlinux                    __handle_sysrq -handle_sysrq
  601        sdt           vmlinux                  oom_kill_process oom_kill_process
  602        sdt           vmlinux                   check_hung_task check_hung_task
  603        sdt           vmlinux                   sys_init_module init_module
  604        sdt           vmlinux                 sys_delete_module delete_module
  611        sdt           vmlinux                      signal_fault signal_fault

More usefully, the beta includes a partially implemented proc provider; the proc provider traces high level process activity (check the docs).

[root@screven ~]# dtrace -l -n proc:::
   ID   PROVIDER            MODULE                          FUNCTION NAME
  598       proc           vmlinux                  do_execve_common exec-success
  599       proc           vmlinux                  do_execve_common exec-failure
  600       proc           vmlinux                  do_execve_common exec
  605       proc           vmlinux             get_signal_to_deliver signal-handle
  606       proc           vmlinux                     __send_signal signal-send
  607       proc           vmlinux                           do_exit exit
  608       proc           vmlinux                           do_exit lwp-exit
  609       proc           vmlinux                           do_fork create
  610       proc           vmlinux                           do_fork lwp-create

For reference, here’s what it looks like on DelphixOS, an illumos derivative (which of course includes DTrace):

root@argos:~# dtrace -l -n proc:::
   ID   PROVIDER            MODULE                          FUNCTION NAME
10589       proc              unix                   lwp_rtt_initial lwp-start
10629       proc              unix                   lwp_rtt_initial start
10631       proc              unix                              trap fault
10761       proc           genunix                      sigtimedwait signal-clear
10762       proc           genunix                              psig signal-handle
10763       proc           genunix                         sigtoproc signal-discard
10764       proc           genunix                         sigtoproc signal-send
10831       proc           genunix                        lwp_create lwp-create
10868       proc           genunix                             cfork create
10870       proc           genunix                         proc_exit exit
10871       proc           genunix                          lwp_exit lwp-exit
10872       proc           genunix                         proc_exit lwp-exit
10873       proc           genunix                       exec_common exec-success
10874       proc           genunix                       exec_common exec-failure
10875       proc           genunix                       exec_common exec

Each DTrace probe has arguments that convey information about the activity that caused the probe to fire. For example, with the kernel function boundary tracing (fbt) provider (not yet implemented in OEL), the arguments for the function entry probe correspond to the arguments passed to the function. With static providers such as the proc provider, the parameters include useful information… but I can never seem to remember the types and order. Fortunately, DTrace lets you add in the -v option to get more information about a probe. Unfortunately, this hasn’t been hooked up in Oracle’s port (just an bug, I’m sure):

[root@screven ~]# dtrace -lv -n proc:::signal-send
   ID   PROVIDER            MODULE                          FUNCTION NAME
  606       proc           vmlinux                     __send_signal signal-send

	Probe Description Attributes
		Identifier Names: Private
		Data Semantics:   Private
		Dependency Class: Unknown

	Argument Attributes
		Identifier Names: Evolving
		Data Semantics:   Evolving
		Dependency Class: ISA

	Argument Types
		args[0]: (unknown)
		args[1]: (unknown)
		args[2]: (unknown)
		args[3]: (unknown)
		args[4]: (unknown)
		args[5]: (unknown)
		args[6]: (unknown)
		args[7]: (unknown)
		args[8]: (unknown)
		args[9]: (unknown)
		args[10]: (unknown)
		args[11]: (unknown)
		args[12]: (unknown)
		args[13]: (unknown)
		args[14]: (unknown)
		args[15]: (unknown)
		args[16]: (unknown)
		args[17]: (unknown)
		args[18]: (unknown)
		args[19]: (unknown)
		args[20]: (unknown)
		args[21]: (unknown)
		args[22]: (unknown)
		args[23]: (unknown)
		args[24]: (unknown)
		args[25]: (unknown)
		args[26]: (unknown)
		args[27]: (unknown)
		args[28]: (unknown)
		args[29]: (unknown)
		args[30]: (unknown)
		args[31]: (unknown)

Here’s what it looks like on DelphixOS:

root@argos:~# dtrace -lv -n proc:::signal-send
   ID   PROVIDER            MODULE                          FUNCTION NAME
10764       proc           genunix                         sigtoproc signal-send

        Probe Description Attributes
                Identifier Names: Private
                Data Semantics:   Private
                Dependency Class: Unknown

        Argument Attributes
                Identifier Names: Evolving
                Data Semantics:   Evolving
                Dependency Class: ISA

        Argument Types
                args[0]: lwpsinfo_t *
                args[1]: psinfo_t *
                args[2]: int

Even without the type system being hooked up, you can definitely do some useful work with this beta. For example, I can use the proc provider to look at what commands are executing on my system:

[root@screven ~]# dtrace -n proc:::exec'{ trace(stringof(arg0)); }'
dtrace: description 'proc:::exec' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0    600            do_execve_common:exec   /usr/bin/staprun
  0    600            do_execve_common:exec   /usr/sbin/perf
  0    600            do_execve_common:exec   /bin/uname
  0    600            do_execve_common:exec   /usr/libexec/perf.2.6.39-101.0.1.el6uek.x86_64

On his blog, Wim Coekaerts showed some examples of use of the proc provider that included this common idiom:

        this->pid = *((int *)arg0 + 171);

It’s hard to know where that 171 constant came from or how a user would figure that out. I assume that this is because OEL doesn’t yet have proper types and it’s a hardcoded offset into some structure. Here’s what that would look like on completed DTrace implementations:

        this->pid = args[0]->pr_pid;


There’s a long way to go, but it looks like the folks at Oracle are making progress. It will be interesting to see the source code that goes along with this updated beta — as of this writing, the git repository has not been updated. Personally, I’m eager to see what user-land tracing looks like in the form of the pid provider and USDT. In the tradition of other ports such as Apple’s and FreeBSD’s, I’d invite the Oracle team to present their work at the upcoming DTrace conference, dtrace.conf.

Every once in a rare while our development machines encounter an fatal error during boot because we couldn’t unmount tmpfs. This weekend I cracked the case, so I thought I’d share my uses of boot-time DTrace, and the musty corners of the operating systems that I encountered along the way. First I should explain a little bit about what happens during boot and why we were unmounting a tmpfs filesystem.

Upgrade and boot

When you upgrade your Delphix system from one version to the next, we perserve your configuration. Part of that system configuration lives in SMF, the Service Management Facility, which is stored in the filesystem at /etc/svc. We keep /etc/svc in its own ZFS dataset so that we can snapshot and clone it for an upgrade to save the old data (in case we need to roll back) and keep the settings. This gets a tad tricky because the kernel mounts tmpfs on /etc/svc/volatile to provide scratch space for early processes like init(1); before we can mount on /etc/svc, we have to unmount /etc/svc/volatile. Here’s what that part of our boot script looks like:

# The kernel mounts tmpfs on /etc/svc/volatile so we we need to
# unmount that before mounting the svc dataset on /etc/svc.
umount /etc/svc/volatile
mount -F zfs $base/running/svc /etc/svc
mount -F tmpfs /etc/svc/volatile

The problem

Infrequently — but not never — we’d see that unmount of/etc/svc/volatile fail with EBUSY. The boot script would stop and report the error; subsequent attempts to unmount /etc/svc/volatile would succeed. So there wasn’t much to go on. The tmpfs code did reveal this:

static int
tmp_unmount(struct vfs *vfsp, int flag, struct cred *cr)
        tnp = tm->tm_rootnode;
        if (TNTOV(tnp)->v_count > 1) {
                return (EBUSY);
        for (tnp = tnp->tn_forw; tnp; tnp = tnp->tn_forw) {
                if ((vp = TNTOV(tnp))->v_count > 0) {
                        return (EBUSY);

So someone had an additional hold on either the root of the filesystem or a file in it. I looked at the contents of /etc/svc/volatile and found one file: init.state. Digging through the code for init(1) I was surprised to find that init(1) keeps a state file around with a list of processes of interest. It doesn’t keep the file descriptor open (which would prevent us from unmounting the filesystem), but it does rewrite the file from time to time. I was worried that init(1) might be racing with our script. I didn’t want to understand the brutal compexity of the code, so I amended our boot script to do the following:

pstop 1 # stop init
umount /etc/svc/volatile
mount -F zfs $base/running/svc /etc/svc
mount -F tmpfs /etc/svc/volatile
prun 1 # resume init

If the unmount failed, I’d be able to use pfiles(1) to see if init(1) did, in fact, have something open in /etc/svc/volatile. I was convinced that in trying to observe the problem, I’d chase it away — a Heisenbug — but after a short while of running a reboot loop, we hit the problem, and init(1) didn’t have anything open in /etc/svc/volatile. What next…

Boot-time DTrace

The problem was that by the time I’d get to the system, the conditions that caused the error had resolved themselves. What I wanted to do was panic the system when tmp_unmount() returned EBUSY so that I could poke around with a debugger. On many systems that would entail compiling in debug logic, but fortunately a DTrace-enabled system has a better option. My former colleague Bryan Cantrill invented anonymous DTrace for looking at boot-time issues — earlier in boot than when one could execute the dtrace(1M) command-line utility. To use boot-time DTrace, specify the D program as usual, but add the -A option to add the D program to the DTrace kernel module’s boot-time configuration. After rebooting, DTrace will enable the program whose output can later be retreived with dtrace -a. In my case, I wanted to drop into the kernel debugger when tmp_unmount() returned EBUSY, so I ran DTrace like this:

dtrace -A -w -n 'fbt:tmpfs:tmp_unmount:return/arg1 == EBUSY/{ panic(); }'

Again after many reboots, we hit the problem and dropped into the debugger. Thanks to infrastructure put into the kernel by my colleague Eric Schrock, I was able to quickly see the identity of the file whose reference count prevented us from unmounting tmpfs:

[5]> ffffff0197321b00::print vnode_t v_path
v_path = 0xffffff0197125d60 "/etc/svc/volatile/init-next.state"

It’s worth noting that v_path isn’t guaranteed to be correct, but may reflect some state state. In this case, I examined the directory structure of tmpfs and found that the filename was actually /etc/svc/volatile/init.state — v_path isn’t updated on a rename. But I couldn’t for the life of me figure out who had that file open. None of the (few) other processes were touching the file. I looked through the fsflush code which flushes cached data back to disk, but that didn’t make a lot of sense, and didn’t seem to be causing problems. The pageout thread isn’t supposed to run unless the system is low on memory. I used kmdb’s ::kgrep command to find places where the vnode_t or its associated page were referenced. There were many, and I quickly got lost in the bowels of the VM system. Rather than groveling through the kernel’s structures, I decided to turn back to DTrace. The next question I wanted to answer was this: after tmp_unmount() returns EBUSY, who is it that releases the reference on that tmpfs vnode_t? To answer it, I wrote this D script:

        self->vfs = args[0];

/arg1 == EBUSY/
        gotit = self->vfs;

        self->vfs = NULL;

/gotit != NULL && args[0]->v_vfsp = gotit/

I installed that as my anonymous DTrace enabling, rebooted, and waited.

Who dunnit

Like the Mystery Inc. gang unmasking the criminal, helplessly caught by the elaborate trap, I used the kernel debugger to identify the subsystem to find that it was none other than harmless old Mr. Pageout. Gasp! Why was pageout running at all? The system had plenty of memory so it wouldn’t normally be running except there’s an exception made very very early in boot (it turns out). In the first second after boot, pageout will execute exactly four times in order to fill in certain performance-related parameters that let it predict how long it will take to page out memory in the future. When it executes, pageout will identify unused pages and take a temporary hold on them — this is exactly the pathology at the root of our problem!


I’m still working on exactly how to solve this. The simplest approach would be to sleep for a second before trying the unmount. Slightly more complicated would be to try unmounting in a loop until a second had passed (checking $SECONDS in bash). More complicated still would be to do a rethink of pageout — I still don’t fully understand how it works, but it really seems like it’s making assumptions that have been invalidated in the past decade and contains this gem of a comment:

For now, this thing is in very rough form.

Note that “now” in this case referred to 1987 or possibly earlier — as Roger Faulkner would say, “it came from New Jersey.”


Pageout would have gotten away with it if it hadn’t been for these meddling tools! DTrace during boot is awesome — when you need it, it’s a life saver. There are some places so early in boot that DTrace can’t help; for that VProbes can give you some DTrace-like functionality. And mature systems can have some musty corners so your tools had better be up to the task.

Recent Posts

February 12, 2017
December 18, 2016
August 9, 2016
August 2, 2016