Adam Leventhal's blog

Search
Close this search box.

Month: August 2004

As Bryan has observed the past, software has a quality unique to engineering disciplines in that you can build it, but you can’t see it. DTrace changes that by opening windows into parts of the system that were previously unobservable and it does so in a way that minimally changes what you’re attempting to observe — this software “uncertainty principle” has limited the utility of previous observability tools. One of the darkest areas of debugging in user-land has been around lock contention.

In multi-threaded programs synchronization primitives — mutexes, R/W locks, semaphores, etc. — are required to coordinate each thread’s efforts and make sure shared data is accessed safely. If many threads are kept waiting while another thread owns a sychronization primitive, the program is said to suffer from lock contention. In the kernel, we’ve had lockstat(1m) for many years, but in user-land, the techniqes for observing lock behavior and sorting out the cause or even the presence have been very ad hoc

the plockstat provider

I just finished work on the plockstat provider for DTrace as well as a new plockstat(1m) command for observing user-land synchronization objects. If you’re unfamiliar with DTrace, you might want to take a quick look at the Solaris Dynamic Tracing Guide (look through it for some examples); that will help ground some of this explanation.

The plockstat provider has these probes:

mutex-acquire fires when a mutex is acquired
mutex-release fires when a mutex is released
mutex-block fires when a thread blocks waiting for a mutex
mutex-spin fires when a thread spins waiting for a mutex
rw-acquire fires when an R/W lock is acquired
rw-release fires when an R/W lock is released
rw-block fires when a thread blocks waiting for an R/W lock

It’s possible with other tools to observe these points, but — as anyone who’s tried it can attest — other tools can alter the effects you’re trying to observe. Traditional debuggers can effectively serialize your parallel program removing any trace of the lock contention you’d see during a normal run. DTrace and the plockstat provider avoid eliminate this problem.

With the plockstat provider you can answer questions that were previously very difficult to solve, such as “where is my program blocked on mutexes”:

bash-2.05b# dtrace -n plockstat1173:::mutex-block'{ @[ustack()] = count() }'
dtrace: description 'plockstat1173:::mutex-block' matched 2 probes
^C
libc.so.1`mutex_lock_queue+0xa9
libc.so.1`slow_lock+0x3d
libc.so.1`mutex_lock_impl+0xec
libc.so.1`mutex_lock+0x38
libnspr4.so`PR_Lock+0x1a
libnspr4.so`PR_EnterMonitor+0x35
libxpcom.so`__1cGnsPipePGetWriteSegment6MrpcrI_I_+0x3e
libxpcom.so`__1cSnsPipeOutputStreamNWriteSegments6MpFpnPnsIOutputStream_pvpcIIpI_I3I5_I_+0x4f
c4654d3c
libxpcom.so`__1cUnsThreadPoolRunnableDRun6M_I_+0xb0
libxpcom.so`__1cInsThreadEMain6Fpv_v_+0x32
c4ec1d6a
libc.so.1`_thr_setup+0x50
libc.so.1`_lwp_start
1

(any guesses as to what program this might be?)

Not just a new view for DTrace, but a new view for user-land.

the plockstat(1m) command

DTrace is an incredibly powerful tool, but some tasks are so common that we want to make it as easy as possible to use DTrace’s facilities without knowing anything about DTrace. The plockstat(1m) command wraps up a bunch of knowledge about lock contention in a neat and easy to use package:

# plockstat -s 10 -A -p `pgrep locker`
^C
Mutex block
-------------------------------------------------------------------------------
Count     nsec Lock                         Caller
13 22040260 locker`lock1                 locker`go_lock+0x47
nsec ---- Time Distribution --- count Stack
65536 |@@@@@@@@@@@@@@          |     8 libc.so.1`mutex_lock+0x38
131072 |                        |     0 locker`go_lock+0x47
262144 |@@@@@                   |     3 libc.so.1`_thr_setup+0x50
524288 |                        |     0 libc.so.1`_lwp_start
1048576 |                        |     0
2097152 |                        |     0
4194304 |                        |     0
8388608 |                        |     0
16777216 |@                       |     1
33554432 |                        |     0
67108864 |                        |     0
134217728 |                        |     0
268435456 |@                       |     1
...

This has been a bit of a teaser. I only integrated plockstat into Solaris 10 yesterday and it will be a few weeks before you can access plockstat as part of the Solaris Express program, but keep an eye on the DTrace Solaris Express Schedule.

go to the Solaris 10 top 11-20 list for more

Getting back to the business of the Solaris 10 top 11-20, Eric Schrock has written up a great piece on kmdb the new kernel-mode debugger which is newly available in Solaris Express 8/04. Check it out.

Since a few people in various forums have been asking about it, I thought I’d explain a little about how Solaris Express works. I know the story best from the kernel side, but keep in mind there are other parts of Solaris — Java, the X server, etc. — that have slightly different processes.

In kernel development we cut a build of Solaris 10 every two weeks; these are numbered s10_XX (for example, Solaris Express 7/04 is s10_60). Those take a week or two to coagulate into the WOS (Wad Of Stuff) which combines the kernel with the latest cut of the X server, gnome, etc. We spend another week or three making sure there’s nothing too toxic in that build and release it in the form of a Solaris Express build. The lag time between when the build cuts and when it hits the streets in a Solaris Express build is usually about 4-6 weeks. We’re about to release Solaris Express 8/04 (s10_63) and we just cut s10_66 on Monday. Note that Solaris Express isn’t some release which we spend extensive time polishing; unless there’s some real tragic problem, you’re using the same bits that I’m using on my desktop. Since we cut a build every 2 weeks, we choose the best, most stable of the two or three builds since the last Solaris Express release, but usually it’s the latest stuff. It can be pretty daunting to know that once you integrate a change into Solaris there’s very little time to make sure its right — we take a lot of pride in making sure Solaris is stable not just for every release of Solaris Express, but every numbered build and, in fact, every nightly build.

As far as what to expect in future releases, I have some hints for DTrace here, but other than, that I think you just have to bite your nails and wait for the release notes. I will tell you that SX 9/04 is going to be exciting — check out Stephen’s weblog for why.

As I mentioned, SX 8/04 will be out very soon. Check out my DTrace Solaris Express decoder ring to see what new DTrace features are in this release (hint: -c and -p are way cool). Dan Price has written up a great description of all the stuff that’s new in Solaris Express 8/04.

While trawling through b.s.c., a comment caught my eye in this post from Glenn’s weblog:

As a shareholder, I do NOT want you to “open source” solaris in its entirety (ESPECIALLY DTrace!). I want you to keep the good stuff completely sun-only, accessible only under NDA.

Certainly, this echoes some of the same concern I had when I started hearing rumblings about OpenSolaris — we in Solaris have spent years of our lives making these innovations (ESPECIALLY DTrace!), and we don’t want to see them robbed. I’m also a shareholder and I don’t want to see my investments of time, effort, and — forgive me — money go to waste.

Now that I know more about OpenSolaris and open source in general, I’m confident that Sun isn’t selling out Solaris or giving away the company’s crown jewel, rather we’re going to make Solaris better, and more widely used. That sounds a little Rah Rah Solaris, but let’s look more closely at OpenSolaris and what it might mean for Solaris and for Sun (and for the author of the comment, a shareholder).

Open source is an interesting dichotomy: on one side there are the developers and the community with the spirit of the free trade of software and ideas, and on the other side there are the Linux vendors selling service contracts to fat cat customers. The former is clearly the benefit of OpenSolaris — a larger community of developers and users will improve Solaris and grow its audicent. The latter is the potential risk — we’re concerned that other companies might directly steals Sun’s customers by using Sun’s technology. The specifics of the OpenSolaris license haven’t been finalized so it’s possible that the license and patents will prevent Linux vendors from selling technologies developed in Solaris outright. Regardless, Solaris isn’t just a bunch of code, it’s the support and service and documentation and us, the Solaris developers.

When a Sun customer pays for Solaris, they’re paying for someone over here to answer the phone when they call and for me and others in Solaris kernel development to do the things they need. Even when they source code is available, customers will still want to tap into the origins of that code and talk to the people who made it. If there are problems they’ll want to be able to rely on the experts to fix them.

What about documentation? The Solaris Dynamic Tracing Guide is still going to be free but only as in beer — we’re not open sourcing our documentation (as least as far as I know). So let’s say dtrace.c was dropped into Linux, would they then rewrite the entire answer book (400 pages and counting) from scratch? Maybe this wouldn’t matter much to ordinary users, but if you’re giving some Linux vendor a big sweaty wad of cash to support DTrace on Linux you expect some documentation! The Solaris docs would be close enough for some users, but not customers shelling out the big big dollars for a service contract.

Even if Linux were able to replicate DTrace and document it and a linux vendor were able to support it, I’m confident the existing and growing Solaris community could keep innovating and push Solaris ahead. On a more person note, I’m also excited about OpenSolaris because it means if I were ever to leave Sun, I could still work on DTrace, mdb, nohup, and the other parts of Solaris that I consider my own.

OpenSolaris can only help Sun. If it succeeds, there will be a larger community of Solaris developers making it work with more platforms and devices, fixing more problems, and improving the quality of life on Solaris which will spawn an even larger community of Solaris users, both individuals and paying customers; if OpenSolaris fails, then that it won’t help to create those communities, and I think that’s the only consequence.

go to the Solaris 10 top 11-20 list for more

p-tools

Since Solaris 7 we’ve included a bunch of process observability tools — the so called “p-tools”. Some of them inspect aspects of the process of the whole. For example, the pmap(1) command shows you information about a process’s mappings, their location and ancillary information (the associated file, shmid, etc.). pldd(1) is another example; it shows which shared objects a process has opened.

Other p-tools apply to the threads in a process. The pstack(1) utility shows the call stacks for each thread in a process. New in Solaris 10 Eric and Andrei have modified the p-tools that apply to threads so that you can specify the threads you’re interested in rather than having to sift through all of them.

pstack(1)

Developers and administrators often use pstack(1) to see what a process is doing and if it’s making progress. You’ll often turn to pstack(1) after prstat(1) or top(1) shows a process consuming a bunch of CPU time — what’s that guy up to. Complex processes can many many threads; fortunately prstat(1)’s -L flag will split out each thread in a process as its own row so you can quickly see that thread 5, say, is the one that’s hammering the processor. Now rather than sifting through all 100 threads to find thread 5, you can just to this:

$ pstack 107/5
100225: /usr/sbin/nscd
-----------------  lwp# 5 / thread# 5  --------------------
c2a0314c nanosleep (c25edfb0, c25edfb8)
08056a96 gethost_revalidate (0) + 4b
c2a02d10 _thr_setup (c2949000) + 50
c2a02ed0 _lwp_start (c2949000, 0, 0, c25edff8, c2a02ed0, c2949000)

Alternatively, you can specify a range of threads (5-7 or 11-), and combinations of ranges (5-7,11-). Giving us something like this:

$ pstack 107/5-7,11-
100225: /usr/sbin/nscd
-----------------  lwp# 5 / thread# 5  --------------------
c2a0314c nanosleep (c25edfb0, c25edfb8)
08056a96 gethost_revalidate (0) + 4b
c2a02d10 _thr_setup (c2949000) + 50
c2a02ed0 _lwp_start (c2949000, 0, 0, c25edff8, c2a02ed0, c2949000)
-----------------  lwp# 6 / thread# 6  --------------------
c2a0314c nanosleep (c24edfb0, c24edfb8)
080577d6 getnode_revalidate (0) + 4b
c2a02d10 _thr_setup (c2949400) + 50
c2a02ed0 _lwp_start (c2949400, 0, 0, c24edff8, c2a02ed0, c2949400)
-----------------  lwp# 7 / thread# 7  --------------------
c2a0314c nanosleep (c23edfb0, c23edfb8)
08055f56 getgr_revalidate (0) + 4b
c2a02d10 _thr_setup (c2949800) + 50
c2a02ed0 _lwp_start (c2949800, 0, 0, c23edff8, c2a02ed0, c2949800)
-----------------  lwp# 11 / thread# 11  --------------------
c2a0314c nanosleep (c1fcdf60, c1fcdf68)
0805887d reap_hash (80ca918, 8081140, 807f2f8, 259) + ed
0805292a nsc_reaper (807f92c, 80ca918, 8081140, 807f2f8, c1fcdfec, c2a02d10) + 6d
08055ded getpw_uid_reaper (0) + 1d
c2a02d10 _thr_setup (c20d0800) + 50
c2a02ed0 _lwp_start (c20d0800, 0, 0, c1fcdff8, c2a02ed0, c20d0800)
...

The thread specification syntax also works for core files if you’re just trying to drill down on, say, the thread that caused the fatal problem:

$ pstack core/2
core 'core/2' of 100225:        /usr/sbin/nscd
-----------------  lwp# 2 / thread# 2  --------------------
c2a04888 door     (c28fbdc0, 74, 0, 0, c28fde00, 4)
080540bd ???????? (deadbeee, c28fddec, 11, 0, 0, 8053d33)
c2a0491c _door_return () + bc

truss(1)

The truss(1) utility is the mother of all p-tools. It lets you trace a process’s system calls, faults, and signals as well as user-land function calls. In addition to consuming pretty much every lower- and upper-case command line option, truss(1) now also supports the thread specification syntax. Now you can follow just the threads that are doing something interesting:

truss -p 107/5
openat(-3041965, ".", O_RDONLY|O_NDELAY|O_LARGEFILE) = 3
fcntl(3, F_SETFD, 0x00000001)                   = 0
fstat64(3, 0x08047800)                          = 0
getdents64(3, 0xC2ABE000, 8192)                 = 8184
brk(0x080721C8)                                 = 0
...

pbind(1)

The pbind(1) utility isn’t an observability tool, rather this p-tool binds a process to a particular CPU so that it will only run on that CPU (except in some unusual circumstances; see the man page for details). For multi-threaded processes, the process is clearly not the right granularity for this kind of activity — you want to be able to bind this thread to that CPU, and those threads to some other CPU. In Solaris 10, that’s a snap:

$ pbind -b 1 107/2
lwp id 107/2: was not bound, now 1
$ pbind -b 0 107/2-5
lwp id 107/2: was 1, now 0
lwp id 107/3: was not bound, now 0
lwp id 107/4: was not bound, now 0
lwp id 107/5: was not bound, now 0

These are perfect examples of Solaris responding to requests from users: there was no easy way to solve these problems, and that was causing our users pain, so we fixed it. After the BOF at OSCON, a Solaris user had a laundry lists of problems and requests, and was skeptical about our interest in fixing them, but I convinced him that we do care, but we need to hear about them. So let’s hear about your gripes and wish lists for Solaris. Many of the usability features (the p-tools for example) came out of our own use of Solaris in kernel development — once OpenSolaris lets everyone be a Solaris kernel developer, I’m sure we’ll be stumbling onto many more quality of life tools like pstack(1), truss(1), and pbind(1).

This evening I was waiting for my bags in SFO (having tacked on an ultimate frisbee tournament in Portland after attending OSCON) when I noticed someone eyeing me suspiciously. “You work on DTrace, right? I was at your BOF the other night; DTrace looks great!” I had assumed he was glaring because another player had opened my lip with his elbow, and I was unshowered from the day’s games — I was pleasantly surprised he had identified me as one of the DTrace three. His buddy who hadn’t been at the BOF lit up, “DTrace looks really cool” (he then asked if he could touch me). These guys were both very enthused about DTrace, and while I was planning to explain away my fat lip as the trophy from a barroom brawl over DTrace v. DProbes, they seemed too pleased to be talking about DTrace to bring it up. I’ve been thinking a bunch about what we need to do to build a community around DTrace and Solaris — this made me think we’re doing something right.

In unrelated news, a few more Solaris kernel engineers are on b.s.c; check these out: Liane Praza, Dan Price, Dilpreet Bindra, Andrei “Dr. Dre” Dorofeev

Recent Posts

April 17, 2024
January 13, 2024
December 29, 2023
February 12, 2017
December 18, 2016

Archives

Archives