July 2004 – Page 3 – Adam Leventhal's blog

Number 11 of 20: libumem

July 13, 2004

go to the Solaris 10 top 11-20 list for more

libumem

In Solaris 2.4 we replaced the old buddy allocator¹ the slab allocator² invented by Jeff Bonwick. The slab allocator is covered in pretty much every operating systems text book — and that’s because most operating systems are now using it. In Solaris 10³, Jonathan Adams brought the slab allocator to user-land in the form of libumem⁴.

Getting started with libumem is easy; just do the linker trick of setting LD_PRELOAD to “libumem.so” and any program you execute will use libumem’s malloc(3C) and free(3C) (or new and delete if you’re into that sort of thing). Alteratively, if you like what you see, you can start linking your programs against libumem by passing -lumem to your compiler or linker. But I’m getting ahead of myself; why is libumem so great?

Scalability

The slab allocator is designed for systems with many threads and many CPUs. Memory allocation with naive allocators can be a serious bottleneck (in fact we recently used DTrace to find such a bottleneck; using libumem got us a 50% improvement). There are other highly scalable allocators out there, but libumem is about the same or better in terms of performance, has compelling debugging features, and it’s free and fully supported by Sun.

Debugging

The scalability and performance are impressive, but not unique to libumem; where libumem really sets itself apart is in debugging. If you’ve ever spent more than 20 seconds debugging heap corruption or chasing down a memory leak, you need libumem. Once you’ve used libumem it’s hard to imagine debugging this sort of problem with out it.

You can use libumem to find double-frees, use-after-free, and many other problems, but my favorite is memory leaks. Memory leaks can really be a pain especially in large systems; libumem makes leaks easy to detect, and easy to diagnose. Here’s a simple example:

$ LD_PRELOAD=libumem.so
$ export LD_PRELOAD
$ UMEM_DEBUG=default
$ export UMEM_DEBUG
$ /usr/bin/mdb ./my_leaky_program
> ::sysbp _exit
> ::run
mdb: stop on entry to _exit
mdb: target stopped at:
libc.so.1`exit+0x14:    ta        8
mdb: You've got symbols!
mdb: You've got symbols!
Loading modules: [ ld.so.1 libumem.so.1 libc.so.1 ]
> ::findleaks
CACHE     LEAKED   BUFCTL CALLER
0002c508       1 00040000 main+4
----------------------------------------------------------------------
Total       1 buffer, 24 bytes
> 00040000::bufctl_audit
ADDR  BUFADDR    TIMESTAMP THR  LASTLOG CONTENTS    CACHE     SLAB     NEXT
DEPTH
00040000 00039fc0 3e34b337e08ef   1 00000000 00000000 0002c508 0003bfb0 00000000
5
libumem.so.1`umem_cache_alloc+0x13c
libumem.so.1`umem_alloc+0x60
libumem.so.1`malloc+0x28
main+4
_start+0x108

Obviously, this is a toy leak, but you get the idea, and it’s really that simple to find memory leaks. Other utilities exist for debugging memory leaks, but they dramatically impact performance (to the point where it’s difficult to actually run the thing you’re trying to debug), and can omit or incorrectly identify leaks. Do you have a memory leak today? Go download Solaris Express, slap your app on it and run it under libumem. I’m sure it will be well worth the time spent.

You can use other mdb dcmds like ::umem_verify to look for corruption. The kernel versions of these dcmds are described in the Solaris Modular Debugger Guide today; we’ll be updating the documentation for Solaris 10 to describe all the libumem debugging commands.

Programmatic Interface

In addition to offering the well-known malloc() and free(), also has a programmatic interface for creating your own object caches backed by the heap or memory mapped files or whatever. This offers additional flexibility and precision and allows you to futher optimize your application around libumem. Check out the man pages for umem_alloc() and umem_cache_alloc() for all the details.

Summary

Libumem is a hugely important feature in Solaris 10 that just slipped off top 10 list, but I doubt there’s a Solaris user (or soon-to-be Solaris user) that won’t fall in love with it. I’ve only just touched on what you can do with libumem, but Jonathan Adams (libumem’s author) will soon be joining the ranks of blogs.sun.com to tell you more. Libumem is fast, it makes debugging a snap, it’s easy to use, and you can get down and dirty with it’s expanded API — what else couldn anyone ask for in an allocator?

1. Jeff’s USENIX paper is definitely worth a read
2. For more about Solaris history, and the internals of the slab allocator check out Solaris Internals
3. Actually, Jonathan slipped libumem into Solaris 9 Update 3 so you might have had libumem all this time and not known…
4. Jeff and Jonathan wrote a USENIX paper about some additions to the allocator and its extension to user-land in the form of libumem

The Solaris 10 top 11-20

July 12, 2004

Solaris 10 has way more features than any release of Solaris that I can remember, and Sun’s been marketing the hell out of them. Here’s my top 10 list roughly in order of how cool I think each is:

DTrace– of course…
ZFS– the amazing new file system
AMD64 Support– Opteron is so obviously great
Zones– N1 Grid Containers for those of you keeping score at home
Predictive Self Healing– never worry about flaky hardware again
Performance– Solaris 10 is just faster, faster networking, faster everything
Linux Compatibility – run linux binariesunmodified, ‘nuf said
Service Management Facility– managing a box just got much easier
Process Rights Management– super-user is no longer a binary proposition
NFSv4 – nfs++ (++++++)

Blah blah blah. That’s for sure amazing stuff, but there are dozens of places where you can read about it (I was going to include some links to news stories, but I’m sure google can find you the same stuff it found for me).

But is that it for Solaris 10? Not by a long shot. There are literally dozens of features and improvements which would have cracked the top 10 for the last few Solaris releases. Without further ado, I present my Solaris 10 top 11-20 list:

libumem – thetool for debugging dynamic allocation problems; oh, and it scales as well or better than any other memory allocator
pfiles(1) with file names – you can get at the file name info through /proctoo; very cool
Improved coreadm(1M)– core files are now actually useful on other machines, administrators and users can specify the content of core files
System V IPC– no more clumsy system tunables and reboots, it’s all dynamic, and — guess what? — faster too
kmdb – if you don’t care, ok, but if you do care, you really really care: mdb(1)‘s cousin replaces kadb(1M)
Watchpoints – now they work andthey scale
pstack(1) for java– see java stack frames in a JVM or core file and through DTrace
pmap(1) features– see thread stacks, and core file content
per-thread p-tools – apply pstack(1) and truss(1)to just the threads you care about
Event Ports – a generic API for dealing with heterogeneous event sources

There were some other features in the running (Serial ATA support, vacation(1) filtering, other p-tools improvements, etc.), but I had to draw the line somewhere. Remember this is my list; Solaris 10 has fancy new java and gnome stuff, but, while it’s cool I guess, it doesn’t do it for me in the way these things do. I’d be doing these features an injustice if I tried to summarize them all in one weblog entry, so I’ll bite off one at a time and explain them in detail over the next few days; stay tuned.

Inside nohup -p

July 9, 2004

I always thought it was cool, but I was surprised by the amount of interest expressed for my recent post on nohup -p. There was even a comment asking how nohup manages the trick of redirecting the output of a running process. I’ll describe in some detail now nohup -p works.

First, a little background material: Eric Schrock recently had a nice post about the history of the /proc file system; nohup makes use of Solaris’s /proc and the agent LWP in particular which Eric also described in detail. All of the /proc and agent LWP tricks I describe are documented in the proc(4) man page.

Historically, nohup invoked a process with SIGHUP and SIGQUIT masked and the output directed to a file called nohup.out. When you run a command inside a terminal there can be two problems: all the output is just recorded to that terminal, and if the terminal goes away the command will receive a SIGHUP, killing it by default. You use nohup to both capture the output in a file and protect the process against the terminal being killed (e.g. if your telnet connection drops).

To “nohup” a running process we both need to mask SIGHUP and SIGQUIT and redirect the output to the file nohup.out. The agent LWP makes this possible. First we create the agent LWP and have it execute the sigaction(2) system call to mask of SIGHUP and SIGQUIT. Next we need to redirect any output intended for the controling terminal to the file nohup.out. This is easy in principle: we find all file descriptors open to the controlling terminal, have the agent LWP close them, and then reopen them to the file nohup.out. The problem is that other LWPs (threads) in the process might be using (e.g. with the read(2) or write(2) system calls) those file descriptors and the close(2) will actually block until those operations have completed. When the agent LWP is present in a process, none of the other LWPs can run so none of the outstanding operations on those file descriptors can complete so the process would deadlock. Note that we can work ourselves out of the deadlock by removing the agent LWP, but we still have a problem.

The solution is this: with all LWPs in the process stopped, we identify all the file descriptors that we’ll need to close and reopen, and then abort (using the PRSABORT flag listed in the proc(4) man page) those sytem calls. Once all outstanding operations have been aborted (or successfully completed) we know that there won’t be any possibility of deadlocking the process. The agent LWP executes the open(2) system call to open the nohup.out file and then has the victim process dup2(3C) that file descriptor over the ones open to the process’s controlling terminal (implicitly closing them). Actually, dup2(3C) is a library call so we have the agent LWP execute a fcntl(2) system call with the F_DUP2FD command.

Whew. Complicated to be sure, but at the end of it all, our precious process is protected against SIGHUP and SIGQUIT and through our arduous labors, output once intended for the terminal is now safely kept in a file. If this made sense or was even useful, I’d love to hear it…

Adam Leventhal's blog

Month: July 2004

I/O provider in Solaris Express 7/04

Number 12 of 20: file names in pfiles(1)

Number 11 of 20: libumem

libumem

Scalability

Debugging

Programmatic Interface

Summary

The Solaris 10 top 11-20

Inside nohup -p

Linker alien spotting

Recent Posts

Austin API Summit Wrap-up

Rust and JSON Schema: odd couple or perfect strangers

Oxide and Friends Season 4

DTrace probes in Rust

From Prometheus to Sisyphus

DTrace at Home

Archives

Archives