I always thought it was cool, but I was surprised by the amount of interest expressed for my recent post on nohup -p. There was even a comment asking how nohup manages the trick of redirecting the output of a running process. I’ll describe in some detail now nohup -p works.
First, a little background material: Eric Schrock recently had a nice post about the history of the /proc file system; nohup makes use of Solaris’s /proc and the agent LWP in particular which Eric also described in detail. All of the /proc and agent LWP tricks I describe are documented in the proc(4) man page.
Historically, nohup invoked a process with SIGHUP and SIGQUIT masked and the output directed to a file called nohup.out. When you run a command inside a terminal there can be two problems: all the output is just recorded to that terminal, and if the terminal goes away the command will receive a SIGHUP, killing it by default. You use nohup to both capture the output in a file and protect the process against the terminal being killed (e.g. if your telnet connection drops).
To “nohup” a running process we both need to mask SIGHUP and SIGQUIT and redirect the output to the file nohup.out. The agent LWP makes this possible. First we create the agent LWP and have it execute the sigaction(2) system call to mask of SIGHUP and SIGQUIT. Next we need to redirect any output intended for the controling terminal to the file nohup.out. This is easy in principle: we find all file descriptors open to the controlling terminal, have the agent LWP close them, and then reopen them to the file nohup.out. The problem is that other LWPs (threads) in the process might be using (e.g. with the read(2) or write(2) system calls) those file descriptors and the close(2) will actually block until those operations have completed. When the agent LWP is present in a process, none of the other LWPs can run so none of the outstanding operations on those file descriptors can complete so the process would deadlock. Note that we can work ourselves out of the deadlock by removing the agent LWP, but we still have a problem.
The solution is this: with all LWPs in the process stopped, we identify all the file descriptors that we’ll need to close and reopen, and then abort (using the PRSABORT flag listed in the proc(4) man page) those sytem calls. Once all outstanding operations have been aborted (or successfully completed) we know that there won’t be any possibility of deadlocking the process. The agent LWP executes the open(2) system call to open the nohup.out file and then has the victim process dup2(3C) that file descriptor over the ones open to the process’s controlling terminal (implicitly closing them). Actually, dup2(3C) is a library call so we have the agent LWP execute a fcntl(2) system call with the F_DUP2FD command.
Whew. Complicated to be sure, but at the end of it all, our precious process is protected against SIGHUP and SIGQUIT and through our arduous labors, output once intended for the terminal is now safely kept in a file. If this made sense or was even useful, I’d love to hear it…
One Response
I had never actually looked into the innards of nohup -p. That is a seriously cool bit of coding Adam.