I think the thing I love most about debugging software is that each tough bug can seem like an insurmountable challenge — until you figure it out. But until you do, each tough bugs is the hardest problem you’ve ever had to solve. There are weeks when every morning presents me with a seemingly impossible challenge, and each afternoon I get to spike my keyboard and do a little victory dance before running for the train.
For my first OpenSolaris blog post, I thought I talk about one of my favorite such bugs. This particularly nasty bug had to do with a tricky interaction between VMware and DTrace (pre-production versions of each to be clear). My buddy Keith — a fellow Brown CS Grad — gave me a call and told me about some strange behavior he was seeting running DTrace inside of a VMware VM. Keith is a big fan of DTrace, but an intermittant, but reproducable problem was putting a damper on his DTrace enthusiasm. Every once in a while, running DTrace would cause the system to freeze. Because Solaris was running in the virtual environment, he could see that both virtual CPUs where spinning away, but they weren’t making any forward progress. After a couple of back and forths over email, I made the trip down to Palo Alto so we could work on the problem together.
Using some custom debugging tools, we were able to figure out where the two virtual CPUs were spinning. One CPU was in xc_common() and the other was in xc_serv() — code to handle cross calls. So what was going on?
cross calls
Before I can really delve into the problem, I want to give just a brief overview of cross calls. In general terms, a cross call (xcall) is used in a multi-processor (MP) system when one CPU needs to get another CPU to do some work. It works by sending a special type of interrupt which the remote CPU handles. You may have heard the term interprocessor interrupt (IPI) — same thing. One example of when xcalls are used is when unmapping a memory region. To unmap a region, a process will typically call the munmap(2) system call. Remember that in an MP system, any processor may have run threads in this process so those mappings may be present in that any CPU’s TLB. The unmap operation executes on one CPU, but the other CPUs need to remove the relevant mappings from their own TLBs. To accomplish this communication, the kernel uses a xcall.
DTrace uses xcalls synchronize data used by all CPUs by ensuring that all CPUs have reached a certain point in the code. DTrace executes actions with interrupts disabled (an explanation of why this must be so is well beyond the bounds of this discussion) so we can tell that a CPU isn’t in probe context if its able to handle our xcall. When DTrace is stopping tracing activity, for example, it will update some data that affects all CPUs and then use a xcall to make sure that every CPU has seen its effects before proceeding:
10739 /* 10740 * We'll set the activity to DTRACE_ACTIVITY_DRAINING, and issue a sync 10741 * to be sure that every CPU has seen it. See below for the details 10742 * on why this is done. 10743 */ 10744 state->dts_activity = DTRACE_ACTIVITY_DRAINING; 10745 dtrace_sync();
dtrace_sync() sends a xcall to all other CPUs and has them spin in a holding pattern until all CPUs have reached that point at which time the CPU which sent the xcall releases them all (and they go back to whatever they had been doing when they received the interrupt). That’s the high level overview; let’s go into a little more detail on how xcalls work (well, actually a lot more detail).
xcall implementation
If you follow the sequence of functions called by dtrace_sync() (and I encourage you to do so), you’ll find that this eventually calls xc_common() to do the heavy lifting. It’s important to note that this call to xc_common() will have the sync argument set to 1. What’s that mean? In a text book example of good software engineering, someone did a good job documenting what this value means:
411 /* 412 * Common code to call a specified function on a set of processors. 413 * sync specifies what kind of waiting is done. 414 * -1 - no waiting, don't release remotes 415 * 0 - no waiting, release remotes immediately 416 * 1 - run service locally w/o waiting for remotes. 417 * 2 - wait for remotes before running locally 418 */ 419 static void 420 xc_common( 421 xc_func_t func, 422 xc_arg_t arg1, 423 xc_arg_t arg2, 424 xc_arg_t arg3, 425 int pri, 426 cpuset_t set, 427 int sync)
Before you start beating your brain out trying to figure out what you’re missing here, in this particular case, this destinction bewteen sync having the value of 1 and 2 is nil: the service (function pointer specified by the func argument) that we’re running is dtrace_sync_func() which does literally nothing.
Let’s start picking apart xc_common():
446 /* 447 * Request service on all remote processors. 448 */ 449 for (cix = 0; cix < NCPU; cix++) { 450 if ((cpup = cpu[cix]) == NULL || 451 (cpup->cpu_flags & CPU_READY) == 0) { 452 /* 453 * In case CPU wasn't ready, but becomes ready later, 454 * take the CPU out of the set now. 455 */ 456 CPUSET_DEL(set, cix); 457 } else if (cix != lcx && CPU_IN_SET(set, cix)) { 458 CPU_STATS_ADDQ(CPU, sys, xcalls, 1); 459 cpup->cpu_m.xc_ack[pri] = 0; 460 cpup->cpu_m.xc_wait[pri] = sync; 461 if (sync > 0) 462 cpup->cpu_m.xc_state[pri] = XC_SYNC_OP; 463 else 464 cpup->cpu_m.xc_state[pri] = XC_CALL_OP; 465 cpup->cpu_m.xc_pend[pri] = 1; 466 send_dirint(cix, xc_xlat_xcptoipl[pri]); 467 } 468 }
We take a first pass through all the processors; if the processor is ready to go and is in the set of processors we care about (they all are in the case of dtrace_sync()) we set the remote CPU’s ack flag to 0, it’s wait flag to sync (remember, 1 in this case), and it’s state flag to XC_SYNC_OP and then actually send the interrupt with the call to send_dirint().
Next we wait for the remote CPUs to acknowledge that they’ve executed the requested service which they do by setting the ack flag to 1:
479 /* 480 * Wait here until all remote calls complete. 481 */ 482 for (cix = 0; cix < NCPU; cix++) { 483 if (lcx != cix && CPU_IN_SET(set, cix)) { 484 cpup = cpu[cix]; 485 while (cpup->cpu_m.xc_ack[pri] == 0) { 486 ht_pause(); 487 return_instr(); 488 } 489 cpup->cpu_m.xc_ack[pri] = 0; 490 } 491 }
That while loop spins waiting for ack to become 1. If you look at the definition of return_instr() it’s name is actually more descriptive that you might imagine: it’s just a return instruction — the most trivial function possible. I’m not absolutely certain, but I think it was put there so the compiler wouldn’t “optimize” the loop away. The call to the inline function ht_pause() is so that the thread spins in such a way that’s considerate on an hyper-threaded CPU. The call to ht_pause() is probably sufficient to prevent the compiler from being overly clever, but the legacy call to return_instr() remains.
Now let’s look at the other side of this conversation: what happens on a remote CPU as a result of this interrupt? This code is in xc_serv()
138 /* 139 * Acknowledge that we have completed the x-call operation. 140 */ 141 cpup->cpu_m.xc_ack[pri] = 1; 142
I’m sure it comes as no surprise that after executing the given function, it just sets the ack flag.
Since in this case we’re dealing with a synchronous xcall, the remote CPU then needs to just chill out until the CPU that initiated the xcall discovers that all remote CPUs have executed the function and are ready to be released:
146 /* 147 * for (op == XC_SYNC_OP) 148 * Wait for the initiator of the x-call to indicate 149 * that all CPUs involved can proceed. 150 */ 151 while (cpup->cpu_m.xc_wait[pri]) { 152 ht_pause(); 153 return_instr(); 154 } 155 156 while (cpup->cpu_m.xc_state[pri] != XC_DONE) { 157 ht_pause(); 158 return_instr(); 159 }
And here’s the code on the initiating side that releases all the remote CPUs by setting the wait and state flags to the values that the remote CPUs are waiting to see:
502 /* 503 * Release any waiting CPUs 504 */ 505 for (cix = 0; cix < NCPU; cix++) { 506 if (lcx != cix && CPU_IN_SET(set, cix)) { 507 cpup = cpu[cix]; 508 if (cpup != NULL && (cpup->cpu_flags & CPU_READY)) { 509 cpup->cpu_m.xc_wait[pri] = 0; 510 cpup->cpu_m.xc_state[pri] = XC_DONE; 511 } 512 } 513 }
there’s a problem
Wait! Without reading ahead in the code, does anyone see the problem?
Back at VMware, Keith hacked up a version of the virtual machine monitor which allowed us to trace certain points in the code and figure out the precise sequence in which they occurred. We traced the entry and return to xc_common() and xc_serv(). Almost every time we’d see something like this:
- enter xc_common() on CPU 0
- enter xc_serv() on CPU 1
- exit xc_serv() on CPU 1
- exit xc_common() on CPU 0
or:
- enter xc_common() on CPU 0
- enter xc_serv() on CPU 1
- exit xc_common() on CPU 0
- exit xc_serv() on CPU 1
But the problem happened when we saw a sequence like this:
- enter xc_common() on CPU 0
- enter xc_serv() on CPU 1
- exit xc_common() on CPU 0
- enter xc_common() on CPU 0
And nothing futher. What was happening was that after releasing remote CPUs, CPU 0 was exiting from the call to xc_common() and calling it again before the remote invocation of xc_serv() on the other CPU had a change to exit.
Recall that one of the the first things that xc_common() does is set the state flag. If the first call to xc_common() sets the state flag to release the remote CPU from xc_sync(), but when things go wrong, xc_common() was overwriting that flag before the remote CPU got a change to see it.
the problem
We were seeing this repeatably under VMware, but no one had seen this at all on real hardware (yet). The machine Keith and I were using was a 2-way box running Linux. On VMware, each virtual CPU is represented by a thread on the native OS so rather than having absolute control of the CPU, the execution was more or less at the whim of the Linux scheduler.
When this code is running unadulterated on physical CPUs, we won’t see this problem. It’s just a matter of timing — the remote CPU has many many fewer instructions to execute before the state flag gets overwritten by a second xcall so there’s no problem. On VMware, the Linux scheduler might decide that’s your second virtual CPU’s right to the physical CPU is trumped by moving the hands on your xclock (why not?) so there are no garuantees about how long these operations can take.
the fix
There are actually quite a few ways to fix this problem — I’m sure you can think of at least one or two off the top of your head. We just need to make sure that subsequent xcalls can’t interfere with each other. When we found this, Solaris 10 was wrapping up — we were still making changes, but only those deemed of the absolute highest importance. Making changes to the xcall code (which is rather delicate and risky to change) for a bug that only manifests itself on virtual hardware (and which VMware could work around using some clever trickery[1]) didn’t seem worthy of being designated a show- stopper.
Keith predicted a few possible situations where this same bug could manifest itself on physical CPUs: on hyper-threaded CPUs, or in the presence of service management interrupts. And that prediction turned out to be spot on: a few weeks after root causing the bug under VMware, we hit the same problem on a system with four hyper-threaded chips (8 logical CPUs).
Since at that time we were even closer to shipping Solaris 10, I chose the fix I thought was the safest and least likely to have nasty side effects. After releasing remote CPUs, the code in xc_common() would now wait for remote CPUs to check in — wait for them to acknowledge receipt of the directive to proceed.
515 /* 516 * Wait for all CPUs to acknowledge completion before we continue. 517 * Without this check it's possible (on a VM or hyper-threaded CPUs 518 * or in the presence of Service Management Interrupts which can all 519 * cause delays) for the remote processor to still be waiting by 520 * the time xc_common() is next invoked with the sync flag set 521 * resulting in a deadlock. 522 */ 523 for (cix = 0; cix < NCPU; cix++) { 524 if (lcx != cix && CPU_IN_SET(set, cix)) { 525 cpup = cpu[cix]; 526 if (cpup != NULL && (cpup->cpu_flags & CPU_READY)) { 527 while (cpup->cpu_m.xc_ack[pri] == 0) { 528 ht_pause(); 529 return_instr(); 530 } 531 cpup->cpu_m.xc_ack[pri] = 0; 532 } 533 } 534 }
In that comment, I tried to summarize in 6 lines what has just taken me several pages to describe. And maybe I should have said “livelock” — oh well. Here’s the complementary code in xc_serv():
170 /* 171 * Acknowledge that we have received the directive to continue. 172 */ 173 ASSERT(cpup->cpu_m.xc_ack[pri] == 0); 174 cpup->cpu_m.xc_ack[pri] = 1;
conclusions
That was one of my favorite bugs to work on, and it’s actually fairly typical of a lot of the bugs I investigate: something’s going wrong; figure out why. I think the folks who work on Solaris tend to love that kind of stuff as a rule. We spend tons of time building facilities like DTrace, mdb(1), kmdb, CTF, fancy core files, and libdis so that the hart part of investigating mysterious problems isn’t gathering data or testing hypotheses, it’s thinking of the questions to answer and inventing new hypotheses. It’s my hope that OpenSolaris will attract those types of inquisitive minds that thrive on the (seemingly) unsolvable problem.
[1] This sort of problem is hardly unique to DTrace or to Solaris. Apparently (and not surprisingly) there are problems like this in nearly every operating system where the code implicitly or explicitly relies on the relative timing of certain operations. In these cases, VMware has hacks to do things like execute the shifty code in lock step.
Technorati Tag: OpenSolaris
Technorati Tag: Solaris
Technorati Tag: DTrace
Technorati Tag: mdb
One Response
Good writeup on a nasty debugging problem!
I was curious about CTF and libdis? I couldn’t find may references to to them on the web (some on other Sun blogs)?