Debugging Unix processes: A Detective Story

This document was originally put together to show how Unix debugging tools can be used to track down problems even when you don't have the source code (ie commercial products). It may one day become a presentation if I can figure out how to present it without boring an audience senseless.

Background

At work we use IBM Tivoli Remote Control. It allows a Controller to take remote control of a Target (both typically Windows boxes). It is functionally similar to products such as pcANYWHERE, Carbon Copy or VNC. It also supports functions such as chat and file transfer.

In order to support environments where there is no route from Controller to Target, Remote Control has an optional component known as a Remote Control Gateway. The RC Gateway functions as a TCP-level relay between Controller and Target. After connection establishment, the RC Gateway just passes the data straight through.

We had been having problems with our Remote Control. Ocasionally, new connections would fail, and when they did, existing connections would freeze. Other Tivoli operations would slow down noticably.

A some time ago, in another customer environment, we had had similar problems. When Tivoli Support was unable to resolve this intermittent but crippling problem, we reconfigured our environment to eliminate the need for the RC Gateway. Now, with the problem unresolved, our network archicture and those of our customers required the use of the gateway. So I waded in.

Basic System Performance: top

The first stop was top. Top has been around for half of forever. Running it showed the following.

last pid: 20676;  load averages:  1.02,  1.05,  1.07             20:59:46
69 processes:  67 sleeping, 1 running, 1 on cpu
CPU states: 0.0% idle,  98.3% user,  1.6% kernel,  0.0% iowait,  0.0% swap
Memory: 512M real, 413M swap in use, 934M swap free

  PID USERNAME THR PRI NICE  SIZE   RES STATE   TIME    CPU COMMAND
 6160 root       1  59    0 6016K 2592K run     9:06 98.06% pcgateway
14906 nobody     1  48    0   19M 5184K sleep 174:02  0.95% Collector_prog1
20673 dtucker    1  58    0 2592K 1792K cpu     0:03  0.66% top
  373 root       1  58    0   76M   13M sleep  21.4H  0.55% oserv
29562 root       1  58    0 2600K 1264K sleep   3:53  0.30% sshd

pcgateway is the aforementioned RC Gateway. Now all it does is pass a bit of traffic from Controllers to Targets and back. Only one such process is ever present on a given Tivoli node, so it must simulaneosly handle all concurrent sessions.

OK, so it looks like our problem is in the RC Gateway. But what the hell is it doing?

System Call tracing: truss

On most Unix (and similar platforms) it's possible to catch a process' system calls. On Solaris, the tool to do this is truss. Trussing our misbehaving pcgateway showed:

$ truss -p 6160
poll(0x00075B50, 1, 0)                          = 0
poll(0x00075B50, 1, 0)                          = 0
poll(0x00075B50, 1, 0)                          = 0
poll(0x00075B50, 1, 0)                          = 0
[ad infinitum]

So we now know that our pcgateway is caught in a loop poll()ing. poll() is a system call that tells the proccess which file descriptors are ready for reading and writing.

We'll need to know which file descriptors are involved. Fortunately, Vic Abel wrote a tool called lsof.

File Descriptors: lsof

Lsof ("list open files") can show, among other things, all open files and network file descriptors. It's a handy tool to have around.

$ lsof -p 6160
COMMAND    PID USER   FD   TYPE        DEVICE SIZE/OFF   NODE NAME
[snip]
pcgateway 6160 root    5u  inet 0x30002782238      0t0    TCP *:2501 (LISTEN)
pcgateway 6160 root    6u  inet 0x300021e1110      0t0    TCP rcgateway:*->203.2.41.147:* (IDLE)
pcgateway 6160 root    7u  inet 0x300027839f8      0t0    TCP rcgateway:47607->10.2.15.17:2501 (ESTABLISHED)
pcgateway 6160 root    8u  inet 0x30002782198      0t0    TCP rcgateway:*->203.2.41.147:* (IDLE)
pcgateway 6160 root    9u  inet 0x30002783318      0t0    TCP rcgateway:*->10.2.15.17:* (IDLE)

At this point, we've gained as much info as we can and the service is still down. We force a core dump with kill -ABRT for post-mortem analysis and restarted the service.

Post Mortem debugging: gdb

The poll man page shows that the arguments are an array of structures (of type "pollfd"), the number of structures in the array (1 in this case) and the timeout. The number of structures and timeout are both integers, but the array is effectively a pointer, so we can't see directly what's in it. A struct pollfd looks like (from the /usr/include/sys/poll.h header file):

typedef struct pollfd {
        int fd;                         /* file desc to poll */
        short events;                   /* events of interest on fd */
        short revents;                  /* events that occurred on fd */
} pollfd_t;

On a SPARC in 32 bit mode, the file descriptor number is an integer (the first four bytes), the next two bytes are flags for the events we care about and the last two bytes are the returned events.

Now we know what we're looking for (from the man page and header files), and where to look for it (from the lsof output), we can feed the core dump into a debugger to peek inside the process as it was when it was killed:

$ gdb -q $BINDIR/TME/PCREMOTE/pcgateway pcgateway.core
[snip]
#0  0xfed15d0c in _poll () from /usr/lib/libc.so.1
(gdb) backtrace
#0  0xfed15d0c in _poll () from /usr/lib/libc.so.1
#1  0xfecce6dc in select () from /usr/lib/libc.so.1
#2  0x162e8 in start_gateway ()
#3  0xfef70128 in run_mtc () from /opt/Tivoli/lib/solaris2/libtmf.so
#4  0x19644 in cthread_body ()
(gdb) print *(0x75b50)
$1 = 9
(gdb) x/xh 0x75b50+4
0x75b54:        0x0004
(gdb) x/xh 0x75b50+6
0x75b56:        0x0000
(gdb) quit
$ grep 0x0004 /usr/include/sys/poll.h
#define POLLOUT         0x0004          /* fd is writeable (won't block) */

The "backtrace" shows the sequence of function calls leading up to the current location. The gateway application is actually using select() function call, which is implemented using poll() by the standard C library (libc.so) on Solaris. Since the stack trace shows it was killed in a function called start_gateway, we might infer that it was trying to start a connection.

The "print" shows us the file descriptor and the first "x/xh" shows us a hex dump of the next half-word (ie short integer). A quick look in the header file shows that the flag is POLLOUT.

Referring to our lsof output, file descriptor 9 is a TCP socket connecting to one of the RC targets.

The other interesting thing that we can see is that the socket is in the IDLE state and that it is not bound to any port (compare with fd#7).

So the Tivoli RC gateway is trying to write (continuously) to a socket that's no longer connected and that's failing. Why is it trying to write to an unconnected socket? We don't know yet. It might be a race condition where it's sensitive to the timing of the connections coming down, or it might be due to a connection being torn down from the target side.

Caught in the act!

At this point, we have a few ideas about what might trigger the problem. What we really need is the syscall trace from just before the problem. We ended up running truss in a screen and logging all of its output to a file. We cleared the log (which generated about 100MB of data per day) every couple of days.

After a couple of weeks, we caught it in the act!

accept(5, 0x000572A0, 0x00057288, 1)            = 6  Accepts connection from RC controller, tcp connection is fd#6
getitimer(ITIMER_REAL, 0x00057100)              = 0
recv(6, " T C P : 1 7 2 . 2 0 . 1".., 128, 0)   = 21    # Receives IP address of target host to connect to from controller
so_socket(2, 2, 6, "", 1)                       = 7 # Allocates a new socket for outgoing connection (fd#7)
setsockopt(7, 65535, 4, 0x0004CB38, 4, 1)       = 0             
ioctl(7, FIONBIO, 0x0004CB34)                   = 0         # Sets fd#7 for non-blocking IO
connect(7, 0x0004CCD0, 16, 1)                   Err#150 EINPROGRESS # Initiates TCP connection to RC target
poll(0x0004CA68, 1, 0)                          = 0
getitimer(ITIMER_REAL, 0x0004C9B0)              = 0
poll(0x00035C90, 1, 988)                        = 0
    Received signal #14, SIGALRM, in poll() [caught]
poll(0x00035C90, 1, 988)                        Err#4 EINTR
setcontext(0x0003D798)
poll(0x00035C90, 1, 0)                          = 0
setitimer(ITIMER_REAL, 0x00035AF8, 0x00000000)  = 0
poll(0x0005DC38, 1, 0)                          = 1
recv(9, "\b\0\012\001 '\0", 8192, 0)            = 8 # Receives data on unrelated connection
poll(0x0005DC40, 1, 0)                          = 1
send(8, "\b\0\012\001 '\0", 8, 0)               = 8 # Forwards data
poll(0x0005DC40, 1, 0)                          = 0
getitimer(ITIMER_REAL, 0x0005DB90)              = 0
setitimer(ITIMER_REAL, 0x0005DB90, 0x00000000)  = 0
poll(0x00035C90, 1, 1)                          = 0
[snip]
    Received signal #14, SIGALRM, in poll() [caught]
setcontext(0x0003D798)
poll(0x00035C90, 1, 0)                          = 0
setitimer(ITIMER_REAL, 0x00035AF8, 0x00000000)  = 0
poll(0x0004CA68, 1, 0)                          = 1 # Polls (fd unknown, probably 7) and fd is ready
send(6, " G A T E W A Y _ C O N N".., 128, 0)   = 128   # Sends "GATEWAY_CONNECTED" message to controller on fd#6
getitimer(ITIMER_REAL, 0x0004CBE0)              = 0
setitimer(ITIMER_REAL, 0x0004CBE0, 0x00000000)  = 0
poll(0x00035C90, 1, 1)                          = 0
    Received signal #14, SIGALRM, in poll() [caught]
poll(0x00035C90, 1, 1)                          Err#4 EINTR
setcontext(0x0003D798)
poll(0x00035C90, 1, 0)                          = 0
setitimer(ITIMER_REAL, 0x00035AF8, 0x00000000)  = 0
poll(0x0004CC98, 1, 0)                          = 1
recv(7, 0x0004CE88, 8192, 0)                    Err#146 ECONNREFUSED    # Tries to read from fd#7 to target
                                    # but connection is refused (by target)!
poll(0x0004CCA0, 1, 0)                          = 1
recv(6, "\v\0", 8192, 0)                        = 2 # Receives data from controller on fd#6
poll(0x0004CC98, 1, 0)                          = 0
    Received signal #14, SIGALRM [caught]
setcontext(0x0004C980)
poll(0x0004CC98, 1, 0)                          = 0 # Gets caught in poll loop, polling fd#7 to write.

It seems that there are two potential causes of the problem:

a) The data from the unrelated connection while the new connection is pending causes pcgateway to think that the connection has completed when it hasn't, or b) The pcgateway doesn't check for a successful connection and tries to use the socket anyway. Or maybe a combination of both.

We can't tell much more without access to the source code or a debug build of pcgateway, so we send of the data and analysis to Support.

Confirmation and resolution

After some to-and-froing with the Support and Devel teams, we had the cause confirmed. The pcgateway process did not correctly check for errors writing to the target. Because it used non-blocking IO, they didn't check for connection failure after the initial attempt, (which nearly always return EINPROGRESS). Later, when it actually went to use the connection, it assumed it was up. This was true most of the time, but not always.

Eventually, the APAR was released with Remote Control 3.7 patch 11:

  IY20447: When trying to establish a Tivoli Remote Control session 
           using the Remote Control Gateway, the "pcgateway(.exe)" 
           process may cause a CPU usage of 100%.

Analysis

The problem turned out to be a bug in pcgateway, triggered by a race condition. We have fast access from the controller to the RC Gateway but slow access (via a WAN link) to the targets.

The Tivoli Framework invokes both controller and target at the same time. The problem occurs when the controller is able to start and connect to the target via the gateway before the target is ready to accept the connection. When the target refused the connection the gateway didn't handle it and was sent into a tailspin.

This shows the problems you'll get when trying to execute a synchronous operation (Remote Control) over a fundamentally asynchronous infrastructure (Tivoli Management Framework).

Other Hints

Collect information as you go. I find it useful to make a directory for each problem I'm working on and subdirectories for each day I work on it. I then copy all data gathered that day (eg process logs, core dumps and keystroke logs)

The keystroke logs were particularly useful. The Unix command script will create one for you. I started a script before each debugging session and saved its output in that day's directory. If, in the future, I wanted to compare a current instance of the problem with output from a previous one, it was easy. It also made reconstructing this story possible.

References

top

Top provides basic system performance information. It's been around for a long time.

truss

truss is the native tool that provides system call tracing on Solaris. Most SysV Unixes have it, as does AIX 5. AIX 4.3 has a system-wide tool called trcrpt and there's a script to emulate truss using it. strace provides the same facility on Linux, SysV, Solaris and Irix.

HP-UX has trace (HP-UX 10.20 and below) or tusc (HP-UX 11.00 and above).

lsof

Lsof allows you to view a process' open file descriptors, current working directory and file maps. There's also a companion lslk for viewing locks.

gdb

gdb is the GNU debugger. It's a source-level debugger that runs on multiple platforms.

screen

screen lets you run an interactive program in the background and attach and detach it at will.

Valid HTML 4.01!

Page copyright 2002 Darren Tucker. Unrestricted use and reproduction permitted as long as this copyright notice is retained.
Last modified: $Date: 2022-05-25 $