Skip to:

System Interactions

Introduction

Sometimes problems aren't as obvious as a missing check for a NULL pointer. Sometimes they aren't even as obvious as a buffer overflow. Sometimes you need to have a good understanding of the whole system (or a significant part of it) to determine the cause of the problem.

In this instance we investigated a latency issue with a real time trading system that turned out to be triggered by a seemingly inoccuous disk monitoring daemon that, in turn, was invoking a code path through the kernel (Linux 2.4) that lead to CPU starvation for processes and the subsequent latency problem.

The Brief

The client was using a number of HP DL360 servers to run a (soft) real time trading application. Trades took the form of messages sent over the network to which acknowledgements were returned. Normally a trade was sent and acknowledged within a few milliseconds. However sometimes a trade took hundreds, or even thousands, of milliseconds. Given the volatility of the markets being traded this raised serious concerns for the pricing strategies used for their own customers as well as direct loss of profit caused by a late trade.

Investigation

The first step was to do some basic monitoring of the system using tools such as top and ps. Likely candidates for the source of the problem were likely to be causing spikes in usage coincident with the messaging delays. Indeed one process, cmaidad, stood out as as consuming significant CPU in bursts every few seconds.

The cmaidad process is part of the Insight Manager package provided by HP to monitor their servers. This particular process is responsible for monitoring the disk subsystem. Attaching an strace to it revealed little of interest:

strace output

open("/dev/cciss/c0d0", O_RDONLY)       = 3
ioctl(3, CCISS_PASSTHRU, 0xbfffa0c8)    = 0
close(3)                                = 0
open("/dev/cciss/c0d0", O_RDONLY)       = 3
ioctl(3, CCISS_PASSTHRU, 0xbfffa0c8)    = 0
close(3)                                = 0

In fact it seems to do nothing other than repeatedly open the disk devices, query their status with an ioctl, and close them.

The next step was to dig a little deeper using oprofile. This tool allows us to see under the hood as it were and determine what kernel code paths the cmaidad process executes.

The oprofile report reveals that 99.6% of cmaidad's time was spent in the kernel, primarily in one function named invalidate_bdev.

invalidate_bdev is called from kill_bdev (defined in fs/block_dev.c). kill_bdev is called from the block device release code. Release is what happens on last close. And since nothing else has the entire disk device open (the mounted partitions are different block devices) each of these closes is considered a last close! The problem is that invalidate_bdev was never designed to be called frequently - it simply walks the entire buffer list looking for buffers on that device that can be abandoned. So the more data you have buffered, the longer the close of the device is going to take.

(It is worth noting that the kernel in question did not use kernel pre-emption so the buffer walk prevented other processes running. A newer kernel with kernel pre-emption may behaviour differently.)

The Work-around

Since the problem was being triggered by last close of the disk devices a simple script was used to hold the devices open:

Work-around script

/usr/bin/setsid /bin/bash -c '
    exec 0<&-
    exec 1<&-
    exec 2<&-
    n=3
    for d in /dev/cciss/c?d?; do
        eval exec "$n<$d"
        n=$(( $n + 1 ))
    done
    kill -STOP $$
' &

This simple little script just opens all the CCISS disk devices it can find then suspends itself. It will not use any more CPU time but as long as it exists a last close will never be triggered and the kernel will not do the buffer walk that was causing all the problems.

Once this script had been started the CPU time shown by ps for cmaidad became negligible and opreport -l output for cmaidad shows nothing that would not be expected from ioctl, with the kernel functions write_some_buffers, wait_for_buffers being the main load now.

Conclusion

A problem with client's application was traced through an unrelated system daemon and on into the kernel. An interaction between the design of the block I/O layer and its usage by a monitoring daemon was found to lead to resource starvation of other processes and a simple work around devised.

The manufacturer of the controller and monitoring software has subsequently implemented the fix within the daemon.

 

Company Registration