Thursday, 17 September 2009

Build Server Woes (mptscsi task abort)

I'm posting this here for two reasons:
  • in the hope that someone out there will find this useful

  • personal therapy
Scenario:

We've got a couple of build servers (x86_64 linux, openVZ) that have been having some disk I/O problems. These boxes (boxen) run various virtual machines related to our product builds - hudson masters & slaves, distribution servers, puppet master, test mail server, munin server... yada yada. You get the idea. They're kinda important.

The problem manifests itself by first reporting errors like the following:

Sep 16 14:28:51 hn3 mptscsih: ioc0: attempting task abort! (sc=ffff880422a348c0)
Sep 16 14:28:51 hn3 sd 0:1:2:0: [sda] CDB: cdb[0]=0x2a: 2a 00 12 b2 fc 9f 00 00 08 00
Sep 16 14:28:51 hn3 mptscsih: ioc0: Issue of TaskMgmt failed!

followed shortly by the volume in question getting offlined into readonly mode:

Sep 16 14:29:41 hn3 mptscsih: ioc0: host reset: SUCCESS (sc=ffff880422a348c0)
Sep 16 14:29:41 hn3 sd 0:1:2:0: Device offlined - not ready after error recovery

This ain't that helpful when you've got a whole load of hungry VMs wanting to write stuff to disk. A closer look at the disk controller yields the following:

>lspci
...
0b:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08)
...

A quick search through our messages shows the following related information:

>dmesg | grep -i mpt
Fusion MPT base driver 3.04.07
Fusion MPT SPI Host driver 3.04.07
Fusion MPT FC Host driver 3.04.07
Fusion MPT SAS Host driver 3.04.07
mptsas 0000:0b:00.0: PCI INT A -> GSI 35 (level, low) -> IRQ 35
mptbase: ioc0: Initiating bringup
mptbase: ioc0: PCI-MSI enabled
mptsas 0000:0b:00.0: setting latency timer to 64
Fusion MPT misc device (ioctl) driver 3.04.07
mptctl: Registered with Fusion MPT base driver
mptctl: /dev/mptctl @ (major,minor=10,220)

A quick google turns up quite a few different issues with these controllers but no clear resolution (no surprise).

To cut a (very) long story short we appear to have a solution by using the drivers supplied by LSI rather than those shipped with the latest linux kernel. Patching the kernel (by replacing the drivers/message/fusion folder with the equivalent found in LSI's MPTLINUX_RHEL5_SLES10_PH16-4.18.00.00-1.zip distribution) with version 4.18 of the MPT drivers has yielded an (apparently) stable system, tested under reasonably high load (load average ~20).

Incidentally, for those of you who like acronyms, MPT stands for 'Message passing technology'.

My work here is done.