- in the hope that someone out there will find this useful
- personal therapy
We've got a couple of build servers (x86_64 linux, openVZ) that have been having some disk I/O problems. These boxes (boxen) run various virtual machines related to our product builds - hudson masters & slaves, distribution servers, puppet master, test mail server, munin server... yada yada. You get the idea. They're kinda important.
The problem manifests itself by first reporting errors like the following:
Sep 16 14:28:51 hn3 mptscsih: ioc0: attempting task abort! (sc=ffff880422a348c0)
Sep 16 14:28:51 hn3 sd 0:1:2:0: [sda] CDB: cdb[0]=0x2a: 2a 00 12 b2 fc 9f 00 00 08 00
Sep 16 14:28:51 hn3 mptscsih: ioc0: Issue of TaskMgmt failed!
followed shortly by the volume in question getting offlined into readonly mode:
Sep 16 14:29:41 hn3 mptscsih: ioc0: host reset: SUCCESS (sc=ffff880422a348c0)
Sep 16 14:29:41 hn3 sd 0:1:2:0: Device offlined - not ready after error recovery
This ain't that helpful when you've got a whole load of hungry VMs wanting to write stuff to disk. A closer look at the disk controller yields the following:
>lspci
...
0b:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08)
...
A quick search through our messages shows the following related information:
>dmesg | grep -i mpt
Fusion MPT base driver 3.04.07
Fusion MPT SPI Host driver 3.04.07
Fusion MPT FC Host driver 3.04.07
Fusion MPT SAS Host driver 3.04.07
mptsas 0000:0b:00.0: PCI INT A -> GSI 35 (level, low) -> IRQ 35
mptbase: ioc0: Initiating bringup
mptbase: ioc0: PCI-MSI enabled
mptsas 0000:0b:00.0: setting latency timer to 64
Fusion MPT misc device (ioctl) driver 3.04.07
mptctl: Registered with Fusion MPT base driver
mptctl: /dev/mptctl @ (major,minor=10,220)
A quick google turns up quite a few different issues with these controllers but no clear resolution (no surprise).
To cut a (very) long story short we appear to have a solution by using the drivers supplied by LSI rather than those shipped with the latest linux kernel. Patching the kernel (by replacing the drivers/message/fusion folder with the equivalent found in LSI's MPTLINUX_RHEL5_SLES10_PH16-4.18.00.00-1.zip distribution) with version 4.18 of the MPT drivers has yielded an (apparently) stable system, tested under reasonably high load (load average ~20).
Incidentally, for those of you who like acronyms, MPT stands for 'Message passing technology'.
My work here is done.