Showing posts with label openvz. Show all posts
Showing posts with label openvz. Show all posts

Thursday, 17 September 2009

Build Server Woes (mptscsi task abort)

I'm posting this here for two reasons:
  • in the hope that someone out there will find this useful

  • personal therapy
Scenario:

We've got a couple of build servers (x86_64 linux, openVZ) that have been having some disk I/O problems. These boxes (boxen) run various virtual machines related to our product builds - hudson masters & slaves, distribution servers, puppet master, test mail server, munin server... yada yada. You get the idea. They're kinda important.

The problem manifests itself by first reporting errors like the following:

Sep 16 14:28:51 hn3 mptscsih: ioc0: attempting task abort! (sc=ffff880422a348c0)
Sep 16 14:28:51 hn3 sd 0:1:2:0: [sda] CDB: cdb[0]=0x2a: 2a 00 12 b2 fc 9f 00 00 08 00
Sep 16 14:28:51 hn3 mptscsih: ioc0: Issue of TaskMgmt failed!

followed shortly by the volume in question getting offlined into readonly mode:

Sep 16 14:29:41 hn3 mptscsih: ioc0: host reset: SUCCESS (sc=ffff880422a348c0)
Sep 16 14:29:41 hn3 sd 0:1:2:0: Device offlined - not ready after error recovery

This ain't that helpful when you've got a whole load of hungry VMs wanting to write stuff to disk. A closer look at the disk controller yields the following:

>lspci
...
0b:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08)
...

A quick search through our messages shows the following related information:

>dmesg | grep -i mpt
Fusion MPT base driver 3.04.07
Fusion MPT SPI Host driver 3.04.07
Fusion MPT FC Host driver 3.04.07
Fusion MPT SAS Host driver 3.04.07
mptsas 0000:0b:00.0: PCI INT A -> GSI 35 (level, low) -> IRQ 35
mptbase: ioc0: Initiating bringup
mptbase: ioc0: PCI-MSI enabled
mptsas 0000:0b:00.0: setting latency timer to 64
Fusion MPT misc device (ioctl) driver 3.04.07
mptctl: Registered with Fusion MPT base driver
mptctl: /dev/mptctl @ (major,minor=10,220)

A quick google turns up quite a few different issues with these controllers but no clear resolution (no surprise).

To cut a (very) long story short we appear to have a solution by using the drivers supplied by LSI rather than those shipped with the latest linux kernel. Patching the kernel (by replacing the drivers/message/fusion folder with the equivalent found in LSI's MPTLINUX_RHEL5_SLES10_PH16-4.18.00.00-1.zip distribution) with version 4.18 of the MPT drivers has yielded an (apparently) stable system, tested under reasonably high load (load average ~20).

Incidentally, for those of you who like acronyms, MPT stands for 'Message passing technology'.

My work here is done.