Showing posts with label linux. Show all posts
Showing posts with label linux. Show all posts

Thursday, 17 September 2009

Build Server Woes (mptscsi task abort)

I'm posting this here for two reasons:
  • in the hope that someone out there will find this useful

  • personal therapy
Scenario:

We've got a couple of build servers (x86_64 linux, openVZ) that have been having some disk I/O problems. These boxes (boxen) run various virtual machines related to our product builds - hudson masters & slaves, distribution servers, puppet master, test mail server, munin server... yada yada. You get the idea. They're kinda important.

The problem manifests itself by first reporting errors like the following:

Sep 16 14:28:51 hn3 mptscsih: ioc0: attempting task abort! (sc=ffff880422a348c0)
Sep 16 14:28:51 hn3 sd 0:1:2:0: [sda] CDB: cdb[0]=0x2a: 2a 00 12 b2 fc 9f 00 00 08 00
Sep 16 14:28:51 hn3 mptscsih: ioc0: Issue of TaskMgmt failed!

followed shortly by the volume in question getting offlined into readonly mode:

Sep 16 14:29:41 hn3 mptscsih: ioc0: host reset: SUCCESS (sc=ffff880422a348c0)
Sep 16 14:29:41 hn3 sd 0:1:2:0: Device offlined - not ready after error recovery

This ain't that helpful when you've got a whole load of hungry VMs wanting to write stuff to disk. A closer look at the disk controller yields the following:

>lspci
...
0b:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08)
...

A quick search through our messages shows the following related information:

>dmesg | grep -i mpt
Fusion MPT base driver 3.04.07
Fusion MPT SPI Host driver 3.04.07
Fusion MPT FC Host driver 3.04.07
Fusion MPT SAS Host driver 3.04.07
mptsas 0000:0b:00.0: PCI INT A -> GSI 35 (level, low) -> IRQ 35
mptbase: ioc0: Initiating bringup
mptbase: ioc0: PCI-MSI enabled
mptsas 0000:0b:00.0: setting latency timer to 64
Fusion MPT misc device (ioctl) driver 3.04.07
mptctl: Registered with Fusion MPT base driver
mptctl: /dev/mptctl @ (major,minor=10,220)

A quick google turns up quite a few different issues with these controllers but no clear resolution (no surprise).

To cut a (very) long story short we appear to have a solution by using the drivers supplied by LSI rather than those shipped with the latest linux kernel. Patching the kernel (by replacing the drivers/message/fusion folder with the equivalent found in LSI's MPTLINUX_RHEL5_SLES10_PH16-4.18.00.00-1.zip distribution) with version 4.18 of the MPT drivers has yielded an (apparently) stable system, tested under reasonably high load (load average ~20).

Incidentally, for those of you who like acronyms, MPT stands for 'Message passing technology'.

My work here is done.

Sunday, 31 August 2008

Up The Limit

Recently while using the searchable plugin with Grails I've had exceptions citing too many open files. Linux systems typically limit resources on a per-user basis by various criteria such as number of processes and number of open files. Running ulimit -a I see the following:
core file size          (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 36864
max locked memory (kbytes, -l) 32
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 36864
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
On this system I can only open 1024 files at once - I can increase this limit in /etc/security/limits.conf by adding something similar to the following:
gus  soft nofile  16384
gus hard nofile 16384
To ensure that PAM actually takes notice of these limits on login check /etc/pam.d/login for an entry like:
session    required     pam_limits.so
Can haz filez!