From owner-freebsd-bugs@FreeBSD.ORG Tue Jun 9 09:04:17 2009 Return-Path: Delivered-To: freebsd-bugs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B17331065670 for ; Tue, 9 Jun 2009 09:04:17 +0000 (UTC) (envelope-from freebsdusb@bindone.de) Received: from mail.bindone.de (mail.bindone.de [80.190.134.51]) by mx1.freebsd.org (Postfix) with SMTP id 2BC888FC4E for ; Tue, 9 Jun 2009 09:04:16 +0000 (UTC) (envelope-from freebsdusb@bindone.de) Received: (qmail 68527 invoked by uid 89); 9 Jun 2009 09:04:15 -0000 Received: from unknown (HELO ufo.bindone.de) (mg@bindone.de@87.152.176.85) by mail.bindone.de with ESMTPA; 9 Jun 2009 09:04:15 -0000 Message-ID: <4A2E258B.9090207@bindone.de> Date: Tue, 09 Jun 2009 11:04:11 +0200 From: Michael User-Agent: Thunderbird 2.0.0.17pre (X11/20090202) MIME-Version: 1.0 To: freebsd-bugs@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Subject: Adaptec 5405 (aac0) hanging on high load X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 09 Jun 2009 09:04:18 -0000 (I filed this one as a PR through the website as well but waiting for confirmation and assignment of a PR number) Hi folks, I've got issues with an Adaptec sometimes hanging under high load stating COMMAND xxx TIMEOUT AFTER yyy SECONDS multiple times and then "Controller is no longer running". (on 7.1-RELEASE, 7.2-RC2, 7.2-RELEASE, 8-CURRENT). This can be provoked by high load like highly parallel make buildworld or various benchmarks (e.g. /usr/ports/benchmarks/blogbench). I've been wondering if this is somehow related to the following article in the adaptec knowledge base: http://ask.adaptec.com/scripts/adaptec_tic.cfg/php.exe/enduser/std_adp.php?p_faqid=15357&p_created=1225366599&p_sid=NqNtKZrj&p_accessibility=0&p_redirect=&p_lva=&p_sp=cF9zcmNoPSZwX3NvcnRfYnk9JnBfZ3JpZHNvcnQ9JnBfcm93X2NudD0yNjk3LDI2OTcmcF9wcm9kcz0mcF9jYXRzPSZwX3B2PSZwX2N2PSZwX3NlYXJjaF90eXBlPWFuc3dlcnMuc2VhcmNoX25sJnBfcGFnZT0x&p_li=&p_topview=1 It states: "AACRAID based controllers have an underlying timeout/recovery cycle that is 35 seconds long. The default in some SCSI subsystems was 60 seconds in the past, but is now standardized at 30 seconds which results in an interference pattern between the controller and the Linux SCSI subsystem." (I copy and pasted the entire article at the end of this post). Since sys/dev/aac/aacvar.h sets AAC_CMD_TIMEOUT to 30 seconds I've been wondering if this is somehow related (there are also timeouts for immediate commands and the period check for timeouts interval - not sure how they're used in aac.c and too lazy to check). The bottom line is, that adaptec states that they're AACRAID based controllers may sometimes need >35 seconds to process a command under normal operational circumstances, if the controller is going through an "error correction cycle on the SAS/SATA bus". cheers Michael -- Complete Adaptec knowledge base entry -- AACRAID based controllers have an underlying timeout/recovery cycle that is 35 seconds long. The default in some SCSI subsystems was 60 seconds in the past, but is now standardized at 30 seconds which results in an interference pattern between the controller and the Linux SCSI subsystem. The alternate workaround is for the user to adjust the timeout in SYSFS if it is shorter than 35 seconds. Changing the timeout values for a Linux block device can be done via SYSFS. For example, if /dev/sdc , /dev/sdd and /dev/sde are the device LUNs on a given Linux host, then the following commands need to be issued: echo 45 > /sys/block/sdc/device/timeout echo 45 > /sys/block /sdd/device/timeout echo 45 > /sys/block/sde/device/timeout In this example the timeout is 45 seconds which should be enough. Note: Any AACRAID based controller is going through an error correction cycle on the SAS/SATA bus that is delaying the completion of I/O beyond the Linux default timeout set for the device, this may be a hardware issue or a problem with the default timeout value as outlined above. If changing the timeout value doesn't solve the problem then please follow the steps we recommend to trouble shoot "Host adapter reset request. SCSI hang ?" messages: Check for any updated firmware for the motherboard, controller, targets and enclosure on the respective manufacturer's web sites. Check per-device queue depth in SYSFS to make sure it is reasonable. Engage disk drive manufacturer's technical support department to check through compatibility or drive class issues. Engage enclosure manufacturer's technical support department to check through compatibility issues.