From owner-freebsd-hardware@FreeBSD.ORG Mon Mar 8 15:00:19 2010 Return-Path: Delivered-To: freebsd-hardware@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A3C701065676 for ; Mon, 8 Mar 2010 15:00:19 +0000 (UTC) (envelope-from andrew.hood@lynchpin.com) Received: from zebedee.abp.lypn.net (zebedee.abp.lypn.net [212.11.77.147]) by mx1.freebsd.org (Postfix) with ESMTP id 013748FC19 for ; Mon, 8 Mar 2010 15:00:18 +0000 (UTC) Received: (qmail 79823 invoked by uid 98); 8 Mar 2010 14:33:35 -0000 Received: from 192.168.13.65 by zebedee.abp.lypn.net (envelope-from , uid 82) with qmail-scanner-2.01 (clamdscan: 0.95.2/9703. spamassassin: 3.2.5. Clear:RC:1(192.168.13.65):. Processed in 0.019069 secs); 08 Mar 2010 14:33:35 -0000 Received: from unknown (HELO ?192.168.13.65?) (192.168.13.65) by mail.lypn.net with CAMELLIA256-SHA encrypted SMTP; 8 Mar 2010 14:33:35 -0000 Message-ID: <4B950ABF.2050403@lynchpin.com> Date: Mon, 08 Mar 2010 14:33:35 +0000 From: Andrew Hood User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.1.8) Gecko/20100227 Thunderbird/3.0.3 MIME-Version: 1.0 To: freebsd-hardware@freebsd.org Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: amr lockup on 8.0-RELEASE X-BeenThere: freebsd-hardware@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: General discussion of FreeBSD hardware List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 08 Mar 2010 15:00:19 -0000 Hi, Recently upgraded to 8.0-RELEASE-p2 (amd64) on a dual-processor Opteron system with a LSI MegaRAID SCSI 320-1. Since then, am getting a complete lock-up of the disk subsystem under heavy write load. It copes fine with a kernel build, but an attempt to rsync 150GB or so of data from the machine it is supposed to be replacing routinely hangs. I can systematically (and pretty immediately) recreate the issue using /usr/ports/sysutils/stress with one hdd hog (stress -d 1). When the hang occurs, the load average gradually moves up to 0.99 with the following CPU states shown in top: CPU: 0.0% user, 0.0% nice, 0.0% system, 25.0% interrupt, 75.0% idle I'm guessing 25% is expressed as a proportion of 4 processor cores (2 x dual cores)? If I run top -S, I can see one interrupt handler (?) at 100% 12 root 20 -60 - 0K 320K WAIT 0 0:08 100.00% intr From that point, the machine will happily do anything that doesn't involve reading or writing to disk. Anything attempting to access the disk subsystem will just hang indefinitely. Killing the process that was attempting to access this disk does not restore things. No errors at all in syslog or on the console. Machine had previously been running quite happily on 6.2-RELEASE as a PostgreSQL server without any issues; but equally may not have been as heavily loaded. Not quite sure where to look next in terms of further diagnosis, wondered if anyone had experienced anything similar? Thanks, Andrew -- Andrew Hood Managing Director Lynchpin Analytics t: 0845 838 1136 f: 0845 838 1137 e: andrew.hood@lynchpin.com Lynchpin Analytics Limited is registered in Scotland No. SC279857 Registered Office: 5th Floor, 7 Castle Street, Edinburgh, EH2 3AH