From owner-freebsd-hardware@FreeBSD.ORG  Mon Mar  8 15:00:19 2010
Return-Path: <owner-freebsd-hardware@FreeBSD.ORG>
Delivered-To: freebsd-hardware@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id A3C701065676
	for <freebsd-hardware@freebsd.org>;
	Mon,  8 Mar 2010 15:00:19 +0000 (UTC)
	(envelope-from andrew.hood@lynchpin.com)
Received: from zebedee.abp.lypn.net (zebedee.abp.lypn.net [212.11.77.147])
	by mx1.freebsd.org (Postfix) with ESMTP id 013748FC19
	for <freebsd-hardware@freebsd.org>;
	Mon,  8 Mar 2010 15:00:18 +0000 (UTC)
Received: (qmail 79823 invoked by uid 98); 8 Mar 2010 14:33:35 -0000
Received: from 192.168.13.65 by zebedee.abp.lypn.net (envelope-from
	<andrew.hood@lynchpin.com>, uid 82) with qmail-scanner-2.01 
	(clamdscan: 0.95.2/9703. spamassassin: 3.2.5.  
	Clear:RC:1(192.168.13.65):. 
	Processed in 0.019069 secs); 08 Mar 2010 14:33:35 -0000
Received: from unknown (HELO ?192.168.13.65?) (192.168.13.65)
	by mail.lypn.net with CAMELLIA256-SHA encrypted SMTP;
	8 Mar 2010 14:33:35 -0000
Message-ID: <4B950ABF.2050403@lynchpin.com>
Date: Mon, 08 Mar 2010 14:33:35 +0000
From: Andrew Hood <andrew.hood@lynchpin.com>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB;
	rv:1.9.1.8) Gecko/20100227 Thunderbird/3.0.3
MIME-Version: 1.0
To: freebsd-hardware@freebsd.org
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: amr lockup on 8.0-RELEASE
X-BeenThere: freebsd-hardware@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: General discussion of FreeBSD hardware <freebsd-hardware.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hardware>, 
	<mailto:freebsd-hardware-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hardware>
List-Post: <mailto:freebsd-hardware@freebsd.org>
List-Help: <mailto:freebsd-hardware-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hardware>, 
	<mailto:freebsd-hardware-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 08 Mar 2010 15:00:19 -0000

Hi,

Recently upgraded to 8.0-RELEASE-p2 (amd64) on a dual-processor Opteron 
system with a LSI MegaRAID SCSI 320-1.

Since then, am getting a complete lock-up of the disk subsystem under 
heavy write load.

It copes fine with a kernel build, but an attempt to rsync 150GB or so 
of data from the machine it is supposed to be replacing routinely hangs.

I can systematically (and pretty immediately) recreate the issue using 
/usr/ports/sysutils/stress with one hdd hog (stress -d 1).

When the hang occurs, the load average gradually moves up to 0.99 with 
the following CPU states shown in top:

CPU:  0.0% user,  0.0% nice,  0.0% system, 25.0% interrupt, 75.0% idle

I'm guessing 25% is expressed as a proportion of 4 processor cores (2 x 
dual cores)?

If I run top -S, I can see one interrupt handler (?) at 100%

12 root       20 -60    -     0K   320K WAIT    0   0:08 100.00% intr

 From that point, the machine will happily do anything that doesn't 
involve reading or writing to disk. Anything attempting to access the 
disk subsystem will just hang indefinitely. Killing the process that was 
attempting to access this disk does not restore things.

No errors at all in syslog or on the console.

Machine had previously been running quite happily on 6.2-RELEASE as a 
PostgreSQL server without any issues; but equally may not have been as 
heavily loaded.

Not quite sure where to look next in terms of further diagnosis, 
wondered if anyone had experienced anything similar?

Thanks,
Andrew

-- 
Andrew Hood
Managing Director
Lynchpin Analytics
t: 0845 838 1136
f: 0845 838 1137
e: andrew.hood@lynchpin.com

Lynchpin Analytics Limited is registered in Scotland No. SC279857
Registered Office: 5th Floor, 7 Castle Street, Edinburgh, EH2 3AH

From owner-freebsd-hardware@FreeBSD.ORG  Tue Mar  9 05:09:41 2010
Return-Path: <owner-freebsd-hardware@FreeBSD.ORG>
Delivered-To: freebsd-hardware@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 11229106566C
	for <freebsd-hardware@freebsd.org>;
	Tue,  9 Mar 2010 05:09:41 +0000 (UTC)
	(envelope-from freebsd@sopwith.solgatos.com)
Received: from sopwith.solgatos.com
	(pool-98-108-131-15.ptldor.fios.verizon.net [98.108.131.15])
	by mx1.freebsd.org (Postfix) with ESMTP id 28F8B8FC15
	for <freebsd-hardware@freebsd.org>;
	Tue,  9 Mar 2010 05:09:39 +0000 (UTC)
Received: by sopwith.solgatos.com (Postfix, from userid 66)
	id 381FFB64F; Mon,  8 Mar 2010 21:09:14 -0800 (PST)
Received: from localhost by sopwith.solgatos.com (8.8.8/6.24)
	id CAA13697; Tue, 9 Mar 2010 02:37:07 GMT
Message-Id: <201003090237.CAA13697@sopwith.solgatos.com>
To: freebsd-hardware@freebsd.org
Date: Mon, 08 Mar 2010 18:37:07 PST
From: Dieter <freebsd@sopwith.solgatos.com>
Subject: siis(4) questions
X-BeenThere: freebsd-hardware@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: freebsd@sopwith.solgatos.com
List-Id: General discussion of FreeBSD hardware <freebsd-hardware.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hardware>, 
	<mailto:freebsd-hardware-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hardware>
List-Post: <mailto:freebsd-hardware@freebsd.org>
List-Help: <mailto:freebsd-hardware-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hardware>, 
	<mailto:freebsd-hardware-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 09 Mar 2010 05:09:41 -0000

The siis(4) man page promises a ada(4) man page, but
I can't find it.  Even tried the online man page tool.

dmesg says:

    ada0 at siisch0 bus 0 target 0 lun 0
    ada0: <Hitachi HDT721010SLA360 ST6OA31B> ATA/ATAPI-8 SATA 2.x device
    ada0: 300.000MB/s transfers
    ada0: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C)
    ada0: Native Command Queueing enabled

but camcontrol says:

    # camcontrol inquiry ada0 -R
    150.000MB/s transfers

    # camcontrol identify ada0
    [ ... ]
    Feature                      Support  Enable    Value           Vendor
    write cache                    yes      no
    read ahead                     yes      yes
    Native Command Queuing (NCQ)   yes              31/0x1F

So camcontrol disagrees with dmesg about the transfer speed.
And camcontrol doesn't even fill in whether NCQ is enabled or not.

The readcap and modepage stuff doesn't work at all:

    # camcontrol readcap ada0
    (pass0:siisch0:0:0:0): READ CAPACITY(10). CDB: 25 0 0 0 0 0 0 0 0 0
    (pass0:siisch0:0:0:0): CAM Status: SCSI Status Error
    (pass0:siisch0:0:0:0): SCSI Status: Check Condition

    # camcontrol modepage ada0 -l
    camcontrol: error sending mode sense command
    # camcontrol modepage ada0 -m 0
    camcontrol: error sending mode sense command
    # camcontrol modepage ada0 -m 1
    camcontrol: error sending mode sense command
    # camcontrol modepage ada0 -m 2
    camcontrol: error sending mode sense command
    # camcontrol modepage ada1 -m 2
    camcontrol: error sending mode sense command

I get the same problems with Seagate disks, so it isn't just Hitachi.

I suspect that the dmesg info is correct.  I'm not sure whether to
believe *any* of the stuff camcontrol prints out?  Like whether
the write cache is on or not.  And with the modepage stuff not
working, how do I turn the write cache on and off?