From owner-freebsd-stable@FreeBSD.ORG  Mon Jul 19 03:01:19 2010
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id E14481065673
	for <freebsd-stable@freebsd.org>; Mon, 19 Jul 2010 03:01:19 +0000 (UTC)
	(envelope-from mike@sentex.net)
Received: from lava.sentex.ca (pyroxene.sentex.ca [199.212.134.18])
	by mx1.freebsd.org (Postfix) with ESMTP id 4EBA28FC15
	for <freebsd-stable@freebsd.org>; Mon, 19 Jul 2010 03:01:18 +0000 (UTC)
Received: from mdt-xp.sentex.net (simeon.sentex.ca [192.168.43.27])
	by lava.sentex.ca (8.14.4/8.14.3) with ESMTP id o6J31Hs1045607;
	Sun, 18 Jul 2010 23:01:17 -0400 (EDT) (envelope-from mike@sentex.net)
Message-Id: <201007190301.o6J31Hs1045607@lava.sentex.ca>
X-Mailer: QUALCOMM Windows Eudora Version 7.1.0.9
Date: Sun, 18 Jul 2010 23:01:03 -0400
To: Jeremy Chadwick <freebsd@jdc.parodius.com>
From: Mike Tancsa <mike@sentex.net>
In-Reply-To: <20100719023419.GA91006@icarus.home.lan>
References: <201007182108.o6IL88eG043887@lava.sentex.ca>
	<20100718211415.GA84127@icarus.home.lan>
	<201007182142.o6ILgDQW044046@lava.sentex.ca>
	<20100719023419.GA91006@icarus.home.lan>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; format=flowed
Cc: freebsd-stable@freebsd.org
Subject: Re: deadlock or bad disk ?  RELENG_8
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 19 Jul 2010 03:01:20 -0000

At 10:34 PM 7/18/2010, Jeremy Chadwick wrote:
>On Sun, Jul 18, 2010 at 05:42:14PM -0400, Mike Tancsa wrote:
> > At 05:14 PM 7/18/2010, Jeremy Chadwick wrote:
> >
> > >Where exactly is your swap partition?
> >
> > On one of the areca raidsets.
> >
> > # swapctl -l
> > Device:       1024-blocks     Used:
> > /dev/da0s1b    10485760       108
>
>So is da0 actually a RAID volume "behind the scenes" on the Areca
>controller?  How many disks are involved in that set?

yes, da0 is a RAID volume with 4 disks behind the scenes.

>Well, the thread I linked you stated that the problem has to do with a
>controller or disk "taking too long".  I have no idea what the threshold
>is.  I suppose it could also indicate that your system is (possibly)
>running low on resources (RAM); I would imagine swap_pager would get
>called if a processes needed to be offloaded to swap.  So maybe this is
>a system tuning thing more than a hardware thing.

Prior to someone rebooting it, it had been stuck in this state for a 
good 90min.  Apart from upgrading to a later RELENG_8 to get the 
security patches, the machine had been running a few versions of 
RELENG_8 doing the same workloads every week without 
issue.  /boot/loader.conf has
ahci_load="YES"
siis_load="YES"

sysctl.conf has

net.inet.tcp.recvbuf_max=16777216
net.inet.tcp.recvspace=131072
net.inet.tcp.sendbuf_max=16777216
net.inet.tcp.sendspace=32768
net.inet.udp.recvspace=65536
kern.ipc.somaxconn=1024
kern.ipc.maxsockbuf=4194304
net.inet.ip.redirect=0
net.inet.ip.intr_queue_maxlen=4096
net.route.netisr_maxqlen=1024
kern.ipc.nmbclusters=131072

I do track some basic mem stats via rrd.  Looking at the graphs upto 
that period, nothing unusual was happening

CPU: 16.6% user,  0.0% nice,  4.3% system,  0.2% interrupt, 78.8% idle
Mem: 443M Active, 5707M Inact, 1462M Wired, 147M Cache, 828M Buf, 166M Free
Swap: 10G Total, 124K Used, 10G Free


> >  smartctl -a -d 3ware,1 /dev/twa0
>
>Now I'm confused -- this indicates twa(4) is involved, not arcmsr(4).

The other controllers (3ware and onboard ich in ahci mode) provider 
other storage on the same box.  I only noted them in that I checked 
all their disks for errors of which there were none either. The dmesg 
from the original post enumerates all the devices on the box.

         ---Mike


--------------------------------------------------------------------
Mike Tancsa,                                      tel +1 519 651 3400
Sentex Communications,                            mike@sentex.net
Providing Internet since 1994                    www.sentex.net
Cambridge, Ontario Canada                         www.sentex.net/mike