From owner-freebsd-stable@FreeBSD.ORG  Wed Sep 14 08:08:32 2011
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 7D42F106564A
	for <stable@freebsd.org>; Wed, 14 Sep 2011 08:08:32 +0000 (UTC)
	(envelope-from dk@mail.neveragain.de)
Received: from mail.neveragain.de (mail.neveragain.de [IPv6:2001:aa8:fffc::25])
	by mx1.freebsd.org (Postfix) with ESMTP id 497E28FC1C
	for <stable@freebsd.org>; Wed, 14 Sep 2011 08:08:32 +0000 (UTC)
Received: by mail.neveragain.de (Postfix, from userid 1002)
	id 7B68417022; Wed, 14 Sep 2011 10:08:31 +0200 (CEST)
Date: Wed, 14 Sep 2011 10:08:31 +0200
From: Dennis Koegel <dk@neveragain.de>
To: stable@freebsd.org
Message-ID: <20110914080831.GB41431@neveragain.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
User-Agent: Mutt/1.4.2.3i
Cc: 
Subject: System freeze: Adaptec (aac) timeouts (releng 8)
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 14 Sep 2011 08:08:32 -0000

Cheers,

we have a reproducible system freeze due to Adaptec driver (aac) timeouts:

Sep  3 05:26:44 foo kernel: aac0: COMMAND 0xffffff80005ae4c0 (TYPE 502) TIMEOUT AFTER 129 SECONDS
Sep  3 05:26:44 foo kernel: aac0: COMMAND 0xffffff80005ac0e0 (TYPE 502) TIMEOUT AFTER 129 SECONDS
Sep  3 05:26:44 foo kernel: aac0: COMMAND 0xffffff80005b0fa0 (TYPE 502) TIMEOUT AFTER 129 SECONDS
<dozens more of these...>

Once this happens, the userland seems to be alive, but the controller is
completely dead. As soon as the disk subsystem is involved, any process
hangs forever (e.g. SSH crypto-exchange still happens, but a shell won't
even start anymore).

We observe the same issue on two systems of (mostly) identical spec, so
it's not a hardware issue.

Apparently this only happens under heavy disk i/o and high cpu load.
Notably high write throughput plus a 'zpool scrub' on a large
GELI-backed zpool usually triggers the problem after a few hours.
Without high activity, they run smooth for weeks.

Both systems are amd64 with an Adaptec 5805 controller and 16 disks (of
which two form a RAID-1 system volume (UFS), and the remaining 14 serve
as JBOD for a large zpool -- a total of 15 "aacd" devices).

Both were running 8.2R originally. I've taken them to 8-STABLE now and
also applied svn r222951 (where the MFC was forgotten, it seems), but
the problem remains.

Any help is greatly appreciated.

Thanks,
- D.