From owner-freebsd-stable@FreeBSD.ORG Wed Sep 14 08:08:32 2011 Return-Path: Delivered-To: stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7D42F106564A for ; Wed, 14 Sep 2011 08:08:32 +0000 (UTC) (envelope-from dk@mail.neveragain.de) Received: from mail.neveragain.de (mail.neveragain.de [IPv6:2001:aa8:fffc::25]) by mx1.freebsd.org (Postfix) with ESMTP id 497E28FC1C for ; Wed, 14 Sep 2011 08:08:32 +0000 (UTC) Received: by mail.neveragain.de (Postfix, from userid 1002) id 7B68417022; Wed, 14 Sep 2011 10:08:31 +0200 (CEST) Date: Wed, 14 Sep 2011 10:08:31 +0200 From: Dennis Koegel To: stable@freebsd.org Message-ID: <20110914080831.GB41431@neveragain.de> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline User-Agent: Mutt/1.4.2.3i Cc: Subject: System freeze: Adaptec (aac) timeouts (releng 8) X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 14 Sep 2011 08:08:32 -0000 Cheers, we have a reproducible system freeze due to Adaptec driver (aac) timeouts: Sep 3 05:26:44 foo kernel: aac0: COMMAND 0xffffff80005ae4c0 (TYPE 502) TIMEOUT AFTER 129 SECONDS Sep 3 05:26:44 foo kernel: aac0: COMMAND 0xffffff80005ac0e0 (TYPE 502) TIMEOUT AFTER 129 SECONDS Sep 3 05:26:44 foo kernel: aac0: COMMAND 0xffffff80005b0fa0 (TYPE 502) TIMEOUT AFTER 129 SECONDS Once this happens, the userland seems to be alive, but the controller is completely dead. As soon as the disk subsystem is involved, any process hangs forever (e.g. SSH crypto-exchange still happens, but a shell won't even start anymore). We observe the same issue on two systems of (mostly) identical spec, so it's not a hardware issue. Apparently this only happens under heavy disk i/o and high cpu load. Notably high write throughput plus a 'zpool scrub' on a large GELI-backed zpool usually triggers the problem after a few hours. Without high activity, they run smooth for weeks. Both systems are amd64 with an Adaptec 5805 controller and 16 disks (of which two form a RAID-1 system volume (UFS), and the remaining 14 serve as JBOD for a large zpool -- a total of 15 "aacd" devices). Both were running 8.2R originally. I've taken them to 8-STABLE now and also applied svn r222951 (where the MFC was forgotten, it seems), but the problem remains. Any help is greatly appreciated. Thanks, - D.