From owner-freebsd-stable@FreeBSD.ORG  Wed Jun 19 13:01:16 2013
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 16E8EB96
 for <freebsd-stable@freebsd.org>; Wed, 19 Jun 2013 13:01:16 +0000 (UTC)
 (envelope-from dk@neveragain.de)
Received: from mail.neveragain.de (mail.neveragain.de [IPv6:2001:aa8:fffc::25])
 by mx1.freebsd.org (Postfix) with ESMTP id D7A841F6F
 for <freebsd-stable@freebsd.org>; Wed, 19 Jun 2013 13:01:15 +0000 (UTC)
Received: from dottie.dus.openit.net (dottie.dus.openit.net
 [IPv6:2001:aa8:fff3::fffd])
 (using TLSv1 with cipher AES128-SHA (128/128 bits))
 (No client certificate requested)
 by mail.neveragain.de (Postfix) with ESMTPSA id 5BA0D14E80
 for <freebsd-stable@freebsd.org>; Wed, 19 Jun 2013 15:01:14 +0200 (CEST)
From: =?iso-8859-1?Q?Dennis_K=F6gel?= <dk@neveragain.de>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable
Subject: Weird I/O hangs (9.1R, arcsas, interrupt spikes on uhci0)
Date: Wed, 19 Jun 2013 15:01:14 +0200
Message-Id: <C2AA9591-CBF4-4956-BABE-08BD8994FF8C@neveragain.de>
To: freebsd-stable@freebsd.org
Mime-Version: 1.0 (Apple Message framework v1283)
X-Mailer: Apple Mail (2.1283)
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 19 Jun 2013 13:01:16 -0000

Hi,

very periodically, we see I/O hangs for about 10 seconds, roughly once =
per minute.

Each time this happens, the I/O rate simply drops to zero, and all disk =
access hangs; this is also very noticeable on the shell, for NFS clients =
etc. Everything else (networking, kernel, =85) seems to continue =
normally.

Environment: FreeBSD 9.1R GENERIC on amd64, using ZFS, on a ARC1320 PCIe =
with 24x Seagate ST33000650SS (3rd party arcsas.ko driver).

It's easy to observe these hangs under write load, e.g. with 'zpool =
iostat 1':

void        22.4T  42.6T     34  2.73K  1.07M   293M
void        22.4T  42.6T     20  2.74K   623K   289M
void        22.4T  42.6T    144  2.62K  4.83M   279M
void        22.4T  42.6T     13  2.60K   437K   283M
void        22.4T  42.6T      0      0      0      0 <-- hang starts
void        22.4T  42.6T      0      0      0      0
void        22.4T  42.6T      0      0      0      0
void        22.4T  42.6T      0      0      0      0
void        22.4T  42.6T      0      0      0      0
void        22.4T  42.6T      0      0      0      0
void        22.4T  42.6T      0      0      0      0
void        22.4T  42.6T      0      0      0      0
void        22.4T  42.6T      0    296  4.00K  34.2M <-- hang ends
void        22.4T  42.6T      2  2.64K  73.8K   288M
void        22.4T  42.6T      8  3.12K   278K   329M

Each time this happens, there is a completely unexplained spike of =
interrupts on uhci0: 'systat -vm' then displays numbers around 270k.

# vmstat -i | grep -E '(arcsas|uhci0|Total)'
irq16: uhci0                  1227020890      67708
irq24: arcsas0                  12045211        664
Total                         1266417827      69882

Things to note:

- Booting an USB-less kernel or disabling all USB in the BIOS doesn't =
change a thing (no interrupt spikes to be seen, but the hangs remain)
- The hangs / interrupt spikes happen just as often when the system is =
idle
- Board is a Supermicro x8dth
- There's two igb cards
- Root is ZFS as well (separate pool though)
- BIOS, Areca FW and driver already are latest versions
- Putting the controller to a different slot doesn't change the =
behaviour
- We have two identical systems and both show the exact same symptoms, =
so flaky hardware is probably not the issue

Any ideas would be appreciated.

Thanks,
D.=