From owner-freebsd-fs@FreeBSD.ORG Tue Sep 27 19:17:43 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 300B0106564A for ; Tue, 27 Sep 2011 19:17:43 +0000 (UTC) (envelope-from dpd@bitgravity.com) Received: from mail-vx0-f182.google.com (mail-vx0-f182.google.com [209.85.220.182]) by mx1.freebsd.org (Postfix) with ESMTP id D17D68FC08 for ; Tue, 27 Sep 2011 19:17:42 +0000 (UTC) Received: by vcbf13 with SMTP id f13so5595004vcb.13 for ; Tue, 27 Sep 2011 12:17:42 -0700 (PDT) Received: by 10.68.55.100 with SMTP id r4mr38599801pbp.69.1317151061955; Tue, 27 Sep 2011 12:17:41 -0700 (PDT) Received: from netops-234.sfo1.bitgravity.com (netops-234.sfo1.bitgravity.com. [209.131.110.234]) by mx.google.com with ESMTPS id h5sm7555869pbf.4.2011.09.27.12.17.40 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 27 Sep 2011 12:17:40 -0700 (PDT) Mime-Version: 1.0 (Apple Message framework v1084) Content-Type: text/plain; charset=us-ascii From: David P Discher In-Reply-To: <4E7F61A2.5060908@platinum.linux.pl> Date: Tue, 27 Sep 2011 12:17:38 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <299DCA15-FD90-4238-9DD9-C1B8F94CC726@bitgravity.com> References: <4E7F49A7.1020909@platinum.linux.pl> <20110925165946.GA42447@icarus.home.lan> <4E7F61A2.5060908@platinum.linux.pl> To: Adam Nowacki X-Mailer: Apple Mail (2.1084) Cc: freebsd-fs@freebsd.org Subject: Re: ZFS and 3ware controller resets X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 27 Sep 2011 19:17:43 -0000 We use a lot of this exact 3ware controller (and firmware) with zfs and = 8.1-RELEASE. Though I have seen controller resets, I have not seen this = exact error with zfs and 3ware. We do 2x RAID-1, and a 14-disk RAID 5,50 = or 10, and the controller seems to survive disk failures in RAID config = with ZFS. However, sometimes we will hit the "calru" ... time-went = backwards while the controller resets and the kernel tries to figure = things out. Of course this is likely service impacting. When multiple controller resets are detected, we have typically declared = the card as bad, and RMA or replaced the card. So far, our VAR has not = rejected replacing the card while in the standard 3-years warranty.=20 I would recommend replacing the controller.=20 HOWEVER - I have seen this ZFS behavior with a different controller/HBA = setup. We have older Xyratex 5400-series 48 bay what-evers connected to = the freebsd host via fiber channel and an LSI 7404EP HBA (mpt). Legacy = setups exported LUN/arrays from the Xyratex at RAID-5, and then = gstripe'ed to form single volumes. Setups upgraded to the ZFS setup, of = course do away with the gstripe.=20 When gstripe (with ufs2) when a Xyratex controllers crashes and resets, = geom gets confused, produces read/write errors, and eventually panics. = In the ZFS world, these failures are almost silent, zpool never reports = an error (we're striping the luns in the zpool, no raidz or raidz2 ). = Eventually all the processes access disk hang is D-state, and the = machine grinds to halt.=20 The recommendation from the community was to use gmountver(8) from -head = and use those vdevs in the zpool. We got it back ported to 8.1. = However, there was some issues with geom-tasting order, and what vdevs = will get picked up by the zpool. I have since abandoned this testing. = We were never able to get multi-pathing working under freebsd. --- David P. Discher dpd@bitgravity.com * AIM: bgDavidDPD BITGRAVITY * http://www.bitgravity.com On Sep 25, 2011, at 10:15 AM, Adam Nowacki wrote: > On 2011-09-25 18:59, Jeremy Chadwick wrote: >> On Sun, Sep 25, 2011 at 05:32:55PM +0200, Adam Nowacki wrote: >>> I have a 20 disk storage system, every now and then a disk dies and >>> causes 3ware controller to reset because of disk timeouts. This cuts >>> out ZFS from all disks, even healthy ones and the system requires a >>> hard reset. >>> Two issues here: >>> 1) Why the controller has to reset? Thats a completely insane way of >>> dealing with drive timeout. >>> 2) ZFS not reopening the disk after controller reset. >>>=20 >>> FreeBSD version: 8.1-RELEASE-p1 >>>=20 >>> /c0 Driver Version =3D 3.80.06.003 >>> /c0 Model =3D 9650SE-16ML >>> /c0 Available Memory =3D 224MB >>> /c0 Firmware Version =3D FE9X 4.10.00.007 >>> /c0 Bios Version =3D BE9X 4.08.00.002 >>> /c0 Boot Loader Version =3D BL9X 3.08.00.001 ... >=20 > I mean that not only the timeouting disk is affected but all disks = that are on the controller. Every single one stops working for ZFS, you = can see that in the zpool status output, each disk reports read and = write errors. zpool clear won't fix it, ZFS simply loses access to all = disks on the controller while for example dd can read from each disk = just fine. Also on the same controller I have a disk with UFS = filesystem, mounted when the controller resets, this survives the reset = as if it didn't even happen. For ZFS the only fix is to hard reset the = whole system.