From owner-freebsd-fs@FreeBSD.ORG Wed Jul 25 06:02:15 2007 Return-Path: Delivered-To: freebsd-fs@hub.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1C3EA16A475; Wed, 25 Jul 2007 06:02:15 +0000 (UTC) (envelope-from remko@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id ED2E213C45D; Wed, 25 Jul 2007 06:02:14 +0000 (UTC) (envelope-from remko@FreeBSD.org) Received: from freefall.freebsd.org (remko@localhost [127.0.0.1]) by freefall.freebsd.org (8.14.1/8.14.1) with ESMTP id l6P62E9c065895; Wed, 25 Jul 2007 06:02:14 GMT (envelope-from remko@freefall.freebsd.org) Received: (from remko@localhost) by freefall.freebsd.org (8.14.1/8.14.1/Submit) id l6P62E6G065891; Wed, 25 Jul 2007 06:02:14 GMT (envelope-from remko) Date: Wed, 25 Jul 2007 06:02:14 GMT Message-Id: <200707250602.l6P62E6G065891@freefall.freebsd.org> To: remko@FreeBSD.org, freebsd-bugs@FreeBSD.org, freebsd-fs@FreeBSD.org From: remko@FreeBSD.org Cc: Subject: Re: kern/114847: [ntfs] [patch] dirmask support for NTFS ala MSDOSFS X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Jul 2007 06:02:15 -0000 Synopsis: [ntfs] [patch] dirmask support for NTFS ala MSDOSFS Responsible-Changed-From-To: freebsd-bugs->freebsd-fs Responsible-Changed-By: remko Responsible-Changed-When: Wed Jul 25 06:02:14 UTC 2007 Responsible-Changed-Why: I think the FS list is a better place for this PR. http://www.freebsd.org/cgi/query-pr.cgi?pr=114847 From owner-freebsd-fs@FreeBSD.ORG Wed Jul 25 06:02:49 2007 Return-Path: Delivered-To: freebsd-fs@hub.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5400916A469; Wed, 25 Jul 2007 06:02:49 +0000 (UTC) (envelope-from remko@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id 32AA113C45E; Wed, 25 Jul 2007 06:02:49 +0000 (UTC) (envelope-from remko@FreeBSD.org) Received: from freefall.freebsd.org (remko@localhost [127.0.0.1]) by freefall.freebsd.org (8.14.1/8.14.1) with ESMTP id l6P62n6P065983; Wed, 25 Jul 2007 06:02:49 GMT (envelope-from remko@freefall.freebsd.org) Received: (from remko@localhost) by freefall.freebsd.org (8.14.1/8.14.1/Submit) id l6P62nJl065979; Wed, 25 Jul 2007 06:02:49 GMT (envelope-from remko) Date: Wed, 25 Jul 2007 06:02:49 GMT Message-Id: <200707250602.l6P62nJl065979@freefall.freebsd.org> To: remko@FreeBSD.org, freebsd-bugs@FreeBSD.org, freebsd-fs@FreeBSD.org From: remko@FreeBSD.org Cc: Subject: Re: kern/114856: [ntfs] [patch] Bug in NTFS allows bogus file modes. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Jul 2007 06:02:49 -0000 Synopsis: [ntfs] [patch] Bug in NTFS allows bogus file modes. Responsible-Changed-From-To: freebsd-bugs->freebsd-fs Responsible-Changed-By: remko Responsible-Changed-When: Wed Jul 25 06:02:48 UTC 2007 Responsible-Changed-Why: I think the FS list is a better place for this PR. http://www.freebsd.org/cgi/query-pr.cgi?pr=114856 From owner-freebsd-fs@FreeBSD.ORG Wed Jul 25 06:07:03 2007 Return-Path: Delivered-To: freebsd-fs@hub.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id F38C816A417; Wed, 25 Jul 2007 06:07:02 +0000 (UTC) (envelope-from remko@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id D227513C442; Wed, 25 Jul 2007 06:07:02 +0000 (UTC) (envelope-from remko@FreeBSD.org) Received: from freefall.freebsd.org (remko@localhost [127.0.0.1]) by freefall.freebsd.org (8.14.1/8.14.1) with ESMTP id l6P6727M066259; Wed, 25 Jul 2007 06:07:02 GMT (envelope-from remko@freefall.freebsd.org) Received: (from remko@localhost) by freefall.freebsd.org (8.14.1/8.14.1/Submit) id l6P672K4066255; Wed, 25 Jul 2007 06:07:02 GMT (envelope-from remko) Date: Wed, 25 Jul 2007 06:07:02 GMT Message-Id: <200707250607.l6P672K4066255@freefall.freebsd.org> To: remko@FreeBSD.org, freebsd-bugs@FreeBSD.org, freebsd-fs@FreeBSD.org From: remko@FreeBSD.org Cc: Subject: Re: kern/114676: [ufs] snapshot creation panics: snapacct_ufs2: bad block X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Jul 2007 06:07:03 -0000 Synopsis: [ufs] snapshot creation panics: snapacct_ufs2: bad block Responsible-Changed-From-To: freebsd-bugs->freebsd-fs Responsible-Changed-By: remko Responsible-Changed-When: Wed Jul 25 06:07:01 UTC 2007 Responsible-Changed-Why: Seems more FS related, reassign. http://www.freebsd.org/cgi/query-pr.cgi?pr=114676 From owner-freebsd-fs@FreeBSD.ORG Wed Jul 25 09:23:49 2007 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DEB1116A4A1 for ; Wed, 25 Jul 2007 09:23:49 +0000 (UTC) (envelope-from M.S.Powell@salford.ac.uk) Received: from abbe.salford.ac.uk (abbe.salford.ac.uk [146.87.0.10]) by mx1.freebsd.org (Postfix) with SMTP id 587DF13C4B5 for ; Wed, 25 Jul 2007 09:23:49 +0000 (UTC) (envelope-from M.S.Powell@salford.ac.uk) Received: (qmail 65884 invoked by uid 98); 25 Jul 2007 10:23:46 +0100 Received: from 146.87.255.121 by abbe.salford.ac.uk (envelope-from , uid 401) with qmail-scanner-2.01 (clamdscan: 0.90/3762. spamassassin: 3.1.8. Clear:RC:1(146.87.255.121):. Processed in 0.045947 secs); 25 Jul 2007 09:23:46 -0000 Received: from rust.salford.ac.uk (HELO rust.salford.ac.uk) (146.87.255.121) by abbe.salford.ac.uk (qpsmtpd/0.3x.614) with SMTP; Wed, 25 Jul 2007 10:23:46 +0100 Received: (qmail 58773 invoked by uid 1002); 25 Jul 2007 09:23:44 -0000 Received: from localhost (sendmail-bs@127.0.0.1) by localhost with SMTP; 25 Jul 2007 09:23:44 -0000 Date: Wed, 25 Jul 2007 10:23:44 +0100 (BST) From: "Mark Powell" To: Pawel Jakub Dawidek In-Reply-To: <20070721065204.GA2044@garage.freebsd.pl> Message-ID: <20070725095723.T57231@rust.salford.ac.uk> References: <20070719102302.R1534@rust.salford.ac.uk> <20070719135510.GE1194@garage.freebsd.pl> <20070719181313.G4923@rust.salford.ac.uk> <20070721065204.GA2044@garage.freebsd.pl> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-fs@freebsd.org Subject: Re: ZfS & GEOM with many odd drive sizes X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Jul 2007 09:23:50 -0000 On Sat, 21 Jul 2007, Pawel Jakub Dawidek wrote: Thanks for your reply. > Be sure to turn off debugging, ie. remove WITNESS, INVARIANTS and > INVARIANT_SUPPORT options from your kernel configuration. > Other than that, ZFS may just be more CPU hungry... I have. Makes little difference. Think the idea of using an Athlon XP for ZFS has turned out to be a bridge too far. The new 65nm Athlon 64 x2 are very cheap now. Time for an upgrade. You said that replacing one device with another is not a problem. Just to be clear on this as it's a key factor in me going with this solution. I hope this isn't too naive a question, but the answer will be here for others :) Suppose instead of gconcat I used gstripe on the 250+200 combinations: i.e. (slice 1 on all drives is reserved for ufs gmirror of /boot and block device swap) gs0 ad0s2 ad1s2 gs1 ad2s2 ad3s2 gs2 ad4s2 ad5s2 I use these gstripes and the single 400GB drive to construct the zpool: zpool create tank raidz /dev/mirror/gs0 /dev/mirror/gs1 /dev/mirror/gs2 ad6s2 If for example ad3 fails and thus gs1 fails, how is this replaced in the zpool? e.g. suppose I replace both ad2 and ad3 with a new 500GB drive as ad2. Is fixing this as simple as: zpool replace tank /dev/mirror/gs1 ad2s2 Many thanks. -- Mark Powell - UNIX System Administrator - The University of Salford Information Services Division, Clifford Whitworth Building, Salford University, Manchester, M5 4WT, UK. Tel: +44 161 295 4837 Fax: +44 161 295 5888 www.pgp.com for PGP key From owner-freebsd-fs@FreeBSD.ORG Wed Jul 25 09:30:58 2007 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B776116A417; Wed, 25 Jul 2007 09:30:58 +0000 (UTC) (envelope-from dfr@rabson.org) Received: from itchy.rabson.org (unknown [IPv6:2001:618:400::50b1:e8f2]) by mx1.freebsd.org (Postfix) with ESMTP id 1E88513C46C; Wed, 25 Jul 2007 09:30:57 +0000 (UTC) (envelope-from dfr@rabson.org) Received: from [80.177.232.250] (herring.rabson.org [80.177.232.250]) by itchy.rabson.org (8.13.3/8.13.3) with ESMTP id l6P9UmpN005605; Wed, 25 Jul 2007 10:30:48 +0100 (BST) (envelope-from dfr@rabson.org) From: Doug Rabson To: Mark Powell In-Reply-To: <20070725095723.T57231@rust.salford.ac.uk> References: <20070719102302.R1534@rust.salford.ac.uk> <20070719135510.GE1194@garage.freebsd.pl> <20070719181313.G4923@rust.salford.ac.uk> <20070721065204.GA2044@garage.freebsd.pl> <20070725095723.T57231@rust.salford.ac.uk> Content-Type: text/plain Date: Wed, 25 Jul 2007 10:30:48 +0100 Message-Id: <1185355848.3698.7.camel@herring.rabson.org> Mime-Version: 1.0 X-Mailer: Evolution 2.10.2 FreeBSD GNOME Team Port Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-1.4 required=5.0 tests=ALL_TRUSTED autolearn=failed version=3.1.0 X-Spam-Checker-Version: SpamAssassin 3.1.0 (2005-09-13) on itchy.rabson.org X-Virus-Scanned: ClamAV 0.87.1/3762/Wed Jul 25 06:17:29 2007 on itchy.rabson.org X-Virus-Status: Clean Cc: freebsd-fs@freebsd.org, Pawel Jakub Dawidek Subject: Re: ZfS & GEOM with many odd drive sizes X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Jul 2007 09:30:58 -0000 On Wed, 2007-07-25 at 10:23 +0100, Mark Powell wrote: > On Sat, 21 Jul 2007, Pawel Jakub Dawidek wrote: > > Thanks for your reply. > > > Be sure to turn off debugging, ie. remove WITNESS, INVARIANTS and > > INVARIANT_SUPPORT options from your kernel configuration. > > Other than that, ZFS may just be more CPU hungry... > > I have. Makes little difference. Think the idea of using an Athlon XP > for ZFS has turned out to be a bridge too far. The new 65nm Athlon 64 x2 > are very cheap now. Time for an upgrade. > You said that replacing one device with another is not a problem. Just > to be clear on this as it's a key factor in me going with this solution. I > hope this isn't too naive a question, but the answer will be here for > others :) > Suppose instead of gconcat I used gstripe on the 250+200 combinations: > > i.e. (slice 1 on all drives is reserved for ufs gmirror of /boot and > block device swap) > > gs0 ad0s2 ad1s2 > gs1 ad2s2 ad3s2 > gs2 ad4s2 ad5s2 > > I use these gstripes and the single 400GB drive to construct the zpool: > > zpool create tank raidz /dev/mirror/gs0 /dev/mirror/gs1 /dev/mirror/gs2 ad6s2 > > If for example ad3 fails and thus gs1 fails, how is this replaced in the > zpool? e.g. suppose I replace both ad2 and ad3 with a new 500GB drive as > ad2. Is fixing this as simple as: > > zpool replace tank /dev/mirror/gs1 ad2s2 > > Many thanks. I'm not really sure why you are using gmirror, gconcat or gstripe at all. Surely it would be easier to let ZFS manage the mirroring and concatentation. If you do that, ZFS can use its checksums to continually monitor the two sides of your mirrors for consistency and will be able to notice as early as possible when one of the drives goes flakey. For concats, ZFS will also spread redundant copies of metadata (and regular data if you use 'zfs set copies=') across the disks in the compat. If you have to replace one half of a mirror, ZFS has enough information to know exactly which blocks needs to be copied to the new drive which can make recovery much quicker. From owner-freebsd-fs@FreeBSD.ORG Wed Jul 25 10:13:20 2007 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D245416A418 for ; Wed, 25 Jul 2007 10:13:20 +0000 (UTC) (envelope-from M.S.Powell@salford.ac.uk) Received: from abbe.salford.ac.uk (abbe.salford.ac.uk [146.87.0.10]) by mx1.freebsd.org (Postfix) with SMTP id 438CF13C458 for ; Wed, 25 Jul 2007 10:13:20 +0000 (UTC) (envelope-from M.S.Powell@salford.ac.uk) Received: (qmail 3431 invoked by uid 98); 25 Jul 2007 11:13:19 +0100 Received: from 146.87.255.121 by abbe.salford.ac.uk (envelope-from , uid 401) with qmail-scanner-2.01 (clamdscan: 0.90/3762. spamassassin: 3.1.8. Clear:RC:1(146.87.255.121):. Processed in 0.046831 secs); 25 Jul 2007 10:13:19 -0000 Received: from rust.salford.ac.uk (HELO rust.salford.ac.uk) (146.87.255.121) by abbe.salford.ac.uk (qpsmtpd/0.3x.614) with SMTP; Wed, 25 Jul 2007 11:13:19 +0100 Received: (qmail 59922 invoked by uid 1002); 25 Jul 2007 10:13:16 -0000 Received: from localhost (sendmail-bs@127.0.0.1) by localhost with SMTP; 25 Jul 2007 10:13:16 -0000 Date: Wed, 25 Jul 2007 11:13:16 +0100 (BST) From: "Mark Powell" To: Doug Rabson In-Reply-To: <1185355848.3698.7.camel@herring.rabson.org> Message-ID: <20070725103746.N57231@rust.salford.ac.uk> References: <20070719102302.R1534@rust.salford.ac.uk> <20070719135510.GE1194@garage.freebsd.pl> <20070719181313.G4923@rust.salford.ac.uk> <20070721065204.GA2044@garage.freebsd.pl> <20070725095723.T57231@rust.salford.ac.uk> <1185355848.3698.7.camel@herring.rabson.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-fs@freebsd.org Subject: Re: ZfS & GEOM with many odd drive sizes X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Jul 2007 10:13:20 -0000 On Wed, 25 Jul 2007, Doug Rabson wrote: > I'm not really sure why you are using gmirror, gconcat or gstripe at > all. Surely it would be easier to let ZFS manage the mirroring and > concatentation. If you do that, ZFS can use its checksums to continually > monitor the two sides of your mirrors for consistency and will be able > to notice as early as possible when one of the drives goes flakey. For > concats, ZFS will also spread redundant copies of metadata (and regular > data if you use 'zfs set copies=') across the disks in the compat. If > you have to replace one half of a mirror, ZFS has enough information to > know exactly which blocks needs to be copied to the new drive which can > make recovery much quicker. gmirror is only going to used for the ufs /boot parition and block device swap. (I'll ignore the smallish space used by that below.) I thought gstripe was a solution cos I mentioned in the original post that I have the following drives to play with; 1x400GB, 3x250GB, 3x200GB. If I make a straight zpool with all those drives I get a total usable 7x200GB raidz with only an effective 6x200GB=1200GB of usable storage. Also a 7 device raidz cries out for being a raidz2? That's a further 200GB of storage lost. My original plan was (because of the largest drive being a single 400GB) was to gconcat (now to gstripe) the smaller drives into 3 pairs of 250GB+200GB, making three new 450GB devices. This would make a zpool of 4 devices i.e. 1x400GB+3x450GB giving effective storage of 1200GB. Yes, it's the same as above (as long as raidz2 is not used there), but I was thinking about future expansion... The advantge this approach seems to give is that when drives fail each device (which is either a single drive or a gstripe pair) can be replaced with a modern larger drive (500GB or 750GB depending on what's economical at the time). Once that replacement has been performed only 4 times, the zpool will increase in size (actually it will increase straight away by 4x50GB total if the 400GB drive fails 1st). In addition, once a couple of drives in a pair have failed and are replaced by a single large drive, there will also be smaller 250GB or 200GB drives spare which can be further added to the zpool as a zfs mirror. The alternative of using a zpool of 7 individual drives means that I need to replace many more drives to actually see an increase in zpool size. Yes, there a large number of combinations here, but it seems that the zpool will increase in size sooner this way? I believe my reasoning is correct here? Let me know if your experience would suggest otherwise. Many thanks. -- Mark Powell - UNIX System Administrator - The University of Salford Information Services Division, Clifford Whitworth Building, Salford University, Manchester, M5 4WT, UK. Tel: +44 161 295 4837 Fax: +44 161 295 5888 www.pgp.com for PGP key From owner-freebsd-fs@FreeBSD.ORG Wed Jul 25 11:17:26 2007 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 08D3416A421 for ; Wed, 25 Jul 2007 11:17:26 +0000 (UTC) (envelope-from M.S.Powell@salford.ac.uk) Received: from akis.salford.ac.uk (akis.salford.ac.uk [146.87.0.14]) by mx1.freebsd.org (Postfix) with SMTP id 6BA4213C4A6 for ; Wed, 25 Jul 2007 11:17:24 +0000 (UTC) (envelope-from M.S.Powell@salford.ac.uk) Received: (qmail 32712 invoked by uid 98); 25 Jul 2007 12:17:23 +0100 Received: from 146.87.255.121 by akis.salford.ac.uk (envelope-from , uid 401) with qmail-scanner-2.01 (clamdscan: 0.90/3762. spamassassin: 3.1.8. Clear:RC:1(146.87.255.121):. Processed in 0.0415 secs); 25 Jul 2007 11:17:23 -0000 Received: from rust.salford.ac.uk (HELO rust.salford.ac.uk) (146.87.255.121) by akis.salford.ac.uk (qpsmtpd/0.3x.614) with SMTP; Wed, 25 Jul 2007 12:17:23 +0100 Received: (qmail 60504 invoked by uid 1002); 25 Jul 2007 11:17:21 -0000 Received: from localhost (sendmail-bs@127.0.0.1) by localhost with SMTP; 25 Jul 2007 11:17:21 -0000 Date: Wed, 25 Jul 2007 12:17:21 +0100 (BST) From: "Mark Powell" To: Doug Rabson In-Reply-To: <3A5D89E1-A7B1-4B10-ADB8-F58332306691@rabson.org> Message-ID: <20070725120913.A57231@rust.salford.ac.uk> References: <20070719102302.R1534@rust.salford.ac.uk> <20070719135510.GE1194@garage.freebsd.pl> <20070719181313.G4923@rust.salford.ac.uk> <20070721065204.GA2044@garage.freebsd.pl> <20070725095723.T57231@rust.salford.ac.uk> <1185355848.3698.7.camel@herring.rabson.org> <20070725103746.N57231@rust.salford.ac.uk> <3A5D89E1-A7B1-4B10-ADB8-F58332306691@rabson.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-fs@freebsd.org Subject: Re: ZfS & GEOM with many odd drive sizes X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Jul 2007 11:17:26 -0000 On Wed, 25 Jul 2007, Doug Rabson wrote: >> gmirror is only going to used for the ufs /boot parition and block device >> swap. (I'll ignore the smallish space used by that below.) > > Just to muddy the waters a little - I'm working on ZFS native boot code at > the moment. It probably won't ship with 7.0 but should be available shortly > after. Great work. That will be zfs mirror only right? >> I believe my reasoning is correct here? Let me know if your experience >> would suggest otherwise. > > Your reasoning sounds fine now that I have the bigger picture in my head. I > don't have a lot of experience here - for my ZFS testing, I just bought a > couple of cheap 300GB drives which I'm using as a simple mirror. From what I > have read, mirrors and raidz2 are roughly equivalent in 'mean time to data > loss' terms with raidz1 quite a bit less safe due to the extra vulnerability > window between a drive failure and replacement. So back to my original question :) If one drive in a gconcat gc1 (ad2s2+ad3s2), say ad3 fails, and the broken gconcat is completely replaced with a new 500GB drive ad2, is fixing that as simple as: zpool replace tank gc1 ad2 Many thanks. -- Mark Powell - UNIX System Administrator - The University of Salford Information Services Division, Clifford Whitworth Building, Salford University, Manchester, M5 4WT, UK. Tel: +44 161 295 4837 Fax: +44 161 295 5888 www.pgp.com for PGP key From owner-freebsd-fs@FreeBSD.ORG Wed Jul 25 11:23:15 2007 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D518916A418 for ; Wed, 25 Jul 2007 11:23:15 +0000 (UTC) (envelope-from dfr@rabson.org) Received: from mail.qubesoft.com (gate.qubesoft.com [217.169.36.34]) by mx1.freebsd.org (Postfix) with ESMTP id 646C913C459 for ; Wed, 25 Jul 2007 11:23:15 +0000 (UTC) (envelope-from dfr@rabson.org) Received: from [10.201.19.245] (doug02.dyn.qubesoft.com [10.201.19.245]) by mail.qubesoft.com (8.13.3/8.13.3) with ESMTP id l6PB1u5e002918; Wed, 25 Jul 2007 12:02:00 +0100 (BST) (envelope-from dfr@rabson.org) In-Reply-To: <20070725103746.N57231@rust.salford.ac.uk> References: <20070719102302.R1534@rust.salford.ac.uk> <20070719135510.GE1194@garage.freebsd.pl> <20070719181313.G4923@rust.salford.ac.uk> <20070721065204.GA2044@garage.freebsd.pl> <20070725095723.T57231@rust.salford.ac.uk> <1185355848.3698.7.camel@herring.rabson.org> <20070725103746.N57231@rust.salford.ac.uk> Mime-Version: 1.0 (Apple Message framework v752.2) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <3A5D89E1-A7B1-4B10-ADB8-F58332306691@rabson.org> Content-Transfer-Encoding: 7bit From: Doug Rabson Date: Wed, 25 Jul 2007 12:01:53 +0100 To: Mark Powell X-Mailer: Apple Mail (2.752.2) X-Spam-Status: No, score=-2.8 required=5.0 tests=ALL_TRUSTED autolearn=failed version=3.0.4 X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on mail.qubesoft.com X-Virus-Scanned: ClamAV 0.86.2/3762/Wed Jul 25 06:17:29 2007 on mail.qubesoft.com X-Virus-Status: Clean Cc: freebsd-fs@freebsd.org Subject: Re: ZfS & GEOM with many odd drive sizes X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Jul 2007 11:23:15 -0000 On 25 Jul 2007, at 11:13, Mark Powell wrote: > On Wed, 25 Jul 2007, Doug Rabson wrote: > >> I'm not really sure why you are using gmirror, gconcat or gstripe at >> all. Surely it would be easier to let ZFS manage the mirroring and >> concatentation. If you do that, ZFS can use its checksums to >> continually >> monitor the two sides of your mirrors for consistency and will be >> able >> to notice as early as possible when one of the drives goes flakey. >> For >> concats, ZFS will also spread redundant copies of metadata (and >> regular >> data if you use 'zfs set copies=') across the disks in the >> compat. If >> you have to replace one half of a mirror, ZFS has enough >> information to >> know exactly which blocks needs to be copied to the new drive >> which can >> make recovery much quicker. > > gmirror is only going to used for the ufs /boot parition and block > device swap. (I'll ignore the smallish space used by that below.) Just to muddy the waters a little - I'm working on ZFS native boot code at the moment. It probably won't ship with 7.0 but should be available shortly after. > I thought gstripe was a solution cos I mentioned in the original > post that I have the following drives to play with; 1x400GB, > 3x250GB, 3x200GB. > If I make a straight zpool with all those drives I get a total > usable 7x200GB raidz with only an effective 6x200GB=1200GB of > usable storage. Also a 7 device raidz cries out for being a raidz2? > That's a further 200GB of storage lost. > My original plan was (because of the largest drive being a single > 400GB) was to gconcat (now to gstripe) the smaller drives into 3 > pairs of 250GB+200GB, making three new 450GB devices. This would > make a zpool of 4 devices i.e. 1x400GB+3x450GB giving effective > storage of 1200GB. Yes, it's the same as above (as long as raidz2 > is not used there), but I was thinking about future expansion... > The advantge this approach seems to give is that when drives fail > each device (which is either a single drive or a gstripe pair) can > be replaced with a modern larger drive (500GB or 750GB depending on > what's economical at the time). > Once that replacement has been performed only 4 times, the zpool > will increase in size (actually it will increase straight away by > 4x50GB total if the 400GB drive fails 1st). > In addition, once a couple of drives in a pair have failed and > are replaced by a single large drive, there will also be smaller > 250GB or 200GB drives spare which can be further added to the zpool > as a zfs mirror. > The alternative of using a zpool of 7 individual drives means > that I need to replace many more drives to actually see an increase > in zpool size. > Yes, there a large number of combinations here, but it seems that > the zpool will increase in size sooner this way? > I believe my reasoning is correct here? Let me know if your > experience would suggest otherwise. > Many thanks. > Your reasoning sounds fine now that I have the bigger picture in my head. I don't have a lot of experience here - for my ZFS testing, I just bought a couple of cheap 300GB drives which I'm using as a simple mirror. From what I have read, mirrors and raidz2 are roughly equivalent in 'mean time to data loss' terms with raidz1 quite a bit less safe due to the extra vulnerability window between a drive failure and replacement. From owner-freebsd-fs@FreeBSD.ORG Wed Jul 25 12:53:57 2007 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 84E3B16A418 for ; Wed, 25 Jul 2007 12:53:57 +0000 (UTC) (envelope-from dfr@rabson.org) Received: from mail.qubesoft.com (gate.qubesoft.com [217.169.36.34]) by mx1.freebsd.org (Postfix) with ESMTP id 19D3713C45B for ; Wed, 25 Jul 2007 12:53:56 +0000 (UTC) (envelope-from dfr@rabson.org) Received: from [10.201.19.245] (doug02.dyn.qubesoft.com [10.201.19.245]) by mail.qubesoft.com (8.13.3/8.13.3) with ESMTP id l6PCrnAC007158; Wed, 25 Jul 2007 13:53:53 +0100 (BST) (envelope-from dfr@rabson.org) In-Reply-To: <20070725120913.A57231@rust.salford.ac.uk> References: <20070719102302.R1534@rust.salford.ac.uk> <20070719135510.GE1194@garage.freebsd.pl> <20070719181313.G4923@rust.salford.ac.uk> <20070721065204.GA2044@garage.freebsd.pl> <20070725095723.T57231@rust.salford.ac.uk> <1185355848.3698.7.camel@herring.rabson.org> <20070725103746.N57231@rust.salford.ac.uk> <3A5D89E1-A7B1-4B10-ADB8-F58332306691@rabson.org> <20070725120913.A57231@rust.salford.ac.uk> Mime-Version: 1.0 (Apple Message framework v752.2) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <6FF8729F-B449-4EFA-B3C6-8B9A9E6F6C4F@rabson.org> Content-Transfer-Encoding: 7bit From: Doug Rabson Date: Wed, 25 Jul 2007 13:53:46 +0100 To: Mark Powell X-Mailer: Apple Mail (2.752.2) X-Spam-Status: No, score=-2.8 required=5.0 tests=ALL_TRUSTED autolearn=failed version=3.0.4 X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on mail.qubesoft.com X-Virus-Scanned: ClamAV 0.86.2/3762/Wed Jul 25 06:17:29 2007 on mail.qubesoft.com X-Virus-Status: Clean Cc: freebsd-fs@freebsd.org Subject: Re: ZfS & GEOM with many odd drive sizes X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Jul 2007 12:53:57 -0000 On 25 Jul 2007, at 12:17, Mark Powell wrote: > On Wed, 25 Jul 2007, Doug Rabson wrote: > >>> gmirror is only going to used for the ufs /boot parition and >>> block device swap. (I'll ignore the smallish space used by that >>> below.) >> >> Just to muddy the waters a little - I'm working on ZFS native boot >> code at the moment. It probably won't ship with 7.0 but should be >> available shortly after. > > Great work. That will be zfs mirror only right? The code is close to being able to support collections of mirrors. No raidz or raidz2 for now though. > >>> I believe my reasoning is correct here? Let me know if your >>> experience would suggest otherwise. >> >> Your reasoning sounds fine now that I have the bigger picture in >> my head. I don't have a lot of experience here - for my ZFS >> testing, I just bought a couple of cheap 300GB drives which I'm >> using as a simple mirror. From what I have read, mirrors and >> raidz2 are roughly equivalent in 'mean time to data loss' terms >> with raidz1 quite a bit less safe due to the extra vulnerability >> window between a drive failure and replacement. > > So back to my original question :) > If one drive in a gconcat gc1 (ad2s2+ad3s2), say ad3 fails, and > the broken gconcat is completely replaced with a new 500GB drive > ad2, is fixing that as simple as: > > zpool replace tank gc1 ad2 That sounds right. From owner-freebsd-fs@FreeBSD.ORG Wed Jul 25 16:22:23 2007 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 73F9716A41B for ; Wed, 25 Jul 2007 16:22:23 +0000 (UTC) (envelope-from M.S.Powell@salford.ac.uk) Received: from akis.salford.ac.uk (akis.salford.ac.uk [146.87.0.14]) by mx1.freebsd.org (Postfix) with SMTP id D8A4313C483 for ; Wed, 25 Jul 2007 16:22:22 +0000 (UTC) (envelope-from M.S.Powell@salford.ac.uk) Received: (qmail 4213 invoked by uid 98); 25 Jul 2007 17:22:21 +0100 Received: from 146.87.255.121 by akis.salford.ac.uk (envelope-from , uid 401) with qmail-scanner-2.01 (clamdscan: 0.90/3763. spamassassin: 3.1.8. Clear:RC:1(146.87.255.121):. Processed in 0.070154 secs); 25 Jul 2007 16:22:21 -0000 Received: from rust.salford.ac.uk (HELO rust.salford.ac.uk) (146.87.255.121) by akis.salford.ac.uk (qpsmtpd/0.3x.614) with SMTP; Wed, 25 Jul 2007 17:22:20 +0100 Received: (qmail 62424 invoked by uid 1002); 25 Jul 2007 16:20:22 -0000 Received: from localhost (sendmail-bs@127.0.0.1) by localhost with SMTP; 25 Jul 2007 16:20:22 -0000 Date: Wed, 25 Jul 2007 17:20:22 +0100 (BST) From: "Mark Powell" To: Doug Rabson In-Reply-To: <6FF8729F-B449-4EFA-B3C6-8B9A9E6F6C4F@rabson.org> Message-ID: <20070725171343.M61339@rust.salford.ac.uk> References: <20070719102302.R1534@rust.salford.ac.uk> <20070719135510.GE1194@garage.freebsd.pl> <20070719181313.G4923@rust.salford.ac.uk> <20070721065204.GA2044@garage.freebsd.pl> <20070725095723.T57231@rust.salford.ac.uk> <1185355848.3698.7.camel@herring.rabson.org> <20070725103746.N57231@rust.salford.ac.uk> <3A5D89E1-A7B1-4B10-ADB8-F58332306691@rabson.org> <20070725120913.A57231@rust.salford.ac.uk> <6FF8729F-B449-4EFA-B3C6-8B9A9E6F6C4F@rabson.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-fs@freebsd.org Subject: Re: ZfS & GEOM with many odd drive sizes X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Jul 2007 16:22:23 -0000 On Wed, 25 Jul 2007, Doug Rabson wrote: > On 25 Jul 2007, at 12:17, Mark Powell wrote: >> Great work. That will be zfs mirror only right? > > The code is close to being able to support collections of mirrors. No raidz > or raidz2 for now though. That's great news. So that would mean, if a raidz vdev was required on a system another pool would have to be created with only a mirror vdev in it, to have / on zfs too? Considering the work involved, is raidz / support really worth it? Of course, it's fantastic if you plan to tackle it, but I don't envy you the task :( >> So back to my original question :) >> If one drive in a gconcat gc1 (ad2s2+ad3s2), say ad3 fails, and the broken >> gconcat is completely replaced with a new 500GB drive ad2, is fixing that >> as simple as: >> >> zpool replace tank gc1 ad2 > > That sounds right. Thanks for the info. It's good to know how to fix an array before it's created :) Cheers. -- Mark Powell - UNIX System Administrator - The University of Salford Information Services Division, Clifford Whitworth Building, Salford University, Manchester, M5 4WT, UK. Tel: +44 161 295 4837 Fax: +44 161 295 5888 www.pgp.com for PGP key From owner-freebsd-fs@FreeBSD.ORG Wed Jul 25 16:39:33 2007 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 55B7C16A421 for ; Wed, 25 Jul 2007 16:39:33 +0000 (UTC) (envelope-from dfr@rabson.org) Received: from mail.qubesoft.com (gate.qubesoft.com [217.169.36.34]) by mx1.freebsd.org (Postfix) with ESMTP id CA3DD13C4A3 for ; Wed, 25 Jul 2007 16:39:32 +0000 (UTC) (envelope-from dfr@rabson.org) Received: from [10.201.19.245] (doug02.dyn.qubesoft.com [10.201.19.245]) by mail.qubesoft.com (8.13.3/8.13.3) with ESMTP id l6PGdMX1015863; Wed, 25 Jul 2007 17:39:31 +0100 (BST) (envelope-from dfr@rabson.org) In-Reply-To: <20070725171343.M61339@rust.salford.ac.uk> References: <20070719102302.R1534@rust.salford.ac.uk> <20070719135510.GE1194@garage.freebsd.pl> <20070719181313.G4923@rust.salford.ac.uk> <20070721065204.GA2044@garage.freebsd.pl> <20070725095723.T57231@rust.salford.ac.uk> <1185355848.3698.7.camel@herring.rabson.org> <20070725103746.N57231@rust.salford.ac.uk> <3A5D89E1-A7B1-4B10-ADB8-F58332306691@rabson.org> <20070725120913.A57231@rust.salford.ac.uk> <6FF8729F-B449-4EFA-B3C6-8B9A9E6F6C4F@rabson.org> <20070725171343.M61339@rust.salford.ac.uk> Mime-Version: 1.0 (Apple Message framework v752.2) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <77814562-8B5E-4E3C-9018-59F7E8FBF8C8@rabson.org> Content-Transfer-Encoding: 7bit From: Doug Rabson Date: Wed, 25 Jul 2007 17:39:22 +0100 To: Mark Powell X-Mailer: Apple Mail (2.752.2) X-Spam-Status: No, score=-2.8 required=5.0 tests=ALL_TRUSTED autolearn=failed version=3.0.4 X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on mail.qubesoft.com X-Virus-Scanned: ClamAV 0.86.2/3763/Wed Jul 25 16:37:41 2007 on mail.qubesoft.com X-Virus-Status: Clean Cc: freebsd-fs@freebsd.org Subject: Re: ZfS & GEOM with many odd drive sizes X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Jul 2007 16:39:33 -0000 On 25 Jul 2007, at 17:20, Mark Powell wrote: > On Wed, 25 Jul 2007, Doug Rabson wrote: > >> On 25 Jul 2007, at 12:17, Mark Powell wrote: >>> Great work. That will be zfs mirror only right? >> >> The code is close to being able to support collections of mirrors. >> No raidz or raidz2 for now though. > > That's great news. > So that would mean, if a raidz vdev was required on a system > another pool would have to be created with only a mirror vdev in > it, to have / on zfs too? > Considering the work involved, is raidz / support really worth > it? Of course, it's fantastic if you plan to tackle it, but I don't > envy you the task :( In theory supporting raidz isn't that hard although the layout policy is undocumented. I've looked at the code and I could probably borrow some code from the 'real' zfs to figure out the layout and support non-degraded raidz and raidz2. Supported degraded configurations is more effort because of the extra code to re-generate the date from the parity. The biggest problem here is space. The wretched PC platform requires us to bootstrap the system starting from a single sector's worth of code (512 bytes). That code runs in stone-age 16bit mode and loads the second stage from a fixed disk location. To keep my sanity, I'm currently trying to limit the code size of the second stage to 16k. This second stage has to understand ZFS well enough to load the third stage /boot/loader code from the pool. I currently have exactly 171 bytes of free space in boot2. I could probably squeeze another 4k into the second stage bootstrap by re-writing boot1 again. I will probably have to do that to support collections of disks/mirrors anyway. Doing that will mean permanently giving up the idea of booting ZFS on systems that don't support LBA addressing. Tthis already disabled in my boot1 code but could be resurrected after some hair pulling - increasing the size of boot2 would make supporting legacy (>10hys old) BIOS machines impossible. From owner-freebsd-fs@FreeBSD.ORG Wed Jul 25 16:58:54 2007 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C8EFF16A421 for ; Wed, 25 Jul 2007 16:58:54 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from phoenix.cs.uoguelph.ca (phoenix.cs.uoguelph.ca [131.104.94.216]) by mx1.freebsd.org (Postfix) with ESMTP id 881B313C457 for ; Wed, 25 Jul 2007 16:58:54 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from muncher.cs.uoguelph.ca (muncher.cs.uoguelph.ca [131.104.96.170]) by phoenix.cs.uoguelph.ca (8.13.1/8.13.1) with ESMTP id l6PGwrcU014063 for ; Wed, 25 Jul 2007 12:58:53 -0400 Received: from localhost (rmacklem@localhost) by muncher.cs.uoguelph.ca (8.11.7p3+Sun/8.11.6) with ESMTP id l6PH3Jw02212 for ; Wed, 25 Jul 2007 13:03:19 -0400 (EDT) X-Authentication-Warning: muncher.cs.uoguelph.ca: rmacklem owned process doing -bs Date: Wed, 25 Jul 2007 13:03:19 -0400 (EDT) From: Rick Macklem X-X-Sender: rmacklem@muncher To: freebsd-fs@freebsd.org Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Scanned-By: MIMEDefang 2.57 on 131.104.94.216 Subject: handling unresonsive NFS servers X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Jul 2007 16:58:54 -0000 I have been thinking about what to do on a client when an NFS server is unresponsive and thought I'd email to see what others thought? "intr mounts" - These don't work correctly and it is nearly impossible to make them work correctly. The problem is that, often, the process which has a termination signal posted against it is blocked waiting for some resource (vnode lock, buffer cache block,...) that another process that is waiting for an RPC reply from the unresponsive server, holds. Also, for NFSv4, a client can't just forget about an RPC that alters state on the server. If it does so, the RPC may have been performed on the server and the client's view of state might become inconsistent with the server's view. (As such, I feel this should be "deprecated or disabled". I don't like things that "sorta work", but I can understand why some might feel that it should remain for NFSv2,3.) "soft mounts" - These have the problem that system calls may terminate abnormally when all you have is a slow, heavily loaded server. As such, they might be ok for read-only mounts using NFSv2,3, but seem too dangerous for anything else. (Very few apps. expect an I/O system call to fail with ETIMEDOUT.) So, about all I can think to do is make "umount -f" work properly. Since it terminates all outstanding RPCs on the mount point (and gets rid of all state for NFSv4), this can be made to work well. (Mac OS X does this.) A problem with this is that it can only be done by someone with system priviledge. However, it seems to me that most systems are either personal (laptops or desktops) where the person has system priviledge OR systems running as servers in machine room environments. The latter usually have sysadmin monitoring and also tend to talk to NFS servers where connectivity seldom goes away. As such, needing system priviledge doesn't seem too serious an issue to me. Any other thoughts? rick From owner-freebsd-fs@FreeBSD.ORG Wed Jul 25 17:17:51 2007 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B580716A41A for ; Wed, 25 Jul 2007 17:17:51 +0000 (UTC) (envelope-from anderson@freebsd.org) Received: from ns.trinitel.com (186.161.36.72.static.reverse.layeredtech.com [72.36.161.186]) by mx1.freebsd.org (Postfix) with ESMTP id 8449213C45D for ; Wed, 25 Jul 2007 17:17:48 +0000 (UTC) (envelope-from anderson@freebsd.org) Received: from proton.local (209-163-168-124.static.twtelecom.net [209.163.168.124]) (authenticated bits=0) by ns.trinitel.com (8.14.1/8.14.1) with ESMTP id l6PHHmxn033498; Wed, 25 Jul 2007 12:17:48 -0500 (CDT) (envelope-from anderson@freebsd.org) Message-ID: <46A785BC.8030602@freebsd.org> Date: Wed, 25 Jul 2007 12:17:48 -0500 From: Eric Anderson User-Agent: Thunderbird 2.0.0.5 (Macintosh/20070716) MIME-Version: 1.0 To: Rick Macklem References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=ham version=3.1.8 X-Spam-Checker-Version: SpamAssassin 3.1.8 (2007-02-13) on ns.trinitel.com Cc: freebsd-fs@freebsd.org Subject: Re: handling unresonsive NFS servers X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Jul 2007 17:17:51 -0000 Rick Macklem wrote: > I have been thinking about what to do on a client when an NFS server is > unresponsive and thought I'd email to see what others thought? > > "intr mounts" - These don't work correctly and it is nearly impossible > to make them work correctly. The problem is that, often, the > process which has a termination signal posted against it is blocked > waiting for some resource (vnode lock, buffer cache block,...) that > another process that is waiting for an RPC reply from the > unresponsive server, holds. Also, for NFSv4, a client can't just > forget about an RPC that alters state on the server. If it does > so, the RPC may have been performed on the server and the client's > view of state might become inconsistent with the server's view. > (As such, I feel this should be "deprecated or disabled". I don't > like things that "sorta work", but I can understand why some might > feel that it should remain for NFSv2,3.) > > "soft mounts" - These have the problem that system calls may terminate > abnormally when all you have is a slow, heavily loaded server. > As such, they might be ok for read-only mounts using NFSv2,3, > but seem too dangerous for anything else. (Very few apps. expect > an I/O system call to fail with ETIMEDOUT.) > > So, about all I can think to do is make "umount -f" work properly. Since > it terminates all outstanding RPCs on the mount point (and gets rid of all > state for NFSv4), this can be made to work well. (Mac OS X does this.) > A problem with this is that it can only be done by someone with system > priviledge. However, it seems to me that most systems are either > personal (laptops or desktops) where the person has system priviledge OR > systems running as servers in machine room environments. The latter > usually have sysadmin monitoring and also tend to talk to NFS servers where > connectivity seldom goes away. As such, needing system priviledge > doesn't seem too serious an issue to me. > > Any other thoughts? rick I agree with you 100%. In datacenters that I have run, umount -f should always work (in my opinion), and should be a superuser privilege. If I am root, and I say 'umount -f ...' - just do it. NFS servers do go away, and sometimes you *have* to umount -f. In linux, you can make that happen (mostly), but FreeBSD doesn't like it much. Anyone who runs with soft mounts, or intr mounts, should be prepared for inconsistent data, or broken apps when there are NFS issues. Typically I expect my hard mounts (non-interruptable) to stick, and applications to block, until the mount comes back. If I need to remove the mount though, I want to be able to do it with a umount -f command, and have it 'just work', since I know the consequences. Eric From owner-freebsd-fs@FreeBSD.ORG Wed Jul 25 17:37:26 2007 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8182416A418 for ; Wed, 25 Jul 2007 17:37:26 +0000 (UTC) (envelope-from rees@citi.umich.edu) Received: from citi.umich.edu (citi.umich.edu [141.211.133.111]) by mx1.freebsd.org (Postfix) with ESMTP id 5E5F313C465 for ; Wed, 25 Jul 2007 17:37:26 +0000 (UTC) (envelope-from rees@citi.umich.edu) Received: from citi.umich.edu (dumaguete.citi.umich.edu [141.211.133.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "Jim Rees", Issuer "CITI Production KCA" (verified OK)) by citi.umich.edu (Postfix) with ESMTP id 7AAA047C2; Wed, 25 Jul 2007 13:12:15 -0400 (EDT) Date: Wed, 25 Jul 2007 13:12:14 -0400 From: Jim Rees To: Rick Macklem Message-ID: <20070725171214.GC25749@citi.umich.edu> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Cc: freebsd-fs@freebsd.org Subject: Re: handling unresonsive NFS servers X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Jul 2007 17:37:26 -0000 Afs has the same problem, and solves it by marking a server "down" when it doesn't respond. The timeout is very long, like a minute or more. Normally this would permanently hang the client, but once the server is marked down, any subsequent operations fail immediately. The client checks periodically to see if the server has come back up. Failing this way is better than waiting forever, because waiting forever results in a reboot when the machine's owner runs out of patience. And by all means, do fix umount -f. From owner-freebsd-fs@FreeBSD.ORG Wed Jul 25 18:13:36 2007 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D44E716A419; Wed, 25 Jul 2007 18:13:36 +0000 (UTC) (envelope-from bakul@bitblocks.com) Received: from mail.bitblocks.com (mail.bitblocks.com [64.142.15.60]) by mx1.freebsd.org (Postfix) with ESMTP id B83E213C457; Wed, 25 Jul 2007 18:13:36 +0000 (UTC) (envelope-from bakul@bitblocks.com) Received: from bitblocks.com (localhost.bitblocks.com [127.0.0.1]) by mail.bitblocks.com (Postfix) with ESMTP id 9F47E5B3B; Wed, 25 Jul 2007 10:47:15 -0700 (PDT) To: Doug Rabson In-reply-to: Your message of "Wed, 25 Jul 2007 10:30:48 BST." <1185355848.3698.7.camel@herring.rabson.org> Date: Wed, 25 Jul 2007 10:47:15 -0700 From: Bakul Shah Message-Id: <20070725174715.9F47E5B3B@mail.bitblocks.com> Cc: freebsd-fs@freebsd.org, Pawel Jakub Dawidek , Mark Powell Subject: Re: ZfS & GEOM with many odd drive sizes X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Jul 2007 18:13:36 -0000 > If you do that, ZFS can use its checksums to continually > monitor the two sides of your mirrors for consistency and will be able > to notice as early as possible when one of the drives goes flakey. Does it really do this? As I understood it, only one of the disks in a mirror will be read for a given block. If the checksum fails, the same block from the other disk is read and checksummed. If all the disks in a mirror are read for every block, ZFS read performance would get somewhat worse instead of linear scaling up with more disks in a mirror. In order to monitor data on both disks one would need to periodically run "zpool scrub", no? But that is not *continuous* monitoring of the two sides. From owner-freebsd-fs@FreeBSD.ORG Wed Jul 25 18:21:10 2007 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2FAEB16A418 for ; Wed, 25 Jul 2007 18:21:10 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from gigi.cs.uoguelph.ca (gigi.cs.uoguelph.ca [131.104.94.210]) by mx1.freebsd.org (Postfix) with ESMTP id C8FE713C474 for ; Wed, 25 Jul 2007 18:21:09 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from muncher.cs.uoguelph.ca (muncher.cs.uoguelph.ca [131.104.96.170]) by gigi.cs.uoguelph.ca (8.13.1/8.13.1) with ESMTP id l6PIL67f025911; Wed, 25 Jul 2007 14:21:06 -0400 Received: from localhost (rmacklem@localhost) by muncher.cs.uoguelph.ca (8.11.7p3+Sun/8.11.6) with ESMTP id l6PIPYQ14848; Wed, 25 Jul 2007 14:25:34 -0400 (EDT) X-Authentication-Warning: muncher.cs.uoguelph.ca: rmacklem owned process doing -bs Date: Wed, 25 Jul 2007 14:25:34 -0400 (EDT) From: Rick Macklem X-X-Sender: rmacklem@muncher To: Jim Rees In-Reply-To: <20070725171214.GC25749@citi.umich.edu> Message-ID: References: <20070725171214.GC25749@citi.umich.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Scanned-By: MIMEDefang 2.57 on 131.104.94.210 Cc: freebsd-fs@freebsd.org Subject: Re: handling unresonsive NFS servers X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Jul 2007 18:21:10 -0000 On Wed, 25 Jul 2007, Jim Rees wrote: > Afs has the same problem, and solves it by marking a server "down" when it > doesn't respond. The timeout is very long, like a minute or more. Normally > this would permanently hang the client, but once the server is marked down, > any subsequent operations fail immediately. The client checks periodically > to see if the server has come back up. Failing this way is better than > waiting forever, because waiting forever results in a reboot when the > machine's owner runs out of patience. Linux has something called a "lazy" umount, which I think is similar to the above, except that it is invoked by a sysadmin instead of a timeout (and doesn't come back, just umounts when the RPCs finally happen). I didn't see much use in it, but I can see that setting a mount point "not working for now" might be useful. > > And by all means, do fix umount -f. > From owner-freebsd-fs@FreeBSD.ORG Wed Jul 25 18:37:17 2007 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D235416A417 for ; Wed, 25 Jul 2007 18:37:17 +0000 (UTC) (envelope-from anderson@freebsd.org) Received: from ns.trinitel.com (186.161.36.72.static.reverse.layeredtech.com [72.36.161.186]) by mx1.freebsd.org (Postfix) with ESMTP id 1A21913C442 for ; Wed, 25 Jul 2007 18:37:17 +0000 (UTC) (envelope-from anderson@freebsd.org) Received: from proton.local (209-163-168-124.static.twtelecom.net [209.163.168.124]) (authenticated bits=0) by ns.trinitel.com (8.14.1/8.14.1) with ESMTP id l6PIbGta046864; Wed, 25 Jul 2007 13:37:16 -0500 (CDT) (envelope-from anderson@freebsd.org) Message-ID: <46A7985C.3010202@freebsd.org> Date: Wed, 25 Jul 2007 13:37:16 -0500 From: Eric Anderson User-Agent: Thunderbird 2.0.0.5 (Macintosh/20070716) MIME-Version: 1.0 To: Jim Rees References: <20070725171214.GC25749@citi.umich.edu> In-Reply-To: <20070725171214.GC25749@citi.umich.edu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=ham version=3.1.8 X-Spam-Checker-Version: SpamAssassin 3.1.8 (2007-02-13) on ns.trinitel.com Cc: freebsd-fs@freebsd.org Subject: Re: handling unresonsive NFS servers X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Jul 2007 18:37:17 -0000 Jim Rees wrote: > Afs has the same problem, and solves it by marking a server "down" when it > doesn't respond. The timeout is very long, like a minute or more. Normally > this would permanently hang the client, but once the server is marked down, > any subsequent operations fail immediately. The client checks periodically > to see if the server has come back up. Failing this way is better than > waiting forever, because waiting forever results in a reboot when the > machine's owner runs out of patience. For 'fail immediately', what does that mean? It returns EIO? That might be sufficient, although I think 1min is pretty low for NFS. Of course, if it's settable, then that's good. :) Eric From owner-freebsd-fs@FreeBSD.ORG Wed Jul 25 18:41:26 2007 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4576916A418 for ; Wed, 25 Jul 2007 18:41:26 +0000 (UTC) (envelope-from rees@citi.umich.edu) Received: from citi.umich.edu (citi.umich.edu [141.211.133.111]) by mx1.freebsd.org (Postfix) with ESMTP id 1F7B013C442 for ; Wed, 25 Jul 2007 18:41:18 +0000 (UTC) (envelope-from rees@citi.umich.edu) Received: from citi.umich.edu (dsl093-001-248.det1.dsl.speakeasy.net [66.93.1.248]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "Jim Rees", Issuer "CITI Production KCA" (verified OK)) by citi.umich.edu (Postfix) with ESMTP id ECA3A47E9; Wed, 25 Jul 2007 14:41:17 -0400 (EDT) Date: Wed, 25 Jul 2007 14:41:15 -0400 From: Jim Rees To: Eric Anderson Message-ID: <20070725184114.GA12728@citi.umich.edu> References: <20070725171214.GC25749@citi.umich.edu> <46A7985C.3010202@freebsd.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <46A7985C.3010202@freebsd.org> Cc: freebsd-fs@freebsd.org Subject: Re: handling unresonsive NFS servers X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Jul 2007 18:41:26 -0000 Eric Anderson wrote: For 'fail immediately', what does that mean? It returns EIO? I don't know. Personally I like EHOSTDOWN. From owner-freebsd-fs@FreeBSD.ORG Wed Jul 25 18:57:42 2007 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id AB8DF16A41F; Wed, 25 Jul 2007 18:57:42 +0000 (UTC) (envelope-from dfr@rabson.org) Received: from itchy.rabson.org (unknown [IPv6:2001:618:400::50b1:e8f2]) by mx1.freebsd.org (Postfix) with ESMTP id 2FDDA13C483; Wed, 25 Jul 2007 18:57:42 +0000 (UTC) (envelope-from dfr@rabson.org) Received: from [80.177.232.250] (herring.rabson.org [80.177.232.250]) by itchy.rabson.org (8.13.3/8.13.3) with ESMTP id l6PIva0C009137; Wed, 25 Jul 2007 19:57:38 +0100 (BST) (envelope-from dfr@rabson.org) From: Doug Rabson To: Bakul Shah In-Reply-To: <20070725174715.9F47E5B3B@mail.bitblocks.com> References: <20070725174715.9F47E5B3B@mail.bitblocks.com> Content-Type: text/plain Date: Wed, 25 Jul 2007 19:57:36 +0100 Message-Id: <1185389856.3698.11.camel@herring.rabson.org> Mime-Version: 1.0 X-Mailer: Evolution 2.10.2 FreeBSD GNOME Team Port Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-1.4 required=5.0 tests=ALL_TRUSTED autolearn=failed version=3.1.0 X-Spam-Checker-Version: SpamAssassin 3.1.0 (2005-09-13) on itchy.rabson.org X-Virus-Scanned: ClamAV 0.87.1/3763/Wed Jul 25 16:37:41 2007 on itchy.rabson.org X-Virus-Status: Clean Cc: freebsd-fs@freebsd.org, Pawel Jakub Dawidek , Mark Powell Subject: Re: ZfS & GEOM with many odd drive sizes X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Jul 2007 18:57:42 -0000 On Wed, 2007-07-25 at 10:47 -0700, Bakul Shah wrote: > > If you do that, ZFS can use its checksums to continually > > monitor the two sides of your mirrors for consistency and will be able > > to notice as early as possible when one of the drives goes flakey. > > Does it really do this? As I understood it, only one of the > disks in a mirror will be read for a given block. If the > checksum fails, the same block from the other disk is read > and checksummed. If all the disks in a mirror are read for > every block, ZFS read performance would get somewhat worse > instead of linear scaling up with more disks in a mirror. In > order to monitor data on both disks one would need to > periodically run "zpool scrub", no? But that is not > *continuous* monitoring of the two sides. This is of course correct. I should have said "continuously checks the data which you are actually looking at on a regular basis". The consistency check is via the block checksum (not comparing the date from the two sides of the mirror). From owner-freebsd-fs@FreeBSD.ORG Wed Jul 25 19:46:38 2007 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BF84416A419 for ; Wed, 25 Jul 2007 19:46:38 +0000 (UTC) (envelope-from nowickis@tlen.pl) Received: from tur.go2.pl (tur.go2.pl [193.17.41.50]) by mx1.freebsd.org (Postfix) with ESMTP id 7F2BE13C4A3 for ; Wed, 25 Jul 2007 19:46:38 +0000 (UTC) (envelope-from nowickis@tlen.pl) Received: from rekin18.go2.pl (rekin18.go2.pl [193.17.41.40]) by tur.go2.pl (o2.pl Mailer 2.0.1) with ESMTP id F22B7230980 for ; Wed, 25 Jul 2007 21:16:29 +0200 (CEST) Received: from o2.pl (unknown [10.0.0.38]) by rekin18.go2.pl (Postfix) with SMTP id 0254253DD5 for ; Wed, 25 Jul 2007 21:16:28 +0200 (CEST) From: =?UTF-8?Q?nowickis?= To: freebsd-fs@freebsd.org Mime-Version: 1.0 Message-ID: <123cb29.3cae55fc.46a7a18c.933@o2.pl> Date: Wed, 25 Jul 2007 21:16:28 +0200 X-Originator: 89.78.226.21 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Subject: UnionFS X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Jul 2007 19:46:38 -0000 Hi. I'm=20curious=20about=20your=20experience=20with=20unionfs. Have=20you=20try=20it?=20Did=20you=20have=20some=20troubles=20while=20usi= ng=20it. Sebastian From owner-freebsd-fs@FreeBSD.ORG Thu Jul 26 02:25:30 2007 Return-Path: Delivered-To: freebsd-fs@hub.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E35E316A419; Thu, 26 Jul 2007 02:25:30 +0000 (UTC) (envelope-from rodrigc@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id BB63113C468; Thu, 26 Jul 2007 02:25:30 +0000 (UTC) (envelope-from rodrigc@FreeBSD.org) Received: from freefall.freebsd.org (rodrigc@localhost [127.0.0.1]) by freefall.freebsd.org (8.14.1/8.14.1) with ESMTP id l6Q2PULe047966; Thu, 26 Jul 2007 02:25:30 GMT (envelope-from rodrigc@freefall.freebsd.org) Received: (from rodrigc@localhost) by freefall.freebsd.org (8.14.1/8.14.1/Submit) id l6Q2PU38047962; Thu, 26 Jul 2007 02:25:30 GMT (envelope-from rodrigc) Date: Thu, 26 Jul 2007 02:25:30 GMT Message-Id: <200707260225.l6Q2PU38047962@freefall.freebsd.org> To: rodrigc@FreeBSD.org, freebsd-bugs@FreeBSD.org, freebsd-fs@FreeBSD.org From: rodrigc@FreeBSD.org Cc: Subject: Re: kern/112658: [smbfs] [patch] smbfs and caching problems (resolves bin/111004) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 26 Jul 2007 02:25:31 -0000 Synopsis: [smbfs] [patch] smbfs and caching problems (resolves bin/111004) Responsible-Changed-From-To: freebsd-bugs->freebsd-fs Responsible-Changed-By: rodrigc Responsible-Changed-When: Thu Jul 26 02:24:18 UTC 2007 Responsible-Changed-Why: http://www.freebsd.org/cgi/query-pr.cgi?pr=112658 From owner-freebsd-fs@FreeBSD.ORG Thu Jul 26 06:59:28 2007 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 088E916A41F for ; Thu, 26 Jul 2007 06:59:28 +0000 (UTC) (envelope-from M.S.Powell@salford.ac.uk) Received: from abbe.salford.ac.uk (abbe.salford.ac.uk [146.87.0.10]) by mx1.freebsd.org (Postfix) with SMTP id 7226413C46C for ; Thu, 26 Jul 2007 06:59:27 +0000 (UTC) (envelope-from M.S.Powell@salford.ac.uk) Received: (qmail 12054 invoked by uid 98); 26 Jul 2007 07:59:25 +0100 Received: from 146.87.255.121 by abbe.salford.ac.uk (envelope-from , uid 401) with qmail-scanner-2.01 (clamdscan: 0.90/3775. spamassassin: 3.1.8. Clear:RC:1(146.87.255.121):. Processed in 0.106566 secs); 26 Jul 2007 06:59:25 -0000 Received: from rust.salford.ac.uk (HELO rust.salford.ac.uk) (146.87.255.121) by abbe.salford.ac.uk (qpsmtpd/0.3x.614) with SMTP; Thu, 26 Jul 2007 07:59:25 +0100 Received: (qmail 68238 invoked by uid 1002); 26 Jul 2007 06:59:23 -0000 Received: from localhost (sendmail-bs@127.0.0.1) by localhost with SMTP; 26 Jul 2007 06:59:23 -0000 Date: Thu, 26 Jul 2007 07:59:23 +0100 (BST) From: "Mark Powell" To: Doug Rabson In-Reply-To: <1185389856.3698.11.camel@herring.rabson.org> Message-ID: <20070726075607.W68220@rust.salford.ac.uk> References: <20070725174715.9F47E5B3B@mail.bitblocks.com> <1185389856.3698.11.camel@herring.rabson.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-fs@freebsd.org, Pawel Jakub Dawidek , Mark Powell Subject: Re: ZfS & GEOM with many odd drive sizes X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 26 Jul 2007 06:59:28 -0000 On Wed, 25 Jul 2007, Doug Rabson wrote: > On Wed, 2007-07-25 at 10:47 -0700, Bakul Shah wrote: >> Does it really do this? As I understood it, only one of the >> disks in a mirror will be read for a given block. If the >> checksum fails, the same block from the other disk is read >> and checksummed. If all the disks in a mirror are read for >> every block, ZFS read performance would get somewhat worse >> instead of linear scaling up with more disks in a mirror. In >> order to monitor data on both disks one would need to >> periodically run "zpool scrub", no? But that is not >> *continuous* monitoring of the two sides. > > This is of course correct. I should have said "continuously checks the > data which you are actually looking at on a regular basis". The > consistency check is via the block checksum (not comparing the date from > the two sides of the mirror). ACcording to this: http://www.opensolaris.org/jive/thread.jspa?threadID=23093&tstart=0 RAID-Z has to read every drive to be able to checksum a block. Isn't this the reason why RAID-Z random reads are so slow and also the reason the pre-fetcher exists to speed up sequential reads? Cheers. -- Mark Powell - UNIX System Administrator - The University of Salford Information Services Division, Clifford Whitworth Building, Salford University, Manchester, M5 4WT, UK. Tel: +44 161 295 4837 Fax: +44 161 295 5888 www.pgp.com for PGP key From owner-freebsd-fs@FreeBSD.ORG Thu Jul 26 07:29:43 2007 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2068F16A419; Thu, 26 Jul 2007 07:29:43 +0000 (UTC) (envelope-from dfr@rabson.org) Received: from itchy.rabson.org (unknown [IPv6:2001:618:400::50b1:e8f2]) by mx1.freebsd.org (Postfix) with ESMTP id 8111913C45B; Thu, 26 Jul 2007 07:29:42 +0000 (UTC) (envelope-from dfr@rabson.org) Received: from [80.177.232.250] (herring.rabson.org [80.177.232.250]) by itchy.rabson.org (8.13.3/8.13.3) with ESMTP id l6Q7TXR4016034; Thu, 26 Jul 2007 08:29:33 +0100 (BST) (envelope-from dfr@rabson.org) From: Doug Rabson To: Mark Powell In-Reply-To: <20070726075607.W68220@rust.salford.ac.uk> References: <20070725174715.9F47E5B3B@mail.bitblocks.com> <1185389856.3698.11.camel@herring.rabson.org> <20070726075607.W68220@rust.salford.ac.uk> Content-Type: text/plain Date: Thu, 26 Jul 2007 08:29:33 +0100 Message-Id: <1185434973.3698.18.camel@herring.rabson.org> Mime-Version: 1.0 X-Mailer: Evolution 2.10.2 FreeBSD GNOME Team Port Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-1.4 required=5.0 tests=ALL_TRUSTED autolearn=failed version=3.1.0 X-Spam-Checker-Version: SpamAssassin 3.1.0 (2005-09-13) on itchy.rabson.org X-Virus-Scanned: ClamAV 0.87.1/3775/Thu Jul 26 06:56:02 2007 on itchy.rabson.org X-Virus-Status: Clean Cc: freebsd-fs@freebsd.org, Pawel Jakub Dawidek Subject: Re: ZfS & GEOM with many odd drive sizes X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 26 Jul 2007 07:29:43 -0000 On Thu, 2007-07-26 at 07:59 +0100, Mark Powell wrote: > On Wed, 25 Jul 2007, Doug Rabson wrote: > > On Wed, 2007-07-25 at 10:47 -0700, Bakul Shah wrote: > >> Does it really do this? As I understood it, only one of the > >> disks in a mirror will be read for a given block. If the > >> checksum fails, the same block from the other disk is read > >> and checksummed. If all the disks in a mirror are read for > >> every block, ZFS read performance would get somewhat worse > >> instead of linear scaling up with more disks in a mirror. In > >> order to monitor data on both disks one would need to > >> periodically run "zpool scrub", no? But that is not > >> *continuous* monitoring of the two sides. > > > > This is of course correct. I should have said "continuously checks the > > data which you are actually looking at on a regular basis". The > > consistency check is via the block checksum (not comparing the date from > > the two sides of the mirror). > > ACcording to this: > > http://www.opensolaris.org/jive/thread.jspa?threadID=23093&tstart=0 > > RAID-Z has to read every drive to be able to checksum a block. > Isn't this the reason why RAID-Z random reads are so slow and also the > reason the pre-fetcher exists to speed up sequential reads? > Cheers. When its reading, RAID-Z only has to read the blocks which contain data - the parity block is only read if either the vdev is in degraded mode after a drive failure or one (two for RAID-Z2) of the data block reads fails. For pools which contain a single RAID-Z or RAID-Z2 group, this is probably a performance issue. Larger pools containing multiple RAID-Z groups can spread the load to improve this. From owner-freebsd-fs@FreeBSD.ORG Thu Jul 26 07:47:17 2007 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0AA3D16A417 for ; Thu, 26 Jul 2007 07:47:17 +0000 (UTC) (envelope-from M.S.Powell@salford.ac.uk) Received: from abbe.salford.ac.uk (abbe.salford.ac.uk [146.87.0.10]) by mx1.freebsd.org (Postfix) with SMTP id 7462B13C468 for ; Thu, 26 Jul 2007 07:47:16 +0000 (UTC) (envelope-from M.S.Powell@salford.ac.uk) Received: (qmail 33792 invoked by uid 98); 26 Jul 2007 08:47:15 +0100 Received: from 146.87.255.121 by abbe.salford.ac.uk (envelope-from , uid 401) with qmail-scanner-2.01 (clamdscan: 0.90/3775. spamassassin: 3.1.8. Clear:RC:1(146.87.255.121):. Processed in 0.064179 secs); 26 Jul 2007 07:47:15 -0000 Received: from rust.salford.ac.uk (HELO rust.salford.ac.uk) (146.87.255.121) by abbe.salford.ac.uk (qpsmtpd/0.3x.614) with SMTP; Thu, 26 Jul 2007 08:47:15 +0100 Received: (qmail 68571 invoked by uid 1002); 26 Jul 2007 07:47:13 -0000 Received: from localhost (sendmail-bs@127.0.0.1) by localhost with SMTP; 26 Jul 2007 07:47:13 -0000 Date: Thu, 26 Jul 2007 08:47:13 +0100 (BST) From: "Mark Powell" To: Doug Rabson In-Reply-To: <1185434973.3698.18.camel@herring.rabson.org> Message-ID: <20070726083224.O68220@rust.salford.ac.uk> References: <20070725174715.9F47E5B3B@mail.bitblocks.com> <1185389856.3698.11.camel@herring.rabson.org> <20070726075607.W68220@rust.salford.ac.uk> <1185434973.3698.18.camel@herring.rabson.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-fs@freebsd.org, Pawel Jakub Dawidek Subject: Re: ZfS & GEOM with many odd drive sizes X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 26 Jul 2007 07:47:17 -0000 On Thu, 26 Jul 2007, Doug Rabson wrote: > When its reading, RAID-Z only has to read the blocks which contain data > - the parity block is only read if either the vdev is in degraded mode > after a drive failure or one (two for RAID-Z2) of the data block reads > fails. Yes, but that article does not mention reading parity. What it's saying is that every block is striped across multiple drives. The checksum for that block thus applies to data which is on multiple drives. Therefore to checksum a block you have to read all the parts of the block from every drive except one in the RAIDz array: "This makes read performance of a RAID-Z pool be the same as that of a single disk, even if you only needed a small read from block D." > For pools which contain a single RAID-Z or RAID-Z2 group, this is > probably a performance issue. Larger pools containing multiple RAID-Z > groups can spread the load to improve this. This isn't something that's immediately obvious, coming from fixed stripe size raid5. Now it seems that the variable stripe size has a rather serious performance penalty. It seems that if you have 8 drives, it'd be much more prudent to make two RAIDz of 3+1 rather than one of 6+2. Cheers. -- Mark Powell - UNIX System Administrator - The University of Salford Information Services Division, Clifford Whitworth Building, Salford University, Manchester, M5 4WT, UK. Tel: +44 161 295 4837 Fax: +44 161 295 5888 www.pgp.com for PGP key From owner-freebsd-fs@FreeBSD.ORG Fri Jul 27 09:32:22 2007 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D3BCD16A417 for ; Fri, 27 Jul 2007 09:32:22 +0000 (UTC) (envelope-from M.S.Powell@salford.ac.uk) Received: from abbe.salford.ac.uk (abbe.salford.ac.uk [146.87.0.10]) by mx1.freebsd.org (Postfix) with SMTP id 51EC113C458 for ; Fri, 27 Jul 2007 09:32:21 +0000 (UTC) (envelope-from M.S.Powell@salford.ac.uk) Received: (qmail 80895 invoked by uid 98); 27 Jul 2007 10:32:20 +0100 Received: from 146.87.255.121 by abbe.salford.ac.uk (envelope-from , uid 401) with qmail-scanner-2.01 (clamdscan: 0.90/3779. spamassassin: 3.1.8. Clear:RC:1(146.87.255.121):. Processed in 0.056505 secs); 27 Jul 2007 09:32:20 -0000 Received: from rust.salford.ac.uk (HELO rust.salford.ac.uk) (146.87.255.121) by abbe.salford.ac.uk (qpsmtpd/0.3x.614) with SMTP; Fri, 27 Jul 2007 10:32:20 +0100 Received: (qmail 78183 invoked by uid 1002); 27 Jul 2007 09:32:18 -0000 Received: from localhost (sendmail-bs@127.0.0.1) by localhost with SMTP; 27 Jul 2007 09:32:18 -0000 Date: Fri, 27 Jul 2007 10:32:18 +0100 (BST) From: "Mark Powell" To: freebsd-fs@freebsd.org Message-ID: <20070727100039.V68220@rust.salford.ac.uk> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Subject: Breaking raidz and zpool bug? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 27 Jul 2007 09:32:22 -0000 Hi, Have a machine with two identical drives ad[01]. Each have a 4GB slice1 and the rest as slice 2. Slice 1 contains a gmirrored ufs /boot and swap. Slice2 I foolishly raidz'ed and put / on there. All works well. I realised the error of making a raidz of only 2 drives and wanted to convert this setup to gmirror without any backup/restore or pulling of drives to force them to error. I assume the difficulty with doing this is from deliberate safeguards to prevent data lose in normal usage? 1st I needed to break the raidz, stop it using one of the drives, so I can make the mirror on it. I thought I could just wipe ad1s2, but I am prevented from doing that cos it's being used by zfs. Even with kern.geom.debugflags=16. I couldn't change the partition details using fdisk for ad1s2 as it doesn't allow it for partitions in use. So i blanked sector 0 on ad1 and rebooted. zpool status showed the raidz as degraded. I thought I'd then create a zpool mirror on ad1s2, but of course I can't cos it's still part of the raidz. I could find no way to remove ad1s2 from the raidz. zpool detach is only for hot spares. I tried to get around the problems of the system not letting me do anything with ad1s2, by creating an identical ad1s3 and then changing the slice type of ad1s2 to 1 (DOS FAT 16-bit). I rebooted, but the zfs root would not mount. I booted into a test enviroment and zpool status told me the worst that no replicas could be found. At first I assumed I'd made a mess of something, but after reflection I was sure I'd not touched ad0. I changed the type of ad1s2 back to FreeBSD 165 and the zfs root worked fine again albeit in the degraded state. It shouldn't be possible to break a raidz simply by changing the slice type? Is this a bug? And does anyone have ideas for what I was trying? Cheers. -- Mark Powell - UNIX System Administrator - The University of Salford Information Services Division, Clifford Whitworth Building, Salford University, Manchester, M5 4WT, UK. Tel: +44 161 295 4837 Fax: +44 161 295 5888 www.pgp.com for PGP key