From owner-freebsd-fs@FreeBSD.ORG Sun Jul 15 01:54:04 2007 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 243F016A403 for ; Sun, 15 Jul 2007 01:54:04 +0000 (UTC) (envelope-from aaron@goflexitllc.com) Received: from rwcrmhc11.comcast.net (rwcrmhc11.comcast.net [204.127.192.81]) by mx1.freebsd.org (Postfix) with ESMTP id 13A5A13C441 for ; Sun, 15 Jul 2007 01:54:04 +0000 (UTC) (envelope-from aaron@goflexitllc.com) Received: from charlie.anbcs.com (anbcs.com[68.52.106.142]) by comcast.net (rwcrmhc11) with ESMTP id <20070715014355m1100q1h7ie>; Sun, 15 Jul 2007 01:43:59 +0000 Message-ID: <46997CC3.3030405@goflexitllc.com> Date: Sat, 14 Jul 2007 20:47:47 -0500 From: Aaron Hurt User-Agent: Thunderbird 2.0.0.4 (Macintosh/20070604) MIME-Version: 1.0 To: freebsd-fs@freebsd.org Content-Type: multipart/mixed; boundary="------------020806040701040304060702" X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Subject: gconcat incorrect superblock after adding a disk X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 15 Jul 2007 01:54:04 -0000 This is a multi-part message in MIME format. --------------020806040701040304060702 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit I have a fairly decent sized gconcat array composed of 4 disks (ad8 ad10 ad12 ad14) the previous working members. Recently I tried to add another disk using the following procedure: umount /dev/concat/store1 gconcat stop store1 gconcat label store1 ad8 ad10 ad12 ad14 ad4 growfs /dev/concat/store1 ..... it was at this point that I got the incorrect superblock error after adding ad4. Now, even if I try to remove ad4 and label with the original disks (keeping the original order) it still will not mount or fsck. The exact messages are below: schroder# mount /dev/concat/store1 /store mount: /dev/concat/store1 on /store: incorrect super block Geom name: store1 State: UP Status: Total=4, Online=4 Type: AUTOMATIC ID: 1480896172 Providers: 1. Name: concat/store1 Mediasize: 640167540736 (596G) Sectorsize: 512 Mode: r0w0e0 Consumers: 1. Name: ad8 Mediasize: 120034123776 (112G) Sectorsize: 512 Mode: r0w0e0 Start: 0 End: 120034123264 2. Name: ad10 Mediasize: 200049647616 (186G) Sectorsize: 512 Mode: r0w0e0 Start: 120034123264 End: 320083770368 3. Name: ad12 Mediasize: 160041885696 (149G) Sectorsize: 512 Mode: r0w0e0 Start: 320083770368 End: 480125655552 4. Name: ad14 Mediasize: 160041885696 (149G) Sectorsize: 512 Mode: r0w0e0 Start: 480125655552 End: 640167540736 schroder# fdisk /dev/concat/store1 ******* Working on device /dev/concat/store1 ******* parameters extracted from in-core disklabel are: cylinders=77829 heads=255 sectors/track=63 (16065 blks/cyl) Figures below won't work with BIOS for partitions not in cyl 1 parameters to be used for BIOS calculations are: cylinders=77829 heads=255 sectors/track=63 (16065 blks/cyl) fdisk: invalid fdisk partition table found Media sector size is 512 Warning: BIOS sector numbering starts with sector 1 Information from DOS bootblock is: The data for partition 1 is: sysid 165 (0xa5),(FreeBSD/NetBSD/386BSD) start 63, size 1250322822 (610509 Meg), flag 80 (active) beg: cyl 0/ head 1/ sector 1; end: cyl 4/ head 254/ sector 63 The data for partition 2 is: The data for partition 3 is: The data for partition 4 is: Any and all help or suggestions would be appreciated. I can be reached via this email address or by my cell phone number below. The data is not life threatening since it is just my home server, but there are several hundred gigs of personal movie clips, family photos and music and whatnot. I would really like to be able to save this data if at all possible (and then make an immediate backup of it). I do believe I will be purchasing some more disks soon for a seperate duplicate machine that can clone this storage box. Thank You, -- Aaron Hurt Managing Partner Flex I.T., LLC 611 Commerce Street Suite 3117 Nashville, TN 37203 Phone: 615.438.7101 E-mail: aaron@goflexitllc.com --------------020806040701040304060702-- From owner-freebsd-fs@FreeBSD.ORG Sun Jul 15 03:29:03 2007 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 435AE16A400 for ; Sun, 15 Jul 2007 03:29:03 +0000 (UTC) (envelope-from anderson@freebsd.org) Received: from ns.trinitel.com (186.161.36.72.static.reverse.layeredtech.com [72.36.161.186]) by mx1.freebsd.org (Postfix) with ESMTP id 24ECC13C442 for ; Sun, 15 Jul 2007 03:29:03 +0000 (UTC) (envelope-from anderson@freebsd.org) Received: from neutrino.vnode.org (r74-193-81-203.pfvlcmta01.grtntx.tl.dh.suddenlink.net [74.193.81.203]) (authenticated bits=0) by ns.trinitel.com (8.14.1/8.14.1) with ESMTP id l6F3T1L6040870 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO); Sat, 14 Jul 2007 22:29:02 -0500 (CDT) (envelope-from anderson@freebsd.org) Message-ID: <46999478.1070501@freebsd.org> Date: Sat, 14 Jul 2007 22:28:56 -0500 From: Eric Anderson User-Agent: Thunderbird 2.0.0.4 (X11/20070629) MIME-Version: 1.0 To: Aaron Hurt References: <46997CC3.3030405@goflexitllc.com> In-Reply-To: <46997CC3.3030405@goflexitllc.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=ham version=3.1.8 X-Spam-Checker-Version: SpamAssassin 3.1.8 (2007-02-13) on ns.trinitel.com Cc: freebsd-fs@freebsd.org Subject: Re: gconcat incorrect superblock after adding a disk X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 15 Jul 2007 03:29:03 -0000 On 07/14/07 20:47, Aaron Hurt wrote: > I have a fairly decent sized gconcat array composed of 4 disks (ad8 ad10 > ad12 ad14) the previous working members. Recently I tried to add > another disk using the following procedure: > > umount /dev/concat/store1 > gconcat stop store1 > gconcat label store1 ad8 ad10 ad12 ad14 ad4 > growfs /dev/concat/store1 > > ..... it was at this point that I got the incorrect superblock error > after adding ad4. Now, even if I try to remove ad4 and label with the > original disks (keeping the original order) it still will not mount or > fsck. The exact messages are below: > > schroder# mount /dev/concat/store1 /store > mount: /dev/concat/store1 on /store: incorrect super block > > Geom name: store1 > State: UP > Status: Total=4, Online=4 > Type: AUTOMATIC > ID: 1480896172 > Providers: > 1. Name: concat/store1 > Mediasize: 640167540736 (596G) > Sectorsize: 512 > Mode: r0w0e0 > Consumers: > 1. Name: ad8 > Mediasize: 120034123776 (112G) > Sectorsize: 512 > Mode: r0w0e0 > Start: 0 > End: 120034123264 > 2. Name: ad10 > Mediasize: 200049647616 (186G) > Sectorsize: 512 > Mode: r0w0e0 > Start: 120034123264 > End: 320083770368 > 3. Name: ad12 > Mediasize: 160041885696 (149G) > Sectorsize: 512 > Mode: r0w0e0 > Start: 320083770368 > End: 480125655552 > 4. Name: ad14 > Mediasize: 160041885696 (149G) > Sectorsize: 512 > Mode: r0w0e0 > Start: 480125655552 > End: 640167540736 > > schroder# fdisk /dev/concat/store1 > ******* Working on device /dev/concat/store1 ******* > parameters extracted from in-core disklabel are: > cylinders=77829 heads=255 sectors/track=63 (16065 blks/cyl) > > Figures below won't work with BIOS for partitions not in cyl 1 > parameters to be used for BIOS calculations are: > cylinders=77829 heads=255 sectors/track=63 (16065 blks/cyl) > > fdisk: invalid fdisk partition table found > Media sector size is 512 > Warning: BIOS sector numbering starts with sector 1 > Information from DOS bootblock is: > The data for partition 1 is: > sysid 165 (0xa5),(FreeBSD/NetBSD/386BSD) > start 63, size 1250322822 (610509 Meg), flag 80 (active) > beg: cyl 0/ head 1/ sector 1; > end: cyl 4/ head 254/ sector 63 > The data for partition 2 is: > > The data for partition 3 is: > > The data for partition 4 is: > > > > > Any and all help or suggestions would be appreciated. I can be reached > via this email address or by my cell phone number below. The data is > not life threatening since it is just my home server, but there are > several hundred gigs of personal movie clips, family photos and music > and whatnot. I would really like to be able to save this data if at all > possible (and then make an immediate backup of it). I do believe I will > be purchasing some more disks soon for a seperate duplicate machine that > can clone this storage box. First, once you add a disk to a concat, and grow the file system, I would not recommend removing the disk from the concat. If possible, and you haven't written anything else to it, put it back on the concat, and then move forward. It looks like you have a slice on that concat, but you are trying to mount the device, not the slice. Did you first fsck the partition after you did a growfs, and before attempting a mount? I would suspect (if you can get the disk back in the concat, and you have not already written over it) that you can just do something like: fsck -y /dev/concat/store1s1 or fsck -y /dev/concat/store1s1a maybe and then mount it using that name. Do a: ls -al /dev/concat/store1* and send that output. Also, for what it's worth, when you add drives to make a single device without any redundancy, you increase your chances of a total failure. The more drives, the less resilient. Eric From owner-freebsd-fs@FreeBSD.ORG Sun Jul 15 07:45:46 2007 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id A12DE16A406 for ; Sun, 15 Jul 2007 07:45:46 +0000 (UTC) (envelope-from pjd@garage.freebsd.pl) Received: from mail.garage.freebsd.pl (arm132.internetdsl.tpnet.pl [83.17.198.132]) by mx1.freebsd.org (Postfix) with ESMTP id 07F5F13C4B7 for ; Sun, 15 Jul 2007 07:45:45 +0000 (UTC) (envelope-from pjd@garage.freebsd.pl) Received: by mail.garage.freebsd.pl (Postfix, from userid 65534) id 9A8A94881A; Sun, 15 Jul 2007 09:45:43 +0200 (CEST) Received: from localhost (154.81.datacomsa.pl [195.34.81.154]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.garage.freebsd.pl (Postfix) with ESMTP id 0CE5C487FA; Sun, 15 Jul 2007 09:45:37 +0200 (CEST) Date: Sun, 15 Jul 2007 09:45:15 +0200 From: Pawel Jakub Dawidek To: Aaron Hurt Message-ID: <20070715074515.GA9823@garage.freebsd.pl> References: <46997CC3.3030405@goflexitllc.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <46997CC3.3030405@goflexitllc.com> User-Agent: Mutt/1.4.2.3i X-PGP-Key-URL: http://people.freebsd.org/~pjd/pjd.asc X-OS: FreeBSD 7.0-CURRENT i386 X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on mail.garage.freebsd.pl X-Spam-Level: X-Spam-Status: No, score=-2.6 required=3.0 tests=BAYES_00 autolearn=ham version=3.0.4 Cc: freebsd-fs@freebsd.org Subject: Re: gconcat incorrect superblock after adding a disk X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 15 Jul 2007 07:45:46 -0000 On Sat, Jul 14, 2007 at 08:47:47PM -0500, Aaron Hurt wrote: > I have a fairly decent sized gconcat array composed of 4 disks (ad8 ad10 > ad12 ad14) the previous working members. Recently I tried to add > another disk using the following procedure: > > umount /dev/concat/store1 > gconcat stop store1 > gconcat label store1 ad8 ad10 ad12 ad14 ad4 > growfs /dev/concat/store1 > > ..... it was at this point that I got the incorrect superblock error > after adding ad4. Now, even if I try to remove ad4 and label with the > original disks (keeping the original order) it still will not mount or > fsck. The exact messages are below: > > schroder# mount /dev/concat/store1 /store > mount: /dev/concat/store1 on /store: incorrect super block Your data should be safe if you didn't write anything into the disks yet and growfs(8) didn't corrupt your file system somehow. Gconcat itself won't touch your data - the only thing it does is to write into last disk's sector. Are you sure you used exactly the same order as you had used when you created concatenated device without ad4 disk? Growfs finished sucessfully? > Geom name: store1 > State: UP > Status: Total=4, Online=4 > Type: AUTOMATIC > ID: 1480896172 > Providers: > 1. Name: concat/store1 > Mediasize: 640167540736 (596G) > Sectorsize: 512 > Mode: r0w0e0 > Consumers: > 1. Name: ad8 > Mediasize: 120034123776 (112G) > Sectorsize: 512 > Mode: r0w0e0 > Start: 0 > End: 120034123264 > 2. Name: ad10 > Mediasize: 200049647616 (186G) > Sectorsize: 512 > Mode: r0w0e0 > Start: 120034123264 > End: 320083770368 > 3. Name: ad12 > Mediasize: 160041885696 (149G) > Sectorsize: 512 > Mode: r0w0e0 > Start: 320083770368 > End: 480125655552 > 4. Name: ad14 > Mediasize: 160041885696 (149G) > Sectorsize: 512 > Mode: r0w0e0 > Start: 480125655552 > End: 640167540736 If this is from before you added ad4, then the order is correct. > schroder# fdisk /dev/concat/store1 > ******* Working on device /dev/concat/store1 ******* > parameters extracted from in-core disklabel are: > cylinders=77829 heads=255 sectors/track=63 (16065 blks/cyl) > > Figures below won't work with BIOS for partitions not in cyl 1 > parameters to be used for BIOS calculations are: > cylinders=77829 heads=255 sectors/track=63 (16065 blks/cyl) > > fdisk: invalid fdisk partition table found > Media sector size is 512 > Warning: BIOS sector numbering starts with sector 1 > Information from DOS bootblock is: > The data for partition 1 is: > sysid 165 (0xa5),(FreeBSD/NetBSD/386BSD) > start 63, size 1250322822 (610509 Meg), flag 80 (active) > beg: cyl 0/ head 1/ sector 1; > end: cyl 4/ head 254/ sector 63 > The data for partition 2 is: > > The data for partition 3 is: > > The data for partition 4 is: > You have a slice there? You extended /dev/concat/store1, not /dev/concat/store1s1. Where is you file system? On store1, store1s1, store1s1a? -- Pawel Jakub Dawidek http://www.wheel.pl pjd@FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! From owner-freebsd-fs@FreeBSD.ORG Mon Jul 16 10:18:20 2007 Return-Path: X-Original-To: fs@FreeBSD.org Delivered-To: freebsd-fs@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id A37D616A496; Mon, 16 Jul 2007 10:18:20 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail12.syd.optusnet.com.au (mail12.syd.optusnet.com.au [211.29.132.193]) by mx1.freebsd.org (Postfix) with ESMTP id 23D2F13C474; Mon, 16 Jul 2007 10:18:19 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c220-239-235-248.carlnfd3.nsw.optusnet.com.au [220.239.235.248]) by mail12.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id l6GAIET6011564 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 16 Jul 2007 20:18:16 +1000 Date: Mon, 16 Jul 2007 20:18:14 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Kostik Belousov In-Reply-To: <20070712142127.GD2200@deviant.kiev.zoral.com.ua> Message-ID: <20070716195556.P12807@besplex.bde.org> References: <20070710233455.O2101@besplex.bde.org> <20070712084115.GA2200@deviant.kiev.zoral.com.ua> <20070712225324.F9515@besplex.bde.org> <20070712142127.GD2200@deviant.kiev.zoral.com.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: bugs@FreeBSD.org, fs@FreeBSD.org Subject: Re: msdosfs not MPSAFE X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jul 2007 10:18:20 -0000 On Thu, 12 Jul 2007, Kostik Belousov wrote: > On Thu, Jul 12, 2007 at 11:33:40PM +1000, Bruce Evans wrote: >> >> On Thu, 12 Jul 2007, Kostik Belousov wrote: >> >>> On Wed, Jul 11, 2007 at 12:08:19AM +1000, Bruce Evans wrote: >>>> msdsofs has been broken since Giant locking for file systems (or syscalls) >>>> was removed. It allows multiple threads to race accessing the shared >>>> static buffer `nambuf' and related variables. This causes remarkably >> >>> It seems that msdosfs_lookup() can sleep, thus Giant protection would be >>> lost. >> >> It can certainly block in bread(). > Besides bread(), there is a (re)locking for ".." case, and deget() call, > that itself calls malloc(M_WAITOK), vfs_hash_get(), getnewvnode() and > readep(). The latter itself calls bread(). > > This is from the brief look. I think msdosfs_lookup() doesn't need to own nambuf near the deget() call. Not sure -- I was looking more at msdosfs_readdir(). >> How does my adding Giant locking help? I checked that at least in >> FreeBSD-~5.2-current, msdosfs_readdir() is already Giant-locked, so my >> fix just increments the recursion count. What happens to recursively- >> held Giant locks across sleeps? I think they should cause a KASSERT() >> failure, but if they are handled by only dropping Giant once then my >> fix might sort of work but sleeps would be broken generally. >> > Look at the kern/kern_sync.c:_sleep(). It does DROP_GIANT(), that (from > the sys/mutex.h) calls mtx_unlock() until Giant is owned. So it is very mysterious that Giant locking helped. Anyway, it doesn't work, and cases where it doesn't help showed up in further testing. sx xlocking works, but is not quite right: % Index: msdosfs_denode.c % =================================================================== % RCS file: /home/ncvs/src/sys/fs/msdosfs/msdosfs_denode.c,v % retrieving revision 1.73 % diff -u -2 -r1.73 msdosfs_denode.c % --- msdosfs_denode.c 16 Jun 2004 09:47:03 -0000 1.73 % +++ msdosfs_denode.c 13 Jul 2007 04:58:35 -0000 % @@ -52,10 +52,12 @@ % #include % #include % +#include % #include % #include % +#include % #include % #include % +#include Include this; other changes in this hunk unrelated. % #include % -#include % % #include % @@ -68,9 +70,11 @@ % #include % % +struct sx mbnambuf_lock; % + Declare this. This file has nothing to do with nambuf, but it is the only convenient place to add the init and destroy calls. In -current, there .vfs_init and .vfs_uninit hooks no longer exist, so the patch would be even larger. % static MALLOC_DEFINE(M_MSDOSFSNODE, "MSDOSFS node", "MSDOSFS vnode private part"); % % static struct denode **dehashtbl; % static u_long dehash; /* size of hash table - 1 */ % -#define DEHASH(dev, dcl, doff) (dehashtbl[(minor(dev) + (dcl) + (doff) / \ % +#define DEHASH(dev, dcl, doff) (dehashtbl[(minor(dev) + (dcl) + (doff) / \ % sizeof(struct direntry)) & dehash]) % static struct mtx dehash_mtx; Unrelated cleanup. % @@ -117,4 +121,5 @@ % dehashtbl = hashinit(desiredvnodes/2, M_MSDOSFSMNT, &dehash); % mtx_init(&dehash_mtx, "msdosfs dehash", NULL, MTX_DEF); % + sx_init(&mbnambuf_lock, "mbnambuf"); % return (0); % } % @@ -128,4 +133,5 @@ % free(dehashtbl, M_MSDOSFSMNT); % mtx_destroy(&dehash_mtx); % + sx_destroy(&mbnambuf_lock); % dehash_init--; % return (0); % Index: msdosfs_lookup.c % =================================================================== % RCS file: /home/ncvs/src/sys/fs/msdosfs/msdosfs_lookup.c,v % retrieving revision 1.40 % diff -u -2 -r1.40 msdosfs_lookup.c % --- msdosfs_lookup.c 26 Dec 2003 17:24:37 -0000 1.40 % +++ msdosfs_lookup.c 13 Jul 2007 06:13:04 -0000 % @@ -54,4 +54,5 @@ % #include % #include % +#include % #include % #include % @@ -63,4 +64,6 @@ % #include % % +extern struct sx mbnambuf_lock; % + % /* % * When we search a directory the blocks containing directory entries are This shouldn't be extern. % @@ -78,11 +81,11 @@ % * memory denode's will be in synch. % */ % -int % -msdosfs_lookup(ap) % +static int % +msdosfs_lookup_locked( % struct vop_cachedlookup_args /* { % struct vnode *a_dvp; % struct vnode **a_vpp; % struct componentname *a_cnp; % - } */ *ap; % + } */ *ap) % { % struct vnode *vdp = ap->a_dvp; % @@ -560,4 +564,20 @@ % % /* % + * XXX msdosfs_lookup() is split up because unlocking before all the returns % + * in the original function would be too churning. % + */ % +int % +msdosfs_lookup(ap) % + struct vop_cachedlookup_args *ap; % +{ % + int error; % + % + sx_xlock(&mbnambuf_lock); % + error = msdosfs_lookup_locked(ap); % + sx_xunlock(&mbnambuf_lock); % + return (error); % +} % + % +/* % * dep - directory entry to copy into the directory % * ddep - directory to add to % Index: msdosfs_vnops.c % =================================================================== % RCS file: /home/ncvs/src/sys/fs/msdosfs/msdosfs_vnops.c,v % retrieving revision 1.147 % diff -u -2 -r1.147 msdosfs_vnops.c % --- msdosfs_vnops.c 4 Feb 2004 21:52:53 -0000 1.147 % +++ msdosfs_vnops.c 15 Jul 2007 04:07:36 -0000 % @@ -60,4 +62,5 @@ % #include % #include % +#include % #include % #include % @@ -70,14 +73,14 @@ % #include % % -#include % - % #include % -#include % #include % #include % #include % +#include % % #define DOS_FILESIZE_MAX 0xffffffff % % +extern struct sx mbnambuf_lock; % + Declare this; other changes in this hunk unrelated. % /* % * Prototypes for MSDOSFS vnode operations % @@ -1559,4 +1594,5 @@ % } % % + sx_xlock(&mbnambuf_lock); % mbnambuf_init(); % off = offset; % @@ -1687,4 +1727,6 @@ % } % out: % + sx_xunlock(&mbnambuf_lock); % + % /* Subtract unused cookies */ % if (ap->a_ncookies) I didn't add this sx lock in subr_witness.c. This patch is already too large. About half of sx locks seem to be missing in witness. Please fix this better. Bruce From owner-freebsd-fs@FreeBSD.ORG Mon Jul 16 17:23:27 2007 Return-Path: X-Original-To: fs@freebsd.org Delivered-To: freebsd-fs@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id D53AD16A406 for ; Mon, 16 Jul 2007 17:23:27 +0000 (UTC) (envelope-from jazzhills@gmail.com) Received: from ug-out-1314.google.com (ug-out-1314.google.com [66.249.92.173]) by mx1.freebsd.org (Postfix) with ESMTP id 63C9A13C4C1 for ; Mon, 16 Jul 2007 17:23:26 +0000 (UTC) (envelope-from jazzhills@gmail.com) Received: by ug-out-1314.google.com with SMTP id o4so1147874uge for ; Mon, 16 Jul 2007 10:23:25 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:mime-version:content-type:content-transfer-encoding:content-disposition; b=MjhLG6OyyS8qKwdT6ERQfSTKuUl/q73JcyNAUrVxKBTulzng6WfuDEedLwmpM6R58W+elht0CMYHVhy1NyGE3qN40spqAtoRGwNDcKvIRLo+5rbc4eO/0I5i81BtfGLSs4CT5WXQtcdHKHJbU+6PYsK6bPJBmPeUjTKxTxRZOpY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:mime-version:content-type:content-transfer-encoding:content-disposition; b=PB33/joT9xnTqBz6sD0V6nKxe3u8DUZBPykn2AMhyRVZdszEfE5f0GAer5d4KR9alRjnZQBmw36Z9X0WXq4s/7E6dxirjg7UuS/NwkBkqCxBuIeriVFMJBa8b6t0d6FOZuLSyKxXoXazK0wpSGT9JtYY0oQW/w0qSF95MLI2KwQ= Received: by 10.78.176.20 with SMTP id y20mr1231651hue.1184605005147; Mon, 16 Jul 2007 09:56:45 -0700 (PDT) Received: by 10.78.170.12 with HTTP; Mon, 16 Jul 2007 09:56:45 -0700 (PDT) Message-ID: <33910a2c0707160956p6d0162d7sf69c428a9fd34146@mail.gmail.com> Date: Mon, 16 Jul 2007 13:56:45 -0300 From: "Jason Hills" To: fs@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline Cc: Subject: GFarm on FreeBSD X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jul 2007 17:23:27 -0000 Hello, Is anyone using Gfarm[1] on FreeBSD here? I saw they have some code for FreeBSD[2], and I am now checking out from SVN, to start testing this FS. If anyone have had production experiences with this, I would like to hear. [1]http://datafarm.apgrid.org/document/ [2]http://datafarm.apgrid.org/software/latest/FreeBSD/ -- Jazzie Hills From owner-freebsd-fs@FreeBSD.ORG Tue Jul 17 11:56:47 2007 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 4AE8016A400 for ; Tue, 17 Jul 2007 11:56:47 +0000 (UTC) (envelope-from anderson@freebsd.org) Received: from ns.trinitel.com (186.161.36.72.static.reverse.layeredtech.com [72.36.161.186]) by mx1.freebsd.org (Postfix) with ESMTP id 0CA4F13C491 for ; Tue, 17 Jul 2007 11:56:46 +0000 (UTC) (envelope-from anderson@freebsd.org) Received: from proton.local (209-163-168-124.static.twtelecom.net [209.163.168.124]) (authenticated bits=0) by ns.trinitel.com (8.14.1/8.14.1) with ESMTP id l6HBujs9094642 for ; Tue, 17 Jul 2007 06:56:46 -0500 (CDT) (envelope-from anderson@freebsd.org) Message-ID: <469CAE7D.8090609@freebsd.org> Date: Tue, 17 Jul 2007 06:56:45 -0500 From: Eric Anderson User-Agent: Thunderbird 2.0.0.4 (Macintosh/20070604) MIME-Version: 1.0 To: freebsd-fs@freebsd.org Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=failed version=3.1.8 X-Spam-Checker-Version: SpamAssassin 3.1.8 (2007-02-13) on ns.trinitel.com Subject: NFS on NFS? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Jul 2007 11:56:47 -0000 Here's what I'd like to do: - Mount NFS export from filer 'A' - Export that mountpoint to clients via NFS I've already tried it, and it doesn't quite work. FreeBSD allows me to export it (doing tricks like null mounting the NFS mounted directory on a different directory, etc). But when a client mounts it, it has issues. Does anyone know if this is a reasonable problem to solve for FreeBSD, or is it so much work that it isn't worth it? Oh, and please - I understand the implications of doing such a thing, no worries, I still want to. Eric From owner-freebsd-fs@FreeBSD.ORG Tue Jul 17 14:44:28 2007 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 793D416A404 for ; Tue, 17 Jul 2007 14:44:28 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from moe.cs.uoguelph.ca (moe.cs.uoguelph.ca [131.104.94.198]) by mx1.freebsd.org (Postfix) with ESMTP id 20D7713C4BD for ; Tue, 17 Jul 2007 14:44:27 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from muncher.cs.uoguelph.ca (muncher.cs.uoguelph.ca [131.104.96.170]) by moe.cs.uoguelph.ca (8.13.1/8.13.1) with ESMTP id l6HEiNHY028858; Tue, 17 Jul 2007 10:44:23 -0400 Received: from localhost (rmacklem@localhost) by muncher.cs.uoguelph.ca (8.11.7p3+Sun/8.11.6) with ESMTP id l6HEmWu14892; Tue, 17 Jul 2007 10:48:32 -0400 (EDT) X-Authentication-Warning: muncher.cs.uoguelph.ca: rmacklem owned process doing -bs Date: Tue, 17 Jul 2007 10:48:32 -0400 (EDT) From: Rick Macklem X-X-Sender: rmacklem@muncher To: Eric Anderson In-Reply-To: <469CAE7D.8090609@freebsd.org> Message-ID: References: <469CAE7D.8090609@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Scanned-By: MIMEDefang 2.57 on 131.104.94.198 Cc: freebsd-fs@freebsd.org Subject: Re: NFS on NFS? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Jul 2007 14:44:28 -0000 On Tue, 17 Jul 2007, Eric Anderson wrote: > Here's what I'd like to do: > > - Mount NFS export from filer 'A' > - Export that mountpoint to clients via NFS > > I've already tried it, and it doesn't quite work. FreeBSD allows me to > export it (doing tricks like null mounting the NFS mounted directory on a > different directory, etc). But when a client mounts it, it has issues. > > Does anyone know if this is a reasonable problem to solve for FreeBSD, or is > it so much work that it isn't worth it? > > Oh, and please - I understand the implications of doing such a thing, no > worries, I still want to. > Since this wasn't allowed for NFSv2 and 3 (due to issues such as providing a T stable file handle), clients probably won't handle it well. In general, NFSv2 and 3 clients will get really confused when the fsid or fid changes and break in subtle ways if the file handle is not T stable (refers to that file only, including long after the file is deleted). NFSv4 does allow mount point crossings (fsid to change), but some clients, such as Solaris10 are confused by it. An easier solution might be to write a simple proxy that just forwards the RPC requests/replies to the actual server. rick From owner-freebsd-fs@FreeBSD.ORG Tue Jul 17 15:59:44 2007 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id E76C816A402 for ; Tue, 17 Jul 2007 15:59:44 +0000 (UTC) (envelope-from anderson@freebsd.org) Received: from ns.trinitel.com (186.161.36.72.static.reverse.layeredtech.com [72.36.161.186]) by mx1.freebsd.org (Postfix) with ESMTP id BC65913C4AC for ; Tue, 17 Jul 2007 15:59:44 +0000 (UTC) (envelope-from anderson@freebsd.org) Received: from proton.local (209-163-168-124.static.twtelecom.net [209.163.168.124]) (authenticated bits=0) by ns.trinitel.com (8.14.1/8.14.1) with ESMTP id l6HFxhie056969; Tue, 17 Jul 2007 10:59:44 -0500 (CDT) (envelope-from anderson@freebsd.org) Message-ID: <469CE76F.9040105@freebsd.org> Date: Tue, 17 Jul 2007 10:59:43 -0500 From: Eric Anderson User-Agent: Thunderbird 2.0.0.4 (Macintosh/20070604) MIME-Version: 1.0 To: Rick Macklem References: <469CAE7D.8090609@freebsd.org> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=ham version=3.1.8 X-Spam-Checker-Version: SpamAssassin 3.1.8 (2007-02-13) on ns.trinitel.com Cc: freebsd-fs@freebsd.org Subject: Re: NFS on NFS? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Jul 2007 15:59:45 -0000 Rick Macklem wrote: > > > On Tue, 17 Jul 2007, Eric Anderson wrote: > >> Here's what I'd like to do: >> >> - Mount NFS export from filer 'A' >> - Export that mountpoint to clients via NFS >> >> I've already tried it, and it doesn't quite work. FreeBSD allows me >> to export it (doing tricks like null mounting the NFS mounted >> directory on a different directory, etc). But when a client mounts >> it, it has issues. >> >> Does anyone know if this is a reasonable problem to solve for FreeBSD, >> or is it so much work that it isn't worth it? >> >> Oh, and please - I understand the implications of doing such a thing, >> no worries, I still want to. >> > Since this wasn't allowed for NFSv2 and 3 (due to issues such as > providing a T stable file handle), clients probably won't handle > it well. In general, NFSv2 and 3 clients will get really confused > when the fsid or fid changes and break in subtle ways if the file > handle is not T stable (refers to that file only, including long > after the file is deleted). Is that really true? It looked like the NFS handle was created by various file system goo, which could come up again some time in the future. For instance, file a file systems inode table, rm all the files, do it again (with different data in the files). Wouldn't the NFS handle look the same to the client then, but be a different file? Or when we say 'file' do we mean 'inode' on a file system? Also, by 'T stable', does 'T' mean 'time' here? I'm not certain I completely understand why the clients would get confused. Wouldn't it look something like this: [File system->NFS server->NFS handle] | V [NFS client->virtual file system->NFS server->NFS handle2] | V [NFS Client->virtual file system->application] > NFSv4 does allow mount point crossings (fsid to change), but some > clients, such as Solaris10 are confused by it. > > An easier solution might be to write a simple proxy that just forwards > the RPC requests/replies to the actual server. Thanks, Eric From owner-freebsd-fs@FreeBSD.ORG Tue Jul 17 18:36:52 2007 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id B89D916A403 for ; Tue, 17 Jul 2007 18:36:52 +0000 (UTC) (envelope-from jaharkes@cs.cmu.edu) Received: from delft.aura.cs.cmu.edu (DELFT.AURA.CS.CMU.EDU [128.2.206.88]) by mx1.freebsd.org (Postfix) with ESMTP id 7E21113C4AC for ; Tue, 17 Jul 2007 18:36:52 +0000 (UTC) (envelope-from jaharkes@cs.cmu.edu) Received: from jaharkes by delft.aura.cs.cmu.edu with local (Exim 4.67) (envelope-from ) id 1IArul-0004Yx-O3 for freebsd-fs@freebsd.org; Tue, 17 Jul 2007 14:36:51 -0400 Date: Tue, 17 Jul 2007 14:36:51 -0400 To: freebsd-fs@freebsd.org Message-ID: <20070717183651.GA16599@delft.aura.cs.cmu.edu> Mail-Followup-To: freebsd-fs@freebsd.org References: <2c84c1de0707060800t21f3f993mfb53f7975a881ed4@mail.gmail.com> <1184090521301-git-send-email-jaharkes@cs.cmu.edu> <20070711223527.S97304@fledge.watson.org> <20070711223517.GH5824@delft.aura.cs.cmu.edu> <4695989B.7020200@freebsd.org> <20070712034033.GO5824@delft.aura.cs.cmu.edu> <20070712124134.G27319@fledge.watson.org> <20070712143103.GR5824@delft.aura.cs.cmu.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070712143103.GR5824@delft.aura.cs.cmu.edu> User-Agent: Mutt/1.5.13 (2006-08-11) From: Jan Harkes Subject: Re: [PATCH Coda 0/5] X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Jul 2007 18:36:52 -0000 On Thu, Jul 12, 2007 at 10:31:03AM -0400, Jan Harkes wrote: > On Thu, Jul 12, 2007 at 01:11:03PM +0100, Robert Watson wrote: > > When I killed venus and restarted it, then the system hung: > ... > > 13:49:59 starting FSDB scan (4166, 100000) (25, 75, 4) > > 13:49:59 2 cache files in table (0 blocks) ... > of surprised this actually managed to wedge the system. I wonder if this > has to do with that bit of code where we used to pass a NULL vfs mount. > > -/* cp = make_coda_node(&ctlfid, vfsp, VCHR); > - The above code seems to cause a loop in the cnode links. > - I don't totally understand when it happens, it is caught > - when closing down the system. > - */ > - cp = make_coda_node(&ctlfid, 0, VCHR); > - > + cp = make_coda_node(&ctlfid, vfsp, VCHR); And sure enough, we never released the vnode that we allocated during the mount. And during the unmount it is marked as S_UNMOUNTING or something similar. Then when we remount the file system later on we get stuck, probably waiting for the old vnode with the S_UNMOUNTING flag to disappear. I'm not confident enough to really clean up the coda_ctlvp handling right now. I think it can be allocated only when we actually need it, in lookup and/or vget. But the following patch fixes the hang when I restart venus. Jan -------------------------------------------------------------------- commit 6b860bfa813d1f56925eb37331c8d2dc48faf020 Author: Jan Harkes Date: Thu Jul 12 14:34:51 2007 -0400 Make sure we release the control vnode. We allocate coda_ctlvp when /coda is mounted, but never release it. During the unmount this vnode was marked as UNMOUNTING and when venus is started a second time the system would hang, possibly waiting for the old vnode to disappear. So now we call vrele on the control vnode when file system is unmounted to drop the reference we got during the mount. I'm pretty sure it is also necessary to not skip the handling in coda_inactive for the control vnode, it seems like that is the place we actually get rid of the vnode once the refcount has dropped to 0. diff --git a/coda_vfsops.c b/coda_vfsops.c index df6f3f9..db7c11e 100644 --- a/coda_vfsops.c +++ b/coda_vfsops.c @@ -227,6 +227,7 @@ coda_unmount(vfsp, mntflags, td) printf("coda_unmount: ROOT: vp %p, cp %p\n", mi->mi_rootvp, VTOC(mi->mi_rootvp)); #endif vrele(mi->mi_rootvp); + vrele(coda_ctlvp); active = coda_kill(vfsp, NOT_DOWNCALL); ASSERT_VOP_LOCKED(mi->mi_rootvp, "coda_unmount"); mi->mi_rootvp->v_vflag &= ~VV_ROOT; diff --git a/coda_vnops.c b/coda_vnops.c index 3639779..a4d7047 100644 --- a/coda_vnops.c +++ b/coda_vnops.c @@ -745,11 +745,6 @@ coda_inactive(struct vop_inactive_args *ap) /* We don't need to send inactive to venus - DCS */ MARK_ENTRY(CODA_INACTIVE_STATS); - if (IS_CTL_VP(vp)) { - MARK_INT_SAT(CODA_INACTIVE_STATS); - return 0; - } - CODADEBUG(CODA_INACTIVE, myprintf(("in inactive, %s, vfsp %p\n", coda_f2s(&cp->c_fid), vp->v_mount));) From owner-freebsd-fs@FreeBSD.ORG Tue Jul 17 18:58:02 2007 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 3334116A402 for ; Tue, 17 Jul 2007 18:58:02 +0000 (UTC) (envelope-from jaharkes@cs.cmu.edu) Received: from delft.aura.cs.cmu.edu (DELFT.AURA.CS.CMU.EDU [128.2.206.88]) by mx1.freebsd.org (Postfix) with ESMTP id 0BCA613C46B for ; Tue, 17 Jul 2007 18:58:01 +0000 (UTC) (envelope-from jaharkes@cs.cmu.edu) Received: from jaharkes by delft.aura.cs.cmu.edu with local (Exim 4.67) (envelope-from ) id 1IAsFF-00053f-Fz for freebsd-fs@freebsd.org; Tue, 17 Jul 2007 14:58:01 -0400 Date: Tue, 17 Jul 2007 14:58:01 -0400 To: freebsd-fs@freebsd.org Message-ID: <20070717185801.GB16599@delft.aura.cs.cmu.edu> Mail-Followup-To: freebsd-fs@freebsd.org References: <2c84c1de0707060800t21f3f993mfb53f7975a881ed4@mail.gmail.com> <1184090521301-git-send-email-jaharkes@cs.cmu.edu> <20070711223527.S97304@fledge.watson.org> <20070711223517.GH5824@delft.aura.cs.cmu.edu> <4695989B.7020200@freebsd.org> <20070712034033.GO5824@delft.aura.cs.cmu.edu> <20070712124134.G27319@fledge.watson.org> <20070712143103.GR5824@delft.aura.cs.cmu.edu> <20070717183651.GA16599@delft.aura.cs.cmu.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070717183651.GA16599@delft.aura.cs.cmu.edu> User-Agent: Mutt/1.5.13 (2006-08-11) From: Jan Harkes Subject: Re: [PATCH Coda 0/5] X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Jul 2007 18:58:02 -0000 On Tue, Jul 17, 2007 at 02:36:51PM -0400, Jan Harkes wrote: Still trying to get my grips on the workflow :) I've submitted this as a problem report, but had some trouble actually attaching the patch. Jan From owner-freebsd-fs@FreeBSD.ORG Tue Jul 17 18:59:59 2007 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 2F48616A405; Tue, 17 Jul 2007 18:59:59 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from galileo.cs.uoguelph.ca (galileo.cs.uoguelph.ca [131.104.94.215]) by mx1.freebsd.org (Postfix) with ESMTP id E4F1F13C4D3; Tue, 17 Jul 2007 18:59:58 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from muncher.cs.uoguelph.ca (muncher.cs.uoguelph.ca [131.104.96.170]) by galileo.cs.uoguelph.ca (8.13.1/8.13.1) with ESMTP id l6HIxvut022978; Tue, 17 Jul 2007 14:59:57 -0400 Received: from localhost (rmacklem@localhost) by muncher.cs.uoguelph.ca (8.11.7p3+Sun/8.11.6) with ESMTP id l6HJ43721111; Tue, 17 Jul 2007 15:04:03 -0400 (EDT) X-Authentication-Warning: muncher.cs.uoguelph.ca: rmacklem owned process doing -bs Date: Tue, 17 Jul 2007 15:04:03 -0400 (EDT) From: Rick Macklem X-X-Sender: rmacklem@muncher To: Eric Anderson In-Reply-To: <469CE76F.9040105@freebsd.org> Message-ID: References: <469CAE7D.8090609@freebsd.org> <469CE76F.9040105@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Scanned-By: MIMEDefang 2.57 on 131.104.94.215 Cc: freebsd-fs@freebsd.org Subject: Re: NFS on NFS? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Jul 2007 18:59:59 -0000 On Tue, 17 Jul 2007, Eric Anderson wrote: > Rick Macklem wrote: > > Is that really true? It looked like the NFS handle was created by various > file system goo, which could come up again some time in the future. For > instance, file a file systems inode table, rm all the files, do it again > (with different data in the files). Wouldn't the NFS handle look the same to > the client then, but be a different file? Or when we say 'file' do we mean > 'inode' on a file system? > The file handle also has di_gen (the generation #) in it, which is there specifically to prevent the file handle from accidentally referring to a new file with the same i-node #. The server is expected to return ESTALE when a client tries to use a file handle after the file is deleted and this error is returned when the generation# in the file handle is not the same as di_gen in the i-node. (di_gen is incremented each time the i-node is re-used.) File systems that do not have the equivalent of di_gen cannot be exported via NFS correctly (but some people/systems do so anyhow). Ok if the file system is read-only. > Also, by 'T stable', does 'T' mean 'time' here? Yep. Capital T for a looonnngggg time. > I'm not certain I completely understand why the clients would get confused. > Wouldn't it look something like this: > > [File system->NFS server->NFS handle] > | > V > [NFS client->virtual file system->NFS server->NFS handle2] > | > V > [NFS Client->virtual file system->application] > So long as the intermediate server obeys all the rules, it can work: - File Handle is T-stable (recognized as ESTALE after the file is deleted) and still works the same after server reboots, etc. - fsid in getattr remains the same throughout the file system, even after server reboots, etc. - handles RPCs in an atomic way, so that they are either done or not (can't leave things half created after a crash) - NFSv2 and v3 clients don't expect servers to maintain any state and don't know the server rebooted. They simply retry the RPC until they get success or failure back from the server. Where these schemes usually break down is when the intermediate server reboots and no longer does the same file handle translations or assigns a new, different fsid to the file system or crosses a mount point boundary and changes the fsid or ??? Like I said, seems like a simple proxy that passes along the RPCs is easier to do. For NFSv3 (not v2) the intermediary can grow the size of the file handle (to a maximum of 64 bytes) so, if the real server creates file handles less than 64 bytes in size, it can add/remove stuff, but... - it then becomes useful for only certain servers - it has to do lots of copying of args, since the size changes rick From owner-freebsd-fs@FreeBSD.ORG Tue Jul 17 20:19:57 2007 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id B1EC316A400 for ; Tue, 17 Jul 2007 20:19:57 +0000 (UTC) (envelope-from anderson@freebsd.org) Received: from ns.trinitel.com (186.161.36.72.static.reverse.layeredtech.com [72.36.161.186]) by mx1.freebsd.org (Postfix) with ESMTP id 91D3813C441 for ; Tue, 17 Jul 2007 20:19:57 +0000 (UTC) (envelope-from anderson@freebsd.org) Received: from proton.local (209-163-168-124.static.twtelecom.net [209.163.168.124]) (authenticated bits=0) by ns.trinitel.com (8.14.1/8.14.1) with ESMTP id l6HKJu0t057510; Tue, 17 Jul 2007 15:19:56 -0500 (CDT) (envelope-from anderson@freebsd.org) Message-ID: <469D246C.1070003@freebsd.org> Date: Tue, 17 Jul 2007 15:19:56 -0500 From: Eric Anderson User-Agent: Thunderbird 2.0.0.4 (Macintosh/20070604) MIME-Version: 1.0 To: Rick Macklem References: <469CAE7D.8090609@freebsd.org> <469CE76F.9040105@freebsd.org> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=ham version=3.1.8 X-Spam-Checker-Version: SpamAssassin 3.1.8 (2007-02-13) on ns.trinitel.com Cc: freebsd-fs@freebsd.org Subject: Re: NFS on NFS? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Jul 2007 20:19:57 -0000 Rick Macklem wrote: > > > On Tue, 17 Jul 2007, Eric Anderson wrote: > >> Rick Macklem wrote: >> >> Is that really true? It looked like the NFS handle was created by >> various file system goo, which could come up again some time in the >> future. For instance, file a file systems inode table, rm all the >> files, do it again (with different data in the files). Wouldn't the >> NFS handle look the same to the client then, but be a different file? >> Or when we say 'file' do we mean 'inode' on a file system? >> > The file handle also has di_gen (the generation #) in it, which is there > specifically to prevent the file handle from accidentally referring to a > new file with the same i-node #. The server is expected to return ESTALE > when a client tries to use a file handle after the file is deleted and > this error is returned when the generation# in the file handle is not the > same as di_gen in the i-node. (di_gen is incremented each time the i-node > is re-used.) File systems that do not have the equivalent of di_gen cannot > be exported via NFS correctly (but some people/systems do so anyhow). Ok > if the file system is read-only. I see. That clears it up a bit. >> Also, by 'T stable', does 'T' mean 'time' here? > Yep. Capital T for a looonnngggg time. > >> I'm not certain I completely understand why the clients would get >> confused. Wouldn't it look something like this: >> >> [File system->NFS server->NFS handle] >> | >> V >> [NFS client->virtual file system->NFS server->NFS handle2] >> | >> V >> [NFS Client->virtual file system->application] >> > So long as the intermediate server obeys all the rules, it can work: > - File Handle is T-stable (recognized as ESTALE after the file is deleted) > and still works the same after server reboots, etc. > - fsid in getattr remains the same throughout the file system, even after > server reboots, etc. > - handles RPCs in an atomic way, so that they are either done or not > (can't leave things half created after a crash) > - NFSv2 and v3 clients don't expect servers to maintain any state > and don't know the server rebooted. They simply retry the RPC until > they get success or failure back from the server. > > Where these schemes usually break down is when the intermediate server > reboots and no longer does the same file handle translations or assigns > a new, different fsid to the file system or crosses a mount point > boundary and changes the fsid or ??? I see the point. > Like I said, seems like a simple proxy that passes along the RPCs is > easier to do. For NFSv3 (not v2) the intermediary can grow the size of > the file handle (to a maximum of 64 bytes) so, if the real server creates > file handles less than 64 bytes in size, it can add/remove stuff, but... Ok, I understand, and see the utility.. > - it then becomes useful for only certain servers Why? Because some servers implement large NFS handles? I've only ever seen 32bytes, but.. > - it has to do lots of copying of args, since the size changes You mean because you have to map the server's info to your new handle? or am I missing something? Thanks for the info.. (is there a good doc on this, besides and RFC?) Eric From owner-freebsd-fs@FreeBSD.ORG Tue Jul 17 21:09:24 2007 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id CD8C816A40B; Tue, 17 Jul 2007 21:09:24 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from dargo.cs.uoguelph.ca (dargo.cs.uoguelph.ca [131.104.94.197]) by mx1.freebsd.org (Postfix) with ESMTP id 71F2313C4B3; Tue, 17 Jul 2007 21:09:24 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from muncher.cs.uoguelph.ca (muncher.cs.uoguelph.ca [131.104.96.170]) by dargo.cs.uoguelph.ca (8.13.1/8.13.1) with ESMTP id l6HL9Muk004217; Tue, 17 Jul 2007 17:09:22 -0400 Received: from localhost (rmacklem@localhost) by muncher.cs.uoguelph.ca (8.11.7p3+Sun/8.11.6) with ESMTP id l6HLDUA09939; Tue, 17 Jul 2007 17:13:30 -0400 (EDT) X-Authentication-Warning: muncher.cs.uoguelph.ca: rmacklem owned process doing -bs Date: Tue, 17 Jul 2007 17:13:30 -0400 (EDT) From: Rick Macklem X-X-Sender: rmacklem@muncher To: Eric Anderson In-Reply-To: <469D246C.1070003@freebsd.org> Message-ID: References: <469CAE7D.8090609@freebsd.org> <469CE76F.9040105@freebsd.org> <469D246C.1070003@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Scanned-By: MIMEDefang 2.57 on 131.104.94.197 Cc: freebsd-fs@freebsd.org Subject: Re: NFS on NFS? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Jul 2007 21:09:24 -0000 On Tue, 17 Jul 2007, Eric Anderson wrote: >> On Tue, 17 Jul 2007, Eric Anderson wrote: >> >> - it then becomes useful for only certain servers > > Why? Because some servers implement large NFS handles? I've only ever seen > 32bytes, but.. > Good point. Although an NFSv3 file handle is variable size up to 64 bytes, since an NFSv2 file handle is always 32 bytes, I bet most servers generate <= 32 byte file handles, since they still support v2. >> - it has to do lots of copying of args, since the size changes > > You mean because you have to map the server's info to your new handle? or am > I missing something? > Because it is stored in the RPC request as [length][bytes rounded up to next multiple of 4][next arg]. If you grow the file handle, then everything after it shifts over. If it is still in an mbuf list, you can probably finnagle it, but otherwise it's copy all the data. > Thanks for the info.. (is there a good doc on this, besides and RFC?) > Some Usenix papers, mostly 1985 (the original Sun NFS one) through early 1990s. Mike Eisler wrote a nutshell book, but I'm not sure how much he talked about the protocol in it. rick From owner-freebsd-fs@FreeBSD.ORG Thu Jul 19 10:45:59 2007 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id BFD3716A400 for ; Thu, 19 Jul 2007 10:45:59 +0000 (UTC) (envelope-from M.S.Powell@salford.ac.uk) Received: from akis.salford.ac.uk (akis.salford.ac.uk [146.87.0.14]) by mx1.freebsd.org (Postfix) with SMTP id 40BEA13C491 for ; Thu, 19 Jul 2007 10:45:53 +0000 (UTC) (envelope-from M.S.Powell@salford.ac.uk) Received: (qmail 4947 invoked by uid 98); 19 Jul 2007 11:19:11 +0100 Received: from 146.87.255.121 by akis.salford.ac.uk (envelope-from , uid 401) with qmail-scanner-2.01 (clamdscan: 0.90/3697. spamassassin: 3.1.8. Clear:RC:1(146.87.255.121):. Processed in 0.043798 secs); 19 Jul 2007 10:19:11 -0000 Received: from rust.salford.ac.uk (HELO rust.salford.ac.uk) (146.87.255.121) by akis.salford.ac.uk (qpsmtpd/0.3x.614) with SMTP; Thu, 19 Jul 2007 11:19:11 +0100 Received: (qmail 1880 invoked by uid 1002); 19 Jul 2007 10:19:08 -0000 Received: from localhost (sendmail-bs@127.0.0.1) by localhost with SMTP; 19 Jul 2007 10:19:08 -0000 Date: Thu, 19 Jul 2007 11:19:08 +0100 (BST) From: "Mark Powell" To: freebsd-fs@freebsd.org Message-ID: <20070719102302.R1534@rust.salford.ac.uk> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Subject: ZfS & GEOM with many odd drive sizes X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Jul 2007 10:45:59 -0000 Hi, I'd like to experiment with ZFS. To that end I'd like to get a running array from a rather ad hoc collection of old drives. 3x250GB 3x200GB 1x400GB I planned to arrange them in 3 pairs of of 250+200. Therefore I'd end up with an effective 4 drives: 3x450GB 1x400GB I'd gmirror to make a small 2GB root and swap from the extra 50GB on the 3 pairs. Then gconcat to join the remaining 448GB from each pair into a volume. Apparently root is possible on ZFS with a small ufs to boot from: http://wiki.freebsd.org/ZFSOnRoot Then make a zfs raidz from the 3x448+1x400. Effectively giving a zpool of 1200GB real storage. 3x48GB will not be accessible now as the last volume will only be the 400GB on the last drive. I want to be able to increase the size of this volume later, by replacing drives when they fail, or it becomes economical to do so. I know removing a volume from a zpool and replacing it with a larger one is possible. The zpool will self-heal the data onto the new volume. Eventually when the final volume is replaced by a larger one the extra space becomes available for use. That's correct right? What I want to know is, does the new volume have to be the same actual device name or can it be substituted with another? i.e. can I remove, for example, one of the 448GB gconcats e.g. gc1 and replace that with a new 750GB drive e.g. ad6? Eventually so that once all volumes are replaced the zpool could be, for example, 4x750GB or 2.25TB of usable storage. Many thanks for any advice on these matters which are new to me. -- Mark Powell - UNIX System Administrator - The University of Salford Information Services Division, Clifford Whitworth Building, Salford University, Manchester, M5 4WT, UK. Tel: +44 161 295 4837 Fax: +44 161 295 5888 www.pgp.com for PGP key From owner-freebsd-fs@FreeBSD.ORG Thu Jul 19 13:55:53 2007 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 82A5016A402 for ; Thu, 19 Jul 2007 13:55:53 +0000 (UTC) (envelope-from pjd@garage.freebsd.pl) Received: from mail.garage.freebsd.pl (arm132.internetdsl.tpnet.pl [83.17.198.132]) by mx1.freebsd.org (Postfix) with ESMTP id BC2F113C4B2 for ; Thu, 19 Jul 2007 13:55:52 +0000 (UTC) (envelope-from pjd@garage.freebsd.pl) Received: by mail.garage.freebsd.pl (Postfix, from userid 65534) id 3C5DC4880D; Thu, 19 Jul 2007 15:55:49 +0200 (CEST) Received: from localhost (ana50.internetdsl.tpnet.pl [83.17.82.50]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.garage.freebsd.pl (Postfix) with ESMTP id EA5C345B26; Thu, 19 Jul 2007 15:55:35 +0200 (CEST) Date: Thu, 19 Jul 2007 15:55:10 +0200 From: Pawel Jakub Dawidek To: Mark Powell Message-ID: <20070719135510.GE1194@garage.freebsd.pl> References: <20070719102302.R1534@rust.salford.ac.uk> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="JkW1gnuWHDypiMFO" Content-Disposition: inline In-Reply-To: <20070719102302.R1534@rust.salford.ac.uk> User-Agent: Mutt/1.4.2.3i X-PGP-Key-URL: http://people.freebsd.org/~pjd/pjd.asc X-OS: FreeBSD 7.0-CURRENT i386 X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on mail.garage.freebsd.pl X-Spam-Level: X-Spam-Status: No, score=-5.9 required=3.0 tests=ALL_TRUSTED,BAYES_00 autolearn=ham version=3.0.4 Cc: freebsd-fs@freebsd.org Subject: Re: ZfS & GEOM with many odd drive sizes X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Jul 2007 13:55:53 -0000 --JkW1gnuWHDypiMFO Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Jul 19, 2007 at 11:19:08AM +0100, Mark Powell wrote: > Hi, > I'd like to experiment with ZFS. > To that end I'd like to get a running array from a rather ad hoc=20 > collection of old drives. >=20 > 3x250GB > 3x200GB > 1x400GB >=20 >=20 > I planned to arrange them in 3 pairs of of 250+200. Therefore I'd end up= =20 > with an effective 4 drives: >=20 > 3x450GB > 1x400GB >=20 > I'd gmirror to make a small 2GB root and swap from the extra 50GB on the = 3=20 > pairs. Then gconcat to join the remaining 448GB from each pair into a=20 > volume. Apparently root is possible on ZFS with a small ufs to boot from: >=20 > http://wiki.freebsd.org/ZFSOnRoot >=20 > Then make a zfs raidz from the 3x448+1x400. Effectively giving a zpool= =20 > of 1200GB real storage. 3x48GB will not be accessible now as the last=20 > volume will only be the 400GB on the last drive. > I want to be able to increase the size of this volume later, by=20 > replacing drives when they fail, or it becomes economical to do so. > I know removing a volume from a zpool and replacing it with a larger on= e=20 > is possible. The zpool will self-heal the data onto the new volume.=20 > Eventually when the final volume is replaced by a larger one the extra=20 > space becomes available for use. That's correct right? > What I want to know is, does the new volume have to be the same actual= =20 > device name or can it be substituted with another? > i.e. can I remove, for example, one of the 448GB gconcats e.g. gc1 and= =20 > replace that with a new 750GB drive e.g. ad6? > Eventually so that once all volumes are replaced the zpool could be, fo= r=20 > example, 4x750GB or 2.25TB of usable storage. > Many thanks for any advice on these matters which are new to me. All you described above should work. --=20 Pawel Jakub Dawidek http://www.wheel.pl pjd@FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! --JkW1gnuWHDypiMFO Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.4 (FreeBSD) iD8DBQFGn20+ForvXbEpPzQRAgvDAKDLvAs/U/m5JJvpG2JE+cYWdtyyKgCgwz6/ nc0Q4BVYJAEa0G5c7XzkQ/M= =0qNP -----END PGP SIGNATURE----- --JkW1gnuWHDypiMFO-- From owner-freebsd-fs@FreeBSD.ORG Thu Jul 19 17:19:18 2007 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 9D3BD16A400 for ; Thu, 19 Jul 2007 17:19:18 +0000 (UTC) (envelope-from M.S.Powell@salford.ac.uk) Received: from abbe.salford.ac.uk (abbe.salford.ac.uk [146.87.0.10]) by mx1.freebsd.org (Postfix) with SMTP id 165A813C441 for ; Thu, 19 Jul 2007 17:19:17 +0000 (UTC) (envelope-from M.S.Powell@salford.ac.uk) Received: (qmail 66374 invoked by uid 98); 19 Jul 2007 18:19:16 +0100 Received: from 146.87.255.121 by abbe.salford.ac.uk (envelope-from , uid 401) with qmail-scanner-2.01 (clamdscan: 0.90/3700. spamassassin: 3.1.8. Clear:RC:1(146.87.255.121):. Processed in 0.042023 secs); 19 Jul 2007 17:19:16 -0000 Received: from rust.salford.ac.uk (HELO rust.salford.ac.uk) (146.87.255.121) by abbe.salford.ac.uk (qpsmtpd/0.3x.614) with SMTP; Thu, 19 Jul 2007 18:19:16 +0100 Received: (qmail 4940 invoked by uid 1002); 19 Jul 2007 17:19:14 -0000 Received: from localhost (sendmail-bs@127.0.0.1) by localhost with SMTP; 19 Jul 2007 17:19:14 -0000 Date: Thu, 19 Jul 2007 18:19:14 +0100 (BST) From: "Mark Powell" To: Pawel Jakub Dawidek In-Reply-To: <20070719135510.GE1194@garage.freebsd.pl> Message-ID: <20070719181313.G4923@rust.salford.ac.uk> References: <20070719102302.R1534@rust.salford.ac.uk> <20070719135510.GE1194@garage.freebsd.pl> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-fs@freebsd.org Subject: Re: ZfS & GEOM with many odd drive sizes X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Jul 2007 17:19:18 -0000 On Thu, 19 Jul 2007, Pawel Jakub Dawidek wrote: > On Thu, Jul 19, 2007 at 11:19:08AM +0100, Mark Powell wrote: >> What I want to know is, does the new volume have to be the same actual >> device name or can it be substituted with another? >> i.e. can I remove, for example, one of the 448GB gconcats e.g. gc1 and >> replace that with a new 750GB drive e.g. ad6? >> Eventually so that once all volumes are replaced the zpool could be, for >> example, 4x750GB or 2.25TB of usable storage. >> Many thanks for any advice on these matters which are new to me. > > All you described above should work. Thanks Pawel. For your response and much so for all your time spent working on ZFS. Should I expect much greater CPU usage with ZFS? I previously had a geom raid5 array which barely broke a sweat on benchmarks i.e simple large dd read and writes. With ZFS on the same hardware I notice 50-60% system CPU usage is usual during such tests. Before the network was a bottleneck, but now it's the zfs array. I expected it would have to do a bit more 'thinking', but is such a dramatic increase normal? Many thanks again. -- Mark Powell - UNIX System Administrator - The University of Salford Information Services Division, Clifford Whitworth Building, Salford University, Manchester, M5 4WT, UK. Tel: +44 161 295 4837 Fax: +44 161 295 5888 www.pgp.com for PGP key From owner-freebsd-fs@FreeBSD.ORG Thu Jul 19 19:34:04 2007 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 0BD9716A403 for ; Thu, 19 Jul 2007 19:34:04 +0000 (UTC) (envelope-from M.S.Powell@salford.ac.uk) Received: from abbe.salford.ac.uk (abbe.salford.ac.uk [146.87.0.10]) by mx1.freebsd.org (Postfix) with SMTP id 8266113C428 for ; Thu, 19 Jul 2007 19:34:02 +0000 (UTC) (envelope-from M.S.Powell@salford.ac.uk) Received: (qmail 18975 invoked by uid 98); 19 Jul 2007 20:34:01 +0100 Received: from 146.87.255.121 by abbe.salford.ac.uk (envelope-from , uid 401) with qmail-scanner-2.01 (clamdscan: 0.90/3700. spamassassin: 3.1.8. Clear:RC:1(146.87.255.121):. Processed in 0.056676 secs); 19 Jul 2007 19:34:01 -0000 Received: from rust.salford.ac.uk (HELO rust.salford.ac.uk) (146.87.255.121) by abbe.salford.ac.uk (qpsmtpd/0.3x.614) with SMTP; Thu, 19 Jul 2007 20:34:01 +0100 Received: (qmail 5853 invoked by uid 1002); 19 Jul 2007 19:33:59 -0000 Received: from localhost (sendmail-bs@127.0.0.1) by localhost with SMTP; 19 Jul 2007 19:33:59 -0000 Date: Thu, 19 Jul 2007 20:33:59 +0100 (BST) From: "Mark Powell" To: Doug Rabson In-Reply-To: <200707192027.44025.dfr@rabson.org> Message-ID: <20070719203134.B4923@rust.salford.ac.uk> References: <20070719102302.R1534@rust.salford.ac.uk> <20070719135510.GE1194@garage.freebsd.pl> <20070719181313.G4923@rust.salford.ac.uk> <200707192027.44025.dfr@rabson.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-fs@freebsd.org Subject: Re: ZfS & GEOM with many odd drive sizes X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Jul 2007 19:34:04 -0000 On Thu, 19 Jul 2007, Doug Rabson wrote: > On Thursday 19 July 2007, Mark Powell wrote: >> Should I expect much greater CPU usage with ZFS? >> I previously had a geom raid5 array which barely broke a sweat on >> benchmarks i.e simple large dd read and writes. With ZFS on the same >> hardware I notice 50-60% system CPU usage is usual during such tests. >> Before the network was a bottleneck, but now it's the zfs array. I >> expected it would have to do a bit more 'thinking', but is such a >> dramatic increase normal? >> >> Many thanks again. > > ZFS does a checksum on every block it reads from the disk which may be > your problem. In normal usage, this isn't a big deal due because many > reads get data from the cache. I've turned off checksums, but still my machine is struggling. I think my Athlon XP is a little old for all this work :( Any other tips for speeding zfs up? Cheers. -- Mark Powell - UNIX System Administrator - The University of Salford Information Services Division, Clifford Whitworth Building, Salford University, Manchester, M5 4WT, UK. Tel: +44 161 295 4837 Fax: +44 161 295 5888 www.pgp.com for PGP key From owner-freebsd-fs@FreeBSD.ORG Thu Jul 19 19:47:37 2007 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 8BD6016A402 for ; Thu, 19 Jul 2007 19:47:37 +0000 (UTC) (envelope-from dfr@rabson.org) Received: from itchy.rabson.org (mailgate.nlsystems.com [80.177.232.242]) by mx1.freebsd.org (Postfix) with ESMTP id 228BA13C481 for ; Thu, 19 Jul 2007 19:47:36 +0000 (UTC) (envelope-from dfr@rabson.org) Received: from herring.rabson.org (herring.rabson.org [80.177.232.250]) by itchy.rabson.org (8.13.3/8.13.3) with ESMTP id l6JJlZa5053810; Thu, 19 Jul 2007 20:47:35 +0100 (BST) (envelope-from dfr@rabson.org) From: Doug Rabson To: "Mark Powell" Date: Thu, 19 Jul 2007 20:47:34 +0100 User-Agent: KMail/1.9.6 References: <20070719102302.R1534@rust.salford.ac.uk> <200707192027.44025.dfr@rabson.org> <20070719203134.B4923@rust.salford.ac.uk> (sfid-20070719_20342_28B5762E) In-Reply-To: <20070719203134.B4923@rust.salford.ac.uk> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200707192047.34979.dfr@rabson.org> X-Virus-Scanned: ClamAV 0.87.1/3700/Thu Jul 19 14:13:47 2007 on itchy.rabson.org X-Virus-Status: Clean Cc: freebsd-fs@freebsd.org Subject: Re: UNS: Re: ZfS & GEOM with many odd drive sizes X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Jul 2007 19:47:37 -0000 On Thursday 19 July 2007, Mark Powell wrote: > On Thu, 19 Jul 2007, Doug Rabson wrote: > > On Thursday 19 July 2007, Mark Powell wrote: > >> Should I expect much greater CPU usage with ZFS? > >> I previously had a geom raid5 array which barely broke a sweat > >> on benchmarks i.e simple large dd read and writes. With ZFS on the > >> same hardware I notice 50-60% system CPU usage is usual during > >> such tests. Before the network was a bottleneck, but now it's the > >> zfs array. I expected it would have to do a bit more 'thinking', > >> but is such a dramatic increase normal? > >> > >> Many thanks again. > > > > ZFS does a checksum on every block it reads from the disk which may > > be your problem. In normal usage, this isn't a big deal due because > > many reads get data from the cache. > > I've turned off checksums, but still my machine is struggling. I > think my Athlon XP is a little old for all this work :( Any other > tips for speeding zfs up? > Cheers. Nothing really comes to mind. You could try simpler geometries (e.g. mirrors or collections of mirrors). Having at least some of your drives in a simple configuration might be useful - I'm working on ZFS boot code at the moment and I don't intend to support raidz or raidz2 (at least to start with). Collections of mirrors and simple disks are much easier. From owner-freebsd-fs@FreeBSD.ORG Thu Jul 19 20:07:07 2007 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id BD45416A400 for ; Thu, 19 Jul 2007 20:07:07 +0000 (UTC) (envelope-from dfr@rabson.org) Received: from itchy.rabson.org (mailgate.nlsystems.com [80.177.232.242]) by mx1.freebsd.org (Postfix) with ESMTP id 45C1C13C4A6 for ; Thu, 19 Jul 2007 20:07:07 +0000 (UTC) (envelope-from dfr@rabson.org) Received: from herring.rabson.org (herring.rabson.org [80.177.232.250]) by itchy.rabson.org (8.13.3/8.13.3) with ESMTP id l6JJRiD3053701; Thu, 19 Jul 2007 20:27:45 +0100 (BST) (envelope-from dfr@rabson.org) From: Doug Rabson To: freebsd-fs@freebsd.org Date: Thu, 19 Jul 2007 20:27:43 +0100 User-Agent: KMail/1.9.6 References: <20070719102302.R1534@rust.salford.ac.uk> <20070719135510.GE1194@garage.freebsd.pl> <20070719181313.G4923@rust.salford.ac.uk> In-Reply-To: <20070719181313.G4923@rust.salford.ac.uk> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200707192027.44025.dfr@rabson.org> X-Virus-Scanned: ClamAV 0.87.1/3700/Thu Jul 19 14:13:47 2007 on itchy.rabson.org X-Virus-Status: Clean Cc: Pawel Jakub Dawidek , Mark Powell Subject: Re: ZfS & GEOM with many odd drive sizes X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Jul 2007 20:07:07 -0000 On Thursday 19 July 2007, Mark Powell wrote: > On Thu, 19 Jul 2007, Pawel Jakub Dawidek wrote: > > On Thu, Jul 19, 2007 at 11:19:08AM +0100, Mark Powell wrote: > >> What I want to know is, does the new volume have to be the same > >> actual device name or can it be substituted with another? > >> i.e. can I remove, for example, one of the 448GB gconcats e.g. > >> gc1 and replace that with a new 750GB drive e.g. ad6? > >> Eventually so that once all volumes are replaced the zpool could > >> be, for example, 4x750GB or 2.25TB of usable storage. > >> Many thanks for any advice on these matters which are new to me. > > > > All you described above should work. > > Thanks Pawel. For your response and much so for all your time spent > working on ZFS. > > Should I expect much greater CPU usage with ZFS? > I previously had a geom raid5 array which barely broke a sweat on > benchmarks i.e simple large dd read and writes. With ZFS on the same > hardware I notice 50-60% system CPU usage is usual during such tests. > Before the network was a bottleneck, but now it's the zfs array. I > expected it would have to do a bit more 'thinking', but is such a > dramatic increase normal? > > Many thanks again. ZFS does a checksum on every block it reads from the disk which may be your problem. In normal usage, this isn't a big deal due because many reads get data from the cache. From owner-freebsd-fs@FreeBSD.ORG Fri Jul 20 20:17:26 2007 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B87F016A417 for ; Fri, 20 Jul 2007 20:17:26 +0000 (UTC) (envelope-from bounces@nabble.com) Received: from kuber.nabble.com (kuber.nabble.com [216.139.236.158]) by mx1.freebsd.org (Postfix) with ESMTP id 9BA8D13C442 for ; Fri, 20 Jul 2007 20:17:26 +0000 (UTC) (envelope-from bounces@nabble.com) Received: from isper.nabble.com ([192.168.236.156]) by kuber.nabble.com with esmtp (Exim 4.63) (envelope-from ) id 1IBybW-00009I-U0 for freebsd-fs@freebsd.org; Fri, 20 Jul 2007 12:57:34 -0700 Message-ID: <11714958.post@talk.nabble.com> Date: Fri, 20 Jul 2007 12:57:34 -0700 (PDT) From: NostalgiaForInfinity To: freebsd-fs@freebsd.org In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Nabble-From: astuy@bio.fsu.edu References: Subject: Re: Filesystems larger than 2TB? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 Jul 2007 20:17:26 -0000 Ivan Voras-2 wrote: > > Francisco Reyes wrote: >> I many postings I have seen references of filesystems greater than 2TB, >> yet I have tried several times to create them and have had problems. >> >> Is there a way to create slices and filesystems greater than 2TB in 6.2? >> Perhaps one needs to do it outside sysinstall? > > Yes, you need to do it outside of sysinstall. There are two ways: > > 1. don't use partitions/slices at all and create the file system on the > raw device (i.e. newfs /dev/da0) > 2. use GPT partitions. > > The first one is recommended in 6.x. > Thanks, #1 worked for me. I had two 3ware raid 5 arrays way over 2tb. I did "newfs " on each (ie newfs /dev/da0) and result is: sigma# uname -a FreeBSD sigma.bio.fsu.edu 6.2-RELEASE FreeBSD 6.2-RELEASE #0: Fri Jan 12 08:43:30 UTC 2007 root@portnoy.cse.buffalo.edu:/usr/obj/usr/src/sys/SMP amd64 sigma# df -h Filesystem Size Used Avail Capacity Mounted on /dev/da1s1a 989M 40M 870M 4% / devfs 1.0K 1.0K 0B 100% /dev /dev/da1s1f 4.8G 12K 4.5G 0% /tmp /dev/da1s1h 11G 37M 10G 0% /users /dev/da1s1d 24G 753M 22G 3% /usr /dev/da1s1e 19G 4.0K 18G 0% /usr/local/www /dev/da1s1g 7.7G 280K 7.1G 0% /var /dev/da0 6.6T 2.5G 6.1T 0% /data0 /dev/da2 4.6T 1.6G 4.3T 0% /data2 -- View this message in context: http://www.nabble.com/Filesystems-larger-than-2TB--tf3895966.html#a11714958 Sent from the freebsd-fs mailing list archive at Nabble.com. From owner-freebsd-fs@FreeBSD.ORG Fri Jul 20 21:33:00 2007 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 133D016A41A for ; Fri, 20 Jul 2007 21:33:00 +0000 (UTC) (envelope-from anderson@freebsd.org) Received: from ns.trinitel.com (186.161.36.72.static.reverse.layeredtech.com [72.36.161.186]) by mx1.freebsd.org (Postfix) with ESMTP id D899D13C46A for ; Fri, 20 Jul 2007 21:32:59 +0000 (UTC) (envelope-from anderson@freebsd.org) Received: from proton.local (209-163-168-124.static.twtelecom.net [209.163.168.124]) (authenticated bits=0) by ns.trinitel.com (8.14.1/8.14.1) with ESMTP id l6KLWuxO024538; Fri, 20 Jul 2007 16:32:56 -0500 (CDT) (envelope-from anderson@freebsd.org) Message-ID: <46A12A07.6070503@freebsd.org> Date: Fri, 20 Jul 2007 16:32:55 -0500 From: Eric Anderson User-Agent: Thunderbird 2.0.0.5 (Macintosh/20070716) MIME-Version: 1.0 To: NostalgiaForInfinity References: <11714958.post@talk.nabble.com> In-Reply-To: <11714958.post@talk.nabble.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=ham version=3.1.8 X-Spam-Checker-Version: SpamAssassin 3.1.8 (2007-02-13) on ns.trinitel.com Cc: freebsd-fs@freebsd.org Subject: Re: Filesystems larger than 2TB? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 Jul 2007 21:33:00 -0000 NostalgiaForInfinity wrote: > > Ivan Voras-2 wrote: >> Francisco Reyes wrote: >>> I many postings I have seen references of filesystems greater than 2TB, >>> yet I have tried several times to create them and have had problems. >>> >>> Is there a way to create slices and filesystems greater than 2TB in 6.2? >>> Perhaps one needs to do it outside sysinstall? >> Yes, you need to do it outside of sysinstall. There are two ways: >> >> 1. don't use partitions/slices at all and create the file system on the >> raw device (i.e. newfs /dev/da0) >> 2. use GPT partitions. >> >> The first one is recommended in 6.x. >> > > Thanks, #1 worked for me. I had two 3ware raid 5 arrays way over 2tb. I > did "newfs " on each (ie newfs /dev/da0) and result is: > sigma# uname -a > FreeBSD sigma.bio.fsu.edu 6.2-RELEASE FreeBSD 6.2-RELEASE #0: Fri Jan 12 > 08:43:30 UTC 2007 root@portnoy.cse.buffalo.edu:/usr/obj/usr/src/sys/SMP > amd64 > sigma# df -h > Filesystem Size Used Avail Capacity Mounted on > /dev/da1s1a 989M 40M 870M 4% / > devfs 1.0K 1.0K 0B 100% /dev > /dev/da1s1f 4.8G 12K 4.5G 0% /tmp > /dev/da1s1h 11G 37M 10G 0% /users > /dev/da1s1d 24G 753M 22G 3% /usr > /dev/da1s1e 19G 4.0K 18G 0% /usr/local/www > /dev/da1s1g 7.7G 280K 7.1G 0% /var > /dev/da0 6.6T 2.5G 6.1T 0% /data0 > /dev/da2 4.6T 1.6G 4.3T 0% /data2 Did you not want soft updates enabled? You need to specify it with the -U switch. Eric From owner-freebsd-fs@FreeBSD.ORG Sat Jul 21 06:35:04 2007 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0BAB416A419; Sat, 21 Jul 2007 06:35:04 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from relay02.kiev.sovam.com (relay02.kiev.sovam.com [62.64.120.197]) by mx1.freebsd.org (Postfix) with ESMTP id 97A7713C458; Sat, 21 Jul 2007 06:35:03 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from [89.162.146.170] (helo=skuns.kiev.zoral.com.ua) by relay02.kiev.sovam.com with esmtps (TLSv1:AES256-SHA:256) (Exim 4.67) (envelope-from ) id 1IC8YM-000JiN-Nn; Sat, 21 Jul 2007 09:35:01 +0300 Received: from deviant.kiev.zoral.com.ua (root@[10.1.1.148]) by skuns.kiev.zoral.com.ua (8.14.1/8.14.1) with ESMTP id l6L6Yb9N081612 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 21 Jul 2007 09:34:37 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.1/8.14.1) with ESMTP id l6L6Yaeo050522; Sat, 21 Jul 2007 09:34:36 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.1/8.14.1/Submit) id l6L6YZee050521; Sat, 21 Jul 2007 09:34:35 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Sat, 21 Jul 2007 09:34:34 +0300 From: Kostik Belousov To: Bruce Evans Message-ID: <20070721063434.GI2200@deviant.kiev.zoral.com.ua> References: <20070710233455.O2101@besplex.bde.org> <20070712084115.GA2200@deviant.kiev.zoral.com.ua> <20070712225324.F9515@besplex.bde.org> <20070712142127.GD2200@deviant.kiev.zoral.com.ua> <20070716195556.P12807@besplex.bde.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="12mfjY/2IcLDkfPj" Content-Disposition: inline In-Reply-To: <20070716195556.P12807@besplex.bde.org> User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: ClamAV version 0.90.3, clamav-milter version 0.90.3 on skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-1.4 required=5.0 tests=ALL_TRUSTED autolearn=failed version=3.2.1 X-Spam-Checker-Version: SpamAssassin 3.2.1 (2007-05-02) on skuns.kiev.zoral.com.ua X-Scanner-Signature: e7d6a74e6a41bfc4865bf8c76b21c35c X-DrWeb-checked: yes X-SpamTest-Envelope-From: kostikbel@gmail.com X-SpamTest-Group-ID: 00000000 X-SpamTest-Header: Not Detected X-SpamTest-Info: Profiles 1263 [July 20 2007] X-SpamTest-Info: helo_type=3 X-SpamTest-Method: none X-SpamTest-Rate: 0 X-SpamTest-Status: Not detected X-SpamTest-Status-Extended: not_detected X-SpamTest-Version: SMTP-Filter Version 3.0.0 [0255], KAS30/Release Cc: bugs@freebsd.org, fs@freebsd.org Subject: Re: msdosfs not MPSAFE X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 21 Jul 2007 06:35:04 -0000 --12mfjY/2IcLDkfPj Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, Jul 16, 2007 at 08:18:14PM +1000, Bruce Evans wrote: > On Thu, 12 Jul 2007, Kostik Belousov wrote: >=20 > >On Thu, Jul 12, 2007 at 11:33:40PM +1000, Bruce Evans wrote: > >> > >>On Thu, 12 Jul 2007, Kostik Belousov wrote: > >> > >>>On Wed, Jul 11, 2007 at 12:08:19AM +1000, Bruce Evans wrote: > >>>>msdsofs has been broken since Giant locking for file systems (or=20 > >>>>syscalls) > >>>>was removed. It allows multiple threads to race accessing the shared > >>>>static buffer `nambuf' and related variables. This causes remarkably > >> > >>>It seems that msdosfs_lookup() can sleep, thus Giant protection would = be > >>>lost. > >> > >>It can certainly block in bread(). > >Besides bread(), there is a (re)locking for ".." case, and deget() call, > >that itself calls malloc(M_WAITOK), vfs_hash_get(), getnewvnode() and > >readep(). The latter itself calls bread(). > > > >This is from the brief look. >=20 > I think msdosfs_lookup() doesn't need to own nambuf near the deget() > call. Not sure -- I was looking more at msdosfs_readdir(). >=20 > >>How does my adding Giant locking help? I checked that at least in > >>FreeBSD-~5.2-current, msdosfs_readdir() is already Giant-locked, so my > >>fix just increments the recursion count. What happens to recursively- > >>held Giant locks across sleeps? I think they should cause a KASSERT() > >>failure, but if they are handled by only dropping Giant once then my > >>fix might sort of work but sleeps would be broken generally. > >> > >Look at the kern/kern_sync.c:_sleep(). It does DROP_GIANT(), that (from > >the sys/mutex.h) calls mtx_unlock() until Giant is owned. >=20 > So it is very mysterious that Giant locking helped. Anyway, it doesn't > work, and cases where it doesn't help showed up in further testing. >=20 > sx xlocking works, but is not quite right: > % /* > % + * XXX msdosfs_lookup() is split up because unlocking before all the= =20 > returns > % + * in the original function would be too churning. > % + */ > % +int > % +msdosfs_lookup(ap) > % + struct vop_cachedlookup_args *ap; > % +{ > % + int error; > % + > % + sx_xlock(&mbnambuf_lock); > % + error =3D msdosfs_lookup_locked(ap); > % + sx_xunlock(&mbnambuf_lock); > % + return (error); > % +} > % + > % +/* Assume that a directory A is participating in lookup() from two threads: thread 1 lookup the A itself; thread 2 lookup some entry in the A. Then, thread 1 would have mbnambuf_lock locked, and may wait for A' vnode lock; thread 2 shall own vnode lock for A, then locking mbnambuf_lock. I do not see what may prevent this LOR scenario from realizing, or what make it harmless. Did I missed something ? --12mfjY/2IcLDkfPj Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (FreeBSD) iD8DBQFGoaj4C3+MBN1Mb4gRAlcwAKCbJh5IOqLlZkd05jW2t6ktgbIaQACgnJNr dtX3BEnDUbOzxeOKkxXrmW8= =ts/I -----END PGP SIGNATURE----- --12mfjY/2IcLDkfPj-- From owner-freebsd-fs@FreeBSD.ORG Sat Jul 21 06:52:54 2007 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D0F1F16A419 for ; Sat, 21 Jul 2007 06:52:54 +0000 (UTC) (envelope-from pjd@garage.freebsd.pl) Received: from mail.garage.freebsd.pl (arm132.internetdsl.tpnet.pl [83.17.198.132]) by mx1.freebsd.org (Postfix) with ESMTP id 7039B13C45B for ; Sat, 21 Jul 2007 06:52:54 +0000 (UTC) (envelope-from pjd@garage.freebsd.pl) Received: by mail.garage.freebsd.pl (Postfix, from userid 65534) id 313AB48A1B; Sat, 21 Jul 2007 08:52:52 +0200 (CEST) Received: from localhost (public-gprs39163.centertel.pl [91.94.25.47]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.garage.freebsd.pl (Postfix) with ESMTP id 533F9487F4; Sat, 21 Jul 2007 08:52:40 +0200 (CEST) Date: Sat, 21 Jul 2007 08:52:04 +0200 From: Pawel Jakub Dawidek To: Mark Powell Message-ID: <20070721065204.GA2044@garage.freebsd.pl> References: <20070719102302.R1534@rust.salford.ac.uk> <20070719135510.GE1194@garage.freebsd.pl> <20070719181313.G4923@rust.salford.ac.uk> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="LZvS9be/3tNcYl/X" Content-Disposition: inline In-Reply-To: <20070719181313.G4923@rust.salford.ac.uk> User-Agent: Mutt/1.4.2.3i X-PGP-Key-URL: http://people.freebsd.org/~pjd/pjd.asc X-OS: FreeBSD 7.0-CURRENT i386 X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on mail.garage.freebsd.pl X-Spam-Level: X-Spam-Status: No, score=-2.5 required=3.0 tests=BAYES_00,RCVD_IN_NJABL_DUL autolearn=no version=3.0.4 Cc: freebsd-fs@freebsd.org Subject: Re: ZfS & GEOM with many odd drive sizes X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 21 Jul 2007 06:52:54 -0000 --LZvS9be/3tNcYl/X Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Jul 19, 2007 at 06:19:14PM +0100, Mark Powell wrote: > On Thu, 19 Jul 2007, Pawel Jakub Dawidek wrote: >=20 > >On Thu, Jul 19, 2007 at 11:19:08AM +0100, Mark Powell wrote: > >> What I want to know is, does the new volume have to be the same actual > >>device name or can it be substituted with another? > >> i.e. can I remove, for example, one of the 448GB gconcats e.g. gc1 and > >>replace that with a new 750GB drive e.g. ad6? > >> Eventually so that once all volumes are replaced the zpool could be, = for > >>example, 4x750GB or 2.25TB of usable storage. > >> Many thanks for any advice on these matters which are new to me. > > > >All you described above should work. >=20 > Thanks Pawel. For your response and much so for all your time spent=20 > working on ZFS. >=20 > Should I expect much greater CPU usage with ZFS? > I previously had a geom raid5 array which barely broke a sweat on=20 > benchmarks i.e simple large dd read and writes. With ZFS on the same=20 > hardware I notice 50-60% system CPU usage is usual during such tests.=20 > Before the network was a bottleneck, but now it's the zfs array. I=20 > expected it would have to do a bit more 'thinking', but is such a dramati= c=20 > increase normal? Be sure to turn off debugging, ie. remove WITNESS, INVARIANTS and INVARIANT_SUPPORT options from your kernel configuration. Other than that, ZFS may just be more CPU hungry... --=20 Pawel Jakub Dawidek http://www.wheel.pl pjd@FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! --LZvS9be/3tNcYl/X Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.4 (FreeBSD) iD8DBQFGoa0UForvXbEpPzQRAl5oAKCtt0QPpNfWvPh2aBnIP0I/G/qjuwCgjOQ3 Yp1Kg6GDa1+FasS0vrqdW0U= =W6Xz -----END PGP SIGNATURE----- --LZvS9be/3tNcYl/X-- From owner-freebsd-fs@FreeBSD.ORG Sat Jul 21 13:52:09 2007 Return-Path: Delivered-To: fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DC4E016A418; Sat, 21 Jul 2007 13:52:09 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail16.syd.optusnet.com.au (mail16.syd.optusnet.com.au [211.29.132.197]) by mx1.freebsd.org (Postfix) with ESMTP id 77EB913C458; Sat, 21 Jul 2007 13:52:09 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c220-239-235-248.carlnfd3.nsw.optusnet.com.au [220.239.235.248]) by mail16.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id l6LDq4Tg021276 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 21 Jul 2007 23:52:06 +1000 Date: Sat, 21 Jul 2007 23:52:04 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Kostik Belousov In-Reply-To: <20070721063434.GI2200@deviant.kiev.zoral.com.ua> Message-ID: <20070721233613.Q3366@besplex.bde.org> References: <20070710233455.O2101@besplex.bde.org> <20070712084115.GA2200@deviant.kiev.zoral.com.ua> <20070712225324.F9515@besplex.bde.org> <20070712142127.GD2200@deviant.kiev.zoral.com.ua> <20070716195556.P12807@besplex.bde.org> <20070721063434.GI2200@deviant.kiev.zoral.com.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: bugs@FreeBSD.org, fs@FreeBSD.org Subject: Re: msdosfs not MPSAFE X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 21 Jul 2007 13:52:10 -0000 On Sat, 21 Jul 2007, Kostik Belousov wrote: > On Mon, Jul 16, 2007 at 08:18:14PM +1000, Bruce Evans wrote: >> sx xlocking works, but is not quite right: >> % /* >> % + * XXX msdosfs_lookup() is split up because unlocking before all the >> returns >> % + * in the original function would be too churning. >> % + */ >> % +int >> % +msdosfs_lookup(ap) >> % + struct vop_cachedlookup_args *ap; >> % +{ >> % + int error; >> % + >> % + sx_xlock(&mbnambuf_lock); >> % + error = msdosfs_lookup_locked(ap); >> % + sx_xunlock(&mbnambuf_lock); >> % + return (error); >> % +} >> % + >> % +/* > > Assume that a directory A is participating in lookup() from two threads: > thread 1 lookup the A itself; > thread 2 lookup some entry in the A. > Then, thread 1 would have mbnambuf_lock locked, and may wait for A' > vnode lock; > thread 2 shall own vnode lock for A, then locking mbnambuf_lock. > > I do not see what may prevent this LOR scenario from realizing, or what > make it harmless. Nothing I can see either. The wrapper is too global. Next try: move locking into the inner loop in msdosfs_lookup(). Unlocking is not as ugly as I feared. The following has only been tested at compile time: % Index: msdosfs_lookup.c % =================================================================== % RCS file: /home/ncvs/src/sys/fs/msdosfs/msdosfs_lookup.c,v % retrieving revision 1.40 % diff -u -2 -r1.40 msdosfs_lookup.c % --- msdosfs_lookup.c 26 Dec 2003 17:24:37 -0000 1.40 % +++ msdosfs_lookup.c 21 Jul 2007 13:27:37 -0000 % @@ -54,4 +54,5 @@ % #include % #include % +#include % #include % #include % @@ -63,4 +64,6 @@ % #include % % +extern struct sx mbnambuf_lock; % + % /* % * When we search a directory the blocks containing directory entries are % @@ -192,4 +195,5 @@ % */ % tdp = NULL; % + sx_xlock(&mbnambuf_lock); % mbnambuf_init(); % /* % @@ -206,4 +210,5 @@ % if (error == E2BIG) % break; % + sx_xunlock(&mbnambuf_lock); % return (error); % } % @@ -211,4 +216,5 @@ % if (error) { % brelse(bp); % + sx_xunlock(&mbnambuf_lock); % return (error); % } % @@ -240,4 +246,5 @@ % if (dep->deName[0] == SLOT_EMPTY) { % brelse(bp); % + sx_xunlock(&mbnambuf_lock); % goto notfound; % } % @@ -301,4 +308,5 @@ % dp->de_fndcnt = wincnt - 1; % % + sx_xunlock(&mbnambuf_lock); % goto found; % } % @@ -310,4 +318,5 @@ % brelse(bp); % } /* for (frcn = 0; ; frcn++) */ % + sx_xunlock(&mbnambuf_lock); % % notfound: After moving the locking into msdosfs_conv.c and adding assertions there, this should be a good enough fix until the mbnambuf interface is changed. This bug is in all versions since 5.2-RELEASE. Bruce