From owner-freebsd-fs@FreeBSD.ORG Wed Jul 20 02:16:23 2005 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 209CD16A41F for ; Wed, 20 Jul 2005 02:16:22 +0000 (GMT) (envelope-from anderson@centtech.com) Received: from mh1.centtech.com (moat3.centtech.com [207.200.51.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id 7433D43D49 for ; Wed, 20 Jul 2005 02:16:21 +0000 (GMT) (envelope-from anderson@centtech.com) Received: from [192.168.42.23] (andersonbox3.centtech.com [192.168.42.23]) by mh1.centtech.com (8.13.1/8.13.1) with ESMTP id j6K2GKHW038162 for ; Tue, 19 Jul 2005 21:16:20 -0500 (CDT) (envelope-from anderson@centtech.com) Message-ID: <42DDB3F2.7020000@centtech.com> Date: Tue, 19 Jul 2005 21:16:18 -0500 From: Eric Anderson User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.8) Gecko/20050603 X-Accept-Language: en-us, en MIME-Version: 1.0 To: freebsd-fs@freebsd.org References: <200507020038.j620cO7F071025@gate.bitblocks.com> In-Reply-To: <200507020038.j620cO7F071025@gate.bitblocks.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.82/984/Tue Jul 19 04:16:09 2005 on mh1.centtech.com X-Virus-Status: Clean Subject: Re: Cluster Filesystem for FreeBSD - any interest? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 20 Jul 2005 02:16:23 -0000 Bakul Shah wrote: [..snip..] >>:) I understand. Any nudging in the right direction here would be >>appreciated. > > > I'd probably start with modelling a single filesystem and how > it maps to a sequence of disk blocks (*without* using any > code or worrying about details of formats but capturing the > essential elements). I'd describe various operations in > terms of preconditions and postconditions. Then, I'd extend > the model to deal with redundancy and so on. Then I'd model > various failure modes. etc. If you are interested _enough_ > we can take this offline and try to work something out. You > may even be able to use perl to create an `executable' > specification:-) I've done some research, and read some books/articles/white papers since I started this thread. First, porting GFS might be a more universal effort, and might be 'easier'. However, that doesn't get us a clustered filesystem with BSD license (something that sounds good to me). Clustering UFS2 would be cool. Here's what I'm looking for: A clustered filesystem (or layer?) that allows all machines in the cluster to see the same filesystem as if it were local, with read/write access. The cluster will need cache coherency across all nodes, and there will need to be some sort of lock manager on each node to communicate with all the other nodes to coordinate file locking. The filesystem will have to support journaling. I'm wondering if one could make a pseudo filesystem something like nullfs that sits on top of a UFS2 partition, and essentially monitors all VFS operations to the filesystem, and communicates them over TCP/IP to the other nodes in the cluster. That way, each node would know which inodes and blocks are changing, so they can flush those buffers, and they would know which blocks (or partial blocks) to view as locked as another node locks it. This could be done via multicast, so all nodes in the cluster would have to be running a distributed lock manager daemon (dlmd) that would coordinate this. I think also that the UFS2 filesystem would have to have a bit set upon mount that tracked it's mount as a 'clustered' filesystem mount. The reason for that is so that we could modify mount to only mount 'clustered' filesystems (mount -o clustered) if the dlmd was running, since that would be a dependency for stable coherent file control on a mount point. Does anyone have any insight as to whether a layer would work? Or maybe I'm way off here and I need to do more reading :) Eric -- ------------------------------------------------------------------------ Eric Anderson Sr. Systems Administrator Centaur Technology A lost ounce of gold may be found, a lost moment of time never. ------------------------------------------------------------------------ From owner-freebsd-fs@FreeBSD.ORG Wed Jul 20 02:37:26 2005 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 78C3316A41F for ; Wed, 20 Jul 2005 02:37:26 +0000 (GMT) (envelope-from yfyoufeng@263.net) Received: from smtp.263.net (mx01.263.net.cn [211.150.96.22]) by mx1.FreeBSD.org (Postfix) with ESMTP id 49AFC43D45 for ; Wed, 20 Jul 2005 02:37:14 +0000 (GMT) (envelope-from yfyoufeng@263.net) Received: from [10.217.12.183] (localhost [127.0.0.1]) by smtp.263.net (Postfix) with ESMTP id 32249C35B0; Wed, 20 Jul 2005 10:37:03 +0800 (CST) (envelope-from yfyoufeng@263.net) Received: from [10.217.12.183] (unknown [61.135.152.194]) by antispam-2 (Coremail:263(050316)) with SMTP id y0DmAM+43UJCB5jC.1 for ; Wed, 20 Jul 2005 10:37:03 +0800 (CST) X-TEBIE-Originating-IP: [61.135.152.194] From: yf-263 To: Eric Anderson In-Reply-To: <42DDB3F2.7020000@centtech.com> References: <200507020038.j620cO7F071025@gate.bitblocks.com> <42DDB3F2.7020000@centtech.com> Content-Type: text/plain; charset=UTF-8 Organization: Unix-driver.org Date: Wed, 20 Jul 2005 10:35:46 +0800 Message-Id: <1121826946.2235.6.camel@localhost.localdomain> Mime-Version: 1.0 X-Mailer: Evolution 2.2.2 (2.2.2-5) Content-Transfer-Encoding: 8bit Cc: freebsd-fs@freebsd.org Subject: Re: Cluster Filesystem for FreeBSD - any interest? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: yfyoufeng@263.net List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 20 Jul 2005 02:37:26 -0000 在 2005-07-19二的 21:16 -0500,Eric Anderson写道: > Bakul Shah wrote: > [..snip..] > >>:) I understand. Any nudging in the right direction here would be > >>appreciated. > > > > > > I'd probably start with modelling a single filesystem and how > > it maps to a sequence of disk blocks (*without* using any > > code or worrying about details of formats but capturing the > > essential elements). I'd describe various operations in > > terms of preconditions and postconditions. Then, I'd extend > > the model to deal with redundancy and so on. Then I'd model > > various failure modes. etc. If you are interested _enough_ > > we can take this offline and try to work something out. You > > may even be able to use perl to create an `executable' > > specification:-) > > I've done some research, and read some books/articles/white papers since > I started this thread. > > First, porting GFS might be a more universal effort, and might be > 'easier'. However, that doesn't get us a clustered filesystem with BSD > license (something that sounds good to me). It has been said it would be a seven man-month efforts for a FS expert. > > Clustering UFS2 would be cool. Here's what I'm looking for: It is exactly how "Lustre" doing its work, though it build itself on Ext3, and Lustre targets at http://www.lustre.org/docs/SGSRFP.pdf . > > A clustered filesystem (or layer?) that allows all machines in the > cluster to see the same filesystem as if it were local, with read/write > access. The cluster will need cache coherency across all nodes, and > there will need to be some sort of lock manager on each node to > communicate with all the other nodes to coordinate file locking. The > filesystem will have to support journaling. > > I'm wondering if one could make a pseudo filesystem something like > nullfs that sits on top of a UFS2 partition, and essentially monitors > all VFS operations to the filesystem, and communicates them over TCP/IP > to the other nodes in the cluster. That way, each node would know which > inodes and blocks are changing, so they can flush those buffers, and > they would know which blocks (or partial blocks) to view as locked as > another node locks it. This could be done via multicast, so all nodes in > the cluster would have to be running a distributed lock manager daemon > (dlmd) that would coordinate this. I think also that the UFS2 > filesystem would have to have a bit set upon mount that tracked it's > mount as a 'clustered' filesystem mount. The reason for that is so that > we could modify mount to only mount 'clustered' filesystems (mount -o > clustered) if the dlmd was running, since that would be a dependency for > stable coherent file control on a mount point. > > Does anyone have any insight as to whether a layer would work? Or maybe > I'm way off here and I need to do more reading :) > > Eric > > > -- yf-263 Unix-driver.org From owner-freebsd-fs@FreeBSD.ORG Wed Jul 20 09:48:46 2005 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 7F22616A41F; Wed, 20 Jul 2005 09:48:46 +0000 (GMT) (envelope-from riggs@rrr.de) Received: from mail-out.m-online.net (mail-out.m-online.net [212.18.0.9]) by mx1.FreeBSD.org (Postfix) with ESMTP id 0943E43D46; Wed, 20 Jul 2005 09:48:45 +0000 (GMT) (envelope-from riggs@rrr.de) Received: from mail.m-online.net (svr20.m-online.net [192.168.3.148]) by mail-out.m-online.net (Postfix) with ESMTP id 61C91FB1B; Wed, 20 Jul 2005 11:48:44 +0200 (CEST) Received: from marvin.riggiland.au (ppp-62-245-162-72.mnet-online.de [62.245.162.72]) by mail.m-online.net (Postfix) with ESMTP id 197E9C6600; Wed, 20 Jul 2005 11:48:42 +0200 (CEST) Received: from marvin.riggiland.au (localhost [127.0.0.1]) by marvin.riggiland.au (8.13.3/8.13.3) with ESMTP id j6K9mW0t065237; Wed, 20 Jul 2005 11:48:32 +0200 (CEST) (envelope-from riggs@marvin.riggiland.au) Received: (from riggs@localhost) by marvin.riggiland.au (8.13.3/8.13.3/Submit) id j6K9mVrd065236; Wed, 20 Jul 2005 11:48:31 +0200 (CEST) (envelope-from riggs) Date: Wed, 20 Jul 2005 11:48:30 +0200 From: "Thomas E. Zander" To: freebsd-current@freebsd.org Message-ID: <20050720094830.GR782@marvin.riggiland.au> References: <42DD64AB.3000605@centtech.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <42DD64AB.3000605@centtech.com> Organization: Chaotic X-PGP-KeyID: 0xC85996CD X-PGP-URI: http://blackhole.pca.dfn.de:11371/pks/lookup?op=get&search=0xC85996CD X-PGP-Fingerprint: 4F59 75B4 4CE3 3B00 BC61 5400 8DD4 8929 C859 96CD X-Mailer: Marvin Mail (Build 1121849089) X-Operating-System: FreeBSD 5.4-STABLE Cc: freebsd-fs@freebsd.org Subject: Re: mksnap_ffs takes 4-5 minutes? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 20 Jul 2005 09:48:46 -0000 Hi, On Tue, 19. Jul 2005, at 15:38 -0500, Eric Anderson wrote according to [mksnap_ffs takes 4-5 minutes?]: > This time, when I ran mksnap_ffs, the command took nearly 5 minutes [...] > Filesystem 1K-blocks Used Avail Capacity iused ifree > /dev/da1s1d 406234604 91799154 281936682 25% 1300303 51197103 I have a fs here with similar (but smaller size) parameters concerning inode density and usage: Filesystem 1K-blocks Used Avail Capacity iused ifree /dev/ad1s1d 113390248 92926924 18195520 84% 248434 14424460 time mksnap_ffs'ing gives the following result: 0.007u 1.902s 1:51.94 1.6% 5+217k 4493+8646io 0pf+0w It takes almost 2 minutes which seem to perform similarly to your 5 minutes. (There was not a single file opened when snapping.) I'd expect snapping to speed up by reducing the inode number when doing newfs, but I haven't verified this right now. Riggs (f'up to freebsd-fs) -- - "[...] I talked to the computer at great length and -- explained my view of the Universe to it" said Marvin. --- And what happened?" pressed Ford. ---- "It committed suicide." said Marvin. From owner-freebsd-fs@FreeBSD.ORG Wed Jul 20 11:57:26 2005 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 8353616A41F; Wed, 20 Jul 2005 11:57:26 +0000 (GMT) (envelope-from anderson@centtech.com) Received: from mh1.centtech.com (moat3.centtech.com [207.200.51.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id 1E4C543D45; Wed, 20 Jul 2005 11:57:25 +0000 (GMT) (envelope-from anderson@centtech.com) Received: from [10.177.171.220] (neutrino.centtech.com [10.177.171.220]) by mh1.centtech.com (8.13.1/8.13.1) with ESMTP id j6KBvL43054377; Wed, 20 Jul 2005 06:57:25 -0500 (CDT) (envelope-from anderson@centtech.com) Message-ID: <42DE3C1F.9070704@centtech.com> Date: Wed, 20 Jul 2005 06:57:19 -0500 From: Eric Anderson User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.8) Gecko/20050603 X-Accept-Language: en-us, en MIME-Version: 1.0 To: "Thomas E. Zander" References: <42DD64AB.3000605@centtech.com> <20050720094830.GR782@marvin.riggiland.au> In-Reply-To: <20050720094830.GR782@marvin.riggiland.au> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.82/984/Tue Jul 19 04:16:09 2005 on mh1.centtech.com X-Virus-Status: Clean Cc: freebsd-fs@freebsd.org, freebsd-current@freebsd.org Subject: Re: mksnap_ffs takes 4-5 minutes? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 20 Jul 2005 11:57:26 -0000 Thomas E. Zander wrote: > Hi, > > On Tue, 19. Jul 2005, at 15:38 -0500, Eric Anderson wrote > according to [mksnap_ffs takes 4-5 minutes?]: > > >>This time, when I ran mksnap_ffs, the command took nearly 5 minutes > > > [...] > > >>Filesystem 1K-blocks Used Avail Capacity iused ifree >>/dev/da1s1d 406234604 91799154 281936682 25% 1300303 51197103 > > > I have a fs here with similar (but smaller size) parameters concerning > inode density and usage: > > Filesystem 1K-blocks Used Avail Capacity iused ifree > /dev/ad1s1d 113390248 92926924 18195520 84% 248434 14424460 > > time mksnap_ffs'ing gives the following result: > 0.007u 1.902s 1:51.94 1.6% 5+217k 4493+8646io 0pf+0w > > It takes almost 2 minutes which seem to perform similarly to your 5 > minutes. > (There was not a single file opened when snapping.) > > I'd expect snapping to speed up by reducing the inode number when doing > newfs, but I haven't verified this right now. A 2tb filesystem with the standard newfs options takes about 30 minutes to mksnap.. That's unusable really, because the filesystem is suspended for so long. Even empty 2tb filesystems take forever, so it's related to the amount of inodes. How can we make this snappier? Eric -- ------------------------------------------------------------------------ Eric Anderson Sr. Systems Administrator Centaur Technology A lost ounce of gold may be found, a lost moment of time never. ------------------------------------------------------------------------ From owner-freebsd-fs@FreeBSD.ORG Wed Jul 20 12:34:31 2005 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 7779116A41F for ; Wed, 20 Jul 2005 12:34:31 +0000 (GMT) (envelope-from anderson@centtech.com) Received: from mh1.centtech.com (moat3.centtech.com [207.200.51.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id 0FDD643D46 for ; Wed, 20 Jul 2005 12:34:30 +0000 (GMT) (envelope-from anderson@centtech.com) Received: from [10.177.171.220] (neutrino.centtech.com [10.177.171.220]) by mh1.centtech.com (8.13.1/8.13.1) with ESMTP id j6KCYKgk055315; Wed, 20 Jul 2005 07:34:24 -0500 (CDT) (envelope-from anderson@centtech.com) Message-ID: <42DE44CA.7080503@centtech.com> Date: Wed, 20 Jul 2005 07:34:18 -0500 From: Eric Anderson User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.8) Gecko/20050603 X-Accept-Language: en-us, en MIME-Version: 1.0 To: yfyoufeng@263.net References: <200507020038.j620cO7F071025@gate.bitblocks.com> <42DDB3F2.7020000@centtech.com> <1121826946.2235.6.camel@localhost.localdomain> In-Reply-To: <1121826946.2235.6.camel@localhost.localdomain> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable X-Virus-Scanned: ClamAV 0.82/984/Tue Jul 19 04:16:09 2005 on mh1.centtech.com X-Virus-Status: Clean Cc: freebsd-fs@freebsd.org Subject: Re: Cluster Filesystem for FreeBSD - any interest? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 20 Jul 2005 12:34:31 -0000 yf-263 wrote: > =E5=9C=A8 2005-07-19=E4=BA=8C=E7=9A=84 21:16 -0500=EF=BC=8CEric Anderso= n=E5=86=99=E9=81=93=EF=BC=9A >=20 >>Bakul Shah wrote: >>[..snip..] >> >>>>:) I understand. Any nudging in the right direction here would be >>>>appreciated. >>> >>> >>>I'd probably start with modelling a single filesystem and how >>>it maps to a sequence of disk blocks (*without* using any >>>code or worrying about details of formats but capturing the >>>essential elements). I'd describe various operations in >>>terms of preconditions and postconditions. Then, I'd extend >>>the model to deal with redundancy and so on. Then I'd model >>>various failure modes. etc. If you are interested _enough_ >>>we can take this offline and try to work something out. You >>>may even be able to use perl to create an `executable' >>>specification:-) >> >>I've done some research, and read some books/articles/white papers sinc= e=20 >>I started this thread. >> >>First, porting GFS might be a more universal effort, and might be=20 >>'easier'. However, that doesn't get us a clustered filesystem with BSD= =20 >>license (something that sounds good to me). >=20 >=20 > It has been said it would be a seven man-month efforts for a FS expert.= Then we need to get a small group together and get started.. >>Clustering UFS2 would be cool. Here's what I'm looking for: >=20 >=20 > It is exactly how "Lustre" doing its work, though it build itself on > Ext3, and Lustre targets at http://www.lustre.org/docs/SGSRFP.pdf . Yes, I've read about lustre. I like the GFS model much better. Eric --=20 ------------------------------------------------------------------------ Eric Anderson Sr. Systems Administrator Centaur Technology A lost ounce of gold may be found, a lost moment of time never. ------------------------------------------------------------------------ From owner-freebsd-fs@FreeBSD.ORG Wed Jul 20 13:05:49 2005 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id DBBF416A41F; Wed, 20 Jul 2005 13:05:49 +0000 (GMT) (envelope-from riggs@rrr.de) Received: from mail-out.m-online.net (mail-out.m-online.net [212.18.0.9]) by mx1.FreeBSD.org (Postfix) with ESMTP id 4947743D4C; Wed, 20 Jul 2005 13:05:34 +0000 (GMT) (envelope-from riggs@rrr.de) Received: from mail.m-online.net (svr20.m-online.net [192.168.3.148]) by mail-out.m-online.net (Postfix) with ESMTP id EA25EFCB1; Wed, 20 Jul 2005 15:05:32 +0200 (CEST) Received: from marvin.riggiland.au (ppp-62-245-162-72.mnet-online.de [62.245.162.72]) by mail.m-online.net (Postfix) with ESMTP id 5C936C6C83; Wed, 20 Jul 2005 15:05:31 +0200 (CEST) Received: from marvin.riggiland.au (localhost [127.0.0.1]) by marvin.riggiland.au (8.13.3/8.13.3) with ESMTP id j6KD5R1D066474; Wed, 20 Jul 2005 15:05:28 +0200 (CEST) (envelope-from riggs@marvin.riggiland.au) Received: (from riggs@localhost) by marvin.riggiland.au (8.13.3/8.13.3/Submit) id j6KD5PSX066473; Wed, 20 Jul 2005 15:05:25 +0200 (CEST) (envelope-from riggs) Date: Wed, 20 Jul 2005 15:05:23 +0200 From: "Thomas E. Zander" To: Eric Anderson Message-ID: <20050720130523.GT782@marvin.riggiland.au> References: <42DD64AB.3000605@centtech.com> <20050720094830.GR782@marvin.riggiland.au> <42DE3C1F.9070704@centtech.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <42DE3C1F.9070704@centtech.com> Organization: Chaotic X-PGP-KeyID: 0xC85996CD X-PGP-URI: http://blackhole.pca.dfn.de:11371/pks/lookup?op=get&search=0xC85996CD X-PGP-Fingerprint: 4F59 75B4 4CE3 3B00 BC61 5400 8DD4 8929 C859 96CD X-Mailer: Marvin Mail (Build 1121863362) X-Operating-System: FreeBSD 5.4-STABLE Cc: freebsd-fs@freebsd.org, freebsd-current@freebsd.org Subject: Re: mksnap_ffs takes 4-5 minutes? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 20 Jul 2005 13:05:50 -0000 Hi, On Wed, 20. Jul 2005, at 6:57 -0500, Eric Anderson wrote according to [Re: mksnap_ffs takes 4-5 minutes?]: > A 2tb filesystem with the standard newfs options takes about 30 minutes > to mksnap.. That's unusable really, because the filesystem is suspended > for so long. Even empty 2tb filesystems take forever, so it's related > to the amount of inodes. > > How can we make this snappier? For the moment we can workaround by setting inode density appropriately when creating the fs. However this is only feasible if you know what your users are going to do with the fs; it also doesn't help when you *need* a large fs containing many small files. In the long run, dynamic inode (de)allocation would be nice to have. Also...what about the 'preparation' time for snapping? IIRC McKusick said that the lion's share of snapping time is used to delay pending transactions before actually doing the snap. There are quite some scenarios in which you can be certain that there is no file opened for writing, so a snap could be taken immediately. Would it be feasible to implement this feature? Or am I completely wrong? Riggs -- - "[...] I talked to the computer at great length and -- explained my view of the Universe to it" said Marvin. --- And what happened?" pressed Ford. ---- "It committed suicide." said Marvin. From owner-freebsd-fs@FreeBSD.ORG Thu Jul 21 13:46:04 2005 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 617B016A420 for ; Thu, 21 Jul 2005 13:46:04 +0000 (GMT) (envelope-from nb@ravenbrook.com) Received: from raven.ravenbrook.com (raven.ravenbrook.com [193.82.131.18]) by mx1.FreeBSD.org (Postfix) with ESMTP id 10AD843D81 for ; Thu, 21 Jul 2005 13:45:55 +0000 (GMT) (envelope-from nb@ravenbrook.com) Received: from thrush.ravenbrook.com (thrush.ravenbrook.com [193.112.141.145]) by raven.ravenbrook.com (8.12.6p3/8.12.6) with ESMTP id j6LDjqXi030013 for ; Thu, 21 Jul 2005 14:45:53 +0100 (BST) (envelope-from nb@ravenbrook.com) Received: from thrush.ravenbrook.com (localhost [127.0.0.1]) by thrush.ravenbrook.com (8.12.9p2/8.12.9) with ESMTP id j6LDjqFM073798 for ; Thu, 21 Jul 2005 14:45:52 +0100 (BST) (envelope-from nb@thrush.ravenbrook.com) From: Nick Barnes To: freebsd-fs@freebsd.org Date: Thu, 21 Jul 2005 14:45:52 +0100 Message-ID: <73797.1121953552@thrush.ravenbrook.com> Sender: nb@ravenbrook.com Subject: CGSIZE inaccuracy? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 21 Jul 2005 13:46:04 -0000 I'm running 4.9-RELEASE, and writing some tools to navigate around my UFS filesystems to figure out what I have lost when I get bad blocks. There's not much detailed online documentation of UFS beyond fs(5) and fs.h, so I'm feeling my way through the sources. Looking at fs.h, I see this: /* * The size of a cylinder group is calculated by CGSIZE. The maximum size * is limited by the fact that cylinder groups are at most one block. * Its size is derived from the size of the maps maintained in the * cylinder group and the (struct cg) size. */ #define CGSIZE(fs) \ /* base cg */ (sizeof(struct cg) + sizeof(int32_t) + \ /* blktot size */ (fs)->fs_cpg * sizeof(int32_t) + \ /* blks size */ (fs)->fs_cpg * (fs)->fs_nrpos * sizeof(int16_t) + \ /* inode map */ howmany((fs)->fs_ipg, NBBY) + \ /* block map */ howmany((fs)->fs_cpg * (fs)->fs_spc / NSPF(fs), NBBY) +\ /* if present */ ((fs)->fs_contigsumsize <= 0 ? 0 : \ /* cluster sum */ (fs)->fs_contigsumsize * sizeof(int32_t) + \ /* cluster map */ howmany((fs)->fs_cpg * (fs)->fs_spc / NSPB(fs), NBBY))) In a typical filesystem (fs_fsize = 2048, fs_bsize = 16384, fs_ipg = 22528, fs_cpg = 89, fs_spc = 4096, fs_nrpos = 1, fs_contigsumsize = 7), the parts of this sum add up like this: 172 struct cg; 4 int32_t 356 blktot: free blocks per cylinder; 178 blks: free blocks per rpos per cylinder; 2816 inode map, one bit per inode; 11392 block map, one bit per fragment; 28 cluster summary, one int32_t per contigsumsize+1; 1424 cluster map, one bit per block; ------ 16370 CGSIZE However, using the cg_* macros from fs.h (e.g. cg_clustersum), I get offsets like this: base limit size 0- 168 168 struct cg less cg_space 168- 524 356 cg_blktot (free blocks per cylinder) 524- 702 178 cg_blks (free blocks per rpos per cylinder) 702- 3518 2816 cg_inosused (inode bitmap) 3518-14910 11392 cg_blksfree (fragment bitmap) 14911-14912 2 padding 14912-14940 28 cg_clustersum (block cluster summaries) 14940-16364 1424 cg_clusteroff (block bitmap) 16364 nextfreeoff There are three discrepancies here: +4: sizeof(struct cg) is used instead of offsetof(cg_space); +4: sizeof(int32_t) is added, mysteriously; -2: the padding for cg_clustersum is disregarded. I don't *think* that this matters, because CGSIZE is only apparently used (by newfs and fsck) as a conservative approximation of the size of the cylinder group header. But it seems odd, given that a correct calculation is fairly easy. Nick Barnes Ravenbrook Limited From owner-freebsd-fs@FreeBSD.ORG Thu Jul 21 19:10:37 2005 Return-Path: X-Original-To: fs@freebsd.org Delivered-To: freebsd-fs@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 2C4CE16A421 for ; Thu, 21 Jul 2005 19:10:37 +0000 (GMT) (envelope-from igor.shmukler@gmail.com) Received: from zproxy.gmail.com (zproxy.gmail.com [64.233.162.199]) by mx1.FreeBSD.org (Postfix) with ESMTP id B73B343D6D for ; Thu, 21 Jul 2005 19:10:31 +0000 (GMT) (envelope-from igor.shmukler@gmail.com) Received: by zproxy.gmail.com with SMTP id c3so40016nze for ; Thu, 21 Jul 2005 12:10:31 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:reply-to:to:subject:mime-version:content-type:content-transfer-encoding:content-disposition; b=q10kAIOoB6TwJY4GAkJr8JZoP+J3yc3jJhvVzLMocW4Iv53buKmTDRFgU011N9HrKmyjMS/auwTz5bVomvAAphHCpFT+VnJM/ZEqkUdZzKYOnfv/+EBetyRiYeJfr1Qs8cj6A6EVudB4/AeV1SUXjn5CUnSh7TIfPQahtfId5Yo= Received: by 10.36.129.16 with SMTP id b16mr1235566nzd; Thu, 21 Jul 2005 12:10:30 -0700 (PDT) Received: by 10.36.119.12 with HTTP; Thu, 21 Jul 2005 12:10:30 -0700 (PDT) Message-ID: <6533c1c9050721121030016b7d@mail.gmail.com> Date: Thu, 21 Jul 2005 15:10:30 -0400 From: Igor Shmukler To: fs@freebsd.org, hackers@freebsd.org, dillon@apollo.backplane.com Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Cc: Subject: per file lock list X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Igor Shmukler List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 21 Jul 2005 19:10:37 -0000 Hi, We have a question: how to get all POSIX locks for a given file? As far as I know, existing API does not allow to retrieve all file locks. Therefore, we need to use kernel internal structures to get all applied locks. Unfortunately, a head of list with file locks is attached to inode rather then vnode. As result, it is much harder to get the lock list head due to the need to know exact inode type that is hidden behind the vnode. Of course, the problem could be resolved in a hackish way: we may get the address of VOP_ADVLOCK() method and compare it with all known FS methods, that handles this VOP operation: (ufs_advlock, etc.) and therefore apply a proper type cast to vnode->v_data to get valid inode. However, this would be a last resort. So the question: is there an elegant way to get the lock list for a given f= ile? Thank you in advance. From owner-freebsd-fs@FreeBSD.ORG Thu Jul 21 19:26:05 2005 Return-Path: X-Original-To: fs@freebsd.org Delivered-To: freebsd-fs@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id B058D16A41F; Thu, 21 Jul 2005 19:26:05 +0000 (GMT) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id 7A1B343D6A; Thu, 21 Jul 2005 19:26:05 +0000 (GMT) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.12.9p2/8.12.9) with ESMTP id j6LJQ5Yk071116; Thu, 21 Jul 2005 12:26:05 -0700 (PDT) (envelope-from dillon@apollo.backplane.com) Received: (from dillon@localhost) by apollo.backplane.com (8.12.9p2/8.12.9/Submit) id j6LJQ55D071115; Thu, 21 Jul 2005 12:26:05 -0700 (PDT) (envelope-from dillon) Date: Thu, 21 Jul 2005 12:26:05 -0700 (PDT) From: Matthew Dillon Message-Id: <200507211926.j6LJQ55D071115@apollo.backplane.com> To: Igor Shmukler References: <6533c1c9050721121030016b7d@mail.gmail.com> Cc: dillon@apollo.backplane.com, hackers@freebsd.org, fs@freebsd.org Subject: Re: per file lock list X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 21 Jul 2005 19:26:05 -0000 :Hi, : :We have a question: how to get all POSIX locks for a given file? :.. : :As far as I know, existing API does not allow to retrieve all file :locks. Therefore, we need to use kernel internal structures to get all :... :So the question: is there an elegant way to get the lock list for a given file? : :Thank you in advance. You can use F_GETLK to iterate through all posix locks held on a file. From man fcntl: F_GETLK Get the first lock that blocks the lock description pointed to by the third argument, arg, taken as a pointer to a struct flock (see above). The information retrieved overwrites the information passed to fcntl() in the flock structure. If no lock is found that would prevent this lock from being created, the structure is left unchanged by this function call except for the lock type which is set to F_UNLCK. So what you do is you specify a lock description that covers the whole file and call F_GETLK. You then use the results to modify the lock description to a range that starts just past the returned lock for the next call. You continue iterating until F_GETLK tells you that there are no more locks. -Matt Matthew Dillon From owner-freebsd-fs@FreeBSD.ORG Fri Jul 22 12:16:57 2005 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 04A9316A433; Fri, 22 Jul 2005 12:16:57 +0000 (GMT) (envelope-from anderson@centtech.com) Received: from mh2.centtech.com (moat3.centtech.com [207.200.51.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id 6290043D78; Fri, 22 Jul 2005 12:16:34 +0000 (GMT) (envelope-from anderson@centtech.com) Received: from [10.177.171.220] (neutrino.centtech.com [10.177.171.220]) by mh2.centtech.com (8.13.1/8.13.1) with ESMTP id j6MCGVtT089947; Fri, 22 Jul 2005 07:16:33 -0500 (CDT) (envelope-from anderson@centtech.com) Message-ID: <42E0E39E.8020009@centtech.com> Date: Fri, 22 Jul 2005 07:16:30 -0500 From: Eric Anderson User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.8) Gecko/20050603 X-Accept-Language: en-us, en MIME-Version: 1.0 To: "Thomas E. Zander" References: <42DD64AB.3000605@centtech.com> <20050720094830.GR782@marvin.riggiland.au> <42DE3C1F.9070704@centtech.com> <20050720130523.GT782@marvin.riggiland.au> In-Reply-To: <20050720130523.GT782@marvin.riggiland.au> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-fs@freebsd.org, freebsd-current@freebsd.org Subject: Re: mksnap_ffs takes 4-5 minutes? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 22 Jul 2005 12:16:57 -0000 Thomas E. Zander wrote: > Hi, > > On Wed, 20. Jul 2005, at 6:57 -0500, Eric Anderson wrote > according to [Re: mksnap_ffs takes 4-5 minutes?]: > > >>A 2tb filesystem with the standard newfs options takes about 30 minutes >>to mksnap.. That's unusable really, because the filesystem is suspended >>for so long. Even empty 2tb filesystems take forever, so it's related >>to the amount of inodes. >> >>How can we make this snappier? > > > For the moment we can workaround by setting inode density appropriately > when creating the fs. However this is only feasible if you know what > your users are going to do with the fs; it also doesn't help when you > *need* a large fs containing many small files. > In the long run, dynamic inode (de)allocation would be nice to have. It doesn't seem to make a difference on how much of the filesystem is actually used. It seems to be dependent on how many inodes there are, or maybe more appropriately, how many cylinder groups. > Also...what about the 'preparation' time for snapping? IIRC McKusick > said that the lion's share of snapping time is used to delay pending > transactions before actually doing the snap. > There are quite some scenarios in which you can be certain that there > is no file opened for writing, so a snap could be taken immediately. > Would it be feasible to implement this feature? Or am I completely > wrong? The snap seemed to suspend the filesystem nearly immediately, and kept it suspended for quite some time - I would say probably more than half the time. In order for snapshots to be very useful, it must work on large filesystems (100GB+) in a reasonable amount of time (a few seconds would be ok). I know for certain that one test filesystem (2Tb) had nothing on it, no processess using the filesystem at all, and it took well over an hour to run mksnap on it. Maybe mksnap is broken somehow? Eric -- ------------------------------------------------------------------------ Eric Anderson Sr. Systems Administrator Centaur Technology A lost ounce of gold may be found, a lost moment of time never. ------------------------------------------------------------------------ From owner-freebsd-fs@FreeBSD.ORG Fri Jul 22 13:53:34 2005 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 0551F16A45C; Fri, 22 Jul 2005 13:53:34 +0000 (GMT) (envelope-from e-masson@kisoft-services.com) Received: from mallaury.nerim.net (smtp-105-friday.noc.nerim.net [62.4.17.105]) by mx1.FreeBSD.org (Postfix) with ESMTP id B5D8C43DB0; Fri, 22 Jul 2005 13:53:28 +0000 (GMT) (envelope-from e-masson@kisoft-services.com) Received: from srvbsdnanssv.interne.kisoft-services.com (kisoft.net1.nerim.net [62.212.107.51]) by mallaury.nerim.net (Postfix) with ESMTP id 431254F3E8; Fri, 22 Jul 2005 15:53:17 +0200 (CEST) Received: from localhost (localhost [127.0.0.1]) by srvbsdnanssv.interne.kisoft-services.com (Postfix) with ESMTP id 80CFAC675; Fri, 22 Jul 2005 15:53:43 +0200 (CEST) Received: from srvbsdnanssv.interne.kisoft-services.com ([127.0.0.1]) by localhost (srvbsdnanssv.interne.kisoft-services.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 58715-04; Fri, 22 Jul 2005 15:53:39 +0200 (CEST) Received: by srvbsdnanssv.interne.kisoft-services.com (Postfix, from userid 1001) id 1206FC667; Fri, 22 Jul 2005 15:53:39 +0200 (CEST) To: Eric Anderson From: Eric Masson In-Reply-To: <42E0E39E.8020009@centtech.com> (Eric Anderson's message of "Fri, 22 Jul 2005 07:16:30 -0500") References: <42DD64AB.3000605@centtech.com> <20050720094830.GR782@marvin.riggiland.au> <42DE3C1F.9070704@centtech.com> <20050720130523.GT782@marvin.riggiland.au> <42E0E39E.8020009@centtech.com> X-Operating-System: FreeBSD 5.4-RELEASE-p2 i386 Date: Fri, 22 Jul 2005 15:53:39 +0200 Message-ID: <86irz32bh8.fsf@srvbsdnanssv.interne.kisoft-services.com> User-Agent: Gnus/5.1006 (Gnus v5.10.6) XEmacs/21.4 (Jumbo Shrimp, berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Transfer-Encoding: 8bit X-Virus-Scanned: amavisd-new at interne.kisoft-services.com Cc: freebsd-fs@freebsd.org, freebsd-current@freebsd.org Subject: Re: mksnap_ffs takes 4-5 minutes? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 22 Jul 2005 13:53:34 -0000 Eric Anderson writes: Hi, > I know for certain that one test filesystem (2Tb) had nothing on it, > no processess using the filesystem at all, and it took well over an > hour to run mksnap on it. I made some tests on a Dell PowerVault 725N with 5.2 about a year and a half ago. Snapshots on a 700GB filesystem were taking roughly half an hour. > Maybe mksnap is broken somehow? Don't think broken is the right term, but it surely lacks optimization on huge filesystems when compared to snapshots on a netapp filer (5 seconds on a terabyte volume for example). For medium size fs like the 80B I'm using here on my desktop box, approximately 30 seconds seems reasonable to me. Shorter time would be welcome, sure ;) ric Masson -- Contresens. Le contenu de la signature doit respecter la charte du NG sur *tous* les sujets. Aussi bien la pub que la Netiquette. C'est pas une zone de non-droit, les 4 lignes de signature. -+- Lapin in : Oui-Oui casque bleu Neuneuland -+- From owner-freebsd-fs@FreeBSD.ORG Fri Jul 22 16:52:26 2005 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id B0D1F16A42F for ; Fri, 22 Jul 2005 16:52:26 +0000 (GMT) (envelope-from cdillon@wolves.k12.mo.us) Received: from mail.wolves.k12.mo.us (mail.wolves.k12.mo.us [207.160.214.1]) by mx1.FreeBSD.org (Postfix) with ESMTP id EF2EF43F71 for ; Fri, 22 Jul 2005 16:25:47 +0000 (GMT) (envelope-from cdillon@wolves.k12.mo.us) Received: from localhost (localhost [127.0.0.1]) by mail.wolves.k12.mo.us (Postfix) with ESMTP id 4C3AC1FE04; Fri, 22 Jul 2005 11:25:47 -0500 (CDT) Received: from mail.wolves.k12.mo.us ([127.0.0.1]) by localhost (mail.wolves.k12.mo.us [127.0.0.1]) (amavisd-new, port 10024) with LMTP id 24737-03-8; Fri, 22 Jul 2005 11:25:40 -0500 (CDT) Received: by mail.wolves.k12.mo.us (Postfix, from userid 1001) id D68371FE03; Fri, 22 Jul 2005 11:25:40 -0500 (CDT) Received: from localhost (localhost [127.0.0.1]) by mail.wolves.k12.mo.us (Postfix) with ESMTP id D4F201A902; Fri, 22 Jul 2005 11:25:40 -0500 (CDT) Date: Fri, 22 Jul 2005 11:25:40 -0500 (CDT) From: Chris Dillon To: Eric Anderson In-Reply-To: <42E0E39E.8020009@centtech.com> Message-ID: <20050722111015.P25012@duey.wolves.k12.mo.us> References: <42DD64AB.3000605@centtech.com> <20050720094830.GR782@marvin.riggiland.au> <42DE3C1F.9070704@centtech.com> <20050720130523.GT782@marvin.riggiland.au> <42E0E39E.8020009@centtech.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Virus-Scanned: amavisd-new at wolves.k12.mo.us Cc: freebsd-fs@freebsd.org Subject: Re: mksnap_ffs takes 4-5 minutes? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 22 Jul 2005 16:52:26 -0000 On Fri, 22 Jul 2005, Eric Anderson wrote: > The snap seemed to suspend the filesystem nearly immediately, and > kept it suspended for quite some time - I would say probably more > than half the time. In order for snapshots to be very useful, it > must work on large filesystems (100GB+) in a reasonable amount of > time (a few seconds would be ok). I know for certain that one test > filesystem (2Tb) had nothing on it, no processess using the > filesystem at all, and it took well over an hour to run mksnap on > it. Just another datapoint -- I take daily snapshots of a 270GB filesystem and it takes 3 to 4 minutes (not sure down to the second). I used to take multiple snapshots during the day but suspending the filesystem for several minutes at peak times wasn't working out (and seemed to cause complete system hangs, sometimes), so I just went to once per day during off-hours. Making the snapshots seems to be mostly I/O bound, and this is on a system with a fairly fast RAID5 array of 10KRPM SCSI drives. The suspension of the filesystem also seems immediate to me, so it seems most of the time is actually spent making the snapshot. Filesystem 1K-blocks Used Avail Capacity iused ifree %iused Mounted on /dev/da1s1a 272092768 16863968 233461380 7% 155146 8435700 2% /userspace Jul 20 00:14:11 rshome root: snapshot: daily.0 snapshot on filesystem /userspace made (duration: 4 min) Jul 21 00:14:01 rshome root: snapshot: daily.0 snapshot on filesystem /userspace made (duration: 3 min) Jul 22 00:14:04 rshome root: snapshot: daily.0 snapshot on filesystem /userspace made (duration: 4 min) I'm using Ralf S. Engelschall's snapshot management scripts, though I know that has no effect on the time it takes to create a snapshot. -- Chris Dillon - cdillon(at)wolves.k12.mo.us FreeBSD: The fastest, most open, and most stable OS on the planet - Available for IA32, IA64, AMD64, PC98, Alpha, and UltraSPARC architectures - PowerPC, ARM, MIPS, and S/390 under development - http://www.freebsd.org Q: Because it reverses the logical flow of conversation. A: Why is putting a reply at the top of the message frowned upon?