From owner-freebsd-fs Mon Feb 5 7:26:56 2001 Delivered-To: freebsd-fs@freebsd.org Received: from peorth.iteration.net (peorth.iteration.net [208.190.180.178]) by hub.freebsd.org (Postfix) with ESMTP id 382F537B65D; Mon, 5 Feb 2001 07:26:31 -0800 (PST) Received: by peorth.iteration.net (Postfix, from userid 1001) id 1CC6D57610; Mon, 5 Feb 2001 09:26:59 -0600 (CST) Date: Mon, 5 Feb 2001 09:26:59 -0600 From: "Michael C . Wu" <keichii@iteration.net> To: hackers@freebsd.org Cc: fs@freebsd.org Subject: Extremely large (70TB) File system/server planning Message-ID: <20010205092658.A97400@peorth.iteration.net> Reply-To: "Michael C . Wu" <keichii@peorth.iteration.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i X-PGP-Fingerprint: 5025 F691 F943 8128 48A8 5025 77CE 29C5 8FA1 2E20 X-PGP-Key-ID: 0x8FA12E20 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Hello Everyone, While talking to a friend about what his company is planning to do, I found out that he is planning a 70TB filesystem/servers/cluster/db. (Yes, seventy t-e-r-a-b-y-t-e...) Apparently, he has files that go up to 2gb each, and actually require such a horribly sized cluster. If he wanted a PC cluster, and having 5TB on each PC, he would have 350 machines to maintain. From past experience maintaining clusters, I guarantee that he will have at least 1 box failing every other day. And I really do not think his idea of using NFS is that good. ;-) Now if we were to go to the high-end route (and probably more cost effective), we can pick SAN's, large Sun fileservers, or somesuch. I still cannot picture him being able to maintain file integrity. I say that he should attempt to split his filesystems into much smaller chunks, say 1TB each. And attempt some way of having a RAID5 array. Mirroring or other RAID configurations would prove too costly. What would you guys do in this case? :) -- +------------------------------------------------------------------+ | keichii@peorth.iteration.net | keichii@bsdconspiracy.net | | http://peorth.iteration.net/~keichii | Yes, BSD is a conspiracy. | +------------------------------------------------------------------+ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Feb 5 7:39:30 2001 Delivered-To: freebsd-fs@freebsd.org Received: from mercury.ccmr.cornell.edu (mercury.ccmr.cornell.edu [128.84.231.97]) by hub.freebsd.org (Postfix) with ESMTP id A33F337B491; Mon, 5 Feb 2001 07:39:08 -0800 (PST) Received: from ruby.ccmr.cornell.edu (IDENT:0@ruby.ccmr.cornell.edu [128.84.231.115]) by mercury.ccmr.cornell.edu (8.9.3/8.9.3) with ESMTP id KAA13217; Mon, 5 Feb 2001 10:39:04 -0500 Received: from localhost (mitch@localhost) by ruby.ccmr.cornell.edu (8.9.3/8.9.3) with ESMTP id KAA06449; Mon, 5 Feb 2001 10:39:02 -0500 X-Authentication-Warning: ruby.ccmr.cornell.edu: mitch owned process doing -bs Date: Mon, 5 Feb 2001 10:39:02 -0500 (EST) From: Mitch Collinsworth <mitch@ccmr.cornell.edu> To: "Michael C . Wu" <keichii@peorth.iteration.net> Cc: hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: Extremely large (70TB) File system/server planning In-Reply-To: <20010205092658.A97400@peorth.iteration.net> Message-ID: <Pine.LNX.4.10.10102051036410.22516-100000@ruby.ccmr.cornell.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org You didn't say what applications this thing is going to support. That does matter. A lot. One thing worth looking at is AFS, or maybe MR-AFS. And now OpenAFS. -Mitch On Mon, 5 Feb 2001, Michael C . Wu wrote: > Hello Everyone, > > While talking to a friend about what his company is planning to do, > I found out that he is planning a 70TB filesystem/servers/cluster/db. > (Yes, seventy t-e-r-a-b-y-t-e...) > > Apparently, he has files that go up to 2gb each, and actually require > such a horribly sized cluster. > > If he wanted a PC cluster, and having 5TB on each PC, he would have > 350 machines to maintain. From past experience maintaining clusters, > I guarantee that he will have at least 1 box failing every other day. > And I really do not think his idea of using NFS is that good. ;-) > > Now if we were to go to the high-end route (and probably more cost > effective), we can pick SAN's, large Sun fileservers, or somesuch. > I still cannot picture him being able to maintain file integrity. > > I say that he should attempt to split his filesystems into much > smaller chunks, say 1TB each. And attempt some way of having a RAID5 > array. Mirroring or other RAID configurations would prove too costly. > What would you guys do in this case? :) > -- > +------------------------------------------------------------------+ > | keichii@peorth.iteration.net | keichii@bsdconspiracy.net | > | http://peorth.iteration.net/~keichii | Yes, BSD is a conspiracy. | > +------------------------------------------------------------------+ > > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-fs" in the body of the message > To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Feb 5 7:52:59 2001 Delivered-To: freebsd-fs@freebsd.org Received: from msp-65-25-230-128.mn.rr.com (msp-65-25-230-128.mn.rr.com [65.25.230.128]) by hub.freebsd.org (Postfix) with ESMTP id 1762137B4EC; Mon, 5 Feb 2001 07:52:33 -0800 (PST) Received: (from z3rk@localhost) by msp-65-25-230-128.mn.rr.com (8.11.0/8.11.0) id f15FqUp23714; Mon, 5 Feb 2001 09:52:30 -0600 Date: Mon, 5 Feb 2001 09:52:30 -0600 From: Goblin <ahkbarr@yahoo.com> To: "Michael C . Wu" <keichii@peorth.iteration.net> Cc: hackers@freebsd.org, fs@freebsd.org Subject: Re: Extremely large (70TB) File system/server planning Message-ID: <20010205095229.A30253@msp-65-25-230-128.mn.rr.com> References: <20010205092658.A97400@peorth.iteration.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.3.12i In-Reply-To: <20010205092658.A97400@peorth.iteration.net>; from keichii@iteration.net on Mon, Feb 05, 2001 at 09:26:59AM -0600 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org NetApp filers? And what exactly is too costly? He's got enormous costs just in doing backups of this thing, and the savings in using NetApp filers for doing "snapshots" instead of standard backups will buy you some disk in the end... What is this data used for? Archival? How oft is it accessed? How much of the data is "live"? Has he looked at something other than plain disk? Broaden his horizens and get specifics of his needs. On 02/05, Michael C . Wu rearranged the electrons to read: > Hello Everyone, > > While talking to a friend about what his company is planning to do, > I found out that he is planning a 70TB filesystem/servers/cluster/db. > (Yes, seventy t-e-r-a-b-y-t-e...) > > Apparently, he has files that go up to 2gb each, and actually require > such a horribly sized cluster. > > If he wanted a PC cluster, and having 5TB on each PC, he would have > 350 machines to maintain. From past experience maintaining clusters, > I guarantee that he will have at least 1 box failing every other day. > And I really do not think his idea of using NFS is that good. ;-) > > Now if we were to go to the high-end route (and probably more cost > effective), we can pick SAN's, large Sun fileservers, or somesuch. > I still cannot picture him being able to maintain file integrity. > > I say that he should attempt to split his filesystems into much > smaller chunks, say 1TB each. And attempt some way of having a RAID5 > array. Mirroring or other RAID configurations would prove too costly. > What would you guys do in this case? :) > -- > +------------------------------------------------------------------+ > | keichii@peorth.iteration.net | keichii@bsdconspiracy.net | > | http://peorth.iteration.net/~keichii | Yes, BSD is a conspiracy. | > +------------------------------------------------------------------+ > > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-fs" in the body of the message Your eyes are weary from staring at the CRT. You feel sleepy. Notice how restful it is to watch the cursor blink. Close your eyes. The opinions stated above are yours. You cannot imagine why you ever felt otherwise. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Feb 5 8: 0:10 2001 Delivered-To: freebsd-fs@freebsd.org Received: from peorth.iteration.net (peorth.iteration.net [208.190.180.178]) by hub.freebsd.org (Postfix) with ESMTP id D624237B503; Mon, 5 Feb 2001 07:59:47 -0800 (PST) Received: by peorth.iteration.net (Postfix, from userid 1001) id 7D37957610; Mon, 5 Feb 2001 10:00:16 -0600 (CST) Date: Mon, 5 Feb 2001 10:00:16 -0600 From: "Michael C . Wu" <keichii@iteration.net> To: Mitch Collinsworth <mitch@ccmr.cornell.edu> Cc: hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: Extremely large (70TB) File system/server planning Message-ID: <20010205100016.C97400@peorth.iteration.net> Reply-To: "Michael C . Wu" <keichii@peorth.iteration.net> References: <20010205092658.A97400@peorth.iteration.net> <Pine.LNX.4.10.10102051036410.22516-100000@ruby.ccmr.cornell.edu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <Pine.LNX.4.10.10102051036410.22516-100000@ruby.ccmr.cornell.edu>; from mitch@ccmr.cornell.edu on Mon, Feb 05, 2001 at 10:39:02AM -0500 X-PGP-Fingerprint: 5025 F691 F943 8128 48A8 5025 77CE 29C5 8FA1 2E20 X-PGP-Key-ID: 0x8FA12E20 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Mon, Feb 05, 2001 at 10:39:02AM -0500, Mitch Collinsworth scribbled: | You didn't say what applications this thing is going to support. | That does matter. A lot. One thing worth looking at is AFS, | or maybe MR-AFS. And now OpenAFS. He has database(s) of graphics simulation results. i.e. large files that are largely unrelated to each other. Compression is not an option. The files are accessed approximately 3 or 4 times a day on average. Older files are archived for reference purpose and may never be accessed after a week. -- +------------------------------------------------------------------+ | keichii@peorth.iteration.net | keichii@bsdconspiracy.net | | http://peorth.iteration.net/~keichii | Yes, BSD is a conspiracy. | +------------------------------------------------------------------+ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Feb 5 8:48:21 2001 Delivered-To: freebsd-fs@freebsd.org Received: from mercury.ccmr.cornell.edu (mercury.ccmr.cornell.edu [128.84.231.97]) by hub.freebsd.org (Postfix) with ESMTP id EF5FA37B69F; Mon, 5 Feb 2001 08:48:00 -0800 (PST) Received: from ruby.ccmr.cornell.edu (IDENT:0@ruby.ccmr.cornell.edu [128.84.231.115]) by mercury.ccmr.cornell.edu (8.9.3/8.9.3) with ESMTP id LAA15223; Mon, 5 Feb 2001 11:48:00 -0500 Received: from localhost (mitch@localhost) by ruby.ccmr.cornell.edu (8.9.3/8.9.3) with ESMTP id LAA06750; Mon, 5 Feb 2001 11:47:58 -0500 X-Authentication-Warning: ruby.ccmr.cornell.edu: mitch owned process doing -bs Date: Mon, 5 Feb 2001 11:47:58 -0500 (EST) From: Mitch Collinsworth <mitch@ccmr.cornell.edu> To: "Michael C . Wu" <keichii@peorth.iteration.net> Cc: hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: Extremely large (70TB) File system/server planning In-Reply-To: <20010205100016.C97400@peorth.iteration.net> Message-ID: <Pine.LNX.4.10.10102051146300.22516-100000@ruby.ccmr.cornell.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Mon, 5 Feb 2001, Michael C . Wu wrote: > On Mon, Feb 05, 2001 at 10:39:02AM -0500, Mitch Collinsworth scribbled: > | You didn't say what applications this thing is going to support. > | That does matter. A lot. One thing worth looking at is AFS, > | or maybe MR-AFS. And now OpenAFS. > > He has database(s) of graphics simulation results. i.e. large files that > are largely unrelated to each other. Compression is not an option. > > The files are accessed approximately 3 or 4 times a day on average. > Older files are archived for reference purpose and may never > be accessed after a week. Ok, this is a start. Now is the 70 TB the size of the active files? Or does that also include the older archived files that may never be accessed again? -Mitch To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Feb 5 9:24:10 2001 Delivered-To: freebsd-fs@freebsd.org Received: from peorth.iteration.net (peorth.iteration.net [208.190.180.178]) by hub.freebsd.org (Postfix) with ESMTP id 4E85B37B65D; Mon, 5 Feb 2001 09:23:51 -0800 (PST) Received: by peorth.iteration.net (Postfix, from userid 1001) id 3949A57611; Mon, 5 Feb 2001 11:24:20 -0600 (CST) Date: Mon, 5 Feb 2001 11:24:20 -0600 From: "Michael C . Wu" <keichii@iteration.net> To: Mitch Collinsworth <mitch@ccmr.cornell.edu> Cc: hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: Extremely large (70TB) File system/server planning Message-ID: <20010205112420.A98288@peorth.iteration.net> Reply-To: "Michael C . Wu" <keichii@peorth.iteration.net> References: <20010205100016.C97400@peorth.iteration.net> <Pine.LNX.4.10.10102051146300.22516-100000@ruby.ccmr.cornell.edu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <Pine.LNX.4.10.10102051146300.22516-100000@ruby.ccmr.cornell.edu>; from mitch@ccmr.cornell.edu on Mon, Feb 05, 2001 at 11:47:58AM -0500 X-PGP-Fingerprint: 5025 F691 F943 8128 48A8 5025 77CE 29C5 8FA1 2E20 X-PGP-Key-ID: 0x8FA12E20 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Mon, Feb 05, 2001 at 11:47:58AM -0500, Mitch Collinsworth scribbled: | On Mon, 5 Feb 2001, Michael C . Wu wrote: | > On Mon, Feb 05, 2001 at 10:39:02AM -0500, Mitch Collinsworth scribbled: | > | You didn't say what applications this thing is going to support. | > | That does matter. A lot. One thing worth looking at is AFS, | > | or maybe MR-AFS. And now OpenAFS. | > | > He has database(s) of graphics simulation results. i.e. large files that | > are largely unrelated to each other. Compression is not an option. | > | > The files are accessed approximately 3 or 4 times a day on average. | > Older files are archived for reference purpose and may never | > be accessed after a week. | | Ok, this is a start. Now is the 70 TB the size of the active files? | Or does that also include the older archived files that may never be | accessed again? 70TB is the size of the sum of all files, access or no access. (They still want to maintain accessibility even though the chances are slim.) -- +------------------------------------------------------------------+ | keichii@peorth.iteration.net | keichii@bsdconspiracy.net | | http://peorth.iteration.net/~keichii | Yes, BSD is a conspiracy. | +------------------------------------------------------------------+ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Feb 5 9:51: 1 2001 Delivered-To: freebsd-fs@freebsd.org Received: from earth.backplane.com (earth-nat-cw.backplane.com [208.161.114.67]) by hub.freebsd.org (Postfix) with ESMTP id 8410537B67D; Mon, 5 Feb 2001 09:50:42 -0800 (PST) Received: (from dillon@localhost) by earth.backplane.com (8.11.1/8.9.3) id f15HoZ021657; Mon, 5 Feb 2001 09:50:35 -0800 (PST) (envelope-from dillon) Date: Mon, 5 Feb 2001 09:50:35 -0800 (PST) From: Matt Dillon <dillon@earth.backplane.com> Message-Id: <200102051750.f15HoZ021657@earth.backplane.com> To: "Michael C . Wu" <keichii@iteration.net> Cc: Mitch Collinsworth <mitch@ccmr.cornell.edu>, hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: Extremely large (70TB) File system/server planning References: <20010205100016.C97400@peorth.iteration.net> <Pine.LNX.4.10.10102051146300.22516-100000@ruby.ccmr.cornell.edu> <20010205112420.A98288@peorth.iteration.net> Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org :| > The files are accessed approximately 3 or 4 times a day on average. :| > Older files are archived for reference purpose and may never :| > be accessed after a week. :| :| Ok, this is a start. Now is the 70 TB the size of the active files? :| Or does that also include the older archived files that may never be :| accessed again? :70TB is the size of the sum of all files, access or no access. :(They still want to maintain accessibility even though the chances are slim.) :-- :+------------------------------------------------------------------+ :| keichii@peorth.iteration.net | keichii@bsdconspiracy.net | :| http://peorth.iteration.net/~keichii | Yes, BSD is a conspiracy. | :+------------------------------------------------------------------+ This doesn't sound like something you can just throw together with off-the-shelf PCs and still have something reliable to show for it. You need a big honking RAID system - maybe a NetApp, maybe something else. You have to look at the filesystem and file size limitations of the unit and the client(s). FreeBSD can only support 1 TB sized filesystems. Our device layer converts everything to DEV_BSIZE'd (512) blocks, so to be safe: 2^31 x 512 bytes = 1 TB on Intel boxes. Our NFS implementation has the same per-filesystem limitation. Theoretically UFS/FFS are limited to 2^31 x blocksize, where blocksize can be larger (e.g. 16384 bytes, 65536 bytes), but I have grave doubts that that actually works.. I'm fairly certain that we still convert things to 512 byte block numbers at the device level, and we only use a 32 bit int to store the block number. So FreeBSD could be used as an NFS client, but probably not a server for your application. Considering the number of disks you need to manage, something like a NetApp or other completely self contained RAID-5-capable system for handling the disks is mandatory. -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Feb 5 9:51:49 2001 Delivered-To: freebsd-fs@freebsd.org Received: from mercury.ccmr.cornell.edu (mercury.ccmr.cornell.edu [128.84.231.97]) by hub.freebsd.org (Postfix) with ESMTP id C141B37B684; Mon, 5 Feb 2001 09:51:28 -0800 (PST) Received: from ruby.ccmr.cornell.edu (IDENT:0@ruby.ccmr.cornell.edu [128.84.231.115]) by mercury.ccmr.cornell.edu (8.9.3/8.9.3) with ESMTP id MAA17009; Mon, 5 Feb 2001 12:51:28 -0500 Received: from localhost (mitch@localhost) by ruby.ccmr.cornell.edu (8.9.3/8.9.3) with ESMTP id MAA06978; Mon, 5 Feb 2001 12:51:26 -0500 X-Authentication-Warning: ruby.ccmr.cornell.edu: mitch owned process doing -bs Date: Mon, 5 Feb 2001 12:51:26 -0500 (EST) From: Mitch Collinsworth <mitch@ccmr.cornell.edu> To: "Michael C . Wu" <keichii@peorth.iteration.net> Cc: hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: Extremely large (70TB) File system/server planning In-Reply-To: <20010205112420.A98288@peorth.iteration.net> Message-ID: <Pine.LNX.4.10.10102051238190.22516-100000@ruby.ccmr.cornell.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Mon, 5 Feb 2001, Michael C . Wu wrote: > On Mon, Feb 05, 2001 at 11:47:58AM -0500, Mitch Collinsworth scribbled: > | On Mon, 5 Feb 2001, Michael C . Wu wrote: > | > On Mon, Feb 05, 2001 at 10:39:02AM -0500, Mitch Collinsworth scribbled: > | > | You didn't say what applications this thing is going to support. > | > | That does matter. A lot. One thing worth looking at is AFS, > | > | or maybe MR-AFS. And now OpenAFS. > | > > | > He has database(s) of graphics simulation results. i.e. large files that > | > are largely unrelated to each other. Compression is not an option. > | > > | > The files are accessed approximately 3 or 4 times a day on average. > | > Older files are archived for reference purpose and may never > | > be accessed after a week. > | > | Ok, this is a start. Now is the 70 TB the size of the active files? > | Or does that also include the older archived files that may never be > | accessed again? > 70TB is the size of the sum of all files, access or no access. > (They still want to maintain accessibility even though the chances are slim.) Ok, well the next question to look at is how do they define "maintain accessibility". In other words what do they consider acceptable? Accessible in 5 seconds, accessible in 1 minute, accessible in 10 minutes, accessible in 1 hour, accessible overnight? 70 TB, as you have already noticed, is no simple feat to accomplish. No matter how you slice it it's going to cost $$. Different levels of accessibility requirement for the archived data can be accomplished with differing technologies and at differing costs. You could rough out a plan for keeping the whole thing online and spinning for instant access and then compare the costs of that with various options that keep the hot data online and archive the rest in varying ways that allow for differing speed of access. Maybe you can archive old data on CDs or tapes. Perhaps keep more recent archives "online" in a jukebox where they are fairly quickly accessible, while older archives are on a rack where someone has to retrieve them as needed. The real question here is: are they really willing to spend what it would take to keep an archive of this size spinning, including systems programmers and administrators? Or are they willing to spend less and have it take a bit longer to get access to the older data? -Mitch To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Feb 5 10:21:12 2001 Delivered-To: freebsd-fs@freebsd.org Received: from meketrex.pix.net (meketrex.pix.net [192.111.45.13]) by hub.freebsd.org (Postfix) with ESMTP id 08AA637B491; Mon, 5 Feb 2001 10:20:49 -0800 (PST) Received: by meketrex.pix.net id NAA00519; Mon, 5 Feb 2001 13:20:43 -0500 (EST) Message-ID: <20010205132042.A324@pix.net> Date: Mon, 5 Feb 2001 13:20:42 -0500 From: "Kurt J. Lidl" <lidl@pix.net> To: Matt Dillon <dillon@earth.backplane.com>, "Michael C . Wu" <keichii@iteration.net> Cc: Mitch Collinsworth <mitch@ccmr.cornell.edu>, hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: Extremely large (70TB) File system/server planning References: <20010205100016.C97400@peorth.iteration.net> <Pine.LNX.4.10.10102051146300.22516-100000@ruby.ccmr.cornell.edu> <20010205112420.A98288@peorth.iteration.net> <200102051750.f15HoZ021657@earth.backplane.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.93.2 In-Reply-To: <200102051750.f15HoZ021657@earth.backplane.com>; from Matt Dillon on Mon, Feb 05, 2001 at 09:50:35AM -0800 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Mon, Feb 05, 2001 at 09:50:35AM -0800, Matt Dillon wrote: > :70TB is the size of the sum of all files, access or no access. > :(They still want to maintain accessibility even though the chances are slim.) > > This doesn't sound like something you can just throw together with > off-the-shelf PCs and still have something reliable to show for it. > You need a big honking RAID system - maybe a NetApp, maybe something > else. You have to look at the filesystem and file size limitations > of the unit and the client(s). NetApp's biggest box can "only" handle 6TB of data, currently, using the latest and greatest software. They claim (and I believe them) that 12TB will be the limit later this year. > So FreeBSD could be used as an NFS client, but probably not a server > for your application. Considering the number of disks you need to > manage, something like a NetApp or other completely self contained > RAID-5-capable system for handling the disks is mandatory. Netapps are actually RAID-4 (dedicated parity disk), not RAID-5 (parity data is recorded across all drives). -Kurt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Feb 5 10:24:55 2001 Delivered-To: freebsd-fs@freebsd.org Received: from ncmail.netcentralen.dk (ncmail.netcentralen.dk [195.24.7.103]) by hub.freebsd.org (Postfix) with ESMTP id 1D78B37B4EC for <fs@freebsd.org>; Mon, 5 Feb 2001 10:24:37 -0800 (PST) Received: from mother.netcentralen.dk (mother.netcentralen.dk [195.24.7.107]) by ncmail.netcentralen.dk (8.9.3/8.9.3) with ESMTP id TAA47957 for <fs@freebsd.org>; Mon, 5 Feb 2001 19:31:53 +0100 (CET) (envelope-from mar@netcentralen.dk) Received: by mother.netcentralen.dk with Internet Mail Service (5.5.2650.21) id <D3M39H5N>; Mon, 5 Feb 2001 19:30:50 +0100 Message-ID: <9164771DDCABD3118333005004E9446E2B7784@mother.netcentralen.dk> From: Michael Aronsen <mar@netcentralen.dk> To: "'fs@freebsd.org'" <fs@freebsd.org> Subject: SV: Extremely large (70TB) File system/server planning Date: Mon, 5 Feb 2001 19:30:44 +0100 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2650.21) Content-Type: text/plain; charset="iso-8859-1" Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org How about an SGI system - XFS is claimed to have no size limit? //Michael Aronsen -----Oprindelig meddelelse----- Fra: Kurt J. Lidl [mailto:lidl@pix.net] Sendt: 5. februar 2001 19:21 Til: Matt Dillon; Michael C . Wu Cc: Mitch Collinsworth; hackers@FreeBSD.ORG; fs@FreeBSD.ORG Emne: Re: Extremely large (70TB) File system/server planning On Mon, Feb 05, 2001 at 09:50:35AM -0800, Matt Dillon wrote: > :70TB is the size of the sum of all files, access or no access. > :(They still want to maintain accessibility even though the chances are slim.) > > This doesn't sound like something you can just throw together with > off-the-shelf PCs and still have something reliable to show for it. > You need a big honking RAID system - maybe a NetApp, maybe something > else. You have to look at the filesystem and file size limitations > of the unit and the client(s). NetApp's biggest box can "only" handle 6TB of data, currently, using the latest and greatest software. They claim (and I believe them) that 12TB will be the limit later this year. > So FreeBSD could be used as an NFS client, but probably not a server > for your application. Considering the number of disks you need to > manage, something like a NetApp or other completely self contained > RAID-5-capable system for handling the disks is mandatory. Netapps are actually RAID-4 (dedicated parity disk), not RAID-5 (parity data is recorded across all drives). -Kurt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Feb 5 10:29:52 2001 Delivered-To: freebsd-fs@freebsd.org Received: from earth.backplane.com (earth-nat-cw.backplane.com [208.161.114.67]) by hub.freebsd.org (Postfix) with ESMTP id 9053837B401; Mon, 5 Feb 2001 10:29:34 -0800 (PST) Received: (from dillon@localhost) by earth.backplane.com (8.11.1/8.9.3) id f15ITYY22891; Mon, 5 Feb 2001 10:29:34 -0800 (PST) (envelope-from dillon) Date: Mon, 5 Feb 2001 10:29:34 -0800 (PST) From: Matt Dillon <dillon@earth.backplane.com> Message-Id: <200102051829.f15ITYY22891@earth.backplane.com> To: "Michael C . Wu" <keichii@iteration.net>, Mitch Collinsworth <mitch@ccmr.cornell.edu>, hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: Extremely large (70TB) File system/server planning References: <20010205100016.C97400@peorth.iteration.net> <Pine.LNX.4.10.10102051146300.22516-100000@ruby.ccmr.cornell.edu> <20010205112420.A98288@peorth.iteration.net> <200102051750.f15HoZ021657@earth.backplane.com> Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org : 2^31 x 512 bytes = 1 TB on Intel boxes. Our NFS implementation has the : same per-filesystem limitation. Theoretically UFS/FFS are limited Oops. I meant, per-file limitation for NFS clients, not per-filesystem. 1TB per file. -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Feb 5 12:51:38 2001 Delivered-To: freebsd-fs@freebsd.org Received: from mass.dis.org (mass.dis.org [216.240.45.41]) by hub.freebsd.org (Postfix) with ESMTP id 6398537B491; Mon, 5 Feb 2001 12:51:18 -0800 (PST) Received: from mass.dis.org (localhost [127.0.0.1]) by mass.dis.org (8.11.1/8.11.1) with ESMTP id f15KqOe00985; Mon, 5 Feb 2001 12:52:24 -0800 (PST) (envelope-from msmith@mass.dis.org) Message-Id: <200102052052.f15KqOe00985@mass.dis.org> X-Mailer: exmh version 2.1.1 10/15/1999 To: Matt Dillon <dillon@earth.backplane.com> Cc: "Michael C . Wu" <keichii@iteration.net>, Mitch Collinsworth <mitch@ccmr.cornell.edu>, hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: Extremely large (70TB) File system/server planning In-reply-to: Your message of "Mon, 05 Feb 2001 09:50:35 PST." <200102051750.f15HoZ021657@earth.backplane.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Mon, 05 Feb 2001 12:52:24 -0800 From: Mike Smith <msmith@freebsd.org> Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org > > :| > The files are accessed approximately 3 or 4 times a day on average. > :| > Older files are archived for reference purpose and may never > :| > be accessed after a week. > :| > :| Ok, this is a start. Now is the 70 TB the size of the active files? > :| Or does that also include the older archived files that may never be > :| accessed again? > :70TB is the size of the sum of all files, access or no access. > :(They still want to maintain accessibility even though the chances are slim.) ... > This doesn't sound like something you can just throw together with > off-the-shelf PCs and still have something reliable to show for it. > You need a big honking RAID system - maybe a NetApp, maybe something > else. You have to look at the filesystem and file size limitations > of the unit and the client(s). You can't do this with a NetApp either; they max out at about 6TB now (going up to around 12 or so soon). You might want to talk to EMC and/or IBM, both of whom have *extremely* large filers. Your friend may also want to look at Traakan, who have a novel product in this space. -- ... every activity meets with opposition, everyone who acts has his rivals and unfortunately opponents also. But not because people want to be opponents, rather because the tasks and relationships force people to take different points of view. [Dr. Fritz Todt] V I C T O R Y N O T V E N G E A N C E To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Feb 5 12:59: 7 2001 Delivered-To: freebsd-fs@freebsd.org Received: from bdr-xcon.matchlogic.com (mail.matchlogic.com [205.216.147.127]) by hub.freebsd.org (Postfix) with ESMTP id 6EFA937B401; Mon, 5 Feb 2001 12:58:44 -0800 (PST) Received: by mail.matchlogic.com with Internet Mail Service (5.5.2653.19) id <DVS3DG1B>; Mon, 5 Feb 2001 13:58:24 -0700 Message-ID: <5FE9B713CCCDD311A03400508B8B3013054E3F50@bdr-xcln.is.matchlogic.com> From: Charles Randall <crandall@matchlogic.com> To: "Michael C . Wu" <keichii@iteration.net>, 'Mike Smith' <msmith@freebsd.org>, Matt Dillon <dillon@earth.backplane.com> Cc: Mitch Collinsworth <mitch@ccmr.cornell.edu>, hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: RE: Extremely large (70TB) File system/server planning Date: Mon, 5 Feb 2001 13:58:22 -0700 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2653.19) Content-Type: text/plain; charset="iso-8859-1" Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Does this have to be a single filesystem? If not, just provide a database front-end that maps some kind of resource identifier to the filesystem name. With that, you can span filers and/or filesystems. Seems like the only thing that would be reasonable. Charles -----Original Message----- From: Mike Smith [mailto:msmith@freebsd.org] Sent: Monday, February 05, 2001 1:52 PM To: Matt Dillon Cc: Michael C . Wu; Mitch Collinsworth; hackers@FreeBSD.ORG; fs@FreeBSD.ORG Subject: Re: Extremely large (70TB) File system/server planning > > :| > The files are accessed approximately 3 or 4 times a day on average. > :| > Older files are archived for reference purpose and may never > :| > be accessed after a week. > :| > :| Ok, this is a start. Now is the 70 TB the size of the active files? > :| Or does that also include the older archived files that may never be > :| accessed again? > :70TB is the size of the sum of all files, access or no access. > :(They still want to maintain accessibility even though the chances are slim.) ... > This doesn't sound like something you can just throw together with > off-the-shelf PCs and still have something reliable to show for it. > You need a big honking RAID system - maybe a NetApp, maybe something > else. You have to look at the filesystem and file size limitations > of the unit and the client(s). You can't do this with a NetApp either; they max out at about 6TB now (going up to around 12 or so soon). You might want to talk to EMC and/or IBM, both of whom have *extremely* large filers. Your friend may also want to look at Traakan, who have a novel product in this space. -- ... every activity meets with opposition, everyone who acts has his rivals and unfortunately opponents also. But not because people want to be opponents, rather because the tasks and relationships force people to take different points of view. [Dr. Fritz Todt] V I C T O R Y N O T V E N G E A N C E To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Feb 6 3: 0:30 2001 Delivered-To: freebsd-fs@freebsd.org Received: from relay.butya.kz (butya-gw.butya.kz [212.154.129.94]) by hub.freebsd.org (Postfix) with ESMTP id 5817737B503; Tue, 6 Feb 2001 03:00:08 -0800 (PST) Received: by relay.butya.kz (Postfix, from userid 1000) id BAE6628E66; Tue, 6 Feb 2001 17:00:03 +0600 (ALMT) Received: from localhost (localhost [127.0.0.1]) by relay.butya.kz (Postfix) with ESMTP id ABBCA28E46; Tue, 6 Feb 2001 17:00:03 +0600 (ALMT) Date: Tue, 6 Feb 2001 17:00:03 +0600 (ALMT) From: Boris Popov <bp@butya.kz> To: freebsd-arch@freebsd.org Cc: freebsd-fs@freebsd.org Subject: vnode interlock API Message-ID: <Pine.BSF.4.21.0102061638280.82511-100000@lion.butya.kz> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Hello, Few months ago simple locks used for vnode interlock were replaced by mutexes. It causes additional pain for externally maintained filesystems and lowers portability of the code between -stable and -current. So, I suggest to introduce two macro definitions which will hide implementation details for interlocks: #define VI_LOCK(vp) mtx_enter(&(vp)->v_interlock, MTX_DEF) #define VI_UNLOCK(vp) mtx_exit(&(vp)->v_interlock, MTX_DEF) for RELENG_4 they will look like this: #define VI_LOCK(vp) simple_lock(&(vp)->v_interlock) #define VI_UNLOCK(vp) simple_unlock(&(vp)->v_interlock) Any comments, suggestions ? -- Boris Popov http://www.butya.kz/~bp/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Feb 6 3: 3:47 2001 Delivered-To: freebsd-fs@freebsd.org Received: from critter.freebsd.dk (flutter.freebsd.dk [212.242.40.147]) by hub.freebsd.org (Postfix) with ESMTP id CEB2037B401; Tue, 6 Feb 2001 03:03:24 -0800 (PST) Received: from critter (localhost [127.0.0.1]) by critter.freebsd.dk (8.11.1/8.11.1) with ESMTP id f16B33B33409; Tue, 6 Feb 2001 12:03:03 +0100 (CET) (envelope-from phk@critter.freebsd.dk) To: Boris Popov <bp@butya.kz> Cc: freebsd-arch@FreeBSD.ORG, freebsd-fs@FreeBSD.ORG Subject: Re: vnode interlock API In-Reply-To: Your message of "Tue, 06 Feb 2001 17:00:03 +0600." <Pine.BSF.4.21.0102061638280.82511-100000@lion.butya.kz> Date: Tue, 06 Feb 2001 12:03:03 +0100 Message-ID: <33407.981457383@critter> From: Poul-Henning Kamp <phk@critter.freebsd.dk> Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Sounds like something which should have been done long time ago... In message <Pine.BSF.4.21.0102061638280.82511-100000@lion.butya.kz>, Boris Popov writes: > Hello, > > Few months ago simple locks used for vnode interlock were replaced >by mutexes. It causes additional pain for externally maintained >filesystems and lowers portability of the code between -stable and >-current. > > So, I suggest to introduce two macro definitions which will hide >implementation details for interlocks: > >#define VI_LOCK(vp) mtx_enter(&(vp)->v_interlock, MTX_DEF) >#define VI_UNLOCK(vp) mtx_exit(&(vp)->v_interlock, MTX_DEF) > > for RELENG_4 they will look like this: > >#define VI_LOCK(vp) simple_lock(&(vp)->v_interlock) >#define VI_UNLOCK(vp) simple_unlock(&(vp)->v_interlock) > > Any comments, suggestions ? > >-- >Boris Popov >http://www.butya.kz/~bp/ > > > >To Unsubscribe: send mail to majordomo@FreeBSD.org >with "unsubscribe freebsd-arch" in the body of the message > -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Feb 6 7:29: 4 2001 Delivered-To: freebsd-fs@freebsd.org Received: from implode.root.com (root.com [209.102.106.178]) by hub.freebsd.org (Postfix) with ESMTP id A4C6537B401; Tue, 6 Feb 2001 07:28:44 -0800 (PST) Received: from implode.root.com (localhost [127.0.0.1]) by implode.root.com (8.8.8/8.8.5) with ESMTP id HAA27735; Tue, 6 Feb 2001 07:18:31 -0800 (PST) Message-Id: <200102061518.HAA27735@implode.root.com> To: "Michael C . Wu" <keichii@peorth.iteration.net> Cc: hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: Extremely large (70TB) File system/server planning In-reply-to: Your message of "Mon, 05 Feb 2001 09:26:59 CST." <20010205092658.A97400@peorth.iteration.net> From: David Greenman <dg@root.com> Reply-To: dg@root.com Date: Tue, 06 Feb 2001 07:18:31 -0800 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org >While talking to a friend about what his company is planning to do, >I found out that he is planning a 70TB filesystem/servers/cluster/db. >(Yes, seventy t-e-r-a-b-y-t-e...) We could do this using about 44 of the not-yet-announced TSR-3100 fibre channel RAID storage systems. These are 1.8TB (1.62TB usable) capacity units in a 3U cabinet. It would take around 200A @ 120VAC (about 18KW) to power all of them and should fit in about 5 rack cabinets. Total cost would be about $3 million. -DG David Greenman Co-founder, The FreeBSD Project - http://www.freebsd.org President, TeraSolutions, Inc. - http://www.terasolutions.com Pave the road of life with opportunities. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Feb 6 8:32:23 2001 Delivered-To: freebsd-fs@freebsd.org Received: from mailout02.sul.t-online.com (mailout02.sul.t-online.com [194.25.134.17]) by hub.freebsd.org (Postfix) with ESMTP id 8948D37B4EC; Tue, 6 Feb 2001 08:32:03 -0800 (PST) Received: from fwd07.sul.t-online.com by mailout02.sul.t-online.com with smtp id 14QB27-00052q-00; Tue, 06 Feb 2001 17:31:59 +0100 Received: from frolic.no-support.loc (520094253176-0001@[217.80.111.106]) by fmrl07.sul.t-online.com with esmtp id 14QB1l-2Kk35mC; Tue, 6 Feb 2001 17:31:37 +0100 Received: (from bjoern@localhost) by frolic.no-support.loc (8.11.1/8.9.3) id f16GLp600648; Tue, 6 Feb 2001 17:21:51 +0100 (CET) (envelope-from bjoern) From: Bjoern Fischer <bfischer@Techfak.Uni-Bielefeld.DE> Date: Tue, 6 Feb 2001 17:21:50 +0100 To: Boris Popov <bp@butya.kz> Cc: freebsd-arch@FreeBSD.ORG, freebsd-fs@FreeBSD.ORG Subject: Re: vnode interlock API Message-ID: <20010206172150.A528@frolic.no-support.loc> References: <Pine.BSF.4.21.0102061638280.82511-100000@lion.butya.kz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <Pine.BSF.4.21.0102061638280.82511-100000@lion.butya.kz>; from bp@butya.kz on Tue, Feb 06, 2001 at 05:00:03PM +0600 X-Sender: 520094253176-0001@t-dialin.net Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Hello, > Few months ago simple locks used for vnode interlock were replaced > by mutexes. It causes additional pain for externally maintained > filesystems and lowers portability of the code between -stable and > -current. > > So, I suggest to introduce two macro definitions which will hide > implementation details for interlocks: > > #define VI_LOCK(vp) mtx_enter(&(vp)->v_interlock, MTX_DEF) > #define VI_UNLOCK(vp) mtx_exit(&(vp)->v_interlock, MTX_DEF) BTW, does this mean that -current vnode locking works sufficiently enough to support stacked file systems a la Eric Zadok's FiST software? Bjoern -- -----BEGIN GEEK CODE BLOCK----- GCS d--(+) s++: a- C+++(-) UB++++OSI++++$ P+++(-) L---(++) !E W- N+ o>+ K- !w !O !M !V PS++ PE- PGP++ t+++ !5 X++ tv- b+++ D++ G e+ h-- y+ ------END GEEK CODE BLOCK------ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Feb 6 11:51:55 2001 Delivered-To: freebsd-fs@freebsd.org Received: from meow.osd.bsdi.com (meow.osd.bsdi.com [204.216.28.88]) by hub.freebsd.org (Postfix) with ESMTP id 192A337B401; Tue, 6 Feb 2001 11:51:34 -0800 (PST) Received: from laptop.baldwin.cx (john@jhb-laptop.osd.bsdi.com [204.216.28.241]) by meow.osd.bsdi.com (8.11.1/8.9.3) with ESMTP id f16Jo9345186; Tue, 6 Feb 2001 11:50:09 -0800 (PST) (envelope-from jhb@FreeBSD.org) Message-ID: <XFMail.010206115111.jhb@FreeBSD.org> X-Mailer: XFMail 1.4.0 on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: <Pine.BSF.4.21.0102061638280.82511-100000@lion.butya.kz> Date: Tue, 06 Feb 2001 11:51:11 -0800 (PST) From: John Baldwin <jhb@FreeBSD.org> To: Boris Popov <bp@butya.kz> Subject: RE: vnode interlock API Cc: freebsd-fs@FreeBSD.org, freebsd-arch@FreeBSD.org Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On 06-Feb-01 Boris Popov wrote: > Hello, > > Few months ago simple locks used for vnode interlock were replaced > by mutexes. It causes additional pain for externally maintained > filesystems and lowers portability of the code between -stable and > -current. Sounds good. -- John Baldwin <jhb@FreeBSD.org> -- http://www.FreeBSD.org/~jhb/ PGP Key: http://www.baldwin.cx/~john/pgpkey.asc "Power Users Use the Power to Serve!" - http://www.FreeBSD.org/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Feb 6 13:16: 9 2001 Delivered-To: freebsd-fs@freebsd.org Received: from bingnet2.cc.binghamton.edu (bingnet2.cc.binghamton.edu [128.226.1.18]) by hub.freebsd.org (Postfix) with ESMTP id A6AAD37B401 for <freebsd-fs@freebsd.org>; Tue, 6 Feb 2001 13:15:52 -0800 (PST) Received: from opal (cs.binghamton.edu [128.226.123.101]) by bingnet2.cc.binghamton.edu (8.11.2/8.11.2) with ESMTP id f16LFp002770 for <freebsd-fs@freebsd.org>; Tue, 6 Feb 2001 16:15:51 -0500 (EST) Date: Tue, 6 Feb 2001 16:15:45 -0500 (EST) From: Zhiui Zhang <zzhang@cs.binghamton.edu> X-Sender: zzhang@opal To: freebsd-fs@freebsd.org Subject: Design a journalled file system Message-ID: <Pine.SOL.4.21.0102061544230.6584-100000@opal> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org I am considering the design of a journalled file system in FreeBSD. I think each transaction corresponds to a file system update operation and will therefore consists of a list of modified buffers. The important thing is that these buffers should not be written to disk until they have been logged into the log area. To do so, we need to pin these buffers in memory for a while. The concept should be simple, but I run into a problem which I have no idea how to solve it: If you access a lot of files quickly, some vnodes will be reused. These vnodes can contain buffers that are still pinned in the memory because of the write-ahead logging constraints. After a vnode is gone, we have no way to recover its buffers. Note that whenever we need a new vnode, we are in the process of creating a new file. At this point, we can not flush the buffers to the log area. The result is a deadlock. I could make copies of the buffers that are still pinned, but that incurs memory copy and need buffer headers, which is also a rare resource. The design is similar to ext3fs of linux (they do not seem to have a vnode layer and they use device + physical block number instead of vnode + logical block number to index buffers, which, I guess, means that buffers can exist after the inode is gone). I know Mckusick has a paper on journalling FFS, but I just want to know if this design can work or not. Any ideas? Thanks for your help! -Zhihui To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Feb 6 13:47:31 2001 Delivered-To: freebsd-fs@freebsd.org Received: from gw.errno.com (node-d1d4bd7a.powerinter.net [209.212.189.122]) by hub.freebsd.org (Postfix) with ESMTP id 944B337B491 for <freebsd-fs@FreeBSD.ORG>; Tue, 6 Feb 2001 13:47:14 -0800 (PST) Received: from melange (melange.errno.com [209.212.166.36]) by gw.errno.com (8.9.0/8.9.0) with SMTP id NAA28653; Tue, 6 Feb 2001 13:47:12 -0800 (PST) Message-ID: <0e9101c09086$5ca812b0$24a6d4d1@melange> From: "Sam Leffler" <sam@errno.com> To: "Zhiui Zhang" <zzhang@cs.binghamton.edu>, <freebsd-fs@FreeBSD.ORG> References: <Pine.SOL.4.21.0102061544230.6584-100000@opal> Subject: Re: Design a journalled file system Date: Tue, 6 Feb 2001 13:47:11 -0800 Organization: Errno Consulting MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 5.00.3018.1300 X-MimeOLE: Produced By Microsoft MimeOLE V5.00.3018.1300 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org If you really want to work on another filesystem, learn about/from SGI's XFS. They've made a GPL'd version for Linux version available for public ftp. Sam To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Feb 6 13:53:36 2001 Delivered-To: freebsd-fs@freebsd.org Received: from fw.wintelcom.net (ns1.wintelcom.net [209.1.153.20]) by hub.freebsd.org (Postfix) with ESMTP id 90BF237B401 for <freebsd-fs@FreeBSD.ORG>; Tue, 6 Feb 2001 13:53:19 -0800 (PST) Received: (from bright@localhost) by fw.wintelcom.net (8.10.0/8.10.0) id f16LrI626721; Tue, 6 Feb 2001 13:53:18 -0800 (PST) Date: Tue, 6 Feb 2001 13:53:18 -0800 From: Alfred Perlstein <bright@wintelcom.net> To: Zhiui Zhang <zzhang@cs.binghamton.edu> Cc: freebsd-fs@FreeBSD.ORG Subject: Re: Design a journalled file system Message-ID: <20010206135317.Z26076@fw.wintelcom.net> References: <Pine.SOL.4.21.0102061544230.6584-100000@opal> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <Pine.SOL.4.21.0102061544230.6584-100000@opal>; from zzhang@cs.binghamton.edu on Tue, Feb 06, 2001 at 04:15:45PM -0500 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org * Zhiui Zhang <zzhang@cs.binghamton.edu> [010206 13:16] wrote: > > I am considering the design of a journalled file system in FreeBSD. I > think each transaction corresponds to a file system update operation and > will therefore consists of a list of modified buffers. The important > thing is that these buffers should not be written to disk until they have > been logged into the log area. To do so, we need to pin these buffers in > memory for a while. The concept should be simple, but I run into a problem > which I have no idea how to solve it: > > If you access a lot of files quickly, some vnodes will be reused. These > vnodes can contain buffers that are still pinned in the memory because of > the write-ahead logging constraints. After a vnode is gone, we have > no way to recover its buffers. Note that whenever we need a new vnode, we > are in the process of creating a new file. At this point, we can not flush > the buffers to the log area. The result is a deadlock. > > I could make copies of the buffers that are still pinned, but that incurs > memory copy and need buffer headers, which is also a rare resource. > > The design is similar to ext3fs of linux (they do not seem to have a vnode > layer and they use device + physical block number instead of vnode + > logical block number to index buffers, which, I guess, means that buffers > can exist after the inode is gone). I know Mckusick has a paper on > journalling FFS, but I just want to know if this design can work or not. > > Any ideas? Thanks for your help! There's ways to reassign buffers to other vnodes, you can remove the buffers from the vnodes at reclaim time (there has to be a hook for this) and link them to a special vnode linked from your mount structure. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] "I have the heart of a child; I keep it in a jar on my desk." To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Feb 6 18:21:16 2001 Delivered-To: freebsd-fs@freebsd.org Received: from bingnet2.cc.binghamton.edu (bingnet2.cc.binghamton.edu [128.226.1.18]) by hub.freebsd.org (Postfix) with ESMTP id 6B12937B401 for <freebsd-fs@FreeBSD.ORG>; Tue, 6 Feb 2001 18:20:58 -0800 (PST) Received: from opal (cs.binghamton.edu [128.226.123.101]) by bingnet2.cc.binghamton.edu (8.11.2/8.11.2) with ESMTP id f172Kt025746; Tue, 6 Feb 2001 21:20:55 -0500 (EST) Date: Tue, 6 Feb 2001 21:20:50 -0500 (EST) From: Zhiui Zhang <zzhang@cs.binghamton.edu> X-Sender: zzhang@opal To: Alfred Perlstein <bright@wintelcom.net> Cc: freebsd-fs@FreeBSD.ORG Subject: Re: Design a journalled file system In-Reply-To: <20010206135317.Z26076@fw.wintelcom.net> Message-ID: <Pine.SOL.4.21.0102062118020.21503-100000@opal> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Tue, 6 Feb 2001, Alfred Perlstein wrote: > > There's ways to reassign buffers to other vnodes, you can remove > the buffers from the vnodes at reclaim time (there has to be a hook > for this) and link them to a special vnode linked from your mount > structure. Thanks, I guess that I can write a function that steals the pages from the disappearing buffer and move it over to the new buffer that is going to replace it. -Zhihui To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Feb 6 18:51:38 2001 Delivered-To: freebsd-fs@freebsd.org Received: from relay.butya.kz (butya-gw.butya.kz [212.154.129.94]) by hub.freebsd.org (Postfix) with ESMTP id 5181E37B401 for <freebsd-fs@FreeBSD.ORG>; Tue, 6 Feb 2001 18:51:21 -0800 (PST) Received: by relay.butya.kz (Postfix, from userid 1000) id 5ACB129073; Wed, 7 Feb 2001 08:51:19 +0600 (ALMT) Received: from localhost (localhost [127.0.0.1]) by relay.butya.kz (Postfix) with ESMTP id 4BA9F29072; Wed, 7 Feb 2001 08:51:19 +0600 (ALMT) Date: Wed, 7 Feb 2001 08:51:19 +0600 (ALMT) From: Boris Popov <bp@butya.kz> To: Bjoern Fischer <bfischer@Techfak.Uni-Bielefeld.DE> Cc: freebsd-fs@FreeBSD.ORG Subject: Re: vnode interlock API In-Reply-To: <20010206172150.A528@frolic.no-support.loc> Message-ID: <Pine.BSF.4.21.0102070847080.4563-100000@lion.butya.kz> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Tue, 6 Feb 2001, Bjoern Fischer wrote: > > #define VI_LOCK(vp) mtx_enter(&(vp)->v_interlock, MTX_DEF) > > #define VI_UNLOCK(vp) mtx_exit(&(vp)->v_interlock, MTX_DEF) > > BTW, does this mean that -current vnode locking works sufficiently > enough to support stacked file systems a la Eric Zadok's FiST software? Hmm, I didn't see how this relates to the stacked file systems, but can say that there is mostly finished generic code to support stacked file systems. I hope to post it for review in few weeks. -- Boris Popov http://www.butya.kz/~bp/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Feb 7 13:26:41 2001 Delivered-To: freebsd-fs@freebsd.org Received: from smtp03.primenet.com (smtp03.primenet.com [206.165.6.133]) by hub.freebsd.org (Postfix) with ESMTP id 2A49137B65D; Wed, 7 Feb 2001 13:26:18 -0800 (PST) Received: (from daemon@localhost) by smtp03.primenet.com (8.9.3/8.9.3) id OAA27535; Wed, 7 Feb 2001 14:23:20 -0700 (MST) Received: from usr08.primenet.com(206.165.6.208) via SMTP by smtp03.primenet.com, id smtpdAAA7zaWQ1; Wed Feb 7 14:23:10 2001 Received: (from tlambert@localhost) by usr08.primenet.com (8.8.5/8.8.5) id OAA24284; Wed, 7 Feb 2001 14:26:00 -0700 (MST) From: Terry Lambert <tlambert@primenet.com> Message-Id: <200102072126.OAA24284@usr08.primenet.com> Subject: Re: vnode interlock API To: bp@butya.kz (Boris Popov) Date: Wed, 7 Feb 2001 21:26:00 +0000 (GMT) Cc: freebsd-arch@FreeBSD.ORG, freebsd-fs@FreeBSD.ORG In-Reply-To: <Pine.BSF.4.21.0102061638280.82511-100000@lion.butya.kz> from "Boris Popov" at Feb 06, 2001 05:00:03 PM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org > So, I suggest to introduce two macro definitions which will hide > implementation details for interlocks: > > #define VI_LOCK(vp) mtx_enter(&(vp)->v_interlock, MTX_DEF) > #define VI_UNLOCK(vp) mtx_exit(&(vp)->v_interlock, MTX_DEF) > > for RELENG_4 they will look like this: > > #define VI_LOCK(vp) simple_lock(&(vp)->v_interlock) > #define VI_UNLOCK(vp) simple_unlock(&(vp)->v_interlock) > > Any comments, suggestions ? 1) Macros are good; interfaces are better. I've consistantly recommended that the NFS cookie interface be rewritten to not require cookies, even though the FreeBSD/NetBSD/OpenBSD differences _could_ be masked with macros. The issue is one of binary vs. source compatability. 2) If you are going to wrap vnode handling, it would probably be a good idea to wrap it using the same approach that another OS uses, instead of being gratuitously different in naming. I would suggest using the Solaris names, but I will admit that doing that depends heavily on the semantics being the same (I think they would be). Worst case, pick an OS with the same semantics; if there are none, this may be an opportunity to learn from other OSs _why_ they don't have the same semantics. 3) It seems to mee that the additional parameter of MTX_DEF is gratuitous, and tries to stretch mutex semantics further than they should be stretched. I personally would have no problem with the conversion of simple_{un}lock() into the equivalent mtx_*() calls. Even if the MTX_DEF can not be murdered without a large public outcry, using this as the the default demantic for the simple_*() equivalents isn't really a bad idea, in my book, and could be done with inline wrappers. Best case, one could apply the WITNESS code to debugging 4.x problems, with some work. 4) You need to wrap the calls with "{ ... }"; this is because it may be useful in the future to institute turnstile or single wakeup semantics, and converting the macro into a single statement instead of a statement block would mean a potentially large amount of work would be needed to cope with the change later, whereas, you seem to plan to already need to touch all those spots now. Again, the Solaris SMP vnode lock management macros are, I think, a good example (or at least they were, six years ago, when Solaris faced the same problem). I have other comments, but these are the four most important ones, IMO, and I've been making a conscious effort to not clutter arguments by giving more detail than people seem to want to hear before they overflow and tune out. 8-). Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Feb 7 13:48:32 2001 Delivered-To: freebsd-fs@freebsd.org Received: from smtp10.phx.gblx.net (smtp10.phx.gblx.net [206.165.6.140]) by hub.freebsd.org (Postfix) with ESMTP id DD70437B6AD for <freebsd-fs@FreeBSD.ORG>; Wed, 7 Feb 2001 13:48:14 -0800 (PST) Received: (from daemon@localhost) by smtp10.phx.gblx.net (8.9.3/8.9.3) id OAA17450; Wed, 7 Feb 2001 14:47:39 -0700 Received: from usr08.primenet.com(206.165.6.208) via SMTP by smtp10.phx.gblx.net, id smtpdwci6aa; Wed Feb 7 14:47:33 2001 Received: (from tlambert@localhost) by usr08.primenet.com (8.8.5/8.8.5) id OAA25001; Wed, 7 Feb 2001 14:48:05 -0700 (MST) From: Terry Lambert <tlambert@primenet.com> Message-Id: <200102072148.OAA25001@usr08.primenet.com> Subject: Re: Design a journalled file system To: sam@errno.com (Sam Leffler) Date: Wed, 7 Feb 2001 21:48:05 +0000 (GMT) Cc: zzhang@cs.binghamton.edu (Zhiui Zhang), freebsd-fs@FreeBSD.ORG In-Reply-To: <0e9101c09086$5ca812b0$24a6d4d1@melange> from "Sam Leffler" at Feb 06, 2001 01:47:11 PM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org > If you really want to work on another filesystem, learn about/from SGI's > XFS. They've made a GPL'd version for Linux version available for public > ftp. Unfortunately, this license means that it can not be distributed compiled into a FreeBSD kernel, since clause 6 of the GPL will specifically prohibit such distribution. The upshot of this is that it can never be the default FS used to boot FreeBSD, out of the box, nor to install by default, since the module would have to be loaded from an FS which the system can not understand until after it has loaded the module. Historically, the soloution that is often suggested for this second problem is to use a simpler boot FS that the kernel understands (Xenix, SCO UNIX, and SVR4 have all taken this approach), but doing this renders the bootfs to be a single point of failure for boot, and therefore the increased MTBF that supposedly comes from using an advanced FS does nothing for the overall MTBF. In other words, the SGI XFS is an interesting curiousity, and may or may not be a useful reference implementation for another work, but it can never be used in a commercially usable OS, for which source code is inconvenient or impossible to distribute (even SGI can not take modifications made to repair bugs in the Linux version, without having to place all of IRIX under the GPL -- this they can not do, since IRIX contains code that was licensed from vendors who are not anxious to have their property given away free). I rather suspect that the GPL was intentionally chosen by SGI to permit them to jump on the Linux/Open Source bandwagon, without exposing them to the risk of a commercial organization which competes with SGI being able to benefit from the technology being released; QNX, Windows NT, and Solaris are all obvious candidates for this anticompetitive practice). Conclusion: Creating a truly free journalled FS implementation, even if it were to end up being bidirectionally data-compatible with XFS disks, is a worthy project. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Feb 7 13:56:50 2001 Delivered-To: freebsd-fs@freebsd.org Received: from mail.integratus.com (unknown [63.209.2.83]) by hub.freebsd.org (Postfix) with SMTP id B77FF37B6AD for <freebsd-fs@FreeBSD.ORG>; Wed, 7 Feb 2001 13:56:32 -0800 (PST) Received: (qmail 3611 invoked from network); 7 Feb 2001 21:56:32 -0000 Received: from kungfu.integratus.com (HELO integratus.com) (172.20.5.168) by tortuga1.integratus.com with SMTP; 7 Feb 2001 21:56:32 -0000 Message-ID: <3A81C490.598F7EB7@integratus.com> Date: Wed, 07 Feb 2001 13:56:32 -0800 From: Jack Rusher <jar@integratus.com> Organization: http://www.integratus.com/ X-Mailer: Mozilla 4.73 [en] (X11; I; Linux 2.2.12 i386) X-Accept-Language: en MIME-Version: 1.0 To: Terry Lambert <tlambert@primenet.com> Cc: Sam Leffler <sam@errno.com>, Zhiui Zhang <zzhang@cs.binghamton.edu>, freebsd-fs@FreeBSD.ORG Subject: Re: Design a journalled file system References: <200102072148.OAA25001@usr08.primenet.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Terry Lambert wrote: > > Unfortunately, this license means that it can not be distributed > compiled into a FreeBSD kernel, since clause 6 of the GPL will > specifically prohibit such distribution. I have been wondering about this legal issue lately. What is the law with regards to implementing XFS as a KLM for FreeBSD & shipping the source in contrib? It won't help people who are trying to make commercial products with embedded FreeBSD, but it might be useful for sysadmins. > point of failure for boot, and therefore the increased MTBF > that supposedly comes from using an advanced FS does nothing > for the overall MTBF. Mirror the boot partition with vinum? > I rather suspect that the GPL was intentionally chosen by SGI > to permit them to jump on the Linux/Open Source bandwagon, > without exposing them to the risk of a commercial organization > which competes with SGI being able to benefit from the technology This is unquestionably true. I have word from some of the architects who helped design XFS that this was exactly the reason GPL was chosen over the BSD license. -- Jack Rusher, Senior Engineer | mailto:jar@integratus.com Integratus, Inc. | http://www.integratus.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Feb 7 14:10:17 2001 Delivered-To: freebsd-fs@freebsd.org Received: from smtp03.primenet.com (smtp03.primenet.com [206.165.6.133]) by hub.freebsd.org (Postfix) with ESMTP id CDBAC37B401 for <freebsd-fs@FreeBSD.ORG>; Wed, 7 Feb 2001 14:09:57 -0800 (PST) Received: (from daemon@localhost) by smtp03.primenet.com (8.9.3/8.9.3) id PAA13495; Wed, 7 Feb 2001 15:06:59 -0700 (MST) Received: from usr08.primenet.com(206.165.6.208) via SMTP by smtp03.primenet.com, id smtpdAAAq5aisA; Wed Feb 7 15:06:49 2001 Received: (from tlambert@localhost) by usr08.primenet.com (8.8.5/8.8.5) id PAA25657; Wed, 7 Feb 2001 15:09:43 -0700 (MST) From: Terry Lambert <tlambert@primenet.com> Message-Id: <200102072209.PAA25657@usr08.primenet.com> Subject: Re: Design a journalled file system To: zzhang@cs.binghamton.edu (Zhiui Zhang) Date: Wed, 7 Feb 2001 22:09:43 +0000 (GMT) Cc: freebsd-fs@FreeBSD.ORG In-Reply-To: <Pine.SOL.4.21.0102061544230.6584-100000@opal> from "Zhiui Zhang" at Feb 06, 2001 04:15:45 PM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org > I am considering the design of a journalled file system in FreeBSD. I > think each transaction corresponds to a file system update operation and > will therefore consists of a list of modified buffers. The important > thing is that these buffers should not be written to disk until they have > been logged into the log area. To do so, we need to pin these buffers in > memory for a while. The concept should be simple, but I run into a problem > which I have no idea how to solve it: > > If you access a lot of files quickly, some vnodes will be reused. These > vnodes can contain buffers that are still pinned in the memory because of > the write-ahead logging constraints. After a vnode is gone, we have > no way to recover its buffers. Note that whenever we need a new vnode, we > are in the process of creating a new file. At this point, we can not flush > the buffers to the log area. The result is a deadlock. > > I could make copies of the buffers that are still pinned, but that incurs > memory copy and need buffer headers, which is also a rare resource. > > The design is similar to ext3fs of linux (they do not seem to have a vnode > layer and they use device + physical block number instead of vnode + > logical block number to index buffers, which, I guess, means that buffers > can exist after the inode is gone). I know Mckusick has a paper on > journalling FFS, but I just want to know if this design can work or not. Soft updates provides this guarantee. It's one approach. If you look at the Ganger/Patt paper, it's pretty obvious that the soloution to the graph dependency problem could be generalized. This would let you externalize hooks into the graph, so that you yould have dependencies span stacking layers, or so that you could externalize a transation interface to user space, or so that you could implement a distributed cache coherency protocol, over a network transport, on the bottom end. In the limit, though, it means that you should think of an FS in terms of a set of ordered metadata and data transactions, and then simply ensure that transactions are handled in sufficient order ("sufficient" means that FFS can lose data, but never become inconsistant; a journalled FS would not have this luxury). For journalling, this is a slightly tougher problem, since you must include the idea of data consistency, not just metadata consistency, but the problem is not insoluable. Starting from first principles, you should look at the transactions you intend to support. You should probably _not_ commit to a storage paradigm (e.g. "... similar to ext3fs of Linux ... "), until _after_ you have mapped out the operations, and what they imply about conflict domains (e.g. several objects in one disk block, or one page, which is what leads to much of the complexity of the FFS soft updates implementation). Probably the first thing you will notice is that the VOP_ABORT semantics are horribly broken: I noticed the same thing, when looking at implementing a writeable NTFS for Windows 95/98/2000, using the Heidemann framework ported from FreeBSD. I would say that you were also constrained by POSIX guaranteed semantics, though it would be convenient to be able to turn most of these off, to avoid vnode/data seeks, though this is an anecdotal conclusion from some recent literature (don't trust it until you can conclude what the effect will be under non-single-threaded FS load). NB: I was unable to convince either Ganger or McKusick of the idea of generalization, where on mount you register conflict resolvers into a dependency graph, which you maintain as stacking is done and undone, and VOPs are added and removed. Both cited different reasons for objecting. Kirk objected to what he saw as a larger in-core dependency accounting storage requirement. IMO, Kirk's reasons were not really correct, since any given dependency could be expressed and resolved using the same structures. I was unable to provide a proof of concept due to license issues, which I very well understand Kirk wanting to enforce at the time. Gregory had different objections, which I laid off to familiarity with graph theory (you _can_ maintain a running accounting of transitive colsure over a graph, particularly one that doesn't change except on mount or unmount), but I wouldn't dismiss either of them on the basis of their gut feelings (I trust mine, but they trust theirs, which is right for them to do). That aside, even if you don't do a generalized implementation, the approach of considering an FS in terms of transactions (events) is still sound, and I think most modern FS researchers would agree with the approach, even if they did not agree on implementation. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Feb 7 15:23:24 2001 Delivered-To: freebsd-fs@freebsd.org Received: from VL-MS-MR002.sc1.videotron.ca (relais.videotron.ca [24.201.245.36]) by hub.freebsd.org (Postfix) with ESMTP id 71B2A37B6C3; Wed, 7 Feb 2001 15:23:04 -0800 (PST) Received: from jehovah ([24.201.144.31]) by VL-MS-MR002.sc1.videotron.ca (Netscape Messaging Server 4.15) with SMTP id G8EUAA03.L5I; Wed, 7 Feb 2001 18:22:58 -0500 Message-ID: <002e01c0915d$326a7ec0$1f90c918@jehovah> From: "Bosko Milekic" <bmilekic@technokratis.com> To: "Terry Lambert" <tlambert@primenet.com>, "Boris Popov" <bp@butya.kz> Cc: <freebsd-arch@FreeBSD.ORG>, <freebsd-fs@FreeBSD.ORG> References: <200102072126.OAA24284@usr08.primenet.com> Subject: Re: vnode interlock API Date: Wed, 7 Feb 2001 18:25:02 -0500 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 5.00.2919.6700 X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2919.6700 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Terry Lambert wrote: [...] > 3) It seems to mee that the additional parameter of MTX_DEF is > gratuitous, and tries to stretch mutex semantics further > than they should be stretched. I personally would have no > problem with the conversion of simple_{un}lock() into the > equivalent mtx_*() calls. Even if the MTX_DEF can not be > murdered without a large public outcry, using this as the Actually, it has been murdered: http://people.freebsd.org/~bmilekic/code/mutex_cleanup-7.1.diff Presently under testing. > the default demantic for the simple_*() equivalents isn't > really a bad idea, in my book, and could be done with > inline wrappers. Best case, one could apply the WITNESS > code to debugging 4.x problems, with some work. [...] > > Terry Lambert > terry@lambert.org > --- > Any opinions in this posting are my own and not those of my present > or previous employers. Regards, Bosko. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Feb 7 15:24: 0 2001 Delivered-To: freebsd-fs@freebsd.org Received: from smtp04.primenet.com (smtp04.primenet.com [206.165.6.134]) by hub.freebsd.org (Postfix) with ESMTP id 411A637B65D for <freebsd-fs@FreeBSD.ORG>; Wed, 7 Feb 2001 15:23:41 -0800 (PST) Received: (from daemon@localhost) by smtp04.primenet.com (8.9.3/8.9.3) id QAA17710; Wed, 7 Feb 2001 16:18:19 -0700 (MST) Received: from usr08.primenet.com(206.165.6.208) via SMTP by smtp04.primenet.com, id smtpdAAAsva4DI; Wed Feb 7 16:18:07 2001 Received: (from tlambert@localhost) by usr08.primenet.com (8.8.5/8.8.5) id QAA27692; Wed, 7 Feb 2001 16:23:23 -0700 (MST) From: Terry Lambert <tlambert@primenet.com> Message-Id: <200102072323.QAA27692@usr08.primenet.com> Subject: Re: Design a journalled file system To: jar@integratus.com (Jack Rusher) Date: Wed, 7 Feb 2001 23:23:17 +0000 (GMT) Cc: tlambert@primenet.com (Terry Lambert), sam@errno.com (Sam Leffler), zzhang@cs.binghamton.edu (Zhiui Zhang), freebsd-fs@FreeBSD.ORG Reply-To: freebsd-chat@FreeBSD.ORG In-Reply-To: <3A81C490.598F7EB7@integratus.com> from "Jack Rusher" at Feb 07, 2001 01:56:32 PM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org > > Unfortunately, this license means that it can not be distributed > > compiled into a FreeBSD kernel, since clause 6 of the GPL will > > specifically prohibit such distribution. > > I have been wondering about this legal issue lately. What is the law > with regards to implementing XFS as a KLM for FreeBSD & shipping the > source in contrib? It won't help people who are trying to make > commercial products with embedded FreeBSD, but it might be useful for > sysadmins. You won't be able to boot from it, unless you compile your own kernel. This was pretty much the Soft Updates status, until recently. The problem with the GPL clause 6 is that it prohibits any additional restrictions, and requiring the distribution of another license, even if it does not otherwise conflict, is a restriction on what can be done with the code. Without that other license, the right granted to you to use the code in question doesn't exist, since it is the license which was the origin of the grant. Like Matt Dillon and Best Internet did with the Soft Updates code, a local administrator could use it, but it could not be distributed in a usable form. Actually, this brings up a seperate sticky legal point, which is how the assets of Best Internet were transferred when it was sold, since I assume that the machines that had Soft Updates on them kept Soft Updates on them. I suppose that the new owners could have rebuilt the kernels on all the machines, getting identical kernels, after first booting to a non-Soft Updates kernel for the transfer of legal posession. Distribution of a binary kernel module would really depend on whether you could get away with treating a kernel as a library, under the GPL allowing the linking of GPL'ed programs against system libraries. You have to wonder if a kernel module is a program or just a program component, with the kernel being the program. BeOS side-steps this for non-boot drivers by running the driver in a user space process, so it's provably a program. Anyway, that's the kind of hoop-jumping that you _could_ do to get around the problem (maybe). I have no idea what the transfer of ownership caluses in the GPL would do if a company were to IPO, for example, or what the concept of "publically held" would mean on that context (since anyone who holds the ownership of the software can demand the source, and the source itself is not legal to distribute, under the conflicting licenses). Not really my problem, though, since I tend to try to avoid just this sort of entanglement. So did IBM, when I was working for them. 8-). [ ... boot MTBF ... ] > Mirror the boot partition with vinum? I'm not sure this works yet. Hardware RAID mirroing certainly would, since it'd have to deal with the BIOS boot device issue. > > I rather suspect that the GPL was intentionally chosen by SGI > > to permit them to jump on the Linux/Open Source bandwagon, > > without exposing them to the risk of a commercial organization > > which competes with SGI being able to benefit from the technology > > This is unquestionably true. I have word from some of the architects > who helped design XFS that this was exactly the reason GPL was chosen > over the BSD license. I had a pretty long discussion with their V.P. of engineering, who made the decision (they have a number of "V.P. of engineering" lying around). He didn't come out and say the same thing, and I really didn't attribute it to that, since it means that any bug fixes are GPL-code derived, and therefore also GPL. That would mean that they really don't expect any useful work to come out of the Linux community, or that they expected people to just sign over rights to anything interesting, which I think would be a bit naieve, to say the least. FYI: Followups set to -chat... Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Feb 7 15:41: 0 2001 Delivered-To: freebsd-fs@freebsd.org Received: from bingnet2.cc.binghamton.edu (bingnet2.cc.binghamton.edu [128.226.1.18]) by hub.freebsd.org (Postfix) with ESMTP id D787B37B503 for <freebsd-fs@FreeBSD.ORG>; Wed, 7 Feb 2001 15:40:40 -0800 (PST) Received: from onyx (onyx.cs.binghamton.edu [128.226.140.171]) by bingnet2.cc.binghamton.edu (8.11.2/8.11.2) with ESMTP id f17NeWI21997; Wed, 7 Feb 2001 18:40:32 -0500 (EST) Date: Wed, 7 Feb 2001 18:40:21 -0500 (EST) From: Zhiui Zhang <zzhang@cs.binghamton.edu> X-Sender: zzhang@onyx To: Terry Lambert <tlambert@primenet.com> Cc: freebsd-fs@FreeBSD.ORG Subject: Re: Design a journalled file system In-Reply-To: <200102072209.PAA25657@usr08.primenet.com> Message-ID: <Pine.SOL.4.21.0102071833210.3918-100000@onyx> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Thanks for your email! Even if I think I have a fairly good understanding of the FFS code (not soft-update) by actually studying/modifying the code, I still have a long way to go to understand the bigger picture which you have described. -Zhihui On Wed, 7 Feb 2001, Terry Lambert wrote: > > I am considering the design of a journalled file system in FreeBSD. I > > think each transaction corresponds to a file system update operation and > > will therefore consists of a list of modified buffers. The important > > thing is that these buffers should not be written to disk until they have > > been logged into the log area. To do so, we need to pin these buffers in > > memory for a while. The concept should be simple, but I run into a problem > > which I have no idea how to solve it: > > > > If you access a lot of files quickly, some vnodes will be reused. These > > vnodes can contain buffers that are still pinned in the memory because of > > the write-ahead logging constraints. After a vnode is gone, we have > > no way to recover its buffers. Note that whenever we need a new vnode, we > > are in the process of creating a new file. At this point, we can not flush > > the buffers to the log area. The result is a deadlock. > > > > I could make copies of the buffers that are still pinned, but that incurs > > memory copy and need buffer headers, which is also a rare resource. > > > > The design is similar to ext3fs of linux (they do not seem to have a vnode > > layer and they use device + physical block number instead of vnode + > > logical block number to index buffers, which, I guess, means that buffers > > can exist after the inode is gone). I know Mckusick has a paper on > > journalling FFS, but I just want to know if this design can work or not. > > Soft updates provides this guarantee. It's one approach. > > If you look at the Ganger/Patt paper, it's pretty obvious that > the soloution to the graph dependency problem could be generalized. > > This would let you externalize hooks into the graph, so that you > yould have dependencies span stacking layers, or so that you could > externalize a transation interface to user space, or so that you > could implement a distributed cache coherency protocol, over a > network transport, on the bottom end. > > > In the limit, though, it means that you should think of an FS in > terms of a set of ordered metadata and data transactions, and then > simply ensure that transactions are handled in sufficient order > ("sufficient" means that FFS can lose data, but never become > inconsistant; a journalled FS would not have this luxury). > > For journalling, this is a slightly tougher problem, since you > must include the idea of data consistency, not just metadata > consistency, but the problem is not insoluable. > > Starting from first principles, you should look at the transactions > you intend to support. You should probably _not_ commit to a > storage paradigm (e.g. "... similar to ext3fs of Linux ... "), > until _after_ you have mapped out the operations, and what they > imply about conflict domains (e.g. several objects in one disk > block, or one page, which is what leads to much of the complexity > of the FFS soft updates implementation). > > Probably the first thing you will notice is that the VOP_ABORT > semantics are horribly broken: I noticed the same thing, when > looking at implementing a writeable NTFS for Windows 95/98/2000, > using the Heidemann framework ported from FreeBSD. > > I would say that you were also constrained by POSIX guaranteed > semantics, though it would be convenient to be able to turn most > of these off, to avoid vnode/data seeks, though this is an anecdotal > conclusion from some recent literature (don't trust it until you > can conclude what the effect will be under non-single-threaded FS > load). > > > NB: I was unable to convince either Ganger or McKusick of the idea > of generalization, where on mount you register conflict resolvers > into a dependency graph, which you maintain as stacking is done and > undone, and VOPs are added and removed. Both cited different > reasons for objecting. Kirk objected to what he saw as a larger > in-core dependency accounting storage requirement. IMO, Kirk's > reasons were not really correct, since any given dependency could > be expressed and resolved using the same structures. I was unable > to provide a proof of concept due to license issues, which I very > well understand Kirk wanting to enforce at the time. Gregory had > different objections, which I laid off to familiarity with graph > theory (you _can_ maintain a running accounting of transitive > colsure over a graph, particularly one that doesn't change except > on mount or unmount), but I wouldn't dismiss either of them on > the basis of their gut feelings (I trust mine, but they trust > theirs, which is right for them to do). > > That aside, even if you don't do a generalized implementation, the > approach of considering an FS in terms of transactions (events) is > still sound, and I think most modern FS researchers would agree with > the approach, even if they did not agree on implementation. > > > Terry Lambert > terry@lambert.org To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Feb 7 23: 3:49 2001 Delivered-To: freebsd-fs@freebsd.org Received: from smtp1b.mail.yahoo.com (smtp3.mail.yahoo.com [128.11.68.135]) by hub.freebsd.org (Postfix) with SMTP id 0BA5B37B491 for <fs@freebsd.org>; Wed, 7 Feb 2001 23:03:28 -0800 (PST) Received: from nat-198-95-226-208.netapp.com (HELO fdevijvelap) (198.95.226.208) by smtp.mail.vip.suc.yahoo.com with SMTP; 8 Feb 2001 08:10:21 -0000 X-Apparently-From: <fdevijve@yahoo.com> Message-ID: <05cd01c0919c$c77b0db0$1fc9a8c0@europe.netapp.com> From: "fab" <fdevijve@yahoo.com> To: "Matt Dillon" <dillon@earth.backplane.com>, "Mike Smith" <msmith@freebsd.org> Cc: "Michael C . Wu" <keichii@iteration.net>, "Mitch Collinsworth" <mitch@ccmr.cornell.edu>, <hackers@FreeBSD.ORG>, <fs@FreeBSD.ORG> References: <200102052052.f15KqOe00985@mass.dis.org> Subject: Re: Extremely large (70TB) File system/server planning Date: Thu, 8 Feb 2001 07:36:30 +0100 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 5.00.2314.1300 X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Hi Mens, it's exact that filers can't exceed 6TB but we can have eaysyly performance (pretty so good) with their. If you try to have EMC box or IBM, you will have to manage anything that it's not your job (IO for example). I think that netapp can be a very simple solution (where other man sells complexity) Thanks Fab. ----- Original Message ----- From: Mike Smith <msmith@freebsd.org> To: Matt Dillon <dillon@earth.backplane.com> Cc: Michael C . Wu <keichii@iteration.net>; Mitch Collinsworth <mitch@ccmr.cornell.edu>; <hackers@FreeBSD.ORG>; <fs@FreeBSD.ORG> Sent: Monday, February 05, 2001 9:52 PM Subject: Re: Extremely large (70TB) File system/server planning > > > > :| > The files are accessed approximately 3 or 4 times a day on average. > > :| > Older files are archived for reference purpose and may never > > :| > be accessed after a week. > > :| > > :| Ok, this is a start. Now is the 70 TB the size of the active files? > > :| Or does that also include the older archived files that may never be > > :| accessed again? > > :70TB is the size of the sum of all files, access or no access. > > :(They still want to maintain accessibility even though the chances are slim.) > ... > > This doesn't sound like something you can just throw together with > > off-the-shelf PCs and still have something reliable to show for it. > > You need a big honking RAID system - maybe a NetApp, maybe something > > else. You have to look at the filesystem and file size limitations > > of the unit and the client(s). > > You can't do this with a NetApp either; they max out at about 6TB now > (going up to around 12 or so soon). You might want to talk to EMC and/or > IBM, both of whom have *extremely* large filers. > > Your friend may also want to look at Traakan, who have a novel product in > this space. > > -- > ... every activity meets with opposition, everyone who acts has his > rivals and unfortunately opponents also. But not because people want > to be opponents, rather because the tasks and relationships force > people to take different points of view. [Dr. Fritz Todt] > V I C T O R Y N O T V E N G E A N C E > > > > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-hackers" in the body of the message _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Thu Feb 8 23: 6: 8 2001 Delivered-To: freebsd-fs@freebsd.org Received: from lips.borg.umn.edu (lips.borg.umn.edu [160.94.232.50]) by hub.freebsd.org (Postfix) with ESMTP id 4692B37B491; Thu, 8 Feb 2001 23:05:40 -0800 (PST) Received: from thebarn.com (nic-31-c12-219.mn.mediaone.net [24.31.12.219]) by lips.borg.umn.edu (8.11.2/8.10.1) with ESMTP id f1975Zb30917; Fri, 9 Feb 2001 01:05:36 -0600 (CST) Message-ID: <3A8396B9.CA8C09E4@thebarn.com> Date: Fri, 09 Feb 2001 01:05:29 -0600 From: Russell Cattelan <cattelan@thebarn.com> X-Mailer: Mozilla 4.74 [en] (X11; U; Linux 2.2.12 i386) X-Accept-Language: en MIME-Version: 1.0 To: freebsd-chat@FreeBSD.ORG Cc: Jack Rusher <jar@integratus.com>, Terry Lambert <tlambert@primenet.com>, Sam Leffler <sam@errno.com>, Zhiui Zhang <zzhang@cs.binghamton.edu>, freebsd-fs@FreeBSD.ORG Subject: Re: Design a journalled file system References: <200102072323.QAA27692@usr08.primenet.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Terry Lambert wrote: > {...} > > > > I rather suspect that the GPL was intentionally chosen by SGI > > > to permit them to jump on the Linux/Open Source bandwagon, > > > without exposing them to the risk of a commercial organization > > > which competes with SGI being able to benefit from the technology > > > > This is unquestionably true. I have word from some of the architects > > who helped design XFS that this was exactly the reason GPL was chosen > > over the BSD license. > > I had a pretty long discussion with their V.P. of engineering, > who made the decision (they have a number of "V.P. of engineering" > lying around). He didn't come out and say the same thing, and I > really didn't attribute it to that, since it means that any bug > fixes are GPL-code derived, and therefore also GPL. That would > mean that they really don't expect any useful work to come out of > the Linux community, or that they expected people to just sign > over rights to anything interesting, which I think would be a bit > naieve, to say the least. I'm not sure who you talked with? but it really it that simple. The reason the GPL was chosen for XFS. It's the license Linux is using, and since the port is being done for Linux it makes sense. SGI is also doing work with the XFree code, the work is being released under the X license (which is also an anti GPL license). SGI is basically matching license for licensee to whatever project they are contributing to. This from the lawyer that is doing all the open source work. I have stated this in the past but I will bring it up again. If sufficient momentum can be generated toward an fbsd port of XFS, it may be possible to go to the lawyers and have a another license drawn up. But unless the bsd community can show they are serious about XFS being ported it would be a waste of time to ask for something that SGI has very little business interesting in doing. Note Darwin might be a big win in terms of making a business case for another platform. The license shouldn't be that big of an issue. Lots of fbsd uses GPL'ed code... hmm gcc for example. Let get to the point were XFS is in such demand on fbsd we can get a petition going if necessary to have the license updated. BTW if anybody is interested a few of us have started looking at actually doing the port. Not much has been done at this point... basically battling through header file cleanup. Ohh one other comment: The only time SGI may ask for a copy write reassignment is if the contributed code affects the filesystem compatibility between irix and linux. This would have to be a major contribution before something like this would be an issue, and some negotiation will most certainly be involved. Up to to this point all bug fixes have been linux related only so it really isn't an issue. This isn't SGI trying to be an ass... rather SGI trying to provide the most compatible FS it can within the constrains of many legal issues. -- Russell Cattelan cattelan@thebarn.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Thu Feb 8 23:13:11 2001 Delivered-To: freebsd-fs@freebsd.org Received: from lips.borg.umn.edu (lips.borg.umn.edu [160.94.232.50]) by hub.freebsd.org (Postfix) with ESMTP id 1C66637B401 for <freebsd-fs@FreeBSD.ORG>; Thu, 8 Feb 2001 23:12:53 -0800 (PST) Received: from thebarn.com (nic-31-c12-219.mn.mediaone.net [24.31.12.219]) by lips.borg.umn.edu (8.11.2/8.10.1) with ESMTP id f197Cpb30997; Fri, 9 Feb 2001 01:12:51 -0600 (CST) Message-ID: <3A83986E.55789E59@thebarn.com> Date: Fri, 09 Feb 2001 01:12:46 -0600 From: Russell Cattelan <cattelan@thebarn.com> X-Mailer: Mozilla 4.74 [en] (X11; U; Linux 2.2.12 i386) X-Accept-Language: en MIME-Version: 1.0 To: Zhiui Zhang <zzhang@cs.binghamton.edu> Cc: freebsd-fs@FreeBSD.ORG Subject: Re: Design a journalled file system References: <Pine.SOL.4.21.0102061544230.6584-100000@opal> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Zhiui Zhang wrote: > I am considering the design of a journalled file system in FreeBSD. I > think each transaction corresponds to a file system update operation and > will therefore consists of a list of modified buffers. The important > thing is that these buffers should not be written to disk until they have > been logged into the log area. To do so, we need to pin these buffers in > memory for a while. The concept should be simple, but I run into a problem > which I have no idea how to solve it: > > If you access a lot of files quickly, some vnodes will be reused. These > vnodes can contain buffers that are still pinned in the memory because of > the write-ahead logging constraints. After a vnode is gone, we have > no way to recover its buffers. Note that whenever we need a new vnode, we > are in the process of creating a new file. At this point, we can not flush > the buffers to the log area. The result is a deadlock. XFS: All pinned buffers are keep on a queue to be flushed by a daemon that walks the queue looking for buffer that have recently become unlocked and unpinned. > > > I could make copies of the buffers that are still pinned, but that incurs > memory copy and need buffer headers, which is also a rare resource. > > The design is similar to ext3fs of linux (they do not seem to have a vnode > layer and they use device + physical block number instead of vnode + > logical block number to index buffers, which, I guess, means that buffers > can exist after the inode is gone). I know Mckusick has a paper on Yup. All meta data buffer use and absolute device offset. > journalling FFS, but I just want to know if this design can work or not. > > Any ideas? Thanks for your help! > > -Zhihui > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-fs" in the body of the message -- Russell Cattelan cattelan@thebarn.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Feb 9 0:57:26 2001 Delivered-To: freebsd-fs@freebsd.org Received: from smtp05.primenet.com (smtp05.primenet.com [206.165.6.135]) by hub.freebsd.org (Postfix) with ESMTP id 6D41F37B503; Fri, 9 Feb 2001 00:56:51 -0800 (PST) Received: (from daemon@localhost) by smtp05.primenet.com (8.9.3/8.9.3) id BAA18553; Fri, 9 Feb 2001 01:51:58 -0700 (MST) Received: from usr08.primenet.com(206.165.6.208) via SMTP by smtp05.primenet.com, id smtpdAAA5RaOkK; Fri Feb 9 01:51:48 2001 Received: (from tlambert@localhost) by usr08.primenet.com (8.8.5/8.8.5) id BAA08304; Fri, 9 Feb 2001 01:56:29 -0700 (MST) From: Terry Lambert <tlambert@primenet.com> Message-Id: <200102090856.BAA08304@usr08.primenet.com> Subject: Re: Design a journalled file system To: cattelan@thebarn.com (Russell Cattelan) Date: Fri, 9 Feb 2001 08:56:29 +0000 (GMT) Cc: freebsd-chat@FreeBSD.ORG, jar@integratus.com (Jack Rusher), tlambert@primenet.com (Terry Lambert), sam@errno.com (Sam Leffler), zzhang@cs.binghamton.edu (Zhiui Zhang), freebsd-fs@FreeBSD.ORG In-Reply-To: <3A8396B9.CA8C09E4@thebarn.com> from "Russell Cattelan" at Feb 09, 2001 01:05:29 AM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org OK, this is not a license war. I will lay it on the line. I am offering to do a preliminary port of the XFS code, potentially to the point of minimally a read-only mount, and perhaps much further, depending on the effort required. The resulting code will have some nasty strings, based on me assuming your comments are correct, and wanting some guarantees on that, on my part. The strings go away when your claims to SGI's actions are met. Below is my reply to your message, including the philosophical basis for the strings, a description of the strings, and the details of my offer. This offer is good for a starting date before 01 March 2001. -- > I'm not sure who you talked with? but it really it that simple. Vijay. The V.P. of Engineering at SGI who negotiated the release of the code. I will quote one of his statements, made to me in email: ] Can't you just relicense FreeBSD [under the GPL]? > The reason the GPL was chosen for XFS. > It's the license Linux is using, and since > the port is being done for Linux it makes sense. One would think a dual license. Alternately, one would think they would use the LGPL, which would let people link it into their kernels, as long as they gave source (which BSD does) or otherwise permitted relinking. In other words, the GPL is not really an optimal license, if they wanted wide use AND specific Linux license compatability. I concluded from their choice that they were not going for wide use, but instead wanted the marketing benefit of being associated with Linux (lots of press, etc.). > SGI is also doing work with the XFree code, the work > is being released under the X license (which is also > an anti GPL license). The BSD and MIT licenses predate the GPL, so careful with the word "anti" there... > SGI is basically matching license for licensee to > whatever project they are contributing to. > This from the lawyer that is doing all the open source work. Rather than the use to which the software is put; that's a bit naieve, then, again. > I have stated this in the past but I will bring it up again. > If sufficient momentum can be generated toward an fbsd port > of XFS, it may be possible to go to the lawyers and have a another > license drawn up. If we had it in writing that the code would be released under a license usable by the BSD kernel, preferrably "matching license for license", as you state, then we would commit to do the work. The problem we have is that the code under the current license is useless to us, and unless we can be ensured that the code we write to glue it in won't end up also being useless to us, there is really no reason to commit the effort. > But unless the bsd community can show they are serious about > XFS being ported it would be a waste of time to ask for > something that SGI has very little business interesting in doing. So if we were to do a port, then SGI would have a business interest, and would relicense the code? Can we have that in writing? > Note Darwin might be a big win in terms of making a business case > for another platform. Darwin support would be automatic, with a FreeBSD port. Darwin can use FreeBSD FS code, unmodified. > The license shouldn't be that big of an issue. It shouldn't, but it is. I would have been ecstatic to use XFS in the Whistle InterJet, as a means of getting rid of the need for a UPS; as a technology for doing exactly that, it's superior to Soft Updates (Soft Updates has other valuable attributes, but that was the one we were interested in obtaining). The is not a chance in hell of IBM shipping a product based on code without a license grant in perpetuity already locked in a vault. > Lots of fbsd uses GPL'ed code... hmm gcc for example. FreeBSD _utilizes_ this code, it does not _use_ it. The gcc code can be diked out of a FreeBSD system, without crippling the utility of the system. In an embeded product, that code _is_ diked out. There is no gcc code linked into the FreeBSD kernel. > Let get to the point were XFS is in such demand on fbsd > we can get a petition going if necessary to have the license > updated. Demand is very different; it is an aspect of marketing. How much demand do you want, and where do you want it directed? I believe that it would be a trivial exercise to generate as much demand as you require. > BTW if anybody is interested a few of us have started looking > at actually doing the port. Not much has been done at this > point... basically battling through header file cleanup. If you have your head wrapped around it already, file system code is really very trivial, particularly if you have code that already works in one environment, and are merely porting it. I'll tell you what: give me a pointer to the code without the Linux modifications, so that I won't inadvertantly include code that is derived from GPL'ed code, and I will create a FreeBSD port of the code, with all code additions, which will compile and link successfully in a FreeBSD kernel, in a matter of a few days. I will additionally require an image of an XFS FS on a floppy disk, which I can use for compatability testing. There should be one file with an example of each thing the FS is capable of representing, including a directory, a directory with a subdirectory, a file, and a directory with two files; the files should be short, but if immediate files exist, one should be long enough to trigger indirection. It would be most useful if the image were zero'ed before it was created, so I am able to distinguish XFS written data from "blank floppy" contents (and to aid compression of the image). I will provide my code for FTP, which will be licensed to explicitly prohibit all but developement use, with a license which will transform itself to the three clause Berkeley license, if the XFS code which it's designed to work with is also released under a Berkeley-style license, and a release from patent claims in the covered code. In other words, the code I provide will be useless to everyone but FS researchers, unless the SGI license on the XFS components it must be linked with change to permit BSD to use the code as a boot FS, and further, permit commercial use by not hiding submerged patent infringement lawsuits which will be sprung on the unwary, as soon as someone with deep pockets uses the code. Call me distrustful, but I am fully capable of delivering in a very short time frame, so I'm pretty much the only game. > Ohh one other comment: > The only time SGI may ask for a copy write reassignment is if > the contributed code affects the filesystem compatibility > between irix and linux. This would have to be a major > contribution before something like this would be an issue, and > some negotiation will most certainly be involved. You're damn straight there will be: SGI will be begging the author to assign rights to a derivative work of SGI's own code. If that author is philosophically adamant about the GPL, the assignment of rights will never happen, unless the author also lacks personal integrity, and SGI is willing to buy them out of their philosophical stubborness, or pay their own engineers to recreate the code. > Up to to this point all bug fixes have been linux related only > so it really isn't an issue. I maintain it probably never will be. Ask Vijay for my arguments in this regard; they boil down to the level of effort and complexity involved in FS hacking. It takes a professional, someone with academic rigor, to do useful work. Consider that the only minds capable of adding Soft Updates technology to XFS, without a huge capital expenditure, are existant _only_ in the BSD community. > This isn't SGI trying to be an ass... rather SGI trying to > provide the most compatible FS it can within the constrains > of many legal issues. A library style license of the Mozilla bent would have been able to accomplish this rather easily, without losing SGI rights to (putative) improvements, and without limiting the compatability of the license to nothing but Linux. Linux could archive it and treat it as a statically linked library used by the kernel or a kernel module. The effect on BSD would have been to require it to do what it does already, and for systems vendors to provide an "ld -r"'ed kernel and XFS source code. A pain in the ass, but livable for most commercial users and embedded systems vendors. I can't believe SGI's lawyers didn't know precisely what they were giving away, and what they weren't. -- So, are you going to point me at the pure (convertable to another license, since it contains only SGI contributions) SGI XFS code, and an image of a sample FS that I can write to a floppy for testing purposes? Meanwhile, I think the FreeBSD community should continue to pursue their own JFS, under a useful license that could then trigger commercial support for the programming required... That's how the BSD community gets professional programmers to do complex and unpleasent tasks, while other communities never get the unpleasent tasks (e.g. Soft Updates [Whistle/IBM], fully unified VM and buffer cache [Oracle], etc.) done at all, after all. Marketing is a poor coin for getting long term work done; it's too ephemeral for a long term investment to be worthwhile. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Feb 9 8:31:39 2001 Delivered-To: freebsd-fs@freebsd.org Received: by hub.freebsd.org (Postfix, from userid 753) id F16E237B97D; Fri, 9 Feb 2001 08:08:01 -0800 (PST) Date: Fri, 9 Feb 2001 08:08:01 -0800 From: Adrian Chadd <adrian@FreeBSD.org> To: tlambert@primenet.com Cc: Russell Cattelan <cattelan@thebarn.com>, freebsd-chat@FreeBSD.ORG, Jack Rusher <jar@integratus.com>, Terry Lambert <tlambert@primenet.com>, Sam Leffler <sam@errno.com>, Zhiui Zhang <zzhang@cs.binghamton.edu>, freebsd-fs@FreeBSD.ORG Subject: Re: XFS Message-ID: <20010209080801.A56926@hub.freebsd.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Terry said: > I am offering to do a preliminary port of the XFS code, > potentially to the point of minimally a read-only mount, and > perhaps much further, depending on the effort required. .. and I'm already (only initially) trudging my way through the linux XFS code and slowly fixing it up. I've hit a sticker - the lacking mount interface we have - which I'm also slowly reworking to be more flexible and suited to the XFS requirements. So Terry, if you'd like to help, lets sort out the mount interface, help me finish bits of the userland interface, and then we can work on getting the XFS kernel code in. .. i might say that from what I hear, it might be easier to port XFS to FreeBSD based on the original XFS code before it was Linux-ified, but I'm willing to walk through the linux code. Adrian To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Feb 9 9:12:26 2001 Delivered-To: freebsd-fs@freebsd.org Received: from relay.butya.kz (butya-gw.butya.kz [212.154.129.94]) by hub.freebsd.org (Postfix) with ESMTP id 133B837B9B9; Fri, 9 Feb 2001 08:22:01 -0800 (PST) Received: by relay.butya.kz (Postfix, from userid 1000) id E9B652863E; Fri, 9 Feb 2001 22:21:56 +0600 (ALMT) Received: from localhost (localhost [127.0.0.1]) by relay.butya.kz (Postfix) with ESMTP id CA656285D3; Fri, 9 Feb 2001 22:21:56 +0600 (ALMT) Date: Fri, 9 Feb 2001 22:21:56 +0600 (ALMT) From: Boris Popov <bp@butya.kz> To: freebsd-fs@freebsd.org Cc: freebsd-hackers@freebsd.org Subject: Re: smbfs-1.3.3 released In-Reply-To: <Pine.BSF.4.21.0101281356020.30001-100000@lion.butya.kz> Message-ID: <Pine.BSF.4.21.0102092217160.31739-100000@lion.butya.kz> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Sun, 28 Jan 2001, Boris Popov wrote: > Well, next version of smbfs for FreeBSD released today. It > includes minor bug fixes and significantly reworked connection engine. As usually, major rewrites tends to introduce some bugs. So, I've released 1.3.5 as update: 09.02.2001 1.3.5 - The user and server names was swapped in the "TreeConnect" request (fixed by Jonathan Hanna). - smb requester could cause a panic if there is no free mbufs - fixed. - It is possible to use smbfs with devfs now, but it wasn't tested under SMP. Also note that device permissions will be wrong, because devfs do not allow passing of credentials to the cloning function. - nsmbX device moved from the /dev/net directory to /dev directory. 31.01.2001 1.3.4 - Maintance: synch with changes in the recent -current An updated version can be downloaded from ftp://ftp.butya.kz/pub/smbfs/smbfs.tar.gz -- Boris Popov http://www.butya.kz/~bp/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Feb 9 9:25:28 2001 Delivered-To: freebsd-fs@freebsd.org Received: from bingnet2.cc.binghamton.edu (bingnet2.cc.binghamton.edu [128.226.1.18]) by hub.freebsd.org (Postfix) with ESMTP id 1C00437E0A4 for <freebsd-fs@FreeBSD.ORG>; Fri, 9 Feb 2001 09:25:10 -0800 (PST) Received: from onyx (onyx.cs.binghamton.edu [128.226.140.171]) by bingnet2.cc.binghamton.edu (8.11.2/8.11.2) with ESMTP id f19HP8c17870; Fri, 9 Feb 2001 12:25:08 -0500 (EST) Date: Fri, 9 Feb 2001 12:24:54 -0500 (EST) From: Zhiui Zhang <zzhang@cs.binghamton.edu> X-Sender: zzhang@onyx To: Russell Cattelan <cattelan@thebarn.com> Cc: freebsd-fs@FreeBSD.ORG Subject: Re: Design a journalled file system In-Reply-To: <3A83986E.55789E59@thebarn.com> Message-ID: <Pine.SOL.4.21.0102091214440.4738-100000@onyx> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org I guess that this will involve either memory copying or changing the buffer header directly. Linux seems to address buffer directly via physical (not logical) block number, so there is no need to change the buffer header. Plus, Linux have a reference count to prevent a buffer from disappearing (brelse()'ed). Another difficulty is that if several transactions are in progress at the same time, we must remember which metadata buffers are modified by which transactions. When we copy/rename the buffer, we must inform those transactions the fact that we did the copy/rename. The buffers modified by one transaction must be flushed at the same time. BTW, Linux GFS code seems to allow ONE transaction in progess at any time. -Zhihui On Fri, 9 Feb 2001, Russell Cattelan wrote: > Zhiui Zhang wrote: > > > I am considering the design of a journalled file system in FreeBSD. I > > think each transaction corresponds to a file system update operation and > > will therefore consists of a list of modified buffers. The important > > thing is that these buffers should not be written to disk until they have > > been logged into the log area. To do so, we need to pin these buffers in > > memory for a while. The concept should be simple, but I run into a problem > > which I have no idea how to solve it: > > > > If you access a lot of files quickly, some vnodes will be reused. These > > vnodes can contain buffers that are still pinned in the memory because of > > the write-ahead logging constraints. After a vnode is gone, we have > > no way to recover its buffers. Note that whenever we need a new vnode, we > > are in the process of creating a new file. At this point, we can not flush > > the buffers to the log area. The result is a deadlock. > > XFS: > All pinned buffers are keep on a queue to be flushed by a > daemon that walks the queue looking for buffer that > have recently become unlocked and unpinned. > > > > > > > > I could make copies of the buffers that are still pinned, but that incurs > > memory copy and need buffer headers, which is also a rare resource. > > > > The design is similar to ext3fs of linux (they do not seem to have a vnode > > layer and they use device + physical block number instead of vnode + > > logical block number to index buffers, which, I guess, means that buffers > > can exist after the inode is gone). I know Mckusick has a paper on > > Yup. All meta data buffer use and absolute device offset. > > > > journalling FFS, but I just want to know if this design can work or not. > > > > Any ideas? Thanks for your help! > > > > -Zhihui > > > > To Unsubscribe: send mail to majordomo@FreeBSD.org > > with "unsubscribe freebsd-fs" in the body of the message > > -- > Russell Cattelan > cattelan@thebarn.com > > > > To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Feb 9 10:43:22 2001 Delivered-To: freebsd-fs@freebsd.org Received: from sgi.com (sgi.SGI.COM [192.48.153.1]) by hub.freebsd.org (Postfix) with ESMTP id B566537B401; Fri, 9 Feb 2001 10:43:01 -0800 (PST) Received: from ledzep.americas.sgi.com (relay.cray.com [137.38.226.97]) by sgi.com (980327.SGI.8.8.8-aspam/980304.SGI-aspam: SGI does not authorize the use of its proprietary systems or networks for unsolicited or bulk email from the Internet.) via ESMTP id KAA05895; Fri, 9 Feb 2001 10:42:51 -0800 (PST) mail_from (cattelan@thebarn.com) Received: from gibble.americas.sgi.com (gibble.americas.sgi.com [128.162.195.80]) by ledzep.americas.sgi.com (SGI-SGI-8.9.3/americas-smart-nospam1.1) with ESMTP id MAA25243; Fri, 9 Feb 2001 12:42:50 -0600 (CST) Received: from thebarn.com (localhost [127.0.0.1]) by gibble.americas.sgi.com (8.11.0/8.11.0) with ESMTP id f19Ifo020453; Fri, 9 Feb 2001 12:41:50 -0600 Message-ID: <3A8439ED.57011A40@thebarn.com> Date: Fri, 09 Feb 2001 12:41:50 -0600 From: Russell Cattelan <cattelan@thebarn.com> X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.4.1-XFS i686) X-Accept-Language: en MIME-Version: 1.0 To: Adrian Chadd <adrian@FreeBSD.ORG> Cc: tlambert@primenet.com, freebsd-fs@FreeBSD.ORG Subject: Re: XFS References: <20010209080801.A56926@hub.freebsd.org> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Adrian Chadd wrote: > Terry said: > > > I am offering to do a preliminary port of the XFS code, > > potentially to the point of minimally a read-only mount, and > > perhaps much further, depending on the effort required. > > .. and I'm already (only initially) trudging my way through the > linux XFS code and slowly fixing it up. > > I've hit a sticker - the lacking mount interface we have - which > I'm also slowly reworking to be more flexible and suited to > the XFS requirements. > > So Terry, if you'd like to help, lets sort out the mount interface, > help me finish bits of the userland interface, and then we can > work on getting the XFS kernel code in. > > .. i might say that from what I hear, it might be easier to port > XFS to FreeBSD based on the original XFS code before it was > Linux-ified, but I'm willing to walk through the linux code. I can go back in time and dig up any of the old interface code. It will have to used only as reference since it may have old license issues, most of it was clean but a couple of places had problems. VFS and VNODE stuff was clean based on the fact the BSD code is out there. Note: we put a layer over the top of the XFS vfs/vnode interface most of the interface is in tact, and should be a matter of stripping of the linvfs_ layer. CXFS needs stackable FS's, the linux VFS layer doesn't have any concept of this, so we needed to keep the vfs/vnode stuff. Behaviors will have to be added... this shouldn't be to much of a problem. Rig now the vnode is part of the linux inode structure... all the vnode members were left in place... the only thing that was pushed up was the count, but this was done with a macro, this should be a trivial conversion. > I wish I had more time to work on this stuff, but the linux port has a lot of work on the todo list. But please keep asking questions; I really would like to see XFS on a decent OS. > > Adrian -- Russell Cattelan -- Digital Elves inc. -- Currently on loan to SGI Linux XFS core developer. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Feb 9 11:19:12 2001 Delivered-To: freebsd-fs@freebsd.org Received: from mail.webmonster.de (datasink.webmonster.de [194.162.162.209]) by hub.freebsd.org (Postfix) with SMTP id AB45D37B6A2 for <fs@FreeBSD.ORG>; Fri, 9 Feb 2001 11:18:51 -0800 (PST) Received: (qmail 84247 invoked by uid 1000); 9 Feb 2001 19:18:49 -0000 Date: Fri, 9 Feb 2001 20:18:49 +0100 From: "Karsten W. Rohrbach" <karsten@rohrbach.de> To: Mike Smith <msmith@freebsd.org> Cc: Matt Dillon <dillon@earth.backplane.com>, "Michael C . Wu" <keichii@iteration.net>, Mitch Collinsworth <mitch@ccmr.cornell.edu>, hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: Extremely large (70TB) File system/server planning Message-ID: <20010209201849.B48420@rohrbach.de> Reply-To: karsten@rohrbach.de References: <200102051750.f15HoZ021657@earth.backplane.com> <200102052052.f15KqOe00985@mass.dis.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0i In-Reply-To: <200102052052.f15KqOe00985@mass.dis.org>; from msmith@freebsd.org on Mon, Feb 05, 2001 at 12:52:24PM -0800 X-Arbitrary-Number-Of-The-Day: 42 X-Sender: karsten@rohrbach.de Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Mike Smith(msmith@freebsd.org)@Mon, Feb 05, 2001 at 12:52:24PM -0800: > > You can't do this with a NetApp either; they max out at about 6TB now > (going up to around 12 or so soon). You might want to talk to EMC and/or > IBM, both of whom have *extremely* large filers. from my experiences with filers (we have both, country and western here - eg. netapp f740/760 and emc^2 symmetrix/connectrix) i can only say that emc is a pile of sh** - no pun intended. actually the boxes work okay, but you need a shitload of datamover boxes from emc to achieve performance similar to netapp's 760 series (up to 12 data movers with 2gig of ram each). emc goes brute force, netapp use their brains. when it comes to ibm, as far as i understand you have to hook up their filers to rs/6000(aix) or s/370 or s/390 systems since they are "only" fibrechannel or ficon attached raid subsystems, so the client platform is responsible for handling all the filesystem stuff. you might also check out lsi logic's filer products, i think they support 12tb via nas. > > Your friend may also want to look at Traakan, who have a novel product in > this space. i checked out their website which says "under construction" strange... /k > > -- > ... every activity meets with opposition, everyone who acts has his > rivals and unfortunately opponents also. But not because people want > to be opponents, rather because the tasks and relationships force > people to take different points of view. [Dr. Fritz Todt] > V I C T O R Y N O T V E N G E A N C E > > > > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-fs" in the body of the message -- > Hackers know all the right MOVs. KR433/KR11-RIPE -- http://www.webmonster.de -- ftp://ftp.webmonster.de To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Feb 9 11:23:56 2001 Delivered-To: freebsd-fs@freebsd.org Received: from mail.webmonster.de (datasink.webmonster.de [194.162.162.209]) by hub.freebsd.org (Postfix) with SMTP id A233137B6A6 for <freebsd-fs@freebsd.org>; Fri, 9 Feb 2001 11:23:37 -0800 (PST) Received: (qmail 84389 invoked by uid 1000); 9 Feb 2001 19:23:36 -0000 Date: Fri, 9 Feb 2001 20:23:36 +0100 From: "Karsten W. Rohrbach" <karsten@rohrbach.de> To: Zhiui Zhang <zzhang@cs.binghamton.edu> Cc: freebsd-fs@freebsd.org Subject: Re: Design a journalled file system Message-ID: <20010209202336.D48420@rohrbach.de> Reply-To: karsten@rohrbach.de References: <Pine.SOL.4.21.0102061544230.6584-100000@opal> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0i In-Reply-To: <Pine.SOL.4.21.0102061544230.6584-100000@opal>; from zzhang@cs.binghamton.edu on Tue, Feb 06, 2001 at 04:15:45PM -0500 X-Arbitrary-Number-Of-The-Day: 42 X-Sender: karsten@rohrbach.de Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org there was a post last year to either -fs or -hackers that described the use of a single disk or block device for logging all writes and committing them afterwards. i think it was presented at usenic last year or 99. can't find it now - please check the mailing list archives. /k Zhiui Zhang(zzhang@cs.binghamton.edu)@Tue, Feb 06, 2001 at 04:15:45PM -0500: > > I am considering the design of a journalled file system in FreeBSD. I > think each transaction corresponds to a file system update operation and > will therefore consists of a list of modified buffers. The important > thing is that these buffers should not be written to disk until they have > been logged into the log area. To do so, we need to pin these buffers in > memory for a while. The concept should be simple, but I run into a problem > which I have no idea how to solve it: > > If you access a lot of files quickly, some vnodes will be reused. These > vnodes can contain buffers that are still pinned in the memory because of > the write-ahead logging constraints. After a vnode is gone, we have > no way to recover its buffers. Note that whenever we need a new vnode, we > are in the process of creating a new file. At this point, we can not flush > the buffers to the log area. The result is a deadlock. > > I could make copies of the buffers that are still pinned, but that incurs > memory copy and need buffer headers, which is also a rare resource. > > The design is similar to ext3fs of linux (they do not seem to have a vnode > layer and they use device + physical block number instead of vnode + > logical block number to index buffers, which, I guess, means that buffers > can exist after the inode is gone). I know Mckusick has a paper on > journalling FFS, but I just want to know if this design can work or not. > > Any ideas? Thanks for your help! > > -Zhihui > > > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-fs" in the body of the message -- > Booze is the answer. I don't remember the question. KR433/KR11-RIPE -- http://www.webmonster.de -- ftp://ftp.webmonster.de To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Feb 9 11:39: 8 2001 Delivered-To: freebsd-fs@freebsd.org Received: from deliverator.sgi.com (deliverator.sgi.com [204.94.214.10]) by hub.freebsd.org (Postfix) with ESMTP id EC6F937B6EE; Fri, 9 Feb 2001 11:38:44 -0800 (PST) Received: from ledzep.americas.sgi.com (ledzep.americas.sgi.com [137.38.226.97]) by deliverator.sgi.com (980309.SGI.8.8.8-aspam-6.2/980310.SGI-aspam) via ESMTP id LAA10650; Fri, 9 Feb 2001 11:37:32 -0800 (PST) mail_from (cattelan@thebarn.com) Received: from gibble.americas.sgi.com (gibble.americas.sgi.com [128.162.195.80]) by ledzep.americas.sgi.com (SGI-SGI-8.9.3/americas-smart-nospam1.1) with ESMTP id NAA26851; Fri, 9 Feb 2001 13:38:32 -0600 (CST) Received: from thebarn.com (localhost [127.0.0.1]) by gibble.americas.sgi.com (8.11.0/8.11.0) with ESMTP id f19JbV020646; Fri, 9 Feb 2001 13:37:31 -0600 Message-ID: <3A8446FA.DCD17C7E@thebarn.com> Date: Fri, 09 Feb 2001 13:37:30 -0600 From: Russell Cattelan <cattelan@thebarn.com> X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.4.1-XFS i686) X-Accept-Language: en MIME-Version: 1.0 To: Terry Lambert <tlambert@primenet.com> Cc: freebsd-chat@FreeBSD.ORG, Jack Rusher <jar@integratus.com>, Sam Leffler <sam@errno.com>, Zhiui Zhang <zzhang@cs.binghamton.edu>, freebsd-fs@FreeBSD.ORG Subject: Re: Design a journalled file system References: <200102090856.BAA08304@usr08.primenet.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Terry Lambert wrote: > OK, this is not a license war. I will lay it on the line. > Ok I did get a response from the lawyer... as in typical lawyer talk he didn't give much of a response either way, but I think he is open to discussion. Ok somebody from the BSD camp should an provide an example of an acceptable license. If I can present something other than abstract concept more progress can be made. The one major requirement is that somebody like Sun or IBM can't pick up the code and start commercializing it. And no I'm not saying restricting a commercial product with XFS, but restricting somebody from making XFS a commercial product unto itself. -- Russell Cattelan -- Digital Elves inc. -- Currently on loan to SGI Linux XFS core developer. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Feb 9 11:42: 7 2001 Delivered-To: freebsd-fs@freebsd.org Received: from deliverator.sgi.com (deliverator.sgi.com [204.94.214.10]) by hub.freebsd.org (Postfix) with ESMTP id D5B3437B6AE; Fri, 9 Feb 2001 11:41:21 -0800 (PST) Received: from ledzep.americas.sgi.com (ledzep.americas.sgi.com [137.38.226.97]) by deliverator.sgi.com (980309.SGI.8.8.8-aspam-6.2/980310.SGI-aspam) via ESMTP id KAA04146; Fri, 9 Feb 2001 10:09:15 -0800 (PST) mail_from (cattelan@thebarn.com) Received: from gibble.americas.sgi.com (gibble.americas.sgi.com [128.162.195.80]) by ledzep.americas.sgi.com (SGI-SGI-8.9.3/americas-smart-nospam1.1) with ESMTP id MAA33083; Fri, 9 Feb 2001 12:10:15 -0600 (CST) Received: from thebarn.com (localhost [127.0.0.1]) by gibble.americas.sgi.com (8.11.0/8.11.0) with ESMTP id f19I9E020316; Fri, 9 Feb 2001 12:09:14 -0600 Message-ID: <3A843249.D93D5952@thebarn.com> Date: Fri, 09 Feb 2001 12:09:14 -0600 From: Russell Cattelan <cattelan@thebarn.com> X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.4.1-XFS i686) X-Accept-Language: en MIME-Version: 1.0 To: Terry Lambert <tlambert@primenet.com> Cc: freebsd-chat@FreeBSD.ORG, Jack Rusher <jar@integratus.com>, Sam Leffler <sam@errno.com>, Zhiui Zhang <zzhang@cs.binghamton.edu>, freebsd-fs@FreeBSD.ORG Subject: Re: Design a journalled file system References: <200102090856.BAA08304@usr08.primenet.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Terry Lambert wrote: > OK, this is not a license war. I will lay it on the line. > > I am offering to do a preliminary port of the XFS code, > potentially to the point of minimally a read-only mount, and > perhaps much further, depending on the effort required. > > The resulting code will have some nasty strings, based on > me assuming your comments are correct, and wanting some > guarantees on that, on my part. The strings go away when > your claims to SGI's actions are met. > > Below is my reply to your message, including the philosophical > basis for the strings, a description of the strings, and the > details of my offer. > > This offer is good for a starting date before 01 March 2001. > > -- > > > I'm not sure who you talked with? but it really it that simple. > > Vijay. The V.P. of Engineering at SGI who negotiated the > release of the code. I will quote one of his statements, > made to me in email: > > ] Can't you just relicense FreeBSD [under the GPL]? > > > The reason the GPL was chosen for XFS. > > It's the license Linux is using, and since > > the port is being done for Linux it makes sense. > > One would think a dual license. Alternately, one would think > they would use the LGPL, which would let people link it into > their kernels, as long as they gave source (which BSD does) > or otherwise permitted relinking. > > In other words, the GPL is not really an optimal license, if > they wanted wide use AND specific Linux license compatability. > I concluded from their choice that they were not going for > wide use, but instead wanted the marketing benefit of being > associated with Linux (lots of press, etc.). > > > SGI is also doing work with the XFree code, the work > > is being released under the X license (which is also > > an anti GPL license). > > The BSD and MIT licenses predate the GPL, so careful with the > word "anti" there... Well yes... but my point was that it is a more open license and the XFree projects has stated they want to keep it that way. > > > > SGI is basically matching license for licensee to > > whatever project they are contributing to. > > This from the lawyer that is doing all the open source work. > > Rather than the use to which the software is put; that's a bit > naieve, then, again. > > > I have stated this in the past but I will bring it up again. > > If sufficient momentum can be generated toward an fbsd port > > of XFS, it may be possible to go to the lawyers and have a another > > license drawn up. > > If we had it in writing that the code would be released under > a license usable by the BSD kernel, preferrably "matching license > for license", as you state, then we would commit to do the work. > > The problem we have is that the code under the current license > is useless to us, and unless we can be ensured that the code we > write to glue it in won't end up also being useless to us, there > is really no reason to commit the effort. > > > But unless the bsd community can show they are serious about > > XFS being ported it would be a waste of time to ask for > > something that SGI has very little business interesting in doing. > > So if we were to do a port, then SGI would have a business interest, > and would relicense the code? Can we have that in writing? > > > Note Darwin might be a big win in terms of making a business case > > for another platform. > > Darwin support would be automatic, with a FreeBSD port. Darwin > can use FreeBSD FS code, unmodified. Ohh? I got the impression the vm system is quite different. vfs and vnode may map quite effortlessly but that's not the part I'm concerned about. 95% of the work for linux port has been in the IO path. > > > > Let get to the point were XFS is in such demand on fbsd > > we can get a petition going if necessary to have the license > > updated. > > Demand is very different; it is an aspect of marketing. How > much demand do you want, and where do you want it directed? I > believe that it would be a trivial exercise to generate as much > demand as you require. I need something to say "hey look" people really want to use this. The half a dozen or so emails I've gotten requesting isn't enough to present to the lawyers say people really really want this. I can't promise anything, but I will send a note to the lawyer and see what kind of suggestion SGI would be open to. Would the LGPL satisfy things? This one might be the easiest to propose since it is close to the GPL (something they already understand), or provide an example of a license I can present. > This won't be an easy task, since the general attitude I will probably encounter... why should we care, we're doing linux not bsd. But I will try. > > > > BTW if anybody is interested a few of us have started looking > > at actually doing the port. Not much has been done at this > > point... basically battling through header file cleanup. > > If you have your head wrapped around it already, file system > code is really very trivial, particularly if you have code that > already works in one environment, and are merely porting it. > > I'll tell you what: give me a pointer to the code without the > Linux modifications, so that I won't inadvertantly include code > that is derived from GPL'ed code, and I will create a FreeBSD > port of the code, with all code additions, which will compile > and link successfully in a FreeBSD kernel, in a matter of a few > days. I will additionally require an image of an XFS FS on a > floppy disk, which I can use for compatability testing. There > should be one file with an example of each thing the FS is > capable of representing, including a directory, a directory > with a subdirectory, a file, and a directory with two files; > the files should be short, but if immediate files exist, one > should be long enough to trigger indirection. It would be most > useful if the image were zero'ed before it was created, so I am > able to distinguish XFS written data from "blank floppy" contents > (and to aid compression of the image). Hmm XFS can't run on a floppy; it's to small. Adrian Chad is working on the user land stuff now. once mkfs is running XFS can be written to a file and by use of proto file the image can be pre populated. > > I will provide my code for FTP, which will be licensed to > explicitly prohibit all but developement use, with a license > which will transform itself to the three clause Berkeley > license, if the XFS code which it's designed to work with > is also released under a Berkeley-style license, and a release > from patent claims in the covered code. > > In other words, the code I provide will be useless to everyone > but FS researchers, unless the SGI license on the XFS components > it must be linked with change to permit BSD to use the code as a > boot FS, and further, permit commercial use by not hiding submerged > patent infringement lawsuits which will be sprung on the unwary, > as soon as someone with deep pockets uses the code. > > Call me distrustful, but I am fully capable of delivering in a > very short time frame, so I'm pretty much the only game. > > > Ohh one other comment: > > The only time SGI may ask for a copy write reassignment is if > > the contributed code affects the filesystem compatibility > > between irix and linux. This would have to be a major > > contribution before something like this would be an issue, and > > some negotiation will most certainly be involved. > > You're damn straight there will be: SGI will be begging the > author to assign rights to a derivative work of SGI's own > code. If that author is philosophically adamant about the > GPL, the assignment of rights will never happen, unless the > author also lacks personal integrity, and SGI is willing to > buy them out of their philosophical stubborness, or pay > their own engineers to recreate the code. > > > Up to to this point all bug fixes have been linux related only > > so it really isn't an issue. > > I maintain it probably never will be. Ask Vijay for my > arguments in this regard; they boil down to the level of > effort and complexity involved in FS hacking. It takes a > professional, someone with academic rigor, to do useful work. > > Consider that the only minds capable of adding Soft Updates > technology to XFS, without a huge capital expenditure, are > existant _only_ in the BSD community. > > > This isn't SGI trying to be an ass... rather SGI trying to > > provide the most compatible FS it can within the constrains > > of many legal issues. > > A library style license of the Mozilla bent would have been > able to accomplish this rather easily, without losing SGI > rights to (putative) improvements, and without limiting the > compatability of the license to nothing but Linux. Linux > could archive it and treat it as a statically linked library > used by the kernel or a kernel module. > > The effect on BSD would have been to require it to do what > it does already, and for systems vendors to provide an "ld -r"'ed > kernel and XFS source code. A pain in the ass, but livable > for most commercial users and embedded systems vendors. > > I can't believe SGI's lawyers didn't know precisely what they > were giving away, and what they weren't. This Open Source thing is need to the closed world... they are struggling to understand how to best protect themselves yet work with the community. > > > -- > > So, are you going to point me at the pure (convertable to > another license, since it contains only SGI contributions) > SGI XFS code, and an image of a sample FS that I can write > to a floppy for testing purposes? I'll try to generate a tree from GPL release day 1. March 2000 Otherwise simply look at the CVS tree for the tag GPL-ENCUMBRANCE it was put on all the XFS code. > > > Meanwhile, I think the FreeBSD community should continue to > pursue their own JFS, under a useful license that could then > trigger commercial support for the programming required... Granted; But given the number of people SGI has doing the initial XFS work and the 3 years it took them ust to get the FS off the ground. I don't think we'll have anything real soon. > That's how the BSD community gets professional programmers > to do complex and unpleasent tasks, while other communities > never get the unpleasent tasks (e.g. Soft Updates [Whistle/IBM], > fully unified VM and buffer cache [Oracle], etc.) done at all, > after all. Marketing is a poor coin for getting long term > work done; it's too ephemeral for a long term investment to > be worthwhile. > > Terry Lambert > terry@lambert.org > --- > Any opinions in this posting are my own and not those of my present > or previous employers. -- Russell Cattelan -- Digital Elves inc. -- Currently on loan to SGI Linux XFS core developer. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Feb 9 11:59:30 2001 Delivered-To: freebsd-fs@freebsd.org Received: from urban.iinet.net.au (urban.iinet.net.au [203.59.24.231]) by hub.freebsd.org (Postfix) with ESMTP id 64C5937B69B; Fri, 9 Feb 2001 11:59:08 -0800 (PST) Received: from muzak.iinet.net.au (muzak.iinet.net.au [203.59.24.237]) by urban.iinet.net.au (8.8.7/8.8.7) with ESMTP id DAA31350; Sat, 10 Feb 2001 03:58:59 +0800 Received: from elischer.org (reggae-13-225.nv.iinet.net.au [203.59.79.225]) by muzak.iinet.net.au (8.8.5/8.8.5) with ESMTP id DAA03325; Sat, 10 Feb 2001 03:56:25 +0800 Message-ID: <3A844BFF.D2C68053@elischer.org> Date: Fri, 09 Feb 2001 11:58:55 -0800 From: Julian Elischer <julian@elischer.org> X-Mailer: Mozilla 4.7 [en] (X11; U; FreeBSD 5.0-CURRENT i386) X-Accept-Language: en, hu MIME-Version: 1.0 To: Russell Cattelan <cattelan@thebarn.com> Cc: Terry Lambert <tlambert@primenet.com>, freebsd-chat@FreeBSD.ORG, Jack Rusher <jar@integratus.com>, Sam Leffler <sam@errno.com>, Zhiui Zhang <zzhang@cs.binghamton.edu>, freebsd-fs@FreeBSD.ORG Subject: Re: Design a journalled file system References: <200102090856.BAA08304@usr08.primenet.com> <3A843249.D93D5952@thebarn.com> Content-Type: text/plain; charset=iso-8859-15 Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Russell Cattelan wrote: > > Terry Lambert wrote: > > > > Darwin support would be automatic, with a FreeBSD port. Darwin > > can use FreeBSD FS code, unmodified. Unmodified is a bit of a hyperlbolae.. Let's say "there's probably a close mapping due to common anscestors" > > Ohh? I got the impression the vm system is quite different. > vfs and vnode may map quite effortlessly but that's not the > part I'm concerned about. > 95% of the work for linux port has been in the IO path. Remember that Darwin is based on Mach, and that FreeBSD is based on BSD4.4 which used the Mach VM, so we have a common anscestor in the VM systems too. > -- __--_|\ Julian Elischer / \ julian@elischer.org ( OZ ) World tour 2000-2001 ---> X_.---._/ v To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Feb 9 12:20:48 2001 Delivered-To: freebsd-fs@freebsd.org Received: from mass.dis.org (c228380-a.sfmissn1.sfba.home.com [24.20.90.44]) by hub.freebsd.org (Postfix) with ESMTP id AA24537B6A2; Fri, 9 Feb 2001 12:20:28 -0800 (PST) Received: from mass.dis.org (localhost [127.0.0.1]) by mass.dis.org (8.11.1/8.11.1) with ESMTP id f19KMDH00585; Fri, 9 Feb 2001 12:22:14 -0800 (PST) (envelope-from msmith@mass.dis.org) Message-Id: <200102092022.f19KMDH00585@mass.dis.org> X-Mailer: exmh version 2.1.1 10/15/1999 To: karsten@rohrbach.de Cc: hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: Extremely large (70TB) File system/server planning In-reply-to: Your message of "Fri, 09 Feb 2001 20:18:49 +0100." <20010209201849.B48420@rohrbach.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Fri, 09 Feb 2001 12:22:13 -0800 From: Mike Smith <msmith@freebsd.org> Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org > > when it comes to ibm, as far as i understand you have to hook up their > filers to rs/6000(aix) or s/370 or s/390 systems since they are "only" > fibrechannel or ficon attached raid subsystems, so the client platform > is responsible for handling all the filesystem stuff. Hrrm. The last box I looked at included a pair of RS6000's in the cabinet, and they were touting it as a NAS, but I wasn't paying so much attention then. > > Your friend may also want to look at Traakan, who have a novel product in > > this space. > i checked out their website which says "under construction" > strange... Definitely; they had some neat stuff up there a week or two ago... -- ... every activity meets with opposition, everyone who acts has his rivals and unfortunately opponents also. But not because people want to be opponents, rather because the tasks and relationships force people to take different points of view. [Dr. Fritz Todt] V I C T O R Y N O T V E N G E A N C E To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Feb 9 12:34:38 2001 Delivered-To: freebsd-fs@freebsd.org Received: from orthanc.ab.ca (207-167-15-66.dsl.worldgate.ca [207.167.15.66]) by hub.freebsd.org (Postfix) with ESMTP id F03BE37B6A8; Fri, 9 Feb 2001 12:34:17 -0800 (PST) Received: from orthanc.ab.ca (localhost [127.0.0.1]) by orthanc.ab.ca (8.11.1/8.11.1) with ESMTP id f19KYGi01493; Fri, 9 Feb 2001 13:34:16 -0700 (MST) (envelope-from lyndon@orthanc.ab.ca) Message-Id: <200102092034.f19KYGi01493@orthanc.ab.ca> To: Mike Smith <msmith@FreeBSD.ORG> Cc: karsten@rohrbach.de, hackers@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: Extremely large (70TB) File system/server planning In-reply-to: Your message of "Fri, 09 Feb 2001 12:22:13 PST." <200102092022.f19KMDH00585@mass.dis.org> Date: Fri, 09 Feb 2001 13:34:16 -0700 From: Lyndon Nerenberg <lyndon@orthanc.ab.ca> Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Another company to look at is Yottayotta (www.yottayotta.com). They just announced their first products last November, and there isn't much hard product info online yet. For the arena they're targeting, though, 70TB would be an entry level system. --lyndon To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message