From owner-freebsd-fs Sun Nov 14 8:16:20 1999 Delivered-To: freebsd-fs@freebsd.org Received: from sv01.cet.co.jp (sv01.cet.co.jp [210.171.56.2]) by hub.freebsd.org (Postfix) with ESMTP id 26FA415007; Sun, 14 Nov 1999 08:16:00 -0800 (PST) (envelope-from michaelh@cet.co.jp) Received: from localhost (michaelh@localhost) by sv01.cet.co.jp (8.9.3/8.9.3) with SMTP id QAA05129; Sun, 14 Nov 1999 16:15:56 GMT Date: Mon, 15 Nov 1999 01:15:55 +0900 (JST) From: Michael Hancock To: Eivind Eklund Cc: fs@FreeBSD.ORG Subject: Re: Killing WILLRELE In-Reply-To: <19991109224553.G256@bitbox.follo.net> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Eivind, I agree with your preferred patches. The slight performance hit for operations like mknod and symlink isn't a worry. IIRC rename was one of those operations where you have to reaquire a ref/lock before return to be consistent with the sane semantics rule. This will also add some latency, but again for an op like rename I don't think it's an issue. Mike On Tue, 9 Nov 1999, Eivind Eklund wrote: > I'm looking at removing WILLRELE from the VFS specs, in order to get > more sane semantics before introducing many more VFS consumers through > stacking layers. I'm sending this as a 'HEADS UP!', a chance for > people to object, and to give a chance at an advance view. > > Note that the present set of patches has not been tested beyond > compilation; I'm reserving testing until after I've let people have > the chance to scream at me (as I don't see a point in testing the > changes unless people agree that they are a step in the right > direction). > > There are presently three VOPs that use it: > VOP_MKNOD > Uses this for the 'vpp' parameter (should be the return vnode > for the newly created node, I believe). The value is > presently unusable; depending on which FS you call, it it is > either set to NULL, set to point to a vnode (MSDOSFS), or just > kept the way it was. (Note that MSDOSFS will leak vnodes as > of today). > > I've been tempted to remove it, but am not entirely happy > about that, as I think it might be useful for some stacked > layers. Thanks to phk, I've been able to come up with patches > to fix it - but these will increase the cost of VOP_MKNOD() > (only slightly, I think, but I am not quite certain). > > The other alternatives are to remove the parameter, or to > break the layering around ufs_mknod (basically, re-implement > parts of VFS_VGET in it, and make it assume that it is only > used with ffsspecops and ffsfifoops. This is presently > correct, but introduces risk of breakage down the road.) Both > of these alternatives are slightly more efficient than my > preferred fix. > > Patches to make VOP_MKNOD use vpp normally are > http://www.freebsd.org/~eivind/vop_mknod_fixed.patch > It is possible that the NFS vp release would have been handled > by common code if I hadn't added special code there, but I > feel too uncomfortable around the NFS code/macros to try to > find out. > > Patches to just remove the parameter are at > http://www.freebsd.org/~eivind/vop_mknod_novpp.patch > > VOP_MKNOD has 5 callers. > > VOP_SYMLINK > Same use of WILLRELE as VOP_MKNOD. > > Returns trash in some cases, OK values in others; relatively > simple to fix, with Coda as the only complication. > > Patches to fix it are at > http://www.freebsd.org/~eivind/vop_symlink_fixed.patch > These will break Coda, which I'm planning to contact rvb about > how to solve if people agree that WILLRELE should die. > > VOP_SYMLINK has 3 callers. > > VOP_RENAME > WILLRELE on a bunch of parameters. Adrian Chadd is doing > several things to VOP_RENAME which is relevant to this, so I'm > keeping my hands off it for the moment. Hopefully, patches > should be available later in the week. > > > My next step along the sane semantics road will probably be to make > freeing of cnp's reflexive - looking at the code that is there now, > there looks like there are a number of bugs related to this at the > moment, and it certainly makes the code much harder to follow. > > Eivind. > > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-fs" in the body of the message > To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Nov 15 8: 5:43 1999 Delivered-To: freebsd-fs@freebsd.org Received: from yana.lemis.com (yana.lemis.com [192.109.197.140]) by hub.freebsd.org (Postfix) with ESMTP id 959AA14EB8 for ; Mon, 15 Nov 1999 08:05:36 -0800 (PST) (envelope-from grog@mojave.sitaranetworks.com) Received: from mojave.sitaranetworks.com (mojave.sitaranetworks.com [199.103.141.157]) by yana.lemis.com (8.8.8/8.8.8) with ESMTP id CAA21030; Tue, 16 Nov 1999 02:35:26 +1030 (CST) (envelope-from grog@mojave.sitaranetworks.com) Message-ID: <19991113213430.48370@mojave.sitaranetworks.com> Date: Sat, 13 Nov 1999 21:34:30 -0500 From: Greg Lehey To: Bernd Walter , Mattias Pantzare Cc: freebsd-fs@FreeBSD.ORG Subject: Re: RAID-5 and failure Reply-To: Greg Lehey References: <199911061827.TAA22113@zed.ludd.luth.se> <19991106200754.A9682@cicely7.cicely.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii In-Reply-To: <19991106200754.A9682@cicely7.cicely.de>; from Bernd Walter on Sat, Nov 06, 1999 at 08:07:54PM +0100 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Saturday, 6 November 1999 at 20:07:54 +0100, Bernd Walter wrote: > On Sat, Nov 06, 1999 at 07:27:20PM +0100, Mattias Pantzare wrote: >>> If the system panics or power fails between such a write there is no way to >>> find out if the parity is broken beside verifying the complete plex after >>> reboot - the problem should be the same with all usual hard and software >>> solutions - greg already begun or finished recalculating and checking the >>> parity. >> >> This is realy a optimisation issue, if you just write without using >> two-phase commit then you have to recalculate parity after a powerfailure. >> (One might keep track of the regions of the disk that have had writes latly >> and only recalculate them) >> >> Or you do as it says under Two-phase commitment in >> http://www.sunworld.com/sunworldonline/swol-09-1995/swol-09-raid5-2.html. >> > That's exactly what vinum does at this moment but without the log. > You need persistent memory for this such as nv-memory or a log area on any disk. > nv-memory on PCs is usually to small and maybe to slow for such purposes. > I asume that a log area on any partitipating disk is not a good idea. > On a different disk it would be an option but still needs implementation. Yes, I suppose we could implement that for maximum security. I wonder if any NOVRAM boards are available. Greg -- Finger grog@lemis.com for PGP public key See complete headers for address and phone numbers To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Nov 15 8: 5:49 1999 Delivered-To: freebsd-fs@freebsd.org Received: from yana.lemis.com (yana.lemis.com [192.109.197.140]) by hub.freebsd.org (Postfix) with ESMTP id AFB74150CF for ; Mon, 15 Nov 1999 08:05:41 -0800 (PST) (envelope-from grog@mojave.sitaranetworks.com) Received: from mojave.sitaranetworks.com (mojave.sitaranetworks.com [199.103.141.157]) by yana.lemis.com (8.8.8/8.8.8) with ESMTP id CAA21033; Tue, 16 Nov 1999 02:35:35 +1030 (CST) (envelope-from grog@mojave.sitaranetworks.com) Message-ID: <19991113213325.57908@mojave.sitaranetworks.com> Date: Sat, 13 Nov 1999 21:33:25 -0500 From: Greg Lehey To: Bernd Walter , Mattias Pantzare Cc: freebsd-fs@FreeBSD.ORG Subject: Re: RAID-5 and failure Reply-To: Greg Lehey References: <199911061716.SAA20783@zed.ludd.luth.se> <19991106183316.A9420@cicely7.cicely.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii In-Reply-To: <19991106183316.A9420@cicely7.cicely.de>; from Bernd Walter on Sat, Nov 06, 1999 at 06:33:16PM +0100 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Saturday, 6 November 1999 at 18:33:16 +0100, Bernd Walter wrote: > On Sat, Nov 06, 1999 at 06:16:47PM +0100, Mattias Pantzare wrote: >>> On Sat, Nov 06, 1999 at 04:58:55PM +0100, Mattias Pantzare wrote: >>>> What hapens if the data part of a write to a RAID-5 plex completes but not the >>>> parity part (or the other way)? >>>> >>> The parity is not in sync - what else? >> >> The system could detect it and recalculate the parity. Or give a warning to >> the user so the user knows that the data is not safe. > > That's not possible because you need to write more then a single > sector to keep parity in sync which is not atomic. > > In case one of the writes fail vinum will do everything needed to > work with it and to inform the user. In RAID-5, I first write the data blocks, then the parity blcoks. There are a number of scenarios here: 1. The drive containing a data or parity block goes down. In this case, the subdisks of that block will be marked 'crashed'. The subdisk to which the write went will be marked 'stale'. When the drive is brought up again (manually), the data will be recreated. I've been thinking about keeping a log somewhere of what needs to be updated, but this carries dangers of corruption. At the moment I require that the entire subdisk be rewritten. This will also recreate parity where necessary. 2. The subdisk containing a data or parity block has an unrecoverable I/O error. This is pretty much the same as the previous case, except that the other subdisks don't crash. 3. The system crashes before writing the first data block for a RAID-5 stripe. The updates are lost (obviously). When the system comes up, the data should be consistent. 4. The system crashes after writing the first data block for a RAID-5 stripe and before writing the last data block. When the system comes up, both data and parity are inconsistent. 5. The system crashes after writing the last data block for a RAID-5 stripe and before writing the last parity block. When the system comes up, data is consistent, and parity is inconsistent. There are a number of ways of dealing with situations 4 and 5. The real problem is that they only occur when the system crashes, so whatever recovery information is required must be stored in non-volatile storage. Some systems do include a NOVRAM for this kind of information, but in general purpose systems the only possibility is to write the information to disk, which would make the inherently slow RAID-5 write even slower. My attitude here is that RAID-5 writes are comparatively infrequent, and so are crashes. In the case of (5), you could rebuild parity after a crash. In the case of (4), I have no good answer. Suggestions welcome. Having said that, I probably need to revise the code which sequentializes the data and parity writes. It currently uses the B_ORDERED flag in the buffer headers, and I'm not sure that's enough. I should probably modify it to confirm that the data blocks are written before starting to write the parity blocks. > Vinum will take the subdisk down because such drives should work with > write reallocation enabled and such a disk is badly broken if you receive a > write error. > > If the system panics or power fails between such a write there is no way to > find out if the parity is broken beside verifying the complete plex after > reboot - the problem should be the same with all usual hard and software > solutions - greg already begun or finished recalculating and checking the > parity. > I asume that's the reason why some systems use 520 byte sectors - maybe they > write timestamps or generationnumbers in a single write within the sector. In fact, the 520 byte sectors are used to protect against data corruption between the disk and the controller. They won't help in this scenario. Greg -- Finger grog@lemis.com for PGP public key See complete headers for address and phone numbers To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Nov 15 11:25: 9 1999 Delivered-To: freebsd-fs@freebsd.org Received: from uni4nn.gn.iaf.nl (osmium.gn.iaf.nl [193.67.144.12]) by hub.freebsd.org (Postfix) with ESMTP id C6AB414D23 for ; Mon, 15 Nov 1999 11:25:05 -0800 (PST) (envelope-from wilko@yedi.iaf.nl) Received: from yedi.iaf.nl (uucp@localhost) by uni4nn.gn.iaf.nl (8.9.2/8.9.2) with UUCP id UAA02174; Mon, 15 Nov 1999 20:00:51 +0100 (MET) Received: (from wilko@localhost) by yedi.iaf.nl (8.9.3/8.9.3) id TAA00923; Mon, 15 Nov 1999 19:24:01 +0100 (CET) (envelope-from wilko) From: Wilko Bulte Message-Id: <199911151824.TAA00923@yedi.iaf.nl> Subject: Re: RAID-5 and failure In-Reply-To: <19991113213430.48370@mojave.sitaranetworks.com> from Greg Lehey at "Nov 13, 1999 9:34:30 pm" To: grog@lemis.com Date: Mon, 15 Nov 1999 19:24:01 +0100 (CET) Cc: ticso@cicely.de, pantzer@ludd.luth.se, freebsd-fs@FreeBSD.ORG X-Organisation: Private FreeBSD site - Arnhem, The Netherlands X-pgp-info: PGP public key at 'finger wilko@freefall.freebsd.org' X-Mailer: ELM [version 2.4ME+ PL43 (25)] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org As Greg Lehey wrote ... > On Saturday, 6 November 1999 at 20:07:54 +0100, Bernd Walter wrote: ... > > That's exactly what vinum does at this moment but without the log. > > You need persistent memory for this such as nv-memory or a log area on any disk. > > nv-memory on PCs is usually to small and maybe to slow for such purposes. > > I asume that a log area on any partitipating disk is not a good idea. > > On a different disk it would be an option but still needs implementation. > > Yes, I suppose we could implement that for maximum security. I wonder > if any NOVRAM boards are available. > > Greg You might find an old Prestoserve PCI card on a yardsale. Long shot.. -- | / o / / _ Arnhem, The Netherlands - Powered by FreeBSD - |/|/ / / /( (_) Bulte WWW : http://www.tcja.nl http://www.freebsd.org To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Nov 15 11:39: 3 1999 Delivered-To: freebsd-fs@freebsd.org Received: from mail.du.gtn.com (mail.du.gtn.com [194.77.9.57]) by hub.freebsd.org (Postfix) with ESMTP id CB10114A09 for ; Mon, 15 Nov 1999 11:38:58 -0800 (PST) (envelope-from ticso@mail.cicely.de) Received: from mail.cicely.de (cicely.de [194.231.9.142]) by mail.du.gtn.com (8.9.3/8.9.3) with ESMTP id UAA26480; Mon, 15 Nov 1999 20:32:01 +0100 (MET) Received: (from ticso@localhost) by mail.cicely.de (8.9.0/8.9.0) id UAA06071; Mon, 15 Nov 1999 20:38:28 +0100 (CET) Date: Mon, 15 Nov 1999 20:38:28 +0100 From: Bernd Walter To: Greg Lehey Cc: Bernd Walter , Mattias Pantzare , freebsd-fs@FreeBSD.ORG Subject: Re: RAID-5 and failure Message-ID: <19991115203828.B5417@cicely7.cicely.de> References: <199911061716.SAA20783@zed.ludd.luth.se> <19991106183316.A9420@cicely7.cicely.de> <19991113213325.57908@mojave.sitaranetworks.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0pre3i In-Reply-To: <19991113213325.57908@mojave.sitaranetworks.com> Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Sat, Nov 13, 1999 at 09:33:25PM -0500, Greg Lehey wrote: > > 4. The system crashes after writing the first data block for a RAID-5 > stripe and before writing the last data block. > > When the system comes up, both data and parity are inconsistent. > > 5. The system crashes after writing the last data block for a RAID-5 > stripe and before writing the last parity block. > > When the system comes up, data is consistent, and parity is > inconsistent. > > There are a number of ways of dealing with situations 4 and 5. The > real problem is that they only occur when the system crashes, so > whatever recovery information is required must be stored in > non-volatile storage. Some systems do include a NOVRAM for this kind > of information, but in general purpose systems the only possibility is > to write the information to disk, which would make the inherently slow > RAID-5 write even slower. My attitude here is that RAID-5 writes are > comparatively infrequent, and so are crashes. In the case of (5), you > could rebuild parity after a crash. In the case of (4), I have no > good answer. Suggestions welcome. Case 4 is not that different from case 5 as any differences should be handled by the FS using the volume. -- B.Walter COSMO-Project http://www.cosmo-project.de ticso@cicely.de Usergroup info@cosmo-project.de To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Nov 15 11:42:53 1999 Delivered-To: freebsd-fs@freebsd.org Received: from mail.du.gtn.com (mail.du.gtn.com [194.77.9.57]) by hub.freebsd.org (Postfix) with ESMTP id F0AF614A09 for ; Mon, 15 Nov 1999 11:42:50 -0800 (PST) (envelope-from ticso@mail.cicely.de) Received: from mail.cicely.de (cicely.de [194.231.9.142]) by mail.du.gtn.com (8.9.3/8.9.3) with ESMTP id UAA26696; Mon, 15 Nov 1999 20:35:52 +0100 (MET) Received: (from ticso@localhost) by mail.cicely.de (8.9.0/8.9.0) id UAA06197; Mon, 15 Nov 1999 20:42:22 +0100 (CET) Date: Mon, 15 Nov 1999 20:42:22 +0100 From: Bernd Walter To: Greg Lehey Cc: Bernd Walter , Mattias Pantzare , freebsd-fs@FreeBSD.ORG Subject: Re: RAID-5 and failure Message-ID: <19991115204222.C5417@cicely7.cicely.de> References: <199911061827.TAA22113@zed.ludd.luth.se> <19991106200754.A9682@cicely7.cicely.de> <19991113213430.48370@mojave.sitaranetworks.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0pre3i In-Reply-To: <19991113213430.48370@mojave.sitaranetworks.com> Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Sat, Nov 13, 1999 at 09:34:30PM -0500, Greg Lehey wrote: > > Yes, I suppose we could implement that for maximum security. I wonder > if any NOVRAM boards are available. > Maybe the RIO project can bring in some interesting features. -- B.Walter COSMO-Project http://www.cosmo-project.de ticso@cicely.de Usergroup info@cosmo-project.de To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Nov 15 11:52:54 1999 Delivered-To: freebsd-fs@freebsd.org Received: from yana.lemis.com (yana.lemis.com [192.109.197.140]) by hub.freebsd.org (Postfix) with ESMTP id 0AAC114BD5 for ; Mon, 15 Nov 1999 11:52:48 -0800 (PST) (envelope-from grog@mojave.sitaranetworks.com) Received: from mojave.sitaranetworks.com (mojave.sitaranetworks.com [199.103.141.157]) by yana.lemis.com (8.8.8/8.8.8) with ESMTP id GAA21345; Tue, 16 Nov 1999 06:22:34 +1030 (CST) (envelope-from grog@mojave.sitaranetworks.com) Message-ID: <19991115145200.09633@mojave.sitaranetworks.com> Date: Mon, 15 Nov 1999 14:52:00 -0500 From: Greg Lehey To: Bernd Walter Cc: Mattias Pantzare , freebsd-fs@FreeBSD.ORG Subject: Re: RAID-5 and failure Reply-To: Greg Lehey References: <199911061716.SAA20783@zed.ludd.luth.se> <19991106183316.A9420@cicely7.cicely.de> <19991113213325.57908@mojave.sitaranetworks.com> <19991115203828.B5417@cicely7.cicely.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii In-Reply-To: <19991115203828.B5417@cicely7.cicely.de>; from Bernd Walter on Mon, Nov 15, 1999 at 08:38:28PM +0100 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Monday, 15 November 1999 at 20:38:28 +0100, Bernd Walter wrote: > On Sat, Nov 13, 1999 at 09:33:25PM -0500, Greg Lehey wrote: >> >> 4. The system crashes after writing the first data block for a RAID-5 >> stripe and before writing the last data block. >> >> When the system comes up, both data and parity are inconsistent. >> >> 5. The system crashes after writing the last data block for a RAID-5 >> stripe and before writing the last parity block. >> >> When the system comes up, data is consistent, and parity is >> inconsistent. >> >> There are a number of ways of dealing with situations 4 and 5. The >> real problem is that they only occur when the system crashes, so >> whatever recovery information is required must be stored in >> non-volatile storage. Some systems do include a NOVRAM for this kind >> of information, but in general purpose systems the only possibility is >> to write the information to disk, which would make the inherently slow >> RAID-5 write even slower. My attitude here is that RAID-5 writes are >> comparatively infrequent, and so are crashes. In the case of (5), you >> could rebuild parity after a crash. In the case of (4), I have no >> good answer. Suggestions welcome. > > Case 4 is not that different from case 5 as any differences should be > handled by the FS using the volume. The problem is that in case 4 you don't have anything to go by. You don't know which data are inconsistent unless you keep a log. The FS using the volume has followed the kernel into the eternal bit bucket. Greg -- Finger grog@lemis.com for PGP public key See complete headers for address and phone numbers To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Nov 15 12: 6:50 1999 Delivered-To: freebsd-fs@freebsd.org Received: from mail.du.gtn.com (mail.du.gtn.com [194.77.9.57]) by hub.freebsd.org (Postfix) with ESMTP id 07F1C150A7 for ; Mon, 15 Nov 1999 12:06:41 -0800 (PST) (envelope-from ticso@mail.cicely.de) Received: from mail.cicely.de (cicely.de [194.231.9.142]) by mail.du.gtn.com (8.9.3/8.9.3) with ESMTP id UAA28447; Mon, 15 Nov 1999 20:59:40 +0100 (MET) Received: (from ticso@localhost) by mail.cicely.de (8.9.0/8.9.0) id VAA06307; Mon, 15 Nov 1999 21:06:08 +0100 (CET) Date: Mon, 15 Nov 1999 21:06:08 +0100 From: Bernd Walter To: Greg Lehey Cc: Bernd Walter , Mattias Pantzare , freebsd-fs@FreeBSD.ORG Subject: Re: RAID-5 and failure Message-ID: <19991115210607.A6252@cicely7.cicely.de> References: <199911061716.SAA20783@zed.ludd.luth.se> <19991106183316.A9420@cicely7.cicely.de> <19991113213325.57908@mojave.sitaranetworks.com> <19991115203828.B5417@cicely7.cicely.de> <19991115145200.09633@mojave.sitaranetworks.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0pre3i In-Reply-To: <19991115145200.09633@mojave.sitaranetworks.com> Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Mon, Nov 15, 1999 at 02:52:00PM -0500, Greg Lehey wrote: > On Monday, 15 November 1999 at 20:38:28 +0100, Bernd Walter wrote: > > On Sat, Nov 13, 1999 at 09:33:25PM -0500, Greg Lehey wrote: > >> > >> 4. The system crashes after writing the first data block for a RAID-5 > >> stripe and before writing the last data block. > >> > >> When the system comes up, both data and parity are inconsistent. > >> > >> 5. The system crashes after writing the last data block for a RAID-5 > >> stripe and before writing the last parity block. > >> > >> When the system comes up, data is consistent, and parity is > >> inconsistent. > >> > >> There are a number of ways of dealing with situations 4 and 5. The > >> real problem is that they only occur when the system crashes, so > >> whatever recovery information is required must be stored in > >> non-volatile storage. Some systems do include a NOVRAM for this kind > >> of information, but in general purpose systems the only possibility is > >> to write the information to disk, which would make the inherently slow > >> RAID-5 write even slower. My attitude here is that RAID-5 writes are > >> comparatively infrequent, and so are crashes. In the case of (5), you > >> could rebuild parity after a crash. In the case of (4), I have no > >> good answer. Suggestions welcome. > > > > Case 4 is not that different from case 5 as any differences should be > > handled by the FS using the volume. > > The problem is that in case 4 you don't have anything to go by. You > don't know which data are inconsistent unless you keep a log. The FS > using the volume has followed the kernel into the eternal bit bucket. > Of course - but that may happen with R0 too and even it may be possible with a single disk. The FS should realy be able to handle this case as it knows that there is an outstanding write operation. -- B.Walter COSMO-Project http://www.cosmo-project.de ticso@cicely.de Usergroup info@cosmo-project.de To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Mon Nov 15 15:12:19 1999 Delivered-To: freebsd-fs@freebsd.org Received: from cs.columbia.edu (cs.columbia.edu [128.59.16.20]) by hub.freebsd.org (Postfix) with ESMTP id 26D5614A01; Mon, 15 Nov 1999 15:12:15 -0800 (PST) (envelope-from ezk@shekel.mcl.cs.columbia.edu) Received: from shekel.mcl.cs.columbia.edu (shekel.mcl.cs.columbia.edu [128.59.18.15]) by cs.columbia.edu (8.9.1/8.9.1) with ESMTP id SAA08098; Mon, 15 Nov 1999 18:12:13 -0500 (EST) Received: (from ezk@localhost) by shekel.mcl.cs.columbia.edu (8.9.1/8.9.1) id SAA21891; Mon, 15 Nov 1999 18:12:09 -0500 (EST) Date: Mon, 15 Nov 1999 18:12:09 -0500 (EST) Message-Id: <199911152312.SAA21891@shekel.mcl.cs.columbia.edu> X-Authentication-Warning: shekel.mcl.cs.columbia.edu: ezk set sender to ezk@shekel.mcl.cs.columbia.edu using -f From: Erez Zadok To: Eivind Eklund Cc: fs@FreeBSD.ORG Subject: Re: namei() and freeing componentnames In-reply-to: Your message of "Fri, 12 Nov 1999 00:03:59 +0100." <19991112000359.A256@bitbox.follo.net> Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org In message <19991112000359.A256@bitbox.follo.net>, Eivind Eklund writes: [...] > I suspect that for some filesystems (though none of the present ones), > it might be necessary to do more than a > zfree(namei_zone,cnp->cn_pnbuf) in order to free up all the relevant > data. In order to support this, we'd have to introduce a new VOP - > tentatively called VOP_RELEASEND(). Unfortunately, this comes with a > performance penalty. Will VOP_RELEASEND be able to call a filesystem-specific routine? I think it should be flexible enough. I can imagine that the VFS will call a (stackable) filesystem's vop_releasend(), and that stackable f/s can call a number of those on the lower level filesystem(s) it stacked on (there could be more than one, namely fan-out f/s). [...] > This is somewhat vile, but has the advantage of keeping the code ready > for the real VOP_RELEASEND(), and not loosing performance until we > actually get some benefit out of it. [...] > Eivind. WRT performance, I suggest that if possible, we #ifdef all of the stacking code and fixes that have a non-insignificant performance impact. Sure, performance is important, but not at the cost of functionality (IMHO). Not all users would need stacking, so they can choose not to turn on the relevant kernel #define and thus get maximum performance. Those who do want any stacking will have to pay a certain performance overhead. Of course, there's also an argument against too much #ifdef'ed code, b/c it makes maintenance more difficult. I think we should realize that there would be no way to fix the VFS w/o impacting performance. Rather than implement temporary fixes that avoid "hurting" performance, we can (1) conditionalize that code, (2) get it working *correctly* first, then (3) optimize it as needed, and (4) finally, turn it on by default, possibly removing the non-stacking code. Erez. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Nov 16 2: 1:21 1999 Delivered-To: freebsd-fs@freebsd.org Received: from akat.civ.cvut.cz (akat.civ.cvut.cz [147.32.235.105]) by hub.freebsd.org (Postfix) with SMTP id 127C114CBA for ; Tue, 16 Nov 1999 02:01:13 -0800 (PST) (envelope-from pechy@hp735.cvut.cz) Received: from localhost (pechy@localhost) by akat.civ.cvut.cz (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id LAA03026 for ; Tue, 16 Nov 1999 11:01:11 +0100 Date: Tue, 16 Nov 1999 11:01:11 +0100 From: Jan Pechanec X-Sender: pechy@akat.civ.cvut.cz To: FreeBSD FS Mailing List Subject: Copying file with not allocated blocks on disk Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Hello, please, don't you know the reason why when copying file with some blocks still not allocated on the disk (the blocks that will be returned full of zeroes when accessed), the ,,zero'' blocks are actually written? Why there is no check whether writing zero block and do not write them? I understand that this would have to be inside the implementation of particular filesystem. Ie., in general, why not have assertion: if the disk block should contain all zeroes, we needn't to alocate physical space Thank you, Jan. -- Jan PECHANEC (mailto:pechy@hp735.cvut.cz) Computing Center CTU (Zikova 4, Praha 6, 166 35, Czech Republic) www.civ.cvut.cz, pechy.civ.cvut.cz, tel: +420 2 24352969 (fax: 24310271) To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Nov 16 9:15:25 1999 Delivered-To: freebsd-fs@freebsd.org Received: from mail.tvol.com (mail.wgate.com [38.219.83.4]) by hub.freebsd.org (Postfix) with ESMTP id BC500152C0 for ; Tue, 16 Nov 1999 09:15:09 -0800 (PST) (envelope-from rjesup@wgate.com) Received: from jesup.eng.tvol.net (jesup.eng.tvol.net [10.32.2.26]) by mail.tvol.com (8.8.8/8.8.3) with ESMTP id MAA28900; Tue, 16 Nov 1999 12:11:58 -0500 (EST) Reply-To: Randell Jesup To: Greg Lehey Cc: freebsd-fs@FreeBSD.ORG Subject: Re: RAID-5 and failure References: <199911061716.SAA20783@zed.ludd.luth.se> <19991106183316.A9420@cicely7.cicely.de> <19991113213325.57908@mojave.sitaranetworks.com> From: Randell Jesup Date: 16 Nov 1999 12:15:17 -0500 In-Reply-To: Greg Lehey's message of "Sat, 13 Nov 1999 21:33:25 -0500" Message-ID: X-Mailer: Gnus v5.6.43/Emacs 20.4 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Greg Lehey writes: >In RAID-5, I first write the data blocks, then the parity blcoks. >There are a number of scenarios here: >4. The system crashes after writing the first data block for a RAID-5 > stripe and before writing the last data block. > > When the system comes up, both data and parity are inconsistent. > >5. The system crashes after writing the last data block for a RAID-5 > stripe and before writing the last parity block. > > When the system comes up, data is consistent, and parity is > inconsistent. > >There are a number of ways of dealing with situations 4 and 5. The >real problem is that they only occur when the system crashes, so >whatever recovery information is required must be stored in >non-volatile storage. Some systems do include a NOVRAM for this kind >of information, but in general purpose systems the only possibility is >to write the information to disk, which would make the inherently slow >RAID-5 write even slower. My attitude here is that RAID-5 writes are >comparatively infrequent, and so are crashes. In the case of (5), you >could rebuild parity after a crash. In the case of (4), I have no >good answer. Suggestions welcome. Well, assuming that vinum can recognize that there might have been outstanding writes (via the equivalent of a dirty flag): When the disks come back up (dirty), check all the parity. The stripe that was being written will fail to check. In case 4, the data and parity are wrong, and in case 5, just the parity, but you don't know which. If you handle case 4, you can handle case 5 the same way. Obviously you've had a write failure, but usually the FS can deal with that possibility (with the chance of lost data, true). Some form of information passed out about what sector(s) were trashed might be useful in recovery if you're not using default UFS/fsck. If it checks, then the data was all written before any crash, and all is fine. So the biggest trick here is recognizing the fact that the system crashed. You could reserve a block (or set of blocks scattered about) on each drive for dirty flags, and only mark a disk clean if it hasn't had writes in . This keeps the write overhead down without requiring NVRAM. There are other evil tricks: with SCSI, you might be able to change some innocuous mode parameter and use it as a dirty flag, though this probably has at least as much overhead as reserving a dirty-flag block. And of course if you have NVRAM, store the dirty bit there. Hmmmmm. Maybe in the PC's clock chip - they generally have several bits of NVRAM..... (On the Amiga we used those bits for storing things like SCSI Id, boot spinup delay, etc.) Alternatively, you could hide the dirty flag at a higher semantic level, by (at the OS level) recognizing a system that wasn't shut down properly and invoking the vinum re-synchronizer. So long as the sectors with problems aren't needed to boot the kernel and recognize this that will work. >> I asume that's the reason why some systems use 520 byte sectors - maybe they >> write timestamps or generationnumbers in a single write within the sector. > >In fact, the 520 byte sectors are used to protect against data >corruption between the disk and the controller. They won't help in >this scenario. At the cost of performance, you could use some bytes of each sector for generation numbers, and know in case 5 that the data is correct. Obviously case 4 will still fail. -- Randell Jesup, Worldgate Communications, ex-Scala, ex-Amiga OS team ('88-94) rjesup@wgate.com CDA II has been passed and signed, sigh. The lawsuit has been filed. Please support the organizations fighting it - ACLU, EFF, CDT, etc. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Nov 16 10:19:24 1999 Delivered-To: freebsd-fs@freebsd.org Received: from bingnet2.cc.binghamton.edu (bingnet2.cc.binghamton.edu [128.226.1.18]) by hub.freebsd.org (Postfix) with ESMTP id 6A3981533F; Tue, 16 Nov 1999 10:19:11 -0800 (PST) (envelope-from zzhang@cs.binghamton.edu) Received: from sol.cs.binghamton.edu (cs1-gw.cs.binghamton.edu [128.226.171.72]) by bingnet2.cc.binghamton.edu (8.9.3/8.9.3) with SMTP id NAA00441; Tue, 16 Nov 1999 13:19:04 -0500 (EST) Date: Tue, 16 Nov 1999 12:06:37 -0500 (EST) From: Zhihui Zhang Reply-To: Zhihui Zhang To: freebsd-hackers@freebsd.org, freebsd-fs@freebsd.org Subject: On-the-fly defragmentation of FFS Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org After studying the code of ffs_reallocblks() for a while, it occurs to me that the on-the-fly defragmentation of a FFS file (It does this on a per file basis) only takes place at the end of a file and only when the previous logical blocks have all been laid out contiguously on the disk (see also cluster_write()). This seems to me a lot of limitations to the FFS defragger. I wonder if the file was not allocated contiguously when it was first created, how can it find contiguous space later unless we delete a lot of files in between? I hope someone can confirm or correct my understanding. It would be even better if someone can suggest a way to improve defragmentation if the FFS defragger is not very efficient. BTW, if I copy all files from a filesystem to a new filesystem, will the files be stored more contiguously? Why? Any help or suggestion is appreciated. -Zhihui To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Nov 16 12:21:31 1999 Delivered-To: freebsd-fs@freebsd.org Received: from fw.wintelcom.net (ns1.wintelcom.net [209.1.153.20]) by hub.freebsd.org (Postfix) with ESMTP id 6593E14D88 for ; Tue, 16 Nov 1999 12:21:30 -0800 (PST) (envelope-from bright@wintelcom.net) Received: from localhost (bright@localhost) by fw.wintelcom.net (8.9.3/8.9.3) with ESMTP id MAA08105; Tue, 16 Nov 1999 12:46:54 -0800 (PST) Date: Tue, 16 Nov 1999 12:46:54 -0800 (PST) From: Alfred Perlstein To: Zhihui Zhang Cc: freebsd-fs@FreeBSD.ORG Subject: Re: On-the-fly defragmentation of FFS In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Tue, 16 Nov 1999, Zhihui Zhang wrote: > > After studying the code of ffs_reallocblks() for a while, it occurs to me > that the on-the-fly defragmentation of a FFS file (It does this on a per > file basis) only takes place at the end of a file and only when the > previous logical blocks have all been laid out contiguously on the disk > (see also cluster_write()). This seems to me a lot of limitations to the > FFS defragger. I wonder if the file was not allocated contiguously > when it was first created, how can it find contiguous space later unless > we delete a lot of files in between? > > I hope someone can confirm or correct my understanding. It would be even > better if someone can suggest a way to improve defragmentation if the FFS > defragger is not very efficient. > > BTW, if I copy all files from a filesystem to a new filesystem, will the > files be stored more contiguously? Why? > > Any help or suggestion is appreciated. I think you're missing an obvious point, as the file is written out the only place where it is likely to be fragmented is the end, hence the reason for only defragging the end of the file. :) -Alfred To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Nov 16 12:50:57 1999 Delivered-To: freebsd-fs@freebsd.org Received: from alpo.whistle.com (alpo.whistle.com [207.76.204.38]) by hub.freebsd.org (Postfix) with ESMTP id AE91514F07 for ; Tue, 16 Nov 1999 12:50:55 -0800 (PST) (envelope-from julian@whistle.com) Received: from current1.whiste.com (current1.whistle.com [207.76.205.22]) by alpo.whistle.com (8.9.1a/8.9.1) with ESMTP id MAA54808; Tue, 16 Nov 1999 12:50:53 -0800 (PST) Date: Tue, 16 Nov 1999 12:50:51 -0800 (PST) From: Julian Elischer To: Alfred Perlstein Cc: Zhihui Zhang , freebsd-fs@FreeBSD.ORG Subject: Re: On-the-fly defragmentation of FFS In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org > > I think you're missing an obvious point, as the file is written out > the only place where it is likely to be fragmented is the end, hence > the reason for only defragging the end of the file. :) usually, though database files can be written randomly as they are filled in. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Nov 16 13: 1:14 1999 Delivered-To: freebsd-fs@freebsd.org Received: from bingnet2.cc.binghamton.edu (bingnet2.cc.binghamton.edu [128.226.1.18]) by hub.freebsd.org (Postfix) with ESMTP id F1F8914CD5 for ; Tue, 16 Nov 1999 13:01:09 -0800 (PST) (envelope-from zzhang@cs.binghamton.edu) Received: from sol.cs.binghamton.edu (cs1-gw.cs.binghamton.edu [128.226.171.72]) by bingnet2.cc.binghamton.edu (8.9.3/8.9.3) with SMTP id QAA06263; Tue, 16 Nov 1999 16:01:05 -0500 (EST) Date: Tue, 16 Nov 1999 14:48:36 -0500 (EST) From: Zhihui Zhang To: Alfred Perlstein Cc: freebsd-fs@FreeBSD.ORG Subject: Re: On-the-fly defragmentation of FFS In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Tue, 16 Nov 1999, Alfred Perlstein wrote: > On Tue, 16 Nov 1999, Zhihui Zhang wrote: > > > > > After studying the code of ffs_reallocblks() for a while, it occurs to me > > that the on-the-fly defragmentation of a FFS file (It does this on a per > > file basis) only takes place at the end of a file and only when the > > previous logical blocks have all been laid out contiguously on the disk > > (see also cluster_write()). This seems to me a lot of limitations to the > > FFS defragger. I wonder if the file was not allocated contiguously > > when it was first created, how can it find contiguous space later unless > > we delete a lot of files in between? > > > > I hope someone can confirm or correct my understanding. It would be even > > better if someone can suggest a way to improve defragmentation if the FFS > > defragger is not very efficient. > > > > BTW, if I copy all files from a filesystem to a new filesystem, will the > > files be stored more contiguously? Why? > > > > Any help or suggestion is appreciated. > > I think you're missing an obvious point, as the file is written out > the only place where it is likely to be fragmented is the end, hence > the reason for only defragging the end of the file. :) > Thanks. I think this defragmentation (I can not find a better word for it) means making the blocks contiguous. Consider the case which in the last eight blocks of a file, seven of them are already contiguously allocated and only the last block is not. Now if we write at the very last block, the filesystem will try to move those seven blocks and the last block together to some other place to make them all contiguous. This only happens at the end of a file. I was wondering if this can happen elsewhere or if there is a better solution for this kind of adjustment. -Zhihui To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Nov 16 13: 3: 9 1999 Delivered-To: freebsd-fs@freebsd.org Received: from fw.wintelcom.net (ns1.wintelcom.net [209.1.153.20]) by hub.freebsd.org (Postfix) with ESMTP id BA26114D98 for ; Tue, 16 Nov 1999 13:03:08 -0800 (PST) (envelope-from bright@wintelcom.net) Received: from localhost (bright@localhost) by fw.wintelcom.net (8.9.3/8.9.3) with ESMTP id NAA09314; Tue, 16 Nov 1999 13:29:03 -0800 (PST) Date: Tue, 16 Nov 1999 13:29:03 -0800 (PST) From: Alfred Perlstein To: Julian Elischer Cc: Zhihui Zhang , freebsd-fs@FreeBSD.ORG Subject: Re: On-the-fly defragmentation of FFS In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Tue, 16 Nov 1999, Julian Elischer wrote: > > > > I think you're missing an obvious point, as the file is written out > > the only place where it is likely to be fragmented is the end, hence > > the reason for only defragging the end of the file. :) > > usually, though database files can be written randomly as they are filled > in. Excellent point, however won't FFS's block placement strategy fix work around this unless the filesystem is already pretty full? Or is this one of the bad-case-scenarios for FFS? -Alfred To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Nov 16 13: 9:33 1999 Delivered-To: freebsd-fs@freebsd.org Received: from bingnet2.cc.binghamton.edu (bingnet2.cc.binghamton.edu [128.226.1.18]) by hub.freebsd.org (Postfix) with ESMTP id 22C3C14D98 for ; Tue, 16 Nov 1999 13:09:30 -0800 (PST) (envelope-from zzhang@cs.binghamton.edu) Received: from sol.cs.binghamton.edu (cs1-gw.cs.binghamton.edu [128.226.171.72]) by bingnet2.cc.binghamton.edu (8.9.3/8.9.3) with SMTP id QAA09240; Tue, 16 Nov 1999 16:09:25 -0500 (EST) Date: Tue, 16 Nov 1999 14:56:56 -0500 (EST) From: Zhihui Zhang To: Julian Elischer Cc: Alfred Perlstein , freebsd-fs@FreeBSD.ORG Subject: Re: On-the-fly defragmentation of FFS In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Tue, 16 Nov 1999, Julian Elischer wrote: > > > > I think you're missing an obvious point, as the file is written out > > the only place where it is likely to be fragmented is the end, hence > > the reason for only defragging the end of the file. :) > > usually, though database files can be written randomly as they are filled > in. > Can a database file has holes? I had some experience with Oracle. I used to create a large file for a database and assumed that all space of the database file are pre-allocated. Otherwise, the performance of the database will be poor. -Zhihui To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Nov 16 13:13:21 1999 Delivered-To: freebsd-fs@freebsd.org Received: from fw.wintelcom.net (ns1.wintelcom.net [209.1.153.20]) by hub.freebsd.org (Postfix) with ESMTP id 12E3914D98 for ; Tue, 16 Nov 1999 13:13:20 -0800 (PST) (envelope-from bright@wintelcom.net) Received: from localhost (bright@localhost) by fw.wintelcom.net (8.9.3/8.9.3) with ESMTP id NAA09589; Tue, 16 Nov 1999 13:39:14 -0800 (PST) Date: Tue, 16 Nov 1999 13:39:14 -0800 (PST) From: Alfred Perlstein To: Zhihui Zhang Cc: freebsd-fs@FreeBSD.ORG Subject: Re: On-the-fly defragmentation of FFS In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Tue, 16 Nov 1999, Zhihui Zhang wrote: > > On Tue, 16 Nov 1999, Alfred Perlstein wrote: > > > On Tue, 16 Nov 1999, Zhihui Zhang wrote: > > > > > > > > After studying the code of ffs_reallocblks() for a while, it occurs to me > > > that the on-the-fly defragmentation of a FFS file (It does this on a per > > > file basis) only takes place at the end of a file and only when the > > > previous logical blocks have all been laid out contiguously on the disk > > > (see also cluster_write()). This seems to me a lot of limitations to the > > > FFS defragger. I wonder if the file was not allocated contiguously > > > when it was first created, how can it find contiguous space later unless > > > we delete a lot of files in between? > > > > > > I hope someone can confirm or correct my understanding. It would be even > > > better if someone can suggest a way to improve defragmentation if the FFS > > > defragger is not very efficient. > > > > > > BTW, if I copy all files from a filesystem to a new filesystem, will the > > > files be stored more contiguously? Why? > > > > > > Any help or suggestion is appreciated. > > > > I think you're missing an obvious point, as the file is written out > > the only place where it is likely to be fragmented is the end, hence > > the reason for only defragging the end of the file. :) > > > > Thanks. I think this defragmentation (I can not find a better word for it) > means making the blocks contiguous. Consider the case which in the last > eight blocks of a file, seven of them are already contiguously allocated > and only the last block is not. Now if we write at the very last block, > the filesystem will try to move those seven blocks and the last block > together to some other place to make them all contiguous. This only > happens at the end of a file. I was wondering if this can happen > elsewhere or if there is a better solution for this kind of adjustment. Not to my knowledge, however if it only works on the tail end of files (which I'm 99% sure is true) then Julian's point is a problem for this algorithm, (files with holes) it may be smart to try to reallocblks on 64k cluster boundries. However this starts to get into adaptive algorithms, something that FFS already has plenty of. :) More couldn't hurt, insight, work and testing of such an algorithm would probably be very appreciated. One of the things that Kirk mused making adaptive was FFS's aggressive write-behind feature that can cause problems when the entire dataset fits into ram. It doesn't necessarily cause problems, execpt for the fact that linux has a more aggressive caching algorithm that will not write anything out until the cache is nearly full. Each approach has it's advantages and drawbacks, FreeBSD excels when the dataset is larger than ram, whereas Linux owns the show when it does fit into ram. An adaptive algorithm would be very benificial for this strategy. -Alfred > > -Zhihui > To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Tue Nov 16 13:27:49 1999 Delivered-To: freebsd-fs@freebsd.org Received: from bingnet2.cc.binghamton.edu (bingnet2.cc.binghamton.edu [128.226.1.18]) by hub.freebsd.org (Postfix) with ESMTP id 1831B14D01 for ; Tue, 16 Nov 1999 13:27:46 -0800 (PST) (envelope-from zzhang@cs.binghamton.edu) Received: from sol.cs.binghamton.edu (cs1-gw.cs.binghamton.edu [128.226.171.72]) by bingnet2.cc.binghamton.edu (8.9.3/8.9.3) with SMTP id QAA16608; Tue, 16 Nov 1999 16:27:44 -0500 (EST) Date: Tue, 16 Nov 1999 15:15:13 -0500 (EST) From: Zhihui Zhang To: Alfred Perlstein Cc: freebsd-fs@FreeBSD.ORG Subject: Re: On-the-fly defragmentation of FFS In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org > One of the things that Kirk mused making adaptive was FFS's aggressive > write-behind feature that can cause problems when the entire dataset > fits into ram. Are you talking about softupdate code? Could you explain a little more about this? It seems to me that writes will not happen unless there is no room in the cache. > It doesn't necessarily cause problems, execpt for > the fact that linux has a more aggressive caching algorithm that will > not write anything out until the cache is nearly full. Each approach > has it's advantages and drawbacks, FreeBSD excels when the dataset is > larger than ram, whereas Linux owns the show when it does fit into > ram. An adaptive algorithm would be very benificial for this strategy. Are there any references for this subject? -Zhihui To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Nov 17 7:24:37 1999 Delivered-To: freebsd-fs@freebsd.org Received: from yana.lemis.com (yana.lemis.com [192.109.197.140]) by hub.freebsd.org (Postfix) with ESMTP id 2635E14E09 for ; Wed, 17 Nov 1999 07:24:28 -0800 (PST) (envelope-from grog@mojave.sitaranetworks.com) Received: from mojave.sitaranetworks.com (mojave.sitaranetworks.com [199.103.141.157]) by yana.lemis.com (8.8.8/8.8.8) with ESMTP id BAA23656; Thu, 18 Nov 1999 01:54:22 +1030 (CST) (envelope-from grog@mojave.sitaranetworks.com) Message-ID: <19991116204916.44107@mojave.sitaranetworks.com> Date: Tue, 16 Nov 1999 20:49:16 -0500 From: Greg Lehey To: Randell Jesup Cc: freebsd-fs@FreeBSD.ORG Subject: Re: RAID-5 and failure Reply-To: Greg Lehey References: <199911061716.SAA20783@zed.ludd.luth.se> <19991106183316.A9420@cicely7.cicely.de> <19991113213325.57908@mojave.sitaranetworks.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii In-Reply-To: ; from Randell Jesup on Tue, Nov 16, 1999 at 12:15:17PM -0500 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Tuesday, 16 November 1999 at 12:15:17 -0500, Randell Jesup wrote: > Greg Lehey writes: >> In RAID-5, I first write the data blocks, then the parity blcoks. >> There are a number of scenarios here: > >> 4. The system crashes after writing the first data block for a RAID-5 >> stripe and before writing the last data block. >> >> When the system comes up, both data and parity are inconsistent. >> >> 5. The system crashes after writing the last data block for a RAID-5 >> stripe and before writing the last parity block. >> >> When the system comes up, data is consistent, and parity is >> inconsistent. >> >> There are a number of ways of dealing with situations 4 and 5. The >> real problem is that they only occur when the system crashes, so >> whatever recovery information is required must be stored in >> non-volatile storage. Some systems do include a NOVRAM for this kind >> of information, but in general purpose systems the only possibility is >> to write the information to disk, which would make the inherently slow >> RAID-5 write even slower. My attitude here is that RAID-5 writes are >> comparatively infrequent, and so are crashes. In the case of (5), you >> could rebuild parity after a crash. In the case of (4), I have no >> good answer. Suggestions welcome. > > Well, assuming that vinum can recognize that there might have been > outstanding writes (via the equivalent of a dirty flag): > > When the disks come back up (dirty), check all the parity. > The stripe that was being written will fail to check. In case 4, the data > and parity are wrong, and in case 5, just the parity, but you don't know > which. If you handle case 4, you can handle case 5 the same way. > Obviously you've had a write failure, but usually the FS can deal with > that possibility (with the chance of lost data, true). Some form of > information passed out about what sector(s) were trashed might be useful > in recovery if you're not using default UFS/fsck. Well, you're still left with the dilemma. Worse, this check makes fsck look like an instantaneous operation: you have to read the entire contents of every disk. For a 500 GB database spread across 3 LVD controllers, you're looking at several hours. > If it checks, then the data was all written before any crash, > and all is fine. That's the simple case. > So the biggest trick here is recognizing the fact that the system > crashed. You could reserve a block (or set of blocks scattered about) on > each drive for dirty flags, and only mark a disk clean if it hasn't had > writes in . This keeps the write > overhead down without requiring NVRAM. There are other evil tricks: with > SCSI, you might be able to change some innocuous mode parameter and use > it as a dirty flag, though this probably has at least as much overhead > as reserving a dirty-flag block. And of course if you have NVRAM, store > the dirty bit there. Hmmmmm. Maybe in the PC's clock chip - they > generally have several bits of NVRAM..... (On the Amiga we used those > bits for storing things like SCSI Id, boot spinup delay, etc.) > > Alternatively, you could hide the dirty flag at a higher semantic > level, by (at the OS level) recognizing a system that wasn't shut down > properly and invoking the vinum re-synchronizer. So long as the sectors > with problems aren't needed to boot the kernel and recognize this that will > work. Basically, the way I see it, we have three options: 1. Disks never crash, and anyway, we don't write to them. Ignore the problem and deal with it if it comes to bite us. 2. Get an NVRAM board and use it for this purpose. 3. Bite the bullet and write intention logs before each write. VERITAS has this as an option. These options don't have to be mutually exclusive. It's quite possible to implement both ((1) doesn't need implementation :-) and leave it to the user to decide which to use. >>> I asume that's the reason why some systems use 520 byte sectors - maybe they >>> write timestamps or generationnumbers in a single write within the sector. >> >> In fact, the 520 byte sectors are used to protect against data >> corruption between the disk and the controller. They won't help in >> this scenario. > > At the cost of performance, you could use some bytes of each sector > for generation numbers, and know in case 5 that the data is correct. > Obviously case 4 will still fail. No, the way things work, this would be very expensive. We'd have to move the data to a larger buffer and set the flags, and it would also require at least reformatting the drive, assuming it's possible to set a different sector. There are better ways to do this. Greg -- Finger grog@lemis.com for PGP public key See complete headers for address and phone numbers To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Nov 17 7:25:14 1999 Delivered-To: freebsd-fs@freebsd.org Received: from yana.lemis.com (yana.lemis.com [192.109.197.140]) by hub.freebsd.org (Postfix) with ESMTP id 031241528B for ; Wed, 17 Nov 1999 07:25:03 -0800 (PST) (envelope-from grog@mojave.sitaranetworks.com) Received: from mojave.sitaranetworks.com (mojave.sitaranetworks.com [199.103.141.157]) by yana.lemis.com (8.8.8/8.8.8) with ESMTP id BAA23662; Thu, 18 Nov 1999 01:54:44 +1030 (CST) (envelope-from grog@mojave.sitaranetworks.com) Message-ID: <19991116204101.12932@mojave.sitaranetworks.com> Date: Tue, 16 Nov 1999 20:41:01 -0500 From: Greg Lehey To: Bernd Walter Cc: Mattias Pantzare , freebsd-fs@FreeBSD.ORG Subject: Re: RAID-5 and failure Reply-To: Greg Lehey References: <199911061716.SAA20783@zed.ludd.luth.se> <19991106183316.A9420@cicely7.cicely.de> <19991113213325.57908@mojave.sitaranetworks.com> <19991115203828.B5417@cicely7.cicely.de> <19991115145200.09633@mojave.sitaranetworks.com> <19991115210607.A6252@cicely7.cicely.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii In-Reply-To: <19991115210607.A6252@cicely7.cicely.de>; from Bernd Walter on Mon, Nov 15, 1999 at 09:06:08PM +0100 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Monday, 15 November 1999 at 21:06:08 +0100, Bernd Walter wrote: > On Mon, Nov 15, 1999 at 02:52:00PM -0500, Greg Lehey wrote: >> On Monday, 15 November 1999 at 20:38:28 +0100, Bernd Walter wrote: >>> On Sat, Nov 13, 1999 at 09:33:25PM -0500, Greg Lehey wrote: >>>> >>>> 4. The system crashes after writing the first data block for a RAID-5 >>>> stripe and before writing the last data block. >>>> >>>> When the system comes up, both data and parity are inconsistent. >>>> >>>> 5. The system crashes after writing the last data block for a RAID-5 >>>> stripe and before writing the last parity block. >>>> >>>> When the system comes up, data is consistent, and parity is >>>> inconsistent. >>>> >>>> There are a number of ways of dealing with situations 4 and 5. The >>>> real problem is that they only occur when the system crashes, so >>>> whatever recovery information is required must be stored in >>>> non-volatile storage. Some systems do include a NOVRAM for this kind >>>> of information, but in general purpose systems the only possibility is >>>> to write the information to disk, which would make the inherently slow >>>> RAID-5 write even slower. My attitude here is that RAID-5 writes are >>>> comparatively infrequent, and so are crashes. In the case of (5), you >>>> could rebuild parity after a crash. In the case of (4), I have no >>>> good answer. Suggestions welcome. >>> >>> Case 4 is not that different from case 5 as any differences should be >>> handled by the FS using the volume. >> >> The problem is that in case 4 you don't have anything to go by. You >> don't know which data are inconsistent unless you keep a log. The FS >> using the volume has followed the kernel into the eternal bit bucket. > > Of course - but that may happen with R0 too and even it may be possible with > a single disk. Sure. It's not specific to RAID-5. > The FS should realy be able to handle this case as it knows that > there is an outstanding write operation. How does it know? That's the question. All state information has gone to /dev/null. The only alternative is to write this state information to some non-volatile location, which usually means disk and associated severe loss of performance. Greg -- Finger grog@lemis.com for PGP public key See complete headers for address and phone numbers To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Nov 17 9:31:41 1999 Delivered-To: freebsd-fs@freebsd.org Received: from bomber.avantgo.com (ws1.avantgo.com [207.214.200.194]) by hub.freebsd.org (Postfix) with ESMTP id D3DD814F68 for ; Wed, 17 Nov 1999 09:31:24 -0800 (PST) (envelope-from scott@avantgo.com) Received: from river ([10.0.128.30]) by bomber.avantgo.com (Netscape Messaging Server 3.5) with SMTP id 238 for ; Wed, 17 Nov 1999 09:27:00 -0800 Message-ID: <166101bf3121$76518900$1e80000a@avantgo.com> From: "Scott Hess" To: Subject: vinum, MYSQL, and small transaction sizes. Date: Wed, 17 Nov 1999 09:30:37 -0800 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 5.00.2314.1300 X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org I've been experimenting with vinum striping as a means of improving MYSQL performance, and am having some odd results. Running a particular workload and a particular set of disks, at overload iostat shows the disk doing about 185 tps, and about 8KB/t. When I run the workload on a 256k striped volume made up of two drives, I'm finding that each drive does about 95 tps. I've also run the tests with slower drives, which do 155 tps for the single-drive test, and 80 tps for the striped test. I didn't expect to double the tps of the entire system - but getting no increase at all seems very suspect. Based on the transaction sizes iostat is reporting, I have tried restriping with 8k stripes, which gives me about 105 tps per disk, which is marginally better. Going the other direction, with 1m stripes, gave the same results as for 256k stripes. In an attempt to isolate the problem, I tried cat'ing very large files in parallel. The files were large enough to not fit in memory, and I ran four cat commands at the same time on different files. I found that running them all from a single disk gave 380tps (24M/s), running 4 on one drive and 4 on the other gave 200tps (12M/s) for each drive, 400tps (24M/s) aggregate, and running them on a 256k volume striped across the disks gave 100tps (6M/s) for each drive, 200tps (12M/s) aggregate. Given past experience with the Linux md driver, I really really really suspect I'm missing something. But I couldn't tell you what. Running under FreeBSD3.3-RELEASE. Later, scott To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Nov 17 10:19:55 1999 Delivered-To: freebsd-fs@freebsd.org Received: from mail.du.gtn.com (mail.du.gtn.com [194.77.9.57]) by hub.freebsd.org (Postfix) with ESMTP id AB93214FD6 for ; Wed, 17 Nov 1999 10:19:41 -0800 (PST) (envelope-from ticso@mail.cicely.de) Received: from mail.cicely.de (cicely.de [194.231.9.142]) by mail.du.gtn.com (8.9.3/8.9.3) with ESMTP id TAA29709; Wed, 17 Nov 1999 19:12:43 +0100 (MET) Received: (from ticso@localhost) by mail.cicely.de (8.9.0/8.9.0) id TAA13518; Wed, 17 Nov 1999 19:19:13 +0100 (CET) Date: Wed, 17 Nov 1999 19:19:13 +0100 From: Bernd Walter To: Greg Lehey Cc: Bernd Walter , Mattias Pantzare , freebsd-fs@FreeBSD.ORG Subject: Re: RAID-5 and failure Message-ID: <19991117191912.A12883@cicely7.cicely.de> References: <199911061716.SAA20783@zed.ludd.luth.se> <19991106183316.A9420@cicely7.cicely.de> <19991113213325.57908@mojave.sitaranetworks.com> <19991115203828.B5417@cicely7.cicely.de> <19991115145200.09633@mojave.sitaranetworks.com> <19991115210607.A6252@cicely7.cicely.de> <19991116204101.12932@mojave.sitaranetworks.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0pre3i In-Reply-To: <19991116204101.12932@mojave.sitaranetworks.com> Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Tue, Nov 16, 1999 at 08:41:01PM -0500, Greg Lehey wrote: > On Monday, 15 November 1999 at 21:06:08 +0100, Bernd Walter wrote: > > > The FS should realy be able to handle this case as it knows that > > there is an outstanding write operation. > > How does it know? That's the question. All state information has > gone to /dev/null. The only alternative is to write this state > information to some non-volatile location, which usually means disk > and associated severe loss of performance. The FS is dirty. The FS before the panic/powerfailure/... had known the outstanding transaction and shouldn't create a situation in which fsck can't handle such a case. It should even expect only a part to be writen as multiple sector transfers are known not to be atomic - that's why critical state information should never cross sector boundarys. I asume most modern HDDs are able to finish a single sector write in case of power failures. In case the drive simply returns a CRC error we realy have a problem because the parity might not be in sync and we can't recover this sector relyable. Nevertheless I got several powerfailures during write access and never got CRCs since ESDI because of that. In case application data was lost that's not a OS specific problem. As long as the applications did not flush the buffers and success was returned it should not be surprised if data gets lost because they could also be in some kind of writecache. -- B.Walter COSMO-Project http://www.cosmo-project.de ticso@cicely.de Usergroup info@cosmo-project.de To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Nov 17 14:29:48 1999 Delivered-To: freebsd-fs@freebsd.org Received: from yana.lemis.com (yana.lemis.com [192.109.197.140]) by hub.freebsd.org (Postfix) with ESMTP id A7E7914DF8 for ; Wed, 17 Nov 1999 14:29:39 -0800 (PST) (envelope-from grog@mojave.sitaranetworks.com) Received: from mojave.sitaranetworks.com (mojave.sitaranetworks.com [199.103.141.157]) by yana.lemis.com (8.8.8/8.8.8) with ESMTP id IAA24124; Thu, 18 Nov 1999 08:59:25 +1030 (CST) (envelope-from grog@mojave.sitaranetworks.com) Message-ID: <19991117172851.06023@mojave.sitaranetworks.com> Date: Wed, 17 Nov 1999 17:28:51 -0500 From: Greg Lehey To: Scott Hess , freebsd-fs@FreeBSD.ORG Subject: Re: vinum, MYSQL, and small transaction sizes. Reply-To: Greg Lehey References: <166101bf3121$76518900$1e80000a@avantgo.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii In-Reply-To: <166101bf3121$76518900$1e80000a@avantgo.com>; from Scott Hess on Wed, Nov 17, 1999 at 09:30:37AM -0800 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Wednesday, 17 November 1999 at 9:30:37 -0800, Scott Hess wrote: > I've been experimenting with vinum striping as a means of improving MYSQL > performance, and am having some odd results. > > Running a particular workload and a particular set of disks, at overload > iostat shows the disk doing about 185 tps, and about 8KB/t. When I run the > workload on a 256k striped volume made up of two drives, I'm finding that > each drive does about 95 tps. I've also run the tests with slower drives, > which do 155 tps for the single-drive test, and 80 tps for the striped > test. > > I didn't expect to double the tps of the entire system - but getting no > increase at all seems very suspect. It's frequently the system's way of saying "the disk is not the bottleneck". > Based on the transaction sizes iostat is reporting, I have tried > restriping with 8k stripes, which gives me about 105 tps per disk, > which is marginally better. Going the other direction, with 1m > stripes, gave the same results as for 256k stripes. I think this is probably a red herring. It's very unlikely that you'll get better performance from an 8k stripe than a 256k stripe. The fact that there's not a significant degradation with such small stripes again points to the likelihood that the disks aren't the bottleneck, though it could also indicate that the transfers are very small (as you indicate in the Subject: line). How big are the transfers? > In an attempt to isolate the problem, I tried cat'ing very large > files in parallel. The files were large enough to not fit in > memory, and I ran four cat commands at the same time on different > files. I found that running them all from a single disk gave 380tps > (24M/s), running 4 on one drive and 4 on the other gave 200tps > (12M/s) for each drive, 400tps (24M/s) aggregate, and running them > on a 256k volume striped across the disks gave 100tps (6M/s) for > each drive, 200tps (12M/s) aggregate. Hmm. The arithmetic at the end suggests that you only striped across 2 disks. What kind of disks are they? You'll run into significant contention problems with IDE, for example. Also, what version of FreeBSD? Greg -- Finger grog@lemis.com for PGP public key See complete headers for address and phone numbers To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Wed Nov 17 15:15:59 1999 Delivered-To: freebsd-fs@freebsd.org Received: from bomber.avantgo.com (ws1.avantgo.com [207.214.200.194]) by hub.freebsd.org (Postfix) with ESMTP id 4C69D14C9E for ; Wed, 17 Nov 1999 15:15:56 -0800 (PST) (envelope-from scott@avantgo.com) Received: from river ([10.0.128.30]) by bomber.avantgo.com (Netscape Messaging Server 3.5) with SMTP id 215; Wed, 17 Nov 1999 15:11:36 -0800 Message-ID: <17e101bf3151$99554ec0$1e80000a@avantgo.com> From: "Scott Hess" To: "Greg Lehey" , References: <166101bf3121$76518900$1e80000a@avantgo.com> <19991117172851.06023@mojave.sitaranetworks.com> Subject: Re: vinum, MYSQL, and small transaction sizes. Date: Wed, 17 Nov 1999 15:15:12 -0800 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 5.00.2314.1300 X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Greg Lehey wrote: > On Wednesday, 17 November 1999 at 9:30:37 -0800, Scott Hess wrote: > > I didn't expect to double the tps of the entire system - but getting no > > increase at all seems very suspect. > > It's frequently the system's way of saying "the disk is not the > bottleneck". Memory is not an issue, CPU time is not an issue. AFAICT, the disk _is_ the bottleneck, because when I upgrade to faster disks, the tps goes up - both for the single-disk test (155->185), and for the vinum'ed test (80->95). I can't think of another way I'd see those results. > > Based on the transaction sizes iostat is reporting, I have tried > > restriping with 8k stripes, which gives me about 105 tps per disk, > > which is marginally better. Going the other direction, with 1m > > stripes, gave the same results as for 256k stripes. > > I think this is probably a red herring. It's very unlikely that > you'll get better performance from an 8k stripe than a 256k stripe. > The fact that there's not a significant degradation with such small > stripes again points to the likelihood that the disks aren't the > bottleneck, though it could also indicate that the transfers are very > small (as you indicate in the Subject: line). How big are the > transfers? iostat reports that the average transfer size is 8k. I can't tell for certain what the distribution is, but I am pretty certain it is basically everything at 8k, with a couple 16k transfers (lots of short bits of data). > > In an attempt to isolate the problem, I tried cat'ing very large > > files in parallel. The files were large enough to not fit in > > memory, and I ran four cat commands at the same time on different > > files. I found that running them all from a single disk gave 380tps > > (24M/s), running 4 on one drive and 4 on the other gave 200tps > > (12M/s) for each drive, 400tps (24M/s) aggregate, and running them > > on a 256k volume striped across the disks gave 100tps (6M/s) for > > each drive, 200tps (12M/s) aggregate. > > Hmm. The arithmetic at the end suggests that you only striped across > 2 disks. What kind of disks are they? You'll run into significant > contention problems with IDE, for example. Also, what version of > FreeBSD? 10k 18Gig Seagate disks, on an NCR 875 controller. The disks by themselves kick ass. The disks both being used at the same time kick ass. The disks when used with vinum do not kick ass. Again, I don't expect to double performance, but my experience did lead me to believe we should have added 50% or so with the second disk, perhaps more given the nature of our use. Later, scott To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Thu Nov 18 4:47: 9 1999 Delivered-To: freebsd-fs@freebsd.org Received: from akat.civ.cvut.cz (akat.civ.cvut.cz [147.32.235.105]) by hub.freebsd.org (Postfix) with SMTP id C7A9215161 for ; Thu, 18 Nov 1999 04:46:50 -0800 (PST) (envelope-from pechy@hp735.cvut.cz) Received: from localhost (pechy@localhost) by akat.civ.cvut.cz (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id NAA10878 for ; Thu, 18 Nov 1999 13:46:49 +0100 Date: Thu, 18 Nov 1999 13:46:49 +0100 From: Jan Pechanec X-Sender: pechy@akat.civ.cvut.cz To: FreeBSD FS Mailing List Subject: Unix International Stackable Files Working Group Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Hello, in several papers on filesystems I found the reference to ${subj}. I spent quite enough time trying to find it through several www search engines, but wasn't succesful. Please, does anybody have more information on this group ? Thank you, Jan. -- Jan PECHANEC (mailto:pechy@hp735.cvut.cz) Computing Center CTU (Zikova 4, Praha 6, 166 35, Czech Republic) www.civ.cvut.cz, pechy.civ.cvut.cz, tel: +420 2 24352969 (fax: 24310271) To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Thu Nov 18 5:18:57 1999 Delivered-To: freebsd-fs@freebsd.org Received: from mentisworks.com (valkery.mentisworks.com [207.227.89.226]) by hub.freebsd.org (Postfix) with ESMTP id 3006A150FD for ; Thu, 18 Nov 1999 05:18:48 -0800 (PST) (envelope-from nathank@mentisworks.com) Received: from [24.29.197.186] (HELO mentisworks.com) by mentisworks.com (CommuniGate Pro SMTP 3.2b5) with ESMTP id 550005; Thu, 18 Nov 1999 07:18:44 -0600 Received: from [192.168.245.111] (HELO mentisworks.com) by mentisworks.com (CommuniGate Pro SMTP 3.2b5) with ESMTP id 1320010; Thu, 18 Nov 1999 07:18:47 -0600 Message-ID: <3833FC97.3224106@mentisworks.com> Date: Thu, 18 Nov 1999 07:18:15 -0600 From: Nathan Kinsman X-Mailer: Mozilla 4.7 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Jan Pechanec , freebsd-fs@freebsd.org Subject: Re: Unix International Stackable Files Working Group References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org I've seen a reference to this before: Unix International Stackable Files Working Group, ``Requirements for Stackable Files,'' Rev. 3.6, Feb. 1993 Unix Int'l., Parsippany, NJ. ^^^^^^^^^^ ^^^^^^^^^^^^^^ The organization is (was) a consortium including Sun, AT&T and others formed to promote an open environment based on Unix System V, including the Open Look windowing system. - Nathan Kinsman Jan Pechanec wrote: > > Hello, > > in several papers on filesystems I found the reference to > ${subj}. I spent quite enough time trying to find it through several > www search engines, but wasn't succesful. Please, does anybody have > more information on this group ? > > Thank you, Jan. > > -- > Jan PECHANEC (mailto:pechy@hp735.cvut.cz) > Computing Center CTU (Zikova 4, Praha 6, 166 35, Czech Republic) > www.civ.cvut.cz, pechy.civ.cvut.cz, tel: +420 2 24352969 (fax: 24310271) > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-fs" in the body of the message To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Thu Nov 18 6:32:30 1999 Delivered-To: freebsd-fs@freebsd.org Received: from ns1.yes.no (ns1.yes.no [195.204.136.10]) by hub.freebsd.org (Postfix) with ESMTP id 84E6E1513B for ; Thu, 18 Nov 1999 06:32:22 -0800 (PST) (envelope-from eivind@bitbox.follo.net) Received: from bitbox.follo.net (bitbox.follo.net [195.204.143.218]) by ns1.yes.no (8.9.3/8.9.3) with ESMTP id PAA05340; Thu, 18 Nov 1999 15:32:21 +0100 (CET) Received: (from eivind@localhost) by bitbox.follo.net (8.8.8/8.8.6) id PAA62682; Thu, 18 Nov 1999 15:32:20 +0100 (MET) Date: Thu, 18 Nov 1999 15:32:20 +0100 From: Eivind Eklund To: Erez Zadok Cc: fs@FreeBSD.ORG Subject: Re: namei() and freeing componentnames Message-ID: <19991118153220.E45524@bitbox.follo.net> References: <19991112000359.A256@bitbox.follo.net> <199911152312.SAA21891@shekel.mcl.cs.columbia.edu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0i In-Reply-To: <199911152312.SAA21891@shekel.mcl.cs.columbia.edu>; from ezk@cs.columbia.edu on Mon, Nov 15, 1999 at 06:12:09PM -0500 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org [Note to impatient readers - forward view if included at the bottom of this mail] On Mon, Nov 15, 1999 at 06:12:09PM -0500, Erez Zadok wrote: > In message <19991112000359.A256@bitbox.follo.net>, Eivind Eklund writes: > [...] > > I suspect that for some filesystems (though none of the present ones), > > it might be necessary to do more than a > > zfree(namei_zone,cnp->cn_pnbuf) in order to free up all the relevant > > data. In order to support this, we'd have to introduce a new VOP - > > tentatively called VOP_RELEASEND(). Unfortunately, this comes with a > > performance penalty. > > Will VOP_RELEASEND be able to call a filesystem-specific routine? I think > it should be flexible enough. All VOPs are filesystem specific (or can be, at least). > I can imagine that the VFS will call a (stackable) filesystem's > vop_releasend(), and that stackable f/s can call a number of those > on the lower level filesystem(s) it stacked on (there could be more > than one, namely fan-out f/s). Yes, this is the intent. The problem I'm finding with VOP_RELEASEND() is that namei() can return two different vps - the dvp (directory vp) and the actual vp (inside the directory dvp points at), and that neither of these are always available. As I am writing the code right now, I am using either of these, with a preference for the dvp. I am considering splitting VOP_RELEASEND() into VOP_RELEASEND() and VOP_DRELEASEND(), which takes the different VPs as parameters - this will at least give something that is easy to search for if we need to change the behaviour somehow. > [...] > > This is somewhat vile, but has the advantage of keeping the code ready > > for the real VOP_RELEASEND(), and not loosing performance until we > > actually get some benefit out of it. > [...] > > Eivind. > > WRT performance, I suggest that if possible, we #ifdef all of the stacking > code and fixes that have a non-insignificant performance impact. Nothing I'm so far positive we will need have a significant performance impact. I'm not sure the performance impact for VOP_RELEASEND() will be significant, either - it is just that I would like to avoid having performance impact without gain, and for this particular case I'm not positive we will ever need it - but I'm not positive we won't, either. This is why I am trying to do the code in a way that let us move to having it quickly, but do not force us to live with the penalites if it turns out we do not need it. > Sure, performance is important, but not at the cost of functionality > (IMHO). Not all users would need stacking, so they can choose not > to turn on the relevant kernel #define and thus get maximum > performance. Those who do want any stacking will have to pay a > certain performance overhead. I hope to make stacking layers really light weight ("featherweight stacking"), and believe it will make sense to use it internally in the kernel organization. If this turns out to be right, everybody will have to have them. > Of course, there's also an argument against too much #ifdef'ed code, > b/c it makes maintenance more difficult. For some of the things I am doing now (e.g, the WILLRELE fixes), ifdef'ing would be a royal pain, making it extremely hard to read the code. > I think we should realize that there would be no way to fix the VFS w/o > impacting performance. Actually, I am reasonably confident that we can do the fixes without impacting performance noticably. > Rather than implement temporary fixes that avoid "hurting" > performance, we can (1) conditionalize that code, (2) get it working > *correctly* first, then (3) optimize it as needed, and (4) finally, > turn it on by default, possibly removing the non-stacking code. What I am doing now is done more or less by these principles - though instead of conditionalizing code I do not know if we will need, I make it very easy to write it if it turns out we will need it. Progress report: Based on current rate of progress, it looks like I'll be able to have patches ready for (my personal) testing sunday (or *possibly* saturday, but most likely not). Depending on how testing/debugging works out, the patches will most likely be ready for public testing sometime next week. I'll need help with NFS testing. Forward view: I'm undecided on the next step. Possibilities: (1) Change the way locking is specificied to make it feasible to test locking patches properly, and change the assertion generation to generate better assertions. This will probably require changing VOP_ISLOCKED() to be able to take a process parameter, and return different valued based on wether an exlusive lock is held by that process or by another process. The present behaviour will be available by passing NULL for this parameter. Presently, running multiple processes does not work properly, as the assertions do not really assert the right things. These changes are necessary to properly debug the use of locks, which I again believe is necessary for stacking layers (which I would like to work in 4.0, but I don't know if I will be able to have ready). (2) Change the behaviour of VOP_LOOKUP() to "eat as much as you can, and return how much that was" rather than "Eat a single path component; we have already decided what this is." This allows different types of namespaces, and it allows optimizations in VOP_LOOKUP() when several steps in the traversal is inside a single filesystem (and hey - who mounts a new filesystem on every directory they see, anyway?) This change is rather small, and it would be nice to have in 4.0 (I want the VFS differences from 4.0 to 5.0 to be as small as possible). It is pretty orthogonal to stacking layers; stacking layers gain the same capabilities as other file systems from it. Eivind. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Thu Nov 18 9:26:20 1999 Delivered-To: freebsd-fs@freebsd.org Received: from smtp05.primenet.com (smtp05.primenet.com [206.165.6.135]) by hub.freebsd.org (Postfix) with ESMTP id 7E70F1512A for ; Thu, 18 Nov 1999 09:25:58 -0800 (PST) (envelope-from tlambert@usr02.primenet.com) Received: (from daemon@localhost) by smtp05.primenet.com (8.9.3/8.9.3) id KAA13267; Thu, 18 Nov 1999 10:25:33 -0700 (MST) Received: from usr02.primenet.com(206.165.6.202) via SMTP by smtp05.primenet.com, id smtpdAAAsEaG3z; Thu Nov 18 10:25:29 1999 Received: (from tlambert@localhost) by usr02.primenet.com (8.8.5/8.8.5) id KAA14781; Thu, 18 Nov 1999 10:25:43 -0700 (MST) From: Terry Lambert Message-Id: <199911181725.KAA14781@usr02.primenet.com> Subject: Re: Unix International Stackable Files Working Group To: pechy@hp735.cvut.cz (Jan Pechanec) Date: Thu, 18 Nov 1999 17:25:43 +0000 (GMT) Cc: freebsd-fs@FreeBSD.ORG In-Reply-To: from "Jan Pechanec" at Nov 18, 99 01:46:49 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org > Hello, > > in several papers on filesystems I found the reference to > ${subj}. I spent quite enough time trying to find it through several > www search engines, but wasn't succesful. Please, does anybody have > more information on this group ? I saved nearly the entire UNIX International FTP archive when UI went out of business, including their TET, ETET, System Admin, DWARF, and Draft SPEC 1170 documents. They are currently archive at DigiBoard. Unfortunately, I didn't save everything, but I'm pretty sure that was one of the things I saved. If not, I know who had the machine in their physical posession after they went under, but I'm pretty sure it has been scrapped by now, as that person was not very much like me (I have been described as "the net.packrat"). Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Thu Nov 18 10:27:36 1999 Delivered-To: freebsd-fs@freebsd.org Received: from cs.columbia.edu (cs.columbia.edu [128.59.16.20]) by hub.freebsd.org (Postfix) with ESMTP id F2FD415476 for ; Thu, 18 Nov 1999 10:27:27 -0800 (PST) (envelope-from ezk@shekel.mcl.cs.columbia.edu) Received: from shekel.mcl.cs.columbia.edu (shekel.mcl.cs.columbia.edu [128.59.18.15]) by cs.columbia.edu (8.9.1/8.9.1) with ESMTP id NAA25492; Thu, 18 Nov 1999 13:27:24 -0500 (EST) Received: (from ezk@localhost) by shekel.mcl.cs.columbia.edu (8.9.1/8.9.1) id NAA27811; Thu, 18 Nov 1999 13:27:24 -0500 (EST) Date: Thu, 18 Nov 1999 13:27:24 -0500 (EST) Message-Id: <199911181827.NAA27811@shekel.mcl.cs.columbia.edu> X-Authentication-Warning: shekel.mcl.cs.columbia.edu: ezk set sender to ezk@shekel.mcl.cs.columbia.edu using -f From: Erez Zadok To: Jan Pechanec Cc: FreeBSD FS Mailing List Subject: Re: Unix International Stackable Files Working Group In-reply-to: Your message of "Thu, 18 Nov 1999 13:46:49 +0100." Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org In message , Jan Pechanec writes: > > Hello, > > in several papers on filesystems I found the reference to > ${subj}. I spent quite enough time trying to find it through several > www search engines, but wasn't succesful. Please, does anybody have > more information on this group ? It's dead Jan! :-) > Thank you, Jan. I have Rosenthal's 6-page 'requirements' paper, which was produced under UI. It was difficult to get it, but eventually I got a copy from the man himself. See ftp://shekel.mcl.cs.columbia.edu/pub/ezk/requirements.ps If you're looking for other papers re: stacking, I probably have all of them. > Jan PECHANEC (mailto:pechy@hp735.cvut.cz) > Computing Center CTU (Zikova 4, Praha 6, 166 35, Czech Republic) > www.civ.cvut.cz, pechy.civ.cvut.cz, tel: +420 2 24352969 (fax: 24310271) > > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-fs" in the body of the message Erez. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Thu Nov 18 15:20:50 1999 Delivered-To: freebsd-fs@freebsd.org Received: from cs.columbia.edu (cs.columbia.edu [128.59.16.20]) by hub.freebsd.org (Postfix) with ESMTP id 8BCD81508E; Thu, 18 Nov 1999 15:20:45 -0800 (PST) (envelope-from ezk@shekel.mcl.cs.columbia.edu) Received: from shekel.mcl.cs.columbia.edu (shekel.mcl.cs.columbia.edu [128.59.18.15]) by cs.columbia.edu (8.9.1/8.9.1) with ESMTP id SAA29976; Thu, 18 Nov 1999 18:20:44 -0500 (EST) Received: (from ezk@localhost) by shekel.mcl.cs.columbia.edu (8.9.1/8.9.1) id SAA15756; Thu, 18 Nov 1999 18:20:43 -0500 (EST) Date: Thu, 18 Nov 1999 18:20:43 -0500 (EST) Message-Id: <199911182320.SAA15756@shekel.mcl.cs.columbia.edu> X-Authentication-Warning: shekel.mcl.cs.columbia.edu: ezk set sender to ezk@shekel.mcl.cs.columbia.edu using -f From: Erez Zadok To: Eivind Eklund Cc: Erez Zadok , fs@FreeBSD.ORG Subject: Re: namei() and freeing componentnames In-reply-to: Your message of "Thu, 18 Nov 1999 15:32:20 +0100." <19991118153220.E45524@bitbox.follo.net> Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org In message <19991118153220.E45524@bitbox.follo.net>, Eivind Eklund writes: > [Note to impatient readers - forward view if included at the bottom of > this mail] > > On Mon, Nov 15, 1999 at 06:12:09PM -0500, Erez Zadok wrote: > > In message <19991112000359.A256@bitbox.follo.net>, Eivind Eklund writes: [...] > The problem I'm finding with VOP_RELEASEND() is that namei() can > return two different vps - the dvp (directory vp) and the actual vp > (inside the directory dvp points at), and that neither of these are > always available. > > As I am writing the code right now, I am using either of these, with a > preference for the dvp. I am considering splitting VOP_RELEASEND() > into VOP_RELEASEND() and VOP_DRELEASEND(), which takes the different > VPs as parameters - this will at least give something that is easy to > search for if we need to change the behaviour somehow. I found similar "annoying" functionality in Solaris's open() routine. Sometimes it can return a new dvp, sometimes NULL, and sometimes a copy or reference to another vnode (I think due to dup() stuff). From my POV, after having ported stackable templates to several OSs, I found out that vnode/vfs functions that try to do too much make the life of a stackable f/s developer harder. Also, functions that behave differently under different (input) conditions also make it hard to work with. The reason is that stackable file systems have to be layer-independent. This means that they have to treat the file system on which they stacked as if they were the VFS calling that layer, and at the same time they must appear to the VFS as a low-level f/s. IOW, a stackable f/s is both a VFS and a lower-level f/s, and thus have to simulate and act as both. So whatever behavior your VFS has before it calls a VOP_* must be simulated accurately inside the stackable f/s before it calls the lower one. It is easier to achieve that when vnode/vfs functions are smaller, simpler, and behave the same always. So, I would say that if you think splitting VOP_RELEASEND in two would make things simpler, go for it here and everywhere else. The lesson learned from the Linux vfs (rapid :-) evolution is a good one: after adding more and more inode/file/dentry/super_block functions, and making them relatively small and simple, they found ways to push some of that functionality up to the VFS. [...] > Actually, I am reasonably confident that we can do the fixes without > impacting performance noticably. That's great! [...] > Forward view: I'm undecided on the next step. Possibilities: > (1) Change the way locking is specificied to make it feasible to test > locking patches properly, and change the assertion generation to > generate better assertions. This will probably require changing I'm not sure I understand what you mean by assertion generation. > VOP_ISLOCKED() to be able to take a process parameter, and return > different valued based on wether an exlusive lock is held by that > process or by another process. The present behaviour will be > available by passing NULL for this parameter. > > Presently, running multiple processes does not work properly, as > the assertions do not really assert the right things. > > These changes are necessary to properly debug the use of locks, > which I again believe is necessary for stacking layers (which I > would like to work in 4.0, but I don't know if I will be able to > have ready). Locks are probably one of the most frustrating things I've had to deal with, b/c you're rarely told whether the objects passed to you are already locked, allocated, and if their reference count has been updated, and what, if any, you have to do with all of these. FreeBSD is very nice by documenting most of these conventions in the vnode_if.src file, but Solaris and Linux don't. I've had to implement a strict un/locking order in my wrapfs templates, to avoid deadlocks. Some of that code is so hairy that I dread each time the (linux) vfs changes and I've got to touch my locking code; that's a sure way to waste several days debugging that. Deciding on proper locking is difficult. In linux, for example, they had most locking done in the VFS; sounds great at first b/c f/s code doesn't have to worry about locking objects. But they found out that to get better SMP performance, each f/s would have to do its own locking, and so they pushed some of the locking to be the f/s responsibility. Locking seems to be stuff that happens all over: part in the VFS, part in the VM/buffercache, and part inside file systems. Is there a way to make locking an explicit part of the vnode interface? Is there a way to keep locking in the VFS by default (for simplicity), but allow those f/s that want to, manage their own locks? How messy and maintainable such code would be? I guess what I'm arguing is for interface flexibility, so we don't have to revise it again any time soon. Eivind, if you haven't recently, I suggest you look at some of the stacking papers (Rosenthal's UI paper, Heidemann, Popek, Skinner/Wong, etc.) Rosenthal's "requirements" paper succinctly described several important issues, including atomicity of multi-vnode operations. Rosenthal suggested that kernels should have a full-transaction engine, which I think is eventually necessary, but it's very complex to put it. The next best thing is to do some form of safe locking. Normally each vnode/inode has its own lock. Imagine a replicated stackable f/s (replicfs) with fan-out of 3. So vnode (V0) at the level of "replicfs" would have access to three lower-vnodes (V1, V2, V3). If you want to make a change (say create a file) in V0, you have to lock V0-V4 at once. Without vfs support for this, replicfs would have to enforce ordered locking (such as I've done in wrapfs) and hope for the best. If the vfs is smarter, it can help replicfs lock all 4 vnodes at once; or the vfs can allow replicfs to control the locks below it, and all the vfs has to do is ensure that no one else can lock V1-V3. I don't have a good answer to this locking issue. The papers I've cited describe changes to the vnode interface that simplify locking. One way they do that is having only one lock per chain (or stack, or DAG) of stacked file systems. So for example, a DAG of stackable f/s is represented by one data structure that contains locks and other things that are true about the whole DAG, and then smaller data structures for each node/leaf of the DAG, containing stuff that's true about that vnode (e.g., operations vector). > (2) Change the behaviour of VOP_LOOKUP() to "eat as much as you can, > and return how much that was" rather than "Eat a single path > component; we have already decided what this is." > This allows different types of namespaces, and it allows > optimizations in VOP_LOOKUP() when several steps in the traversal > is inside a single filesystem (and hey - who mounts a > new filesystem on every directory they see, anyway?) > > This change is rather small, and it would be nice to have in 4.0 > (I want the VFS differences from 4.0 to 5.0 to be as small as > possible). > It is pretty orthogonal to stacking layers; stacking layers gain > the same capabilities as other file systems from it. Multi-component lookup has always been desirable. There's one paper by Duchamp (USENIX '94) on multi-component look in NFS. I think we should allow for multi-component lookup as well as the old style "one component at a time" lookup. I would argue that the default should still be the old style. Someone might want to write a stackable f/s that does special things as it traverses the pathname of each component. For example a general purpose unionfs (one which uses fan-out, unlike the single-stack design in bsd-4.4) might follow into different underlying directories as it looks up single components; unionfs has all kinds of interesting semantic issues that would require more flexibility at lookup time. Lookup is fairly complex as it is. If you're going to add multi-component lookup, then maybe it should be a new vop? If not a new vop, then make sure it's added to the current vop_lookup such that a f/s has enough flexibility to control the type of lookup it wants. Also, it would be nice if the type of lookup used can be controlled dynamically by the f/s itself (as opposed to, say, a mount() flag that sets the lookup type for the duration of the mount). > Eivind. Cheers, Erez. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Thu Nov 18 15:35: 4 1999 Delivered-To: freebsd-fs@freebsd.org Received: from excalibur.lps.ens.fr (excalibur.lps.ens.fr [129.199.120.3]) by hub.freebsd.org (Postfix) with ESMTP id C95BD1553D; Thu, 18 Nov 1999 15:34:55 -0800 (PST) (envelope-from Thierry.Besancon@lps.ens.fr) Received: by excalibur.lps.ens.fr (8.9.3/jtpda-5.3.1) id AAA25614 ; Fri, 19 Nov 1999 00:34:53 +0100 (MET) Message-Id: <199911182334.AAA25614@excalibur.lps.ens.fr> From: Thierry.Besancon@lps.ens.fr (Thierry Besancon) Date: Fri, 19 Nov 1999 00:34:53 +0000 X-Mailer: Mail User's Shell (7.2.5 10/14/92) To: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org Subject: crash in ffs_vptofh on diskless workstation Cc: dillon@freebsd.org, Ollivier.Robert@eurocontrol.fr, besancon@lps.ens.fr, Joel.Marchand@polytechnique.fr, Pierre.David@prism.uvsq.fr Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Hello I'm trying to build new X terminals for my lab. To do so I use FreeBSD 3.3-RELEASE. The X terminal is a diskless PC with 64 Mo of ram. It perfectly boots and I can launch the X server perfectly. Everything just runs fine. Except for one little piece of thing. As i wanted to make use of the floppy drive, I gave a look at floppyd part of mtools package. It implements what I want. While running the daemon, I encountered a problem. So I went debugging the C code of it. And so i found a bug in FreeBSD (?!). Here's the df of the diskless X terminal (i kept the ssh port in order to remotely connect and be able to look at the problem of floppyd) : Filesystem 1K-blocks Used Avail Capacity Mounted on 129.199.120.250:/ 127023 31651 85211 27% / mfs:29 959 668 215 76% /conf/etc /conf/etc 959 668 215 76% /etc 129.199.120.250:/usr 190543 153042 22258 87% /usr 129.199.120.250:/usr/local 2846396 1958786 659899 75% /usr/local mfs:61 3935 1431 2190 40% /var /var/tmp 3935 1431 2190 40% /tmp mfs:91 1511 47 1344 3% /dev It's the classical way FreeBSD 3.3 seems to make diskless run. The root filesystem is mounted through NFS and memory filesystems are created to store the live logs of the system. The mounts are read-only. The X terminal runs without any swap. /etc/rc.sysctl confirms it as well : sysctl -w vm.swap_enabled=0 The bug is just that when launching any executable residing in my mfs /tmp, it justs hangs the kernel. # cp /bin/ls /tmp # df /tmp/. Filesystem 1K-blocks Used Avail Capacity Mounted on /var/tmp 3935 1432 2189 40% /tmp # /tmp/ls (workstation freezes) Here's the panic : Fatal trap 12 : page fault while in kernel mode fault virtual address = 0x3e fault code = supervisor read, page not present instruction pointer = 0x8:0xc022bf14 stack pointer = 0x10:0xc4546bc8 frame pointer = 0x10:0xc4546ca4 code segment = base 0x0, list 0xfffff, type 0x1b = DPL 0, pres 1, def32 1, gran 1 precessor eflags = interrupt disabled, resume, IOPL = 0 current process = 355 (csh) interrupt mask = net tty bio cam kernel : type 12 trap, code = 0 Stopped at ffs_vptofh+0xfe0: cmpw $0x2,0x3e(%edx) and the trace : db> trace ffs_vptofh(c4546d5c,c4514300,1000,0,c4546cf4) at ffs_vptofh+0xfe0 end(c4546d5c) at 0xc087c485 vnode_pager_freepage(c4559a2c,c4546db8,1,0,c4546df8) at vnode_pager_freepage+0x556 vm_pager_get_pages(c4559a2c,c4546db8,1,0,c4546f18) at vm_pager_get_pages+0x1f exec_map_first_page(c4546e94,c44c55a8,c02fe464,0,4) at exec_map_first_page+0xba execve(c44c55a0,c4546f94,80922e0,80940000,8085000) at execve+0x19e syscall(27,27,8085000,8094000,bfbffbb0) at syscall+0x187 Xint0x80_syscall() at Xint0x80_syscall+0x2c (not too deep) Given I have no swap, it is not easy to supply vmcore. But I can provide any help as I can reproduce the crash at will. If someone has a clue on how to fix that... Thierry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Thu Nov 18 20:38:56 1999 Delivered-To: freebsd-fs@freebsd.org Received: from nomis.simon-shapiro.org (nomis.simon-shapiro.org [209.86.126.163]) by hub.freebsd.org (Postfix) with SMTP id 1ABA2155BF for ; Thu, 18 Nov 1999 20:38:53 -0800 (PST) (envelope-from shimon@simon-shapiro.org) Received: (qmail 99725 invoked from network); 19 Nov 1999 04:38:52 -0000 Received: from localhost.simon-shapiro.org (HELO simon-shapiro.org) (127.0.0.1) by localhost.simon-shapiro.org with SMTP; 19 Nov 1999 04:38:52 -0000 Message-ID: <3834D45C.1F963B3B@simon-shapiro.org> Date: Thu, 18 Nov 1999 23:38:52 -0500 From: Simon Shapiro Organization: Simon's Garage X-Mailer: Mozilla 4.6 [en] (X11; I; FreeBSD 3.3-STABLE i386) X-Accept-Language: en-US MIME-Version: 1.0 To: Bernd Walter Cc: Mattias Pantzare , freebsd-fs@FreeBSD.ORG Subject: Re: RAID-5 and failure References: <199911061716.SAA20783@zed.ludd.luth.se> <19991106183316.A9420@cicely7.cicely.de> Content-Type: text/plain; charset= Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Bernd Walter wrote: > > On Sat, Nov 06, 1999 at 06:16:47PM +0100, Mattias Pantzare wrote: > > > On Sat, Nov 06, 1999 at 04:58:55PM +0100, Mattias Pantzare wrote: > > > > What hapens if the data part of a write to a RAID-5 plex completes but not the > > > > parity part (or the other way)? > > > > > > > The parity is not in sync - what else? > > > > The system could detect it and recalculate the parity. Or give a warning to > > the user so the user knows that the data is not safe. > > That's not possible because you need to write more then a single sector to keep > parity in sync which is not atomic. > > In case one of the writes fail vinum will do everything needed to work with it > and to inform the user. > Vinum will take the subdisk down because such drives should work with > write reallocation enabled and such a disk is badly broken if you receive a > write error. > > If the system panics or power fails between such a write there is no way to > find out if the parity is broken beside verifying the complete plex after > reboot - the problem should be the same with all usual hard and software > solutions - greg already begun or finished recalculating and checking the > parity. > I asume that's the reason why some systems use 520 byte sectors - maybe they > write timestamps or generationnumbers in a single write within the sector. 528. 512 data, 16 ECC for the sector. Nothing to do with RAID. > > -- > B.Walter COSMO-Project http://www.cosmo-project.de > ticso@cicely.de Usergroup info@cosmo-project.de > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-fs" in the body of the message -- Sincerely Yours, Shimon@Simon-Shapiro.ORG 404.664.6401 Simon Shapiro Unwritten code has no bugs and executes at twice the speed of mouth To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Nov 19 7:18: 2 1999 Delivered-To: freebsd-fs@freebsd.org Received: from mojave.sitaranetworks.com (mojave.sitaranetworks.com [199.103.141.157]) by hub.freebsd.org (Postfix) with ESMTP id 2E1081563C for ; Fri, 19 Nov 1999 07:17:59 -0800 (PST) (envelope-from grog@mojave.sitaranetworks.com) Message-ID: <19991119101720.35872@mojave.sitaranetworks.com> Date: Fri, 19 Nov 1999 10:17:20 -0500 From: Greg Lehey To: Simon Shapiro , Bernd Walter Cc: Mattias Pantzare , freebsd-fs@FreeBSD.ORG Subject: Re: RAID-5 and failure Reply-To: Greg Lehey References: <199911061716.SAA20783@zed.ludd.luth.se> <19991106183316.A9420@cicely7.cicely.de> <3834D45C.1F963B3B@simon-shapiro.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii In-Reply-To: <3834D45C.1F963B3B@simon-shapiro.org>; from Simon Shapiro on Thu, Nov 18, 1999 at 11:38:52PM -0500 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Thursday, 18 November 1999 at 23:38:52 -0500, Simon Shapiro wrote: > Bernd Walter wrote: >> >> I asume that's the reason why some systems use 520 byte sectors - maybe they >> write timestamps or generationnumbers in a single write within the sector. > > 528. 512 data, 16 ECC for the sector. Nothing to do with RAID. There are various sizes. I've had surplus disks with 516 and 520 byte sectors. But yes, they're usually under hardware control. Greg -- Finger grog@lemis.com for PGP public key See complete headers for address and phone numbers To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Nov 19 8:33:43 1999 Delivered-To: freebsd-fs@freebsd.org Received: from mail.tvol.com (mail.wgate.com [38.219.83.4]) by hub.freebsd.org (Postfix) with ESMTP id 2406D15673 for ; Fri, 19 Nov 1999 08:33:34 -0800 (PST) (envelope-from rjesup@wgate.com) Received: from jesup.eng.tvol.net (jesup.eng.tvol.net [10.32.2.26]) by mail.tvol.com (8.8.8/8.8.3) with ESMTP id LAA14056 for ; Fri, 19 Nov 1999 11:30:48 -0500 (EST) Reply-To: Randell Jesup To: freebsd-fs@FreeBSD.ORG Subject: Re: RAID-5 and failure References: <199911061716.SAA20783@zed.ludd.luth.se> <19991106183316.A9420@cicely7.cicely.de> <19991113213325.57908@mojave.sitaranetworks.com> <19991116204916.44107@mojave.sitaranetworks.com> From: Randell Jesup Date: 19 Nov 1999 11:33:58 -0500 In-Reply-To: Greg Lehey's message of "Tue, 16 Nov 1999 20:49:16 -0500" Message-ID: X-Mailer: Gnus v5.6.43/Emacs 20.4 Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Greg Lehey writes: >> When the disks come back up (dirty), check all the parity. >> The stripe that was being written will fail to check. In case 4, the data >> and parity are wrong, and in case 5, just the parity, but you don't know >> which. If you handle case 4, you can handle case 5 the same way. >> Obviously you've had a write failure, but usually the FS can deal with >> that possibility (with the chance of lost data, true). Some form of >> information passed out about what sector(s) were trashed might be useful >> in recovery if you're not using default UFS/fsck. > >Well, you're still left with the dilemma. Worse, this check makes >fsck look like an instantaneous operation: you have to read the entire >contents of every disk. For a 500 GB database spread across 3 LVD >controllers, you're looking at several hours. True. Not that it may matter, but you could have dirty flags for each cylinder group (or whatever). This both adds locality (shorter seeks) and reduces the amount needed to recheck. If an area hasn't been written to 'recently', the dirty flag for the area gets rewritten to clean. This allows you to keep the amount of the disk that needs to be reread on a crash down to a very manageable level. Tuning the size of the groups covered by a flag and the timeout to rewrite a flag to clean would take a little work. >> If it checks, then the data was all written before any crash, >> and all is fine. > >That's the simple case. That's certainly true. >> So the biggest trick here is recognizing the fact that the system >> crashed. You could reserve a block (or set of blocks scattered about) on >> each drive for dirty flags, and only mark a disk clean if it hasn't had >> writes in . This keeps the write >> overhead down without requiring NVRAM. There are other evil tricks: with >> SCSI, you might be able to change some innocuous mode parameter and use >> it as a dirty flag, though this probably has at least as much overhead >> as reserving a dirty-flag block. And of course if you have NVRAM, store >> the dirty bit there. Hmmmmm. Maybe in the PC's clock chip - they >> generally have several bits of NVRAM..... (On the Amiga we used those >> bits for storing things like SCSI Id, boot spinup delay, etc.) >> >> Alternatively, you could hide the dirty flag at a higher semantic >> level, by (at the OS level) recognizing a system that wasn't shut down >> properly and invoking the vinum re-synchronizer. So long as the sectors >> with problems aren't needed to boot the kernel and recognize this that will >> work. > >Basically, the way I see it, we have three options: > >1. Disks never crash, and anyway, we don't write to them. Ignore the > problem and deal with it if it comes to bite us. > >2. Get an NVRAM board and use it for this purpose. How much is commonly stored in nvram boards for raid? If it's merely the location of the write, _maybe_ clock-chip memory might work (if writing to it that often doesn't slow down the system - I don't remember how fast the interface is). If it's the entire sector, well then we're screwed without it or #3 - or rather we could have a corrupted stripe after a crash. Oh well. >3. Bite the bullet and write intention logs before each write. > VERITAS has this as an option. Probably worthwhile. >These options don't have to be mutually exclusive. It's quite >possible to implement both ((1) doesn't need implementation :-) and >leave it to the user to decide which to use. Quite so. BTW, I assume I'm correct in assuming that vinum normally works on drives with write-behind disabled... >> At the cost of performance, you could use some bytes of each sector >> for generation numbers, and know in case 5 that the data is correct. >> Obviously case 4 will still fail. > >No, the way things work, this would be very expensive. We'd have to >move the data to a larger buffer and set the flags, and it would also >require at least reformatting the drive, assuming it's possible to set >a different sector. There are better ways to do this. Well, I was assuming you'd use some bytes from the existing sectorsize (such as 511 bytes of user data per sector, 1 byte of generation). We're talking lots of extra CPU overhead on read or write, however, to transfer the data into alternative buffers before write and to invert that on read - not to mention that higher-level code tends to be inflexible in regard to sector sizes being powers of two (or multiples of 512 for that matter). Does vinum do any transfers of user data into alternative buffers before posting it's writes, or does it just use gather/scatter lists? -- Randell Jesup, Worldgate Communications, ex-Scala, ex-Amiga OS team ('88-94) rjesup@wgate.com CDA II has been passed and signed, sigh. The lawsuit has been filed. Please support the organizations fighting it - ACLU, EFF, CDT, etc. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Nov 19 10:10:17 1999 Delivered-To: freebsd-fs@freebsd.org Received: from uni4nn.gn.iaf.nl (osmium.gn.iaf.nl [193.67.144.12]) by hub.freebsd.org (Postfix) with ESMTP id B4E5F156F7 for ; Fri, 19 Nov 1999 10:10:01 -0800 (PST) (envelope-from wilko@yedi.iaf.nl) Received: from yedi.iaf.nl (uucp@localhost) by uni4nn.gn.iaf.nl (8.9.2/8.9.2) with UUCP id SAA32117; Fri, 19 Nov 1999 18:55:35 +0100 (MET) Received: (from wilko@localhost) by yedi.iaf.nl (8.9.3/8.9.3) id SAA54691; Fri, 19 Nov 1999 18:50:59 +0100 (CET) (envelope-from wilko) From: Wilko Bulte Message-Id: <199911191750.SAA54691@yedi.iaf.nl> Subject: Re: RAID-5 and failure In-Reply-To: from Randell Jesup at "Nov 19, 1999 11:33:58 am" To: rjesup@wgate.com Date: Fri, 19 Nov 1999 18:50:59 +0100 (CET) Cc: freebsd-fs@FreeBSD.ORG X-Organisation: Private FreeBSD site - Arnhem, The Netherlands X-pgp-info: PGP public key at 'finger wilko@freefall.freebsd.org' X-Mailer: ELM [version 2.4ME+ PL43 (25)] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org As Randell Jesup wrote ... > Greg Lehey writes: [...] > >2. Get an NVRAM board and use it for this purpose. > > How much is commonly stored in nvram boards for raid? If it's > merely the location of the write, _maybe_ clock-chip memory might work > (if writing to it that often doesn't slow down the system - I don't > remember how fast the interface is). If it's the entire sector, well then > we're screwed without it or #3 - or rather we could have a corrupted > stripe after a crash. Oh well. Well, I can tell you that the HSx DEC ^H^H^H Compaq controllers use the battery backup-ed writeback cache for this purpose. These are anything from 32 to 512Mb per controllers. Controllers generally are used in redundant pairs, each with their own cache module, each cachemodule with it's own backup battery. To avoid the potential for datacorruption when a cache module fails they can be setup to run in mirrored cache mode. Price? I'm pretty sure you don't want to know ;-) The SCSI variants work fine on FreeBSD BTW. I have yet to try the Fibrechannel boxes. I lack a host adapter that FreeBSD has a driver for. Wilko -- | / o / / _ Arnhem, The Netherlands - Powered by FreeBSD - |/|/ / / /( (_) Bulte WWW : http://www.tcja.nl http://www.freebsd.org To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Fri Nov 19 10:11: 1 1999 Delivered-To: freebsd-fs@freebsd.org Received: from uni4nn.gn.iaf.nl (osmium.gn.iaf.nl [193.67.144.12]) by hub.freebsd.org (Postfix) with ESMTP id 607D415732 for ; Fri, 19 Nov 1999 10:10:42 -0800 (PST) (envelope-from wilko@yedi.iaf.nl) Received: from yedi.iaf.nl (uucp@localhost) by uni4nn.gn.iaf.nl (8.9.2/8.9.2) with UUCP id SAA32126; Fri, 19 Nov 1999 18:55:39 +0100 (MET) Received: (from wilko@localhost) by yedi.iaf.nl (8.9.3/8.9.3) id SAA54750; Fri, 19 Nov 1999 18:56:38 +0100 (CET) (envelope-from wilko) From: Wilko Bulte Message-Id: <199911191756.SAA54750@yedi.iaf.nl> Subject: Re: RAID-5 and failure In-Reply-To: <19991119101720.35872@mojave.sitaranetworks.com> from Greg Lehey at "Nov 19, 1999 10:17:20 am" To: grog@lemis.com Date: Fri, 19 Nov 1999 18:56:38 +0100 (CET) Cc: shimon@simon-shapiro.org, ticso@cicely.de, pantzer@ludd.luth.se, freebsd-fs@FreeBSD.ORG X-Organisation: Private FreeBSD site - Arnhem, The Netherlands X-pgp-info: PGP public key at 'finger wilko@freefall.freebsd.org' X-Mailer: ELM [version 2.4ME+ PL43 (25)] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org As Greg Lehey wrote ... > On Thursday, 18 November 1999 at 23:38:52 -0500, Simon Shapiro wrote: > > Bernd Walter wrote: > >> > >> I asume that's the reason why some systems use 520 byte sectors - maybe they > >> write timestamps or generationnumbers in a single write within the sector. > > > > 528. 512 data, 16 ECC for the sector. Nothing to do with RAID. > > There are various sizes. I've had surplus disks with 516 and 520 byte > sectors. But yes, they're usually under hardware control. I've also seen 518 once. -- | / o / / _ Arnhem, The Netherlands - Powered by FreeBSD - |/|/ / / /( (_) Bulte WWW : http://www.tcja.nl http://www.freebsd.org To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message From owner-freebsd-fs Sat Nov 20 12:20: 6 1999 Delivered-To: freebsd-fs@freebsd.org Received: from europa.dreamscape.com (europa.dreamscape.com [206.64.128.147]) by hub.freebsd.org (Postfix) with ESMTP id D1BAA14C41 for ; Sat, 20 Nov 1999 12:19:40 -0800 (PST) (envelope-from krentel@dreamscape.com) Received: from dreamscape.com (sA18-p7.dreamscape.com [209.217.200.7]) by europa.dreamscape.com (8.8.5/8.8.4) with ESMTP id PAA16622 for ; Sat, 20 Nov 1999 15:19:37 -0500 (EST) X-Dreamscape-Track-A: sA18-p7.dreamscape.com [209.217.200.7] X-Dreamscape-Track-B: Sat, 20 Nov 1999 15:19:37 -0500 (EST) Received: (from krentel@localhost) by dreamscape.com (8.9.3/8.9.3) id PAA03794 for freebsd-fs@freebsd.org; Sat, 20 Nov 1999 15:17:58 -0500 (EST) (envelope-from krentel) Date: Sat, 20 Nov 1999 15:17:58 -0500 (EST) From: "Mark W. Krentel" Message-Id: <199911202017.PAA03794@dreamscape.com> To: freebsd-fs@freebsd.org Subject: running linux binaries from ext2fs partition Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Is it possible to run linux (or freebsd) binaries directly from a local ext2fs partition? My machine dual boots between Freebsd 3.3-stable (as of Nov 7) and Red Hat 6.0. I have the linux_base-6.0 port installed, and I can run linux binaries by copying them to a freebsd partition. But I tried running them directly from their ext2fs partition and I got a "page fault while in kernel mode" panic. I'm not using soft updates, if that matters. I'm guessing that this is not supported and probably has nothing to do with linux binaries. If I'm wrong and this should work, then I'll be back with more details. But I thought I should check before I run too many experiments that crash my system. :-( While we're on the subject, on what filesystem types is it ok to run binaries? Local freebsd (UFS), NFS, and cdrom should all work, right? Are there others? --Mark Krentel To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message