From owner-freebsd-fs  Sun Nov 14  8:16:20 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from sv01.cet.co.jp (sv01.cet.co.jp [210.171.56.2])
	by hub.freebsd.org (Postfix) with ESMTP
	id 26FA415007; Sun, 14 Nov 1999 08:16:00 -0800 (PST)
	(envelope-from michaelh@cet.co.jp)
Received: from localhost (michaelh@localhost)
	by sv01.cet.co.jp (8.9.3/8.9.3) with SMTP id QAA05129;
	Sun, 14 Nov 1999 16:15:56 GMT
Date: Mon, 15 Nov 1999 01:15:55 +0900 (JST)
From: Michael Hancock <michaelh@cet.co.jp>
To: Eivind Eklund <eivind@FreeBSD.ORG>
Cc: fs@FreeBSD.ORG
Subject: Re: Killing WILLRELE
In-Reply-To: <19991109224553.G256@bitbox.follo.net>
Message-ID: <Pine.BSF.3.95LJ1.1b3.991115010552.5097A-100000@sv01.cet.co.jp>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

Eivind,

I agree with your preferred patches.  The slight performance hit for
operations like mknod and symlink isn't a worry.

IIRC rename was one of those operations where you have to reaquire a
ref/lock before return to be consistent with the sane semantics rule. 
This will also add some latency, but again for an op like rename I don't
think it's an issue.

Mike

On Tue, 9 Nov 1999, Eivind Eklund wrote:

> I'm looking at removing WILLRELE from the VFS specs, in order to get
> more sane semantics before introducing many more VFS consumers through
> stacking layers.  I'm sending this as a 'HEADS UP!', a chance for
> people to object, and to give a chance at an advance view.
> 
> Note that the present set of patches has not been tested beyond
> compilation; I'm reserving testing until after I've let people have
> the chance to scream at me (as I don't see a point in testing the
> changes unless people agree that they are a step in the right
> direction).
> 
> There are presently three VOPs that use it:
> VOP_MKNOD
> 	Uses this for the 'vpp' parameter (should be the return vnode
> 	for the newly created node, I believe).  The value is
> 	presently unusable; depending on which FS you call, it it is
> 	either set to NULL, set to point to a vnode (MSDOSFS), or just
> 	kept the way it was.  (Note that MSDOSFS will leak vnodes as
> 	of today).
> 
> 	I've been tempted to remove it, but am not entirely happy
> 	about that, as I think it might be useful for some stacked
> 	layers.  Thanks to phk, I've been able to come up with patches
> 	to fix it - but these will increase the cost of VOP_MKNOD()
> 	(only slightly, I think, but I am not quite certain).
> 
> 	The other alternatives are to remove the parameter, or to
> 	break the layering around ufs_mknod (basically, re-implement
> 	parts of VFS_VGET in it, and make it assume that it is only
> 	used with ffsspecops and ffsfifoops.   This is presently
> 	correct, but introduces risk of breakage down the road.)  Both
> 	of these alternatives are slightly more efficient than my
> 	preferred fix.
> 
> 	Patches to make VOP_MKNOD use vpp normally are 
> 		http://www.freebsd.org/~eivind/vop_mknod_fixed.patch
> 	It is possible that the NFS vp release would have been handled
> 	by common code if I hadn't added special code there, but I
> 	feel too uncomfortable around the NFS code/macros to try to
> 	find out.
> 
> 	Patches to just remove the parameter are at
> 		http://www.freebsd.org/~eivind/vop_mknod_novpp.patch
> 
> 	VOP_MKNOD has 5 callers.
> 
> VOP_SYMLINK
> 	Same use of WILLRELE as VOP_MKNOD.
> 
> 	Returns trash in some cases, OK values in others; relatively
> 	simple to fix, with Coda as the only complication.
> 
> 	Patches to fix it are at
> 		http://www.freebsd.org/~eivind/vop_symlink_fixed.patch
> 	These will break Coda, which I'm planning to contact rvb about
> 	how to solve if people agree that WILLRELE should die.
> 
> 	VOP_SYMLINK has 3 callers.
> 
> VOP_RENAME
> 	WILLRELE on a bunch of parameters.  Adrian Chadd is doing
> 	several things to VOP_RENAME which is relevant to this, so I'm
> 	keeping my hands off it for the moment.  Hopefully, patches
> 	should be available later in the week.
> 
> 
> My next step along the sane semantics road will probably be to make
> freeing of cnp's reflexive - looking at the code that is there now,
> there looks like there are a number of bugs related to this at the
> moment, and it certainly makes the code much harder to follow.
> 
> Eivind.
> 
> 
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-fs" in the body of the message
> 


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Mon Nov 15  8: 5:43 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from yana.lemis.com (yana.lemis.com [192.109.197.140])
	by hub.freebsd.org (Postfix) with ESMTP id 959AA14EB8
	for <freebsd-fs@FreeBSD.ORG>; Mon, 15 Nov 1999 08:05:36 -0800 (PST)
	(envelope-from grog@mojave.sitaranetworks.com)
Received: from mojave.sitaranetworks.com (mojave.sitaranetworks.com [199.103.141.157])
	by yana.lemis.com (8.8.8/8.8.8) with ESMTP id CAA21030;
	Tue, 16 Nov 1999 02:35:26 +1030 (CST)
	(envelope-from grog@mojave.sitaranetworks.com)
Message-ID: <19991113213430.48370@mojave.sitaranetworks.com>
Date: Sat, 13 Nov 1999 21:34:30 -0500
From: Greg Lehey <grog@mojave.sitaranetworks.com>
To: Bernd Walter <ticso@cicely.de>,
	Mattias Pantzare <pantzer@ludd.luth.se>
Cc: freebsd-fs@FreeBSD.ORG
Subject: Re: RAID-5 and failure
Reply-To: Greg Lehey <grog@lemis.com>
References: <ticso@cicely.de> <199911061827.TAA22113@zed.ludd.luth.se> <19991106200754.A9682@cicely7.cicely.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
In-Reply-To: <19991106200754.A9682@cicely7.cicely.de>; from Bernd Walter on Sat, Nov 06, 1999 at 08:07:54PM +0100
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Saturday,  6 November 1999 at 20:07:54 +0100, Bernd Walter wrote:
> On Sat, Nov 06, 1999 at 07:27:20PM +0100, Mattias Pantzare wrote:
>>> If the system panics or power fails between such a write there is no way to
>>> find out if the parity is broken beside verifying the complete plex after
>>> reboot - the problem should be the same with all usual hard and software
>>> solutions - greg already begun or finished recalculating and checking the
>>> parity.
>>
>> This is realy a optimisation issue, if you just write without using
>> two-phase commit then you have to recalculate parity after a powerfailure.
>> (One might keep track of the regions of the disk that have had writes latly
>> and only recalculate them)
>>
>> Or you do as it says under Two-phase commitment in
>> http://www.sunworld.com/sunworldonline/swol-09-1995/swol-09-raid5-2.html.
>>
> That's exactly what vinum does at this moment but without the log.
> You need persistent memory for this such as nv-memory or a log area on any disk.
> nv-memory on PCs is usually to small and maybe to slow for such purposes.
> I asume that a log area on any partitipating disk is not a good idea.
> On a different disk it would be an option but still needs implementation.

Yes, I suppose we could implement that for maximum security.  I wonder
if any NOVRAM boards are available.

Greg
--
Finger grog@lemis.com for PGP public key
See complete headers for address and phone numbers


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Mon Nov 15  8: 5:49 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from yana.lemis.com (yana.lemis.com [192.109.197.140])
	by hub.freebsd.org (Postfix) with ESMTP id AFB74150CF
	for <freebsd-fs@FreeBSD.ORG>; Mon, 15 Nov 1999 08:05:41 -0800 (PST)
	(envelope-from grog@mojave.sitaranetworks.com)
Received: from mojave.sitaranetworks.com (mojave.sitaranetworks.com [199.103.141.157])
	by yana.lemis.com (8.8.8/8.8.8) with ESMTP id CAA21033;
	Tue, 16 Nov 1999 02:35:35 +1030 (CST)
	(envelope-from grog@mojave.sitaranetworks.com)
Message-ID: <19991113213325.57908@mojave.sitaranetworks.com>
Date: Sat, 13 Nov 1999 21:33:25 -0500
From: Greg Lehey <grog@mojave.sitaranetworks.com>
To: Bernd Walter <ticso@cicely.de>,
	Mattias Pantzare <pantzer@ludd.luth.se>
Cc: freebsd-fs@FreeBSD.ORG
Subject: Re: RAID-5 and failure
Reply-To: Greg Lehey <grog@lemis.com>
References: <ticso@cicely.de> <199911061716.SAA20783@zed.ludd.luth.se> <19991106183316.A9420@cicely7.cicely.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
In-Reply-To: <19991106183316.A9420@cicely7.cicely.de>; from Bernd Walter on Sat, Nov 06, 1999 at 06:33:16PM +0100
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Saturday,  6 November 1999 at 18:33:16 +0100, Bernd Walter wrote:
> On Sat, Nov 06, 1999 at 06:16:47PM +0100, Mattias Pantzare wrote:
>>> On Sat, Nov 06, 1999 at 04:58:55PM +0100, Mattias Pantzare wrote:
>>>> What hapens if the data part of a write to a RAID-5 plex completes but not the
>>>> parity part (or the other way)?
>>>>
>>> The parity is not in sync - what else?
>>
>> The system could detect it and recalculate the parity. Or give a warning to
>> the user so the user knows that the data is not safe.
>
> That's not possible because you need to write more then a single
> sector to keep parity in sync which is not atomic.
>
> In case one of the writes fail vinum will do everything needed to
> work with it and to inform the user.

In RAID-5, I first write the data blocks, then the parity blcoks.
There are a number of scenarios here:

1.  The drive containing a data or parity block goes down.

    In this case, the subdisks of that block will be marked
    'crashed'.  The subdisk to which the write went will be marked
    'stale'.  When the drive is brought up again (manually), the data
    will be recreated.

    I've been thinking about keeping a log somewhere of what needs to
    be updated, but this carries dangers of corruption.  At the moment
    I require that the entire subdisk be rewritten.  This will also
    recreate parity where necessary.

2.  The subdisk containing a data or parity block has an unrecoverable
    I/O error.

    This is pretty much the same as the previous case, except that the
    other subdisks don't crash.

3.  The system crashes before writing the first data block for a
    RAID-5 stripe.

    The updates are lost (obviously).  When the system comes up, the
    data should be consistent.

4.  The system crashes after writing the first data block for a RAID-5
    stripe and before writing the last data block.

    When the system comes up, both data and parity are inconsistent.

5.  The system crashes after writing the last data block for a RAID-5
    stripe and before writing the last parity block.

    When the system comes up, data is consistent, and parity is
    inconsistent.

There are a number of ways of dealing with situations 4 and 5.  The
real problem is that they only occur when the system crashes, so
whatever recovery information is required must be stored in
non-volatile storage.  Some systems do include a NOVRAM for this kind
of information, but in general purpose systems the only possibility is
to write the information to disk, which would make the inherently slow
RAID-5 write even slower.  My attitude here is that RAID-5 writes are
comparatively infrequent, and so are crashes.  In the case of (5), you
could rebuild parity after a crash.  In the case of (4), I have no
good answer.  Suggestions welcome.

Having said that, I probably need to revise the code which
sequentializes the data and parity writes.  It currently uses the
B_ORDERED flag in the buffer headers, and I'm not sure that's enough.
I should probably modify it to confirm that the data blocks are
written before starting to write the parity blocks.

> Vinum will take the subdisk down because such drives should work with
> write reallocation enabled and such a disk is badly broken if you receive a
> write error.
>
> If the system panics or power fails between such a write there is no way to
> find out if the parity is broken beside verifying the complete plex after
> reboot - the problem should be the same with all usual hard and software
> solutions - greg already begun or finished recalculating and checking the
> parity.
> I asume that's the reason why some systems use 520 byte sectors - maybe they
> write timestamps or generationnumbers in a single write within the sector.

In fact, the 520 byte sectors are used to protect against data
corruption between the disk and the controller.  They won't help in
this scenario.

Greg
--
Finger grog@lemis.com for PGP public key
See complete headers for address and phone numbers


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Mon Nov 15 11:25: 9 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from uni4nn.gn.iaf.nl (osmium.gn.iaf.nl [193.67.144.12])
	by hub.freebsd.org (Postfix) with ESMTP id C6AB414D23
	for <freebsd-fs@FreeBSD.ORG>; Mon, 15 Nov 1999 11:25:05 -0800 (PST)
	(envelope-from wilko@yedi.iaf.nl)
Received: from yedi.iaf.nl (uucp@localhost)
	  by uni4nn.gn.iaf.nl (8.9.2/8.9.2) with UUCP id UAA02174;
	  Mon, 15 Nov 1999 20:00:51 +0100 (MET)
Received: (from wilko@localhost)
	by yedi.iaf.nl (8.9.3/8.9.3) id TAA00923;
	Mon, 15 Nov 1999 19:24:01 +0100 (CET)
	(envelope-from wilko)
From: Wilko Bulte <wilko@yedi.iaf.nl>
Message-Id: <199911151824.TAA00923@yedi.iaf.nl>
Subject: Re: RAID-5 and failure
In-Reply-To: <19991113213430.48370@mojave.sitaranetworks.com> from Greg Lehey at "Nov 13, 1999  9:34:30 pm"
To: grog@lemis.com
Date: Mon, 15 Nov 1999 19:24:01 +0100 (CET)
Cc: ticso@cicely.de, pantzer@ludd.luth.se, freebsd-fs@FreeBSD.ORG
X-Organisation: Private FreeBSD site - Arnhem, The Netherlands
X-pgp-info: PGP public key at 'finger wilko@freefall.freebsd.org'
X-Mailer: ELM [version 2.4ME+ PL43 (25)]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

As Greg Lehey wrote ...
> On Saturday,  6 November 1999 at 20:07:54 +0100, Bernd Walter wrote:
...
> > That's exactly what vinum does at this moment but without the log.
> > You need persistent memory for this such as nv-memory or a log area on any disk.
> > nv-memory on PCs is usually to small and maybe to slow for such purposes.
> > I asume that a log area on any partitipating disk is not a good idea.
> > On a different disk it would be an option but still needs implementation.
> 
> Yes, I suppose we could implement that for maximum security.  I wonder
> if any NOVRAM boards are available.
> 
> Greg

You might find an old Prestoserve PCI card on a yardsale. Long shot..

-- 
|   / o / /  _  	Arnhem, The Netherlands	  - Powered by FreeBSD -
|/|/ / / /( (_) Bulte 	WWW : http://www.tcja.nl  http://www.freebsd.org


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Mon Nov 15 11:39: 3 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from mail.du.gtn.com (mail.du.gtn.com [194.77.9.57])
	by hub.freebsd.org (Postfix) with ESMTP id CB10114A09
	for <freebsd-fs@FreeBSD.ORG>; Mon, 15 Nov 1999 11:38:58 -0800 (PST)
	(envelope-from ticso@mail.cicely.de)
Received: from mail.cicely.de (cicely.de [194.231.9.142])
	by mail.du.gtn.com (8.9.3/8.9.3) with ESMTP id UAA26480;
	Mon, 15 Nov 1999 20:32:01 +0100 (MET)
Received: (from ticso@localhost)
	by mail.cicely.de (8.9.0/8.9.0) id UAA06071;
	Mon, 15 Nov 1999 20:38:28 +0100 (CET)
Date: Mon, 15 Nov 1999 20:38:28 +0100
From: Bernd Walter <ticso@cicely.de>
To: Greg Lehey <grog@lemis.com>
Cc: Bernd Walter <ticso@cicely.de>,
	Mattias Pantzare <pantzer@ludd.luth.se>, freebsd-fs@FreeBSD.ORG
Subject: Re: RAID-5 and failure
Message-ID: <19991115203828.B5417@cicely7.cicely.de>
References: <ticso@cicely.de> <199911061716.SAA20783@zed.ludd.luth.se> <19991106183316.A9420@cicely7.cicely.de> <19991113213325.57908@mojave.sitaranetworks.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 1.0pre3i
In-Reply-To: <19991113213325.57908@mojave.sitaranetworks.com>
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Sat, Nov 13, 1999 at 09:33:25PM -0500, Greg Lehey wrote:
> 
> 4.  The system crashes after writing the first data block for a RAID-5
>     stripe and before writing the last data block.
> 
>     When the system comes up, both data and parity are inconsistent.
> 
> 5.  The system crashes after writing the last data block for a RAID-5
>     stripe and before writing the last parity block.
> 
>     When the system comes up, data is consistent, and parity is
>     inconsistent.
> 
> There are a number of ways of dealing with situations 4 and 5.  The
> real problem is that they only occur when the system crashes, so
> whatever recovery information is required must be stored in
> non-volatile storage.  Some systems do include a NOVRAM for this kind
> of information, but in general purpose systems the only possibility is
> to write the information to disk, which would make the inherently slow
> RAID-5 write even slower.  My attitude here is that RAID-5 writes are
> comparatively infrequent, and so are crashes.  In the case of (5), you
> could rebuild parity after a crash.  In the case of (4), I have no
> good answer.  Suggestions welcome.

Case 4 is not that different from case 5 as any differences should be
handled by the FS using the volume.

-- 
B.Walter                  COSMO-Project              http://www.cosmo-project.de
ticso@cicely.de             Usergroup                info@cosmo-project.de


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Mon Nov 15 11:42:53 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from mail.du.gtn.com (mail.du.gtn.com [194.77.9.57])
	by hub.freebsd.org (Postfix) with ESMTP id F0AF614A09
	for <freebsd-fs@FreeBSD.ORG>; Mon, 15 Nov 1999 11:42:50 -0800 (PST)
	(envelope-from ticso@mail.cicely.de)
Received: from mail.cicely.de (cicely.de [194.231.9.142])
	by mail.du.gtn.com (8.9.3/8.9.3) with ESMTP id UAA26696;
	Mon, 15 Nov 1999 20:35:52 +0100 (MET)
Received: (from ticso@localhost)
	by mail.cicely.de (8.9.0/8.9.0) id UAA06197;
	Mon, 15 Nov 1999 20:42:22 +0100 (CET)
Date: Mon, 15 Nov 1999 20:42:22 +0100
From: Bernd Walter <ticso@cicely.de>
To: Greg Lehey <grog@lemis.com>
Cc: Bernd Walter <ticso@cicely.de>,
	Mattias Pantzare <pantzer@ludd.luth.se>, freebsd-fs@FreeBSD.ORG
Subject: Re: RAID-5 and failure
Message-ID: <19991115204222.C5417@cicely7.cicely.de>
References: <ticso@cicely.de> <199911061827.TAA22113@zed.ludd.luth.se> <19991106200754.A9682@cicely7.cicely.de> <19991113213430.48370@mojave.sitaranetworks.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 1.0pre3i
In-Reply-To: <19991113213430.48370@mojave.sitaranetworks.com>
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Sat, Nov 13, 1999 at 09:34:30PM -0500, Greg Lehey wrote:
> 
> Yes, I suppose we could implement that for maximum security.  I wonder
> if any NOVRAM boards are available.
> 

Maybe the RIO project can bring in some interesting features.

-- 
B.Walter                  COSMO-Project              http://www.cosmo-project.de
ticso@cicely.de             Usergroup                info@cosmo-project.de


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Mon Nov 15 11:52:54 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from yana.lemis.com (yana.lemis.com [192.109.197.140])
	by hub.freebsd.org (Postfix) with ESMTP id 0AAC114BD5
	for <freebsd-fs@FreeBSD.ORG>; Mon, 15 Nov 1999 11:52:48 -0800 (PST)
	(envelope-from grog@mojave.sitaranetworks.com)
Received: from mojave.sitaranetworks.com (mojave.sitaranetworks.com [199.103.141.157])
	by yana.lemis.com (8.8.8/8.8.8) with ESMTP id GAA21345;
	Tue, 16 Nov 1999 06:22:34 +1030 (CST)
	(envelope-from grog@mojave.sitaranetworks.com)
Message-ID: <19991115145200.09633@mojave.sitaranetworks.com>
Date: Mon, 15 Nov 1999 14:52:00 -0500
From: Greg Lehey <grog@mojave.sitaranetworks.com>
To: Bernd Walter <ticso@cicely.de>
Cc: Mattias Pantzare <pantzer@ludd.luth.se>, freebsd-fs@FreeBSD.ORG
Subject: Re: RAID-5 and failure
Reply-To: Greg Lehey <grog@lemis.com>
References: <ticso@cicely.de> <199911061716.SAA20783@zed.ludd.luth.se> <19991106183316.A9420@cicely7.cicely.de> <19991113213325.57908@mojave.sitaranetworks.com> <19991115203828.B5417@cicely7.cicely.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
In-Reply-To: <19991115203828.B5417@cicely7.cicely.de>; from Bernd Walter on Mon, Nov 15, 1999 at 08:38:28PM +0100
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Monday, 15 November 1999 at 20:38:28 +0100, Bernd Walter wrote:
> On Sat, Nov 13, 1999 at 09:33:25PM -0500, Greg Lehey wrote:
>>
>> 4.  The system crashes after writing the first data block for a RAID-5
>>     stripe and before writing the last data block.
>>
>>     When the system comes up, both data and parity are inconsistent.
>>
>> 5.  The system crashes after writing the last data block for a RAID-5
>>     stripe and before writing the last parity block.
>>
>>     When the system comes up, data is consistent, and parity is
>>     inconsistent.
>>
>> There are a number of ways of dealing with situations 4 and 5.  The
>> real problem is that they only occur when the system crashes, so
>> whatever recovery information is required must be stored in
>> non-volatile storage.  Some systems do include a NOVRAM for this kind
>> of information, but in general purpose systems the only possibility is
>> to write the information to disk, which would make the inherently slow
>> RAID-5 write even slower.  My attitude here is that RAID-5 writes are
>> comparatively infrequent, and so are crashes.  In the case of (5), you
>> could rebuild parity after a crash.  In the case of (4), I have no
>> good answer.  Suggestions welcome.
>
> Case 4 is not that different from case 5 as any differences should be
> handled by the FS using the volume.

The problem is that in case 4 you don't have anything to go by.  You
don't know which data are inconsistent unless you keep a log.  The FS
using the volume has followed the kernel into the eternal bit bucket.

Greg
--
Finger grog@lemis.com for PGP public key
See complete headers for address and phone numbers


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Mon Nov 15 12: 6:50 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from mail.du.gtn.com (mail.du.gtn.com [194.77.9.57])
	by hub.freebsd.org (Postfix) with ESMTP id 07F1C150A7
	for <freebsd-fs@FreeBSD.ORG>; Mon, 15 Nov 1999 12:06:41 -0800 (PST)
	(envelope-from ticso@mail.cicely.de)
Received: from mail.cicely.de (cicely.de [194.231.9.142])
	by mail.du.gtn.com (8.9.3/8.9.3) with ESMTP id UAA28447;
	Mon, 15 Nov 1999 20:59:40 +0100 (MET)
Received: (from ticso@localhost)
	by mail.cicely.de (8.9.0/8.9.0) id VAA06307;
	Mon, 15 Nov 1999 21:06:08 +0100 (CET)
Date: Mon, 15 Nov 1999 21:06:08 +0100
From: Bernd Walter <ticso@cicely.de>
To: Greg Lehey <grog@lemis.com>
Cc: Bernd Walter <ticso@cicely.de>,
	Mattias Pantzare <pantzer@ludd.luth.se>, freebsd-fs@FreeBSD.ORG
Subject: Re: RAID-5 and failure
Message-ID: <19991115210607.A6252@cicely7.cicely.de>
References: <ticso@cicely.de> <199911061716.SAA20783@zed.ludd.luth.se> <19991106183316.A9420@cicely7.cicely.de> <19991113213325.57908@mojave.sitaranetworks.com> <19991115203828.B5417@cicely7.cicely.de> <19991115145200.09633@mojave.sitaranetworks.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 1.0pre3i
In-Reply-To: <19991115145200.09633@mojave.sitaranetworks.com>
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Mon, Nov 15, 1999 at 02:52:00PM -0500, Greg Lehey wrote:
> On Monday, 15 November 1999 at 20:38:28 +0100, Bernd Walter wrote:
> > On Sat, Nov 13, 1999 at 09:33:25PM -0500, Greg Lehey wrote:
> >>
> >> 4.  The system crashes after writing the first data block for a RAID-5
> >>     stripe and before writing the last data block.
> >>
> >>     When the system comes up, both data and parity are inconsistent.
> >>
> >> 5.  The system crashes after writing the last data block for a RAID-5
> >>     stripe and before writing the last parity block.
> >>
> >>     When the system comes up, data is consistent, and parity is
> >>     inconsistent.
> >>
> >> There are a number of ways of dealing with situations 4 and 5.  The
> >> real problem is that they only occur when the system crashes, so
> >> whatever recovery information is required must be stored in
> >> non-volatile storage.  Some systems do include a NOVRAM for this kind
> >> of information, but in general purpose systems the only possibility is
> >> to write the information to disk, which would make the inherently slow
> >> RAID-5 write even slower.  My attitude here is that RAID-5 writes are
> >> comparatively infrequent, and so are crashes.  In the case of (5), you
> >> could rebuild parity after a crash.  In the case of (4), I have no
> >> good answer.  Suggestions welcome.
> >
> > Case 4 is not that different from case 5 as any differences should be
> > handled by the FS using the volume.
> 
> The problem is that in case 4 you don't have anything to go by.  You
> don't know which data are inconsistent unless you keep a log.  The FS
> using the volume has followed the kernel into the eternal bit bucket.
> 
Of course - but that may happen with R0 too and even it may be possible with
a single disk.
The FS should realy be able to handle this case as it knows that there is an
outstanding write operation.

-- 
B.Walter                  COSMO-Project              http://www.cosmo-project.de
ticso@cicely.de             Usergroup                info@cosmo-project.de


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Mon Nov 15 15:12:19 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from cs.columbia.edu (cs.columbia.edu [128.59.16.20])
	by hub.freebsd.org (Postfix) with ESMTP
	id 26D5614A01; Mon, 15 Nov 1999 15:12:15 -0800 (PST)
	(envelope-from ezk@shekel.mcl.cs.columbia.edu)
Received: from shekel.mcl.cs.columbia.edu (shekel.mcl.cs.columbia.edu [128.59.18.15])
	by cs.columbia.edu (8.9.1/8.9.1) with ESMTP id SAA08098;
	Mon, 15 Nov 1999 18:12:13 -0500 (EST)
Received: (from ezk@localhost)
	by shekel.mcl.cs.columbia.edu (8.9.1/8.9.1) id SAA21891;
	Mon, 15 Nov 1999 18:12:09 -0500 (EST)
Date: Mon, 15 Nov 1999 18:12:09 -0500 (EST)
Message-Id: <199911152312.SAA21891@shekel.mcl.cs.columbia.edu>
X-Authentication-Warning: shekel.mcl.cs.columbia.edu: ezk set sender to ezk@shekel.mcl.cs.columbia.edu using -f
From: Erez Zadok <ezk@cs.columbia.edu>
To: Eivind Eklund <eivind@FreeBSD.ORG>
Cc: fs@FreeBSD.ORG
Subject: Re: namei() and freeing componentnames 
In-reply-to: Your message of "Fri, 12 Nov 1999 00:03:59 +0100."
             <19991112000359.A256@bitbox.follo.net> 
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

In message <19991112000359.A256@bitbox.follo.net>, Eivind Eklund writes:
[...]
> I suspect that for some filesystems (though none of the present ones),
> it might be necessary to do more than a
> zfree(namei_zone,cnp->cn_pnbuf) in order to free up all the relevant
> data.  In order to support this, we'd have to introduce a new VOP -
> tentatively called VOP_RELEASEND().  Unfortunately, this comes with a
> performance penalty.

Will VOP_RELEASEND be able to call a filesystem-specific routine?  I think
it should be flexible enough.  I can imagine that the VFS will call a
(stackable) filesystem's vop_releasend(), and that stackable f/s can call a
number of those on the lower level filesystem(s) it stacked on (there could
be more than one, namely fan-out f/s).

[...]
> This is somewhat vile, but has the advantage of keeping the code ready
> for the real VOP_RELEASEND(), and not loosing performance until we
> actually get some benefit out of it.
[...]
> Eivind.

WRT performance, I suggest that if possible, we #ifdef all of the stacking
code and fixes that have a non-insignificant performance impact.  Sure,
performance is important, but not at the cost of functionality (IMHO).  Not
all users would need stacking, so they can choose not to turn on the
relevant kernel #define and thus get maximum performance.  Those who do want
any stacking will have to pay a certain performance overhead.  Of course,
there's also an argument against too much #ifdef'ed code, b/c it makes
maintenance more difficult.

I think we should realize that there would be no way to fix the VFS w/o
impacting performance.  Rather than implement temporary fixes that avoid
"hurting" performance, we can (1) conditionalize that code, (2) get it
working *correctly* first, then (3) optimize it as needed, and (4) finally,
turn it on by default, possibly removing the non-stacking code.

Erez.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Tue Nov 16  2: 1:21 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from akat.civ.cvut.cz (akat.civ.cvut.cz [147.32.235.105])
	by hub.freebsd.org (Postfix) with SMTP id 127C114CBA
	for <freebsd-fs@FreeBSD.ORG>; Tue, 16 Nov 1999 02:01:13 -0800 (PST)
	(envelope-from pechy@hp735.cvut.cz)
Received: from localhost (pechy@localhost) by akat.civ.cvut.cz (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id LAA03026 for <freebsd-fs@FreeBSD.ORG>; Tue, 16 Nov 1999 11:01:11 +0100
Date: Tue, 16 Nov 1999 11:01:11 +0100
From: Jan Pechanec <pechy@hp735.cvut.cz>
X-Sender: pechy@akat.civ.cvut.cz
To: FreeBSD FS Mailing List <freebsd-fs@FreeBSD.ORG>
Subject: Copying file with not allocated blocks on disk
Message-ID: <Pine.SGI.4.05.9911161009250.2853-100000@akat.civ.cvut.cz>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org


	Hello,

	please, don't you know the reason why when copying file with
some blocks still not allocated on the disk (the blocks that will be
returned full of zeroes when accessed), the ,,zero'' blocks are
actually written? Why there is no check whether writing zero block and
do not write them? I understand that this would have to be inside the
implementation of particular filesystem.

	Ie., in general, why not have assertion: if the disk block
should contain all zeroes, we needn't to alocate physical space

	Thank you, Jan.

-- 
Jan PECHANEC (mailto:pechy@hp735.cvut.cz)
Computing Center CTU (Zikova 4, Praha 6, 166 35, Czech Republic)
www.civ.cvut.cz, pechy.civ.cvut.cz, tel: +420 2 24352969 (fax: 24310271)


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Tue Nov 16  9:15:25 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from mail.tvol.com (mail.wgate.com [38.219.83.4])
	by hub.freebsd.org (Postfix) with ESMTP id BC500152C0
	for <freebsd-fs@FreeBSD.ORG>; Tue, 16 Nov 1999 09:15:09 -0800 (PST)
	(envelope-from rjesup@wgate.com)
Received: from jesup.eng.tvol.net (jesup.eng.tvol.net [10.32.2.26]) by mail.tvol.com (8.8.8/8.8.3) with ESMTP id MAA28900; Tue, 16 Nov 1999 12:11:58 -0500 (EST)
Reply-To: Randell Jesup <rjesup@wgate.com>
To: Greg Lehey <grog@lemis.com>
Cc: freebsd-fs@FreeBSD.ORG
Subject: Re: RAID-5 and failure
References: <ticso@cicely.de> <199911061716.SAA20783@zed.ludd.luth.se> <19991106183316.A9420@cicely7.cicely.de> <19991113213325.57908@mojave.sitaranetworks.com>
From: Randell Jesup <rjesup@wgate.com>
Date: 16 Nov 1999 12:15:17 -0500
In-Reply-To: Greg Lehey's message of "Sat, 13 Nov 1999 21:33:25 -0500"
Message-ID: <ybuk8nis6hm.fsf@jesup.eng.tvol.net.jesup.eng.tvol.net>
X-Mailer: Gnus v5.6.43/Emacs 20.4
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

Greg Lehey <grog@mojave.sitaranetworks.com> writes:
>In RAID-5, I first write the data blocks, then the parity blcoks.
>There are a number of scenarios here:

>4.  The system crashes after writing the first data block for a RAID-5
>    stripe and before writing the last data block.
>
>    When the system comes up, both data and parity are inconsistent.
>
>5.  The system crashes after writing the last data block for a RAID-5
>    stripe and before writing the last parity block.
>
>    When the system comes up, data is consistent, and parity is
>    inconsistent.
>
>There are a number of ways of dealing with situations 4 and 5.  The
>real problem is that they only occur when the system crashes, so
>whatever recovery information is required must be stored in
>non-volatile storage.  Some systems do include a NOVRAM for this kind
>of information, but in general purpose systems the only possibility is
>to write the information to disk, which would make the inherently slow
>RAID-5 write even slower.  My attitude here is that RAID-5 writes are
>comparatively infrequent, and so are crashes.  In the case of (5), you
>could rebuild parity after a crash.  In the case of (4), I have no
>good answer.  Suggestions welcome.

	Well, assuming that vinum can recognize that there might have been
outstanding writes (via the equivalent of a dirty flag):

	When the disks come back up (dirty), check all the parity.
The stripe that was being written will fail to check.  In case 4, the data
and parity are wrong, and in case 5, just the parity, but you don't know
which.  If you handle case 4, you can handle case 5 the same way.
Obviously you've had a write failure, but usually the FS can deal with
that possibility (with the chance of lost data, true).  Some form of
information passed out about what sector(s) were trashed might be useful
in recovery if you're not using default UFS/fsck.

	If it checks, then the data was all written before any crash,
and all is fine.

	So the biggest trick here is recognizing the fact that the system
crashed.  You could reserve a block (or set of blocks scattered about) on
each drive for dirty flags, and only mark a disk clean if it hasn't had
writes in <some configurable amount of time>.  This keeps the write
overhead down without requiring NVRAM.  There are other evil tricks: with
SCSI, you might be able to change some innocuous mode parameter and use
it as a dirty flag, though this probably has at least as much overhead
as reserving a dirty-flag block.  And of course if you have NVRAM, store
the dirty bit there.  Hmmmmm.  Maybe in the PC's clock chip - they
generally have several bits of NVRAM.....  (On the Amiga we used those
bits for storing things like SCSI Id, boot spinup delay, etc.)

	Alternatively, you could hide the dirty flag at a higher semantic
level, by (at the OS level) recognizing a system that wasn't shut down
properly and invoking the vinum re-synchronizer.  So long as the sectors
with problems aren't needed to boot the kernel and recognize this that will
work.

>> I asume that's the reason why some systems use 520 byte sectors - maybe they
>> write timestamps or generationnumbers in a single write within the sector.
>
>In fact, the 520 byte sectors are used to protect against data
>corruption between the disk and the controller.  They won't help in
>this scenario.

	At the cost of performance, you could use some bytes of each sector
for generation numbers, and know in case 5 that the data is correct.
Obviously case 4 will still fail.

-- 
Randell Jesup, Worldgate Communications, ex-Scala, ex-Amiga OS team ('88-94)
rjesup@wgate.com
CDA II has been passed and signed, sigh.  The lawsuit has been filed.  Please
support the organizations fighting it - ACLU, EFF, CDT, etc.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Tue Nov 16 10:19:24 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from bingnet2.cc.binghamton.edu (bingnet2.cc.binghamton.edu [128.226.1.18])
	by hub.freebsd.org (Postfix) with ESMTP
	id 6A3981533F; Tue, 16 Nov 1999 10:19:11 -0800 (PST)
	(envelope-from zzhang@cs.binghamton.edu)
Received: from sol.cs.binghamton.edu (cs1-gw.cs.binghamton.edu [128.226.171.72])
	by bingnet2.cc.binghamton.edu (8.9.3/8.9.3) with SMTP id NAA00441;
	Tue, 16 Nov 1999 13:19:04 -0500 (EST)
Date: Tue, 16 Nov 1999 12:06:37 -0500 (EST)
From: Zhihui Zhang <zzhang@cs.binghamton.edu>
Reply-To: Zhihui Zhang <zzhang@cs.binghamton.edu>
To: freebsd-hackers@freebsd.org, freebsd-fs@freebsd.org
Subject: On-the-fly defragmentation of FFS
Message-ID: <Pine.GSO.3.96.991116111847.9599A-100000@sol.cs.binghamton.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org


After studying the code of ffs_reallocblks() for a while, it occurs to me
that the on-the-fly defragmentation of a FFS file (It does this on a per
file basis) only takes place at the end of a file and only when the
previous logical blocks have all been laid out contiguously on the disk
(see also cluster_write()).  This seems to me a lot of limitations to the
FFS defragger.  I wonder if the file was not allocated contiguously
when it was first created, how can it find contiguous space later unless
we delete a lot of files in between?

I hope someone can confirm or correct my understanding. It would be even
better if someone can suggest a way to improve defragmentation if the FFS
defragger is not very efficient.

BTW, if I copy all files from a filesystem to a new filesystem, will the
files be stored more contiguously?  Why?

Any help or suggestion is appreciated.

-Zhihui


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Tue Nov 16 12:21:31 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from fw.wintelcom.net (ns1.wintelcom.net [209.1.153.20])
	by hub.freebsd.org (Postfix) with ESMTP id 6593E14D88
	for <freebsd-fs@FreeBSD.ORG>; Tue, 16 Nov 1999 12:21:30 -0800 (PST)
	(envelope-from bright@wintelcom.net)
Received: from localhost (bright@localhost)
	by fw.wintelcom.net (8.9.3/8.9.3) with ESMTP id MAA08105;
	Tue, 16 Nov 1999 12:46:54 -0800 (PST)
Date: Tue, 16 Nov 1999 12:46:54 -0800 (PST)
From: Alfred Perlstein <bright@wintelcom.net>
To: Zhihui Zhang <zzhang@cs.binghamton.edu>
Cc: freebsd-fs@FreeBSD.ORG
Subject: Re: On-the-fly defragmentation of FFS
In-Reply-To: <Pine.GSO.3.96.991116111847.9599A-100000@sol.cs.binghamton.edu>
Message-ID: <Pine.BSF.4.05.9911161245250.12797-100000@fw.wintelcom.net>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Tue, 16 Nov 1999, Zhihui Zhang wrote:

> 
> After studying the code of ffs_reallocblks() for a while, it occurs to me
> that the on-the-fly defragmentation of a FFS file (It does this on a per
> file basis) only takes place at the end of a file and only when the
> previous logical blocks have all been laid out contiguously on the disk
> (see also cluster_write()).  This seems to me a lot of limitations to the
> FFS defragger.  I wonder if the file was not allocated contiguously
> when it was first created, how can it find contiguous space later unless
> we delete a lot of files in between?
> 
> I hope someone can confirm or correct my understanding. It would be even
> better if someone can suggest a way to improve defragmentation if the FFS
> defragger is not very efficient.
> 
> BTW, if I copy all files from a filesystem to a new filesystem, will the
> files be stored more contiguously?  Why?
> 
> Any help or suggestion is appreciated.

I think you're missing an obvious point, as the file is written out
the only place where it is likely to be fragmented is the end, hence
the reason for only defragging the end of the file. :)

-Alfred


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Tue Nov 16 12:50:57 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from alpo.whistle.com (alpo.whistle.com [207.76.204.38])
	by hub.freebsd.org (Postfix) with ESMTP id AE91514F07
	for <freebsd-fs@FreeBSD.ORG>; Tue, 16 Nov 1999 12:50:55 -0800 (PST)
	(envelope-from julian@whistle.com)
Received: from current1.whiste.com (current1.whistle.com [207.76.205.22])
	by alpo.whistle.com (8.9.1a/8.9.1) with ESMTP id MAA54808;
	Tue, 16 Nov 1999 12:50:53 -0800 (PST)
Date: Tue, 16 Nov 1999 12:50:51 -0800 (PST)
From: Julian Elischer <julian@whistle.com>
To: Alfred Perlstein <bright@wintelcom.net>
Cc: Zhihui Zhang <zzhang@cs.binghamton.edu>, freebsd-fs@FreeBSD.ORG
Subject: Re: On-the-fly defragmentation of FFS
In-Reply-To: <Pine.BSF.4.05.9911161245250.12797-100000@fw.wintelcom.net>
Message-ID: <Pine.BSF.4.10.9911161250040.25805-100000@current1.whistle.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

> 
> I think you're missing an obvious point, as the file is written out
> the only place where it is likely to be fragmented is the end, hence
> the reason for only defragging the end of the file. :)

usually, though database files can be written randomly as they are filled
in.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Tue Nov 16 13: 1:14 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from bingnet2.cc.binghamton.edu (bingnet2.cc.binghamton.edu [128.226.1.18])
	by hub.freebsd.org (Postfix) with ESMTP id F1F8914CD5
	for <freebsd-fs@FreeBSD.ORG>; Tue, 16 Nov 1999 13:01:09 -0800 (PST)
	(envelope-from zzhang@cs.binghamton.edu)
Received: from sol.cs.binghamton.edu (cs1-gw.cs.binghamton.edu [128.226.171.72])
	by bingnet2.cc.binghamton.edu (8.9.3/8.9.3) with SMTP id QAA06263;
	Tue, 16 Nov 1999 16:01:05 -0500 (EST)
Date: Tue, 16 Nov 1999 14:48:36 -0500 (EST)
From: Zhihui Zhang <zzhang@cs.binghamton.edu>
To: Alfred Perlstein <bright@wintelcom.net>
Cc: freebsd-fs@FreeBSD.ORG
Subject: Re: On-the-fly defragmentation of FFS
In-Reply-To: <Pine.BSF.4.05.9911161245250.12797-100000@fw.wintelcom.net>
Message-ID: <Pine.GSO.3.96.991116143730.11223A-100000@sol.cs.binghamton.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org


On Tue, 16 Nov 1999, Alfred Perlstein wrote:

> On Tue, 16 Nov 1999, Zhihui Zhang wrote:
> 
> > 
> > After studying the code of ffs_reallocblks() for a while, it occurs to me
> > that the on-the-fly defragmentation of a FFS file (It does this on a per
> > file basis) only takes place at the end of a file and only when the
> > previous logical blocks have all been laid out contiguously on the disk
> > (see also cluster_write()).  This seems to me a lot of limitations to the
> > FFS defragger.  I wonder if the file was not allocated contiguously
> > when it was first created, how can it find contiguous space later unless
> > we delete a lot of files in between?
> > 
> > I hope someone can confirm or correct my understanding. It would be even
> > better if someone can suggest a way to improve defragmentation if the FFS
> > defragger is not very efficient.
> > 
> > BTW, if I copy all files from a filesystem to a new filesystem, will the
> > files be stored more contiguously?  Why?
> > 
> > Any help or suggestion is appreciated.
> 
> I think you're missing an obvious point, as the file is written out
> the only place where it is likely to be fragmented is the end, hence
> the reason for only defragging the end of the file. :)
> 

Thanks. I think this defragmentation (I can not find a better word for it) 
means making the blocks contiguous. Consider the case which in the last
eight blocks of a file, seven of them are already contiguously allocated
and only the last block is not.  Now if we write at the very last block,
the filesystem will try to move those seven blocks and the last block
together to some other place to make them all contiguous.  This only
happens at the end of a file.  I was wondering if this can happen
elsewhere or if there is a better solution for this kind of adjustment.

-Zhihui


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Tue Nov 16 13: 3: 9 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from fw.wintelcom.net (ns1.wintelcom.net [209.1.153.20])
	by hub.freebsd.org (Postfix) with ESMTP id BA26114D98
	for <freebsd-fs@FreeBSD.ORG>; Tue, 16 Nov 1999 13:03:08 -0800 (PST)
	(envelope-from bright@wintelcom.net)
Received: from localhost (bright@localhost)
	by fw.wintelcom.net (8.9.3/8.9.3) with ESMTP id NAA09314;
	Tue, 16 Nov 1999 13:29:03 -0800 (PST)
Date: Tue, 16 Nov 1999 13:29:03 -0800 (PST)
From: Alfred Perlstein <bright@wintelcom.net>
To: Julian Elischer <julian@whistle.com>
Cc: Zhihui Zhang <zzhang@cs.binghamton.edu>, freebsd-fs@FreeBSD.ORG
Subject: Re: On-the-fly defragmentation of FFS
In-Reply-To: <Pine.BSF.4.10.9911161250040.25805-100000@current1.whistle.com>
Message-ID: <Pine.BSF.4.05.9911161327080.12797-100000@fw.wintelcom.net>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Tue, 16 Nov 1999, Julian Elischer wrote:

> > 
> > I think you're missing an obvious point, as the file is written out
> > the only place where it is likely to be fragmented is the end, hence
> > the reason for only defragging the end of the file. :)
> 
> usually, though database files can be written randomly as they are filled
> in.

Excellent point, however won't FFS's block placement strategy fix
work around this unless the filesystem is already pretty full?

Or is this one of the bad-case-scenarios for FFS?

-Alfred


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Tue Nov 16 13: 9:33 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from bingnet2.cc.binghamton.edu (bingnet2.cc.binghamton.edu [128.226.1.18])
	by hub.freebsd.org (Postfix) with ESMTP id 22C3C14D98
	for <freebsd-fs@FreeBSD.ORG>; Tue, 16 Nov 1999 13:09:30 -0800 (PST)
	(envelope-from zzhang@cs.binghamton.edu)
Received: from sol.cs.binghamton.edu (cs1-gw.cs.binghamton.edu [128.226.171.72])
	by bingnet2.cc.binghamton.edu (8.9.3/8.9.3) with SMTP id QAA09240;
	Tue, 16 Nov 1999 16:09:25 -0500 (EST)
Date: Tue, 16 Nov 1999 14:56:56 -0500 (EST)
From: Zhihui Zhang <zzhang@cs.binghamton.edu>
To: Julian Elischer <julian@whistle.com>
Cc: Alfred Perlstein <bright@wintelcom.net>, freebsd-fs@FreeBSD.ORG
Subject: Re: On-the-fly defragmentation of FFS
In-Reply-To: <Pine.BSF.4.10.9911161250040.25805-100000@current1.whistle.com>
Message-ID: <Pine.GSO.3.96.991116145429.11223B-100000@sol.cs.binghamton.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org


On Tue, 16 Nov 1999, Julian Elischer wrote:

> > 
> > I think you're missing an obvious point, as the file is written out
> > the only place where it is likely to be fragmented is the end, hence
> > the reason for only defragging the end of the file. :)
> 
> usually, though database files can be written randomly as they are filled
> in.
> 

Can a database file has holes? I had some experience with Oracle. I used
to create a large file for a database and assumed that all space of the
database file are pre-allocated.  Otherwise, the performance of the
database will be poor.

-Zhihui


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Tue Nov 16 13:13:21 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from fw.wintelcom.net (ns1.wintelcom.net [209.1.153.20])
	by hub.freebsd.org (Postfix) with ESMTP id 12E3914D98
	for <freebsd-fs@FreeBSD.ORG>; Tue, 16 Nov 1999 13:13:20 -0800 (PST)
	(envelope-from bright@wintelcom.net)
Received: from localhost (bright@localhost)
	by fw.wintelcom.net (8.9.3/8.9.3) with ESMTP id NAA09589;
	Tue, 16 Nov 1999 13:39:14 -0800 (PST)
Date: Tue, 16 Nov 1999 13:39:14 -0800 (PST)
From: Alfred Perlstein <bright@wintelcom.net>
To: Zhihui Zhang <zzhang@cs.binghamton.edu>
Cc: freebsd-fs@FreeBSD.ORG
Subject: Re: On-the-fly defragmentation of FFS
In-Reply-To: <Pine.GSO.3.96.991116143730.11223A-100000@sol.cs.binghamton.edu>
Message-ID: <Pine.BSF.4.05.9911161331090.12797-100000@fw.wintelcom.net>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Tue, 16 Nov 1999, Zhihui Zhang wrote:

> 
> On Tue, 16 Nov 1999, Alfred Perlstein wrote:
> 
> > On Tue, 16 Nov 1999, Zhihui Zhang wrote:
> > 
> > > 
> > > After studying the code of ffs_reallocblks() for a while, it occurs to me
> > > that the on-the-fly defragmentation of a FFS file (It does this on a per
> > > file basis) only takes place at the end of a file and only when the
> > > previous logical blocks have all been laid out contiguously on the disk
> > > (see also cluster_write()).  This seems to me a lot of limitations to the
> > > FFS defragger.  I wonder if the file was not allocated contiguously
> > > when it was first created, how can it find contiguous space later unless
> > > we delete a lot of files in between?
> > > 
> > > I hope someone can confirm or correct my understanding. It would be even
> > > better if someone can suggest a way to improve defragmentation if the FFS
> > > defragger is not very efficient.
> > > 
> > > BTW, if I copy all files from a filesystem to a new filesystem, will the
> > > files be stored more contiguously?  Why?
> > > 
> > > Any help or suggestion is appreciated.
> > 
> > I think you're missing an obvious point, as the file is written out
> > the only place where it is likely to be fragmented is the end, hence
> > the reason for only defragging the end of the file. :)
> > 
> 
> Thanks. I think this defragmentation (I can not find a better word for it) 
> means making the blocks contiguous. Consider the case which in the last
> eight blocks of a file, seven of them are already contiguously allocated
> and only the last block is not.  Now if we write at the very last block,
> the filesystem will try to move those seven blocks and the last block
> together to some other place to make them all contiguous.  This only
> happens at the end of a file.  I was wondering if this can happen
> elsewhere or if there is a better solution for this kind of adjustment.

Not to my knowledge, however if it only works on the tail end of
files (which I'm 99% sure is true) then Julian's point is a problem
for this algorithm, (files with holes) it may be smart to try to
reallocblks on 64k cluster boundries.

However this starts to get into adaptive algorithms, something that
FFS already has plenty of. :)  More couldn't hurt, insight, work
and testing of such an algorithm would probably be very appreciated.

One of the things that Kirk mused making adaptive was FFS's aggressive
write-behind feature that can cause problems when the entire dataset
fits into ram.  It doesn't necessarily cause problems, execpt for
the fact that linux has a more aggressive caching algorithm that will
not write anything out until the cache is nearly full.  Each approach
has it's advantages and drawbacks, FreeBSD excels when the dataset is
larger than ram, whereas Linux owns the show when it does fit into 
ram.  An adaptive algorithm would be very benificial for this strategy.

-Alfred

> 
> -Zhihui
> 


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Tue Nov 16 13:27:49 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from bingnet2.cc.binghamton.edu (bingnet2.cc.binghamton.edu [128.226.1.18])
	by hub.freebsd.org (Postfix) with ESMTP id 1831B14D01
	for <freebsd-fs@FreeBSD.ORG>; Tue, 16 Nov 1999 13:27:46 -0800 (PST)
	(envelope-from zzhang@cs.binghamton.edu)
Received: from sol.cs.binghamton.edu (cs1-gw.cs.binghamton.edu [128.226.171.72])
	by bingnet2.cc.binghamton.edu (8.9.3/8.9.3) with SMTP id QAA16608;
	Tue, 16 Nov 1999 16:27:44 -0500 (EST)
Date: Tue, 16 Nov 1999 15:15:13 -0500 (EST)
From: Zhihui Zhang <zzhang@cs.binghamton.edu>
To: Alfred Perlstein <bright@wintelcom.net>
Cc: freebsd-fs@FreeBSD.ORG
Subject: Re: On-the-fly defragmentation of FFS
In-Reply-To: <Pine.BSF.4.05.9911161331090.12797-100000@fw.wintelcom.net>
Message-ID: <Pine.GSO.3.96.991116150930.11223C-100000@sol.cs.binghamton.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org


> One of the things that Kirk mused making adaptive was FFS's aggressive
> write-behind feature that can cause problems when the entire dataset
> fits into ram.  

Are you talking about softupdate code? Could you explain a little more
about this?  It seems to me that writes will not happen unless there is no
room in the cache. 

> It doesn't necessarily cause problems, execpt for
> the fact that linux has a more aggressive caching algorithm that will
> not write anything out until the cache is nearly full.  Each approach
> has it's advantages and drawbacks, FreeBSD excels when the dataset is
> larger than ram, whereas Linux owns the show when it does fit into 
> ram.  An adaptive algorithm would be very benificial for this strategy.

Are there any references for this subject?

-Zhihui


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Nov 17  7:24:37 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from yana.lemis.com (yana.lemis.com [192.109.197.140])
	by hub.freebsd.org (Postfix) with ESMTP id 2635E14E09
	for <freebsd-fs@FreeBSD.ORG>; Wed, 17 Nov 1999 07:24:28 -0800 (PST)
	(envelope-from grog@mojave.sitaranetworks.com)
Received: from mojave.sitaranetworks.com (mojave.sitaranetworks.com [199.103.141.157])
	by yana.lemis.com (8.8.8/8.8.8) with ESMTP id BAA23656;
	Thu, 18 Nov 1999 01:54:22 +1030 (CST)
	(envelope-from grog@mojave.sitaranetworks.com)
Message-ID: <19991116204916.44107@mojave.sitaranetworks.com>
Date: Tue, 16 Nov 1999 20:49:16 -0500
From: Greg Lehey <grog@mojave.sitaranetworks.com>
To: Randell Jesup <rjesup@wgate.com>
Cc: freebsd-fs@FreeBSD.ORG
Subject: Re: RAID-5 and failure
Reply-To: Greg Lehey <grog@lemis.com>
References: <ticso@cicely.de> <199911061716.SAA20783@zed.ludd.luth.se> <19991106183316.A9420@cicely7.cicely.de> <19991113213325.57908@mojave.sitaranetworks.com> <ybuk8nis6hm.fsf@jesup.eng.tvol.net.jesup.eng.tvol.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
In-Reply-To: <ybuk8nis6hm.fsf@jesup.eng.tvol.net.jesup.eng.tvol.net>; from Randell Jesup on Tue, Nov 16, 1999 at 12:15:17PM -0500
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Tuesday, 16 November 1999 at 12:15:17 -0500, Randell Jesup wrote:
> Greg Lehey <grog@mojave.sitaranetworks.com> writes:
>> In RAID-5, I first write the data blocks, then the parity blcoks.
>> There are a number of scenarios here:
>
>> 4.  The system crashes after writing the first data block for a RAID-5
>>    stripe and before writing the last data block.
>>
>>    When the system comes up, both data and parity are inconsistent.
>>
>> 5.  The system crashes after writing the last data block for a RAID-5
>>    stripe and before writing the last parity block.
>>
>>    When the system comes up, data is consistent, and parity is
>>    inconsistent.
>>
>> There are a number of ways of dealing with situations 4 and 5.  The
>> real problem is that they only occur when the system crashes, so
>> whatever recovery information is required must be stored in
>> non-volatile storage.  Some systems do include a NOVRAM for this kind
>> of information, but in general purpose systems the only possibility is
>> to write the information to disk, which would make the inherently slow
>> RAID-5 write even slower.  My attitude here is that RAID-5 writes are
>> comparatively infrequent, and so are crashes.  In the case of (5), you
>> could rebuild parity after a crash.  In the case of (4), I have no
>> good answer.  Suggestions welcome.
>
> 	Well, assuming that vinum can recognize that there might have been
> outstanding writes (via the equivalent of a dirty flag):
>
> 	When the disks come back up (dirty), check all the parity.
> The stripe that was being written will fail to check.  In case 4, the data
> and parity are wrong, and in case 5, just the parity, but you don't know
> which.  If you handle case 4, you can handle case 5 the same way.
> Obviously you've had a write failure, but usually the FS can deal with
> that possibility (with the chance of lost data, true).  Some form of
> information passed out about what sector(s) were trashed might be useful
> in recovery if you're not using default UFS/fsck.

Well, you're still left with the dilemma.  Worse, this check makes
fsck look like an instantaneous operation: you have to read the entire
contents of every disk.  For a 500 GB database spread across 3 LVD
controllers, you're looking at several hours.

> 	If it checks, then the data was all written before any crash,
> and all is fine.

That's the simple case.

> 	So the biggest trick here is recognizing the fact that the system
> crashed.  You could reserve a block (or set of blocks scattered about) on
> each drive for dirty flags, and only mark a disk clean if it hasn't had
> writes in <some configurable amount of time>.  This keeps the write
> overhead down without requiring NVRAM.  There are other evil tricks: with
> SCSI, you might be able to change some innocuous mode parameter and use
> it as a dirty flag, though this probably has at least as much overhead
> as reserving a dirty-flag block.  And of course if you have NVRAM, store
> the dirty bit there.  Hmmmmm.  Maybe in the PC's clock chip - they
> generally have several bits of NVRAM.....  (On the Amiga we used those
> bits for storing things like SCSI Id, boot spinup delay, etc.)
>
> 	Alternatively, you could hide the dirty flag at a higher semantic
> level, by (at the OS level) recognizing a system that wasn't shut down
> properly and invoking the vinum re-synchronizer.  So long as the sectors
> with problems aren't needed to boot the kernel and recognize this that will
> work.

Basically, the way I see it, we have three options:

1.  Disks never crash, and anyway, we don't write to them.  Ignore the
    problem and deal with it if it comes to bite us.

2.  Get an NVRAM board and use it for this purpose.

3.  Bite the bullet and write intention logs before each write.
    VERITAS has this as an option.

These options don't have to be mutually exclusive.  It's quite
possible to implement both ((1) doesn't need implementation :-) and
leave it to the user to decide which to use.

>>> I asume that's the reason why some systems use 520 byte sectors - maybe they
>>> write timestamps or generationnumbers in a single write within the sector.
>>
>> In fact, the 520 byte sectors are used to protect against data
>> corruption between the disk and the controller.  They won't help in
>> this scenario.
>
> 	At the cost of performance, you could use some bytes of each sector
> for generation numbers, and know in case 5 that the data is correct.
> Obviously case 4 will still fail.

No, the way things work, this would be very expensive.  We'd have to
move the data to a larger buffer and set the flags, and it would also
require at least reformatting the drive, assuming it's possible to set
a different sector.  There are better ways to do this.

Greg
--
Finger grog@lemis.com for PGP public key
See complete headers for address and phone numbers


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Nov 17  7:25:14 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from yana.lemis.com (yana.lemis.com [192.109.197.140])
	by hub.freebsd.org (Postfix) with ESMTP id 031241528B
	for <freebsd-fs@FreeBSD.ORG>; Wed, 17 Nov 1999 07:25:03 -0800 (PST)
	(envelope-from grog@mojave.sitaranetworks.com)
Received: from mojave.sitaranetworks.com (mojave.sitaranetworks.com [199.103.141.157])
	by yana.lemis.com (8.8.8/8.8.8) with ESMTP id BAA23662;
	Thu, 18 Nov 1999 01:54:44 +1030 (CST)
	(envelope-from grog@mojave.sitaranetworks.com)
Message-ID: <19991116204101.12932@mojave.sitaranetworks.com>
Date: Tue, 16 Nov 1999 20:41:01 -0500
From: Greg Lehey <grog@mojave.sitaranetworks.com>
To: Bernd Walter <ticso@cicely.de>
Cc: Mattias Pantzare <pantzer@ludd.luth.se>, freebsd-fs@FreeBSD.ORG
Subject: Re: RAID-5 and failure
Reply-To: Greg Lehey <grog@lemis.com>
References: <ticso@cicely.de> <199911061716.SAA20783@zed.ludd.luth.se> <19991106183316.A9420@cicely7.cicely.de> <19991113213325.57908@mojave.sitaranetworks.com> <19991115203828.B5417@cicely7.cicely.de> <19991115145200.09633@mojave.sitaranetworks.com> <19991115210607.A6252@cicely7.cicely.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
In-Reply-To: <19991115210607.A6252@cicely7.cicely.de>; from Bernd Walter on Mon, Nov 15, 1999 at 09:06:08PM +0100
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Monday, 15 November 1999 at 21:06:08 +0100, Bernd Walter wrote:
> On Mon, Nov 15, 1999 at 02:52:00PM -0500, Greg Lehey wrote:
>> On Monday, 15 November 1999 at 20:38:28 +0100, Bernd Walter wrote:
>>> On Sat, Nov 13, 1999 at 09:33:25PM -0500, Greg Lehey wrote:
>>>>
>>>> 4.  The system crashes after writing the first data block for a RAID-5
>>>>     stripe and before writing the last data block.
>>>>
>>>>     When the system comes up, both data and parity are inconsistent.
>>>>
>>>> 5.  The system crashes after writing the last data block for a RAID-5
>>>>     stripe and before writing the last parity block.
>>>>
>>>>     When the system comes up, data is consistent, and parity is
>>>>     inconsistent.
>>>>
>>>> There are a number of ways of dealing with situations 4 and 5.  The
>>>> real problem is that they only occur when the system crashes, so
>>>> whatever recovery information is required must be stored in
>>>> non-volatile storage.  Some systems do include a NOVRAM for this kind
>>>> of information, but in general purpose systems the only possibility is
>>>> to write the information to disk, which would make the inherently slow
>>>> RAID-5 write even slower.  My attitude here is that RAID-5 writes are
>>>> comparatively infrequent, and so are crashes.  In the case of (5), you
>>>> could rebuild parity after a crash.  In the case of (4), I have no
>>>> good answer.  Suggestions welcome.
>>>
>>> Case 4 is not that different from case 5 as any differences should be
>>> handled by the FS using the volume.
>>
>> The problem is that in case 4 you don't have anything to go by.  You
>> don't know which data are inconsistent unless you keep a log.  The FS
>> using the volume has followed the kernel into the eternal bit bucket.
>
> Of course - but that may happen with R0 too and even it may be possible with
> a single disk.

Sure.  It's not specific to RAID-5.

> The FS should realy be able to handle this case as it knows that
> there is an outstanding write operation.

How does it know?  That's the question.  All state information has
gone to /dev/null.  The only alternative is to write this state
information to some non-volatile location, which usually means disk
and associated severe loss of performance.

Greg
--
Finger grog@lemis.com for PGP public key
See complete headers for address and phone numbers


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Nov 17  9:31:41 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from bomber.avantgo.com (ws1.avantgo.com [207.214.200.194])
	by hub.freebsd.org (Postfix) with ESMTP id D3DD814F68
	for <freebsd-fs@freebsd.org>; Wed, 17 Nov 1999 09:31:24 -0800 (PST)
	(envelope-from scott@avantgo.com)
Received: from river ([10.0.128.30]) by bomber.avantgo.com
          (Netscape Messaging Server 3.5)  with SMTP id 238
          for <freebsd-fs@freebsd.org>; Wed, 17 Nov 1999 09:27:00 -0800
Message-ID: <166101bf3121$76518900$1e80000a@avantgo.com>
From: "Scott Hess" <scott@avantgo.com>
To: <freebsd-fs@freebsd.org>
Subject: vinum, MYSQL, and small transaction sizes.
Date: Wed, 17 Nov 1999 09:30:37 -0800
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 5.00.2314.1300
X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

I've been experimenting with vinum striping as a means of improving MYSQL
performance, and am having some odd results.

Running a particular workload and a particular set of disks, at overload
iostat shows the disk doing about 185 tps, and about 8KB/t.  When I run the
workload on a 256k striped volume made up of two drives, I'm finding that
each drive does about 95 tps.  I've also run the tests with slower drives,
which do 155 tps for the single-drive test, and 80 tps for the striped
test.

I didn't expect to double the tps of the entire system - but getting no
increase at all seems very suspect.  Based on the transaction sizes iostat
is reporting, I have tried restriping with 8k stripes, which gives me about
105 tps per disk, which is marginally better.  Going the other direction,
with 1m stripes, gave the same results as for 256k stripes.

In an attempt to isolate the problem, I tried cat'ing very large files in
parallel.  The files were large enough to not fit in memory, and I ran four
cat commands at the same time on different files.  I found that running
them all from a single disk gave 380tps (24M/s), running 4 on one drive and
4 on the other gave 200tps (12M/s) for each drive, 400tps (24M/s)
aggregate, and running them on a 256k volume striped across the disks gave
100tps (6M/s) for each drive, 200tps (12M/s) aggregate.

Given past experience with the Linux md driver, I really really really
suspect I'm missing something.  But I couldn't tell you what.  Running
under FreeBSD3.3-RELEASE.

Later,
scott


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Nov 17 10:19:55 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from mail.du.gtn.com (mail.du.gtn.com [194.77.9.57])
	by hub.freebsd.org (Postfix) with ESMTP id AB93214FD6
	for <freebsd-fs@FreeBSD.ORG>; Wed, 17 Nov 1999 10:19:41 -0800 (PST)
	(envelope-from ticso@mail.cicely.de)
Received: from mail.cicely.de (cicely.de [194.231.9.142])
	by mail.du.gtn.com (8.9.3/8.9.3) with ESMTP id TAA29709;
	Wed, 17 Nov 1999 19:12:43 +0100 (MET)
Received: (from ticso@localhost)
	by mail.cicely.de (8.9.0/8.9.0) id TAA13518;
	Wed, 17 Nov 1999 19:19:13 +0100 (CET)
Date: Wed, 17 Nov 1999 19:19:13 +0100
From: Bernd Walter <ticso@cicely.de>
To: Greg Lehey <grog@lemis.com>
Cc: Bernd Walter <ticso@cicely.de>,
	Mattias Pantzare <pantzer@ludd.luth.se>, freebsd-fs@FreeBSD.ORG
Subject: Re: RAID-5 and failure
Message-ID: <19991117191912.A12883@cicely7.cicely.de>
References: <ticso@cicely.de> <199911061716.SAA20783@zed.ludd.luth.se> <19991106183316.A9420@cicely7.cicely.de> <19991113213325.57908@mojave.sitaranetworks.com> <19991115203828.B5417@cicely7.cicely.de> <19991115145200.09633@mojave.sitaranetworks.com> <19991115210607.A6252@cicely7.cicely.de> <19991116204101.12932@mojave.sitaranetworks.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 1.0pre3i
In-Reply-To: <19991116204101.12932@mojave.sitaranetworks.com>
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Tue, Nov 16, 1999 at 08:41:01PM -0500, Greg Lehey wrote:
> On Monday, 15 November 1999 at 21:06:08 +0100, Bernd Walter wrote:
> 
> > The FS should realy be able to handle this case as it knows that
> > there is an outstanding write operation.
> 
> How does it know?  That's the question.  All state information has
> gone to /dev/null.  The only alternative is to write this state
> information to some non-volatile location, which usually means disk
> and associated severe loss of performance.

The FS is dirty. The FS before the panic/powerfailure/... had known the
outstanding transaction and shouldn't create a situation in which fsck can't
handle such a case. It should even expect only a part to be writen as multiple
sector transfers are known not to be atomic - that's why critical state
information should never cross sector boundarys.
I asume most modern HDDs are able to finish a single sector write in case
of power failures.
In case the drive simply returns a CRC error we realy have a problem because
the parity might not be in sync and we can't recover this sector relyable.
Nevertheless I got several powerfailures during write access and never got
CRCs since ESDI because of that.

In case application data was lost that's not a OS specific problem.
As long as the applications did not flush the buffers and success was returned
it should not be surprised if data gets lost because they could also be in some
kind of writecache.

-- 
B.Walter                  COSMO-Project              http://www.cosmo-project.de
ticso@cicely.de             Usergroup                info@cosmo-project.de


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Nov 17 14:29:48 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from yana.lemis.com (yana.lemis.com [192.109.197.140])
	by hub.freebsd.org (Postfix) with ESMTP id A7E7914DF8
	for <freebsd-fs@FreeBSD.ORG>; Wed, 17 Nov 1999 14:29:39 -0800 (PST)
	(envelope-from grog@mojave.sitaranetworks.com)
Received: from mojave.sitaranetworks.com (mojave.sitaranetworks.com [199.103.141.157])
	by yana.lemis.com (8.8.8/8.8.8) with ESMTP id IAA24124;
	Thu, 18 Nov 1999 08:59:25 +1030 (CST)
	(envelope-from grog@mojave.sitaranetworks.com)
Message-ID: <19991117172851.06023@mojave.sitaranetworks.com>
Date: Wed, 17 Nov 1999 17:28:51 -0500
From: Greg Lehey <grog@mojave.sitaranetworks.com>
To: Scott Hess <scott@avantgo.com>, freebsd-fs@FreeBSD.ORG
Subject: Re: vinum, MYSQL, and small transaction sizes.
Reply-To: Greg Lehey <grog@lemis.com>
References: <166101bf3121$76518900$1e80000a@avantgo.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
In-Reply-To: <166101bf3121$76518900$1e80000a@avantgo.com>; from Scott Hess on Wed, Nov 17, 1999 at 09:30:37AM -0800
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Wednesday, 17 November 1999 at  9:30:37 -0800, Scott Hess wrote:
> I've been experimenting with vinum striping as a means of improving MYSQL
> performance, and am having some odd results.
>
> Running a particular workload and a particular set of disks, at overload
> iostat shows the disk doing about 185 tps, and about 8KB/t.  When I run the
> workload on a 256k striped volume made up of two drives, I'm finding that
> each drive does about 95 tps.  I've also run the tests with slower drives,
> which do 155 tps for the single-drive test, and 80 tps for the striped
> test.
>
> I didn't expect to double the tps of the entire system - but getting no
> increase at all seems very suspect.

It's frequently the system's way of saying "the disk is not the
bottleneck".

> Based on the transaction sizes iostat is reporting, I have tried
> restriping with 8k stripes, which gives me about 105 tps per disk,
> which is marginally better.  Going the other direction, with 1m
> stripes, gave the same results as for 256k stripes.

I think this is probably a red herring.  It's very unlikely that
you'll get better performance from an 8k stripe than a 256k stripe.
The fact that there's not a significant degradation with such small
stripes again points to the likelihood that the disks aren't the
bottleneck, though it could also indicate that the transfers are very
small (as you indicate in the Subject: line).  How big are the
transfers?

> In an attempt to isolate the problem, I tried cat'ing very large
> files in parallel.  The files were large enough to not fit in
> memory, and I ran four cat commands at the same time on different
> files.  I found that running them all from a single disk gave 380tps
> (24M/s), running 4 on one drive and 4 on the other gave 200tps
> (12M/s) for each drive, 400tps (24M/s) aggregate, and running them
> on a 256k volume striped across the disks gave 100tps (6M/s) for
> each drive, 200tps (12M/s) aggregate.

Hmm.  The arithmetic at the end suggests that you only striped across
2 disks.  What kind of disks are they?  You'll run into significant
contention problems with IDE, for example.  Also, what version of
FreeBSD?

Greg
--
Finger grog@lemis.com for PGP public key
See complete headers for address and phone numbers


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Wed Nov 17 15:15:59 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from bomber.avantgo.com (ws1.avantgo.com [207.214.200.194])
	by hub.freebsd.org (Postfix) with ESMTP id 4C69D14C9E
	for <freebsd-fs@FreeBSD.ORG>; Wed, 17 Nov 1999 15:15:56 -0800 (PST)
	(envelope-from scott@avantgo.com)
Received: from river ([10.0.128.30]) by bomber.avantgo.com
          (Netscape Messaging Server 3.5)  with SMTP id 215;
          Wed, 17 Nov 1999 15:11:36 -0800
Message-ID: <17e101bf3151$99554ec0$1e80000a@avantgo.com>
From: "Scott Hess" <scott@avantgo.com>
To: "Greg Lehey" <grog@lemis.com>, <freebsd-fs@FreeBSD.ORG>
References: <166101bf3121$76518900$1e80000a@avantgo.com> <19991117172851.06023@mojave.sitaranetworks.com>
Subject: Re: vinum, MYSQL, and small transaction sizes.
Date: Wed, 17 Nov 1999 15:15:12 -0800
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 5.00.2314.1300
X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

Greg Lehey <grog@mojave.sitaranetworks.com> wrote:
> On Wednesday, 17 November 1999 at  9:30:37 -0800, Scott Hess wrote:
> > I didn't expect to double the tps of the entire system - but getting no
> > increase at all seems very suspect.
>
> It's frequently the system's way of saying "the disk is not the
> bottleneck".

Memory is not an issue, CPU time is not an issue.  AFAICT, the disk _is_
the bottleneck, because when I upgrade to faster disks, the tps goes up -
both for the single-disk test (155->185), and for the vinum'ed test
(80->95).  I can't think of another way I'd see those results.

> > Based on the transaction sizes iostat is reporting, I have tried
> > restriping with 8k stripes, which gives me about 105 tps per disk,
> > which is marginally better.  Going the other direction, with 1m
> > stripes, gave the same results as for 256k stripes.
>
> I think this is probably a red herring.  It's very unlikely that
> you'll get better performance from an 8k stripe than a 256k stripe.
> The fact that there's not a significant degradation with such small
> stripes again points to the likelihood that the disks aren't the
> bottleneck, though it could also indicate that the transfers are very
> small (as you indicate in the Subject: line).  How big are the
> transfers?

iostat reports that the average transfer size is 8k.  I can't tell for
certain what the distribution is, but I am pretty certain it is basically
everything at 8k, with a couple 16k transfers (lots of short bits of data).

> > In an attempt to isolate the problem, I tried cat'ing very large
> > files in parallel.  The files were large enough to not fit in
> > memory, and I ran four cat commands at the same time on different
> > files.  I found that running them all from a single disk gave 380tps
> > (24M/s), running 4 on one drive and 4 on the other gave 200tps
> > (12M/s) for each drive, 400tps (24M/s) aggregate, and running them
> > on a 256k volume striped across the disks gave 100tps (6M/s) for
> > each drive, 200tps (12M/s) aggregate.
>
> Hmm.  The arithmetic at the end suggests that you only striped across
> 2 disks.  What kind of disks are they?  You'll run into significant
> contention problems with IDE, for example.  Also, what version of
> FreeBSD?

10k 18Gig Seagate disks, on an NCR 875 controller.  The disks by themselves
kick ass.  The disks both being used at the same time kick ass.  The disks
when used with vinum do not kick ass.

Again, I don't expect to double performance, but my experience did lead me
to believe we should have added 50% or so with the second disk, perhaps
more given the nature of our use.

Later,
scott


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Thu Nov 18  4:47: 9 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from akat.civ.cvut.cz (akat.civ.cvut.cz [147.32.235.105])
	by hub.freebsd.org (Postfix) with SMTP id C7A9215161
	for <freebsd-fs@FreeBSD.ORG>; Thu, 18 Nov 1999 04:46:50 -0800 (PST)
	(envelope-from pechy@hp735.cvut.cz)
Received: from localhost (pechy@localhost) by akat.civ.cvut.cz (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id NAA10878 for <freebsd-fs@FreeBSD.ORG>; Thu, 18 Nov 1999 13:46:49 +0100
Date: Thu, 18 Nov 1999 13:46:49 +0100
From: Jan Pechanec <pechy@hp735.cvut.cz>
X-Sender: pechy@akat.civ.cvut.cz
To: FreeBSD FS Mailing List <freebsd-fs@FreeBSD.ORG>
Subject: Unix International Stackable Files Working Group
Message-ID: <Pine.SGI.4.05.9911181344290.10849-100000@akat.civ.cvut.cz>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org


	Hello,

	in several papers on filesystems I found the reference to
${subj}. I spent quite enough time trying to find it through several
www search engines, but wasn't succesful. Please, does anybody have
more information on this group ?

	Thank you, Jan.

-- 
Jan PECHANEC (mailto:pechy@hp735.cvut.cz)
Computing Center CTU (Zikova 4, Praha 6, 166 35, Czech Republic)
www.civ.cvut.cz, pechy.civ.cvut.cz, tel: +420 2 24352969 (fax: 24310271)


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Thu Nov 18  5:18:57 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from mentisworks.com (valkery.mentisworks.com [207.227.89.226])
	by hub.freebsd.org (Postfix) with ESMTP id 3006A150FD
	for <freebsd-fs@freebsd.org>; Thu, 18 Nov 1999 05:18:48 -0800 (PST)
	(envelope-from nathank@mentisworks.com)
Received: from [24.29.197.186] (HELO mentisworks.com)
  by mentisworks.com (CommuniGate Pro SMTP 3.2b5)
  with ESMTP id 550005; Thu, 18 Nov 1999 07:18:44 -0600
Received: from [192.168.245.111] (HELO mentisworks.com)
  by mentisworks.com (CommuniGate Pro SMTP 3.2b5)
  with ESMTP id 1320010; Thu, 18 Nov 1999 07:18:47 -0600
Message-ID: <3833FC97.3224106@mentisworks.com>
Date: Thu, 18 Nov 1999 07:18:15 -0600
From: Nathan Kinsman <nathank@mentisworks.com>
X-Mailer: Mozilla 4.7 [en] (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: Jan Pechanec <pechy@hp735.cvut.cz>, freebsd-fs@freebsd.org
Subject: Re: Unix International Stackable Files Working Group
References: <Pine.SGI.4.05.9911181344290.10849-100000@akat.civ.cvut.cz>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org


I've seen a reference to this before:

Unix International Stackable Files Working Group, ``Requirements for
Stackable Files,'' Rev. 3.6, Feb. 1993
Unix Int'l., Parsippany, NJ.
^^^^^^^^^^   ^^^^^^^^^^^^^^

The organization is (was) a consortium including Sun, AT&T and others
formed to promote an open environment based on Unix System V, including
the Open Look windowing system.

-
Nathan Kinsman

Jan Pechanec wrote:
> 
>         Hello,
> 
>         in several papers on filesystems I found the reference to
> ${subj}. I spent quite enough time trying to find it through several
> www search engines, but wasn't succesful. Please, does anybody have
> more information on this group ?
> 
>         Thank you, Jan.
> 
> --
> Jan PECHANEC (mailto:pechy@hp735.cvut.cz)
> Computing Center CTU (Zikova 4, Praha 6, 166 35, Czech Republic)
> www.civ.cvut.cz, pechy.civ.cvut.cz, tel: +420 2 24352969 (fax: 24310271)
> 
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-fs" in the body of the message


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Thu Nov 18  6:32:30 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from ns1.yes.no (ns1.yes.no [195.204.136.10])
	by hub.freebsd.org (Postfix) with ESMTP id 84E6E1513B
	for <fs@FreeBSD.ORG>; Thu, 18 Nov 1999 06:32:22 -0800 (PST)
	(envelope-from eivind@bitbox.follo.net)
Received: from bitbox.follo.net (bitbox.follo.net [195.204.143.218])
	by ns1.yes.no (8.9.3/8.9.3) with ESMTP id PAA05340;
	Thu, 18 Nov 1999 15:32:21 +0100 (CET)
Received: (from eivind@localhost)
	by bitbox.follo.net (8.8.8/8.8.6) id PAA62682;
	Thu, 18 Nov 1999 15:32:20 +0100 (MET)
Date: Thu, 18 Nov 1999 15:32:20 +0100
From: Eivind Eklund <eivind@FreeBSD.ORG>
To: Erez Zadok <ezk@cs.columbia.edu>
Cc: fs@FreeBSD.ORG
Subject: Re: namei() and freeing componentnames
Message-ID: <19991118153220.E45524@bitbox.follo.net>
References: <19991112000359.A256@bitbox.follo.net> <199911152312.SAA21891@shekel.mcl.cs.columbia.edu>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 1.0i
In-Reply-To: <199911152312.SAA21891@shekel.mcl.cs.columbia.edu>; from ezk@cs.columbia.edu on Mon, Nov 15, 1999 at 06:12:09PM -0500
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

[Note to impatient readers - forward view if included at the bottom of
this mail]

On Mon, Nov 15, 1999 at 06:12:09PM -0500, Erez Zadok wrote:
> In message <19991112000359.A256@bitbox.follo.net>, Eivind Eklund writes:
> [...]
> > I suspect that for some filesystems (though none of the present ones),
> > it might be necessary to do more than a
> > zfree(namei_zone,cnp->cn_pnbuf) in order to free up all the relevant
> > data.  In order to support this, we'd have to introduce a new VOP -
> > tentatively called VOP_RELEASEND().  Unfortunately, this comes with a
> > performance penalty.
> 
> Will VOP_RELEASEND be able to call a filesystem-specific routine?  I think
> it should be flexible enough.

All VOPs are filesystem specific (or can be, at least).

>  I can imagine that the VFS will call a (stackable) filesystem's
> vop_releasend(), and that stackable f/s can call a number of those
> on the lower level filesystem(s) it stacked on (there could be more
> than one, namely fan-out f/s).

Yes, this is the intent.

The problem I'm finding with VOP_RELEASEND() is that namei() can
return two different vps - the dvp (directory vp) and the actual vp
(inside the directory dvp points at), and that neither of these are
always available.

As I am writing the code right now, I am using either of these, with a
preference for the dvp.  I am considering splitting VOP_RELEASEND()
into VOP_RELEASEND() and VOP_DRELEASEND(), which takes the different
VPs as parameters - this will at least give something that is easy to
search for if we need to change the behaviour somehow.


> [...]
> > This is somewhat vile, but has the advantage of keeping the code ready
> > for the real VOP_RELEASEND(), and not loosing performance until we
> > actually get some benefit out of it.
> [...]
> > Eivind.
> 
> WRT performance, I suggest that if possible, we #ifdef all of the stacking
> code and fixes that have a non-insignificant performance impact.

Nothing I'm so far positive we will need have a significant
performance impact.  I'm not sure the performance impact for
VOP_RELEASEND() will be significant, either - it is just that I would
like to avoid having performance impact without gain, and for this
particular case I'm not positive we will ever need it - but I'm not
positive we won't, either.  This is why I am trying to do the code in
a way that let us move to having it quickly, but do not force us to
live with the penalites if it turns out we do not need it.

> Sure, performance is important, but not at the cost of functionality
> (IMHO).  Not all users would need stacking, so they can choose not
> to turn on the relevant kernel #define and thus get maximum
> performance.  Those who do want any stacking will have to pay a
> certain performance overhead.

I hope to make stacking layers really light weight ("featherweight
stacking"), and believe it will make sense to use it internally in the
kernel organization.  If this turns out to be right, everybody will
have to have them.

> Of course, there's also an argument against too much #ifdef'ed code,
> b/c it makes maintenance more difficult.

For some of the things I am doing now (e.g, the WILLRELE fixes),
ifdef'ing would be a royal pain, making it extremely hard to read the
code.

> I think we should realize that there would be no way to fix the VFS w/o
> impacting performance.

Actually, I am reasonably confident that we can do the fixes without
impacting performance noticably.


> Rather than implement temporary fixes that avoid "hurting"
> performance, we can (1) conditionalize that code, (2) get it working
> *correctly* first, then (3) optimize it as needed, and (4) finally,
> turn it on by default, possibly removing the non-stacking code.

What I am doing now is done more or less by these principles - though
instead of conditionalizing code I do not know if we will need, I make
it very easy to write it if it turns out we will need it.


Progress report: Based on current rate of progress, it looks like I'll
be able to have patches ready for (my personal) testing sunday (or
*possibly* saturday, but most likely not).  Depending on how
testing/debugging works out, the patches will most likely be ready for
public testing sometime next week.  I'll need help with NFS testing.


Forward view: I'm undecided on the next step.  Possibilities:
(1) Change the way locking is specificied to make it feasible to test
    locking patches properly, and change the assertion generation to
    generate better assertions.  This will probably require changing
    VOP_ISLOCKED() to be able to take a process parameter, and return
    different valued based on wether an exlusive lock is held by that
    process or by another process.  The present behaviour will be
    available by passing NULL for this parameter.

    Presently, running multiple processes does not work properly, as
    the assertions do not really assert the right things.

    These changes are necessary to properly debug the use of locks,
    which I again believe is necessary for stacking layers (which I
    would like to work in 4.0, but I don't know if I will be able to
    have ready).

(2) Change the behaviour of VOP_LOOKUP() to "eat as much as you can,
    and return how much that was" rather than "Eat a single path
    component; we have already decided what this is."
    This allows different types of namespaces, and it allows
    optimizations in VOP_LOOKUP() when several steps in the traversal
    is inside a single filesystem (and hey - who mounts a
    new filesystem on every directory they see, anyway?)

    This change is rather small, and it would be nice to have in 4.0
    (I want the VFS differences from 4.0 to 5.0 to be as small as
    possible).
    It is pretty orthogonal to stacking layers; stacking layers gain
    the same capabilities as other file systems from it.

Eivind.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Thu Nov 18  9:26:20 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from smtp05.primenet.com (smtp05.primenet.com [206.165.6.135])
	by hub.freebsd.org (Postfix) with ESMTP id 7E70F1512A
	for <freebsd-fs@FreeBSD.ORG>; Thu, 18 Nov 1999 09:25:58 -0800 (PST)
	(envelope-from tlambert@usr02.primenet.com)
Received: (from daemon@localhost)
	by smtp05.primenet.com (8.9.3/8.9.3) id KAA13267;
	Thu, 18 Nov 1999 10:25:33 -0700 (MST)
Received: from usr02.primenet.com(206.165.6.202)
 via SMTP by smtp05.primenet.com, id smtpdAAAsEaG3z; Thu Nov 18 10:25:29 1999
Received: (from tlambert@localhost)
	by usr02.primenet.com (8.8.5/8.8.5) id KAA14781;
	Thu, 18 Nov 1999 10:25:43 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199911181725.KAA14781@usr02.primenet.com>
Subject: Re: Unix International Stackable Files Working Group
To: pechy@hp735.cvut.cz (Jan Pechanec)
Date: Thu, 18 Nov 1999 17:25:43 +0000 (GMT)
Cc: freebsd-fs@FreeBSD.ORG
In-Reply-To: <Pine.SGI.4.05.9911181344290.10849-100000@akat.civ.cvut.cz> from "Jan Pechanec" at Nov 18, 99 01:46:49 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

> 	Hello,
> 
> 	in several papers on filesystems I found the reference to
> ${subj}. I spent quite enough time trying to find it through several
> www search engines, but wasn't succesful. Please, does anybody have
> more information on this group ?


I saved nearly the entire UNIX International FTP archive when
UI went out of business, including their TET, ETET, System Admin,
DWARF, and Draft SPEC 1170 documents.

They are currently archive at DigiBoard.


Unfortunately, I didn't save everything, but I'm pretty sure that
was one of the things I saved.  If not, I know who had the machine
in their physical posession after they went under, but I'm pretty
sure it has been scrapped by now, as that person was not very
much like me (I have been described as "the net.packrat").


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Thu Nov 18 10:27:36 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from cs.columbia.edu (cs.columbia.edu [128.59.16.20])
	by hub.freebsd.org (Postfix) with ESMTP id F2FD415476
	for <freebsd-fs@FreeBSD.ORG>; Thu, 18 Nov 1999 10:27:27 -0800 (PST)
	(envelope-from ezk@shekel.mcl.cs.columbia.edu)
Received: from shekel.mcl.cs.columbia.edu (shekel.mcl.cs.columbia.edu [128.59.18.15])
	by cs.columbia.edu (8.9.1/8.9.1) with ESMTP id NAA25492;
	Thu, 18 Nov 1999 13:27:24 -0500 (EST)
Received: (from ezk@localhost)
	by shekel.mcl.cs.columbia.edu (8.9.1/8.9.1) id NAA27811;
	Thu, 18 Nov 1999 13:27:24 -0500 (EST)
Date: Thu, 18 Nov 1999 13:27:24 -0500 (EST)
Message-Id: <199911181827.NAA27811@shekel.mcl.cs.columbia.edu>
X-Authentication-Warning: shekel.mcl.cs.columbia.edu: ezk set sender to ezk@shekel.mcl.cs.columbia.edu using -f
From: Erez Zadok <ezk@cs.columbia.edu>
To: Jan Pechanec <pechy@hp735.cvut.cz>
Cc: FreeBSD FS Mailing List <freebsd-fs@FreeBSD.ORG>
Subject: Re: Unix International Stackable Files Working Group 
In-reply-to: Your message of "Thu, 18 Nov 1999 13:46:49 +0100."
             <Pine.SGI.4.05.9911181344290.10849-100000@akat.civ.cvut.cz> 
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

In message <Pine.SGI.4.05.9911181344290.10849-100000@akat.civ.cvut.cz>, Jan Pechanec writes:
> 
> 	Hello,
> 
> 	in several papers on filesystems I found the reference to
> ${subj}. I spent quite enough time trying to find it through several
> www search engines, but wasn't succesful. Please, does anybody have
> more information on this group ?

It's dead Jan!

:-)

> 	Thank you, Jan.

I have Rosenthal's 6-page 'requirements' paper, which was produced under UI.
It was difficult to get it, but eventually I got a copy from the man
himself.  See

	ftp://shekel.mcl.cs.columbia.edu/pub/ezk/requirements.ps

If you're looking for other papers re: stacking, I probably have all of
them.

> Jan PECHANEC (mailto:pechy@hp735.cvut.cz)
> Computing Center CTU (Zikova 4, Praha 6, 166 35, Czech Republic)
> www.civ.cvut.cz, pechy.civ.cvut.cz, tel: +420 2 24352969 (fax: 24310271)
> 
> 
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-fs" in the body of the message

Erez.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Thu Nov 18 15:20:50 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from cs.columbia.edu (cs.columbia.edu [128.59.16.20])
	by hub.freebsd.org (Postfix) with ESMTP
	id 8BCD81508E; Thu, 18 Nov 1999 15:20:45 -0800 (PST)
	(envelope-from ezk@shekel.mcl.cs.columbia.edu)
Received: from shekel.mcl.cs.columbia.edu (shekel.mcl.cs.columbia.edu [128.59.18.15])
	by cs.columbia.edu (8.9.1/8.9.1) with ESMTP id SAA29976;
	Thu, 18 Nov 1999 18:20:44 -0500 (EST)
Received: (from ezk@localhost)
	by shekel.mcl.cs.columbia.edu (8.9.1/8.9.1) id SAA15756;
	Thu, 18 Nov 1999 18:20:43 -0500 (EST)
Date: Thu, 18 Nov 1999 18:20:43 -0500 (EST)
Message-Id: <199911182320.SAA15756@shekel.mcl.cs.columbia.edu>
X-Authentication-Warning: shekel.mcl.cs.columbia.edu: ezk set sender to ezk@shekel.mcl.cs.columbia.edu using -f
From: Erez Zadok <ezk@cs.columbia.edu>
To: Eivind Eklund <eivind@FreeBSD.ORG>
Cc: Erez Zadok <ezk@cs.columbia.edu>, fs@FreeBSD.ORG
Subject: Re: namei() and freeing componentnames 
In-reply-to: Your message of "Thu, 18 Nov 1999 15:32:20 +0100."
             <19991118153220.E45524@bitbox.follo.net> 
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

In message <19991118153220.E45524@bitbox.follo.net>, Eivind Eklund writes:
> [Note to impatient readers - forward view if included at the bottom of
> this mail]
> 
> On Mon, Nov 15, 1999 at 06:12:09PM -0500, Erez Zadok wrote:
> > In message <19991112000359.A256@bitbox.follo.net>, Eivind Eklund writes:
[...]
> The problem I'm finding with VOP_RELEASEND() is that namei() can
> return two different vps - the dvp (directory vp) and the actual vp
> (inside the directory dvp points at), and that neither of these are
> always available.
> 
> As I am writing the code right now, I am using either of these, with a
> preference for the dvp.  I am considering splitting VOP_RELEASEND()
> into VOP_RELEASEND() and VOP_DRELEASEND(), which takes the different
> VPs as parameters - this will at least give something that is easy to
> search for if we need to change the behaviour somehow.

I found similar "annoying" functionality in Solaris's open() routine.
Sometimes it can return a new dvp, sometimes NULL, and sometimes a copy or
reference to another vnode (I think due to dup() stuff).

From my POV, after having ported stackable templates to several OSs, I found
out that vnode/vfs functions that try to do too much make the life of a
stackable f/s developer harder.  Also, functions that behave differently
under different (input) conditions also make it hard to work with.  The
reason is that stackable file systems have to be layer-independent.  This
means that they have to treat the file system on which they stacked as if
they were the VFS calling that layer, and at the same time they must appear
to the VFS as a low-level f/s.  IOW, a stackable f/s is both a VFS and a
lower-level f/s, and thus have to simulate and act as both.  So whatever
behavior your VFS has before it calls a VOP_* must be simulated accurately
inside the stackable f/s before it calls the lower one.  It is easier to
achieve that when vnode/vfs functions are smaller, simpler, and behave the
same always.

So, I would say that if you think splitting VOP_RELEASEND in two would make
things simpler, go for it here and everywhere else.  The lesson learned from
the Linux vfs (rapid :-) evolution is a good one: after adding more and more
inode/file/dentry/super_block functions, and making them relatively small
and simple, they found ways to push some of that functionality up to the
VFS.

[...]
> Actually, I am reasonably confident that we can do the fixes without
> impacting performance noticably.

That's great!

[...]
> Forward view: I'm undecided on the next step.  Possibilities:
> (1) Change the way locking is specificied to make it feasible to test
>     locking patches properly, and change the assertion generation to
>     generate better assertions.  This will probably require changing

I'm not sure I understand what you mean by assertion generation.

>     VOP_ISLOCKED() to be able to take a process parameter, and return
>     different valued based on wether an exlusive lock is held by that
>     process or by another process.  The present behaviour will be
>     available by passing NULL for this parameter.
> 
>     Presently, running multiple processes does not work properly, as
>     the assertions do not really assert the right things.
> 
>     These changes are necessary to properly debug the use of locks,
>     which I again believe is necessary for stacking layers (which I
>     would like to work in 4.0, but I don't know if I will be able to
>     have ready).

Locks are probably one of the most frustrating things I've had to deal with,
b/c you're rarely told whether the objects passed to you are already locked,
allocated, and if their reference count has been updated, and what, if any,
you have to do with all of these.  FreeBSD is very nice by documenting most
of these conventions in the vnode_if.src file, but Solaris and Linux don't.
I've had to implement a strict un/locking order in my wrapfs templates, to
avoid deadlocks.  Some of that code is so hairy that I dread each time the
(linux) vfs changes and I've got to touch my locking code; that's a sure way
to waste several days debugging that.

Deciding on proper locking is difficult.  In linux, for example, they had
most locking done in the VFS; sounds great at first b/c f/s code doesn't
have to worry about locking objects.  But they found out that to get better
SMP performance, each f/s would have to do its own locking, and so they
pushed some of the locking to be the f/s responsibility.

Locking seems to be stuff that happens all over: part in the VFS, part in
the VM/buffercache, and part inside file systems.  Is there a way to make
locking an explicit part of the vnode interface?  Is there a way to keep
locking in the VFS by default (for simplicity), but allow those f/s that
want to, manage their own locks?  How messy and maintainable such code would
be?

I guess what I'm arguing is for interface flexibility, so we don't have to
revise it again any time soon.

Eivind, if you haven't recently, I suggest you look at some of the stacking
papers (Rosenthal's UI paper, Heidemann, Popek, Skinner/Wong, etc.)
Rosenthal's "requirements" paper succinctly described several important
issues, including atomicity of multi-vnode operations.  Rosenthal suggested
that kernels should have a full-transaction engine, which I think is
eventually necessary, but it's very complex to put it.  The next best thing
is to do some form of safe locking.  Normally each vnode/inode has its own
lock.  Imagine a replicated stackable f/s (replicfs) with fan-out of 3.  So
vnode (V0) at the level of "replicfs" would have access to three
lower-vnodes (V1, V2, V3).  If you want to make a change (say create a file)
in V0, you have to lock V0-V4 at once.  Without vfs support for this,
replicfs would have to enforce ordered locking (such as I've done in wrapfs)
and hope for the best.  If the vfs is smarter, it can help replicfs lock all
4 vnodes at once; or the vfs can allow replicfs to control the locks below
it, and all the vfs has to do is ensure that no one else can lock V1-V3.

I don't have a good answer to this locking issue.  The papers I've cited
describe changes to the vnode interface that simplify locking.  One way they
do that is having only one lock per chain (or stack, or DAG) of stacked file
systems.  So for example, a DAG of stackable f/s is represented by one data
structure that contains locks and other things that are true about the whole
DAG, and then smaller data structures for each node/leaf of the DAG,
containing stuff that's true about that vnode (e.g., operations vector).

> (2) Change the behaviour of VOP_LOOKUP() to "eat as much as you can,
>     and return how much that was" rather than "Eat a single path
>     component; we have already decided what this is."
>     This allows different types of namespaces, and it allows
>     optimizations in VOP_LOOKUP() when several steps in the traversal
>     is inside a single filesystem (and hey - who mounts a
>     new filesystem on every directory they see, anyway?)
> 
>     This change is rather small, and it would be nice to have in 4.0
>     (I want the VFS differences from 4.0 to 5.0 to be as small as
>     possible).
>     It is pretty orthogonal to stacking layers; stacking layers gain
>     the same capabilities as other file systems from it.

Multi-component lookup has always been desirable.  There's one paper by
Duchamp (USENIX '94) on multi-component look in NFS.  I think we should
allow for multi-component lookup as well as the old style "one component at
a time" lookup.  I would argue that the default should still be the old
style.  Someone might want to write a stackable f/s that does special things
as it traverses the pathname of each component.  For example a general
purpose unionfs (one which uses fan-out, unlike the single-stack design in
bsd-4.4) might follow into different underlying directories as it looks up
single components; unionfs has all kinds of interesting semantic issues that
would require more flexibility at lookup time.

Lookup is fairly complex as it is.  If you're going to add multi-component
lookup, then maybe it should be a new vop?  If not a new vop, then make sure
it's added to the current vop_lookup such that a f/s has enough flexibility
to control the type of lookup it wants.  Also, it would be nice if the type
of lookup used can be controlled dynamically by the f/s itself (as opposed
to, say, a mount() flag that sets the lookup type for the duration of the
mount).

> Eivind.

Cheers,
Erez.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Thu Nov 18 15:35: 4 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from excalibur.lps.ens.fr (excalibur.lps.ens.fr [129.199.120.3])
	by hub.freebsd.org (Postfix) with ESMTP
	id C95BD1553D; Thu, 18 Nov 1999 15:34:55 -0800 (PST)
	(envelope-from Thierry.Besancon@lps.ens.fr)
Received: 
          by excalibur.lps.ens.fr (8.9.3/jtpda-5.3.1) id AAA25614
          ; Fri, 19 Nov 1999 00:34:53 +0100 (MET)
Message-Id: <199911182334.AAA25614@excalibur.lps.ens.fr>
From: Thierry.Besancon@lps.ens.fr (Thierry Besancon)
Date: Fri, 19 Nov 1999 00:34:53 +0000
X-Mailer: Mail User's Shell (7.2.5 10/14/92)
To: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org
Subject: crash in ffs_vptofh on diskless workstation
Cc: dillon@freebsd.org, Ollivier.Robert@eurocontrol.fr,
	besancon@lps.ens.fr, Joel.Marchand@polytechnique.fr,
	Pierre.David@prism.uvsq.fr
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

	Hello 

I'm trying to build new X terminals for my lab.
To do so I use FreeBSD 3.3-RELEASE.

The X terminal is a diskless PC with 64 Mo of ram. It perfectly boots 
and I can launch the X server perfectly. Everything just runs fine.

Except for one little piece of thing.

As i wanted to make use of the floppy drive, I gave a look at floppyd
part of mtools package. It implements what I want. While running the
daemon, I encountered a problem. So I went debugging the C code of it.
And so i found a bug in FreeBSD (?!).

Here's the df of the diskless X terminal (i kept the ssh port in order
to remotely connect and be able to look at the problem of floppyd) :

Filesystem                 1K-blocks     Used    Avail Capacity  Mounted on
129.199.120.250:/             127023    31651    85211    27%    /
mfs:29                           959      668      215    76%    /conf/etc
/conf/etc                        959      668      215    76%    /etc
129.199.120.250:/usr          190543   153042    22258    87%    /usr
129.199.120.250:/usr/local   2846396  1958786   659899    75%    /usr/local
mfs:61                          3935     1431     2190    40%    /var
/var/tmp                        3935     1431     2190    40%    /tmp
mfs:91                          1511       47     1344     3%    /dev

It's the classical way FreeBSD 3.3 seems to make diskless run.
The root filesystem is mounted through NFS and memory filesystems 
are created to store the live logs of the system.

The mounts are read-only.

The X terminal runs without any swap.
/etc/rc.sysctl confirms it as well :
	sysctl -w vm.swap_enabled=0

The bug is just that when launching any executable residing in my
mfs /tmp, it justs hangs the kernel. 

# cp /bin/ls /tmp
# df /tmp/.
Filesystem  1K-blocks     Used    Avail Capacity  Mounted on
/var/tmp         3935     1432     2189    40%    /tmp
# /tmp/ls
(workstation freezes)

Here's the panic :

Fatal trap 12 : page fault while in kernel mode
fault virtual address = 0x3e
fault code            = supervisor read, page not present
instruction pointer   = 0x8:0xc022bf14
stack pointer         = 0x10:0xc4546bc8
frame pointer         = 0x10:0xc4546ca4
code segment          = base 0x0, list 0xfffff, type 0x1b
                      = DPL 0, pres 1, def32 1, gran 1
precessor eflags      = interrupt disabled, resume, IOPL = 0
current process       = 355 (csh)
interrupt mask        = net tty bio cam
kernel : type 12 trap, code = 0
Stopped at ffs_vptofh+0xfe0: cmpw $0x2,0x3e(%edx)

and the trace :

db> trace
ffs_vptofh(c4546d5c,c4514300,1000,0,c4546cf4) at ffs_vptofh+0xfe0
end(c4546d5c) at 0xc087c485
vnode_pager_freepage(c4559a2c,c4546db8,1,0,c4546df8) at vnode_pager_freepage+0x556
vm_pager_get_pages(c4559a2c,c4546db8,1,0,c4546f18) at vm_pager_get_pages+0x1f
exec_map_first_page(c4546e94,c44c55a8,c02fe464,0,4) at exec_map_first_page+0xba
execve(c44c55a0,c4546f94,80922e0,80940000,8085000) at execve+0x19e
syscall(27,27,8085000,8094000,bfbffbb0) at syscall+0x187
Xint0x80_syscall() at Xint0x80_syscall+0x2c

(not too deep)

Given I have no swap, it is not easy to supply vmcore.
But I can provide any help as I can reproduce the crash at will.

If someone has a clue on how to fix that...

	Thierry


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Thu Nov 18 20:38:56 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from nomis.simon-shapiro.org (nomis.simon-shapiro.org [209.86.126.163])
	by hub.freebsd.org (Postfix) with SMTP id 1ABA2155BF
	for <freebsd-fs@FreeBSD.ORG>; Thu, 18 Nov 1999 20:38:53 -0800 (PST)
	(envelope-from shimon@simon-shapiro.org)
Received: (qmail 99725 invoked from network); 19 Nov 1999 04:38:52 -0000
Received: from localhost.simon-shapiro.org (HELO simon-shapiro.org) (127.0.0.1)
  by localhost.simon-shapiro.org with SMTP; 19 Nov 1999 04:38:52 -0000
Message-ID: <3834D45C.1F963B3B@simon-shapiro.org>
Date: Thu, 18 Nov 1999 23:38:52 -0500
From: Simon Shapiro <shimon@simon-shapiro.org>
Organization: Simon's Garage
X-Mailer: Mozilla 4.6 [en] (X11; I; FreeBSD 3.3-STABLE i386)
X-Accept-Language: en-US
MIME-Version: 1.0
To: Bernd Walter <ticso@cicely.de>
Cc: Mattias Pantzare <pantzer@ludd.luth.se>, freebsd-fs@FreeBSD.ORG
Subject: Re: RAID-5 and failure
References: <ticso@cicely.de> <199911061716.SAA20783@zed.ludd.luth.se> <19991106183316.A9420@cicely7.cicely.de>
Content-Type: text/plain; charset=
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

Bernd Walter wrote:
> 
> On Sat, Nov 06, 1999 at 06:16:47PM +0100, Mattias Pantzare wrote:
> > > On Sat, Nov 06, 1999 at 04:58:55PM +0100, Mattias Pantzare wrote:
> > > > What hapens if the data part of a write to a RAID-5 plex completes but not the
> > > > parity part (or the other way)?
> > > >
> > > The parity is not in sync - what else?
> >
> > The system could detect it and recalculate the parity. Or give a warning to
> > the user so the user knows that the data is not safe.
> 
> That's not possible because you need to write more then a single sector to keep
> parity in sync which is not atomic.
> 
> In case one of the writes fail vinum will do everything needed to work with it
> and to inform the user.
> Vinum will take the subdisk down because such drives should work with
> write reallocation enabled and such a disk is badly broken if you receive a
> write error.
> 
> If the system panics or power fails between such a write there is no way to
> find out if the parity is broken beside verifying the complete plex after
> reboot - the problem should be the same with all usual hard and software
> solutions - greg already begun or finished recalculating and checking the
> parity.
> I asume that's the reason why some systems use 520 byte sectors - maybe they
> write timestamps or generationnumbers in a single write within the sector.

528.  512 data, 16 ECC for the sector.  Nothing to do with RAID.

> 
> --
> B.Walter                  COSMO-Project              http://www.cosmo-project.de
> ticso@cicely.de             Usergroup                info@cosmo-project.de
> 
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-fs" in the body of the message

-- 


Sincerely Yours,                 Shimon@Simon-Shapiro.ORG
                                             404.664.6401
Simon Shapiro

Unwritten code has no bugs and executes at twice the speed of mouth


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Fri Nov 19  7:18: 2 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from mojave.sitaranetworks.com (mojave.sitaranetworks.com [199.103.141.157])
	by hub.freebsd.org (Postfix) with ESMTP id 2E1081563C
	for <freebsd-fs@FreeBSD.ORG>; Fri, 19 Nov 1999 07:17:59 -0800 (PST)
	(envelope-from grog@mojave.sitaranetworks.com)
Message-ID: <19991119101720.35872@mojave.sitaranetworks.com>
Date: Fri, 19 Nov 1999 10:17:20 -0500
From: Greg Lehey <grog@mojave.sitaranetworks.com>
To: Simon Shapiro <shimon@simon-shapiro.org>,
	Bernd Walter <ticso@cicely.de>
Cc: Mattias Pantzare <pantzer@ludd.luth.se>, freebsd-fs@FreeBSD.ORG
Subject: Re: RAID-5 and failure
Reply-To: Greg Lehey <grog@lemis.com>
References: <ticso@cicely.de> <199911061716.SAA20783@zed.ludd.luth.se> <19991106183316.A9420@cicely7.cicely.de> <3834D45C.1F963B3B@simon-shapiro.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
In-Reply-To: <3834D45C.1F963B3B@simon-shapiro.org>; from Simon Shapiro on Thu, Nov 18, 1999 at 11:38:52PM -0500
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Thursday, 18 November 1999 at 23:38:52 -0500, Simon Shapiro wrote:
> Bernd Walter wrote:
>>
>> I asume that's the reason why some systems use 520 byte sectors - maybe they
>> write timestamps or generationnumbers in a single write within the sector.
>
> 528.  512 data, 16 ECC for the sector.  Nothing to do with RAID.

There are various sizes.  I've had surplus disks with 516 and 520 byte
sectors.  But yes, they're usually under hardware control.

Greg
--
Finger grog@lemis.com for PGP public key
See complete headers for address and phone numbers


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Fri Nov 19  8:33:43 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from mail.tvol.com (mail.wgate.com [38.219.83.4])
	by hub.freebsd.org (Postfix) with ESMTP id 2406D15673
	for <freebsd-fs@FreeBSD.ORG>; Fri, 19 Nov 1999 08:33:34 -0800 (PST)
	(envelope-from rjesup@wgate.com)
Received: from jesup.eng.tvol.net (jesup.eng.tvol.net [10.32.2.26]) by mail.tvol.com (8.8.8/8.8.3) with ESMTP id LAA14056 for <freebsd-fs@FreeBSD.ORG>; Fri, 19 Nov 1999 11:30:48 -0500 (EST)
Reply-To: Randell Jesup <rjesup@wgate.com>
To: freebsd-fs@FreeBSD.ORG
Subject: Re: RAID-5 and failure
References: <ticso@cicely.de> <199911061716.SAA20783@zed.ludd.luth.se> <19991106183316.A9420@cicely7.cicely.de> <19991113213325.57908@mojave.sitaranetworks.com> <ybuk8nis6hm.fsf@jesup.eng.tvol.net.jesup.eng.tvol.net> <19991116204916.44107@mojave.sitaranetworks.com>
From: Randell Jesup <rjesup@wgate.com>
Date: 19 Nov 1999 11:33:58 -0500
In-Reply-To: Greg Lehey's message of "Tue, 16 Nov 1999 20:49:16 -0500"
Message-ID: <ybuso22qw3t.fsf@jesup.eng.tvol.net.jesup.eng.tvol.net>
X-Mailer: Gnus v5.6.43/Emacs 20.4
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

Greg Lehey <grog@mojave.sitaranetworks.com> writes:
>> 	When the disks come back up (dirty), check all the parity.
>> The stripe that was being written will fail to check.  In case 4, the data
>> and parity are wrong, and in case 5, just the parity, but you don't know
>> which.  If you handle case 4, you can handle case 5 the same way.
>> Obviously you've had a write failure, but usually the FS can deal with
>> that possibility (with the chance of lost data, true).  Some form of
>> information passed out about what sector(s) were trashed might be useful
>> in recovery if you're not using default UFS/fsck.
>
>Well, you're still left with the dilemma.  Worse, this check makes
>fsck look like an instantaneous operation: you have to read the entire
>contents of every disk.  For a 500 GB database spread across 3 LVD
>controllers, you're looking at several hours.

	True.  Not that it may matter, but you could have dirty flags for
each cylinder group (or whatever).  This both adds locality (shorter seeks)
and reduces the amount needed to recheck.  If an area hasn't been written
to 'recently', the dirty flag for the area gets rewritten to clean.  This
allows you to keep the amount of the disk that needs to be reread on a
crash down to a very manageable level.  Tuning the size of the groups
covered by a flag and the timeout to rewrite a flag to clean would take a
little work.

>> 	If it checks, then the data was all written before any crash,
>> and all is fine.
>
>That's the simple case.

	That's certainly true.

>> 	So the biggest trick here is recognizing the fact that the system
>> crashed.  You could reserve a block (or set of blocks scattered about) on
>> each drive for dirty flags, and only mark a disk clean if it hasn't had
>> writes in <some configurable amount of time>.  This keeps the write
>> overhead down without requiring NVRAM.  There are other evil tricks: with
>> SCSI, you might be able to change some innocuous mode parameter and use
>> it as a dirty flag, though this probably has at least as much overhead
>> as reserving a dirty-flag block.  And of course if you have NVRAM, store
>> the dirty bit there.  Hmmmmm.  Maybe in the PC's clock chip - they
>> generally have several bits of NVRAM.....  (On the Amiga we used those
>> bits for storing things like SCSI Id, boot spinup delay, etc.)
>>
>> 	Alternatively, you could hide the dirty flag at a higher semantic
>> level, by (at the OS level) recognizing a system that wasn't shut down
>> properly and invoking the vinum re-synchronizer.  So long as the sectors
>> with problems aren't needed to boot the kernel and recognize this that will
>> work.
>
>Basically, the way I see it, we have three options:
>
>1.  Disks never crash, and anyway, we don't write to them.  Ignore the
>    problem and deal with it if it comes to bite us.
>
>2.  Get an NVRAM board and use it for this purpose.

	How much is commonly stored in nvram boards for raid?  If it's
merely the location of the write, _maybe_ clock-chip memory might work
(if writing to it that often doesn't slow down the system - I don't
remember how fast the interface is).  If it's the entire sector, well then
we're screwed without it or #3 - or rather we could have a corrupted
stripe after a crash.  Oh well.

>3.  Bite the bullet and write intention logs before each write.
>    VERITAS has this as an option.

	Probably worthwhile.

>These options don't have to be mutually exclusive.  It's quite
>possible to implement both ((1) doesn't need implementation :-) and
>leave it to the user to decide which to use.

	Quite so.

	BTW, I assume I'm correct in assuming that vinum normally works on
drives with write-behind disabled...

>> 	At the cost of performance, you could use some bytes of each sector
>> for generation numbers, and know in case 5 that the data is correct.
>> Obviously case 4 will still fail.
>
>No, the way things work, this would be very expensive.  We'd have to
>move the data to a larger buffer and set the flags, and it would also
>require at least reformatting the drive, assuming it's possible to set
>a different sector.  There are better ways to do this.

	Well, I was assuming you'd use some bytes from the existing
sectorsize (such as 511 bytes of user data per sector, 1 byte of
generation).  We're talking lots of extra CPU overhead on read or write,
however, to transfer the data into alternative buffers before write and to
invert that on read - not to mention that higher-level code tends to be
inflexible in regard to sector sizes being powers of two (or multiples of
512 for that matter).  Does vinum do any transfers of user data into
alternative buffers before posting it's writes, or does it just use
gather/scatter lists?

-- 
Randell Jesup, Worldgate Communications, ex-Scala, ex-Amiga OS team ('88-94)
rjesup@wgate.com
CDA II has been passed and signed, sigh.  The lawsuit has been filed.  Please
support the organizations fighting it - ACLU, EFF, CDT, etc.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Fri Nov 19 10:10:17 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from uni4nn.gn.iaf.nl (osmium.gn.iaf.nl [193.67.144.12])
	by hub.freebsd.org (Postfix) with ESMTP id B4E5F156F7
	for <freebsd-fs@FreeBSD.ORG>; Fri, 19 Nov 1999 10:10:01 -0800 (PST)
	(envelope-from wilko@yedi.iaf.nl)
Received: from yedi.iaf.nl (uucp@localhost)
	  by uni4nn.gn.iaf.nl (8.9.2/8.9.2) with UUCP id SAA32117;
	  Fri, 19 Nov 1999 18:55:35 +0100 (MET)
Received: (from wilko@localhost)
	by yedi.iaf.nl (8.9.3/8.9.3) id SAA54691;
	Fri, 19 Nov 1999 18:50:59 +0100 (CET)
	(envelope-from wilko)
From: Wilko Bulte <wilko@yedi.iaf.nl>
Message-Id: <199911191750.SAA54691@yedi.iaf.nl>
Subject: Re: RAID-5 and failure
In-Reply-To: <ybuso22qw3t.fsf@jesup.eng.tvol.net.jesup.eng.tvol.net> from Randell Jesup at "Nov 19, 1999 11:33:58 am"
To: rjesup@wgate.com
Date: Fri, 19 Nov 1999 18:50:59 +0100 (CET)
Cc: freebsd-fs@FreeBSD.ORG
X-Organisation: Private FreeBSD site - Arnhem, The Netherlands
X-pgp-info: PGP public key at 'finger wilko@freefall.freebsd.org'
X-Mailer: ELM [version 2.4ME+ PL43 (25)]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

As Randell Jesup wrote ...
> Greg Lehey <grog@mojave.sitaranetworks.com> writes:

[...]

> >2.  Get an NVRAM board and use it for this purpose.
> 
> 	How much is commonly stored in nvram boards for raid?  If it's
> merely the location of the write, _maybe_ clock-chip memory might work
> (if writing to it that often doesn't slow down the system - I don't
> remember how fast the interface is).  If it's the entire sector, well then
> we're screwed without it or #3 - or rather we could have a corrupted
> stripe after a crash.  Oh well.

Well, I can tell you that the HSx DEC ^H^H^H Compaq controllers use the
battery backup-ed writeback cache for this purpose. These are anything from
32 to 512Mb per controllers. Controllers generally are used in redundant 
pairs, each with their own cache module, each cachemodule with it's own
backup battery. To avoid the potential for datacorruption when a cache
module fails they can be setup to run in mirrored cache mode.

Price? I'm pretty sure you don't want to know ;-) The SCSI variants work
fine on FreeBSD BTW. I have yet to try the Fibrechannel boxes. I lack
a host adapter that FreeBSD has a driver for.

Wilko
-- 
|   / o / /  _  	Arnhem, The Netherlands	  - Powered by FreeBSD -
|/|/ / / /( (_) Bulte 	WWW : http://www.tcja.nl  http://www.freebsd.org


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Fri Nov 19 10:11: 1 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from uni4nn.gn.iaf.nl (osmium.gn.iaf.nl [193.67.144.12])
	by hub.freebsd.org (Postfix) with ESMTP id 607D415732
	for <freebsd-fs@FreeBSD.ORG>; Fri, 19 Nov 1999 10:10:42 -0800 (PST)
	(envelope-from wilko@yedi.iaf.nl)
Received: from yedi.iaf.nl (uucp@localhost)
	  by uni4nn.gn.iaf.nl (8.9.2/8.9.2) with UUCP id SAA32126;
	  Fri, 19 Nov 1999 18:55:39 +0100 (MET)
Received: (from wilko@localhost)
	by yedi.iaf.nl (8.9.3/8.9.3) id SAA54750;
	Fri, 19 Nov 1999 18:56:38 +0100 (CET)
	(envelope-from wilko)
From: Wilko Bulte <wilko@yedi.iaf.nl>
Message-Id: <199911191756.SAA54750@yedi.iaf.nl>
Subject: Re: RAID-5 and failure
In-Reply-To: <19991119101720.35872@mojave.sitaranetworks.com> from Greg Lehey at "Nov 19, 1999 10:17:20 am"
To: grog@lemis.com
Date: Fri, 19 Nov 1999 18:56:38 +0100 (CET)
Cc: shimon@simon-shapiro.org, ticso@cicely.de, pantzer@ludd.luth.se,
	freebsd-fs@FreeBSD.ORG
X-Organisation: Private FreeBSD site - Arnhem, The Netherlands
X-pgp-info: PGP public key at 'finger wilko@freefall.freebsd.org'
X-Mailer: ELM [version 2.4ME+ PL43 (25)]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

As Greg Lehey wrote ...
> On Thursday, 18 November 1999 at 23:38:52 -0500, Simon Shapiro wrote:
> > Bernd Walter wrote:
> >>
> >> I asume that's the reason why some systems use 520 byte sectors - maybe they
> >> write timestamps or generationnumbers in a single write within the sector.
> >
> > 528.  512 data, 16 ECC for the sector.  Nothing to do with RAID.
> 
> There are various sizes.  I've had surplus disks with 516 and 520 byte
> sectors.  But yes, they're usually under hardware control.

I've also seen 518 once.

-- 
|   / o / /  _  	Arnhem, The Netherlands	  - Powered by FreeBSD -
|/|/ / / /( (_) Bulte 	WWW : http://www.tcja.nl  http://www.freebsd.org


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


From owner-freebsd-fs  Sat Nov 20 12:20: 6 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from europa.dreamscape.com (europa.dreamscape.com [206.64.128.147])
	by hub.freebsd.org (Postfix) with ESMTP id D1BAA14C41
	for <freebsd-fs@freebsd.org>; Sat, 20 Nov 1999 12:19:40 -0800 (PST)
	(envelope-from krentel@dreamscape.com)
Received: from dreamscape.com (sA18-p7.dreamscape.com [209.217.200.7])
          by europa.dreamscape.com (8.8.5/8.8.4) with ESMTP
	  id PAA16622 for <freebsd-fs@freebsd.org>; Sat, 20 Nov 1999 15:19:37 -0500 (EST)
X-Dreamscape-Track-A: sA18-p7.dreamscape.com [209.217.200.7]
X-Dreamscape-Track-B: Sat, 20 Nov 1999 15:19:37 -0500 (EST)
Received: (from krentel@localhost)
	by dreamscape.com (8.9.3/8.9.3) id PAA03794
	for freebsd-fs@freebsd.org; Sat, 20 Nov 1999 15:17:58 -0500 (EST)
	(envelope-from krentel)
Date: Sat, 20 Nov 1999 15:17:58 -0500 (EST)
From: "Mark W. Krentel" <krentel@dreamscape.com>
Message-Id: <199911202017.PAA03794@dreamscape.com>
To: freebsd-fs@freebsd.org
Subject: running linux binaries from ext2fs partition
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

Is it possible to run linux (or freebsd) binaries directly from a
local ext2fs partition?

My machine dual boots between Freebsd 3.3-stable (as of Nov 7) and Red
Hat 6.0.  I have the linux_base-6.0 port installed, and I can run
linux binaries by copying them to a freebsd partition.  But I tried
running them directly from their ext2fs partition and I got a "page
fault while in kernel mode" panic.  I'm not using soft updates, if
that matters.

I'm guessing that this is not supported and probably has nothing to do
with linux binaries.  If I'm wrong and this should work, then I'll be
back with more details.  But I thought I should check before I run too
many experiments that crash my system. :-(

While we're on the subject, on what filesystem types is it ok to run
binaries?  Local freebsd (UFS), NFS, and cdrom should all work, right?
Are there others?

--Mark Krentel


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message