From owner-freebsd-fs@FreeBSD.ORG  Sat Dec  1 16:07:43 2007
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 3146F16A418
	for <freebsd-fs@freebsd.org>; Sat,  1 Dec 2007 16:07:43 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail18.syd.optusnet.com.au (mail18.syd.optusnet.com.au
	[211.29.132.199])
	by mx1.freebsd.org (Postfix) with ESMTP id AE3F813C45A
	for <freebsd-fs@freebsd.org>; Sat,  1 Dec 2007 16:07:42 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from besplex.bde.org (c211-30-219-213.carlnfd3.nsw.optusnet.com.au
	[211.30.219.213])
	by mail18.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	lB1G7FvG008283
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Sun, 2 Dec 2007 03:07:24 +1100
Date: Sun, 2 Dec 2007 03:07:10 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Bruce Evans <brde@optusnet.com.au>
In-Reply-To: <20071201030305.G1170@besplex.bde.org>
Message-ID: <20071202020213.I2849@besplex.bde.org>
References: <474F4E46.8030109@nokia.com> <20071130112043.H7217@besplex.bde.org>
	<474F69A7.9090404@nokia.com> <20071130151606.F12094@delplex.bde.org>
	<20071130052840.GH83121@deviant.kiev.zoral.com.ua>
	<20071201030305.G1170@besplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-fs@freebsd.org
Subject: Re: File remove problem
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 01 Dec 2007 16:07:43 -0000

On Sat, 1 Dec 2007, Bruce Evans wrote:

> On Fri, 30 Nov 2007, Kostik Belousov wrote:

>> As a speculation, it might be that ufs_inactive() should conditionalize on
>> fs_ronly instead of MNT_RDONLY. Then, VOP_INACTIVE() would set up the
>> IN_CHANGE|IN_UPDATE and finally call the ffs_update() ?
>
> Something like that seems to be right.  The folowing hack in ufs_inactive()
> seems to fix the problem with sift updates, as does unsetting MNT_RDONLY
> for the whole VOP_SYNC() in ffs_mount().
>
> % Index: ufs_inode.c
> % ===================================================================
> % RCS file: /home/ncvs/src/sys/ufs/ufs/ufs_inode.c,v
> % retrieving revision 1.53
> % diff -u -2 -r1.53 ufs_inode.c
> % --- ufs_inode.c	7 Apr 2004 03:47:20 -0000	1.53
> % +++ ufs_inode.c	30 Nov 2007 12:58:39 -0000
> % @@ -59,4 +59,6 @@
> %  #endif
> % % +#include <ufs/ffs/fs.h>
> % +
> %  /*
> %   * Last reference to an inode.  If necessary, write or delete it.
> % @@ -118,6 +120,15 @@
> %  			ip->i_flag &= ~IN_ACCESS;
> %  		} else {
> % +			int wasro = 0;
> % +
> %  			(void) vn_write_suspend_wait(vp, NULL, V_WAIT);
> % +			if (vp->v_mount->mnt_flag & MNT_RDONLY &&
> % +			    ip->i_fs->fs_ronly == 0) {
> % +				vp->v_mount->mnt_flag &= ~MNT_RDONLY;
> % +				wasro = 1;
> % +			}
> %  			UFS_UPDATE(vp, 0);
> % +			if (wasro)
> % +				vp->v_mount->mnt_flag |= MNT_RDONLY;
> %  		}
> %  	}
>
> I didn't bother with correct locking here (only tested under UP).
>
> With this change, the VOP_SYNC() in ffs_mount() for MNT_UPDATE seems
> to flush everything in simple cases (with no open files), just like
> the VOP_SYNC() in unmount() flushes everything before ffs_unmount() is
> reached.  OTOH, without a forced flush, soft updates takes a long
> time to flush things -- more like 3 syncer periods than 1 for
> non-waitfor syncs.  With soft updates, the above is called from deep
> in VOP_SYNC().  It's strange that the above non-waitfor UFS_UPDATE()
> is used inside of waitfor syncs.  It apparently works because the
> waitfor syncs wait for it later, but only if it is non-null.

Here is a non-hackish patch which explains why ignoring MNT_RDONLY in
the above or in ffs_mount() helps.  It just fixes the confusion between
IN_MODIFIED and IN_CHANGE in critical places.

% Index: ffs_softdep.c
% ===================================================================
% RCS file: /home/ncvs/src/sys/ufs/ffs/ffs_softdep.c,v
% retrieving revision 1.214
% diff -u -2 -r1.214 ffs_softdep.c
% --- ffs_softdep.c	9 Nov 2007 11:04:36 -0000	1.214
% +++ ffs_softdep.c	1 Dec 2007 14:25:49 -0000
% @@ -2776,5 +2776,5 @@
%  		DIP_SET(ip, i_blocks, DIP(ip, i_blocks) + \
%  		    freeblks->fb_chkcnt - blocksreleased);
% -		ip->i_flag |= IN_CHANGE;
% +		ip->i_flag |= IN_MODIFIED;
%  		vput(vp);
%  	}
% @@ -3575,5 +3575,5 @@
%  		ip->i_nlink--;
%  		DIP_SET(ip, i_nlink, ip->i_nlink);
% -		ip->i_flag |= IN_CHANGE;
% +		ip->i_flag |= IN_MODIFIED;
%  		if (ip->i_nlink < ip->i_effnlink)
%  			panic("handle_workitem_remove: bad file delta");
% @@ -3594,5 +3594,5 @@
%  	ip->i_nlink -= 2;
%  	DIP_SET(ip, i_nlink, ip->i_nlink);
% -	ip->i_flag |= IN_CHANGE;
% +	ip->i_flag |= IN_MODIFIED;
%  	if (ip->i_nlink < ip->i_effnlink)
%  		panic("handle_workitem_remove: bad dir delta");
% @@ -3635,5 +3635,5 @@
%  	WORKLIST_INSERT(&inodedep->id_inowait, &dirrem->dm_list);
%  	FREE_LOCK(&lk);
% -	ip->i_flag |= IN_CHANGE;
% +	ip->i_flag |= IN_MODIFIED;
%  	ffs_update(vp, 0);
%  	vput(vp);

Without this change, soft updates depends on IN_CHANGE being converted
to IN_MODIFIED by ufs_itimes(), but this conversion doesn't happen
when MNT_RDONLY is set.  With soft updates, changes are often delayed
until sync time, and when the sync is for mount-update it is done after
setting MNT_RDONLY so the above doesn't work.

Soft updates sometimes forces an inode update using UFS_UPDATE(...,
1) but it doesn't do this in all cases even for MNT_UPDATE syncs.  The
update is first done asynchronously (except for the bug) by setting
IN_* in the above so that ffs_update() is (supposed to be) non-null
when it is called by ufs_inactive() or directly in 1 case in the above.
Then for MNT_UPDATE cases, something should wait later.  This seems
to be broken for some cases but not for mount-update or unmount.

Setting IN_CHANGE in the above also breaks ctimes slightly.  I think
the bug is visible from userland in some cases.  Consider the operarations
rename; stat; sync; sync.  The mv sets update marks; the stat (actually
ufs_inactive() at the end of the rename if the file is not open)
converts the marks to timestamps that should be final; then the sync
runs the above code which sets the ctime mark again, and one of the
syncs (should be the first one, but that seems to be broken) converts
the ctime mark to a timestamp and thus breaks the timestamp already
reported to userland.

The problem area can be investigated without creating corrupt file systems
by using fsync(2): on a new file system /f:

 	touch a
 	while :; do
 		mv a b
 		fsync b
 		# optionally sleep a short time 0.1-0.5 seconds
 		fsck -n /f
 		mv b a
 		unmount /f; mount /f  # Don't trust sync
 	done

I watched the mv and the fsync using ddb and the fsck.  With soft
updates, (even with sync mounts) the fsync doesn't actually complete
the i/o.  Ddb shows that the work list is not completed when fsync
returns, and the fsck shows that the file system is inconsistent when
the fsck returns.  After sleeping just a fraction of a second, the
fsck shows a consistent file system.  0.1 seconds usually isn't enough,
but 0.5 seconds usually is.  I think it is softdep_waitidle() that is
waiting long enough for the unmount and remount cases to work.

Without soft updates (even with async mounts), the fsync seems to work
correctly -- no sleep is needed for the fsck to usually show a
consistent fs -- though in the async case this just because no i/o
has been done in the usual case, and in the sync case the fs is consistent
without the fsync.

Without the above patch, after "mv a b", "fsync b" is insufficient to
make "mount -u -o ro /f" not corrupt the fs (because the mv generates
3 or 4 entries in the worklist, and the fsync generates 0 or 1 more
and always ends up with 1 or 2 not handled, and then the sync for
mount-update usually doesn't make any more progress).  I didn't check
if a short delay after the fsync is sufficient, but expect that it is.
With the above patch, a loop doing "mv a b; mount -u -o ro /f; fsck
-n /f; mount -u -o rw /f; mv b a" shows no problems.  However, the
original problem was without soft updates, so there must be more bugs
here.

> BTW, *_mount() has lots of bogusness related to string options.  In
> particular, ffs_mount() for update from r/w to r/o checks the "ro"
> string option and sets MNT_RDONLY later, but MNT_RDONLY is already set
> and other things depend on it being set early.

Setting MNT_RDONLY too early can probably cause critical updates to be
lost in the non-soft updates case due to the IN_CHANGE/IN_MODIFIED
confusion, but only in cases where mount races with other operations.

While testing soft updates without any patches, I got a panic (dup
alloc?) immediately when I raced the mv+remount loop (or was it just
mv+fsync?) with another loop doing just remount.

Bruce