Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 2 Dec 2007 14:57:05 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Don Lewis <truckman@FreeBSD.org>
Cc:        freebsd-fs@FreeBSD.org
Subject:   Re: File remove problem
Message-ID:  <20071202132955.M18602@delplex.bde.org>
In-Reply-To: <200712012214.lB1MEl2Q015881@gw.catspoiler.org>
References:  <200712012214.lB1MEl2Q015881@gw.catspoiler.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 1 Dec 2007, Don Lewis wrote:

> On  2 Dec, Bruce Evans wrote:
>
>> Here is a non-hackish patch which explains why ignoring MNT_RDONLY in
>> the above or in ffs_mount() helps.  It just fixes the confusion between
>> IN_MODIFIED and IN_CHANGE in critical places.
>>
>> % Index: ffs_softdep.c

[All settings here, but not yet in ufs_inactive() and ffs_truncate(),
which are critical, and in other places which might not be critical.]

>> Without this change, soft updates depends on IN_CHANGE being converted
>> to IN_MODIFIED by ufs_itimes(), but this conversion doesn't happen
>> when MNT_RDONLY is set.  With soft updates, changes are often delayed
>> until sync time, and when the sync is for mount-update it is done after
>> setting MNT_RDONLY so the above doesn't work.
>
> ufs_itimes() should probably also be looking at fs_ronly instead of
> MNT_RDONLY, *but* all the paths leading from userland to ufs_itimes()
> would need to be checked to verify that they check MNT_RDONLY to prevent
> new file system write operations from happening while the remount is in
> progress.

Yes, that is probably why MNT_RDONLY is (ab)used now.

I found old (Y2002) private mail from mckusick that explains a previous
change in this area, a change that mostly avoided the problem but has
been lost:

% RCS file: /home/ncvs/src/sys/ufs/ufs/ufs_vnops.c,v
% Working file: ufs_vnops.c
% ----------------------------
% revision 1.182
% date: 2002/01/15 07:17:12;  author: mckusick;  state: Exp;  lines: +4 -5
% When downgrading a filesystem from read-write to read-only, operations
% involving file removal or file update were not always being fully
% committed to disk. The result was lost files or corrupted file data.
% This change ensures that the filesystem is properly synced to disk
% before the filesystem is down-graded.
% 
% This delta also fixes a long standing bug in which a file open for
% reading has been unlinked. When the last open reference to the file
% is closed, the inode is reclaimed by the filesystem. Previously,
% if the filesystem had been down-graded to read-only, the inode could
% not be reclaimed, and thus was lost and had to be later recovered
% by fsck.  With this change, such files are found at the time of the
% down-grade.  Normally they will result in the filesystem down-grade
% failing with `device busy'. If a forcible down-grade is done, then
% the affected files will be revoked causing the inode to be released
% and the open file descriptors to begin failing on attempts to read.
% 
% Submitted by:	"Sam Leffler" <sam@errno.com>
% ----------------------------
% 
% Index: ufs_vnops.c
% ===================================================================
% RCS file: /home/ncvs/src/sys/ufs/ufs/ufs_vnops.c,v
% retrieving revision 1.181
% retrieving revision 1.182
% diff -u -2 -r1.181 -r1.182
% --- ufs_vnops.c	22 Nov 2001 15:33:12 -0000	1.181
% +++ ufs_vnops.c	15 Jan 2002 07:17:12 -0000	1.182
% @@ -37,5 +37,5 @@
%   *
%   *	@(#)ufs_vnops.c	8.27 (Berkeley) 5/27/95
% - * $FreeBSD: src/sys/ufs/ufs/ufs_vnops.c,v 1.181 2001/11/22 15:33:12 guido Exp $
% + * $FreeBSD: src/sys/ufs/ufs/ufs_vnops.c,v 1.182 2002/01/15 07:17:12 mckusick Exp $
%   */
% 
% @@ -159,11 +159,10 @@
%  	if ((ip->i_flag & (IN_ACCESS | IN_CHANGE | IN_UPDATE)) == 0)
%  		return;
% +	if ((vp->v_type == VBLK || vp->v_type == VCHR) && !DOINGSOFTDEP(vp))
% +		ip->i_flag |= IN_LAZYMOD;
% +	else
% +		ip->i_flag |= IN_MODIFIED;
%  	if ((vp->v_mount->mnt_flag & MNT_RDONLY) == 0) {
%  		vfs_timestamp(&ts);
% -		if ((vp->v_type == VBLK || vp->v_type == VCHR) &&
% -		    !DOINGSOFTDEP(vp))
% -			ip->i_flag |= IN_LAZYMOD;
% -		else
% -			ip->i_flag |= IN_MODIFIED;
%  		if (ip->i_flag & IN_ACCESS) {
%  			ip->i_atime = ts.tv_sec;

This is in ufs_itimes().  Note that it moves the setting of the modified
flag before the check of MNT_RDONLY, so that when the wrong or incomplete
flags are set earlier and the wrong flags aren't converted to the
modified flag before MNT_RDONLY is set, then we only lose the timestamps
but not critical updates here.

% Index: ffs_inode.c
% ===================================================================
% RCS file: /home/ncvs/src/sys/ufs/ffs/ffs_inode.c,v
% retrieving revision 1.73
% retrieving revision 1.74
% diff -u -2 -r1.73 -r1.74
% --- ffs_inode.c	13 Dec 2001 05:07:48 -0000	1.73
% +++ ffs_inode.c	15 Jan 2002 07:17:12 -0000	1.74
% @@ -32,5 +32,5 @@
%   *
%   *	@(#)ffs_inode.c	8.13 (Berkeley) 4/21/95
% - * $FreeBSD: src/sys/ufs/ffs/ffs_inode.c,v 1.73 2001/12/13 05:07:48 mckusick Exp $
% + * $FreeBSD: src/sys/ufs/ffs/ffs_inode.c,v 1.74 2002/01/15 07:17:12 mckusick Exp $
%   */
% 
% @@ -88,7 +88,7 @@
%  		return (0);
%  	ip->i_flag &= ~(IN_LAZYMOD | IN_MODIFIED);
% -	if (vp->v_mount->mnt_flag & MNT_RDONLY)
% -		return (0);
%  	fs = ip->i_fs;
% +	if (fs->fs_ronly)
% +		return (0);
%  	/*
%  	 * Ensure that uid and gid are correct. This is a temporary

This fixes loss of the critical updates a little later in ffs_update().

% @@ -153,4 +153,6 @@
%  	oip = VTOI(ovp);
%  	fs = oip->i_fs;
% +	if (fs->fs_ronly)
% +		panic("ffs_truncate: read-only filesystem");
%  	if (length < 0)
%  		return (EINVAL);

This is a sanity check in ffs_truncate().  I think all callers except the
one in ufs_inactive() automatically pass the check, since they are higher
level so they do a correct check of MNT_RDONLY.  The call in ufs_inactive()
used to be unconditional (except for the (i_nlink == 0) condition of course).
In -current it is conditional on MNT_ONLY.  kib's patch changes it to be
conditional on fs_ronly, but I hope it can become unconditional again --
it should be an error to reach ufs_inactive() with a partially deleted
file after syncing before changing fs_ronly to 0.

ffs_update() should panic instead of returning 0 when (fs->fs_ronly)
too, so that bugs get noticed.

mckusick's explanation says that "[fs_ronly is the only believable flag].
Thus the killing of IN_MODIFIED has to happen a few lines later [in]
ffs_update()".

It should be safe to blindly ignore all modification flags except IN_ACCESS
in ufs_itimes(), since ffs_update() will kill the completely invalid ones.

4.4BSD-Lite blindly ignored all modification flags in ITIMES(), and
checks the wrong read-only flag (MNT_RDONLY) in open code that duplicates
ITIMES() plus adds the wrong r/o check.  When I converted ITIMES() to
ufs_itimes(), I centralized this wrong r/o check.  This was mainly a
cleanup, but I think it fixes wrong setting of atimes (IN_ATIME is set
without checking any r/o flags, and IN_ATIME was sometimes converted
into a timestamp before ffs_update killed it, so applications could
see atime changing even on purely r/o mounted file systems).

The fix in ufs_vnops.c was lost relatively recently as part of related
changes to fix IN_ACCESS for non-exclusively-locked reads:

% Index: ufs_vnops.c
% ===================================================================
% RCS file: /home/ncvs/src/sys/ufs/ufs/ufs_vnops.c,v
% retrieving revision 1.279
% retrieving revision 1.280
% diff -u -2 -r1.279 -r1.280
% --- ufs_vnops.c	2 Oct 2006 02:08:31 -0000	1.279
% +++ ufs_vnops.c	10 Oct 2006 09:20:54 -0000	1.280
% @@ -36,5 +36,5 @@
% 
%  #include <sys/cdefs.h>
% -__FBSDID("$FreeBSD: src/sys/ufs/ufs/ufs_vnops.c,v 1.279 2006/10/02 02:08:31 tegge Exp $");
% +__FBSDID("$FreeBSD: src/sys/ufs/ufs/ufs_vnops.c,v 1.280 2006/10/10 09:20:54 kib Exp $");
% 
%  #include "opt_mac.h"
% @@ -129,29 +129,47 @@
%  	struct inode *ip;
%  	struct timespec ts;
% +	int mnt_locked;
% 
%  	ip = VTOI(vp);
% +	mnt_locked = 0;
% +	if ((vp->v_mount->mnt_flag & MNT_RDONLY) != 0) {
% +		VI_LOCK(vp);
% +		goto out;
% +	}
% +	MNT_ILOCK(vp->v_mount);		/* For reading of mnt_kern_flags. */
% +	mnt_locked = 1;
% +	VI_LOCK(vp);
%  	if ((ip->i_flag & (IN_ACCESS | IN_CHANGE | IN_UPDATE)) == 0)
% -		return;
% +		goto out_unl;
% +
% 
%  		ip->i_flag |= IN_LAZYMOD;
% -	else
% +	else if (((vp->v_mount->mnt_kern_flag &
% +		    (MNTK_SUSPENDED | MNTK_SUSPEND)) == 0) ||
% +		    (ip->i_flag & (IN_CHANGE | IN_UPDATE)))
%  		ip->i_flag |= IN_MODIFIED;
% -	if ((vp->v_mount->mnt_flag & MNT_RDONLY) == 0) {
% -		vfs_timestamp(&ts);
% -		if (ip->i_flag & IN_ACCESS) {
% -			DIP_SET(ip, i_atime, ts.tv_sec);
% -			DIP_SET(ip, i_atimensec, ts.tv_nsec);
% -		}
% -		if (ip->i_flag & IN_UPDATE) {
% -			DIP_SET(ip, i_mtime, ts.tv_sec);
% -			DIP_SET(ip, i_mtimensec, ts.tv_nsec);
% -			ip->i_modrev++;
% -		}
% -		if (ip->i_flag & IN_CHANGE) {
% -			DIP_SET(ip, i_ctime, ts.tv_sec);
% -			DIP_SET(ip, i_ctimensec, ts.tv_nsec);
% -		}
% +	else if (ip->i_flag & IN_ACCESS)
% +		ip->i_flag |= IN_LAZYACCESS;
% +	vfs_timestamp(&ts);
% +	if (ip->i_flag & IN_ACCESS) {
% +		DIP_SET(ip, i_atime, ts.tv_sec);
% +		DIP_SET(ip, i_atimensec, ts.tv_nsec);
% +	}
% +	if (ip->i_flag & IN_UPDATE) {
% +		DIP_SET(ip, i_mtime, ts.tv_sec);
% +		DIP_SET(ip, i_mtimensec, ts.tv_nsec);
% +		ip->i_modrev++;
% +	}
% +	if (ip->i_flag & IN_CHANGE) {
% +		DIP_SET(ip, i_ctime, ts.tv_sec);
% +		DIP_SET(ip, i_ctimensec, ts.tv_nsec);
%  	}
% +
% + out:
%  	ip->i_flag &= ~(IN_ACCESS | IN_CHANGE | IN_UPDATE);
% + out_unl:
% +	VI_UNLOCK(vp);
% +	if (mnt_locked)
% +		MNT_IUNLOCK(vp->v_mount);
%  }
%

Now we trust MNT_RDONLY again, so that when the wrong or incomplete
flags are set earlier and the wrong flags aren't converted to the
modified flag before MNT_RDONLY is set, then we only lose both timestamps
and critical updates here.

I think the best quick fix would be to trust fs_ronly here, except for
IN_ACCESS.  Then wrong IN_CHANGE | IN_UPDATE flags would just cause
wrong updates of i_ctime and i_mtime, and an extra i/o to write these
changes, but missing IN_MODIFIED flags would be fixed up as in rev.1.182
and associated correct IN_CHANGE | IN_UPDATE flags wouldn't be incorrectly
discarded.  IN_ACCESS needs special handling even in the non-snapshot
cases so that read() doesn't race mount-update (at best, read()s might
keep dirtying inodes, so there would be a problem setting fs_ronly
atomically with completing the sync).

Most other file systems are primitive or broken in this area.  ext2fs is
at the level of ufs_vnops.c 1.182.  msdosfs is at level before 4.4BSD-Lite
(it still uses its clone of ITIMES() and looks more like the Net/2 ffs
than the 4.4BSD one).  But most other file systems are Giant-locked, so
they don't have the IN_ACCESS races.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20071202132955.M18602>