Date: Sun, 2 Dec 2007 14:57:05 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Don Lewis <truckman@FreeBSD.org> Cc: freebsd-fs@FreeBSD.org Subject: Re: File remove problem Message-ID: <20071202132955.M18602@delplex.bde.org> In-Reply-To: <200712012214.lB1MEl2Q015881@gw.catspoiler.org> References: <200712012214.lB1MEl2Q015881@gw.catspoiler.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 1 Dec 2007, Don Lewis wrote: > On 2 Dec, Bruce Evans wrote: > >> Here is a non-hackish patch which explains why ignoring MNT_RDONLY in >> the above or in ffs_mount() helps. It just fixes the confusion between >> IN_MODIFIED and IN_CHANGE in critical places. >> >> % Index: ffs_softdep.c [All settings here, but not yet in ufs_inactive() and ffs_truncate(), which are critical, and in other places which might not be critical.] >> Without this change, soft updates depends on IN_CHANGE being converted >> to IN_MODIFIED by ufs_itimes(), but this conversion doesn't happen >> when MNT_RDONLY is set. With soft updates, changes are often delayed >> until sync time, and when the sync is for mount-update it is done after >> setting MNT_RDONLY so the above doesn't work. > > ufs_itimes() should probably also be looking at fs_ronly instead of > MNT_RDONLY, *but* all the paths leading from userland to ufs_itimes() > would need to be checked to verify that they check MNT_RDONLY to prevent > new file system write operations from happening while the remount is in > progress. Yes, that is probably why MNT_RDONLY is (ab)used now. I found old (Y2002) private mail from mckusick that explains a previous change in this area, a change that mostly avoided the problem but has been lost: % RCS file: /home/ncvs/src/sys/ufs/ufs/ufs_vnops.c,v % Working file: ufs_vnops.c % ---------------------------- % revision 1.182 % date: 2002/01/15 07:17:12; author: mckusick; state: Exp; lines: +4 -5 % When downgrading a filesystem from read-write to read-only, operations % involving file removal or file update were not always being fully % committed to disk. The result was lost files or corrupted file data. % This change ensures that the filesystem is properly synced to disk % before the filesystem is down-graded. % % This delta also fixes a long standing bug in which a file open for % reading has been unlinked. When the last open reference to the file % is closed, the inode is reclaimed by the filesystem. Previously, % if the filesystem had been down-graded to read-only, the inode could % not be reclaimed, and thus was lost and had to be later recovered % by fsck. With this change, such files are found at the time of the % down-grade. Normally they will result in the filesystem down-grade % failing with `device busy'. If a forcible down-grade is done, then % the affected files will be revoked causing the inode to be released % and the open file descriptors to begin failing on attempts to read. % % Submitted by: "Sam Leffler" <sam@errno.com> % ---------------------------- % % Index: ufs_vnops.c % =================================================================== % RCS file: /home/ncvs/src/sys/ufs/ufs/ufs_vnops.c,v % retrieving revision 1.181 % retrieving revision 1.182 % diff -u -2 -r1.181 -r1.182 % --- ufs_vnops.c 22 Nov 2001 15:33:12 -0000 1.181 % +++ ufs_vnops.c 15 Jan 2002 07:17:12 -0000 1.182 % @@ -37,5 +37,5 @@ % * % * @(#)ufs_vnops.c 8.27 (Berkeley) 5/27/95 % - * $FreeBSD: src/sys/ufs/ufs/ufs_vnops.c,v 1.181 2001/11/22 15:33:12 guido Exp $ % + * $FreeBSD: src/sys/ufs/ufs/ufs_vnops.c,v 1.182 2002/01/15 07:17:12 mckusick Exp $ % */ % % @@ -159,11 +159,10 @@ % if ((ip->i_flag & (IN_ACCESS | IN_CHANGE | IN_UPDATE)) == 0) % return; % + if ((vp->v_type == VBLK || vp->v_type == VCHR) && !DOINGSOFTDEP(vp)) % + ip->i_flag |= IN_LAZYMOD; % + else % + ip->i_flag |= IN_MODIFIED; % if ((vp->v_mount->mnt_flag & MNT_RDONLY) == 0) { % vfs_timestamp(&ts); % - if ((vp->v_type == VBLK || vp->v_type == VCHR) && % - !DOINGSOFTDEP(vp)) % - ip->i_flag |= IN_LAZYMOD; % - else % - ip->i_flag |= IN_MODIFIED; % if (ip->i_flag & IN_ACCESS) { % ip->i_atime = ts.tv_sec; This is in ufs_itimes(). Note that it moves the setting of the modified flag before the check of MNT_RDONLY, so that when the wrong or incomplete flags are set earlier and the wrong flags aren't converted to the modified flag before MNT_RDONLY is set, then we only lose the timestamps but not critical updates here. % Index: ffs_inode.c % =================================================================== % RCS file: /home/ncvs/src/sys/ufs/ffs/ffs_inode.c,v % retrieving revision 1.73 % retrieving revision 1.74 % diff -u -2 -r1.73 -r1.74 % --- ffs_inode.c 13 Dec 2001 05:07:48 -0000 1.73 % +++ ffs_inode.c 15 Jan 2002 07:17:12 -0000 1.74 % @@ -32,5 +32,5 @@ % * % * @(#)ffs_inode.c 8.13 (Berkeley) 4/21/95 % - * $FreeBSD: src/sys/ufs/ffs/ffs_inode.c,v 1.73 2001/12/13 05:07:48 mckusick Exp $ % + * $FreeBSD: src/sys/ufs/ffs/ffs_inode.c,v 1.74 2002/01/15 07:17:12 mckusick Exp $ % */ % % @@ -88,7 +88,7 @@ % return (0); % ip->i_flag &= ~(IN_LAZYMOD | IN_MODIFIED); % - if (vp->v_mount->mnt_flag & MNT_RDONLY) % - return (0); % fs = ip->i_fs; % + if (fs->fs_ronly) % + return (0); % /* % * Ensure that uid and gid are correct. This is a temporary This fixes loss of the critical updates a little later in ffs_update(). % @@ -153,4 +153,6 @@ % oip = VTOI(ovp); % fs = oip->i_fs; % + if (fs->fs_ronly) % + panic("ffs_truncate: read-only filesystem"); % if (length < 0) % return (EINVAL); This is a sanity check in ffs_truncate(). I think all callers except the one in ufs_inactive() automatically pass the check, since they are higher level so they do a correct check of MNT_RDONLY. The call in ufs_inactive() used to be unconditional (except for the (i_nlink == 0) condition of course). In -current it is conditional on MNT_ONLY. kib's patch changes it to be conditional on fs_ronly, but I hope it can become unconditional again -- it should be an error to reach ufs_inactive() with a partially deleted file after syncing before changing fs_ronly to 0. ffs_update() should panic instead of returning 0 when (fs->fs_ronly) too, so that bugs get noticed. mckusick's explanation says that "[fs_ronly is the only believable flag]. Thus the killing of IN_MODIFIED has to happen a few lines later [in] ffs_update()". It should be safe to blindly ignore all modification flags except IN_ACCESS in ufs_itimes(), since ffs_update() will kill the completely invalid ones. 4.4BSD-Lite blindly ignored all modification flags in ITIMES(), and checks the wrong read-only flag (MNT_RDONLY) in open code that duplicates ITIMES() plus adds the wrong r/o check. When I converted ITIMES() to ufs_itimes(), I centralized this wrong r/o check. This was mainly a cleanup, but I think it fixes wrong setting of atimes (IN_ATIME is set without checking any r/o flags, and IN_ATIME was sometimes converted into a timestamp before ffs_update killed it, so applications could see atime changing even on purely r/o mounted file systems). The fix in ufs_vnops.c was lost relatively recently as part of related changes to fix IN_ACCESS for non-exclusively-locked reads: % Index: ufs_vnops.c % =================================================================== % RCS file: /home/ncvs/src/sys/ufs/ufs/ufs_vnops.c,v % retrieving revision 1.279 % retrieving revision 1.280 % diff -u -2 -r1.279 -r1.280 % --- ufs_vnops.c 2 Oct 2006 02:08:31 -0000 1.279 % +++ ufs_vnops.c 10 Oct 2006 09:20:54 -0000 1.280 % @@ -36,5 +36,5 @@ % % #include <sys/cdefs.h> % -__FBSDID("$FreeBSD: src/sys/ufs/ufs/ufs_vnops.c,v 1.279 2006/10/02 02:08:31 tegge Exp $"); % +__FBSDID("$FreeBSD: src/sys/ufs/ufs/ufs_vnops.c,v 1.280 2006/10/10 09:20:54 kib Exp $"); % % #include "opt_mac.h" % @@ -129,29 +129,47 @@ % struct inode *ip; % struct timespec ts; % + int mnt_locked; % % ip = VTOI(vp); % + mnt_locked = 0; % + if ((vp->v_mount->mnt_flag & MNT_RDONLY) != 0) { % + VI_LOCK(vp); % + goto out; % + } % + MNT_ILOCK(vp->v_mount); /* For reading of mnt_kern_flags. */ % + mnt_locked = 1; % + VI_LOCK(vp); % if ((ip->i_flag & (IN_ACCESS | IN_CHANGE | IN_UPDATE)) == 0) % - return; % + goto out_unl; % + % % ip->i_flag |= IN_LAZYMOD; % - else % + else if (((vp->v_mount->mnt_kern_flag & % + (MNTK_SUSPENDED | MNTK_SUSPEND)) == 0) || % + (ip->i_flag & (IN_CHANGE | IN_UPDATE))) % ip->i_flag |= IN_MODIFIED; % - if ((vp->v_mount->mnt_flag & MNT_RDONLY) == 0) { % - vfs_timestamp(&ts); % - if (ip->i_flag & IN_ACCESS) { % - DIP_SET(ip, i_atime, ts.tv_sec); % - DIP_SET(ip, i_atimensec, ts.tv_nsec); % - } % - if (ip->i_flag & IN_UPDATE) { % - DIP_SET(ip, i_mtime, ts.tv_sec); % - DIP_SET(ip, i_mtimensec, ts.tv_nsec); % - ip->i_modrev++; % - } % - if (ip->i_flag & IN_CHANGE) { % - DIP_SET(ip, i_ctime, ts.tv_sec); % - DIP_SET(ip, i_ctimensec, ts.tv_nsec); % - } % + else if (ip->i_flag & IN_ACCESS) % + ip->i_flag |= IN_LAZYACCESS; % + vfs_timestamp(&ts); % + if (ip->i_flag & IN_ACCESS) { % + DIP_SET(ip, i_atime, ts.tv_sec); % + DIP_SET(ip, i_atimensec, ts.tv_nsec); % + } % + if (ip->i_flag & IN_UPDATE) { % + DIP_SET(ip, i_mtime, ts.tv_sec); % + DIP_SET(ip, i_mtimensec, ts.tv_nsec); % + ip->i_modrev++; % + } % + if (ip->i_flag & IN_CHANGE) { % + DIP_SET(ip, i_ctime, ts.tv_sec); % + DIP_SET(ip, i_ctimensec, ts.tv_nsec); % } % + % + out: % ip->i_flag &= ~(IN_ACCESS | IN_CHANGE | IN_UPDATE); % + out_unl: % + VI_UNLOCK(vp); % + if (mnt_locked) % + MNT_IUNLOCK(vp->v_mount); % } % Now we trust MNT_RDONLY again, so that when the wrong or incomplete flags are set earlier and the wrong flags aren't converted to the modified flag before MNT_RDONLY is set, then we only lose both timestamps and critical updates here. I think the best quick fix would be to trust fs_ronly here, except for IN_ACCESS. Then wrong IN_CHANGE | IN_UPDATE flags would just cause wrong updates of i_ctime and i_mtime, and an extra i/o to write these changes, but missing IN_MODIFIED flags would be fixed up as in rev.1.182 and associated correct IN_CHANGE | IN_UPDATE flags wouldn't be incorrectly discarded. IN_ACCESS needs special handling even in the non-snapshot cases so that read() doesn't race mount-update (at best, read()s might keep dirtying inodes, so there would be a problem setting fs_ronly atomically with completing the sync). Most other file systems are primitive or broken in this area. ext2fs is at the level of ufs_vnops.c 1.182. msdosfs is at level before 4.4BSD-Lite (it still uses its clone of ITIMES() and looks more like the Net/2 ffs than the 4.4BSD one). But most other file systems are Giant-locked, so they don't have the IN_ACCESS races. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20071202132955.M18602>