From owner-freebsd-fs@FreeBSD.ORG Sat Dec 1 16:07:43 2007 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3146F16A418 for ; Sat, 1 Dec 2007 16:07:43 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail18.syd.optusnet.com.au (mail18.syd.optusnet.com.au [211.29.132.199]) by mx1.freebsd.org (Postfix) with ESMTP id AE3F813C45A for ; Sat, 1 Dec 2007 16:07:42 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c211-30-219-213.carlnfd3.nsw.optusnet.com.au [211.30.219.213]) by mail18.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id lB1G7FvG008283 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 2 Dec 2007 03:07:24 +1100 Date: Sun, 2 Dec 2007 03:07:10 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Bruce Evans In-Reply-To: <20071201030305.G1170@besplex.bde.org> Message-ID: <20071202020213.I2849@besplex.bde.org> References: <474F4E46.8030109@nokia.com> <20071130112043.H7217@besplex.bde.org> <474F69A7.9090404@nokia.com> <20071130151606.F12094@delplex.bde.org> <20071130052840.GH83121@deviant.kiev.zoral.com.ua> <20071201030305.G1170@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-fs@freebsd.org Subject: Re: File remove problem X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 01 Dec 2007 16:07:43 -0000 On Sat, 1 Dec 2007, Bruce Evans wrote: > On Fri, 30 Nov 2007, Kostik Belousov wrote: >> As a speculation, it might be that ufs_inactive() should conditionalize on >> fs_ronly instead of MNT_RDONLY. Then, VOP_INACTIVE() would set up the >> IN_CHANGE|IN_UPDATE and finally call the ffs_update() ? > > Something like that seems to be right. The folowing hack in ufs_inactive() > seems to fix the problem with sift updates, as does unsetting MNT_RDONLY > for the whole VOP_SYNC() in ffs_mount(). > > % Index: ufs_inode.c > % =================================================================== > % RCS file: /home/ncvs/src/sys/ufs/ufs/ufs_inode.c,v > % retrieving revision 1.53 > % diff -u -2 -r1.53 ufs_inode.c > % --- ufs_inode.c 7 Apr 2004 03:47:20 -0000 1.53 > % +++ ufs_inode.c 30 Nov 2007 12:58:39 -0000 > % @@ -59,4 +59,6 @@ > % #endif > % % +#include > % + > % /* > % * Last reference to an inode. If necessary, write or delete it. > % @@ -118,6 +120,15 @@ > % ip->i_flag &= ~IN_ACCESS; > % } else { > % + int wasro = 0; > % + > % (void) vn_write_suspend_wait(vp, NULL, V_WAIT); > % + if (vp->v_mount->mnt_flag & MNT_RDONLY && > % + ip->i_fs->fs_ronly == 0) { > % + vp->v_mount->mnt_flag &= ~MNT_RDONLY; > % + wasro = 1; > % + } > % UFS_UPDATE(vp, 0); > % + if (wasro) > % + vp->v_mount->mnt_flag |= MNT_RDONLY; > % } > % } > > I didn't bother with correct locking here (only tested under UP). > > With this change, the VOP_SYNC() in ffs_mount() for MNT_UPDATE seems > to flush everything in simple cases (with no open files), just like > the VOP_SYNC() in unmount() flushes everything before ffs_unmount() is > reached. OTOH, without a forced flush, soft updates takes a long > time to flush things -- more like 3 syncer periods than 1 for > non-waitfor syncs. With soft updates, the above is called from deep > in VOP_SYNC(). It's strange that the above non-waitfor UFS_UPDATE() > is used inside of waitfor syncs. It apparently works because the > waitfor syncs wait for it later, but only if it is non-null. Here is a non-hackish patch which explains why ignoring MNT_RDONLY in the above or in ffs_mount() helps. It just fixes the confusion between IN_MODIFIED and IN_CHANGE in critical places. % Index: ffs_softdep.c % =================================================================== % RCS file: /home/ncvs/src/sys/ufs/ffs/ffs_softdep.c,v % retrieving revision 1.214 % diff -u -2 -r1.214 ffs_softdep.c % --- ffs_softdep.c 9 Nov 2007 11:04:36 -0000 1.214 % +++ ffs_softdep.c 1 Dec 2007 14:25:49 -0000 % @@ -2776,5 +2776,5 @@ % DIP_SET(ip, i_blocks, DIP(ip, i_blocks) + \ % freeblks->fb_chkcnt - blocksreleased); % - ip->i_flag |= IN_CHANGE; % + ip->i_flag |= IN_MODIFIED; % vput(vp); % } % @@ -3575,5 +3575,5 @@ % ip->i_nlink--; % DIP_SET(ip, i_nlink, ip->i_nlink); % - ip->i_flag |= IN_CHANGE; % + ip->i_flag |= IN_MODIFIED; % if (ip->i_nlink < ip->i_effnlink) % panic("handle_workitem_remove: bad file delta"); % @@ -3594,5 +3594,5 @@ % ip->i_nlink -= 2; % DIP_SET(ip, i_nlink, ip->i_nlink); % - ip->i_flag |= IN_CHANGE; % + ip->i_flag |= IN_MODIFIED; % if (ip->i_nlink < ip->i_effnlink) % panic("handle_workitem_remove: bad dir delta"); % @@ -3635,5 +3635,5 @@ % WORKLIST_INSERT(&inodedep->id_inowait, &dirrem->dm_list); % FREE_LOCK(&lk); % - ip->i_flag |= IN_CHANGE; % + ip->i_flag |= IN_MODIFIED; % ffs_update(vp, 0); % vput(vp); Without this change, soft updates depends on IN_CHANGE being converted to IN_MODIFIED by ufs_itimes(), but this conversion doesn't happen when MNT_RDONLY is set. With soft updates, changes are often delayed until sync time, and when the sync is for mount-update it is done after setting MNT_RDONLY so the above doesn't work. Soft updates sometimes forces an inode update using UFS_UPDATE(..., 1) but it doesn't do this in all cases even for MNT_UPDATE syncs. The update is first done asynchronously (except for the bug) by setting IN_* in the above so that ffs_update() is (supposed to be) non-null when it is called by ufs_inactive() or directly in 1 case in the above. Then for MNT_UPDATE cases, something should wait later. This seems to be broken for some cases but not for mount-update or unmount. Setting IN_CHANGE in the above also breaks ctimes slightly. I think the bug is visible from userland in some cases. Consider the operarations rename; stat; sync; sync. The mv sets update marks; the stat (actually ufs_inactive() at the end of the rename if the file is not open) converts the marks to timestamps that should be final; then the sync runs the above code which sets the ctime mark again, and one of the syncs (should be the first one, but that seems to be broken) converts the ctime mark to a timestamp and thus breaks the timestamp already reported to userland. The problem area can be investigated without creating corrupt file systems by using fsync(2): on a new file system /f: touch a while :; do mv a b fsync b # optionally sleep a short time 0.1-0.5 seconds fsck -n /f mv b a unmount /f; mount /f # Don't trust sync done I watched the mv and the fsync using ddb and the fsck. With soft updates, (even with sync mounts) the fsync doesn't actually complete the i/o. Ddb shows that the work list is not completed when fsync returns, and the fsck shows that the file system is inconsistent when the fsck returns. After sleeping just a fraction of a second, the fsck shows a consistent file system. 0.1 seconds usually isn't enough, but 0.5 seconds usually is. I think it is softdep_waitidle() that is waiting long enough for the unmount and remount cases to work. Without soft updates (even with async mounts), the fsync seems to work correctly -- no sleep is needed for the fsck to usually show a consistent fs -- though in the async case this just because no i/o has been done in the usual case, and in the sync case the fs is consistent without the fsync. Without the above patch, after "mv a b", "fsync b" is insufficient to make "mount -u -o ro /f" not corrupt the fs (because the mv generates 3 or 4 entries in the worklist, and the fsync generates 0 or 1 more and always ends up with 1 or 2 not handled, and then the sync for mount-update usually doesn't make any more progress). I didn't check if a short delay after the fsync is sufficient, but expect that it is. With the above patch, a loop doing "mv a b; mount -u -o ro /f; fsck -n /f; mount -u -o rw /f; mv b a" shows no problems. However, the original problem was without soft updates, so there must be more bugs here. > BTW, *_mount() has lots of bogusness related to string options. In > particular, ffs_mount() for update from r/w to r/o checks the "ro" > string option and sets MNT_RDONLY later, but MNT_RDONLY is already set > and other things depend on it being set early. Setting MNT_RDONLY too early can probably cause critical updates to be lost in the non-soft updates case due to the IN_CHANGE/IN_MODIFIED confusion, but only in cases where mount races with other operations. While testing soft updates without any patches, I got a panic (dup alloc?) immediately when I raced the mv+remount loop (or was it just mv+fsync?) with another loop doing just remount. Bruce