From owner-freebsd-fs@freebsd.org Tue May 17 10:26:43 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 730D8B3EF91 for ; Tue, 17 May 2016 10:26:43 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mailman.ysv.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id 60D271E00 for ; Tue, 17 May 2016 10:26:43 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: by mailman.ysv.freebsd.org (Postfix) id 601CDB3EF8F; Tue, 17 May 2016 10:26:43 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 5FC29B3EF8D for ; Tue, 17 May 2016 10:26:43 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail108.syd.optusnet.com.au (mail108.syd.optusnet.com.au [211.29.132.59]) by mx1.freebsd.org (Postfix) with ESMTP id 10C9E1DFF for ; Tue, 17 May 2016 10:26:42 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c122-106-149-109.carlnfd1.nsw.optusnet.com.au (c122-106-149-109.carlnfd1.nsw.optusnet.com.au [122.106.149.109]) by mail108.syd.optusnet.com.au (Postfix) with ESMTPS id A332D1A3E13; Tue, 17 May 2016 20:26:33 +1000 (AEST) Date: Tue, 17 May 2016 20:26:26 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov cc: fs@freebsd.org Subject: Re: quick fix for slow directory shrinking in ffs In-Reply-To: <20160517082050.GX89104@kib.kiev.ua> Message-ID: <20160517192933.U4573@besplex.bde.org> References: <20160517072705.F2157@besplex.bde.org> <20160517082050.GX89104@kib.kiev.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=TuMb/2jh c=1 sm=1 tr=0 a=R/f3m204ZbWUO/0rwPSMPw==:117 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10 a=DFMq5MnWpUiX_LZUeQ4A:9 a=l0SOHounc31-8c1A:21 a=KhYsnkSFTUpZV8AS:21 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 10:26:43 -0000 On Tue, 17 May 2016, Konstantin Belousov wrote: > On Tue, May 17, 2016 at 07:54:27AM +1000, Bruce Evans wrote: >> ffs does very slow shrinking of directories after removing some files >> leaves unused blocks at the end, by always doing synchronous truncation. >> ... >> X Index: ufs_lookup.c >> X =================================================================== >> X --- ufs_lookup.c (revision 299263) >> X +++ ufs_lookup.c (working copy) >> X @@ -1131,9 +1131,9 @@ >> X if (tvp != NULL) >> X VOP_UNLOCK(tvp, 0); >> X error = UFS_TRUNCATE(dvp, (off_t)dp->i_endoff, >> X - IO_NORMAL | IO_SYNC, cr); >> X + IO_NORMAL | (DOINGASYNC(dvp) ? 0 : IO_SYNC), cr); >> X if (error != 0) >> X - vprint("ufs_direnter: failted to truncate", dvp); >> X + vprint("ufs_direnter: failed to truncate", dvp); >> X #ifdef UFS_DIRHASH >> X if (error == 0 && dp->i_dirhash != NULL) >> X ufsdirhash_dirtrunc(dp, dp->i_endoff); > > The IO_SYNC flag, for non-journaled SU and any kind of non-SU mounts, > only affects the new blocks allocation mode, and write-out mode for > the last fragment. The truncation itself (for -J) is performed in the > context of the truncating thread. The cg blocks, after the bits are > set to free, are marked for delayed write (with the background write > hack). The inode block is written according to the mount mode, ignoring > IO_SYNC. I don't see why you think that. ffs_truncate() clearly honors IO_SYNC, and testing shows that ffs with soft updates does precisely 7 extra sync writes for directory compaction (where some of the 7 are probably to sync previous activity). I think it would be wrong to ignore IO_SYNC and use the mount mode for inodes. Async mounts still have that bug IIRC (I fixed locally long ago). IO_SYNC is set if the file is is open with O_SYNC and mount mode must not override this. I think ffs has no way of telling that this particular IO_SYNC is not associated with O_SYNC. > That is, for always fully populated directory files, I do not see how > anything is changed by the patch. This problem affects all 512-boundaries, which are rarely block or even fragment boundaries. Test program: X mp=$(df . | grep -v Filesystem | sed 's/ .*//') X echo $mp X while :; X do X echo -n "start: $(/usr/bin/stat -f %5z .); " X mount -v | grep $mp | sed -e 's/.*writes/writes/' -e 's/, reads.*//' X X touch $(jot 41 0) # just over 512 bytes X echo -n "touch: $(/usr/bin/stat -f %5z .); " X mount -v | grep $mp | sed -e 's/.*writes/writes/' -e 's/, reads.*//' X X # X # Async mounts are still broken in -current -- these rm's (but nothing X # else here) cause sync writes (but just 1 for the 2 rm's). X # X # rm 39 40 # just under, but no truncation yet X echo -n "rm: $(/usr/bin/stat -f %5z .); " X mount -v | grep $mp | sed -e 's/.*writes/writes/' -e 's/, reads.*//' X X # X # Another bug in async mounts makes the truncate for the compaction X # triggered by this touch do an async write (with the fix to stop it X # doing a sync write. X # X touch 39 # still under; this creation does the truncation X echo -n "touch 39:$(/usr/bin/stat -f %5z .); " X mount -v | grep $mp | sed -e 's/.*writes/writes/' -e 's/, reads.*//' X X sleep 10 X echo X done I hope this uses a portable enough way to find the mount point. This must be run in an empty directory (or you have to adjust the sizes). Results: - async mount with fix: 1 sync write per iteration. A bogus one triggered by the rm. I only fixed this locally. Remove the rm line so that the size stays slightly about 1024 bytes and there are 0 sync writes. There is also 1 async write triggered by the truncate. This is another bug in async mounts which I have fixed locally. All writes for async mounts should be delayed unless IO_SYNC forces them to be sync. - soft updates: 7 sync writes per iteration all triggered by the final touch (which triggers the compaction). Remove the rm line and there are again 0 sync writes. Sometimes there are 2-5 async writes between the loop iterations. These might be for the loop too, since there are more of them than for async mounts. (I left daemons running whuile testing this on the root file system. Test on a completely idle fs to be sure.) - no soft updates and no async mount: first touch does 3 sync writes, rm does 2 sync, last touch does 4 sync; 0 async writes. The IO_SYNC for soft updates apparently turns all the previous writes for the loop into sync ones. It has to order them and wait for them and there is no better way to wait than a sync write. The ordering makes an unnecessary sync write even more expensive for soft updates than for other cases. Some relevant code in ffs_truncate: Y /* Y * Shorten the size of the file. If the file is not being Y * truncated to a block boundary, the contents of the Y * partial block following the end of the file must be Y * zero'ed in case it ever becomes accessible again because Y * of subsequent file growth. Directories however are not Y * zero'ed as they should grow back initialized to empty. Y */ Y offset = blkoff(fs, length); Y if (offset == 0) { Y ip->i_size = length; Y DIP_SET(ip, i_size, length); Y } else { Y lbn = lblkno(fs, length); Y flags |= BA_CLRBUF; Y error = UFS_BALLOC(vp, length - 1, 1, cred, flags, &bp); Y if (error) { Y return (error); Y } Y /* Y * When we are doing soft updates and the UFS_BALLOC Y * above fills in a direct block hole with a full sized Y * block that will be truncated down to a fragment below, Y * we must flush out the block dependency with an FSYNC Y * so that we do not get a soft updates inconsistency Y * when we create the fragment below. Y */ Y if (DOINGSOFTDEP(vp) && lbn < NDADDR && Y fragroundup(fs, blkoff(fs, length)) < fs->fs_bsize && Y (error = ffs_syncvnode(vp, MNT_WAIT)) != 0) Y return (error); Y ip->i_size = length; Y DIP_SET(ip, i_size, length); Y size = blksize(fs, ip, lbn); Y if (vp->v_type != VDIR) Y bzero((char *)bp->b_data + offset, Y (u_int)(size - offset)); Y /* Kirk's code has reallocbuf(bp, size, 1) here */ Y allocbuf(bp, size); Y if (bp->b_bufsize == fs->fs_bsize) Y bp->b_flags |= B_CLUSTEROK; Y if (flags & IO_SYNC) Y bwrite(bp); Y else Y bawrite(bp); Y } I think we usually arrive here and honor the IO_SYNC flag. This is correct. Otherwise, we always do an async write, but that is wrong for async mounts. Here is my old fix for this: Z diff -u2 ffs_inode.c~ ffs_inode.c Z --- ffs_inode.c~ Wed Apr 7 21:22:26 2004 Z +++ ffs_inode.c Sat Mar 23 01:23:16 2013 Z @@ -345,4 +431,6 @@ Z if (flags & IO_SYNC) Z bwrite(bp); Z + else if (DOINGASYNC(ovp)) Z + bdwrite(bp); Z else Z bawrite(bp); This fix must be sprinkled in most places where there is a bwrite()/ bawrite() decision. Bruce