Date: Thu, 21 Sep 2017 10:58:46 +0200 From: Andreas Longwitz <longwitz@incore.de> To: Konstantin Belousov <kostikbel@gmail.com> Cc: Kirk McKusick <mckusick@mckusick.com>, freebsd-fs@freebsd.org Subject: Re: fsync: giving up on dirty on ufs partitions running vfs_write_suspend() Message-ID: <59C37F46.80509@incore.de> In-Reply-To: <20170916183117.GF78693@kib.kiev.ua> References: <201709110519.v8B5JVmf060773@chez.mckusick.com> <59BD0EAC.8030206@incore.de> <20170916183117.GF78693@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
Konstantin Belousov wrote: > On Sat, Sep 16, 2017 at 01:44:44PM +0200, Andreas Longwitz wrote: >> Ok, I understand your thoughts about the "big loop" and I agree. On the >> other side it is not easy to measure the progress of the dirty buffers >> because these buffers a created from another process at the same time we >> loop in vop_stdfsync(). I can explain from my tests, where I use the >> following loop on a gjournaled partition: >> >> while true; do >> cp -p bigfile bigfile.tmp >> rm bigfile >> mv bigfile.tmp bigfile >> done >> >> When g_journal_switcher starts vfs_write_suspend() immediately after the >> rm command has started to do his "rm stuff" (ufs_inactive, ffs_truncate, >> ffs_indirtrunc at different levels, ffs_blkfree, ...) the we must loop >> (that means wait) in vop_stdfsync() until the rm process has finished >> his work. A lot of locking overhead is needed for coordination. >> Returning from bufobj_wwait() we always see one left dirty buffer (very >> seldom two), that is not optimal. Therefore I have tried the following >> patch (instead of bumping maxretry): >> >> --- vfs_default.c.orig 2016-10-24 12:26:57.000000000 +0200 >> +++ vfs_default.c 2017-09-15 12:30:44.792274000 +0200 >> @@ -688,6 +688,8 @@ >> bremfree(bp); >> bawrite(bp); >> } >> + if( maxretry < 1000) >> + DELAY(waitns); >> BO_LOCK(bo); >> goto loop2; >> } >> >> with different values for waitns. If I run the testloop 5000 times on my >> testserver, the problem is triggered always round about 10 times. The >> results from several runs are given in the following table: >> >> waitns max time max loops >> ------------------------------- >> no DELAY 0,5 sec 8650 (maxres = 100000) >> 1000 0,2 sec 24 >> 10000 0,8 sec 3 >> 100000 7,2 sec 3 >> >> "time" means spent time in vop_stdfsync() measured from entry to return >> by a dtrace script. "loops" means the number of times "--maxretry" is >> executed. I am not sure if DELAY() is the best way to wait or if waiting >> has other drawbacks. Anyway with DELAY() it does not take more than five >> iterazions to finish. > This is not explicitly stated in your message, but I suppose that the > vop_stdfsync() is called due to VOP_FSYNC(devvp, MNT_SUSPEND) call in > ffs_sync(). Am I right ? Yes, the stack trace given by dtrace script looks always like this: 4 22140 vop_stdfsync:entry kernel`devfs_fsync+0x26 kernel`VOP_FSYNC_APV+0xa7 kernel`ffs_sync+0x3bb kernel`vfs_write_suspend+0x1cd geom_journal.ko`g_journal_switcher+0x9a4 kernel`fork_exit+0x9a kernel`0xffffffff8095502e > If yes, then the solution is most likely to continue looping in the > vop_stdfsync() until there is no dirty buffers or the mount point > mnt_secondary_writes counter is zero. The pauses trick you tried might > be still useful, e.g. after some threshold of the performed loop > iterations. I have checked your proposal and found that indeed the mnt_secondary_writes counter goes to zero when the dirties have reached zero. During the loop the mnt_secondary_write counter is always equal to one, so there is not something like a countdown and thats Kirk wanted to see. A dtrace output (with DELAY of 1ms in the loop) for the biggest loop count on a three day test is this: 18 32865 kern_unlinkat:entry path=bigfile, tid=101201, tid=101201, execname=rm 18 12747 ufs_remove:entry gj=mirror/gmbkp4p5.journal, inum=11155630, blocks=22301568, size=11415525660 18 12748 ufs_remove:return returncode=0, inum=11155630, blocks=22301568 18 18902 ffs_truncate:entry gj=mirror/gmbkp4p5.journal, inum=11155630, size=11415525660, mnt_flag=0x12001040, mnt_kern_flag=0x40006142, blocks=22301568 6 33304 vfs_write_suspend:entry gj=mirror/gmbkp4p5.journal, mnt_kern_flag=0x40006142, tid=100181 6 22140 vop_stdfsync:entry mounted on /home, waitfor=1, numoutput=0, clean=10, dirty=6, secondary_writes=1 10 28117 bufobj_wwait:return calls to bufobj_wait = 1, dirtycnt=2, secondary_writes=1 10 28117 bufobj_wwait:return calls to bufobj_wait = 2, dirtycnt=1, secondary_writes=1 10 28117 bufobj_wwait:return calls to bufobj_wait = 3, dirtycnt=1, secondary_writes=1 10 28117 bufobj_wwait:return calls to bufobj_wait = 4, dirtycnt=3, secondary_writes=1 10 28117 bufobj_wwait:return calls to bufobj_wait = 5, dirtycnt=2, secondary_writes=1 6 28117 bufobj_wwait:return calls to bufobj_wait = 6, dirtycnt=3, secondary_writes=1 6 28117 bufobj_wwait:return calls to bufobj_wait = 7, dirtycnt=3, secondary_writes=1 6 28117 bufobj_wwait:return calls to bufobj_wait = 8, dirtycnt=3, secondary_writes=1 6 28117 bufobj_wwait:return calls to bufobj_wait = 9, dirtycnt=3, secondary_writes=1 6 28117 bufobj_wwait:return calls to bufobj_wait = 10, dirtycnt=2, secondary_writes=1 6 28117 bufobj_wwait:return calls to bufobj_wait = 11, dirtycnt=2, secondary_writes=1 6 28117 bufobj_wwait:return calls to bufobj_wait = 12, dirtycnt=3, secondary_writes=1 6 28117 bufobj_wwait:return calls to bufobj_wait = 13, dirtycnt=3, secondary_writes=1 6 28117 bufobj_wwait:return calls to bufobj_wait = 14, dirtycnt=3, secondary_writes=1 6 28117 bufobj_wwait:return calls to bufobj_wait = 15, dirtycnt=4, secondary_writes=1 6 28117 bufobj_wwait:return calls to bufobj_wait = 16, dirtycnt=3, secondary_writes=1 6 28117 bufobj_wwait:return calls to bufobj_wait = 17, dirtycnt=3, secondary_writes=1 2 18903 ffs_truncate:return returncode=0, inum=11155630, blocks=0 2 32866 kern_unlinkat:return returncode=0, errno=0, number io's: 791/791 6 28117 bufobj_wwait:return calls to bufobj_wait = 18, dirtycnt=3, secondary_writes=0 6 28117 bufobj_wwait:return calls to bufobj_wait = 19, dirtycnt=0, secondary_writes=0 6 22141 vop_stdfsync:return returncode=0, pid=26, tid=100181, spent 240373850 nsecs So the spent time in vop_stdfsync() is 0,24 sec in the worst case I found using DELAY with 1 ms. I would prefer this solution. My first appoach (simple bumping maxres from 1000 to 100000) is also ok, but max spend time will be raise up to 0,5 sec. Perhaps you like something like if( maxretry < 1000 && maxretry % 10 = 0) DELAY(waitns); That is also ok but does not make a noteworthy difference. The main argument remains: we have to wait until there are no dirties left. For me the problem with the "giving up on dirty" is solved. -- Andreas Longwitz
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?59C37F46.80509>