Date: Fri, 22 Sep 2017 12:01:49 +0200 From: Andreas Longwitz <longwitz@incore.de> To: Konstantin Belousov <kostikbel@gmail.com> Cc: Kirk McKusick <mckusick@mckusick.com>, freebsd-fs@freebsd.org Subject: Re: fsync: giving up on dirty on ufs partitions running vfs_write_suspend() Message-ID: <59C4DF8D.5070004@incore.de> In-Reply-To: <20170921172902.GW78693@kib.kiev.ua> References: <201709110519.v8B5JVmf060773@chez.mckusick.com> <59BD0EAC.8030206@incore.de> <20170916183117.GF78693@kib.kiev.ua> <59C37F46.80509@incore.de> <20170921172902.GW78693@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
Konstantin Belousov schrieb: > On Thu, Sep 21, 2017 at 10:58:46AM +0200, Andreas Longwitz wrote: >> Konstantin Belousov wrote: >>> On Sat, Sep 16, 2017 at 01:44:44PM +0200, Andreas Longwitz wrote: >>>> Ok, I understand your thoughts about the "big loop" and I agree. On the >>>> other side it is not easy to measure the progress of the dirty buffers >>>> because these buffers a created from another process at the same time we >>>> loop in vop_stdfsync(). I can explain from my tests, where I use the >>>> following loop on a gjournaled partition: >>>> >>>> while true; do >>>> cp -p bigfile bigfile.tmp >>>> rm bigfile >>>> mv bigfile.tmp bigfile >>>> done >>>> >>>> When g_journal_switcher starts vfs_write_suspend() immediately after the >>>> rm command has started to do his "rm stuff" (ufs_inactive, ffs_truncate, >>>> ffs_indirtrunc at different levels, ffs_blkfree, ...) the we must loop >>>> (that means wait) in vop_stdfsync() until the rm process has finished >>>> his work. A lot of locking overhead is needed for coordination. >>>> Returning from bufobj_wwait() we always see one left dirty buffer (very >>>> seldom two), that is not optimal. Therefore I have tried the following >>>> patch (instead of bumping maxretry): >>>> >>>> --- vfs_default.c.orig 2016-10-24 12:26:57.000000000 +0200 >>>> +++ vfs_default.c 2017-09-15 12:30:44.792274000 +0200 >>>> @@ -688,6 +688,8 @@ >>>> bremfree(bp); >>>> bawrite(bp); >>>> } >>>> + if( maxretry < 1000) >>>> + DELAY(waitns); >>>> BO_LOCK(bo); >>>> goto loop2; >>>> } >>>> >>>> with different values for waitns. If I run the testloop 5000 times on my >>>> testserver, the problem is triggered always round about 10 times. The >>>> results from several runs are given in the following table: >>>> >>>> waitns max time max loops >>>> ------------------------------- >>>> no DELAY 0,5 sec 8650 (maxres = 100000) >>>> 1000 0,2 sec 24 >>>> 10000 0,8 sec 3 >>>> 100000 7,2 sec 3 >>>> >>>> "time" means spent time in vop_stdfsync() measured from entry to return >>>> by a dtrace script. "loops" means the number of times "--maxretry" is >>>> executed. I am not sure if DELAY() is the best way to wait or if waiting >>>> has other drawbacks. Anyway with DELAY() it does not take more than five >>>> iterazions to finish. >>> This is not explicitly stated in your message, but I suppose that the >>> vop_stdfsync() is called due to VOP_FSYNC(devvp, MNT_SUSPEND) call in >>> ffs_sync(). Am I right ? >> Yes, the stack trace given by dtrace script looks always like this: >> 4 22140 vop_stdfsync:entry >> kernel`devfs_fsync+0x26 >> kernel`VOP_FSYNC_APV+0xa7 >> kernel`ffs_sync+0x3bb >> kernel`vfs_write_suspend+0x1cd >> geom_journal.ko`g_journal_switcher+0x9a4 >> kernel`fork_exit+0x9a >> kernel`0xffffffff8095502e >> >> >>> If yes, then the solution is most likely to continue looping in the >>> vop_stdfsync() until there is no dirty buffers or the mount point >>> mnt_secondary_writes counter is zero. The pauses trick you tried might >>> be still useful, e.g. after some threshold of the performed loop >>> iterations. >> I have checked your proposal and found that indeed the >> mnt_secondary_writes counter goes to zero when the dirties have reached >> zero. During the loop the mnt_secondary_write counter is always equal to >> one, so there is not something like a countdown and thats Kirk wanted to >> see. > This is because mnt_secondary_write counts number of threads which entered > the vn_start_secondary_write() block and potentially can issue a write > dirtying a buffer. In principle, some writer may start the secondary > write block again even if the counter is zero, but practically some > primary writer must make a modification for secondary writers to have > work. > > I.e., the change would not cover the problem to claim it being completely > solved, but for the current UFS code I doubt that the issue can be triggered. > >> A dtrace output (with DELAY of 1ms in the loop) for the biggest >> loop count on a three day test is this: >> >> 18 32865 kern_unlinkat:entry path=bigfile, tid=101201, >> tid=101201, execname=rm >> 18 12747 ufs_remove:entry gj=mirror/gmbkp4p5.journal, >> inum=11155630, blocks=22301568, size=11415525660 >> 18 12748 ufs_remove:return returncode=0, inum=11155630, >> blocks=22301568 >> 18 18902 ffs_truncate:entry gj=mirror/gmbkp4p5.journal, >> inum=11155630, size=11415525660, mnt_flag=0x12001040, >> mnt_kern_flag=0x40006142, blocks=22301568 >> 6 33304 vfs_write_suspend:entry gj=mirror/gmbkp4p5.journal, >> mnt_kern_flag=0x40006142, tid=100181 >> 6 22140 vop_stdfsync:entry mounted on /home, waitfor=1, >> numoutput=0, clean=10, dirty=6, secondary_writes=1 >> 10 28117 bufobj_wwait:return calls to bufobj_wait = 1, >> dirtycnt=2, secondary_writes=1 >> 10 28117 bufobj_wwait:return calls to bufobj_wait = 2, >> dirtycnt=1, secondary_writes=1 >> 10 28117 bufobj_wwait:return calls to bufobj_wait = 3, >> dirtycnt=1, secondary_writes=1 >> 10 28117 bufobj_wwait:return calls to bufobj_wait = 4, >> dirtycnt=3, secondary_writes=1 >> 10 28117 bufobj_wwait:return calls to bufobj_wait = 5, >> dirtycnt=2, secondary_writes=1 >> 6 28117 bufobj_wwait:return calls to bufobj_wait = 6, >> dirtycnt=3, secondary_writes=1 >> 6 28117 bufobj_wwait:return calls to bufobj_wait = 7, >> dirtycnt=3, secondary_writes=1 >> 6 28117 bufobj_wwait:return calls to bufobj_wait = 8, >> dirtycnt=3, secondary_writes=1 >> 6 28117 bufobj_wwait:return calls to bufobj_wait = 9, >> dirtycnt=3, secondary_writes=1 >> 6 28117 bufobj_wwait:return calls to bufobj_wait = 10, >> dirtycnt=2, secondary_writes=1 >> 6 28117 bufobj_wwait:return calls to bufobj_wait = 11, >> dirtycnt=2, secondary_writes=1 >> 6 28117 bufobj_wwait:return calls to bufobj_wait = 12, >> dirtycnt=3, secondary_writes=1 >> 6 28117 bufobj_wwait:return calls to bufobj_wait = 13, >> dirtycnt=3, secondary_writes=1 >> 6 28117 bufobj_wwait:return calls to bufobj_wait = 14, >> dirtycnt=3, secondary_writes=1 >> 6 28117 bufobj_wwait:return calls to bufobj_wait = 15, >> dirtycnt=4, secondary_writes=1 >> 6 28117 bufobj_wwait:return calls to bufobj_wait = 16, >> dirtycnt=3, secondary_writes=1 >> 6 28117 bufobj_wwait:return calls to bufobj_wait = 17, >> dirtycnt=3, secondary_writes=1 >> 2 18903 ffs_truncate:return returncode=0, inum=11155630, blocks=0 >> 2 32866 kern_unlinkat:return returncode=0, errno=0, number >> io's: 791/791 >> 6 28117 bufobj_wwait:return calls to bufobj_wait = 18, >> dirtycnt=3, secondary_writes=0 >> 6 28117 bufobj_wwait:return calls to bufobj_wait = 19, >> dirtycnt=0, secondary_writes=0 >> 6 22141 vop_stdfsync:return returncode=0, pid=26, tid=100181, >> spent 240373850 nsecs >> >> So the spent time in vop_stdfsync() is 0,24 sec in the worst case I >> found using DELAY with 1 ms. I would prefer this solution. My first >> appoach (simple bumping maxres from 1000 to 100000) is also ok, but max >> spend time will be raise up to 0,5 sec. Perhaps you like something like >> >> if( maxretry < 1000 && maxretry % 10 = 0) >> DELAY(waitns); >> >> That is also ok but does not make a noteworthy difference. The main >> argument remains: we have to wait until there are no dirties left. >> >> For me the problem with the "giving up on dirty" is solved. > Will you provide the patch ? Patch against HEAD: --- vfs_default.c.orig 2017-09-22 11:56:26.950084000 +0200 +++ vfs_default.c 2017-09-22 11:58:33.211196000 +0200 @@ -690,6 +690,8 @@ bremfree(bp); bawrite(bp); } + if( maxretry < 1000) + DELAY(1000); /* 1 ms */ BO_LOCK(bo); goto loop2; } -- Andreas Longwitz
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?59C4DF8D.5070004>