From owner-freebsd-fs@freebsd.org Fri Sep 22 22:20:26 2017 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 6ED70E12190 for ; Fri, 22 Sep 2017 22:20:26 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from gw.catspoiler.org (unknown [IPv6:2602:304:b010:ef20::f2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "gw.catspoiler.org", Issuer "gw.catspoiler.org" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 31272812C0 for ; Fri, 22 Sep 2017 22:20:26 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.15.2/8.15.2) with ESMTP id v8MMKEW5085371; Fri, 22 Sep 2017 15:20:18 -0700 (PDT) (envelope-from truckman@FreeBSD.org) Message-Id: <201709222220.v8MMKEW5085371@gw.catspoiler.org> Date: Fri, 22 Sep 2017 15:20:14 -0700 (PDT) From: Don Lewis Subject: Re: fsync: giving up on dirty on ufs partitions running vfs_write_suspend() To: longwitz@incore.de cc: kostikbel@gmail.com, mckusick@mckusick.com, freebsd-fs@freebsd.org In-Reply-To: <59C4DF8D.5070004@incore.de> MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 22 Sep 2017 22:20:26 -0000 On 22 Sep, Andreas Longwitz wrote: > Konstantin Belousov schrieb: >> On Thu, Sep 21, 2017 at 10:58:46AM +0200, Andreas Longwitz wrote: >>> Konstantin Belousov wrote: >>>> On Sat, Sep 16, 2017 at 01:44:44PM +0200, Andreas Longwitz wrote: >>>>> Ok, I understand your thoughts about the "big loop" and I agree. On the >>>>> other side it is not easy to measure the progress of the dirty buffers >>>>> because these buffers a created from another process at the same time we >>>>> loop in vop_stdfsync(). I can explain from my tests, where I use the >>>>> following loop on a gjournaled partition: >>>>> >>>>> while true; do >>>>> cp -p bigfile bigfile.tmp >>>>> rm bigfile >>>>> mv bigfile.tmp bigfile >>>>> done >>>>> >>>>> When g_journal_switcher starts vfs_write_suspend() immediately after the >>>>> rm command has started to do his "rm stuff" (ufs_inactive, ffs_truncate, >>>>> ffs_indirtrunc at different levels, ffs_blkfree, ...) the we must loop >>>>> (that means wait) in vop_stdfsync() until the rm process has finished >>>>> his work. A lot of locking overhead is needed for coordination. >>>>> Returning from bufobj_wwait() we always see one left dirty buffer (very >>>>> seldom two), that is not optimal. Therefore I have tried the following >>>>> patch (instead of bumping maxretry): >>>>> >>>>> --- vfs_default.c.orig 2016-10-24 12:26:57.000000000 +0200 >>>>> +++ vfs_default.c 2017-09-15 12:30:44.792274000 +0200 >>>>> @@ -688,6 +688,8 @@ >>>>> bremfree(bp); >>>>> bawrite(bp); >>>>> } >>>>> + if( maxretry < 1000) >>>>> + DELAY(waitns); >>>>> BO_LOCK(bo); >>>>> goto loop2; >>>>> } >>>>> >>>>> with different values for waitns. If I run the testloop 5000 times on my >>>>> testserver, the problem is triggered always round about 10 times. The >>>>> results from several runs are given in the following table: >>>>> >>>>> waitns max time max loops >>>>> ------------------------------- >>>>> no DELAY 0,5 sec 8650 (maxres = 100000) >>>>> 1000 0,2 sec 24 >>>>> 10000 0,8 sec 3 >>>>> 100000 7,2 sec 3 >>>>> >>>>> "time" means spent time in vop_stdfsync() measured from entry to return >>>>> by a dtrace script. "loops" means the number of times "--maxretry" is >>>>> executed. I am not sure if DELAY() is the best way to wait or if waiting >>>>> has other drawbacks. Anyway with DELAY() it does not take more than five >>>>> iterazions to finish. >>>> This is not explicitly stated in your message, but I suppose that the >>>> vop_stdfsync() is called due to VOP_FSYNC(devvp, MNT_SUSPEND) call in >>>> ffs_sync(). Am I right ? >>> Yes, the stack trace given by dtrace script looks always like this: >>> 4 22140 vop_stdfsync:entry >>> kernel`devfs_fsync+0x26 >>> kernel`VOP_FSYNC_APV+0xa7 >>> kernel`ffs_sync+0x3bb >>> kernel`vfs_write_suspend+0x1cd >>> geom_journal.ko`g_journal_switcher+0x9a4 >>> kernel`fork_exit+0x9a >>> kernel`0xffffffff8095502e >>> >>> >>>> If yes, then the solution is most likely to continue looping in the >>>> vop_stdfsync() until there is no dirty buffers or the mount point >>>> mnt_secondary_writes counter is zero. The pauses trick you tried might >>>> be still useful, e.g. after some threshold of the performed loop >>>> iterations. >>> I have checked your proposal and found that indeed the >>> mnt_secondary_writes counter goes to zero when the dirties have reached >>> zero. During the loop the mnt_secondary_write counter is always equal to >>> one, so there is not something like a countdown and thats Kirk wanted to >>> see. >> This is because mnt_secondary_write counts number of threads which entered >> the vn_start_secondary_write() block and potentially can issue a write >> dirtying a buffer. In principle, some writer may start the secondary >> write block again even if the counter is zero, but practically some >> primary writer must make a modification for secondary writers to have >> work. >> >> I.e., the change would not cover the problem to claim it being completely >> solved, but for the current UFS code I doubt that the issue can be triggered. >> >>> A dtrace output (with DELAY of 1ms in the loop) for the biggest >>> loop count on a three day test is this: >>> >>> 18 32865 kern_unlinkat:entry path=bigfile, tid=101201, >>> tid=101201, execname=rm >>> 18 12747 ufs_remove:entry gj=mirror/gmbkp4p5.journal, >>> inum=11155630, blocks=22301568, size=11415525660 >>> 18 12748 ufs_remove:return returncode=0, inum=11155630, >>> blocks=22301568 >>> 18 18902 ffs_truncate:entry gj=mirror/gmbkp4p5.journal, >>> inum=11155630, size=11415525660, mnt_flag=0x12001040, >>> mnt_kern_flag=0x40006142, blocks=22301568 >>> 6 33304 vfs_write_suspend:entry gj=mirror/gmbkp4p5.journal, >>> mnt_kern_flag=0x40006142, tid=100181 >>> 6 22140 vop_stdfsync:entry mounted on /home, waitfor=1, >>> numoutput=0, clean=10, dirty=6, secondary_writes=1 >>> 10 28117 bufobj_wwait:return calls to bufobj_wait = 1, >>> dirtycnt=2, secondary_writes=1 >>> 10 28117 bufobj_wwait:return calls to bufobj_wait = 2, >>> dirtycnt=1, secondary_writes=1 >>> 10 28117 bufobj_wwait:return calls to bufobj_wait = 3, >>> dirtycnt=1, secondary_writes=1 >>> 10 28117 bufobj_wwait:return calls to bufobj_wait = 4, >>> dirtycnt=3, secondary_writes=1 >>> 10 28117 bufobj_wwait:return calls to bufobj_wait = 5, >>> dirtycnt=2, secondary_writes=1 >>> 6 28117 bufobj_wwait:return calls to bufobj_wait = 6, >>> dirtycnt=3, secondary_writes=1 >>> 6 28117 bufobj_wwait:return calls to bufobj_wait = 7, >>> dirtycnt=3, secondary_writes=1 >>> 6 28117 bufobj_wwait:return calls to bufobj_wait = 8, >>> dirtycnt=3, secondary_writes=1 >>> 6 28117 bufobj_wwait:return calls to bufobj_wait = 9, >>> dirtycnt=3, secondary_writes=1 >>> 6 28117 bufobj_wwait:return calls to bufobj_wait = 10, >>> dirtycnt=2, secondary_writes=1 >>> 6 28117 bufobj_wwait:return calls to bufobj_wait = 11, >>> dirtycnt=2, secondary_writes=1 >>> 6 28117 bufobj_wwait:return calls to bufobj_wait = 12, >>> dirtycnt=3, secondary_writes=1 >>> 6 28117 bufobj_wwait:return calls to bufobj_wait = 13, >>> dirtycnt=3, secondary_writes=1 >>> 6 28117 bufobj_wwait:return calls to bufobj_wait = 14, >>> dirtycnt=3, secondary_writes=1 >>> 6 28117 bufobj_wwait:return calls to bufobj_wait = 15, >>> dirtycnt=4, secondary_writes=1 >>> 6 28117 bufobj_wwait:return calls to bufobj_wait = 16, >>> dirtycnt=3, secondary_writes=1 >>> 6 28117 bufobj_wwait:return calls to bufobj_wait = 17, >>> dirtycnt=3, secondary_writes=1 >>> 2 18903 ffs_truncate:return returncode=0, inum=11155630, blocks=0 >>> 2 32866 kern_unlinkat:return returncode=0, errno=0, number >>> io's: 791/791 >>> 6 28117 bufobj_wwait:return calls to bufobj_wait = 18, >>> dirtycnt=3, secondary_writes=0 >>> 6 28117 bufobj_wwait:return calls to bufobj_wait = 19, >>> dirtycnt=0, secondary_writes=0 >>> 6 22141 vop_stdfsync:return returncode=0, pid=26, tid=100181, >>> spent 240373850 nsecs >>> >>> So the spent time in vop_stdfsync() is 0,24 sec in the worst case I >>> found using DELAY with 1 ms. I would prefer this solution. My first >>> appoach (simple bumping maxres from 1000 to 100000) is also ok, but max >>> spend time will be raise up to 0,5 sec. Perhaps you like something like >>> >>> if( maxretry < 1000 && maxretry % 10 = 0) >>> DELAY(waitns); >>> >>> That is also ok but does not make a noteworthy difference. The main >>> argument remains: we have to wait until there are no dirties left. >>> >>> For me the problem with the "giving up on dirty" is solved. >> Will you provide the patch ? > > Patch against HEAD: > --- vfs_default.c.orig 2017-09-22 11:56:26.950084000 +0200 > +++ vfs_default.c 2017-09-22 11:58:33.211196000 +0200 > @@ -690,6 +690,8 @@ > bremfree(bp); > bawrite(bp); > } > + if( maxretry < 1000) > + DELAY(1000); /* 1 ms */ > BO_LOCK(bo); > goto loop2; > } Do you need to use a busy loop here, or can you yield the cpu by using something like pause(9)?