From owner-freebsd-fs@freebsd.org Thu Sep 21 17:29:15 2017 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 0DCBAE141C5 for ; Thu, 21 Sep 2017 17:29:15 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id AA3513539 for ; Thu, 21 Sep 2017 17:29:14 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id v8LHT2IS048804 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Thu, 21 Sep 2017 20:29:02 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua v8LHT2IS048804 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id v8LHT2Sh048793; Thu, 21 Sep 2017 20:29:02 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Thu, 21 Sep 2017 20:29:02 +0300 From: Konstantin Belousov To: Andreas Longwitz Cc: Kirk McKusick , freebsd-fs@freebsd.org Subject: Re: fsync: giving up on dirty on ufs partitions running vfs_write_suspend() Message-ID: <20170921172902.GW78693@kib.kiev.ua> References: <201709110519.v8B5JVmf060773@chez.mckusick.com> <59BD0EAC.8030206@incore.de> <20170916183117.GF78693@kib.kiev.ua> <59C37F46.80509@incore.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <59C37F46.80509@incore.de> User-Agent: Mutt/1.9.0 (2017-09-02) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 21 Sep 2017 17:29:15 -0000 On Thu, Sep 21, 2017 at 10:58:46AM +0200, Andreas Longwitz wrote: > Konstantin Belousov wrote: > > On Sat, Sep 16, 2017 at 01:44:44PM +0200, Andreas Longwitz wrote: > >> Ok, I understand your thoughts about the "big loop" and I agree. On the > >> other side it is not easy to measure the progress of the dirty buffers > >> because these buffers a created from another process at the same time we > >> loop in vop_stdfsync(). I can explain from my tests, where I use the > >> following loop on a gjournaled partition: > >> > >> while true; do > >> cp -p bigfile bigfile.tmp > >> rm bigfile > >> mv bigfile.tmp bigfile > >> done > >> > >> When g_journal_switcher starts vfs_write_suspend() immediately after the > >> rm command has started to do his "rm stuff" (ufs_inactive, ffs_truncate, > >> ffs_indirtrunc at different levels, ffs_blkfree, ...) the we must loop > >> (that means wait) in vop_stdfsync() until the rm process has finished > >> his work. A lot of locking overhead is needed for coordination. > >> Returning from bufobj_wwait() we always see one left dirty buffer (very > >> seldom two), that is not optimal. Therefore I have tried the following > >> patch (instead of bumping maxretry): > >> > >> --- vfs_default.c.orig 2016-10-24 12:26:57.000000000 +0200 > >> +++ vfs_default.c 2017-09-15 12:30:44.792274000 +0200 > >> @@ -688,6 +688,8 @@ > >> bremfree(bp); > >> bawrite(bp); > >> } > >> + if( maxretry < 1000) > >> + DELAY(waitns); > >> BO_LOCK(bo); > >> goto loop2; > >> } > >> > >> with different values for waitns. If I run the testloop 5000 times on my > >> testserver, the problem is triggered always round about 10 times. The > >> results from several runs are given in the following table: > >> > >> waitns max time max loops > >> ------------------------------- > >> no DELAY 0,5 sec 8650 (maxres = 100000) > >> 1000 0,2 sec 24 > >> 10000 0,8 sec 3 > >> 100000 7,2 sec 3 > >> > >> "time" means spent time in vop_stdfsync() measured from entry to return > >> by a dtrace script. "loops" means the number of times "--maxretry" is > >> executed. I am not sure if DELAY() is the best way to wait or if waiting > >> has other drawbacks. Anyway with DELAY() it does not take more than five > >> iterazions to finish. > > This is not explicitly stated in your message, but I suppose that the > > vop_stdfsync() is called due to VOP_FSYNC(devvp, MNT_SUSPEND) call in > > ffs_sync(). Am I right ? > > Yes, the stack trace given by dtrace script looks always like this: > 4 22140 vop_stdfsync:entry > kernel`devfs_fsync+0x26 > kernel`VOP_FSYNC_APV+0xa7 > kernel`ffs_sync+0x3bb > kernel`vfs_write_suspend+0x1cd > geom_journal.ko`g_journal_switcher+0x9a4 > kernel`fork_exit+0x9a > kernel`0xffffffff8095502e > > > > If yes, then the solution is most likely to continue looping in the > > vop_stdfsync() until there is no dirty buffers or the mount point > > mnt_secondary_writes counter is zero. The pauses trick you tried might > > be still useful, e.g. after some threshold of the performed loop > > iterations. > > I have checked your proposal and found that indeed the > mnt_secondary_writes counter goes to zero when the dirties have reached > zero. During the loop the mnt_secondary_write counter is always equal to > one, so there is not something like a countdown and thats Kirk wanted to > see. This is because mnt_secondary_write counts number of threads which entered the vn_start_secondary_write() block and potentially can issue a write dirtying a buffer. In principle, some writer may start the secondary write block again even if the counter is zero, but practically some primary writer must make a modification for secondary writers to have work. I.e., the change would not cover the problem to claim it being completely solved, but for the current UFS code I doubt that the issue can be triggered. > A dtrace output (with DELAY of 1ms in the loop) for the biggest > loop count on a three day test is this: > > 18 32865 kern_unlinkat:entry path=bigfile, tid=101201, > tid=101201, execname=rm > 18 12747 ufs_remove:entry gj=mirror/gmbkp4p5.journal, > inum=11155630, blocks=22301568, size=11415525660 > 18 12748 ufs_remove:return returncode=0, inum=11155630, > blocks=22301568 > 18 18902 ffs_truncate:entry gj=mirror/gmbkp4p5.journal, > inum=11155630, size=11415525660, mnt_flag=0x12001040, > mnt_kern_flag=0x40006142, blocks=22301568 > 6 33304 vfs_write_suspend:entry gj=mirror/gmbkp4p5.journal, > mnt_kern_flag=0x40006142, tid=100181 > 6 22140 vop_stdfsync:entry mounted on /home, waitfor=1, > numoutput=0, clean=10, dirty=6, secondary_writes=1 > 10 28117 bufobj_wwait:return calls to bufobj_wait = 1, > dirtycnt=2, secondary_writes=1 > 10 28117 bufobj_wwait:return calls to bufobj_wait = 2, > dirtycnt=1, secondary_writes=1 > 10 28117 bufobj_wwait:return calls to bufobj_wait = 3, > dirtycnt=1, secondary_writes=1 > 10 28117 bufobj_wwait:return calls to bufobj_wait = 4, > dirtycnt=3, secondary_writes=1 > 10 28117 bufobj_wwait:return calls to bufobj_wait = 5, > dirtycnt=2, secondary_writes=1 > 6 28117 bufobj_wwait:return calls to bufobj_wait = 6, > dirtycnt=3, secondary_writes=1 > 6 28117 bufobj_wwait:return calls to bufobj_wait = 7, > dirtycnt=3, secondary_writes=1 > 6 28117 bufobj_wwait:return calls to bufobj_wait = 8, > dirtycnt=3, secondary_writes=1 > 6 28117 bufobj_wwait:return calls to bufobj_wait = 9, > dirtycnt=3, secondary_writes=1 > 6 28117 bufobj_wwait:return calls to bufobj_wait = 10, > dirtycnt=2, secondary_writes=1 > 6 28117 bufobj_wwait:return calls to bufobj_wait = 11, > dirtycnt=2, secondary_writes=1 > 6 28117 bufobj_wwait:return calls to bufobj_wait = 12, > dirtycnt=3, secondary_writes=1 > 6 28117 bufobj_wwait:return calls to bufobj_wait = 13, > dirtycnt=3, secondary_writes=1 > 6 28117 bufobj_wwait:return calls to bufobj_wait = 14, > dirtycnt=3, secondary_writes=1 > 6 28117 bufobj_wwait:return calls to bufobj_wait = 15, > dirtycnt=4, secondary_writes=1 > 6 28117 bufobj_wwait:return calls to bufobj_wait = 16, > dirtycnt=3, secondary_writes=1 > 6 28117 bufobj_wwait:return calls to bufobj_wait = 17, > dirtycnt=3, secondary_writes=1 > 2 18903 ffs_truncate:return returncode=0, inum=11155630, blocks=0 > 2 32866 kern_unlinkat:return returncode=0, errno=0, number > io's: 791/791 > 6 28117 bufobj_wwait:return calls to bufobj_wait = 18, > dirtycnt=3, secondary_writes=0 > 6 28117 bufobj_wwait:return calls to bufobj_wait = 19, > dirtycnt=0, secondary_writes=0 > 6 22141 vop_stdfsync:return returncode=0, pid=26, tid=100181, > spent 240373850 nsecs > > So the spent time in vop_stdfsync() is 0,24 sec in the worst case I > found using DELAY with 1 ms. I would prefer this solution. My first > appoach (simple bumping maxres from 1000 to 100000) is also ok, but max > spend time will be raise up to 0,5 sec. Perhaps you like something like > > if( maxretry < 1000 && maxretry % 10 = 0) > DELAY(waitns); > > That is also ok but does not make a noteworthy difference. The main > argument remains: we have to wait until there are no dirties left. > > For me the problem with the "giving up on dirty" is solved. Will you provide the patch ?