From owner-freebsd-fs@freebsd.org  Thu Sep 21 17:29:15 2017
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 0DCBAE141C5
 for <freebsd-fs@mailman.ysv.freebsd.org>; Thu, 21 Sep 2017 17:29:15 +0000 (UTC)
 (envelope-from kostikbel@gmail.com)
Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id AA3513539
 for <freebsd-fs@freebsd.org>; Thu, 21 Sep 2017 17:29:14 +0000 (UTC)
 (envelope-from kostikbel@gmail.com)
Received: from tom.home (kib@localhost [127.0.0.1])
 by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id v8LHT2IS048804
 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Thu, 21 Sep 2017 20:29:02 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua v8LHT2IS048804
Received: (from kostik@localhost)
 by tom.home (8.15.2/8.15.2/Submit) id v8LHT2Sh048793;
 Thu, 21 Sep 2017 20:29:02 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com
 using -f
Date: Thu, 21 Sep 2017 20:29:02 +0300
From: Konstantin Belousov <kostikbel@gmail.com>
To: Andreas Longwitz <longwitz@incore.de>
Cc: Kirk McKusick <mckusick@mckusick.com>, freebsd-fs@freebsd.org
Subject: Re: fsync: giving up on dirty on ufs partitions running
 vfs_write_suspend()
Message-ID: <20170921172902.GW78693@kib.kiev.ua>
References: <201709110519.v8B5JVmf060773@chez.mckusick.com>
 <59BD0EAC.8030206@incore.de> <20170916183117.GF78693@kib.kiev.ua>
 <59C37F46.80509@incore.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <59C37F46.80509@incore.de>
User-Agent: Mutt/1.9.0 (2017-09-02)
X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00,
 DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no
 autolearn_force=no version=3.4.1
X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 21 Sep 2017 17:29:15 -0000

On Thu, Sep 21, 2017 at 10:58:46AM +0200, Andreas Longwitz wrote:
> Konstantin Belousov wrote:
> > On Sat, Sep 16, 2017 at 01:44:44PM +0200, Andreas Longwitz wrote:
> >> Ok, I understand your thoughts about the "big loop" and I agree. On the
> >> other side it is not easy to measure the progress of the dirty buffers
> >> because these buffers a created from another process at the same time we
> >> loop in vop_stdfsync(). I can explain from my tests, where I use the
> >> following loop on a gjournaled partition:
> >>
> >>    while true; do
> >>       cp -p bigfile bigfile.tmp
> >>       rm bigfile
> >>       mv bigfile.tmp bigfile
> >>    done
> >>
> >> When g_journal_switcher starts vfs_write_suspend() immediately after the
> >> rm command has started to do his "rm stuff" (ufs_inactive, ffs_truncate,
> >> ffs_indirtrunc at different levels, ffs_blkfree, ...) the we must loop
> >> (that means wait) in vop_stdfsync() until the rm process has finished
> >> his work. A lot of locking overhead is needed for coordination.
> >> Returning from bufobj_wwait() we always see one left dirty buffer (very
> >> seldom two), that is not optimal. Therefore I have tried the following
> >> patch (instead of bumping maxretry):
> >>
> >> --- vfs_default.c.orig  2016-10-24 12:26:57.000000000 +0200
> >> +++ vfs_default.c       2017-09-15 12:30:44.792274000 +0200
> >> @@ -688,6 +688,8 @@
> >>                         bremfree(bp);
> >>                         bawrite(bp);
> >>                 }
> >> +               if( maxretry < 1000)
> >> +                       DELAY(waitns);
> >>                 BO_LOCK(bo);
> >>                 goto loop2;
> >>         }
> >>
> >> with different values for waitns. If I run the testloop 5000 times on my
> >> testserver, the problem is triggered always round about 10 times. The
> >> results from several runs are given in the following table:
> >>
> >>     waitns    max time   max loops
> >>     -------------------------------
> >>   no DELAY     0,5 sec    8650  (maxres = 100000)
> >>       1000     0,2 sec      24
> >>      10000     0,8 sec       3
> >>     100000     7,2 sec       3
> >>
> >> "time" means spent time in vop_stdfsync() measured from entry to return
> >> by a dtrace script. "loops" means the number of times "--maxretry" is
> >> executed. I am not sure if DELAY() is the best way to wait or if waiting
> >> has other drawbacks. Anyway with DELAY() it does not take more than five
> >> iterazions to finish.
> > This is not explicitly stated in your message, but I suppose that the
> > vop_stdfsync() is called due to VOP_FSYNC(devvp, MNT_SUSPEND) call in
> > ffs_sync().  Am I right ?
> 
> Yes, the stack trace given by dtrace script looks always like this:
>  4  22140               vop_stdfsync:entry
>               kernel`devfs_fsync+0x26
>               kernel`VOP_FSYNC_APV+0xa7
>               kernel`ffs_sync+0x3bb
>               kernel`vfs_write_suspend+0x1cd
>               geom_journal.ko`g_journal_switcher+0x9a4
>               kernel`fork_exit+0x9a
>               kernel`0xffffffff8095502e
> 
> 
> > If yes, then the solution is most likely to continue looping in the
> > vop_stdfsync() until there is no dirty buffers or the mount point
> > mnt_secondary_writes counter is zero. The pauses trick you tried might
> > be still useful, e.g. after some threshold of the performed loop
> > iterations.
> 
> I have checked your proposal and found that indeed the
> mnt_secondary_writes counter goes to zero when the dirties have reached
> zero. During the loop the mnt_secondary_write counter is always equal to
> one, so there is not something like a countdown and thats Kirk wanted to
> see.
This is because mnt_secondary_write counts number of threads which entered
the vn_start_secondary_write() block and potentially can issue a write
dirtying a buffer.  In principle, some writer may start the secondary
write block again even if the counter is zero, but practically some
primary writer must make a modification for secondary writers to have
work.

I.e., the change would not cover the problem to claim it being completely
solved, but for the current UFS code I doubt that the issue can be triggered.

> A dtrace output (with DELAY of 1ms in the loop) for the biggest
> loop count on a three day test is this:
> 
>  18  32865      kern_unlinkat:entry    path=bigfile, tid=101201,
> tid=101201, execname=rm
>  18  12747         ufs_remove:entry    gj=mirror/gmbkp4p5.journal,
> inum=11155630, blocks=22301568, size=11415525660
>  18  12748        ufs_remove:return    returncode=0, inum=11155630,
> blocks=22301568
>  18  18902       ffs_truncate:entry    gj=mirror/gmbkp4p5.journal,
> inum=11155630, size=11415525660, mnt_flag=0x12001040,
> mnt_kern_flag=0x40006142, blocks=22301568
>   6  33304  vfs_write_suspend:entry    gj=mirror/gmbkp4p5.journal,
> mnt_kern_flag=0x40006142, tid=100181
>   6  22140       vop_stdfsync:entry    mounted on /home, waitfor=1,
> numoutput=0, clean=10, dirty=6, secondary_writes=1
>  10  28117      bufobj_wwait:return    calls to bufobj_wait = 1,
> dirtycnt=2, secondary_writes=1
>  10  28117      bufobj_wwait:return    calls to bufobj_wait = 2,
> dirtycnt=1, secondary_writes=1
>  10  28117      bufobj_wwait:return    calls to bufobj_wait = 3,
> dirtycnt=1, secondary_writes=1
>  10  28117      bufobj_wwait:return    calls to bufobj_wait = 4,
> dirtycnt=3, secondary_writes=1
>  10  28117      bufobj_wwait:return    calls to bufobj_wait = 5,
> dirtycnt=2, secondary_writes=1
>   6  28117      bufobj_wwait:return    calls to bufobj_wait = 6,
> dirtycnt=3, secondary_writes=1
>   6  28117      bufobj_wwait:return    calls to bufobj_wait = 7,
> dirtycnt=3, secondary_writes=1
>   6  28117      bufobj_wwait:return    calls to bufobj_wait = 8,
> dirtycnt=3, secondary_writes=1
>   6  28117      bufobj_wwait:return    calls to bufobj_wait = 9,
> dirtycnt=3, secondary_writes=1
>   6  28117      bufobj_wwait:return    calls to bufobj_wait = 10,
> dirtycnt=2, secondary_writes=1
>   6  28117      bufobj_wwait:return    calls to bufobj_wait = 11,
> dirtycnt=2, secondary_writes=1
>   6  28117      bufobj_wwait:return    calls to bufobj_wait = 12,
> dirtycnt=3, secondary_writes=1
>   6  28117      bufobj_wwait:return    calls to bufobj_wait = 13,
> dirtycnt=3, secondary_writes=1
>   6  28117      bufobj_wwait:return    calls to bufobj_wait = 14,
> dirtycnt=3, secondary_writes=1
>   6  28117      bufobj_wwait:return    calls to bufobj_wait = 15,
> dirtycnt=4, secondary_writes=1
>   6  28117      bufobj_wwait:return    calls to bufobj_wait = 16,
> dirtycnt=3, secondary_writes=1
>   6  28117      bufobj_wwait:return    calls to bufobj_wait = 17,
> dirtycnt=3, secondary_writes=1
>   2  18903      ffs_truncate:return    returncode=0, inum=11155630, blocks=0
>   2  32866     kern_unlinkat:return    returncode=0, errno=0, number
> io's: 791/791
>   6  28117      bufobj_wwait:return    calls to bufobj_wait = 18,
> dirtycnt=3, secondary_writes=0
>   6  28117      bufobj_wwait:return    calls to bufobj_wait = 19,
> dirtycnt=0, secondary_writes=0
>   6  22141      vop_stdfsync:return    returncode=0, pid=26, tid=100181,
> spent 240373850 nsecs
> 
> So the spent time in vop_stdfsync() is 0,24 sec in the worst case I
> found using DELAY with 1 ms. I would prefer this solution. My first
> appoach (simple bumping maxres from 1000 to 100000) is also ok, but max
> spend time will be raise up to 0,5 sec. Perhaps you like something like
> 
>    if( maxretry < 1000 && maxretry % 10 = 0)
>           DELAY(waitns);
> 
> That is also ok but does not make a noteworthy difference. The main
> argument remains: we have to wait until there are no dirties left.
> 
> For me the problem with the "giving up on dirty" is solved.
Will you provide the patch ?