Date: Sat, 16 Sep 2017 13:44:44 +0200 From: Andreas Longwitz <longwitz@incore.de> To: Kirk McKusick <mckusick@mckusick.com> Cc: freebsd-fs@freebsd.org Subject: Re: fsync: giving up on dirty on ufs partitions running vfs_write_suspend() Message-ID: <59BD0EAC.8030206@incore.de> In-Reply-To: <201709110519.v8B5JVmf060773@chez.mckusick.com> References: <201709110519.v8B5JVmf060773@chez.mckusick.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Hello Kirk, >> Second I found, that the "dirty" situation during vfs_write_suspend() >> only occurs when a big file (more than 10G on a partition of 116G) is >> removed. If vfs_write_suspend() is called immediately after "rm >> bigfile", then in vop_stdfsync() 1000 tries (maxretry) are done to wait >> for the "rm bigfile" to complete. Because a lot of bitmap writes must be >> done, the value 1000 is not sufficient on my servers. I have increased >> maxretry and in the worst case I saw 8650 tries to complete without >> "dirty". In this case the time spent in vop_stdfsync() was about 0,5 >> seconds. The following patch solves the "dirty problem" for me: >> >> --- vfs_default.c.orig 2016-10-24 12:26:57.000000000 +0200 >> +++ vfs_default.c 2017-09-08 12:49:18.059970000 +0200 >> @@ -644,7 +644,7 @@ >> struct bufobj *bo; >> struct buf *nbp; >> int error = 0; >> - int maxretry = 1000; /* large, arbitrarily chosen */ >> + int maxretry = 100000; /* large, arbitrarily chosen */ >> >> bo = &vp->v_bufobj; >> BO_LOCK(bo); > > This message has plagued me for years. It started out as a panic, > then got changed to a printf because I could not get rid of it. I > was never able to figure out why it should take more than five > iterations to finish, but obviously it takes more. The 1000 number > was picked because that just seemed insanely large and I did not > want to iterate forever. I have no problem with bumping up the > iteration count if there is some way to figure out that each iteration > is making forward progress (so we know that we are not in an infinite > loop). Can you come up with a scheme that can measure forward progress? > I would much prefer that to just making this number ever bigger. > > Kirk McKusick Ok, I understand your thoughts about the "big loop" and I agree. On the other side it is not easy to measure the progress of the dirty buffers because these buffers a created from another process at the same time we loop in vop_stdfsync(). I can explain from my tests, where I use the following loop on a gjournaled partition: while true; do cp -p bigfile bigfile.tmp rm bigfile mv bigfile.tmp bigfile done When g_journal_switcher starts vfs_write_suspend() immediately after the rm command has started to do his "rm stuff" (ufs_inactive, ffs_truncate, ffs_indirtrunc at different levels, ffs_blkfree, ...) the we must loop (that means wait) in vop_stdfsync() until the rm process has finished his work. A lot of locking overhead is needed for coordination. Returning from bufobj_wwait() we always see one left dirty buffer (very seldom two), that is not optimal. Therefore I have tried the following patch (instead of bumping maxretry): --- vfs_default.c.orig 2016-10-24 12:26:57.000000000 +0200 +++ vfs_default.c 2017-09-15 12:30:44.792274000 +0200 @@ -688,6 +688,8 @@ bremfree(bp); bawrite(bp); } + if( maxretry < 1000) + DELAY(waitns); BO_LOCK(bo); goto loop2; } with different values for waitns. If I run the testloop 5000 times on my testserver, the problem is triggered always round about 10 times. The results from several runs are given in the following table: waitns max time max loops ------------------------------- no DELAY 0,5 sec 8650 (maxres = 100000) 1000 0,2 sec 24 10000 0,8 sec 3 100000 7,2 sec 3 "time" means spent time in vop_stdfsync() measured from entry to return by a dtrace script. "loops" means the number of times "--maxretry" is executed. I am not sure if DELAY() is the best way to wait or if waiting has other drawbacks. Anyway with DELAY() it does not take more than five iterazions to finish. -- Andreas Longwitz
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?59BD0EAC.8030206>