From owner-freebsd-bugs@FreeBSD.ORG Sun Jul 21 18:00:01 2013 Return-Path: Delivered-To: freebsd-bugs@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 5F7D648E for ; Sun, 21 Jul 2013 18:00:01 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) by mx1.freebsd.org (Postfix) with ESMTP id 500A9BBB for ; Sun, 21 Jul 2013 18:00:01 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id r6LI01Gh012301 for ; Sun, 21 Jul 2013 18:00:01 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.7/8.14.7/Submit) id r6LI01ha012300; Sun, 21 Jul 2013 18:00:01 GMT (envelope-from gnats) Date: Sun, 21 Jul 2013 18:00:01 GMT Message-Id: <201307211800.r6LI01ha012300@freefall.freebsd.org> To: freebsd-bugs@FreeBSD.org Cc: From: Klaus Weber Subject: Re: kern/178997: Heavy disk I/O may hang system X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: Klaus Weber List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 21 Jul 2013 18:00:01 -0000 The following reply was made to PR kern/178997; it has been noted by GNATS. From: Klaus Weber To: Bruce Evans Cc: Klaus Weber , freebsd-gnats-submit@FreeBSD.org Subject: Re: kern/178997: Heavy disk I/O may hang system Date: Sun, 21 Jul 2013 19:42:15 +0200 Sorry for the late reply, did lots of testing. I believe to have identified the root cause for the buffer depletion and subsequent i/o hangs. Have a look at the beginning of ffs_bdflush() in ffs_snapshot.c: if (bo->bo_dirty.bv_cnt <= dirtybufthresh) return; This short-circuits the entire flushing code if _this bufobj_ is using less than vfs.dirtybufthresh buffers (by default, vfs.dirtybufthresh is 90% of all available buffers). In other words, one bufobj may hog up to 90% of all buffers. This works well if only one file is being re-written. As documented up-thread, with just one file vfs.numdirtybuffers rises until it reaches vfs.dirtybufthresh, and then remains relatively stable at that point, and the system will not hang. With two (or more) files being rewritten, the code above tries to let _each_ file have 90% of all available buffers, which lets vfs.numdirtybuffers hit vfs.hidirtybuffers. Even worse, with two processes competing for buffer space, often both remain under the magic 90% mark, effectively disabling all flushing of buffers entirely. At that point, the only way that buffers can be freed is flushing in different code paths, or if other processes not involved in this buffer space competion are releasing some. Sooner or later, the system reaches a state where none of the bufobjs has more than 90% of the buffers, and none of the other (normal) processes can release any buffers. The only way out then is a hard reset. With the following patch (also attached), the system performs reasonably well and does not hang: Index: sys/kern/vfs_bio.c =================================================================== --- sys/kern/vfs_bio.c (revision 253339) +++ sys/kern/vfs_bio.c (working copy) @@ -1146,7 +1146,7 @@ if (bo->bo_dirty.bv_cnt > dirtybufthresh + 10) { (void) VOP_FSYNC(bp->b_vp, MNT_NOWAIT, curthread); altbufferflushes++; - } else if (bo->bo_dirty.bv_cnt > dirtybufthresh) { + } else if ( bdflush_required(bo) ) { BO_LOCK(bo); /* * Try to find a buffer to flush. @@ -1438,6 +1438,28 @@ } /* + * Flushing is deemed necessary if at least one of the following is true: + * - bo is hogging more than dirtybufthresh buffers (previous behavior) + * - numdirtybuffers exceeds dirtybufthresh + * - we are or act as bufdaemon (TDP_NORUNNINGBUF) + * + * Used by bufbdflush (above) and ffs_bdflush (in ffs_snapshot). + */ +int +bdflush_required(struct bufobj *bo) +{ + struct thread *td = curthread; + + KASSERT(bo != NULL, ("bdflush_required: NULL bo")); + + if ((bo->bo_dirty.bv_cnt > dirtybufthresh) || + (numdirtybuffers > dirtybufthresh) || + (td && (td->td_pflags & TDP_NORUNNINGBUF)) ) + return(1); + return(0); + } + +/* * brelse: * * Release a busy buffer and, if requested, free its resources. The Index: sys/sys/bio.h =================================================================== --- sys/sys/bio.h (revision 253339) +++ sys/sys/bio.h (working copy) @@ -151,6 +151,10 @@ #define physread physio #define physwrite physio +extern struct bufobj *bo; +int bdflush_required(struct bufobj *bo); + + #endif /* _KERNEL */ #endif /* !_SYS_BIO_H_ */ Index: sys/ufs/ffs/ffs_snapshot.c =================================================================== --- sys/ufs/ffs/ffs_snapshot.c (revision 253339) +++ sys/ufs/ffs/ffs_snapshot.c (working copy) @@ -2161,7 +2161,7 @@ struct buf *nbp; int bp_bdskip; - if (bo->bo_dirty.bv_cnt <= dirtybufthresh) + if ( !bdflush_required(bo) ) return; td = curthread; I have modified ffs_bdflush() in ffs_snapshot.c and the generic bufbdflush() in vfs_bio.c, which contains similar logic. (I believe a kernel of 5.2 vintage will have this code still in bdwrite().) While the patch makes rewriting work reliably, it may need some further refinements: - Right now, it limits numdirtybuffers to dirtybufthresh. It might be better to use hidirtybuffers as a limit instead? - the set of conditions for which the flush is performed needs to be reviewed by someone who understands this better than I do. Limiting numdirtybuffers is neccessary to fix the hangs. Limiting bo->bo_dirty.bv_cnt was what the code did before, so I left this in as well. And finally, I'm testing for TDP_NORUNNINGBUF based on the idea that if bufdaemon, syncer or any process acting on their behalf sees the need to flush buffers, then it should not be up to *bdflush() to second-guess their intention. - the patch may have style issues Regarding the threshold and conditions: I am still in the process of doing some basic benchmarks with and without the patch. However, it seems that it causes a slight performance decrease in the cases that work well without the patch (i.e. only one process re-writing). My guess is that it now flushes more often than absolutely neccessary. Performance is unchanged for cases that do not cause a huge buffer buildup (reading and writing via separate file descriptors). Of course, in cases that did result in hangs before and now don't, the patch is a clear win. I still believe there are some bugs in the clustering code that affect this case, making the re-writing less efficient than it could be. I wish I had the time to do more testing, but I need to bring the server into production soon. I'm putting some notes here for things I intended to test, but probably won't have the time to: - in vfs_cluster.c, cluster_write_gb(), where it says "Change to algorithm...": Also push the previous cluster when it has reached maximum size, not only when seqcount > 0. No point waiting for buf_daemon when the cluster cannot grow. Verify that condition "vp->v_clen <= cursize" really tests "reached max. size". XXX "maxclen <= cursize"? Add rate-limited printf()s! - create a new "cluster_awrite", from DragonFly BSD: http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/9d4e78c77684f1de021eb8ae17767c210205c3c3 (plug-compatible to bawrite(), but with clustering). Do not replace vfs_bio_awrite (yet). Then, selectively replace some bawrite() calls with calls to cluster_awrite() to identify where current bawrite() calls are inefficient. XXX requires mods to cluster_wbuild()! If anyone wants me to test any additional things, or variants of the patch above, I can still do this next weekend. Since the system no longer hangs with the patch, I think this bug can now be closed. Klaus