From owner-svn-src-all@FreeBSD.ORG Wed Oct 2 06:00:35 2013 Return-Path: Delivered-To: svn-src-all@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 9E2F2FE; Wed, 2 Oct 2013 06:00:35 +0000 (UTC) (envelope-from kib@FreeBSD.org) Received: from svn.freebsd.org (svn.freebsd.org [IPv6:2001:1900:2254:2068::e6a:0]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 7BC562E59; Wed, 2 Oct 2013 06:00:35 +0000 (UTC) Received: from svn.freebsd.org ([127.0.1.70]) by svn.freebsd.org (8.14.7/8.14.7) with ESMTP id r9260ZFh002820; Wed, 2 Oct 2013 06:00:35 GMT (envelope-from kib@svn.freebsd.org) Received: (from kib@localhost) by svn.freebsd.org (8.14.7/8.14.5/Submit) id r9260Z6g002791; Wed, 2 Oct 2013 06:00:35 GMT (envelope-from kib@svn.freebsd.org) Message-Id: <201310020600.r9260Z6g002791@svn.freebsd.org> From: Konstantin Belousov Date: Wed, 2 Oct 2013 06:00:35 +0000 (UTC) To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: svn commit: r255986 - head/sys/kern X-SVN-Group: head MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 02 Oct 2013 06:00:35 -0000 Author: kib Date: Wed Oct 2 06:00:34 2013 New Revision: 255986 URL: http://svnweb.freebsd.org/changeset/base/255986 Log: When helping the bufdaemon from the buffer allocation context, there is no sense to walk the whole dirty buffer queue. We are only interested in, and can operate on, the buffers owned by the current vnode [1]. Instead of calling generic queue flush routine, do VOP_FSYNC() if possible. Holding the dirty buffer queue lock in the bufdaemon, without dropping it, can cause starvation of buffer writes from other threads. This is esp. easy to reproduce on the big memory machines, where large files are written, causing almost all dirty buffers accumulating in several big files, which vnodes are locked by writers. Bufdaemon cannot flush any buffer, but is iterating over the whole dirty queue continuously. Since dirty queue mutex is not dropped, bufdone() in g_up thread is starved, usually deadlocking the machine [2]. Mitigate this by dropping the queue lock after the vnode is locked, allowing other queue lock contenders to make a progress. Discussed with: Jeff [1] Reported by: pho [2] Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Approved by: re (hrs) Modified: head/sys/kern/vfs_bio.c Modified: head/sys/kern/vfs_bio.c ============================================================================== --- head/sys/kern/vfs_bio.c Wed Oct 2 04:40:46 2013 (r255985) +++ head/sys/kern/vfs_bio.c Wed Oct 2 06:00:34 2013 (r255986) @@ -113,8 +113,8 @@ static void vfs_setdirty_locked_object(s static void vfs_vmio_release(struct buf *bp); static int vfs_bio_clcheck(struct vnode *vp, int size, daddr_t lblkno, daddr_t blkno); -static int buf_flush(struct vnode *vp, int); -static int flushbufqueues(struct vnode *, int, int); +static int buf_flush(int); +static int flushbufqueues(int, int); static void buf_daemon(void); static void bremfreel(struct buf *bp); static __inline void bd_wakeup(void); @@ -2048,7 +2048,7 @@ getnewbuf_bufd_help(struct vnode *vp, in { struct thread *td; char *waitmsg; - int fl, flags, norunbuf; + int cnt, error, flags, norunbuf, wait; mtx_assert(&bqclean, MA_OWNED); @@ -2072,10 +2072,13 @@ getnewbuf_bufd_help(struct vnode *vp, in return; td = curthread; + cnt = 0; + wait = MNT_NOWAIT; mtx_lock(&nblock); while (needsbuffer & flags) { if (vp != NULL && (td->td_pflags & TDP_BUFNEED) == 0) { mtx_unlock(&nblock); + /* * getblk() is called with a vnode locked, and * some majority of the dirty buffers may as @@ -2084,15 +2087,20 @@ getnewbuf_bufd_help(struct vnode *vp, in * cannot be achieved by the buf_daemon, that * cannot lock the vnode. */ - norunbuf = ~(TDP_BUFNEED | TDP_NORUNNINGBUF) | - (td->td_pflags & TDP_NORUNNINGBUF); - /* play bufdaemon */ - td->td_pflags |= TDP_BUFNEED | TDP_NORUNNINGBUF; - fl = buf_flush(vp, flushbufqtarget); - td->td_pflags &= norunbuf; + if (cnt++ > 2) + wait = MNT_WAIT; + ASSERT_VOP_LOCKED(vp, "bufd_helper"); + error = VOP_ISLOCKED(vp) == LK_EXCLUSIVE ? 0 : + vn_lock(vp, LK_TRYUPGRADE); + if (error == 0) { + /* play bufdaemon */ + norunbuf = curthread_pflags_set(TDP_BUFNEED | + TDP_NORUNNINGBUF); + VOP_FSYNC(vp, wait, td); + atomic_add_long(¬bufdflushes, 1); + curthread_pflags_restore(norunbuf); + } mtx_lock(&nblock); - if (fl != 0) - continue; if ((needsbuffer & flags) == 0) break; } @@ -2510,20 +2518,18 @@ static struct kproc_desc buf_kp = { SYSINIT(bufdaemon, SI_SUB_KTHREAD_BUF, SI_ORDER_FIRST, kproc_start, &buf_kp); static int -buf_flush(struct vnode *vp, int target) +buf_flush(int target) { int flushed; - flushed = flushbufqueues(vp, target, 0); + flushed = flushbufqueues(target, 0); if (flushed == 0) { /* * Could not find any buffers without rollback * dependencies, so just write the first one * in the hopes of eventually making progress. */ - if (vp != NULL && target > 2) - target /= 2; - flushbufqueues(vp, target, 1); + flushed = flushbufqueues(target, 1); } return (flushed); } @@ -2560,7 +2566,7 @@ buf_daemon() * the I/O system. */ while (numdirtybuffers > lodirty) { - if (buf_flush(NULL, numdirtybuffers - lodirty) == 0) + if (buf_flush(numdirtybuffers - lodirty) == 0) break; kern_yield(PRI_USER); } @@ -2615,7 +2621,7 @@ SYSCTL_INT(_vfs, OID_AUTO, flushwithdeps 0, "Number of buffers flushed with dependecies that require rollbacks"); static int -flushbufqueues(struct vnode *lvp, int target, int flushdeps) +flushbufqueues(int target, int flushdeps) { struct buf *sentinel; struct vnode *vp; @@ -2625,7 +2631,6 @@ flushbufqueues(struct vnode *lvp, int ta int flushed; int queue; int error; - bool unlock; flushed = 0; queue = QUEUE_DIRTY; @@ -2634,27 +2639,24 @@ flushbufqueues(struct vnode *lvp, int ta sentinel->b_qindex = QUEUE_SENTINEL; mtx_lock(&bqdirty); TAILQ_INSERT_HEAD(&bufqueues[queue], sentinel, b_freelist); + mtx_unlock(&bqdirty); while (flushed != target) { + maybe_yield(); + mtx_lock(&bqdirty); bp = TAILQ_NEXT(sentinel, b_freelist); if (bp != NULL) { TAILQ_REMOVE(&bufqueues[queue], sentinel, b_freelist); TAILQ_INSERT_AFTER(&bufqueues[queue], bp, sentinel, b_freelist); - } else + } else { + mtx_unlock(&bqdirty); break; - /* - * Skip sentinels inserted by other invocations of the - * flushbufqueues(), taking care to not reorder them. - */ - if (bp->b_qindex == QUEUE_SENTINEL) - continue; - /* - * Only flush the buffers that belong to the - * vnode locked by the curthread. - */ - if (lvp != NULL && bp->b_vp != lvp) - continue; - if (BUF_LOCK(bp, LK_EXCLUSIVE | LK_NOWAIT, NULL) != 0) + } + KASSERT(bp->b_qindex != QUEUE_SENTINEL, + ("parallel calls to flushbufqueues() bp %p", bp)); + error = BUF_LOCK(bp, LK_EXCLUSIVE | LK_NOWAIT, NULL); + mtx_unlock(&bqdirty); + if (error != 0) continue; if (bp->b_pin_count > 0) { BUF_UNLOCK(bp); @@ -2670,11 +2672,9 @@ flushbufqueues(struct vnode *lvp, int ta continue; } if (bp->b_flags & B_INVAL) { - bremfreel(bp); - mtx_unlock(&bqdirty); + bremfreef(bp); brelse(bp); flushed++; - mtx_lock(&bqdirty); continue; } @@ -2701,45 +2701,23 @@ flushbufqueues(struct vnode *lvp, int ta BUF_UNLOCK(bp); continue; } - if (lvp == NULL) { - unlock = true; - error = vn_lock(vp, LK_EXCLUSIVE | LK_NOWAIT); - } else { - ASSERT_VOP_LOCKED(vp, "getbuf"); - unlock = false; - error = VOP_ISLOCKED(vp) == LK_EXCLUSIVE ? 0 : - vn_lock(vp, LK_TRYUPGRADE); - } + error = vn_lock(vp, LK_EXCLUSIVE | LK_NOWAIT); if (error == 0) { - mtx_unlock(&bqdirty); CTR3(KTR_BUF, "flushbufqueue(%p) vp %p flags %X", bp, bp->b_vp, bp->b_flags); - if (curproc == bufdaemonproc) - vfs_bio_awrite(bp); - else { - bremfree(bp); - bwrite(bp); - notbufdflushes++; - } + vfs_bio_awrite(bp); vn_finished_write(mp); - if (unlock) - VOP_UNLOCK(vp, 0); + VOP_UNLOCK(vp, 0); flushwithdeps += hasdeps; flushed++; - - /* - * Sleeping on runningbufspace while holding - * vnode lock leads to deadlock. - */ - if (curproc == bufdaemonproc && - runningbufspace > hirunningspace) + if (runningbufspace > hirunningspace) waitrunningbufspace(); - mtx_lock(&bqdirty); continue; } vn_finished_write(mp); BUF_UNLOCK(bp); } + mtx_lock(&bqdirty); TAILQ_REMOVE(&bufqueues[queue], sentinel, b_freelist); mtx_unlock(&bqdirty); free(sentinel, M_TEMP);