Date: Fri, 7 Jun 2013 05:28:11 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: fs@freebsd.org Subject: missed clustering for small block sizes in cluster_wbuild() Message-ID: <20130607044845.O24441@besplex.bde.org>
index | next in thread | raw e-mail
Would someone please fix this fairly small bug is cluster_wbuild().
When the fs block size is 512 bytes and everything is contiguous,
cluster_write() prepares a complete cluster but cluster_build()
splits it into two with a runt of size PAGE_SIZE - 512 bytes for
the second part.
>From vfs_cluster.c:
@ /*
@ * From this location in the file, scan forward to see
@ * if there are buffers with adjacent data that need to
@ * be written as well.
@ */
@ for (i = 0; i < len; ++i, ++start_lbn) {
@ if (i != 0) { /* If not the first buffer */
@ s = splbio();
@ /*
@ * If the adjacent data is not even in core it
@ * can't need to be written.
@ */
@ VI_LOCK(vp);
@ if ((tbp = gbincore(vp, start_lbn)) == NULL ||
@ (tbp->b_vflags & BV_BKGRDINPROG)) {
@ VI_UNLOCK(vp);
@ splx(s);
@ break;
@ }
@
@ /*
@ * If it IS in core, but has different
@ * characteristics, or is locked (which
@ * means it could be undergoing a background
@ * I/O or be in a weird state), then don't
@ * cluster with it.
@ */
@ if (BUF_LOCK(tbp,
@ LK_EXCLUSIVE | LK_NOWAIT | LK_INTERLOCK,
@ VI_MTX(vp))) {
@ splx(s);
@ break;
@ }
@
@ if ((tbp->b_flags & (B_VMIO | B_CLUSTEROK |
@ B_INVAL | B_DELWRI | B_NEEDCOMMIT))
@ != (B_DELWRI | B_CLUSTEROK |
@ (bp->b_flags & (B_VMIO | B_NEEDCOMMIT))) ||
@ tbp->b_wcred != bp->b_wcred) {
@ BUF_UNLOCK(tbp);
@ splx(s);
@ break;
@ }
@
@ /*
@ * Check that the combined cluster
@ * would make sense with regard to pages
@ * and would not be too large
@ */
@ if ((tbp->b_bcount != size) ||
@ ((bp->b_blkno + (dbsize * i)) !=
@ tbp->b_blkno) ||
@ ((tbp->b_npages + bp->b_npages) >
@ (vp->v_mount->mnt_iosize_max / PAGE_SIZE))) {
This page count check is wrong. Usually mnt_iosize_max is 128K and PAGE_SIZE
is 4K. This gives a limit of 32 pages, which is normally enough to hold
256 512-blocks (very sparsely and wastefully mapped, but that is another,
much larger bug). The pages up to the 31st are built up right and hold
248 512-blocks. Then bp->b_npages is 31 and tbp->npages = 1. The sum is
32 so the 32nd page is started right. The 249th 512-block is allocated
in it. Then for the next 512-block, bp->b_npages is 32 so the sum is 33
so we break out of the loop prematurely. The buffer containing 249
512-blocks is written out. cluster_wbuild() continues and builds and
writes out a runt buffer containing the remaining 7 512-blocks.
For a quick fix, I just removed this check. cluster_write() should
never prepare a cluster too large to write out in 1 step. It uses
mnt_stat.f_iosize instead mnt_iosize_max for the limit, and that
should not be larger. But there may be a special case where the
buffer contents is not aligned. Then an extra page might be needed.
@ BUF_UNLOCK(tbp);
@ splx(s);
@ break;
@ }
@ /*
@ * Ok, it's passed all the tests,
@ * so remove it from the free list
@ * and mark it busy. We will use it.
@ */
@ bremfree(tbp);
@ tbp->b_flags &= ~B_DONE;
@ splx(s);
@ } /* end of code for non-first buffers only */
@ /* check for latent dependencies to be handled */
@ if ((LIST_FIRST(&tbp->b_dep)) != NULL) {
@ tbp->b_iocmd = BIO_WRITE;
@ buf_start(tbp);
@ }
@ /*
@ * If the IO is via the VM then we do some
@ * special VM hackery (yuck). Since the buffer's
@ * block size may not be page-aligned it is possible
@ * for a page to be shared between two buffers. We
@ * have to get rid of the duplication when building
@ * the cluster.
@ */
@ if (tbp->b_flags & B_VMIO) {
@ vm_page_t m;
@
@ if (i != 0) { /* if not first buffer */
@ for (j = 0; j < tbp->b_npages; j += 1) {
@ m = tbp->b_pages[j];
@ if (m->flags & PG_BUSY) {
@ bqrelse(tbp);
@ goto finishcluster;
@ }
@ }
@ }
@ if (tbp->b_object != NULL)
@ VM_OBJECT_LOCK(tbp->b_object);
@ vm_page_lock_queues();
@ for (j = 0; j < tbp->b_npages; j += 1) {
@ m = tbp->b_pages[j];
@ vm_page_io_start(m);
@ vm_object_pip_add(m->object, 1);
@ if ((bp->b_npages == 0) ||
@ (bp->b_pages[bp->b_npages - 1] != m)) {
@ bp->b_pages[bp->b_npages] = m;
@ bp->b_npages++;
The number of new pages needed is the number of increments here, which I
think is 1 less than bp->b_pages in most cases where the runt buffer is
built.
I think this is best fixed be fixed by removing the check above and
checking here. Then back out of the changes. I don't know this code
well enough to write the backing out easily.
@ }
@ }
@ vm_page_unlock_queues();
@ if (tbp->b_object != NULL)
@ VM_OBJECT_UNLOCK(tbp->b_object);
@ }
@ bp->b_bcount += size;
@ bp->b_bufsize += size;
@
@ s = splbio();
@ bundirty(tbp);
@ tbp->b_flags &= ~B_DONE;
@ tbp->b_ioflags &= ~BIO_ERROR;
@ tbp->b_flags |= B_ASYNC;
@ tbp->b_iocmd = BIO_WRITE;
@ reassignbuf(tbp, tbp->b_vp); /* put on clean list */
@ VI_LOCK(tbp->b_vp);
@ ++tbp->b_vp->v_numoutput;
@ VI_UNLOCK(tbp->b_vp);
@ splx(s);
@ BUF_KERNPROC(tbp);
@ TAILQ_INSERT_TAIL(&bp->b_cluster.cluster_head,
@ tbp, b_cluster.cluster_entry);
@ }
The loss of i/o performance from this wasn't too bad on my disks. The
bug mainly made it harder to see the size of other clustering bugs by
watching iostat output.
Bruce
home |
help
Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130607044845.O24441>
