From owner-freebsd-hackers@FreeBSD.ORG  Fri Oct 29 16:49:25 2004
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id A66EC16A4CE
	for <freebsd-hackers@freebsd.org>;
	Fri, 29 Oct 2004 16:49:25 +0000 (GMT)
Received: from mail-svr1.cs.utah.edu (brahma.cs.utah.edu [155.98.64.200])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 5000943D1F
	for <freebsd-hackers@freebsd.org>;
	Fri, 29 Oct 2004 16:49:25 +0000 (GMT)
	(envelope-from saggarwa@cs.utah.edu)
Received: from faith.cs.utah.edu (faith.cs.utah.edu [155.98.65.40])
	by mail-svr1.cs.utah.edu (Postfix) with ESMTP id CFA5A346F4
	for <freebsd-hackers@freebsd.org>;
	Fri, 29 Oct 2004 10:49:24 -0600 (MDT)
Received: by faith.cs.utah.edu (Postfix, from userid 4973)
	id B21182EC21; Fri, 29 Oct 2004 10:49:24 -0600 (MDT)
Received: from localhost (localhost [127.0.0.1])
	by faith.cs.utah.edu (Postfix) with ESMTP id A9CB934406
	for <freebsd-hackers@freebsd.org>;
	Fri, 29 Oct 2004 16:49:24 +0000 (UTC)
Date: Fri, 29 Oct 2004 10:49:24 -0600 (MDT)
From: Siddharth Aggarwal <saggarwa@cs.utah.edu>
To: freebsd-hackers@freebsd.org
Message-ID: <Pine.GSO.4.50L0.0410291033130.25989-100000@faith.cs.utah.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Subject: flushing disk buffer cache
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 29 Oct 2004 16:49:25 -0000


Hi,

I am writing this pseudo disk driver for disk checkpointing, which
intercepts write requests to the disk (ad0s1) and performs a copy on write
of the old contents to another partition (ad0s4) before writing out the
new contents. So the driver (called shd) is mounted as

/dev/shd0a on /
/dev/shd0f on /usr


So each time the user creates a new checkpoint (basically initialize new
data structures in memory for a new checkpoint), right before that inside
the driver, I explicitly do a sync() to flush out the disk buffer cache,
so that disk state is consistent when the checkpoint was taken.

Then, I have hacked the reboot system call to revert to a previous
checkpoint after unmounting all the filesystems but before halting the
system. This revert basically involves copying some blocks from ad0s4 to
ad0s1.

However, when the system reboots, fsck shows up inconsistencies in the
filesystem and so fsck needs to be run manually.

So I suspect that the reason for this problem is that when a checkpoint is
taken, the filesystem on ad0s1 is active and more write operations are
coming in i.e. filesystem on ad0s1 is still dirty. Hence I explicitly
called sync() before returning from the checkpoint command but I think
sync() doesnt guarantee that everything was actually flushed out. So I
implemented a more mandatory way of syncing, i.e. just got part of the
code from boot() system call. The code is as below, and it is called
whenever a checkpoint command is fired.

Does anyone think if this is the right way of flushing the cache? Is there
anything I can do to ensure the filesystem is consistent during reboot?
I don't think this is a problem in the driver code, because when I created
a new filesystem on ad0s3 and shadowed that using the driver, everything
ran perfectly fine, but the difference was that I could unmount the
filesystem before "restoring the checkpoint" and hence wasnt necessary to
do it during reboot time.


void sync_before_checkpoint (void)
{
    register struct buf *bp;
    int iter, nbusy, pbusy;

    waittime = 0;
    sync(&proc0, NULL);

                /*
                 * With soft updates, some buffers that are
                 * written will be remarked as dirty until other
                 * buffers are written.
                 */

    for (iter = pbusy = 0; iter < 20; iter++) {
        nbusy = 0;
        for (bp = &buf[nbuf]; --bp >= buf; ) {
                if ((bp->b_flags & B_INVAL) == 0 &&
                    BUF_REFCNT(bp) > 0) {
                        nbusy++;
                } else if ((bp->b_flags & (B_DELWRI | B_INVAL))
                                == B_DELWRI) {
                        /* bawrite(bp);*/
                        nbusy++;
                }
        }
        if (nbusy == 0)
                break;
        printf("%d ", nbusy);
        if (nbusy < pbusy)
                iter = 0;
        pbusy = nbusy;
        if (iter > 5 && bioops.io_sync)
                (*bioops.io_sync)(NULL);
        sync(&proc0, NULL);
        DELAY(50000 * iter);
    }
                /*
                 * Count only busy local buffers to prevent forcing
                 * a fsck if we're just a client of a wedged NFS server
                 */
    nbusy = 0;
    for (bp = &buf[nbuf]; --bp >= buf; ) {
                if (((bp->b_flags&B_INVAL) == 0 && BUF_REFCNT(bp)) ||
                    ((bp->b_flags & (B_DELWRI|B_INVAL)) == B_DELWRI)) {
                        if (bp->b_dev == NODEV) {
                                TAILQ_REMOVE(&mountlist,
                                    bp->b_vp->v_mount, mnt_list);
                                continue;
                        }
                        nbusy++;
                }
    }
    if (nbusy) {
                        /*
                         * Failed to sync all blocks. Indicate this and don't
                         * unmount filesystems (thus forcing an fsck on reboot).
                         */
                printf("giving up on %d buffers\n", nbusy);
                DELAY(5000000); /* 5 seconds */
    }
}