From owner-freebsd-fs@FreeBSD.ORG Mon May 3 12:56:26 2004 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id BB28116A4CF for ; Mon, 3 May 2004 12:56:26 -0700 (PDT) Received: from mail-svr1.cs.utah.edu (mail-svr1.cs.utah.edu [155.99.198.200]) by mx1.FreeBSD.org (Postfix) with ESMTP id 1414E43D5E for ; Mon, 3 May 2004 12:56:26 -0700 (PDT) (envelope-from saggarwa@cs.utah.edu) Received: from faith.cs.utah.edu (faith.cs.utah.edu [155.99.198.108]) by mail-svr1.cs.utah.edu (Postfix) with ESMTP id 659EB346EB; Mon, 3 May 2004 13:56:27 -0600 (MDT) Received: by faith.cs.utah.edu (Postfix, from userid 4973) id 738482EC21; Mon, 3 May 2004 13:56:25 -0600 (MDT) Received: from localhost (localhost [127.0.0.1]) by faith.cs.utah.edu (Postfix) with ESMTP id 6775134406; Mon, 3 May 2004 19:56:25 +0000 (UTC) Date: Mon, 3 May 2004 13:56:25 -0600 (MDT) From: Siddharth Aggarwal To: Allan Fields In-Reply-To: <20040502222558.GB31553@afields.ca> Message-ID: References: <20040502222558.GB31553@afields.ca> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: freebsd-fs@freebsd.org Subject: Re: Debugging pseudo-disk driver on FreeBSD X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 03 May 2004 19:56:26 -0000 On Sun, 2 May 2004, Allan Fields wrote: > On Sun, May 02, 2004 at 12:41:56AM -0600, Siddharth Aggarwal wrote: > > > > Hi, > > > > I am working on a Copy on Write disk driver on FreeBSD where I try to save > > the state of a filesystem (/dev/ad0s3) to another device (/dev/ad0s4) by > > making a virtual device that sits on top of these two (/dev/shd0). > > > > 1. So in the strategy routine, I get the block read/write calls to > > (/dev/shd0) . > > 2. For a write operation, I copy the previous contents of the block > > (number corresponding to /dev/ad0s3) on to a free block on /dev/ad0s4 > > 3. To restore previous contents of disk, I read the allocated free block > > from /dev/ad0s4 and write it back to original block number /dev/ad0s3. > > > > The virtual device /dev/shd0 is mounted on /mnt > > > > So to test it out, my /dev/ad0s3 originally had a file "old1" of 13685 > > bytes containing repeating string pattern (OLDOLD) > > I then copied a file "new1" of 8211 bytes having the repeating pattern > > (NEWNEW) to overwrite the old one > > i.e. cp new1 /mnt/old1 > > > > A hexdump shows that a block of 8192 bytes containing "OLDOLD" was copied > > over to /dev/ad0s4 and its place being taken be "NEWNEW" in /dev/ad0s3. > > Also remaining bytes (beyond the 8192 bytes) still remain in /dev/ad0s3. > > So this shows that the copy on write was done correctly. And I correctly > > see 8211 bytes of "NEWNEW" in /mnt/old1 (ls -l /mnt/old1) > > On closer read, I see the advantage of your approach here: were the > originating device always has the latest changes but old data is > still stored on another device. (But for how long.. until next > overwrite. Revisioning possibilities?) This means that the original Yes I am doing some kind of versioning for these blocks which are stored away on the shadow device. > disk is always consistent with the most recent changes but has a > sort of log of old blocks? > > This is the conceptually opposite approach to the union filesystem > which traditionally keeps new changes to files on another filesystem > (the overlay) and preserve the underlying filesystem contents. > > Your facility also allows devices containing arbitrary data which > could be for example raw data streams as opposed to a filesystem > which is accessible through the VFS. But this carries with it the > implications of device-level block-i/o. Restoring any given file > would involve translating the inode to physical blocks and restoring > only those portions which were changed by the operation. I'm unclear > how this works. Take undeleting a file: Wouldn't you need to > restore the inode, the direct blocks, any indirect blocks and > dirents by referencing these blocks. How do you know how to do > this (at file granularity) at the device-level in a filesystem > agnostic way? (Could writes be processed atomically?) > Actually the use case of this thing I am writing doesn't involve much of rolling back to a previous state but instead get a fresh disk image on another machine and then applying these log entries to the new disk in chronological order to reach a similar state on the new machine. So some of the concerns you expresses above may not apply. > Alternatively, you can implement this copy-on-write scheme at the > vnode layer. > > > I then send an IOCTL to my driver to restore to the previous state > > (expecting it to give me 13685 bytes of "OLDOLD" back in /mnt/old1) > > So this is like a snapshot of the original state of the filesystem > on the device in it's entirety (sort of like snapshots but at the > device-level vs. file-system)? How do you ensure it's consistent, > especially when the device backing the storage of old blocks becomes > full, which do you turf first? (Problem is less significant if you > have a 1:1 mapping of blocks like RAID mirror w/ same partition size.) > > > After unmounting and remounting, I see that the contents of /mnt/old1 have > > become OLDOLD, but there are only 8211 bytes instead of 13685. A hexdump of > > /dev/ad0s3 however, shows that there are indeed 13685 consecutive bytes of > > OLDOLD lying there. > > > > This has lead me to believe that the Inode of /mnt/old1 is not being > > refereshed (or it was never saved off to the /dev/ad0s4 in the first place). Do Inode > > read/writes go through the strategy routine in the first place? > > Can you reboot the machine and see the same effects? I know that > sounds like an extreme measure, but that's a way to determine for > sure if it's a caching issue. You could also try doing a few large > dd's form another filesystem between dis/remount. > I tried the reboot option too, but no success :(. One thing though is that, if the file old1 and new1 files are of the same size, i.e. 8211 bytes. I do get the correct behavior :). But obviously that is too ideal a case and I guess it works because filesystem metadata (particularly Inode) is not under question here. > > Any idea what could be going wrong? > > No clue. ;) > > -- > Allan Fields > AFRSL - http://afields.ca > BSDCan: May 2004, Ottawa - http://www.bsdcan.org >