From owner-freebsd-fs@FreeBSD.ORG Wed Mar 10 15:21:58 2004 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 27BCB16A4CE for ; Wed, 10 Mar 2004 15:21:58 -0800 (PST) Received: from cliffclavin.cs.rpi.edu (cliffclavin.cs.rpi.edu [128.213.1.9]) by mx1.FreeBSD.org (Postfix) with ESMTP id AB56843D2F for ; Wed, 10 Mar 2004 15:21:57 -0800 (PST) (envelope-from crossd@cs.rpi.edu) Received: from 128.213.50.12 (kiki.cs.rpi.edu [128.213.50.12]) i2ANLqn7093313 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 10 Mar 2004 18:21:52 -0500 (EST) From: "David E. Cross" To: wronkm@cs.rpi.edu, crossd@cs.rpi.edu, moorthy@cs.rpi.edu, freebsd-fs@freebsd.org Content-Type: text/plain Message-Id: <1078960907.4345.20.camel@kiki.cs.rpi.edu> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.4.3 Date: 10 Mar 2004 18:21:52 -0500 Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.37 Subject: JUFS update, and questions. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 10 Mar 2004 23:21:58 -0000 Journaled UFS Technology Description As many are aware we have been keenly interested in Journaling for the UFS filesystem. This is intended to bring people up to date on design decisions that we have made, progress, and to solicit help for problems that we are facing. In the design of this system we consulted many different implementations of journaled filesystems, including ext3fs, reiser, XFS, and JFS. We also received an implementation of an incomplete but highly functional journaled UFS implementation. From these we have attempted to construct a "best-of-breed" solution. >From our review we selected methods based on those used by JFS and XFS due to their relative simplicity and performance and similarity to the journaled UFS implementation that we have. A brief description of this is as follows: There exists on disk, in the root of each filesystem a file called .journal Upon r/w mount this file is verified to have the following characteristics: 1) mode: -r-------- 2) user: 0, group: 0 3) flags: noschg 4) all blocks allocated (no sparse blocks, no frags) (1) 5) That the journal is empty, and that the first entry is a checkpoint. (empty meaning NOT null, but there are no operations that need to be committed) The system then saves the vnode/inode reference to this file and has a hook in chflags that prevents modification to that vnode/inode during operation. The code prevents r/w mounting of the filesystem unless the above conditions are met. The format of the journal is roughly as follows: Each block (FS blocksize) has this format. Block { Header { Magic Number Version Transaction ID of this block Last transaction ID committed Length of Header # of transactions in block Options Field Checksum } Transaction { Opcode operand } (repeat for number in header) } In addition to this on-disk representation the system maintains an in-core journal. The in-core provides a buffer mechanism such that each operation does not force a sync write. The format of the in-core is roughly as follows: journal { current transaction ID Last Transaction committed ID first in-core entry last in-core entry first on-disk entry last on-disk entry mutex-pointer buffer } Every operation then has its information placed in the buffer, when the buffer becomes full it is flushed to disk, when disk is full it is read back and committed. Periodically during periods of light disk IO there will be a heartbeat kernel process that will force commits of all buffered data, on disk and in core. One of the opcodes defined is the NOP. Its format is: Opcode Operand 0x0000 length(16bit), data(arbitrary) Aside from debugging, this is used as a checkpoint function, after a commit the journal will write a blank journal entry out stating that this is transaction "N", and transaction "N-1" was the last committed. This is also done on umount. Journaling will be a mount option, and has so far been defined as MNT_JOURNAL 0x00800000 (2), this flag will trigger the checks mentioned at the beginning. The kernel will _not_ replay the journal in the event of an unclean mount, this will be handled by fsck for at least the following situation: Handle moving between the journaled and non-journaled options, due to either (lack of) specifying mount flags, or different compiled options. For example: Admin mounts /usr/home with "-o journal", system crashes, system comes back up and /etc/fstab has not been updated to include the "journal" flag, admin later realizes this and remounts /usr/home with the appropriate flag. If fsck did not handle the journal syncing then the FS would be "repaired" by fsck on the reboot after the crash, and the kernel would then attempt to re-repair the data from the journal log and be referencing a potentially MUCH older version of the filesystem database. (3) fsck will also ensure that the journal file meets the requirements listed, specifically it will update the journal file itself to include the checkpoint if needed. fsck's operation in brief will be as follows: 1) scan the journal file for the highest numbered transaction ID 2) Read in number of the last completed transaction from that block 3) Rescan the journal for the lowest transaction ID after that one. 4) begin replaying in order until highest transaction ID is reached. 5) write the checkpoint transaction and mark the filesystem clean. Unmounting of the filesystem will include a full commit of the journal (in-core and on-disk), and a write of the checkpoint opcode to the first journal block. Given the nature of what we are doing (and how), its incompatible to mount a filesystem both journaled and softdept-ed, our code will prevent an admin/user from trying to do both at once with a deny message, it will not just silently fail. Issues that we are having now include how and when to increment the transaction ID. The transaction IDs are used to group operations together such that related operations are completed together, and to guarantee replay-safeness. For example a rename(2) is a combination of a link and an unlink. So it works something like this: TID=5 rename(2) call made TID++ link (opcode tagged with TID 6) unlink (opcode tagged with TID 6) TID=6 Later, when this is flushed to disk the system will make sure that all opcodes with the same TID are written, and not split across blocks. The TID in the header of the block will be the TID of the last opcode in that block. So that it then becomes a super-transaction of all of them (potentially thousands of smaller transactions). An unlink would be similar to this (assuming no processes holding the file open, and a link count of 1) TID=6 unlink(2) call made TID++ unlink inode update (link_cnt--) inode update (free) truncate TID=7 Assuming a flush to disk now would have the following: Header { TID = 7 , count=6, lastTID=5 } opcodes { link unlink <--- these were the rename(2) unlink inode update inode update truncate } This block could then be safely replayed multiple times (Think situation of a crash where this had been committed but the checkpoint not written, fsck would then replay this since it could not know that it was already done) These examples are relatively easy, what we are running into problems with is things that bypass the vfs layer. An example is mmaping of a sparse file, a write access to the middle of the file could trigger a large number of updates. Inode changes, direct block allocations, indirect block allocations, and fragment promotions. In this situation, and in our model, how and where would we increment the transaction ID? Notes: (1) I do not know how to actually do this within the kernel, pointers here would be appreciated. (2) This currently conflicts with MNT_IGNORE. Is this a problem? What should we use? (3) There is another problem here, files that were held open when the system crashed. They could have a reference count of zero, but still have allocated data. It seems that an fsck would still be required to walk the inode tables and put these files "somewhere", or just free the blocks they were using. Can anyone think of a better way to do this? -- David E. Cross