Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 6 Mar 2007 11:18:34 -0500
From:      John Nielsen <lists@jnielsen.net>
To:        freebsd-questions@freebsd.org
Subject:   Re: gjournal and zfs, questions
Message-ID:  <200703061118.34321.lists@jnielsen.net>
In-Reply-To: <45ED7CAB.3010608@unsane.co.uk>
References:  <4F9C9299A10AE74E89EA580D14AA10A6028719@royal64.emp.zapto.org> <esjr0b$uvn$1@sea.gmane.org> <45ED7CAB.3010608@unsane.co.uk>

next in thread | previous in thread | raw e-mail | index | archive | help
--Boundary-00=_aRZ7FUS0QLmc3St
Content-Type: text/plain;
  charset="utf-8"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

On Tuesday 06 March 2007 09:37, Vince wrote:
> Ivan Voras wrote:
> > Daniel Eriksson wrote:
> >> When will gjournal and zfs be committed to the CVS tree? Will either of
> >> them be merged to STABLE?
> >
> > GJournal aready is:
> > http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/geom/journal/g_journal.c?re
> >v=1.9&content-type=text/x-cvsweb-markup
> >
> > It's unlikely they will be merged to 6-STABLE because they introduce
> > lots of changes.
>
> That said there are patches for -STABLE although they are a little stale
> now, as changes have been made to stable, however last time i tried only
> cosmetic changes were needed (the patches wouldnt apply cleanly because
> the file(s) had changed but applying the rejected patches by hand worked
> fine, just look for rejects and see why its been rejected. Usually its
> because a line has been added to the file or removed and so the line
> number in the patch is wrong.)
> See Pawels patches at
> http://people.freebsd.org/~pjd/patches/
> in particular
> http://people.freebsd.org/~pjd/patches/gjournal6_20061024.patch
> however also search the freebsd-geom and freebsd-current mailing lists
> for how to apply them and build with gjournal support (for a start
> Before applying the patch, create the following directories in your
> source directory:
>
> sbin/geom/class/journal
> sys/geom/journal
> sys/modules/geom/geom_journal )
>
> That said I've a 6.2-RELEASE box thats got /var and /usr journaled and
> its been running happily for a month and a half. Give a scary sounding
> message in dmesg about
> GEOM_JOURNAL: Cannot suspend file system /usr (error=35).
> but this post
> http://lists.freebsd.org/pipermail/freebsd-hackers/2006-August/017894.html
> indicates its harmless and i havent seen any problems.

I'm running gjournal with -STABLE here as well. The box is my work desktop and 
pseudo-server with a 450GB gmirror'ed /usr. The one problem I have is a 
consistent crash at shutdown. I suspect it's a result of poor interaction 
between gjournal and gmirror. Fortunately, it only happens after the 
filesystems are unmounted so the only real downside is being unable to reboot 
remotely (manual power cycle required). I haven't taken the time to 
document/report this yet but hopefully I'll get around to it sooner or later.

I'm attaching the patch I use. It applied cleanly last time I updated a couple 
weeks ago. It's the same as the other one available from the mailing lists 
with one change so it doesn't choke on sys/sys/vnode.h.

If you plan to use it refer to the original instructions here:

http://lists.freebsd.org/pipermail/freebsd-fs/2006-June/001962.html

and also look at the manpage (from -CURRENT):

http://www.freebsd.org/cgi/man.cgi?query=gjournal&apropos=0&sektion=0&manpath=FreeBSD+7-current&format=html

In brief, you need to make the new directories in the source tree, apply the 
patch, add the gjournal option to your kernel, rebuild kernel and world and 
reinstall, create a journalled volume, remember to use the -J flag with newfs 
(or with tunefs if you forgot), mount it async, and if you're using gmirror 
do a "gmirror configure -n".

JN

--Boundary-00=_aRZ7FUS0QLmc3St
Content-Type: text/x-diff; charset="utf-8"; name="gjournal6_20061030_1.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
	filename="gjournal6_20061030_1.patch"

---  etc/mtree/BSD.include.dist.orig
+++  etc/mtree/BSD.include.dist
@@ -102,6 +102,8 @@
         ..
         gate
         ..
+        journal
+        ..
         label
         ..
         mirror
---  include/Makefile.orig
+++  include/Makefile
@@ -42,8 +42,8 @@
 	fs/devfs fs/fdescfs fs/fifofs fs/msdosfs fs/ntfs fs/nullfs \
 	fs/nwfs fs/portalfs fs/procfs fs/smbfs fs/udf fs/umapfs \
 	fs/unionfs \
-	geom/concat geom/eli geom/gate geom/label geom/mirror geom/nop \
-	geom/raid3 geom/shsec geom/stripe \
+	geom/concat geom/eli geom/gate geom/journal geom/label geom/mirror \
+	geom/nop geom/raid3 geom/shsec geom/stripe \
 	isofs/cd9660 \
 	netatm/ipatm netatm/sigpvc netatm/spans netatm/uni \
 	netgraph/atm netgraph/netflow \
---  lib/libufs/Makefile.orig
+++  lib/libufs/Makefile
@@ -7,6 +7,7 @@
 MAN=	bread.3 cgread.3 libufs.3 sbread.3 ufs_disk_close.3
 MLINKS+= bread.3 bwrite.3
 MLINKS+= cgread.3 cgread1.3
+MLINKS+= cgread.3 cgwrite1.3
 MLINKS+= sbread.3 sbwrite.3
 MLINKS+= ufs_disk_close.3 ufs_disk_fillout.3
 MLINKS+= ufs_disk_close.3 ufs_disk_fillout_blank.3
---  lib/libufs/cgread.3.orig
+++  lib/libufs/cgread.3
@@ -4,6 +4,7 @@
 .\" 	Manual page for libufs functions:
 .\"		cgread(3)
 .\"		cgread1(3)
+.\"		cgwrite1(3)
 .\"
 .\" This file is in the public domain.
 .\"
@@ -13,8 +14,8 @@
 .Dt CGREAD 3
 .Os
 .Sh NAME
-.Nm cgread , cgread1
-.Nd read cylinder groups of UFS disks
+.Nm cgread , cgread1, cgwrite1
+.Nd read/write cylinder groups of UFS disks
 .Sh LIBRARY
 .Lb libufs
 .Sh SYNOPSIS
@@ -28,6 +29,8 @@
 .Fn cgread "struct uufsd *disk"
 .Ft int
 .Fn cgread1 "struct uufsd *disk" "int c"
+.Ft int
+.Fn cgwrite1 "struct uufsd *disk" "int c"
 .Sh DESCRIPTION
 The
 .Fn cgread
@@ -60,6 +63,14 @@
 field, and then incrementing the
 .Va d_ccg
 field.
+.Pp
+The
+.Fn cgwrite1
+function stores cylinder group specified by
+.Fa c
+from
+.Va d_cg
+field of a userland UFS disk structure on disk.
 .Sh RETURN VALUES
 Both functions return 0 if there are no more cylinder groups to read,
 1 if there are more cylinder groups, and \-1 on error.
@@ -75,8 +86,16 @@
 .Fn cgread1
 has semantically identical failure conditions to those of
 .Fn cgread .
+.Pp
+The function
+.Fn cgwrite1
+may fail and set
+.Va errno
+for any of the errors specified for the library function
+.Xr bwrite 3 .
 .Sh SEE ALSO
 .Xr bread 3 ,
+.Xr bwrite 3 ,
 .Xr libufs 3
 .Sh HISTORY
 These functions first appeared as part of
---  lib/libufs/cgroup.c.orig
+++  lib/libufs/cgroup.c
@@ -71,3 +71,17 @@
 	disk->d_lcg = c;
 	return (1);
 }
+
+int
+cgwrite1(struct uufsd *disk, int c)
+{
+	struct fs *fs;
+
+	fs = &disk->d_fs;
+	if (bwrite(disk, fsbtodb(fs, cgtod(fs, c)),
+	    disk->d_cgunion.d_buf, fs->fs_bsize) == -1) {
+		ERROR(disk, "unable to write cylinder group");
+		return (-1);
+	}
+	return (0);
+}
---  lib/libufs/libufs.3.orig
+++  lib/libufs/libufs.3
@@ -57,6 +57,7 @@
 .Xr bwrite 3 ,
 .Xr cgread 3 ,
 .Xr cgread1 3 ,
+.Xr cgwrite1 3 ,
 .Xr sbread 3 ,
 .Xr sbwrite 3 ,
 .Xr ufs_disk_close 3 ,
---  lib/libufs/libufs.h.orig
+++  lib/libufs/libufs.h
@@ -110,6 +110,7 @@
  */
 int cgread(struct uufsd *);
 int cgread1(struct uufsd *, int);
+int cgwrite1(struct uufsd *, int);
 
 /*
  * inode.c
---  sbin/dumpfs/dumpfs.c.orig
+++  sbin/dumpfs/dumpfs.c
@@ -168,8 +168,9 @@
 		    (intmax_t)afs.fs_cstotal.cs_ndir,
 		    (intmax_t)afs.fs_cstotal.cs_nifree, 
 		    (intmax_t)afs.fs_cstotal.cs_nffree);
-		printf("bpg\t%d\tfpg\t%d\tipg\t%d\n",
-		    afs.fs_fpg / afs.fs_frag, afs.fs_fpg, afs.fs_ipg);
+		printf("bpg\t%d\tfpg\t%d\tipg\t%d\tunrefs\t%jd\n",
+		    afs.fs_fpg / afs.fs_frag, afs.fs_fpg, afs.fs_ipg,
+		    (intmax_t)afs.fs_unrefs);
 		printf("nindir\t%d\tinopb\t%d\tmaxfilesize\t%ju\n",
 		    afs.fs_nindir, afs.fs_inopb, 
 		    (uintmax_t)afs.fs_maxfilesize);
@@ -228,10 +229,12 @@
 		printf("acls ");
 	if (fsflags & FS_MULTILABEL)
 		printf("multilabel ");
+	if (fsflags & FS_GJOURNAL)
+		printf("gjournal ");
 	if (fsflags & FS_FLAGS_UPDATED)
 		printf("fs_flags expanded ");
 	fsflags &= ~(FS_UNCLEAN | FS_DOSOFTDEP | FS_NEEDSFSCK | FS_INDEXDIRS |
-		     FS_ACLS | FS_MULTILABEL | FS_FLAGS_UPDATED);
+		     FS_ACLS | FS_MULTILABEL | FS_GJOURNAL | FS_FLAGS_UPDATED);
 	if (fsflags != 0)
 		printf("unknown flags (%#x)", fsflags);
 	putchar('\n');
@@ -282,8 +285,9 @@
 		cgtime = acg.cg_time;
 		printf("magic\t%x\ttell\t%jx\ttime\t%s",
 		    acg.cg_magic, (intmax_t)cur, ctime(&cgtime));
-		printf("cgx\t%d\tndblk\t%d\tniblk\t%d\tinitiblk %d\n",
-		    acg.cg_cgx, acg.cg_ndblk, acg.cg_niblk, acg.cg_initediblk);
+		printf("cgx\t%d\tndblk\t%d\tniblk\t%d\tinitiblk %d\tunrefs %d\n",
+		    acg.cg_cgx, acg.cg_ndblk, acg.cg_niblk, acg.cg_initediblk,
+		    acg.cg_unrefs);
 		break;
 	case 1:
 		cgtime = acg.cg_old_time;
---  sbin/fsck_ffs/Makefile.orig
+++  sbin/fsck_ffs/Makefile
@@ -7,7 +7,9 @@
 MAN=	fsck_ffs.8
 MLINKS=	fsck_ffs.8 fsck_ufs.8 fsck_ffs.8 fsck_4.2bsd.8
 SRCS=	dir.c ea.c fsutil.c inode.c main.c pass1.c pass1b.c pass2.c pass3.c \
-	pass4.c pass5.c setup.c utilities.c ffs_subr.c ffs_tables.c
+	pass4.c pass5.c setup.c utilities.c ffs_subr.c ffs_tables.c gjournal.c
+DPADD=	${LIBUFS}
+LDADD=	-lufs
 WARNS?=	2
 CFLAGS+= -I${.CURDIR}
 
---  sbin/fsck_ffs/fsck.h.orig
+++  sbin/fsck_ffs/fsck.h
@@ -328,9 +328,9 @@
 ino_t		allocino(ino_t request, int type);
 void		blkerror(ino_t ino, const char *type, ufs2_daddr_t blk);
 char	       *blockcheck(char *name);
-int		bread(int fd, char *buf, ufs2_daddr_t blk, long size);
+int		blread(int fd, char *buf, ufs2_daddr_t blk, long size);
 void		bufinit(void);
-void		bwrite(int fd, char *buf, ufs2_daddr_t blk, long size);
+void		blwrite(int fd, char *buf, ufs2_daddr_t blk, long size);
 void		cacheino(union dinode *dp, ino_t inumber);
 void		catch(int);
 void		catchquit(int);
@@ -388,3 +388,4 @@
 void		sblock_init(void);
 void		setinodebuf(ino_t);
 int		setup(char *dev);
+void		gjournal_check(const char *filesys);
---  sbin/fsck_ffs/fsutil.c.orig
+++  sbin/fsck_ffs/fsutil.c
@@ -221,7 +221,7 @@
 	if (bp->b_bno != dblk) {
 		flush(fswritefd, bp);
 		diskreads++;
-		bp->b_errs = bread(fsreadfd, bp->b_un.b_buf, dblk, size);
+		bp->b_errs = blread(fsreadfd, bp->b_un.b_buf, dblk, size);
 		bp->b_bno = dblk;
 		bp->b_size = size;
 	}
@@ -244,11 +244,11 @@
 		    (bp->b_errs == bp->b_size / dev_bsize) ? "" : "PARTIALLY ",
 		    (long long)bp->b_bno);
 	bp->b_errs = 0;
-	bwrite(fd, bp->b_un.b_buf, bp->b_bno, (long)bp->b_size);
+	blwrite(fd, bp->b_un.b_buf, bp->b_bno, (long)bp->b_size);
 	if (bp != &sblk)
 		return;
 	for (i = 0, j = 0; i < sblock.fs_cssize; i += sblock.fs_bsize, j++) {
-		bwrite(fswritefd, (char *)sblock.fs_csp + i,
+		blwrite(fswritefd, (char *)sblock.fs_csp + i,
 		    fsbtodb(&sblock, sblock.fs_csaddr + j * sblock.fs_frag),
 		    sblock.fs_cssize - i < sblock.fs_bsize ?
 		    sblock.fs_cssize - i : sblock.fs_bsize);
@@ -345,7 +345,7 @@
 }
 
 int
-bread(int fd, char *buf, ufs2_daddr_t blk, long size)
+blread(int fd, char *buf, ufs2_daddr_t blk, long size)
 {
 	char *cp;
 	int i, errs;
@@ -387,7 +387,7 @@
 }
 
 void
-bwrite(int fd, char *buf, ufs2_daddr_t blk, long size)
+blwrite(int fd, char *buf, ufs2_daddr_t blk, long size)
 {
 	int i;
 	char *cp;
--- /dev/null	Tue Oct 24 16:33:50 2006
+++ sbin/fsck_ffs/gjournal.c	Tue Oct 24 16:33:58 2006
@@ -0,0 +1,774 @@
+/*-
+ * Copyright (c) 2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ * Copyright (c) 1982, 1986, 1989, 1993
+ *	The Regents of the University of California.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 4. Neither the name of the University nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include <sys/cdefs.h>
+__FBSDID("$FreeBSD$");
+
+#include <sys/param.h>
+#include <sys/disklabel.h>
+#include <sys/mount.h>
+#include <sys/stat.h>
+
+#include <ufs/ufs/ufsmount.h>
+#include <ufs/ufs/dinode.h>
+#include <ufs/ffs/fs.h>
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <libufs.h>
+#include <strings.h>
+#include <err.h>
+#include <assert.h>
+
+#include "fsck.h"
+
+struct cgchain {
+	union {
+		struct cg cgcu_cg;
+		char cgcu_buf[MAXBSIZE];
+	} cgc_union;
+	int	cgc_busy;
+	int	cgc_dirty;
+	LIST_ENTRY(cgchain) cgc_next;
+};
+#define cgc_cg	cgc_union.cgcu_cg
+
+#define	MAX_CACHED_CGS	1024
+static unsigned ncgs = 0;
+static LIST_HEAD(, cgchain) cglist = LIST_HEAD_INITIALIZER(&cglist);
+
+static const char *devnam;
+static struct uufsd *disk = NULL;
+static struct fs *fs = NULL;
+struct ufs2_dinode ufs2_zino;
+
+static void putcgs(void);
+
+/*
+ * Write current block of inodes.
+ */
+static int
+putino(struct uufsd *disk, ino_t inode)
+{
+	caddr_t inoblock;
+	struct fs *fs;
+	ssize_t ret;
+
+	fs = &disk->d_fs;
+	inoblock = disk->d_inoblock;
+
+	assert(inoblock != NULL);
+	assert(inode >= disk->d_inomin && inode <= disk->d_inomax);
+	ret = bwrite(disk, fsbtodb(fs, ino_to_fsba(fs, inode)), inoblock,
+	    fs->fs_bsize);
+
+	return (ret == -1 ? -1 : 0);
+}
+
+/*
+ * Return cylinder group from the cache or load it if it is not in the
+ * cache yet.
+ * Don't cache more than MAX_CACHED_CGS cylinder groups.
+ */
+static struct cgchain *
+getcg(int cg)
+{
+	struct cgchain *cgc;
+
+	assert(disk != NULL && fs != NULL);
+	LIST_FOREACH(cgc, &cglist, cgc_next) {
+		if (cgc->cgc_cg.cg_cgx == cg) {
+			//printf("%s: Found cg=%d\n", __func__, cg);
+			return (cgc);
+		}
+	}
+	/*
+	 * Our cache is full? Let's clean it up.
+	 */
+	if (ncgs >= MAX_CACHED_CGS) {
+		//printf("%s: Flushing CGs.\n", __func__);
+		putcgs();
+	}
+	cgc = malloc(sizeof(*cgc));
+	if (cgc == NULL) {
+		/*
+		 * Cannot allocate memory?
+		 * Let's put all currently loaded and not busy cylinder groups
+		 * on disk and try again.
+		 */
+		//printf("%s: No memory, flushing CGs.\n", __func__);
+		putcgs();
+		cgc = malloc(sizeof(*cgc));
+		if (cgc == NULL)
+			err(1, "malloc(%zu)", sizeof(*cgc));
+	}
+	if (cgread1(disk, cg) == -1)
+		err(1, "cgread1(%d)", cg);
+	bcopy(&disk->d_cg, &cgc->cgc_cg, sizeof(cgc->cgc_union));
+	cgc->cgc_busy = 0;
+	cgc->cgc_dirty = 0;
+	LIST_INSERT_HEAD(&cglist, cgc, cgc_next);
+	ncgs++;
+	//printf("%s: Read cg=%d\n", __func__, cg);
+	return (cgc);
+}
+
+/*
+ * Mark cylinder group as dirty - it will be written back on putcgs().
+ */
+static void
+dirtycg(struct cgchain *cgc)
+{
+
+	cgc->cgc_dirty = 1;
+}
+
+/*
+ * Mark cylinder group as busy - it will not be freed on putcgs().
+ */
+static void
+busycg(struct cgchain *cgc)
+{
+
+	cgc->cgc_busy = 1;
+}
+
+/*
+ * Unmark the given cylinder group as busy.
+ */
+static void
+unbusycg(struct cgchain *cgc)
+{
+
+	cgc->cgc_busy = 0;
+}
+
+/*
+ * Write back all dirty cylinder groups.
+ * Free all non-busy cylinder groups.
+ */
+static void
+putcgs(void)
+{
+	struct cgchain *cgc, *cgc2;
+
+	assert(disk != NULL && fs != NULL);
+	LIST_FOREACH_SAFE(cgc, &cglist, cgc_next, cgc2) {
+		if (cgc->cgc_busy)
+			continue;
+		LIST_REMOVE(cgc, cgc_next);
+		ncgs--;
+		if (cgc->cgc_dirty) {
+			bcopy(&cgc->cgc_cg, &disk->d_cg,
+			    sizeof(cgc->cgc_union));
+			if (cgwrite1(disk, cgc->cgc_cg.cg_cgx) == -1)
+				err(1, "cgwrite1(%d)", cgc->cgc_cg.cg_cgx);
+			//printf("%s: Wrote cg=%d\n", __func__,
+			//    cgc->cgc_cg.cg_cgx);
+		}
+		free(cgc);
+	}
+}
+
+#if 0
+/*
+ * Free all non-busy cylinder groups without storing the dirty ones.
+ */
+static void
+cancelcgs(void)
+{
+	struct cgchain *cgc;
+
+	assert(disk != NULL && fs != NULL);
+	while ((cgc = LIST_FIRST(&cglist)) != NULL) {
+		if (cgc->cgc_busy)
+			continue;
+		LIST_REMOVE(cgc, cgc_next);
+		//printf("%s: Canceled cg=%d\n", __func__, cgc->cgc_cg.cg_cgx);
+		free(cgc);
+	}
+}
+#endif
+
+/*
+ * Open the given provider, load statistics.
+ */
+static void
+getdisk(void)
+{
+	int i;
+
+	if (disk != NULL)
+		return;
+	disk = malloc(sizeof(*disk));
+	if (disk == NULL)
+		err(1, "malloc(%zu)", sizeof(*disk));
+	if (ufs_disk_fillout(disk, devnam) == -1) {
+		err(1, "ufs_disk_fillout(%s) failed: %s", devnam,
+		    disk->d_error);
+	}
+	fs = &disk->d_fs;
+	fs->fs_csp = malloc((size_t)fs->fs_cssize);
+	if (fs->fs_csp == NULL)
+		err(1, "malloc(%zu)", (size_t)fs->fs_cssize);
+	bzero(fs->fs_csp, (size_t)fs->fs_cssize);
+	for (i = 0; i < fs->fs_cssize; i += fs->fs_bsize) {
+		if (bread(disk, fsbtodb(fs, fs->fs_csaddr + numfrags(fs, i)),
+		    (void *)(((char *)fs->fs_csp) + i),
+		    (size_t)(fs->fs_cssize - i < fs->fs_bsize ? fs->fs_cssize - i : fs->fs_bsize)) == -1) {
+			err(1, "bread: %s", disk->d_error);
+		}
+	}
+	if (fs->fs_contigsumsize > 0) {
+		fs->fs_maxcluster = malloc(fs->fs_ncg * sizeof(int32_t));
+		if (fs->fs_maxcluster == NULL)
+			err(1, "malloc(%zu)", fs->fs_ncg * sizeof(int32_t));
+		for (i = 0; i < fs->fs_ncg; i++)
+			fs->fs_maxcluster[i] = fs->fs_contigsumsize;
+	}
+}
+
+/*
+ * Mark file system as clean, write the super-block back, close the disk.
+ */
+static void
+closedisk(void)
+{
+
+	free(fs->fs_csp);
+	if (fs->fs_contigsumsize > 0) {
+		free(fs->fs_maxcluster);
+		fs->fs_maxcluster = NULL;
+	}
+	fs->fs_clean = 1;
+	if (sbwrite(disk, 0) == -1)
+		err(1, "sbwrite(%s)", devnam);
+	if (ufs_disk_close(disk) == -1)
+		err(1, "ufs_disk_close(%s)", devnam);
+	free(disk);
+	disk = NULL;
+	fs = NULL;
+}
+
+/*
+ * Write the statistics back, call closedisk().
+ */
+static void
+putdisk(void)
+{
+	int i;
+
+	assert(disk != NULL && fs != NULL);
+	for (i = 0; i < fs->fs_cssize; i += fs->fs_bsize) {
+		if (bwrite(disk, fsbtodb(fs, fs->fs_csaddr + numfrags(fs, i)),
+		    (void *)(((char *)fs->fs_csp) + i),
+		    (size_t)(fs->fs_cssize - i < fs->fs_bsize ? fs->fs_cssize - i : fs->fs_bsize)) == -1) {
+			err(1, "bwrite: %s", disk->d_error);
+		}
+	}
+	closedisk();
+}
+
+#if 0
+/*
+ * Free memory, close the disk, but don't write anything back.
+ */
+static void
+canceldisk(void)
+{
+	int i;
+
+	assert(disk != NULL && fs != NULL);
+	free(fs->fs_csp);
+	if (fs->fs_contigsumsize > 0)
+		free(fs->fs_maxcluster);
+	if (ufs_disk_close(disk) == -1)
+		err(1, "ufs_disk_close(%s)", devnam);
+	free(disk);
+	disk = NULL;
+	fs = NULL;
+}
+#endif
+
+static int
+isblock(unsigned char *cp, ufs1_daddr_t h)
+{
+	unsigned char mask;
+
+	switch ((int)fs->fs_frag) {
+	case 8:
+		return (cp[h] == 0xff);
+	case 4:
+		mask = 0x0f << ((h & 0x1) << 2);
+		return ((cp[h >> 1] & mask) == mask);
+	case 2:
+		mask = 0x03 << ((h & 0x3) << 1);
+		return ((cp[h >> 2] & mask) == mask);
+	case 1:
+		mask = 0x01 << (h & 0x7);
+		return ((cp[h >> 3] & mask) == mask);
+	default:
+		assert(!"isblock: invalid number of fragments");
+	}
+	return (0);
+}
+
+/*
+ * put a block into the map
+ */
+static void
+setblock(unsigned char *cp, ufs1_daddr_t h)
+{
+
+	switch ((int)fs->fs_frag) {
+	case 8:
+		cp[h] = 0xff;
+		return;
+	case 4:
+		cp[h >> 1] |= (0x0f << ((h & 0x1) << 2));
+		return;
+	case 2:
+		cp[h >> 2] |= (0x03 << ((h & 0x3) << 1));
+		return;
+	case 1:
+		cp[h >> 3] |= (0x01 << (h & 0x7));
+		return;
+	default:
+		assert(!"setblock: invalid number of fragments");
+	}
+}
+
+/*
+ * check if a block is free
+ */
+static int
+isfreeblock(u_char *cp, ufs1_daddr_t h)
+{
+
+	switch ((int)fs->fs_frag) {
+	case 8:
+		return (cp[h] == 0);
+	case 4:
+		return ((cp[h >> 1] & (0x0f << ((h & 0x1) << 2))) == 0);
+	case 2:
+		return ((cp[h >> 2] & (0x03 << ((h & 0x3) << 1))) == 0);
+	case 1:
+		return ((cp[h >> 3] & (0x01 << (h & 0x7))) == 0);
+	default:
+		assert(!"isfreeblock: invalid number of fragments");
+	}
+	return (0);
+}
+
+/*
+ * Update the frsum fields to reflect addition or deletion
+ * of some frags.
+ */
+void
+fragacct(int fragmap, int32_t fraglist[], int cnt)
+{
+	int inblk;
+	int field, subfield;
+	int siz, pos;
+
+	inblk = (int)(fragtbl[fs->fs_frag][fragmap]) << 1;
+	fragmap <<= 1;
+	for (siz = 1; siz < fs->fs_frag; siz++) {
+		if ((inblk & (1 << (siz + (fs->fs_frag % NBBY)))) == 0)
+			continue;
+		field = around[siz];
+		subfield = inside[siz];
+		for (pos = siz; pos <= fs->fs_frag; pos++) {
+			if ((fragmap & field) == subfield) {
+				fraglist[siz] += cnt;
+				pos += siz;
+				field <<= siz;
+				subfield <<= siz;
+			}
+			field <<= 1;
+			subfield <<= 1;
+		}
+	}
+}
+
+static void
+clusteracct(struct cg *cgp, ufs1_daddr_t blkno)
+{
+	int32_t *sump;
+	int32_t *lp;
+	u_char *freemapp, *mapp;
+	int i, start, end, forw, back, map, bit;
+
+	if (fs->fs_contigsumsize <= 0)
+		return;
+	freemapp = cg_clustersfree(cgp);
+	sump = cg_clustersum(cgp);
+	/*
+	 * Clear the actual block.
+	 */
+	setbit(freemapp, blkno);
+	/*
+	 * Find the size of the cluster going forward.
+	 */
+	start = blkno + 1;
+	end = start + fs->fs_contigsumsize;
+	if (end >= cgp->cg_nclusterblks)
+		end = cgp->cg_nclusterblks;
+	mapp = &freemapp[start / NBBY];
+	map = *mapp++;
+	bit = 1 << (start % NBBY);
+	for (i = start; i < end; i++) {
+		if ((map & bit) == 0)
+			break;
+		if ((i & (NBBY - 1)) != (NBBY - 1)) {
+			bit <<= 1;
+		} else {
+			map = *mapp++;
+			bit = 1;
+		}
+	}
+	forw = i - start;
+	/*
+	 * Find the size of the cluster going backward.
+	 */
+	start = blkno - 1;
+	end = start - fs->fs_contigsumsize;
+	if (end < 0)
+		end = -1;
+	mapp = &freemapp[start / NBBY];
+	map = *mapp--;
+	bit = 1 << (start % NBBY);
+	for (i = start; i > end; i--) {
+		if ((map & bit) == 0)
+			break;
+		if ((i & (NBBY - 1)) != 0) {
+			bit >>= 1;
+		} else {
+			map = *mapp--;
+			bit = 1 << (NBBY - 1);
+		}
+	}
+	back = start - i;
+	/*
+	 * Account for old cluster and the possibly new forward and
+	 * back clusters.
+	 */
+	i = back + forw + 1;
+	if (i > fs->fs_contigsumsize)
+		i = fs->fs_contigsumsize;
+	sump[i]++;
+	if (back > 0)
+		sump[back]--;
+	if (forw > 0)
+		sump[forw]--;
+	/*
+	 * Update cluster summary information.
+	 */
+	lp = &sump[fs->fs_contigsumsize];
+	for (i = fs->fs_contigsumsize; i > 0; i--)
+		if (*lp-- > 0)
+			break;
+	fs->fs_maxcluster[cgp->cg_cgx] = i;
+}
+
+static void
+blkfree(ufs2_daddr_t bno, long size)
+{
+	struct cgchain *cgc;
+	struct cg *cgp;
+	ufs1_daddr_t fragno, cgbno;
+	int i, cg, blk, frags, bbase;
+	u_int8_t *blksfree;
+
+	cg = dtog(fs, bno);
+	cgc = getcg(cg);
+	dirtycg(cgc);
+	cgp = &cgc->cgc_cg;
+	cgbno = dtogd(fs, bno);
+	blksfree = cg_blksfree(cgp);
+	if (size == fs->fs_bsize) {
+		fragno = fragstoblks(fs, cgbno);
+		if (!isfreeblock(blksfree, fragno))
+			assert(!"blkfree: freeing free block");
+		setblock(blksfree, fragno);
+		clusteracct(cgp, fragno);
+		cgp->cg_cs.cs_nbfree++;
+		fs->fs_cstotal.cs_nbfree++;
+		fs->fs_cs(fs, cg).cs_nbfree++;
+	} else {
+		bbase = cgbno - fragnum(fs, cgbno);
+		/*
+		 * decrement the counts associated with the old frags
+		 */
+		blk = blkmap(fs, blksfree, bbase);
+		fragacct(blk, cgp->cg_frsum, -1);
+		/*
+		 * deallocate the fragment
+		 */
+		frags = numfrags(fs, size);
+		for (i = 0; i < frags; i++) {
+			if (isset(blksfree, cgbno + i))
+				assert(!"blkfree: freeing free frag");
+			setbit(blksfree, cgbno + i);
+		}
+		cgp->cg_cs.cs_nffree += i;
+		fs->fs_cstotal.cs_nffree += i;
+		fs->fs_cs(fs, cg).cs_nffree += i;
+		/*
+		 * add back in counts associated with the new frags
+		 */
+		blk = blkmap(fs, blksfree, bbase);
+		fragacct(blk, cgp->cg_frsum, 1);
+		/*
+		 * if a complete block has been reassembled, account for it
+		 */
+		fragno = fragstoblks(fs, bbase);
+		if (isblock(blksfree, fragno)) {
+			cgp->cg_cs.cs_nffree -= fs->fs_frag;
+			fs->fs_cstotal.cs_nffree -= fs->fs_frag;
+			fs->fs_cs(fs, cg).cs_nffree -= fs->fs_frag;
+			clusteracct(cgp, fragno);
+			cgp->cg_cs.cs_nbfree++;
+			fs->fs_cstotal.cs_nbfree++;
+			fs->fs_cs(fs, cg).cs_nbfree++;
+		}
+	}
+}
+
+/*
+ * Recursively free all indirect blocks.
+ */
+static void
+freeindir(ufs2_daddr_t blk, int level)
+{
+	char sblks[MAXBSIZE];
+	ufs2_daddr_t *blks;
+	int i;
+
+	if (bread(disk, fsbtodb(fs, blk), (void *)&sblks, (size_t)fs->fs_bsize) == -1)
+		err(1, "bread: %s", disk->d_error);
+	blks = (ufs2_daddr_t *)&sblks;
+	for (i = 0; i < howmany(fs->fs_bsize, sizeof(ufs2_daddr_t)); i++) {
+		if (blks[i] == 0)
+			break;
+		if (level == 0)
+			blkfree(blks[i], fs->fs_bsize);
+		else
+			freeindir(blks[i], level - 1);
+	}
+	blkfree(blk, fs->fs_bsize);
+}
+
+#define	dblksize(fs, dino, lbn) \
+	((dino)->di_size >= smalllblktosize(fs, (lbn) + 1) \
+	    ? (fs)->fs_bsize \
+	    : fragroundup(fs, blkoff(fs, (dino)->di_size)))
+
+/*
+ * Free all blocks associated with the given inode.
+ */
+static void
+clear_inode(struct ufs2_dinode *dino)
+{
+	ufs2_daddr_t bn;
+	int extblocks, i, level;
+	off_t osize;
+	long bsize;
+
+	extblocks = 0;
+	if (fs->fs_magic == FS_UFS2_MAGIC && dino->di_extsize > 0)
+		extblocks = btodb(fragroundup(fs, dino->di_extsize));
+	/* deallocate external attributes blocks */
+	if (extblocks > 0) {
+		osize = dino->di_extsize;
+		dino->di_blocks -= extblocks;
+		dino->di_extsize = 0;
+		for (i = 0; i < NXADDR; i++) {
+			if (dino->di_extb[i] == 0)
+				continue;
+			blkfree(dino->di_extb[i], sblksize(fs, osize, i));
+		}
+	}
+#define	SINGLE	0	/* index of single indirect block */
+#define	DOUBLE	1	/* index of double indirect block */
+#define	TRIPLE	2	/* index of triple indirect block */
+	/* deallocate indirect blocks */
+	for (level = SINGLE; level <= TRIPLE; level++) {
+		if (dino->di_ib[level] == 0)
+			break;
+		freeindir(dino->di_ib[level], level);
+	}
+	/* deallocate direct blocks and fragments */
+	for (i = 0; i < NDADDR; i++) {
+		bn = dino->di_db[i];
+		if (bn == 0)
+			continue;
+		bsize = dblksize(fs, dino, i);
+		blkfree(bn, bsize);
+	}
+}
+
+void
+gjournal_check(const char *filesys)
+{
+	struct ufs2_dinode *dino;
+	struct cgchain *cgc;
+	struct cg *cgp;
+	uint8_t *inosused, *blksfree;
+	ino_t cino, ino;
+	int cg, mode;
+
+	devnam = filesys;
+	getdisk();
+	/* Are there any unreferenced inodes in this cylinder group? */
+	if (fs->fs_unrefs == 0) {
+		//printf("No unreferenced inodes.\n");
+		closedisk();
+		return;
+	}
+
+	for (cg = 0; cg < fs->fs_ncg; cg++) {
+		/* Show progress if requested. */
+		if (got_siginfo) {
+			printf("%s: phase j: cyl group %d of %d (%d%%)\n",
+			    cdevname, cg, fs->fs_ncg, cg * 100 / fs->fs_ncg);
+			got_siginfo = 0;
+		}
+		if (got_sigalarm) {
+			setproctitle("%s pj %d%%", cdevname,
+			     cg * 100 / fs->fs_ncg);
+			got_sigalarm = 0;
+		}
+		cgc = getcg(cg);
+		cgp = &cgc->cgc_cg;
+		/* Are there any unreferenced inodes in this cylinder group? */
+		if (cgp->cg_unrefs == 0)
+			continue;
+		//printf("Analizing cylinder group %d (count=%d)\n", cg, cgp->cg_unrefs);
+		/*
+		 * We are going to modify this cylinder group, so we want it to
+		 * be written back.
+		 */
+		dirtycg(cgc);
+		/* We don't want it to be freed in the meantime. */
+		busycg(cgc);
+		inosused = cg_inosused(cgp);
+		blksfree = cg_blksfree(cgp);
+		/*
+		 * Now go through the list of all inodes in this cylinder group
+		 * to find unreferenced ones.
+		 */
+		for (cino = 0; cino < fs->fs_ipg; cino++) {
+			ino = fs->fs_ipg * cg + cino;
+			/* Unallocated? Skip it. */
+			if (isclr(inosused, cino))
+				continue;
+			if (getino(disk, (void **)&dino, ino, &mode) == -1)
+				err(1, "getino(cg=%d ino=%d)", cg, ino);
+			/* Not a regular file nor directory? Skip it. */
+			if (!S_ISREG(dino->di_mode) && !S_ISDIR(dino->di_mode))
+				continue;
+			/* Has reference(s)? Skip it. */
+			if (dino->di_nlink > 0)
+				continue;
+			//printf("Clearing inode=%d (size=%jd)\n", ino, (intmax_t)dino->di_size);
+			/* Free inode's blocks. */
+			clear_inode(dino);
+			/* Deallocate it. */
+			clrbit(inosused, cino);
+			/* Update position of last used inode. */
+			if (ino < cgp->cg_irotor)
+				cgp->cg_irotor = ino;
+			/* Update statistics. */
+			cgp->cg_cs.cs_nifree++;
+			fs->fs_cs(fs, cg).cs_nifree++;
+			fs->fs_cstotal.cs_nifree++;
+			cgp->cg_unrefs--;
+			fs->fs_unrefs--;
+			/* If this is directory, update related statistics. */
+			if (S_ISDIR(dino->di_mode)) {
+				cgp->cg_cs.cs_ndir--;
+				fs->fs_cs(fs, cg).cs_ndir--;
+				fs->fs_cstotal.cs_ndir--;
+			}
+			/* Zero-fill the inode. */
+			*dino = ufs2_zino;
+			/* Write the inode back. */
+			if (putino(disk, ino) == -1)
+				err(1, "putino(cg=%d ino=%d)", cg, ino);
+			if (cgp->cg_unrefs == 0) {
+				//printf("No more unreferenced inodes in cg=%d.\n", cg);
+				break;
+			}
+		}
+		/*
+		 * We don't need this cylinder group anymore, so feel free to
+		 * free it if needed.
+		 */
+		unbusycg(cgc);
+		/*
+		 * If there are no more unreferenced inodes, there is no need to
+		 * check other cylinder groups.
+		 */
+		if (fs->fs_unrefs == 0) {
+			//printf("No more unreferenced inodes (cg=%d/%d).\n", cg,
+			//    fs->fs_ncg);
+			break;
+		}
+	}
+	/* Write back modified cylinder groups. */
+	putcgs();
+	/* Write back updated statistics and super-block. */
+	putdisk();
+}
--- sbin/fsck_ffs/inode.c.orig
+++ sbin/fsck_ffs/inode.c
@@ -329,10 +329,10 @@
 			lastinum += fullcnt;
 		}
 		/*
-		 * If bread returns an error, it will already have zeroed
+		 * If blread returns an error, it will already have zeroed
 		 * out the buffer, so we do not need to do so here.
 		 */
-		(void)bread(fsreadfd, inodebuf, dblk, size);
+		(void)blread(fsreadfd, inodebuf, dblk, size);
 		nextinop = inodebuf;
 	}
 	dp = (union dinode *)nextinop;
--- sbin/fsck_ffs/main.c.orig
+++ sbin/fsck_ffs/main.c
@@ -237,6 +237,29 @@
 			exit(7);	/* Filesystem clean, report it now */
 		exit(0);
 	}
+	if (preen && skipclean) {
+		/*
+		 * If file system is gjournaled, check it here.
+		 */
+		if ((fsreadfd = open(filesys, O_RDONLY)) < 0 || readsb(0) == 0)
+			exit(3);	/* Cannot read superblock */
+		close(fsreadfd);
+		if ((sblock.fs_flags & FS_GJOURNAL) != 0) {
+			//printf("GJournaled file system detected on %s.\n",
+			//    filesys);
+			if (sblock.fs_clean == 1) {
+				pwarn("FILE SYSTEM CLEAN; SKIPPING CHECKS\n");
+				exit(0);
+			}
+			if ((sblock.fs_flags & (FS_UNCLEAN | FS_NEEDSFSCK)) == 0) {
+				gjournal_check(filesys);
+				exit(0);
+			} else {
+				pfatal("UNEXPECTED INCONSISTENCY, %s\n",
+				    "CANNOT RUN FAST FSCK\n");
+			}
+		}
+	}
 	/*
 	 * If we are to do a background check:
 	 *	Get the mount point information of the file system
@@ -437,7 +460,7 @@
 		 * Write out the duplicate super blocks
 		 */
 		for (cylno = 0; cylno < sblock.fs_ncg; cylno++)
-			bwrite(fswritefd, (char *)&sblock,
+			blwrite(fswritefd, (char *)&sblock,
 			    fsbtodb(&sblock, cgsblock(&sblock, cylno)),
 			    SBLOCKSIZE);
 	}
--- sbin/fsck_ffs/pass5.c.orig
+++ sbin/fsck_ffs/pass5.c
@@ -164,6 +164,7 @@
 			pfatal("CG %d: BAD MAGIC NUMBER\n", c);
 		newcg->cg_time = cg->cg_time;
 		newcg->cg_old_time = cg->cg_old_time;
+		newcg->cg_unrefs = cg->cg_unrefs;
 		newcg->cg_cgx = c;
 		dbase = cgbase(fs, c);
 		dmax = dbase + fs->fs_fpg;
--- sbin/fsck_ffs/setup.c.orig
+++ sbin/fsck_ffs/setup.c
@@ -249,7 +249,7 @@
 	for (i = 0, j = 0; i < sblock.fs_cssize; i += sblock.fs_bsize, j++) {
 		size = sblock.fs_cssize - i < sblock.fs_bsize ?
 		    sblock.fs_cssize - i : sblock.fs_bsize;
-		if (bread(fsreadfd, (char *)sblock.fs_csp + i,
+		if (blread(fsreadfd, (char *)sblock.fs_csp + i,
 		    fsbtodb(&sblock, sblock.fs_csaddr + j * sblock.fs_frag),
 		    size) != 0 && !asked) {
 			pfatal("BAD SUMMARY INFORMATION");
@@ -322,7 +322,7 @@
 
 	if (bflag) {
 		super = bflag;
-		if ((bread(fsreadfd, (char *)&sblock, super, (long)SBLOCKSIZE)))
+		if ((blread(fsreadfd, (char *)&sblock, super, (long)SBLOCKSIZE)))
 			return (0);
 		if (sblock.fs_magic == FS_BAD_MAGIC) {
 			fprintf(stderr, BAD_MAGIC_MSG);
@@ -337,7 +337,7 @@
 	} else {
 		for (i = 0; sblock_try[i] != -1; i++) {
 			super = sblock_try[i] / dev_bsize;
-			if ((bread(fsreadfd, (char *)&sblock, super,
+			if ((blread(fsreadfd, (char *)&sblock, super,
 			    (long)SBLOCKSIZE)))
 				return (0);
 			if (sblock.fs_magic == FS_BAD_MAGIC) {
--- sbin/fsdb/fsdb.c.orig
+++ sbin/fsdb/fsdb.c
@@ -620,7 +620,7 @@
     uint32_t idblk[MAXNINDIR];
     int i;
 
-    bread(fsreadfd, (char *)idblk, fsbtodb(&sblock, blk), (int)sblock.fs_bsize);
+    blread(fsreadfd, (char *)idblk, fsbtodb(&sblock, blk), (int)sblock.fs_bsize);
     if (ind_level <= 0) {
 	if (find_blks32(idblk, sblock.fs_bsize / sizeof(uint32_t), wantedblk))
 	    return 1;
@@ -662,7 +662,7 @@
     uint64_t idblk[MAXNINDIR];
     int i;
 
-    bread(fsreadfd, (char *)idblk, fsbtodb(&sblock, blk), (int)sblock.fs_bsize);
+    blread(fsreadfd, (char *)idblk, fsbtodb(&sblock, blk), (int)sblock.fs_bsize);
     if (ind_level <= 0) {
 	if (find_blks64(idblk, sblock.fs_bsize / sizeof(uint64_t), wantedblk))
 	    return 1;
--- sbin/fsdb/fsdb.h.orig
+++ sbin/fsdb/fsdb.h
@@ -30,8 +30,7 @@
  * $FreeBSD: src/sbin/fsdb/fsdb.h,v 1.10.14.1 2006/05/21 09:04:31 maxim Exp $
  */
 
-extern int bread(int fd, char *buf, ufs2_daddr_t blk, long size);
-extern void bwrite(int fd, char *buf, ufs2_daddr_t blk, long size);
+extern int blread(int fd, char *buf, ufs2_daddr_t blk, long size);
 extern void rwerror(const char *mesg, ufs2_daddr_t blk);
 extern int reply(const char *question);
 
--- sbin/geom/class/Makefile.orig
+++ sbin/geom/class/Makefile
@@ -4,6 +4,7 @@
 .if !defined(NO_CRYPT) && !defined(NO_OPENSSL)
 SUBDIR+=eli
 .endif
+SUBDIR+=journal
 SUBDIR+=label
 SUBDIR+=mirror
 SUBDIR+=nop
--- /dev/null	Tue Oct 24 16:33:50 2006
+++ sbin/geom/class/journal/Makefile	Tue Oct 24 16:34:01 2006
@@ -0,0 +1,14 @@
+# $FreeBSD$
+
+.PATH:	${.CURDIR}/../../misc
+
+CLASS=	journal
+SRCS+=	geom_journal_ufs.c
+
+DPADD=	${LIBMD} ${LIBUFS}
+LDADD=	-lmd -lufs
+
+NO_MAN=
+CFLAGS+=-I${.CURDIR}/../../../../sys
+
+.include <bsd.lib.mk>
--- /dev/null	Tue Oct 24 16:33:50 2006
+++ sbin/geom/class/journal/geom_journal.c	Tue Oct 24 16:34:04 2006
@@ -0,0 +1,340 @@
+/*-
+ * Copyright (c) 2005-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include <sys/cdefs.h>
+__FBSDID("$FreeBSD");
+
+#include <sys/types.h>
+#include <errno.h>
+#include <paths.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <strings.h>
+#include <assert.h>
+#include <libgeom.h>
+#include <geom/journal/g_journal.h>
+#include <core/geom.h>
+#include <misc/subr.h>
+
+#include "geom_journal.h"
+
+
+uint32_t lib_version = G_LIB_VERSION;
+uint32_t version = G_JOURNAL_VERSION;
+
+static intmax_t default_jsize = -1;
+
+static void journal_main(struct gctl_req *req, unsigned flags);
+static void journal_clear(struct gctl_req *req);
+static void journal_dump(struct gctl_req *req);
+static void journal_label(struct gctl_req *req);
+
+struct g_command class_commands[] = {
+	{ "clear", G_FLAG_VERBOSE, journal_main, G_NULL_OPTS,
+	    "[-v] prov ..."
+	},
+	{ "dump", 0, journal_main, G_NULL_OPTS,
+	    "prov ..."
+	},
+	{ "label", G_FLAG_VERBOSE, journal_main,
+	    {
+		{ 'c', "checksum", NULL, G_TYPE_NONE },
+		{ 'f', "force", NULL, G_TYPE_NONE },
+		{ 'h', "hardcode", NULL, G_TYPE_NONE },
+		{ 's', "jsize", &default_jsize, G_TYPE_NUMBER },
+		G_OPT_SENTINEL
+	    },
+	    "[-cfhv] [-s jsize] dataprov [jprov]"
+	},
+	{ "stop", G_FLAG_VERBOSE, NULL,
+	    {
+		{ 'f', "force", NULL, G_TYPE_NONE },
+		G_OPT_SENTINEL
+	    },
+	    "[-fv] name ..."
+	},
+	{ "sync", G_FLAG_VERBOSE, NULL, G_NULL_OPTS,
+	    "[-v]"
+	},
+	G_CMD_SENTINEL
+};
+
+static int verbose = 0;
+
+static void
+journal_main(struct gctl_req *req, unsigned flags)
+{
+	const char *name;
+
+	if ((flags & G_FLAG_VERBOSE) != 0)
+		verbose = 1;
+
+	name = gctl_get_ascii(req, "verb");
+	if (name == NULL) {
+		gctl_error(req, "No '%s' argument.", "verb");
+		return;
+	}
+	if (strcmp(name, "label") == 0)
+		journal_label(req);
+	else if (strcmp(name, "clear") == 0)
+		journal_clear(req);
+	else if (strcmp(name, "dump") == 0)
+		journal_dump(req);
+	else
+		gctl_error(req, "Unknown command: %s.", name);
+}
+
+static int
+g_journal_fs_exists(const char *prov)
+{
+
+	if (g_journal_ufs_exists(prov))
+		return (1);
+#if 0
+	if (g_journal_otherfs_exists(prov))
+		return (1);
+#endif
+	return (0);
+}
+
+static int
+g_journal_fs_using_last_sector(const char *prov)
+{
+
+	if (g_journal_ufs_using_last_sector(prov))
+		return (1);
+#if 0
+	if (g_journal_otherfs_using_last_sector(prov))
+		return (1);
+#endif
+	return (0);
+}
+
+static void
+journal_label(struct gctl_req *req)
+{
+	struct g_journal_metadata md;
+	const char *data, *journal, *str;
+	u_char sector[512];
+	intmax_t jsize, msize, ssize;
+	int error, force, i, nargs, checksum, hardcode;
+
+	nargs = gctl_get_int(req, "nargs");
+
+	strlcpy(md.md_magic, G_JOURNAL_MAGIC, sizeof(md.md_magic));
+	md.md_version = G_JOURNAL_VERSION;
+	md.md_id = arc4random();
+	md.md_joffset = 0;
+	md.md_jid = 0;
+	md.md_flags = GJ_FLAG_CLEAN;
+	checksum = gctl_get_int(req, "checksum");
+	if (checksum)
+		md.md_flags |= GJ_FLAG_CHECKSUM;
+	force = gctl_get_int(req, "force");
+	hardcode = gctl_get_int(req, "hardcode");
+
+	if (nargs != 1 && nargs != 2) {
+		gctl_error(req, "Invalid number of arguments.");
+		return;
+	}
+
+	/* Verify the given providers. */
+	for (i = 0; i < nargs; i++) {
+		str = gctl_get_ascii(req, "arg%d", i);
+		if (g_get_mediasize(str) == 0) {
+			gctl_error(req, "Invalid provider %s.", str);
+			return;
+		}
+	}
+
+	data = gctl_get_ascii(req, "arg0");
+	jsize = gctl_get_intmax(req, "jsize");
+	journal = NULL;
+	switch (nargs) {
+	case 1:
+		if (!force && g_journal_fs_exists(data)) {
+			gctl_error(req, "File system exists on %s and this "
+			    "operation is going to destroy it. Use -f if you "
+			    "really want to do it.", data);
+			return;
+		}
+		journal = data;
+		if (jsize == -1) {
+			/*
+			 * No journal size specified. 1GB should be safe
+			 * default.
+			 */
+			jsize = 1073741824ULL;
+		}
+		msize = g_get_mediasize(data);
+		ssize = g_get_sectorsize(data);
+		if (jsize + ssize >= msize) {
+			gctl_error(req, "Provider too small for journalling. "
+			    "You can try smaller jsize (default is %jd).",
+			    jsize);
+			return;
+		}
+		md.md_jstart = msize - ssize - jsize;
+		md.md_jend = msize - ssize;
+		break;
+	case 2:
+		if (!force && g_journal_fs_using_last_sector(data)) {
+			gctl_error(req, "File system on %s is using the last "
+			    "sector and this operation is going to overwrite "
+			    "it. Use -f if you really want to do it.", data);
+			return;
+		}
+		journal = gctl_get_ascii(req, "arg1");
+		if (jsize != -1) {
+			gctl_error(req, "jsize argument is valid only for "
+			    "all-in-one configuration.");
+			return;
+		}
+		msize = g_get_mediasize(journal);
+		ssize = g_get_sectorsize(journal);
+		md.md_jstart = 0;
+		md.md_jend = msize - ssize;
+		break;
+	}
+
+	if (g_get_sectorsize(data) != g_get_sectorsize(journal)) {
+		gctl_error(req, "Not equal sector sizes.");
+		return;
+	}
+
+	/*
+	 * Clear last sector first, to spoil all components if device exists.
+	 */
+	for (i = 0; i < nargs; i++) {
+		str = gctl_get_ascii(req, "arg%d", i);
+		error = g_metadata_clear(str, NULL);
+		if (error != 0) {
+			gctl_error(req, "Cannot clear metadata on %s: %s.", str,
+			    strerror(error));
+			return;
+		}
+	}
+
+	/*
+	 * Ok, store metadata.
+	 */
+	for (i = 0; i < nargs; i++) {
+		switch (i) {
+		case 0:
+			str = data;
+			md.md_type = GJ_TYPE_DATA;
+			if (nargs == 1)
+				md.md_type |= GJ_TYPE_JOURNAL;
+			break;
+		case 1:
+			str = journal;
+			md.md_type = GJ_TYPE_JOURNAL;
+			break;
+		}
+		md.md_provsize = g_get_mediasize(str);
+		assert(md.md_provsize != 0);
+		if (!hardcode)
+			bzero(md.md_provider, sizeof(md.md_provider));
+		else {
+			if (strncmp(str, _PATH_DEV, strlen(_PATH_DEV)) == 0)
+				str += strlen(_PATH_DEV);
+			strlcpy(md.md_provider, str, sizeof(md.md_provider));
+		}
+		journal_metadata_encode(&md, sector);
+		error = g_metadata_store(str, sector, sizeof(sector));
+		if (error != 0) {
+			fprintf(stderr, "Cannot store metadata on %s: %s.\n",
+			    str, strerror(error));
+			gctl_error(req, "Not fully done.");
+			continue;
+		}
+		if (verbose)
+			printf("Metadata value stored on %s.\n", str);
+	}
+}
+
+static void
+journal_clear(struct gctl_req *req)
+{
+	const char *name;
+	int error, i, nargs;
+
+	nargs = gctl_get_int(req, "nargs");
+	if (nargs < 1) {
+		gctl_error(req, "Too few arguments.");
+		return;
+	}
+
+	for (i = 0; i < nargs; i++) {
+		name = gctl_get_ascii(req, "arg%d", i);
+		error = g_metadata_clear(name, G_JOURNAL_MAGIC);
+		if (error != 0) {
+			fprintf(stderr, "Cannot clear metadata on %s: %s.\n",
+			    name, strerror(error));
+			gctl_error(req, "Not fully done.");
+			continue;
+		}
+		if (verbose)
+			printf("Metadata cleared on %s.\n", name);
+	}
+}
+
+static void
+journal_dump(struct gctl_req *req)
+{
+	struct g_journal_metadata md, tmpmd;
+	const char *name;
+	int error, i, nargs;
+
+	nargs = gctl_get_int(req, "nargs");
+	if (nargs < 1) {
+		gctl_error(req, "Too few arguments.");
+		return;
+	}
+
+	for (i = 0; i < nargs; i++) {
+		name = gctl_get_ascii(req, "arg%d", i);
+		error = g_metadata_read(name, (u_char *)&tmpmd, sizeof(tmpmd),
+		    G_JOURNAL_MAGIC);
+		if (error != 0) {
+			fprintf(stderr, "Cannot read metadata from %s: %s.\n",
+			    name, strerror(error));
+			gctl_error(req, "Not fully done.");
+			continue;
+		}
+		if (journal_metadata_decode((u_char *)&tmpmd, &md) != 0) {
+			fprintf(stderr, "MD5 hash mismatch for %s, skipping.\n",
+			    name);
+			gctl_error(req, "Not fully done.");
+			continue;
+		}
+		printf("Metadata on %s:\n", name);
+		journal_metadata_dump(&md);
+		printf("\n");
+	}
+}
--- /dev/null	Tue Oct 24 16:33:50 2006
+++ sbin/geom/class/journal/geom_journal.h	Tue Oct 24 16:34:07 2006
@@ -0,0 +1,33 @@
+/*-
+ * Copyright (c) 2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ * $FreeBSD$
+ */
+
+#ifndef	_GEOM_JOURNAL_H_
+#define	_GEOM_JOURNAL_H_
+int g_journal_ufs_exists(const char *prov);
+int g_journal_ufs_using_last_sector(const char *prov);
+#endif	/* !_GEOM_JOURNAL_H_ */
--- /dev/null	Tue Oct 24 16:33:50 2006
+++ sbin/geom/class/journal/geom_journal_ufs.c	Tue Oct 24 16:34:09 2006
@@ -0,0 +1,78 @@
+/*-
+ * Copyright (c) 2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include <sys/cdefs.h>
+__FBSDID("$FreeBSD");
+
+#include <sys/param.h>
+#include <sys/disklabel.h>
+#include <sys/mount.h>
+
+#include <ufs/ufs/dinode.h>
+#include <ufs/ffs/fs.h>
+
+#include <libufs.h>
+#include <libgeom.h>
+#include <core/geom.h>
+#include <misc/subr.h>
+
+#include "geom_journal.h"
+
+static struct fs *
+read_superblock(const char *prov)
+{
+	static struct uufsd disk;
+	struct fs *fs;
+
+	if (ufs_disk_fillout(&disk, prov) == -1)
+		return (NULL);
+	fs = &disk.d_fs;
+	ufs_disk_close(&disk);
+	return (fs);
+}
+
+int
+g_journal_ufs_exists(const char *prov)
+{
+
+	return (read_superblock(prov) != NULL);
+}
+
+int
+g_journal_ufs_using_last_sector(const char *prov)
+{
+	struct fs *fs;
+	off_t psize, fssize;
+
+	fs = read_superblock(prov);
+	if (fs == NULL)
+		return (0);
+	/* Provider size in 512 bytes blocks. */
+	psize = g_get_mediasize(prov) / DEV_BSIZE;
+	/* File system size in 512 bytes blocks. */
+	fssize = fsbtodb(fs, dbtofsb(fs, psize));
+	return (psize == fssize);
+}
--- sbin/growfs/debug.c.orig
+++ sbin/growfs/debug.c
@@ -281,6 +281,8 @@
  */
 	fprintf(dbg_log, "maxbsize          int32_t          0x%08x\n",
 	    sb->fs_maxbsize);
+	fprintf(dbg_log, "unrefs            int64_t          0x%08x\n",
+	    sb->fs_unrefs);
 	fprintf(dbg_log, "sblockloc         int64_t          0x%08x%08x\n",
 		((unsigned int *)&(sb->fs_sblockloc))[1],
 		((unsigned int *)&(sb->fs_sblockloc))[0]);
@@ -399,6 +401,7 @@
 	    cgr->cg_nclusterblks);
 	fprintf(dbg_log, "niblk         int32_t    0x%08x\n", cgr->cg_niblk);
 	fprintf(dbg_log, "initediblk    int32_t    0x%08x\n", cgr->cg_initediblk);
+	fprintf(dbg_log, "unrefs        int32_t    0x%08x\n", cgr->cg_unrefs);
 	fprintf(dbg_log, "time          ufs_time_t %10u\n", 
 		(unsigned int)cgr->cg_initediblk);
 
--- sbin/mount/mount.c.orig
+++ sbin/mount/mount.c
@@ -106,6 +106,7 @@
 	{ MNT_SOFTDEP,		"soft-updates" },
 	{ MNT_MULTILABEL,	"multilabel" },
 	{ MNT_ACLS,		"acls" },
+	{ MNT_GJOURNAL,		"gjournal" },
 	{ 0, NULL }
 };
 
@@ -773,6 +774,7 @@
 	if (flags & MNT_SUIDDIR)	res = catopt(res, "suiddir");
 	if (flags & MNT_MULTILABEL)	res = catopt(res, "multilabel");
 	if (flags & MNT_ACLS)		res = catopt(res, "acls");
+	if (flags & MNT_GJOURNAL)	res = catopt(res, "gjournal");
 
 	return res;
 }
--- sbin/newfs/mkfs.c.orig
+++ sbin/newfs/mkfs.c
@@ -135,6 +135,8 @@
 		sblock.fs_flags |= FS_DOSOFTDEP;
 	if (Lflag)
 		strlcpy(sblock.fs_volname, volumelabel, MAXVOLLEN);
+	if (Jflag)
+		sblock.fs_flags |= FS_GJOURNAL;
 	if (lflag)
 		sblock.fs_flags |= FS_MULTILABEL;
 	/*
--- sbin/newfs/newfs.8.orig
+++ sbin/newfs/newfs.8
@@ -36,7 +36,7 @@
 .Nd construct a new UFS1/UFS2 file system
 .Sh SYNOPSIS
 .Nm
-.Op Fl NUln
+.Op Fl JNUln
 .Op Fl L Ar volname
 .Op Fl O Ar filesystem-type
 .Op Fl S Ar sector-size
@@ -77,6 +77,8 @@
 .Pp
 The following options define the general layout policies:
 .Bl -tag -width indent
+.It Fl J
+Enable journaling on the new file system via gjournal.
 .It Fl L Ar volname
 Add a volume label to the new file system.
 .It Fl N
--- sbin/newfs/newfs.c.orig
+++ sbin/newfs/newfs.c
@@ -117,6 +117,7 @@
 int	Rflag;			/* regression test */
 int	Uflag;			/* enable soft updates for file system */
 int	Eflag = 0;		/* exit in middle of newfs for testing */
+int	Jflag;			/* enable gjournal for file system */
 int	lflag;			/* enable multilabel for file system */
 int	nflag;			/* do not create .snap directory */
 quad_t	fssize;			/* file system size */
@@ -156,11 +157,14 @@
 	off_t mediasize;
 
 	while ((ch = getopt(argc, argv,
-	    "EL:NO:RS:T:Ua:b:c:d:e:f:g:h:i:lm:no:s:")) != -1)
+	    "EJL:NO:RS:T:Ua:b:c:d:e:f:g:h:i:lm:no:s:")) != -1)
 		switch (ch) {
 		case 'E':
 			Eflag++;
 			break;
+		case 'J':
+			Jflag = 1;
+			break;
 		case 'L':
 			volumelabel = optarg;
 			i = -1;
--- sbin/newfs/newfs.h.orig
+++ sbin/newfs/newfs.h
@@ -49,6 +49,7 @@
 extern int	Rflag;		/* regression test */
 extern int	Uflag;		/* enable soft updates for file system */
 extern int	Eflag;		/* exit as if error, for testing */
+extern int	Jflag;		/* enable gjournal for file system */
 extern int	lflag;		/* enable multilabel MAC for file system */
 extern int	nflag;		/* do not create .snap directory */
 extern quad_t	fssize;		/* file system size */
--- sbin/tunefs/tunefs.8.orig
+++ sbin/tunefs/tunefs.8
@@ -40,6 +40,7 @@
 .Op Fl a Cm enable | disable
 .Op Fl e Ar maxbpg
 .Op Fl f Ar avgfilesize
+.Op Fl J Cm enable | disable
 .Op Fl L Ar volname
 .Op Fl l Cm enable | disable
 .Op Fl m Ar minfree
@@ -87,6 +88,8 @@
 this parameter should be set higher.
 .It Fl f Ar avgfilesize
 Specify the expected average file size.
+.It Fl J Cm enable | disable
+Turn on/off GJournal flag.
 .It Fl L Ar volname
 Add/modify an optional file system volume label.
 .It Fl l Cm enable | disable
--- sbin/tunefs/tunefs.c.orig
+++ sbin/tunefs/tunefs.c
@@ -76,11 +76,11 @@
 int
 main(int argc, char *argv[])
 {
-	char *avalue, *Lvalue, *lvalue, *nvalue;
+	char *avalue, *Jvalue, *Lvalue, *lvalue, *nvalue;
 	const char *special, *on;
 	const char *name;
 	int active;
-	int Aflag, aflag, eflag, evalue, fflag, fvalue, Lflag, lflag;
+	int Aflag, aflag, eflag, evalue, fflag, fvalue, Jflag, Lflag, lflag;
 	int mflag, mvalue, nflag, oflag, ovalue, pflag, sflag, svalue;
 	int ch, found_arg, i;
 	const char *chg[2];
@@ -89,13 +89,13 @@
 
 	if (argc < 3)
 		usage();
-	Aflag = aflag = eflag = fflag = Lflag = lflag = mflag = 0;
+	Aflag = aflag = eflag = fflag = Jflag = Lflag = lflag = mflag = 0;
 	nflag = oflag = pflag = sflag = 0;
-	avalue = Lvalue = lvalue = nvalue = NULL;
+	avalue = Jvalue = Lvalue = lvalue = nvalue = NULL;
 	evalue = fvalue = mvalue = ovalue = svalue = 0;
 	active = 0;
 	found_arg = 0;		/* At least one arg is required. */
-	while ((ch = getopt(argc, argv, "Aa:e:f:L:l:m:n:o:ps:")) != -1)
+	while ((ch = getopt(argc, argv, "Aa:e:f:J:L:l:m:n:o:ps:")) != -1)
 		switch (ch) {
 
 		case 'A':
@@ -135,6 +135,19 @@
 			fflag = 1;
 			break;
 
+		case 'J':
+			found_arg = 1;
+			name = "gjournaled file system";
+			Jvalue = optarg;
+			if (strcmp(Jvalue, "enable") &&
+			    strcmp(Jvalue, "disable")) {
+				errx(10, "bad %s (options are %s)",
+				    name, "`enable' or `disable'");
+			}
+			Jflag = 1;
+			break;
+
+
 		case 'L':
 			found_arg = 1;
 			name = "volume label";
@@ -282,6 +295,26 @@
 			sblock.fs_avgfilesize = fvalue;
 		}
 	}
+	if (Jflag) {
+		name = "gjournal";
+		if (strcmp(Jvalue, "enable") == 0) {
+			if (sblock.fs_flags & FS_GJOURNAL) {
+				warnx("%s remains unchanged as enabled", name);
+			} else {
+				sblock.fs_flags |= FS_GJOURNAL;
+				warnx("%s set", name);
+			}
+		} else if (strcmp(Jvalue, "disable") == 0) {
+			if ((~sblock.fs_flags & FS_GJOURNAL) ==
+			    FS_GJOURNAL) {
+				warnx("%s remains unchanged as disabled",
+				    name);
+			} else {
+				sblock.fs_flags &= ~FS_GJOURNAL;
+				warnx("%s cleared", name);
+			}
+		}
+	}
 	if (lflag) {
 		name = "multilabel";
 		if (strcmp(lvalue, "enable") == 0) {
@@ -389,8 +422,8 @@
 {
 	fprintf(stderr, "%s\n%s\n%s\n%s\n",
 "usage: tunefs [-A] [-a enable | disable] [-e maxbpg] [-f avgfilesize]",
-"              [-L volname] [-l enable | disable] [-m minfree]",
-"              [-n enable | disable] [-o space | time] [-p]",
+"              [-J enable | disable ] [-L volname] [-l enable | disable]",
+"              [-m minfree] [-n enable | disable] [-o space | time] [-p]",
 "              [-s avgfpdir] special | filesystem");
 	exit(2);
 }
@@ -404,6 +437,8 @@
 		(sblock.fs_flags & FS_MULTILABEL)? "enabled" : "disabled");
 	warnx("soft updates: (-n)                                 %s", 
 		(sblock.fs_flags & FS_DOSOFTDEP)? "enabled" : "disabled");
+	warnx("gjournal: (-J)                                     %s",
+		(sblock.fs_flags & FS_GJOURNAL)? "enabled" : "disabled");
 	warnx("maximum blocks per file in a cylinder group: (-e)  %d",
 	      sblock.fs_maxbpg);
 	warnx("average file size: (-f)                            %d",
--- share/man/man5/fs.5.orig
+++ share/man/man5/fs.5
@@ -153,7 +153,8 @@
 	u_int	*fs_active;        /* used by snapshots to track fs */
 	int32_t	 fs_old_cpc;       /* cyl per cycle in postbl */
 	int32_t	 fs_maxbsize;      /* maximum blocking factor permitted */
-	int64_t	 fs_sparecon64[17]; /* old rotation block list head */
+	int64_t	 fs_unrefs;        /* number of unreferenced inodes */
+	int64_t	 fs_sparecon64[16]; /* old rotation block list head */
 	int64_t	 fs_sblockloc;     /* byte offset of standard superblock */
 	struct	csum_total fs_cstotal;  /* cylinder summary information */
 	ufs_time_t fs_time;        /* last time written */
--- sys/cam/scsi/scsi_da.c.orig
+++ sys/cam/scsi/scsi_da.c
@@ -1163,6 +1163,8 @@
 	softc->disk->d_maxsize = DFLTPHYS; /* XXX: probably not arbitrary */
 	softc->disk->d_unit = periph->unit_number;
 	softc->disk->d_flags = DISKFLAG_NEEDSGIANT;
+	if ((softc->quirks & DA_Q_NO_SYNC_CACHE) == 0)
+		softc->disk->d_flags |= DISKFLAG_CANFLUSHCACHE;
 	disk_create(softc->disk, DISK_VERSION);
 
 	/*
@@ -1234,20 +1236,35 @@
 			} else {
 				tag_code = MSG_SIMPLE_Q_TAG;
 			}
-			scsi_read_write(&start_ccb->csio,
-					/*retries*/da_retry_count,
-					/*cbfcnp*/dadone,
-					/*tag_action*/tag_code,
-					/*read_op*/bp->bio_cmd == BIO_READ,
-					/*byte2*/0,
-					softc->minimum_cmd_size,
-					/*lba*/bp->bio_pblkno,
-					/*block_count*/bp->bio_bcount /
-					softc->params.secsize,
-					/*data_ptr*/ bp->bio_data,
-					/*dxfer_len*/ bp->bio_bcount,
-					/*sense_len*/SSD_FULL_SIZE,
-					/*timeout*/da_default_timeout*1000);
+			switch (bp->bio_cmd) {
+			case BIO_READ:
+			case BIO_WRITE:
+				scsi_read_write(&start_ccb->csio,
+						/*retries*/da_retry_count,
+						/*cbfcnp*/dadone,
+						/*tag_action*/tag_code,
+						/*read_op*/bp->bio_cmd == BIO_READ,
+						/*byte2*/0,
+						softc->minimum_cmd_size,
+						/*lba*/bp->bio_pblkno,
+						/*block_count*/bp->bio_bcount /
+						softc->params.secsize,
+						/*data_ptr*/ bp->bio_data,
+						/*dxfer_len*/ bp->bio_bcount,
+						/*sense_len*/SSD_FULL_SIZE,
+						/*timeout*/da_default_timeout*1000);
+				break;
+			case BIO_FLUSH:
+				scsi_synchronize_cache(&start_ccb->csio,
+						       /*retries*/1,
+						       /*cbfcnp*/dadone,
+						       MSG_SIMPLE_Q_TAG,
+						       /*begin_lba*/0,/* Cover the whole disk */
+						       /*lb_count*/0,
+						       SSD_FULL_SIZE,
+						       /*timeout*/da_default_timeout*1000);
+				break;
+			}
 			start_ccb->ccb_h.ccb_state = DA_CCB_BUFFER_IO;
 
 			/*
--- sys/conf/NOTES.orig
+++ sys/conf/NOTES
@@ -135,6 +135,7 @@
 options 	GEOM_FOX		# Redundant path mitigation
 options 	GEOM_GATE		# Userland services.
 options 	GEOM_GPT		# GPT partitioning
+options 	GEOM_JOURNAL		# Journaling.
 options 	GEOM_LABEL		# Providers labelization.
 options 	GEOM_MBR		# DOS/MBR partitioning
 options 	GEOM_MIRROR		# Disk mirroring.
@@ -877,6 +878,9 @@
 # directories at the expense of some memory.
 options 	UFS_DIRHASH
 
+# Gjournal-based UFS journaling support.
+options 	UFS_GJOURNAL
+
 # Make space in the kernel for a root filesystem on a md device.
 # Define to the number of kilobytes to reserve for the filesystem.
 options 	MD_ROOT_SIZE=10
--- sys/conf/files.orig
+++ sys/conf/files
@@ -1125,6 +1125,8 @@
 geom/geom_sunlabel_enc.c	optional geom_sunlabel
 geom/geom_vfs.c			standard
 geom/geom_vol_ffs.c		optional geom_vol
+geom/journal/g_journal.c	optional geom_journal
+geom/journal/g_journal_ufs.c	optional geom_journal
 geom/label/g_label.c		optional geom_label
 geom/label/g_label_ext2fs.c	optional geom_label
 geom/label/g_label_iso9660.c	optional geom_label
@@ -1890,6 +1892,7 @@
 ufs/ufs/ufs_bmap.c		optional ffs
 ufs/ufs/ufs_dirhash.c		optional ffs
 ufs/ufs/ufs_extattr.c		optional ffs
+ufs/ufs/ufs_gjournal.c		optional ffs
 ufs/ufs/ufs_inode.c		optional ffs
 ufs/ufs/ufs_lookup.c		optional ffs
 ufs/ufs/ufs_quota.c		optional ffs
--- sys/conf/options.orig
+++ sys/conf/options
@@ -83,6 +83,7 @@
 GEOM_FOX	opt_geom.h
 GEOM_GATE	opt_geom.h
 GEOM_GPT	opt_geom.h
+GEOM_JOURNAL	opt_geom.h
 GEOM_LABEL	opt_geom.h
 GEOM_MBR	opt_geom.h
 GEOM_MIRROR	opt_geom.h
@@ -235,6 +236,9 @@
 # Enable fast hash lookups for large directories on UFS-based filesystems.
 UFS_DIRHASH	opt_ufs.h
 
+# Enable gjournal-based UFS journal.
+UFS_GJOURNAL	opt_ufs.h
+
 # The below sentence is not in English, and neither is this one.
 # We plan to remove the static dependences above, with a
 # <filesystem>_ROOT option to control if it usable as root.  This list
--- sys/dev/amr/amr.c.orig
+++ sys/dev/amr/amr.c
@@ -1287,7 +1287,7 @@
     int			driveno;
     int			cmd;
 
-    ac = NULL;
+    *acp = NULL;
     error = 0;
 
     /* get a command */
@@ -1305,39 +1305,50 @@
     ac->ac_bio = bio;
     ac->ac_data = bio->bio_data;
     ac->ac_length = bio->bio_bcount;
-    if (bio->bio_cmd == BIO_READ) {
+    cmd = 0;
+    switch (bio->bio_cmd) {
+    case BIO_READ:
 	ac->ac_flags |= AMR_CMD_DATAIN;
 	if (AMR_IS_SG64(sc)) {
 	    cmd = AMR_CMD_LREAD64;
 	    ac->ac_flags |= AMR_CMD_SG64;
 	} else
 	    cmd = AMR_CMD_LREAD;
-    } else {
+	break;
+    case BIO_WRITE:
 	ac->ac_flags |= AMR_CMD_DATAOUT;
 	if (AMR_IS_SG64(sc)) {
 	    cmd = AMR_CMD_LWRITE64;
 	    ac->ac_flags |= AMR_CMD_SG64;
 	} else
 	    cmd = AMR_CMD_LWRITE;
+	break;
+    case BIO_FLUSH:
+	ac->ac_flags |= AMR_CMD_PRIORITY | AMR_CMD_DATAOUT;
+	cmd = AMR_CMD_FLUSH;
+	break;
     }
     amrd = (struct amrd_softc *)bio->bio_disk->d_drv1;
     driveno = amrd->amrd_drive - sc->amr_drive;
     blkcount = (bio->bio_bcount + AMR_BLKSIZE - 1) / AMR_BLKSIZE;
 
     ac->ac_mailbox.mb_command = cmd;
-    ac->ac_mailbox.mb_blkcount = blkcount;
-    ac->ac_mailbox.mb_lba = bio->bio_pblkno;
+    if (bio->bio_cmd & (BIO_READ|BIO_WRITE)) {
+	ac->ac_mailbox.mb_blkcount = blkcount;
+	ac->ac_mailbox.mb_lba = bio->bio_pblkno;
+	if ((bio->bio_pblkno + blkcount) > sc->amr_drive[driveno].al_size) {
+	    device_printf(sc->amr_dev,
+			  "I/O beyond end of unit (%lld,%d > %lu)\n", 
+			  (long long)bio->bio_pblkno, blkcount,
+			  (u_long)sc->amr_drive[driveno].al_size);
+	}
+    }
     ac->ac_mailbox.mb_drive = driveno;
     if (sc->amr_state & AMR_STATE_REMAP_LD)
 	ac->ac_mailbox.mb_drive |= 0x80;
 
     /* we fill in the s/g related data when the command is mapped */
 
-    if ((bio->bio_pblkno + blkcount) > sc->amr_drive[driveno].al_size)
-	device_printf(sc->amr_dev, "I/O beyond end of unit (%lld,%d > %lu)\n", 
-		      (long long)bio->bio_pblkno, blkcount,
-		      (u_long)sc->amr_drive[driveno].al_size);
-
     *acp = ac;
     return(error);
 }
--- sys/dev/amr/amr_disk.c.orig
+++ sys/dev/amr/amr_disk.c
@@ -236,7 +236,7 @@
     sc->amrd_disk->d_name = "amrd";
     sc->amrd_disk->d_dump = (dumper_t *)amrd_dump;
     sc->amrd_disk->d_unit = sc->amrd_unit;
-    sc->amrd_disk->d_flags = 0;
+    sc->amrd_disk->d_flags = DISKFLAG_CANFLUSHCACHE;
     sc->amrd_disk->d_sectorsize = AMR_BLKSIZE;
     sc->amrd_disk->d_mediasize = (off_t)sc->amrd_drive->al_size * AMR_BLKSIZE;
     sc->amrd_disk->d_fwsectors = sc->amrd_drive->al_sectors;
--- sys/dev/ata/ata-disk.c.orig
+++ sys/dev/ata/ata-disk.c
@@ -151,6 +151,8 @@
     adp->disk->d_fwsectors = adp->sectors;
     adp->disk->d_fwheads = adp->heads;
     adp->disk->d_unit = device_get_unit(dev);
+    if (atadev->param.support.command2 & ATA_SUPPORT_FLUSHCACHE)
+	adp->disk->d_flags = DISKFLAG_CANFLUSHCACHE;
     disk_create(adp->disk, DISK_VERSION);
     device_add_child(dev, "subdisk", device_get_unit(dev));
     bus_generic_attach(dev);
@@ -260,6 +262,17 @@
 	else
 	    request->u.ata.command = ATA_WRITE;
 	break;
+    case BIO_FLUSH:
+	request->u.ata.lba = 0;
+	request->u.ata.count = 0;
+	request->u.ata.feature = 0;
+	request->bytecount = 0;
+	request->transfersize = 0;
+	request->timeout = 1;
+	request->retries = 0;
+	request->flags = ATA_R_CONTROL;
+	request->u.ata.command = ATA_FLUSHCACHE;
+	break;
     default:
 	device_printf(dev, "FAILURE - unknown BIO operation\n");
 	ata_free_request(request);
--- sys/dev/ata/ata-raid.c.orig
+++ sys/dev/ata/ata-raid.c
@@ -146,6 +146,21 @@
     rdp->disk->d_maxsize = 128 * DEV_BSIZE;
     rdp->disk->d_drv1 = rdp;
     rdp->disk->d_unit = rdp->lun;
+    /* we support flushing cache if all components support it */
+    /* XXX: not all components can be connected at this point */
+    rdp->disk->d_flags = DISKFLAG_CANFLUSHCACHE;
+    for (disk = 0; disk < rdp->total_disks; disk++) {
+	struct ata_device *atadev;
+
+	if (rdp->disks[disk].dev == NULL)
+	    continue;
+	if ((atadev = device_get_softc(rdp->disks[disk].dev)) == NULL)
+	    continue;
+	if (atadev->param.support.command2 & ATA_SUPPORT_FLUSHCACHE)
+	    continue;
+	rdp->disk->d_flags = 0;
+	break;
+    }
     disk_create(rdp->disk, DISK_VERSION);
 
     printf("ar%d: %juMB <%s %s%s> status: %s\n", rdp->lun,
@@ -229,6 +244,39 @@
     return error;
 }
 
+static int
+ata_raid_flush(struct bio *bp)
+{
+    struct ar_softc *rdp = bp->bio_disk->d_drv1;
+    struct ata_request *request;
+    device_t dev;
+    int disk, error;
+
+    error = 0;
+    bp->bio_pflags = 0;
+
+    for (disk = 0; disk < rdp->total_disks; disk++) {
+	if ((dev = rdp->disks[disk].dev) != NULL)
+	    bp->bio_pflags++;
+    }
+    for (disk = 0; disk < rdp->total_disks; disk++) {
+	if ((dev = rdp->disks[disk].dev) == NULL)
+	    continue;
+	if (!(request = ata_raid_init_request(rdp, bp)))
+	    return ENOMEM;
+	request->dev = dev;
+	request->u.ata.command = ATA_FLUSHCACHE;
+	request->u.ata.lba = 0;
+	request->u.ata.count = 0;
+	request->u.ata.feature = 0;
+	request->timeout = 1;
+	request->retries = 0;
+	request->flags |= ATA_R_ORDERED | ATA_R_DIRECT;
+	ata_queue_request(request);
+    }
+    return 0;
+}
+
 static void
 ata_raid_strategy(struct bio *bp)
 {
@@ -238,6 +286,15 @@
     u_int64_t blkno, lba, blk = 0;
     int count, chunk, drv, par = 0, change = 0;
 
+    if (bp->bio_cmd == BIO_FLUSH) {
+	int error;
+
+	error = ata_raid_flush(bp);
+	if (error != 0)
+		biofinish(bp, NULL, error);
+	return;
+    }
+
     if (!(rdp->status & AR_S_READY) ||
 	(bp->bio_cmd != BIO_READ && bp->bio_cmd != BIO_WRITE)) {
 	biofinish(bp, NULL, EIO);
@@ -554,6 +611,15 @@
     struct bio *bp = request->bio;
     int i, mirror, finished = 0;
 
+    if (bp->bio_cmd == BIO_FLUSH) {
+	if (bp->bio_error == 0)
+	    bp->bio_error = request->result;
+	ata_free_request(request);
+	if (--bp->bio_pflags == 0)
+	    biodone(bp);
+	return;
+    }
+
     switch (rdp->type) {
     case AR_T_JBOD:
     case AR_T_SPAN:
@@ -3957,6 +4023,9 @@
     case BIO_WRITE:
 	request->flags = ATA_R_WRITE;
 	break;
+    case BIO_FLUSH:
+	request->flags = ATA_R_CONTROL;
+	break;
     }
     return request;
 }
--- sys/geom/concat/g_concat.c.orig
+++ sys/geom/concat/g_concat.c
@@ -212,6 +212,42 @@
 }
 
 static void
+g_concat_flush(struct g_concat_softc *sc, struct bio *bp)
+{
+	struct bio_queue_head queue;
+	struct g_consumer *cp;
+	struct bio *cbp;
+	u_int no;
+
+	bioq_init(&queue);
+	for (no = 0; no < sc->sc_ndisks; no++) {
+		cbp = g_clone_bio(bp);
+		if (cbp == NULL) {
+			for (cbp = bioq_first(&queue); cbp != NULL;
+			    cbp = bioq_first(&queue)) {
+				bioq_remove(&queue, cbp);
+				g_destroy_bio(cbp);
+			}
+			if (bp->bio_error == 0)
+				bp->bio_error = ENOMEM;
+			g_io_deliver(bp, bp->bio_error);
+			return;
+		}
+		bioq_insert_tail(&queue, cbp);
+		cbp->bio_done = g_std_done;
+		cbp->bio_caller1 = sc->sc_disks[no].d_consumer;
+		cbp->bio_to = sc->sc_disks[no].d_consumer->provider;
+	}
+	for (cbp = bioq_first(&queue); cbp != NULL; cbp = bioq_first(&queue)) {
+		bioq_remove(&queue, cbp);
+		G_CONCAT_LOGREQ(cbp, "Sending request.");
+		cp = cbp->bio_caller1;
+		cbp->bio_caller1 = NULL;
+		g_io_request(cbp, cp);
+	}
+}
+
+static void
 g_concat_start(struct bio *bp)
 {
 	struct bio_queue_head queue;
@@ -240,6 +276,9 @@
 	case BIO_WRITE:
 	case BIO_DELETE:
 		break;
+	case BIO_FLUSH:
+		g_concat_flush(sc, bp);
+		return;
 	case BIO_GETATTR:
 		/* To which provider it should be delivered? */
 	default:
--- sys/geom/eli/g_eli.c.orig
+++ sys/geom/eli/g_eli.c
@@ -258,6 +258,7 @@
 	case BIO_READ:
 	case BIO_WRITE:
 	case BIO_GETATTR:
+	case BIO_FLUSH:
 		break;
 	case BIO_DELETE:
 		/*
@@ -298,6 +299,7 @@
 		wakeup(sc);
 		break;
 	case BIO_GETATTR:
+	case BIO_FLUSH:
 		cbp->bio_done = g_std_done;
 		cp = LIST_FIRST(&sc->sc_geom->consumer);
 		cbp->bio_to = cp->provider;
--- sys/geom/geom.h.orig
+++ sys/geom/geom.h
@@ -265,6 +265,7 @@
 void g_destroy_bio(struct bio *);
 void g_io_deliver(struct bio *bp, int error);
 int g_io_getattr(const char *attr, struct g_consumer *cp, int *len, void *ptr);
+int g_io_flush(struct g_consumer *cp);
 void g_io_request(struct bio *bp, struct g_consumer *cp);
 struct bio *g_new_bio(void);
 struct bio *g_alloc_bio(void);
--- sys/geom/geom_disk.c.orig
+++ sys/geom/geom_disk.c
@@ -206,8 +206,10 @@
 	bp2->bio_inbed++;
 	if (bp2->bio_children == bp2->bio_inbed) {
 		bp2->bio_resid = bp2->bio_bcount - bp2->bio_completed;
-		if ((dp = bp2->bio_to->geom->softc))
+		if ((bp2->bio_cmd & (BIO_READ|BIO_WRITE|BIO_DELETE)) &&
+		    (dp = bp2->bio_to->geom->softc)) {
 			devstat_end_transaction_bio(dp->d_devstat, bp2);
+		}
 		g_io_deliver(bp2, bp2->bio_error);
 	}
 	mtx_unlock(&g_disk_done_mtx);
@@ -304,6 +306,24 @@
 		else 
 			error = ENOIOCTL;
 		break;
+	case BIO_FLUSH:
+		g_trace(G_T_TOPOLOGY, "g_disk_flushcache(%s)",
+		    bp->bio_to->name);
+		if (!(dp->d_flags & DISKFLAG_CANFLUSHCACHE)) {
+			g_io_deliver(bp, ENODEV);
+			return;
+		}
+		bp2 = g_clone_bio(bp);
+		if (bp2 == NULL) {
+			g_io_deliver(bp, ENOMEM);
+			return;
+		}
+		bp2->bio_done = g_disk_done;
+		bp2->bio_disk = dp;
+		g_disk_lock_giant(dp);
+		dp->d_strategy(bp2);
+		g_disk_unlock_giant(dp);
+		break;
 	default:
 		error = EOPNOTSUPP;
 		break;
--- sys/geom/geom_disk.h.orig
+++ sys/geom/geom_disk.h
@@ -91,6 +91,7 @@
 #define DISKFLAG_NEEDSGIANT	0x1
 #define DISKFLAG_OPEN		0x2
 #define DISKFLAG_CANDELETE	0x4
+#define DISKFLAG_CANFLUSHCACHE	0x8
 
 struct disk *disk_alloc(void);
 void disk_create(struct disk *disk, int version);
--- sys/geom/geom_io.c.orig
+++ sys/geom/geom_io.c
@@ -199,6 +199,26 @@
 	return (error);
 }
 
+int
+g_io_flush(struct g_consumer *cp)
+{
+	struct bio *bp;
+	int error;
+
+	g_trace(G_T_BIO, "bio_flush(%s)", cp->provider->name);
+	bp = g_alloc_bio();
+	bp->bio_cmd = BIO_FLUSH;
+	bp->bio_done = NULL;
+	bp->bio_attribute = NULL;
+	bp->bio_offset = cp->provider->mediasize;
+	bp->bio_length = 0;
+	bp->bio_data = NULL;
+	g_io_request(bp, cp);
+	error = biowait(bp, "gflush");
+	g_destroy_bio(bp);
+	return (error);
+}
+
 static int
 g_io_check(struct bio *bp)
 {
@@ -217,6 +237,7 @@
 		break;
 	case BIO_WRITE:
 	case BIO_DELETE:
+	case BIO_FLUSH:
 		if (cp->acw == 0)
 			return (EPERM);
 		break;
@@ -259,10 +280,13 @@
 
 	KASSERT(cp != NULL, ("NULL cp in g_io_request"));
 	KASSERT(bp != NULL, ("NULL bp in g_io_request"));
-	KASSERT(bp->bio_data != NULL, ("NULL bp->data in g_io_request"));
 	pp = cp->provider;
 	KASSERT(pp != NULL, ("consumer not attached in g_io_request"));
 
+	if (bp->bio_cmd & (BIO_READ|BIO_WRITE|BIO_DELETE|BIO_GETATTR)) {
+		KASSERT(bp->bio_data != NULL,
+		    ("NULL bp->data in g_io_request"));
+	}
 	if (bp->bio_cmd & (BIO_READ|BIO_WRITE|BIO_DELETE)) {
 		KASSERT(bp->bio_offset % cp->provider->sectorsize == 0,
 		    ("wrong offset %jd for sectorsize %u",
@@ -564,6 +588,10 @@
 		cmd = "GETATTR";
 		printf("%s[%s(attr=%s)]", pname, cmd, bp->bio_attribute);
 		return;
+	case BIO_FLUSH:
+		cmd = "FLUSH";
+		printf("%s[%s]", pname, cmd);
+		return;
 	case BIO_READ:
 		cmd = "READ";
 	case BIO_WRITE:
--- sys/geom/geom_slice.c.orig
+++ sys/geom/geom_slice.c
@@ -260,6 +260,8 @@
 				gkd->length = gsp->slices[idx].length;
 			/* now, pass it on downwards... */
 		}
+		/* FALLTHROUGH */
+	case BIO_FLUSH:
 		bp2 = g_clone_bio(bp);
 		if (bp2 == NULL) {
 			g_io_deliver(bp, ENOMEM);
--- /dev/null	Tue Oct 24 16:34:10 2006
+++ sys/geom/journal/g_journal.c	Tue Oct 24 16:34:13 2006
@@ -0,0 +1,3024 @@
+/*-
+ * Copyright (c) 2005-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include <sys/cdefs.h>
+__FBSDID("$FreeBSD$");
+
+#include <sys/param.h>
+#include <sys/systm.h>
+#include <sys/kernel.h>
+#include <sys/module.h>
+#include <sys/limits.h>
+#include <sys/lock.h>
+#include <sys/mutex.h>
+#include <sys/bio.h>
+#include <sys/sysctl.h>
+#include <sys/malloc.h>
+#include <sys/mount.h>
+#include <sys/eventhandler.h>
+#include <sys/proc.h>
+#include <sys/kthread.h>
+#include <sys/sched.h>
+#include <sys/taskqueue.h>
+#include <sys/vnode.h>
+#ifdef GJ_MEMDEBUG
+#include <sys/stack.h>
+#include <sys/kdb.h>
+#endif
+#include <vm/vm.h>
+#include <vm/vm_kern.h>
+#include <geom/geom.h>
+
+#include <geom/journal/g_journal.h>
+
+
+/*
+ * On-disk journal format:
+ *
+ * JH - Journal header
+ * RH - Record header
+ *
+ * %%%%%% ****** +------+ +------+     ****** +------+     %%%%%%
+ * % JH % * RH * | Data | | Data | ... * RH * | Data | ... % JH % ...
+ * %%%%%% ****** +------+ +------+     ****** +------+     %%%%%%
+ *
+ */
+
+CTASSERT(sizeof(struct g_journal_header) <= 512);
+CTASSERT(sizeof(struct g_journal_record_header) <= 512);
+
+static MALLOC_DEFINE(M_JOURNAL, "journal_data", "GEOM_JOURNAL Data");
+static struct mtx g_journal_cache_mtx;
+MTX_SYSINIT(g_journal_cache, &g_journal_cache_mtx, "cache usage", MTX_DEF);
+
+const struct g_journal_desc *g_journal_filesystems[] = {
+	&g_journal_ufs,
+	NULL
+};
+
+SYSCTL_DECL(_kern_geom);
+
+int g_journal_debug = 0;
+TUNABLE_INT("kern.geom.journal.debug", &g_journal_debug);
+static u_int g_journal_switch_time = 10;
+static u_int g_journal_force_switch = 70;
+static u_int g_journal_parallel_flushes = 16;
+static u_int g_journal_parallel_copies = 16;
+static u_int g_journal_accept_immediately = 64;
+static u_int g_journal_record_entries = GJ_RECORD_HEADER_NENTRIES;
+static u_int g_journal_do_optimize = 1;
+
+SYSCTL_NODE(_kern_geom, OID_AUTO, journal, CTLFLAG_RW, 0, "GEOM_JOURNAL stuff");
+SYSCTL_INT(_kern_geom_journal, OID_AUTO, debug, CTLFLAG_RW, &g_journal_debug, 0,
+    "Debug level");
+SYSCTL_UINT(_kern_geom_journal, OID_AUTO, switch_time, CTLFLAG_RW,
+    &g_journal_switch_time, 0, "Switch journals every N seconds");
+SYSCTL_UINT(_kern_geom_journal, OID_AUTO, force_switch, CTLFLAG_RW,
+    &g_journal_force_switch, 0, "Force switch when journal is N%% full");
+SYSCTL_UINT(_kern_geom_journal, OID_AUTO, parallel_flushes, CTLFLAG_RW,
+    &g_journal_parallel_flushes, 0,
+    "Number of flush I/O requests send in parallel");
+SYSCTL_UINT(_kern_geom_journal, OID_AUTO, accept_immediately, CTLFLAG_RW,
+    &g_journal_accept_immediately, 0,
+    "Number of I/O requests accepted immediatelly");
+SYSCTL_UINT(_kern_geom_journal, OID_AUTO, parallel_copies, CTLFLAG_RW,
+    &g_journal_parallel_copies, 0,
+    "Number of copy I/O requests send in parallel");
+static int
+g_journal_record_entries_sysctl(SYSCTL_HANDLER_ARGS)
+{
+	u_int entries;
+	int error;
+
+	entries = g_journal_record_entries;
+	error = sysctl_handle_int(oidp, &entries, sizeof(entries), req);
+	if (error != 0 || req->newptr == NULL)
+		return (error);
+	if (entries < 1 || entries > GJ_RECORD_HEADER_NENTRIES)
+		return (EINVAL);
+	g_journal_record_entries = entries;
+	return (0);
+}
+SYSCTL_PROC(_kern_geom_journal, OID_AUTO, record_entries,
+    CTLTYPE_UINT | CTLFLAG_RW, NULL, 0, g_journal_record_entries_sysctl, "I",
+    "Maximum number of entires in one journal record");
+SYSCTL_UINT(_kern_geom_journal, OID_AUTO, optimize, CTLFLAG_RW,
+    &g_journal_do_optimize, 0, "Try to combine bios on flush and copy");
+
+static u_int g_journal_cache_used = 0;
+static u_int g_journal_cache_limit = 64 * 1024 * 1024;
+TUNABLE_INT("kern.geom.journal.cache.limit", &g_journal_cache_limit);
+static u_int g_journal_cache_divisor = 2;
+TUNABLE_INT("kern.geom.journal.cache.divisor", &g_journal_cache_divisor);
+static u_int g_journal_cache_switch = 90;
+static u_int g_journal_cache_misses = 0;
+static u_int g_journal_cache_alloc_failures = 0;
+static u_int g_journal_cache_low = 0;
+
+SYSCTL_NODE(_kern_geom_journal, OID_AUTO, cache, CTLFLAG_RW, 0,
+    "GEOM_JOURNAL cache");
+SYSCTL_UINT(_kern_geom_journal_cache, OID_AUTO, used, CTLFLAG_RD,
+    &g_journal_cache_used, 0, "Number of allocated bytes");
+static int
+g_journal_cache_limit_sysctl(SYSCTL_HANDLER_ARGS)
+{
+	u_int limit;
+	int error;
+
+	limit = g_journal_cache_limit;
+	error = sysctl_handle_int(oidp, &limit, sizeof(limit), req);
+	if (error != 0 || req->newptr == NULL)
+		return (error);
+	g_journal_cache_limit = limit;
+	g_journal_cache_low = (limit / 100) * g_journal_cache_switch;
+	return (0);
+}
+SYSCTL_PROC(_kern_geom_journal_cache, OID_AUTO, limit,
+    CTLTYPE_UINT | CTLFLAG_RW, NULL, 0, g_journal_cache_limit_sysctl, "I",
+    "Maximum number of allocated bytes");
+SYSCTL_UINT(_kern_geom_journal_cache, OID_AUTO, divisor, CTLFLAG_RDTUN,
+    &g_journal_cache_divisor, 0,
+    "(kmem_size / kern.geom.journal.cache.divisor) == cache size");
+static int
+g_journal_cache_switch_sysctl(SYSCTL_HANDLER_ARGS)
+{
+	u_int cswitch;
+	int error;
+
+	cswitch = g_journal_cache_switch;
+	error = sysctl_handle_int(oidp, &cswitch, sizeof(cswitch), req);
+	if (error != 0 || req->newptr == NULL)
+		return (error);
+	if (cswitch < 0 || cswitch > 100)
+		return (EINVAL);
+	g_journal_cache_switch = cswitch;
+	g_journal_cache_low = (g_journal_cache_limit / 100) * cswitch;
+	return (0);
+}
+SYSCTL_PROC(_kern_geom_journal_cache, OID_AUTO, switch,
+    CTLTYPE_UINT | CTLFLAG_RW, NULL, 0, g_journal_cache_switch_sysctl, "I",
+    "Force switch when we hit this percent of cache use");
+SYSCTL_UINT(_kern_geom_journal_cache, OID_AUTO, misses, CTLFLAG_RW,
+    &g_journal_cache_misses, 0, "Number of cache misses");
+SYSCTL_UINT(_kern_geom_journal_cache, OID_AUTO, alloc_failures, CTLFLAG_RW,
+    &g_journal_cache_alloc_failures, 0, "Memory allocation failures");
+
+static u_long g_journal_stats_bytes_skipped = 0;
+static u_long g_journal_stats_combined_ios = 0;
+static u_long g_journal_stats_switches = 0;
+static u_long g_journal_stats_wait_for_copy = 0;
+static u_long g_journal_stats_journal_full = 0;
+static u_long g_journal_stats_low_mem = 0;
+
+SYSCTL_NODE(_kern_geom_journal, OID_AUTO, stats, CTLFLAG_RW, 0,
+    "GEOM_JOURNAL statistics");
+SYSCTL_ULONG(_kern_geom_journal_stats, OID_AUTO, skipped_bytes, CTLFLAG_RW,
+    &g_journal_stats_bytes_skipped, 0, "Number of skipped bytes");
+SYSCTL_ULONG(_kern_geom_journal_stats, OID_AUTO, combined_ios, CTLFLAG_RW,
+    &g_journal_stats_combined_ios, 0, "Number of combined I/O requests");
+SYSCTL_ULONG(_kern_geom_journal_stats, OID_AUTO, switches, CTLFLAG_RW,
+    &g_journal_stats_switches, 0, "Number of journal switches");
+SYSCTL_ULONG(_kern_geom_journal_stats, OID_AUTO, wait_for_copy, CTLFLAG_RW,
+    &g_journal_stats_wait_for_copy, 0, "Wait for journal copy on switch");
+SYSCTL_ULONG(_kern_geom_journal_stats, OID_AUTO, journal_full, CTLFLAG_RW,
+    &g_journal_stats_journal_full, 0,
+    "Number of times journal was almost full.");
+SYSCTL_ULONG(_kern_geom_journal_stats, OID_AUTO, low_mem, CTLFLAG_RW,
+    &g_journal_stats_low_mem, 0, "Number of times low_mem hook was called.");
+
+static g_taste_t g_journal_taste;
+static g_ctl_req_t g_journal_config;
+static g_dumpconf_t g_journal_dumpconf;
+static g_init_t g_journal_init;
+static g_fini_t g_journal_fini;
+
+struct g_class g_journal_class = {
+	.name = G_JOURNAL_CLASS_NAME,
+	.version = G_VERSION,
+	.taste = g_journal_taste,
+	.ctlreq = g_journal_config,
+	.dumpconf = g_journal_dumpconf,
+	.init = g_journal_init,
+	.fini = g_journal_fini
+};
+
+static int g_journal_destroy(struct g_journal_softc *sc);
+static void g_journal_metadata_update(struct g_journal_softc *sc);
+static void g_journal_switch_wait(struct g_journal_softc *sc);
+
+#define	GJ_SWITCHER_WORKING	0
+#define	GJ_SWITCHER_DIE		1
+#define	GJ_SWITCHER_DIED	2
+static int g_journal_switcher_state = GJ_SWITCHER_WORKING;
+static int g_journal_switcher_wokenup = 0;
+static int g_journal_sync_requested = 0;
+
+#ifdef GJ_MEMDEBUG
+struct meminfo {
+	size_t		mi_size;
+	struct stack	mi_stack;
+};
+#endif
+
+/*
+ * We use our own malloc/realloc/free funtions, so we can collect statistics
+ * and force journal switch when we're running out of cache.
+ */
+static void *
+gj_malloc(size_t size, int flags)
+{
+	void *p;
+#ifdef GJ_MEMDEBUG
+	struct meminfo *mi;
+#endif
+
+	mtx_lock(&g_journal_cache_mtx);
+	if (g_journal_cache_limit > 0 && !g_journal_switcher_wokenup &&
+	    g_journal_cache_used + size > g_journal_cache_low) {
+		GJ_DEBUG(1, "No cache, waking up the switcher.");
+		g_journal_switcher_wokenup = 1;
+		wakeup(&g_journal_switcher_state);
+	}
+	if ((flags & M_NOWAIT) && g_journal_cache_limit > 0 &&
+	    g_journal_cache_used + size > g_journal_cache_limit) {
+		mtx_unlock(&g_journal_cache_mtx);
+		g_journal_cache_alloc_failures++;
+		return (NULL);
+	}
+	g_journal_cache_used += size;
+	mtx_unlock(&g_journal_cache_mtx);
+	flags &= ~M_NOWAIT;
+#ifndef GJ_MEMDEBUG
+	p = malloc(size, M_JOURNAL, flags | M_WAITOK);
+#else
+	mi = malloc(sizeof(*mi) + size, M_JOURNAL, flags | M_WAITOK);
+	p = (u_char *)mi + sizeof(*mi);
+	mi->mi_size = size;
+	stack_save(&mi->mi_stack);
+#endif
+	return (p);
+}
+
+static void
+gj_free(void *p, size_t size)
+{
+#ifdef GJ_MEMDEBUG
+	struct meminfo *mi;
+#endif
+
+	KASSERT(p != NULL, ("p=NULL"));
+	KASSERT(size > 0, ("size=0"));
+	mtx_lock(&g_journal_cache_mtx);
+	KASSERT(g_journal_cache_used >= size, ("Freeing too much?"));
+	g_journal_cache_used -= size;
+	mtx_unlock(&g_journal_cache_mtx);
+#ifdef GJ_MEMDEBUG
+	mi = p = (void *)((u_char *)p - sizeof(*mi));
+	if (mi->mi_size != size) {
+		printf("GJOURNAL: Size mismatch! %zu != %zu\n", size,
+		    mi->mi_size);
+		printf("GJOURNAL: Alloc backtrace:\n");
+		stack_print(&mi->mi_stack);
+		printf("GJOURNAL: Free backtrace:\n");
+		kdb_backtrace();
+	}
+#endif
+	free(p, M_JOURNAL);
+}
+
+static void *
+gj_realloc(void *p, size_t size, size_t oldsize)
+{
+	void *np;
+
+#ifndef GJ_MEMDEBUG
+	mtx_lock(&g_journal_cache_mtx);
+	g_journal_cache_used -= oldsize;
+	g_journal_cache_used += size;
+	mtx_unlock(&g_journal_cache_mtx);
+	np = realloc(p, size, M_JOURNAL, M_WAITOK);
+#else
+	np = gj_malloc(size, M_WAITOK);
+	bcopy(p, np, MIN(oldsize, size));
+	gj_free(p, oldsize);
+#endif
+	return (np);
+}
+
+static void
+g_journal_check_overflow(struct g_journal_softc *sc)
+{
+	off_t length, used;
+
+	if ((sc->sc_active.jj_offset < sc->sc_inactive.jj_offset &&
+	     sc->sc_journal_offset >= sc->sc_inactive.jj_offset) ||
+	    (sc->sc_active.jj_offset > sc->sc_inactive.jj_offset &&
+	     sc->sc_journal_offset >= sc->sc_inactive.jj_offset &&
+	     sc->sc_journal_offset < sc->sc_active.jj_offset)) {
+		panic("Journal overflow (joffset=%jd active=%jd inactive=%jd)",
+		    (intmax_t)sc->sc_journal_offset,
+		    (intmax_t)sc->sc_active.jj_offset,
+		    (intmax_t)sc->sc_inactive.jj_offset);
+	}
+	if (sc->sc_active.jj_offset < sc->sc_inactive.jj_offset) {
+		length = sc->sc_inactive.jj_offset - sc->sc_active.jj_offset;
+		used = sc->sc_journal_offset - sc->sc_active.jj_offset;
+	} else {
+		length = sc->sc_jend - sc->sc_active.jj_offset;
+		length += sc->sc_inactive.jj_offset - sc->sc_jstart;
+		if (sc->sc_journal_offset >= sc->sc_active.jj_offset)
+			used = sc->sc_journal_offset - sc->sc_active.jj_offset;
+		else {
+			used = sc->sc_jend - sc->sc_active.jj_offset;
+			used += sc->sc_journal_offset - sc->sc_jstart;
+		}
+	}
+	/* Already woken up? */
+	if (g_journal_switcher_wokenup)
+		return;
+	/*
+	 * If the active journal takes more than g_journal_force_switch precent
+	 * of free journal space, we force journal switch.
+	 */
+	KASSERT(length > 0,
+	    ("length=%jd used=%jd active=%jd inactive=%jd joffset=%jd",
+	    (intmax_t)length, (intmax_t)used,
+	    (intmax_t)sc->sc_active.jj_offset,
+	    (intmax_t)sc->sc_inactive.jj_offset,
+	    (intmax_t)sc->sc_journal_offset));
+	if ((used * 100) / length > g_journal_force_switch) {
+		g_journal_stats_journal_full++;
+		GJ_DEBUG(1, "Journal %s %jd%% full, forcing journal switch.",
+		    sc->sc_name, (used * 100) / length);
+		mtx_lock(&g_journal_cache_mtx);
+		g_journal_switcher_wokenup = 1;
+		wakeup(&g_journal_switcher_state);
+		mtx_unlock(&g_journal_cache_mtx);
+	}
+}
+
+static void
+g_journal_orphan(struct g_consumer *cp)
+{
+	struct g_journal_softc *sc;
+	char name[256];
+	int error;
+
+	g_topology_assert();
+	sc = cp->geom->softc;
+	GJ_DEBUG(0, "Lost provider %s (journal=%s).", cp->provider->name,
+	    sc->sc_name);
+	strlcpy(name, sc->sc_name, sizeof(name));
+	error = g_journal_destroy(sc);
+	if (error == 0)
+		GJ_DEBUG(0, "Journal %s destroyed.", name);
+	else {
+		GJ_DEBUG(0, "Cannot destroy journal %s (error=%d). "
+		    "Destroy it manually after last close.", sc->sc_name,
+		    error);
+	}
+}
+
+static int
+g_journal_access(struct g_provider *pp, int acr, int acw, int ace)
+{
+	struct g_journal_softc *sc;
+	int dcr, dcw, dce;
+
+	g_topology_assert();
+	GJ_DEBUG(2, "Access request for %s: r%dw%de%d.", pp->name,
+	    acr, acw, ace);
+
+	dcr = pp->acr + acr;
+	dcw = pp->acw + acw;
+	dce = pp->ace + ace;
+
+	sc = pp->geom->softc;
+	if (sc == NULL || (sc->sc_flags & GJF_DEVICE_DESTROY)) {
+		if (acr <= 0 && acw <= 0 && ace <= 0)
+			return (0);
+		else
+			return (ENXIO);
+	}
+	if (pp->acw == 0 && dcw > 0) {
+		GJ_DEBUG(1, "Marking %s as dirty.", sc->sc_name);
+		sc->sc_flags &= ~GJF_DEVICE_CLEAN;
+		g_topology_unlock();
+		g_journal_metadata_update(sc);
+		g_topology_lock();
+	} /* else if (pp->acw == 0 && dcw > 0 && JEMPTY(sc)) {
+		GJ_DEBUG(1, "Marking %s as clean.", sc->sc_name);
+		sc->sc_flags |= GJF_DEVICE_CLEAN;
+		g_topology_unlock();
+		g_journal_metadata_update(sc);
+		g_topology_lock();
+	} */
+	return (0);
+}
+
+static void
+g_journal_header_encode(struct g_journal_header *hdr, u_char *data)
+{
+
+	bcopy(GJ_HEADER_MAGIC, data, sizeof(GJ_HEADER_MAGIC));
+	data += sizeof(GJ_HEADER_MAGIC);
+	le32enc(data, hdr->jh_journal_id);
+	data += 4;
+	le32enc(data, hdr->jh_journal_next_id);
+}
+
+static int
+g_journal_header_decode(const u_char *data, struct g_journal_header *hdr)
+{
+
+	bcopy(data, hdr->jh_magic, sizeof(hdr->jh_magic));
+	data += sizeof(hdr->jh_magic);
+	if (bcmp(hdr->jh_magic, GJ_HEADER_MAGIC, sizeof(GJ_HEADER_MAGIC)) != 0)
+		return (EINVAL);
+	hdr->jh_journal_id = le32dec(data);
+	data += 4;
+	hdr->jh_journal_next_id = le32dec(data);
+	return (0);
+}
+
+static void
+g_journal_flush_cache(struct g_journal_softc *sc)
+{
+	struct bintime bt;
+	int error;
+
+	if (sc->sc_bio_flush == 0)
+		return;
+	GJ_TIMER_START(1, &bt);
+	if (sc->sc_bio_flush & GJ_FLUSH_JOURNAL) {
+		error = g_io_flush(sc->sc_jconsumer);
+		GJ_DEBUG(error == 0 ? 2 : 0, "Flush cache of %s: error=%d.",
+		    sc->sc_jconsumer->provider->name, error);
+	}
+	if (sc->sc_bio_flush & GJ_FLUSH_DATA) {
+		/*
+		 * TODO: This could be called in parallel with the
+		 *       previous call.
+		 */
+		error = g_io_flush(sc->sc_dconsumer);
+		GJ_DEBUG(error == 0 ? 2 : 0, "Flush cache of %s: error=%d.",
+		    sc->sc_dconsumer->provider->name, error);
+	}
+	GJ_TIMER_STOP(1, &bt, "Cache flush time");
+}
+
+static int
+g_journal_write_header(struct g_journal_softc *sc)
+{
+	struct g_journal_header hdr;
+	struct g_consumer *cp;
+	u_char *buf;
+	int error;
+
+	cp = sc->sc_jconsumer;
+	buf = gj_malloc(cp->provider->sectorsize, M_WAITOK);
+
+	strlcpy(hdr.jh_magic, GJ_HEADER_MAGIC, sizeof(hdr.jh_magic));
+	hdr.jh_journal_id = sc->sc_journal_id;
+	hdr.jh_journal_next_id = sc->sc_journal_next_id;
+	g_journal_header_encode(&hdr, buf);
+	error = g_write_data(cp, sc->sc_journal_offset, buf,
+	    cp->provider->sectorsize);
+	/* if (error == 0) */
+	sc->sc_journal_offset += cp->provider->sectorsize;
+
+	gj_free(buf, cp->provider->sectorsize);
+	return (error);
+}
+
+/*
+ * Every journal record has a header and data following it.
+ * Functions below are used to decode the header before storing it to
+ * little endian and to encode it after reading to system endianess.
+ */
+static void
+g_journal_record_header_encode(struct g_journal_record_header *hdr,
+    u_char *data)
+{
+	struct g_journal_entry *ent;
+	u_int i;
+
+	bcopy(GJ_RECORD_HEADER_MAGIC, data, sizeof(GJ_RECORD_HEADER_MAGIC));
+	data += sizeof(GJ_RECORD_HEADER_MAGIC);
+	le32enc(data, hdr->jrh_journal_id);
+	data += 8;
+	le16enc(data, hdr->jrh_nentries);
+	data += 2;
+	bcopy(hdr->jrh_sum, data, sizeof(hdr->jrh_sum));
+	data += 8;
+	for (i = 0; i < hdr->jrh_nentries; i++) {
+		ent = &hdr->jrh_entries[i];
+		le64enc(data, ent->je_joffset);
+		data += 8;
+		le64enc(data, ent->je_offset);
+		data += 8;
+		le64enc(data, ent->je_length);
+		data += 8;
+	}
+}
+
+static int
+g_journal_record_header_decode(const u_char *data,
+    struct g_journal_record_header *hdr)
+{
+	struct g_journal_entry *ent;
+	u_int i;
+
+	bcopy(data, hdr->jrh_magic, sizeof(hdr->jrh_magic));
+	data += sizeof(hdr->jrh_magic);
+	if (strcmp(hdr->jrh_magic, GJ_RECORD_HEADER_MAGIC) != 0)
+		return (EINVAL);
+	hdr->jrh_journal_id = le32dec(data);
+	data += 8;
+	hdr->jrh_nentries = le16dec(data);
+	data += 2;
+	if (hdr->jrh_nentries > GJ_RECORD_HEADER_NENTRIES)
+		return (EINVAL);
+	bcopy(data, hdr->jrh_sum, sizeof(hdr->jrh_sum));
+	data += 8;
+	for (i = 0; i < hdr->jrh_nentries; i++) {
+		ent = &hdr->jrh_entries[i];
+		ent->je_joffset = le64dec(data);
+		data += 8;
+		ent->je_offset = le64dec(data);
+		data += 8;
+		ent->je_length = le64dec(data);
+		data += 8;
+	}
+	return (0);
+}
+
+/*
+ * Function reads metadata from a provider (via the given consumer), decodes
+ * it to system endianess and verifies its correctness.
+ */
+static int
+g_journal_metadata_read(struct g_consumer *cp, struct g_journal_metadata *md)
+{
+	struct g_provider *pp;
+	u_char *buf;
+	int error;
+
+	g_topology_assert();
+
+	error = g_access(cp, 1, 0, 0);
+	if (error != 0)
+		return (error);
+	pp = cp->provider;
+	g_topology_unlock();
+	/* Metadata is stored in last sector. */
+	buf = g_read_data(cp, pp->mediasize - pp->sectorsize, pp->sectorsize,
+	    &error);
+	g_topology_lock();
+	g_access(cp, -1, 0, 0);
+	if (error != 0) {
+		GJ_DEBUG(1, "Cannot read metadata from %s (error=%d).",
+		    cp->provider->name, error);
+		if (buf != NULL)
+			g_free(buf);
+		return (error);
+	}
+
+	/* Decode metadata. */
+	error = journal_metadata_decode(buf, md);
+	g_free(buf);
+	/* Is this is gjournal provider at all? */
+	if (strcmp(md->md_magic, G_JOURNAL_MAGIC) != 0)
+		return (EINVAL);
+	/*
+	 * Are we able to handle this version of metadata?
+	 * We only maintain backward compatibility.
+	 */
+	if (md->md_version > G_JOURNAL_VERSION) {
+		GJ_DEBUG(0,
+		    "Kernel module is too old to handle metadata from %s.",
+		    cp->provider->name);
+		return (EINVAL);
+	}
+	/* Is checksum correct? */
+	if (error != 0) {
+		GJ_DEBUG(0, "MD5 metadata hash mismatch for provider %s.",
+		    cp->provider->name);
+		return (error);
+	}
+	return (0);
+}
+
+/*
+ * Two functions below are responsible for updating metadata.
+ * Only metadata on the data provider is updated (we need to update
+ * information about active journal in there).
+ */
+static void
+g_journal_metadata_done(struct bio *bp)
+{
+
+	/*
+	 * There is not much we can do on error except informing about it.
+	 */
+	if (bp->bio_error != 0) {
+		GJ_LOGREQ(0, bp, "Cannot update metadata (error=%d).",
+		    bp->bio_error);
+	} else {
+		GJ_LOGREQ(2, bp, "Metadata updated.");
+	}
+	gj_free(bp->bio_data, bp->bio_length);
+	g_destroy_bio(bp);
+}
+
+static void
+g_journal_metadata_update(struct g_journal_softc *sc)
+{
+	struct g_journal_metadata md;
+	struct g_consumer *cp;
+	struct bio *bp;
+	u_char *sector;
+
+	cp = sc->sc_dconsumer;
+	sector = gj_malloc(cp->provider->sectorsize, M_WAITOK);
+	strlcpy(md.md_magic, G_JOURNAL_MAGIC, sizeof(md.md_magic));
+	md.md_version = G_JOURNAL_VERSION;
+	md.md_id = sc->sc_id;
+	md.md_type = sc->sc_orig_type;
+	md.md_jstart = sc->sc_jstart;
+	md.md_jend = sc->sc_jend;
+	md.md_joffset = sc->sc_inactive.jj_offset;
+	md.md_jid = sc->sc_journal_previous_id;
+	md.md_flags = 0;
+	if (sc->sc_flags & GJF_DEVICE_CLEAN)
+		md.md_flags |= GJ_FLAG_CLEAN;
+
+	if (sc->sc_flags & GJF_DEVICE_HARDCODED)
+		strlcpy(md.md_provider, sc->sc_name, sizeof(md.md_provider));
+	else
+		bzero(md.md_provider, sizeof(md.md_provider));
+	md.md_provsize = cp->provider->mediasize;
+	journal_metadata_encode(&md, sector);
+
+	/*
+	 * Flush the cache, so we know all data are on disk.
+	 * We write here informations like "journal is consistent", so we need
+	 * to be sure it is. Without BIO_FLUSH here, we can end up in situation
+	 * where metadata is stored on disk, but not all data.
+	 */
+	g_journal_flush_cache(sc);
+
+	bp = g_alloc_bio();
+	bp->bio_offset = cp->provider->mediasize - cp->provider->sectorsize;
+	bp->bio_length = cp->provider->sectorsize;
+	bp->bio_data = sector;
+	bp->bio_cmd = BIO_WRITE;
+	if (!(sc->sc_flags & GJF_DEVICE_DESTROY)) {
+		bp->bio_done = g_journal_metadata_done;
+		g_io_request(bp, cp);
+	} else {
+		bp->bio_done = NULL;
+		g_io_request(bp, cp);
+		biowait(bp, "gjmdu");
+		g_journal_metadata_done(bp);
+	}
+
+	/*
+	 * Be sure metadata reached the disk.
+	 */
+	g_journal_flush_cache(sc);
+}
+
+/*
+ * This is where the I/O request comes from the GEOM.
+ */
+static void
+g_journal_start(struct bio *bp)
+{
+	struct g_journal_softc *sc;
+
+	sc = bp->bio_to->geom->softc;
+	GJ_LOGREQ(3, bp, "Request received.");
+
+	switch (bp->bio_cmd) {
+	case BIO_READ:
+	case BIO_WRITE:
+		mtx_lock(&sc->sc_mtx);
+		bioq_insert_tail(&sc->sc_regular_queue, bp);
+		wakeup(sc);
+		mtx_unlock(&sc->sc_mtx);
+		return;
+	case BIO_GETATTR:
+		if (strcmp(bp->bio_attribute, "GJOURNAL::provider") == 0) {
+			strlcpy(bp->bio_data, bp->bio_to->name, bp->bio_length);
+			bp->bio_completed = strlen(bp->bio_to->name) + 1;
+			g_io_deliver(bp, 0);
+			return;
+		}
+		/* FALLTHROUGH */
+	case BIO_DELETE:
+	default:
+		g_io_deliver(bp, EOPNOTSUPP);
+		return;
+	}
+}
+
+static void
+g_journal_std_done(struct bio *bp)
+{
+	struct g_journal_softc *sc;
+
+	sc = bp->bio_from->geom->softc;
+	mtx_lock(&sc->sc_mtx);
+	bioq_insert_tail(&sc->sc_back_queue, bp);
+	wakeup(sc);
+	mtx_unlock(&sc->sc_mtx);
+}
+
+static struct bio *
+g_journal_new_bio(off_t start, off_t end, off_t joffset, u_char *data,
+    int flags)
+{
+	struct bio *bp;
+
+	bp = g_alloc_bio();
+	bp->bio_offset = start;
+	bp->bio_joffset = joffset;
+	bp->bio_length = end - start;
+	bp->bio_cmd = BIO_WRITE;
+	bp->bio_done = g_journal_std_done;
+	if (data == NULL)
+		bp->bio_data = NULL;
+	else {
+		bp->bio_data = gj_malloc(bp->bio_length, flags);
+		if (bp->bio_data != NULL)
+			bcopy(data, bp->bio_data, bp->bio_length);
+	}
+	return (bp);
+}
+
+#define	g_journal_insert_bio(head, bp, flags)				\
+	g_journal_insert((head), (bp)->bio_offset,			\
+		(bp)->bio_offset + (bp)->bio_length, (bp)->bio_joffset,	\
+		(bp)->bio_data, flags)
+/*
+ * The function below does a lot more than just inserting bio to the queue.
+ * It keeps the queue sorted by offset and ensures that there are no doubled
+ * data (it combines bios where ranges overlap).
+ *
+ * The function returns the number of bios inserted (as bio can be splitted).
+ */
+static int
+g_journal_insert(struct bio **head, off_t nstart, off_t nend, off_t joffset,
+    u_char *data, int flags)
+{
+	struct bio *nbp, *cbp, *pbp;
+	off_t cstart, cend;
+	u_char *tmpdata;
+	int n;
+
+	GJ_DEBUG(3, "INSERT(%p): (%jd, %jd, %jd)", *head, nstart, nend,
+	    joffset);
+	n = 0;
+	pbp = NULL;
+	GJQ_FOREACH(*head, cbp) {
+		cstart = cbp->bio_offset;
+		cend = cbp->bio_offset + cbp->bio_length;
+
+		if (nstart >= cend) {
+			/*
+			 *  +-------------+
+			 *  |             |
+			 *  |   current   |  +-------------+
+			 *  |     bio     |  |             |
+			 *  |             |  |     new     |
+			 *  +-------------+  |     bio     |
+			 *                   |             |
+			 *                   +-------------+
+			 */
+			GJ_DEBUG(3, "INSERT(%p): 1", *head);
+		} else if (nend <= cstart) {
+			/*
+			 *                   +-------------+
+			 *                   |             |
+			 *  +-------------+  |   current   |
+			 *  |             |  |     bio     |
+			 *  |     new     |  |             |
+			 *  |     bio     |  +-------------+
+			 *  |             |
+			 *  +-------------+
+			 */
+			nbp = g_journal_new_bio(nstart, nend, joffset, data,
+			    flags);
+			if (pbp == NULL)
+				*head = nbp;
+			else
+				pbp->bio_next = nbp;
+			nbp->bio_next = cbp;
+			n++;
+			GJ_DEBUG(3, "INSERT(%p): 2 (nbp=%p pbp=%p)", *head, nbp,
+			    pbp);
+			goto end;
+		} else if (nstart <= cstart && nend >= cend) {
+			/*
+			 *      +-------------+      +-------------+
+			 *      | current bio |      | current bio |
+			 *  +---+-------------+---+  +-------------+---+
+			 *  |   |             |   |  |             |   |
+			 *  |   |             |   |  |             |   |
+			 *  |   +-------------+   |  +-------------+   |
+			 *  |       new bio       |  |     new bio     |
+			 *  +---------------------+  +-----------------+
+			 *
+			 *      +-------------+  +-------------+
+			 *      | current bio |  | current bio |
+			 *  +---+-------------+  +-------------+
+			 *  |   |             |  |             |
+			 *  |   |             |  |             |
+			 *  |   +-------------+  +-------------+
+			 *  |     new bio     |  |   new bio   |
+			 *  +-----------------+  +-------------+
+			 */
+			g_journal_stats_bytes_skipped += cbp->bio_length;
+			cbp->bio_offset = nstart;
+			cbp->bio_joffset = joffset;
+			cbp->bio_length = cend - nstart;
+			if (cbp->bio_data != NULL) {
+				gj_free(cbp->bio_data, cend - cstart);
+				cbp->bio_data = NULL;
+			}
+			if (data != NULL) {
+				cbp->bio_data = gj_malloc(cbp->bio_length,
+				    flags);
+				if (cbp->bio_data != NULL) {
+					bcopy(data, cbp->bio_data,
+					    cbp->bio_length);
+				}
+				data += cend - nstart;
+			}
+			joffset += cend - nstart;
+			nstart = cend;
+			GJ_DEBUG(3, "INSERT(%p): 3 (cbp=%p)", *head, cbp);
+		} else if (nstart > cstart && nend >= cend) {
+			/*
+			 *  +-----------------+  +-------------+
+			 *  |   current bio   |  | current bio |
+			 *  |   +-------------+  |   +---------+---+
+			 *  |   |             |  |   |         |   |
+			 *  |   |             |  |   |         |   |
+			 *  +---+-------------+  +---+---------+   |
+			 *      |   new bio   |      |   new bio   |
+			 *      +-------------+      +-------------+
+			 */
+			g_journal_stats_bytes_skipped += cend - nstart;
+			nbp = g_journal_new_bio(nstart, cend, joffset, data,
+			    flags);
+			nbp->bio_next = cbp->bio_next;
+			cbp->bio_next = nbp;
+			cbp->bio_length = nstart - cstart;
+			if (cbp->bio_data != NULL) {
+				cbp->bio_data = gj_realloc(cbp->bio_data,
+				    cbp->bio_length, cend - cstart);
+			}
+			if (data != NULL)
+				data += cend - nstart;
+			joffset += cend - nstart;
+			nstart = cend;
+			n++;
+			GJ_DEBUG(3, "INSERT(%p): 4 (cbp=%p)", *head, cbp);
+		} else if (nstart > cstart && nend < cend) {
+			/*
+			 *  +---------------------+
+			 *  |     current bio     |
+			 *  |   +-------------+   |
+			 *  |   |             |   |
+			 *  |   |             |   |
+			 *  +---+-------------+---+
+			 *      |   new bio   |
+			 *      +-------------+
+			 */
+			g_journal_stats_bytes_skipped += nend - nstart;
+			nbp = g_journal_new_bio(nstart, nend, joffset, data,
+			    flags);
+			nbp->bio_next = cbp->bio_next;
+			cbp->bio_next = nbp;
+			if (cbp->bio_data == NULL)
+				tmpdata = NULL;
+			else
+				tmpdata = cbp->bio_data + nend - cstart;
+			nbp = g_journal_new_bio(nend, cend,
+			    cbp->bio_joffset + nend - cstart, tmpdata, flags);
+			nbp->bio_next = ((struct bio *)cbp->bio_next)->bio_next;
+			((struct bio *)cbp->bio_next)->bio_next = nbp;
+			cbp->bio_length = nstart - cstart;
+			if (cbp->bio_data != NULL) {
+				cbp->bio_data = gj_realloc(cbp->bio_data,
+				    cbp->bio_length, cend - cstart);
+			}
+			n += 2;
+			GJ_DEBUG(3, "INSERT(%p): 5 (cbp=%p)", *head, cbp);
+			goto end;
+		} else if (nstart <= cstart && nend < cend) {
+			/*
+			 *  +-----------------+      +-------------+
+			 *  |   current bio   |      | current bio |
+			 *  +-------------+   |  +---+---------+   |
+			 *  |             |   |  |   |         |   |
+			 *  |             |   |  |   |         |   |
+			 *  +-------------+---+  |   +---------+---+
+			 *  |   new bio   |      |   new bio   |
+			 *  +-------------+      +-------------+
+			 */
+			g_journal_stats_bytes_skipped += nend - nstart;
+			nbp = g_journal_new_bio(nstart, nend, joffset, data,
+			    flags);
+			if (pbp == NULL)
+				*head = nbp;
+			else
+				pbp->bio_next = nbp;
+			nbp->bio_next = cbp;
+			cbp->bio_offset = nend;
+			cbp->bio_length = cend - nend;
+			cbp->bio_joffset += nend - cstart;
+			tmpdata = cbp->bio_data;
+			if (tmpdata != NULL) {
+				cbp->bio_data = gj_malloc(cbp->bio_length,
+				    flags);
+				if (cbp->bio_data != NULL) {
+					bcopy(tmpdata + nend - cstart,
+					    cbp->bio_data, cbp->bio_length);
+				}
+				gj_free(tmpdata, cend - cstart);
+			}
+			n++;
+			GJ_DEBUG(3, "INSERT(%p): 6 (cbp=%p)", *head, cbp);
+			goto end;
+		}
+		if (nstart == nend)
+			goto end;
+		pbp = cbp;
+	}
+	nbp = g_journal_new_bio(nstart, nend, joffset, data, flags);
+	if (pbp == NULL)
+		*head = nbp;
+	else
+		pbp->bio_next = nbp;
+	nbp->bio_next = NULL;
+	n++;
+	GJ_DEBUG(3, "INSERT(%p): 8 (nbp=%p pbp=%p)", *head, nbp, pbp);
+end:
+	if (g_journal_debug >= 3) {
+		GJQ_FOREACH(*head, cbp) {
+			GJ_DEBUG(3, "ELEMENT: %p (%jd, %jd, %jd, %p)", cbp,
+			    (intmax_t)cbp->bio_offset,
+			    (intmax_t)cbp->bio_length,
+			    (intmax_t)cbp->bio_joffset, cbp->bio_data);
+		}
+		GJ_DEBUG(3, "INSERT(%p): DONE %d", *head, n);
+	}
+	return (n);
+}
+
+/*
+ * The function combines neighbour bios trying to squeeze as much data as
+ * possible into one bio.
+ *
+ * The function returns the number of bios combined (negative value).
+ */
+static int
+g_journal_optimize(struct bio *head)
+{
+	struct bio *cbp, *pbp;
+	int n;
+
+	n = 0;
+	pbp = NULL;
+	GJQ_FOREACH(head, cbp) {
+		/* Skip bios which has to be read first. */
+		if (cbp->bio_data == NULL) {
+			pbp = NULL;
+			continue;
+		}
+		/* There is no previous bio yet. */
+		if (pbp == NULL) {
+			pbp = cbp;
+			continue;
+		}
+		/* Is this a neighbour bio? */
+		if (pbp->bio_offset + pbp->bio_length != cbp->bio_offset) {
+			/* Be sure that bios queue is sorted. */
+			KASSERT(pbp->bio_offset + pbp->bio_length < cbp->bio_offset,
+			    ("poffset=%jd plength=%jd coffset=%jd",
+			    (intmax_t)pbp->bio_offset,
+			    (intmax_t)pbp->bio_length,
+			    (intmax_t)cbp->bio_offset));
+			pbp = cbp;
+			continue;
+		}
+		/* Be sure we don't end up with too big bio. */
+		if (pbp->bio_length + cbp->bio_length > MAXPHYS) {
+			pbp = cbp;
+			continue;
+		}
+		/* Ok, we can join bios. */
+		GJ_LOGREQ(4, pbp, "Join: ");
+		GJ_LOGREQ(4, cbp, "and: ");
+		pbp->bio_data = gj_realloc(pbp->bio_data,
+		    pbp->bio_length + cbp->bio_length, pbp->bio_length);
+		bcopy(cbp->bio_data, pbp->bio_data + pbp->bio_length,
+		    cbp->bio_length);
+		gj_free(cbp->bio_data, cbp->bio_length);
+		pbp->bio_length += cbp->bio_length;
+		pbp->bio_next = cbp->bio_next;
+		g_destroy_bio(cbp);
+		cbp = pbp;
+		g_journal_stats_combined_ios++;
+		n--;
+		GJ_LOGREQ(4, pbp, "Got: ");
+	}
+	return (n);
+}
+
+/*
+ * TODO: Update comment.
+ * These are functions responsible for copying one portion of data from journal
+ * to the destination provider.
+ * The order goes like this:
+ * 1. Read the header, which contains informations about data blocks
+ *    following it.
+ * 2. Read the data blocks from the journal.
+ * 3. Write the data blocks on the data provider.
+ *
+ * g_journal_copy_start()
+ * g_journal_copy_done() - got finished write request, logs potential errors.
+ */
+
+/*
+ * When there is no data in cache, this function is used to read it.
+ */
+static void
+g_journal_read_first(struct g_journal_softc *sc, struct bio *bp)
+{
+	struct bio *cbp;
+
+	/*
+	 * We were short in memory, so data was freed.
+	 * In that case we need to read it back from journal.
+	 */
+	cbp = g_alloc_bio();
+	cbp->bio_cflags = bp->bio_cflags;
+	cbp->bio_parent = bp;
+	cbp->bio_offset = bp->bio_joffset;
+	cbp->bio_length = bp->bio_length;
+	cbp->bio_data = gj_malloc(bp->bio_length, M_WAITOK);
+	cbp->bio_cmd = BIO_READ;
+	cbp->bio_done = g_journal_std_done;
+	GJ_LOGREQ(4, cbp, "READ FIRST");
+	g_io_request(cbp, sc->sc_jconsumer);
+	g_journal_cache_misses++;
+}
+
+static void
+g_journal_copy_send(struct g_journal_softc *sc)
+{
+	struct bio *bioq, *bp, *lbp;
+
+	bioq = lbp = NULL;
+	mtx_lock(&sc->sc_mtx);
+	for (; sc->sc_copy_in_progress < g_journal_parallel_copies;) {
+		bp = GJQ_FIRST(sc->sc_inactive.jj_queue);
+		if (bp == NULL)
+			break;
+		GJQ_REMOVE(sc->sc_inactive.jj_queue, bp);
+		sc->sc_copy_in_progress++;
+		GJQ_INSERT_AFTER(bioq, bp, lbp);
+		lbp = bp;
+	}
+	mtx_unlock(&sc->sc_mtx);
+	if (g_journal_do_optimize)
+		sc->sc_copy_in_progress += g_journal_optimize(bioq);
+	while ((bp = GJQ_FIRST(bioq)) != NULL) {
+		GJQ_REMOVE(bioq, bp);
+		GJQ_INSERT_HEAD(sc->sc_copy_queue, bp);
+		bp->bio_cflags = GJ_BIO_COPY;
+		if (bp->bio_data == NULL)
+			g_journal_read_first(sc, bp);
+		else {
+			bp->bio_joffset = 0;
+			GJ_LOGREQ(4, bp, "SEND");
+			g_io_request(bp, sc->sc_dconsumer);
+		}
+	}
+}
+
+static void
+g_journal_copy_start(struct g_journal_softc *sc)
+{
+
+	/*
+	 * Remember in metadata that we're starting to copy journaled data
+	 * to the data provider.
+	 * In case of power failure, we will copy these data once again on boot.
+	 */
+	if (!sc->sc_journal_copying) {
+		sc->sc_journal_copying = 1;
+		GJ_DEBUG(1, "Starting copy of journal.");
+		g_journal_metadata_update(sc);
+	}
+	g_journal_copy_send(sc);
+}
+
+/*
+ * Data block has been read from the journal provider.
+ */
+static int
+g_journal_copy_read_done(struct bio *bp)
+{
+	struct g_journal_softc *sc;
+	struct g_consumer *cp;
+	struct bio *pbp;
+
+	KASSERT(bp->bio_cflags == GJ_BIO_COPY,
+	    ("Invalid bio (%d != %d).", bp->bio_cflags, GJ_BIO_COPY));
+
+	sc = bp->bio_from->geom->softc;
+	pbp = bp->bio_parent;
+
+	if (bp->bio_error != 0) {
+		GJ_DEBUG(0, "Error while reading data from %s (error=%d).",
+		    bp->bio_to->name, bp->bio_error);
+		/*
+		 * We will not be able to deliver WRITE request as well.
+		 */
+		gj_free(bp->bio_data, bp->bio_length);
+		g_destroy_bio(pbp);
+		g_destroy_bio(bp);
+		sc->sc_copy_in_progress--;
+		return (1);
+	}
+	pbp->bio_data = bp->bio_data;
+	cp = sc->sc_dconsumer;
+	g_io_request(pbp, cp);
+	GJ_LOGREQ(4, bp, "READ DONE");
+	g_destroy_bio(bp);
+	return (0);
+}
+
+/*
+ * Data block has been written to the data provider.
+ */
+static void
+g_journal_copy_write_done(struct bio *bp)
+{
+	struct g_journal_softc *sc;
+
+	KASSERT(bp->bio_cflags == GJ_BIO_COPY,
+	    ("Invalid bio (%d != %d).", bp->bio_cflags, GJ_BIO_COPY));
+
+	sc = bp->bio_from->geom->softc;
+	sc->sc_copy_in_progress--;
+
+	if (bp->bio_error != 0) {
+		GJ_LOGREQ(0, bp, "[copy] Error while writting data (error=%d)",
+		    bp->bio_error);
+	}
+	GJQ_REMOVE(sc->sc_copy_queue, bp);
+	gj_free(bp->bio_data, bp->bio_length);
+	GJ_LOGREQ(4, bp, "DONE");
+	g_destroy_bio(bp);
+
+	if (sc->sc_copy_in_progress == 0) {
+		/*
+		 * This was the last write request for this journal.
+		 */
+		GJ_DEBUG(1, "Data has been copied.");
+		sc->sc_journal_copying = 0;
+	}
+}
+
+static void g_journal_flush_done(struct bio *bp);
+
+/*
+ * Flush one record onto active journal provider.
+ */
+static void
+g_journal_flush(struct g_journal_softc *sc)
+{
+	struct g_journal_record_header hdr;
+	struct g_journal_entry *ent;
+	struct g_provider *pp;
+	struct bio **bioq;
+	struct bio *bp, *fbp, *pbp;
+	off_t joffset, size;
+	u_char *data, hash[16];
+	MD5_CTX ctx;
+	u_int i;
+
+	if (sc->sc_current_count == 0)
+		return;
+
+	size = 0;
+	pp = sc->sc_jprovider;
+	GJ_VALIDATE_OFFSET(sc->sc_journal_offset, sc);
+	joffset = sc->sc_journal_offset;
+
+	GJ_DEBUG(2, "Storing %d journal entries on %s at %jd.",
+	    sc->sc_current_count, pp->name, (intmax_t)joffset);
+
+	/*
+	 * Store 'journal id', so we know to which journal this record belongs.
+	 */
+	hdr.jrh_journal_id = sc->sc_journal_id;
+	/* Could be less than g_journal_record_entries if called due timeout. */
+	hdr.jrh_nentries = MIN(sc->sc_current_count, g_journal_record_entries);
+	strlcpy(hdr.jrh_magic, GJ_RECORD_HEADER_MAGIC, sizeof(hdr.jrh_magic));
+
+	bioq = &sc->sc_active.jj_queue;
+	pbp = sc->sc_flush_queue;
+
+	fbp = g_alloc_bio();
+	fbp->bio_parent = NULL;
+	fbp->bio_cflags = GJ_BIO_JOURNAL;
+	fbp->bio_offset = -1;
+	fbp->bio_joffset = joffset;
+	fbp->bio_length = pp->sectorsize;
+	fbp->bio_cmd = BIO_WRITE;
+	fbp->bio_done = g_journal_std_done;
+	GJQ_INSERT_AFTER(sc->sc_flush_queue, fbp, pbp);
+	pbp = fbp;
+	fbp->bio_to = pp;
+	GJ_LOGREQ(4, fbp, "FLUSH_OUT");
+	joffset += pp->sectorsize;
+	sc->sc_flush_count++;
+	if (sc->sc_flags & GJF_DEVICE_CHECKSUM)
+		MD5Init(&ctx);
+
+	for (i = 0; i < hdr.jrh_nentries; i++) {
+		bp = sc->sc_current_queue;
+		KASSERT(bp != NULL, ("NULL bp"));
+		bp->bio_to = pp;
+		GJ_LOGREQ(4, bp, "FLUSHED");
+		sc->sc_current_queue = bp->bio_next;
+		bp->bio_next = NULL;
+		sc->sc_current_count--;
+
+		/* Add to the header. */
+		ent = &hdr.jrh_entries[i];
+		ent->je_offset = bp->bio_offset;
+		ent->je_joffset = joffset;
+		ent->je_length = bp->bio_length;
+		size += ent->je_length;
+
+		data = bp->bio_data;
+		if (sc->sc_flags & GJF_DEVICE_CHECKSUM)
+			MD5Update(&ctx, data, ent->je_length);
+		bzero(bp, sizeof(*bp));
+		bp->bio_cflags = GJ_BIO_JOURNAL;
+		bp->bio_offset = ent->je_offset;
+		bp->bio_joffset = ent->je_joffset;
+		bp->bio_length = ent->je_length;
+		bp->bio_data = data;
+		bp->bio_cmd = BIO_WRITE;
+		bp->bio_done = g_journal_std_done;
+		GJQ_INSERT_AFTER(sc->sc_flush_queue, bp, pbp);
+		pbp = bp;
+		bp->bio_to = pp;
+		GJ_LOGREQ(4, bp, "FLUSH_OUT");
+		joffset += bp->bio_length;
+		sc->sc_flush_count++;
+
+		/*
+		 * Add request to the active sc_journal_queue queue.
+		 * This is our cache. After journal switch we don't have to
+		 * read the data from the inactive journal, because we keep
+		 * it in memory.
+		 */
+		g_journal_insert(bioq, ent->je_offset,
+		    ent->je_offset + ent->je_length, ent->je_joffset, data,
+		    M_NOWAIT);
+	}
+
+	/*
+	 * After all requests, store valid header.
+	 */
+	data = gj_malloc(pp->sectorsize, M_WAITOK);
+	if (sc->sc_flags & GJF_DEVICE_CHECKSUM) {
+		MD5Final(hash, &ctx);
+		bcopy(hash, hdr.jrh_sum, sizeof(hdr.jrh_sum));
+	}
+	g_journal_record_header_encode(&hdr, data);
+	fbp->bio_data = data;
+
+	sc->sc_journal_offset = joffset;
+
+	g_journal_check_overflow(sc);
+}
+
+/*
+ * Flush request finished.
+ */
+static void
+g_journal_flush_done(struct bio *bp)
+{
+	struct g_journal_softc *sc;
+	struct g_consumer *cp;
+
+	KASSERT((bp->bio_cflags & GJ_BIO_MASK) == GJ_BIO_JOURNAL,
+	    ("Invalid bio (%d != %d).", bp->bio_cflags, GJ_BIO_JOURNAL));
+
+	cp = bp->bio_from;
+	sc = cp->geom->softc;
+	sc->sc_flush_in_progress--;
+
+	if (bp->bio_error != 0) {
+		GJ_LOGREQ(0, bp, "[flush] Error while writting data (error=%d)",
+		    bp->bio_error);
+	}
+	gj_free(bp->bio_data, bp->bio_length);
+	GJ_LOGREQ(4, bp, "DONE");
+	g_destroy_bio(bp);
+}
+
+static void g_journal_release_delayed(struct g_journal_softc *sc);
+
+static void
+g_journal_flush_send(struct g_journal_softc *sc)
+{
+	struct g_consumer *cp;
+	struct bio *bioq, *bp, *lbp;
+
+	cp = sc->sc_jconsumer;
+	bioq = lbp = NULL;
+	while (sc->sc_flush_in_progress < g_journal_parallel_flushes) {
+		/* Send one flush requests to the active journal. */
+		bp = GJQ_FIRST(sc->sc_flush_queue);
+		if (bp != NULL) {
+			GJQ_REMOVE(sc->sc_flush_queue, bp);
+			sc->sc_flush_count--;
+			bp->bio_offset = bp->bio_joffset;
+			bp->bio_joffset = 0;
+			sc->sc_flush_in_progress++;
+			GJQ_INSERT_AFTER(bioq, bp, lbp);
+			lbp = bp;
+		}
+		/* Try to release delayed requests. */
+		g_journal_release_delayed(sc);
+		/* If there are no requests to flush, leave. */
+		if (GJQ_FIRST(sc->sc_flush_queue) == NULL)
+			break;
+	}
+	if (g_journal_do_optimize)
+		sc->sc_flush_in_progress += g_journal_optimize(bioq);
+	while ((bp = GJQ_FIRST(bioq)) != NULL) {
+		GJQ_REMOVE(bioq, bp);
+		GJ_LOGREQ(3, bp, "Flush request send");
+		g_io_request(bp, cp);
+	}
+}
+
+static void
+g_journal_add_current(struct g_journal_softc *sc, struct bio *bp)
+{
+	int n;
+
+	GJ_LOGREQ(4, bp, "CURRENT %d", sc->sc_current_count);
+	n = g_journal_insert_bio(&sc->sc_current_queue, bp, M_WAITOK);
+	sc->sc_current_count += n;
+	n = g_journal_optimize(sc->sc_current_queue);
+	sc->sc_current_count += n;
+	/*
+	 * For requests which are added to the current queue we deliver
+	 * response immediately.
+	 */
+	bp->bio_completed = bp->bio_length;
+	g_io_deliver(bp, 0);
+	if (sc->sc_current_count >= g_journal_record_entries) {
+		/*
+		 * Let's flush one record onto active journal provider.
+		 */
+		g_journal_flush(sc);
+	}
+}
+
+static void
+g_journal_release_delayed(struct g_journal_softc *sc)
+{
+	struct bio *bp;
+
+	for (;;) {
+		/* The flush queue is full, exit. */
+		if (sc->sc_flush_count >= g_journal_accept_immediately)
+			return;
+		bp = bioq_takefirst(&sc->sc_delayed_queue);
+		if (bp == NULL)
+			return;
+		sc->sc_delayed_count--;
+		g_journal_add_current(sc, bp);
+	}
+}
+
+/*
+ * Add I/O request to the current queue. If we have enough requests for one
+ * journal record we flush them onto active journal provider.
+ */
+static void
+g_journal_add_request(struct g_journal_softc *sc, struct bio *bp)
+{
+
+	/*
+	 * The flush queue is full, we need to delay the request.
+	 */
+	if (sc->sc_delayed_count > 0 ||
+	    sc->sc_flush_count >= g_journal_accept_immediately) {
+		GJ_LOGREQ(4, bp, "DELAYED");
+		bioq_insert_tail(&sc->sc_delayed_queue, bp);
+		sc->sc_delayed_count++;
+		return;
+	}
+
+	KASSERT(TAILQ_EMPTY(&sc->sc_delayed_queue.queue),
+	    ("DELAYED queue not empty."));
+	g_journal_add_current(sc, bp);
+}
+
+static void g_journal_read_done(struct bio *bp);
+
+/*
+ * Try to find requested data in cache.
+ */
+static struct bio *
+g_journal_read_find(struct bio *head, int sorted, struct bio *pbp, off_t ostart,
+    off_t oend)
+{
+	off_t cstart, cend;
+	struct bio *bp;
+
+	GJQ_FOREACH(head, bp) {
+		if (bp->bio_offset == -1)
+			continue;
+		cstart = MAX(ostart, bp->bio_offset);
+		cend = MIN(oend, bp->bio_offset + bp->bio_length);
+		if (cend <= ostart)
+			continue;
+		else if (cstart >= oend) {
+			if (!sorted)
+				continue;
+			else {
+				bp = NULL;
+				break;
+			}
+		}
+		if (bp->bio_data == NULL)
+			break;
+		GJ_DEBUG(3, "READ(%p): (%jd, %jd) (bp=%p)", head, cstart, cend,
+		    bp);
+		bcopy(bp->bio_data + cstart - bp->bio_offset,
+		    pbp->bio_data + cstart - pbp->bio_offset, cend - cstart);
+		pbp->bio_completed += cend - cstart;
+		if (pbp->bio_completed == pbp->bio_length) {
+			/*
+			 * Cool, the whole request was in cache, deliver happy
+			 * message.
+			 */
+			g_io_deliver(pbp, 0);
+			return (pbp);
+		}
+		break;
+	}
+	return (bp);
+}
+
+/*
+ * Try to find requested data in cache.
+ */
+static struct bio *
+g_journal_read_queue_find(struct bio_queue *head, struct bio *pbp, off_t ostart,
+    off_t oend)
+{
+	off_t cstart, cend;
+	struct bio *bp;
+
+	TAILQ_FOREACH(bp, head, bio_queue) {
+		cstart = MAX(ostart, bp->bio_offset);
+		cend = MIN(oend, bp->bio_offset + bp->bio_length);
+		if (cend <= ostart)
+			continue;
+		else if (cstart >= oend)
+			continue;
+		KASSERT(bp->bio_data != NULL,
+		    ("%s: bio_data == NULL", __func__));
+		GJ_DEBUG(3, "READ(%p): (%jd, %jd) (bp=%p)", head, cstart, cend,
+		    bp);
+		bcopy(bp->bio_data + cstart - bp->bio_offset,
+		    pbp->bio_data + cstart - pbp->bio_offset, cend - cstart);
+		pbp->bio_completed += cend - cstart;
+		if (pbp->bio_completed == pbp->bio_length) {
+			/*
+			 * Cool, the whole request was in cache, deliver happy
+			 * message.
+			 */
+			g_io_deliver(pbp, 0);
+			return (pbp);
+		}
+		break;
+	}
+	return (bp);
+}
+
+/*
+ * This function is used for colecting data on read.
+ * The complexity is because parts of the data can be stored in four different
+ * places:
+ * - in delayed requests
+ * - in memory - the data not yet send to the active journal provider
+ * - in requests which are going to be sent to the active journal
+ * - in the active journal
+ * - in the inactive journal
+ * - in the data provider
+ */
+static void
+g_journal_read(struct g_journal_softc *sc, struct bio *pbp, off_t ostart,
+    off_t oend)
+{
+	struct bio *bp, *nbp, *head;
+	off_t cstart, cend;
+	u_int i, sorted = 0;
+
+	GJ_DEBUG(3, "READ: (%jd, %jd)", ostart, oend);
+
+	cstart = cend = -1;
+	bp = NULL;
+	head = NULL;
+	for (i = 0; i <= 5; i++) {
+		switch (i) {
+		case 0:	/* Delayed requests. */
+			head = NULL;
+			sorted = 0;
+			break;
+		case 1:	/* Not-yet-send data. */
+			head = sc->sc_current_queue;
+			sorted = 1;
+			break;
+		case 2:	/* In-flight to the active journal. */
+			head = sc->sc_flush_queue;
+			sorted = 0;
+			break;
+		case 3:	/* Active journal. */
+			head = sc->sc_active.jj_queue;
+			sorted = 1;
+			break;
+		case 4:	/* Inactive journal. */
+			/*
+			 * XXX: Here could be a race with g_journal_lowmem().
+			 */
+			head = sc->sc_inactive.jj_queue;
+			sorted = 1;
+			break;
+		case 5:	/* In-flight to the data provider. */
+			head = sc->sc_copy_queue;
+			sorted = 0;
+			break;
+		default:
+			panic("gjournal %s: i=%d", __func__, i);
+		}
+		if (i == 0)
+			bp = g_journal_read_queue_find(&sc->sc_delayed_queue.queue, pbp, ostart, oend);
+		else
+			bp = g_journal_read_find(head, sorted, pbp, ostart, oend);
+		if (bp == pbp) { /* Got the whole request. */
+			GJ_DEBUG(2, "Got the whole request from %u.", i);
+			return;
+		} else if (bp != NULL) {
+			cstart = MAX(ostart, bp->bio_offset);
+			cend = MIN(oend, bp->bio_offset + bp->bio_length);
+			GJ_DEBUG(2, "Got part of the request from %u (%jd-%jd).",
+			    i, (intmax_t)cstart, (intmax_t)cend);
+			break;
+		}
+	}
+	if (bp != NULL) {
+		if (bp->bio_data == NULL) {
+			nbp = g_clone_bio(pbp);
+			nbp->bio_cflags = GJ_BIO_READ;
+			nbp->bio_data =
+			    pbp->bio_data + cstart - pbp->bio_offset;
+			nbp->bio_offset =
+			    bp->bio_joffset + cstart - bp->bio_offset;
+			nbp->bio_length = cend - cstart;
+			nbp->bio_done = g_journal_read_done;
+			g_io_request(nbp, sc->sc_jconsumer);
+		}
+		/*
+		 * If we don't have the whole request yet, call g_journal_read()
+		 * recursively.
+		 */
+		if (ostart < cstart)
+			g_journal_read(sc, pbp, ostart, cstart);
+		if (oend > cend)
+			g_journal_read(sc, pbp, cend, oend);
+	} else {
+		/*
+		 * No data in memory, no data in journal.
+		 * Its time for asking data provider.
+		 */
+		GJ_DEBUG(3, "READ(data): (%jd, %jd)", ostart, oend);
+		nbp = g_clone_bio(pbp);
+		nbp->bio_cflags = GJ_BIO_READ;
+		nbp->bio_data = pbp->bio_data + ostart - pbp->bio_offset;
+		nbp->bio_offset = ostart;
+		nbp->bio_length = oend - ostart;
+		nbp->bio_done = g_journal_read_done;
+		g_io_request(nbp, sc->sc_dconsumer);
+		/* We have the whole request, return here. */
+		return;
+	}
+}
+
+/*
+ * Function responsible for handling finished READ requests.
+ * Actually, g_std_done() could be used here, the only difference is that we
+ * log error.
+ */
+static void
+g_journal_read_done(struct bio *bp)
+{
+	struct bio *pbp;
+
+	KASSERT(bp->bio_cflags == GJ_BIO_READ,
+	    ("Invalid bio (%d != %d).", bp->bio_cflags, GJ_BIO_READ));
+
+	pbp = bp->bio_parent;
+	pbp->bio_inbed++;
+	pbp->bio_completed += bp->bio_length;
+
+	if (bp->bio_error != 0) {
+		if (pbp->bio_error == 0)
+			pbp->bio_error = bp->bio_error;
+		GJ_DEBUG(0, "Error while reading data from %s (error=%d).",
+		    bp->bio_to->name, bp->bio_error);
+	}
+	g_destroy_bio(bp);
+	if (pbp->bio_children == pbp->bio_inbed &&
+	    pbp->bio_completed == pbp->bio_length) {
+		/* We're done. */
+		g_io_deliver(pbp, 0);
+	}
+}
+
+/*
+ * Deactive current journal and active next one.
+ */
+static void
+g_journal_switch(struct g_journal_softc *sc)
+{
+	struct g_provider *pp;
+
+	if (JEMPTY(sc)) {
+		GJ_DEBUG(3, "No need for %s switch.", sc->sc_name);
+		pp = LIST_FIRST(&sc->sc_geom->provider);
+		if (!(sc->sc_flags & GJF_DEVICE_CLEAN) && pp->acw == 0) {
+			sc->sc_flags |= GJF_DEVICE_CLEAN;
+			GJ_DEBUG(1, "Marking %s as clean.", sc->sc_name);
+			g_journal_metadata_update(sc);
+		}
+	} else {
+		GJ_DEBUG(3, "Switching journal %s.", sc->sc_geom->name);
+
+		pp = sc->sc_jprovider;
+
+		sc->sc_journal_previous_id = sc->sc_journal_id;
+
+		sc->sc_journal_id = sc->sc_journal_next_id;
+		sc->sc_journal_next_id = arc4random();
+
+		GJ_VALIDATE_OFFSET(sc->sc_journal_offset, sc);
+
+		g_journal_write_header(sc);
+
+		sc->sc_inactive.jj_offset = sc->sc_active.jj_offset;
+		sc->sc_inactive.jj_queue = sc->sc_active.jj_queue;
+
+		sc->sc_active.jj_offset =
+		    sc->sc_journal_offset - pp->sectorsize;
+		sc->sc_active.jj_queue = NULL;
+
+		/*
+		 * Switch is done, start copying data from the (now) inactive
+		 * journal to the data provider.
+		 */
+		g_journal_copy_start(sc);
+	}
+	mtx_lock(&sc->sc_mtx);
+	sc->sc_flags &= ~GJF_DEVICE_SWITCH;
+	mtx_unlock(&sc->sc_mtx);
+}
+
+static void
+g_journal_initialize(struct g_journal_softc *sc)
+{
+
+	sc->sc_journal_id = arc4random();
+	sc->sc_journal_next_id = arc4random();
+	sc->sc_journal_previous_id = sc->sc_journal_id;
+	sc->sc_journal_offset = sc->sc_jstart;
+	sc->sc_inactive.jj_offset = sc->sc_jstart;
+	g_journal_write_header(sc);
+	sc->sc_active.jj_offset = sc->sc_jstart;
+}
+
+static void
+g_journal_mark_as_dirty(struct g_journal_softc *sc)
+{
+	const struct g_journal_desc *desc;
+	int i;
+
+	GJ_DEBUG(1, "Marking file system %s as dirty.", sc->sc_name);
+	for (i = 0; (desc = g_journal_filesystems[i]) != NULL; i++)
+		desc->jd_dirty(sc->sc_dconsumer);
+}
+
+/*
+ * Function read record header from the given journal.
+ * It is very simlar to g_read_data(9), but it doesn't allocate memory for bio
+ * and data on every call.
+ */
+static int
+g_journal_sync_read(struct g_consumer *cp, struct bio *bp, off_t offset,
+    void *data)
+{
+	int error;
+
+	bzero(bp, sizeof(*bp));
+	bp->bio_cmd = BIO_READ;
+	bp->bio_done = NULL;
+	bp->bio_offset = offset;
+	bp->bio_length = cp->provider->sectorsize;
+	bp->bio_data = data;
+	g_io_request(bp, cp);
+	error = biowait(bp, "gjs_read");
+	return (error);
+}
+
+#if 0
+/*
+ * Function is called when we start the journal device and we detect that
+ * one of the journals was not fully copied.
+ * The purpose of this function is to read all records headers from journal
+ * and placed them in the inactive queue, so we can start journal
+ * synchronization process and the journal provider itself.
+ * Design decision was taken to not synchronize the whole journal here as it
+ * can take too much time. Reading headers only and delaying synchronization
+ * process until after journal provider is started should be the best choice.
+ */
+#endif
+
+static void
+g_journal_sync(struct g_journal_softc *sc)
+{
+	struct g_journal_record_header rhdr;
+	struct g_journal_entry *ent;
+	struct g_journal_header jhdr;
+	struct g_consumer *cp;
+	struct bio *bp, *fbp, *tbp;
+	off_t joffset, offset;
+	u_char *buf, sum[16];
+	uint64_t id;
+	MD5_CTX ctx;
+	int error, found, i;
+
+	found = 0;
+	fbp = NULL;
+	cp = sc->sc_jconsumer;
+	bp = g_alloc_bio();
+	buf = gj_malloc(cp->provider->sectorsize, M_WAITOK);
+	offset = joffset = sc->sc_inactive.jj_offset = sc->sc_journal_offset;
+
+	GJ_DEBUG(2, "Looking for termination at %jd.", (intmax_t)joffset);
+
+	/*
+	 * Read and decode first journal header.
+	 */
+	error = g_journal_sync_read(cp, bp, offset, buf);
+	if (error != 0) {
+		GJ_DEBUG(0, "Error while reading journal header from %s.",
+		    cp->provider->name);
+		goto end;
+	}
+	error = g_journal_header_decode(buf, &jhdr);
+	if (error != 0) {
+		GJ_DEBUG(0, "Cannot decode journal header from %s.",
+		    cp->provider->name);
+		goto end;
+	}
+	id = sc->sc_journal_id;
+	if (jhdr.jh_journal_id != sc->sc_journal_id) {
+		GJ_DEBUG(1, "Journal ID mismatch at %jd (0x%08x != 0x%08x).",
+		    (intmax_t)offset, (u_int)jhdr.jh_journal_id, (u_int)id);
+		goto end;
+	}
+	offset += cp->provider->sectorsize;
+	id = sc->sc_journal_next_id = jhdr.jh_journal_next_id;
+
+	for (;;) {
+		/*
+		 * If the biggest record won't fit, look for a record header or
+		 * journal header from the begining.
+		 */
+		GJ_VALIDATE_OFFSET(offset, sc);
+		error = g_journal_sync_read(cp, bp, offset, buf);
+		if (error != 0) {
+			/*
+			 * Not good. Having an error while reading header
+			 * means, that we cannot read next headers and in
+			 * consequence we cannot find termination.
+			 */
+			GJ_DEBUG(0,
+			    "Error while reading record header from %s.",
+			    cp->provider->name);
+			break;
+		}
+
+		error = g_journal_record_header_decode(buf, &rhdr);
+		if (error != 0) {
+			GJ_DEBUG(2, "Not a record header at %jd (error=%d).",
+			    (intmax_t)offset, error);
+			/*
+			 * This is not a record header.
+			 * If we are lucky, this is next journal header.
+			 */
+			error = g_journal_header_decode(buf, &jhdr);
+			if (error != 0) {
+				GJ_DEBUG(1, "Not a journal header at %jd (error=%d).",
+				    (intmax_t)offset, error);
+				/*
+				 * Nope, this is not journal header, which
+				 * bascially means that journal is not
+				 * terminated properly.
+				 */
+				error = ENOENT;
+				break;
+			}
+			/*
+			 * Ok. This is header of _some_ journal. Now we need to
+			 * verify if this is header of the _next_ journal.
+			 */
+			if (jhdr.jh_journal_id != id) {
+				GJ_DEBUG(1, "Journal ID mismatch at %jd "
+				    "(0x%08x != 0x%08x).", (intmax_t)offset,
+				    (u_int)jhdr.jh_journal_id, (u_int)id);
+				error = ENOENT;
+				break;
+			}
+
+			/* Found termination. */
+			found++;
+			GJ_DEBUG(1, "Found termination at %jd (id=0x%08x).",
+			    (intmax_t)offset, (u_int)id);
+			sc->sc_active.jj_offset = offset;
+			sc->sc_journal_offset =
+			    offset + cp->provider->sectorsize;
+			sc->sc_journal_id = id;
+			id = sc->sc_journal_next_id = jhdr.jh_journal_next_id;
+
+			while ((tbp = fbp) != NULL) {
+				fbp = tbp->bio_next;
+				GJ_LOGREQ(3, tbp, "Adding request.");
+				g_journal_insert_bio(&sc->sc_inactive.jj_queue,
+				    tbp, M_WAITOK);
+			}
+
+			/* Skip journal's header. */
+			offset += cp->provider->sectorsize;
+			continue;
+		}
+
+		/* Skip record's header. */
+		offset += cp->provider->sectorsize;
+
+		/*
+		 * Add information about every record entry to the inactive
+		 * queue.
+		 */
+		if (sc->sc_flags & GJF_DEVICE_CHECKSUM)
+			MD5Init(&ctx);
+		for (i = 0; i < rhdr.jrh_nentries; i++) {
+			ent = &rhdr.jrh_entries[i];
+			GJ_DEBUG(3, "Insert entry: %jd %jd.",
+			    (intmax_t)ent->je_offset, (intmax_t)ent->je_length);
+			g_journal_insert(&fbp, ent->je_offset,
+			    ent->je_offset + ent->je_length, ent->je_joffset,
+			    NULL, M_WAITOK);
+			if (sc->sc_flags & GJF_DEVICE_CHECKSUM) {
+				u_char *buf2;
+
+				/*
+				 * TODO: Should use faster function (like
+				 *       g_journal_sync_read()).
+				 */
+				buf2 = g_read_data(cp, offset, ent->je_length,
+				    NULL);
+				if (buf2 == NULL)
+					GJ_DEBUG(0, "Cannot read data at %jd.",
+					    (intmax_t)offset);
+				else {
+					MD5Update(&ctx, buf2, ent->je_length);
+					g_free(buf2);
+				}
+			}
+			/* Skip entry's data. */
+			offset += ent->je_length;
+		}
+		if (sc->sc_flags & GJF_DEVICE_CHECKSUM) {
+			MD5Final(sum, &ctx);
+			if (bcmp(sum, rhdr.jrh_sum, sizeof(rhdr.jrh_sum)) != 0) {
+				GJ_DEBUG(0, "MD5 hash mismatch at %jd!",
+				    (intmax_t)offset);
+			}
+		}
+	}
+end:
+	gj_free(bp->bio_data, cp->provider->sectorsize);
+	g_destroy_bio(bp);
+
+	/* Remove bios from unterminated journal. */
+	while ((tbp = fbp) != NULL) {
+		fbp = tbp->bio_next;
+		g_destroy_bio(tbp);
+	}
+
+	if (found < 1 && joffset > 0) {
+		GJ_DEBUG(0, "Journal on %s is broken/corrupted. Initializing.",
+		    sc->sc_name);
+		while ((tbp = sc->sc_inactive.jj_queue) != NULL) {
+			sc->sc_inactive.jj_queue = tbp->bio_next;
+			g_destroy_bio(tbp);
+		}
+		g_journal_initialize(sc);
+		g_journal_mark_as_dirty(sc);
+	} else {
+		GJ_DEBUG(0, "Journal %s consistent.", sc->sc_name);
+		g_journal_copy_start(sc);
+	}
+}
+
+/*
+ * Wait for requests.
+ * If we have requests in the current queue, flush them after 3 seconds from the
+ * last flush. In this way we don't wait forever (or for journal switch) with
+ * storing not full records on journal.
+ */
+static void
+g_journal_wait(struct g_journal_softc *sc, time_t last_write)
+{
+	int error, timeout;
+
+	GJ_DEBUG(3, "%s: enter", __func__);
+	if (sc->sc_current_count == 0) {
+		if (g_journal_debug < 2)
+			msleep(sc, &sc->sc_mtx, PRIBIO | PDROP, "gj:work", 0);
+		else {
+			/*
+			 * If we have debug turned on, show number of elements
+			 * in various queues.
+			 */
+			for (;;) {
+				error = msleep(sc, &sc->sc_mtx, PRIBIO,
+				    "gj:work", hz * 3);
+				if (error == 0) {
+					mtx_unlock(&sc->sc_mtx);
+					break;
+				}
+				GJ_DEBUG(3, "Report: current count=%d",
+				    sc->sc_current_count);
+				GJ_DEBUG(3, "Report: flush count=%d",
+				    sc->sc_flush_count);
+				GJ_DEBUG(3, "Report: flush in progress=%d",
+				    sc->sc_flush_in_progress);
+				GJ_DEBUG(3, "Report: copy in progress=%d",
+				    sc->sc_copy_in_progress);
+				GJ_DEBUG(3, "Report: delayed=%d",
+				    sc->sc_delayed_count);
+			}
+		}
+		GJ_DEBUG(3, "%s: exit 1", __func__);
+		return;
+	}
+
+	/*
+	 * Flush even not full records every 3 seconds.
+	 */
+	timeout = (last_write + 3 - time_second) * hz;
+	if (timeout <= 0) {
+		mtx_unlock(&sc->sc_mtx);
+		g_journal_flush(sc);
+		g_journal_flush_send(sc);
+		GJ_DEBUG(3, "%s: exit 2", __func__);
+		return;
+	}
+	error = msleep(sc, &sc->sc_mtx, PRIBIO | PDROP, "gj:work", timeout);
+	if (error == EWOULDBLOCK)
+		g_journal_flush_send(sc);
+	GJ_DEBUG(3, "%s: exit 3", __func__);
+}
+
+/*
+ * Worker thread.
+ */
+static void
+g_journal_worker(void *arg)
+{
+	struct g_journal_softc *sc;
+	struct g_geom *gp;
+	struct g_provider *pp;
+	struct bio *bp;
+	time_t last_write;
+	int type;
+
+	mtx_lock_spin(&sched_lock);
+	sched_prio(curthread, PRIBIO);
+	mtx_unlock_spin(&sched_lock);
+
+	sc = arg;
+
+	if (sc->sc_flags & GJF_DEVICE_CLEAN) {
+		GJ_DEBUG(0, "Journal %s clean.", sc->sc_name);
+		g_journal_initialize(sc);
+	} else {
+		g_journal_sync(sc);
+	}
+	/*
+	 * Check if we can use BIO_FLUSH.
+	 */
+	sc->sc_bio_flush = 0;
+	if (g_io_flush(sc->sc_jconsumer) == 0) {
+		sc->sc_bio_flush |= GJ_FLUSH_JOURNAL;
+		GJ_DEBUG(1, "BIO_FLUSH supported by %s.",
+		    sc->sc_jconsumer->provider->name);
+	} else {
+		GJ_DEBUG(0, "BIO_FLUSH not supported by %s.",
+		    sc->sc_jconsumer->provider->name);
+	}
+	if (sc->sc_jconsumer != sc->sc_dconsumer) {
+		if (g_io_flush(sc->sc_dconsumer) == 0) {
+			sc->sc_bio_flush |= GJ_FLUSH_DATA;
+			GJ_DEBUG(1, "BIO_FLUSH supported by %s.",
+			    sc->sc_dconsumer->provider->name);
+		} else {
+			GJ_DEBUG(0, "BIO_FLUSH not supported by %s.",
+			    sc->sc_dconsumer->provider->name);
+		}
+	}
+
+	gp = sc->sc_geom;
+	g_topology_lock();
+	pp = g_new_providerf(gp, "%s.journal", sc->sc_name);
+	KASSERT(pp != NULL, ("Cannot create %s.journal.", sc->sc_name));
+	pp->mediasize = sc->sc_mediasize;
+	/*
+	 * There could be a problem when data provider and journal providers
+	 * have different sectorsize, but such scenario is prevented on journal
+	 * creation.
+	 */
+	pp->sectorsize = sc->sc_sectorsize;
+	g_error_provider(pp, 0);
+	g_topology_unlock();
+	last_write = time_second;
+
+	for (;;) {
+		/* Get first request from the queue. */
+		mtx_lock(&sc->sc_mtx);
+		bp = bioq_first(&sc->sc_back_queue);
+		if (bp != NULL)
+			type = (bp->bio_cflags & GJ_BIO_MASK);
+		if (bp == NULL) {
+			bp = bioq_first(&sc->sc_regular_queue);
+			if (bp != NULL)
+				type = GJ_BIO_REGULAR;
+		}
+		if (bp == NULL) {
+try_switch:
+			if ((sc->sc_flags & GJF_DEVICE_SWITCH) ||
+			    (sc->sc_flags & GJF_DEVICE_DESTROY)) {
+				if (sc->sc_current_count > 0) {
+					mtx_unlock(&sc->sc_mtx);
+					g_journal_flush(sc);
+					g_journal_flush_send(sc);
+					continue;
+				}
+				if (sc->sc_flush_in_progress > 0)
+					goto sleep;
+				if (sc->sc_copy_in_progress > 0)
+					goto sleep;
+			}
+			if (sc->sc_flags & GJF_DEVICE_SWITCH) {
+				mtx_unlock(&sc->sc_mtx);
+				g_journal_switch(sc);
+				wakeup(&sc->sc_journal_copying);
+				continue;
+			}
+			if (sc->sc_flags & GJF_DEVICE_DESTROY) {
+				GJ_DEBUG(1, "Shutting down worker "
+				    "thread for %s.", gp->name);
+				sc->sc_worker = NULL;
+				wakeup(&sc->sc_worker);
+				mtx_unlock(&sc->sc_mtx);
+				kthread_exit(0);
+			}
+sleep:
+			g_journal_wait(sc, last_write);
+			continue;
+		}
+		/*
+		 * If we're in switch process, we need to delay all new
+		 * write requests until its done.
+		 */
+		if ((sc->sc_flags & GJF_DEVICE_SWITCH) &&
+		    type == GJ_BIO_REGULAR && bp->bio_cmd == BIO_WRITE) {
+			GJ_LOGREQ(2, bp, "WRITE on SWITCH");
+			goto try_switch;
+		}
+		if (type == GJ_BIO_REGULAR)
+			bioq_remove(&sc->sc_regular_queue, bp);
+		else
+			bioq_remove(&sc->sc_back_queue, bp);
+		mtx_unlock(&sc->sc_mtx);
+		switch (type) {
+		case GJ_BIO_REGULAR:
+			/* Regular request. */
+			switch (bp->bio_cmd) {
+			case BIO_READ:
+				g_journal_read(sc, bp, bp->bio_offset,
+				    bp->bio_offset + bp->bio_length);
+				break;
+			case BIO_WRITE:
+				last_write = time_second;
+				g_journal_add_request(sc, bp);
+				g_journal_flush_send(sc);
+				break;
+			default:
+				panic("Invalid bio_cmd (%d).", bp->bio_cmd);
+			}
+			break;
+		case GJ_BIO_COPY:
+			switch (bp->bio_cmd) {
+			case BIO_READ:
+				if (g_journal_copy_read_done(bp))
+					g_journal_copy_send(sc);
+				break;
+			case BIO_WRITE:
+				g_journal_copy_write_done(bp);
+				g_journal_copy_send(sc);
+				break;
+			default:
+				panic("Invalid bio_cmd (%d).", bp->bio_cmd);
+			}
+			break;
+		case GJ_BIO_JOURNAL:
+			g_journal_flush_done(bp);
+			g_journal_flush_send(sc);
+			break;
+		case GJ_BIO_READ:
+		default:
+			panic("Invalid bio (%d).", type);
+		}
+	}
+}
+
+static void
+g_journal_destroy_event(void *arg, int flags __unused)
+{
+	struct g_journal_softc *sc;
+
+	g_topology_assert();
+	sc = arg;
+	g_journal_destroy(sc);
+}
+
+static void
+g_journal_timeout(void *arg)
+{
+	struct g_journal_softc *sc;
+
+	sc = arg;
+	GJ_DEBUG(0, "Timeout. Journal %s cannot be completed.",
+	    sc->sc_geom->name);
+	g_post_event(g_journal_destroy_event, sc, M_NOWAIT, NULL);
+}
+
+static struct g_geom *
+g_journal_create(struct g_class *mp, struct g_provider *pp,
+    const struct g_journal_metadata *md)
+{
+	struct g_journal_softc *sc;
+	struct g_geom *gp;
+	struct g_consumer *cp;
+	int error;
+
+	g_topology_assert();
+	/*
+	 * There are two possibilities:
+	 * 1. Data and both journals are on the same provider.
+	 * 2. Data and journals are all on separated providers.
+	 */
+	/* Look for journal device with the same ID. */
+	LIST_FOREACH(gp, &mp->geom, geom) {
+		sc = gp->softc;
+		if (sc == NULL)
+			continue;
+		if (sc->sc_id == md->md_id)
+			break;
+	}
+	if (gp == NULL)
+		sc = NULL;
+	else if (sc != NULL && (sc->sc_type & md->md_type) != 0) {
+		GJ_DEBUG(1, "Journal device %u already configured.", sc->sc_id);
+		return (NULL);
+	}
+	if (md->md_type == 0 || (md->md_type & ~GJ_TYPE_COMPLETE) != 0) {
+		GJ_DEBUG(0, "Invalid type on %s.", pp->name);
+		return (NULL);
+	}
+	if (md->md_type & GJ_TYPE_DATA) {
+		GJ_DEBUG(0, "Journal %u: %s contains data.", md->md_id,
+		    pp->name);
+	}
+	if (md->md_type & GJ_TYPE_JOURNAL) {
+		GJ_DEBUG(0, "Journal %u: %s contains journal.", md->md_id,
+		    pp->name);
+	}
+
+	if (sc == NULL) {
+		/* Action geom. */
+		sc = malloc(sizeof(*sc), M_JOURNAL, M_WAITOK | M_ZERO);
+		sc->sc_id = md->md_id;
+		sc->sc_type = 0;
+		sc->sc_flags = 0;
+		sc->sc_worker = NULL;
+
+		gp = g_new_geomf(mp, "gjournal %u", sc->sc_id);
+		gp->start = g_journal_start;
+		gp->orphan = g_journal_orphan;
+		gp->access = g_journal_access;
+		gp->softc = sc;
+		sc->sc_geom = gp;
+
+		mtx_init(&sc->sc_mtx, "gjournal", NULL, MTX_DEF);
+
+		bioq_init(&sc->sc_back_queue);
+		bioq_init(&sc->sc_regular_queue);
+		bioq_init(&sc->sc_delayed_queue);
+		sc->sc_delayed_count = 0;
+		sc->sc_current_queue = NULL;
+		sc->sc_current_count = 0;
+		sc->sc_flush_queue = NULL;
+		sc->sc_flush_count = 0;
+		sc->sc_flush_in_progress = 0;
+		sc->sc_copy_queue = NULL;
+		sc->sc_copy_in_progress = 0;
+		sc->sc_inactive.jj_queue = NULL;
+		sc->sc_active.jj_queue = NULL;
+
+		callout_init(&sc->sc_callout, CALLOUT_MPSAFE);
+		if (md->md_type != GJ_TYPE_COMPLETE) {
+			/*
+			 * Journal and data are on separate providers.
+			 * At this point we have only one of them.
+			 * We setup a timeout in case the other part will not
+			 * appear, so we won't wait forever.
+			 */
+			callout_reset(&sc->sc_callout, 5 * hz,
+			    g_journal_timeout, sc);
+		}
+	}
+
+	/* Remember type of the data provider. */
+	if (md->md_type & GJ_TYPE_DATA)
+		sc->sc_orig_type = md->md_type;
+	sc->sc_type |= md->md_type;
+	cp = NULL;
+
+	if (md->md_type & GJ_TYPE_DATA) {
+		if (md->md_flags & GJ_FLAG_CLEAN)
+			sc->sc_flags |= GJF_DEVICE_CLEAN;
+		if (md->md_flags & GJ_FLAG_CHECKSUM)
+			sc->sc_flags |= GJF_DEVICE_CHECKSUM;
+		cp = g_new_consumer(gp);
+		error = g_attach(cp, pp);
+		KASSERT(error == 0, ("Cannot attach to %s (error=%d).",
+		    pp->name, error));
+		error = g_access(cp, 1, 1, 1);
+		if (error != 0) {
+			GJ_DEBUG(0, "Cannot access %s (error=%d).", pp->name,
+			    error);
+			g_journal_destroy(sc);
+			return (NULL);
+		}
+		sc->sc_dconsumer = cp;
+		sc->sc_mediasize = pp->mediasize - pp->sectorsize;
+		sc->sc_sectorsize = pp->sectorsize;
+		sc->sc_jstart = md->md_jstart;
+		sc->sc_jend = md->md_jend;
+		if (md->md_provider[0] != '\0')
+			sc->sc_flags |= GJF_DEVICE_HARDCODED;
+		sc->sc_journal_offset = md->md_joffset;
+		sc->sc_journal_id = md->md_jid;
+		sc->sc_journal_previous_id = md->md_jid;
+	}
+	if (md->md_type & GJ_TYPE_JOURNAL) {
+		if (cp == NULL) {
+			cp = g_new_consumer(gp);
+			error = g_attach(cp, pp);
+			KASSERT(error == 0, ("Cannot attach to %s (error=%d).",
+			    pp->name, error));
+			error = g_access(cp, 1, 1, 1);
+			if (error != 0) {
+				GJ_DEBUG(0, "Cannot access %s (error=%d).",
+				    pp->name, error);
+				g_journal_destroy(sc);
+				return (NULL);
+			}
+		} else {
+			/*
+			 * Journal is on the same provider as data, which means
+			 * that data provider ends where journal starts.
+			 */
+			sc->sc_mediasize = md->md_jstart;
+		}
+		sc->sc_jconsumer = cp;
+	}
+
+	if ((sc->sc_type & GJ_TYPE_COMPLETE) != GJ_TYPE_COMPLETE) {
+		/* Journal is not complete yet. */
+		return (gp);
+	} else {
+		/* Journal complete, cancel timeout. */
+		callout_drain(&sc->sc_callout);
+	}
+
+	error = kthread_create(g_journal_worker, sc, &sc->sc_worker, 0, 0,
+	    "g_journal %s", sc->sc_name);
+	if (error != 0) {
+		GJ_DEBUG(0, "Cannot create worker thread for %s.journal.",
+		    sc->sc_name);
+		g_journal_destroy(sc);
+		return (NULL);
+	}
+
+	return (gp);
+}
+
+static void
+g_journal_destroy_consumer(void *arg, int flags __unused)
+{
+	struct g_consumer *cp;
+
+	g_topology_assert();
+	cp = arg;
+	g_detach(cp);
+	g_destroy_consumer(cp);
+}
+
+static int
+g_journal_destroy(struct g_journal_softc *sc)
+{
+	struct g_geom *gp;
+	struct g_provider *pp;
+	struct g_consumer *cp;
+
+	g_topology_assert();
+
+	if (sc == NULL)
+		return (ENXIO);
+
+	gp = sc->sc_geom;
+	pp = LIST_FIRST(&gp->provider);
+	if (pp != NULL) {
+		if (pp->acr != 0 || pp->acw != 0 || pp->ace != 0) {
+			GJ_DEBUG(1, "Device %s is still open (r%dw%de%d).",
+			    pp->name, pp->acr, pp->acw, pp->ace);
+			return (EBUSY);
+		}
+		g_error_provider(pp, ENXIO);
+
+		g_journal_flush(sc);
+		g_journal_flush_send(sc);
+		g_journal_switch(sc);
+	}
+
+	sc->sc_flags |= (GJF_DEVICE_DESTROY | GJF_DEVICE_CLEAN);
+
+	g_topology_unlock();
+	callout_drain(&sc->sc_callout);
+	mtx_lock(&sc->sc_mtx);
+	wakeup(sc);
+	while (sc->sc_worker != NULL)
+		msleep(&sc->sc_worker, &sc->sc_mtx, PRIBIO, "gj:destroy", 0);
+	mtx_unlock(&sc->sc_mtx);
+
+	if (pp != NULL) {
+		GJ_DEBUG(1, "Marking %s as clean.", sc->sc_name);
+		g_journal_metadata_update(sc);
+		g_topology_lock();
+		pp->flags |= G_PF_WITHER;
+		g_orphan_provider(pp, ENXIO);
+	} else {
+		g_topology_lock();
+	}
+	mtx_destroy(&sc->sc_mtx);
+
+	if (sc->sc_current_count != 0) {
+		GJ_DEBUG(0, "Warning! Number of current requests %d.",
+		    sc->sc_current_count);
+	}
+
+	LIST_FOREACH(cp, &gp->consumer, consumer) {
+		if (cp->acr + cp->acw + cp->ace > 0)
+			g_access(cp, -1, -1, -1);
+		/*
+		 * We keep all consumers open for writting, so if I'll detach
+		 * and destroy consumer here, I'll get providers for taste, so
+		 * journal will be started again.
+		 * Sending an event here, prevents this from happening.
+		 */
+		g_post_event(g_journal_destroy_consumer, cp, M_WAITOK, NULL);
+	}
+	gp->softc = NULL;
+	g_wither_geom(gp, ENXIO);
+	free(sc, M_JOURNAL);
+	return (0);
+}
+
+static void
+g_journal_taste_orphan(struct g_consumer *cp)
+{
+
+	KASSERT(1 == 0, ("%s called while tasting %s.", __func__,
+	    cp->provider->name));
+}
+
+static struct g_geom *
+g_journal_taste(struct g_class *mp, struct g_provider *pp, int flags __unused)
+{
+	struct g_journal_metadata md;
+	struct g_consumer *cp;
+	struct g_geom *gp;
+	int error;
+
+	g_topology_assert();
+	g_trace(G_T_TOPOLOGY, "%s(%s, %s)", __func__, mp->name, pp->name);
+	GJ_DEBUG(2, "Tasting %s.", pp->name);
+	if (pp->geom->class == mp)
+		return (NULL);
+
+	gp = g_new_geomf(mp, "journal:taste");
+	/* This orphan function should be never called. */
+	gp->orphan = g_journal_taste_orphan;
+	cp = g_new_consumer(gp);
+	g_attach(cp, pp);
+	error = g_journal_metadata_read(cp, &md);
+	g_detach(cp);
+	g_destroy_consumer(cp);
+	g_destroy_geom(gp);
+	if (error != 0)
+		return (NULL);
+	gp = NULL;
+
+	if (md.md_provider[0] != '\0' && strcmp(md.md_provider, pp->name) != 0)
+		return (NULL);
+	if (md.md_provsize != 0 && md.md_provsize != pp->mediasize)
+		return (NULL);
+	if (g_journal_debug >= 2)
+		journal_metadata_dump(&md);
+
+	gp = g_journal_create(mp, pp, &md);
+	return (gp);
+}
+
+static struct g_journal_softc *
+g_journal_find_device(struct g_class *mp, const char *name)
+{
+	struct g_journal_softc *sc;
+	struct g_geom *gp;
+	struct g_provider *pp;
+
+	if (strncmp(name, "/dev/", 5) == 0)
+		name += 5;
+	LIST_FOREACH(gp, &mp->geom, geom) {
+		sc = gp->softc;
+		if (sc == NULL)
+			continue;
+		if (sc->sc_flags & GJF_DEVICE_DESTROY)
+			continue;
+		if ((sc->sc_type & GJ_TYPE_COMPLETE) != GJ_TYPE_COMPLETE)
+			continue;
+		pp = LIST_FIRST(&gp->provider);
+		if (strcmp(sc->sc_name, name) == 0)
+			return (sc);
+		if (pp != NULL && strcmp(pp->name, name) == 0)
+			return (sc);
+	}
+	return (NULL);
+}
+
+static void
+g_journal_ctl_destroy(struct gctl_req *req, struct g_class *mp)
+{
+	struct g_journal_softc *sc;
+	const char *name;
+	char param[16];
+	int *nargs;
+	int error, i;
+
+	g_topology_assert();
+
+	nargs = gctl_get_paraml(req, "nargs", sizeof(*nargs));
+	if (nargs == NULL) {
+		gctl_error(req, "No '%s' argument.", "nargs");
+		return;
+	}
+	if (*nargs <= 0) {
+		gctl_error(req, "Missing device(s).");
+		return;
+	}
+
+	for (i = 0; i < *nargs; i++) {
+		snprintf(param, sizeof(param), "arg%d", i);
+		name = gctl_get_asciiparam(req, param);
+		if (name == NULL) {
+			gctl_error(req, "No 'arg%d' argument.", i);
+			return;
+		}
+		sc = g_journal_find_device(mp, name);
+		if (sc == NULL) {
+			gctl_error(req, "No such device: %s.", name);
+			return;
+		}
+		error = g_journal_destroy(sc);
+		if (error != 0) {
+			gctl_error(req, "Cannot destroy device %s (error=%d).",
+			    LIST_FIRST(&sc->sc_geom->provider)->name, error);
+			return;
+		}
+	}
+}
+
+static void
+g_journal_ctl_sync(struct gctl_req *req __unused, struct g_class *mp __unused)
+{
+
+	g_topology_assert();
+	g_topology_unlock();
+	g_journal_sync_requested++;
+	wakeup(&g_journal_switcher_state);
+	while (g_journal_sync_requested > 0)
+		tsleep(&g_journal_sync_requested, PRIBIO, "j:sreq", hz / 2);
+	g_topology_lock();
+}
+
+static void
+g_journal_config(struct gctl_req *req, struct g_class *mp, const char *verb)
+{
+	uint32_t *version;
+
+	g_topology_assert();
+
+	version = gctl_get_paraml(req, "version", sizeof(*version));
+	if (version == NULL) {
+		gctl_error(req, "No '%s' argument.", "version");
+		return;
+	}
+	if (*version != G_JOURNAL_VERSION) {
+		gctl_error(req, "Userland and kernel parts are out of sync.");
+		return;
+	}
+
+	if (strcmp(verb, "destroy") == 0 || strcmp(verb, "stop") == 0) {
+		g_journal_ctl_destroy(req, mp);
+		return;
+	} else if (strcmp(verb, "sync") == 0) {
+		g_journal_ctl_sync(req, mp);
+		return;
+	}
+
+	gctl_error(req, "Unknown verb.");
+}
+
+static void
+g_journal_dumpconf(struct sbuf *sb, const char *indent, struct g_geom *gp,
+    struct g_consumer *cp, struct g_provider *pp)
+{
+	struct g_journal_softc *sc;
+
+	g_topology_assert();
+
+	sc = gp->softc;
+	if (sc == NULL)
+		return;
+	if (pp != NULL) {
+		/* Nothing here. */
+	} else if (cp != NULL) {
+		int first = 1;
+
+		sbuf_printf(sb, "%s<Role>", indent);
+		if (cp == sc->sc_dconsumer) {
+			sbuf_printf(sb, "Data");
+			first = 0;
+		}
+		if (cp == sc->sc_jconsumer) {
+			if (!first)
+				sbuf_printf(sb, ",");
+			sbuf_printf(sb, "Journal");
+		}
+		sbuf_printf(sb, "</Role>\n");
+		if (cp == sc->sc_jconsumer) {
+			sbuf_printf(sb, "<Jstart>%jd</Jstart>",
+			    (intmax_t)sc->sc_jstart);
+			sbuf_printf(sb, "<Jend>%jd</Jend>",
+			    (intmax_t)sc->sc_jend);
+		}
+	} else {
+		sbuf_printf(sb, "%s<ID>%u</ID>\n", indent, (u_int)sc->sc_id);
+	}
+}
+
+static eventhandler_tag g_journal_event_shutdown = NULL;
+static eventhandler_tag g_journal_event_lowmem = NULL;
+
+static void
+g_journal_shutdown(void *arg, int howto __unused)
+{
+	struct g_class *mp;
+	struct g_geom *gp, *gp2;
+
+	if (panicstr != NULL)
+		return;
+	mp = arg;
+	DROP_GIANT();
+	g_topology_lock();
+	LIST_FOREACH_SAFE(gp, &mp->geom, geom, gp2) {
+		if (gp->softc == NULL)
+			continue;
+		GJ_DEBUG(0, "Shutting down geom %s.", gp->name);
+		g_journal_destroy(gp->softc);
+	}
+	g_topology_unlock();
+	PICKUP_GIANT();
+}
+
+/*
+ * Free cached requests from inactive queue in case of low memory.
+ * We free GJ_FREE_AT_ONCE elements at once.
+ */
+#define	GJ_FREE_AT_ONCE	4
+static void
+g_journal_lowmem(void *arg, int howto __unused)
+{
+	struct g_journal_softc *sc;
+	struct g_class *mp;
+	struct g_geom *gp;
+	struct bio *bp;
+	u_int nfree = GJ_FREE_AT_ONCE;
+
+	g_journal_stats_low_mem++;
+	mp = arg;
+	DROP_GIANT();
+	g_topology_lock();
+	LIST_FOREACH(gp, &mp->geom, geom) {
+		sc = gp->softc;
+		if (sc == NULL || (sc->sc_flags & GJF_DEVICE_DESTROY))
+			continue;
+		mtx_lock(&sc->sc_mtx);
+		for (bp = sc->sc_inactive.jj_queue; nfree > 0 && bp != NULL;
+		    nfree--, bp = bp->bio_next) {
+			/*
+			 * This is safe to free the bio_data, because:
+			 * 1. If bio_data is NULL it will be read from the
+			 *    inactive journal.
+			 * 2. If bp is sent down, it is first removed from the
+			 *    inactive queue, so it's impossible to free the
+			 *    data from under in-flight bio.
+			 * On the other hand, freeing elements from the active
+			 * queue, is not safe.
+			 */
+			if (bp->bio_data != NULL) {
+				GJ_DEBUG(2, "Freeing data from %s.",
+				    sc->sc_name);
+				gj_free(bp->bio_data, bp->bio_length);
+				bp->bio_data = NULL;
+			}
+		}
+		mtx_unlock(&sc->sc_mtx);
+		if (nfree == 0)
+			break;
+	}
+	g_topology_unlock();
+	PICKUP_GIANT();
+}
+
+static void g_journal_switcher(void *arg);
+
+static void
+g_journal_init(struct g_class *mp)
+{
+	int error;
+
+	/* Pick a conservative value if provided value sucks. */
+	if (g_journal_cache_divisor <= 0 ||
+	    (vm_kmem_size / g_journal_cache_divisor == 0)) {
+		g_journal_cache_divisor = 5;
+	}
+	if (g_journal_cache_limit > 0) {
+		g_journal_cache_limit = vm_kmem_size / g_journal_cache_divisor;
+		g_journal_cache_low =
+		    (g_journal_cache_limit / 100) * g_journal_cache_switch;
+	}
+	g_journal_event_shutdown = EVENTHANDLER_REGISTER(shutdown_post_sync,
+	    g_journal_shutdown, mp, EVENTHANDLER_PRI_FIRST);
+	if (g_journal_event_shutdown == NULL)
+		GJ_DEBUG(0, "Warning! Cannot register shutdown event.");
+	g_journal_event_lowmem = EVENTHANDLER_REGISTER(vm_lowmem,
+	    g_journal_lowmem, mp, EVENTHANDLER_PRI_FIRST);
+	if (g_journal_event_lowmem == NULL)
+		GJ_DEBUG(0, "Warning! Cannot register lowmem event.");
+	error = kthread_create(g_journal_switcher, mp, NULL, 0, 0,
+	    "g_journal switcher");
+	KASSERT(error == 0, ("Cannot create switcher thread."));
+}
+
+static void
+g_journal_fini(struct g_class *mp)
+{
+
+	if (g_journal_event_shutdown != NULL) {
+		EVENTHANDLER_DEREGISTER(shutdown_post_sync,
+		    g_journal_event_shutdown);
+	}
+	if (g_journal_event_lowmem != NULL)
+		EVENTHANDLER_DEREGISTER(vm_lowmem, g_journal_event_lowmem);
+	g_journal_switcher_state = GJ_SWITCHER_DIE;
+	wakeup(&g_journal_switcher_state);
+	while (g_journal_switcher_state != GJ_SWITCHER_DIED)
+		tsleep(&g_journal_switcher_state, PRIBIO, "jfini:wait", hz / 5);
+	GJ_DEBUG(1, "Switcher died.");
+}
+
+DECLARE_GEOM_CLASS(g_journal_class, g_journal);
+
+static const struct g_journal_desc *
+g_journal_find_desc(const char *fstype)
+{
+	const struct g_journal_desc *desc;
+	int i;
+
+	for (desc = g_journal_filesystems[i = 0]; desc != NULL;
+	     desc = g_journal_filesystems[++i]) {
+		if (strcmp(desc->jd_fstype, fstype) == 0)
+			break;
+	}
+	return (desc);
+}
+
+static void
+g_journal_switch_wait(struct g_journal_softc *sc)
+{
+	struct bintime bt;
+
+	mtx_assert(&sc->sc_mtx, MA_OWNED);
+	if (g_journal_debug >= 2) {
+		if (sc->sc_flush_in_progress > 0) {
+			GJ_DEBUG(2, "%d requests flushing.",
+			    sc->sc_flush_in_progress);
+		}
+		if (sc->sc_copy_in_progress > 0) {
+			GJ_DEBUG(2, "%d requests copying.",
+			    sc->sc_copy_in_progress);
+		}
+		if (sc->sc_flush_count > 0) {
+			GJ_DEBUG(2, "%d requests to flush.",
+			    sc->sc_flush_count);
+		}
+		if (sc->sc_delayed_count > 0) {
+			GJ_DEBUG(2, "%d requests delayed.",
+			    sc->sc_delayed_count);
+		}
+	}
+	g_journal_stats_switches++;
+	if (sc->sc_copy_in_progress > 0)
+		g_journal_stats_wait_for_copy++;
+	GJ_TIMER_START(1, &bt);
+	sc->sc_flags &= ~GJF_DEVICE_BEFORE_SWITCH;
+	sc->sc_flags |= GJF_DEVICE_SWITCH;
+	wakeup(sc);
+	while (sc->sc_flags & GJF_DEVICE_SWITCH) {
+		msleep(&sc->sc_journal_copying, &sc->sc_mtx, PRIBIO,
+		    "gj:switch", 0);
+	}
+	GJ_TIMER_STOP(1, &bt, "Switch time of %s", sc->sc_name);
+}
+
+static void
+g_journal_do_switch(struct g_class *classp, struct thread *td)
+{
+	struct g_journal_softc *sc;
+	const struct g_journal_desc *desc;
+	struct g_geom *gp;
+	struct mount *mp;
+	struct bintime bt;
+	char *mountpoint;
+	int asyncflag, error, vfslocked;
+
+	DROP_GIANT();
+	g_topology_lock();
+	LIST_FOREACH(gp, &classp->geom, geom) {
+		sc = gp->softc;
+		if (sc == NULL)
+			continue;
+		if (sc->sc_flags & GJF_DEVICE_DESTROY)
+			continue;
+		if ((sc->sc_type & GJ_TYPE_COMPLETE) != GJ_TYPE_COMPLETE)
+			continue;
+		mtx_lock(&sc->sc_mtx);
+		sc->sc_flags |= GJF_DEVICE_BEFORE_SWITCH;
+		mtx_unlock(&sc->sc_mtx);
+	}
+	g_topology_unlock();
+	PICKUP_GIANT();
+
+	mtx_lock(&mountlist_mtx);
+	TAILQ_FOREACH(mp, &mountlist, mnt_list) {
+		if (mp->mnt_gjprovider == NULL)
+			continue;
+		if (mp->mnt_flag & MNT_RDONLY)
+			continue;
+		desc = g_journal_find_desc(mp->mnt_stat.f_fstypename);
+		if (desc == NULL)
+			continue;
+		if (vfs_busy(mp, LK_NOWAIT, &mountlist_mtx, td))
+			continue;
+		/* mtx_unlock(&mountlist_mtx) was done inside vfs_busy() */
+
+		DROP_GIANT();
+		g_topology_lock();
+		sc = g_journal_find_device(classp, mp->mnt_gjprovider);
+		g_topology_unlock();
+		PICKUP_GIANT();
+
+		if (sc == NULL) {
+			GJ_DEBUG(0, "Cannot find journal geom for %s.",
+			    mp->mnt_gjprovider);
+			goto next;
+		} else if (JEMPTY(sc)) {
+			mtx_lock(&sc->sc_mtx);
+			sc->sc_flags &= ~GJF_DEVICE_BEFORE_SWITCH;
+			mtx_unlock(&sc->sc_mtx);
+			GJ_DEBUG(3, "No need for %s switch.", sc->sc_name);
+			goto next;
+		}
+
+		mountpoint = mp->mnt_stat.f_mntonname;
+
+		vfslocked = VFS_LOCK_GIANT(mp);
+
+		error = vn_start_write(NULL, &mp, V_WAIT);
+		if (error != 0) {
+                	VFS_UNLOCK_GIANT(vfslocked);
+			GJ_DEBUG(0, "vn_start_write(%s) failed (error=%d).",
+			    mountpoint, error);
+			goto next;
+		}
+		asyncflag = mp->mnt_flag & MNT_ASYNC;
+		mp->mnt_flag &= ~MNT_ASYNC;
+
+		GJ_TIMER_START(1, &bt);
+		vfs_msync(mp, MNT_NOWAIT);
+		GJ_TIMER_STOP(1, &bt, "Msync time of %s", mountpoint);
+
+		GJ_TIMER_START(1, &bt);
+		error = VFS_SYNC(mp, MNT_NOWAIT, curthread);
+		if (error == 0)
+			GJ_TIMER_STOP(1, &bt, "Sync time of %s", mountpoint);
+		else {
+			GJ_DEBUG(0, "Cannot sync file system %s (error=%d).",
+			    mountpoint, error);
+		}
+		mp->mnt_flag |= asyncflag;
+
+		vn_finished_write(mp);
+
+		if (error != 0) {
+			VFS_UNLOCK_GIANT(vfslocked);
+			goto next;
+		}
+
+		GJ_TIMER_START(1, &bt);
+		error = vfs_write_suspend(mp);
+		VFS_UNLOCK_GIANT(vfslocked);
+		GJ_TIMER_STOP(1, &bt, "Suspend time of %s", mountpoint);
+		if (error != 0) {
+			GJ_DEBUG(0, "Cannot suspend file system %s (error=%d).",
+			    mountpoint, error);
+			goto next;
+		}
+
+		error = desc->jd_clean(mp);
+		if (error != 0)
+			goto next;
+
+		mtx_lock(&sc->sc_mtx);
+		g_journal_switch_wait(sc);
+		mtx_unlock(&sc->sc_mtx);
+
+		vfs_write_resume(mp);
+next:
+		mtx_lock(&mountlist_mtx);
+		vfs_unbusy(mp, td);
+	}
+	mtx_unlock(&mountlist_mtx);
+
+	sc = NULL;
+	for (;;) {
+		DROP_GIANT();
+		g_topology_lock();
+		LIST_FOREACH(gp, &g_journal_class.geom, geom) {
+			sc = gp->softc;
+			if (sc == NULL)
+				continue;
+			mtx_lock(&sc->sc_mtx);
+			if ((sc->sc_type & GJ_TYPE_COMPLETE) == GJ_TYPE_COMPLETE &&
+			    !(sc->sc_flags & GJF_DEVICE_DESTROY) &&
+			    (sc->sc_flags & GJF_DEVICE_BEFORE_SWITCH)) {
+				break;
+			}
+			mtx_unlock(&sc->sc_mtx);
+			sc = NULL;
+		}
+		g_topology_unlock();
+		PICKUP_GIANT();
+		if (sc == NULL)
+			break;
+		mtx_assert(&sc->sc_mtx, MA_OWNED);
+		g_journal_switch_wait(sc);
+		mtx_unlock(&sc->sc_mtx);
+	}
+}
+
+/*
+ * TODO: Switcher thread should be started on first geom creation and killed on
+ * last geom destruction.
+ */
+static void
+g_journal_switcher(void *arg)
+{
+	struct thread *td = curthread;
+	struct g_class *mp;
+	struct bintime bt;
+	int error;
+
+	mp = arg;
+	for (;;) {
+		g_journal_switcher_wokenup = 0;
+		error = tsleep(&g_journal_switcher_state, PRIBIO, "jsw:wait",
+		    g_journal_switch_time * hz);
+		if (g_journal_switcher_state == GJ_SWITCHER_DIE) {
+			g_journal_switcher_state = GJ_SWITCHER_DIED;
+			GJ_DEBUG(1, "Switcher exiting.");
+			wakeup(&g_journal_switcher_state);
+			kthread_exit(0);
+		}
+		if (error == 0 && g_journal_sync_requested == 0) {
+			GJ_DEBUG(1, "Out of cache, force switch (used=%u "
+			    "limit=%u).", g_journal_cache_used,
+			    g_journal_cache_limit);
+		}
+		GJ_TIMER_START(1, &bt);
+		g_journal_do_switch(mp, td);
+		GJ_TIMER_STOP(1, &bt, "Entire switch time");
+		if (g_journal_sync_requested > 0) {
+			g_journal_sync_requested = 0;
+			wakeup(&g_journal_sync_requested);
+		}
+	}
+}
--- /dev/null	Tue Oct 24 16:34:10 2006
+++ sys/geom/journal/g_journal.h	Tue Oct 24 16:34:16 2006
@@ -0,0 +1,379 @@
+/*-
+ * Copyright (c) 2005-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ * $FreeBSD$
+ */
+
+#ifndef	_G_JOURNAL_H_
+#define	_G_JOURNAL_H_
+
+#include <sys/endian.h>
+#include <sys/md5.h>
+#ifdef _KERNEL
+#include <sys/bio.h>
+#endif
+
+#define	G_JOURNAL_CLASS_NAME	"JOURNAL"
+
+#define	G_JOURNAL_MAGIC		"GEOM::JOURNAL"
+/*
+ * Version history:
+ * 0 - Initial version number.
+ */
+#define	G_JOURNAL_VERSION	0
+
+#ifdef _KERNEL
+extern int g_journal_debug;
+
+#define	GJ_DEBUG(lvl, ...)	do {					\
+	if (g_journal_debug >= (lvl)) {					\
+		printf("GEOM_JOURNAL");					\
+		if (g_journal_debug > 0)				\
+			printf("[%u]", lvl);				\
+		printf(": ");						\
+		printf(__VA_ARGS__);					\
+		printf("\n");						\
+	}								\
+} while (0)
+#define	GJ_LOGREQ(lvl, bp, ...)	do {					\
+	if (g_journal_debug >= (lvl)) {					\
+		printf("GEOM_JOURNAL");					\
+		if (g_journal_debug > 0)				\
+			printf("[%u]", lvl);				\
+		printf(": ");						\
+		printf(__VA_ARGS__);					\
+		printf(" ");						\
+		g_print_bio(bp);					\
+		printf("\n");						\
+	}								\
+} while (0)
+
+#define	JEMPTY(sc)	((sc)->sc_journal_offset -			\
+			 (sc)->sc_jprovider->sectorsize ==		\
+			 (sc)->sc_active.jj_offset &&			\
+			 (sc)->sc_current_count == 0)
+
+#define	GJ_BIO_REGULAR		0x00
+#define	GJ_BIO_READ		0x01
+#define	GJ_BIO_JOURNAL		0x02
+#define	GJ_BIO_COPY		0x03
+#define	GJ_BIO_MASK		0x0f
+
+#if 0
+#define	GJF_BIO_DONT_FREE	0x10
+#define	GJF_BIO_MASK		0xf0
+#endif
+
+#define	GJF_DEVICE_HARDCODED		0x0001
+#define	GJF_DEVICE_DESTROY		0x0010
+#define	GJF_DEVICE_SWITCH		0x0020
+#define	GJF_DEVICE_BEFORE_SWITCH	0x0040
+#define	GJF_DEVICE_CLEAN		0x0080
+#define	GJF_DEVICE_CHECKSUM		0x0100
+
+#define	GJ_HARD_LIMIT		64
+
+/*
+ * We keep pointers to journaled data in bio structure and because we
+ * need to store two off_t values (offset in data provider and offset in
+ * journal), we have to borrow bio_completed field for this.
+ */
+#define	bio_joffset	bio_completed
+/*
+ * Use bio_caller1 field as a pointer in queue.
+ */
+#define	bio_next	bio_caller1
+
+/*
+ * There are two such structures maintained inside each journaled device.
+ * One describes active part of the journal, were recent requests are stored.
+ * The second describes the last consistent part of the journal with requests
+ * that are copied to the destination provider.
+ */
+struct g_journal_journal {
+	struct bio	*jj_queue;	/* Cached journal entries. */
+	off_t		 jj_offset;	/* Journal's start offset. */
+};
+
+struct g_journal_softc {
+	uint32_t	 sc_id;
+	uint8_t		 sc_type;
+	uint8_t		 sc_orig_type;
+	struct g_geom	*sc_geom;
+	u_int		 sc_flags;
+	struct mtx	 sc_mtx;
+	off_t		 sc_mediasize;
+	u_int		 sc_sectorsize;
+#define	GJ_FLUSH_DATA		0x01
+#define	GJ_FLUSH_JOURNAL	0x02
+	u_int		 sc_bio_flush;
+
+	uint32_t	 sc_journal_id;
+	uint32_t	 sc_journal_next_id;
+	int		 sc_journal_copying;
+	off_t		 sc_journal_offset;
+	off_t		 sc_journal_previous_id;
+
+	struct bio_queue_head sc_back_queue;
+	struct bio_queue_head sc_regular_queue;
+
+	struct bio_queue_head sc_delayed_queue;
+	int		 sc_delayed_count;
+
+	struct bio	*sc_current_queue;
+	int		 sc_current_count;
+
+	struct bio	*sc_flush_queue;
+	int		 sc_flush_count;
+	int		 sc_flush_in_progress;
+
+	struct bio	*sc_copy_queue;
+	int		 sc_copy_in_progress;
+
+	struct g_consumer *sc_dconsumer;
+	struct g_consumer *sc_jconsumer;
+
+	struct g_journal_journal sc_inactive;
+	struct g_journal_journal sc_active;
+
+	off_t		 sc_jstart;	/* Journal space start offset. */
+	off_t		 sc_jend;	/* Journal space end offset. */
+
+	struct callout	 sc_callout;
+	struct proc	*sc_worker;
+};
+#define	sc_dprovider	sc_dconsumer->provider
+#define	sc_jprovider	sc_jconsumer->provider
+#define	sc_name		sc_dprovider->name
+
+#define	GJQ_INSERT_HEAD(head, bp)	do {				\
+	(bp)->bio_next = (head);					\
+	(head) = (bp);							\
+} while (0)
+#define	GJQ_INSERT_AFTER(head, bp, pbp)	do {				\
+	if ((pbp) == NULL)						\
+		GJQ_INSERT_HEAD(head, bp);				\
+	else {								\
+		(bp)->bio_next = (pbp)->bio_next;			\
+		(pbp)->bio_next = (bp);					\
+	}								\
+} while (0)
+#define	GJQ_FIRST(head)	(head)
+#define	GJQ_REMOVE(head, bp)	do {					\
+	struct bio *_bp;						\
+									\
+	if ((head) == (bp)) {						\
+		(head) = (bp)->bio_next;				\
+		(bp)->bio_next = NULL;					\
+		break;							\
+	}								\
+	for (_bp = (head); _bp->bio_next != NULL; _bp = _bp->bio_next) {\
+		if (_bp->bio_next == (bp))				\
+			break;						\
+	}								\
+	KASSERT(_bp->bio_next != NULL, ("NULL bio_next"));		\
+	KASSERT(_bp->bio_next == (bp), ("bio_next != bp"));		\
+	_bp->bio_next = (bp)->bio_next;					\
+	(bp)->bio_next = NULL;						\
+} while (0)
+#define GJQ_FOREACH(head, bp)						\
+	for ((bp) = (head); (bp) != NULL; (bp) = (bp)->bio_next)
+
+#define	GJ_HEADER_MAGIC	"GJHDR"
+
+struct g_journal_header {
+	char		jh_magic[sizeof(GJ_HEADER_MAGIC)];
+	uint32_t	jh_journal_id;
+	uint32_t	jh_journal_next_id;
+} __packed;
+
+struct g_journal_entry {
+	uint64_t	je_joffset;
+	uint64_t	je_offset;
+	uint64_t	je_length;
+} __packed;
+
+#define	GJ_RECORD_HEADER_MAGIC		"GJRHDR"
+#define	GJ_RECORD_HEADER_NENTRIES	(20)
+#define	GJ_RECORD_MAX_SIZE(sc)	\
+	((sc)->sc_jprovider->sectorsize + GJ_RECORD_HEADER_NENTRIES * MAXPHYS)
+#define	GJ_VALIDATE_OFFSET(offset, sc)	do {				\
+	if ((offset) + GJ_RECORD_MAX_SIZE(sc) >= (sc)->sc_jend) {	\
+		(offset) = (sc)->sc_jstart;				\
+		GJ_DEBUG(2, "Starting from the begining (%s).",		\
+		    (sc)->sc_name);					\
+	}								\
+} while (0)
+
+struct g_journal_record_header {
+	char		jrh_magic[sizeof(GJ_RECORD_HEADER_MAGIC)];
+	uint32_t	jrh_journal_id;
+	uint16_t	jrh_nentries;
+	u_char		jrh_sum[8];
+	struct g_journal_entry jrh_entries[GJ_RECORD_HEADER_NENTRIES];
+} __packed;
+
+typedef int (g_journal_clean_t)(struct mount *mp);
+typedef void (g_journal_dirty_t)(struct g_consumer *cp);
+
+struct g_journal_desc {
+	const char		*jd_fstype;
+	g_journal_clean_t	*jd_clean;
+	g_journal_dirty_t	*jd_dirty;
+};
+
+/* Supported file systems. */
+extern const struct g_journal_desc g_journal_ufs;
+
+#define	GJ_TIMER_START(lvl, bt)	do {					\
+	if (g_journal_debug >= (lvl))					\
+		binuptime(bt);						\
+} while (0)
+#define	GJ_TIMER_STOP(lvl, bt, ...)	do {				\
+	if (g_journal_debug >= (lvl)) {					\
+		struct bintime _bt2;					\
+		struct timeval _tv;					\
+									\
+		binuptime(&_bt2);					\
+		bintime_sub(&_bt2, bt);					\
+		bintime2timeval(&_bt2, &_tv);				\
+		printf("GEOM_JOURNAL");					\
+		if (g_journal_debug > 0)				\
+			printf("[%u]", lvl);				\
+		printf(": ");						\
+		printf(__VA_ARGS__);					\
+		printf(": %jd.%06jds\n", (intmax_t)_tv.tv_sec,		\
+		    (intmax_t)_tv.tv_usec);				\
+	}								\
+} while (0)
+#endif	/* _KERNEL */
+
+#define	GJ_TYPE_DATA		0x01
+#define	GJ_TYPE_JOURNAL		0x02
+#define	GJ_TYPE_COMPLETE	(GJ_TYPE_DATA|GJ_TYPE_JOURNAL)
+
+#define	GJ_FLAG_CLEAN		0x01
+#define	GJ_FLAG_CHECKSUM	0x02
+
+struct g_journal_metadata {
+	char		md_magic[16];	/* Magic value. */
+	uint32_t	md_version;	/* Version number. */
+	uint32_t	md_id;		/* Journal unique ID. */
+	uint8_t		md_type;	/* Provider type. */
+	uint64_t	md_jstart;	/* Journal space start offset. */
+	uint64_t	md_jend;	/* Journal space end offset. */
+	uint64_t	md_joffset;	/* Last known consistent journal offset. */
+	uint32_t	md_jid;		/* Last known consistent journal ID. */
+	uint64_t	md_flags;	/* Journal flags. */
+	char		md_provider[16]; /* Hardcoded provider. */
+	uint64_t	md_provsize;	/* Provider's size. */
+	u_char		md_hash[16];	/* MD5 hash. */
+};
+static __inline void
+journal_metadata_encode(struct g_journal_metadata *md, u_char *data)
+{
+	MD5_CTX ctx;
+
+	bcopy(md->md_magic, data, 16);
+	le32enc(data + 16, md->md_version);
+	le32enc(data + 20, md->md_id);
+	*(data + 24) = md->md_type;
+	le64enc(data + 25, md->md_jstart);
+	le64enc(data + 33, md->md_jend);
+	le64enc(data + 41, md->md_joffset);
+	le32enc(data + 49, md->md_jid);
+	le64enc(data + 53, md->md_flags);
+	bcopy(md->md_provider, data + 61, 16);
+	le64enc(data + 77, md->md_provsize);
+	MD5Init(&ctx);
+	MD5Update(&ctx, data, 85);
+	MD5Final(md->md_hash, &ctx);
+	bcopy(md->md_hash, data + 85, 16);
+}
+static __inline int
+journal_metadata_decode_v0(const u_char *data, struct g_journal_metadata *md)
+{
+	MD5_CTX ctx;
+
+	md->md_id = le32dec(data + 20);
+	md->md_type = *(data + 24);
+	md->md_jstart = le64dec(data + 25);
+	md->md_jend = le64dec(data + 33);
+	md->md_joffset = le64dec(data + 41);
+	md->md_jid = le32dec(data + 49);
+	md->md_flags = le64dec(data + 53);
+	bcopy(data + 61, md->md_provider, 16);
+	md->md_provsize = le64dec(data + 77);
+	MD5Init(&ctx);
+	MD5Update(&ctx, data, 85);
+	MD5Final(md->md_hash, &ctx);
+	if (bcmp(md->md_hash, data + 85, 16) != 0)
+		return (EINVAL);
+	return (0);
+}
+static __inline int
+journal_metadata_decode(const u_char *data, struct g_journal_metadata *md)
+{
+	int error;
+
+	bcopy(data, md->md_magic, 16);
+	md->md_version = le32dec(data + 16);
+	switch (md->md_version) {
+	case 0:
+		error = journal_metadata_decode_v0(data, md);
+		break;
+	default:
+		error = EINVAL;
+		break;
+	}
+	return (error);
+}
+
+static __inline void
+journal_metadata_dump(const struct g_journal_metadata *md)
+{
+	static const char hex[] = "0123456789abcdef";
+	char hash[16 * 2 + 1];
+	u_int i;
+
+	printf("     magic: %s\n", md->md_magic);
+	printf("   version: %u\n", (u_int)md->md_version);
+	printf("        id: %u\n", (u_int)md->md_id);
+	printf("      type: %u\n", (u_int)md->md_type);
+	printf("     start: %ju\n", (uintmax_t)md->md_jstart);
+	printf("       end: %ju\n", (uintmax_t)md->md_jend);
+	printf("   joffset: %ju\n", (uintmax_t)md->md_joffset);
+	printf("       jid: %u\n", (u_int)md->md_jid);
+	printf("     flags: %u\n", (u_int)md->md_flags);
+	printf("hcprovider: %s\n", md->md_provider);
+	printf("  provsize: %ju\n", (uintmax_t)md->md_provsize);
+	bzero(hash, sizeof(hash));
+	for (i = 0; i < 16; i++) {
+		hash[i * 2] = hex[md->md_hash[i] >> 4];
+		hash[i * 2 + 1] = hex[md->md_hash[i] & 0x0f];
+	}
+	printf("  MD5 hash: %s\n", hash);
+}
+#endif	/* !_G_JOURNAL_H_ */
--- /dev/null	Tue Oct 24 16:34:10 2006
+++ sys/geom/journal/g_journal_ufs.c	Tue Oct 24 16:34:19 2006
@@ -0,0 +1,107 @@
+/*-
+ * Copyright (c) 2005-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include <sys/cdefs.h>
+__FBSDID("$FreeBSD$");
+
+#include <sys/param.h>
+#include <sys/systm.h>
+#include <sys/vnode.h>
+#include <sys/mount.h>
+
+#include <ufs/ufs/extattr.h>
+#include <ufs/ufs/quota.h>
+#include <ufs/ufs/inode.h>
+#include <ufs/ufs/ufs_extern.h>
+#include <ufs/ufs/ufsmount.h>
+
+#include <ufs/ffs/fs.h>
+#include <ufs/ffs/ffs_extern.h>
+
+#include <geom/geom.h>
+#include <geom/journal/g_journal.h>
+
+static const int superblocks[] = SBLOCKSEARCH;
+
+static int
+g_journal_ufs_clean(struct mount *mp)
+{
+	struct ufsmount *ump;
+	struct fs *fs;
+	int flags;
+
+	ump = VFSTOUFS(mp);
+	fs = ump->um_fs;
+
+	flags = fs->fs_flags;
+	fs->fs_flags &= ~(FS_UNCLEAN | FS_NEEDSFSCK);
+	ffs_sbupdate(ump, MNT_WAIT, 1);
+	fs->fs_flags = flags;
+
+	return (0);
+}
+
+static void
+g_journal_ufs_dirty(struct g_consumer *cp)
+{
+	struct fs *fs;
+	int error, i, sb;
+
+	if (SBLOCKSIZE % cp->provider->sectorsize != 0)
+		return;
+	for (i = 0; (sb = superblocks[i]) != -1; i++) {
+		if (sb % cp->provider->sectorsize != 0)
+			continue;
+		fs = g_read_data(cp, sb, SBLOCKSIZE, NULL);
+		if (fs == NULL)
+			continue;
+		if (fs->fs_magic != FS_UFS1_MAGIC &&
+		    fs->fs_magic != FS_UFS2_MAGIC) {
+			g_free(fs);
+			continue;
+		}
+		GJ_DEBUG(0, "clean=%d flags=0x%x", fs->fs_clean, fs->fs_flags);
+		fs->fs_clean = 0;
+		fs->fs_flags |= FS_NEEDSFSCK | FS_UNCLEAN;
+		error = g_write_data(cp, sb, fs, SBLOCKSIZE);
+		g_free(fs);
+		if (error != 0) {
+			GJ_DEBUG(0, "Cannot mark file system %s as dirty "
+			    "(error=%d).", cp->provider->name, error);
+		} else {
+			GJ_DEBUG(0, "File system %s marked as dirty.",
+			    cp->provider->name);
+		}
+	}
+}
+
+const struct g_journal_desc g_journal_ufs = {
+	.jd_fstype = "ufs",
+	.jd_clean = g_journal_ufs_clean,
+	.jd_dirty = g_journal_ufs_dirty
+};
+
+MODULE_DEPEND(g_journal, ufs, 1, 1, 1);
--- sys/geom/mirror/g_mirror.c.orig
+++ sys/geom/mirror/g_mirror.c
@@ -1042,6 +1042,48 @@
 }
 
 static void
+g_mirror_flush(struct g_mirror_softc *sc, struct bio *bp)
+{
+	struct bio_queue_head queue;
+	struct g_mirror_disk *disk;
+	struct g_consumer *cp;
+	struct bio *cbp;
+
+	bioq_init(&queue);
+	LIST_FOREACH(disk, &sc->sc_disks, d_next) {
+		if (disk->d_state != G_MIRROR_DISK_STATE_ACTIVE)
+			continue;
+		cbp = g_clone_bio(bp);
+		if (cbp == NULL) {
+			for (cbp = bioq_first(&queue); cbp != NULL;
+			    cbp = bioq_first(&queue)) {
+				bioq_remove(&queue, cbp);
+				g_destroy_bio(cbp);
+			}
+			if (bp->bio_error == 0)
+				bp->bio_error = ENOMEM;
+			g_io_deliver(bp, bp->bio_error);
+			return;
+		}
+		bioq_insert_tail(&queue, cbp);
+		cbp->bio_done = g_std_done;
+		cbp->bio_caller1 = disk;
+		cbp->bio_to = disk->d_consumer->provider;
+	}
+	for (cbp = bioq_first(&queue); cbp != NULL; cbp = bioq_first(&queue)) {
+		bioq_remove(&queue, cbp);
+		G_MIRROR_LOGREQ(3, cbp, "Sending request.");
+		disk = cbp->bio_caller1;
+		cbp->bio_caller1 = NULL;
+		cp = disk->d_consumer;
+		KASSERT(cp->acr >= 1 && cp->acw >= 1 && cp->ace >= 1,
+		    ("Consumer %s not opened (r%dw%de%d).", cp->provider->name,
+		    cp->acr, cp->acw, cp->ace));
+		g_io_request(cbp, disk->d_consumer);
+	}
+}
+
+static void
 g_mirror_start(struct bio *bp)
 {
 	struct g_mirror_softc *sc;
@@ -1061,6 +1103,9 @@
 	case BIO_WRITE:
 	case BIO_DELETE:
 		break;
+	case BIO_FLUSH:
+		g_mirror_flush(sc, bp);
+		return;
 	case BIO_GETATTR:
 		if (strcmp("GEOM::kerneldump", bp->bio_attribute) == 0) {
 			g_mirror_kernel_dump(bp);
--- sys/geom/raid3/g_raid3.c.orig
+++ sys/geom/raid3/g_raid3.c
@@ -25,7 +25,7 @@
  */
 
 #include <sys/cdefs.h>
-__FBSDID("$FreeBSD: src/sys/geom/raid3/g_raid3.c,v 1.40.2.16 2006/10/21 07:16:41 pjd Exp $");
+__FBSDID("$FreeBSD: src/sys/geom/raid3/g_raid3.c,v 1.40.2.15 2006/09/19 11:16:14 pjd Exp $");
 
 #include <sys/param.h>
 #include <sys/systm.h>
@@ -1370,6 +1370,50 @@
 }
 
 static void
+g_raid3_flush(struct g_raid3_softc *sc, struct bio *bp)
+{
+	struct bio_queue_head queue;
+	struct g_raid3_disk *disk;
+	struct g_consumer *cp;
+	struct bio *cbp;
+	u_int i;
+
+	bioq_init(&queue);
+	for (i = 0; i < sc->sc_ndisks; i++) {
+		disk = &sc->sc_disks[i];
+		if (disk->d_state != G_RAID3_DISK_STATE_ACTIVE)
+			continue;
+		cbp = g_clone_bio(bp);
+		if (cbp == NULL) {
+			for (cbp = bioq_first(&queue); cbp != NULL;
+			    cbp = bioq_first(&queue)) {
+				bioq_remove(&queue, cbp);
+				g_destroy_bio(cbp);
+			}
+			if (bp->bio_error == 0)
+				bp->bio_error = ENOMEM;
+			g_io_deliver(bp, bp->bio_error);
+			return;
+		}
+		bioq_insert_tail(&queue, cbp);
+		cbp->bio_done = g_std_done;
+		cbp->bio_caller1 = disk;
+		cbp->bio_to = disk->d_consumer->provider;
+	}
+	for (cbp = bioq_first(&queue); cbp != NULL; cbp = bioq_first(&queue)) {
+		bioq_remove(&queue, cbp);
+		G_RAID3_LOGREQ(3, cbp, "Sending request.");
+		disk = cbp->bio_caller1;
+		cbp->bio_caller1 = NULL;
+		cp = disk->d_consumer;
+		KASSERT(cp->acr >= 1 && cp->acw >= 1 && cp->ace >= 1,
+		    ("Consumer %s not opened (r%dw%de%d).", cp->provider->name,
+		    cp->acr, cp->acw, cp->ace));
+		g_io_request(cbp, disk->d_consumer);
+	}
+}
+
+static void
 g_raid3_start(struct bio *bp)
 {
 	struct g_raid3_softc *sc;
@@ -1390,6 +1434,9 @@
 	case BIO_WRITE:
 	case BIO_DELETE:
 		break;
+	case BIO_FLUSH:
+		g_raid3_flush(sc, bp);
+		return;
 	case BIO_GETATTR:
 	default:
 		g_io_deliver(bp, EOPNOTSUPP);
--- sys/geom/stripe/g_stripe.c.orig
+++ sys/geom/stripe/g_stripe.c
@@ -520,6 +520,42 @@
 }
 
 static void
+g_stripe_flush(struct g_stripe_softc *sc, struct bio *bp)
+{
+	struct bio_queue_head queue;
+	struct g_consumer *cp;
+	struct bio *cbp;
+	u_int no;
+
+	bioq_init(&queue);
+	for (no = 0; no < sc->sc_ndisks; no++) {
+		cbp = g_clone_bio(bp);
+		if (cbp == NULL) {
+			for (cbp = bioq_first(&queue); cbp != NULL;
+			    cbp = bioq_first(&queue)) {
+				bioq_remove(&queue, cbp);
+				g_destroy_bio(cbp);
+			}
+			if (bp->bio_error == 0)
+				bp->bio_error = ENOMEM;
+			g_io_deliver(bp, bp->bio_error);
+			return;
+		}
+		bioq_insert_tail(&queue, cbp);
+		cbp->bio_done = g_std_done;
+		cbp->bio_caller1 = sc->sc_disks[no];
+		cbp->bio_to = sc->sc_disks[no]->provider;
+	}
+	for (cbp = bioq_first(&queue); cbp != NULL; cbp = bioq_first(&queue)) {
+		bioq_remove(&queue, cbp);
+		G_STRIPE_LOGREQ(cbp, "Sending request.");
+		cp = cbp->bio_caller1;
+		cbp->bio_caller1 = NULL;
+		g_io_request(cbp, cp);
+	}
+}
+
+static void
 g_stripe_start(struct bio *bp)
 {
 	off_t offset, start, length, nstripe;
@@ -542,10 +578,10 @@
 	case BIO_READ:
 	case BIO_WRITE:
 	case BIO_DELETE:
-		/*
-		 * Only those requests are supported.
-		 */
 		break;
+        case BIO_FLUSH:
+                g_stripe_flush(sc, bp);
+                return;
 	case BIO_GETATTR:
 		/* To which provider it should be delivered? */
 	default:
--- sys/kern/subr_disk.c.orig
+++ sys/kern/subr_disk.c
@@ -43,6 +43,7 @@
 	case BIO_WRITE:		printf("cmd=write "); break;
 	case BIO_DELETE:	printf("cmd=delete "); break;
 	case BIO_GETATTR:	printf("cmd=getattr "); break;
+	case BIO_FLUSH:		printf("cmd=flush "); break;
 	default:		printf("cmd=%x ", bp->bio_cmd); break;
 	}
 	sn = bp->bio_pblkno;
--- sys/kern/vfs_subr.c.orig
+++ sys/kern/vfs_subr.c
@@ -2545,6 +2545,8 @@
 		strcat(buf, "|VV_TEXT");
 	if (vp->v_vflag & VV_SYSTEM)
 		strcat(buf, "|VV_SYSTEM");
+	if (vp->v_vflag & VV_DELETED)
+		strcat(buf, "|VV_DELETED");
 	if (vp->v_iflag & VI_DOOMED)
 		strcat(buf, "|VI_DOOMED");
 	if (vp->v_iflag & VI_FREE)
--- sys/modules/geom/Makefile.orig
+++ sys/modules/geom/Makefile
@@ -9,6 +9,7 @@
 	geom_fox \
 	geom_gate \
 	geom_gpt \
+	geom_journal \
 	geom_label \
 	geom_mbr \
 	geom_mirror \
--- /dev/null	Tue Oct 24 16:34:10 2006
+++ sys/modules/geom/geom_journal/Makefile	Tue Oct 24 16:34:22 2006
@@ -0,0 +1,10 @@
+# $FreeBSD$
+
+.PATH: ${.CURDIR}/../../../geom/journal
+
+KMOD=	geom_journal
+SRCS=	g_journal.c
+SRCS+=	g_journal_ufs.c
+SRCS+=	vnode_if.h
+
+.include <bsd.kmod.mk>
--- sys/modules/ufs/Makefile.orig
+++ sys/modules/ufs/Makefile
@@ -6,7 +6,7 @@
 SRCS=	opt_ddb.h opt_directio.h opt_ffs.h opt_ffs_broken_fixme.h opt_mac.h \
 	opt_quota.h opt_suiddir.h opt_ufs.h \
 	vnode_if.h ufs_acl.c ufs_bmap.c ufs_dirhash.c ufs_extattr.c \
-	ufs_inode.c ufs_lookup.c ufs_quota.c ufs_vfsops.c \
+	ufs_gjournal.c ufs_inode.c ufs_lookup.c ufs_quota.c ufs_vfsops.c \
 	ufs_vnops.c ffs_alloc.c ffs_balloc.c ffs_inode.c ffs_snapshot.c \
 	ffs_softdep.c ffs_subr.c ffs_tables.c ffs_vfsops.c ffs_vnops.c
 
--- sys/sys/bio.h.orig
+++ sys/sys/bio.h
@@ -88,6 +88,7 @@
 #define BIO_WRITE	0x02
 #define BIO_DELETE	0x04
 #define BIO_GETATTR	0x08
+#define BIO_FLUSH	0x10
 #define BIO_CMD0	0x20	/* Available for local hacks */
 #define BIO_CMD1	0x40	/* Available for local hacks */
 #define BIO_CMD2	0x80	/* Available for local hacks */
--- sys/sys/mount.h.orig
+++ sys/sys/mount.h
@@ -178,6 +178,7 @@
 	int		mnt_secondary_accwrites;/* (i) secondary wr. starts */
 	int		mnt_ref;		/* (i) Reference count */
 	int		mnt_gen;		/* struct mount generation */
+	char		*mnt_gjprovider;	/* gjournal provider name */
 };
 
 struct vnode *__mnt_vnode_next(struct vnode **mvp, struct mount *mp);
@@ -224,7 +225,7 @@
 #define	MNT_SUIDDIR	0x00100000	/* special handling of SUID on dirs */
 #define	MNT_SOFTDEP	0x00200000	/* soft updates being done */
 #define	MNT_NOSYMFOLLOW	0x00400000	/* do not follow symlinks */
-#define	MNT_JAILDEVFS	0x02000000	/* jail-friendly DEVFS behaviour */
+#define	MNT_GJOURNAL	0x02000000	/* GEOM journal support enabled */
 #define	MNT_MULTILABEL	0x04000000	/* MAC support for individual objects */
 #define	MNT_ACLS	0x08000000	/* ACL support enabled */
 #define	MNT_NOATIME	0x10000000	/* disable update of file access time */
@@ -265,13 +266,13 @@
 			MNT_ROOTFS	| MNT_NOATIME	| MNT_NOCLUSTERR| \
 			MNT_NOCLUSTERW	| MNT_SUIDDIR	| MNT_SOFTDEP	| \
 			MNT_IGNORE	| MNT_EXPUBLIC	| MNT_NOSYMFOLLOW | \
-			MNT_JAILDEVFS	| MNT_MULTILABEL | MNT_ACLS)
+			MNT_GJOURNAL	| MNT_MULTILABEL | MNT_ACLS)
 
 /* Mask of flags that can be updated. */
 #define	MNT_UPDATEMASK (MNT_NOSUID	| MNT_NOEXEC	| \
 			MNT_SYNCHRONOUS	| MNT_UNION	| MNT_ASYNC	| \
 			MNT_NOATIME | \
-			MNT_NOSYMFOLLOW	| MNT_IGNORE	| MNT_JAILDEVFS	| \
+			MNT_NOSYMFOLLOW	| MNT_IGNORE	| \
 			MNT_NOCLUSTERR	| MNT_NOCLUSTERW | MNT_SUIDDIR	| \
 			MNT_ACLS	| MNT_USER)
 
--- sys/ufs/ffs/ffs_extern.h.orig
+++ sys/ufs/ffs/ffs_extern.h
@@ -72,6 +72,7 @@
 int	ffs_reallocblks(struct vop_reallocblks_args *);
 int	ffs_realloccg(struct inode *, ufs2_daddr_t, ufs2_daddr_t,
 	    ufs2_daddr_t, int, int, struct ucred *, struct buf **);
+int	ffs_sbupdate(struct ufsmount *, int, int);
 void	ffs_setblock(struct fs *, u_char *, ufs1_daddr_t);
 int	ffs_snapblkfree(struct fs *, struct vnode *, ufs2_daddr_t, long, ino_t);
 void	ffs_snapremove(struct vnode *vp);
--- sys/ufs/ffs/ffs_vfsops.c.orig
+++ sys/ufs/ffs/ffs_vfsops.c
@@ -53,6 +53,7 @@
 #include <sys/mutex.h>
 
 #include <ufs/ufs/extattr.h>
+#include <ufs/ufs/gjournal.h>
 #include <ufs/ufs/quota.h>
 #include <ufs/ufs/ufsmount.h>
 #include <ufs/ufs/inode.h>
@@ -70,7 +71,6 @@
 
 static uma_zone_t uma_inode, uma_ufs1, uma_ufs2;
 
-static int	ffs_sbupdate(struct ufsmount *, int, int);
 static int	ffs_reload(struct mount *, struct thread *);
 static int	ffs_mountfs(struct vnode *, struct mount *, struct thread *);
 static void	ffs_oldfscompat_read(struct fs *, struct ufsmount *,
@@ -661,6 +661,35 @@
 		fs->fs_pendingblocks = 0;
 		fs->fs_pendinginodes = 0;
 	}
+	if ((fs->fs_flags & FS_GJOURNAL) != 0) {
+#ifdef UFS_GJOURNAL
+		/*
+		 * Get journal provider name.
+		 */
+		size = 1024;
+		mp->mnt_gjprovider = malloc(size, M_UFSMNT, M_WAITOK);
+		if (g_io_getattr("GJOURNAL::provider", cp, &size,
+		    mp->mnt_gjprovider) == 0) {
+			mp->mnt_gjprovider = realloc(mp->mnt_gjprovider, size,
+			    M_UFSMNT, M_WAITOK);
+			MNT_ILOCK(mp);
+			mp->mnt_flag |= MNT_GJOURNAL;
+			MNT_IUNLOCK(mp);
+		} else {
+			printf(
+"WARNING: %s: GJOURNAL flag on fs but no gjournal provider below\n",
+			    mp->mnt_stat.f_mntonname);
+			free(mp->mnt_gjprovider, M_UFSMNT);
+			mp->mnt_gjprovider = NULL;
+		}
+#else
+		printf(
+"WARNING: %s: GJOURNAL flag on fs but no UFS_GJOURNAL support\n",
+		    mp->mnt_stat.f_mntonname);
+#endif
+	} else {
+		mp->mnt_gjprovider = NULL;
+	}
 	ump = malloc(sizeof *ump, M_UFSMNT, M_WAITOK | M_ZERO);
 	ump->um_cp = cp;
 	ump->um_bo = &devvp->v_bufobj;
@@ -828,6 +857,10 @@
 	}
 	if (ump) {
 		mtx_destroy(UFS_MTX(ump));
+		if (mp->mnt_gjprovider != NULL) {
+			free(mp->mnt_gjprovider, M_UFSMNT);
+			mp->mnt_gjprovider = NULL;
+		}
 		free(ump->um_fs, M_UFSMNT);
 		free(ump, M_UFSMNT);
 		mp->mnt_data = (qaddr_t)0;
@@ -987,6 +1020,10 @@
 	PICKUP_GIANT();
 	vrele(ump->um_devvp);
 	mtx_destroy(UFS_MTX(ump));
+	if (mp->mnt_gjprovider != NULL) {
+		free(mp->mnt_gjprovider, M_UFSMNT);
+		mp->mnt_gjprovider = NULL;
+	}
 	free(fs->fs_csp, M_UFSMNT);
 	free(fs, M_UFSMNT);
 	free(ump, M_UFSMNT);
@@ -1471,7 +1508,7 @@
 /*
  * Write a superblock and associated information back to disk.
  */
-static int
+int
 ffs_sbupdate(mp, waitfor, suspended)
 	struct ufsmount *mp;
 	int waitfor;
--- sys/ufs/ffs/fs.h.orig
+++ sys/ufs/ffs/fs.h
@@ -323,7 +323,8 @@
 	u_int	*fs_active;		/* (u) used by snapshots to track fs */
 	int32_t	 fs_old_cpc;		/* cyl per cycle in postbl */
 	int32_t	 fs_maxbsize;		/* maximum blocking factor permitted */
-	int64_t	 fs_sparecon64[17];	/* old rotation block list head */
+	int64_t	 fs_unrefs;		/* number of unreferenced inodes */
+	int64_t	 fs_sparecon64[16];	/* old rotation block list head */
 	int64_t	 fs_sblockloc;		/* byte offset of standard superblock */
 	struct	csum_total fs_cstotal;	/* (u) cylinder summary information */
 	ufs_time_t fs_time;		/* last time written */
@@ -406,6 +407,7 @@
 #define FS_INDEXDIRS  0x08	/* kernel supports indexed directories */
 #define FS_ACLS       0x10	/* file system has ACLs enabled */
 #define FS_MULTILABEL 0x20	/* file system is MAC multi-label */
+#define FS_GJOURNAL   0x40	/* gjournaled file system */
 #define FS_FLAGS_UPDATED 0x80	/* flags have been moved to new location */
 
 /*
@@ -475,7 +477,8 @@
 	int32_t	 cg_nclusterblks;	/* number of clusters this cg */
 	int32_t  cg_niblk;		/* number of inode blocks this cg */
 	int32_t	 cg_initediblk;		/* last initialized inode */
-	int32_t	 cg_sparecon32[3];	/* reserved for future use */
+	int32_t	 cg_unrefs;		/* number of unreferenced inodes */
+	int32_t	 cg_sparecon32[2];	/* reserved for future use */
 	ufs_time_t cg_time;		/* time last written */
 	int64_t	 cg_sparecon64[3];	/* reserved for future use */
 	u_int8_t cg_space[1];		/* space for cylinder group maps */
--- /dev/null	Tue Oct 24 16:34:10 2006
+++ sys/ufs/ufs/gjournal.h	Tue Oct 24 16:34:25 2006
@@ -0,0 +1,37 @@
+/*-
+ * Copyright (c) 2005-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ * $FreeBSD$
+ */
+
+#ifndef _UFS_UFS_GJOURNAL_H_
+#define _UFS_UFS_GJOURNAL_H_
+
+/*
+ * GEOM journal function prototypes.
+ */
+void	ufs_gjournal_orphan(struct vnode *fvp);
+void	ufs_gjournal_close(struct vnode *vp);
+#endif /* !_UFS_UFS_GJOURNAL_H_ */
--- /dev/null	Tue Oct 24 16:34:10 2006
+++ sys/ufs/ufs/ufs_gjournal.c	Tue Oct 24 16:34:27 2006
@@ -0,0 +1,150 @@
+/*-
+ * Copyright (c) 2005-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include <sys/cdefs.h>
+__FBSDID("$FreeBSD$");
+
+#include "opt_ufs.h"
+
+#ifdef UFS_GJOURNAL
+
+#include <sys/param.h>
+#include <sys/systm.h>
+#include <sys/kernel.h>
+#include <sys/vnode.h>
+#include <sys/lock.h>
+#include <sys/mount.h>
+#include <sys/mutex.h>
+
+#include <ufs/ufs/extattr.h>
+#include <ufs/ufs/quota.h>
+#include <ufs/ufs/inode.h>
+#include <ufs/ufs/ufsmount.h>
+#include <ufs/ufs/gjournal.h>
+
+#include <ufs/ffs/fs.h>
+#include <ufs/ffs/ffs_extern.h>
+
+/*
+ * Change the number of unreferenced inodes.
+ */
+static int
+ufs_gjournal_modref(struct vnode *vp, int count)
+{
+	struct cg *cgp;
+	struct buf *bp;
+	ufs2_daddr_t cgbno;
+	int error, cg;
+	struct cdev *dev;
+	struct inode *ip;
+	struct ufsmount *ump;
+	struct fs *fs;
+	struct vnode *devvp;
+	ino_t ino;
+
+	ip = VTOI(vp);
+	ump = ip->i_ump;
+	fs = ip->i_fs;
+	devvp = ip->i_devvp;
+	ino = ip->i_number;
+
+	cg = ino_to_cg(fs, ino);
+	if (devvp->v_type != VCHR) {
+		/* devvp is a snapshot */
+		dev = VTOI(devvp)->i_devvp->v_rdev;
+		cgbno = fragstoblks(fs, cgtod(fs, cg));
+	} else {
+		/* devvp is a normal disk device */
+		dev = devvp->v_rdev;
+		cgbno = fsbtodb(fs, cgtod(fs, cg));
+	}
+	if ((u_int)ino >= fs->fs_ipg * fs->fs_ncg)
+		panic("ffs_freefile: range: dev = %s, ino = %lu, fs = %s",
+		    devtoname(dev), (u_long)ino, fs->fs_fsmnt);
+	if ((error = bread(devvp, cgbno, (int)fs->fs_cgsize, NOCRED, &bp))) {
+		brelse(bp);
+		return (error);
+	}
+	cgp = (struct cg *)bp->b_data;
+	if (!cg_chkmagic(cgp)) {
+		brelse(bp);
+		return (0);
+	}
+	bp->b_xflags |= BX_BKGRDWRITE;
+	cgp->cg_unrefs += count;
+	UFS_LOCK(ump);
+	fs->fs_unrefs += count;
+	fs->fs_fmod = 1;
+	ACTIVECLEAR(fs, cg);
+	UFS_UNLOCK(ump);
+	bdwrite(bp);
+	return (0);
+}
+
+void
+ufs_gjournal_orphan(struct vnode *fvp)
+{
+	struct mount *mp;
+	struct inode *ip;
+
+	mp = fvp->v_mount;
+	if (mp->mnt_gjprovider == NULL)
+		return;
+	VI_LOCK(fvp);
+	if (fvp->v_usecount < 2 || (fvp->v_vflag & VV_DELETED)) {
+		VI_UNLOCK(fvp);
+		return;
+	}
+	ip = VTOI(fvp);
+	if ((fvp->v_type == VDIR && ip->i_nlink > 2) ||
+	    (fvp->v_type != VDIR && ip->i_nlink > 1)) {
+		VI_UNLOCK(fvp);
+		return;
+	}
+	fvp->v_vflag |= VV_DELETED;
+	VI_UNLOCK(fvp);
+
+	ufs_gjournal_modref(fvp, 1);
+}
+
+void
+ufs_gjournal_close(struct vnode *vp)
+{
+	struct mount *mp;
+	struct inode *ip;
+
+	mp = vp->v_mount;
+	if (mp->mnt_gjprovider == NULL)
+		return;
+	if (!(vp->v_vflag & VV_DELETED))
+		return;
+	ip = VTOI(vp);
+	if (ip->i_nlink > 0)
+		return;
+	ufs_gjournal_modref(vp, -1);
+}
+
+#endif /* UFS_GJOURNAL */
--- sys/ufs/ufs/ufs_inode.c.orig
+++ sys/ufs/ufs/ufs_inode.c
@@ -57,6 +57,9 @@
 #include <ufs/ufs/dir.h>
 #include <ufs/ufs/dirhash.h>
 #endif
+#ifdef UFS_GJOURNAL
+#include <ufs/ufs/gjournal.h>
+#endif
 
 /*
  * Last reference to an inode.  If necessary, write or delete it.
@@ -83,6 +86,9 @@
 	 */
 	if (ip->i_mode == 0)
 		goto out;
+#ifdef UFS_GJOURNAL
+	ufs_gjournal_close(vp);
+#endif
 	if ((ip->i_effnlink == 0 && DOINGSOFTDEP(vp)) ||
 	    (ip->i_nlink <= 0 &&
 	     (vp->v_mount->mnt_flag & MNT_RDONLY) == 0)) {
--- sys/ufs/ufs/ufs_vnops.c.orig
+++ sys/ufs/ufs/ufs_vnops.c
@@ -81,6 +81,9 @@
 #ifdef UFS_DIRHASH
 #include <ufs/ufs/dirhash.h>
 #endif
+#ifdef UFS_GJOURNAL
+#include <ufs/ufs/gjournal.h>
+#endif
 
 #include <ufs/ffs/ffs_extern.h>
 
@@ -777,6 +780,9 @@
 		error = EPERM;
 		goto out;
 	}
+#ifdef UFS_GJOURNAL
+	ufs_gjournal_orphan(vp);
+#endif
 	error = ufs_dirremove(dvp, ip, ap->a_cnp->cn_flags, 0);
 	if (ip->i_nlink <= 0)
 		vp->v_vflag |= VV_NOSYNC;
@@ -1683,6 +1689,9 @@
 		error = EINVAL;
 		goto out;
 	}
+#ifdef UFS_GJOURNAL
+	ufs_gjournal_orphan(vp);
+#endif
 	/*
 	 * Delete reference to directory before purging
 	 * inode.  If we crash in between, the directory
--- sys/sys/vnode.h.orig	Fri Jan  5 04:51:14 2007
+++ sys/sys/vnode.h	Tue Jan 23 12:41:12 2007
@@ -253,6 +253,7 @@
 #define	VV_SYSTEM	0x0080	/* vnode being used by kernel */
 #define	VV_PROCDEP	0x0100	/* vnode is process dependent */
 #define	VV_NOKNOTE	0x0200	/* don't activate knotes on this vnode */
+#define	VV_DELETED	0x0400	/* should be removed */
 #define	VV_MD		0x0800	/* vnode backs the md device */
 
 /*

--Boundary-00=_aRZ7FUS0QLmc3St--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200703061118.34321.lists>