From owner-freebsd-stable@FreeBSD.ORG  Thu Dec 13 12:12:25 2012
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
 by hub.freebsd.org (Postfix) with ESMTP id D7B0A9E7
 for <stable@freebsd.org>; Thu, 13 Dec 2012 12:12:25 +0000 (UTC)
 (envelope-from victor@bsdes.net)
Received: from equilibrium.bsdes.net
 (244.Red-217-126-240.staticIP.rima-tde.net [217.126.240.244])
 by mx1.freebsd.org (Postfix) with ESMTP id 79D258FC17
 for <stable@freebsd.org>; Thu, 13 Dec 2012 12:12:23 +0000 (UTC)
Received: by equilibrium.bsdes.net (Postfix, from userid 1001)
 id 857C639847; Thu, 13 Dec 2012 13:05:32 +0100 (CET)
Date: Thu, 13 Dec 2012 13:05:32 +0100
From: Victor Balada Diaz <victor@bsdes.net>
To: stable@freebsd.org
Subject: gjournal + HAST data lost
Message-ID: <20121213120532.GW1414@equilibrium.bsdes.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
User-Agent: Mutt/1.5.21 (2010-09-15)
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 13 Dec 2012 12:12:25 -0000

Hello,

We've experienced a weird data "rollback" on our NFS servers.

We have two NFS servers. Both running 8.3-RELEASE-p3. We've
setup HAST for one partition between both of them. To be able
to switch fast we configured gjournal on top of HAST. At the
time there was no UFS+J.

Yesterday one of the servers crashed and CARP changed the
slave to master. During that operation we got the following error:

GEOM_JOURNAL: Journal 2180207123: hast/shared contains data.
GEOM_JOURNAL: Journal 2180207123: hast/shared contains journal.
GEOM_JOURNAL: Cannot decode journal header from hast/shared.
GEOM_JOURNAL: Journal on hast/shared is broken/corrupted. Initializing.
GEOM_JOURNAL: clean=1 flags=0x40
GEOM_JOURNAL: File system hast/shared marked as dirty.

Did a full fsck and no errors were detected. The filesystem was working
again.

After looking at the data we saw that all the files in the last days
were missing. Like if both servers were disconnected, but that didn't happen.
Even more: after our first NFS server was up again, no split-brain condition
was detected.

We're sure the first NFS server was working because all of the data is on
the backup servers. So it's not like the data never got written.

What could explain that data rollback? If gjournal's journal is lost it's
possible to lose the data of a few days ago? Is not recommended to use gjournal
with HAST?

Thanks a lot.
Regards.
Victor.

hast.conf:

replication fullsync
#compression lzf
#checksum sha256

on nfs01 {
        listen 192.168.23.81
}

on nfs02 {
        listen 192.168.23.82
}
resource shared {
        name shared
        local /dev/mirror/oss1g

        on nfs01 {
                remote 192.168.23.82
        }
        on nfs02 {
                remote 192.168.23.81
        }
}

-- 
La prueba más fehaciente de que existe vida inteligente en otros
planetas, es que no han intentado contactar con nosotros.