From owner-freebsd-current@FreeBSD.ORG  Wed Apr 30 07:14:26 2003
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 50C7C37B401
	for <freebsd-current@freebsd.org>;
	Wed, 30 Apr 2003 07:14:26 -0700 (PDT)
Received: from sauron.fto.de (p15106025.pureserver.info [217.160.140.13])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 0C21643FCB
	for <freebsd-current@freebsd.org>;
	Wed, 30 Apr 2003 07:14:25 -0700 (PDT)
	(envelope-from hschaefer@fto.de)
Received: from localhost (localhost.fto.de [127.0.0.1])
	by sauron.fto.de (Postfix) with ESMTP
	id C1FC625C0FD; Wed, 30 Apr 2003 16:14:22 +0200 (CEST)
Received: from sauron.fto.de ([127.0.0.1])
 by localhost (sauron [127.0.0.1]) (amavisd-new, port 10024) with ESMTP
 id 25459-07; Wed, 30 Apr 2003 16:14:21 +0200 (CEST)
Received: from giskard.foundation.hs (p5091AC1D.dip.t-dialin.net
	[80.145.172.29])	by sauron.fto.de (Postfix) with ESMTP
	id 2AD1C25C0FC; Wed, 30 Apr 2003 16:14:21 +0200 (CEST)
Received: from daneel.foundation.hs (daneel.foundation.hs [192.168.20.2])
	by giskard.foundation.hs (8.9.3/8.9.3) with ESMTP id QAA76828;
	Wed, 30 Apr 2003 16:14:21 +0200 (CEST)
	(envelope-from hschaefer@fto.de)
Date: Wed, 30 Apr 2003 16:14:20 +0200 (CEST)
From: Heiko Schaefer <hschaefer@fto.de>
X-X-Sender: heiko@daneel.foundation.hs
To: Poul-Henning Kamp <phk@phk.freebsd.dk>
In-Reply-To: <8677.1051710679@critter.freebsd.dk>
Message-ID: <20030430155816.U27116@daneel.foundation.hs>
References: <8677.1051710679@critter.freebsd.dk>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Virus-Scanned: by amavisd-new at fto.de
cc: freebsd-current@freebsd.org
Subject: Re: still: Re: gbde data corruption? 
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Apr 2003 14:14:26 -0000

Hello Poul,

> >the broken version of the file contains lots of 0-bytes (instead of high
> >entropy values in the original file). seems by the output of cmp that
> >every damaged value is replaced by 0.
>
> Zero bytes is the absolutely last thing I would expect...
>
> How long are the sequences of zero bytes, and do they start at
> sector boundaries ?

it seems that the (one and only) sequence is exactly 32k long and starts
nicely alligned (alligned to 1024*16, even).

> Do you also see this on the client ?  (Ie: could it be that data is
> still cached on the client and not flushed ?)

i see the broken variant of the file both locally and via my nfs client.
which is to be expected - i'm moving rather large amounts of data...

the thing that i am doing (over and over again) is completely filling one
30gb and one 60gb filesystem.

> What is the approximate error-rate ?  1 file in 10 ? 1 file in 100 ?
> How long are the files ?

this last error i observe is one file on a 30gb filesystem that is filled
fully with files that are between 1mb and 10mb or so (most of them, at
least). so i'm talking about 1 in 10000, in this case.

> >another thing i just notice: /var/log/messages contains lots of
> >
> >[...]
> >Apr 30 15:24:55 zoidberg kernel: ENOMEM 0xc4c62100 on 0xc45c6c80(ad2s1e.bde)
> >Apr 30 15:25:19 zoidberg kernel: ENOMEM 0xc3fa5000 on 0xc45c6c80(ad2s1e.bde)
> >Apr 30 15:25:57 zoidberg kernel: ENOMEM 0xc4b46100 on 0xc45c6c80(ad2s1e.bde)
> >Apr 30 15:25:57 zoidberg kernel: ENOMEM 0xc4364500 on 0xc45c6c80(ad2s1e.bde)
> >[...]
>
> This means that the kernel ran out of ram and the operation was retried,
> it should not result in data corruption but it may reorder bio requests
> significantly.  I must admit that I have not bashed NFS to see that it
> copes.

that sounds moderately suspicious to me. i could try to physically move
another disc with lots of unencrypted data into the fileserver and try
copying onto gbde without nfs - but only later today, when i get home.

> >if you have no other things i could report or try, i might just throw away
> >the gbde volumes and try the same copying with non-gbde partitions, just
> >to be sure.
>
> That would be a good first step, but we need to do it controlled to make
> sure we know what we prove, so please try it this way:
>
> add
> 	option          MALLOC_MAKE_FAILURES
> to your kernel.
>
> Build filesystem without GBDE, run test, check for corruption.

well, i think i'll just try copying (over nfs) onto unencrypted
filesystems without any further changes first. one of these copy- and
checksum cycles takes quite a few hours ... if that test results in
errors, then i will instantly throw myself into the dust before you and
apologize :) if not, i'll try to stress my box some more (including malloc
failures if nothing else helps/hurts).

thanks, regards,

Heiko