Date: Wed, 30 Apr 2003 16:14:20 +0200 (CEST) From: Heiko Schaefer <hschaefer@fto.de> To: Poul-Henning Kamp <phk@phk.freebsd.dk> Cc: freebsd-current@freebsd.org Subject: Re: still: Re: gbde data corruption? Message-ID: <20030430155816.U27116@daneel.foundation.hs> In-Reply-To: <8677.1051710679@critter.freebsd.dk> References: <8677.1051710679@critter.freebsd.dk>
next in thread | previous in thread | raw e-mail | index | archive | help
Hello Poul, > >the broken version of the file contains lots of 0-bytes (instead of high > >entropy values in the original file). seems by the output of cmp that > >every damaged value is replaced by 0. > > Zero bytes is the absolutely last thing I would expect... > > How long are the sequences of zero bytes, and do they start at > sector boundaries ? it seems that the (one and only) sequence is exactly 32k long and starts nicely alligned (alligned to 1024*16, even). > Do you also see this on the client ? (Ie: could it be that data is > still cached on the client and not flushed ?) i see the broken variant of the file both locally and via my nfs client. which is to be expected - i'm moving rather large amounts of data... the thing that i am doing (over and over again) is completely filling one 30gb and one 60gb filesystem. > What is the approximate error-rate ? 1 file in 10 ? 1 file in 100 ? > How long are the files ? this last error i observe is one file on a 30gb filesystem that is filled fully with files that are between 1mb and 10mb or so (most of them, at least). so i'm talking about 1 in 10000, in this case. > >another thing i just notice: /var/log/messages contains lots of > > > >[...] > >Apr 30 15:24:55 zoidberg kernel: ENOMEM 0xc4c62100 on 0xc45c6c80(ad2s1e.bde) > >Apr 30 15:25:19 zoidberg kernel: ENOMEM 0xc3fa5000 on 0xc45c6c80(ad2s1e.bde) > >Apr 30 15:25:57 zoidberg kernel: ENOMEM 0xc4b46100 on 0xc45c6c80(ad2s1e.bde) > >Apr 30 15:25:57 zoidberg kernel: ENOMEM 0xc4364500 on 0xc45c6c80(ad2s1e.bde) > >[...] > > This means that the kernel ran out of ram and the operation was retried, > it should not result in data corruption but it may reorder bio requests > significantly. I must admit that I have not bashed NFS to see that it > copes. that sounds moderately suspicious to me. i could try to physically move another disc with lots of unencrypted data into the fileserver and try copying onto gbde without nfs - but only later today, when i get home. > >if you have no other things i could report or try, i might just throw away > >the gbde volumes and try the same copying with non-gbde partitions, just > >to be sure. > > That would be a good first step, but we need to do it controlled to make > sure we know what we prove, so please try it this way: > > add > option MALLOC_MAKE_FAILURES > to your kernel. > > Build filesystem without GBDE, run test, check for corruption. well, i think i'll just try copying (over nfs) onto unencrypted filesystems without any further changes first. one of these copy- and checksum cycles takes quite a few hours ... if that test results in errors, then i will instantly throw myself into the dust before you and apologize :) if not, i'll try to stress my box some more (including malloc failures if nothing else helps/hurts). thanks, regards, Heiko
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20030430155816.U27116>