From owner-freebsd-current@FreeBSD.ORG Wed Apr 30 06:51:21 2003 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 8418737B401 for ; Wed, 30 Apr 2003 06:51:21 -0700 (PDT) Received: from critter.freebsd.dk (critter.freebsd.dk [212.242.86.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id 7EFE343F93 for ; Wed, 30 Apr 2003 06:51:20 -0700 (PDT) (envelope-from phk@phk.freebsd.dk) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.12.9/8.12.9) with ESMTP id h3UDpJs4008678; Wed, 30 Apr 2003 15:51:19 +0200 (CEST) (envelope-from phk@phk.freebsd.dk) To: Heiko Schaefer From: "Poul-Henning Kamp" In-Reply-To: Your message of "Wed, 30 Apr 2003 15:29:28 +0200." <20030430151514.X27116@daneel.foundation.hs> Date: Wed, 30 Apr 2003 15:51:19 +0200 Message-ID: <8677.1051710679@critter.freebsd.dk> cc: freebsd-current@freebsd.org Subject: Re: still: Re: gbde data corruption? X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Apr 2003 13:51:21 -0000 In message <20030430151514.X27116@daneel.foundation.hs>, Heiko Schaefer writes: >Hi Poul, >the broken version of the file contains lots of 0-bytes (instead of high >entropy values in the original file). seems by the output of cmp that >every damaged value is replaced by 0. Zero bytes is the absolutely last thing I would expect... How long are the sequences of zero bytes, and do they start at sector boundaries ? Do you also see this on the client ? (Ie: could it be that data is still cached on the client and not flushed ?) What is the approximate error-rate ? 1 file in 10 ? 1 file in 100 ? How long are the files ? >zoidberg# diskinfo /dev/ad0s1e >/dev/ad0s1e 512 29051207680 56740640 56290 16 63 >zoidberg# diskinfo /dev/ad0s1e.bde >/dev/ad0s1e.bde 4096 28937551872 7064832 This looks ok. >another thing i just notice: /var/log/messages contains lots of > >[...] >Apr 30 15:24:55 zoidberg kernel: ENOMEM 0xc4c62100 on 0xc45c6c80(ad2s1e.bde) >Apr 30 15:25:19 zoidberg kernel: ENOMEM 0xc3fa5000 on 0xc45c6c80(ad2s1e.bde) >Apr 30 15:25:57 zoidberg kernel: ENOMEM 0xc4b46100 on 0xc45c6c80(ad2s1e.bde) >Apr 30 15:25:57 zoidberg kernel: ENOMEM 0xc4364500 on 0xc45c6c80(ad2s1e.bde) >[...] This means that the kernel ran out of ram and the operation was retried, it should not result in data corruption but it may reorder bio requests significantly. I must admit that I have not bashed NFS to see that it copes. >i feel that the issue i see is outside the realm of 'should' - so i try to >give any information i can think of. even useless information :) Ohh, you're _WAY_ out of "should", you're with your feet deep into "should certainly NOT", right next to "NEVER EVER!" :-) >also, i have the unpleasant feeling that i might be making some stupid >mistake, and waste your time by looking entirely in the wrong direction. > >...for all i know the hardware i use on the server-side (or the drivers >for it ... for some reason the sis-based onboard nic comes to my mind, >just now) could be subtly broken :/ >if you have no other things i could report or try, i might just throw away >the gbde volumes and try the same copying with non-gbde partitions, just >to be sure. That would be a good first step, but we need to do it controlled to make sure we know what we prove, so please try it this way: add option MALLOC_MAKE_FAILURES to your kernel. Build filesystem without GBDE, run test, check for corruption. if no corruption run: sysctl debug.malloc.failure_rate=9013 and then reeuild filesystem without GBDE, run test, check for corruption. if you get no corruption in either case GBDE is clearly to blame, and I get to loose more hair while I chase that bug.. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence.