From owner-freebsd-stable@FreeBSD.ORG Thu May 1 18:55:13 2008 Return-Path: Delivered-To: stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D196C1065677 for ; Thu, 1 May 2008 18:55:13 +0000 (UTC) (envelope-from arno@heho.snv.jussieu.fr) Received: from shiva.jussieu.fr (shiva.jussieu.fr [134.157.0.129]) by mx1.freebsd.org (Postfix) with ESMTP id 7C4FA8FC28 for ; Thu, 1 May 2008 18:55:12 +0000 (UTC) (envelope-from arno@heho.snv.jussieu.fr) Received: from heho.snv.jussieu.fr (heho.snv.jussieu.fr [134.157.184.22]) by shiva.jussieu.fr (8.14.2/jtpda-5.4) with ESMTP id m41ItAA4036567 ; Thu, 1 May 2008 20:55:11 +0200 (CEST) X-Ids: 166 Received: from heho.snv.jussieu.fr (localhost [127.0.0.1]) by heho.snv.jussieu.fr (8.13.3/jtpda-5.2) with ESMTP id m41It9Jk077928 ; Thu, 1 May 2008 20:55:09 +0200 (MEST) Received: (from arno@localhost) by heho.snv.jussieu.fr (8.13.3/8.13.1/Submit) id m41It9vU077925; Thu, 1 May 2008 20:55:09 +0200 (MEST) (envelope-from arno) To: Mike Tancsa References: <20080421094718.GY25623@hub.freebsd.org> <200804211537.m3LFbaZA086977@lava.sentex.ca> <200804221501.m3MF1guW092221@lava.sentex.ca> <200804221741.m3MHfYjO092795@lava.sentex.ca> <200804221807.m3MI73bN092981@lava.sentex.ca> <200804222155.m3MLtoKt093783@lava.sentex.ca> From: "Arno J. Klaassen" Date: 01 May 2008 20:55:08 +0200 In-Reply-To: Message-ID: Lines: 117 User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.3 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-3.0 (shiva.jussieu.fr [134.157.0.166]); Thu, 01 May 2008 20:55:11 +0200 (CEST) X-Virus-Scanned: ClamAV 0.92/7007/Thu May 1 17:34:23 2008 on shiva.jussieu.fr X-Virus-Status: Clean X-Miltered: at jchkmail.jussieu.fr with ID 481A120E.003 by Joe's j-chkmail (http : // j-chkmail dot ensmp dot fr)! X-j-chkmail-Enveloppe: 481A120E.003/134.157.184.22/heho.snv.jussieu.fr/heho.snv.jussieu.fr/ X-j-chkmail-Score: MSGID : 481A120E.003 on jchkmail.jussieu.fr : j-chkmail score : . : R=. U=. O=. B=0.025 -> S=0.025 X-j-chkmail-Status: Ham Cc: stable@freebsd.org, pluknet@gmail.com Subject: Re: nfs-server silent data corruption X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 May 2008 18:55:13 -0000 Hello, > [ .. stuff deleted .. ] > > I have recompiled the kernel with ULE, and it seems fine as well. I > > ran 160 iterations of a 300MB file and there was no corruption. Same > > process - copy a junk random file over nfs mount, unmount the nfs > > mount, remount it copy it back, compare the files. > > > Let me summarise my investigations till now : > [ .. more stuff deleted .. ] > - it does *not* seem to depend on : > > - the interface : I could produce it using nfe0, nfe1 and > re0 using some netgear pci-card > > - the distribution of the 4Gig memory : installing 4G at > CPU1 or 1G at CPU1 and 2G at CPU2 produces same results > (NB, all memory passed memtest.iso in both situtations > for complete run) > > - the frequency control method : easier to produce with > cpufreq/powerd, but finally I can reproduce the cooruption > as well using acpi_ppc > > - the nfs-client and options (not exhaustively tested, but different > test include i386-releng6, amd64-releng6 and linux, and quite > a set of different try and see mounf_nfs options > > I am testing right now with a fixed frequency of 1Ghz. I cannot reproduce it at fixed cpu-frequency with cpufreq loaded (I ran my test for three days without prob, normally a couple of hours was enough). But I looked again at the corrupted copies : # for i in raid5/xps/SAVE/1 raid5/pxe/SAVE/1 raid5/pxe/SAVE/2 raid5/pxe/SAVE/3 raid5/blockhead/SAVE/1 scsi/pxe/SAVE/1 scsi/blockhead/SAVE/1 scsi/blockhead/SAVE/2 scsi/blockhead/SAVE/3 scsi/blockhead/SAVE/4; do ls -l $i/BIG; cmp -x $i/BIG $i/BIG2; echo; done -rw-r--r-- 1 root wheel 144703488 Apr 26 16:06 raid5/xps/SAVE/1/BIG 004fd908 18 00 02c9e6c8 11 00 034ab6c8 90 00 037e4648 09 00 039e85c8 91 01 04484408 00 09 06115cc8 00 81 06e5d148 01 91 07016048 18 00 074307c8 08 19 07aa45c8 29 20 080bfb88 00 11 -rw-r--r-- 1 root wheel 144703488 Apr 20 14:07 raid5/pxe/SAVE/1/BIG 03869a48 09 00 -rw-r--r-- 1 root wheel 144703488 Apr 20 14:47 raid5/pxe/SAVE/2/BIG 05209d88 09 00 -rw-r--r-- 1 root wheel 39845888 Apr 20 15:17 raid5/pxe/SAVE/3/BIG 01777148 09 00 -rw-r--r-- 1 root wheel 144703488 Apr 20 14:54 raid5/blockhead/SAVE/1/BIG 00f10f88 09 00 -rw-r--r-- 1 root wheel 39845888 Apr 20 16:08 scsi/pxe/SAVE/1/BIG 01f4c4c8 11 00 -rw-r--r-- 1 root wheel 144703488 Apr 20 15:38 scsi/blockhead/SAVE/1/BIG 06c3d6c8 11 00 -rw-r--r-- 1 root wheel 144703488 Apr 20 16:11 scsi/blockhead/SAVE/2/BIG 0725ca48 18 00 -rw-r--r-- 1 root wheel 144703488 Apr 20 17:32 scsi/blockhead/SAVE/3/BIG 01608008 09 00 -rw-r--r-- 1 root wheel 144703488 Apr 23 19:26 scsi/blockhead/SAVE/4/BIG 00f3b888 18 00 The output from raid5/xps/SAVE/1/BIG is after installing at a lab with without doubt more sophisticated switches than I use and the first I was able to produce with more that just one byte corrupted, but still with the same pattern : it looks like the position always is 2^3 * 'somethin without power of two' (e.g. factor(hex2dec('00f10f88')) = 2 2 2 809 2441 factor(hex2dec('01f4c4c8')) = 2 2 2 317 12941 ) and the corruption is one out of the following half-byte transitions : 1 -> 0 8 -> 0 9 -> 0 0 -> 1 0 -> 8 0 -> 9 8 -> 9 Maybe this gives a hint to someone ... Best, Arno