From owner-freebsd-current@FreeBSD.ORG Sat May 10 11:44:28 2003 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 04E5D37B401 for ; Sat, 10 May 2003 11:44:28 -0700 (PDT) Received: from sauron.fto.de (p15106025.pureserver.info [217.160.140.13]) by mx1.FreeBSD.org (Postfix) with ESMTP id EFD8C43F3F for ; Sat, 10 May 2003 11:44:26 -0700 (PDT) (envelope-from hschaefer@fto.de) Received: from localhost (localhost.fto.de [127.0.0.1]) by sauron.fto.de (Postfix) with ESMTP id 2961925C0F6; Sat, 10 May 2003 20:44:26 +0200 (CEST) Received: from sauron.fto.de ([127.0.0.1]) by localhost (sauron [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 19060-09; Sat, 10 May 2003 20:44:24 +0200 (CEST) Received: from giskard.foundation.hs (p50919FD6.dip.t-dialin.net [80.145.159.214]) by sauron.fto.de (Postfix) with ESMTP id B6AD725C0C2; Sat, 10 May 2003 20:44:23 +0200 (CEST) Received: from daneel.foundation.hs (daneel.foundation.hs [192.168.20.2]) by giskard.foundation.hs (8.9.3/8.9.3) with ESMTP id UAA89137; Sat, 10 May 2003 20:44:22 +0200 (CEST) (envelope-from hschaefer@fto.de) Date: Sat, 10 May 2003 20:44:22 +0200 (CEST) From: Heiko Schaefer X-X-Sender: heiko@daneel.foundation.hs To: Terry Lambert In-Reply-To: <3EBD3EB0.F5F8ADF7@mindspring.com> Message-ID: <20030510203854.E93229@daneel.foundation.hs> References: <3EBC6C6A.1040602@myrealbox.com> <20030510130934.R93229@daneel.foundation.hs> <3EBD3EB0.F5F8ADF7@mindspring.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Virus-Scanned: by amavisd-new at fto.de cc: freebsd-current@freebsd.org Subject: Re: data corruption with current (maybe sis chipset related?) X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 10 May 2003 18:44:28 -0000 Hey Terry, > > > walt wrote: > > > > Do I recall from some months ago that this bug would not > > > > affect machines with less than a gig of RAM? > > > > > > The amount of memory at which you see it depends on the processor > > > features. Now that autotuning is in, there's a stair-step for > > > how much the system uses for each resource pool, based on how > > > much RAM is in the system. It's quite unpredictable where it will > > > show up in -current, because of this (and the new memory allocator). > > > > > > Basically, the problem will show wherever the memory size vs. > > > memory utilization tickles it (that's why upping maxfiles was > > > enough to scare it off, before the tuning/allocator changes > > > went in). > > - i still have an issue with the system because of which i started this > > thread: > > > > originally, i bought a 512mb ddr ram for it (not the cheapest kind, but > > also nothing fancy - the chips say infineon). with that ram i still > > experience data corruption. > > > > while i reported that the problem disappeared, i was running of a sdr pc > > 133 ram which is only 256mb. > > > > what i wonder now: is the physical 512mb ram possibly damaged (or not > > interacting well with the board or bios), or could that yet again be a > > general (software-solvable) issue (which i would likely experience > > whenever i have 512mb of ram in that machine. regardless of make) ? > > It's possible that the RAM was damaged, but unlikely. > > If you revert to a DP2 kernel (or any kernel before Jeff's > allocator changes AND Matt's autotuning changes), you should > be able to trigger this problem fairly easily with anything > that causes a lot of page thrashing right after system boot, > as long as you pick the right amount of RAM to install for > the CPU features of the CPU you are using. > > > if the problem is likely to go away with another 512mb ram, i will go to > > get the ram changed on monday - otherwise, i'd like to spare myself and > > the vendor the trouble :) ... especially myself *g* > > It might. It might not. When I first saw the problem, it > didn't occur on 512M, and it didn't occur on 2G, but it did > occur on 1G. This was a SuperMicro running a PIII. The > behaviour's going to be different for different CPU features, > unfortunately. i'm sorry, my mail was probably a bit confusing. since it has been pointed out to me, i am running -current kernels with options DISABLE_PSE options DISABLE_PG_G enabled. what i am asking myself: is there any chance that i still get any data corruption because of the issues that you write about in some configuration ?! because with the 512mb (ddr) ram (which might or might not be defective) i get data corruption, while with another 256mb (sdr) ram, i apparently don't. so far i had the impression that my test (copying >30gb of checksummed data between disks) shows these problems rather reliably. > Alternately, disable auto-tuning by setting MAXUSERS to some > value (preferrably equal to or larger than the pre-auto-tune > value), and then set maxfiles to 50000 or more. This should > also mask the problem (though I don't know this for sure, > given Jeff's allocator changes not preallocating the page > maps for things which used to be allocated via zalloci()). masking sounds scary to me - i don't really want to make the problem less likely by, say 1 : 10^3 or so :) i would much rather not have any data corrupted at all. > > does it make sense for me to try bosko's patch ? > > Yes. It fixes the problem, according to his testing. He > posted the URL for it a while back, or you can contact him > directly. ok, i'll find it - what i wanted to ask is, if that patch is likely to make _more_ problems go away than those two kernel options. > > can i hope for any better results (i don't really care about > > performance, only data integrity) with it than with those > > two kernel options ?! > > Yes, if that's the source of your problems. As you pointed > out, there's a small but finite chance it's bad RAM, or a > problem with the motherboard, etc.. The way to find out is > to try the offending RAM again, with a kernel with those > options, and see if it happens (this assumes that you were > able to trigger it fairly reliably before; negative evidence > is really only anecdotal, without a regression test case, so > if it only happened one in a great while, it not happening in > a week or a month would prove nothing). i guess i can manage to get another 256mb sdr ram into that box temporarily by next week, if nothing better comes up - just to check. thanks, regards, Heiko -- Free Software. Why put up with inferior code and antisocial corporations? http://www.gnu.org/philosophy/why-free.html