From owner-freebsd-current@FreeBSD.ORG  Sat May 10 11:44:28 2003
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 04E5D37B401
	for <freebsd-current@freebsd.org>;
	Sat, 10 May 2003 11:44:28 -0700 (PDT)
Received: from sauron.fto.de (p15106025.pureserver.info [217.160.140.13])
	by mx1.FreeBSD.org (Postfix) with ESMTP id EFD8C43F3F
	for <freebsd-current@freebsd.org>;
	Sat, 10 May 2003 11:44:26 -0700 (PDT)
	(envelope-from hschaefer@fto.de)
Received: from localhost (localhost.fto.de [127.0.0.1])
	by sauron.fto.de (Postfix) with ESMTP
	id 2961925C0F6; Sat, 10 May 2003 20:44:26 +0200 (CEST)
Received: from sauron.fto.de ([127.0.0.1])
 by localhost (sauron [127.0.0.1]) (amavisd-new, port 10024) with ESMTP
 id 19060-09; Sat, 10 May 2003 20:44:24 +0200 (CEST)
Received: from giskard.foundation.hs (p50919FD6.dip.t-dialin.net
	[80.145.159.214])	by sauron.fto.de (Postfix) with ESMTP
	id B6AD725C0C2; Sat, 10 May 2003 20:44:23 +0200 (CEST)
Received: from daneel.foundation.hs (daneel.foundation.hs [192.168.20.2])
	by giskard.foundation.hs (8.9.3/8.9.3) with ESMTP id UAA89137;
	Sat, 10 May 2003 20:44:22 +0200 (CEST)
	(envelope-from hschaefer@fto.de)
Date: Sat, 10 May 2003 20:44:22 +0200 (CEST)
From: Heiko Schaefer <hschaefer@fto.de>
X-X-Sender: heiko@daneel.foundation.hs
To: Terry Lambert <tlambert2@mindspring.com>
In-Reply-To: <3EBD3EB0.F5F8ADF7@mindspring.com>
Message-ID: <20030510203854.E93229@daneel.foundation.hs>
References: <b9falv$209l$1@FreeBSD.csie.NCTU.edu.tw>
	<3EBC6C6A.1040602@myrealbox.com>
	<20030510130934.R93229@daneel.foundation.hs>
	<3EBD3EB0.F5F8ADF7@mindspring.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Virus-Scanned: by amavisd-new at fto.de
cc: freebsd-current@freebsd.org
Subject: Re: data corruption with current (maybe sis chipset related?)
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 10 May 2003 18:44:28 -0000

Hey Terry,

> > > walt wrote:
> > > > Do I recall from some months ago that this bug would not
> > > > affect machines with less than a gig of RAM?
> > >
> > > The amount of memory at which you see it depends on the processor
> > > features.  Now that autotuning is in, there's a stair-step for
> > > how much the system uses for each resource pool, based on how
> > > much RAM is in the system.  It's quite unpredictable where it will
> > > show up in -current, because of this (and the new memory allocator).
> > >
> > > Basically, the problem will show wherever the memory size vs.
> > > memory utilization tickles it (that's why upping maxfiles was
> > > enough to scare it off, before the tuning/allocator changes
> > > went in).

> > - i still have an issue with the system because of which i started this
> > thread:
> >
> > originally, i bought a 512mb ddr ram for it (not the cheapest kind, but
> > also nothing fancy - the chips say infineon). with that ram i still
> > experience data corruption.
> >
> > while i reported that the problem disappeared, i was running of a sdr pc
> > 133 ram which is only 256mb.
> >
> > what i wonder now: is the physical 512mb ram possibly damaged (or not
> > interacting well with the board or bios), or could that yet again be a
> > general (software-solvable) issue (which i would likely experience
> > whenever i have 512mb of ram in that machine. regardless of make) ?
>
> It's possible that the RAM was damaged, but unlikely.
>
> If you revert to a DP2 kernel (or any kernel before Jeff's
> allocator changes AND Matt's autotuning changes), you should
> be able to trigger this problem fairly easily with anything
> that causes a lot of page thrashing right after system boot,
> as long as you pick the right amount of RAM to install for
> the CPU features of the CPU you are using.
>
> > if the problem is likely to go away with another 512mb ram, i will go to
> > get the ram changed on monday - otherwise, i'd like to spare myself and
> > the vendor the trouble :) ... especially myself *g*
>
> It might.  It might not.  When I first saw the problem, it
> didn't occur on 512M, and it didn't occur on 2G, but it did
> occur on 1G.  This was a SuperMicro running a PIII.  The
> behaviour's going to be different for different CPU features,
> unfortunately.

i'm sorry, my mail was probably a bit confusing.
since it has been pointed out to me, i am running -current kernels with

options               DISABLE_PSE
options               DISABLE_PG_G

enabled.

what i am asking myself:
is there any chance that i still get any data corruption because of the
issues that you write about in some configuration ?!

because with the 512mb (ddr) ram (which might or might not be defective) i
get data corruption, while with another 256mb (sdr) ram, i apparently
don't.

so far i had the impression that my test (copying >30gb of checksummed
data between disks) shows these problems rather reliably.

> Alternately, disable auto-tuning by setting MAXUSERS to some
> value (preferrably equal to or larger than the pre-auto-tune
> value), and then set maxfiles to 50000 or more.  This should
> also mask the problem (though I don't know this for sure,
> given Jeff's allocator changes not preallocating the page
> maps for things which used to be allocated via zalloci()).

masking sounds scary to me - i don't really want to make the problem less
likely by, say 1 : 10^3 or so :)
i would much rather not have any data corrupted at all.

> > does it make sense for me to try bosko's patch ?
>
> Yes.  It fixes the problem, according to his testing.  He
> posted the URL for it a while back, or you can contact him
> directly.

ok, i'll find it - what i wanted to ask is, if that patch is likely to
make _more_ problems go away than those two kernel options.

> > can i hope for any better results (i don't really care about
> > performance, only data integrity) with it than with those
> > two kernel options ?!
>
> Yes, if that's the source of your problems.  As you pointed
> out, there's a small but finite chance it's bad RAM, or a
> problem with the motherboard, etc..  The way to find out is
> to try the offending RAM again, with a kernel with those
> options, and see if it happens (this assumes that you were
> able to trigger it fairly reliably before; negative evidence
> is really only anecdotal, without a regression test case, so
> if it only happened one in a great while, it not happening in
> a week or a month would prove nothing).

i guess i can manage to get another 256mb sdr ram into that box
temporarily by next week, if nothing better comes up - just to check.

thanks, regards,

Heiko

-- 
Free Software. Why put up with inferior code and antisocial corporations?
http://www.gnu.org/philosophy/why-free.html