From owner-freebsd-current@FreeBSD.ORG  Fri Dec 12 02:10:52 2003
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 1C0A316A4CE
	for <current@FreeBSD.org>; Fri, 12 Dec 2003 02:10:52 -0800 (PST)
Received: from email01.aon.at (WARSL402PIP8.highway.telekom.at [195.3.96.97])
	by mx1.FreeBSD.org (Postfix) with SMTP id A8B3C43D09
	for <current@FreeBSD.org>; Fri, 12 Dec 2003 02:10:48 -0800 (PST)
	(envelope-from shoesoft@gmx.net)
Received: (qmail 36372 invoked from network); 12 Dec 2003 10:10:47 -0000
Received: from m118p012.dipool.highway.telekom.at (HELO ?62.46.4.172?)
	([62.46.4.172]) (envelope-sender <shoesoft@gmx.net>)
	by qmail1rs.highway.telekom.at (qmail-ldap-1.03) with SMTP
	for <truckman@FreeBSD.org>; 12 Dec 2003 10:10:47 -0000
From: Stefan Ehmann <shoesoft@gmx.net>
To: Don Lewis <truckman@FreeBSD.org>
In-Reply-To: <200312110649.hBB6nDeF054514@gw.catspoiler.org>
References: <200312110649.hBB6nDeF054514@gw.catspoiler.org>
Content-Type: text/plain
Message-Id: <1071223849.1494.21.camel@shoeserv.freebsd>
Mime-Version: 1.0
X-Mailer: Ximian Evolution 1.4.5 
Date: Fri, 12 Dec 2003 11:10:50 +0100
Content-Transfer-Encoding: 7bit
cc: current@FreeBSD.org
Subject: Re: kernel pointer polka, possibly by mount_nfs
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 12 Dec 2003 10:10:52 -0000

On Thu, 2003-12-11 at 07:49, Don Lewis wrote:
> On 10 Dec, Poul-Henning Kamp wrote:
> > 
> > I have a 100% reproducible case here where it looks like mount_nfs
> > tramples on the softc of a led(4) device.
> > 
> > Stock -current kernel, HZ=1000, I've added a couple of sanity-checks
> > in the timeout routine of led(4) and they trigger reliably on a
> > byte which should not have been zero.
> > 
> > In all cases so far, the currently running program is mount_nfs run
> > from /etc/rc.mumble somewhere.
> > 
> > The machine is a Soekris 4501 booting diskless.
> > 
> > I have also seen a reproducible page fault panic in in_pcbremlist()
> > if I put "set -x" as the second line in /etc/rc on the same machine,
> > it smells the same to me.
> > 
> > This problem likely affects 5.2-WHATEVER as well, and could be
> > responsible for other Heisenbugs, and could be considered a
> > showstopper.
> 
> That sounds a somewhat like the Heisenbug I've been on the hunt for in
> the last few weeks.  This one liked to munch some file system's struct
> mount, or whatever structure that mnt_data was pointing to.  The system
> in question typically blew up when attempting to lock mnt_lock in
> vfs_busy().  The trigger appeared to be the use of read-only ext2fs. The
> user who reported this problem said that the system would panic after a
> few hours.  After getting the user to sprinkle KASSERT()s around, I've
> pretty come to the conclusion that the bug is not in the code for the
> vfs top half.  Another bit of data is that the struct mount getting
> nuked doesn't appear to belong to ext2fs.  It's hard to tell whose it is
> though because it gets zeroed.
> 
> I use NFS on my two -CURRENT boxes and haven't run into any problems,
> and I also haven't been able to reproduce any panics with ext2fs, though
> I haven't exercised that nearly as much.

I guess you are talking about my panics. Since we don't seem to make any
progress - would it help to find out when the change that causes the
problem was made?

I was running an end of september kernel for nearly two months without
having panics 3 times a day. The kernel of Nov 23 had these problems. So
the problem should be located somwhere in these two months.

Since this may take quite some time (and a lot of kernel and
worldbuilds), I'll only take it into account if there is a good chance
that this will reveal the source of the problem.