From owner-freebsd-current@FreeBSD.ORG Fri Dec 12 10:32:17 2003 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 48EB316A4CE for ; Fri, 12 Dec 2003 10:32:17 -0800 (PST) Received: from gw.catspoiler.org (217-ip-163.nccn.net [209.79.217.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id 8B79243D1D for ; Fri, 12 Dec 2003 10:32:11 -0800 (PST) (envelope-from truckman@FreeBSD.org) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.12.9p2/8.12.9) with ESMTP id hBCIW3eF058641; Fri, 12 Dec 2003 10:32:07 -0800 (PST) (envelope-from truckman@FreeBSD.org) Message-Id: <200312121832.hBCIW3eF058641@gw.catspoiler.org> Date: Fri, 12 Dec 2003 10:32:03 -0800 (PST) From: Don Lewis To: shoesoft@gmx.net In-Reply-To: <1071223849.1494.21.camel@shoeserv.freebsd> MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii cc: current@FreeBSD.org Subject: Re: kernel pointer polka, possibly by mount_nfs X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 12 Dec 2003 18:32:17 -0000 On 12 Dec, Stefan Ehmann wrote: > On Thu, 2003-12-11 at 07:49, Don Lewis wrote: >> >> That sounds a somewhat like the Heisenbug I've been on the hunt for in >> the last few weeks. This one liked to munch some file system's struct >> mount, or whatever structure that mnt_data was pointing to. The system >> in question typically blew up when attempting to lock mnt_lock in >> vfs_busy(). The trigger appeared to be the use of read-only ext2fs. The >> user who reported this problem said that the system would panic after a >> few hours. After getting the user to sprinkle KASSERT()s around, I've >> pretty come to the conclusion that the bug is not in the code for the >> vfs top half. Another bit of data is that the struct mount getting >> nuked doesn't appear to belong to ext2fs. It's hard to tell whose it is >> though because it gets zeroed. >> >> I use NFS on my two -CURRENT boxes and haven't run into any problems, >> and I also haven't been able to reproduce any panics with ext2fs, though >> I haven't exercised that nearly as much. > > I guess you are talking about my panics. Since we don't seem to make any > progress - would it help to find out when the change that causes the > problem was made? > > I was running an end of september kernel for nearly two months without > having panics 3 times a day. The kernel of Nov 23 had these problems. So > the problem should be located somwhere in these two months. > > Since this may take quite some time (and a lot of kernel and > worldbuilds), I'll only take it into account if there is a good chance > that this will reveal the source of the problem. Unfortunately, that may be the fastest way to track down the culprit. The only other way would be to write a more aggressive assertion checker function that validates the integrity of all the mount structures and sprinkle lots of calls to this function around the kernel. I also diff'ed the 2003/09/23 and 2003/11/23 versions of the ext2fs code and didn't see anything suspicious. That means that either the culprit change is something subtle in extfs, or it is elsewhere in the kernel.