From owner-freebsd-current Mon Mar 10 14:51:36 2003 Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 2E27D37B405 for ; Mon, 10 Mar 2003 14:51:34 -0800 (PST) Received: from mail.gmx.net (imap.gmx.net [213.165.64.20]) by mx1.FreeBSD.org (Postfix) with SMTP id EF56843FB1 for ; Mon, 10 Mar 2003 14:51:30 -0800 (PST) (envelope-from tmoestl@gmx.net) Received: (qmail 28445 invoked by uid 0); 10 Mar 2003 22:51:29 -0000 Received: from p508E4D4F.dip.t-dialin.net (HELO galatea.local) (80.142.77.79) by mail.gmx.net (mp023-rz3) with SMTP; 10 Mar 2003 22:51:29 -0000 Received: from tmm by galatea.local with local (Exim 4.12 #1) id 18sW7c-0000s1-00; Mon, 10 Mar 2003 23:51:52 +0100 Date: Mon, 10 Mar 2003 23:51:52 +0100 From: Thomas Moestl To: Tim Robbins Cc: John Baldwin , Kris Kennaway , alfred@FreeBSD.org, current@FreeBSD.org, Poul-Henning Kamp Subject: Re: NULL pointer problem in pid selection ? Message-ID: <20030310225151.GA2803@crow.dom2ip.de> Mail-Followup-To: Tim Robbins , John Baldwin , Kris Kennaway , alfred@FreeBSD.org, current@FreeBSD.org, Poul-Henning Kamp References: <20030308213535.GE56020@rot13.obsecurity.org> <20030311084346.A63542@dilbert.robbins.dropbear.id.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030311084346.A63542@dilbert.robbins.dropbear.id.au> User-Agent: Mutt/1.4i Sender: owner-freebsd-current@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Tue, 2003/03/11 at 08:43:46 +1100, Tim Robbins wrote: > On Mon, Mar 10, 2003 at 01:00:15PM -0500, John Baldwin wrote: > > > On 08-Mar-2003 Kris Kennaway wrote: > > > On Sat, Mar 08, 2003 at 11:46:34AM +0100, Poul-Henning Kamp wrote: > > >> > > >> Just got this crash on -current, and I belive I have seen similar > > >> before. addr2line(1) reports the faulting address to be > > >> ../../../kern/kern_fork.c:395 > > >> which is in the inner loop of pid collision avoidance. > > > > > > I've been running this patch from Alfred for the past month or so on > > > bento, which has fixed a similar panic I was seeing regularly. > > > > Using just a shared lock instead of an xlock should be ok there. You > > aren't modifying the process tree, just looking at it. OTOH, the > > proc lock is supposed to protect p_grp and p_session, so they shouldn't > > be NULL. :( > > I have a suspiscion that the bug is actually in wait1(): > > sx_xlock(&proctree_lock); > [...] > /* > * Remove other references to this process to ensure > * we have an exclusive reference. > */ > leavepgrp(p); > > sx_xlock(&allproc_lock); > LIST_REMOVE(p, p_list); /* off zombproc */ > sx_xunlock(&allproc_lock); > > LIST_REMOVE(p, p_sibling); > sx_xunlock(&proctree_lock); > > > Shouldn't we be removing the process from zombproc before setting > p_pgrp to NULL via leavepgrp()? Does this even matter at all when both > fork1() and wait1() are still protected by Giant? Hmmm, I think you're right; if allproc_lock happens to be contested in fork1() (which can happen because it it is locked without Giant held in some places, and because sleeping with an sx lock is allowed), we'll go to sleep there, dropping Giant. This opens up a race, since wait1() can now proceed until after the leavepgrp() before blocking; when allproc_lock is released, fork1() will be the first to pick it up, and this panic will happen. Seems that I relied on Giant too much when I first took a look into that code :) - Thomas -- Thomas Moestl http://www.tu-bs.de/~y0015675/ http://people.FreeBSD.org/~tmm/ PGP fingerprint: 1C97 A604 2BD0 E492 51D0 9C0F 1FE6 4F1D 419C 776C To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message