From owner-freebsd-arch Mon Oct 7 14:23:34 2002 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 5FEBD37B404; Mon, 7 Oct 2002 14:23:32 -0700 (PDT) Received: from gw.catspoiler.org (217-ip-163.nccn.net [209.79.217.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id 2EB0D43EA9; Mon, 7 Oct 2002 14:23:31 -0700 (PDT) (envelope-from dl-freebsd@catspoiler.org) Received: from mousie.catspoiler.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.12.5/8.12.5) with ESMTP id g97LNGvU033246; Mon, 7 Oct 2002 14:23:20 -0700 (PDT) (envelope-from dl-freebsd@catspoiler.org) Message-Id: <200210072123.g97LNGvU033246@gw.catspoiler.org> Date: Mon, 7 Oct 2002 14:23:16 -0700 (PDT) From: Don Lewis Subject: Re: [jmallett@FreeBSD.org: [PATCH] Reliable signal queues, etc., To: jhb@FreeBSD.ORG Cc: arch@FreeBSD.ORG, jmallett@FreeBSD.ORG In-Reply-To: MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On 7 Oct, John Baldwin wrote: > > On 05-Oct-2002 Don Lewis wrote: >> On 5 Oct, Juli Mallett wrote: >> >>> diff -Nrdu -x *CVS* -x *dev* sys/kern/kern_exit.c kernel/kern/kern_exit.c >>> --- sys/kern/kern_exit.c Tue Oct 1 12:15:51 2002 >>> +++ kernel/kern/kern_exit.c Sat Oct 5 01:20:57 2002 >> >>> @@ -209,12 +210,12 @@ >>> PROC_LOCK(p); >>> if (p == p->p_leader) { >>> q = p->p_peers; >>> + PROC_UNLOCK(p); >>> while (q != NULL) { >>> - PROC_LOCK(q); >>> psignal(q, SIGKILL); >>> - PROC_UNLOCK(q); >>> q = q->p_peers; >>> } >>> + PROC_LOCK(p); >>> while (p->p_peers) >>> msleep(p, &p->p_mtx, PWAIT, "exit1", 0); >>> } >> >> This scary looking fragment of code in exit1() is relying on the lock on >> p->p_leader being continuously held to prevent the p_peers list from >> changing while the list traversal is in progress. The code in >> kern_fork.c and elsewhere in kern_exit.c holds a lock on p_leader while >> the list modifications are done. >> >> The existing code looks like it could deadlock if q is locked because it >> is in fork() or exit(). Process p will block when it tries to lock q, >> and q will block when it tries to lock its p_leader, which happens to be >> p. > > Ugh. Probably the code should be changed to do something like this: > > --- kern_exit.c 2 Oct 2002 23:12:01 -0000 1.181 > +++ kern_exit.c 7 Oct 2002 18:48:18 -0000 > @@ -203,17 +203,18 @@ > */ > > p->p_flag |= P_WEXIT; > - PROC_UNLOCK(p); > > /* Are we a task leader? */ > - PROC_LOCK(p); > if (p == p->p_leader) { > q = p->p_peers; > while (q != NULL) { > + nq = q->p_peers; > + PROC_UNLOCK(p); > PROC_LOCK(q); > psignal(q, SIGKILL); > PROC_UNLOCK(q); > - q = q->p_peers; > + PROC_LOCK(p); > + q = nq; > } > while (p->p_peers) > msleep(p, &p->p_mtx, PWAIT, "exit1", 0); It's not obvious to me that your alternative is safe. It avoids the deadlock problem, but what keeps the list from changing while it is being traversed, especially while we're waiting for PROC_LOCK(q)? It separate lock for the peer list (instead of using PROC_LOCK(p_leader)) looks like the obvious fix. Grabbing the peer list lock after unlocking P would avoid the deadlock and allow us to do whatever locking is needed for psignal(). > Also, we might should check P_WEXIT and abort in fork1() if it is > set. (We don't appear to do that presently.) > Probably, but the list is also modified in the exit code. All those processes that we are sending SIGKILL to are removing themselves from the list. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message