From owner-freebsd-questions@FreeBSD.ORG Fri Oct 1 01:29:39 2004 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 3F74F16A4CF for ; Fri, 1 Oct 2004 01:29:39 +0000 (GMT) Received: from dan.emsphone.com (dan.emsphone.com [199.67.51.101]) by mx1.FreeBSD.org (Postfix) with ESMTP id E5F6343D2F for ; Fri, 1 Oct 2004 01:29:38 +0000 (GMT) (envelope-from dan@dan.emsphone.com) Received: (from dan@localhost) by dan.emsphone.com (8.12.11/8.12.11) id i911TWFT038258; Thu, 30 Sep 2004 20:29:32 -0500 (CDT) (envelope-from dan) Date: Thu, 30 Sep 2004 20:29:32 -0500 From: Dan Nelson To: Jason Barnes Message-ID: <20041001012932.GH22530@dan.emsphone.com> References: <20040930160527.A58465@c3po.barnesos.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20040930160527.A58465@c3po.barnesos.net> X-OS: FreeBSD 5.3-BETA5 X-message-flag: Outlook Error User-Agent: Mutt/1.5.6i cc: questions@freebsd.org Subject: Re: process will not die. X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Oct 2004 01:29:39 -0000 In the last episode (Sep 30), Jason Barnes said: > While running an mpirun job on my dual-processor SMP system > (FreeBSD 4-STABLE from August 28), my program (initiated with the > command line 'mpirun -np 2 ../sphagr') periodically dies, leaving a > process that I can't kill -9. Here's the top: > > here's ps -auxw | grep sph: > > jbarnes 549 0.0 8.7 410076 90744 p2 R 3:39PM 3:01.97 sphagr -p4pg /usr/home/ > jbarnes 550 0.0 0.0 0 0 p2 Z 3:39PM 0:00.00 (sphagr) > > The 550 process I kill -9ed, but its still there, and now when I > try to kill it it says 'no such process'. Processes in the Z state have already exited, but their parent process has not retrieved their status with one of the wait*() functions. The entry in the process table will stay until that happens. You can run "ps axlp 550" and look at the PPID column to determine the parent's pid. The parent code needs to either wait() for the child status, or if it doesn't need to know when the child exits, ignore SIGCHLD or set the SA_NOCLDWAIT flag with sigaction(). -- Dan Nelson dnelson@allantgroup.com