Date: Thu, 30 Mar 2000 07:39:03 -0800 (PST) From: John Polstra <jdp@polstra.com> To: ben@scientia.demon.co.uk Cc: stable@freebsd.org, gerti-freebsds@bitart.com Subject: Re: Random signal 9 (SIGKILL), please help! Message-ID: <200003301539.HAA07006@vashon.polstra.com> In-Reply-To: <20000329133710.E96553@strontium.scientia.demon.co.uk> References: <20000329041104.3028.qmail@camelot.bitart.com> <20000328213754.L21029@fw.wintelcom.net> <20000329102024.3950.qmail@camelot.bitart.com> <20000329133710.E96553@strontium.scientia.demon.co.uk>
next in thread | previous in thread | raw e-mail | index | archive | help
In article <20000329133710.E96553@strontium.scientia.demon.co.uk>, Ben Smithurst <ben@scientia.demon.co.uk> wrote: > Gerd Knops wrote: > > > I think I found a correlation between pid roll over (from 99999 > > to 0) and the spurious signals. Some program seems to keep > > taps on pids that already went away, and when they 'come back' they > > are killed again. I am suspicious of syslogd at the moment (I pipe > > syslog output through a filter), one of the very few programs in the > > base system that are running on those systems and that uses SIGKILL. > > > > However it will probably take some time before I can wrap my head > > around that code, it's not exactly heavily commented... If anyone > > with more intimate knowledge could have a look I'd appreciate that. > > Apparently syslogd could do this, I think a fix was commited recently, > before RELENG_4 was branched, and it was fixed in RELENG_3 as well > (possibly RELENG_2_2 even). Yes, I bet Ben is right. Here's part of the log message for the fix ("src/usr.sbin/syslogd/syslogd.c"): revision 1.58 date: 2000/02/28 17:49:43; author: joerg; state: Exp; lines: +97 -36 Fix a serious bug in syslogd regarding the handling of pipes. The bug would cause syslogd to eventually kill innocent processes in the system over time (note: not `could' but `would'). Many thanks to my colleague Mirko for digging into the kernel structures and providing me with the debugging framework to find out about the nature of this bug (and to isolate that syslogd was the culprit) in a rather large set of distributed machines at client sites where this happened occasionally. Whenever a child process was no longer responsive, or when syslogd receives a SIGHUP so it closes all its logging file descriptors, for any descriptor that refers to a pipe syslogd enters the data about the old logging child process into a `dead queue', where it is being removed from (and the status of the dead kitten being fetched) upon receipt of a SIGCHLD. However, there's a high probability that the SIGCHLD already arrives before the child's data are actually entered into the dead queue inside the SIGHUP handler, so the SIGCHLD handler has nothing to fetch and remove and simply continues. Whenever this happens, the process'es data remain on the dead queue forever, and since domark() tried to get rid of totally unresponsive children by first sending a SIGTERM and later a SIGKILL, it was only a matter of time until the system had recycled enough PIDs so an innocent process got shot to death. Fix the race by masking SIGHUP and SIGCHLD from both handlers mutually. Add additional bandaids ``just in case'', i. e. don't enter a process into the dead queue if we can't signal it (this should only happen in case it is already dead by that time so we can fetch the status immediately instead of deferring this to the SIGCHLD handler); for the kill(2) inside domark(), check for an error status (/* Can't happen */ :) and remove it from the dead queue in this case (which if it would have been there in the first place would have reduced the problem to a statistically minimal likelihood so i certainly would never have noticed the bug at all :). John -- John Polstra jdp@polstra.com John D. Polstra & Co., Inc. Seattle, Washington USA "Disappointment is a good sign of basic intelligence." -- Chögyam Trungpa To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-stable" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200003301539.HAA07006>