From owner-freebsd-current@freebsd.org Thu Mar 9 23:48:20 2017 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 4F432D05D16 for ; Thu, 9 Mar 2017 23:48:20 +0000 (UTC) (envelope-from bdrewery@FreeBSD.org) Received: from mailman.ysv.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id 2E40B1A37 for ; Thu, 9 Mar 2017 23:48:20 +0000 (UTC) (envelope-from bdrewery@FreeBSD.org) Received: by mailman.ysv.freebsd.org (Postfix) id 2AE29D05D15; Thu, 9 Mar 2017 23:48:20 +0000 (UTC) Delivered-To: current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 28C26D05D14 for ; Thu, 9 Mar 2017 23:48:20 +0000 (UTC) (envelope-from bdrewery@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2610:1c1:1:6074::16:84]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "freefall.freebsd.org", Issuer "Let's Encrypt Authority X3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id E442F1A36; Thu, 9 Mar 2017 23:48:19 +0000 (UTC) (envelope-from bdrewery@FreeBSD.org) Received: from mail.xzibition.com (freefall.freebsd.org [IPv6:2610:1c1:1:6074::16:84]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by freefall.freebsd.org (Postfix) with ESMTPS id 0FA0215E0; Thu, 9 Mar 2017 23:48:19 +0000 (UTC) (envelope-from bdrewery@FreeBSD.org) Received: from mail.xzibition.com (localhost [172.31.3.2]) by mail.xzibition.com (Postfix) with ESMTP id 2C4F430838; Thu, 9 Mar 2017 23:48:15 +0000 (UTC) X-Virus-Scanned: amavisd-new at mail.xzibition.com Received: from mail.xzibition.com ([172.31.3.2]) by mail.xzibition.com (mail.xzibition.com [172.31.3.2]) (amavisd-new, port 10026) with LMTP id xrT3wyn4kWJ5; Thu, 9 Mar 2017 23:47:54 +0000 (UTC) Subject: Re: r314708: panic: tdsendsignal: ksi on queue DKIM-Filter: OpenDKIM Filter v2.9.2 mail.xzibition.com 3A95630833 To: Konstantin Belousov References: <20170309144646.GB16105@kib.kiev.ua> <20170309231151.GA49720@stack.nl> Cc: current@FreeBSD.org From: Bryan Drewery Openpgp: id=F9173CB2C3AAEA7A5C8A1F0935D771BB6E4697CF; url=http://www.shatow.net/bryan/bryan2.asc Organization: FreeBSD Message-ID: <5e56d3d6-9e92-1b50-f720-ff16c58b74dd@FreeBSD.org> Date: Thu, 9 Mar 2017 15:47:32 -0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.7.1 MIME-Version: 1.0 In-Reply-To: <20170309231151.GA49720@stack.nl> Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="AhJqqOKViOhgPgqtrJdCTlNNlw6FLnGlg" X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 09 Mar 2017 23:48:20 -0000 This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --AhJqqOKViOhgPgqtrJdCTlNNlw6FLnGlg Content-Type: multipart/mixed; boundary="LpE6BXWECbPK2gqQEwJLJ73iWeEc1UPF5"; protected-headers="v1" From: Bryan Drewery To: Konstantin Belousov Cc: current@FreeBSD.org Message-ID: <5e56d3d6-9e92-1b50-f720-ff16c58b74dd@FreeBSD.org> Subject: Re: r314708: panic: tdsendsignal: ksi on queue References: <20170309144646.GB16105@kib.kiev.ua> <20170309231151.GA49720@stack.nl> In-Reply-To: <20170309231151.GA49720@stack.nl> --LpE6BXWECbPK2gqQEwJLJ73iWeEc1UPF5 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable On 3/9/2017 3:11 PM, Jilles Tjoelker wrote: > On Thu, Mar 09, 2017 at 04:46:46PM +0200, Konstantin Belousov wrote: >> Yes, there is a race, apparently, with the child zombie still not fini= shing >> sending the SIGCHLD to the parent and parent exiting. The following s= hould >> fix the issue, but I do not think that reproducing the problem is easy= =2E >=20 >> diff --git a/sys/kern/kern_exit.c b/sys/kern/kern_exit.c >> index c524fe5df37..ba5ff84e9de 100644 >> --- a/sys/kern/kern_exit.c >> +++ b/sys/kern/kern_exit.c >> @@ -189,6 +189,7 @@ exit1(struct thread *td, int rval, int signo) >> { >> struct proc *p, *nq, *q, *t; >> struct thread *tdt; >> + ksiginfo_t ksi; >> =20 >> mtx_assert(&Giant, MA_NOTOWNED); >> KASSERT(rval =3D=3D 0 || signo =3D=3D 0, ("exit1 rv %d sig %d", rval= , signo)); >> @@ -456,7 +457,12 @@ exit1(struct thread *td, int rval, int signo) >> proc_reparent(q, q->p_reaper); >> if (q->p_state =3D=3D PRS_ZOMBIE) { >> PROC_LOCK(q->p_reaper); >> - pksignal(q->p_reaper, SIGCHLD, q->p_ksi); >> + if (q->p_ksi !=3D NULL) { >> + ksiginfo_init(&ksi); >> + ksiginfo_copy(q->p_ksi, &ksi); >> + } >> + pksignal(q->p_reaper, SIGCHLD, q->p_ksi !=3D >> + NULL ? &ksi : NULL); >> PROC_UNLOCK(q->p_reaper); >> } >> } else { I just got something weird with this patch that wasn't happening before: /usr/bin/time -l src/bin/poudriere -e /usr/local/etc testport -j exp-10amd64 -p commit -z test devel/ccache [poudriere runs and completes with exit status 0] > time: command terminated abnormally = =20 > 28.08 real 9.92 user 10.38 sys = =20 > 23464 maximum resident set size = =20 > 4996 average shared memory size = =20 > 88 average unshared data size = =20 > 127 average unshared stack size = =20 > 282705 page reclaims = =20 > 5623 page faults = =20 > 0 swaps = =20 > 2673 block input operations = =20 > 4836 block output operations = =20 > 33 messages sent = =20 > 0 messages received = =20 > 37 signals received = =20 > 11226 voluntary context switches = =20 > 780 involuntary context switches = =20 > zsh: alarm /usr/bin/time -l src/bin/poudriere -e /usr/local/etc te= stport -j exp-10amd64 exit status: 142 (SIGALRM). I don't see time(1) using SIGALRM or proc reaper at all. Rerunning it, and trying other simpler test cases, does not produce the same result. It may be some race unrelated to this patch, dunno. >=20 > This patch introduces a subtle correctness bug. A real SIGCHLD ksiginfo= > should always be the zombie's p_ksi; otherwise, the siginfo may be lost= > if there are too many signals pending for the target process or in the > system. If the siginfo is lost and the reaper normally passes si_pid to= > waitpid() or similar (instead of passing WAIT_ANY or P_ALL), a zombie > will remain until the reaper terminates. >=20 > Conceptually the siginfo is sent to one process at a time only, so the > bug is an artifact of the implementation. Perhaps the piece of code > added in r309886 can be moved or the ksiginfo can be removed from the > parent's queue. >=20 > If such a fix is not possible, it may be better to send a bare SIGCHLD > (si_code is SI_KERNEL or 0, depending on how many signals are pending) > in this situation and document that reapers must use WAIT_ANY or P_ALL.= > (However, compared to the pre-r309886 situation they can still use > SIGCHLD to get notified when to call waitpid() or similar.) >=20 --=20 Regards, Bryan Drewery --LpE6BXWECbPK2gqQEwJLJ73iWeEc1UPF5-- --AhJqqOKViOhgPgqtrJdCTlNNlw6FLnGlg Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQEcBAEBAgAGBQJYwemUAAoJEDXXcbtuRpfPV1YH/2vbQvMKbamftIZe8j2ASUGF rdSdZMsVIE+zWnCckNEYcw6aPqp4YrRUH0lJZlm6Z+R9/JU8iTDUqySVRFa3jbRB o/cEqqb/Af3Frof50ASKk0H66+mkL23NVhkU6a/jHNbtf22BPDSjk099H/fOyyX5 T+eqaGjycS0jEs9wo6RyODItpTUplG68JRCwnESy1xgnuNhLQJGRBaqS3OIx7v0I oR1I6wYBJQoYGb/tqEUyWNJ6Myb7LeLRfwVxhCjRJKYfipzdOdvZQb8LE/85Nott a6i/sm9cilxF1rf97zONKf20US7W2Vn4ukce3op9BxjDwoY8cHhBfYjFIlc/fBY= =kb7q -----END PGP SIGNATURE----- --AhJqqOKViOhgPgqtrJdCTlNNlw6FLnGlg--