Date: Thu, 9 Mar 2017 15:57:06 -0800 From: Bryan Drewery <bdrewery@FreeBSD.org> To: Konstantin Belousov <kostikbel@gmail.com> Cc: current@FreeBSD.org Subject: Re: r314708: panic: tdsendsignal: ksi on queue Message-ID: <b7068678-e593-f4ed-647c-036ff0bc0576@FreeBSD.org> In-Reply-To: <5e56d3d6-9e92-1b50-f720-ff16c58b74dd@FreeBSD.org> References: <d510a9da-8293-ba22-a1e6-75b3ea7ffa1d@FreeBSD.org> <20170309144646.GB16105@kib.kiev.ua> <20170309231151.GA49720@stack.nl> <5e56d3d6-9e92-1b50-f720-ff16c58b74dd@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --6oKB8GMj0DjIucTvXVtofbvdNFRn8VAKd Content-Type: multipart/mixed; boundary="N5VRPxGWQn1N9mKI3n26Sr4waQKxOLkPd"; protected-headers="v1" From: Bryan Drewery <bdrewery@FreeBSD.org> To: Konstantin Belousov <kostikbel@gmail.com> Cc: current@FreeBSD.org Message-ID: <b7068678-e593-f4ed-647c-036ff0bc0576@FreeBSD.org> Subject: Re: r314708: panic: tdsendsignal: ksi on queue References: <d510a9da-8293-ba22-a1e6-75b3ea7ffa1d@FreeBSD.org> <20170309144646.GB16105@kib.kiev.ua> <20170309231151.GA49720@stack.nl> <5e56d3d6-9e92-1b50-f720-ff16c58b74dd@FreeBSD.org> In-Reply-To: <5e56d3d6-9e92-1b50-f720-ff16c58b74dd@FreeBSD.org> --N5VRPxGWQn1N9mKI3n26Sr4waQKxOLkPd Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable On 3/9/2017 3:47 PM, Bryan Drewery wrote: > On 3/9/2017 3:11 PM, Jilles Tjoelker wrote: >> On Thu, Mar 09, 2017 at 04:46:46PM +0200, Konstantin Belousov wrote: >>> Yes, there is a race, apparently, with the child zombie still not fin= ishing >>> sending the SIGCHLD to the parent and parent exiting. The following = should >>> fix the issue, but I do not think that reproducing the problem is eas= y. >> >>> diff --git a/sys/kern/kern_exit.c b/sys/kern/kern_exit.c >>> index c524fe5df37..ba5ff84e9de 100644 >>> --- a/sys/kern/kern_exit.c >>> +++ b/sys/kern/kern_exit.c >>> @@ -189,6 +189,7 @@ exit1(struct thread *td, int rval, int signo) >>> { >>> struct proc *p, *nq, *q, *t; >>> struct thread *tdt; >>> + ksiginfo_t ksi; >>> =20 >>> mtx_assert(&Giant, MA_NOTOWNED); >>> KASSERT(rval =3D=3D 0 || signo =3D=3D 0, ("exit1 rv %d sig %d", rva= l, signo)); >>> @@ -456,7 +457,12 @@ exit1(struct thread *td, int rval, int signo) >>> proc_reparent(q, q->p_reaper); >>> if (q->p_state =3D=3D PRS_ZOMBIE) { >>> PROC_LOCK(q->p_reaper); >>> - pksignal(q->p_reaper, SIGCHLD, q->p_ksi); >>> + if (q->p_ksi !=3D NULL) { >>> + ksiginfo_init(&ksi); >>> + ksiginfo_copy(q->p_ksi, &ksi); >>> + } >>> + pksignal(q->p_reaper, SIGCHLD, q->p_ksi !=3D >>> + NULL ? &ksi : NULL); >>> PROC_UNLOCK(q->p_reaper); >>> } >>> } else { >=20 > I just got something weird with this patch that wasn't happening before= : >=20 > /usr/bin/time -l src/bin/poudriere -e /usr/local/etc testport -j > exp-10amd64 -p commit -z test devel/ccache > [poudriere runs and completes with exit status 0] >> time: command terminated abnormally = =20 >> 28.08 real 9.92 user 10.38 sys = =20 >> 23464 maximum resident set size = =20 >> 4996 average shared memory size = =20 >> 88 average unshared data size = =20 >> 127 average unshared stack size = =20 >> 282705 page reclaims = =20 >> 5623 page faults = =20 >> 0 swaps = =20 >> 2673 block input operations = =20 >> 4836 block output operations = =20 >> 33 messages sent = =20 >> 0 messages received = =20 >> 37 signals received = =20 >> 11226 voluntary context switches = =20 >> 780 involuntary context switches = =20 >> zsh: alarm /usr/bin/time -l src/bin/poudriere -e /usr/local/etc t= estport -j exp-10amd64 > exit status: 142 (SIGALRM). >=20 > I don't see time(1) using SIGALRM or proc reaper at all. >=20 > Rerunning it, and trying other simpler test cases, does not produce the= > same result. It may be some race unrelated to this patch, dunno. >=20 I'm consistently getting foreground processes getting the wrong signals now. I'm removing this patch for now. >=20 >> >> This patch introduces a subtle correctness bug. A real SIGCHLD ksiginf= o >> should always be the zombie's p_ksi; otherwise, the siginfo may be los= t >> if there are too many signals pending for the target process or in the= >> system. If the siginfo is lost and the reaper normally passes si_pid t= o >> waitpid() or similar (instead of passing WAIT_ANY or P_ALL), a zombie >> will remain until the reaper terminates. >> >> Conceptually the siginfo is sent to one process at a time only, so the= >> bug is an artifact of the implementation. Perhaps the piece of code >> added in r309886 can be moved or the ksiginfo can be removed from the >> parent's queue. >> >> If such a fix is not possible, it may be better to send a bare SIGCHLD= >> (si_code is SI_KERNEL or 0, depending on how many signals are pending)= >> in this situation and document that reapers must use WAIT_ANY or P_ALL= =2E >> (However, compared to the pre-r309886 situation they can still use >> SIGCHLD to get notified when to call waitpid() or similar.) >> >=20 >=20 --=20 Regards, Bryan Drewery --N5VRPxGWQn1N9mKI3n26Sr4waQKxOLkPd-- --6oKB8GMj0DjIucTvXVtofbvdNFRn8VAKd Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQEcBAEBAgAGBQJYwevSAAoJEDXXcbtuRpfPOMAH/ihoupaKtVFBnbUb+BNBmeq6 cfYTx2JXxnoK6Hlok/pIUIwrpstUUzIGPfVpBQE8j/JM7tnYkDdsw3KNtdF5BAsu zx/oIPgv3FH3U0IkT1G+/MXUFXlb4pC+qhM47SJkm/2KfkZGRwgB9EwXVGovDpAj yHg4IwHfhCrYXXWjr5KSU3aqBkArFPrjG7NtA+yJ83DQjO+GhLvJcCrNwi0FJd5G fq4F4CwYWCz0klnPv9jU1BsO28P+b/U5iov0mYFG6g1cnYY0bltmrfme5tmnRUrv KSHGpWPlIQ7qO/v66e18YioGY8BGCITm3W/F90FwFZUrEqpaNxQngVSOEBHYmnc= =o3Z9 -----END PGP SIGNATURE----- --6oKB8GMj0DjIucTvXVtofbvdNFRn8VAKd--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?b7068678-e593-f4ed-647c-036ff0bc0576>