Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 17 Mar 2017 11:27:52 +0000
From:      Steven Hartland <killing@multiplay.co.uk>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        "K. Macy" <kmacy@freebsd.org>, "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>
Subject:   Re: Help needed to identify golang fork / memory corruption issue on FreeBSD
Message-ID:  <180a601b-5481-bb41-f7fc-67976aabe451@multiplay.co.uk>
In-Reply-To: <20170317082333.GP16105@kib.kiev.ua>
References:  <27e1a828-5cd9-0755-50ca-d7143e7df117@multiplay.co.uk> <20161206125919.GQ54029@kib.kiev.ua> <8b502580-4d2d-1e1f-9e05-61d46d5ac3b1@multiplay.co.uk> <20161206143532.GR54029@kib.kiev.ua> <e160381c-9935-6edf-04a9-1ff78e95d818@multiplay.co.uk> <CAHM0Q_Mg662u9D0KJ9knEWWqi9Ydy38qKDnjLt6XaS0ks%2B9-iw@mail.gmail.com> <18b40a69-4460-faf2-c0ce-7491eca92782@multiplay.co.uk> <20170317082333.GP16105@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On 17/03/2017 08:23, Konstantin Belousov wrote:
> On Fri, Mar 17, 2017 at 06:30:49AM +0000, Steven Hartland wrote:
>> Ok I think I've identified the cause.
>>
>> If an alternative signal stack is applied to a non-main thread and that
>> thread calls execve then the signal stack is not cleared.
>>
>> This results in all sorts of badness.
>>
>> Full details, including a small C reproduction case can be found here:
>> https://github.com/golang/go/issues/15658#issuecomment-287276856
>>
>> So looks like its kernel bug. If anyone has an ideas about that before I
>> look tomorrow that would be appreciated.
> Yes, there is definitely a kernel bug, which should be fixed by the patch
> below.
>
> Still, what I saw when I looked at the issue, is not quite resembling
> potential consequences of the bug.  Using wrong memory for signal stack
> would result either in much more significant memory corruption if the
> alt stack range is mapped and used for something unrelated, or in killed
> process on signal delivery, if the range is not mapped.  While I saw a
> systematic 'off by 0x10' in some gc structures.
>
> Anyway, patch for the issue you identified:
>
> diff --git a/sys/kern/kern_sig.c b/sys/kern/kern_sig.c
> index 29d5dd4b132..9bf3ba66f5c 100644
> --- a/sys/kern/kern_sig.c
> +++ b/sys/kern/kern_sig.c
> @@ -976,7 +976,6 @@ execsigs(struct proc *p)
>   	 * and are now ignored by default).
>   	 */
>   	PROC_LOCK_ASSERT(p, MA_OWNED);
> -	td = FIRST_THREAD_IN_PROC(p);
>   	ps = p->p_sigacts;
>   	mtx_lock(&ps->ps_mtx);
>   	while (SIGNOTEMPTY(ps->ps_sigcatch)) {
> @@ -1007,6 +1006,8 @@ execsigs(struct proc *p)
>   	 * Reset stack state to the user stack.
>   	 * Clear set of signals caught on the signal stack.
>   	 */
> +	td = curthread;
> +	MPASS(td->td_proc == p);
>   	td->td_sigstk.ss_flags = SS_DISABLE;
>   	td->td_sigstk.ss_size = 0;
>   	td->td_sigstk.ss_sp = 0;
Thanks Kostik, pretty obvious now looking at :)

Testing here we've seen all sorts of corruption looking things, mainly 
around random signals from SIGILL to SIGSEGV but also random kernel 
messages including:
pid 4603 (test): sigreturn copying xfpustate failed
pid 5013 (test): sigreturn xfpusave_len = 0x44d9bb

I'm currently running a test, but its looking good as the test case 
usually crashes in a matter of seconds.

Would you mind if I committed it?

I'm guessing given its nature this is something we'd want MFC'ed and 
Errata's issued for all supported versions?

     Regards
     Steve



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?180a601b-5481-bb41-f7fc-67976aabe451>