Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 7 Dec 2016 14:14:49 +0200
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Steven Hartland <killing@multiplay.co.uk>
Cc:        "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>
Subject:   Re: Help needed to identify golang fork / memory corruption issue on FreeBSD
Message-ID:  <20161207121449.GV54029@kib.kiev.ua>
In-Reply-To: <9b40c93a-871f-bb32-668c-39bc3e31e385@multiplay.co.uk>
References:  <27e1a828-5cd9-0755-50ca-d7143e7df117@multiplay.co.uk> <20161206125919.GQ54029@kib.kiev.ua> <8b502580-4d2d-1e1f-9e05-61d46d5ac3b1@multiplay.co.uk> <20161206143532.GR54029@kib.kiev.ua> <9b40c93a-871f-bb32-668c-39bc3e31e385@multiplay.co.uk>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Dec 06, 2016 at 08:35:04PM +0000, Steven Hartland wrote:
> On 06/12/2016 14:35, Konstantin Belousov wrote:
> > On Tue, Dec 06, 2016 at 01:53:52PM +0000, Steven Hartland wrote:
> >> On 06/12/2016 12:59, Konstantin Belousov wrote:
> >>> On Tue, Dec 06, 2016 at 12:31:47PM +0000, Steven Hartland wrote:
> >>>> Hi guys I'm trying to help identify / fix an issue with golang where by
> >>>> fork results in memory corruption.
> >>>>
> >>>> Details of the issue can be found here:
> >>>> https://github.com/golang/go/issues/15658
> >>>>
> >>>> In summary when a fork is done in golang is has a chance of causing
> >>>> memory corruption in the parent resulting in a process crash once detected.
> >>>>
> >>>> Its believed that this only effects FreeBSD.
> >>>>
> >>>> This has similarities to other reported issues such as this one which
> >>>> impacted perl during 10.x:
> >>>> https://rt.perl.org/Public/Bug/Display.html?id=122199
> >>> I cannot judge about any similarilities when all the description provided
> >>> is 'memory corruption'. BTW, the perl issue described, where child segfaults
> >>> after the fork, is more likely to be caused by the set of problems referenced
> >>> in the FreeBSD-EN-16:17.vm.
> >>>
> >>>> And more recently the issue with nginx on 11.x:
> >>>> https://lists.freebsd.org/pipermail/freebsd-stable/2016-September/085540.html
> >>> Which does not affect anything unless aio is used on Sandy/Ivy.
> >>>
> >>>> Its possible, some believe likely, that this is a kernel bug around fork
> >>>> / vm that golang stresses, but I've not been able to confirm.
> >>>>
> >>>> I can reproduce the issue at will, takes between 5mins and 1hour using
> >>>> 16 threads, and it definitely seems like an interaction between fork and
> >>>> other memory operations.
> >>> Which arch is the kernel and the process which demonstrates the behaviour  ?
> >>> I mean i386/amd64.
> >> amd64
> > How large is the machine, how many cores, what is the physical memory size ?
I was able to reproduce that as well, reliably, on two desktop-size
machines. One is SandyBridge, same core microarchitecture as your
crashbox, another is Haswell. I see the error both with PCID enabled
and disabled on both machines (Haswell does implement INVPCID, so the
original aio/PCID bug did never affected this microarchitecture).

I believe this clears the PCID changes from the accusations.

> >
> >>>> I've tried reproducing the issue in C but also no joy (captured in the bug).
> >>>>
> >>>> For reference I'm currently testing on 11.0-RELEASE-p3 + kibs PCID fix
> >>>> (#306350).
> >>> Switch to HEAD kernel, for start.
> >>> Show the memory map of the failed process.
> No sign of zeroed memory that I can tell.
> 
> This error was caused by hitting the following validation in gc:
> func (list *mSpanList) remove(span *mspan) {
>          if span.prev == nil || span.list != list {
>                  println("runtime: failed MSpanList_Remove", span, 
> span.prev, span.list, list)
>                  throw("MSpanList_Remove")
>          }
> 
> runtime: failed MSpanList_Remove 0x80052e580 0x80052e300 0x53e9c0 0x53e9b0
> fatal error: MSpanList_Remove
> 
> (gdb) print list
> $4 = (runtime.mSpanList *) 0x53e9b0 <runtime.mheap_+4944>
> (gdb) print span.list
> $5 = (runtime.mSpanList *) 0x53e9c0 <runtime.mheap_+4960>
The difference, which triggered the exception, is quite curious:
list is 0x53e9b0, and span.list == list + 0x10.  More, this is not
a single-bit error: bit patter is 1011 for 0xb and 1100 for 0xc.

It is highly unlikely that the cause is a memory corruption due to
OS mis-managing pages or TLB.  Typically, you get either page or cache
line of complete garbage, instead of the almost identical but slightly
modified data.

> (gdb) print span.prev
> $6 = (struct runtime.mspan **) 0x80052e300
> (gdb) print *list
> $7 = {first = 0x80052e580, last = 0x8008aa180}
> (gdb) print *span.list
> $8 = {first = 0x8007ea7e0, last = 0x80052e580}
> 
> procstat -v test.core.1481054183
>    PID              START                END PRT  RES PRES REF SHD FLAG 
> TP PATH
>   1178           0x400000           0x49b000 r-x  115  223 3   1 CN-- vn 
> /root/test
>   1178           0x49b000           0x528000 r--   97  223 3   1 CN-- vn 
> /root/test
>   1178           0x528000           0x539000 rw-   10    0 1   0 C--- vn 
> /root/test
>   1178           0x539000           0x55a000 rw-   16   16 1   0 C--- df
>   1178        0x800528000        0x800a28000 rw-  118  118 1   0 C--- df
>   1178        0x800a28000        0x800a68000 rw-    1    1 1   0 CN-- df
>   1178        0x800a68000        0x800aa8000 rw-    2    2 1   0 CN-- df
>   1178        0x800aa8000        0x800c08000 rw-   50   50 1   0 CN-- df
>   1178        0x800c08000        0x800c48000 rw-    2    2 1   0 CN-- df
>   1178        0x800c48000        0x800c88000 rw-    1    1 1   0 CN-- df
>   1178        0x800c88000        0x800cc8000 rw-    1    1 1   0 CN-- df
>   1178       0xc000000000       0xc000001000 rw-    1    1 1   0 CN-- df
>   1178       0xc41ffe0000       0xc41ffe8000 rw-    8    8 1   0 CN-- df
>   1178       0xc41ffe8000       0xc41fff0000 rw-    8    8 1   0 CN-- df
>   1178       0xc41fff0000       0xc41fff8000 rw-    8    8 1   0 C--- df
>   1178       0xc41fff8000       0xc420300000 rw-  553  553 1   0 C--- df
>   1178       0xc420300000       0xc420400000 rw-  234  234 1   0 C--- df
>   1178     0x7ffffffdf000     0x7ffffffff000 rwx    2    2 1   0 C--D df
>   1178     0x7ffffffff000     0x800000000000 r-x    1    1 33   0 ---- ph
> 
> This is from FreeBSD 12.0-CURRENT #36 r309618M
> 
> ktrace on 11.0-RELEASE is still running 6 hours so far.
> 
>      Regards
>      Steve
> 



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20161207121449.GV54029>