Date: Wed, 7 Dec 2016 16:30:51 +0200 From: Konstantin Belousov <kostikbel@gmail.com> To: Steven Hartland <killing@multiplay.co.uk> Cc: "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org> Subject: Re: Help needed to identify golang fork / memory corruption issue on FreeBSD Message-ID: <20161207143051.GX54029@kib.kiev.ua> In-Reply-To: <20161207121449.GV54029@kib.kiev.ua> References: <27e1a828-5cd9-0755-50ca-d7143e7df117@multiplay.co.uk> <20161206125919.GQ54029@kib.kiev.ua> <8b502580-4d2d-1e1f-9e05-61d46d5ac3b1@multiplay.co.uk> <20161206143532.GR54029@kib.kiev.ua> <9b40c93a-871f-bb32-668c-39bc3e31e385@multiplay.co.uk> <20161207121449.GV54029@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Dec 07, 2016 at 02:14:49PM +0200, Konstantin Belousov wrote: > On Tue, Dec 06, 2016 at 08:35:04PM +0000, Steven Hartland wrote: > > On 06/12/2016 14:35, Konstantin Belousov wrote: > > > On Tue, Dec 06, 2016 at 01:53:52PM +0000, Steven Hartland wrote: > > >> On 06/12/2016 12:59, Konstantin Belousov wrote: > > >>> On Tue, Dec 06, 2016 at 12:31:47PM +0000, Steven Hartland wrote: > > >>>> Hi guys I'm trying to help identify / fix an issue with golang where by > > >>>> fork results in memory corruption. > > >>>> > > >>>> Details of the issue can be found here: > > >>>> https://github.com/golang/go/issues/15658 > > >>>> > > >>>> In summary when a fork is done in golang is has a chance of causing > > >>>> memory corruption in the parent resulting in a process crash once detected. > > >>>> > > >>>> Its believed that this only effects FreeBSD. > > >>>> > > >>>> This has similarities to other reported issues such as this one which > > >>>> impacted perl during 10.x: > > >>>> https://rt.perl.org/Public/Bug/Display.html?id=122199 > > >>> I cannot judge about any similarilities when all the description provided > > >>> is 'memory corruption'. BTW, the perl issue described, where child segfaults > > >>> after the fork, is more likely to be caused by the set of problems referenced > > >>> in the FreeBSD-EN-16:17.vm. > > >>> > > >>>> And more recently the issue with nginx on 11.x: > > >>>> https://lists.freebsd.org/pipermail/freebsd-stable/2016-September/085540.html > > >>> Which does not affect anything unless aio is used on Sandy/Ivy. > > >>> > > >>>> Its possible, some believe likely, that this is a kernel bug around fork > > >>>> / vm that golang stresses, but I've not been able to confirm. > > >>>> > > >>>> I can reproduce the issue at will, takes between 5mins and 1hour using > > >>>> 16 threads, and it definitely seems like an interaction between fork and > > >>>> other memory operations. > > >>> Which arch is the kernel and the process which demonstrates the behaviour ? > > >>> I mean i386/amd64. > > >> amd64 > > > How large is the machine, how many cores, what is the physical memory size ? > I was able to reproduce that as well, reliably, on two desktop-size > machines. One is SandyBridge, same core microarchitecture as your > crashbox, another is Haswell. I see the error both with PCID enabled > and disabled on both machines (Haswell does implement INVPCID, so the > original aio/PCID bug did never affected this microarchitecture). > > I believe this clears the PCID changes from the accusations. > > > > > > >>>> I've tried reproducing the issue in C but also no joy (captured in the bug). > > >>>> > > >>>> For reference I'm currently testing on 11.0-RELEASE-p3 + kibs PCID fix > > >>>> (#306350). > > >>> Switch to HEAD kernel, for start. > > >>> Show the memory map of the failed process. > > No sign of zeroed memory that I can tell. > > > > This error was caused by hitting the following validation in gc: > > func (list *mSpanList) remove(span *mspan) { > > if span.prev == nil || span.list != list { > > println("runtime: failed MSpanList_Remove", span, > > span.prev, span.list, list) > > throw("MSpanList_Remove") > > } > > > > runtime: failed MSpanList_Remove 0x80052e580 0x80052e300 0x53e9c0 0x53e9b0 > > fatal error: MSpanList_Remove > > > > (gdb) print list > > $4 = (runtime.mSpanList *) 0x53e9b0 <runtime.mheap_+4944> > > (gdb) print span.list > > $5 = (runtime.mSpanList *) 0x53e9c0 <runtime.mheap_+4960> > The difference, which triggered the exception, is quite curious: > list is 0x53e9b0, and span.list == list + 0x10. More, this is not > a single-bit error: bit patter is 1011 for 0xb and 1100 for 0xc. > > It is highly unlikely that the cause is a memory corruption due to > OS mis-managing pages or TLB. Typically, you get either page or cache > line of complete garbage, instead of the almost identical but slightly > modified data. > > > (gdb) print span.prev > > $6 = (struct runtime.mspan **) 0x80052e300 > > (gdb) print *list > > $7 = {first = 0x80052e580, last = 0x8008aa180} > > (gdb) print *span.list > > $8 = {first = 0x8007ea7e0, last = 0x80052e580} > > > > procstat -v test.core.1481054183 > > PID START END PRT RES PRES REF SHD FLAG > > TP PATH > > 1178 0x400000 0x49b000 r-x 115 223 3 1 CN-- vn > > /root/test > > 1178 0x49b000 0x528000 r-- 97 223 3 1 CN-- vn > > /root/test > > 1178 0x528000 0x539000 rw- 10 0 1 0 C--- vn > > /root/test > > 1178 0x539000 0x55a000 rw- 16 16 1 0 C--- df > > 1178 0x800528000 0x800a28000 rw- 118 118 1 0 C--- df > > 1178 0x800a28000 0x800a68000 rw- 1 1 1 0 CN-- df > > 1178 0x800a68000 0x800aa8000 rw- 2 2 1 0 CN-- df > > 1178 0x800aa8000 0x800c08000 rw- 50 50 1 0 CN-- df > > 1178 0x800c08000 0x800c48000 rw- 2 2 1 0 CN-- df > > 1178 0x800c48000 0x800c88000 rw- 1 1 1 0 CN-- df > > 1178 0x800c88000 0x800cc8000 rw- 1 1 1 0 CN-- df > > 1178 0xc000000000 0xc000001000 rw- 1 1 1 0 CN-- df > > 1178 0xc41ffe0000 0xc41ffe8000 rw- 8 8 1 0 CN-- df > > 1178 0xc41ffe8000 0xc41fff0000 rw- 8 8 1 0 CN-- df > > 1178 0xc41fff0000 0xc41fff8000 rw- 8 8 1 0 C--- df > > 1178 0xc41fff8000 0xc420300000 rw- 553 553 1 0 C--- df > > 1178 0xc420300000 0xc420400000 rw- 234 234 1 0 C--- df > > 1178 0x7ffffffdf000 0x7ffffffff000 rwx 2 2 1 0 C--D df > > 1178 0x7ffffffff000 0x800000000000 r-x 1 1 33 0 ---- ph > > > > This is from FreeBSD 12.0-CURRENT #36 r309618M > > > > ktrace on 11.0-RELEASE is still running 6 hours so far. One thing that I noted. In my later attempt to reproduce the issue, I got the following output: sandy% GOGC=2 ./1.go /mnt/1 Starting 16 forking goroutines... GOMAXPROCS: 8 runtime: failed MSpanList_Remove 0x8006a9d60 0x53f2e0 0x53f2f0 0x53f2e0 fatal error: MSpanList_Remove runtime stack: runtime.throw(0x4cca4d, 0x10) /usr/local/go/src/runtime/panic.go:566 +0x95 runtime.(*mSpanList).remove(0x53f2e0, 0x8006a9d60) /usr/local/go/src/runtime/mheap.go:1001 +0x19d runtime.(*mcentral).cacheSpan(0x53f2d0, 0x44abfb) /usr/local/go/src/runtime/mcentral.go:55 +0x3d0 runtime.(*mcache).refill(0x80052a0d0, 0xc400000016, 0xc420200d38) /usr/local/go/src/runtime/mcache.go:121 +0xae runtime.(*mcache).nextFree.func1() /usr/local/go/src/runtime/malloc.go:505 +0x33 runtime.systemstack(0xc420018000) /usr/local/go/src/runtime/asm_amd64.s:298 +0x79 runtime.mstart() /usr/local/go/src/runtime/proc.go:1079 goroutine 8797810 [running]: runtime.systemstack_switch() /usr/local/go/src/runtime/asm_amd64.s:252 fp=0xc420200be0 sp=0xc420200bd8 runtime.(*mcache).nextFree(0x80052a0d0, 0xc420200c16, 0x40ec95, 0xc42011c050, 0x10) /usr/local/go/src/runtime/malloc.go:506 +0xb2 fp=0xc420200c38 sp=0xc420200be0 runtime.mallocgc(0x1a0, 0x4aaf60, 0xc420200d01, 0x44ab40) /usr/local/go/src/runtime/malloc.go:658 +0x809 fp=0xc420200cd8 sp=0xc420200c38 runtime.makeslice(0x4aaf60, 0x0, 0x19, 0xc4200b8688, 0x8, 0x8) /usr/local/go/src/runtime/slice.go:57 +0x7b fp=0xc420200d30 sp=0xc420200cd8 syscall.Environ(0x0, 0x0, 0x0) /usr/local/go/src/syscall/env_unix.go:142 +0xd0 fp=0xc420200dc0 sp=0xc420200d30 os.Environ(0x0, 0x0, 0x0) /usr/local/go/src/os/env.go:116 +0x22 fp=0xc420200de8 sp=0xc420200dc0 os/exec.(*Cmd).envv(0xc4201422c0, 0xc4200b8680, 0x0, 0x1) /usr/local/go/src/os/exec/exec.go:171 +0x38 fp=0xc420200e10 sp=0xc420200de8 os/exec.(*Cmd).Start(0xc4201422c0, 0x6, 0x0) There are two traces for goroutines, and note that both first and second are in malloc. Is go malloc fine-grain locked ? E.g. I know that JVM uses per-thread arenas. In fact, are there stress-tests for the go mutual exclusion primitives ? The runtime seems to try to use thr and umtx syscalls directly, which could be the source of bugs.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20161207143051.GX54029>