Date: Wed, 7 Dec 2016 23:49:50 +0000 From: Steven Hartland <killing@multiplay.co.uk> To: Konstantin Belousov <kostikbel@gmail.com> Cc: "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org> Subject: Re: Help needed to identify golang fork / memory corruption issue on FreeBSD Message-ID: <37d0e944-5104-3db0-2884-be6fa80bc95d@multiplay.co.uk> In-Reply-To: <20161207143051.GX54029@kib.kiev.ua> References: <27e1a828-5cd9-0755-50ca-d7143e7df117@multiplay.co.uk> <20161206125919.GQ54029@kib.kiev.ua> <8b502580-4d2d-1e1f-9e05-61d46d5ac3b1@multiplay.co.uk> <20161206143532.GR54029@kib.kiev.ua> <9b40c93a-871f-bb32-668c-39bc3e31e385@multiplay.co.uk> <20161207121449.GV54029@kib.kiev.ua> <20161207143051.GX54029@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
On 07/12/2016 14:30, Konstantin Belousov wrote: > On Wed, Dec 07, 2016 at 02:14:49PM +0200, Konstantin Belousov wrote: >> On Tue, Dec 06, 2016 at 08:35:04PM +0000, Steven Hartland wrote: >>> On 06/12/2016 14:35, Konstantin Belousov wrote: >>>> On Tue, Dec 06, 2016 at 01:53:52PM +0000, Steven Hartland wrote: >>>>> On 06/12/2016 12:59, Konstantin Belousov wrote: >>>>>> On Tue, Dec 06, 2016 at 12:31:47PM +0000, Steven Hartland wrote: >>>>>>> Hi guys I'm trying to help identify / fix an issue with golang where by >>>>>>> fork results in memory corruption. >>>>>>> >>>>>>> Details of the issue can be found here: >>>>>>> https://github.com/golang/go/issues/15658 >>>>>>> >>>>>>> In summary when a fork is done in golang is has a chance of causing >>>>>>> memory corruption in the parent resulting in a process crash once detected. >>>>>>> >>>>>>> Its believed that this only effects FreeBSD. >>>>>>> >>>>>>> This has similarities to other reported issues such as this one which >>>>>>> impacted perl during 10.x: >>>>>>> https://rt.perl.org/Public/Bug/Display.html?id=122199 >>>>>> I cannot judge about any similarilities when all the description provided >>>>>> is 'memory corruption'. BTW, the perl issue described, where child segfaults >>>>>> after the fork, is more likely to be caused by the set of problems referenced >>>>>> in the FreeBSD-EN-16:17.vm. >>>>>> >>>>>>> And more recently the issue with nginx on 11.x: >>>>>>> https://lists.freebsd.org/pipermail/freebsd-stable/2016-September/085540.html >>>>>> Which does not affect anything unless aio is used on Sandy/Ivy. >>>>>> >>>>>>> Its possible, some believe likely, that this is a kernel bug around fork >>>>>>> / vm that golang stresses, but I've not been able to confirm. >>>>>>> >>>>>>> I can reproduce the issue at will, takes between 5mins and 1hour using >>>>>>> 16 threads, and it definitely seems like an interaction between fork and >>>>>>> other memory operations. >>>>>> Which arch is the kernel and the process which demonstrates the behaviour ? >>>>>> I mean i386/amd64. >>>>> amd64 >>>> How large is the machine, how many cores, what is the physical memory size ? >> I was able to reproduce that as well, reliably, on two desktop-size >> machines. One is SandyBridge, same core microarchitecture as your >> crashbox, another is Haswell. I see the error both with PCID enabled >> and disabled on both machines (Haswell does implement INVPCID, so the >> original aio/PCID bug did never affected this microarchitecture). >> >> I believe this clears the PCID changes from the accusations. >> >>>>>>> I've tried reproducing the issue in C but also no joy (captured in the bug). >>>>>>> >>>>>>> For reference I'm currently testing on 11.0-RELEASE-p3 + kibs PCID fix >>>>>>> (#306350). >>>>>> Switch to HEAD kernel, for start. >>>>>> Show the memory map of the failed process. >>> No sign of zeroed memory that I can tell. >>> >>> This error was caused by hitting the following validation in gc: >>> func (list *mSpanList) remove(span *mspan) { >>> if span.prev == nil || span.list != list { >>> println("runtime: failed MSpanList_Remove", span, >>> span.prev, span.list, list) >>> throw("MSpanList_Remove") >>> } >>> >>> runtime: failed MSpanList_Remove 0x80052e580 0x80052e300 0x53e9c0 0x53e9b0 >>> fatal error: MSpanList_Remove >>> >>> (gdb) print list >>> $4 = (runtime.mSpanList *) 0x53e9b0 <runtime.mheap_+4944> >>> (gdb) print span.list >>> $5 = (runtime.mSpanList *) 0x53e9c0 <runtime.mheap_+4960> >> The difference, which triggered the exception, is quite curious: >> list is 0x53e9b0, and span.list == list + 0x10. More, this is not >> a single-bit error: bit patter is 1011 for 0xb and 1100 for 0xc. >> >> It is highly unlikely that the cause is a memory corruption due to >> OS mis-managing pages or TLB. Typically, you get either page or cache >> line of complete garbage, instead of the almost identical but slightly >> modified data. >> >>> (gdb) print span.prev >>> $6 = (struct runtime.mspan **) 0x80052e300 >>> (gdb) print *list >>> $7 = {first = 0x80052e580, last = 0x8008aa180} >>> (gdb) print *span.list >>> $8 = {first = 0x8007ea7e0, last = 0x80052e580} >>> >>> procstat -v test.core.1481054183 >>> PID START END PRT RES PRES REF SHD FLAG >>> TP PATH >>> 1178 0x400000 0x49b000 r-x 115 223 3 1 CN-- vn >>> /root/test >>> 1178 0x49b000 0x528000 r-- 97 223 3 1 CN-- vn >>> /root/test >>> 1178 0x528000 0x539000 rw- 10 0 1 0 C--- vn >>> /root/test >>> 1178 0x539000 0x55a000 rw- 16 16 1 0 C--- df >>> 1178 0x800528000 0x800a28000 rw- 118 118 1 0 C--- df >>> 1178 0x800a28000 0x800a68000 rw- 1 1 1 0 CN-- df >>> 1178 0x800a68000 0x800aa8000 rw- 2 2 1 0 CN-- df >>> 1178 0x800aa8000 0x800c08000 rw- 50 50 1 0 CN-- df >>> 1178 0x800c08000 0x800c48000 rw- 2 2 1 0 CN-- df >>> 1178 0x800c48000 0x800c88000 rw- 1 1 1 0 CN-- df >>> 1178 0x800c88000 0x800cc8000 rw- 1 1 1 0 CN-- df >>> 1178 0xc000000000 0xc000001000 rw- 1 1 1 0 CN-- df >>> 1178 0xc41ffe0000 0xc41ffe8000 rw- 8 8 1 0 CN-- df >>> 1178 0xc41ffe8000 0xc41fff0000 rw- 8 8 1 0 CN-- df >>> 1178 0xc41fff0000 0xc41fff8000 rw- 8 8 1 0 C--- df >>> 1178 0xc41fff8000 0xc420300000 rw- 553 553 1 0 C--- df >>> 1178 0xc420300000 0xc420400000 rw- 234 234 1 0 C--- df >>> 1178 0x7ffffffdf000 0x7ffffffff000 rwx 2 2 1 0 C--D df >>> 1178 0x7ffffffff000 0x800000000000 r-x 1 1 33 0 ---- ph >>> >>> This is from FreeBSD 12.0-CURRENT #36 r309618M >>> >>> ktrace on 11.0-RELEASE is still running 6 hours so far. > One thing that I noted. In my later attempt to reproduce the issue, I > got the following output: > > sandy% GOGC=2 ./1.go /mnt/1 > Starting 16 forking goroutines... > GOMAXPROCS: 8 > runtime: failed MSpanList_Remove 0x8006a9d60 0x53f2e0 0x53f2f0 0x53f2e0 > fatal error: MSpanList_Remove > > runtime stack: > runtime.throw(0x4cca4d, 0x10) > /usr/local/go/src/runtime/panic.go:566 +0x95 > runtime.(*mSpanList).remove(0x53f2e0, 0x8006a9d60) > /usr/local/go/src/runtime/mheap.go:1001 +0x19d > runtime.(*mcentral).cacheSpan(0x53f2d0, 0x44abfb) > /usr/local/go/src/runtime/mcentral.go:55 +0x3d0 > runtime.(*mcache).refill(0x80052a0d0, 0xc400000016, 0xc420200d38) > /usr/local/go/src/runtime/mcache.go:121 +0xae > runtime.(*mcache).nextFree.func1() > /usr/local/go/src/runtime/malloc.go:505 +0x33 > runtime.systemstack(0xc420018000) > /usr/local/go/src/runtime/asm_amd64.s:298 +0x79 > runtime.mstart() > /usr/local/go/src/runtime/proc.go:1079 > > goroutine 8797810 [running]: > runtime.systemstack_switch() > /usr/local/go/src/runtime/asm_amd64.s:252 fp=0xc420200be0 sp=0xc420200bd8 > runtime.(*mcache).nextFree(0x80052a0d0, 0xc420200c16, 0x40ec95, 0xc42011c050, 0x10) > /usr/local/go/src/runtime/malloc.go:506 +0xb2 fp=0xc420200c38 sp=0xc420200be0 > runtime.mallocgc(0x1a0, 0x4aaf60, 0xc420200d01, 0x44ab40) > /usr/local/go/src/runtime/malloc.go:658 +0x809 fp=0xc420200cd8 sp=0xc420200c38 > runtime.makeslice(0x4aaf60, 0x0, 0x19, 0xc4200b8688, 0x8, 0x8) > /usr/local/go/src/runtime/slice.go:57 +0x7b fp=0xc420200d30 sp=0xc420200cd8 > syscall.Environ(0x0, 0x0, 0x0) > /usr/local/go/src/syscall/env_unix.go:142 +0xd0 fp=0xc420200dc0 sp=0xc420200d30 > os.Environ(0x0, 0x0, 0x0) > /usr/local/go/src/os/env.go:116 +0x22 fp=0xc420200de8 sp=0xc420200dc0 > os/exec.(*Cmd).envv(0xc4201422c0, 0xc4200b8680, 0x0, 0x1) > /usr/local/go/src/os/exec/exec.go:171 +0x38 fp=0xc420200e10 sp=0xc420200de8 > os/exec.(*Cmd).Start(0xc4201422c0, 0x6, 0x0) > > There are two traces for goroutines, and note that both first and second > are in malloc. Is go malloc fine-grain locked ? E.g. I know that JVM > uses per-thread arenas. I'm not particularly familiar with the golang memory allocator, as I only started digging into this a few days ago, but malloc is pool based. One article which specifies some details about the internals of the allocator I just found is: https://blog.altoros.com/golang-internals-part-6-bootstrapping-and-memory-allocator-initialization.html > In fact, are there stress-tests for the go mutual exclusion primitives ? go has a full test suite built in which is run when its built by default, components of it also include benchmarks. You can run tests for the current directory with verbose output using: go test -v or for all directories below the current one with: go test -v ./... to also run all benchmarks add: -bench=. for more information about test which has loads of nice options including profiling: go test -h > The runtime seems to try to use thr and umtx syscalls directly, which > could be the source of bugs. Yes it could well be. Looking through the runtime code I spotted it uses sys sigprocmask directly in src/runtime/os_freebsd.go may also have potential to cause problems given libthr does additional work on top the syscall and that sigprocmask(8) explicitly warns against the use of sigprocmask for thread apps: https://github.com/golang/go/blob/master/src/runtime/os_freebsd.go#L135 I've now got a failure with ktrace running took over 24hours so its quite large so I've uploaded it along side the log and exact binary build to: https://blog.multiplay.co.uk/dropzone/freebsd/golang-panic.tar.gz The crash was the same as before: runtime: failed MSpanList_Remove 0x80052d860 0x8007abbc0 0x53ea30 0x53ea20 fatal error: MSpanList_Remove Given the offset is the same again this doesn't really feel like random corruption. Regards Steve
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?37d0e944-5104-3db0-2884-be6fa80bc95d>