Date: Tue, 28 Mar 2017 23:50:37 +0100 From: Steven Hartland <killing@multiplay.co.uk> To: Konstantin Belousov <kostikbel@gmail.com> Cc: "K. Macy" <kmacy@freebsd.org>, "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org> Subject: Re: Help needed to identify golang fork / memory corruption issue on FreeBSD Message-ID: <2f5952b5-104d-4e82-5ebe-8cb9caabedae@multiplay.co.uk> In-Reply-To: <20170328113859.GS43712@kib.kiev.ua> References: <20170317124437.GR16105@kib.kiev.ua> <5ba92447-945e-6fea-ad4f-f58ac2a0012e@multiplay.co.uk> <20170327161833.GL43712@kib.kiev.ua> <3ec35a46-ae70-35cd-29f8-82e7cebb0eb6@multiplay.co.uk> <20170327164905.GN43712@kib.kiev.ua> <17f29342-f3c0-5940-d012-1a698e59a384@multiplay.co.uk> <20170328075859.GQ43712@kib.kiev.ua> <85f86a20-948f-025a-0d09-92ee2a815136@multiplay.co.uk> <20170328083810.GR43712@kib.kiev.ua> <5aa653ba-30e1-c9de-46ce-bad74d78c40c@multiplay.co.uk> <20170328113859.GS43712@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
On 28/03/2017 12:38, Konstantin Belousov wrote: > On Tue, Mar 28, 2017 at 09:48:23AM +0100, Steven Hartland wrote: >> On 28/03/2017 09:38, Konstantin Belousov wrote: >>> On Tue, Mar 28, 2017 at 09:23:24AM +0100, Steven Hartland wrote: >>>> As I stopped the panic before that I couldn't tell so I've re-run with >>>> some debug added just before the panic to capture the addresses of the >>>> workbuf structure that the issue was detected in, here goes (parent: >>>> 62620, child: 98756): >>>> >>>> workbuf: 0x800b51800 >>>> fatal error: workbuf is not empty >>>> workbuf: 0x800a72000 >>>> fatal error: workbuf is empty >>>> workbuf: 0x800a72000 >>>> fatal error: workbuf is not empty >>> I do not understand. Why do you show several addresses ? Wouldn't the >>> runtime panic after detecting the discrepancy, so there could be only one >>> address ? >> There are several goroutines (threads) running each detected an error, >> as I'm blocking the panic by entering a sleep in the faulting goroutine >> to enable the capture of procstat, other routines continue and detect an >> error too. > Ok. > > So I tried to simulate the load with an isolated test. Code below is > naive, but it should illustrate the idea. Parent allocates some > number of private-mapped areas, then runs threads which write bytes into > the areas. Simultaneously parent forks children which write distinct > byte into the same anonymous memory. > > Parent checks that it cannot see a byte written by children. > > So far it did not tripped on my test machine. Feel free to play with it, > if you have more insights what go runtime does, modify the code to simulate > the failing test more accurately. I've updated to it to be more like the go, so single forking thread (non-main), ancillary threads mainly idle until triggered by forking thread to perform a check, and still no failure. What's curious is why I don't get the issue if either: 1. The machine has just a single core. 2. The work (GC) is moved after the child wait. Given the above I added some debug: func (b *workbuf) checknonempty() { if b.nobj == 0 { print("workbuf is empty: b: ", b, ", nobj: ", b.nobj, ", nobj2: ", b.nobj2, ", pushcnt: ", b.node.pushcnt, "\n") throw("workbuf is empty") } } func (b *workbuf) checkempty() { if b.nobj != 0 { print("workbuf is not empty: b: ", b, ", nobj: ", b.nobj, ", nobj2: ", b.nobj2, ", pushcnt: ", b.node.pushcnt, "\n") throw("workbuf is not empty") } } Here's the output: workbuf is not empty: b: 0x800c51000, nobj: 4, nobj2: -2, pushcnt: 104881 fatal error: workbuf is not empty Nothing strange, but now lets have a look using gdb after the parent has exited: (gdb) frame 8 #8 0x000000000041f1e8 in runtime.(*workbuf).checkempty (b=0x800c51000) at /usr/local/go/src/runtime/mgcwork.go:328 328 throw("workbuf is not empty") (gdb) print b $3 = (struct runtime.workbuf *) 0x800c51000 (gdb) print *b $4 = {runtime.workbufhdr = {node = {next = 0, pushcnt = 104881}, nobj = 0, nobj2 = -8},.... So after the error was printed the value for nobj was some how corrected, however nobj2 being -8 indicates the last call which altered nobj was func (w *gcWork) get() uintptr where as the -2 indicates it was a putfull which is very muddled up. I was curious what the child had at 0x800c51000 but couldn't persuade gdb to cast and output it as a struct runtime.workbuf. Regards Steve
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?2f5952b5-104d-4e82-5ebe-8cb9caabedae>