Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 28 Mar 2017 23:50:37 +0100
From:      Steven Hartland <killing@multiplay.co.uk>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        "K. Macy" <kmacy@freebsd.org>, "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>
Subject:   Re: Help needed to identify golang fork / memory corruption issue on FreeBSD
Message-ID:  <2f5952b5-104d-4e82-5ebe-8cb9caabedae@multiplay.co.uk>
In-Reply-To: <20170328113859.GS43712@kib.kiev.ua>
References:  <20170317124437.GR16105@kib.kiev.ua> <5ba92447-945e-6fea-ad4f-f58ac2a0012e@multiplay.co.uk> <20170327161833.GL43712@kib.kiev.ua> <3ec35a46-ae70-35cd-29f8-82e7cebb0eb6@multiplay.co.uk> <20170327164905.GN43712@kib.kiev.ua> <17f29342-f3c0-5940-d012-1a698e59a384@multiplay.co.uk> <20170328075859.GQ43712@kib.kiev.ua> <85f86a20-948f-025a-0d09-92ee2a815136@multiplay.co.uk> <20170328083810.GR43712@kib.kiev.ua> <5aa653ba-30e1-c9de-46ce-bad74d78c40c@multiplay.co.uk> <20170328113859.GS43712@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On 28/03/2017 12:38, Konstantin Belousov wrote:
> On Tue, Mar 28, 2017 at 09:48:23AM +0100, Steven Hartland wrote:
>> On 28/03/2017 09:38, Konstantin Belousov wrote:
>>> On Tue, Mar 28, 2017 at 09:23:24AM +0100, Steven Hartland wrote:
>>>> As I stopped the panic before that I couldn't tell so I've re-run with
>>>> some debug added just before the panic to capture the addresses of the
>>>> workbuf structure that the issue was detected in, here goes (parent:
>>>> 62620, child: 98756):
>>>>
>>>> workbuf: 0x800b51800
>>>> fatal error: workbuf is not empty
>>>> workbuf: 0x800a72000
>>>> fatal error: workbuf is empty
>>>> workbuf: 0x800a72000
>>>> fatal error: workbuf is not empty
>>> I do not understand.  Why do you show several addresses ?  Wouldn't the
>>> runtime panic after detecting the discrepancy, so there could be only one
>>> address ?
>> There are several goroutines (threads) running each detected an error,
>> as I'm blocking the panic by entering a sleep in the faulting goroutine
>> to enable the capture of procstat, other routines continue and detect an
>> error too.
> Ok.
>
> So I tried to simulate the load with an isolated test. Code below is
> naive, but it should illustrate the idea. Parent allocates some
> number of private-mapped areas, then runs threads which write bytes into
> the areas. Simultaneously parent forks children which write distinct
> byte into the same anonymous memory.
>
> Parent checks that it cannot see a byte written by children.
>
> So far it did not tripped on my test machine.  Feel free to play with it,
> if you have more insights what go runtime does, modify the code to simulate
> the failing test more accurately.
I've updated to it to be more like the go, so single forking thread 
(non-main), ancillary threads mainly idle until triggered by forking 
thread to perform a check, and still no failure.

What's curious is why I don't get the issue if either:
1. The machine has just a single core.
2. The work (GC) is moved after the child wait.

Given the above I added some debug:
func (b *workbuf) checknonempty() {
         if b.nobj == 0 {
                 print("workbuf is empty: b: ", b, ", nobj: ", b.nobj, 
", nobj2: ", b.nobj2, ", pushcnt: ", b.node.pushcnt, "\n")
                 throw("workbuf is empty")
         }
}

func (b *workbuf) checkempty() {
         if b.nobj != 0 {
                 print("workbuf is not empty: b: ", b, ", nobj: ", 
b.nobj, ", nobj2: ", b.nobj2, ", pushcnt: ", b.node.pushcnt, "\n")
                 throw("workbuf is not empty")
         }
}

Here's the output:
workbuf is not empty: b: 0x800c51000, nobj: 4, nobj2: -2, pushcnt: 104881
fatal error: workbuf is not empty

Nothing strange, but now lets have a look using gdb after the parent has 
exited:
(gdb) frame 8
#8  0x000000000041f1e8 in runtime.(*workbuf).checkempty (b=0x800c51000) 
at /usr/local/go/src/runtime/mgcwork.go:328
328                     throw("workbuf is not empty")
(gdb) print b
$3 = (struct runtime.workbuf *) 0x800c51000
(gdb) print *b
$4 = {runtime.workbufhdr = {node = {next = 0, pushcnt = 104881}, nobj = 
0, nobj2 = -8},....

So after the error was printed the value for nobj was some how 
corrected, however nobj2 being -8 indicates the last call which altered 
nobj was func (w *gcWork) get() uintptr where as the -2 indicates it was 
a putfull which is very muddled up.

I was curious what the child had at 0x800c51000 but couldn't persuade 
gdb to cast and output it as a
struct runtime.workbuf.


     Regards
     Steve





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?2f5952b5-104d-4e82-5ebe-8cb9caabedae>