Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 6 Dec 2016 16:35:32 +0200
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Steven Hartland <killing@multiplay.co.uk>
Cc:        "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>
Subject:   Re: Help needed to identify golang fork / memory corruption issue on FreeBSD
Message-ID:  <20161206143532.GR54029@kib.kiev.ua>
In-Reply-To: <8b502580-4d2d-1e1f-9e05-61d46d5ac3b1@multiplay.co.uk>
References:  <27e1a828-5cd9-0755-50ca-d7143e7df117@multiplay.co.uk> <20161206125919.GQ54029@kib.kiev.ua> <8b502580-4d2d-1e1f-9e05-61d46d5ac3b1@multiplay.co.uk>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Dec 06, 2016 at 01:53:52PM +0000, Steven Hartland wrote:
> On 06/12/2016 12:59, Konstantin Belousov wrote:
> > On Tue, Dec 06, 2016 at 12:31:47PM +0000, Steven Hartland wrote:
> >> Hi guys I'm trying to help identify / fix an issue with golang where by
> >> fork results in memory corruption.
> >>
> >> Details of the issue can be found here:
> >> https://github.com/golang/go/issues/15658
> >>
> >> In summary when a fork is done in golang is has a chance of causing
> >> memory corruption in the parent resulting in a process crash once detected.
> >>
> >> Its believed that this only effects FreeBSD.
> >>
> >> This has similarities to other reported issues such as this one which
> >> impacted perl during 10.x:
> >> https://rt.perl.org/Public/Bug/Display.html?id=122199
> > I cannot judge about any similarilities when all the description provided
> > is 'memory corruption'. BTW, the perl issue described, where child segfaults
> > after the fork, is more likely to be caused by the set of problems referenced
> > in the FreeBSD-EN-16:17.vm.
> >
> >> And more recently the issue with nginx on 11.x:
> >> https://lists.freebsd.org/pipermail/freebsd-stable/2016-September/085540.html
> > Which does not affect anything unless aio is used on Sandy/Ivy.
> >
> >> Its possible, some believe likely, that this is a kernel bug around fork
> >> / vm that golang stresses, but I've not been able to confirm.
> >>
> >> I can reproduce the issue at will, takes between 5mins and 1hour using
> >> 16 threads, and it definitely seems like an interaction between fork and
> >> other memory operations.
> > Which arch is the kernel and the process which demonstrates the behaviour  ?
> > I mean i386/amd64.
> amd64
How large is the machine, how many cores, what is the physical memory size ?

> >
> >> I've tried reproducing the issue in C but also no joy (captured in the bug).
> >>
> >> For reference I'm currently testing on 11.0-RELEASE-p3 + kibs PCID fix
> >> (#306350).
> > Switch to HEAD kernel, for start.
> > Show the memory map of the failed process.
> > Are you able to take ktrace of the process while still producing the bug ?
> When ever I've tried ktrace the issue doesn't present itself.
> 
> I can try and run it for an extended period to see if it does eventually 
> but I did run it for a few hours without any joy.
> 
> I'm currently testing with a 11.0-RELEASE debug kernel, witness, 
> invariants etc to see if that would detect anything; however so far its 
> taking longer than usual to reproduce so it may simply not occur with a 
> debug kernel.
> 
> > Where is the memory corruption happen ? Is it in go runtime structures,
> > or in the application data ?
> Its usually detected by the runtime GC which panics with a number of 
> errors e.g.
> fatal error: all goroutines are asleep - deadlock!
> 
> fatal error: workbuf is empty
> 
> runtime: nelems=256 nfree=233 nalloc=23 previous allocCount=18 nfreed=65531
> fatal error: sweep increased allocation count
> 
> runtime: failed MSpanList_Remove 0x800698500 0x800b46d40 0x53adb0 0x53ada0
> fatal error: MSpanList_Remove
> 
> As the test is very basic its unlikely to see an issue in the 
> application data.
> 
> > Can somebody knowledgable of either the go runtime or the app,
> > try to identify the initial corrupted userspace data ?
> The golang developers have looked but where unable to reproduce on 
> freebsd-amd64-gce101 gomote running FreeBSD 10.1. This could be a factor 
> of the VM its unclear.
This is not what I asked.  I am asking is it possible to make an educated
guess at what initial corruption could be to cause the outcome.  Like,
if this variable suddently becomes zero, we get the errors.

Does go runtime use FreeBSD libc and threading library ?

> 
> The app is tiny test binary which I'm current running with GOGC=2:
> package main
> 
> import (
>          "fmt"
>          "os/exec"
>          "runtime"
>          "time"
> )
> 
> var (
>          gcPeriod     = time.Second * 10
>          forkRoutines = 16
> )
> 
> func run(done chan struct{}) {
>          cmd := exec.Command("/usr/bin/true")
>          cmd.Start()
>          cmd.Wait()
> 
>          done <- struct{}{}
> }
> 
> func main() {
>          fmt.Printf("Starting %v forking goroutines...\n", forkRoutines)
>          fmt.Println("GOMAXPROCS:", runtime.GOMAXPROCS(0))
> 
>          done := make(chan struct{}, forkRoutines*2)
> 
>          for i := 0; i < forkRoutines; i++ {
>                  go run(done)
>          }
> 
>          for {
>                  start := time.Now()
>                  active := forkRoutines
>          forking:
>                  for range done {
>                          if time.Since(start) > gcPeriod {
>                                  active--
>                                  if active == 0 {
>                                          break forking
>                                  }
>                          } else {
>                                  go run(done)
>                          }
>                  }
> 
>                  runtime.GC()
> 
>                  for i := 0; i < forkRoutines; i++ {
>                          go run(done)
>                  }
>          }
> }



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20161206143532.GR54029>