From owner-freebsd-hackers@freebsd.org Tue Dec 6 13:53:25 2016 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 65754C63E7C for ; Tue, 6 Dec 2016 13:53:25 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: from mail-wj0-x22e.google.com (mail-wj0-x22e.google.com [IPv6:2a00:1450:400c:c01::22e]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 1A97D19F3 for ; Tue, 6 Dec 2016 13:53:25 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: by mail-wj0-x22e.google.com with SMTP id tg4so66534560wjb.1 for ; Tue, 06 Dec 2016 05:53:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=multiplay-co-uk.20150623.gappssmtp.com; s=20150623; h=subject:to:references:cc:from:message-id:date:user-agent :mime-version:in-reply-to; bh=G5hYTCZ6Vwa/0Di0hyooLcb0SlD5Dt2MANOmCxa0HBs=; b=fimg4T9e8smnUn8yqJhUlWTzje3aPrBmx/SASrF/YbT8UfV5hQ/sbXeTR0HU5/wFZs cnqxYAu+usuKXocnCzFSOq1JB+W0P3HOwQu1kcoNYdBmbYKKqe94oB3Q6TDFCTnFXynd UBnKQ7XeL76gQZMMinH0jxO/FO3DjmYHfqi1ZmCbksu30BREKNcOoa2K3k3acZCer8bu JB5F5WGbGWdN8sXlWMeiNqec59akpJdSMuWbrOjsWFwwFLnWRY0z8BOMhTVwqS5NShrz TsqDryGPsNOjLxjW0dXY5bUGlMkeWkJSolJh0KrTk/OU3r/Bk8XmKhebaESdnZXcMVQ2 ZqLQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:cc:from:message-id:date :user-agent:mime-version:in-reply-to; bh=G5hYTCZ6Vwa/0Di0hyooLcb0SlD5Dt2MANOmCxa0HBs=; b=NgWQE8MGaJ/4GatsceArm7M7z56yz2qYHzw5TG7VtapeDQGT5f8nPFALf69mdkzkwJ qMCbvQvOzoI3y666bizCSOO3AsUsJKlK4MtY/yXps5bfpEGoUhtdpGBpsGJCkRq5s/5N UO9oyEaqcQJ7aG0YWonPZ+/M8CGTeLltbPAA5vh/wPpzA79ZFENlXdkdQEaQ2jcPgjaN zP7JpD3bk4yH3Qs7GgGHk6dN4zGjeloH83Y2d5eHWemDTGENoS7Gp1qMYuUnDj304N2L 0rwtTMR9IFpO9/SuMS9qtQLeNwa4F/OyKNHcZgVjaBuCVqma3PCQZLJWrHc9f0T9EVZv hdqQ== X-Gm-Message-State: AKaTC00eCKn64Hv9x15LUlsMYRANpjJXGDT0mqIoJsTKHzHnel4HP2XLjRZPA/+6jDonk2+C X-Received: by 10.194.58.52 with SMTP id n20mr54257896wjq.110.1481032402931; Tue, 06 Dec 2016 05:53:22 -0800 (PST) Received: from [10.10.1.58] ([185.97.61.26]) by smtp.gmail.com with ESMTPSA id q7sm25901355wjh.9.2016.12.06.05.53.21 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 06 Dec 2016 05:53:22 -0800 (PST) Subject: Re: Help needed to identify golang fork / memory corruption issue on FreeBSD To: Konstantin Belousov References: <27e1a828-5cd9-0755-50ca-d7143e7df117@multiplay.co.uk> <20161206125919.GQ54029@kib.kiev.ua> Cc: "freebsd-hackers@freebsd.org" From: Steven Hartland Message-ID: <8b502580-4d2d-1e1f-9e05-61d46d5ac3b1@multiplay.co.uk> Date: Tue, 6 Dec 2016 13:53:52 +0000 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.5.1 MIME-Version: 1.0 In-Reply-To: <20161206125919.GQ54029@kib.kiev.ua> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.23 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 06 Dec 2016 13:53:25 -0000 On 06/12/2016 12:59, Konstantin Belousov wrote: > On Tue, Dec 06, 2016 at 12:31:47PM +0000, Steven Hartland wrote: >> Hi guys I'm trying to help identify / fix an issue with golang where by >> fork results in memory corruption. >> >> Details of the issue can be found here: >> https://github.com/golang/go/issues/15658 >> >> In summary when a fork is done in golang is has a chance of causing >> memory corruption in the parent resulting in a process crash once detected. >> >> Its believed that this only effects FreeBSD. >> >> This has similarities to other reported issues such as this one which >> impacted perl during 10.x: >> https://rt.perl.org/Public/Bug/Display.html?id=122199 > I cannot judge about any similarilities when all the description provided > is 'memory corruption'. BTW, the perl issue described, where child segfaults > after the fork, is more likely to be caused by the set of problems referenced > in the FreeBSD-EN-16:17.vm. > >> And more recently the issue with nginx on 11.x: >> https://lists.freebsd.org/pipermail/freebsd-stable/2016-September/085540.html > Which does not affect anything unless aio is used on Sandy/Ivy. > >> Its possible, some believe likely, that this is a kernel bug around fork >> / vm that golang stresses, but I've not been able to confirm. >> >> I can reproduce the issue at will, takes between 5mins and 1hour using >> 16 threads, and it definitely seems like an interaction between fork and >> other memory operations. > Which arch is the kernel and the process which demonstrates the behaviour ? > I mean i386/amd64. amd64 > >> I've tried reproducing the issue in C but also no joy (captured in the bug). >> >> For reference I'm currently testing on 11.0-RELEASE-p3 + kibs PCID fix >> (#306350). > Switch to HEAD kernel, for start. > Show the memory map of the failed process. > Are you able to take ktrace of the process while still producing the bug ? When ever I've tried ktrace the issue doesn't present itself. I can try and run it for an extended period to see if it does eventually but I did run it for a few hours without any joy. I'm currently testing with a 11.0-RELEASE debug kernel, witness, invariants etc to see if that would detect anything; however so far its taking longer than usual to reproduce so it may simply not occur with a debug kernel. > Where is the memory corruption happen ? Is it in go runtime structures, > or in the application data ? Its usually detected by the runtime GC which panics with a number of errors e.g. fatal error: all goroutines are asleep - deadlock! fatal error: workbuf is empty runtime: nelems=256 nfree=233 nalloc=23 previous allocCount=18 nfreed=65531 fatal error: sweep increased allocation count runtime: failed MSpanList_Remove 0x800698500 0x800b46d40 0x53adb0 0x53ada0 fatal error: MSpanList_Remove As the test is very basic its unlikely to see an issue in the application data. > Can somebody knowledgable of either the go runtime or the app, > try to identify the initial corrupted userspace data ? The golang developers have looked but where unable to reproduce on freebsd-amd64-gce101 gomote running FreeBSD 10.1. This could be a factor of the VM its unclear. The app is tiny test binary which I'm current running with GOGC=2: package main import ( "fmt" "os/exec" "runtime" "time" ) var ( gcPeriod = time.Second * 10 forkRoutines = 16 ) func run(done chan struct{}) { cmd := exec.Command("/usr/bin/true") cmd.Start() cmd.Wait() done <- struct{}{} } func main() { fmt.Printf("Starting %v forking goroutines...\n", forkRoutines) fmt.Println("GOMAXPROCS:", runtime.GOMAXPROCS(0)) done := make(chan struct{}, forkRoutines*2) for i := 0; i < forkRoutines; i++ { go run(done) } for { start := time.Now() active := forkRoutines forking: for range done { if time.Since(start) > gcPeriod { active-- if active == 0 { break forking } } else { go run(done) } } runtime.GC() for i := 0; i < forkRoutines; i++ { go run(done) } } }