From owner-freebsd-hackers@freebsd.org  Mon Mar 27 23:32:39 2017
Return-Path: <owner-freebsd-hackers@freebsd.org>
Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 3BFA3D20688
 for <freebsd-hackers@mailman.ysv.freebsd.org>;
 Mon, 27 Mar 2017 23:32:39 +0000 (UTC)
 (envelope-from killing@multiplay.co.uk)
Received: from mail-wr0-x22e.google.com (mail-wr0-x22e.google.com
 [IPv6:2a00:1450:400c:c0c::22e])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id B948DDB6
 for <freebsd-hackers@freebsd.org>; Mon, 27 Mar 2017 23:32:38 +0000 (UTC)
 (envelope-from killing@multiplay.co.uk)
Received: by mail-wr0-x22e.google.com with SMTP id u1so79470916wra.2
 for <freebsd-hackers@freebsd.org>; Mon, 27 Mar 2017 16:32:38 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=multiplay-co-uk.20150623.gappssmtp.com; s=20150623;
 h=subject:to:references:cc:from:message-id:date:user-agent
 :mime-version:in-reply-to;
 bh=1+gsSINhaQYlKt57S86iI7D2L0GICoBqhcOWc7J699s=;
 b=hrFRiW0kyMpDyE4EgIxfrRbTGD67em3D05oCsGJzDrtSEckd8DN7C0wWHY6RZ1Kbx8
 1CDgpN7QBP26+STXdvex/pAWekGPV8I/CbbDl2ejrVQ/g+ZdlfrVYD8vCMIYyXO//tnA
 5FrfG899sIP30ChjQXbScli8Nw/l3fbz4SUkuEuVZrHBNdkIWYPEbKqOBdfLj5hrgHfS
 98UXiB/e48itdEGGy6nXp8rKlJFTMgKbS17KcW7V/ldr4QHEed789sASooD4f5BYt6Mq
 i89NeyqThySd9U18PudpwkR3MqJd7PYH7StNhw0SszlH89i2iJqvHjPkA6AW/8Cximf+
 veXA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:subject:to:references:cc:from:message-id:date
 :user-agent:mime-version:in-reply-to;
 bh=1+gsSINhaQYlKt57S86iI7D2L0GICoBqhcOWc7J699s=;
 b=ilUfDxFtOo76ptKkL2TrAvXQ8F9DxuCkIfnqvGK0+IejFVVKgk+Ai+bIKjFvyB0CpN
 t2/z7FCoUyEiXsEhjYjPreVFBTlHsZeOco+f0CckDjvb1WGIQ9ID6j1202WOy4beacyI
 QF7ntLSU6pl+mkmdrWW5sNQWJScc17taHVKedL5yh7a5Dc2oq4KQVyS/fLdss5xapSOf
 cHsDKH6ezd7XHPSBcSmP55wMPBG+924bxI88+K0cICUHp9hSXwYXzW5MinYsvQ/lA2QW
 4zCUIUMhXDiNkqpxVy1ffnnm7cDvC9NXRn3cZjXrrmVav2H68TEef/4/Ft5KK5F7JEu/
 3A9A==
X-Gm-Message-State: AFeK/H1qLj6G5YjifCsa7tBRd8FcpvIlVdhrDvzxlVjsFI11Lm7ArLs3t/aeH1m2gElIYYQW
X-Received: by 10.223.156.2 with SMTP id f2mr22096235wrc.176.1490657556797;
 Mon, 27 Mar 2017 16:32:36 -0700 (PDT)
Received: from [10.10.1.58] ([185.97.61.26])
 by smtp.gmail.com with ESMTPSA id l90sm1234233wmi.25.2017.03.27.16.32.35
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Mon, 27 Mar 2017 16:32:35 -0700 (PDT)
Subject: Re: Help needed to identify golang fork / memory corruption issue on
 FreeBSD
To: Konstantin Belousov <kostikbel@gmail.com>
References: <20161206143532.GR54029@kib.kiev.ua>
 <e160381c-9935-6edf-04a9-1ff78e95d818@multiplay.co.uk>
 <CAHM0Q_Mg662u9D0KJ9knEWWqi9Ydy38qKDnjLt6XaS0ks+9-iw@mail.gmail.com>
 <18b40a69-4460-faf2-c0ce-7491eca92782@multiplay.co.uk>
 <20170317082333.GP16105@kib.kiev.ua>
 <180a601b-5481-bb41-f7fc-67976aabe451@multiplay.co.uk>
 <20170317124437.GR16105@kib.kiev.ua>
 <5ba92447-945e-6fea-ad4f-f58ac2a0012e@multiplay.co.uk>
 <20170327161833.GL43712@kib.kiev.ua>
 <3ec35a46-ae70-35cd-29f8-82e7cebb0eb6@multiplay.co.uk>
 <20170327164905.GN43712@kib.kiev.ua>
Cc: "K. Macy" <kmacy@freebsd.org>,
 "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>
From: Steven Hartland <killing@multiplay.co.uk>
Message-ID: <17f29342-f3c0-5940-d012-1a698e59a384@multiplay.co.uk>
Date: Tue, 28 Mar 2017 00:32:37 +0100
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101
 Thunderbird/45.8.0
MIME-Version: 1.0
In-Reply-To: <20170327164905.GN43712@kib.kiev.ua>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
X-Content-Filtered-By: Mailman/MimeDel 2.1.23
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 27 Mar 2017 23:32:39 -0000

On 27/03/2017 17:49, Konstantin Belousov wrote:
> On Mon, Mar 27, 2017 at 05:33:49PM +0100, Steven Hartland wrote:
>> On 27/03/2017 17:18, Konstantin Belousov wrote:
>>> On Mon, Mar 27, 2017 at 12:47:11PM +0100, Steven Hartland wrote:
>>>> OK now the similar but unrelated issue with signal stacks is solved I've
>>>> moved back to the initial issue.
>>>>
>>>> I've made some progress with a reproduction case as detailed here:
>>>> https://github.com/golang/go/issues/15658#issuecomment-288747812
>>>>
>>>> In short it seems that having a running child, while the parent runs GC,
>>>> is some how responsible for memory corruption in the parent.
>>>>
>>>> The reason I believe this is if I run the same GC in the parent after
>>>> the child exits instead of while its running, I've been unable to
>>>> reproduce the issue.
>>>>
>>>> As the memory segments are COW then the issue might be in VM subsystem.
>>> Well, it might be, but it is a strange corruption mode to believe.
>> Indeed, but would you agree the evidence seems to indicate that this may
>> be the case, as otherwise I would have expected that running the GC
>> after the child process has exited would have zero impact on the issue.
>>>> In order to confirm / deny this I was wondering if there was a way to
>>>> force a full copy of all segments for the child instead of using the COW
>>>> optimisation.
>>> No, there is no. By design, copying only occurs on faults, when VM
>>> detects that the map entry needs copying. Doing the actual copy at fork
>>> time would require writing a lot of new code.
>> I noticed in vm_map_copy_entry the following:
>>                   /*
>>                    * We don't want to make writeable wired pages
>> copy-on-write.
>>                    * Immediately copy these pages into the new map by
>> simulating
>>                    * page faults.  The new pages are pageable.
>>                    */
>>                   vm_fault_copy_entry(dst_map, src_map, dst_entry, src_entry,
>>                       fork_charge);
>>
>> I wondered if I could use vm_fault_copy_entry to force the copy on fork?
> No, the vm_fault_copy_entry() only works with wired entries, e.g. it cannot
> page in not yet touched page, and the result is also wired.
>
>>> Does go have FreeBSD/i386 port ?  If yes, is the issue reproducable there ?
>> Yes it does, I don't currently have i386 machine to test with, I'm
>> assuming testing i386 on amd64 kernel, would likely not have any effect.
> Only if the bug is in kernel and not in the go runtime.  I am still not
> convinced that the kernel is the culprit.
>
>>> Another blind experiment to try is to comment out call to
>>> vm_object_collapse() in sys/vm/vm_map.c:vm_map_copy_entry() and see if
>>> it changes anything.
>> I'll do that shortly.
>>> What could be quite interesting is to look at the parent and possibly
>>> child address map after the error occured, using procstat -v. At
>>> least for parent, this should be relatively easy to set up, just make
>>> go runtime spin or pause on panic, instead of exiting, and then use
>>> procstat.
>>
Here's both parent and child after a failure in the parent, which I 
obtained by putting the child in a nanosleep loop and only after 
successful GC call I send SIGTERM the child and reap it.

procstat -v 53832 61121
   PID              START                END PRT  RES PRES REF SHD FLAG 
TP PATH
53832           0x400000           0x70e000 r-x  308  601 5   1 CN-- vn 
/root/golang/src/test5/test5
53832           0x70e000           0x951000 r--  261  601 5   1 CN-- vn 
/root/golang/src/test5/test5
53832           0x951000           0x988000 rw-   31    0 1   0 C--- vn 
/root/golang/src/test5/test5
53832           0x988000           0x9ab000 rw-   18    0 1   0 C--- df
53832        0x800951000        0x800b51000 rw-   41    0 1   0 C--- df
53832        0x800b51000        0x800c21000 rw-   26    0 1   0 C--- df
53832        0x800c21000        0x800c71000 rw-   18    0 1   0 C--- df
53832        0x800c71000        0x800cb1000 rw-    1    0 1   0 C--- df
53832        0x800cb1000        0x800cf1000 rw-    2    0 1   0 C--- df
53832        0x800cf1000        0x800d71000 rw-    3    0 1   0 C--- df
53832        0x800d71000        0x800db1000 rw-    1    0 1   0 C--- df
53832        0x800db1000        0x800e71000 rw-    3    0 1   0 C--- df
53832        0x800e71000        0x800eb1000 rw-    1    1 1   0 ---- df
53832       0xc000000000       0xc000001000 rw-    1    1 2   0 CN-- df
53832       0xc41fff0000       0xc41fff8000 rw-    3    3 2   0 CN-- df
53832       0xc41fff8000       0xc420200000 rw-  251    0 1   0 C--- df
53832     0x7ffffffdf000     0x7ffffffff000 rwx    2    0 1   0 C--D df
53832     0x7ffffffff000     0x800000000000 r-x    1    1 28   0 ---- ph
61121           0x400000           0x70e000 r-x  308  601 5   1 CN-- vn 
/root/golang/src/test5/test5
61121           0x70e000           0x951000 r--  261  601 5   1 CN-- vn 
/root/golang/src/test5/test5
61121           0x951000           0x988000 rw-   31    0 2   1 CN-- vn 
/root/golang/src/test5/test5
61121           0x988000           0x9ab000 rw-   18   18 2   1 CN-- df
61121        0x800951000        0x800b51000 rw-   41   41 2   1 CN-- df
61121        0x800b51000        0x800c21000 rw-   26   26 2   1 CN-- df
61121        0x800c21000        0x800c71000 rw-   18   18 2   1 CN-- df
61121        0x800c71000        0x800cb1000 rw-    1    1 2   1 CN-- df
61121        0x800cb1000        0x800cf1000 rw-    2    2 2   1 CN-- df
61121        0x800cf1000        0x800d71000 rw-    3    3 2   1 CN-- df
61121        0x800d71000        0x800db1000 rw-    1    1 2   1 CN-- df
61121        0x800db1000        0x800e71000 rw-    3    3 2   1 CN-- df
61121       0xc000000000       0xc000001000 rw-    1    1 2   0 CN-- df
61121       0xc41fff0000       0xc41fff8000 rw-    3    3 2   0 CN-- df
61121       0xc41fff8000       0xc420200000 rw-  251    0 1   0 C--- df
61121     0x7ffffffdf000     0x7ffffffff000 rwx    2    2 2   1 CN-D df
61121     0x7ffffffff000     0x800000000000 r-x    1    1 28   0 ---- ph

Should the parent have lost the COW flag to the region starting at 
0x800e71000?

     Regards
     Steve

     Regards
     Steve