Date: Tue, 28 Mar 2017 10:58:59 +0300 From: Konstantin Belousov <kostikbel@gmail.com> To: Steven Hartland <killing@multiplay.co.uk> Cc: "K. Macy" <kmacy@freebsd.org>, "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org> Subject: Re: Help needed to identify golang fork / memory corruption issue on FreeBSD Message-ID: <20170328075859.GQ43712@kib.kiev.ua> In-Reply-To: <17f29342-f3c0-5940-d012-1a698e59a384@multiplay.co.uk> References: <CAHM0Q_Mg662u9D0KJ9knEWWqi9Ydy38qKDnjLt6XaS0ks%2B9-iw@mail.gmail.com> <18b40a69-4460-faf2-c0ce-7491eca92782@multiplay.co.uk> <20170317082333.GP16105@kib.kiev.ua> <180a601b-5481-bb41-f7fc-67976aabe451@multiplay.co.uk> <20170317124437.GR16105@kib.kiev.ua> <5ba92447-945e-6fea-ad4f-f58ac2a0012e@multiplay.co.uk> <20170327161833.GL43712@kib.kiev.ua> <3ec35a46-ae70-35cd-29f8-82e7cebb0eb6@multiplay.co.uk> <20170327164905.GN43712@kib.kiev.ua> <17f29342-f3c0-5940-d012-1a698e59a384@multiplay.co.uk>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Mar 28, 2017 at 12:32:37AM +0100, Steven Hartland wrote: > On 27/03/2017 17:49, Konstantin Belousov wrote: > > On Mon, Mar 27, 2017 at 05:33:49PM +0100, Steven Hartland wrote: > >> On 27/03/2017 17:18, Konstantin Belousov wrote: > >>> On Mon, Mar 27, 2017 at 12:47:11PM +0100, Steven Hartland wrote: > >>>> OK now the similar but unrelated issue with signal stacks is solved I've > >>>> moved back to the initial issue. > >>>> > >>>> I've made some progress with a reproduction case as detailed here: > >>>> https://github.com/golang/go/issues/15658#issuecomment-288747812 > >>>> > >>>> In short it seems that having a running child, while the parent runs GC, > >>>> is some how responsible for memory corruption in the parent. > >>>> > >>>> The reason I believe this is if I run the same GC in the parent after > >>>> the child exits instead of while its running, I've been unable to > >>>> reproduce the issue. > >>>> > >>>> As the memory segments are COW then the issue might be in VM subsystem. > >>> Well, it might be, but it is a strange corruption mode to believe. > >> Indeed, but would you agree the evidence seems to indicate that this may > >> be the case, as otherwise I would have expected that running the GC > >> after the child process has exited would have zero impact on the issue. > >>>> In order to confirm / deny this I was wondering if there was a way to > >>>> force a full copy of all segments for the child instead of using the COW > >>>> optimisation. > >>> No, there is no. By design, copying only occurs on faults, when VM > >>> detects that the map entry needs copying. Doing the actual copy at fork > >>> time would require writing a lot of new code. > >> I noticed in vm_map_copy_entry the following: > >> /* > >> * We don't want to make writeable wired pages > >> copy-on-write. > >> * Immediately copy these pages into the new map by > >> simulating > >> * page faults. The new pages are pageable. > >> */ > >> vm_fault_copy_entry(dst_map, src_map, dst_entry, src_entry, > >> fork_charge); > >> > >> I wondered if I could use vm_fault_copy_entry to force the copy on fork? > > No, the vm_fault_copy_entry() only works with wired entries, e.g. it cannot > > page in not yet touched page, and the result is also wired. > > > >>> Does go have FreeBSD/i386 port ? If yes, is the issue reproducable there ? > >> Yes it does, I don't currently have i386 machine to test with, I'm > >> assuming testing i386 on amd64 kernel, would likely not have any effect. > > Only if the bug is in kernel and not in the go runtime. I am still not > > convinced that the kernel is the culprit. > > > >>> Another blind experiment to try is to comment out call to > >>> vm_object_collapse() in sys/vm/vm_map.c:vm_map_copy_entry() and see if > >>> it changes anything. > >> I'll do that shortly. > >>> What could be quite interesting is to look at the parent and possibly > >>> child address map after the error occured, using procstat -v. At > >>> least for parent, this should be relatively easy to set up, just make > >>> go runtime spin or pause on panic, instead of exiting, and then use > >>> procstat. > >> > Here's both parent and child after a failure in the parent, which I > obtained by putting the child in a nanosleep loop and only after > successful GC call I send SIGTERM the child and reap it. > > procstat -v 53832 61121 > PID START END PRT RES PRES REF SHD FLAG > TP PATH > 53832 0x400000 0x70e000 r-x 308 601 5 1 CN-- vn > /root/golang/src/test5/test5 > 53832 0x70e000 0x951000 r-- 261 601 5 1 CN-- vn > /root/golang/src/test5/test5 > 53832 0x951000 0x988000 rw- 31 0 1 0 C--- vn > /root/golang/src/test5/test5 > 53832 0x988000 0x9ab000 rw- 18 0 1 0 C--- df > 53832 0x800951000 0x800b51000 rw- 41 0 1 0 C--- df > 53832 0x800b51000 0x800c21000 rw- 26 0 1 0 C--- df > 53832 0x800c21000 0x800c71000 rw- 18 0 1 0 C--- df > 53832 0x800c71000 0x800cb1000 rw- 1 0 1 0 C--- df > 53832 0x800cb1000 0x800cf1000 rw- 2 0 1 0 C--- df > 53832 0x800cf1000 0x800d71000 rw- 3 0 1 0 C--- df > 53832 0x800d71000 0x800db1000 rw- 1 0 1 0 C--- df > 53832 0x800db1000 0x800e71000 rw- 3 0 1 0 C--- df > 53832 0x800e71000 0x800eb1000 rw- 1 1 1 0 ---- df > 53832 0xc000000000 0xc000001000 rw- 1 1 2 0 CN-- df > 53832 0xc41fff0000 0xc41fff8000 rw- 3 3 2 0 CN-- df > 53832 0xc41fff8000 0xc420200000 rw- 251 0 1 0 C--- df > 53832 0x7ffffffdf000 0x7ffffffff000 rwx 2 0 1 0 C--D df > 53832 0x7ffffffff000 0x800000000000 r-x 1 1 28 0 ---- ph > 61121 0x400000 0x70e000 r-x 308 601 5 1 CN-- vn > /root/golang/src/test5/test5 > 61121 0x70e000 0x951000 r-- 261 601 5 1 CN-- vn > /root/golang/src/test5/test5 > 61121 0x951000 0x988000 rw- 31 0 2 1 CN-- vn > /root/golang/src/test5/test5 > 61121 0x988000 0x9ab000 rw- 18 18 2 1 CN-- df > 61121 0x800951000 0x800b51000 rw- 41 41 2 1 CN-- df > 61121 0x800b51000 0x800c21000 rw- 26 26 2 1 CN-- df > 61121 0x800c21000 0x800c71000 rw- 18 18 2 1 CN-- df > 61121 0x800c71000 0x800cb1000 rw- 1 1 2 1 CN-- df > 61121 0x800cb1000 0x800cf1000 rw- 2 2 2 1 CN-- df > 61121 0x800cf1000 0x800d71000 rw- 3 3 2 1 CN-- df > 61121 0x800d71000 0x800db1000 rw- 1 1 2 1 CN-- df > 61121 0x800db1000 0x800e71000 rw- 3 3 2 1 CN-- df > 61121 0xc000000000 0xc000001000 rw- 1 1 2 0 CN-- df > 61121 0xc41fff0000 0xc41fff8000 rw- 3 3 2 0 CN-- df > 61121 0xc41fff8000 0xc420200000 rw- 251 0 1 0 C--- df > 61121 0x7ffffffdf000 0x7ffffffff000 rwx 2 2 2 1 CN-D df > 61121 0x7ffffffff000 0x800000000000 r-x 1 1 28 0 ---- ph > > Should the parent have lost the COW flag to the region starting at > 0x800e71000? Note that the region is absent in the child, so most likely this is a new mmaping occured in the parent after fork. Again, what is the address where an invalid value was detected ?
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170328075859.GQ43712>