From owner-freebsd-current@freebsd.org Tue Feb 21 05:39:36 2017 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id B0D75CE678F for ; Tue, 21 Feb 2017 05:39:36 +0000 (UTC) (envelope-from markmi@dsl-only.net) Received: from asp.reflexion.net (outbound-mail-210-70.reflexion.net [208.70.210.70]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 7132610BC for ; Tue, 21 Feb 2017 05:39:35 +0000 (UTC) (envelope-from markmi@dsl-only.net) Received: (qmail 18996 invoked from network); 21 Feb 2017 05:41:40 -0000 Received: from unknown (HELO mail-cs-02.app.dca.reflexion.local) (10.81.19.2) by 0 (rfx-qmail) with SMTP; 21 Feb 2017 05:41:40 -0000 Received: by mail-cs-02.app.dca.reflexion.local (Reflexion email security v8.30.0) with SMTP; Tue, 21 Feb 2017 00:39:34 -0500 (EST) Received: (qmail 27892 invoked from network); 21 Feb 2017 05:39:34 -0000 Received: from unknown (HELO iron2.pdx.net) (69.64.224.71) by 0 (rfx-qmail) with (AES256-SHA encrypted) SMTP; 21 Feb 2017 05:39:34 -0000 Received: from [192.168.1.111] (c-67-170-167-181.hsd1.or.comcast.net [67.170.167.181]) by iron2.pdx.net (Postfix) with ESMTPSA id 386FDEC7E18; Mon, 20 Feb 2017 21:39:33 -0800 (PST) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 10.2 \(3259\)) Subject: Re: svn commit: r313268 - head/sys/kern [through -r313271 for atomic_fcmpset use and later: fails on PowerMac G5 "Quad Core"; -r313266 works] From: Mark Millard In-Reply-To: <26B93201-D3E9-4443-A60F-FC871EC5C96C@dsl-only.net> Date: Mon, 20 Feb 2017 21:39:32 -0800 Cc: svn-src-head@freebsd.org, FreeBSD Current , Justin Hibbits , mjg@freebsd.org, FreeBSD PowerPC ML Content-Transfer-Encoding: quoted-printable Message-Id: <7089D194-8BB3-4D40-8314-4A15B41986DA@dsl-only.net> References: <2FD12B8F-2255-470A-98D4-2DCE9C7495F5@dsl-only.net> <20170220191044.GA8526@dft-labs.eu> <83428304-87BE-413C-BAB9-8FF218E7661C@dsl-only.net> <20170220233537.GB26759@dft-labs.eu> <26B93201-D3E9-4443-A60F-FC871EC5C96C@dsl-only.net> To: Mateusz Guzik X-Mailer: Apple Mail (2.3259) X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 21 Feb 2017 05:39:36 -0000 On 2017-Feb-20, at 6:36 PM, Mark Millard wrote: > On 2017-Feb-20, at 3:35 PM, Mateusz Guzik = wrote: >=20 >> On Mon, Feb 20, 2017 at 03:10:44PM -0800, Mark Millard wrote: >>> On 2017-Feb-20, at 2:58 PM, Mark Millard = wrote: >>>=20 >>>> On 2017-Feb-20, at 11:10 AM, Mateusz Guzik = wrote: >>>>=20 >>>>> On Sat, Feb 18, 2017 at 04:18:05AM -0800, Mark Millard wrote: >>>>>> [Note: I experiment with clang based powerpc64 builds, >>>>>> reporting problems that I find. Justin is familiar >>>>>> with this, as is Nathan.] >>>>>>=20 >>>>>> I tried to update the PowerMac G5 (a so-called "Quad Core") >>>>>> that I have access to from head -r312761 to -r313864 and >>>>>> ended up with random panics and hang ups in fairly short >>>>>> order after booting. >>>>>>=20 >>>>>> Some approximate bisecting for the kernel lead to: >>>>>> (sometimes getting part way into a buildkernel attempt >>>>>> for a different version before a failure happens) >>>>>>=20 >>>>>> -r313266: works (just before use of atomic_fcmpset) >>>>>> vs. >>>>>> -r313271: fails (last of the "use atomic_fcmpset" check-ins) >>>>>>=20 >>>>>> (I did not try -r313268 through -r313270 as the use was >>>>>> gradually added.) >>>>>>=20 >>>>>> So I'm currently running a -r313864 world with a -r313266 >>>>>> kernel. >>>>>>=20 >>>>>> No kernel that I tried that was from before -r313266 had the >>>>>> problems. >>>>>>=20 >>>>>> Any kernel that I tried that was from after -r313271 had the >>>>>> problems. >>>>>>=20 >>>>>> Of course I did not try them all in other direction. :) >>>>>>=20 >>>>>=20 >>>>> I found that spin mutexes were not properly handling this, fixed = in >>>>> r313996. >>>>>=20 >>>>> Locally I added a if (cpu_tick() % 2) return (0); snipped to amd64 >>>>> fcmpset to simulate failures. Everything works, while it would = easily >>>>> fail without the patch. >>>>>=20 >>>>> That said, I hope this concludes the 'missing check for not-reread = value >>>>> of failed fcmpset' saga. >>>>>=20 >>>>> --=20 >>>>> Mateusz Guzik >>>>=20 >>>> I tried to update from -r313864 to -r313999 in my amd64 context >>>> (a VirtualBox machine under macOS) but it now crashes late in >>>> the boot sequence (after it processes a dump if I make one but >>>> before I can log in). >>>>=20 >>>> This update was via my usual explicit svnlite update; buildworld >>>> buildkernel; etc. production style build of world and kernel, >>>> including use of MALLOC_PRODUCTION. >>>>=20 >>>> The window shows: >>>>=20 >>>> _vm_map_lock+0xf >>>> vm_map_wire+0x32 >>>> rtROMemObjNativeLockInMap+0x8c >>>> rtROMemObjNativeLockUser+0x51 >>>> RTR0MemObjLockUserTag+0x231 >>>> vbglR0HGCMInternalPreprocessCall+0x65d >>>> vbglR0HGCMInternalCall+0x17c >>>> vgdrvIoCtl_HGCMCall+0x43f >>>> VGDrvCommonIoCtl+0x261 >>>> vgdrvFreeBSDIOCtl+0x2cd >>>> devfs_ioctl+0xae >>>> VOP_IOCTL_APV+0x88 >>>> vn_ioctl+0x161 >>>> devfs_ioctl_f+0x1f >>>> kern_ioctl+0x280 >>>> sys_ioctl+0x13f >>>> amd64_syscall+0x397 >>>> Xfast_syscall+0xfb >>>=20 >>> More detail from booting with the -r313864 kernel.old >>> and using kgdb on what the dump produced: >>>=20 >>> # kgdb kernel.debug /var/crash/vmcore. >>> /var/crash/vmcore.0 /var/crash/vmcore.last >>> # kgdb kernel.debug /var/crash/vmcore.0 >>> GNU gdb 6.1.1 [FreeBSD] >>> Copyright 2004 Free Software Foundation, Inc. >>> GDB is free software, covered by the GNU General Public License, and = you are >>> welcome to change it and/or distribute copies of it under certain = conditions. >>> Type "show copying" to see the conditions. >>> There is absolutely no warranty for GDB. Type "show warranty" for = details. >>> This GDB was configured as "amd64-marcel-freebsd"... >>>=20 >>> Unread portion of the kernel message buffer: >>> <118>Starting vboxservice. >>> <118>VBoxService 5.1.14 r112924 (verbosity: 0) freebsd.amd64 (Jan 20 = 2017 18:37:45) release log >>> <118>00:00:00.000120 main Log opened = 2017-02-20T22:38:46.348080000Z >>> <118>00:00:00.000162 main OS Product: FreeBSD >>> <118>00:00:00.000171 main OS Release: 12.0-CURRENT >>> <118>00:00:00.000180 main OS Version: FreeBSD 12.0-CURRENT = r313999M >>> <118>00:00:00.000192 main Executable: = /usr/local/sbin/VBoxService >>> <118>00:00:00.000194 main Process ID: 609 >>> <118>00:00:00.000196 main Package type: BSD_64BITS_GENERIC (OSE) >>>=20 >>>=20 >>> Fatal trap 12: page fault while in kernel mode >>> cpuid =3D 2; apic id =3D 02 >>> fault virtual address =3D 0xd6 >>> fault code =3D supervisor read data, page not present >>> instruction pointer =3D 0x20:0xffffffff80d4ebaf >>> stack pointer =3D 0x28:0xfffffe0122e2bef0 >>> frame pointer =3D 0x28:0xfffffe0122e2bf00 >>> code segment =3D base 0x0, limit 0xfffff, type 0x1b >>> =3D DPL 0, pres 1, long 1, def32 0, gran 1 >>> processor eflags =3D interrupt enabled, resume, IOPL =3D 0 >>> current process =3D 609 (VBoxService) >>>=20 >>=20 >>=20 >>=20 >>> #9 0xffffffff80eb6be1 in calltrap () at = /usr/src/sys/amd64/amd64/exception.S:236 >>> #10 0xffffffff80d4ebaf in _vm_map_lock (map=3D0x1, file=3D0x0, = line=3D0) at /usr/src/sys/vm/vm_map.c:501 >>=20 >> The function is: >> void >> _vm_map_lock(vm_map_t map, const char *file, int line) >> { >>=20 >> if (map->system_map) >> mtx_lock_flags_(&map->system_mtx, 0, file, line); >> else >> sx_xlock_(&map->lock, file, line); >> map->timestamp++; >> } >>=20 >> system_map is at offset 0xd5, thus the faulting address of 0xd6 with = map >> address of 1 looks like the backtrace is corect. But this suggests = the >> bug is unrelated to my changes and there is a chance there is no bug = in >> the first place. >>=20 >> Please make sure that the virtualbox module is recompiled against = proper >> source tree. If the problem persists, please bisect. The range is not >> big. >>=20 >> Off hand I don't see what can cause the failure in question (and = chances >> are there is no bug if kbi changed and the module was not = recompiled). >>=20 >>> #11 0xffffffff80d51ea2 in vm_map_wire (map=3D, = start=3D4534272, end=3D4538368, flags=3D1) at = /usr/src/sys/vm/vm_map.c:2534 >>> #12 0xffffffff8265291c in rtR0MemObjNativeLockInMap () from = /boot/modules/vboxguest.ko >>> #13 0xffffffff82652881 in rtR0MemObjNativeLockUser () from = /boot/modules/vboxguest.ko >>> #14 0xffffffff8264ec01 in RTR0MemObjLockUserTag () from = /boot/modules/vboxguest.ko >>> #15 0xffffffff82624afd in vbglR0HGCMInternalPreprocessCall () from = /boot/modules/vboxguest.ko >>> #16 0xffffffff8262411a in VbglR0HGCMInternalCall () from = /boot/modules/vboxguest.ko >>> #17 0xffffffff8261ec4f in vgdrvIoCtl_HGCMCall () from = /boot/modules/vboxguest.ko >>> #18 0xffffffff8261d221 in VGDrvCommonIoCtl () from = /boot/modules/vboxguest.ko >=20 > I do not expect that the kernel binary interface deliberately changed > between -r313864 and -r313999. Until the attempted update of amd64 > (which I always do first) the amd64 and arm64 were running: >=20 > . . . 12.0-CURRENT FreeBSD 12.0-CURRENT r313864M . . . 1200021 = 1200021 >=20 > I've not noticed an update to 1200022 yet. >=20 > [It turned out that for powerpc64 I had to use -r313266 for the > kernel when I tried to update to -r313864. This does mix 1200020 > and 1200021. But 1200021 was removal of support for things I do > not have involved --and the combination has seemed okay so far.] >=20 > I've decided to do a round of port upgrades (to -r434493), > although virtualbox client has not been updated. I'll force a > rebuild before I'm done. >=20 > It turns out that llvm39 is now required for what I choose to > have and its build ran out of RAM/swap as I had things configured. > So I've adjusted to have the VM have more RAM assigned and I'm not > starting lumina but just using the console for now. We will see. >=20 > Note: I always manually start lumina and so it was not > involved in the boot problem: it was just a basic console > style context at all times for the boot crash. >=20 > Overall it will be a while before I have a works vs. fails > pair that are significantly closer together. Looks like some kernel binary interface (as seen by emulators/virtualbox-ose-addition ) has changed: rebuilding emulators/virtualbox-ose-addition removed the booting crash but uname -apKU still lists 1200021 and 2100021 for the kernel and world for -r313999, just like for -r313864. [Unfortunately lumina no longer processes my mouse clicks so now I've got something else to figure out.] =3D=3D=3D Mark Millard markmi at dsl-only.net