Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 20 Feb 2017 21:39:32 -0800
From:      Mark Millard <markmi@dsl-only.net>
To:        Mateusz Guzik <mjguzik@gmail.com>
Cc:        svn-src-head@freebsd.org, FreeBSD Current <freebsd-current@freebsd.org>, Justin Hibbits <chmeeedalf@gmail.com>, mjg@freebsd.org, FreeBSD PowerPC ML <freebsd-ppc@freebsd.org>
Subject:   Re: svn commit: r313268 - head/sys/kern [through -r313271 for atomic_fcmpset use and later: fails on PowerMac G5 "Quad Core"; -r313266 works]
Message-ID:  <7089D194-8BB3-4D40-8314-4A15B41986DA@dsl-only.net>
In-Reply-To: <26B93201-D3E9-4443-A60F-FC871EC5C96C@dsl-only.net>
References:  <2FD12B8F-2255-470A-98D4-2DCE9C7495F5@dsl-only.net> <20170220191044.GA8526@dft-labs.eu> <83428304-87BE-413C-BAB9-8FF218E7661C@dsl-only.net> <CD8044D2-C11A-4ABE-B72D-62BDDE302C7C@dsl-only.net> <20170220233537.GB26759@dft-labs.eu> <26B93201-D3E9-4443-A60F-FC871EC5C96C@dsl-only.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On 2017-Feb-20, at 6:36 PM, Mark Millard <markmi at dsl-only.net> wrote:

> On 2017-Feb-20, at 3:35 PM, Mateusz Guzik <mjguzik at gmail.com> =
wrote:
>=20
>> On Mon, Feb 20, 2017 at 03:10:44PM -0800, Mark Millard wrote:
>>> On 2017-Feb-20, at 2:58 PM, Mark Millard <markmi at dsl-only.net> =
wrote:
>>>=20
>>>> On 2017-Feb-20, at 11:10 AM, Mateusz Guzik <mjguzik at gmail.com> =
wrote:
>>>>=20
>>>>> On Sat, Feb 18, 2017 at 04:18:05AM -0800, Mark Millard wrote:
>>>>>> [Note: I experiment with clang based powerpc64 builds,
>>>>>> reporting problems that I find. Justin is familiar
>>>>>> with this, as is Nathan.]
>>>>>>=20
>>>>>> I tried to update the PowerMac G5 (a so-called "Quad Core")
>>>>>> that I have access to from head -r312761 to -r313864 and
>>>>>> ended up with random panics and hang ups in fairly short
>>>>>> order after booting.
>>>>>>=20
>>>>>> Some approximate bisecting for the kernel lead to:
>>>>>> (sometimes getting part way into a buildkernel attempt
>>>>>> for a different version before a failure happens)
>>>>>>=20
>>>>>> -r313266: works (just before use of atomic_fcmpset)
>>>>>> vs.
>>>>>> -r313271: fails (last of the "use atomic_fcmpset" check-ins)
>>>>>>=20
>>>>>> (I did not try -r313268 through -r313270 as the use was
>>>>>> gradually added.)
>>>>>>=20
>>>>>> So I'm currently running a -r313864 world with a -r313266
>>>>>> kernel.
>>>>>>=20
>>>>>> No kernel that I tried that was from before -r313266 had the
>>>>>> problems.
>>>>>>=20
>>>>>> Any kernel that I tried that was from after -r313271 had the
>>>>>> problems.
>>>>>>=20
>>>>>> Of course I did not try them all in other direction. :)
>>>>>>=20
>>>>>=20
>>>>> I found that spin mutexes were not properly handling this, fixed =
in
>>>>> r313996.
>>>>>=20
>>>>> Locally I added a if (cpu_tick() % 2) return (0); snipped to amd64
>>>>> fcmpset to simulate failures. Everything works, while it would =
easily
>>>>> fail without the patch.
>>>>>=20
>>>>> That said, I hope this concludes the 'missing check for not-reread =
value
>>>>> of failed fcmpset' saga.
>>>>>=20
>>>>> --=20
>>>>> Mateusz Guzik <mjguzik gmail.com>
>>>>=20
>>>> I tried to update from -r313864 to -r313999 in my amd64 context
>>>> (a VirtualBox machine under macOS) but it now crashes late in
>>>> the boot sequence (after it processes a dump if I make one but
>>>> before I can log in).
>>>>=20
>>>> This update was via my usual explicit svnlite update; buildworld
>>>> buildkernel; etc. production style build of world and kernel,
>>>> including use of MALLOC_PRODUCTION.
>>>>=20
>>>> The window shows:
>>>>=20
>>>> _vm_map_lock+0xf
>>>> vm_map_wire+0x32
>>>> rtROMemObjNativeLockInMap+0x8c
>>>> rtROMemObjNativeLockUser+0x51
>>>> RTR0MemObjLockUserTag+0x231
>>>> vbglR0HGCMInternalPreprocessCall+0x65d
>>>> vbglR0HGCMInternalCall+0x17c
>>>> vgdrvIoCtl_HGCMCall+0x43f
>>>> VGDrvCommonIoCtl+0x261
>>>> vgdrvFreeBSDIOCtl+0x2cd
>>>> devfs_ioctl+0xae
>>>> VOP_IOCTL_APV+0x88
>>>> vn_ioctl+0x161
>>>> devfs_ioctl_f+0x1f
>>>> kern_ioctl+0x280
>>>> sys_ioctl+0x13f
>>>> amd64_syscall+0x397
>>>> Xfast_syscall+0xfb
>>>=20
>>> More detail from booting with the -r313864 kernel.old
>>> and using kgdb on what the dump produced:
>>>=20
>>> # kgdb kernel.debug /var/crash/vmcore.
>>> /var/crash/vmcore.0    /var/crash/vmcore.last
>>> # kgdb kernel.debug /var/crash/vmcore.0
>>> GNU gdb 6.1.1 [FreeBSD]
>>> Copyright 2004 Free Software Foundation, Inc.
>>> GDB is free software, covered by the GNU General Public License, and =
you are
>>> welcome to change it and/or distribute copies of it under certain =
conditions.
>>> Type "show copying" to see the conditions.
>>> There is absolutely no warranty for GDB.  Type "show warranty" for =
details.
>>> This GDB was configured as "amd64-marcel-freebsd"...
>>>=20
>>> Unread portion of the kernel message buffer:
>>> <118>Starting vboxservice.
>>> <118>VBoxService 5.1.14 r112924 (verbosity: 0) freebsd.amd64 (Jan 20 =
2017 18:37:45) release log
>>> <118>00:00:00.000120 main     Log opened =
2017-02-20T22:38:46.348080000Z
>>> <118>00:00:00.000162 main     OS Product: FreeBSD
>>> <118>00:00:00.000171 main     OS Release: 12.0-CURRENT
>>> <118>00:00:00.000180 main     OS Version: FreeBSD 12.0-CURRENT  =
r313999M
>>> <118>00:00:00.000192 main     Executable: =
/usr/local/sbin/VBoxService
>>> <118>00:00:00.000194 main     Process ID: 609
>>> <118>00:00:00.000196 main     Package type: BSD_64BITS_GENERIC (OSE)
>>>=20
>>>=20
>>> Fatal trap 12: page fault while in kernel mode
>>> cpuid =3D 2; apic id =3D 02
>>> fault virtual address   =3D 0xd6
>>> fault code              =3D supervisor read data, page not present
>>> instruction pointer     =3D 0x20:0xffffffff80d4ebaf
>>> stack pointer           =3D 0x28:0xfffffe0122e2bef0
>>> frame pointer           =3D 0x28:0xfffffe0122e2bf00
>>> code segment            =3D base 0x0, limit 0xfffff, type 0x1b
>>>                       =3D DPL 0, pres 1, long 1, def32 0, gran 1
>>> processor eflags        =3D interrupt enabled, resume, IOPL =3D 0
>>> current process         =3D 609 (VBoxService)
>>>=20
>>=20
>>=20
>>=20
>>> #9  0xffffffff80eb6be1 in calltrap () at =
/usr/src/sys/amd64/amd64/exception.S:236
>>> #10 0xffffffff80d4ebaf in _vm_map_lock (map=3D0x1, file=3D0x0, =
line=3D0) at /usr/src/sys/vm/vm_map.c:501
>>=20
>> The function is:
>> void
>> _vm_map_lock(vm_map_t map, const char *file, int line)
>> {
>>=20
>>       if (map->system_map)
>>               mtx_lock_flags_(&map->system_mtx, 0, file, line);
>>       else
>>               sx_xlock_(&map->lock, file, line);
>>       map->timestamp++;
>> }
>>=20
>> system_map is at offset 0xd5, thus the faulting address of 0xd6 with =
map
>> address of 1 looks like the backtrace is corect. But this suggests =
the
>> bug is unrelated to my changes and there is a chance there is no bug =
in
>> the first place.
>>=20
>> Please make sure that the virtualbox module is recompiled against =
proper
>> source tree. If the problem persists, please bisect. The range is not
>> big.
>>=20
>> Off hand I don't see what can cause the failure in question (and =
chances
>> are there is no bug if kbi changed and the module was not =
recompiled).
>>=20
>>> #11 0xffffffff80d51ea2 in vm_map_wire (map=3D<value optimized out>, =
start=3D4534272, end=3D4538368, flags=3D1) at =
/usr/src/sys/vm/vm_map.c:2534
>>> #12 0xffffffff8265291c in rtR0MemObjNativeLockInMap () from =
/boot/modules/vboxguest.ko
>>> #13 0xffffffff82652881 in rtR0MemObjNativeLockUser () from =
/boot/modules/vboxguest.ko
>>> #14 0xffffffff8264ec01 in RTR0MemObjLockUserTag () from =
/boot/modules/vboxguest.ko
>>> #15 0xffffffff82624afd in vbglR0HGCMInternalPreprocessCall () from =
/boot/modules/vboxguest.ko
>>> #16 0xffffffff8262411a in VbglR0HGCMInternalCall () from =
/boot/modules/vboxguest.ko
>>> #17 0xffffffff8261ec4f in vgdrvIoCtl_HGCMCall () from =
/boot/modules/vboxguest.ko
>>> #18 0xffffffff8261d221 in VGDrvCommonIoCtl () from =
/boot/modules/vboxguest.ko
>=20
> I do not expect that the kernel binary interface deliberately changed
> between -r313864 and -r313999. Until the attempted update of amd64
> (which I always do first) the amd64 and arm64 were running:
>=20
> . . . 12.0-CURRENT FreeBSD 12.0-CURRENT  r313864M  . . . 1200021 =
1200021
>=20
> I've not noticed an update to 1200022 yet.
>=20
> [It turned out that for powerpc64 I had to use -r313266 for the
> kernel when I tried to update to -r313864. This does mix 1200020
> and 1200021. But 1200021 was removal of support for things I do
> not have involved --and the combination has seemed okay so far.]
>=20
> I've decided to do a round of port upgrades (to -r434493),
> although virtualbox client has not been updated. I'll force a
> rebuild before I'm done.
>=20
> It turns out that llvm39 is now required for what I choose to
> have and its build ran out of RAM/swap as I had things configured.
> So I've adjusted to have the VM have more RAM assigned and I'm not
> starting lumina but just using the console for now. We will see.
>=20
> Note: I always manually start lumina and so it was not
> involved in the boot problem: it was just a basic console
> style context at all times for the boot crash.
>=20
> Overall it will be a while before I have a works vs. fails
> pair that are significantly closer together.

Looks like some kernel binary interface (as seen by
emulators/virtualbox-ose-addition ) has changed:
rebuilding emulators/virtualbox-ose-addition removed
the booting crash but uname -apKU still lists 1200021
and 2100021 for the kernel and world for -r313999,
just like for -r313864.

[Unfortunately lumina no longer processes my mouse clicks
so now I've got something else to figure out.]

=3D=3D=3D
Mark Millard
markmi at dsl-only.net




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?7089D194-8BB3-4D40-8314-4A15B41986DA>