Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 21 Jul 2022 08:31:38 -0700
From:      Chuck Tuffli <ctuffli@gmail.com>
To:        FreeBSD-Current <freebsd-current@freebsd.org>
Subject:   bhyve core dump related to llvm 14
Message-ID:  <CAKAYmML=qRvGQdbW7cZP_vRBDP99ZkQk9yA9H2N%2BagfcK-RN1A@mail.gmail.com>

next in thread | raw e-mail | index | archive | help
I have a virtual machine used to test the NVMe emulation in bhyve. All
of the tests in the VM pass running under FreeBSD 13.1-R, but the same
VM running under -current causes bhyve(8) to dump core because of a
segmentation fault.

git bisect identified the last "good" commit on main as
    cb2ae6163174 sysvsem: Fix a typo
After this commit, there are a half-dozen commits related to merging
the llvm project release/14.x

The core dump is repeatable and consistent. Back traces under lldb
look similar to this:

* thread #22, name = 'vcpu 2', stop reason = signal SIGSEGV: invalid
address (fault address: 0xb8)
  * frame #0: 0x0000383eb9fc916b
                     bhyve`pci_nvme_read(ctx=0x000038483ad2d700, vcpu=0,
                     pi=0x0000000000000000, baridx=-188391150,
offset=0, size=0) at
                     pci_nvme.c:3035:34
    frame #1: 0x0000384834616280
    frame #2: 0x0000383eb9fc1f7a
                     bhyve`pci_emul_mem_handler(ctx=<unavailable>,
vcpu=<unavailable>,
                     dir=<unavailable>, addr=<unavailable>, size=<unavailable>,
                     val=<unavailable>, arg1=0x00003846e5b71600, arg2=0) at
                     pci_emul.c:498:4

In frame 0, pi being NULL causes the core dump, but most of the
arguments are invalid / garbage. Looking earlier in the stack, the
vcpu value should be 2, the ctx pointer doesn't match, and the value
passed to pi isn't NULL.

Poking around in frame 2, I can see that the "direction" is a memory
write (dir == MEM_F_WRITE) and the statement being executed is this:
    (*pe->pe_barwrite)(ctx, vcpu, pdi, bidx, offset, size, *val);

Confusingly, the function pointer pe_barwrite is pci_nvme_write() and
not pci_nvme_read() where the crash occurs. I've confirmed the fault
is in pci_nvme_read() by adding an assert for pi != NULL. This is
especially odd because pci_emul_mem_handler() directly calls
pci_nvme_read() and pci_nvme_write(). So why does frame 1 exist at
all?

Using gdb, the back traces either don't decode at all or look similar to this:
(gdb) bt
#0  pci_nvme_read (ctx=0x944c1168700, vcpu=0, pi=0x0,
baridx=-1835053270, offset=0, size=0)
    at /poudriere/jails/14-current-amd64/usr/src/usr.sbin/bhyve/pci_nvme.c:3035
#1  0x000009436891d8e8 in _CurrentRuneLocale () from /lib/libc.so.7
#2  0x000009436a73ca28 in ?? ()
#3  0x000009436a73e1c0 in ?? ()
...
#34 0x000009436a747600 in ?? ()
#35 0x0000093b3e76b088 in pci_de_lpc ()
#36 0x000009436a716500 in ?? ()
#37 0x00000944c3196d10 in ?? ()
#38 0x0000093b3e74501a in pci_emul_mem_handler (ctx=0x9436a7bd670, vcpu=0,
    dir=<optimized out>, addr=<optimized out>, size=0,
val=0x646165725f657469, arg1=0x1,
    arg2=10185153275136)
    at /poudriere/jails/14-current-amd64/usr/src/usr.sbin/bhyve/pci_emul.c:498

Other random tidbits:
 - disabling compiler optimization (i.e. -O0) for the two files in
question (pci_nvme.c and pci_emul.c) makes the core dump go away
 - using the default optimization level but generously sprinkling
debug printf everywhere makes the core dump go away.

I'm not sure where to go from here and could use some help.

--chuck



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAKAYmML=qRvGQdbW7cZP_vRBDP99ZkQk9yA9H2N%2BagfcK-RN1A>