Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 01 Mar 2022 09:46:26 +0000
From:      bugzilla-noreply@freebsd.org
To:        bugs@FreeBSD.org
Subject:   [Bug 261059] Kernel panic XEN + ZFS volume.
Message-ID:  <bug-261059-227-mH6oWtD6ln@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-261059-227@https.bugs.freebsd.org/bugzilla/>
References:  <bug-261059-227@https.bugs.freebsd.org/bugzilla/>

next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D261059

--- Comment #1 from Janis <zedupsys@gmail.com> ---
I've been digging further with this bug. Found one ZFS problem, which i can
repeat 100%, reported as bug #262189. For me though it seems that these two
bugs might not be related.

What i have found though, is that less RAM for Dom0, helps to panic system
sooner. Thus i assume, that ZFS stress script, just helped to fill memory
sooner.

Another thing i did, is install FreeBSD on UFS separate disk, and ZFS pool =
on
the other disk. System still crashes, but it is easier to try out different
combinations. My latest xen command line params are:

xen_cmdline=3D"dom0_mem=3D2048M cpufreq=3Ddom0-kernel dom0_max_vcpus=3D2
dom0=3Dpvh,verbose=3D1 console=3Dvga,com1 com1=3D9600,8n1 guest_loglvl=3Dal=
l loglvl=3Dall
sync_console=3D1 reboot=3Dno"

So now it seems that i can see more verbose panic messages on serial output.

While investigating, there are few things i have noticed; it has given me a
suspicion that actually there is not just a single bug, but multiple, which
trigger themselves at different times. Sometimes when system does not crash=
, it
crashes when i destroy all DomU instances, not when i create them, sometimes
after all DomU's have been destroyed, system crashes on init 0 call.


1. While stressing ZFS, at some point i get messages like these in console:
xnb(xnb_frontend_changed:1391): frontend_state=3DConnected, xnb_state=3DIni=
tWait
xnb(xnb_connect_comms:787): rings connected!
(XEN) d2v0: upcall vector 93
xbbd2: Error 12 Unable to allocate request bounce buffers
xbbd2: Fatal error. Transitioning to Closing State
xbbd5: Error 12 Unable to allocate request bounce buffers
xbbd5: Fatal error. Transitioning to Closing State
xnb(xnb_frontend_changed:1391): frontend_state=3DConnected, xnb_state=3DIni=
tWait
xnb(xnb_connect_comms:787): rings connected!
Mar  1 10:31:55 lab-01 kernel: pid 1117 (qemu-system-i386), jid 0, uid 0, w=
as
killed: out of swap space
Mar  1 10:32:59 lab-01 kernel: pid 1264 (qemu-system-i386), jid 0, uid 0, w=
as
killed: out of swap space
Mar  1 10:33:06 lab-01 kernel: pid 1060 (zsh), jid 0, uid 0, was killed: ou=
t of
swap space
Mar  1 10:33:11 lab-01 kernel: pid 1053 (zsh), jid 0, uid 0, was killed: ou=
t of
swap space


For me this seems somehow weird, could it be a sign for memleak? That some
resources are not cleaned up after DomU's destroy? Because all that the scr=
ipts
are doing is start DomU, write some data in disk, stop DomU.


2. On domain creation part, sometimes i get an error like this:
Parsing config from /service/crash/cfg/xen-vm2-zvol-5.conf
libxl: error: libxl_device.c:1111:device_backend_callback: Domain 9:unable =
to
add device with path /local/domain/0/backend/vbd/9/51712
libxl: error: libxl_device.c:1111:device_backend_callback: Domain 9:unable =
to
add device with path /local/domain/0/backend/vbd/9/51728
libxl: error: libxl_device.c:1111:device_backend_callback: Domain 9:unable =
to
add device with path /local/domain/0/backend/vbd/9/51744
libxl: error: libxl_device.c:1111:device_backend_callback: Domain 9:unable =
to
add device with path /local/domain/0/backend/vbd/9/51760
libxl: error: libxl_device.c:1111:device_backend_callback: Domain 9:unable =
to
add device with path /local/domain/0/backend/vbd/9/51776
libxl: error: libxl_create.c:1613:domcreate_launch_dm: Domain 9:unable to a=
dd
disk devices
libxl: error: libxl_domain.c:1182:libxl__destroy_domid: Domain 9:Non-exista=
nt
domain
libxl: error: libxl_domain.c:1136:domain_destroy_callback: Domain 9:Unable =
to
destroy guest
libxl: error: libxl_domain.c:1063:domain_destroy_cb: Domain 9:Destruction of
domain failed


Is it possible to know more info for why Dom0 was "unable to add device with
path"? More verbosity? Was it that ZFS held some locks or that previous Dom=
U is
still holding the same ZVOL?


3. Since i am following information at
https://docs.freebsd.org/en/books/handbook/virtualization/#virtualization-h=
ost-xen,
it seems that command:

echo 'vm.max_wired=3D-1' >> /etc/sysctl.conf

is obsolete, because in FreeBSD 13.0, there is no such sysctl knob, "sysctl:
unknown oid 'vm.max_wired'". I do not know which is equivalent to this one.=
 I
found "vm.max_user_wired=3D-1", is it the same? Maybe manual should be upda=
ted.
Even if i set this to -1, still quemy-system is killed with out of swap spa=
ce
error. Maybe there is a different sysctl for that purpose now?


At one point i got an unseen error, but i do not remember what was the syst=
em
state, what did i do. It is as follows:
xnb(xnb_rxpkt2rsp:2059): Got error -1 for hypervisor gnttab_copy status
xnb(xnb_ring2pkt:1526): Unknown extra info type 255.  Discarding packet
xnb(xnb_dump_txreq:299): netif_tx_request index =3D0
xnb(xnb_dump_txreq:300): netif_tx_request.gref  =3D0
xnb(xnb_dump_txreq:301): netif_tx_request.offset=3D0
xnb(xnb_dump_txreq:302): netif_tx_request.flags =3D8
xnb(xnb_dump_txreq:303): netif_tx_request.id    =3D69
xnb(xnb_dump_txreq:304): netif_tx_request.size  =3D1000
xnb(xnb_dump_txreq:299): netif_tx_request index =3D1
xnb(xnb_dump_txreq:300): netif_tx_request.gref  =3D255
xnb(xnb_dump_txreq:301): netif_tx_request.offset=3D0
xnb(xnb_dump_txreq:302): netif_tx_request.flags =3D0
xnb(xnb_dump_txreq:303): netif_tx_request.id    =3D0
xnb(xnb_dump_txreq:304): netif_tx_request.size  =3D0
xnb(xnb_rxpkt2rsp:2059): Got error -1 for hypervisor gnttab_copy status
xnb(xnb_ring2pkt:1526): Unknown extra info type 255.  Discarding packet
xnb(xnb_dump_txreq:299): netif_tx_request index =3D0
xnb(xnb_dump_txreq:300): netif_tx_request.gref  =3D0
xnb(xnb_dump_txreq:301): netif_tx_request.offset=3D0
xnb(xnb_dump_txreq:302): netif_tx_request.flags =3D8
xnb(xnb_dump_txreq:303): netif_tx_request.id    =3D69
xnb(xnb_dump_txreq:304): netif_tx_request.size  =3D1000
xnb(xnb_dump_txreq:299): netif_tx_request index =3D1
xnb(xnb_dump_txreq:300): netif_tx_request.gref  =3D255
xnb(xnb_dump_txreq:301): netif_tx_request.offset=3D0
xnb(xnb_dump_txreq:302): netif_tx_request.flags =3D0
xnb(xnb_dump_txreq:303): netif_tx_request.id    =3D0
xnb(xnb_dump_txreq:304): netif_tx_request.size  =3D0
xnb(xnb_rxpkt2rsp:2059): Got error -1 for hypervisor gnttab_copy status


4. Finally, due to better XEN flags, i get full output for panics:
(XEN) d1v0: upcall vector 93
xnb(xnb_frontend_changed:1391): frontend_state=3DConnected, xnb_state=3DIni=
tWait
xnb(xnb_connect_comms:787): rings connected!
(XEN) d2v0: upcall vector 93
xbbd2: Error 12 Unable to allocate request bounce buffers
xbbd2: Fatal error. Transitioning to Closing State
xbbd5: Error 12 Unable to allocate request bounce buffers
xbbd5: Fatal error. Transitioning to Closing State
xnb(xnb_frontend_changed:1391): frontend_state=3DConnected, xnb_state=3DIni=
tWait
xnb(xnb_connect_comms:787): rings connected!
panic: pmap_growkernel: no memory to grow kernel
cpuid =3D 0
time =3D 1646123072
KDB: stack backtrace:
#0 0xffffffff80c57525 at kdb_backtrace+0x65
#1 0xffffffff80c09f01 at vpanic+0x181
#2 0xffffffff80c09d73 at panic+0x43
#3 0xffffffff81073eed at pmap_growkernel+0x27d
#4 0xffffffff80f2dae8 at vm_map_insert+0x248
#5 0xffffffff80f30249 at vm_map_find+0x549
#6 0xffffffff80f2bf76 at kmem_init+0x226
#7 0xffffffff80c73341 at vmem_xalloc+0xcb1
#8 0xffffffff80c72c3b at vmem_xalloc+0x5ab
#9 0xffffffff80f2bfce at kmem_init+0x27e
#10 0xffffffff80c73341 at vmem_xalloc+0xcb1
#11 0xffffffff80c72c3b at vmem_xalloc+0x5ab
#12 0xffffffff80c72646 at vmem_alloc+0x46
#13 0xffffffff80f2b616 at kmem_malloc_domainset+0x96
#14 0xffffffff80f21a2a at uma_prealloc+0x23a
#15 0xffffffff80f235de at sysctl_handle_uma_zone_cur+0xe2e
#16 0xffffffff80f1f6af at uma_set_align+0x8f
#17 0xffffffff82463362 at abd_borrow_buf_copy+0x22
Uptime: 4m9s


Here i somewhat do not understand why pmap_growkernel should panic if out of
memory. I mean couldn't it just return DomU could not be created because ou=
t of
memory? I do not know internals, so forgive me if this question seems fooli=
sh.


Fatal trap 12: page fault while in kernel mode
cpuid =3D 0; apic id =3D 00
fault virtual address   =3D 0x22710028
fault code              =3D supervisor read data, page not present
instruction pointer     =3D 0x20:0xffffffff80c45892
stack pointer           =3D 0x28:0xfffffe0096600930
frame pointer           =3D 0x28:0xfffffe0096600930
code segment            =3D base 0x0, limit 0xfffff, type 0x1b
                        =3D DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        =3D interrupt enabled, resume, IOPL =3D 0
current process         =3D 1496 (devmatch)
trap number             =3D 12
panic: page fault
cpuid =3D 0
time =3D 1646123791
KDB: stack backtrace:
#0 0xffffffff80c57525 at kdb_backtrace+0x65
#1 0xffffffff80c09f01 at vpanic+0x181
#2 0xffffffff80c09d73 at panic+0x43
#3 0xffffffff8108b1a7 at trap+0xbc7
#4 0xffffffff8108b1ff at trap+0xc1f
#5 0xffffffff8108a85d at trap+0x27d
#6 0xffffffff81061b18 at calltrap+0x8
#7 0xffffffff80c62011 at rman_is_region_manager+0x241
#8 0xffffffff80c1a051 at sbuf_new_for_sysctl+0x101
#9 0xffffffff80c1949c at kernel_sysctl+0x43c
#10 0xffffffff80c19b13 at userland_sysctl+0x173
#11 0xffffffff80c1995f at sys___sysctl+0x5f
#12 0xffffffff8108baac at amd64_syscall+0x10c
#13 0xffffffff8106243e at Xfast_syscall+0xfe


Fatal trap 12: page fault while in kernel mode
cpuid =3D 1; apic id =3D 02
fault virtual address   =3D 0x68
fault code              =3D supervisor read data, page not present
instruction pointer     =3D 0x20:0xffffffff824a599d
stack pointer           =3D 0x28:0xfffffe00b1e27910
frame pointer           =3D 0x28:0xfffffe00b1e279b0
code segment            =3D base 0x0, limit 0xfffff, type 0x1b
                        =3D DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        =3D interrupt enabled, resume, IOPL =3D 0
current process         =3D 0 (xbbd7 taskq)
trap number             =3D 12
panic: page fault
cpuid =3D 1
time =3D 1646122723
KDB: stack backtrace:
#0 0xffffffff80c57525 at kdb_backtrace+0x65
#1 0xffffffff80c09f01 at vpanic+0x181
#2 0xffffffff80c09d73 at panic+0x43
#3 0xffffffff8108b1a7 at trap+0xbc7
#4 0xffffffff8108b1ff at trap+0xc1f
#5 0xffffffff8108a85d at trap+0x27d
#6 0xffffffff81061b18 at calltrap+0x8
#7 0xffffffff8248935a at dmu_read+0x2a
#8 0xffffffff82456a3a at zvol_geom_bio_strategy+0x2aa
#9 0xffffffff80a7f214 at xbd_instance_create+0xa394
#10 0xffffffff80a7b1ea at xbd_instance_create+0x636a
#11 0xffffffff80c6b1c1 at taskqueue_run+0x2a1
#12 0xffffffff80c6c4dc at taskqueue_thread_loop+0xac
#13 0xffffffff80bc7e3e at fork_exit+0x7e
#14 0xffffffff81062b9e at fork_trampoline+0xe
Uptime: 1h44m10s


One of those panics happened on init 0 at some point (all DomU's were
destroyed) , unfortunately i did not note down which one. Version is still
13.0-RELEASE-p7.

--=20
You are receiving this mail because:
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-261059-227-mH6oWtD6ln>