Date: Tue, 01 Mar 2022 09:46:26 +0000 From: bugzilla-noreply@freebsd.org To: bugs@FreeBSD.org Subject: [Bug 261059] Kernel panic XEN + ZFS volume. Message-ID: <bug-261059-227-mH6oWtD6ln@https.bugs.freebsd.org/bugzilla/> In-Reply-To: <bug-261059-227@https.bugs.freebsd.org/bugzilla/> References: <bug-261059-227@https.bugs.freebsd.org/bugzilla/>
next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D261059 --- Comment #1 from Janis <zedupsys@gmail.com> --- I've been digging further with this bug. Found one ZFS problem, which i can repeat 100%, reported as bug #262189. For me though it seems that these two bugs might not be related. What i have found though, is that less RAM for Dom0, helps to panic system sooner. Thus i assume, that ZFS stress script, just helped to fill memory sooner. Another thing i did, is install FreeBSD on UFS separate disk, and ZFS pool = on the other disk. System still crashes, but it is easier to try out different combinations. My latest xen command line params are: xen_cmdline=3D"dom0_mem=3D2048M cpufreq=3Ddom0-kernel dom0_max_vcpus=3D2 dom0=3Dpvh,verbose=3D1 console=3Dvga,com1 com1=3D9600,8n1 guest_loglvl=3Dal= l loglvl=3Dall sync_console=3D1 reboot=3Dno" So now it seems that i can see more verbose panic messages on serial output. While investigating, there are few things i have noticed; it has given me a suspicion that actually there is not just a single bug, but multiple, which trigger themselves at different times. Sometimes when system does not crash= , it crashes when i destroy all DomU instances, not when i create them, sometimes after all DomU's have been destroyed, system crashes on init 0 call. 1. While stressing ZFS, at some point i get messages like these in console: xnb(xnb_frontend_changed:1391): frontend_state=3DConnected, xnb_state=3DIni= tWait xnb(xnb_connect_comms:787): rings connected! (XEN) d2v0: upcall vector 93 xbbd2: Error 12 Unable to allocate request bounce buffers xbbd2: Fatal error. Transitioning to Closing State xbbd5: Error 12 Unable to allocate request bounce buffers xbbd5: Fatal error. Transitioning to Closing State xnb(xnb_frontend_changed:1391): frontend_state=3DConnected, xnb_state=3DIni= tWait xnb(xnb_connect_comms:787): rings connected! Mar 1 10:31:55 lab-01 kernel: pid 1117 (qemu-system-i386), jid 0, uid 0, w= as killed: out of swap space Mar 1 10:32:59 lab-01 kernel: pid 1264 (qemu-system-i386), jid 0, uid 0, w= as killed: out of swap space Mar 1 10:33:06 lab-01 kernel: pid 1060 (zsh), jid 0, uid 0, was killed: ou= t of swap space Mar 1 10:33:11 lab-01 kernel: pid 1053 (zsh), jid 0, uid 0, was killed: ou= t of swap space For me this seems somehow weird, could it be a sign for memleak? That some resources are not cleaned up after DomU's destroy? Because all that the scr= ipts are doing is start DomU, write some data in disk, stop DomU. 2. On domain creation part, sometimes i get an error like this: Parsing config from /service/crash/cfg/xen-vm2-zvol-5.conf libxl: error: libxl_device.c:1111:device_backend_callback: Domain 9:unable = to add device with path /local/domain/0/backend/vbd/9/51712 libxl: error: libxl_device.c:1111:device_backend_callback: Domain 9:unable = to add device with path /local/domain/0/backend/vbd/9/51728 libxl: error: libxl_device.c:1111:device_backend_callback: Domain 9:unable = to add device with path /local/domain/0/backend/vbd/9/51744 libxl: error: libxl_device.c:1111:device_backend_callback: Domain 9:unable = to add device with path /local/domain/0/backend/vbd/9/51760 libxl: error: libxl_device.c:1111:device_backend_callback: Domain 9:unable = to add device with path /local/domain/0/backend/vbd/9/51776 libxl: error: libxl_create.c:1613:domcreate_launch_dm: Domain 9:unable to a= dd disk devices libxl: error: libxl_domain.c:1182:libxl__destroy_domid: Domain 9:Non-exista= nt domain libxl: error: libxl_domain.c:1136:domain_destroy_callback: Domain 9:Unable = to destroy guest libxl: error: libxl_domain.c:1063:domain_destroy_cb: Domain 9:Destruction of domain failed Is it possible to know more info for why Dom0 was "unable to add device with path"? More verbosity? Was it that ZFS held some locks or that previous Dom= U is still holding the same ZVOL? 3. Since i am following information at https://docs.freebsd.org/en/books/handbook/virtualization/#virtualization-h= ost-xen, it seems that command: echo 'vm.max_wired=3D-1' >> /etc/sysctl.conf is obsolete, because in FreeBSD 13.0, there is no such sysctl knob, "sysctl: unknown oid 'vm.max_wired'". I do not know which is equivalent to this one.= I found "vm.max_user_wired=3D-1", is it the same? Maybe manual should be upda= ted. Even if i set this to -1, still quemy-system is killed with out of swap spa= ce error. Maybe there is a different sysctl for that purpose now? At one point i got an unseen error, but i do not remember what was the syst= em state, what did i do. It is as follows: xnb(xnb_rxpkt2rsp:2059): Got error -1 for hypervisor gnttab_copy status xnb(xnb_ring2pkt:1526): Unknown extra info type 255. Discarding packet xnb(xnb_dump_txreq:299): netif_tx_request index =3D0 xnb(xnb_dump_txreq:300): netif_tx_request.gref =3D0 xnb(xnb_dump_txreq:301): netif_tx_request.offset=3D0 xnb(xnb_dump_txreq:302): netif_tx_request.flags =3D8 xnb(xnb_dump_txreq:303): netif_tx_request.id =3D69 xnb(xnb_dump_txreq:304): netif_tx_request.size =3D1000 xnb(xnb_dump_txreq:299): netif_tx_request index =3D1 xnb(xnb_dump_txreq:300): netif_tx_request.gref =3D255 xnb(xnb_dump_txreq:301): netif_tx_request.offset=3D0 xnb(xnb_dump_txreq:302): netif_tx_request.flags =3D0 xnb(xnb_dump_txreq:303): netif_tx_request.id =3D0 xnb(xnb_dump_txreq:304): netif_tx_request.size =3D0 xnb(xnb_rxpkt2rsp:2059): Got error -1 for hypervisor gnttab_copy status xnb(xnb_ring2pkt:1526): Unknown extra info type 255. Discarding packet xnb(xnb_dump_txreq:299): netif_tx_request index =3D0 xnb(xnb_dump_txreq:300): netif_tx_request.gref =3D0 xnb(xnb_dump_txreq:301): netif_tx_request.offset=3D0 xnb(xnb_dump_txreq:302): netif_tx_request.flags =3D8 xnb(xnb_dump_txreq:303): netif_tx_request.id =3D69 xnb(xnb_dump_txreq:304): netif_tx_request.size =3D1000 xnb(xnb_dump_txreq:299): netif_tx_request index =3D1 xnb(xnb_dump_txreq:300): netif_tx_request.gref =3D255 xnb(xnb_dump_txreq:301): netif_tx_request.offset=3D0 xnb(xnb_dump_txreq:302): netif_tx_request.flags =3D0 xnb(xnb_dump_txreq:303): netif_tx_request.id =3D0 xnb(xnb_dump_txreq:304): netif_tx_request.size =3D0 xnb(xnb_rxpkt2rsp:2059): Got error -1 for hypervisor gnttab_copy status 4. Finally, due to better XEN flags, i get full output for panics: (XEN) d1v0: upcall vector 93 xnb(xnb_frontend_changed:1391): frontend_state=3DConnected, xnb_state=3DIni= tWait xnb(xnb_connect_comms:787): rings connected! (XEN) d2v0: upcall vector 93 xbbd2: Error 12 Unable to allocate request bounce buffers xbbd2: Fatal error. Transitioning to Closing State xbbd5: Error 12 Unable to allocate request bounce buffers xbbd5: Fatal error. Transitioning to Closing State xnb(xnb_frontend_changed:1391): frontend_state=3DConnected, xnb_state=3DIni= tWait xnb(xnb_connect_comms:787): rings connected! panic: pmap_growkernel: no memory to grow kernel cpuid =3D 0 time =3D 1646123072 KDB: stack backtrace: #0 0xffffffff80c57525 at kdb_backtrace+0x65 #1 0xffffffff80c09f01 at vpanic+0x181 #2 0xffffffff80c09d73 at panic+0x43 #3 0xffffffff81073eed at pmap_growkernel+0x27d #4 0xffffffff80f2dae8 at vm_map_insert+0x248 #5 0xffffffff80f30249 at vm_map_find+0x549 #6 0xffffffff80f2bf76 at kmem_init+0x226 #7 0xffffffff80c73341 at vmem_xalloc+0xcb1 #8 0xffffffff80c72c3b at vmem_xalloc+0x5ab #9 0xffffffff80f2bfce at kmem_init+0x27e #10 0xffffffff80c73341 at vmem_xalloc+0xcb1 #11 0xffffffff80c72c3b at vmem_xalloc+0x5ab #12 0xffffffff80c72646 at vmem_alloc+0x46 #13 0xffffffff80f2b616 at kmem_malloc_domainset+0x96 #14 0xffffffff80f21a2a at uma_prealloc+0x23a #15 0xffffffff80f235de at sysctl_handle_uma_zone_cur+0xe2e #16 0xffffffff80f1f6af at uma_set_align+0x8f #17 0xffffffff82463362 at abd_borrow_buf_copy+0x22 Uptime: 4m9s Here i somewhat do not understand why pmap_growkernel should panic if out of memory. I mean couldn't it just return DomU could not be created because ou= t of memory? I do not know internals, so forgive me if this question seems fooli= sh. Fatal trap 12: page fault while in kernel mode cpuid =3D 0; apic id =3D 00 fault virtual address =3D 0x22710028 fault code =3D supervisor read data, page not present instruction pointer =3D 0x20:0xffffffff80c45892 stack pointer =3D 0x28:0xfffffe0096600930 frame pointer =3D 0x28:0xfffffe0096600930 code segment =3D base 0x0, limit 0xfffff, type 0x1b =3D DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags =3D interrupt enabled, resume, IOPL =3D 0 current process =3D 1496 (devmatch) trap number =3D 12 panic: page fault cpuid =3D 0 time =3D 1646123791 KDB: stack backtrace: #0 0xffffffff80c57525 at kdb_backtrace+0x65 #1 0xffffffff80c09f01 at vpanic+0x181 #2 0xffffffff80c09d73 at panic+0x43 #3 0xffffffff8108b1a7 at trap+0xbc7 #4 0xffffffff8108b1ff at trap+0xc1f #5 0xffffffff8108a85d at trap+0x27d #6 0xffffffff81061b18 at calltrap+0x8 #7 0xffffffff80c62011 at rman_is_region_manager+0x241 #8 0xffffffff80c1a051 at sbuf_new_for_sysctl+0x101 #9 0xffffffff80c1949c at kernel_sysctl+0x43c #10 0xffffffff80c19b13 at userland_sysctl+0x173 #11 0xffffffff80c1995f at sys___sysctl+0x5f #12 0xffffffff8108baac at amd64_syscall+0x10c #13 0xffffffff8106243e at Xfast_syscall+0xfe Fatal trap 12: page fault while in kernel mode cpuid =3D 1; apic id =3D 02 fault virtual address =3D 0x68 fault code =3D supervisor read data, page not present instruction pointer =3D 0x20:0xffffffff824a599d stack pointer =3D 0x28:0xfffffe00b1e27910 frame pointer =3D 0x28:0xfffffe00b1e279b0 code segment =3D base 0x0, limit 0xfffff, type 0x1b =3D DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags =3D interrupt enabled, resume, IOPL =3D 0 current process =3D 0 (xbbd7 taskq) trap number =3D 12 panic: page fault cpuid =3D 1 time =3D 1646122723 KDB: stack backtrace: #0 0xffffffff80c57525 at kdb_backtrace+0x65 #1 0xffffffff80c09f01 at vpanic+0x181 #2 0xffffffff80c09d73 at panic+0x43 #3 0xffffffff8108b1a7 at trap+0xbc7 #4 0xffffffff8108b1ff at trap+0xc1f #5 0xffffffff8108a85d at trap+0x27d #6 0xffffffff81061b18 at calltrap+0x8 #7 0xffffffff8248935a at dmu_read+0x2a #8 0xffffffff82456a3a at zvol_geom_bio_strategy+0x2aa #9 0xffffffff80a7f214 at xbd_instance_create+0xa394 #10 0xffffffff80a7b1ea at xbd_instance_create+0x636a #11 0xffffffff80c6b1c1 at taskqueue_run+0x2a1 #12 0xffffffff80c6c4dc at taskqueue_thread_loop+0xac #13 0xffffffff80bc7e3e at fork_exit+0x7e #14 0xffffffff81062b9e at fork_trampoline+0xe Uptime: 1h44m10s One of those panics happened on init 0 at some point (all DomU's were destroyed) , unfortunately i did not note down which one. Version is still 13.0-RELEASE-p7. --=20 You are receiving this mail because: You are the assignee for the bug.=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-261059-227-mH6oWtD6ln>