Date: Sat, 26 Mar 2022 16:56:21 +0100 From: =?UTF-8?Q?Roger_Pau_Monn=C3=A9?= <royger@gmail.com> To: Ze Dupsys <zedupsys@gmail.com> Cc: freebsd-xen@freebsd.org, Brian Buhrow <buhrow@nfbcal.org> Subject: Re: ZFS + FreeBSD XEN dom0 panic Message-ID: <CAPLaKK7dszTy_6rcKcWZ0vK_E0ZWQ3QfuiHwvzSN_YWN_Gr9AA@mail.gmail.com> In-Reply-To: <Yj8lZWqeHbD%2BkfOQ@Air-de-Roger> References: <YjipQwBQ/JTo4S6G@Air-de-Roger> <Yji8NZePmovLFhk2@Air-de-Roger> <YjxuPF80Z8z0V58t@Air-de-Roger> <abcdae23-eea9-93c3-04da-61b7f79a99e9@gmail.com> <YjybrgeORadwBmjP@Air-de-Roger> <088c8222-063a-1db5-da83-a5a0168d66c6@gmail.com> <Yj16hdrxawD61mAL@Air-de-Roger> <639f7ce0-8a07-884c-c1cf-8257b9f3d9e8@gmail.com> <Yj7YrW9CG2aXT%2BiC@Air-de-Roger> <4da2302b-0745-ea1d-c868-5a8a5fc66b18@gmail.com> <Yj8lZWqeHbD%2BkfOQ@Air-de-Roger>
next in thread | previous in thread | raw e-mail | index | archive | help
--00000000000097864305db211f9b Content-Type: multipart/alternative; boundary="00000000000097864105db211f99" --00000000000097864105db211f99 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable El ds., 26 de mar=C3=A7 2022, 15:39, Roger Pau Monn=C3=A9 <roger.pau@citrix= .com> va escriure: > On Sat, Mar 26, 2022 at 02:08:06PM +0200, Ze Dupsys wrote: > > On 2022.03.26. 11:11, Roger Pau Monn=C3=A9 wrote: > > > > > > Hm, do you think you could upload (or attach) your > > > /usr/lib/debug/boot/kernel/kernel.debug and provide an updated panic > > > trace using that same exact kernel? > > > > Yes, it is just too big for email attachment. > > Uploaded at: https://files.fm/f/mp3v3qa22 > > > > This time i starved Dom0 of RAM(2G) to speed panic up. Panic trace it t= he > > same. > > > > Trace: > > Fatal trap 12: page fault while in kernel mode > > cpuid =3D 2; apic id =3D 04 > > fault virtual address =3D 0x22710028 > > fault code =3D supervisor read data, page not present > > instruction pointer =3D 0x20:0xffffffff80c6a2b2 > > stack pointer =3D 0x28:0xfffffe009e486b30 > > frame pointer =3D 0x28:0xfffffe009e486b30 > > code segment =3D base 0x0, limit 0xfffff, type 0x1b > > =3D DPL 0, pres 1, long 1, def32 0, gran 1 > > processor eflags =3D interrupt enabled, resume, IOPL =3D 0 > > current process =3D 3995 (devmatch) > > trap number =3D 12 > > panic: page fault > > cpuid =3D 2 > > time =3D 1648293768 > > KDB: stack backtrace: > > #0 0xffffffff80c7c285 at kdb_backtrace+0x65 > > #1 0xffffffff80c2e2e1 at vpanic+0x181 > > #2 0xffffffff80c2e153 at panic+0x43 > > #3 0xffffffff810c8b97 at trap+0xba7 > > #4 0xffffffff810c8bef at trap+0xbff > > #5 0xffffffff810c8243 at trap+0x253 > > #6 0xffffffff810a0848 at calltrap+0x8 > > #7 0xffffffff80c86ed1 at rman_is_region_manager+0x241 > > #8 0xffffffff80c3eb41 at sbuf_new_for_sysctl+0x101 > > #9 0xffffffff80c3df8c at kernel_sysctl+0x3ec > > #10 0xffffffff80c3e603 at userland_sysctl+0x173 > > #11 0xffffffff80c3e44f at sys___sysctl+0x5f > > #12 0xffffffff810c949c at amd64_syscall+0x10c > > #13 0xffffffff810a115b at Xfast_syscall+0xfb > > Uptime: 10m19s > > It's weird, because here you get a page fault, but there are also > traces with: > > general protection fault while in kernel mode > cpuid =3D 3; a(d8) Scan for VGA option rom > pic id =3D 06 > instruction pointer =3D 0x20:0xffffffff810c5d64 > stack pointer =3D 0x28:0xfffffe00a20fe990 > frame pointer =3D 0x28:0xfffffe00a20fe990 > code segment =3D base 0x0, limit 0xfffff, type 0x1b > =3D DPL 0, pres 1, long 1, def32 0, gran 1 > processor eflags =3D interrupt enabled, resume, IOPL =3D 0 > current process =3D 8998 (devmatch) > trap number =3D 9 > panic: general protection fault > cpuid =3D 3 > time =3D 1647416577 > KDB: stack backtrace: > #0 0xffffffff80c7ca05 at kdb_backtrace+0x65 > #1 0xffffffff80c2ea11 at vpanic+0x181 > #2 0xffffffff80c2e883 at panic+0x43 > #3 0xffffffff810c9b97 at trap+0xba7 > #4 0xffffffff810c907b at trap+0x8b > #5 0xffffffff810a0dc8 at calltrap+0x8 > #6 0xffffffff80c83067 at kvprintf+0x1007 > #7 0xffffffff80c83df9 at snprintf+0x59 > #8 0xffffffff80c8768b at rman_is_region_manager+0x27b > #9 0xffffffff80c3f271 at sbuf_new_for_sysctl+0x101 > #10 0xffffffff80c3e6bc at kernel_sysctl+0x3ec > #11 0xffffffff80c3ed33 at userland_sysctl+0x173 > #12 0xffffffff80c3eb7f at sys___sysctl+0x5f > #13 0xffffffff810ca49c at amd64_syscall+0x10c > #14 0xffffffff810a16db at Xfast_syscall+0xfb > > That show a general protection fault instead of a page fault. > > I've built an hypervisor with debug enabled for you, it's at: > > https://people.freebsd.org/~royger/xen-debug > > This is the same as the one in ports, just build with debug=3Dy. If you > can place it in /boot/ and change your xen_kernel to: > > xen_kernel=3D"/boot/xen-debug" > > It might provide some additional info. > > I've also noticed it seems to always be 'devmatch' the process that > triggers the panic. > > > > > cat /tmp/panic.log| sed -Ee 's/^#[0-9]* //' -e 's/ .*//' | xargs > addr2line > > -e /usr/lib/debug/boot/kernel/kernel.debug > > /usr/src/sys/kern/subr_kdb.c:443 > > /usr/src/sys/kern/kern_shutdown.c:0 > > /usr/src/sys/kern/kern_shutdown.c:844 > > /usr/src/sys/amd64/amd64/trap.c:944 > > /usr/src/sys/amd64/amd64/trap.c:0 > > /usr/src/sys/amd64/amd64/trap.c:0 > > /usr/src/sys/amd64/amd64/exception.S:292 > > /usr/src/sys/kern/subr_rman.c:0 > > I've been able to get a better trace with gdb and your debug symbols, > and this is: > > (gdb) info line *0xffffffff80c6a2b2 > Line 1386 of "/usr/src/sys/kern/subr_bus.c" starts at address > 0xffffffff80c6a2b2 <device_get_name+18> > and ends at 0xffffffff80c6a2b6 <device_get_name+22>. > (gdb) info line *0xffffffff80c86ed1 > Line 1052 of "/usr/src/sys/kern/subr_rman.c" starts at address > 0xffffffff80c86ecc <sysctl_rman+540> > and ends at 0xffffffff80c86ed5 <sysctl_rman+549>. > > The page fault happens exactly at: > > https://cgit.freebsd.org/src/tree/sys/kern/subr_bus.c?h=3Dstable/13#n1386 > > Which is called from > > https://cgit.freebsd.org/src/tree/sys/kern/subr_rman.c?h=3Dstable/13#n105= 2 > > I'm trying to figure out how the device could be removed or > disconnected from the rman. I will try to create a patch to catch the > device that leaves rman regions when destroyed/removed. Replying from my phone so the format will likely be mangled. I think I've found at least one issue with blkback leaking resources on destroy if the ring was not connected. Could you give the following patch a try? I've just build tested it, so can't guarantee it will work. Thanks, Roger. --00000000000097864105db211f99 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"auto"><div><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D= "gmail_attr">El ds., 26 de mar=C3=A7 2022, 15:39, Roger Pau Monn=C3=A9 <= <a href=3D"mailto:roger.pau@citrix.com" rel=3D"noreferrer noreferrer" targe= t=3D"_blank">roger.pau@citrix.com</a>> va escriure:<br></div><blockquote= class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc soli= d;padding-left:1ex">On Sat, Mar 26, 2022 at 02:08:06PM +0200, Ze Dupsys wro= te:<br> > On 2022.03.26. 11:11, Roger Pau Monn=C3=A9 wrote:<br> > ><br> > > Hm, do you think you could upload (or attach) your<br> > > /usr/lib/debug/boot/kernel/kernel.debug and provide an updated pa= nic<br> > > trace using that same exact kernel?<br> > <br> > Yes, it is just too big for email attachment.<br> > Uploaded at: <a href=3D"https://files.fm/f/mp3v3qa22" rel=3D"noreferre= r noreferrer noreferrer noreferrer" target=3D"_blank">https://files.fm/f/mp= 3v3qa22</a><br> > <br> > This time i starved Dom0 of RAM(2G) to speed panic up. Panic trace it = the<br> > same.<br> > <br> > Trace:<br> > Fatal trap 12: page fault while in kernel mode<br> > cpuid =3D 2; apic id =3D 04<br> > fault virtual address =3D 0x22710028<br> > fault code=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =3D supervisor rea= d data, page not present<br> > instruction pointer=C2=A0 =C2=A0=3D 0x20:0xffffffff80c6a2b2<br> > stack pointer=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=3D 0x28:0xfffffe009e48= 6b30<br> > frame pointer=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=3D 0x28:0xfffffe009e48= 6b30<br> > code segment=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =3D base 0x0, limit 0xf= ffff, type 0x1b<br> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0=3D DPL 0, pres 1, long 1, def32 0, gran 1<br> > processor eflags=C2=A0 =C2=A0 =C2=A0 =3D interrupt enabled, resume, IO= PL =3D 0<br> > current process=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =3D 3995 (devmatch)<br> > trap number=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=3D 12<br> > panic: page fault<br> > cpuid =3D 2<br> > time =3D 1648293768<br> > KDB: stack backtrace:<br> > #0 0xffffffff80c7c285 at kdb_backtrace+0x65<br> > #1 0xffffffff80c2e2e1 at vpanic+0x181<br> > #2 0xffffffff80c2e153 at panic+0x43<br> > #3 0xffffffff810c8b97 at trap+0xba7<br> > #4 0xffffffff810c8bef at trap+0xbff<br> > #5 0xffffffff810c8243 at trap+0x253<br> > #6 0xffffffff810a0848 at calltrap+0x8<br> > #7 0xffffffff80c86ed1 at rman_is_region_manager+0x241<br> > #8 0xffffffff80c3eb41 at sbuf_new_for_sysctl+0x101<br> > #9 0xffffffff80c3df8c at kernel_sysctl+0x3ec<br> > #10 0xffffffff80c3e603 at userland_sysctl+0x173<br> > #11 0xffffffff80c3e44f at sys___sysctl+0x5f<br> > #12 0xffffffff810c949c at amd64_syscall+0x10c<br> > #13 0xffffffff810a115b at Xfast_syscall+0xfb<br> > Uptime: 10m19s<br> <br> It's weird, because here you get a page fault, but there are also<br> traces with:<br> <br> general protection fault while in kernel mode<br> cpuid =3D 3; a(d8) Scan for VGA option rom<br> pic id =3D 06<br> instruction pointer=C2=A0 =C2=A0 =C2=A0=3D 0x20:0xffffffff810c5d64<br> stack pointer=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=3D 0x28:0xfffffe00a2= 0fe990<br> frame pointer=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=3D 0x28:0xfffffe00a2= 0fe990<br> code segment=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =3D base 0x0, limit 0= xfffff, type 0x1b<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =3D DPL 0, pres 1, long 1, def32 0, gran 1<br> processor eflags=C2=A0 =C2=A0 =C2=A0 =C2=A0 =3D interrupt enabled, resume, = IOPL =3D 0<br> current process=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=3D 8998 (devmatch)<br> trap number=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=3D 9<br> panic: general protection fault<br> cpuid =3D 3<br> time =3D 1647416577<br> KDB: stack backtrace:<br> #0 0xffffffff80c7ca05 at kdb_backtrace+0x65<br> #1 0xffffffff80c2ea11 at vpanic+0x181<br> #2 0xffffffff80c2e883 at panic+0x43<br> #3 0xffffffff810c9b97 at trap+0xba7<br> #4 0xffffffff810c907b at trap+0x8b<br> #5 0xffffffff810a0dc8 at calltrap+0x8<br> #6 0xffffffff80c83067 at kvprintf+0x1007<br> #7 0xffffffff80c83df9 at snprintf+0x59<br> #8 0xffffffff80c8768b at rman_is_region_manager+0x27b<br> #9 0xffffffff80c3f271 at sbuf_new_for_sysctl+0x101<br> #10 0xffffffff80c3e6bc at kernel_sysctl+0x3ec<br> #11 0xffffffff80c3ed33 at userland_sysctl+0x173<br> #12 0xffffffff80c3eb7f at sys___sysctl+0x5f<br> #13 0xffffffff810ca49c at amd64_syscall+0x10c<br> #14 0xffffffff810a16db at Xfast_syscall+0xfb<br> <br> That show a general protection fault instead of a page fault.<br> <br> I've built an hypervisor with debug enabled for you, it's at:<br> <br> <a href=3D"https://people.freebsd.org/~royger/xen-debug" rel=3D"noreferrer = noreferrer noreferrer noreferrer" target=3D"_blank">https://people.freebsd.= org/~royger/xen-debug</a><br> <br> This is the same as the one in ports, just build with debug=3Dy. If you<br> can place it in /boot/ and change your xen_kernel to:<br> <br> xen_kernel=3D"/boot/xen-debug"<br> <br> It might provide some additional info.<br> <br> I've also noticed it seems to always be 'devmatch' the process = that<br> triggers the panic.<br> <br> > <br> > cat /tmp/panic.log| sed -Ee 's/^#[0-9]* //' -e 's/ .*//= 9; | xargs addr2line<br> > -e /usr/lib/debug/boot/kernel/kernel.debug<br> > /usr/src/sys/kern/subr_kdb.c:443<br> > /usr/src/sys/kern/kern_shutdown.c:0<br> > /usr/src/sys/kern/kern_shutdown.c:844<br> > /usr/src/sys/amd64/amd64/trap.c:944<br> > /usr/src/sys/amd64/amd64/trap.c:0<br> > /usr/src/sys/amd64/amd64/trap.c:0<br> > /usr/src/sys/amd64/amd64/exception.S:292<br> > /usr/src/sys/kern/subr_rman.c:0<br> <br> I've been able to get a better trace with gdb and your debug symbols,<b= r> and this is:<br> <br> (gdb) info line *0xffffffff80c6a2b2<br> Line 1386 of "/usr/src/sys/kern/subr_bus.c" starts at address 0xf= fffffff80c6a2b2 <device_get_name+18><br> =C2=A0 =C2=A0and ends at 0xffffffff80c6a2b6 <device_get_name+22>.<br> (gdb) info line *0xffffffff80c86ed1<br> Line 1052 of "/usr/src/sys/kern/subr_rman.c" starts at address 0x= ffffffff80c86ecc <sysctl_rman+540><br> =C2=A0 =C2=A0and ends at 0xffffffff80c86ed5 <sysctl_rman+549>.<br> <br> The page fault happens exactly at:<br> <br> <a href=3D"https://cgit.freebsd.org/src/tree/sys/kern/subr_bus.c?h=3Dstable= /13#n1386" rel=3D"noreferrer noreferrer noreferrer noreferrer" target=3D"_b= lank">https://cgit.freebsd.org/src/tree/sys/kern/subr_bus.c?h=3Dstable/13#n= 1386</a><br> <br> Which is called from<br> <br> <a href=3D"https://cgit.freebsd.org/src/tree/sys/kern/subr_rman.c?h=3Dstabl= e/13#n1052" rel=3D"noreferrer noreferrer noreferrer noreferrer" target=3D"_= blank">https://cgit.freebsd.org/src/tree/sys/kern/subr_rman.c?h=3Dstable/13= #n1052</a><br> <br> I'm trying to figure out how the device could be removed or<br> disconnected from the rman. I will try to create a patch to catch the<br> device that leaves rman regions when destroyed/removed.</blockquote></div><= /div><div dir=3D"auto"><br></div><div dir=3D"auto">Replying from my phone s= o the format will likely be mangled.=C2=A0</div><div dir=3D"auto"><br></div= ><div dir=3D"auto">I think I've found at least one issue with blkback l= eaking resources on destroy if the ring was not connected. Could you give t= he following patch a try? I've just build tested it, so can't guara= ntee it will work.=C2=A0</div><div dir=3D"auto"><br></div><div dir=3D"auto"= >Thanks, Roger.=C2=A0</div><div dir=3D"auto"></div></div> --00000000000097864105db211f99-- --00000000000097864305db211f9b Content-Type: application/x-patch; name="blkback.patch" Content-Disposition: attachment; filename="blkback.patch" Content-Transfer-Encoding: base64 Content-ID: <17fc6eb5f8e17e280a21> X-Attachment-Id: 17fc6eb5f8e17e280a21 ZGlmZiAtLWdpdCBhL3N5cy9kZXYveGVuL2Jsa2JhY2svYmxrYmFjay5jIGIvc3lzL2Rldi94ZW4v YmxrYmFjay9ibGtiYWNrLmMKaW5kZXggMzM0MTQyOTViZjUuLjY2NGY1MmE3NGU3IDEwMDY0NAot LS0gYS9zeXMvZGV2L3hlbi9ibGtiYWNrL2Jsa2JhY2suYworKysgYi9zeXMvZGV2L3hlbi9ibGti YWNrL2Jsa2JhY2suYwpAQCAtMjc4MSw5ICsyNzgxLDYgQEAgeGJiX2Rpc2Nvbm5lY3Qoc3RydWN0 IHhiYl9zb2Z0YyAqeGJiKQogCiAJRFBSSU5URigiXG4iKTsKIAotCWlmICgoeGJiLT5mbGFncyAm IFhCQkZfUklOR19DT05ORUNURUQpID09IDApCi0JCXJldHVybiAoMCk7Ci0KIAltdHhfdW5sb2Nr KCZ4YmItPmxvY2spOwogCXhlbl9pbnRyX3VuYmluZCgmeGJiLT54ZW5faW50cl9oYW5kbGUpOwog CXRhc2txdWV1ZV9kcmFpbih4YmItPmlvX3Rhc2txdWV1ZSwgJnhiYi0+aW9fdGFzayk7IAo= --00000000000097864305db211f9b--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAPLaKK7dszTy_6rcKcWZ0vK_E0ZWQ3QfuiHwvzSN_YWN_Gr9AA>