Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 26 Mar 2022 16:56:21 +0100
From:      =?UTF-8?Q?Roger_Pau_Monn=C3=A9?= <royger@gmail.com>
To:        Ze Dupsys <zedupsys@gmail.com>
Cc:        freebsd-xen@freebsd.org, Brian Buhrow <buhrow@nfbcal.org>
Subject:   Re: ZFS + FreeBSD XEN dom0 panic
Message-ID:  <CAPLaKK7dszTy_6rcKcWZ0vK_E0ZWQ3QfuiHwvzSN_YWN_Gr9AA@mail.gmail.com>
In-Reply-To: <Yj8lZWqeHbD%2BkfOQ@Air-de-Roger>
References:  <YjipQwBQ/JTo4S6G@Air-de-Roger> <Yji8NZePmovLFhk2@Air-de-Roger> <YjxuPF80Z8z0V58t@Air-de-Roger> <abcdae23-eea9-93c3-04da-61b7f79a99e9@gmail.com> <YjybrgeORadwBmjP@Air-de-Roger> <088c8222-063a-1db5-da83-a5a0168d66c6@gmail.com> <Yj16hdrxawD61mAL@Air-de-Roger> <639f7ce0-8a07-884c-c1cf-8257b9f3d9e8@gmail.com> <Yj7YrW9CG2aXT%2BiC@Air-de-Roger> <4da2302b-0745-ea1d-c868-5a8a5fc66b18@gmail.com> <Yj8lZWqeHbD%2BkfOQ@Air-de-Roger>

next in thread | previous in thread | raw e-mail | index | archive | help
--00000000000097864305db211f9b
Content-Type: multipart/alternative; boundary="00000000000097864105db211f99"

--00000000000097864105db211f99
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

El ds., 26 de mar=C3=A7 2022, 15:39, Roger Pau Monn=C3=A9 <roger.pau@citrix=
.com> va
escriure:

> On Sat, Mar 26, 2022 at 02:08:06PM +0200, Ze Dupsys wrote:
> > On 2022.03.26. 11:11, Roger Pau Monn=C3=A9 wrote:
> > >
> > > Hm, do you think you could upload (or attach) your
> > > /usr/lib/debug/boot/kernel/kernel.debug and provide an updated panic
> > > trace using that same exact kernel?
> >
> > Yes, it is just too big for email attachment.
> > Uploaded at: https://files.fm/f/mp3v3qa22
> >
> > This time i starved Dom0 of RAM(2G) to speed panic up. Panic trace it t=
he
> > same.
> >
> > Trace:
> > Fatal trap 12: page fault while in kernel mode
> > cpuid =3D 2; apic id =3D 04
> > fault virtual address =3D 0x22710028
> > fault code            =3D supervisor read data, page not present
> > instruction pointer   =3D 0x20:0xffffffff80c6a2b2
> > stack pointer         =3D 0x28:0xfffffe009e486b30
> > frame pointer         =3D 0x28:0xfffffe009e486b30
> > code segment          =3D base 0x0, limit 0xfffff, type 0x1b
> >                       =3D DPL 0, pres 1, long 1, def32 0, gran 1
> > processor eflags      =3D interrupt enabled, resume, IOPL =3D 0
> > current process               =3D 3995 (devmatch)
> > trap number           =3D 12
> > panic: page fault
> > cpuid =3D 2
> > time =3D 1648293768
> > KDB: stack backtrace:
> > #0 0xffffffff80c7c285 at kdb_backtrace+0x65
> > #1 0xffffffff80c2e2e1 at vpanic+0x181
> > #2 0xffffffff80c2e153 at panic+0x43
> > #3 0xffffffff810c8b97 at trap+0xba7
> > #4 0xffffffff810c8bef at trap+0xbff
> > #5 0xffffffff810c8243 at trap+0x253
> > #6 0xffffffff810a0848 at calltrap+0x8
> > #7 0xffffffff80c86ed1 at rman_is_region_manager+0x241
> > #8 0xffffffff80c3eb41 at sbuf_new_for_sysctl+0x101
> > #9 0xffffffff80c3df8c at kernel_sysctl+0x3ec
> > #10 0xffffffff80c3e603 at userland_sysctl+0x173
> > #11 0xffffffff80c3e44f at sys___sysctl+0x5f
> > #12 0xffffffff810c949c at amd64_syscall+0x10c
> > #13 0xffffffff810a115b at Xfast_syscall+0xfb
> > Uptime: 10m19s
>
> It's weird, because here you get a page fault, but there are also
> traces with:
>
> general protection fault while in kernel mode
> cpuid =3D 3; a(d8) Scan for VGA option rom
> pic id =3D 06
> instruction pointer     =3D 0x20:0xffffffff810c5d64
> stack pointer           =3D 0x28:0xfffffe00a20fe990
> frame pointer           =3D 0x28:0xfffffe00a20fe990
> code segment            =3D base 0x0, limit 0xfffff, type 0x1b
>                         =3D DPL 0, pres 1, long 1, def32 0, gran 1
> processor eflags        =3D interrupt enabled, resume, IOPL =3D 0
> current process         =3D 8998 (devmatch)
> trap number             =3D 9
> panic: general protection fault
> cpuid =3D 3
> time =3D 1647416577
> KDB: stack backtrace:
> #0 0xffffffff80c7ca05 at kdb_backtrace+0x65
> #1 0xffffffff80c2ea11 at vpanic+0x181
> #2 0xffffffff80c2e883 at panic+0x43
> #3 0xffffffff810c9b97 at trap+0xba7
> #4 0xffffffff810c907b at trap+0x8b
> #5 0xffffffff810a0dc8 at calltrap+0x8
> #6 0xffffffff80c83067 at kvprintf+0x1007
> #7 0xffffffff80c83df9 at snprintf+0x59
> #8 0xffffffff80c8768b at rman_is_region_manager+0x27b
> #9 0xffffffff80c3f271 at sbuf_new_for_sysctl+0x101
> #10 0xffffffff80c3e6bc at kernel_sysctl+0x3ec
> #11 0xffffffff80c3ed33 at userland_sysctl+0x173
> #12 0xffffffff80c3eb7f at sys___sysctl+0x5f
> #13 0xffffffff810ca49c at amd64_syscall+0x10c
> #14 0xffffffff810a16db at Xfast_syscall+0xfb
>
> That show a general protection fault instead of a page fault.
>
> I've built an hypervisor with debug enabled for you, it's at:
>
> https://people.freebsd.org/~royger/xen-debug
>
> This is the same as the one in ports, just build with debug=3Dy. If you
> can place it in /boot/ and change your xen_kernel to:
>
> xen_kernel=3D"/boot/xen-debug"
>
> It might provide some additional info.
>
> I've also noticed it seems to always be 'devmatch' the process that
> triggers the panic.
>
> >
> > cat /tmp/panic.log| sed -Ee 's/^#[0-9]* //' -e 's/ .*//' | xargs
> addr2line
> > -e /usr/lib/debug/boot/kernel/kernel.debug
> > /usr/src/sys/kern/subr_kdb.c:443
> > /usr/src/sys/kern/kern_shutdown.c:0
> > /usr/src/sys/kern/kern_shutdown.c:844
> > /usr/src/sys/amd64/amd64/trap.c:944
> > /usr/src/sys/amd64/amd64/trap.c:0
> > /usr/src/sys/amd64/amd64/trap.c:0
> > /usr/src/sys/amd64/amd64/exception.S:292
> > /usr/src/sys/kern/subr_rman.c:0
>
> I've been able to get a better trace with gdb and your debug symbols,
> and this is:
>
> (gdb) info line *0xffffffff80c6a2b2
> Line 1386 of "/usr/src/sys/kern/subr_bus.c" starts at address
> 0xffffffff80c6a2b2 <device_get_name+18>
>    and ends at 0xffffffff80c6a2b6 <device_get_name+22>.
> (gdb) info line *0xffffffff80c86ed1
> Line 1052 of "/usr/src/sys/kern/subr_rman.c" starts at address
> 0xffffffff80c86ecc <sysctl_rman+540>
>    and ends at 0xffffffff80c86ed5 <sysctl_rman+549>.
>
> The page fault happens exactly at:
>
> https://cgit.freebsd.org/src/tree/sys/kern/subr_bus.c?h=3Dstable/13#n1386
>
> Which is called from
>
> https://cgit.freebsd.org/src/tree/sys/kern/subr_rman.c?h=3Dstable/13#n105=
2
>
> I'm trying to figure out how the device could be removed or
> disconnected from the rman. I will try to create a patch to catch the
> device that leaves rman regions when destroyed/removed.


Replying from my phone so the format will likely be mangled.

I think I've found at least one issue with blkback leaking resources on
destroy if the ring was not connected. Could you give the following patch a
try? I've just build tested it, so can't guarantee it will work.

Thanks, Roger.

--00000000000097864105db211f99
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"auto"><div><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D=
"gmail_attr">El ds., 26 de mar=C3=A7 2022, 15:39, Roger Pau Monn=C3=A9 &lt;=
<a href=3D"mailto:roger.pau@citrix.com" rel=3D"noreferrer noreferrer" targe=
t=3D"_blank">roger.pau@citrix.com</a>&gt; va escriure:<br></div><blockquote=
 class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc soli=
d;padding-left:1ex">On Sat, Mar 26, 2022 at 02:08:06PM +0200, Ze Dupsys wro=
te:<br>
&gt; On 2022.03.26. 11:11, Roger Pau Monn=C3=A9 wrote:<br>
&gt; &gt;<br>
&gt; &gt; Hm, do you think you could upload (or attach) your<br>
&gt; &gt; /usr/lib/debug/boot/kernel/kernel.debug and provide an updated pa=
nic<br>
&gt; &gt; trace using that same exact kernel?<br>
&gt; <br>
&gt; Yes, it is just too big for email attachment.<br>
&gt; Uploaded at: <a href=3D"https://files.fm/f/mp3v3qa22" rel=3D"noreferre=
r noreferrer noreferrer noreferrer" target=3D"_blank">https://files.fm/f/mp=
3v3qa22</a><br>
&gt; <br>
&gt; This time i starved Dom0 of RAM(2G) to speed panic up. Panic trace it =
the<br>
&gt; same.<br>
&gt; <br>
&gt; Trace:<br>
&gt; Fatal trap 12: page fault while in kernel mode<br>
&gt; cpuid =3D 2; apic id =3D 04<br>
&gt; fault virtual address =3D 0x22710028<br>
&gt; fault code=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =3D supervisor rea=
d data, page not present<br>
&gt; instruction pointer=C2=A0 =C2=A0=3D 0x20:0xffffffff80c6a2b2<br>
&gt; stack pointer=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=3D 0x28:0xfffffe009e48=
6b30<br>
&gt; frame pointer=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=3D 0x28:0xfffffe009e48=
6b30<br>
&gt; code segment=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =3D base 0x0, limit 0xf=
ffff, type 0x1b<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0=3D DPL 0, pres 1, long 1, def32 0, gran 1<br>
&gt; processor eflags=C2=A0 =C2=A0 =C2=A0 =3D interrupt enabled, resume, IO=
PL =3D 0<br>
&gt; current process=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
=3D 3995 (devmatch)<br>
&gt; trap number=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=3D 12<br>
&gt; panic: page fault<br>
&gt; cpuid =3D 2<br>
&gt; time =3D 1648293768<br>
&gt; KDB: stack backtrace:<br>
&gt; #0 0xffffffff80c7c285 at kdb_backtrace+0x65<br>
&gt; #1 0xffffffff80c2e2e1 at vpanic+0x181<br>
&gt; #2 0xffffffff80c2e153 at panic+0x43<br>
&gt; #3 0xffffffff810c8b97 at trap+0xba7<br>
&gt; #4 0xffffffff810c8bef at trap+0xbff<br>
&gt; #5 0xffffffff810c8243 at trap+0x253<br>
&gt; #6 0xffffffff810a0848 at calltrap+0x8<br>
&gt; #7 0xffffffff80c86ed1 at rman_is_region_manager+0x241<br>
&gt; #8 0xffffffff80c3eb41 at sbuf_new_for_sysctl+0x101<br>
&gt; #9 0xffffffff80c3df8c at kernel_sysctl+0x3ec<br>
&gt; #10 0xffffffff80c3e603 at userland_sysctl+0x173<br>
&gt; #11 0xffffffff80c3e44f at sys___sysctl+0x5f<br>
&gt; #12 0xffffffff810c949c at amd64_syscall+0x10c<br>
&gt; #13 0xffffffff810a115b at Xfast_syscall+0xfb<br>
&gt; Uptime: 10m19s<br>
<br>
It&#39;s weird, because here you get a page fault, but there are also<br>
traces with:<br>
<br>
general protection fault while in kernel mode<br>
cpuid =3D 3; a(d8) Scan for VGA option rom<br>
pic id =3D 06<br>
instruction pointer=C2=A0 =C2=A0 =C2=A0=3D 0x20:0xffffffff810c5d64<br>
stack pointer=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=3D 0x28:0xfffffe00a2=
0fe990<br>
frame pointer=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=3D 0x28:0xfffffe00a2=
0fe990<br>
code segment=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =3D base 0x0, limit 0=
xfffff, type 0x1b<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =3D DPL 0, pres 1, long 1, def32 0, gran 1<br>
processor eflags=C2=A0 =C2=A0 =C2=A0 =C2=A0 =3D interrupt enabled, resume, =
IOPL =3D 0<br>
current process=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=3D 8998 (devmatch)<br>
trap number=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=3D 9<br>
panic: general protection fault<br>
cpuid =3D 3<br>
time =3D 1647416577<br>
KDB: stack backtrace:<br>
#0 0xffffffff80c7ca05 at kdb_backtrace+0x65<br>
#1 0xffffffff80c2ea11 at vpanic+0x181<br>
#2 0xffffffff80c2e883 at panic+0x43<br>
#3 0xffffffff810c9b97 at trap+0xba7<br>
#4 0xffffffff810c907b at trap+0x8b<br>
#5 0xffffffff810a0dc8 at calltrap+0x8<br>
#6 0xffffffff80c83067 at kvprintf+0x1007<br>
#7 0xffffffff80c83df9 at snprintf+0x59<br>
#8 0xffffffff80c8768b at rman_is_region_manager+0x27b<br>
#9 0xffffffff80c3f271 at sbuf_new_for_sysctl+0x101<br>
#10 0xffffffff80c3e6bc at kernel_sysctl+0x3ec<br>
#11 0xffffffff80c3ed33 at userland_sysctl+0x173<br>
#12 0xffffffff80c3eb7f at sys___sysctl+0x5f<br>
#13 0xffffffff810ca49c at amd64_syscall+0x10c<br>
#14 0xffffffff810a16db at Xfast_syscall+0xfb<br>
<br>
That show a general protection fault instead of a page fault.<br>
<br>
I&#39;ve built an hypervisor with debug enabled for you, it&#39;s at:<br>
<br>
<a href=3D"https://people.freebsd.org/~royger/xen-debug" rel=3D"noreferrer =
noreferrer noreferrer noreferrer" target=3D"_blank">https://people.freebsd.=
org/~royger/xen-debug</a><br>
<br>
This is the same as the one in ports, just build with debug=3Dy. If you<br>
can place it in /boot/ and change your xen_kernel to:<br>
<br>
xen_kernel=3D&quot;/boot/xen-debug&quot;<br>
<br>
It might provide some additional info.<br>
<br>
I&#39;ve also noticed it seems to always be &#39;devmatch&#39; the process =
that<br>
triggers the panic.<br>
<br>
&gt; <br>
&gt; cat /tmp/panic.log| sed -Ee &#39;s/^#[0-9]* //&#39; -e &#39;s/ .*//&#3=
9; | xargs addr2line<br>
&gt; -e /usr/lib/debug/boot/kernel/kernel.debug<br>
&gt; /usr/src/sys/kern/subr_kdb.c:443<br>
&gt; /usr/src/sys/kern/kern_shutdown.c:0<br>
&gt; /usr/src/sys/kern/kern_shutdown.c:844<br>
&gt; /usr/src/sys/amd64/amd64/trap.c:944<br>
&gt; /usr/src/sys/amd64/amd64/trap.c:0<br>
&gt; /usr/src/sys/amd64/amd64/trap.c:0<br>
&gt; /usr/src/sys/amd64/amd64/exception.S:292<br>
&gt; /usr/src/sys/kern/subr_rman.c:0<br>
<br>
I&#39;ve been able to get a better trace with gdb and your debug symbols,<b=
r>
and this is:<br>
<br>
(gdb) info line *0xffffffff80c6a2b2<br>
Line 1386 of &quot;/usr/src/sys/kern/subr_bus.c&quot; starts at address 0xf=
fffffff80c6a2b2 &lt;device_get_name+18&gt;<br>
=C2=A0 =C2=A0and ends at 0xffffffff80c6a2b6 &lt;device_get_name+22&gt;.<br>
(gdb) info line *0xffffffff80c86ed1<br>
Line 1052 of &quot;/usr/src/sys/kern/subr_rman.c&quot; starts at address 0x=
ffffffff80c86ecc &lt;sysctl_rman+540&gt;<br>
=C2=A0 =C2=A0and ends at 0xffffffff80c86ed5 &lt;sysctl_rman+549&gt;.<br>
<br>
The page fault happens exactly at:<br>
<br>
<a href=3D"https://cgit.freebsd.org/src/tree/sys/kern/subr_bus.c?h=3Dstable=
/13#n1386" rel=3D"noreferrer noreferrer noreferrer noreferrer" target=3D"_b=
lank">https://cgit.freebsd.org/src/tree/sys/kern/subr_bus.c?h=3Dstable/13#n=
1386</a><br>
<br>
Which is called from<br>
<br>
<a href=3D"https://cgit.freebsd.org/src/tree/sys/kern/subr_rman.c?h=3Dstabl=
e/13#n1052" rel=3D"noreferrer noreferrer noreferrer noreferrer" target=3D"_=
blank">https://cgit.freebsd.org/src/tree/sys/kern/subr_rman.c?h=3Dstable/13=
#n1052</a><br>
<br>
I&#39;m trying to figure out how the device could be removed or<br>
disconnected from the rman. I will try to create a patch to catch the<br>
device that leaves rman regions when destroyed/removed.</blockquote></div><=
/div><div dir=3D"auto"><br></div><div dir=3D"auto">Replying from my phone s=
o the format will likely be mangled.=C2=A0</div><div dir=3D"auto"><br></div=
><div dir=3D"auto">I think I&#39;ve found at least one issue with blkback l=
eaking resources on destroy if the ring was not connected. Could you give t=
he following patch a try? I&#39;ve just build tested it, so can&#39;t guara=
ntee it will work.=C2=A0</div><div dir=3D"auto"><br></div><div dir=3D"auto"=
>Thanks, Roger.=C2=A0</div><div dir=3D"auto"></div></div>

--00000000000097864105db211f99--
--00000000000097864305db211f9b
Content-Type: application/x-patch; name="blkback.patch"
Content-Disposition: attachment; filename="blkback.patch"
Content-Transfer-Encoding: base64
Content-ID: <17fc6eb5f8e17e280a21>
X-Attachment-Id: 17fc6eb5f8e17e280a21

ZGlmZiAtLWdpdCBhL3N5cy9kZXYveGVuL2Jsa2JhY2svYmxrYmFjay5jIGIvc3lzL2Rldi94ZW4v
YmxrYmFjay9ibGtiYWNrLmMKaW5kZXggMzM0MTQyOTViZjUuLjY2NGY1MmE3NGU3IDEwMDY0NAot
LS0gYS9zeXMvZGV2L3hlbi9ibGtiYWNrL2Jsa2JhY2suYworKysgYi9zeXMvZGV2L3hlbi9ibGti
YWNrL2Jsa2JhY2suYwpAQCAtMjc4MSw5ICsyNzgxLDYgQEAgeGJiX2Rpc2Nvbm5lY3Qoc3RydWN0
IHhiYl9zb2Z0YyAqeGJiKQogCiAJRFBSSU5URigiXG4iKTsKIAotCWlmICgoeGJiLT5mbGFncyAm
IFhCQkZfUklOR19DT05ORUNURUQpID09IDApCi0JCXJldHVybiAoMCk7Ci0KIAltdHhfdW5sb2Nr
KCZ4YmItPmxvY2spOwogCXhlbl9pbnRyX3VuYmluZCgmeGJiLT54ZW5faW50cl9oYW5kbGUpOwog
CXRhc2txdWV1ZV9kcmFpbih4YmItPmlvX3Rhc2txdWV1ZSwgJnhiYi0+aW9fdGFzayk7IAo=
--00000000000097864305db211f9b--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAPLaKK7dszTy_6rcKcWZ0vK_E0ZWQ3QfuiHwvzSN_YWN_Gr9AA>