Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 31 Aug 2023 17:37:45 -0600
From:      Warner Losh <imp@bsdimp.com>
To:        Cy Schubert <Cy.Schubert@cschubert.com>
Cc:        Alexander Motin <mav@freebsd.org>, Gleb Smirnoff <glebius@freebsd.org>,  Drew Gallatin <gallatin@freebsd.org>, Martin Matuska <mm@freebsd.org>,  src-committers <src-committers@freebsd.org>,  "<dev-commits-src-all@freebsd.org>" <dev-commits-src-all@freebsd.org>,  "<dev-commits-src-main@freebsd.org>" <dev-commits-src-main@freebsd.org>
Subject:   Re: git: 315ee00fa961 - main - zfs: merge openzfs/zfs@804414aad
Message-ID:  <CANCZdfqLWoQnLkKcLYLa73WOKDOAEfXB2rQX869Qaaqv6z=gKA@mail.gmail.com>
In-Reply-To: <20230831233228.9935BA8@slippy.cwsent.com>
References:  <202308270509.37R596B5048298@gitrepo.freebsd.org> <ZO_aOaf-eGiCMCKy@cell.glebi.us> <c09c92df-90f5-8c94-4125-9e33262bc686@FreeBSD.org> <07faf861-9186-47d1-992a-91d483ea4e9c@app.fastmail.com> <1db726d4-32c9-e1b8-51d6-981aa51b7825@FreeBSD.org> <20230831175350.981F1D5@slippy.cwsent.com> <a5c51f3f-8c7f-8bd5-f718-72bc33fe22ed@FreeBSD.org> <20230831223526.DCB701A1@slippy.cwsent.com> <20230831233228.9935BA8@slippy.cwsent.com>

next in thread | previous in thread | raw e-mail | index | archive | help
--000000000000a89ba10604408858
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Thu, Aug 31, 2023, 5:32 PM Cy Schubert <Cy.Schubert@cschubert.com> wrote=
:

> In message <20230831223526.DCB701A1@slippy.cwsent.com>, Cy Schubert
> writes:
> > In message <a5c51f3f-8c7f-8bd5-f718-72bc33fe22ed@FreeBSD.org>,
> Alexander
> > Motin
> > writes:
> > > On 31.08.2023 13:53, Cy Schubert wrote:
> > > > One thing that circumvents my two problems is reducing poudriere
> bulk job
> > s
> > > > from 8 to 5 on my 4 core machines.
> > >
> > > Cy, I have no real evidences to think it is related, other than your
> > > panics look like some memory corruptions, but could you try is patch:
> > > https://github.com/openzfs/zfs/pull/15228 .  If it won't do the
> trick,
> > > then I am out of ideas without additional input.
> >
> > So far so good. Poudriere has been running with a decent -J jobs on bot=
h
> > machines for over an hour. I'll let you know if they survive the night.
> It
> > can take some time before the panics happen though.
> >
> > The problem is more likely to occur when there are a lot of small
> package
> > builds than large long running jobs, probably because of the parallel
> ZFS
> > dataset creations, deletions, and rollbacks.
> >
> > >
> > > Gleb, you may try to add this too, just as a choice between impossibl=
e
> > > and improbable.
> > >
> > > --
> > > Alexander Motin
> >
> >
> > --
> > Cheers,
> > Cy Schubert <Cy.Schubert@cschubert.com>
> > FreeBSD UNIX:  <cy@FreeBSD.org>   Web:  https://FreeBSD.org
> > NTP:           <cy@nwtime.org>    Web:  https://nwtime.org
> >
> >                       e^(i*pi)+1=3D0
> >
> >
>
> One of the two machines is hung.
>
> cwfw# ping bob
> PING bob (10.1.1.7): 56 data bytes
> ^C
> --- bob ping statistics ---
> 2 packets transmitted, 0 packets received, 100.0% packet loss
> cwfw# console bob
> [Enter `^Ec?' for help]
> [halt sent]
> KDB: enter: Break to debugger
> [ thread pid 31259 tid 100913 ]
> Stopped at      kdb_break+0x48: movq    $0,0xa1069d(%rip)
> db> bt
> Tracing pid 31259 tid 100913 td 0xfffffe00c4eca000
> kdb_break() at kdb_break+0x48/frame 0xfffffe00c53ef2d0
> uart_intr() at uart_intr+0xf7/frame 0xfffffe00c53ef310
> intr_event_handle() at intr_event_handle+0x12b/frame 0xfffffe00c53ef380
> intr_execute_handlers() at intr_execute_handlers+0x63/frame
> 0xfffffe00c53ef3b0
> Xapic_isr1() at Xapic_isr1+0xdc/frame 0xfffffe00c53ef3b0
> --- interrupt, rip =3D 0xffffffff806d5c70, rsp =3D 0xfffffe00c53ef480, rb=
p =3D
> 0xfffffe00c53ef480 ---
> getbinuptime() at getbinuptime+0x30/frame 0xfffffe00c53ef480
> arc_access() at arc_access+0x250/frame 0xfffffe00c53ef4d0
> arc_buf_access() at arc_buf_access+0xd0/frame 0xfffffe00c53ef4f0
> dbuf_hold_impl() at dbuf_hold_impl+0xf3/frame 0xfffffe00c53ef580
> dbuf_hold() at dbuf_hold+0x25/frame 0xfffffe00c53ef5b0
> dnode_hold_impl() at dnode_hold_impl+0x194/frame 0xfffffe00c53ef670
> dmu_bonus_hold() at dmu_bonus_hold+0x20/frame 0xfffffe00c53ef6a0
> zfs_zget() at zfs_zget+0x20d/frame 0xfffffe00c53ef750
> zfs_dirent_lookup() at zfs_dirent_lookup+0x16d/frame 0xfffffe00c53ef7a0
> zfs_dirlook() at zfs_dirlook+0x7f/frame 0xfffffe00c53ef7d0
> zfs_lookup() at zfs_lookup+0x3c0/frame 0xfffffe00c53ef8a0
> zfs_freebsd_cachedlookup() at zfs_freebsd_cachedlookup+0x67/frame
> 0xfffffe00c53ef9e0
> vfs_cache_lookup() at vfs_cache_lookup+0xa6/frame 0xfffffe00c53efa30
> vfs_lookup() at vfs_lookup+0x457/frame 0xfffffe00c53efac0
> namei() at namei+0x2e1/frame 0xfffffe00c53efb20
> vn_open_cred() at vn_open_cred+0x505/frame 0xfffffe00c53efca0
> kern_openat() at kern_openat+0x287/frame 0xfffffe00c53efdf0
> ia32_syscall() at ia32_syscall+0x156/frame 0xfffffe00c53eff30
> int0x80_syscall_common() at int0x80_syscall_common+0x9c/frame 0xffff89dc
> db>
>
> I'll let it continue. Hopefully the watchdog timer will pop and we get a
> dump.
>


Might also be interesting to see if this moves around or is really hung
getting the time. I suspect it's live lock given this traceback.

Warner

--=20
> Cheers,
> Cy Schubert <Cy.Schubert@cschubert.com>
> FreeBSD UNIX:  <cy@FreeBSD.org>   Web:  https://FreeBSD.org
> NTP:           <cy@nwtime.org>    Web:  https://nwtime.org
>
>                         e^(i*pi)+1=3D0
>
>
>  J=EF=BE=90 =EF=BD=A4   =EF=BF=BD
>

--000000000000a89ba10604408858
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"auto"><div><br><br><div class=3D"gmail_quote"><div dir=3D"ltr" =
class=3D"gmail_attr">On Thu, Aug 31, 2023, 5:32 PM Cy Schubert &lt;<a href=
=3D"mailto:Cy.Schubert@cschubert.com">Cy.Schubert@cschubert.com</a>&gt; wro=
te:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;b=
order-left:1px #ccc solid;padding-left:1ex">In message &lt;<a href=3D"mailt=
o:20230831223526.DCB701A1@slippy.cwsent.com" target=3D"_blank" rel=3D"noref=
errer">20230831223526.DCB701A1@slippy.cwsent.com</a>&gt;, Cy Schubert write=
s:<br>
&gt; In message &lt;a5c51f3f-8c7f-8bd5-f718-72bc33fe22ed@FreeBSD.org&gt;, A=
lexander <br>
&gt; Motin<br>
&gt; writes:<br>
&gt; &gt; On 31.08.2023 13:53, Cy Schubert wrote:<br>
&gt; &gt; &gt; One thing that circumvents my two problems is reducing poudr=
iere bulk job<br>
&gt; s<br>
&gt; &gt; &gt; from 8 to 5 on my 4 core machines.<br>
&gt; &gt;<br>
&gt; &gt; Cy, I have no real evidences to think it is related, other than y=
our <br>
&gt; &gt; panics look like some memory corruptions, but could you try is pa=
tch: <br>
&gt; &gt; <a href=3D"https://github.com/openzfs/zfs/pull/15228" rel=3D"nore=
ferrer noreferrer" target=3D"_blank">https://github.com/openzfs/zfs/pull/15=
228</a> .=C2=A0 If it won&#39;t do the trick, <br>
&gt; &gt; then I am out of ideas without additional input.<br>
&gt;<br>
&gt; So far so good. Poudriere has been running with a decent -J jobs on bo=
th <br>
&gt; machines for over an hour. I&#39;ll let you know if they survive the n=
ight. It <br>
&gt; can take some time before the panics happen though.<br>
&gt;<br>
&gt; The problem is more likely to occur when there are a lot of small pack=
age <br>
&gt; builds than large long running jobs, probably because of the parallel =
ZFS <br>
&gt; dataset creations, deletions, and rollbacks.<br>
&gt;<br>
&gt; &gt;<br>
&gt; &gt; Gleb, you may try to add this too, just as a choice between impos=
sible <br>
&gt; &gt; and improbable.<br>
&gt; &gt;<br>
&gt; &gt; -- <br>
&gt; &gt; Alexander Motin<br>
&gt;<br>
&gt;<br>
&gt; -- <br>
&gt; Cheers,<br>
&gt; Cy Schubert &lt;<a href=3D"mailto:Cy.Schubert@cschubert.com" target=3D=
"_blank" rel=3D"noreferrer">Cy.Schubert@cschubert.com</a>&gt;<br>
&gt; FreeBSD UNIX:=C2=A0 &lt;cy@FreeBSD.org&gt;=C2=A0 =C2=A0Web:=C2=A0 <a h=
ref=3D"https://FreeBSD.org" rel=3D"noreferrer noreferrer" target=3D"_blank"=
>https://FreeBSD.org</a><br>;
&gt; NTP:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0&lt;<a href=3D"mailto:cy@=
nwtime.org" target=3D"_blank" rel=3D"noreferrer">cy@nwtime.org</a>&gt;=C2=
=A0 =C2=A0 Web:=C2=A0 <a href=3D"https://nwtime.org" rel=3D"noreferrer nore=
ferrer" target=3D"_blank">https://nwtime.org</a><br>;
&gt;<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0e^(i*pi)+1=3D0<br>
&gt;<br>
&gt;<br>
<br>
One of the two machines is hung. <br>
<br>
cwfw# ping bob<br>
PING bob (10.1.1.7): 56 data bytes<br>
^C<br>
--- bob ping statistics ---<br>
2 packets transmitted, 0 packets received, 100.0% packet loss<br>
cwfw# console bob<br>
[Enter `^Ec?&#39; for help]<br>
[halt sent]<br>
KDB: enter: Break to debugger<br>
[ thread pid 31259 tid 100913 ]<br>
Stopped at=C2=A0 =C2=A0 =C2=A0 kdb_break+0x48: movq=C2=A0 =C2=A0 $0,0xa1069=
d(%rip)<br>
db&gt; bt<br>
Tracing pid 31259 tid 100913 td 0xfffffe00c4eca000<br>
kdb_break() at kdb_break+0x48/frame 0xfffffe00c53ef2d0<br>
uart_intr() at uart_intr+0xf7/frame 0xfffffe00c53ef310<br>
intr_event_handle() at intr_event_handle+0x12b/frame 0xfffffe00c53ef380<br>
intr_execute_handlers() at intr_execute_handlers+0x63/frame <br>
0xfffffe00c53ef3b0<br>
Xapic_isr1() at Xapic_isr1+0xdc/frame 0xfffffe00c53ef3b0<br>
--- interrupt, rip =3D 0xffffffff806d5c70, rsp =3D 0xfffffe00c53ef480, rbp =
=3D <br>
0xfffffe00c53ef480 ---<br>
getbinuptime() at getbinuptime+0x30/frame 0xfffffe00c53ef480<br>
arc_access() at arc_access+0x250/frame 0xfffffe00c53ef4d0<br>
arc_buf_access() at arc_buf_access+0xd0/frame 0xfffffe00c53ef4f0<br>
dbuf_hold_impl() at dbuf_hold_impl+0xf3/frame 0xfffffe00c53ef580<br>
dbuf_hold() at dbuf_hold+0x25/frame 0xfffffe00c53ef5b0<br>
dnode_hold_impl() at dnode_hold_impl+0x194/frame 0xfffffe00c53ef670<br>
dmu_bonus_hold() at dmu_bonus_hold+0x20/frame 0xfffffe00c53ef6a0<br>
zfs_zget() at zfs_zget+0x20d/frame 0xfffffe00c53ef750<br>
zfs_dirent_lookup() at zfs_dirent_lookup+0x16d/frame 0xfffffe00c53ef7a0<br>
zfs_dirlook() at zfs_dirlook+0x7f/frame 0xfffffe00c53ef7d0<br>
zfs_lookup() at zfs_lookup+0x3c0/frame 0xfffffe00c53ef8a0<br>
zfs_freebsd_cachedlookup() at zfs_freebsd_cachedlookup+0x67/frame <br>
0xfffffe00c53ef9e0<br>
vfs_cache_lookup() at vfs_cache_lookup+0xa6/frame 0xfffffe00c53efa30<br>
vfs_lookup() at vfs_lookup+0x457/frame 0xfffffe00c53efac0<br>
namei() at namei+0x2e1/frame 0xfffffe00c53efb20<br>
vn_open_cred() at vn_open_cred+0x505/frame 0xfffffe00c53efca0<br>
kern_openat() at kern_openat+0x287/frame 0xfffffe00c53efdf0<br>
ia32_syscall() at ia32_syscall+0x156/frame 0xfffffe00c53eff30<br>
int0x80_syscall_common() at int0x80_syscall_common+0x9c/frame 0xffff89dc<br=
>
db&gt; <br>
<br>
I&#39;ll let it continue. Hopefully the watchdog timer will pop and we get =
a <br>
dump.<br></blockquote></div></div><div dir=3D"auto"><br></div><div dir=3D"a=
uto"><br></div><div dir=3D"auto">Might also be interesting to see if this m=
oves around or is really hung getting the time. I suspect it&#39;s live loc=
k given this traceback.</div><div dir=3D"auto"><br></div><div dir=3D"auto">=
Warner</div><div dir=3D"auto"><br></div><div dir=3D"auto"><div class=3D"gma=
il_quote"><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bord=
er-left:1px #ccc solid;padding-left:1ex">
-- <br>
Cheers,<br>
Cy Schubert &lt;<a href=3D"mailto:Cy.Schubert@cschubert.com" target=3D"_bla=
nk" rel=3D"noreferrer">Cy.Schubert@cschubert.com</a>&gt;<br>
FreeBSD UNIX:=C2=A0 &lt;cy@FreeBSD.org&gt;=C2=A0 =C2=A0Web:=C2=A0 <a href=
=3D"https://FreeBSD.org" rel=3D"noreferrer noreferrer" target=3D"_blank">ht=
tps://FreeBSD.org</a><br>
NTP:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0&lt;<a href=3D"mailto:cy@nwtim=
e.org" target=3D"_blank" rel=3D"noreferrer">cy@nwtime.org</a>&gt;=C2=A0 =C2=
=A0 Web:=C2=A0 <a href=3D"https://nwtime.org" rel=3D"noreferrer noreferrer"=
 target=3D"_blank">https://nwtime.org</a><br>;
<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 e^(i*pi)+1=3D0<br>
<br>
<br>
=C2=A0J=EF=BE=90 =EF=BD=A4=C2=A0 =C2=A0=EF=BF=BD<br>
</blockquote></div></div></div>

--000000000000a89ba10604408858--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfqLWoQnLkKcLYLa73WOKDOAEfXB2rQX869Qaaqv6z=gKA>