Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 31 Aug 2023 17:37:45 -0600
From:      Warner Losh <imp@bsdimp.com>
To:        Cy Schubert <Cy.Schubert@cschubert.com>
Cc:        Alexander Motin <mav@freebsd.org>, Gleb Smirnoff <glebius@freebsd.org>,  Drew Gallatin <gallatin@freebsd.org>, Martin Matuska <mm@freebsd.org>,  src-committers <src-committers@freebsd.org>,  "<dev-commits-src-all@freebsd.org>" <dev-commits-src-all@freebsd.org>,  "<dev-commits-src-main@freebsd.org>" <dev-commits-src-main@freebsd.org>
Subject:   Re: git: 315ee00fa961 - main - zfs: merge openzfs/zfs@804414aad
Message-ID:  <CANCZdfqLWoQnLkKcLYLa73WOKDOAEfXB2rQX869Qaaqv6z=gKA@mail.gmail.com>
In-Reply-To: <20230831233228.9935BA8@slippy.cwsent.com>
References:  <202308270509.37R596B5048298@gitrepo.freebsd.org> <ZO_aOaf-eGiCMCKy@cell.glebi.us> <c09c92df-90f5-8c94-4125-9e33262bc686@FreeBSD.org> <07faf861-9186-47d1-992a-91d483ea4e9c@app.fastmail.com> <1db726d4-32c9-e1b8-51d6-981aa51b7825@FreeBSD.org> <20230831175350.981F1D5@slippy.cwsent.com> <a5c51f3f-8c7f-8bd5-f718-72bc33fe22ed@FreeBSD.org> <20230831223526.DCB701A1@slippy.cwsent.com> <20230831233228.9935BA8@slippy.cwsent.com>

next in thread | previous in thread | raw e-mail | index | archive | help

[-- Attachment #1 --]
On Thu, Aug 31, 2023, 5:32 PM Cy Schubert <Cy.Schubert@cschubert.com> wrote:

> In message <20230831223526.DCB701A1@slippy.cwsent.com>, Cy Schubert
> writes:
> > In message <a5c51f3f-8c7f-8bd5-f718-72bc33fe22ed@FreeBSD.org>,
> Alexander
> > Motin
> > writes:
> > > On 31.08.2023 13:53, Cy Schubert wrote:
> > > > One thing that circumvents my two problems is reducing poudriere
> bulk job
> > s
> > > > from 8 to 5 on my 4 core machines.
> > >
> > > Cy, I have no real evidences to think it is related, other than your
> > > panics look like some memory corruptions, but could you try is patch:
> > > https://github.com/openzfs/zfs/pull/15228 .  If it won't do the
> trick,
> > > then I am out of ideas without additional input.
> >
> > So far so good. Poudriere has been running with a decent -J jobs on both
> > machines for over an hour. I'll let you know if they survive the night.
> It
> > can take some time before the panics happen though.
> >
> > The problem is more likely to occur when there are a lot of small
> package
> > builds than large long running jobs, probably because of the parallel
> ZFS
> > dataset creations, deletions, and rollbacks.
> >
> > >
> > > Gleb, you may try to add this too, just as a choice between impossible
> > > and improbable.
> > >
> > > --
> > > Alexander Motin
> >
> >
> > --
> > Cheers,
> > Cy Schubert <Cy.Schubert@cschubert.com>
> > FreeBSD UNIX:  <cy@FreeBSD.org>   Web:  https://FreeBSD.org
> > NTP:           <cy@nwtime.org>    Web:  https://nwtime.org
> >
> >                       e^(i*pi)+1=0
> >
> >
>
> One of the two machines is hung.
>
> cwfw# ping bob
> PING bob (10.1.1.7): 56 data bytes
> ^C
> --- bob ping statistics ---
> 2 packets transmitted, 0 packets received, 100.0% packet loss
> cwfw# console bob
> [Enter `^Ec?' for help]
> [halt sent]
> KDB: enter: Break to debugger
> [ thread pid 31259 tid 100913 ]
> Stopped at      kdb_break+0x48: movq    $0,0xa1069d(%rip)
> db> bt
> Tracing pid 31259 tid 100913 td 0xfffffe00c4eca000
> kdb_break() at kdb_break+0x48/frame 0xfffffe00c53ef2d0
> uart_intr() at uart_intr+0xf7/frame 0xfffffe00c53ef310
> intr_event_handle() at intr_event_handle+0x12b/frame 0xfffffe00c53ef380
> intr_execute_handlers() at intr_execute_handlers+0x63/frame
> 0xfffffe00c53ef3b0
> Xapic_isr1() at Xapic_isr1+0xdc/frame 0xfffffe00c53ef3b0
> --- interrupt, rip = 0xffffffff806d5c70, rsp = 0xfffffe00c53ef480, rbp =
> 0xfffffe00c53ef480 ---
> getbinuptime() at getbinuptime+0x30/frame 0xfffffe00c53ef480
> arc_access() at arc_access+0x250/frame 0xfffffe00c53ef4d0
> arc_buf_access() at arc_buf_access+0xd0/frame 0xfffffe00c53ef4f0
> dbuf_hold_impl() at dbuf_hold_impl+0xf3/frame 0xfffffe00c53ef580
> dbuf_hold() at dbuf_hold+0x25/frame 0xfffffe00c53ef5b0
> dnode_hold_impl() at dnode_hold_impl+0x194/frame 0xfffffe00c53ef670
> dmu_bonus_hold() at dmu_bonus_hold+0x20/frame 0xfffffe00c53ef6a0
> zfs_zget() at zfs_zget+0x20d/frame 0xfffffe00c53ef750
> zfs_dirent_lookup() at zfs_dirent_lookup+0x16d/frame 0xfffffe00c53ef7a0
> zfs_dirlook() at zfs_dirlook+0x7f/frame 0xfffffe00c53ef7d0
> zfs_lookup() at zfs_lookup+0x3c0/frame 0xfffffe00c53ef8a0
> zfs_freebsd_cachedlookup() at zfs_freebsd_cachedlookup+0x67/frame
> 0xfffffe00c53ef9e0
> vfs_cache_lookup() at vfs_cache_lookup+0xa6/frame 0xfffffe00c53efa30
> vfs_lookup() at vfs_lookup+0x457/frame 0xfffffe00c53efac0
> namei() at namei+0x2e1/frame 0xfffffe00c53efb20
> vn_open_cred() at vn_open_cred+0x505/frame 0xfffffe00c53efca0
> kern_openat() at kern_openat+0x287/frame 0xfffffe00c53efdf0
> ia32_syscall() at ia32_syscall+0x156/frame 0xfffffe00c53eff30
> int0x80_syscall_common() at int0x80_syscall_common+0x9c/frame 0xffff89dc
> db>
>
> I'll let it continue. Hopefully the watchdog timer will pop and we get a
> dump.
>


Might also be interesting to see if this moves around or is really hung
getting the time. I suspect it's live lock given this traceback.

Warner

-- 
> Cheers,
> Cy Schubert <Cy.Schubert@cschubert.com>
> FreeBSD UNIX:  <cy@FreeBSD.org>   Web:  https://FreeBSD.org
> NTP:           <cy@nwtime.org>    Web:  https://nwtime.org
>
>                         e^(i*pi)+1=0
>
>
>  Jミ 、   �
>

[-- Attachment #2 --]
<div dir="auto"><div><br><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Aug 31, 2023, 5:32 PM Cy Schubert &lt;<a href="mailto:Cy.Schubert@cschubert.com">Cy.Schubert@cschubert.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">In message &lt;<a href="mailto:20230831223526.DCB701A1@slippy.cwsent.com" target="_blank" rel="noreferrer">20230831223526.DCB701A1@slippy.cwsent.com</a>&gt;, Cy Schubert writes:<br>
&gt; In message &lt;a5c51f3f-8c7f-8bd5-f718-72bc33fe22ed@FreeBSD.org&gt;, Alexander <br>
&gt; Motin<br>
&gt; writes:<br>
&gt; &gt; On 31.08.2023 13:53, Cy Schubert wrote:<br>
&gt; &gt; &gt; One thing that circumvents my two problems is reducing poudriere bulk job<br>
&gt; s<br>
&gt; &gt; &gt; from 8 to 5 on my 4 core machines.<br>
&gt; &gt;<br>
&gt; &gt; Cy, I have no real evidences to think it is related, other than your <br>
&gt; &gt; panics look like some memory corruptions, but could you try is patch: <br>
&gt; &gt; <a href="https://github.com/openzfs/zfs/pull/15228" rel="noreferrer noreferrer" target="_blank">https://github.com/openzfs/zfs/pull/15228</a>; .  If it won&#39;t do the trick, <br>
&gt; &gt; then I am out of ideas without additional input.<br>
&gt;<br>
&gt; So far so good. Poudriere has been running with a decent -J jobs on both <br>
&gt; machines for over an hour. I&#39;ll let you know if they survive the night. It <br>
&gt; can take some time before the panics happen though.<br>
&gt;<br>
&gt; The problem is more likely to occur when there are a lot of small package <br>
&gt; builds than large long running jobs, probably because of the parallel ZFS <br>
&gt; dataset creations, deletions, and rollbacks.<br>
&gt;<br>
&gt; &gt;<br>
&gt; &gt; Gleb, you may try to add this too, just as a choice between impossible <br>
&gt; &gt; and improbable.<br>
&gt; &gt;<br>
&gt; &gt; -- <br>
&gt; &gt; Alexander Motin<br>
&gt;<br>
&gt;<br>
&gt; -- <br>
&gt; Cheers,<br>
&gt; Cy Schubert &lt;<a href="mailto:Cy.Schubert@cschubert.com" target="_blank" rel="noreferrer">Cy.Schubert@cschubert.com</a>&gt;<br>
&gt; FreeBSD UNIX:  &lt;cy@FreeBSD.org&gt;   Web:  <a href="https://FreeBSD.org" rel="noreferrer noreferrer" target="_blank">https://FreeBSD.org</a><br>;
&gt; NTP:           &lt;<a href="mailto:cy@nwtime.org" target="_blank" rel="noreferrer">cy@nwtime.org</a>&gt;    Web:  <a href="https://nwtime.org" rel="noreferrer noreferrer" target="_blank">https://nwtime.org</a><br>;
&gt;<br>
&gt;                       e^(i*pi)+1=0<br>
&gt;<br>
&gt;<br>
<br>
One of the two machines is hung. <br>
<br>
cwfw# ping bob<br>
PING bob (10.1.1.7): 56 data bytes<br>
^C<br>
--- bob ping statistics ---<br>
2 packets transmitted, 0 packets received, 100.0% packet loss<br>
cwfw# console bob<br>
[Enter `^Ec?&#39; for help]<br>
[halt sent]<br>
KDB: enter: Break to debugger<br>
[ thread pid 31259 tid 100913 ]<br>
Stopped at      kdb_break+0x48: movq    $0,0xa1069d(%rip)<br>
db&gt; bt<br>
Tracing pid 31259 tid 100913 td 0xfffffe00c4eca000<br>
kdb_break() at kdb_break+0x48/frame 0xfffffe00c53ef2d0<br>
uart_intr() at uart_intr+0xf7/frame 0xfffffe00c53ef310<br>
intr_event_handle() at intr_event_handle+0x12b/frame 0xfffffe00c53ef380<br>
intr_execute_handlers() at intr_execute_handlers+0x63/frame <br>
0xfffffe00c53ef3b0<br>
Xapic_isr1() at Xapic_isr1+0xdc/frame 0xfffffe00c53ef3b0<br>
--- interrupt, rip = 0xffffffff806d5c70, rsp = 0xfffffe00c53ef480, rbp = <br>
0xfffffe00c53ef480 ---<br>
getbinuptime() at getbinuptime+0x30/frame 0xfffffe00c53ef480<br>
arc_access() at arc_access+0x250/frame 0xfffffe00c53ef4d0<br>
arc_buf_access() at arc_buf_access+0xd0/frame 0xfffffe00c53ef4f0<br>
dbuf_hold_impl() at dbuf_hold_impl+0xf3/frame 0xfffffe00c53ef580<br>
dbuf_hold() at dbuf_hold+0x25/frame 0xfffffe00c53ef5b0<br>
dnode_hold_impl() at dnode_hold_impl+0x194/frame 0xfffffe00c53ef670<br>
dmu_bonus_hold() at dmu_bonus_hold+0x20/frame 0xfffffe00c53ef6a0<br>
zfs_zget() at zfs_zget+0x20d/frame 0xfffffe00c53ef750<br>
zfs_dirent_lookup() at zfs_dirent_lookup+0x16d/frame 0xfffffe00c53ef7a0<br>
zfs_dirlook() at zfs_dirlook+0x7f/frame 0xfffffe00c53ef7d0<br>
zfs_lookup() at zfs_lookup+0x3c0/frame 0xfffffe00c53ef8a0<br>
zfs_freebsd_cachedlookup() at zfs_freebsd_cachedlookup+0x67/frame <br>
0xfffffe00c53ef9e0<br>
vfs_cache_lookup() at vfs_cache_lookup+0xa6/frame 0xfffffe00c53efa30<br>
vfs_lookup() at vfs_lookup+0x457/frame 0xfffffe00c53efac0<br>
namei() at namei+0x2e1/frame 0xfffffe00c53efb20<br>
vn_open_cred() at vn_open_cred+0x505/frame 0xfffffe00c53efca0<br>
kern_openat() at kern_openat+0x287/frame 0xfffffe00c53efdf0<br>
ia32_syscall() at ia32_syscall+0x156/frame 0xfffffe00c53eff30<br>
int0x80_syscall_common() at int0x80_syscall_common+0x9c/frame 0xffff89dc<br>
db&gt; <br>
<br>
I&#39;ll let it continue. Hopefully the watchdog timer will pop and we get a <br>
dump.<br></blockquote></div></div><div dir="auto"><br></div><div dir="auto"><br></div><div dir="auto">Might also be interesting to see if this moves around or is really hung getting the time. I suspect it&#39;s live lock given this traceback.</div><div dir="auto"><br></div><div dir="auto">Warner</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
-- <br>
Cheers,<br>
Cy Schubert &lt;<a href="mailto:Cy.Schubert@cschubert.com" target="_blank" rel="noreferrer">Cy.Schubert@cschubert.com</a>&gt;<br>
FreeBSD UNIX:  &lt;cy@FreeBSD.org&gt;   Web:  <a href="https://FreeBSD.org" rel="noreferrer noreferrer" target="_blank">https://FreeBSD.org</a><br>;
NTP:           &lt;<a href="mailto:cy@nwtime.org" target="_blank" rel="noreferrer">cy@nwtime.org</a>&gt;    Web:  <a href="https://nwtime.org" rel="noreferrer noreferrer" target="_blank">https://nwtime.org</a><br>;
<br>
                        e^(i*pi)+1=0<br>
<br>
<br>
 Jミ 、   �<br>
</blockquote></div></div></div>

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfqLWoQnLkKcLYLa73WOKDOAEfXB2rQX869Qaaqv6z=gKA>