Date: Thu, 31 Aug 2023 17:37:45 -0600 From: Warner Losh <imp@bsdimp.com> To: Cy Schubert <Cy.Schubert@cschubert.com> Cc: Alexander Motin <mav@freebsd.org>, Gleb Smirnoff <glebius@freebsd.org>, Drew Gallatin <gallatin@freebsd.org>, Martin Matuska <mm@freebsd.org>, src-committers <src-committers@freebsd.org>, "<dev-commits-src-all@freebsd.org>" <dev-commits-src-all@freebsd.org>, "<dev-commits-src-main@freebsd.org>" <dev-commits-src-main@freebsd.org> Subject: Re: git: 315ee00fa961 - main - zfs: merge openzfs/zfs@804414aad Message-ID: <CANCZdfqLWoQnLkKcLYLa73WOKDOAEfXB2rQX869Qaaqv6z=gKA@mail.gmail.com> In-Reply-To: <20230831233228.9935BA8@slippy.cwsent.com> References: <202308270509.37R596B5048298@gitrepo.freebsd.org> <ZO_aOaf-eGiCMCKy@cell.glebi.us> <c09c92df-90f5-8c94-4125-9e33262bc686@FreeBSD.org> <07faf861-9186-47d1-992a-91d483ea4e9c@app.fastmail.com> <1db726d4-32c9-e1b8-51d6-981aa51b7825@FreeBSD.org> <20230831175350.981F1D5@slippy.cwsent.com> <a5c51f3f-8c7f-8bd5-f718-72bc33fe22ed@FreeBSD.org> <20230831223526.DCB701A1@slippy.cwsent.com> <20230831233228.9935BA8@slippy.cwsent.com>
next in thread | previous in thread | raw e-mail | index | archive | help
--000000000000a89ba10604408858 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Thu, Aug 31, 2023, 5:32 PM Cy Schubert <Cy.Schubert@cschubert.com> wrote= : > In message <20230831223526.DCB701A1@slippy.cwsent.com>, Cy Schubert > writes: > > In message <a5c51f3f-8c7f-8bd5-f718-72bc33fe22ed@FreeBSD.org>, > Alexander > > Motin > > writes: > > > On 31.08.2023 13:53, Cy Schubert wrote: > > > > One thing that circumvents my two problems is reducing poudriere > bulk job > > s > > > > from 8 to 5 on my 4 core machines. > > > > > > Cy, I have no real evidences to think it is related, other than your > > > panics look like some memory corruptions, but could you try is patch: > > > https://github.com/openzfs/zfs/pull/15228 . If it won't do the > trick, > > > then I am out of ideas without additional input. > > > > So far so good. Poudriere has been running with a decent -J jobs on bot= h > > machines for over an hour. I'll let you know if they survive the night. > It > > can take some time before the panics happen though. > > > > The problem is more likely to occur when there are a lot of small > package > > builds than large long running jobs, probably because of the parallel > ZFS > > dataset creations, deletions, and rollbacks. > > > > > > > > Gleb, you may try to add this too, just as a choice between impossibl= e > > > and improbable. > > > > > > -- > > > Alexander Motin > > > > > > -- > > Cheers, > > Cy Schubert <Cy.Schubert@cschubert.com> > > FreeBSD UNIX: <cy@FreeBSD.org> Web: https://FreeBSD.org > > NTP: <cy@nwtime.org> Web: https://nwtime.org > > > > e^(i*pi)+1=3D0 > > > > > > One of the two machines is hung. > > cwfw# ping bob > PING bob (10.1.1.7): 56 data bytes > ^C > --- bob ping statistics --- > 2 packets transmitted, 0 packets received, 100.0% packet loss > cwfw# console bob > [Enter `^Ec?' for help] > [halt sent] > KDB: enter: Break to debugger > [ thread pid 31259 tid 100913 ] > Stopped at kdb_break+0x48: movq $0,0xa1069d(%rip) > db> bt > Tracing pid 31259 tid 100913 td 0xfffffe00c4eca000 > kdb_break() at kdb_break+0x48/frame 0xfffffe00c53ef2d0 > uart_intr() at uart_intr+0xf7/frame 0xfffffe00c53ef310 > intr_event_handle() at intr_event_handle+0x12b/frame 0xfffffe00c53ef380 > intr_execute_handlers() at intr_execute_handlers+0x63/frame > 0xfffffe00c53ef3b0 > Xapic_isr1() at Xapic_isr1+0xdc/frame 0xfffffe00c53ef3b0 > --- interrupt, rip =3D 0xffffffff806d5c70, rsp =3D 0xfffffe00c53ef480, rb= p =3D > 0xfffffe00c53ef480 --- > getbinuptime() at getbinuptime+0x30/frame 0xfffffe00c53ef480 > arc_access() at arc_access+0x250/frame 0xfffffe00c53ef4d0 > arc_buf_access() at arc_buf_access+0xd0/frame 0xfffffe00c53ef4f0 > dbuf_hold_impl() at dbuf_hold_impl+0xf3/frame 0xfffffe00c53ef580 > dbuf_hold() at dbuf_hold+0x25/frame 0xfffffe00c53ef5b0 > dnode_hold_impl() at dnode_hold_impl+0x194/frame 0xfffffe00c53ef670 > dmu_bonus_hold() at dmu_bonus_hold+0x20/frame 0xfffffe00c53ef6a0 > zfs_zget() at zfs_zget+0x20d/frame 0xfffffe00c53ef750 > zfs_dirent_lookup() at zfs_dirent_lookup+0x16d/frame 0xfffffe00c53ef7a0 > zfs_dirlook() at zfs_dirlook+0x7f/frame 0xfffffe00c53ef7d0 > zfs_lookup() at zfs_lookup+0x3c0/frame 0xfffffe00c53ef8a0 > zfs_freebsd_cachedlookup() at zfs_freebsd_cachedlookup+0x67/frame > 0xfffffe00c53ef9e0 > vfs_cache_lookup() at vfs_cache_lookup+0xa6/frame 0xfffffe00c53efa30 > vfs_lookup() at vfs_lookup+0x457/frame 0xfffffe00c53efac0 > namei() at namei+0x2e1/frame 0xfffffe00c53efb20 > vn_open_cred() at vn_open_cred+0x505/frame 0xfffffe00c53efca0 > kern_openat() at kern_openat+0x287/frame 0xfffffe00c53efdf0 > ia32_syscall() at ia32_syscall+0x156/frame 0xfffffe00c53eff30 > int0x80_syscall_common() at int0x80_syscall_common+0x9c/frame 0xffff89dc > db> > > I'll let it continue. Hopefully the watchdog timer will pop and we get a > dump. > Might also be interesting to see if this moves around or is really hung getting the time. I suspect it's live lock given this traceback. Warner --=20 > Cheers, > Cy Schubert <Cy.Schubert@cschubert.com> > FreeBSD UNIX: <cy@FreeBSD.org> Web: https://FreeBSD.org > NTP: <cy@nwtime.org> Web: https://nwtime.org > > e^(i*pi)+1=3D0 > > > J=EF=BE=90 =EF=BD=A4 =EF=BF=BD > --000000000000a89ba10604408858 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"auto"><div><br><br><div class=3D"gmail_quote"><div dir=3D"ltr" = class=3D"gmail_attr">On Thu, Aug 31, 2023, 5:32 PM Cy Schubert <<a href= =3D"mailto:Cy.Schubert@cschubert.com">Cy.Schubert@cschubert.com</a>> wro= te:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;b= order-left:1px #ccc solid;padding-left:1ex">In message <<a href=3D"mailt= o:20230831223526.DCB701A1@slippy.cwsent.com" target=3D"_blank" rel=3D"noref= errer">20230831223526.DCB701A1@slippy.cwsent.com</a>>, Cy Schubert write= s:<br> > In message <a5c51f3f-8c7f-8bd5-f718-72bc33fe22ed@FreeBSD.org>, A= lexander <br> > Motin<br> > writes:<br> > > On 31.08.2023 13:53, Cy Schubert wrote:<br> > > > One thing that circumvents my two problems is reducing poudr= iere bulk job<br> > s<br> > > > from 8 to 5 on my 4 core machines.<br> > ><br> > > Cy, I have no real evidences to think it is related, other than y= our <br> > > panics look like some memory corruptions, but could you try is pa= tch: <br> > > <a href=3D"https://github.com/openzfs/zfs/pull/15228" rel=3D"nore= ferrer noreferrer" target=3D"_blank">https://github.com/openzfs/zfs/pull/15= 228</a> .=C2=A0 If it won't do the trick, <br> > > then I am out of ideas without additional input.<br> ><br> > So far so good. Poudriere has been running with a decent -J jobs on bo= th <br> > machines for over an hour. I'll let you know if they survive the n= ight. It <br> > can take some time before the panics happen though.<br> ><br> > The problem is more likely to occur when there are a lot of small pack= age <br> > builds than large long running jobs, probably because of the parallel = ZFS <br> > dataset creations, deletions, and rollbacks.<br> ><br> > ><br> > > Gleb, you may try to add this too, just as a choice between impos= sible <br> > > and improbable.<br> > ><br> > > -- <br> > > Alexander Motin<br> ><br> ><br> > -- <br> > Cheers,<br> > Cy Schubert <<a href=3D"mailto:Cy.Schubert@cschubert.com" target=3D= "_blank" rel=3D"noreferrer">Cy.Schubert@cschubert.com</a>><br> > FreeBSD UNIX:=C2=A0 <cy@FreeBSD.org>=C2=A0 =C2=A0Web:=C2=A0 <a h= ref=3D"https://FreeBSD.org" rel=3D"noreferrer noreferrer" target=3D"_blank"= >https://FreeBSD.org</a><br> > NTP:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0<<a href=3D"mailto:cy@= nwtime.org" target=3D"_blank" rel=3D"noreferrer">cy@nwtime.org</a>>=C2= =A0 =C2=A0 Web:=C2=A0 <a href=3D"https://nwtime.org" rel=3D"noreferrer nore= ferrer" target=3D"_blank">https://nwtime.org</a><br> ><br> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0e^(i*pi)+1=3D0<br> ><br> ><br> <br> One of the two machines is hung. <br> <br> cwfw# ping bob<br> PING bob (10.1.1.7): 56 data bytes<br> ^C<br> --- bob ping statistics ---<br> 2 packets transmitted, 0 packets received, 100.0% packet loss<br> cwfw# console bob<br> [Enter `^Ec?' for help]<br> [halt sent]<br> KDB: enter: Break to debugger<br> [ thread pid 31259 tid 100913 ]<br> Stopped at=C2=A0 =C2=A0 =C2=A0 kdb_break+0x48: movq=C2=A0 =C2=A0 $0,0xa1069= d(%rip)<br> db> bt<br> Tracing pid 31259 tid 100913 td 0xfffffe00c4eca000<br> kdb_break() at kdb_break+0x48/frame 0xfffffe00c53ef2d0<br> uart_intr() at uart_intr+0xf7/frame 0xfffffe00c53ef310<br> intr_event_handle() at intr_event_handle+0x12b/frame 0xfffffe00c53ef380<br> intr_execute_handlers() at intr_execute_handlers+0x63/frame <br> 0xfffffe00c53ef3b0<br> Xapic_isr1() at Xapic_isr1+0xdc/frame 0xfffffe00c53ef3b0<br> --- interrupt, rip =3D 0xffffffff806d5c70, rsp =3D 0xfffffe00c53ef480, rbp = =3D <br> 0xfffffe00c53ef480 ---<br> getbinuptime() at getbinuptime+0x30/frame 0xfffffe00c53ef480<br> arc_access() at arc_access+0x250/frame 0xfffffe00c53ef4d0<br> arc_buf_access() at arc_buf_access+0xd0/frame 0xfffffe00c53ef4f0<br> dbuf_hold_impl() at dbuf_hold_impl+0xf3/frame 0xfffffe00c53ef580<br> dbuf_hold() at dbuf_hold+0x25/frame 0xfffffe00c53ef5b0<br> dnode_hold_impl() at dnode_hold_impl+0x194/frame 0xfffffe00c53ef670<br> dmu_bonus_hold() at dmu_bonus_hold+0x20/frame 0xfffffe00c53ef6a0<br> zfs_zget() at zfs_zget+0x20d/frame 0xfffffe00c53ef750<br> zfs_dirent_lookup() at zfs_dirent_lookup+0x16d/frame 0xfffffe00c53ef7a0<br> zfs_dirlook() at zfs_dirlook+0x7f/frame 0xfffffe00c53ef7d0<br> zfs_lookup() at zfs_lookup+0x3c0/frame 0xfffffe00c53ef8a0<br> zfs_freebsd_cachedlookup() at zfs_freebsd_cachedlookup+0x67/frame <br> 0xfffffe00c53ef9e0<br> vfs_cache_lookup() at vfs_cache_lookup+0xa6/frame 0xfffffe00c53efa30<br> vfs_lookup() at vfs_lookup+0x457/frame 0xfffffe00c53efac0<br> namei() at namei+0x2e1/frame 0xfffffe00c53efb20<br> vn_open_cred() at vn_open_cred+0x505/frame 0xfffffe00c53efca0<br> kern_openat() at kern_openat+0x287/frame 0xfffffe00c53efdf0<br> ia32_syscall() at ia32_syscall+0x156/frame 0xfffffe00c53eff30<br> int0x80_syscall_common() at int0x80_syscall_common+0x9c/frame 0xffff89dc<br= > db> <br> <br> I'll let it continue. Hopefully the watchdog timer will pop and we get = a <br> dump.<br></blockquote></div></div><div dir=3D"auto"><br></div><div dir=3D"a= uto"><br></div><div dir=3D"auto">Might also be interesting to see if this m= oves around or is really hung getting the time. I suspect it's live loc= k given this traceback.</div><div dir=3D"auto"><br></div><div dir=3D"auto">= Warner</div><div dir=3D"auto"><br></div><div dir=3D"auto"><div class=3D"gma= il_quote"><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bord= er-left:1px #ccc solid;padding-left:1ex"> -- <br> Cheers,<br> Cy Schubert <<a href=3D"mailto:Cy.Schubert@cschubert.com" target=3D"_bla= nk" rel=3D"noreferrer">Cy.Schubert@cschubert.com</a>><br> FreeBSD UNIX:=C2=A0 <cy@FreeBSD.org>=C2=A0 =C2=A0Web:=C2=A0 <a href= =3D"https://FreeBSD.org" rel=3D"noreferrer noreferrer" target=3D"_blank">ht= tps://FreeBSD.org</a><br> NTP:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0<<a href=3D"mailto:cy@nwtim= e.org" target=3D"_blank" rel=3D"noreferrer">cy@nwtime.org</a>>=C2=A0 =C2= =A0 Web:=C2=A0 <a href=3D"https://nwtime.org" rel=3D"noreferrer noreferrer"= target=3D"_blank">https://nwtime.org</a><br> <br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 e^(i*pi)+1=3D0<br> <br> <br> =C2=A0J=EF=BE=90 =EF=BD=A4=C2=A0 =C2=A0=EF=BF=BD<br> </blockquote></div></div></div> --000000000000a89ba10604408858--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfqLWoQnLkKcLYLa73WOKDOAEfXB2rQX869Qaaqv6z=gKA>