Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 31 Aug 2023 17:04:29 -0700
From:      Cy Schubert <Cy.Schubert@cschubert.com>
To:        Cy Schubert <Cy.Schubert@cschubert.com>
Cc:        Warner Losh <imp@bsdimp.com>, Alexander Motin <mav@freebsd.org>, Gleb Smirnoff <glebius@freebsd.org>, Drew Gallatin <gallatin@freebsd.org>, Martin Matuska <mm@freebsd.org>, src-committers <src-committers@freebsd.org>, "<dev-commits-src-all@freebsd.org>" <dev-commits-src-all@freebsd.org>, "<dev-commits-src-main@freebsd.org>" <dev-commits-src-main@freebsd.org>
Subject:   Re: git: 315ee00fa961 - main - zfs: merge openzfs/zfs@804414aad
Message-ID:  <20230901000429.B83BE1C0@slippy.cwsent.com>
In-Reply-To: <20230831234437.0EB38133@slippy.cwsent.com>
References:  <202308270509.37R596B5048298@gitrepo.freebsd.org>  <ZO_aOaf-eGiCMCKy@cell.glebi.us> <c09c92df-90f5-8c94-4125-9e33262bc686@FreeBSD.org> <07faf861-9186-47d1-992a-91d483ea4e9c@app.fastmail.com> <1db726d4-32c9-e1b8-51d6-981aa51b7825@FreeBSD.org> <20230831175350.981F1D5@slippy.cwsent.com> <a5c51f3f-8c7f-8bd5-f718-72bc33fe22ed@FreeBSD.org> <20230831223526.DCB701A1@slippy.cwsent.com> <20230831233228.9935BA8@slippy.cwsent.com> <CANCZdfqLWoQnLkKcLYLa73WOKDOAEfXB2rQX869Qaaqv6z=gKA@mail.gmail.com> <20230831234023.917DFAB@slippy.cwsent.com> <20230831234437.0EB38133@slippy.cwsent.com>

next in thread | previous in thread | raw e-mail | index | archive | help
In message <20230831234437.0EB38133@slippy.cwsent.com>, Cy Schubert writes:
> In message <20230831234023.917DFAB@slippy.cwsent.com>, Cy Schubert writes:
> > In message <CANCZdfqLWoQnLkKcLYLa73WOKDOAEfXB2rQX869Qaaqv6z=gKA@mail.gmail.
> c
> > om>
> > , Warner Losh writes:
> > > On Thu, Aug 31, 2023, 5:32 PM Cy Schubert <Cy.Schubert@cschubert.com> wro
> te
> > =
> > > :
> > >
> > > > In message <20230831223526.DCB701A1@slippy.cwsent.com>, Cy Schubert
> > > > writes:
> > > > > In message <a5c51f3f-8c7f-8bd5-f718-72bc33fe22ed@FreeBSD.org>,
> > > > Alexander
> > > > > Motin
> > > > > writes:
> > > > > > On 31.08.2023 13:53, Cy Schubert wrote:
> > > > > > > One thing that circumvents my two problems is reducing poudriere
> > > > bulk job
> > > > > s
> > > > > > > from 8 to 5 on my 4 core machines.
> > > > > >
> > > > > > Cy, I have no real evidences to think it is related, other than you
> r
> > > > > > panics look like some memory corruptions, but could you try is patc
> h:
> > > > > > https://github.com/openzfs/zfs/pull/15228 .  If it won't do the
> > > > trick,
> > > > > > then I am out of ideas without additional input.
> > > > >
> > > > > So far so good. Poudriere has been running with a decent -J jobs on b
> ot
> > =
> > > h
> > > > > machines for over an hour. I'll let you know if they survive the nigh
> t.
> > > > It
> > > > > can take some time before the panics happen though.
> > > > >
> > > > > The problem is more likely to occur when there are a lot of small
> > > > package
> > > > > builds than large long running jobs, probably because of the parallel
> > > > ZFS
> > > > > dataset creations, deletions, and rollbacks.
> > > > >
> > > > > >
> > > > > > Gleb, you may try to add this too, just as a choice between impossi
> bl
> > =
> > > e
> > > > > > and improbable.
> > > > > >
> > > > > > --
> > > > > > Alexander Motin
> > > > >
> > > > >
> > > > > --
> > > > > Cheers,
> > > > > Cy Schubert <Cy.Schubert@cschubert.com>
> > > > > FreeBSD UNIX:  <cy@FreeBSD.org>   Web:  https://FreeBSD.org
> > > > > NTP:           <cy@nwtime.org>    Web:  https://nwtime.org
> > > > >
> > > > >                       e^(i*pi)+1=3D0
> > > > >
> > > > >
> > > >
> > > > One of the two machines is hung.
> > > >
> > > > cwfw# ping bob
> > > > PING bob (10.1.1.7): 56 data bytes
> > > > ^C
> > > > --- bob ping statistics ---
> > > > 2 packets transmitted, 0 packets received, 100.0% packet loss
> > > > cwfw# console bob
> > > > [Enter `^Ec?' for help]
> > > > [halt sent]
> > > > KDB: enter: Break to debugger
> > > > [ thread pid 31259 tid 100913 ]
> > > > Stopped at      kdb_break+0x48: movq    $0,0xa1069d(%rip)
> > > > db> bt
> > > > Tracing pid 31259 tid 100913 td 0xfffffe00c4eca000
> > > > kdb_break() at kdb_break+0x48/frame 0xfffffe00c53ef2d0
> > > > uart_intr() at uart_intr+0xf7/frame 0xfffffe00c53ef310
> > > > intr_event_handle() at intr_event_handle+0x12b/frame 0xfffffe00c53ef380
> > > > intr_execute_handlers() at intr_execute_handlers+0x63/frame
> > > > 0xfffffe00c53ef3b0
> > > > Xapic_isr1() at Xapic_isr1+0xdc/frame 0xfffffe00c53ef3b0
> > > > --- interrupt, rip =3D 0xffffffff806d5c70, rsp =3D 0xfffffe00c53ef480, 
> rb
> > =
> > > p =3D
> > > > 0xfffffe00c53ef480 ---
> > > > getbinuptime() at getbinuptime+0x30/frame 0xfffffe00c53ef480
> > > > arc_access() at arc_access+0x250/frame 0xfffffe00c53ef4d0
> > > > arc_buf_access() at arc_buf_access+0xd0/frame 0xfffffe00c53ef4f0
> > > > dbuf_hold_impl() at dbuf_hold_impl+0xf3/frame 0xfffffe00c53ef580
> > > > dbuf_hold() at dbuf_hold+0x25/frame 0xfffffe00c53ef5b0
> > > > dnode_hold_impl() at dnode_hold_impl+0x194/frame 0xfffffe00c53ef670
> > > > dmu_bonus_hold() at dmu_bonus_hold+0x20/frame 0xfffffe00c53ef6a0
> > > > zfs_zget() at zfs_zget+0x20d/frame 0xfffffe00c53ef750
> > > > zfs_dirent_lookup() at zfs_dirent_lookup+0x16d/frame 0xfffffe00c53ef7a0
> > > > zfs_dirlook() at zfs_dirlook+0x7f/frame 0xfffffe00c53ef7d0
> > > > zfs_lookup() at zfs_lookup+0x3c0/frame 0xfffffe00c53ef8a0
> > > > zfs_freebsd_cachedlookup() at zfs_freebsd_cachedlookup+0x67/frame
> > > > 0xfffffe00c53ef9e0
> > > > vfs_cache_lookup() at vfs_cache_lookup+0xa6/frame 0xfffffe00c53efa30
> > > > vfs_lookup() at vfs_lookup+0x457/frame 0xfffffe00c53efac0
> > > > namei() at namei+0x2e1/frame 0xfffffe00c53efb20
> > > > vn_open_cred() at vn_open_cred+0x505/frame 0xfffffe00c53efca0
> > > > kern_openat() at kern_openat+0x287/frame 0xfffffe00c53efdf0
> > > > ia32_syscall() at ia32_syscall+0x156/frame 0xfffffe00c53eff30
> > > > int0x80_syscall_common() at int0x80_syscall_common+0x9c/frame 0xffff89d
> c
> > > > db>
> > > >
> > > > I'll let it continue. Hopefully the watchdog timer will pop and we get 
> a
> > > > dump.
> > > >
> > >
> > >
> > > Might also be interesting to see if this moves around or is really hung
> > > getting the time. I suspect it's live lock given this traceback.
> >
> > It's moving around.
> >
> >
> > db> c
> > [halt sent]
> > KDB: enter: Break to debugger
> > [ thread pid 31259 tid 100913 ]
> > Stopped at      kdb_break+0x48: movq    $0,0xa1069d(%rip)
> > db> bt
> > Tracing pid 31259 tid 100913 td 0xfffffe00c4eca000
> > kdb_break() at kdb_break+0x48/frame 0xfffffe00c53ef410
> > uart_intr() at uart_intr+0xf7/frame 0xfffffe00c53ef450
> > intr_event_handle() at intr_event_handle+0x12b/frame 0xfffffe00c53ef4c0
> > intr_execute_handlers() at intr_execute_handlers+0x63/frame 
> > 0xfffffe00c53ef4f0
> > Xapic_isr1() at Xapic_isr1+0xdc/frame 0xfffffe00c53ef4f0
> > --- interrupt, rip = 0xffffffff807117c4, rsp = 0xfffffe00c53ef5c0, rbp = 
> > 0xfffffe00c53ef5c0 ---
> > lock_delay() at lock_delay+0x14/frame 0xfffffe00c53ef5c0
> > _sx_xlock_hard() at _sx_xlock_hard+0x1a9/frame 0xfffffe00c53ef660
> > _sx_xlock() at _sx_xlock+0xb0/frame 0xfffffe00c53ef6a0
> > zfs_zget() at zfs_zget+0x1fa/frame 0xfffffe00c53ef750
> > zfs_dirent_lookup() at zfs_dirent_lookup+0x16d/frame 0xfffffe00c53ef7a0
> > zfs_dirlook() at zfs_dirlook+0x7f/frame 0xfffffe00c53ef7d0
> > zfs_lookup() at zfs_lookup+0x3c0/frame 0xfffffe00c53ef8a0
> > zfs_freebsd_cachedlookup() at zfs_freebsd_cachedlookup+0x67/frame 
> > 0xfffffe00c53ef9e0
> > vfs_cache_lookup() at vfs_cache_lookup+0xa6/frame 0xfffffe00c53efa30
> > vfs_lookup() at vfs_lookup+0x457/frame 0xfffffe00c53efac0
> > namei() at namei+0x2e1/frame 0xfffffe00c53efb20
> > vn_open_cred() at vn_open_cred+0x505/frame 0xfffffe00c53efca0
> > kern_openat() at kern_openat+0x287/frame 0xfffffe00c53efdf0
> > ia32_syscall() at ia32_syscall+0x156/frame 0xfffffe00c53eff30
> > int0x80_syscall_common() at int0x80_syscall_common+0x9c/frame 0xffff89dc
> > db> c
>
> It seems to have settled 
I manually panicked the machine. It captured a good dump of the above 
backtrace.

This is the amd64 machine that was building i386 packages under poudriere. 
I purposely set -J high to stress it enough to panic the machine.

This kernel includes amotin's patch from https://github.com/openzfs/zfs/pull
/15228. This manually triggered panic is different from the one it suffered 
two days ago.

The machine building amd64 packages is still running nicely but it's 
building some large packages. Because of this there's not as much stress on 
zfs. The panicked machine was building a lot of small packages. As such it 
was doing a lot of zfs clones, destroys, and rollbacks.

I'll take a look at the dump after dinner tonight.


-- 
Cheers,
Cy Schubert <Cy.Schubert@cschubert.com>
FreeBSD UNIX:  <cy@FreeBSD.org>   Web:  https://FreeBSD.org
NTP:           <cy@nwtime.org>    Web:  https://nwtime.org

			e^(i*pi)+1=0







Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20230901000429.B83BE1C0>