Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 31 Aug 2023 17:19:59 -0700
From:      Cy Schubert <Cy.Schubert@cschubert.com>
To:        Cy Schubert <Cy.Schubert@cschubert.com>
Cc:        Warner Losh <imp@bsdimp.com>, Alexander Motin <mav@freebsd.org>, Gleb Smirnoff <glebius@freebsd.org>, Drew Gallatin <gallatin@freebsd.org>, Martin Matuska <mm@freebsd.org>, src-committers <src-committers@freebsd.org>, "<dev-commits-src-all@freebsd.org>" <dev-commits-src-all@freebsd.org>, "<dev-commits-src-main@freebsd.org>" <dev-commits-src-main@freebsd.org>
Subject:   Re: git: 315ee00fa961 - main - zfs: merge openzfs/zfs@804414aad
Message-ID:  <20230901001959.E463A1C3@slippy.cwsent.com>

next in thread | raw e-mail | index | archive | help
Cy Schubert writes:
> In message <20230831234437.0EB38133@slippy.cwsent.com>, Cy Schubert writes:
> > In message <20230831234023.917DFAB@slippy.cwsent.com>, Cy Schubert writes:
> > > In message <CANCZdfqLWoQnLkKcLYLa73WOKDOAEfXB2rQX869Qaaqv6z=gKA@mail.gmai
> l.
> > c
> > > om>
> > > , Warner Losh writes:
> > > > On Thu, Aug 31, 2023, 5:32 PM Cy Schubert <Cy.Schubert@cschubert.com> w
> ro
> > te
> > > =
> > > > :
> > > >
> > > > > In message <20230831223526.DCB701A1@slippy.cwsent.com>, Cy Schubert
> > > > > writes:
> > > > > > In message <a5c51f3f-8c7f-8bd5-f718-72bc33fe22ed@FreeBSD.org>,
> > > > > Alexander
> > > > > > Motin
> > > > > > writes:
> > > > > > > On 31.08.2023 13:53, Cy Schubert wrote:
> > > > > > > > One thing that circumvents my two problems is reducing poudrier
> e
> > > > > bulk job
> > > > > > s
> > > > > > > > from 8 to 5 on my 4 core machines.
> > > > > > >
> > > > > > > Cy, I have no real evidences to think it is related, other than y
> ou
> > r
> > > > > > > panics look like some memory corruptions, but could you try is pa
> tc
> > h:
> > > > > > > https://github.com/openzfs/zfs/pull/15228 .  If it won't do the
> > > > > trick,
> > > > > > > then I am out of ideas without additional input.
> > > > > >
> > > > > > So far so good. Poudriere has been running with a decent -J jobs on
>  b
> > ot
> > > =
> > > > h
> > > > > > machines for over an hour. I'll let you know if they survive the ni
> gh
> > t.
> > > > > It
> > > > > > can take some time before the panics happen though.
> > > > > >
> > > > > > The problem is more likely to occur when there are a lot of small
> > > > > package
> > > > > > builds than large long running jobs, probably because of the parall
> el
> > > > > ZFS
> > > > > > dataset creations, deletions, and rollbacks.
> > > > > >
> > > > > > >
> > > > > > > Gleb, you may try to add this too, just as a choice between impos
> si
> > bl
> > > =
> > > > e
> > > > > > > and improbable.
> > > > > > >
> > > > > > > --
> > > > > > > Alexander Motin
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Cheers,
> > > > > > Cy Schubert <Cy.Schubert@cschubert.com>
> > > > > > FreeBSD UNIX:  <cy@FreeBSD.org>   Web:  https://FreeBSD.org
> > > > > > NTP:           <cy@nwtime.org>    Web:  https://nwtime.org
> > > > > >
> > > > > >                       e^(i*pi)+1=3D0
> > > > > >
> > > > > >
> > > > >
> > > > > One of the two machines is hung.
> > > > >
> > > > > cwfw# ping bob
> > > > > PING bob (10.1.1.7): 56 data bytes
> > > > > ^C
> > > > > --- bob ping statistics ---
> > > > > 2 packets transmitted, 0 packets received, 100.0% packet loss
> > > > > cwfw# console bob
> > > > > [Enter `^Ec?' for help]
> > > > > [halt sent]
> > > > > KDB: enter: Break to debugger
> > > > > [ thread pid 31259 tid 100913 ]
> > > > > Stopped at      kdb_break+0x48: movq    $0,0xa1069d(%rip)
> > > > > db> bt
> > > > > Tracing pid 31259 tid 100913 td 0xfffffe00c4eca000
> > > > > kdb_break() at kdb_break+0x48/frame 0xfffffe00c53ef2d0
> > > > > uart_intr() at uart_intr+0xf7/frame 0xfffffe00c53ef310
> > > > > intr_event_handle() at intr_event_handle+0x12b/frame 0xfffffe00c53ef3
> 80
> > > > > intr_execute_handlers() at intr_execute_handlers+0x63/frame
> > > > > 0xfffffe00c53ef3b0
> > > > > Xapic_isr1() at Xapic_isr1+0xdc/frame 0xfffffe00c53ef3b0
> > > > > --- interrupt, rip =3D 0xffffffff806d5c70, rsp =3D 0xfffffe00c53ef480
> , 
> > rb
> > > =
> > > > p =3D
> > > > > 0xfffffe00c53ef480 ---
> > > > > getbinuptime() at getbinuptime+0x30/frame 0xfffffe00c53ef480
> > > > > arc_access() at arc_access+0x250/frame 0xfffffe00c53ef4d0
> > > > > arc_buf_access() at arc_buf_access+0xd0/frame 0xfffffe00c53ef4f0
> > > > > dbuf_hold_impl() at dbuf_hold_impl+0xf3/frame 0xfffffe00c53ef580
> > > > > dbuf_hold() at dbuf_hold+0x25/frame 0xfffffe00c53ef5b0
> > > > > dnode_hold_impl() at dnode_hold_impl+0x194/frame 0xfffffe00c53ef670
> > > > > dmu_bonus_hold() at dmu_bonus_hold+0x20/frame 0xfffffe00c53ef6a0
> > > > > zfs_zget() at zfs_zget+0x20d/frame 0xfffffe00c53ef750
> > > > > zfs_dirent_lookup() at zfs_dirent_lookup+0x16d/frame 0xfffffe00c53ef7
> a0
> > > > > zfs_dirlook() at zfs_dirlook+0x7f/frame 0xfffffe00c53ef7d0
> > > > > zfs_lookup() at zfs_lookup+0x3c0/frame 0xfffffe00c53ef8a0
> > > > > zfs_freebsd_cachedlookup() at zfs_freebsd_cachedlookup+0x67/frame
> > > > > 0xfffffe00c53ef9e0
> > > > > vfs_cache_lookup() at vfs_cache_lookup+0xa6/frame 0xfffffe00c53efa30
> > > > > vfs_lookup() at vfs_lookup+0x457/frame 0xfffffe00c53efac0
> > > > > namei() at namei+0x2e1/frame 0xfffffe00c53efb20
> > > > > vn_open_cred() at vn_open_cred+0x505/frame 0xfffffe00c53efca0
> > > > > kern_openat() at kern_openat+0x287/frame 0xfffffe00c53efdf0
> > > > > ia32_syscall() at ia32_syscall+0x156/frame 0xfffffe00c53eff30
> > > > > int0x80_syscall_common() at int0x80_syscall_common+0x9c/frame 0xffff8
> 9d
> > c
> > > > > db>
> > > > >
> > > > > I'll let it continue. Hopefully the watchdog timer will pop and we ge
> t 
> > a
> > > > > dump.
> > > > >
> > > >
> > > >
> > > > Might also be interesting to see if this moves around or is really hung
> > > > getting the time. I suspect it's live lock given this traceback.
> > >
> > > It's moving around.
> > >
> > >
> > > db> c
> > > [halt sent]
> > > KDB: enter: Break to debugger
> > > [ thread pid 31259 tid 100913 ]
> > > Stopped at      kdb_break+0x48: movq    $0,0xa1069d(%rip)
> > > db> bt
> > > Tracing pid 31259 tid 100913 td 0xfffffe00c4eca000
> > > kdb_break() at kdb_break+0x48/frame 0xfffffe00c53ef410
> > > uart_intr() at uart_intr+0xf7/frame 0xfffffe00c53ef450
> > > intr_event_handle() at intr_event_handle+0x12b/frame 0xfffffe00c53ef4c0
> > > intr_execute_handlers() at intr_execute_handlers+0x63/frame 
> > > 0xfffffe00c53ef4f0
> > > Xapic_isr1() at Xapic_isr1+0xdc/frame 0xfffffe00c53ef4f0
> > > --- interrupt, rip = 0xffffffff807117c4, rsp = 0xfffffe00c53ef5c0, rbp = 
> > > 0xfffffe00c53ef5c0 ---
> > > lock_delay() at lock_delay+0x14/frame 0xfffffe00c53ef5c0
> > > _sx_xlock_hard() at _sx_xlock_hard+0x1a9/frame 0xfffffe00c53ef660
> > > _sx_xlock() at _sx_xlock+0xb0/frame 0xfffffe00c53ef6a0
> > > zfs_zget() at zfs_zget+0x1fa/frame 0xfffffe00c53ef750
> > > zfs_dirent_lookup() at zfs_dirent_lookup+0x16d/frame 0xfffffe00c53ef7a0
> > > zfs_dirlook() at zfs_dirlook+0x7f/frame 0xfffffe00c53ef7d0
> > > zfs_lookup() at zfs_lookup+0x3c0/frame 0xfffffe00c53ef8a0
> > > zfs_freebsd_cachedlookup() at zfs_freebsd_cachedlookup+0x67/frame 
> > > 0xfffffe00c53ef9e0
> > > vfs_cache_lookup() at vfs_cache_lookup+0xa6/frame 0xfffffe00c53efa30
> > > vfs_lookup() at vfs_lookup+0x457/frame 0xfffffe00c53efac0
> > > namei() at namei+0x2e1/frame 0xfffffe00c53efb20
> > > vn_open_cred() at vn_open_cred+0x505/frame 0xfffffe00c53efca0
> > > kern_openat() at kern_openat+0x287/frame 0xfffffe00c53efdf0
> > > ia32_syscall() at ia32_syscall+0x156/frame 0xfffffe00c53eff30
> > > int0x80_syscall_common() at int0x80_syscall_common+0x9c/frame 0xffff89dc
> > > db> c
> >
> > It seems to have settled there for now.
>
> I manually panicked the machine. It captured a good dump of the above 
> backtrace.
>
> This is the amd64 machine that was building i386 packages under poudriere. 
> I purposely set -J high to stress it enough to panic the machine.
>
> This ker> /15228. This manually triggered panic is different from the one it suffered 
> two days ago.
>
> The machine building amd64 packages is still running nicely but it's 
> building some large packages. Because of this there's not as much stress on 
> zfs. The panicked machine was building a lot of small packages. As such it 
> was doing a lot of zfs clones, destroys, and rollbacks.

An initial look at the dump:

(kgdb) bt
#0  __curthread () at /opt/src/git-src/sys/amd64/include/pcpu_aux.h:57
#1  doadump (textdump=textdump@entry=1)
    at /opt/src/git-src/sys/kern/kern_shutdown.c:405
#2  0xffffffff806c1b30 in kern_reboot (howto=260)
    at /opt/src/git-src/sys/kern/kern_shutdown.c:526
#3  0xffffffff806c202f in vpanic (fmt=0xffffffff80b8653c "from debugger", 
    ap=ap@entry=0xfffffe00c53eef30)
    at /opt/src/git-src/sys/kern/kern_shutdown.c:970
#4  0xffffffff806c1dd3 in panic (fmt=<unavailable>)
    at /opt/src/git-src/sys/kern/kern_shutdown.c:894
#5  0xffffffff80413937 in db_panic (addr=<optimized out>, 
    have_addr=<optimized out>, count=<optimized out>, modif=<optimized out>)
    at /opt/src/git-src/sys/ddb/db_command.c:531
#6  0xffffffff80412ddd in db_command (last_cmdp=<optimized out>, 
    cmd_table=<optimized out>, dopager=true)
    at /opt/src/git-src/sys/ddb/db_command.c:504
#7  0xffffffff80412a9d in db_command_loop ()
    at /opt/src/git-src/sys/ddb/db_command.c:551
#8  0xffffffff80416176 in db_trap (type=<optimized out>, code=<optimized 
out>)
    at /opt/src/git-src/sys/ddb/db_main.c:268
#9  0xffffffff80710993 in kdb_trap (type=type@entry=3, code=code@entry=0, 
    tf=tf@entry=0xfffffe00c53ef340) at /opt/src/git-src/sys/kern/subr_kdb.c:
790
#10 0xffffffff80a72cb9 in trap (frame=0xfffffe00c53ef340)
    at /opt/src/git-src/sys/amd64/amd64/trap.c:608
#11 <signal handler called>
#12 kdb_enter (why=<optimized out>, msg=<optimized out>)
    at /opt/src/git-src/sys/kern/subr_kdb.c:556
#13 kdb_break () at /opt/src/git-src/sys/kern/subr_kdb.c:346
#14 0xffffffff804dc7f7 in uart_intr_break (arg=0xfffff80006b84400)
    at /opt/src/git-src/sys/dev/uart/uart_core.c:298
#15 uart_intr (arg=0xfffff80006b84400)
    at /opt/src/git-src/sys/dev/uart/uart_core.c:439
#16 0xffffffff8067ae8b in intr_event_handle (ie=ie@entry=0xfffff80005814c00,

    frame=frame@entry=0xfffffe00c53ef500)
    at /opt/src/git-src/sys/kern/kern_intr.c:1410
#17 0xffffffff80a37653 in intr_execute_handlers (isrc=0xfffff8000580c9c8, 
--Type <RET> for more, q to quit, c to continue without paging--c
    frame=0xfffffe00c53ef500)
    at /opt/src/git-src/sys/x86/x86/intr_machdep.c:354
#18 <signal handler called>
#19 lock_delay (la=la@entry=0xfffffe00c53ef600)
    at /opt/src/git-src/sys/kern/subr_lock.c:124
#20 0xffffffff806ccfb9 in _sx_xlock_hard (sx=sx@entry=0xfffff80123712340, 
    x=<optimized out>, opts=opts@entry=0, 
    file=file@entry=0xffffffff81910597 "/opt/src/git-src/sys/contrib/openzfs
/module/os/freebsd/zfs/zfs_znode.c", line=line@entry=949)
    at /opt/src/git-src/sys/kern/kern_sx.c:693
#21 0xffffffff806ccdb0 in _sx_xlock (sx=sx@entry=0xfffff80123712340, 
    opts=opts@entry=0, 
    file=0xffffffff81910597 "/opt/src/git-src/sys/contrib/openzfs/module/os/
freebsd/zfs/zfs_znode.c", line=line@entry=949)
    at /opt/src/git-src/sys/kern/kern_sx.c:330
#22 0xffffffff816b4ffa in zfs_zget (zfsvfs=zfsvfs@entry=0xfffff80123712000, 
    obj_num=80130, zpp=zpp@entry=0xfffffe00c53ef760)
    at /opt/src/git-src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_znode.
c:949
#23 0xffffffff816a018d in zfs_dirent_lookup (dzp=dzp@entry=0xfffff80131e6dae
0,
    name=0xfffffe00c53ef8c0 "xlocale", zpp=zpp@entry=0xfffffe00c53ef7b0, 
    flag=flag@entry=2)
    at /opt/src/git-src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_dir.c:
190
#24 0xffffffff816a028f in zfs_dirlook (dzp=dzp@entry=0xfffff80131e6dae0, 
    name=0xfffffe00c4bbdc80 "", name@entry=0xfffffe00c53ef8c0 "xlocale", 
    zpp=zpp@entry=0xfffffe00c53ef860)
    at /opt/src/git-src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_dir.c:
246
#25 0xffffffff816b0d20 in zfs_lookup (dvp=0xfffff800c2da9380, 
    nm=nm@entry=0xfffffe00c53ef8c0 "xlocale", vpp=0xfffffe00c53efd10, 
    cnp=cnp@entry=0xfffffe00c53efd38, nameiop=0, cr=<optimized out>, 
flags=0,
    cached=1)
    at /opt/src/git-src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_
os.c:923
#26 0xffffffff816ab907 in zfs_freebsd_lookup (ap=0xfffffe00c53ef9f0, 
cached=1)
    at /opt/src/git-src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_
os.c:4558
#27 zfs_freebsd_cachedlookup (ap=0xfffffe00c53ef9f0)
    at /opt/src/git-src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_
os.c:4566
#28 0xffffffff80797d86 in VOP_CACHEDLOOKUP (dvp=0xfffff800c2da9380, 
    vpp=0xfffffe00c53efd10, cnp=0xfffffe00c53efd38) at ./vnode_if.h:101
#29 vfs_cache_lookup (ap=<optimized out>)
    at /opt/src/git-src/sys/kern/vfs_cache.c:3068
#30 0xffffffff807a9487 in VOP_LOOKUP (dvp=0xfffff800c2da9380, 
    vpp=0xfffffe00c53efd10, cnp=0xfffffe00c53efd38) at ./vnode_if.h:67
#31 vfs_lookup (ndp=ndp@entry=0xfffffe00c53efcb8)
    at /opt/src/git-src/sys/kern/vfs_lookup.c:1252
#32 0xffffffff807a8531 in namei (ndp=ndp@entry=0xfffffe00c53efcb8)
    at /opt/src/git-src/sys/kern/vfs_lookup.c:684
#33 0xffffffff807d0095 in vn_open_cred (ndp=ndp@entry=0xfffffe00c53efcb8, 
    flagp=flagp@entry=0xfffffe00c53efdc4, cmode=cmode@entry=0, 
    vn_open_flags=vn_open_flags@entry=16, cred=0xfffff801ea492800, 
    fp=0xfffff801b9a54690) at /opt/src/git-src/sys/kern/vfs_vnops.c:331
#34 0xffffffff807c61f7 in kern_openat (td=0xfffffe00c4eca000, fd=-100, 
    path=0xffff8a24 <error: Cannot access memory at address 0xffff8a24>, 
    pathseg=UIO_USERSPACE, flags=1048577, mode=<optimized out>)
    at /opt/src/git-src/sys/kern/vfs_syscalls.c:1165
#35 0xffffffff80a993a6 in syscallenter (td=0xfffffe00c4eca000)
    at /opt/src/git-src/sys/amd64/ia32/../../kern/subr_syscall.c:187
#36 ia32_syscall (frame=0xfffffe00c53eff40)
    at /opt/src/git-src/sys/amd64/ia32/ia32_syscall.c:222
#37 0xffffffff80a47f8f in int0x80_syscall_common ()
    at /opt/src/git-src/sys/amd64/ia32/ia32_exception.S:75
#38 0x0000000026a12000 in ?? ()
#39 0x0000000000000000 in ?? ()
(kgdb) 

>
> I'll take a look at the dump after dinner tonight.


-- 
Cheers,
Cy Schubert <Cy.Schubert@cschubert.com>
FreeBSD UNIX:  <cy@FreeBSD.org>   Web:  https://FreeBSD.org
NTP:           <cy@nwtime.org>    Web:  https://nwtime.org

			e^(i*pi)+1=0





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20230901001959.E463A1C3>