Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 31 Aug 2023 19:46:08 -0700
From:      Cy Schubert <Cy.Schubert@cschubert.com>
To:        Cy Schubert <Cy.Schubert@cschubert.com>
Cc:        Warner Losh <imp@bsdimp.com>, Alexander Motin <mav@freebsd.org>, Gleb Smirnoff <glebius@freebsd.org>, Drew Gallatin <gallatin@freebsd.org>, Martin Matuska <mm@freebsd.org>, src-committers <src-committers@freebsd.org>, "<dev-commits-src-all@freebsd.org>" <dev-commits-src-all@freebsd.org>, "<dev-commits-src-main@freebsd.org>" <dev-commits-src-main@freebsd.org>
Subject:   Re: git: 315ee00fa961 - main - zfs: merge openzfs/zfs@804414aad
Message-ID:  <20230831194608.2f20d74f@slippy>

next in thread | raw e-mail | index | archive | help
Resending this again. The email that was returned from the ML back to
me was corrupted. I'm not sure if exmh was at fault, zfs (or postfix)
on my laptop or zfs (or postfix) on my gateway. The last time we had
significant ZFS issues my emails were also truncated or corrupted. It's
likely something on the laptop as one email in my note log contained a
few bytes of binary zeros.

It's just interesting that I'm seeing similar email corruption as last
time. 

Hopefully the full email makes it to the ML so people can view the full
backtrace. (Also using a different MH based MUA, if that might make a
difference.)

-- 
Cheers,
Cy Schubert <Cy.Schubert@cschubert.com>
FreeBSD UNIX:  <cy@FreeBSD.org>   Web:  https://FreeBSD.org
NTP:           <cy@nwtime.org>    Web:  https://nwtime.org

			e^(i*pi)+1=0


On Thu, 31 Aug 2023 17:19:59 -0700
Cy Schubert <Cy.Schubert@cschubert.com> wrote:

> Cy Schubert writes:
> > In message <20230831234437.0EB38133@slippy.cwsent.com>, Cy Schubert writes:  
> > > In message <20230831234023.917DFAB@slippy.cwsent.com>, Cy Schubert writes:  
> > > > In message <CANCZdfqLWoQnLkKcLYLa73WOKDOAEfXB2rQX869Qaaqv6z=gKA@mail.gmai  
> > l.  
> > > c  
> > > > om>  
> > > > , Warner Losh writes:  
> > > > > On Thu, Aug 31, 2023, 5:32 PM Cy Schubert <Cy.Schubert@cschubert.com> w  
> > ro  
> > > te  
> > > > =  
> > > > > :
> > > > >  
> > > > > > In message <20230831223526.DCB701A1@slippy.cwsent.com>, Cy Schubert
> > > > > > writes:  
> > > > > > > In message <a5c51f3f-8c7f-8bd5-f718-72bc33fe22ed@FreeBSD.org>,  
> > > > > > Alexander  
> > > > > > > Motin
> > > > > > > writes:  
> > > > > > > > On 31.08.2023 13:53, Cy Schubert wrote:  
> > > > > > > > > One thing that circumvents my two problems is reducing poudrier  
> > e  
> > > > > > bulk job  
> > > > > > > s  
> > > > > > > > > from 8 to 5 on my 4 core machines.  
> > > > > > > >
> > > > > > > > Cy, I have no real evidences to think it is related, other than y  
> > ou  
> > > r  
> > > > > > > > panics look like some memory corruptions, but could you try is pa  
> > tc  
> > > h:  
> > > > > > > > https://github.com/openzfs/zfs/pull/15228 .  If it won't do the  
> > > > > > trick,  
> > > > > > > > then I am out of ideas without additional input.  
> > > > > > >
> > > > > > > So far so good. Poudriere has been running with a decent -J jobs on  
> >  b  
> > > ot  
> > > > =  
> > > > > h  
> > > > > > > machines for over an hour. I'll let you know if they survive the ni  
> > gh  
> > > t.  
> > > > > > It  
> > > > > > > can take some time before the panics happen though.
> > > > > > >
> > > > > > > The problem is more likely to occur when there are a lot of small  
> > > > > > package  
> > > > > > > builds than large long running jobs, probably because of the parall  
> > el  
> > > > > > ZFS  
> > > > > > > dataset creations, deletions, and rollbacks.
> > > > > > >  
> > > > > > > >
> > > > > > > > Gleb, you may try to add this too, just as a choice between impos  
> > si  
> > > bl  
> > > > =  
> > > > > e  
> > > > > > > > and improbable.
> > > > > > > >
> > > > > > > > --
> > > > > > > > Alexander Motin  
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Cheers,
> > > > > > > Cy Schubert <Cy.Schubert@cschubert.com>
> > > > > > > FreeBSD UNIX:  <cy@FreeBSD.org>   Web:  https://FreeBSD.org
> > > > > > > NTP:           <cy@nwtime.org>    Web:  https://nwtime.org
> > > > > > >
> > > > > > >                       e^(i*pi)+1=3D0
> > > > > > >
> > > > > > >  
> > > > > >
> > > > > > One of the two machines is hung.
> > > > > >
> > > > > > cwfw# ping bob
> > > > > > PING bob (10.1.1.7): 56 data bytes
> > > > > > ^C
> > > > > > --- bob ping statistics ---
> > > > > > 2 packets transmitted, 0 packets received, 100.0% packet loss
> > > > > > cwfw# console bob
> > > > > > [Enter `^Ec?' for help]
> > > > > > [halt sent]
> > > > > > KDB: enter: Break to debugger
> > > > > > [ thread pid 31259 tid 100913 ]
> > > > > > Stopped at      kdb_break+0x48: movq    $0,0xa1069d(%rip)  
> > > > > > db> bt  
> > > > > > Tracing pid 31259 tid 100913 td 0xfffffe00c4eca000
> > > > > > kdb_break() at kdb_break+0x48/frame 0xfffffe00c53ef2d0
> > > > > > uart_intr() at uart_intr+0xf7/frame 0xfffffe00c53ef310
> > > > > > intr_event_handle() at intr_event_handle+0x12b/frame 0xfffffe00c53ef3  
> > 80  
> > > > > > intr_execute_handlers() at intr_execute_handlers+0x63/frame
> > > > > > 0xfffffe00c53ef3b0
> > > > > > Xapic_isr1() at Xapic_isr1+0xdc/frame 0xfffffe00c53ef3b0
> > > > > > --- interrupt, rip =3D 0xffffffff806d5c70, rsp =3D 0xfffffe00c53ef480  
> > ,   
> > > rb  
> > > > =  
> > > > > p =3D  
> > > > > > 0xfffffe00c53ef480 ---
> > > > > > getbinuptime() at getbinuptime+0x30/frame 0xfffffe00c53ef480
> > > > > > arc_access() at arc_access+0x250/frame 0xfffffe00c53ef4d0
> > > > > > arc_buf_access() at arc_buf_access+0xd0/frame 0xfffffe00c53ef4f0
> > > > > > dbuf_hold_impl() at dbuf_hold_impl+0xf3/frame 0xfffffe00c53ef580
> > > > > > dbuf_hold() at dbuf_hold+0x25/frame 0xfffffe00c53ef5b0
> > > > > > dnode_hold_impl() at dnode_hold_impl+0x194/frame 0xfffffe00c53ef670
> > > > > > dmu_bonus_hold() at dmu_bonus_hold+0x20/frame 0xfffffe00c53ef6a0
> > > > > > zfs_zget() at zfs_zget+0x20d/frame 0xfffffe00c53ef750
> > > > > > zfs_dirent_lookup() at zfs_dirent_lookup+0x16d/frame 0xfffffe00c53ef7  
> > a0  
> > > > > > zfs_dirlook() at zfs_dirlook+0x7f/frame 0xfffffe00c53ef7d0
> > > > > > zfs_lookup() at zfs_lookup+0x3c0/frame 0xfffffe00c53ef8a0
> > > > > > zfs_freebsd_cachedlookup() at zfs_freebsd_cachedlookup+0x67/frame
> > > > > > 0xfffffe00c53ef9e0
> > > > > > vfs_cache_lookup() at vfs_cache_lookup+0xa6/frame 0xfffffe00c53efa30
> > > > > > vfs_lookup() at vfs_lookup+0x457/frame 0xfffffe00c53efac0
> > > > > > namei() at namei+0x2e1/frame 0xfffffe00c53efb20
> > > > > > vn_open_cred() at vn_open_cred+0x505/frame 0xfffffe00c53efca0
> > > > > > kern_openat() at kern_openat+0x287/frame 0xfffffe00c53efdf0
> > > > > > ia32_syscall() at ia32_syscall+0x156/frame 0xfffffe00c53eff30
> > > > > > int0x80_syscall_common() at int0x80_syscall_common+0x9c/frame 0xffff8  
> > 9d  
> > > c  
> > > > > > db>  
> > > > > >
> > > > > > I'll let it continue. Hopefully the watchdog timer will pop and we ge  
> > t   
> > > a  
> > > > > > dump.
> > > > > >  
> > > > >
> > > > >
> > > > > Might also be interesting to see if this moves around or is really hung
> > > > > getting the time. I suspect it's live lock given this traceback.  
> > > >
> > > > It's moving around.
> > > >
> > > >  
> > > > db> c  
> > > > [halt sent]
> > > > KDB: enter: Break to debugger
> > > > [ thread pid 31259 tid 100913 ]
> > > > Stopped at      kdb_break+0x48: movq    $0,0xa1069d(%rip)  
> > > > db> bt  
> > > > Tracing pid 31259 tid 100913 td 0xfffffe00c4eca000
> > > > kdb_break() at kdb_break+0x48/frame 0xfffffe00c53ef410
> > > > uart_intr() at uart_intr+0xf7/frame 0xfffffe00c53ef450
> > > > intr_event_handle() at intr_event_handle+0x12b/frame 0xfffffe00c53ef4c0
> > > > intr_execute_handlers() at intr_execute_handlers+0x63/frame 
> > > > 0xfffffe00c53ef4f0
> > > > Xapic_isr1() at Xapic_isr1+0xdc/frame 0xfffffe00c53ef4f0
> > > > --- interrupt, rip = 0xffffffff807117c4, rsp = 0xfffffe00c53ef5c0, rbp = 
> > > > 0xfffffe00c53ef5c0 ---
> > > > lock_delay() at lock_delay+0x14/frame 0xfffffe00c53ef5c0
> > > > _sx_xlock_hard() at _sx_xlock_hard+0x1a9/frame 0xfffffe00c53ef660
> > > > _sx_xlock() at _sx_xlock+0xb0/frame 0xfffffe00c53ef6a0
> > > > zfs_zget() at zfs_zget+0x1fa/frame 0xfffffe00c53ef750
> > > > zfs_dirent_lookup() at zfs_dirent_lookup+0x16d/frame 0xfffffe00c53ef7a0
> > > > zfs_dirlook() at zfs_dirlook+0x7f/frame 0xfffffe00c53ef7d0
> > > > zfs_lookup() at zfs_lookup+0x3c0/frame 0xfffffe00c53ef8a0
> > > > zfs_freebsd_cachedlookup() at zfs_freebsd_cachedlookup+0x67/frame 
> > > > 0xfffffe00c53ef9e0
> > > > vfs_cache_lookup() at vfs_cache_lookup+0xa6/frame 0xfffffe00c53efa30
> > > > vfs_lookup() at vfs_lookup+0x457/frame 0xfffffe00c53efac0
> > > > namei() at namei+0x2e1/frame 0xfffffe00c53efb20
> > > > vn_open_cred() at vn_open_cred+0x505/frame 0xfffffe00c53efca0
> > > > kern_openat() at kern_openat+0x287/frame 0xfffffe00c53efdf0
> > > > ia32_syscall() at ia32_syscall+0x156/frame 0xfffffe00c53eff30
> > > > int0x80_syscall_common() at int0x80_syscall_common+0x9c/frame 0xffff89dc  
> > > > db> c  
> > >
> > > It seems to have settled there for now.  
> >
> > I manually panicked the machine. It captured a good dump of the above 
> > backtrace.
> >
> > This is the amd64 machine that was building i386 packages under poudriere. 
> > I purposely set -J high to stress it enough to panic the machine.
> >
> > This ker> /15228. This manually triggered panic is different from the one it suffered 
> > two days ago.
> >
> > The machine building amd64 packages is still running nicely but it's 
> > building some large packages. Because of this there's not as much stress on 
> > zfs. The panicked machine was building a lot of small packages. As such it 
> > was doing a lot of zfs clones, destroys, and rollbacks.  
> 
> An initial look at the dump:
> 
> (kgdb) bt
> #0  __curthread () at /opt/src/git-src/sys/amd64/include/pcpu_aux.h:57
> #1  doadump (textdump=textdump@entry=1)
>     at /opt/src/git-src/sys/kern/kern_shutdown.c:405
> #2  0xffffffff806c1b30 in kern_reboot (howto=260)
>     at /opt/src/git-src/sys/kern/kern_shutdown.c:526
> #3  0xffffffff806c202f in vpanic (fmt=0xffffffff80b8653c "from debugger", 
>     ap=ap@entry=0xfffffe00c53eef30)
>     at /opt/src/git-src/sys/kern/kern_shutdown.c:970
> #4  0xffffffff806c1dd3 in panic (fmt=<unavailable>)
>     at /opt/src/git-src/sys/kern/kern_shutdown.c:894
> #5  0xffffffff80413937 in db_panic (addr=<optimized out>, 
>     have_addr=<optimized out>, count=<optimized out>, modif=<optimized out>)
>     at /opt/src/git-src/sys/ddb/db_command.c:531
> #6  0xffffffff80412ddd in db_command (last_cmdp=<optimized out>, 
>     cmd_table=<optimized out>, dopager=true)
>     at /opt/src/git-src/sys/ddb/db_command.c:504
> #7  0xffffffff80412a9d in db_command_loop ()
>     at /opt/src/git-src/sys/ddb/db_command.c:551
> #8  0xffffffff80416176 in db_trap (type=<optimized out>, code=<optimized 
> out>)  
>     at /opt/src/git-src/sys/ddb/db_main.c:268
> #9  0xffffffff80710993 in kdb_trap (type=type@entry=3, code=code@entry=0, 
>     tf=tf@entry=0xfffffe00c53ef340) at /opt/src/git-src/sys/kern/subr_kdb.c:
> 790
> #10 0xffffffff80a72cb9 in trap (frame=0xfffffe00c53ef340)
>     at /opt/src/git-src/sys/amd64/amd64/trap.c:608
> #11 <signal handler called>
> #12 kdb_enter (why=<optimized out>, msg=<optimized out>)
>     at /opt/src/git-src/sys/kern/subr_kdb.c:556
> #13 kdb_break () at /opt/src/git-src/sys/kern/subr_kdb.c:346
> #14 0xffffffff804dc7f7 in uart_intr_break (arg=0xfffff80006b84400)
>     at /opt/src/git-src/sys/dev/uart/uart_core.c:298
> #15 uart_intr (arg=0xfffff80006b84400)
>     at /opt/src/git-src/sys/dev/uart/uart_core.c:439
> #16 0xffffffff8067ae8b in intr_event_handle (ie=ie@entry=0xfffff80005814c00,
> 
>     frame=frame@entry=0xfffffe00c53ef500)
>     at /opt/src/git-src/sys/kern/kern_intr.c:1410
> #17 0xffffffff80a37653 in intr_execute_handlers (isrc=0xfffff8000580c9c8, 
> --Type <RET> for more, q to quit, c to continue without paging--c
>     frame=0xfffffe00c53ef500)
>     at /opt/src/git-src/sys/x86/x86/intr_machdep.c:354
> #18 <signal handler called>
> #19 lock_delay (la=la@entry=0xfffffe00c53ef600)
>     at /opt/src/git-src/sys/kern/subr_lock.c:124
> #20 0xffffffff806ccfb9 in _sx_xlock_hard (sx=sx@entry=0xfffff80123712340, 
>     x=<optimized out>, opts=opts@entry=0, 
>     file=file@entry=0xffffffff81910597 "/opt/src/git-src/sys/contrib/openzfs
> /module/os/freebsd/zfs/zfs_znode.c", line=line@entry=949)
>     at /opt/src/git-src/sys/kern/kern_sx.c:693
> #21 0xffffffff806ccdb0 in _sx_xlock (sx=sx@entry=0xfffff80123712340, 
>     opts=opts@entry=0, 
>     file=0xffffffff81910597 "/opt/src/git-src/sys/contrib/openzfs/module/os/
> freebsd/zfs/zfs_znode.c", line=line@entry=949)
>     at /opt/src/git-src/sys/kern/kern_sx.c:330
> #22 0xffffffff816b4ffa in zfs_zget (zfsvfs=zfsvfs@entry=0xfffff80123712000, 
>     obj_num=80130, zpp=zpp@entry=0xfffffe00c53ef760)
>     at /opt/src/git-src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_znode.
> c:949
> #23 0xffffffff816a018d in zfs_dirent_lookup (dzp=dzp@entry=0xfffff80131e6dae
> 0,
>     name=0xfffffe00c53ef8c0 "xlocale", zpp=zpp@entry=0xfffffe00c53ef7b0, 
>     flag=flag@entry=2)
>     at /opt/src/git-src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_dir.c:
> 190
> #24 0xffffffff816a028f in zfs_dirlook (dzp=dzp@entry=0xfffff80131e6dae0, 
>     name=0xfffffe00c4bbdc80 "", name@entry=0xfffffe00c53ef8c0 "xlocale", 
>     zpp=zpp@entry=0xfffffe00c53ef860)
>     at /opt/src/git-src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_dir.c:
> 246
> #25 0xffffffff816b0d20 in zfs_lookup (dvp=0xfffff800c2da9380, 
>     nm=nm@entry=0xfffffe00c53ef8c0 "xlocale", vpp=0xfffffe00c53efd10, 
>     cnp=cnp@entry=0xfffffe00c53efd38, nameiop=0, cr=<optimized out>, 
> flags=0,
>     cached=1)
>     at /opt/src/git-src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_
> os.c:923
> #26 0xffffffff816ab907 in zfs_freebsd_lookup (ap=0xfffffe00c53ef9f0, 
> cached=1)
>     at /opt/src/git-src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_
> os.c:4558
> #27 zfs_freebsd_cachedlookup (ap=0xfffffe00c53ef9f0)
>     at /opt/src/git-src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_
> os.c:4566
> #28 0xffffffff80797d86 in VOP_CACHEDLOOKUP (dvp=0xfffff800c2da9380, 
>     vpp=0xfffffe00c53efd10, cnp=0xfffffe00c53efd38) at ./vnode_if.h:101
> #29 vfs_cache_lookup (ap=<optimized out>)
>     at /opt/src/git-src/sys/kern/vfs_cache.c:3068
> #30 0xffffffff807a9487 in VOP_LOOKUP (dvp=0xfffff800c2da9380, 
>     vpp=0xfffffe00c53efd10, cnp=0xfffffe00c53efd38) at ./vnode_if.h:67
> #31 vfs_lookup (ndp=ndp@entry=0xfffffe00c53efcb8)
>     at /opt/src/git-src/sys/kern/vfs_lookup.c:1252
> #32 0xffffffff807a8531 in namei (ndp=ndp@entry=0xfffffe00c53efcb8)
>     at /opt/src/git-src/sys/kern/vfs_lookup.c:684
> #33 0xffffffff807d0095 in vn_open_cred (ndp=ndp@entry=0xfffffe00c53efcb8, 
>     flagp=flagp@entry=0xfffffe00c53efdc4, cmode=cmode@entry=0, 
>     vn_open_flags=vn_open_flags@entry=16, cred=0xfffff801ea492800, 
>     fp=0xfffff801b9a54690) at /opt/src/git-src/sys/kern/vfs_vnops.c:331
> #34 0xffffffff807c61f7 in kern_openat (td=0xfffffe00c4eca000, fd=-100, 
>     path=0xffff8a24 <error: Cannot access memory at address 0xffff8a24>, 
>     pathseg=UIO_USERSPACE, flags=1048577, mode=<optimized out>)
>     at /opt/src/git-src/sys/kern/vfs_syscalls.c:1165
> #35 0xffffffff80a993a6 in syscallenter (td=0xfffffe00c4eca000)
>     at /opt/src/git-src/sys/amd64/ia32/../../kern/subr_syscall.c:187
> #36 ia32_syscall (frame=0xfffffe00c53eff40)
>     at /opt/src/git-src/sys/amd64/ia32/ia32_syscall.c:222
> #37 0xffffffff80a47f8f in int0x80_syscall_common ()
>     at /opt/src/git-src/sys/amd64/ia32/ia32_exception.S:75
> #38 0x0000000026a12000 in ?? ()
> #39 0x0000000000000000 in ?? ()
> (kgdb) 
> 
> >
> > I'll take a look at the dump after dinner tonight.  
> 
> 




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20230831194608.2f20d74f>