From nobody Fri Sep 1 00:04:29 2023 X-Original-To: dev-commits-src-all@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4RcJDc3kCJz4rnKh; Fri, 1 Sep 2023 00:04:32 +0000 (UTC) (envelope-from cy.schubert@cschubert.com) Received: from omta002.cacentral1.a.cloudfilter.net (omta002.cacentral1.a.cloudfilter.net [3.97.99.33]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "Client", Issuer "CA" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 4RcJDc2btVz3bCW; Fri, 1 Sep 2023 00:04:32 +0000 (UTC) (envelope-from cy.schubert@cschubert.com) Authentication-Results: mx1.freebsd.org; none Received: from shw-obgw-4002a.ext.cloudfilter.net ([10.228.9.250]) by cmsmtp with ESMTP id biTCqEv6b6NwhbreJqBzMN; Fri, 01 Sep 2023 00:04:31 +0000 Received: from spqr.komquats.com ([70.66.152.170]) by cmsmtp with ESMTPA id breIqzPupyAOebreIqV1Et; Fri, 01 Sep 2023 00:04:31 +0000 X-Authority-Analysis: v=2.4 cv=e5oV9Il/ c=1 sm=1 tr=0 ts=64f12a8f a=y8EK/9tc/U6QY+pUhnbtgQ==:117 a=y8EK/9tc/U6QY+pUhnbtgQ==:17 a=kj9zAlcOel0A:10 a=UttIx32zK-AA:10 a=VxmjJ2MpAAAA:8 a=YxBL1-UpAAAA:8 a=6I5d2MoRAAAA:8 a=NEAV23lmAAAA:8 a=EkcXrb_YAAAA:8 a=MvHaf23w45GdYnbxqqAA:9 a=LhhIYMOyE9AJFHmb:21 a=CjuIK1q_8ugA:10 a=2tJrspVSXVUA:10 a=gvFz51Dc2IYA:10 a=7gXAzLPJhVmCkEl4_tsf:22 a=Ia-lj3WSrqcvXOmTRaiG:22 a=IjZwj45LgO3ly-622nXo:22 a=LK5xJRSDVpKd5WXXoEvA:22 Received: from slippy.cwsent.com (slippy [10.1.1.91]) by spqr.komquats.com (Postfix) with ESMTP id C2B753B5; Thu, 31 Aug 2023 17:04:29 -0700 (PDT) Received: by slippy.cwsent.com (Postfix, from userid 1000) id B83BE1C0; Thu, 31 Aug 2023 17:04:29 -0700 (PDT) X-Mailer: exmh version 2.9.0 11/07/2018 with nmh-1.8+dev Reply-to: Cy Schubert From: Cy Schubert X-os: FreeBSD X-Sender: cy@cwsent.com X-URL: http://www.cschubert.com/ To: Cy Schubert cc: Warner Losh , Alexander Motin , Gleb Smirnoff , Drew Gallatin , Martin Matuska , src-committers , "" , "" Subject: Re: git: 315ee00fa961 - main - zfs: merge openzfs/zfs@804414aad In-reply-to: <20230831234437.0EB38133@slippy.cwsent.com> References: <202308270509.37R596B5048298@gitrepo.freebsd.org> <07faf861-9186-47d1-992a-91d483ea4e9c@app.fastmail.com> <1db726d4-32c9-e1b8-51d6-981aa51b7825@FreeBSD.org> <20230831175350.981F1D5@slippy.cwsent.com> <20230831223526.DCB701A1@slippy.cwsent.com> <20230831233228.9935BA8@slippy.cwsent.com> <20230831234023.917DFAB@slippy.cwsent.com> <20230831234437.0EB38133@slippy.cwsent.com> Comments: In-reply-to Cy Schubert message dated "Thu, 31 Aug 2023 16:44:36 -0700." List-Id: Commit messages for all branches of the src repository List-Archive: https://lists.freebsd.org/archives/dev-commits-src-all List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-dev-commits-src-all@freebsd.org X-BeenThere: dev-commits-src-all@freebsd.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Thu, 31 Aug 2023 17:04:29 -0700 Message-Id: <20230901000429.B83BE1C0@slippy.cwsent.com> X-CMAE-Envelope: MS4xfN6lseVqb6SFlykTn2PvKGeqEDaYjt9O58QUuOodYxV5zHpXU3zR2/bXiPse1Xo8WxkTKO1T+UbGRtJJYWCF8qG2OsO66CgvSns0X0+NO0I//YzVBL0W QnZ8y5RI+/vbItrmw8q9PAC+yfco5ATuYEom+KcV2q1hdorAOKoiEYYbrk7mNd9/SlQaNr+hDog9/ezyKpJkzFpryIqH6WXm6SeKdxMCY7bi7uXaEqa3RIWf +OKYP34Hx48Ln66igDZHkTCFjqSJn7EKk9/Jt67JLNYSVHvg882JjvkSI/FooPjXkBwlNd0Xf4u1GqtVYnmeDPdQ03IwWsqWDa942p774rURI+PelaGU+szw jEGLk2mMeR48iafUQFewoL/5J7qDrxshnF5eCSvo4NezAP1Dbn65uYWLYtfTrWh4S+7hQzCHy35ipuCp7p+LqewOFUhqyw== X-Spamd-Bar: ---- X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; ASN(0.00)[asn:16509, ipnet:3.96.0.0/15, country:US] X-Rspamd-Queue-Id: 4RcJDc2btVz3bCW In message <20230831234437.0EB38133@slippy.cwsent.com>, Cy Schubert writes: > In message <20230831234023.917DFAB@slippy.cwsent.com>, Cy Schubert writes: > > In message c > > om> > > , Warner Losh writes: > > > On Thu, Aug 31, 2023, 5:32 PM Cy Schubert wro > te > > = > > > : > > > > > > > In message <20230831223526.DCB701A1@slippy.cwsent.com>, Cy Schubert > > > > writes: > > > > > In message , > > > > Alexander > > > > > Motin > > > > > writes: > > > > > > On 31.08.2023 13:53, Cy Schubert wrote: > > > > > > > One thing that circumvents my two problems is reducing poudriere > > > > bulk job > > > > > s > > > > > > > from 8 to 5 on my 4 core machines. > > > > > > > > > > > > Cy, I have no real evidences to think it is related, other than you > r > > > > > > panics look like some memory corruptions, but could you try is patc > h: > > > > > > https://github.com/openzfs/zfs/pull/15228 . If it won't do the > > > > trick, > > > > > > then I am out of ideas without additional input. > > > > > > > > > > So far so good. Poudriere has been running with a decent -J jobs on b > ot > > = > > > h > > > > > machines for over an hour. I'll let you know if they survive the nigh > t. > > > > It > > > > > can take some time before the panics happen though. > > > > > > > > > > The problem is more likely to occur when there are a lot of small > > > > package > > > > > builds than large long running jobs, probably because of the parallel > > > > ZFS > > > > > dataset creations, deletions, and rollbacks. > > > > > > > > > > > > > > > > > Gleb, you may try to add this too, just as a choice between impossi > bl > > = > > > e > > > > > > and improbable. > > > > > > > > > > > > -- > > > > > > Alexander Motin > > > > > > > > > > > > > > > -- > > > > > Cheers, > > > > > Cy Schubert > > > > > FreeBSD UNIX: Web: https://FreeBSD.org > > > > > NTP: Web: https://nwtime.org > > > > > > > > > > e^(i*pi)+1=3D0 > > > > > > > > > > > > > > > > > > One of the two machines is hung. > > > > > > > > cwfw# ping bob > > > > PING bob (10.1.1.7): 56 data bytes > > > > ^C > > > > --- bob ping statistics --- > > > > 2 packets transmitted, 0 packets received, 100.0% packet loss > > > > cwfw# console bob > > > > [Enter `^Ec?' for help] > > > > [halt sent] > > > > KDB: enter: Break to debugger > > > > [ thread pid 31259 tid 100913 ] > > > > Stopped at kdb_break+0x48: movq $0,0xa1069d(%rip) > > > > db> bt > > > > Tracing pid 31259 tid 100913 td 0xfffffe00c4eca000 > > > > kdb_break() at kdb_break+0x48/frame 0xfffffe00c53ef2d0 > > > > uart_intr() at uart_intr+0xf7/frame 0xfffffe00c53ef310 > > > > intr_event_handle() at intr_event_handle+0x12b/frame 0xfffffe00c53ef380 > > > > intr_execute_handlers() at intr_execute_handlers+0x63/frame > > > > 0xfffffe00c53ef3b0 > > > > Xapic_isr1() at Xapic_isr1+0xdc/frame 0xfffffe00c53ef3b0 > > > > --- interrupt, rip =3D 0xffffffff806d5c70, rsp =3D 0xfffffe00c53ef480, > rb > > = > > > p =3D > > > > 0xfffffe00c53ef480 --- > > > > getbinuptime() at getbinuptime+0x30/frame 0xfffffe00c53ef480 > > > > arc_access() at arc_access+0x250/frame 0xfffffe00c53ef4d0 > > > > arc_buf_access() at arc_buf_access+0xd0/frame 0xfffffe00c53ef4f0 > > > > dbuf_hold_impl() at dbuf_hold_impl+0xf3/frame 0xfffffe00c53ef580 > > > > dbuf_hold() at dbuf_hold+0x25/frame 0xfffffe00c53ef5b0 > > > > dnode_hold_impl() at dnode_hold_impl+0x194/frame 0xfffffe00c53ef670 > > > > dmu_bonus_hold() at dmu_bonus_hold+0x20/frame 0xfffffe00c53ef6a0 > > > > zfs_zget() at zfs_zget+0x20d/frame 0xfffffe00c53ef750 > > > > zfs_dirent_lookup() at zfs_dirent_lookup+0x16d/frame 0xfffffe00c53ef7a0 > > > > zfs_dirlook() at zfs_dirlook+0x7f/frame 0xfffffe00c53ef7d0 > > > > zfs_lookup() at zfs_lookup+0x3c0/frame 0xfffffe00c53ef8a0 > > > > zfs_freebsd_cachedlookup() at zfs_freebsd_cachedlookup+0x67/frame > > > > 0xfffffe00c53ef9e0 > > > > vfs_cache_lookup() at vfs_cache_lookup+0xa6/frame 0xfffffe00c53efa30 > > > > vfs_lookup() at vfs_lookup+0x457/frame 0xfffffe00c53efac0 > > > > namei() at namei+0x2e1/frame 0xfffffe00c53efb20 > > > > vn_open_cred() at vn_open_cred+0x505/frame 0xfffffe00c53efca0 > > > > kern_openat() at kern_openat+0x287/frame 0xfffffe00c53efdf0 > > > > ia32_syscall() at ia32_syscall+0x156/frame 0xfffffe00c53eff30 > > > > int0x80_syscall_common() at int0x80_syscall_common+0x9c/frame 0xffff89d > c > > > > db> > > > > > > > > I'll let it continue. Hopefully the watchdog timer will pop and we get > a > > > > dump. > > > > > > > > > > > > > Might also be interesting to see if this moves around or is really hung > > > getting the time. I suspect it's live lock given this traceback. > > > > It's moving around. > > > > > > db> c > > [halt sent] > > KDB: enter: Break to debugger > > [ thread pid 31259 tid 100913 ] > > Stopped at kdb_break+0x48: movq $0,0xa1069d(%rip) > > db> bt > > Tracing pid 31259 tid 100913 td 0xfffffe00c4eca000 > > kdb_break() at kdb_break+0x48/frame 0xfffffe00c53ef410 > > uart_intr() at uart_intr+0xf7/frame 0xfffffe00c53ef450 > > intr_event_handle() at intr_event_handle+0x12b/frame 0xfffffe00c53ef4c0 > > intr_execute_handlers() at intr_execute_handlers+0x63/frame > > 0xfffffe00c53ef4f0 > > Xapic_isr1() at Xapic_isr1+0xdc/frame 0xfffffe00c53ef4f0 > > --- interrupt, rip = 0xffffffff807117c4, rsp = 0xfffffe00c53ef5c0, rbp = > > 0xfffffe00c53ef5c0 --- > > lock_delay() at lock_delay+0x14/frame 0xfffffe00c53ef5c0 > > _sx_xlock_hard() at _sx_xlock_hard+0x1a9/frame 0xfffffe00c53ef660 > > _sx_xlock() at _sx_xlock+0xb0/frame 0xfffffe00c53ef6a0 > > zfs_zget() at zfs_zget+0x1fa/frame 0xfffffe00c53ef750 > > zfs_dirent_lookup() at zfs_dirent_lookup+0x16d/frame 0xfffffe00c53ef7a0 > > zfs_dirlook() at zfs_dirlook+0x7f/frame 0xfffffe00c53ef7d0 > > zfs_lookup() at zfs_lookup+0x3c0/frame 0xfffffe00c53ef8a0 > > zfs_freebsd_cachedlookup() at zfs_freebsd_cachedlookup+0x67/frame > > 0xfffffe00c53ef9e0 > > vfs_cache_lookup() at vfs_cache_lookup+0xa6/frame 0xfffffe00c53efa30 > > vfs_lookup() at vfs_lookup+0x457/frame 0xfffffe00c53efac0 > > namei() at namei+0x2e1/frame 0xfffffe00c53efb20 > > vn_open_cred() at vn_open_cred+0x505/frame 0xfffffe00c53efca0 > > kern_openat() at kern_openat+0x287/frame 0xfffffe00c53efdf0 > > ia32_syscall() at ia32_syscall+0x156/frame 0xfffffe00c53eff30 > > int0x80_syscall_common() at int0x80_syscall_common+0x9c/frame 0xffff89dc > > db> c > > It seems to have settled I manually panicked the machine. It captured a good dump of the above backtrace. This is the amd64 machine that was building i386 packages under poudriere. I purposely set -J high to stress it enough to panic the machine. This kernel includes amotin's patch from https://github.com/openzfs/zfs/pull /15228. This manually triggered panic is different from the one it suffered two days ago. The machine building amd64 packages is still running nicely but it's building some large packages. Because of this there's not as much stress on zfs. The panicked machine was building a lot of small packages. As such it was doing a lot of zfs clones, destroys, and rollbacks. I'll take a look at the dump after dinner tonight. -- Cheers, Cy Schubert FreeBSD UNIX: Web: https://FreeBSD.org NTP: Web: https://nwtime.org e^(i*pi)+1=0