Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 31 Aug 2023 12:30:25 -0700
From:      Mark Millard <marklmi@yahoo.com>
To:        Cy Schubert <Cy.Schubert@cschubert.com>, dev-commits-src-main@freebsd.org
Cc:        Alexander Motin <mav@FreeBSD.org>
Subject:   Re: git: 315ee00fa961 - main - zfs: merge openzfs/zfs@804414aad
Message-ID:  <BBF3BC24-80AA-4210-9775-8B89805105C9@yahoo.com>
References:  <BBF3BC24-80AA-4210-9775-8B89805105C9.ref@yahoo.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Cy Schubert <Cy.Schubert_at_cschubert.com> wrote on
Date: Thu, 31 Aug 2023 17:53:50 UTC :

> In message <1db726d4-32c9-e1b8-51d6-981aa51b7825@FreeBSD.org>, =
Alexander=20
> Motin
> writes:
> > On 31.08.2023 08:45, Drew Gallatin wrote:
> > > On Wed, Aug 30, 2023, at 8:01 PM, Alexander Motin wrote:
> > >> It is the first time I see a panic like this.=C3=82  I'll think =
about it
> > >> tomorrow.=C3=82  But I'd appreciate any information on what is =
your workload
> > >> and what are you doing related to ZIL (O_SYNC, fsync(), =
sync=3Dalways,
> > >> etc) to trigger it?=C3=82  What is your pool configuration?
> > >=20
> > > I'm not Gleb, but this was something at $WORK, so I can perhaps =
help.=20
> > > I've included the output of zpool status, and all non-default =
settings=20
> > > in the zpool.=C3=82  Note that we don't use a ZIL device.
> >
> > You don't use SLOG device. ZIL is always with you, just embedded in=20=

> > this case.
> >
> > I tried to think about this for couple hours and still can't see how =
can=20
> > this happen. zil_sync() should not call zil_free_lwb() unless the =
lwb=20
> > is in LWB_STATE_FLUSH_DONE. To get into LWB_STATE_FLUSH_DONE lwb =
should=20
> > first delete all lwb_vdev_tree entries in zil_lwb_write_done(). And =
no=20
> > new entries should be added during/after zil_lwb_write_done() due to =
set=20
> > zio dependencies.
> >
> > I've made a patch tuning some assertions for this context:=20
> > https://github.com/openzfs/zfs/pull/15227 . If the issue is=20
> > reproducible, could you please apply it and try again? May be it =
give=20
> > us any more clues.
>=20
> One thing that circumvents my two problems is reducing poudriere bulk =
jobs=20
> from 8 to 5 on my 4 core machines.

What about the likes of your ALLOW_MAKE_JOBS status or other
constraints (such as MAKE_JOBS_NUMBER) on the parallelism
internal to each builder?

My earlier high load average test that did not reproduce the
problem was allowed to use 32 builders, each allowed to use
32 make jobs. This was for 32 hardware threads (ThreadRipper
1950X). The maximum observed for each the 3 load averages were:
349.68, 264.30, 243.16 (not simultaneously). (I use top patched
to monitor and report such maximums for what it observes.)
In general the 3 reported load averages were each over 100
for a very long time.

Since I did not get the problem despite the high sustained
load averages but with no "extra" builders involved and you
got what you report, my test does support the number of
builders being large compared to the number of hardware
threads being what matters for repeatability via poudriere.

A weakness in that evidence is that my test predates:

Sun, 27 Aug 2023
    =E2=80=A2 git: 315ee00fa961 - main - zfs: merge =
openzfs/zfs@804414aad Martin Matuska

and so was for one stage earlier relative main's openzfs updates. It =
had:

Thu, 10 Aug 2023
    . . .
    =E2=80=A2 git: cd25b0f740f8 - main - zfs: cherry-pick fix from =
openzfs Martin Matuska=20
    =E2=80=A2 git: 28d2e3b5dedf - main - zfs: cherry-pick fix from =
openzfs Martin Matuska

and no uncommitted openzfs patches.

=3D=3D=3D
Mark Millard
marklmi at yahoo.com




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?BBF3BC24-80AA-4210-9775-8B89805105C9>