Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 13 Apr 2023 06:56:35 -0700
From:      Cy Schubert <Cy.Schubert@cschubert.com>
To:        Mateusz Guzik <mjguzik@gmail.com>
Cc:        Cy Schubert <Cy.Schubert@cschubert.com>, "Pawe? Jakub Dawidek" <pawel@dawidek.net>, Mark Millard <marklmi@yahoo.com>, vishwin@freebsd.org, dev-commits-src-main@freebsd.org, Current FreeBSD <freebsd-current@freebsd.org>, pjd@freebsd.org
Subject:   Re: git: 2a58b312b62f - main - zfs: merge openzfs/zfs@431083f75
Message-ID:  <20230413135635.6B62F354@slippy.cwsent.com>
In-Reply-To: <CAGudoHG3rCx93gyJTmzTBnSe4fQ9=m4mBESWbKVWtAGRxen_4w@mail.gmail.com>
References:  <20230413071032.18BFF31F@slippy.cwsent.com>  <D0D9BD06-C321-454C-A038-C55C63E0DD6B@dawidek.net>  <20230413063321.60344b1f@cschubert.com> <CAGudoHG3rCx93gyJTmzTBnSe4fQ9=m4mBESWbKVWtAGRxen_4w@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
In message <CAGudoHG3rCx93gyJTmzTBnSe4fQ9=m4mBESWbKVWtAGRxen_4w@mail.gmail.c
om>
, Mateusz Guzik writes:
> On 4/13/23, Cy Schubert <Cy.Schubert@cschubert.com> wrote:
> > On Thu, 13 Apr 2023 19:54:42 +0900
> > Pawe=C5=82 Jakub Dawidek <pawel@dawidek.net> wrote:
> >
> >> On Apr 13, 2023, at 16:10, Cy Schubert <Cy.Schubert@cschubert.com> wrote=
> :
> >> >
> >> > =EF=BB=BFIn message <20230413070426.8A54F25A@slippy.cwsent.com>, Cy Sc=
> hubert
> >> > writes:
> >> > In message <20230413064252.1E5C1318@slippy.cwsent.com>, Cy Schubert
> >> > writes:
> >> >> In message <A291C24C-9D7C-4E79-AD03-68ED910FC2DE@yahoo.com>, Mark
> >> >> Millard
> >> >>> write
> >> >>> s:
> >> >>> [This just puts my prior reply's material into Cy's
> >> >>>> adjusted resend of the original. The To/Cc should
> >> >>>> be coomplete this time.]
> >> >>>>
> >> >>>> On Apr 12, 2023, at 22:52, Cy Schubert <Cy.Schubert@cschubert.com> =
> =3D
> >> >>>> wrote:
> >> >>>>
> >> >>>> In message <C8E4A43B-9FC8-456E-ADB3-13E7F40B2B04@yahoo.com>, Mark =
> =3D
> >> >>>>> Millard=3D20
> >> >>>> write
> >> >>>>> s:
> >> >>>>> From: Charlie Li <vishwin_at_freebsd.org> wrote on
> >> >>>>>> Date: Wed, 12 Apr 2023 20:11:16 UTC :
> >> >>>>>> =3D20
> >> >>>>>> Charlie Li wrote:
> >> >>>>>>> Mateusz Guzik wrote:
> >> >>>>>>>> can you please test poudriere with
> >> >>>>>>>>> https://github.com/openzfs/zfs/pull/14739/files
> >> >>>>>>>>> =3D20
> >> >>>>>>>>> After applying, on the md(4)-backed pool regardless of =3D3D
> >> >>>>>>>> block_cloning,=3D3D20
> >> >>>>>> the cy@ `cp -R` test reports no differing (ie corrupted) files. =
> =3D
> >> >>>>>>>> Will=3D3D20=3D3D
> >> >>>> =3D20
> >> >>>>>> report back on poudriere results (no block_cloning).
> >> >>>>>>>> =3D3D20
> >> >>>>>>>> As for poudriere, build failures are still rolling in. These ar=
> e
> >> >>>>>>>> =3D
> >> >>>>>>> (and=3D3D20=3D3D
> >> >>>> =3D20
> >> >>>>>> have been) entirely random on every run. Some examples from this =
> =3D
> >> >>>>>>> run:
> >> >>>> =3D3D20
> >> >>>>>>> lang/php81:
> >> >>>>>>> - post-install: @${INSTALL_DATA}
> >> >>>>>>> ${WRKSRC}/php.ini-development=3D3D20
> >> >>>>>>> ${WRKSRC}/php.ini-production ${WRKDIR}/php.conf =3D3D
> >> >>>>>>> ${STAGEDIR}/${PREFIX}/etc
> >> >>>>>> - consumers fail to build due to corrupted php.conf packaged
> >> >>>>>>> =3D3D20
> >> >>>>>>> devel/ninja:
> >> >>>>>>> - phase: stage
> >> >>>>>>> - install -s -m 555=3D3D20
> >> >>>>>>> /wrkdirs/usr/ports/devel/ninja/work/ninja-1.11.1/ninja=3D3D20
> >> >>>>>>> /wrkdirs/usr/ports/devel/ninja/work/stage/usr/local/bin
> >> >>>>>>> - consumers fail to build due to corrupted bin/ninja packaged
> >> >>>>>>> =3D3D20
> >> >>>>>>> devel/netsurf-buildsystem:
> >> >>>>>>> - phase: stage
> >> >>>>>>> - mkdir -p=3D3D20
> >> >>>>>>> =3D3D
> >> >>>>>>> =3D
> >> >>>>>> /wrkdirs/usr/ports/devel/netsurf-buildsystem/work/stage/usr/local=
> /share/n
> >> >>>> e=3D
> >> >> =3D3D
> >> >>>> tsurf-buildsystem/makefiles=3D3D20
> >> >>>>>> =3D3D
> >> >>>>>>> =3D
> >> >>>>>> /wrkdirs/usr/ports/devel/netsurf-buildsystem/work/stage/usr/local=
> /share/n
> >> >>>> e=3D
> >> >> =3D3D
> >> >>>> tsurf-buildsystem/testtools
> >> >>>>>> for M in Makefile.top Makefile.tools Makefile.subdir =3D3D
> >> >>>>>>> Makefile.pkgconfig=3D3D20
> >> >>>>>> Makefile.clang Makefile.gcc Makefile.norcroft Makefile.open64; do
> >> >>>>>> \
> >> >>>>>>> cp makefiles/$M=3D3D20
> >> >>>>>>> =3D3D
> >> >>>>>>> =3D
> >> >>>>>> /wrkdirs/usr/ports/devel/netsurf-buildsystem/work/stage/usr/local=
> /share/n
> >> >>>> e=3D
> >> >> =3D3D
> >> >>>> tsurf-buildsystem/makefiles/;=3D3D20
> >> >>>>>> \
> >> >>>>>>> done
> >> >>>>>>> - graphics/libnsgif fails to build due to NUL characters in=3D3D=
> 20
> >> >>>>>>> Makefile.{clang,subdir}, causing nothing to link
> >> >>>>>>> =3D20
> >> >>>>>> Summary: I have problems building ports into packages
> >> >>>>>> via poudriere-devel use despite being fully updated/patched
> >> >>>>>> (as of when I started the experiment), never having enabled
> >> >>>>>> block_cloning ( still using openzfs-2.1-freebsd ).
> >> >>>>>> =3D20
> >> >>>>>> In other words, I can confirm other reports that have
> >> >>>>>> been made.
> >> >>>>>> =3D20
> >> >>>>>> The details follow.
> >> >>>>>> =3D20
> >> >>>>>> =3D20
> >> >>>>>> [Written as I was working on setting up for the experiments
> >> >>>>>> and then executing those experiments, adjusting as I went
> >> >>>>>> along.]
> >> >>>>>> =3D20
> >> >>>>>> I've run my own tests in a context that has never had the
> >> >>>>>> zpool upgrade and that jump from before the openzfs import to
> >> >>>>>> after the existing commits for trying to fix openzfs on
> >> >>>>>> FreeBSD. I report on the sequence of activities getting to
> >> >>>>>> the point of testing as well.
> >> >>>>>> =3D20
> >> >>>>>> By personal policy I keep my (non-temporary) pool's compatible
> >> >>>>>> with what the most recent ??.?-RELEASE supports, using
> >> >>>>>> openzfs-2.1-freebsd for now. The pools involved below have
> >> >>>>>> never had a zpool upgrade from where they started. (I've no
> >> >>>>>> pools that have ever had a zpool upgrade.)
> >> >>>>>> =3D20
> >> >>>>>> (Temporary pools are rare for me, such as this investigation.
> >> >>>>>> But I'm not testing block_cloning or anything new this time.)
> >> >>>>>> =3D20
> >> >>>>>> I'll note that I use zfs for bectl, not for redundancy. So
> >> >>>>>> my evidence is more limited in that respect.
> >> >>>>>> =3D20
> >> >>>>>> The activities were done on a HoneyComb (16 Cortex-A72 cores).
> >> >>>>>> The system has and supports ECC RAM, 64 GiBytes of RAM are
> >> >>>>>> present.
> >> >>>>>> =3D20
> >> >>>>>> I started by duplicating my normal zfs environment to an
> >> >>>>>> external USB3 NVMe drive and adjusting the host name and such
> >> >>>>>> to produce the below. (Non-debug, although I do not strip
> >> >>>>>> symbols.) :
> >> >>>>>> =3D20
> >> >>>>>> # uname -apKU
> >> >>>>>> FreeBSD CA72_4c8G_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #90 =3D3D
> >> >>>>>> main-n261544-cee09bda03c8-dirty: Wed Mar 15 20:25:49 PDT 2023
> >> >>>>>>     =3D3D
> >> >>>>>> =3D
> >> >>>>>> root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main=
> -src/arm
> >> >>>> 6=3D
> >> >> =3D3D
> >> >>>> 4.aarch64/sys/GENERIC-NODBG-CA72 arm64 aarch64 1400082 1400082
> >> >>>>>> =3D20
> >> >>>>>> I then did: git fetch, stash push ., merge --ff-only, stash apply=
>  .
> >> >>>>>> :
> >> >>>>>> my normal procedure. I then also applied the patch from:
> >> >>>>>> =3D20
> >> >>>>>> https://github.com/openzfs/zfs/pull/14739/files
> >> >>>>>> =3D20
> >> >>>>>> Then I did: buildworld buildkernel, install them, and rebooted.
> >> >>>>>> =3D20
> >> >>>>>> The result was:
> >> >>>>>> =3D20
> >> >>>>>> # uname -apKU
> >> >>>>>> FreeBSD CA72_4c8G_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #91 =3D3D
> >> >>>>>> main-n262122-2ef2c26f3f13-dirty: Wed Apr 12 19:23:35 PDT 2023
> >> >>>>>>     =3D3D
> >> >>>>>> =3D
> >> >>>>>> root@CA72_4c8G_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main=
> -src/arm
> >> >>>> 6=3D
> >> >> =3D3D
> >> >>>> 4.aarch64/sys/GENERIC-NODBG-CA72 arm64 aarc> >> >>>>>> =3D20
> >> >>>>>> The later poudriere-devel based build of packages from ports is
> >> >>>>>> based on:
> >> >>>>>> =3D20
> >> >>>>>> # ~/fbsd-based-on-what-commit.sh -C /usr/ports
> >> >>>>>> 4e94ac9eb97f (HEAD -> main, freebsd/main, freebsd/HEAD) =3D3D
> >> >>>>>> devel/freebsd-gcc12: Bump to 12.2.0.
> >> >>>>>> Author:     John Baldwin <jhb@FreeBSD.org>
> >> >>>>>> Commit:     John Baldwin <jhb@FreeBSD.org>
> >> >>>>>> CommitDate: 2023-03-25 00:06:40 +0000
> >> >>>>>> branch: main
> >> >>>>>> merge-base: 4e94ac9eb97fab16510b74ebcaa9316613182a72
> >> >>>>>> merge-base: CommitDate: 2023-03-25 00:06:40 +0000
> >> >>>>>> n613214 (--first-parent --count for merge-base)
> >> >>>>>> =3D20
> >> >>>>>> poudriere attempted to build 476 packages, starting
> >> >>>>>> with pkg (in order to build the 56 that I explicitly
> >> >>>>>> indicate that I want). It is my normal set of ports.
> >> >>>>>> The form of building is biased to allowing a high
> >> >>>>>> load average compared to the number of hardware
> >> >>>>>> threads (same as cores here): each builder is allowed
> >> >>>>>> to use the full count of hardware threads. The build
> >> >>>>>> =E2=82=AC=C3=8FL=E2=82=AC=E2=82=AC=E2=82=AC=E2=82=AC=E2=80=B9  > =
> > >> used USE_TMPFS=3D3D3D"data" instead of the
> >> >>>>>> USE_TMPFS=3D3D3Dall I
> >> >> normally use on the build machine involved.
> >> >>>>>> =3D20
> >> >>>>>> And it produced some random errors during the attempted
> >> >>>>>> builds. A type of example that is easy to interpret
> >> >>>>>> without further exploration is:
> >> >>>>>> =3D20
> >> >>>>>> pkg_resources.extern.packaging.requirements.InvalidRequirement:
> >> >>>>>> Parse
> >> >>>>>> =3D
> >> >> =3D3D
> >> >>>> error at "'\x00\x00\x00\x00\x00\x00\x00\x00'": Expected
> >> >>>> W:(0-9A-Za-z)
> >> >>>>>>     0
> >> >>         da0p8     ONLINE       0     0     0
> >> >>>>>> =3D20
> >> >>>>>> errors: No known data errors
> >> >>>>>> =3D20
> >> >>>>>> =3D20
> >> >>>>>> =3D3D3D=3D3D3D=3D3D3D
> >> >>>>>> Mark Millard
> >> >>>>>> marklmi at yahoo.com
> >> >>>>>> =3D20
> >> >>>>> =3D20
> >> >>>>> Let's try this again. Claws-mail didn't include the list address i=
> n
> >> >>>>> =3D
> >> >>>>> the=3D20
> >> >>>> header. Trying to reply, again, using exmh instead.
> >> >>>>> =3D20
> >> >>>>> =3D20
> >> >>>>> Did your pools suffer the EXDEV problem? The EXDEV also corrupted =
> =3D
> >> >>>>> files.
> >> >>>>
> >> >>>> As I reported, this was a jump from before the import
> >> >>>> to as things are tonight (here). So: NO, unless the
> >> >>>> existing code as of tonight still has the EXDEV problem!
> >> >>>>
> >> >>>> Prior to this experiment I'd not progressed any media
> >> >>>> beyond: main-n261544-cee09bda03c8-dirty Wed Mar 15 20:25:49.
> >> >>>>
> >> >>>> I think, without sufficient investigation we risk jumping to
> >> >>>>> conclusions. I've taken an extremely cautious approach, rolling
> >> >>>>> back
> >> >>>>> snapshots (as much as possible, i.e. poudriere datasets) when EXDE=
> V
> >> >>>>> corruption was encountered.
> >> >>>>>
> >> >>>> Again: nothing between main-n261544-cee09bda03c8-dirty and
> >> >>>> main-n262122-2ef2c26f3f13-dirty was involved at any stage.
> >> >>>>
> >> >>>> =3D20
> >> >>>>> I did not rollback any snapshots in my MH mail directory. Rolling
> >> >>>>> back
> >> >>>>> snapshots of my MH maildir would result in loss of email. I have t=
> o
> >> >>>>> live with that corruption. Corrupted files in my outgoing sent
> >> >>>>> email
> >> >>>>> directory remain:
> >> >>>>> =3D20
> >> >>>>> slippy$ ugrep -cPa '\x00' ~/.Mail/note | grep -c :1=3D20
> >> >>>>> 53
> >> >>>>> slippy$=3D20
> >> >>>>> =3D20
> >> >>>>> There are 53 corrupted files in my note log of 9913 emails. Those =
> =3D
> >> >>>>> files
> >> >>>> will never be fixed. They were corrupted by the EXDEV bug. Any new
> >> >>>> ZFS
> >> >>>>> or ZFS patches cannot retroactively remove the corruption from
> >> >>>>> those
> >> >>>>> files.
> >> >>>>> =3D20
> >> >>>>> But my poudriere files, because the snapshots were rolled back,
> >> >>>>> were
> >> >>>>> "repaired" by the rolled back snapshots.
> >> >>>>> =3D20
> >> >>>>> I'm not convinced that there is presently active corruption since
> >> >>>>> the problem has been fixed. I am convinced that whatever corruptio=
> n
> >> >>>>> that was written at the time will remain forever or until those
> >> >>>>> files
> >> >>>>> are deleted or replaced -- just like my email files written to dis=
> k
> >> >>>>> at
> >> >>>>> the time.
> >> >>>>>
> >> >>>> My test results and procedure just do not fit your conclusion
> >> >>>> that things are okay now if block_clonging is completely avoided.
> >> >>>>
> >> >>> Admitting I'm wrong: sending copies of my last reply to you back to
> >> >>> myself,
> >> >>>
> >> >> again and again, three times, I've managed to reproduce the corruptio=
> n
> >> >> you
> >> >>> are talking about.
> >> >>>
> >> >> This email itself was also corrupted. Below is what was sent. Good
> >> >> thing
> >> >> multiple copies are saved by exmh.
> >> >>
> >> >> Admitting I'm wrong: sending copies of my last reply to you back to
> >> >> myself,
> >> >> again and again, three times, I've managed to reproduce the corruptio=
> n
> >> >> you
> >> >> are talking about.
> >> >>
> >> > This email itself was also corrupted. Below is what was sent. Good
> >> > thing
> >> > multiple copies are saved by exmh.
> >> >
> >> > Admitting I'm wrong: sending copies of my last reply to you back to
> >> > myself,
> >> > again and again, three times, I've managed to reproduce the corruption
> >> > you
> >> > are talking about.
> >> >
> >> > From my previous email to you.
> >> >
> >> > header. Trying to reply:::::::::, again, using exmh instead.
> >> >                       ^^^^^^^^^
> >> > Here it is, nine additional bytes of garbage. I've replaced the garbag=
> e
> >> > with colons because nulls mess up a lot of things, including cut&paste=
> .
> >> >
> >> > In another instance about 500 bytes were removed. I can reproduce the
> >> > corruption at will now.
> >> >
> >> > The EXDEV patch is applied. Block_cloning is disabled.
> >> >
> >> > Somehow nulls and other garbage are inserted in the middle of emails
> >> > after
> >> > the ZFS upgrade.
> >> >
> >> Can you please try this patch:
> >>
> >> github.com
> >
> > The patch was applied yesterday at noon (PDT).
> >
> >>
> >>
> >>
> >> Unfortunately I don=E2=80=99t see how this can happen with block cloning
> >> disabled.
> >
> > It does and it's reproducible.
> >
>
> There is corruption with the recent import, with the
> https://github.com/openzfs/zfs/pull/14739/files patch applied and
> block cloning disabled on the pool.

Same here.

>
> There is no corruption with top of main with zfs merge reverted altogether.

I'm in the process of building a branch reverting the merge altogether and 
will test it on my sandbox machine later today.

>
> Which commit results in said corruption remains to be seen, a variant
> of the tree with just block cloning support reverted just for testing
> purposes is about to be evaluated.
>
> --=20
> Mateusz Guzik <mjguzik gmail.com>


-- 
Cheers,
Cy Schubert <Cy.Schubert@cschubert.com>
FreeBSD UNIX:  <cy@FreeBSD.org>   Web:  https://FreeBSD.org
NTP:           <cy@nwtime.org>    Web:  https://nwtime.org

			e^(i*pi)+1=0





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20230413135635.6B62F354>