Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 12 Apr 2023 23:21:19 -0700
From:      Mark Millard <marklmi@yahoo.com>
To:        Cy Schubert <Cy.Schubert@cschubert.com>
Cc:        Mateusz Guzik <mjguzik@gmail.com>, vishwin@freebsd.org, dev-commits-src-main@freebsd.org, Current FreeBSD <freebsd-current@freebsd.org>
Subject:   Re: git: 2a58b312b62f - main - zfs: merge openzfs/zfs@431083f75
Message-ID:  <A291C24C-9D7C-4E79-AD03-68ED910FC2DE@yahoo.com>
In-Reply-To: <20230413055221.E8B211F0@slippy.cwsent.com>
References:  <C8E4A43B-9FC8-456E-ADB3-13E7F40B2B04.ref@yahoo.com> <C8E4A43B-9FC8-456E-ADB3-13E7F40B2B04@yahoo.com> <20230413055221.E8B211F0@slippy.cwsent.com>

next in thread | previous in thread | raw e-mail | index | archive | help
[This just puts my prior reply's material into Cy's
adjusted resend of the original. The To/Cc should
be coomplete this time.]

On Apr 12, 2023, at 22:52, Cy Schubert <Cy.Schubert@cschubert.com> =
wrote:

> In message <C8E4A43B-9FC8-456E-ADB3-13E7F40B2B04@yahoo.com>, Mark =
Millard=20
> write
> s:
>> From: Charlie Li <vishwin_at_freebsd.org> wrote on
>> Date: Wed, 12 Apr 2023 20:11:16 UTC :
>>=20
>>> Charlie Li wrote:
>>>> Mateusz Guzik wrote:
>>>>> can you please test poudriere with
>>>>> https://github.com/openzfs/zfs/pull/14739/files
>>>>>=20
>>>> After applying, on the md(4)-backed pool regardless of =3D
>> block_cloning,=3D20
>>>> the cy@ `cp -R` test reports no differing (ie corrupted) files. =
Will=3D20=3D
>>=20
>>>> report back on poudriere results (no block_cloning).
>>>> =3D20
>>> As for poudriere, build failures are still rolling in. These are =
(and=3D20=3D
>>=20
>>> have been) entirely random on every run. Some examples from this =
run:
>>> =3D20
>>> lang/php81:
>>> - post-install: @${INSTALL_DATA} ${WRKSRC}/php.ini-development=3D20
>>> ${WRKSRC}/php.ini-production ${WRKDIR}/php.conf =3D
>> ${STAGEDIR}/${PREFIX}/etc
>>> - consumers fail to build due to corrupted php.conf packaged
>>> =3D20
>>> devel/ninja:
>>> - phase: stage
>>> - install -s -m 555=3D20
>>> /wrkdirs/usr/ports/devel/ninja/work/ninja-1.11.1/ninja=3D20
>>> /wrkdirs/usr/ports/devel/ninja/work/stage/usr/local/bin
>>> - consumers fail to build due to corrupted bin/ninja packaged
>>> =3D20
>>> devel/netsurf-buildsystem:
>>> - phase: stage
>>> - mkdir -p=3D20
>>> =3D
>> =
/wrkdirs/usr/ports/devel/netsurf-buildsystem/work/stage/usr/local/share/ne=
=3D
>> tsurf-buildsystem/makefiles=3D20
>>> =3D
>> =
/wrkdirs/usr/ports/devel/netsurf-buildsystem/work/stage/usr/local/share/ne=
=3D
>> tsurf-buildsystem/testtools
>>> for M in Makefile.top Makefile.tools Makefile.subdir =3D
>> Makefile.pkgconfig=3D20
>>> Makefile.clang Makefile.gcc Makefile.norcroft Makefile.open64; do \
>>> cp makefiles/$M=3D20
>>> =3D
>> =
/wrkdirs/usr/ports/devel/netsurf-buildsystem/work/stage/usr/local/share/ne=
=3D
>> tsurf-buildsystem/makefiles/;=3D20
>>> \
>>> done
>>> - graphics/libnsgif fails to build due to NUL characters in=3D20
>>> Makefile.{clang,subdir}, causing nothing to link
>>=20
>> Summary: I have problems building ports into packages
>> via poudriere-devel use despite being fully updated/patched
>> (as of when I started the experiment), never having enabled
>> block_cloning ( still using openzfs-2.1-freebsd ).
>>=20
>> In other words, I can confirm other reports that have
>> been made.
>>=20
>> The details follow.
>>=20
>>=20
>> [Written as I was working on setting up for the experiments
>> and then executing those experiments, adjusting as I went
>> along.]
>>=20
>> I've run my own tests in a context that has never had the
>> zpool upgrade and that jump from before the openzfs import to
>> after the existing commits for trying to fix openzfs on
>> FreeBSD. I report on the sequence of activities getting to
>> the point of testing as well.
>>=20
>> By personal policy I keep my (non-temporary) pool's compatible
>> with what the most recent ??.?-RELEASE supports, using
>> openzfs-2.1-freebsd for now. The pools involved below have
>> never had a zpool upgrade from where they started. (I've no
>> pools that have ever had a zpool upgrade.)
>>=20
>> (Temporary pools are rare for me, such as this investigation.
>> But I'm not testing block_cloning or anything new this time.)
>>=20
>> I'll note that I use zfs for bectl, not for redundancy. So
>> my evidence is more limited in that respect.
>>=20
>> The activities were done on a HoneyComb (16 Cortex-A72 cores).
>> The system has and supports ECC RAM, 64 GiBytes of RAM are
>> present.
>>=20
>> I started by duplicating my normal zfs environment to an
>> external USB3 NVMe drive and adjusting the host name and such
>> to produce the below. (Non-debug, although I do not strip
>> symbols.) :
>>=20
>> # uname -apKU
>> FreeBSD CA72_4c8G_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #90 =3D
>> main-n261544-cee09bda03c8-dirty: Wed Mar 15 20:25:49 PDT 2023     =3D
>> =
root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm6=
=3D
>> 4.aarch64/sys/GENERIC-NODBG-CA72 arm64 aarch64 1400082 1400082
>>=20
>> I then did: git fetch, stash push ., merge --ff-only, stash apply . :
>> my normal procedure. I then also applied the patch from:
>>=20
>> https://github.com/openzfs/zfs/pull/14739/files
>>=20
>> Then I did: buildworld buildkernel, install them, and rebooted.
>>=20
>> The result was:
>>=20
>> # uname -apKU
>> FreeBSD CA72_4c8G_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #91 =3D
>> main-n262122-2ef2c26f3f13-dirty: Wed Apr 12 19:23:35 PDT 2023     =3D
>> =
root@CA72_4c8G_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm6=
=3D
>> 4.aarch64/sys/GENERIC-NODBG-CA72 arm64 aarch64 1400086 1400086
>>=20
>> The later poudriere-devel based build of packages from ports is
>> based on:
>>=20
>> # ~/fbsd-based-on-what-commit.sh -C /usr/ports
>> 4e94ac9eb97f (HEAD -> main, freebsd/main, freebsd/HEAD) =3D
>> devel/freebsd-gcc12: Bump to 12.2.0.
>> Author:     John Baldwin <jhb@FreeBSD.org>
>> Commit:     John Baldwin <jhb@FreeBSD.org>
>> CommitDate: 2023-03-25 00:06:40 +0000
>> branch: main
>> merge-base: 4e94ac9eb97fab16510b74ebcaa9316613182a72
>> merge-base: CommitDate: 2023-03-25 00:06:40 +0000
>> n613214 (--first-parent --count for merge-base)
>>=20
>> poudriere attempted to build 476 packages, starting
>> with pkg (in order to build the 56 that I explicitly
>> indicate that I want). It is my normal set of ports.
>> The form of building is biased to allowing a high
>> load average compared to the number of hardware
>> threads (same as cores here): each builder is allowed
>> to use the full count of hardware threads. The build
>> used USE_TMPFS=3D3D"data" instead of the USE_TMPFS=3D3Dall I
>> normally use on the build machine involved.
>>=20
>> And it produced some random errors during the attempted
>> builds. A type of example that is easy to interpret
>> without further exploration is:
>>=20
>> pkg_resources.extern.packaging.requirements.InvalidRequirement: Parse =
=3D
>> error at "'\x00\x00\x00\x00\x00\x00\x00\x00'": Expected W:(0-9A-Za-z)
>>=20
>> A fair number of errors are of the form: the build
>> installing a previously built package for use in the
>> builder but later the builder can not find some file
>> from the package's installation.
>>=20
>> Another error reported was:
>>=20
>> ld: error: /usr/local/lib/libblkid.a: unknown file type
>>=20
>> For reference:
>>=20
>> [main-CA72-bulk_a-default] [2023-04-12_20h45m32s] [committing:] =
Queued: =3D
>> 476 Built: 252 Failed: 11  Skipped: 213 Ignored: 0   Fetched: 0   =3D
>> Tobuild: 0    Time: 00:37:52
>>=20
>> I started another build that tried to build 224 packeges:
>> the 11 failed and 213 skipped.
>>=20
>> Just 1 package built that failed before:
>>=20
>> [00:04:58] [09] [00:04:15] Finished databases/sqlite3@default | =3D
>> sqlite3-3.41.0_1,1: Success
>>=20
>> It seems to be the only one where the original failure was not
>> an example of complaining about the missing/corrupted content
>> of a package install used for building. So it is an example
>> of randomly varying behavior.
>>=20
>> That, in turn, allowed:
>>=20
>> [00:04:58] [01] [00:00:00] Building security/nss | nss-3.89
>>=20
>> to build but everything else failed or was skipped.
>>=20
>> The sqlite3 vs. other failure difference suggests that writes
>> have random problems but later reads reliably see the problem
>> that resulted (before the content is deleted).
>>=20
>>=20
>> After the above:
>>=20
>> # zpool status
>>  pool: zroot
>> state: ONLINE
>> config:
>>=20
>>        NAME        STATE     READ WRITE CKSUM
>>        zroot       ONLINE       0     0     0
>>          da0p8     ONLINE       0     0     0
>>=20
>> errors: No known data errors
>>=20
>> =08=E0=B9=84=C2=8DM # zpool scrub zroot
>> # zpool status
>>  pool: zroot
>> state: ONLINE
>>  scan: scrub repaired 0B in 00:16:25 with 0 errors on Wed Apr 12 =3D
>> 22:15:39 2023
>> config:
>>=20
>>        NAME        STATE     READ WRITE CKSUM
>>        zroot       ONLINE       0     0     0
>>          da0p8     ONLINE       0     0     0
>>=20
>> errors: No known data errors
>>=20
>>=20
>> =3D3D=3D3D=3D3D
>> Mark Millard
>> marklmi at yahoo.com
>=20
>=20
> Let's try this again. Claws-mail didn't include the list address in =
the=20
> header. Trying to reply, again, using exmh instead.
>=20
>=20
> Did your pools suffer the EXDEV problem? The EXDEV also corrupted =
files.

As I reported, this was a jump from before the import
to as things are tonight (here). So: NO, unless the
existing code as of tonight still has the EXDEV problem!

Prior to this experiment I'd not progressed any media
beyond: main-n261544-cee09bda03c8-dirty Wed Mar 15 20:25:49.

> I think, without sufficient investigation we risk jumping to
> conclusions. I've taken an extremely cautious approach, rolling back
> snapshots (as much as possible, i.e. poudriere datasets) when EXDEV
> corruption was encountered.

Again: nothing between main-n261544-cee09bda03c8-dirty and
main-n262122-2ef2c26f3f13-dirty was involved at any stage.

>=20
> I did not rollback any snapshots in my MH mail directory. Rolling back
> snapshots of my MH maildir would result in loss of email. I have to
> live with that corruption. Corrupted files in my outgoing sent email
> directory remain:
>=20
> slippy$ ugrep -cPa '\x00' ~/.Mail/note | grep -c :1=20
> 53
> slippy$=20
>=20
> There are 53 corrupted files in my note log of 9913 emails. Those =
files
> will never be fixed. They were corrupted by the EXDEV bug. Any new ZFS
> or ZFS patches cannot retroactively remove the corruption from those
> files.
>=20
> But my poudriere files, because the snapshots were rolled back, were
> "repaired" by the rolled back snapshots.
>=20
> I'm not convinced that there is presently active corruption since
> the problem has been fixed. I am convinced that whatever corruption
> that was written at the time will remain forever or until those files
> are deleted or replaced -- just like my email files written to disk at
> the time.

My test results and procedure just do not fit your conclusion
that things are okay now if block_clonging is completely avoided.


=3D=3D=3D
Mark Millard
marklmi at yahoo.com




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?A291C24C-9D7C-4E79-AD03-68ED910FC2DE>