Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 20 Nov 2021 11:54:09 -0800
From:      Mark Millard via freebsd-current <freebsd-current@freebsd.org>
To:        freebsd-current <freebsd-current@freebsd.org>, "freebsd-arm@freebsd.org" <arm@freebsd.org>
Subject:   Re: aarch64(?) poudiere-devel based builds seem to get fairly-rare corrupted files after recent system update(s?)
Message-ID:  <E52FCE89-81CB-4F8D-869F-D32C883F14A3@yahoo.com>
In-Reply-To: <0006EB30-B9F9-465A-8B9A-A0C03899CEFC@yahoo.com>
References:  <2CA61249-321C-45AA-9755-597146AB8E9F@yahoo.com> <65AA4BCD-EC4B-4A19-B750-C7FC6E5ADDF5@yahoo.com> <E7C678B0-B0E1-4802-9362-9C2C92558202@yahoo.com> <9BF4F65B-6437-4D88-AF34-9BCFBF90D6F3@yahoo.com> <F2DCCBC2-12A7-48C7-A6D0-1BD626B87890@yahoo.com> <4B591638-4693-4403-8549-88D7A1D9D669@yahoo.com> <0006EB30-B9F9-465A-8B9A-A0C03899CEFC@yahoo.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On 2021-Nov-19, at 22:20, Mark Millard <marklmi@yahoo.com> wrote:

> On 2021-Nov-18, at 12:15, Mark Millard <marklmi@yahoo.com> wrote:
>=20
>> On 2021-Nov-17, at 11:17, Mark Millard <marklmi@yahoo.com> wrote:
>>=20
>>> On 2021-Nov-15, at 15:43, Mark Millard <marklmi@yahoo.com> wrote:
>>>=20
>>>> On 2021-Nov-15, at 13:13, Mark Millard <marklmi@yahoo.com> wrote:
>>>>=20
>>>>> On 2021-Nov-15, at 12:51, Mark Millard <marklmi@yahoo.com> wrote:
>>>>>=20
>>>>>> On 2021-Nov-15, at 11:31, Mark Millard <marklmi@yahoo.com> wrote:
>>>>>>=20
>>>>>>> I updated from (shown a system that I've not updated yet):
>>>>>>>=20
>>>>>>> # uname -apKU
>>>>>>> FreeBSD CA72_4c8G_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #18 =
main-n250455-890cae197737-dirty: Thu Nov  4 13:43:17 PDT 2021     =
root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm6=
4.aarch64/sys/GENERIC-NODBG-CA72  arm64 aarch64=20
>>>>>>> 1400040 1400040
>>>>>>>=20
>>>>>>> to:
>>>>>>>=20
>>>>>>> # uname -apKU
>>>>>>> FreeBSD CA72_16Gp_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #19 =
main-n250667-20aa359773be-dirty: Sun Nov 14 02:57:32 PST 2021     =
root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm6=
4.aarch64/sys/GENERIC-NODBG-CA72  arm64 aarch64 1400042 1400042
>>>>>>>=20
>>>>>>> and then updated /usr/ports/ and started poudriere-devel based =
builds of
>>>>>>> the ports I's set up to use. However my last round of port =
builds from
>>>>>>> a general update of /usr/ports/ were on 2021-10-23 before either =
of the
>>>>>>> above.
>>>>>>>=20
>>>>>>> I've had at least two files that seem to be corrupted, where a =
later part
>>>>>>> of the build hits problematical file(s) from earlier build =
activity. For
>>>>>>> example:
>>>>>>>=20
>>>>>>> /usr/local/include/X11/extensions/XvMC.h:1:1: warning: null =
character ignored [-Wnull-character]
>>>>>>> <U+0000>=20
>>>>>>> ^
>>>>>>> /usr/local/include/X11/extensions/XvMC.h:1:2: warning: null =
character ignored [-Wnull-character]
>>>>>>> <U+0000><U+0000>
>>>>>>>  ^
>>>>>>> /usr/local/include/X11/extensions/XvMC.h:1:3: warning: null =
character ignored [-Wnull-character]
>>>>>>> <U+0000><U+0000><U+0000>=20
>>>>>>>          ^  =20
>>>>>>> /usr/local/include/X11/extensions/XvMC.h:1:4: warning: null =
character ignored [-Wnull-character]
>>>>>>> <U+0000><U+0000><U+0000><U+0000>
>>>>>>>                  ^
>>>>>>> . . .
>>>>>>>=20
>>>>>>> Removing the xorgproto-2021.4 package and rebuilding via
>>>>>>> poudiere-devel did not get a failure of any ports dependent
>>>>>>> on it.
>>>>>>>=20
>>>>>>> This was from a use of:
>>>>>>>=20
>>>>>>> # poudriere jail -j13_0R-CA7 -i
>>>>>>> Jail name:         13_0R-CA7
>>>>>>> Jail version:      13.0-RELEASE-p5
>>>>>>> Jail arch:         arm.armv7
>>>>>>> Jail method:       null
>>>>>>> Jail mount:        /usr/obj/DESTDIRs/13_0R-CA7-poud
>>>>>>> Jail fs:          =20
>>>>>>> Jail updated:      2021-11-04 01:48:49
>>>>>>> Jail pkgbase:      disabled
>>>>>>>=20
>>>>>>> but another not-investigated example was from:
>>>>>>>=20
>>>>>>> # poudriere jail -j13_0R-CA72 -i
>>>>>>> Jail name:         13_0R-CA72
>>>>>>> Jail version:      13.0-RELEASE-p5
>>>>>>> Jail arch:         arm64.aarch64
>>>>>>> Jail method:       null
>>>>>>> Jail mount:        /usr/obj/DESTDIRs/13_0R-CA72-poud
>>>>>>> Jail fs:          =20
>>>>>>> Jail updated:      2021-11-04 01:48:01
>>>>>>> Jail pkgbase:      disabled
>>>>>>>=20
>>>>>>> (so no 32-bit COMPAT involved). The apparent corruption
>>>>>>> was in a different port (autoconfig, noticed by the
>>>>>>> build of automake failing via config reporting
>>>>>>> /usr/local/share/autoconf-2.69/autoconf/autoconf.m4f
>>>>>>> being rejected).
>>>>>>>=20
>>>>>>> /usr/obj/DESTDIRs/13_0R-CA7-poud/ and
>>>>>>> /usr/obj/DESTDIRs/13_0R-CA72-poud/ and the like track the
>>>>>>> system versions.
>>>>>>>=20
>>>>>>> The media is an Optane 960 in the PCIe slot of a HoneyComb
>>>>>>> (16 Cortex-A72's). The context is a root on ZFS one, ZFS
>>>>>>> used in order to have bectl, not redundancy.
>>>>>>>=20
>>>>>>> The ThreadRipper 1950X (so amd64) port builds did not give
>>>>>>> evidence of such problems based on the updated system. (Also
>>>>>>> Optane media in a PCIe slot, also root on ZFS.) But the
>>>>>>> errors seem rare enough to not be able to conclude much.
>>>>>>=20
>>>>>> For aarch64 targeting aarch64 there was also this
>>>>>> explicit corruption notice during the poudriere(-devel)
>>>>>> bulk build:
>>>>>>=20
>>>>>> . . .
>>>>>> [CA72_ZFS] Extracting arm-none-eabi-gcc-8.4.0_3: .........
>>>>>> pkg-static: Fail to extract =
/usr/local/libexec/gcc/arm-none-eabi/8.4.0/lto1 from package: Lzma =
library error: Corrupted input data
>>>>>> [CA72_ZFS] Extracting arm-none-eabi-gcc-8.4.0_3... done
>>>>>>=20
>>>>>> Failed to install the following 1 package(s): =
/packages/All/arm-none-eabi-gcc-8.4.0_3.pkg
>>>>>> *** Error code 1
>>>>>> Stop.
>>>>>> make: stopped in /usr/ports/sysutils/u-boot-orangepi-plus-2e
>>>>>>=20
>>>>>> I'm not yet to the point of retrying after removing
>>>>>> arm-none-eabi-gcc-8.4.0_3 : other things are being built.
>>>>>=20
>>>>>=20
>>>>> Another context with my prior general update of /usr/ports/
>>>>> and the matching port builds: Back then I used USE_TMPFS=3Dall
>>>>> but the failure is based on USE_TMPFS-"data" instead. So:
>>>>> lots more I/O.
>>>>>=20
>>>>=20
>>>> None of the 3 corruptions repeated during bulk builds that
>>>> retried the builds that generated the files. All of the
>>>> ports that failed by hitting the corruptions in what they
>>>> depended on, built fine in teh retries.
>>>>=20
>>>> For reference:
>>>>=20
>>>> I'll note that, back when I was using USE_TMPFS=3Dall , I also
>>>> did some separate bulk -a test runs, both aarch64 (Cortex-A72)
>>>> native and Cortext-A72 targeting Cortex-A7 (armv7). None of
>>>> those showed evidence of file corruptions. In general I've
>>>> not had previous file corruptions with this system. (There
>>>> was a little more than 245 GiBytes swap, which covered the
>>>> tmpfs needs when they were large.)
>>>=20
>>>=20
>>> I set up a contrasting test context and got no evidence of
>>> corruptions in that context. (Note: the 3 bulk builds
>>> total to around 24 hrs of activity for the 3 examples
>>> of 460+ ports building.) So, for the Cortex-A72 system,
>>=20
>> I set up a UFS on Optane (U.2 via M.2 adapter) context and
>> also got no evidence of corruptions in that context (same
>> hardware and a copy of the USB3 SSD based system). The
>> sequence of 3 bulks took somewhat over 18 hrs using the
>> Optane.
>>=20
>>> root on UFS on portable USB3 SSD:   no evidence of corruptions
>> Also:
>> root on UFS on Optane U.2 via M.2:  no evidence of corruptions
>>> vs.:
>>> root on ZFS on optane in PCIe slot: solid evidence of 3 known =
corruptions
>>>=20
>>> Both had USE_TMPFS=3D"data" in use. The same system build
>>> had been installed and booted for both tests.
>>>=20
>>> The evidence of corruptions is rare enough for this not to
>>> be determinative, but it is suggestive.
>>>=20
>>> Unfortunately, ZFS vs. UFS and Optane-in-PCIe vs. USB3 are
>>> not differentiated by this test result.
>>>=20
>>> There is also the result that I've not seen evidence of
>>> corruptions on the ThreadRipper 1950 X (amd64) system.
>>> Again, not determinative, but suggestive, given how rare
>>> the corruptions seem to be.
>>=20
>> So far the only things unique to the observed corruptions are:
>>=20
>> root on ZFS context (vs. root on UFS)
>> and:
>> Optane in a PCIe slot (but no contrasting ZFS case tested)
>>=20
>> The PCIe slot does not seem to me to be likely to be contributing.
>> So this seem to be suggestive of a ZFS problem.
>>=20
>> A contributing point might be that the main [so: 14] system was
>> built via -mcpu=3Dcortex-a72 for execution on a Cortext-A72 system.
>>=20
>> [I previously ran into a USB subsystem mishandling of keeping
>> things coherent for the week memory ordering in this sort of
>> context. That issue was fixed. But back then I was lucky enough
>> to be able to demonstrate fails vs. works by adding an
>> appropriate instruction to FreeBSD in a few specific places
>> (more than necessary as it turned out). Someone else determined
>> where the actual mishandling was that covered all required
>> places. My generating that much information in this context
>> seems unlikely.]
>=20
>=20
> I started a retry of root-on-ZFS with the Optane-in-PCIe-slot media
> and it got its first corruption (in a different place, 2nd bulk
> build this time). The use of the corrupted file reports:
>=20
> configure:13269: cc -o conftest -Wall -Wextra -fsigned-char =
-Wdeclaration-after-statement -O2 -pipe -mcpu=3Dcortex-a53  -g =
-fstack-protector-strong -fno-strict-aliasing  -DUSE_MEMORY_H =
-I/usr/local/incl
> ude -mcpu=3Dcortex-a53  -fstack-protector-strong  conftest.c  =
-L/usr/local/lib -logg >&5
> In file included from conftest.c:27:
> In file included from /usr/local/include/ogg/ogg.h:24:
> In file included from /usr/local/include/ogg/os_types.h:154:
> /usr/local/include/ogg/config_types.h:1:1: warning: null character =
ignored [-Wnull-character]
> <U+0000>
> ^
> /usr/local/include/ogg/config_types.h:1:2: warning: null character =
ignored [-Wnull-character]
> <U+0000><U+0000>
>        ^
> /usr/local/include/ogg/config_types.h:1:3: warning: null character =
ignored [-Wnull-character]
> <U+0000><U+0000><U+0000>
>                ^
> . . .
> /usr/local/include/ogg/config_types.h:1:538: warning: null character =
ignored [-Wnull-character]
> . . . (nulls) . . .
>=20
> So: 538 such null bytes.
>=20
> Thus, another example of something like a page of nulls being
> written out when ZFS is in use.
>=20
> audio/gstreamer1-plugins-ogg also failed via referencing the file
> during its build.
>=20
> (The bulk run is still going and there is one more bulk run to go.)
>=20

Well, 528 happened to be the size of config_types.h --and of
config_types.h from a build that did not get the corruption there.

So looking at the other (later) corruption, which was a bigger file
(looking via bulk -i and installing what contained the file but
looking from outside the jail):

# find /usr/local/ -name "libtextstyle.so*" -exec ls -Tld {} \;
-rwxr-xr-x  1 root  wheel  2339104 Nov 20 01:05:05 2021 =
/usr/local/poudriere/data/.m/13_0R-CA7-default/ref/usr/local/lib/libtextst=
yle.so.0.1.1
lrwxr-xr-x  1 root  wheel  21 Nov 20 01:05:05 2021 =
/usr/local/poudriere/data/.m/13_0R-CA7-default/ref/usr/local/lib/libtextst=
yle.so.0 -> libtextstyle.so.0.1.1
lrwxr-xr-x  1 root  wheel  21 Nov 20 01:05:05 2021 =
/usr/local/poudriere/data/.m/13_0R-CA7-default/ref/usr/local/lib/libtextst=
yle.so -> libtextstyle.so.0.1.1

hd =
/usr/local/poudriere/data/.m/13_0R-CA7-default/ref/usr/local/lib/libtextst=
yle.so.0.1.1 | more
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  =
|................|
*
0023b120

So the whole, over 2 MiByte, the whole file ended up with just null =
Bytes.

To cross check on live system caching vs. on disk, I rebooted and redid =
the
bulk -i based install of libtextstyle and looked at =
libtextstyle.so.0.1.1 :
still all zeros.

For reference, zpool scrub afterward resulted in:

# zpool status
  pool: zopt0
 state: ONLINE
  scan: scrub repaired 0B in 00:01:49 with 0 errors on Sat Nov 20 =
11:47:31 2021
config:

        NAME        STATE     READ WRITE CKSUM
        zopt0       ONLINE       0     0     0
          nda1p3    ONLINE       0     0     0

But it is not a ZFS redundancy context: ZFS used just to use bectl .

=3D=3D=3D
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?E52FCE89-81CB-4F8D-869F-D32C883F14A3>