Date: Sat, 20 Nov 2021 11:54:09 -0800 From: Mark Millard via freebsd-current <freebsd-current@freebsd.org> To: freebsd-current <freebsd-current@freebsd.org>, "freebsd-arm@freebsd.org" <arm@freebsd.org> Subject: Re: aarch64(?) poudiere-devel based builds seem to get fairly-rare corrupted files after recent system update(s?) Message-ID: <E52FCE89-81CB-4F8D-869F-D32C883F14A3@yahoo.com> In-Reply-To: <0006EB30-B9F9-465A-8B9A-A0C03899CEFC@yahoo.com> References: <2CA61249-321C-45AA-9755-597146AB8E9F@yahoo.com> <65AA4BCD-EC4B-4A19-B750-C7FC6E5ADDF5@yahoo.com> <E7C678B0-B0E1-4802-9362-9C2C92558202@yahoo.com> <9BF4F65B-6437-4D88-AF34-9BCFBF90D6F3@yahoo.com> <F2DCCBC2-12A7-48C7-A6D0-1BD626B87890@yahoo.com> <4B591638-4693-4403-8549-88D7A1D9D669@yahoo.com> <0006EB30-B9F9-465A-8B9A-A0C03899CEFC@yahoo.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On 2021-Nov-19, at 22:20, Mark Millard <marklmi@yahoo.com> wrote: > On 2021-Nov-18, at 12:15, Mark Millard <marklmi@yahoo.com> wrote: >=20 >> On 2021-Nov-17, at 11:17, Mark Millard <marklmi@yahoo.com> wrote: >>=20 >>> On 2021-Nov-15, at 15:43, Mark Millard <marklmi@yahoo.com> wrote: >>>=20 >>>> On 2021-Nov-15, at 13:13, Mark Millard <marklmi@yahoo.com> wrote: >>>>=20 >>>>> On 2021-Nov-15, at 12:51, Mark Millard <marklmi@yahoo.com> wrote: >>>>>=20 >>>>>> On 2021-Nov-15, at 11:31, Mark Millard <marklmi@yahoo.com> wrote: >>>>>>=20 >>>>>>> I updated from (shown a system that I've not updated yet): >>>>>>>=20 >>>>>>> # uname -apKU >>>>>>> FreeBSD CA72_4c8G_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #18 = main-n250455-890cae197737-dirty: Thu Nov 4 13:43:17 PDT 2021 = root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm6= 4.aarch64/sys/GENERIC-NODBG-CA72 arm64 aarch64=20 >>>>>>> 1400040 1400040 >>>>>>>=20 >>>>>>> to: >>>>>>>=20 >>>>>>> # uname -apKU >>>>>>> FreeBSD CA72_16Gp_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #19 = main-n250667-20aa359773be-dirty: Sun Nov 14 02:57:32 PST 2021 = root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm6= 4.aarch64/sys/GENERIC-NODBG-CA72 arm64 aarch64 1400042 1400042 >>>>>>>=20 >>>>>>> and then updated /usr/ports/ and started poudriere-devel based = builds of >>>>>>> the ports I's set up to use. However my last round of port = builds from >>>>>>> a general update of /usr/ports/ were on 2021-10-23 before either = of the >>>>>>> above. >>>>>>>=20 >>>>>>> I've had at least two files that seem to be corrupted, where a = later part >>>>>>> of the build hits problematical file(s) from earlier build = activity. For >>>>>>> example: >>>>>>>=20 >>>>>>> /usr/local/include/X11/extensions/XvMC.h:1:1: warning: null = character ignored [-Wnull-character] >>>>>>> <U+0000>=20 >>>>>>> ^ >>>>>>> /usr/local/include/X11/extensions/XvMC.h:1:2: warning: null = character ignored [-Wnull-character] >>>>>>> <U+0000><U+0000> >>>>>>> ^ >>>>>>> /usr/local/include/X11/extensions/XvMC.h:1:3: warning: null = character ignored [-Wnull-character] >>>>>>> <U+0000><U+0000><U+0000>=20 >>>>>>> ^ =20 >>>>>>> /usr/local/include/X11/extensions/XvMC.h:1:4: warning: null = character ignored [-Wnull-character] >>>>>>> <U+0000><U+0000><U+0000><U+0000> >>>>>>> ^ >>>>>>> . . . >>>>>>>=20 >>>>>>> Removing the xorgproto-2021.4 package and rebuilding via >>>>>>> poudiere-devel did not get a failure of any ports dependent >>>>>>> on it. >>>>>>>=20 >>>>>>> This was from a use of: >>>>>>>=20 >>>>>>> # poudriere jail -j13_0R-CA7 -i >>>>>>> Jail name: 13_0R-CA7 >>>>>>> Jail version: 13.0-RELEASE-p5 >>>>>>> Jail arch: arm.armv7 >>>>>>> Jail method: null >>>>>>> Jail mount: /usr/obj/DESTDIRs/13_0R-CA7-poud >>>>>>> Jail fs: =20 >>>>>>> Jail updated: 2021-11-04 01:48:49 >>>>>>> Jail pkgbase: disabled >>>>>>>=20 >>>>>>> but another not-investigated example was from: >>>>>>>=20 >>>>>>> # poudriere jail -j13_0R-CA72 -i >>>>>>> Jail name: 13_0R-CA72 >>>>>>> Jail version: 13.0-RELEASE-p5 >>>>>>> Jail arch: arm64.aarch64 >>>>>>> Jail method: null >>>>>>> Jail mount: /usr/obj/DESTDIRs/13_0R-CA72-poud >>>>>>> Jail fs: =20 >>>>>>> Jail updated: 2021-11-04 01:48:01 >>>>>>> Jail pkgbase: disabled >>>>>>>=20 >>>>>>> (so no 32-bit COMPAT involved). The apparent corruption >>>>>>> was in a different port (autoconfig, noticed by the >>>>>>> build of automake failing via config reporting >>>>>>> /usr/local/share/autoconf-2.69/autoconf/autoconf.m4f >>>>>>> being rejected). >>>>>>>=20 >>>>>>> /usr/obj/DESTDIRs/13_0R-CA7-poud/ and >>>>>>> /usr/obj/DESTDIRs/13_0R-CA72-poud/ and the like track the >>>>>>> system versions. >>>>>>>=20 >>>>>>> The media is an Optane 960 in the PCIe slot of a HoneyComb >>>>>>> (16 Cortex-A72's). The context is a root on ZFS one, ZFS >>>>>>> used in order to have bectl, not redundancy. >>>>>>>=20 >>>>>>> The ThreadRipper 1950X (so amd64) port builds did not give >>>>>>> evidence of such problems based on the updated system. (Also >>>>>>> Optane media in a PCIe slot, also root on ZFS.) But the >>>>>>> errors seem rare enough to not be able to conclude much. >>>>>>=20 >>>>>> For aarch64 targeting aarch64 there was also this >>>>>> explicit corruption notice during the poudriere(-devel) >>>>>> bulk build: >>>>>>=20 >>>>>> . . . >>>>>> [CA72_ZFS] Extracting arm-none-eabi-gcc-8.4.0_3: ......... >>>>>> pkg-static: Fail to extract = /usr/local/libexec/gcc/arm-none-eabi/8.4.0/lto1 from package: Lzma = library error: Corrupted input data >>>>>> [CA72_ZFS] Extracting arm-none-eabi-gcc-8.4.0_3... done >>>>>>=20 >>>>>> Failed to install the following 1 package(s): = /packages/All/arm-none-eabi-gcc-8.4.0_3.pkg >>>>>> *** Error code 1 >>>>>> Stop. >>>>>> make: stopped in /usr/ports/sysutils/u-boot-orangepi-plus-2e >>>>>>=20 >>>>>> I'm not yet to the point of retrying after removing >>>>>> arm-none-eabi-gcc-8.4.0_3 : other things are being built. >>>>>=20 >>>>>=20 >>>>> Another context with my prior general update of /usr/ports/ >>>>> and the matching port builds: Back then I used USE_TMPFS=3Dall >>>>> but the failure is based on USE_TMPFS-"data" instead. So: >>>>> lots more I/O. >>>>>=20 >>>>=20 >>>> None of the 3 corruptions repeated during bulk builds that >>>> retried the builds that generated the files. All of the >>>> ports that failed by hitting the corruptions in what they >>>> depended on, built fine in teh retries. >>>>=20 >>>> For reference: >>>>=20 >>>> I'll note that, back when I was using USE_TMPFS=3Dall , I also >>>> did some separate bulk -a test runs, both aarch64 (Cortex-A72) >>>> native and Cortext-A72 targeting Cortex-A7 (armv7). None of >>>> those showed evidence of file corruptions. In general I've >>>> not had previous file corruptions with this system. (There >>>> was a little more than 245 GiBytes swap, which covered the >>>> tmpfs needs when they were large.) >>>=20 >>>=20 >>> I set up a contrasting test context and got no evidence of >>> corruptions in that context. (Note: the 3 bulk builds >>> total to around 24 hrs of activity for the 3 examples >>> of 460+ ports building.) So, for the Cortex-A72 system, >>=20 >> I set up a UFS on Optane (U.2 via M.2 adapter) context and >> also got no evidence of corruptions in that context (same >> hardware and a copy of the USB3 SSD based system). The >> sequence of 3 bulks took somewhat over 18 hrs using the >> Optane. >>=20 >>> root on UFS on portable USB3 SSD: no evidence of corruptions >> Also: >> root on UFS on Optane U.2 via M.2: no evidence of corruptions >>> vs.: >>> root on ZFS on optane in PCIe slot: solid evidence of 3 known = corruptions >>>=20 >>> Both had USE_TMPFS=3D"data" in use. The same system build >>> had been installed and booted for both tests. >>>=20 >>> The evidence of corruptions is rare enough for this not to >>> be determinative, but it is suggestive. >>>=20 >>> Unfortunately, ZFS vs. UFS and Optane-in-PCIe vs. USB3 are >>> not differentiated by this test result. >>>=20 >>> There is also the result that I've not seen evidence of >>> corruptions on the ThreadRipper 1950 X (amd64) system. >>> Again, not determinative, but suggestive, given how rare >>> the corruptions seem to be. >>=20 >> So far the only things unique to the observed corruptions are: >>=20 >> root on ZFS context (vs. root on UFS) >> and: >> Optane in a PCIe slot (but no contrasting ZFS case tested) >>=20 >> The PCIe slot does not seem to me to be likely to be contributing. >> So this seem to be suggestive of a ZFS problem. >>=20 >> A contributing point might be that the main [so: 14] system was >> built via -mcpu=3Dcortex-a72 for execution on a Cortext-A72 system. >>=20 >> [I previously ran into a USB subsystem mishandling of keeping >> things coherent for the week memory ordering in this sort of >> context. That issue was fixed. But back then I was lucky enough >> to be able to demonstrate fails vs. works by adding an >> appropriate instruction to FreeBSD in a few specific places >> (more than necessary as it turned out). Someone else determined >> where the actual mishandling was that covered all required >> places. My generating that much information in this context >> seems unlikely.] >=20 >=20 > I started a retry of root-on-ZFS with the Optane-in-PCIe-slot media > and it got its first corruption (in a different place, 2nd bulk > build this time). The use of the corrupted file reports: >=20 > configure:13269: cc -o conftest -Wall -Wextra -fsigned-char = -Wdeclaration-after-statement -O2 -pipe -mcpu=3Dcortex-a53 -g = -fstack-protector-strong -fno-strict-aliasing -DUSE_MEMORY_H = -I/usr/local/incl > ude -mcpu=3Dcortex-a53 -fstack-protector-strong conftest.c = -L/usr/local/lib -logg >&5 > In file included from conftest.c:27: > In file included from /usr/local/include/ogg/ogg.h:24: > In file included from /usr/local/include/ogg/os_types.h:154: > /usr/local/include/ogg/config_types.h:1:1: warning: null character = ignored [-Wnull-character] > <U+0000> > ^ > /usr/local/include/ogg/config_types.h:1:2: warning: null character = ignored [-Wnull-character] > <U+0000><U+0000> > ^ > /usr/local/include/ogg/config_types.h:1:3: warning: null character = ignored [-Wnull-character] > <U+0000><U+0000><U+0000> > ^ > . . . > /usr/local/include/ogg/config_types.h:1:538: warning: null character = ignored [-Wnull-character] > . . . (nulls) . . . >=20 > So: 538 such null bytes. >=20 > Thus, another example of something like a page of nulls being > written out when ZFS is in use. >=20 > audio/gstreamer1-plugins-ogg also failed via referencing the file > during its build. >=20 > (The bulk run is still going and there is one more bulk run to go.) >=20 Well, 528 happened to be the size of config_types.h --and of config_types.h from a build that did not get the corruption there. So looking at the other (later) corruption, which was a bigger file (looking via bulk -i and installing what contained the file but looking from outside the jail): # find /usr/local/ -name "libtextstyle.so*" -exec ls -Tld {} \; -rwxr-xr-x 1 root wheel 2339104 Nov 20 01:05:05 2021 = /usr/local/poudriere/data/.m/13_0R-CA7-default/ref/usr/local/lib/libtextst= yle.so.0.1.1 lrwxr-xr-x 1 root wheel 21 Nov 20 01:05:05 2021 = /usr/local/poudriere/data/.m/13_0R-CA7-default/ref/usr/local/lib/libtextst= yle.so.0 -> libtextstyle.so.0.1.1 lrwxr-xr-x 1 root wheel 21 Nov 20 01:05:05 2021 = /usr/local/poudriere/data/.m/13_0R-CA7-default/ref/usr/local/lib/libtextst= yle.so -> libtextstyle.so.0.1.1 hd = /usr/local/poudriere/data/.m/13_0R-CA7-default/ref/usr/local/lib/libtextst= yle.so.0.1.1 | more 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 = |................| * 0023b120 So the whole, over 2 MiByte, the whole file ended up with just null = Bytes. To cross check on live system caching vs. on disk, I rebooted and redid = the bulk -i based install of libtextstyle and looked at = libtextstyle.so.0.1.1 : still all zeros. For reference, zpool scrub afterward resulted in: # zpool status pool: zopt0 state: ONLINE scan: scrub repaired 0B in 00:01:49 with 0 errors on Sat Nov 20 = 11:47:31 2021 config: NAME STATE READ WRITE CKSUM zopt0 ONLINE 0 0 0 nda1p3 ONLINE 0 0 0 But it is not a ZFS redundancy context: ZFS used just to use bectl . =3D=3D=3D Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar)
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?E52FCE89-81CB-4F8D-869F-D32C883F14A3>