Date: Sat, 15 Apr 2023 15:33:28 -0700 From: Mark Millard <marklmi@yahoo.com> To: Mateusz Guzik <mjguzik@gmail.com> Cc: FreeBSD User <freebsd@walstatt-de.de>, Cy Schubert <Cy.Schubert@cschubert.com>, Charlie Li <vishwin@freebsd.org>, Pawel Jakub Dawidek <pjd@freebsd.org>, dev-commits-src-main@freebsd.org, Current FreeBSD <freebsd-current@freebsd.org> Subject: Re: git: 2a58b312b62f - main - zfs: merge openzfs/zfs@431083f75 Message-ID: <62F8DD62-8E09-4A07-926B-5DE3DB399609@yahoo.com> In-Reply-To: <CAGudoHHnimmsgXTBjrcY=FiYnQCoh7m8zhBM4BPYHoFy%2BihUxQ@mail.gmail.com> References: <20230413071032.18BFF31F@slippy.cwsent.com> <D0D9BD06-C321-454C-A038-C55C63E0DD6B@dawidek.net> <20230413063321.60344b1f@cschubert.com> <CAGudoHG3rCx93gyJTmzTBnSe4fQ9=m4mBESWbKVWtAGRxen_4w@mail.gmail.com> <20230413135635.6B62F354@slippy.cwsent.com> <c41f9ed6-e557-9255-5a46-1a22d4b32d66@dawidek.net> <319a267e-3f76-3647-954a-02178c260cea@dawidek.net> <b60807e9-f393-6e6d-3336-042652ddd03c@freebsd.org> <441db213-2abb-b37e-e5b3-481ed3e00f96@dawidek.net> <5ce72375-90db-6d30-9f3b-a741c320b1bf@freebsd.org> <99382FF7-765C-455F-A082-C47DB4D5E2C1@yahoo.com> <32cad878-726c-4562-0971-20d5049c28ad@freebsd.org> <ABC9F3DB-289E-455E-AF43-B3C13525CB2C@yahoo.com> <20230415115452.08911bb7@thor.intern.walstatt.dynvpn.de> <20230415143625.99388387@slippy.cwsent.com> <20230415175218.777d0a97@thor.intern.walstatt.dynvpn.de> <CAGudoHHnimmsgXTBjrcY=FiYnQCoh7m8zhBM4BPYHoFy%2BihUxQ@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Apr 15, 2023, at 13:30, Mateusz Guzik <mjguzik@gmail.com> wrote: > On 4/15/23, FreeBSD User <freebsd@walstatt-de.de> wrote: >> Am Sat, 15 Apr 2023 07:36:25 -0700 >> Cy Schubert <Cy.Schubert@cschubert.com> schrieb: >>=20 >>> In message <20230415115452.08911bb7@thor.intern.walstatt.dynvpn.de>, >>> FreeBSD Us >>> er writes: >>>> Am Thu, 13 Apr 2023 22:18:04 -0700 >>>> Mark Millard <marklmi@yahoo.com> schrieb: >>>>=20 >>>>> On Apr 13, 2023, at 21:44, Charlie Li <vishwin@freebsd.org> wrote: >>>>>=20 >>>>>> Mark Millard wrote: >>>>>>> FYI: in my original report for a context that has never had >>>>>>> block_cloning enabled, I reported BOTH missing files and >>>>>>> file content corruption in the poudriere-devel bulk build >>>>>>> testing. This predates: >>>>>>> https://people.freebsd.org/~pjd/patches/brt_revert.patch >>>>>>> but had the changes from: >>>>>>> https://github.com/openzfs/zfs/pull/14739/files >>>>>>> The files were missing from packages installed to be used >>>>>>> during a port's build. No other types of examples of missing >>>>>>> files happened. (But only 11 ports failed.) >>>>>> I also don't have block_cloning enabled. "Missing files" prior to >>>>>> brt_rev >>>> ert may actually >>>>>> be present, but as the corruption also messes with the file(1) >>>>>> signature, >>>> some tools like >>>>>> ldconfig report them as missing. >>>>>=20 >>>>> For reference, the specific messages that were not explicit >>>>> null-byte complaints were (some shown with a little context): >>>>>=20 >>>>>=20 >>>>> =3D=3D=3D> py39-lxml-4.9.2 depends on shared library: libxml2.so = - not >>>>> found >>>>> =3D=3D=3D> Installing existing package = /packages/All/libxml2-2.10.3_1.pkg >>>>>=20 >>>>> [CA72_ZFS] Installing libxml2-2.10.3_1... >>>>> [CA72_ZFS] Extracting libxml2-2.10.3_1: .......... done >>>>> =3D=3D=3D> py39-lxml-4.9.2 depends on shared library: libxml2.so = - found >>>>>=20 >>>>> (/usr/local/lib/libxml2.so) . . . >>>>> [CA72_ZFS] Extracting libxslt-1.1.37: .......... done >>>>> =3D=3D=3D> py39-lxml-4.9.2 depends on shared library: libxslt.so = - found >>>>>=20 >>>>> (/usr/local/lib/libxslt.so) =3D=3D=3D> Returning to build of >>>>> py39-lxml-4.9.2 >>>>> . . . >>>>> =3D=3D=3D> Configuring for py39-lxml-4.9.2 >>>>> Building lxml version 4.9.2. >>>>> Building with Cython 0.29.33. >>>>> Error: Please make sure the libxml2 and libxslt development = packages >>>>> are in >>>> stalled. >>>>>=20 >>>>>=20 >>>>> [CA72_ZFS] Extracting libunistring-1.1: .......... done >>>>> =3D=3D=3D> libidn2-2.3.4 depends on shared library: = libunistring.so - not >>>>> found >>>>=20 >>>>>=20 >>>>>=20 >>>>> [CA72_ZFS] Extracting gmp-6.2.1: .......... done >>>>> =3D=3D=3D> mpfr-4.2.0,1 depends on shared library: libgmp.so - = not found >>>>>=20 >>>>>=20 >>>>>=20 >>>>> =3D=3D=3D> nettle-3.8.1 depends on shared library: libgmp.so - = not found >>>>> =3D=3D=3D> Installing existing package = /packages/All/gmp-6.2.1.pkg >>>>> [CA72_ZFS] Installing gmp-6.2.1... >>>>> the most recent version of gmp-6.2.1 is already installed >>>>> =3D=3D=3D> nettle-3.8.1 depends on shared library: libgmp.so - = not found >>>>>=20 >>>>> *** Error code 1 >>>>>=20 >>>>>=20 >>>>> autom4te: error: need GNU m4 1.4 or later: /usr/local/bin/gm4 >>>>>=20 >>>>>=20 >>>>> checking for GNU >>>>> M4 that supports accurate traces... configure: error: no = acceptable m4 >>>>> coul >>>> d be found in >>>>> $PATH. GNU M4 1.4.6 or later is required; 1.4.16 or newer is >>>>> recommended. >>>>> GNU M4 1.4.15 uses a buggy replacement strstr on some systems. >>>>> Glibc 2.9 - 2.12 and GNU M4 1.4.11 - 1.4.15 have another strstr = bug. >>>>>=20 >>>>>=20 >>>>> ld: error: /usr/local/lib/libblkid.a: unknown file type >>>>>=20 >>>>>=20 >>>>> =3D=3D=3D >>>>> Mark Millard >>>>> marklmi at yahoo.com >>>>>=20 >>>>>=20 >>>>=20 >>>> Hello >>>>=20 >>>> whar is the recent status of fixing/mitigate this desatrous bug? >>>> Especially f >>>> or those with the >>>> new option enabled on ZFS pools. Any advice? >>>>=20 >>>> In an act of precausion (or call it panic) I shutdown several = servers to >>>> prev >>>> ent irreversible >>>> damages to databases and data storages. We face on one host with >>>> /usr/ports r >>>> esiding on ZFS >>>> always errors on the same files created while staging (using = portmaster, >>>> leav >>>> es the system >>>> with noninstalled software, i.e. www/apache24 in our case). = Deleting the >>>> work >>>> folder doesn't >>>> seem to change anything, even when starting a scrubbing of the = entire >>>> pool (R >>>> AIDZ1 pool) - >>>> cause unknown, why it affects always the same files to be = corrupted. >>>> Same wit >>>> h deve/ruby-gems. >>>>=20 >>>> Poudriere has been shutdown for the time being to avoid further = issues. >>>>=20 >>>>=20 >>>> Are there any advies to proceed apart from conserving the boxes via >>>> shutdown? >>>>=20 >>>> Thank you ;-) >>>> oh >>>>=20 >>>>=20 >>>>=20 >>>> -- >>>> O. Hartmann >>>=20 >>> With an up-to-date tree + pjd@'s "Fix data corruption when cloning >>> embedded >>> blocks. #14739" patch I didn't have any issues, except for email = messages >>>=20 >>> with corruption in my sent directory, nowhere else. I'm still >>> investigating >>> the email messages issue. IMO one is generally safe to run poudriere = on >>> the >>> latest ZFS with the additional patch. >>>=20 >>> My tests of the additional patch concluded that it resolved my last >>> problems, except for the sent email problem I'm still investigating. = I'm >>> sure there's a simple explanation for it, i.e. the email thread was >>> corrupted by the EXDEV regression which cannot be fixed by anything, = even >>>=20 >>> reverting to the previous ZFS -- the data in those files will remain >>> damaged regardless. >>>=20 >>> I cannot speak to the others who have had poudriere and other = issues. I >>> never had any problems with poudriere on top of the new ZFS. >>>=20 >>> WRT reverting block_cloning pools to without, your only option is to >>> backup >>> your pool and recreate it without block_cloning. Then restore your = data. >>>=20 >>>=20 >>=20 >> All right, I interpret the answer that way, that I need a most recent = source >> tree (and >> accordingly built and installed OS) AND a patch that isn't officially >> commited? >>=20 >> On a box I'm with: >>=20 >> FreeBSD 14.0-CURRENT #8 main-n262175-5ee1c90e50ce: Sat Apr 15 = 07:57:16 CEST >> 2023 amd64 >>=20 >> The box is crashing while trying to update ports with the well known = issue: >>=20 >> Panic String: VERIFY(!zil_replaying(zilog, tx)) failed >>=20 >> At the moment all alarm bells are ringing and I lost track about what = has >> been patched and >> already commited and what is still as patch available but in the = phase of >> evaluation or >> inofficially emmited here. >>=20 >> According to the EXDEV issue: in cases of poudriere or ports trees on = ZFS, >> what do I have to >> do to ensure that those datasets are clean? The OS should detect file >> corruption but in my >> case the box is crashing :-( >>=20 >> I did several times scrubbing, but this seems to be the action of a = helpless >> and desperate man >> ... ;-/ >>=20 >> Greetings >>=20 >=20 > Using block cloning is still not safe, but somewhere in this thread > pjd had a patch to keep it operatinal for already cloned files without > adding new ones. >=20 > Anyhow, as was indicated by vishwin@ there was data corruption > *unrelated* to block cloning which also came with the import, I > narrowed it down: https://github.com/openzfs/zfs/issues/14753 >=20 > That said now I'm testing a kernel which does not do block cloning and > does not have the other problematic commit, we will see if things > work. (Mostly written as I progressed but some material later inserted into/around previously written material.) Summary: As stands, it looks like reverting the dnode_is_dirty code is what fixes the corruptions that my type of test context produced via poudriere bulk activity . The details that lead to that summary . . . Using my my build environment for updating my temporary, experimental context, an environment running a world and and kernel that predate the import: # uname -apKU FreeBSD CA72_4c8G_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #90 = main-n261544-cee09bda03c8-dirty: Wed Mar 15 20:25:49 PDT 2023 = root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm6= 4.aarch64/sys/GENERIC-NODBG-CA72 arm64 aarch64 1400082 1400082 (Note the "nondbg": I normally run non-debug main builds, but with symbols not stripped.) The kernel and world for this are what is in old-main-CA72: # bectl list BE Active Mountpoint Space Created main-CA72 R - 3.98G 2023-04-12 20:29 old-main-CA72 N / 1.08M 2023-02-06 19:44 (Most everything else is outside the BE's and so is shared across the BE's.) I updated to also have (whitespace details likely not preserved in this note): # git -C /usr/main-src/ diff = /usr/main-src/sys/contrib/openzfs/module/zfs/dnode.c diff --git a/sys/contrib/openzfs/module/zfs/dnode.c = b/sys/contrib/openzfs/module/zfs/dnode.c index 367bfaa80726..49a7f59c0da4 100644 --- a/sys/contrib/openzfs/module/zfs/dnode.c +++ b/sys/contrib/openzfs/module/zfs/dnode.c @@ -1772,17 +1772,7 @@ dnode_is_dirty(dnode_t *dn) { mutex_enter(&dn->dn_mtx); for (int i =3D 0; i < TXG_SIZE; i++) { - list_t *list =3D &dn->dn_dirty_records[i]; - for (dbuf_dirty_record_t *dr =3D list_head(list); - dr !=3D NULL; dr =3D list_next(list, dr)) { - if (dr->dr_dbuf =3D=3D NULL || - (dr->dr_dbuf->db_blkid !=3D DMU_BONUS_BLKID = && - dr->dr_dbuf->db_blkid !=3D DMU_SPILL_BLKID)) = { - mutex_exit(&dn->dn_mtx); - return (B_TRUE); - } - } - if (dn->dn_free_ranges[i] !=3D NULL) { + if (multilist_link_active(&dn->dn_dirty_link[i])) { mutex_exit(&dn->dn_mtx); return (B_TRUE); } I did my usual buildworld buildkernel sequence and then one of my normal install sequences into main-CA72 to update it to have the change, as well as the prior material involved in my first experiment that I'd reported on. I cleared the content of the jail that I use for temporary experiments, such as the prior testing that got the 11 builder failures: # poudriere pkgclean -jmain-CA72-bulk_a -A I then rebooted using the updated main-CA72 BE. Then I started the: # poudriere bulk -jmain-CA72-bulk_a -w -f ~/origins/CA72-origins.txt . . . [00:00:37] Building 476 packages using up to 16 builders [00:00:37] Hit CTRL+t at any time to see build progress and stats [00:00:38] [01] [00:00:00] Builder starting [00:00:40] [01] [00:00:02] Builder started [00:00:40] [01] [00:00:00] Building ports-mgmt/pkg | pkg-1.19.1_1 In the prior experiment it got: 476 =3D 252 success + 11 failed + 213 skipped and it reported the time for that as: 00:37:52. A normal from-scratch build takes many hours (multiple compiler toolchains and such) so my first report after this point will be for one of: A) It got to, say, 00:40:00 or beyond with, or without failures. vs. B) It got failures and stopped before that. . . . TIME GOES BY . . . At about 00:40:00 the status was: [00:40:00] [06] [00:00:00] Building x11/libXv | libXv-1.0.12,1 load: 30.73 cmd: sh 1508 [nanslp] 2400.88r 6.69u 11.90s 0% 3960k [main-CA72-bulk_a-default] [2023-04-15_14h47m19s] [parallel_build:] = Queued: 476 Built: 235 Failed: 0 Skipped: 0 Ignored: 0 Fetched: 0 = Tobuild: 241 Time: 00:40:01 ID TOTAL ORIGIN PKGNAME = PHASE PHASE TMPFS CPU% MEM% [15] 00:07:44 devel/py-lxml@py39 | py39-lxml-4.9.2 = stage 00:00:08 40.00 KiB 0% 0% [01] 00:00:34 x11/libXxf86vm | libXxf86vm-1.1.4_3 = build-depends 00:00:03 56.00 KiB 2.3% 0% [16] 00:01:59 x11-toolkits/libXt | libXt-1.2.1,1 = configure 00:00:52 40.00 KiB 0.3% 0% [02] 00:01:40 devel/dbus | dbus-1.14.6,1 = configure 00:00:05 36.00 KiB 0.5% 0% [03] 00:02:20 x11/libXrender | libXrender-0.9.10_2 = configure 00:01:27 40.00 KiB 0% 0% [04] 00:00:44 graphics/libglvnd | libglvnd-1.6.0 = build-depends 00:00:13 52.00 KiB 20.3% 0.1% [05] 00:01:39 x11/xprop | xprop-1.2.6 = configure 00:00:45 56.00 KiB 0.7% 0.2% [06] 00:00:14 x11/libXv | libXv-1.0.12,1 = pkg-depends 00:00:03 52.00 KiB 3.6% 0% [07] 00:01:57 x11/libXfixes | libXfixes-6.0.0 = configure 00:00:42 40.00 KiB 0.1% 0% [08] 00:03:01 devel/glib20 | glib-2.76.1,2 = configure 00:01:26 40.00 KiB 4.3% 0.1% [09] 00:01:21 shells/bash-completion | bash-completion-2.11_2,2 = configure 00:00:13 32.00 KiB 5.7% 0% [10] 00:06:26 devel/qt5-buildtools | qt5-buildtools-5.15.8p157 = package 00:01:57 44.00 KiB 76.1% 0.1% [11] 00:01:20 print/psutils | psutils-1.17_5 = stage 00:00:03 40.00 KiB 0.2% 0% [12] 00:02:09 x11/libxkbfile | libxkbfile-1.1.0 = configure 00:01:22 44.00 KiB 0.1% 0% [13] 00:08:54 devel/cmake-core | cmake-core-3.25.1 = build 00:01:43 36.00 KiB 694.9% 5.2% [14] 00:01:20 x11/xcb-util-image | xcb-util-image-0.4.0_1 = configure 00:00:14 48.00 KiB 0% 0% [00:40:14] Logs: = /usr/local/poudriere/data/logs/bulk/main-CA72-bulk_a-default/2023-04-15_14= h47m19s [00:40:22] [11] [00:01:28] Finished print/psutils | psutils-1.17_5: = Success So no failures so far, 235 ports built and a bunch in process. Note: maximum observed ("MaxObs") load averages so far as shown below (for there being only 16 cores, i.e., 16 Cortex-A72 threads, one per core): load averages: 43.75, 34.32, 27.40 MaxObs: 45.03, 34.32, 27.40 I'll report again if it gets a corruption failure. Otherwise it will be many hours before it would complete without such failures (multiple compiler toolchain builds and such). =3D=3D=3D Mark Millard marklmi at yahoo.com
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?62F8DD62-8E09-4A07-926B-5DE3DB399609>