Date: Sun, 23 May 2021 01:27:47 -0700 From: Mark Millard <marklmi@yahoo.com> To: Rick Macklem <rmacklem@uoguelph.ca> Cc: FreeBSD-STABLE Mailing List <freebsd-stable@freebsd.org> Subject: Re: releng/13 release/13.0.0 : odd/incorrect diff result over nfs (in a zfs file systems context) Message-ID: <6F0F0719-F029-4DE9-AEB8-5A9FF8303C6F@yahoo.com> In-Reply-To: <47AE7DDF-F4BA-4632-BDCC-FB1F1AE30810@yahoo.com> References: <623369D9-5EE5-4FEF-B9AD-56499E8F1C09.ref@yahoo.com> <623369D9-5EE5-4FEF-B9AD-56499E8F1C09@yahoo.com> <YQXPR0101MB0968B29934D7BD73FCA73907DD299@YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM> <YTOPR0101MB0970A1257E4DD37335D5B52EDD299@YTOPR0101MB0970.CANPRD01.PROD.OUTLOOK.COM> <04D7264A-206B-4281-B452-779B01EA3327@yahoo.com> <34E915B3-30DF-408C-A931-C39188F3EB0F@yahoo.com> <E938DB30-22C9-4765-9E01-601D80B36910@yahoo.com> <YQXPR0101MB0968EA2F32C1EEB8CC8CAD9FDD299@YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM> <D6842A56-95EC-4A2D-99E3-3DCF95C50F68@yahoo.com> <YQXPR0101MB096874849E9749010F4BDD5CDD299@YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM> <508C3B05-79E5-49ED-8032-DA7DF249E154@yahoo.com> <YQXPR0101MB09683A5BE725EF50E590A391DD289@YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM> <47AE7DDF-F4BA-4632-BDCC-FB1F1AE30810@yahoo.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On 2021-May-23, at 00:44, Mark Millard <marklmi at yahoo.com> wrote: > On 2021-May-21, at 17:56, Rick Macklem <rmacklem at uoguelph.ca> = wrote: >=20 >> Mark Millard wrote: >> [stuff snipped] >>> Well, why is it that ls -R, find, and diff -r all get file >>> name problems via genet0 but diff -r gets no problems >>> comparing the content of files that it does match up (the >>> vast majority)? Any clue how could the problems possibly >>> be unique to the handling of file names/paths? Does it >>> suggest anything else to look into for getting some more >>> potentially useful evidence? >> Well, all I can do is describe the most common TSO related >> failure: >> - When a read RPC reply (including NFS/RPC/TCP/IP headers) >> is slightly less than 64K bytes (many TSO implementations are >> limited to 64K or 32 discontiguous segments, think 32 2K >> mbuf clusters), the driver decides it is ok, but when the MAC >> header is added it exceeds what the hardware can handle correctly... >> --> This will happen when reading a regular file that is slightly = less >> than a multiple of 64K in size. >> or >> --> This will happen when reading just about any large directory, >> since the directory reply for a 64K request is converted to Sun = XDR >> format and clipped at the last full directory entry that will fit = within 64K. >> For ports, where most files are small, I think you can tell which is = more >> likely to happen. >> --> If TSO is disabled, I have no idea how this might matter, but?? >>=20 >>> I'll note that netstat -I ue0 -d and netstat -I genet0 -d >>> do not report changes in Ierrs or Idrop in a before vs. >>> after failures comparison. (There may be better figures >>> to look at for all I know.) >>>=20 >>> I tried "ifconfig genet0 -rxcsum -rxcsum -rxcsum6 -txcsum6" >>> and got no obvious change in behavior. >> All we know is that the data is getting corrupted somehow. >>=20 >> NFS traffic looks very different than typical TCP traffic. It is >> mostly small messages travelling in both directions concurrently, >> with some large messages thrown in the mix. >> All I'm saying is that, testing a net interface with something like >> bulk data transfer in one direction doesn't verify it works for NFS >> traffic. >>=20 >> Also, the large RPC messages are a chain of about 33 mbufs of >> various lengths, including a mix of partial clusters and regular >> data mbufs, whereas a bulk send on a socket will typically >> result in an mbuf chain of a lot of full 2K clusters. >> --> As such, NFS can be good at tickling subtle bugs it the >> net driver related to mbuf handling. >>=20 >> rick >>=20 >>>> W.r.t. reverting r367492...the patch to replace r367492 was just >>>> committed to "main" by rscheff@ with a two week MFC, so it >>>> should be in stable/13 soon. Not sure if an errata can be done >>>> for it for releng13.0? >>>=20 >>> That update is reported to be causing "rack" related panics: >>>=20 >>> = https://lists.freebsd.org/pipermail/dev-commits-src-main/2021-May/004440.h= tml >>>=20 >>> reports (via links): >>>=20 >>> panic: _mtx_lock_sleep: recursed on non-recursive mutex so_snd @ = /syzkaller/managers/i386/kernel/sys/modules/tcp/rack/../../../netinet/tcp_= stacks/rack.c:10632 >>>=20 >>> Still, I have a non-debug update to main building and will >>> likely do a debug build as well. llvm is rebuilding, so >>> the builds will take a notable time. >=20 > I got the following built and installed on the two > machines: >=20 > # uname -apKU > FreeBSD CA72_16Gp_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #1 = main-n246854-03b0505b8fe8-dirty: Sat May 22 16:25:04 PDT 2021 = root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-dbg-clang/usr/main-src/arm64.= aarch64/sys/GENERIC-DBG-CA72 arm64 aarch64 1400013 1400013 >=20 > # uname -apKU > FreeBSD CA72_4c8G_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #1 = main-n246854-03b0505b8fe8-dirty: Sat May 22 16:25:04 PDT 2021 = root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-dbg-clang/usr/main-src/arm64.= aarch64/sys/GENERIC-DBG-CA72 arm64 aarch64 1400013 1400013 >=20 > Note that both are booted with debug builds of main. >=20 > Using the context with the alternate EtherNet device that has not > had an associated diff -r, find, pr ls -R failure yet > yet got a panic that looks likely to be unrelated: >=20 > # mount -onoatime 192.168.1.187:/usr/ports/ /mnt/ > # diff -r /usr/ports/ /mnt/ | more > nvme0: cpl does not map to outstanding cmd > cdw0:00000000 sqhd:0020 sqid:0003 cid:007e p:1 sc:00 sct:0 m:0 dnr:0 > panic: received completion for unknown cmd > cpuid =3D 3 > time =3D 1621743752 > KDB: stack backtrace: > db_trace_self() at db_trace_self > db_trace_self_wrapper() at db_trace_self_wrapper+0x30 > vpanic() at vpanic+0x188 > panic() at panic+0x44 > nvme_qpair_process_completions() at = nvme_qpair_process_completions+0x1fc > nvme_timeout() at nvme_timeout+0x3c > softclock_call_cc() at softclock_call_cc+0x124 > softclock() at softclock+0x60 > ithread_loop() at ithread_loop+0x2a8 > fork_exit() at fork_exit+0x74 > fork_trampoline() at fork_trampoline+0x14 > KDB: enter: panic > [ thread pid 12 tid 100028 ] > Stopped at kdb_enter+0x48: undefined f904411f > db>=20 >=20 > Based on the "nvme" references, I expect this is tied to > handling the Optane 480 GiByte that is in the PCIe slot > and is the boot/only media for the machine doing the diff. >=20 > "db> dump" seems to have worked. >=20 > After reboot, zpool scrub found no errors. >=20 > So, trying again . . . >=20 > I got some "Expensive timeout(9) function" notices: >=20 > Expensive timeout(9) function: 0xffff000000717b64(0) 1.210285924 s > Expensive timeout(9) function: 0xffff000000717b64(0) 4.001010935 s >=20 > 0xffff000000717b64 looks to be uma_timeout: >=20 > ffff000000717b60 <uma_startup3+0x118> b ffff000000717b3c = <uma_startup3+0xf4> > ffff000000717b64 <uma_timeout> stp x29, x30, [sp, #-32]! > ffff000000717b68 <uma_timeout+0x4> stp x20, x19, [sp, #16] > . . . >=20 > . . . Hmm. The debug kernel test context seems to take a > very long time. It has not failed so far but is still > going. >=20 > So I stopped it and switch to testing with the genet0 device > that was involved for the earlier failures. . . . >=20 > It did not fail. Nor did the debug kernel report anything > beyond: >=20 > if_delmulti_locked: detaching ifnet instance 0xffffa00000fc8000 > if_delmulti_locked: detaching ifnet instance 0xffffa00000fc8000 > Expensive timeout(9) function: 0xffff00000050c088(0) 6.318652023 s >=20 > on one machine and: >=20 > if_delmulti_locked: detaching ifnet instance 0xffffa0000b56b800 >=20 > on the other. >=20 > So I may reboot into the also-updated non-debug builds on both > machines and try in that context. >=20 The non-debug build pair of machines got the problem: # diff -r /usr/ports/ /mnt/ | more Only in /mnt/devel/electron12/files:=20 Only in /usr/ports/devel/electron12/files: = patch-chrome_browser_media_webrtc_webrtc__logging__controller.cc Only in /usr/ports/devel/electron12/files: = patch-components_previews_core_previews__features.cc Only in /mnt/devel/electron12/files: <A0><CE><C8>=D6=8F<DC>=DC=A62<B2><E2>= <AA>^H Only in /mnt/www/chromium/files: patch-chrome_browser_chrome__browser Only in /usr/ports/www/chromium/files: = patch-chrome_browser_chrome__browser__main__posix.cc I'll note that it turns out that the debug build had more than is typical enabled: DIAGNOSTICS, BUF_TRACKING, and FULL_BUF_TRACKING were also enabled. I'd forgotten that I'd previously had a reason to add those to what my debug builds included (for a prior problem investigation). I'd not done debug builds in some time. =3D=3D=3D Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar)
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?6F0F0719-F029-4DE9-AEB8-5A9FF8303C6F>