Date: Wed, 15 Jul 2020 03:35:41 -0700 From: Mark Millard <marklmi@yahoo.com> To: freebsd-arm <freebsd-arm@freebsd.org> Subject: Re: USB [USB3 and USB2] problems when using UEFi v1.16 to boot RPi4: Evidence of a read-time problem being involved (contexts that avoids the issue) Message-ID: <19F98671-4B69-44A6-8254-B186F0ED995F@yahoo.com> In-Reply-To: <88B0E169-C42F-42D6-B2BA-957EAEC7DB8C@yahoo.com> References: <476DD0F0-2286-4B2C-8E44-4404AF17F5A8@yahoo.com> <B1FF8DD3-DFD1-4973-B0D2-6AC33BCAA59C@yahoo.com> <CF81584E-75CE-4BFC-8ACC-AB95E561B28D@yahoo.com> <F426CFE6-F619-4B3C-9260-07E72BC709AF@yahoo.com> <ED69F8C1-C042-43C6-941A-E154229E4623@googlemail.com> <F7BDD05D-C803-4ACB-9C48-6CBEC277F464@yahoo.com> <88B0E169-C42F-42D6-B2BA-957EAEC7DB8C@yahoo.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On 2020-Jun-25, at 20:40, Mark Millard <marklmi at yahoo.com> wrote: > [Looks like it is a read-time failure in some > new testing.] >=20 > On 2020-Jun-25, at 17:52, Mark Millard <marklmi at yahoo.com> wrote: >>=20 >> On 2020-Jun-25, at 15:40, Klaus K=C3=BCchemann <maciphone2 at = googlemail.com> wrote: >>=20 >>> Am 25.06.2020 um 21:29 schrieb Mark Millard via freebsd-arm = <freebsd-arm@freebsd.org>: >>>> =E2=80=A6 >>>> . >>>> The test still failed to produce an accurate file copy >>>> but the kernel did not report anything either. I'm >>>> Unsure how get evidence of the context for the bad 4K >>>> chunks. >>>>=20 >>> No clue if it has effects but maybe : dd if=3Dxxx of=3Dxxx bs=3D4k ? >>=20 >> Something interesting does result from dd testing, >> even though doing file copies that way still gets >> the problem. In fact a couple of interesting points >> show up. >>=20 >> Using dd to copy large files still gets corrupted copies. >> (Large files are only because the corruptions are not >> frequent in the files but a sufficiently large file >> seems to always have some corruption.) >>=20 >> Interestingly, dd if=3D/dev/zero based large file >> generation has produced good files from what I >> can tell. (Generate separate files and diff them >> after a reboot.) >>=20 >> The problem was originally discovered copying >> from another machine to a RPi4. But the Ethernet >> use involved USB in providing data (but not a >> local USB drive) --while /dev/zero does not >> involve USB as a data source and copies of >> data in memory via file content buffering. So >> the contrasting dd if=3D/dev/zero results may be >> indicating something. >>=20 >> Another interesting point is that the following >> sequence seems repeatable for step (E)'s resultant >> property below: >>=20 >> A) first do a couple of large dd if=3D/dev/zero file generations >> B) then do a (non-zero) large file copy (dd based or cp based) >> C) reboot >> D) diff the 2 files generated in (A): no differences >> E) diff the original large file and the temporary copy >> from (B): there are differences and the temporary copy >> has zero in every byte that is different. >>=20 >> (E) suggests that the bad file copies via cp or >> via dd are picking up data from the wrong memory >> pages sometimes, (A) just made large numbers of >> pages zero, making it more likely a zero page >> would be used if the wrong page was referenced. >>=20 >> An example of checking for (E) was: >>=20 >> # diff clang-cortexA53-installworld-poud.tar mmjnk.other=20 >> Binary files clang-cortexA53-installworld-poud.tar and mmjnk.other = differ >>=20 >> # cmp -l clang-cortexA53-installworld-poud.tar mmjnk.other | grep -v = " 0$" | more >> --More--(END) >>=20 >>=20 >> Note about my example "large file" sizes: >>=20 >> -rw-r--r-- 1 root wheel 4011026432 Apr 25 21:04:42 2020 = clang-cortexA53-installworld-poud.tar >>=20 >> and I've been mostly using 4 GiByte for the resultant size >> of large files generated via dd. >>=20 >> I have not tried to find a minimum size for reliably >> getting corrupted file copies. >>=20 >=20 > I continued after the above with (no additional reboot): >=20 > # cpuset -l0 cp -aRx clang-cortexA53-installworld-poud.tar = mmjnk.other2 >=20 > # diff clang-cortexA53-installworld-poud.tar mmjnk.other2 > Binary files clang-cortexA53-installworld-poud.tar and mmjnk.other2 = differ >=20 > # cpuset -l2 diff clang-cortexA53-installworld-poud.tar mmjnk.other2 > Binary files clang-cortexA53-installworld-poud.tar and mmjnk.other2 = differ >=20 > # cpuset -l3 cp -aRx clang-cortexA53-installworld-poud.tar = mmjnk.other3 >=20 > # cpuset -l3 diff clang-cortexA53-installworld-poud.tar mmjnk.other3 > Binary files clang-cortexA53-installworld-poud.tar and mmjnk.other3 = differ >=20 > Note that the final mmjnk.other2 was via cpu 2. > Note that the mmjnk.other3 was via cpu 3. > Note that the original mmjnk.other was without limiting the cpu usage. >=20 > Then I went back and did a compare of files not written since > the reboot and showing zeros earlier above. First I show some > of the output of a prior zeros-producing compare: >=20 > # cmp -l clang-cortexA53-installworld-poud.tar mmjnk.other | more > 1795768321 264 0 > 1795768322 167 0 > 1795768323 272 0 > 1795768324 6 0 > 1795768325 3 0 > 1795768326 370 0 > 1795768327 10 0 > 1795768328 112 0 > . . . >=20 > (Yes, I did not lock down what cpu was to be used for the cmp -l > usage in this activity. In the future I probably should experiment > with that too.) >=20 > The new comparison looked like: >=20 > # cmp -l clang-cortexA53-installworld-poud.tar mmjnk.other | more > 1442340865 15 0 > 1442340866 245 0 > 1442340867 1 30 > 1442340868 1 353 > 1442340869 0 11 > 1442340870 100 17 > 1442340871 226 271 > 1442340872 31 125 > . . . >=20 > Not all-zeros being presented on the right any more! And not > the same offset either (so different left hand side data). > (Some bytes are a match to the left side and so do not show a > line overall.) >=20 > So I looked at the new copy made under cpuset -l2 : >=20 > # cmp -l clang-cortexA53-installworld-poud.tar mmjnk.other2 | more > 1442340865 15 0 > 1442340866 245 0 > 1442340867 1 30 > 1442340868 1 353 > 1442340869 0 11 > 1442340870 100 17 > 1442340871 226 271 > 1442340872 31 125 > . . . >=20 > Same offset in this file and *same* values on the left and right. > (Not just those shown above.) >=20 > So I looked at the new copy made under cpuset -l3 : >=20 > # cmp -l clang-cortexA53-installworld-poud.tar mmjnk.other3 | more > 981008385 62 0 > 981008386 111 0 > 981008387 157 30 > 981008388 65 353 > 981008389 123 11 > 981008390 145 17 > 981008391 164 271 > 981008393 160 0 > . . . >=20 > Different offset in this file but the *same* values on the right. > (Not just those shown above.) The left values are different, > matching up with the offset difference. >=20 > (Some bytes are a match to the different data on the left and so > do not show a line but the right side values appear to match the > prior 2 examples even where lines disappear differently because > of left-side content.) >=20 > So, apparently, the same page of content used for the right > side material but at a different point in the diff. (Lack > of controlling the cpu used for cmp -l might be contributing?) >=20 > Note: 1795768321 % 4096 =3D=3D 1 > Note: 1442340865 % 4096 =3D=3D 1 > Note: 981008385 % 4096 =3D=3D 1 >=20 > cmp starts with line "1", so the above all align > at 4096 boundaries. >=20 >=20 > Overall this indicates that an unmodified file can have > its content appear to change and that multiple files > got the same block of bad data showing up in their > respective comparisons, just not always at the same > offset in the files. >=20 > I've no clue if the roles of "left" and "right" could > swap. So far the right seems to be the one that gets > the bad data. >=20 Turns out that the combination of enabling the 3 GiByte limitation in uefi and not having D25219 applied in the kernel avoids the problem. I only used this combination in order to use artifacts.ci.freebsd.org kernels (that do not have D25219) in some other testing. So, putting back my non-debug kernel that has D25219 in it but leaving the 3 GiByte limit in place in uefi . . . Turns out that also avoids the problem. This suggests that may be D25219 by itself is not keeping everything in the memory range(s) that the uefi 3 GiByte limitation enforces internally: With the limitation enforced, the problem disappears. =3D=3D=3D Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar)
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?19F98671-4B69-44A6-8254-B186F0ED995F>