Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 15 Jul 2020 03:35:41 -0700
From:      Mark Millard <marklmi@yahoo.com>
To:        freebsd-arm <freebsd-arm@freebsd.org>
Subject:   Re: USB [USB3 and USB2] problems when using UEFi v1.16 to boot RPi4: Evidence of a read-time problem being involved (contexts that avoids the issue)
Message-ID:  <19F98671-4B69-44A6-8254-B186F0ED995F@yahoo.com>
In-Reply-To: <88B0E169-C42F-42D6-B2BA-957EAEC7DB8C@yahoo.com>
References:  <476DD0F0-2286-4B2C-8E44-4404AF17F5A8@yahoo.com> <B1FF8DD3-DFD1-4973-B0D2-6AC33BCAA59C@yahoo.com> <CF81584E-75CE-4BFC-8ACC-AB95E561B28D@yahoo.com> <F426CFE6-F619-4B3C-9260-07E72BC709AF@yahoo.com> <ED69F8C1-C042-43C6-941A-E154229E4623@googlemail.com> <F7BDD05D-C803-4ACB-9C48-6CBEC277F464@yahoo.com> <88B0E169-C42F-42D6-B2BA-957EAEC7DB8C@yahoo.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On 2020-Jun-25, at 20:40, Mark Millard <marklmi at yahoo.com> wrote:

> [Looks like it is a read-time failure in some
> new testing.]
>=20
> On 2020-Jun-25, at 17:52, Mark Millard <marklmi at yahoo.com> wrote:
>>=20
>> On 2020-Jun-25, at 15:40, Klaus K=C3=BCchemann <maciphone2 at =
googlemail.com> wrote:
>>=20
>>> Am 25.06.2020 um 21:29 schrieb Mark Millard via freebsd-arm =
<freebsd-arm@freebsd.org>:
>>>> =E2=80=A6
>>>> .
>>>> The test still failed to produce an accurate file copy
>>>> but the kernel did not report anything either. I'm
>>>> Unsure how get evidence of the context for the bad 4K
>>>> chunks.
>>>>=20
>>> No clue if it has effects but maybe : dd if=3Dxxx of=3Dxxx bs=3D4k ?
>>=20
>> Something interesting does result from dd testing,
>> even though doing file copies that way still gets
>> the problem. In fact a couple of interesting points
>> show up.
>>=20
>> Using dd to copy large files still gets corrupted copies.
>> (Large files are only because the corruptions are not
>> frequent in the files but a sufficiently large file
>> seems to always have some corruption.)
>>=20
>> Interestingly, dd if=3D/dev/zero based large file
>> generation has produced good files from what I
>> can tell. (Generate separate files and diff them
>> after a reboot.)
>>=20
>> The problem was originally discovered copying
>> from another machine to a RPi4. But the Ethernet
>> use involved USB in providing data (but not a
>> local USB drive) --while /dev/zero does not
>> involve USB as a data source and copies of
>> data in memory via file content buffering. So
>> the contrasting dd if=3D/dev/zero results may be
>> indicating something.
>>=20
>> Another interesting point is that the following
>> sequence seems repeatable for step (E)'s resultant
>> property below:
>>=20
>> A) first do a couple of large dd if=3D/dev/zero file generations
>> B) then do a (non-zero) large file copy (dd based or cp based)
>> C) reboot
>> D) diff the 2 files generated in (A): no differences
>> E) diff the original large file and the temporary copy
>>  from (B): there are differences and the temporary copy
>>  has zero in every byte that is different.
>>=20
>> (E) suggests that the bad file copies via cp or
>> via dd are picking up data from the wrong memory
>> pages sometimes, (A) just made large numbers of
>> pages zero, making it more likely a zero page
>> would be used if the wrong page was referenced.
>>=20
>> An example of checking for (E) was:
>>=20
>> # diff clang-cortexA53-installworld-poud.tar mmjnk.other=20
>> Binary files clang-cortexA53-installworld-poud.tar and mmjnk.other =
differ
>>=20
>> # cmp -l clang-cortexA53-installworld-poud.tar mmjnk.other | grep -v =
" 0$" | more
>> --More--(END)
>>=20
>>=20
>> Note about my example "large file" sizes:
>>=20
>> -rw-r--r--   1 root  wheel  4011026432 Apr 25 21:04:42 2020 =
clang-cortexA53-installworld-poud.tar
>>=20
>> and I've been mostly using 4 GiByte for the resultant size
>> of large files generated via dd.
>>=20
>> I have not tried to find a minimum size for reliably
>> getting corrupted file copies.
>>=20
>=20
> I continued after the above with (no additional reboot):
>=20
> # cpuset -l0 cp -aRx clang-cortexA53-installworld-poud.tar =
mmjnk.other2
>=20
> # diff clang-cortexA53-installworld-poud.tar mmjnk.other2
> Binary files clang-cortexA53-installworld-poud.tar and mmjnk.other2 =
differ
>=20
> # cpuset -l2 diff clang-cortexA53-installworld-poud.tar mmjnk.other2
> Binary files clang-cortexA53-installworld-poud.tar and mmjnk.other2 =
differ
>=20
> # cpuset -l3 cp -aRx clang-cortexA53-installworld-poud.tar =
mmjnk.other3
>=20
> # cpuset -l3 diff clang-cortexA53-installworld-poud.tar mmjnk.other3
> Binary files clang-cortexA53-installworld-poud.tar and mmjnk.other3 =
differ
>=20
> Note that the final mmjnk.other2 was via cpu 2.
> Note that the mmjnk.other3       was via cpu 3.
> Note that the original mmjnk.other was without limiting the cpu usage.
>=20
> Then I went back and did a compare of files not written since
> the reboot and showing zeros earlier above. First I show some
> of the output of a prior zeros-producing compare:
>=20
> # cmp -l clang-cortexA53-installworld-poud.tar mmjnk.other | more
> 1795768321 264   0
> 1795768322 167   0
> 1795768323 272   0
> 1795768324   6   0
> 1795768325   3   0
> 1795768326 370   0
> 1795768327  10   0
> 1795768328 112   0
> . . .
>=20
> (Yes, I did not lock down what cpu was to be used for the cmp -l
> usage in this activity. In the future I probably should experiment
> with that too.)
>=20
> The new comparison looked like:
>=20
> # cmp -l clang-cortexA53-installworld-poud.tar  mmjnk.other | more
> 1442340865  15   0
> 1442340866 245   0
> 1442340867   1  30
> 1442340868   1 353
> 1442340869   0  11
> 1442340870 100  17
> 1442340871 226 271
> 1442340872  31 125
> . . .
>=20
> Not all-zeros being presented on the right any more! And not
> the same offset either (so different left hand side data).
> (Some bytes are a match to the left side and so do not show a
> line overall.)
>=20
> So I looked at the new copy made under cpuset -l2 :
>=20
> # cmp -l clang-cortexA53-installworld-poud.tar  mmjnk.other2 | more
> 1442340865  15   0
> 1442340866 245   0
> 1442340867   1  30
> 1442340868   1 353
> 1442340869   0  11
> 1442340870 100  17
> 1442340871 226 271
> 1442340872  31 125
> . . .
>=20
> Same offset in this file and *same* values on the left and right.
> (Not just those shown above.)
>=20
> So I looked at the new copy made under cpuset -l3 :
>=20
> # cmp -l clang-cortexA53-installworld-poud.tar  mmjnk.other3 | more
> 981008385  62   0
> 981008386 111   0
> 981008387 157  30
> 981008388  65 353
> 981008389 123  11
> 981008390 145  17
> 981008391 164 271
> 981008393 160   0
> . . .
>=20
> Different offset in this file but the *same* values on the right.
> (Not just those shown above.) The left values are different,
> matching up with the offset difference.
>=20
> (Some bytes are a match to the different data on the left and so
> do not show a line but the right side values appear to match the
> prior 2 examples even where lines disappear differently because
> of left-side content.)
>=20
> So, apparently, the same page of content used for the right
> side material but at a different point in the diff. (Lack
> of controlling the cpu used for cmp -l might be contributing?)
>=20
> Note: 1795768321 % 4096 =3D=3D 1
> Note: 1442340865 % 4096 =3D=3D 1
> Note:  981008385 % 4096 =3D=3D 1
>=20
> cmp starts with line "1", so the above all align
> at 4096 boundaries.
>=20
>=20
> Overall this indicates that an unmodified file can have
> its content appear to change and that multiple files
> got the same block of bad data showing up in their
> respective comparisons, just not always at the same
> offset in the files.
>=20
> I've no clue if the roles of "left" and "right" could
> swap. So far the right seems to be the one that gets
> the bad data.
>=20

Turns out that the combination of enabling the 3 GiByte
limitation in uefi and not having D25219 applied in
the kernel avoids the problem.

I only used this combination in order to use
artifacts.ci.freebsd.org kernels (that do not have
D25219) in some other testing.

So, putting back my non-debug kernel that has
D25219 in it but leaving the 3 GiByte limit
in place in uefi . . . Turns out that also
avoids the problem.

This suggests that may be D25219 by itself is not
keeping everything in the memory range(s) that the
uefi 3 GiByte limitation enforces internally: With
the limitation enforced, the problem disappears.


=3D=3D=3D
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?19F98671-4B69-44A6-8254-B186F0ED995F>