From owner-freebsd-current@freebsd.org  Sat Mar 18 13:26:58 2017
Return-Path: <owner-freebsd-current@freebsd.org>
Delivered-To: freebsd-current@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 16224D12C7F
 for <freebsd-current@mailman.ysv.freebsd.org>;
 Sat, 18 Mar 2017 13:26:58 +0000 (UTC)
 (envelope-from markmi@dsl-only.net)
Received: from asp.reflexion.net (outbound-mail-211-172.reflexion.net
 [208.70.211.172])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id BF5251790
 for <freebsd-current@freebsd.org>; Sat, 18 Mar 2017 13:26:56 +0000 (UTC)
 (envelope-from markmi@dsl-only.net)
Received: (qmail 30928 invoked from network); 18 Mar 2017 13:26:50 -0000
Received: from unknown (HELO mail-cs-01.app.dca.reflexion.local) (10.81.19.1)
 by 0 (rfx-qmail) with SMTP; 18 Mar 2017 13:26:50 -0000
Received: by mail-cs-01.app.dca.reflexion.local
 (Reflexion email security v8.30.2) with SMTP;
 Sat, 18 Mar 2017 09:26:50 -0400 (EDT)
Received: (qmail 28762 invoked from network); 18 Mar 2017 13:26:49 -0000
Received: from unknown (HELO iron2.pdx.net) (69.64.224.71)
 by 0 (rfx-qmail) with (AES256-SHA encrypted) SMTP; 18 Mar 2017 13:26:49 -0000
Received: from [192.168.1.111] (c-67-170-167-181.hsd1.or.comcast.net
 [67.170.167.181])
 by iron2.pdx.net (Postfix) with ESMTPSA id 274FEEC805D;
 Sat, 18 Mar 2017 06:26:49 -0700 (PDT)
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Mac OS X Mail 10.2 \(3259\))
Subject: Re: arm64 fork/swap data corruptions: A ~110 line C program
 demonstrating an example (Pine64+ 2GB context) [Corrected subject: arm64!]
From: Mark Millard <markmi@dsl-only.net>
In-Reply-To: <1019DBB4-5A92-41FE-90B5-63F3F658CF3D@dsl-only.net>
Date: Sat, 18 Mar 2017 06:26:48 -0700
Cc: freebsd-arm <freebsd-arm@freebsd.org>,
 FreeBSD Current <freebsd-current@freebsd.org>,
 FreeBSD-STABLE Mailing List <freebsd-stable@freebsd.org>
Content-Transfer-Encoding: quoted-printable
Message-Id: <826D525A-BDAF-4352-AD9F-A238B797BFAF@dsl-only.net>
References: <mailman.15.1489579200.37820.freebsd-stable@freebsd.org>
 <201703151315.v2FDFWOr028842@sdf.org>
 <345EE889-A429-4C13-9B08-B762DA3F4D71@dsl-only.net>
 <FC7930F8-B9CC-429B-9618-FB50F1FE685F@dsl-only.net>
 <201703160607.v2G67Vwe023153@sdf.org>
 <1019DBB4-5A92-41FE-90B5-63F3F658CF3D@dsl-only.net>
To: Scott Bennett <bennett@sdf.org>
X-Mailer: Apple Mail (2.3259)
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
 <freebsd-current.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-current>, 
 <mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current/>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
 <mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 18 Mar 2017 13:26:58 -0000

[Summary: I've now tested on a rpi3 in addition to a
pine64+ 2GB. Both contexts show the problem.]

On 2017-Mar-16, at 2:07 AM, Mark Millard <markmi at dsl-only.net> wrote:

> On 2017-Mar-15, at 11:07 PM, Scott Bennett <bennett at sdf.org> wrote:
>=20
>> Mark Millard <markmi ta dsl-only.net> wrote:
>>=20
>>> [Something strange happened to the automatic CC: fill-in for my =
original
>>> reply. Also I should have mentioned that for my test program if a
>>> variant is made that does not fork the swapping works fine.]
>>>=20
>>> On 2017-Mar-15, at 9:37 AM, Mark Millard <markmi at dsl-only.net> =
wrote:
>>>=20
>>>> On 2017-Mar-15, at 6:15 AM, Scott Bennett <bennett at sdf.org> =
wrote:
>>>>=20
>>>>>  On Tue, 14 Mar 2017 18:18:56 -0700 Mark Millard
>>>>> <markmi at dsl-only.net> wrote:
>>>>>> On 2017-Mar-14, at 4:44 PM, Bernd Walter =
<ticso@cicely7.cicely.de> wrote:
>>>>>>=20
>>>>>>> On Tue, Mar 14, 2017 at 03:28:53PM -0700, Mark Millard wrote:
>>>>>>>> [test_check() between the fork and the wait/sleep prevents the
>>>>>>>> failure from occurring. Even a small access to the memory at
>>>>>>>> that stage prevents the failure. Details follow.]
>>>>>>>=20
>>>>>>> Maybe a stupid question, since you might have written it =
somewhere.
>>>>>>> What medium do you swap to?
>>>>>>> I've seen broken firmware on microSD cards doing silent data
>>>>>>> corruption for some access patterns.
>>>>>>=20
>>>>>> The root filesystem is on a USB SSD on a powered hub.
>>>>>>=20
>>>>>> Only the kernel is from the microSD card.
>>>>>>=20
>>>>>> I have several examples of the USB SSD model and have
>>>>>> never observed such problems in any other context.
>>>>>>=20
>>>>>> [remainder of irrelevant material deleted  --SB]
>>>>>=20
>>>>>  You gave a very long-winded non-answer to Bernd's question, so =
I'll
>>>>> repeat it here.  What medium do you swap to?
>>>>=20
>>>> My wording of:
>>>>=20
>>>> The root filesystem is on a USB SSD on a powered hub.
>>>>=20
>>>> was definitely poor. It should have explicitly mentioned the
>>>> swap partition too:
>>>>=20
>>>> The root filesystem and swap partition are both on the same
>>>> USB SSD on a powered hub.
>>>>=20
>>>> More detail from dmesg -a for usb:
>>>>=20
>>>> usbus0: 12Mbps Full Speed USB v1.0
>>>> usbus1: 480Mbps High Speed USB v2.0
>>>> usbus2: 12Mbps Full Speed USB v1.0
>>>> usbus3: 480Mbps High Speed USB v2.0
>>>> ugen0.1: <Generic OHCI root HUB> at usbus0
>>>> uhub0: <Generic OHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on =
usbus0
>>>> ugen1.1: <Allwinner EHCI root HUB> at usbus1
>>>> uhub1: <Allwinner EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> =
on usbus1
>>>> ugen2.1: <Generic OHCI root HUB> at usbus2
>>>> uhub2: <Generic OHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on =
usbus2
>>>> ugen3.1: <Allwinner EHCI root HUB> at usbus3
>>>> uhub3: <Allwinner EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> =
on usbus3
>>>> . . .
>>>> uhub0: 1 port with 1 removable, self powered
>>>> uhub2: 1 port with 1 removable, self powered
>>>> uhub1: 1 port with 1 removable, self powered
>>>> uhub3: 1 port with 1 removable, self powered
>>>> ugen3.2: <GenesysLogic USB2.0 Hub> at usbus3
>>>> uhub4 on uhub3
>>>> uhub4: <GenesysLogic USB2.0 Hub, class 9/0, rev 2.00/90.20, addr 2> =
on usbus3
>>>> uhub4: MTT enabled
>>>> uhub4: 4 ports with 4 removable, self powered
>>>> ugen3.3: <OWC Envoy Pro mini> at usbus3
>>>> umass0 on uhub4
>>>> umass0: <OWC Envoy Pro mini, class 0/0, rev 2.10/1.00, addr 3> on =
usbus3
>>>> umass0:  SCSI over Bulk-Only; quirks =3D 0x0100
>>>> umass0:0:0: Attached to scbus0
>>>> . . .
>>>> da0 at umass-sim0 bus 0 scbus0 target 0 lun 0
>>>> da0: <OWC Envoy Pro mini 0> Fixed Direct Access SPC-4 SCSI device
>>>> da0: Serial Number <REPLACED>
>>>> da0: 40.000MB/s transfers
>>>>=20
>>>> (Edited a bit because there is other material interlaced, even
>>>> internal to some lines. Also: I removed the serial number of the
>>>> specific example device.)
>>=20
>>    Thank you.  That presents a much clearer picture.
>>>>=20
>>>>>  I will further note that any kind of USB device cannot =
automatically
>>>>> be trusted to behave properly.  USB devices are notorious, for =
example,
>>>>>=20
>>>>> [reasons why deleted  --SB]
>>>>>=20
>>>>>  You should identify where you page/swap to and then try =
substituting
>>>>> a different device for that function as a test to eliminate the =
possibility
>>>>> of a bad storage device/controller.  If the problem still occurs, =
that
>>>>> means there still remains the possibility that another controller =
or its
>>>>> firmware is defective instead.  It could be a kernel bug, it is =
true, but
>>>>> making sure there is no hardware or firmware error occurring is =
important,
>>>>> and as I say, USB devices should always be considered suspect =
unless and
>>>>> until proven innocent.
>>>>=20
>>>> [FYI: This is a ufs context, not a zfs one.]
>>=20
>>    Right.  It's only a Pi, after all. :-)
>=20
> It is a Pine64+ 2GB, not an rpi3.
>=20
>>>>=20
>>>> I'm aware of such  things. There is no evidence that has resulted =
in
>>>> suggesting the USB devices that I can replace are a problem. =
Otherwise
>>>> I'd not be going down this path. I only have access to the one =
arm64
>>>> device (a Pine64+ 2GB) so I've no ability to substitution-test what
>>>> is on that board.
>>=20
>>    There isn't even one open port on that hub that you could plug a
>> flash drive into temporarily to be the paging device?
>=20
> Why do you think that I've never tried alternative devices? It
> is just that the result was no evidence that my usually-in-use
> SSD is having a special/local problem: the behavior continues
> across all such contexts when the Pine64+ 2GB is involved. (Again
> I have not had access to an alternate to the one arm64 board.
> That limits my substitution testing possibilities.)
>=20
> Why would you expect a Flash drive to be better than another SSD
> for such testing? (The SSD that I usually use even happens to be
> a USB 3.0 SSD, capable of USB 3.0 speeds in USB 3.0 contexts. So
> is the hub that I usually use for that matter.)

FYI: I now have access to a rpi3 in addition to a pine64+ 2GB.

I've tested on the rpi3 using a different USB hub and a different
SSD: no hardware device in common with the recent Pine64+ 2GB
tests (other than console cabling and what handles the serial
console).

The fork-then-swap-out-then-swap-in failure happens in the
rpi3 context as well.

Because the rpi3 has only 1 GiByte of RAM the stress commands
that I used were more like:

stress -m 1 --vm-bytes 1000M

to get zero RES(ident memory) for the two processes from my
test program after it forks while they are waiting/sleeping.


>> You could then
>> try your tests before returning to the normal configuration.  If =
there
>> isn't an open port, then how about plugging a second hub into one of
>> the first hub's ports and moving the displaced device to the second
>> hub?  A flash drive could then be plugged in.  That kind of =
configuration
>> is obviously a bad idea for the long run, but just to try your tests =
it
>> ought to work well enough.
>=20
> I have access to more SSDs that I can use than I do to Flash drives. I
> see no reason to specifically use a Flash drive.
>=20
>> (BTW, if a USB storage device containing a
>> paging area drops off=3Dline even momentarily and the system needs to =
use
>> it, that is the beginning of the end, even though it may take up to a =
few
>> minutes for everything to lock up.
>=20
> The system does not lock up, even days or weeks later, with having =
done
> dozens of experiments that show memory corruption failures over those
> days. The only processes showing memory corruption so far are those
> that were the parent or child for a fork that were later swapped out
> to have zero RES(ident memory) and then even later swapped back in.
>=20
> The context has no such issues. You are inventing problems that do
> not exist in my context. That is why none of my list submittals
> mention such problems: they did not occur.
>=20
>> You probably won't be able to do an
>> orderly shutdown, but will instead have to crash it with the power =
switch.
>> In the case of something like a Pi, this is an unpleasant fact of =
life,
>> to be sure.)
>=20
> Such things did not occur and has nothing to do with my context so =
far.
>=20
>>    I think I buy your arguments, given the evidence you've collected
>> thus far, including what you've added below.  I just like to =
eliminate
>> possibilities that are much simpler to deal with before facing =
nastinesses
>> like bugs in the VM subsystem. :-)
>=20
> When I started this I found no evidence of device-specific problems.
> My investigation activity goes back to long before my list submittals.
>=20
> And I repeat: Other people have reported the symptoms that started
> this investigation. They did so before I ever started my activities.
> They were using none of the specific devices that I have access to.
> Likely the types of devices were frequently even different, such as
> a rpi3 instead of a Pine64+ 2GB or a different USB drive. I was able
> to get the symptoms that they reported.
>=20
>>>> It would be neat if some folks used my code to test other arm64
>>>> contexts and reported the results. I'd be very interested.
>>>> (This is easier to do on devices that do not have massive
>>>> amounts of RAM, which may limit the range of devices or
>>>> device configurations that are reasonable to test.)
>>>>=20
>>>> There is that other people using other devices have reported
>>>> the behavior that started this investigation. I can produce the
>>>> behavior that they reported, although I've not seen anyone else
>>>> listing specific steps that lead to the problem or ways to tell
>>>> if the symptom is going to happen before it actually does. Nor
>>>> have I seen any other core dump analysis. (I have bugzilla
>>>> submittals 217138 and 217239 tied to symptoms others have
>>>> reported as well as this test program material.)
>>>>=20
>>>> Also, considering that for my test program I can control which =
pages
>>>> get the zeroed-problem by read-accessing even one byte of any 4K
>>>> Byte page that I want to make work normally, doing so in the child
>>>> process of the fork, between the fork and the sleep/swap-out, it =
does
>>>> not suggest USB-device-specific behavior. The read-access is =
changing
>>>> the status of the page in some way as far as I can tell.
>>>>=20
>>>> (Such read-accesses in the parent process make no difference to the
>>>> behavior.)
>>>=20
>>> I should have noted another comparison/contrast between
>>> having memory corruption and not in my context:
>>>=20
>>> I've tried variants of my test program that do not fork but
>>> just sleep for 60s to allow me to force the swap-out. I
>>> did this before adding fork and before using
>>> parital_test_check, for example. I gradually added things
>>> apparently involved in the reports others had made
>>> until I found a combination that produced a memory
>>> corruption test failure.
>>>=20
>>> These tests without fork involved find no problems with
>>> the memory content after the swap-in.
>>>=20
>>> For my test program it appears that fork-before-swap-out
>>> or the like is essential to having the problem occur.
>>>=20
>>    A comment about terminology seems in order here.  It bothers
>> me considerably to see you writing "swap out" or "swapping" where
>> it seems like you mean to write "page out" or "paging".  A BSD
>> system whose swapping mechanism gets activated has already waded
>> very deeply into the quicksand and frequently cannot be gotten out
>> in a reasonable amount of time even with manual assistance.  It is
>> often quicker to crash it, reboot, and wait for the fsck(8) cleanups
>> to complete.  Orderly shutdowns, even of the kind that results from
>> a quick poke to the power button, typically get mired in the same
>> mess that already has the system in knots.  Also, BSD systems since
>> 3.0BSD, unlike older AT&T (pre-SysVR2.3) systems, do not swap in,
>> just out.  A swapped out process, once the system determines that it
>> has adequate resources again to attempt to run the process, will have
>> the interrupted text page paged in and the rest will be paged in by
>> the normal mechanism of page faults and page-in operations.  I assume
>> you must already know all this, which is a large part of why it =
grates
>> on me that you appear to be using the wrong terms.
>=20
> You apparently did not read any of the material about how the test
> is done or are unfamiliar with what "stress -m 1 --vm-bytes 1800M"
> does when there is only 2GB of RAM. I am deliberately inducing
> swapping in other processes, including the 2 from my test program
> (after the fork), not just paging. (stress is a port, not part of
> the base system.)
>=20
> When I say swap-out and swap-in I mean it.
>=20
> =46rom the source code of my test program:
>=20
>            sleep(60);
>=20
>            // During this manually force this process to
>            // swap out. I use something like:
>=20
>            // stress -m 1 --vm-bytes 1800M
>=20
>            // in another shell and ^C'ing it after top
>            // shows the swapped status desired. 1800M
>            // just happened to work on the Pine64+ 2GB
>            // that I was using. I watch with top -PCwaopid .
>=20
> That type of stress run uses about 1.8 GiBytes after a bit,
> which is enough to cause the swapping of other processes,
> including the two that I am testing (post-fork). (Some RAM
> is in use already before the stress run, which explains not
> needing 2 GiBytes to be in use by stress.)
>=20
> Look at a "top -PCwaopid" display: there are columns for
> RES(ident memory) and SWAP. I cause my 2 test processes to
> show zero RES and everything under SWAP, starting sometime
> during the 60s sleep/wait.
>=20
> Why would I cause swapping? Because buildworld causes such
> swap-outs at times when there is only 2GBytes of RAM,
> including processes that forked earlier, and as a result
> the corrupted memory problems show up later in some processes
> that were swapped out at the time. The build eventually
> stops for process failures tied to the corruptions of memory
> in the failing processes. (At least that is what my testing
> strongly suggests.)
>=20
> But that is a very complicated context to use for analysis or
> testing of the problem. My test program is vastly simpler
> and easier/quicker to set up and test when used with stress
> as well. Such was the kind of thing I was trying to find.
>=20
> I want the Pine64+ 2GB to work well enough to be able to have
> buildworld (-j 4) complete correctly without having to restart
> the build --even when everything has to be rebuilt. So I'm
> trying to find and provide enough evidence to help someone fix
> the problems that are observed to block such buildworld
> activity.
>=20
> Again: others have reported such arm64 problems on the lists
> before I ever got into this activity. The evidence is that
> the issues are not a local property of my environment.
>=20
> Swapping is supposed to work. I can do buildworld (-j 4)
> on armv6 (really -mcpu=3Dcortex-a7 so armv7-a) and the
> swapping it causes works fine. This is true for both a
> bpim3 (2 GiBytes of RAM) and a rpi2 (1 GiByte of RAM
> so even more swapping). On a powerpc64 with 16 GiBytes
> I've built things that caused 26 GiBytes of swap to be
> in use some of the time (during 4 ld's running in
> parallel), with lots of processes having zero for
> RES(ident memory) and all their space listed under SWAP
> in a "top -PCwaopid" display. This too has no problems
> with swapping of previously forked processes (or of any
> other processes).
>=20
> For the likes of a Pine64+ 2GB to be "self hosted"=20
> for source-code based updates, swapping of previously
> forked processes must work and currently such
> swapping is unreliable.

=3D=3D=3D
Mark Millard
markmi at dsl-only.net