Date: Thu, 16 Mar 2017 02:07:23 -0700 From: Mark Millard <markmi@dsl-only.net> To: Scott Bennett <bennett@sdf.org> Cc: FreeBSD Current <freebsd-current@freebsd.org>, freebsd-arm@freebsd.org, freebsd-stable@freebsd.org Subject: Re: arm64 fork/swap data corruptions: A ~110 line C program demonstrating an example (Pine64+ 2GB context) [Corrected subject: arm64!] Message-ID: <1019DBB4-5A92-41FE-90B5-63F3F658CF3D@dsl-only.net> In-Reply-To: <201703160607.v2G67Vwe023153@sdf.org> References: <mailman.15.1489579200.37820.freebsd-stable@freebsd.org> <201703151315.v2FDFWOr028842@sdf.org> <345EE889-A429-4C13-9B08-B762DA3F4D71@dsl-only.net> <FC7930F8-B9CC-429B-9618-FB50F1FE685F@dsl-only.net> <201703160607.v2G67Vwe023153@sdf.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On 2017-Mar-15, at 11:07 PM, Scott Bennett <bennett at sdf.org> wrote: > Mark Millard <markmi ta dsl-only.net> wrote: >=20 >> [Something strange happened to the automatic CC: fill-in for my = original >> reply. Also I should have mentioned that for my test program if a >> variant is made that does not fork the swapping works fine.] >>=20 >> On 2017-Mar-15, at 9:37 AM, Mark Millard <markmi at dsl-only.net> = wrote: >>=20 >>> On 2017-Mar-15, at 6:15 AM, Scott Bennett <bennett at sdf.org> = wrote: >>>=20 >>>> On Tue, 14 Mar 2017 18:18:56 -0700 Mark Millard >>>> <markmi at dsl-only.net> wrote: >>>>> On 2017-Mar-14, at 4:44 PM, Bernd Walter <ticso@cicely7.cicely.de> = wrote: >>>>>=20 >>>>>> On Tue, Mar 14, 2017 at 03:28:53PM -0700, Mark Millard wrote: >>>>>>> [test_check() between the fork and the wait/sleep prevents the >>>>>>> failure from occurring. Even a small access to the memory at >>>>>>> that stage prevents the failure. Details follow.] >>>>>>=20 >>>>>> Maybe a stupid question, since you might have written it = somewhere. >>>>>> What medium do you swap to? >>>>>> I've seen broken firmware on microSD cards doing silent data >>>>>> corruption for some access patterns. >>>>>=20 >>>>> The root filesystem is on a USB SSD on a powered hub. >>>>>=20 >>>>> Only the kernel is from the microSD card. >>>>>=20 >>>>> I have several examples of the USB SSD model and have >>>>> never observed such problems in any other context. >>>>>=20 >>>>> [remainder of irrelevant material deleted --SB] >>>>=20 >>>> You gave a very long-winded non-answer to Bernd's question, so = I'll >>>> repeat it here. What medium do you swap to? >>>=20 >>> My wording of: >>>=20 >>> The root filesystem is on a USB SSD on a powered hub. >>>=20 >>> was definitely poor. It should have explicitly mentioned the >>> swap partition too: >>>=20 >>> The root filesystem and swap partition are both on the same >>> USB SSD on a powered hub. >>>=20 >>> More detail from dmesg -a for usb: >>>=20 >>> usbus0: 12Mbps Full Speed USB v1.0 >>> usbus1: 480Mbps High Speed USB v2.0 >>> usbus2: 12Mbps Full Speed USB v1.0 >>> usbus3: 480Mbps High Speed USB v2.0 >>> ugen0.1: <Generic OHCI root HUB> at usbus0 >>> uhub0: <Generic OHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on = usbus0 >>> ugen1.1: <Allwinner EHCI root HUB> at usbus1 >>> uhub1: <Allwinner EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> = on usbus1 >>> ugen2.1: <Generic OHCI root HUB> at usbus2 >>> uhub2: <Generic OHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on = usbus2 >>> ugen3.1: <Allwinner EHCI root HUB> at usbus3 >>> uhub3: <Allwinner EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> = on usbus3 >>> . . . >>> uhub0: 1 port with 1 removable, self powered >>> uhub2: 1 port with 1 removable, self powered >>> uhub1: 1 port with 1 removable, self powered >>> uhub3: 1 port with 1 removable, self powered >>> ugen3.2: <GenesysLogic USB2.0 Hub> at usbus3 >>> uhub4 on uhub3 >>> uhub4: <GenesysLogic USB2.0 Hub, class 9/0, rev 2.00/90.20, addr 2> = on usbus3 >>> uhub4: MTT enabled >>> uhub4: 4 ports with 4 removable, self powered >>> ugen3.3: <OWC Envoy Pro mini> at usbus3 >>> umass0 on uhub4 >>> umass0: <OWC Envoy Pro mini, class 0/0, rev 2.10/1.00, addr 3> on = usbus3 >>> umass0: SCSI over Bulk-Only; quirks =3D 0x0100 >>> umass0:0:0: Attached to scbus0 >>> . . . >>> da0 at umass-sim0 bus 0 scbus0 target 0 lun 0 >>> da0: <OWC Envoy Pro mini 0> Fixed Direct Access SPC-4 SCSI device >>> da0: Serial Number <REPLACED> >>> da0: 40.000MB/s transfers >>>=20 >>> (Edited a bit because there is other material interlaced, even >>> internal to some lines. Also: I removed the serial number of the >>> specific example device.) >=20 > Thank you. That presents a much clearer picture. >>>=20 >>>> I will further note that any kind of USB device cannot = automatically >>>> be trusted to behave properly. USB devices are notorious, for = example, >>>>=20 >>>> [reasons why deleted --SB] >>>>=20 >>>> You should identify where you page/swap to and then try = substituting >>>> a different device for that function as a test to eliminate the = possibility >>>> of a bad storage device/controller. If the problem still occurs, = that >>>> means there still remains the possibility that another controller = or its >>>> firmware is defective instead. It could be a kernel bug, it is = true, but >>>> making sure there is no hardware or firmware error occurring is = important, >>>> and as I say, USB devices should always be considered suspect = unless and >>>> until proven innocent. >>>=20 >>> [FYI: This is a ufs context, not a zfs one.] >=20 > Right. It's only a Pi, after all. :-) It is a Pine64+ 2GB, not an rpi3. >>>=20 >>> I'm aware of such things. There is no evidence that has resulted in >>> suggesting the USB devices that I can replace are a problem. = Otherwise >>> I'd not be going down this path. I only have access to the one arm64 >>> device (a Pine64+ 2GB) so I've no ability to substitution-test what >>> is on that board. >=20 > There isn't even one open port on that hub that you could plug a > flash drive into temporarily to be the paging device? Why do you think that I've never tried alternative devices? It is just that the result was no evidence that my usually-in-use SSD is having a special/local problem: the behavior continues across all such contexts when the Pine64+ 2GB is involved. (Again I have not had access to an alternate to the one arm64 board. That limits my substitution testing possibilities.) Why would you expect a Flash drive to be better than another SSD for such testing? (The SSD that I usually use even happens to be a USB 3.0 SSD, capable of USB 3.0 speeds in USB 3.0 contexts. So is the hub that I usually use for that matter.) > You could then > try your tests before returning to the normal configuration. If there > isn't an open port, then how about plugging a second hub into one of > the first hub's ports and moving the displaced device to the second > hub? A flash drive could then be plugged in. That kind of = configuration > is obviously a bad idea for the long run, but just to try your tests = it > ought to work well enough. I have access to more SSDs that I can use than I do to Flash drives. I see no reason to specifically use a Flash drive. > (BTW, if a USB storage device containing a > paging area drops off=3Dline even momentarily and the system needs to = use > it, that is the beginning of the end, even though it may take up to a = few > minutes for everything to lock up. The system does not lock up, even days or weeks later, with having done dozens of experiments that show memory corruption failures over those days. The only processes showing memory corruption so far are those that were the parent or child for a fork that were later swapped out to have zero RES(ident memory) and then even later swapped back in. The context has no such issues. You are inventing problems that do not exist in my context. That is why none of my list submittals mention such problems: they did not occur. > You probably won't be able to do an > orderly shutdown, but will instead have to crash it with the power = switch. > In the case of something like a Pi, this is an unpleasant fact of = life, > to be sure.) Such things did not occur and has nothing to do with my context so far. > I think I buy your arguments, given the evidence you've collected > thus far, including what you've added below. I just like to eliminate > possibilities that are much simpler to deal with before facing = nastinesses > like bugs in the VM subsystem. :-) When I started this I found no evidence of device-specific problems. My investigation activity goes back to long before my list submittals. And I repeat: Other people have reported the symptoms that started this investigation. They did so before I ever started my activities. They were using none of the specific devices that I have access to. Likely the types of devices were frequently even different, such as a rpi3 instead of a Pine64+ 2GB or a different USB drive. I was able to get the symptoms that they reported. >>> It would be neat if some folks used my code to test other arm64 >>> contexts and reported the results. I'd be very interested. >>> (This is easier to do on devices that do not have massive >>> amounts of RAM, which may limit the range of devices or >>> device configurations that are reasonable to test.) >>>=20 >>> There is that other people using other devices have reported >>> the behavior that started this investigation. I can produce the >>> behavior that they reported, although I've not seen anyone else >>> listing specific steps that lead to the problem or ways to tell >>> if the symptom is going to happen before it actually does. Nor >>> have I seen any other core dump analysis. (I have bugzilla >>> submittals 217138 and 217239 tied to symptoms others have >>> reported as well as this test program material.) >>>=20 >>> Also, considering that for my test program I can control which pages >>> get the zeroed-problem by read-accessing even one byte of any 4K >>> Byte page that I want to make work normally, doing so in the child >>> process of the fork, between the fork and the sleep/swap-out, it = does >>> not suggest USB-device-specific behavior. The read-access is = changing >>> the status of the page in some way as far as I can tell. >>>=20 >>> (Such read-accesses in the parent process make no difference to the >>> behavior.) >>=20 >> I should have noted another comparison/contrast between >> having memory corruption and not in my context: >>=20 >> I've tried variants of my test program that do not fork but >> just sleep for 60s to allow me to force the swap-out. I >> did this before adding fork and before using >> parital_test_check, for example. I gradually added things >> apparently involved in the reports others had made >> until I found a combination that produced a memory >> corruption test failure. >>=20 >> These tests without fork involved find no problems with >> the memory content after the swap-in. >>=20 >> For my test program it appears that fork-before-swap-out >> or the like is essential to having the problem occur. >>=20 > A comment about terminology seems in order here. It bothers > me considerably to see you writing "swap out" or "swapping" where > it seems like you mean to write "page out" or "paging". A BSD > system whose swapping mechanism gets activated has already waded > very deeply into the quicksand and frequently cannot be gotten out > in a reasonable amount of time even with manual assistance. It is > often quicker to crash it, reboot, and wait for the fsck(8) cleanups > to complete. Orderly shutdowns, even of the kind that results from > a quick poke to the power button, typically get mired in the same > mess that already has the system in knots. Also, BSD systems since > 3.0BSD, unlike older AT&T (pre-SysVR2.3) systems, do not swap in, > just out. A swapped out process, once the system determines that it > has adequate resources again to attempt to run the process, will have > the interrupted text page paged in and the rest will be paged in by > the normal mechanism of page faults and page-in operations. I assume > you must already know all this, which is a large part of why it grates > on me that you appear to be using the wrong terms. You apparently did not read any of the material about how the test is done or are unfamiliar with what "stress -m 1 --vm-bytes 1800M" does when there is only 2GB of RAM. I am deliberately inducing swapping in other processes, including the 2 from my test program (after the fork), not just paging. (stress is a port, not part of the base system.) When I say swap-out and swap-in I mean it. =46rom the source code of my test program: sleep(60); // During this manually force this process to // swap out. I use something like: // stress -m 1 --vm-bytes 1800M // in another shell and ^C'ing it after top // shows the swapped status desired. 1800M // just happened to work on the Pine64+ 2GB // that I was using. I watch with top -PCwaopid . That type of stress run uses about 1.8 GiBytes after a bit, which is enough to cause the swapping of other processes, including the two that I am testing (post-fork). (Some RAM is in use already before the stress run, which explains not needing 2 GiBytes to be in use by stress.) Look at a "top -PCwaopid" display: there are columns for RES(ident memory) and SWAP. I cause my 2 test processes to show zero RES and everything under SWAP, starting sometime during the 60s sleep/wait. Why would I cause swapping? Because buildworld causes such swap-outs at times when there is only 2GBytes of RAM, including processes that forked earlier, and as a result the corrupted memory problems show up later in some processes that were swapped out at the time. The build eventually stops for process failures tied to the corruptions of memory in the failing processes. (At least that is what my testing strongly suggests.) But that is a very complicated context to use for analysis or testing of the problem. My test program is vastly simpler and easier/quicker to set up and test when used with stress as well. Such was the kind of thing I was trying to find. I want the Pine64+ 2GB to work well enough to be able to have buildworld (-j 4) complete correctly without having to restart the build --even when everything has to be rebuilt. So I'm trying to find and provide enough evidence to help someone fix the problems that are observed to block such buildworld activity. Again: others have reported such arm64 problems on the lists before I ever got into this activity. The evidence is that the issues are not a local property of my environment. Swapping is supposed to work. I can do buildworld (-j 4) on armv6 (really -mcpu=3Dcortex-a7 so armv7-a) and the swapping it causes works fine. This is true for both a bpim3 (2 GiBytes of RAM) and a rpi2 (1 GiByte of RAM so even more swapping). On a powerpc64 with 16 GiBytes I've built things that caused 26 GiBytes of swap to be in use some of the time (during 4 ld's running in parallel), with lots of processes having zero for RES(ident memory) and all their space listed under SWAP in a "top -PCwaopid" display. This too has no problems with swapping of previously forked processes (or of any other processes). For the likes of a Pine64+ 2GB to be "self hosted"=20 for source-code based updates, swapping of previously forked processes must work and currently such swapping is unreliable. =3D=3D=3D Mark Millard markmi at dsl-only.net
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1019DBB4-5A92-41FE-90B5-63F3F658CF3D>