From owner-freebsd-current@freebsd.org Sat Mar 18 13:26:58 2017 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 16224D12C7F for ; Sat, 18 Mar 2017 13:26:58 +0000 (UTC) (envelope-from markmi@dsl-only.net) Received: from asp.reflexion.net (outbound-mail-211-172.reflexion.net [208.70.211.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id BF5251790 for ; Sat, 18 Mar 2017 13:26:56 +0000 (UTC) (envelope-from markmi@dsl-only.net) Received: (qmail 30928 invoked from network); 18 Mar 2017 13:26:50 -0000 Received: from unknown (HELO mail-cs-01.app.dca.reflexion.local) (10.81.19.1) by 0 (rfx-qmail) with SMTP; 18 Mar 2017 13:26:50 -0000 Received: by mail-cs-01.app.dca.reflexion.local (Reflexion email security v8.30.2) with SMTP; Sat, 18 Mar 2017 09:26:50 -0400 (EDT) Received: (qmail 28762 invoked from network); 18 Mar 2017 13:26:49 -0000 Received: from unknown (HELO iron2.pdx.net) (69.64.224.71) by 0 (rfx-qmail) with (AES256-SHA encrypted) SMTP; 18 Mar 2017 13:26:49 -0000 Received: from [192.168.1.111] (c-67-170-167-181.hsd1.or.comcast.net [67.170.167.181]) by iron2.pdx.net (Postfix) with ESMTPSA id 274FEEC805D; Sat, 18 Mar 2017 06:26:49 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 10.2 \(3259\)) Subject: Re: arm64 fork/swap data corruptions: A ~110 line C program demonstrating an example (Pine64+ 2GB context) [Corrected subject: arm64!] From: Mark Millard In-Reply-To: <1019DBB4-5A92-41FE-90B5-63F3F658CF3D@dsl-only.net> Date: Sat, 18 Mar 2017 06:26:48 -0700 Cc: freebsd-arm , FreeBSD Current , FreeBSD-STABLE Mailing List Content-Transfer-Encoding: quoted-printable Message-Id: <826D525A-BDAF-4352-AD9F-A238B797BFAF@dsl-only.net> References: <201703151315.v2FDFWOr028842@sdf.org> <345EE889-A429-4C13-9B08-B762DA3F4D71@dsl-only.net> <201703160607.v2G67Vwe023153@sdf.org> <1019DBB4-5A92-41FE-90B5-63F3F658CF3D@dsl-only.net> To: Scott Bennett X-Mailer: Apple Mail (2.3259) X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 18 Mar 2017 13:26:58 -0000 [Summary: I've now tested on a rpi3 in addition to a pine64+ 2GB. Both contexts show the problem.] On 2017-Mar-16, at 2:07 AM, Mark Millard wrote: > On 2017-Mar-15, at 11:07 PM, Scott Bennett wrote: >=20 >> Mark Millard wrote: >>=20 >>> [Something strange happened to the automatic CC: fill-in for my = original >>> reply. Also I should have mentioned that for my test program if a >>> variant is made that does not fork the swapping works fine.] >>>=20 >>> On 2017-Mar-15, at 9:37 AM, Mark Millard = wrote: >>>=20 >>>> On 2017-Mar-15, at 6:15 AM, Scott Bennett = wrote: >>>>=20 >>>>> On Tue, 14 Mar 2017 18:18:56 -0700 Mark Millard >>>>> wrote: >>>>>> On 2017-Mar-14, at 4:44 PM, Bernd Walter = wrote: >>>>>>=20 >>>>>>> On Tue, Mar 14, 2017 at 03:28:53PM -0700, Mark Millard wrote: >>>>>>>> [test_check() between the fork and the wait/sleep prevents the >>>>>>>> failure from occurring. Even a small access to the memory at >>>>>>>> that stage prevents the failure. Details follow.] >>>>>>>=20 >>>>>>> Maybe a stupid question, since you might have written it = somewhere. >>>>>>> What medium do you swap to? >>>>>>> I've seen broken firmware on microSD cards doing silent data >>>>>>> corruption for some access patterns. >>>>>>=20 >>>>>> The root filesystem is on a USB SSD on a powered hub. >>>>>>=20 >>>>>> Only the kernel is from the microSD card. >>>>>>=20 >>>>>> I have several examples of the USB SSD model and have >>>>>> never observed such problems in any other context. >>>>>>=20 >>>>>> [remainder of irrelevant material deleted --SB] >>>>>=20 >>>>> You gave a very long-winded non-answer to Bernd's question, so = I'll >>>>> repeat it here. What medium do you swap to? >>>>=20 >>>> My wording of: >>>>=20 >>>> The root filesystem is on a USB SSD on a powered hub. >>>>=20 >>>> was definitely poor. It should have explicitly mentioned the >>>> swap partition too: >>>>=20 >>>> The root filesystem and swap partition are both on the same >>>> USB SSD on a powered hub. >>>>=20 >>>> More detail from dmesg -a for usb: >>>>=20 >>>> usbus0: 12Mbps Full Speed USB v1.0 >>>> usbus1: 480Mbps High Speed USB v2.0 >>>> usbus2: 12Mbps Full Speed USB v1.0 >>>> usbus3: 480Mbps High Speed USB v2.0 >>>> ugen0.1: at usbus0 >>>> uhub0: on = usbus0 >>>> ugen1.1: at usbus1 >>>> uhub1: = on usbus1 >>>> ugen2.1: at usbus2 >>>> uhub2: on = usbus2 >>>> ugen3.1: at usbus3 >>>> uhub3: = on usbus3 >>>> . . . >>>> uhub0: 1 port with 1 removable, self powered >>>> uhub2: 1 port with 1 removable, self powered >>>> uhub1: 1 port with 1 removable, self powered >>>> uhub3: 1 port with 1 removable, self powered >>>> ugen3.2: at usbus3 >>>> uhub4 on uhub3 >>>> uhub4: = on usbus3 >>>> uhub4: MTT enabled >>>> uhub4: 4 ports with 4 removable, self powered >>>> ugen3.3: at usbus3 >>>> umass0 on uhub4 >>>> umass0: on = usbus3 >>>> umass0: SCSI over Bulk-Only; quirks =3D 0x0100 >>>> umass0:0:0: Attached to scbus0 >>>> . . . >>>> da0 at umass-sim0 bus 0 scbus0 target 0 lun 0 >>>> da0: Fixed Direct Access SPC-4 SCSI device >>>> da0: Serial Number >>>> da0: 40.000MB/s transfers >>>>=20 >>>> (Edited a bit because there is other material interlaced, even >>>> internal to some lines. Also: I removed the serial number of the >>>> specific example device.) >>=20 >> Thank you. That presents a much clearer picture. >>>>=20 >>>>> I will further note that any kind of USB device cannot = automatically >>>>> be trusted to behave properly. USB devices are notorious, for = example, >>>>>=20 >>>>> [reasons why deleted --SB] >>>>>=20 >>>>> You should identify where you page/swap to and then try = substituting >>>>> a different device for that function as a test to eliminate the = possibility >>>>> of a bad storage device/controller. If the problem still occurs, = that >>>>> means there still remains the possibility that another controller = or its >>>>> firmware is defective instead. It could be a kernel bug, it is = true, but >>>>> making sure there is no hardware or firmware error occurring is = important, >>>>> and as I say, USB devices should always be considered suspect = unless and >>>>> until proven innocent. >>>>=20 >>>> [FYI: This is a ufs context, not a zfs one.] >>=20 >> Right. It's only a Pi, after all. :-) >=20 > It is a Pine64+ 2GB, not an rpi3. >=20 >>>>=20 >>>> I'm aware of such things. There is no evidence that has resulted = in >>>> suggesting the USB devices that I can replace are a problem. = Otherwise >>>> I'd not be going down this path. I only have access to the one = arm64 >>>> device (a Pine64+ 2GB) so I've no ability to substitution-test what >>>> is on that board. >>=20 >> There isn't even one open port on that hub that you could plug a >> flash drive into temporarily to be the paging device? >=20 > Why do you think that I've never tried alternative devices? It > is just that the result was no evidence that my usually-in-use > SSD is having a special/local problem: the behavior continues > across all such contexts when the Pine64+ 2GB is involved. (Again > I have not had access to an alternate to the one arm64 board. > That limits my substitution testing possibilities.) >=20 > Why would you expect a Flash drive to be better than another SSD > for such testing? (The SSD that I usually use even happens to be > a USB 3.0 SSD, capable of USB 3.0 speeds in USB 3.0 contexts. So > is the hub that I usually use for that matter.) FYI: I now have access to a rpi3 in addition to a pine64+ 2GB. I've tested on the rpi3 using a different USB hub and a different SSD: no hardware device in common with the recent Pine64+ 2GB tests (other than console cabling and what handles the serial console). The fork-then-swap-out-then-swap-in failure happens in the rpi3 context as well. Because the rpi3 has only 1 GiByte of RAM the stress commands that I used were more like: stress -m 1 --vm-bytes 1000M to get zero RES(ident memory) for the two processes from my test program after it forks while they are waiting/sleeping. >> You could then >> try your tests before returning to the normal configuration. If = there >> isn't an open port, then how about plugging a second hub into one of >> the first hub's ports and moving the displaced device to the second >> hub? A flash drive could then be plugged in. That kind of = configuration >> is obviously a bad idea for the long run, but just to try your tests = it >> ought to work well enough. >=20 > I have access to more SSDs that I can use than I do to Flash drives. I > see no reason to specifically use a Flash drive. >=20 >> (BTW, if a USB storage device containing a >> paging area drops off=3Dline even momentarily and the system needs to = use >> it, that is the beginning of the end, even though it may take up to a = few >> minutes for everything to lock up. >=20 > The system does not lock up, even days or weeks later, with having = done > dozens of experiments that show memory corruption failures over those > days. The only processes showing memory corruption so far are those > that were the parent or child for a fork that were later swapped out > to have zero RES(ident memory) and then even later swapped back in. >=20 > The context has no such issues. You are inventing problems that do > not exist in my context. That is why none of my list submittals > mention such problems: they did not occur. >=20 >> You probably won't be able to do an >> orderly shutdown, but will instead have to crash it with the power = switch. >> In the case of something like a Pi, this is an unpleasant fact of = life, >> to be sure.) >=20 > Such things did not occur and has nothing to do with my context so = far. >=20 >> I think I buy your arguments, given the evidence you've collected >> thus far, including what you've added below. I just like to = eliminate >> possibilities that are much simpler to deal with before facing = nastinesses >> like bugs in the VM subsystem. :-) >=20 > When I started this I found no evidence of device-specific problems. > My investigation activity goes back to long before my list submittals. >=20 > And I repeat: Other people have reported the symptoms that started > this investigation. They did so before I ever started my activities. > They were using none of the specific devices that I have access to. > Likely the types of devices were frequently even different, such as > a rpi3 instead of a Pine64+ 2GB or a different USB drive. I was able > to get the symptoms that they reported. >=20 >>>> It would be neat if some folks used my code to test other arm64 >>>> contexts and reported the results. I'd be very interested. >>>> (This is easier to do on devices that do not have massive >>>> amounts of RAM, which may limit the range of devices or >>>> device configurations that are reasonable to test.) >>>>=20 >>>> There is that other people using other devices have reported >>>> the behavior that started this investigation. I can produce the >>>> behavior that they reported, although I've not seen anyone else >>>> listing specific steps that lead to the problem or ways to tell >>>> if the symptom is going to happen before it actually does. Nor >>>> have I seen any other core dump analysis. (I have bugzilla >>>> submittals 217138 and 217239 tied to symptoms others have >>>> reported as well as this test program material.) >>>>=20 >>>> Also, considering that for my test program I can control which = pages >>>> get the zeroed-problem by read-accessing even one byte of any 4K >>>> Byte page that I want to make work normally, doing so in the child >>>> process of the fork, between the fork and the sleep/swap-out, it = does >>>> not suggest USB-device-specific behavior. The read-access is = changing >>>> the status of the page in some way as far as I can tell. >>>>=20 >>>> (Such read-accesses in the parent process make no difference to the >>>> behavior.) >>>=20 >>> I should have noted another comparison/contrast between >>> having memory corruption and not in my context: >>>=20 >>> I've tried variants of my test program that do not fork but >>> just sleep for 60s to allow me to force the swap-out. I >>> did this before adding fork and before using >>> parital_test_check, for example. I gradually added things >>> apparently involved in the reports others had made >>> until I found a combination that produced a memory >>> corruption test failure. >>>=20 >>> These tests without fork involved find no problems with >>> the memory content after the swap-in. >>>=20 >>> For my test program it appears that fork-before-swap-out >>> or the like is essential to having the problem occur. >>>=20 >> A comment about terminology seems in order here. It bothers >> me considerably to see you writing "swap out" or "swapping" where >> it seems like you mean to write "page out" or "paging". A BSD >> system whose swapping mechanism gets activated has already waded >> very deeply into the quicksand and frequently cannot be gotten out >> in a reasonable amount of time even with manual assistance. It is >> often quicker to crash it, reboot, and wait for the fsck(8) cleanups >> to complete. Orderly shutdowns, even of the kind that results from >> a quick poke to the power button, typically get mired in the same >> mess that already has the system in knots. Also, BSD systems since >> 3.0BSD, unlike older AT&T (pre-SysVR2.3) systems, do not swap in, >> just out. A swapped out process, once the system determines that it >> has adequate resources again to attempt to run the process, will have >> the interrupted text page paged in and the rest will be paged in by >> the normal mechanism of page faults and page-in operations. I assume >> you must already know all this, which is a large part of why it = grates >> on me that you appear to be using the wrong terms. >=20 > You apparently did not read any of the material about how the test > is done or are unfamiliar with what "stress -m 1 --vm-bytes 1800M" > does when there is only 2GB of RAM. I am deliberately inducing > swapping in other processes, including the 2 from my test program > (after the fork), not just paging. (stress is a port, not part of > the base system.) >=20 > When I say swap-out and swap-in I mean it. >=20 > =46rom the source code of my test program: >=20 > sleep(60); >=20 > // During this manually force this process to > // swap out. I use something like: >=20 > // stress -m 1 --vm-bytes 1800M >=20 > // in another shell and ^C'ing it after top > // shows the swapped status desired. 1800M > // just happened to work on the Pine64+ 2GB > // that I was using. I watch with top -PCwaopid . >=20 > That type of stress run uses about 1.8 GiBytes after a bit, > which is enough to cause the swapping of other processes, > including the two that I am testing (post-fork). (Some RAM > is in use already before the stress run, which explains not > needing 2 GiBytes to be in use by stress.) >=20 > Look at a "top -PCwaopid" display: there are columns for > RES(ident memory) and SWAP. I cause my 2 test processes to > show zero RES and everything under SWAP, starting sometime > during the 60s sleep/wait. >=20 > Why would I cause swapping? Because buildworld causes such > swap-outs at times when there is only 2GBytes of RAM, > including processes that forked earlier, and as a result > the corrupted memory problems show up later in some processes > that were swapped out at the time. The build eventually > stops for process failures tied to the corruptions of memory > in the failing processes. (At least that is what my testing > strongly suggests.) >=20 > But that is a very complicated context to use for analysis or > testing of the problem. My test program is vastly simpler > and easier/quicker to set up and test when used with stress > as well. Such was the kind of thing I was trying to find. >=20 > I want the Pine64+ 2GB to work well enough to be able to have > buildworld (-j 4) complete correctly without having to restart > the build --even when everything has to be rebuilt. So I'm > trying to find and provide enough evidence to help someone fix > the problems that are observed to block such buildworld > activity. >=20 > Again: others have reported such arm64 problems on the lists > before I ever got into this activity. The evidence is that > the issues are not a local property of my environment. >=20 > Swapping is supposed to work. I can do buildworld (-j 4) > on armv6 (really -mcpu=3Dcortex-a7 so armv7-a) and the > swapping it causes works fine. This is true for both a > bpim3 (2 GiBytes of RAM) and a rpi2 (1 GiByte of RAM > so even more swapping). On a powerpc64 with 16 GiBytes > I've built things that caused 26 GiBytes of swap to be > in use some of the time (during 4 ld's running in > parallel), with lots of processes having zero for > RES(ident memory) and all their space listed under SWAP > in a "top -PCwaopid" display. This too has no problems > with swapping of previously forked processes (or of any > other processes). >=20 > For the likes of a Pine64+ 2GB to be "self hosted"=20 > for source-code based updates, swapping of previously > forked processes must work and currently such > swapping is unreliable. =3D=3D=3D Mark Millard markmi at dsl-only.net