Date: Tue, 14 Mar 2017 21:33:08 -0700 From: Mark Millard <markmi@dsl-only.net> To: Andrew Turner <andrew@fubar.geek.nz>, freebsd-arm <freebsd-arm@freebsd.org>, FreeBSD Current <freebsd-current@freebsd.org> Cc: FreeBSD-STABLE Mailing List <freebsd-stable@freebsd.org> Subject: Re: arm64 fork/swap data corruptions: A ~110 line C program demonstrating an example (Pine64+ 2GB context) [Corrected subject: arm64!] Message-ID: <10F50F1C-FD26-4142-9350-966312822438@dsl-only.net> In-Reply-To: <AE06FE24-60A9-4B84-B4DE-B780F83309B3@dsl-only.net> References: <01735A68-FED6-4E63-964F-0820FE5C446C@dsl-only.net> <A82D1406-DB53-42CE-A41C-D984C9F5A1C9@dsl-only.net> <16B3D614-62E1-4E58-B409-8DB9DBB35BCB@dsl-only.net> <5BEAFC6C-DA80-4D7B-AB55-977E585D1ACC@dsl-only.net> <AE06FE24-60A9-4B84-B4DE-B780F83309B3@dsl-only.net>
next in thread | previous in thread | raw e-mail | index | archive | help
A single Byte access to a 4K Byte aligned region between the fork and wait/sleep/swap-out prevents that specific 4K Byte region from having the (bad) zeros. Sounds like a page sized unit of behavior to me. Details follow. On 2017-Mar-14, at 3:28 PM, Mark Millard <markmi@dsl-only.net> wrote: > [test_check() between the fork and the wait/sleep prevents the > failure from occurring. Even a small access to the memory at > that stage prevents the failure. Details follow.] >=20 > On 2017-Mar-14, at 11:07 AM, Mark Millard <markmi@dsl-only.net> wrote: >=20 >> [This is just a correction to the subject-line text to say arm64 >> instead of amd64.] >>=20 >> On 2017-Mar-14, at 12:58 AM, Mark Millard <markmi@dsl-only.net> = wrote: >>=20 >> [Another correction I'm afraid --about alternative program variations >> this time.] >>=20 >> On 2017-Mar-13, at 11:52 PM, Mark Millard <markmi@dsl-only.net> = wrote: >>=20 >>> I'm still at a loss about how to figure out what stages are messed >>> up. (Memory coherency? Some memory not swapped out? Bad data swapped >>> out? Wrong data swapped in?) >>>=20 >>> But at least I've found a much smaller/simpler example to = demonstrate >>> some problem with in my Pine64+_ 2GB context. >>>=20 >>> The Pine64+ 2GB is the only amd64 context that I have access to. >>=20 >> Someday I'll learn to type arm64 the first time instead of amd64. >>=20 >>> The following program fails its check for data >>> having its expected byte pattern in dynamically >>> allocated memory after a fork/swap-out/swap-in >>> sequence. >>>=20 >>> I'll note that the program sleeps for 60s after >>> forking to give time to do something else to >>> cause the parent and child processes to swap >>> out (RES=3D0 as seen in top). >>=20 >> The following about the extra test_check() was >> wrong. >>=20 >>> Note the source code line: >>>=20 >>> // test_check(); // Adding this line prevents failure. >>>=20 >>> It seem that accessing the region contents before forking >>> and swapping avoids the problem. But there is a problem >>> if the region was only written-to before the fork/swap. >=20 > There is a place that if a test_check call is put then the > problem does not happen at any stage: I tried putting a > call between the fork and the later wait/sleep code: I changed the byte sequence patterns to avoid zero values since the bad values are zeros: static value_type value(size_t v) { return (value_type)((v&0xFEu)|0x1u); = } // value now avoids the zero value since the failures // are zeros. With that I can then test accurately what bytes have bad values vs. do not. I also changed to: void partial_test_check(void) { if (value(0u)!=3Dgbl_region.array[0]) raise(SIGABRT); if (value(0u)!=3D(*dyn_region).array[0]) raise(SIGABRT); } since previously [0] had a zero value and so I'd used [1]. On this basis I'm now using the below. See the comments tied to partial_test_check() calls: extern void test_setup(void); // Sets up the memory byte = patterns. extern void test_check(void); // Tests the memory byte patterns. extern void partial_test_check(void); // Tests just [0] of each region // (gbl_region and dyn_region). int main(void) { test_setup(); test_check(); // Before fork() [passes] pid_t pid =3D fork(); int wait_status =3D 0;; // After fork; before waitsleep/swap-out. if (0=3D=3Dpid) partial_test_check(); // Even the above is sufficient by // itself to prevent failure for // region_size 1u through // 4u*1024u! // But 4u*1024u+1u and above fail // with this access to memory. // The failing test is of // (*dyn_region).array[4096u]. // This test never fails here. if (0<pid) partial_test_check(); // This never prevents // later failures (and // never fails here). if (0<pid) { wait(&wait_status); } if (-1!=3Dwait_status && 0<=3Dpid) { if (0=3D=3Dpid) { sleep(60); // During this manually force this process to // swap out. I use something like: // stress -m 1 --vm-bytes 1800M // in another shell and ^C'ing it after top // shows the swapped status desired. 1800M // just happened to work on the Pine64+ 2GB // that I was using. I watch with top -PCwaopid . } test_check(); // After wait/sleep [fails for small-enough = region_sizes] } } > This suggests to me that the small access is forcing one or more = things to > be initialized for memory access that fork is not establishing of = itself. > It appears that if established correctly then the swap-out/swap-in > sequence would work okay without needing the manual access to the = memory. >=20 >=20 > So far via this test I've not seen any evidence of problems with the = global > region but only the dynamically allocated region. >=20 > However, the symptoms that started this investigation in a much more > complicated context had an area of global memory from a .so that ended > up being zero. >=20 > I think that things should be fixed for this simpler context first and > that further investigation of the sh/su related should wait to see = what > things are like after this test case works. =3D=3D=3D Mark Millard markmi at dsl-only.net
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?10F50F1C-FD26-4142-9350-966312822438>