Date: Tue, 14 Mar 2017 15:28:53 -0700 From: Mark Millard <markmi@dsl-only.net> To: Andrew Turner <andrew@fubar.geek.nz>, freebsd-arm <freebsd-arm@freebsd.org>, FreeBSD Current <freebsd-current@freebsd.org>, FreeBSD-STABLE Mailing List <freebsd-stable@freebsd.org> Subject: Re: arm64 fork/swap data corruptions: A ~110 line C program demonstrating an example (Pine64+ 2GB context) [Corrected subject: arm64!] Message-ID: <AE06FE24-60A9-4B84-B4DE-B780F83309B3@dsl-only.net> In-Reply-To: <5BEAFC6C-DA80-4D7B-AB55-977E585D1ACC@dsl-only.net> References: <01735A68-FED6-4E63-964F-0820FE5C446C@dsl-only.net> <A82D1406-DB53-42CE-A41C-D984C9F5A1C9@dsl-only.net> <16B3D614-62E1-4E58-B409-8DB9DBB35BCB@dsl-only.net> <5BEAFC6C-DA80-4D7B-AB55-977E585D1ACC@dsl-only.net>
next in thread | previous in thread | raw e-mail | index | archive | help
[test_check() between the fork and the wait/sleep prevents the failure from occurring. Even a small access to the memory at that stage prevents the failure. Details follow.] On 2017-Mar-14, at 11:07 AM, Mark Millard <markmi@dsl-only.net> wrote: > [This is just a correction to the subject-line text to say arm64 > instead of amd64.] >=20 > On 2017-Mar-14, at 12:58 AM, Mark Millard <markmi@dsl-only.net> wrote: >=20 > [Another correction I'm afraid --about alternative program variations > this time.] >=20 > On 2017-Mar-13, at 11:52 PM, Mark Millard <markmi@dsl-only.net> wrote: >=20 >> I'm still at a loss about how to figure out what stages are messed >> up. (Memory coherency? Some memory not swapped out? Bad data swapped >> out? Wrong data swapped in?) >>=20 >> But at least I've found a much smaller/simpler example to demonstrate >> some problem with in my Pine64+_ 2GB context. >>=20 >> The Pine64+ 2GB is the only amd64 context that I have access to. >=20 > Someday I'll learn to type arm64 the first time instead of amd64. >=20 >> The following program fails its check for data >> having its expected byte pattern in dynamically >> allocated memory after a fork/swap-out/swap-in >> sequence. >>=20 >> I'll note that the program sleeps for 60s after >> forking to give time to do something else to >> cause the parent and child processes to swap >> out (RES=3D0 as seen in top). >=20 > The following about the extra test_check() was > wrong. >=20 >> Note the source code line: >>=20 >> // test_check(); // Adding this line prevents failure. >>=20 >> It seem that accessing the region contents before forking >> and swapping avoids the problem. But there is a problem >> if the region was only written-to before the fork/swap. There is a place that if a test_check call is put then the problem does not happen at any stage: I tried putting a call between the fork and the later wait/sleep code: int main(void) { test_setup(); test_check(); // Before fork() [passes] pid_t pid =3D fork(); int wait_status =3D 0;; // test_check(); // After fork(); before wait/sleep.=20 // If used it prevents failure later! if (0<pid) { wait(&wait_status); } if (-1!=3Dwait_status && 0<=3Dpid) { if (0=3D=3Dpid) { sleep(60); // During this manually force this process to // swap out. I use something like: // stress -m 1 --vm-bytes 1800M // in another shell and ^C'ing it after top // shows the swapped status desired. 1800M // just happened to work on the Pine64+ 2GB // that I was using. I watch with top -PCwaopid . } test_check(); // After wait/sleep [fails for small-enough = region_sizes] } } My guess is that the forced access attempt when the line is uncommented causes local some sort of status/caching update for that memory and with that in place swap-out gets the right information swapped out and then later that information is swapped back in. But an interesting point is that the failing case fails in both the parent process of the fork and the child process, both seeing an all-zero pattern for the dynamically allocated region. Even for using: void partial_test_check(void) { if (1u!=3Dgbl_region.array[1]) raise(SIGABRT); if (1u!=3D(*dyn_region).array[1]) raise(SIGABRT); } instead of test_check as what to call between the fork and the wait/sleep the following no longer gets the problem at any stage: extern void partial_test_check(void); // Tests some of the memory byte = pattern =20 // but not all of it. int main(void) { test_setup(); test_check(); // Before fork() [passes] pid_t pid =3D fork(); int wait_status =3D 0;; // test_check(); // After fork(); before wait/sleep.=20 // If used it prevents failure later! partial_test_check(); // Does a small access do such? if (0<pid) { wait(&wait_status); } if (-1!=3Dwait_status && 0<=3Dpid) { if (0=3D=3Dpid) { sleep(60); // During this manually force this process to // swap out. I use something like: // stress -m 1 --vm-bytes 1800M // in another shell and ^C'ing it after top // shows the swapped status desired. 1800M // just happened to work on the Pine64+ 2GB // that I was using. I watch with top -PCwaopid . } test_check(); // After wait/sleep [fails for small-enough = region_sizes] } } This suggests to me that the small access is forcing one or more things = to be initialized for memory access that fork is not establishing of = itself. It appears that if established correctly then the swap-out/swap-in sequence would work okay without needing the manual access to the = memory. So far via this test I've not seen any evidence of problems with the = global region but only the dynamically allocated region. However, the symptoms that started this investigation in a much more complicated context had an area of global memory from a .so that ended up being zero. I think that things should be fixed for this simpler context first and that further investigation of the sh/su related should wait to see what things are like after this test case works. Side note: The "extern"s are from a stage where I was investigating having a .so involved but it turned out no shared library had to be involved for what I ran into during this. =3D=3D=3D Mark Millard markmi at dsl-only.net
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?AE06FE24-60A9-4B84-B4DE-B780F83309B3>