Date: Fri, 7 Apr 2017 01:16:12 -0700 From: Mark Millard <markmi@dsl-only.net> To: freebsd-arm <freebsd-arm@freebsd.org>, freebsd-hackers@freebsd.org Cc: andrew@freebsd.org Subject: Re: The arm64 fork-then-swap-out-then-swap-in failures: a program source for exploring them Message-ID: <163B37B0-55D6-498E-8F52-9A95C036CDFA@dsl-only.net> In-Reply-To: <4DEA2D76-9F27-426D-A8D2-F07B16575FB9@dsl-only.net> References: <4DEA2D76-9F27-426D-A8D2-F07B16575FB9@dsl-only.net>
next in thread | previous in thread | raw e-mail | index | archive | help
[I now can: (A) crudely control the number of allocated pages that get zeros (that should not). (B) Watch a "top -PCwaopid" display and predict if the test-architecture will fail or not before the fork() or swap-out happens.] On 2017-Apr-4, at 8:00 PM, Mark Millard <markmi@dsl-only.net> wrote: > Uncommenting/commenting parts of the below program allows > exploring the problems with fork-then-swap-out-then-in on > arm64. >=20 > Note: By swap-out I mean that zero RES(ident memory) results, > for the process(s) of interest, as shown by > "top -PCwaopid" . >=20 > I discovered recently that swapping-out just before the > fork() prevents the failure from the swapping after the > fork(). >=20 > Note: > Without the fork() no problem happens. Without the later > swap-out no problem happens. Both are required. But some > activities before the fork() or between fork() and the > swap-out prevent the failures. >=20 > Some of the comments are based on a pine64+ 2GB context. > I use stress to force swap-outs during some sleeps in > the program. See also Buzilla 217239 and 217138. (I now > expect that they have the same cause.) >=20 > In my environment I've seen the fork-then-swap-out/swap-in > failures on a pine64+ 2GB and a rpi3. They are repeatable > on both. I do not have access to server-class machines, or > any other arm64 machines. >=20 >=20 > // swap_testing5.c >=20 > // Built via (cc was clang 4.0 in my case): > // > // cc -g -std=3Dc11 -Wpedantic -o swaptesting5 swap_testing5.c > // -O0 and -O2 also gets the problem. >=20 > // Note: jemalloc's tcache needs to be enabled to get the failure. > // But FreeBSD can get into a state were /etc/malloc.conf > // -> 'tcache:false' is ineffective. Also: the allocation > // size needs to by sufficiently small (<=3D SMALL_MAXCLASS) > // to see the problem. Other comments are based on a specific > // context (pine64+ 2GB). >=20 > #include <signal.h> // for raise(.), SIGABRT (induce core dump) > #include <unistd.h> // for fork(), sleep(.) > #include <sys/types.h> // for pid_t > #include <sys/wait.h> // for wait(.) >=20 > extern void test_setup(void); // Sets up the memory byte = patterns. > extern void test_check(void); // Tests the memory byte = patterns. > extern void memory_willneed(void); // For seeing if > // = posix_madvise(.,.,POSIX_MADV_WILLNEED) > // makes a difference. >=20 > int main(void) { > sleep(30); // Potentialy force swap-out here. > // [Swap-out here does not avoid later failures.] >=20 > test_setup(); > test_check(); // Before potential sleep(.)/swap-out or fork(.) = [passes] >=20 > sleep(30); // Potentialy force swap-out here. > // [Everything below passes if swapped-out here, > // no matter if there are later swap-outs > // or not.] >=20 > pid_t pid =3D fork(); // To test no-fork use: =3D 0; no-fork does = not fail. > int wait_status =3D 0; >=20 > // HERE: After fork; before sleep/swap-out/wait. >=20 > // if (0 < pid) memory_willneed(); // Does not prevent either = parent or > // child failure if enabled. >=20 > // if (0 =3D=3D pid) memory_willneed(); // Prevents both the parent = and the > // child failure. Disable to see > // failure of both parent and = child. > // [Presuming no prior swap-out: = that > // would make everything pass.] >=20 > // During sleep/wait: manually force this process to > // swap out. I use something like: > // stress -m 1 --vm-bytes 1800M > // in another shell and ^C'ing it after top shows the > // swapped status desired. 1800M just happened to work > // on the Pine64+ 2GB that I was using. I watch with > // top -PCwaopid [checking for zero RES(ident memory)]. >=20 > if (0 < pid) { > sleep(30); // Intend to swap-out during sleep. > // test_check(); // Test in parent before child runs (longer = sleep). > // This test fails if run for a failing = region_size > // unless earlier preventing-activity happened. > wait(&wait_status); // Only if test_check above passes or is > // disabled above. > } > if (-1 !=3D wait_status && 0 <=3D pid) { > if (0 =3D=3D pid) { sleep(90); } // Intend to swap-out during = sleep. > test_check(); // Fails for small-enough region_size, both > // parent and child processes, unless earlier > // preventing-activty happened. > } > } >=20 > // The memory and test code follows. >=20 > #include <stddef.h> // for size_t, NULL > #include <stdlib.h> // for malloc(.), free(.) > #include <sys/mman.h> // for POSIX_MADV_WILLNEED, = posix_madvise(.,.,.) >=20 > #define region_size (14u*1024u) > // Bad dyn_region pattern, parent and child processes examples: > // 256u, 2u*1024u, 4u*1024u, 8u*1024u, 9u*1024u, 12u*1024u, = 14u*1024u > // No failure examples: > // 14u*1024u+1u, 15u*1024u, 16u*1024u, 32u*1024u, = 256u*1024u*1024u > #define num_regions (256u*1024u*1024u/region_size) >=20 > typedef volatile unsigned char value_type; > struct region_struct { value_type array[region_size]; }; > typedef struct region_struct region; > static region * volatile dyn_regions[num_regions] =3D {NULL,}; >=20 > static value_type value(size_t v) { return = (value_type)((v&0xFEu)|0x1u); } > // value avoids zero values: the bad values are = zeros. >=20 > void test_setup(void) { > for(size_t i=3D0u; i<num_regions; i++) { > dyn_regions[i] =3D malloc(sizeof(region)); > if (!dyn_regions[i]) raise(SIGABRT); >=20 > for(size_t j=3D0u; j<region_size; j++) { > (*dyn_regions[i]).array[j] =3D value(j); > } > } > } >=20 > void memory_willneed(void) { > for(size_t i=3D0u; i<num_regions; i++) { > (void) posix_madvise(dyn_regions[i], region_size, = POSIX_MADV_WILLNEED); > } > } >=20 > static volatile size_t first_failure_idx =3D 0u; // dyn_regions index > static volatile size_t first_failure_pos =3D 0u; // sub-array index > static volatile size_t after_bad_idx =3D 0u; // dyn_regions index > static volatile size_t after_bad_pos =3D 0u; // sub-array index > static volatile size_t after_good_idx =3D 0u; // dyn_regions index > static volatile size_t after_good_pos =3D 0u; // sub-array index >=20 > // Note: Some failing cases get (conjunctive notation): > // > // 0 =3D=3D first_failure_idx < after_bad_idx < after_good_idx =3D=3D= num_regions > // && 0 =3D=3D first_failure_pos && 0<=3Dafter_bad_pos<=3Dregion_size = && after_good_idx=3D=3D0 > // && (after_bad_pos is a multiple of the page size in Bytes, here: > // after_bad_pos=3D=3DN*4096 for some non-negative integral value = N) > // > // other failing cases instead fail with: > // > // 0 =3D=3D first_failure && num_regions =3D=3D after_bad_idx =3D=3D = after_good_idx > // && 0 =3D=3D first_failure_pos =3D=3D after_bad_pos =3D=3D = after_good_idx > // > // after_bad_idx strongly tends to vary from failing run to failing = run > // as does after_bad_pos. >=20 > // Note: The working cases get: > // > // num_regions =3D=3D first_failure =3D=3D after_bad_idx =3D=3D = after_good_idx > // && 0 =3D=3D first_failure_pos =3D=3D after_bad_pos =3D=3D = after_good_idx >=20 > void test_check(void) { > first_failure_idx =3D first_failure_pos =3D 0u; >=20 > while (first_failure_idx < num_regions) { > while ( first_failure_pos < region_size > && ( value(first_failure_pos) > =3D=3D = (*dyn_regions[first_failure_idx]).array[first_failure_pos] > ) > ) { > first_failure_pos++; > } >=20 > if (region_size !=3D first_failure_pos) break; >=20 > first_failure_idx++; > first_failure_pos =3D 0u; > } >=20 > after_bad_idx =3D first_failure_idx; > after_bad_pos =3D first_failure_pos; >=20 > while (after_bad_idx < num_regions) { > while ( after_bad_pos < region_size > && ( value(after_bad_pos) > !=3D = (*dyn_regions[after_bad_idx]).array[after_bad_pos] > ) > ) { > after_bad_pos++; > } >=20 > if(region_size !=3D after_bad_pos) break; >=20 > after_bad_idx++; > after_bad_pos =3D 0u; > } >=20 > after_good_idx =3D after_bad_idx; > after_good_pos =3D after_bad_pos; >=20 > while (after_good_idx < num_regions) { > while ( after_good_pos < region_size > && ( value(after_good_pos) > =3D=3D = (*dyn_regions[after_good_idx]).array[after_good_pos] > ) > ) { > after_good_pos++; > } >=20 > if(region_size !=3D after_good_pos) break; >=20 > after_good_idx++; > after_good_pos =3D 0u; > } >=20 > if (num_regions !=3D first_failure_idx) raise(SIGABRT); > } I've found that for the above swap_testing5.c I can make variations that change how much of the allocated region prefix ends up zero vs. stays good. I vary the sleep time between testing the initialized allocations and doing the fork. The longer the sleep the more zero pages show up (be sure to read the comments): # diff swap_testing[56].c = = 1c1 < // swap_testing5.c --- > // swap_testing6.c 5c5 < // cc -g -std=3Dc11 -Wpedantic -o swaptesting5 swap_testing5.c --- > // cc -g -std=3Dc11 -Wpedantic -o swaptesting5 swap_testing6.c 33c33 < sleep(30); // Potentialy force swap-out here. --- > sleep(150); // Potentialy force swap-out here. 37a38,48 > // For no-swap-out here cases: > // > // The longer the sleep here the more allocations > // that end up as zero. > // > // top's Mem Active, Inact, Wired, Bug, Free and > // Swap Total, Used, and Free stay unchanged. > // What does change is the process RES decreases > // while the process SIZE and SWAP stay unchanged > // during this sleep. >=20 NOTE: On other architectures that I've tried (such as armv6/v7) RES does not decrease during the sleep --and the problem does not happen even for as long of sleeps as I've tried. (I use "stress -m 2 --vm-bytes 900M" on armv6/v7 instead of -m 1 --vm-bytes 1800M because that large in one process is not allowed.) So watching top's RES during the sleep (longer than a few seconds) just before the fork() predicts the later fails-vs.-not status: If RES decreases (while other things associated with the process status stay the same) then there will be a failure. At this point I've no clue why the sleeping process has a decreasing RES(ident memory) size. I infer that without the sleep there still is a small amount of loss of RES but on too short of a timescale to observe in a "top -PCwaopid" or other such: in other words that the same behavior is causing the failure then as well, possibly for a loss of only one page of RES. =3D=3D=3D Mark Millard markmi at dsl-only.net
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?163B37B0-55D6-498E-8F52-9A95C036CDFA>