From owner-freebsd-arm@freebsd.org Sun Apr 9 01:02:08 2017 Return-Path: Delivered-To: freebsd-arm@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9455DD337B2 for ; Sun, 9 Apr 2017 01:02:08 +0000 (UTC) (envelope-from markmi@dsl-only.net) Received: from asp.reflexion.net (outbound-mail-210-5.reflexion.net [208.70.210.5]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 59413C28 for ; Sun, 9 Apr 2017 01:02:08 +0000 (UTC) (envelope-from markmi@dsl-only.net) Received: (qmail 9048 invoked from network); 9 Apr 2017 01:02:01 -0000 Received: from unknown (HELO mail-cs-02.app.dca.reflexion.local) (10.81.19.2) by 0 (rfx-qmail) with SMTP; 9 Apr 2017 01:02:01 -0000 Received: by mail-cs-02.app.dca.reflexion.local (Reflexion email security v8.40.0) with SMTP; Sat, 08 Apr 2017 21:02:01 -0400 (EDT) Received: (qmail 10694 invoked from network); 9 Apr 2017 01:02:01 -0000 Received: from unknown (HELO iron2.pdx.net) (69.64.224.71) by 0 (rfx-qmail) with (AES256-SHA encrypted) SMTP; 9 Apr 2017 01:02:01 -0000 Received: from [192.168.1.106] (c-76-115-7-162.hsd1.or.comcast.net [76.115.7.162]) by iron2.pdx.net (Postfix) with ESMTPSA id BEB76EC8172; Sat, 8 Apr 2017 18:02:00 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: The arm64 fork-then-swap-out-then-swap-in failures: a program source for exploring them From: Mark Millard In-Reply-To: <163B37B0-55D6-498E-8F52-9A95C036CDFA@dsl-only.net> Date: Sat, 8 Apr 2017 18:02:00 -0700 Cc: andrew@freebsd.org, Konstantin Belousov Content-Transfer-Encoding: quoted-printable Message-Id: <08E7A5B0-8707-4479-9D7A-272C427FF643@dsl-only.net> References: <4DEA2D76-9F27-426D-A8D2-F07B16575FB9@dsl-only.net> <163B37B0-55D6-498E-8F52-9A95C036CDFA@dsl-only.net> To: freebsd-arm , freebsd-hackers@freebsd.org X-Mailer: Apple Mail (2.3273) X-BeenThere: freebsd-arm@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Porting FreeBSD to ARM processors." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 09 Apr 2017 01:02:08 -0000 [I've identified the code path involved is the arm64 small allocations turning into zeros for later fork-then-swapout-then-back-in, specifically the ongoing RES(ident memory) size decrease that "top -PCwaopid" shows before the fork/swap sequence. Hopefully I've also exposed enough related information for someone that knows what they are doing to get started with a specific investigation, looking for a fix. I'd like for a pine64+ 2GB to have buildworld complete despite the forking and swapping involved (yep: for a time zero RES(ident memory) for some processes involved in the build).] On 2017-Apr-7, at 1:16 AM, Mark Millard wrote: > [I now can: (A) crudely control the number of allocated > pages that get zeros (that should not). (B) Watch a > "top -PCwaopid" display and predict if the > test-architecture will fail or not before the fork() > or swap-out happens.] >=20 > On 2017-Apr-4, at 8:00 PM, Mark Millard wrote: >=20 >> Uncommenting/commenting parts of the below program allows >> exploring the problems with fork-then-swap-out-then-in on >> arm64. >>=20 >> Note: By swap-out I mean that zero RES(ident memory) results, >> for the process(s) of interest, as shown by >> "top -PCwaopid" . >>=20 >> I discovered recently that swapping-out just before the >> fork() prevents the failure from the swapping after the >> fork(). >>=20 >> Note: >> Without the fork() no problem happens. Without the later >> swap-out no problem happens. Both are required. But some >> activities before the fork() or between fork() and the >> swap-out prevent the failures. >>=20 >> Some of the comments are based on a pine64+ 2GB context. >> I use stress to force swap-outs during some sleeps in >> the program. See also Buzilla 217239 and 217138. (I now >> expect that they have the same cause.) >>=20 >> In my environment I've seen the fork-then-swap-out/swap-in >> failures on a pine64+ 2GB and a rpi3. They are repeatable >> on both. I do not have access to server-class machines, or >> any other arm64 machines. >>=20 >>=20 >> // swap_testing5.c >>=20 >> // Built via (cc was clang 4.0 in my case): >> // >> // cc -g -std=3Dc11 -Wpedantic -o swaptesting5 swap_testing5.c >> // -O0 and -O2 also gets the problem. >>=20 >> // Note: jemalloc's tcache needs to be enabled to get the failure. >> // But FreeBSD can get into a state were /etc/malloc.conf >> // -> 'tcache:false' is ineffective. Also: the allocation >> // size needs to by sufficiently small (<=3D SMALL_MAXCLASS) >> // to see the problem. Other comments are based on a specific >> // context (pine64+ 2GB). >>=20 >> #include // for raise(.), SIGABRT (induce core dump) >> #include // for fork(), sleep(.) >> #include // for pid_t >> #include // for wait(.) >>=20 >> extern void test_setup(void); // Sets up the memory byte = patterns. >> extern void test_check(void); // Tests the memory byte = patterns. >> extern void memory_willneed(void); // For seeing if >> // = posix_madvise(.,.,POSIX_MADV_WILLNEED) >> // makes a difference. >>=20 >> int main(void) { >> sleep(30); // Potentialy force swap-out here. >> // [Swap-out here does not avoid later failures.] >>=20 >> test_setup(); >> test_check(); // Before potential sleep(.)/swap-out or fork(.) = [passes] >>=20 >> sleep(30); // Potentialy force swap-out here. >> // [Everything below passes if swapped-out here, >> // no matter if there are later swap-outs >> // or not.] >>=20 >> pid_t pid =3D fork(); // To test no-fork use: =3D 0; no-fork does = not fail. >> int wait_status =3D 0; >>=20 >> // HERE: After fork; before sleep/swap-out/wait. >>=20 >> // if (0 < pid) memory_willneed(); // Does not prevent either = parent or >> // child failure if enabled. >>=20 >> // if (0 =3D=3D pid) memory_willneed(); // Prevents both the parent = and the >> // child failure. Disable to see >> // failure of both parent and = child. >> // [Presuming no prior swap-out: = that >> // would make everything pass.] >>=20 >> // During sleep/wait: manually force this process to >> // swap out. I use something like: >> // stress -m 1 --vm-bytes 1800M >> // in another shell and ^C'ing it after top shows the >> // swapped status desired. 1800M just happened to work >> // on the Pine64+ 2GB that I was using. I watch with >> // top -PCwaopid [checking for zero RES(ident memory)]. >>=20 >> if (0 < pid) { >> sleep(30); // Intend to swap-out during sleep. >> // test_check(); // Test in parent before child runs (longer = sleep). >> // This test fails if run for a failing = region_size >> // unless earlier preventing-activity happened. >> wait(&wait_status); // Only if test_check above passes or is >> // disabled above. >> } >> if (-1 !=3D wait_status && 0 <=3D pid) { >> if (0 =3D=3D pid) { sleep(90); } // Intend to swap-out during = sleep. >> test_check(); // Fails for small-enough region_size, both >> // parent and child processes, unless earlier >> // preventing-activty happened. >> } >> } >>=20 >> // The memory and test code follows. >>=20 >> #include // for size_t, NULL >> #include // for malloc(.), free(.) >> #include // for POSIX_MADV_WILLNEED, = posix_madvise(.,.,.) >>=20 >> #define region_size (14u*1024u) >> // Bad dyn_region pattern, parent and child processes examples: >> // 256u, 2u*1024u, 4u*1024u, 8u*1024u, 9u*1024u, 12u*1024u, = 14u*1024u >> // No failure examples: >> // 14u*1024u+1u, 15u*1024u, 16u*1024u, 32u*1024u, = 256u*1024u*1024u >> #define num_regions (256u*1024u*1024u/region_size) >>=20 >> typedef volatile unsigned char value_type; >> struct region_struct { value_type array[region_size]; }; >> typedef struct region_struct region; >> static region * volatile dyn_regions[num_regions] =3D {NULL,}; >>=20 >> static value_type value(size_t v) { return = (value_type)((v&0xFEu)|0x1u); } >> // value avoids zero values: the bad values are = zeros. >>=20 >> void test_setup(void) { >> for(size_t i=3D0u; i> dyn_regions[i] =3D malloc(sizeof(region)); >> if (!dyn_regions[i]) raise(SIGABRT); >>=20 >> for(size_t j=3D0u; j> (*dyn_regions[i]).array[j] =3D value(j); >> } >> } >> } >>=20 >> void memory_willneed(void) { >> for(size_t i=3D0u; i> (void) posix_madvise(dyn_regions[i], region_size, = POSIX_MADV_WILLNEED); >> } >> } >>=20 >> static volatile size_t first_failure_idx =3D 0u; // dyn_regions index >> static volatile size_t first_failure_pos =3D 0u; // sub-array index >> static volatile size_t after_bad_idx =3D 0u; // dyn_regions index >> static volatile size_t after_bad_pos =3D 0u; // sub-array index >> static volatile size_t after_good_idx =3D 0u; // dyn_regions index >> static volatile size_t after_good_pos =3D 0u; // sub-array index >>=20 >> // Note: Some failing cases get (conjunctive notation): >> // >> // 0 =3D=3D first_failure_idx < after_bad_idx < after_good_idx =3D=3D= num_regions >> // && 0 =3D=3D first_failure_pos && 0<=3Dafter_bad_pos<=3Dregion_size = && after_good_idx=3D=3D0 >> // && (after_bad_pos is a multiple of the page size in Bytes, here: >> // after_bad_pos=3D=3DN*4096 for some non-negative integral value = N) >> // >> // other failing cases instead fail with: >> // >> // 0 =3D=3D first_failure && num_regions =3D=3D after_bad_idx =3D=3D= after_good_idx >> // && 0 =3D=3D first_failure_pos =3D=3D after_bad_pos =3D=3D = after_good_idx >> // >> // after_bad_idx strongly tends to vary from failing run to failing = run >> // as does after_bad_pos. >>=20 >> // Note: The working cases get: >> // >> // num_regions =3D=3D first_failure =3D=3D after_bad_idx =3D=3D = after_good_idx >> // && 0 =3D=3D first_failure_pos =3D=3D after_bad_pos =3D=3D = after_good_idx >>=20 >> void test_check(void) { >> first_failure_idx =3D first_failure_pos =3D 0u; >>=20 >> while (first_failure_idx < num_regions) { >> while ( first_failure_pos < region_size >> && ( value(first_failure_pos) >> =3D=3D = (*dyn_regions[first_failure_idx]).array[first_failure_pos] >> ) >> ) { >> first_failure_pos++; >> } >>=20 >> if (region_size !=3D first_failure_pos) break; >>=20 >> first_failure_idx++; >> first_failure_pos =3D 0u; >> } >>=20 >> after_bad_idx =3D first_failure_idx; >> after_bad_pos =3D first_failure_pos; >>=20 >> while (after_bad_idx < num_regions) { >> while ( after_bad_pos < region_size >> && ( value(after_bad_pos) >> !=3D = (*dyn_regions[after_bad_idx]).array[after_bad_pos] >> ) >> ) { >> after_bad_pos++; >> } >>=20 >> if(region_size !=3D after_bad_pos) break; >>=20 >> after_bad_idx++; >> after_bad_pos =3D 0u; >> } >>=20 >> after_good_idx =3D after_bad_idx; >> after_good_pos =3D after_bad_pos; >>=20 >> while (after_good_idx < num_regions) { >> while ( after_good_pos < region_size >> && ( value(after_good_pos) >> =3D=3D = (*dyn_regions[after_good_idx]).array[after_good_pos] >> ) >> ) { >> after_good_pos++; >> } >>=20 >> if(region_size !=3D after_good_pos) break; >>=20 >> after_good_idx++; >> after_good_pos =3D 0u; >> } >>=20 >> if (num_regions !=3D first_failure_idx) raise(SIGABRT); >> } >=20 >=20 > I've found that for the above swap_testing5.c > I can make variations that change how much of the > allocated region prefix ends up zero vs. stays good. >=20 > I vary the sleep time between testing the initialized > allocations and doing the fork. The longer the sleep > the more zero pages show up (be sure to read the > comments): >=20 > # diff swap_testing[56].c = = 1c1 > < // swap_testing5.c > --- >> // swap_testing6.c > 5c5 > < // cc -g -std=3Dc11 -Wpedantic -o swaptesting5 swap_testing5.c > --- >> // cc -g -std=3Dc11 -Wpedantic -o swaptesting5 swap_testing6.c > 33c33 > < sleep(30); // Potentialy force swap-out here. > --- >> sleep(150); // Potentialy force swap-out here. > 37a38,48 >> // For no-swap-out here cases: >> // >> // The longer the sleep here the more allocations >> // that end up as zero. >> // >> // top's Mem Active, Inact, Wired, Bug, Free and >> // Swap Total, Used, and Free stay unchanged. >> // What does change is the process RES decreases >> // while the process SIZE and SWAP stay unchanged >> // during this sleep. >>=20 >=20 > NOTE: On other architectures that I've tried (such as armv6/v7) > RES does not decrease during the sleep --and the problem > does not happen even for as long of sleeps as I've tried. >=20 > (I use "stress -m 2 --vm-bytes 900M" on armv6/v7 instead > of -m 1 --vm-bytes 1800M because that large in one > process is not allowed.) >=20 > So watching top's RES during the sleep (longer than a few > seconds) just before the fork() predicts the later > fails-vs.-not status: If RES decreases (while other things > associated with the process status stay the same) then > there will be a failure. >=20 > At this point I've no clue why the sleeping process has > a decreasing RES(ident memory) size. >=20 > I infer that without the sleep there still is a small > amount of loss of RES but on too short of a timescale > to observe in a "top -PCwaopid" or other such: in other > words that the same behavior is causing the failure then > as well, possibly for a loss of only one page of RES. I've been able to identify what code sequence is gradually removing the "small_mappings" via some breakpointing in the kernel after reaching the "should be just sleeping" status. Specifically I started with breakpointing when pmap_resident_count_dec was on the call stack in order to see the call chain(s) that lead to it being called while RES(ident memory) is gradually decreasing during the sleep that is just before forking. (tid 100067 is [pagedaemon{pagedaemon}], which is in vm_pageout_worker. bt does not show inlined layers.) [ thread pid 17 tid 100067 ] Breakpoint at $x.1: undefined d65f03c0 db> bt Tracing pid 17 tid 100067 td 0xfffffd0001c4aa00 . . . handle_el1h_sync() at pmap_remove_l3+0xdc pc =3D 0xffff000000604870 lr =3D 0xffff000000611158 sp =3D 0xffff000083a49980 fp =3D 0xffff000083a49a40 pmap_remove_l3() at pmap_ts_referenced+0x580 pc =3D 0xffff000000611158 lr =3D 0xffff000000615c50 sp =3D 0xffff000083a49a50 fp =3D 0xffff000083a49ac0 pmap_ts_referenced() at vm_pageout+0xe60 pc =3D 0xffff000000615c50 lr =3D 0xffff0000005d1f74 sp =3D 0xffff000083a49ad0 fp =3D 0xffff000083a49b50 vm_pageout() at fork_exit+0x94 pc =3D 0xffff0000005d1f74 lr =3D 0xffff0000002e01c0 sp =3D 0xffff000083a49b60 fp =3D 0xffff000083a49b90 fork_exit() at fork_trampoline+0x10 pc =3D 0xffff0000002e01c0 lr =3D 0xffff0000006177b4 sp =3D 0xffff000083a49ba0 fp =3D 0x0000000000000000 It turns out that pmap_ts_referenced is on its: small_mappings: . . . path for the above so the pmap_remove_l3 call is the one from that execution path. (Found by more breakpointing after enabling such on the paths.) So this is the path with: (breakpoint hook not shown) /* * Wired pages cannot be paged out so * doing accessed bit emulation for * them is wasted effort. We do the * hard work for unwired pages only. */ pmap_remove_l3(pmap, pte, pv->pv_va, = tpde, &free, &lock); pmap_invalidate_page(pmap, pv->pv_va); cleared++; if (pvf =3D=3D pv) pvf =3D NULL; pv =3D NULL; . . . pmap_remove_l3 decrements the resident_count in this sequence. =46rom what I can tell this code is eliminating the content of pages that in the failing tests, ones with no backing store yet (not swapped-out yet by test design). The observed behavior is that the pages that have the above happen end up as zero pages after the later fork-then-swapout-then-back-in . I do not see anything putting the pages that this happens to into any other lists to keep track of the contents of the page content. The swap-out and swap-in seem to have ignored these pages and to have been based on automatically zeroed pages instead. Note that the (or a) question might be if these pages should have ever gotten to this code at all. (I'm no expert overall.) But that might get into why POSIX_MADV_WILLNEED spanning each page is sufficient to avoid the zeros issue for work-then-swapout-and-back-in. I'll only write here about what the backtrace code seems to be doing if I'm interpreting correctly. One oddity here is that pmap_remove_l3 does its own pmap_invalidate_page to invalidate the same tlb entry as the above pmap_invalidate_page, so a double-invalidate. (I've no clue if such is just suboptimal vs. a form of error.) pmap_remove_l3 here does things that the analogous sys/arm/arm/pmap-v6.c's pmap_ts_referenced does not do and pmap-v6 does something this code does not. arm64's pmap_remove_l3 does (in summary): pmap_invalidate_page decrements the resident_count pmap_unwire_l3 (then pmap_ts_referenced's small_mappings code does another pmap_invalidate_page for the same argument values) arm pmap-v6's pmap_ts_referenced's small_mappings code does: conditional vm_page_dirty pte2_clear_bit for PTE2_A pmap_tlb_flush There is, for example, no decrement of the resident_count involved (that I found anyway).=20 But I've no clue just what should be analogous vs. what should not between pmap-v6 and arm64's pmap code in this area. I'll also note that the code before the arm64 small_mappings code also uses pmap_remove_l3 but does not do the decrement nor the extra pmap_invalidate_page (for example). But again I do not know how analogous the two paths should be. Only the small_mappings path seems to have the end-up-with-zeros problem for the later fork-then-swap-out and then swap-back-in context. =3D=3D=3D Mark Millard markmi at dsl-only.net