From owner-freebsd-arm@freebsd.org Wed Mar 15 04:33:12 2017 Return-Path: Delivered-To: freebsd-arm@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 1AF3BD0D88B for ; Wed, 15 Mar 2017 04:33:12 +0000 (UTC) (envelope-from markmi@dsl-only.net) Received: from asp.reflexion.net (outbound-mail-211-172.reflexion.net [208.70.211.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id C98A01D4F for ; Wed, 15 Mar 2017 04:33:11 +0000 (UTC) (envelope-from markmi@dsl-only.net) Received: (qmail 1303 invoked from network); 15 Mar 2017 04:35:38 -0000 Received: from unknown (HELO mail-cs-02.app.dca.reflexion.local) (10.81.19.2) by 0 (rfx-qmail) with SMTP; 15 Mar 2017 04:35:38 -0000 Received: by mail-cs-02.app.dca.reflexion.local (Reflexion email security v8.30.2) with SMTP; Wed, 15 Mar 2017 00:33:09 -0400 (EDT) Received: (qmail 1852 invoked from network); 15 Mar 2017 04:33:09 -0000 Received: from unknown (HELO iron2.pdx.net) (69.64.224.71) by 0 (rfx-qmail) with (AES256-SHA encrypted) SMTP; 15 Mar 2017 04:33:09 -0000 Received: from [192.168.1.111] (c-67-170-167-181.hsd1.or.comcast.net [67.170.167.181]) by iron2.pdx.net (Postfix) with ESMTPSA id DDA81EC8534; Tue, 14 Mar 2017 21:33:08 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 10.2 \(3259\)) Subject: Re: arm64 fork/swap data corruptions: A ~110 line C program demonstrating an example (Pine64+ 2GB context) [Corrected subject: arm64!] From: Mark Millard In-Reply-To: Date: Tue, 14 Mar 2017 21:33:08 -0700 Cc: FreeBSD-STABLE Mailing List Content-Transfer-Encoding: quoted-printable Message-Id: <10F50F1C-FD26-4142-9350-966312822438@dsl-only.net> References: <01735A68-FED6-4E63-964F-0820FE5C446C@dsl-only.net> <16B3D614-62E1-4E58-B409-8DB9DBB35BCB@dsl-only.net> <5BEAFC6C-DA80-4D7B-AB55-977E585D1ACC@dsl-only.net> To: Andrew Turner , freebsd-arm , FreeBSD Current X-Mailer: Apple Mail (2.3259) X-BeenThere: freebsd-arm@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Porting FreeBSD to ARM processors." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 15 Mar 2017 04:33:12 -0000 A single Byte access to a 4K Byte aligned region between the fork and wait/sleep/swap-out prevents that specific 4K Byte region from having the (bad) zeros. Sounds like a page sized unit of behavior to me. Details follow. On 2017-Mar-14, at 3:28 PM, Mark Millard wrote: > [test_check() between the fork and the wait/sleep prevents the > failure from occurring. Even a small access to the memory at > that stage prevents the failure. Details follow.] >=20 > On 2017-Mar-14, at 11:07 AM, Mark Millard wrote: >=20 >> [This is just a correction to the subject-line text to say arm64 >> instead of amd64.] >>=20 >> On 2017-Mar-14, at 12:58 AM, Mark Millard = wrote: >>=20 >> [Another correction I'm afraid --about alternative program variations >> this time.] >>=20 >> On 2017-Mar-13, at 11:52 PM, Mark Millard = wrote: >>=20 >>> I'm still at a loss about how to figure out what stages are messed >>> up. (Memory coherency? Some memory not swapped out? Bad data swapped >>> out? Wrong data swapped in?) >>>=20 >>> But at least I've found a much smaller/simpler example to = demonstrate >>> some problem with in my Pine64+_ 2GB context. >>>=20 >>> The Pine64+ 2GB is the only amd64 context that I have access to. >>=20 >> Someday I'll learn to type arm64 the first time instead of amd64. >>=20 >>> The following program fails its check for data >>> having its expected byte pattern in dynamically >>> allocated memory after a fork/swap-out/swap-in >>> sequence. >>>=20 >>> I'll note that the program sleeps for 60s after >>> forking to give time to do something else to >>> cause the parent and child processes to swap >>> out (RES=3D0 as seen in top). >>=20 >> The following about the extra test_check() was >> wrong. >>=20 >>> Note the source code line: >>>=20 >>> // test_check(); // Adding this line prevents failure. >>>=20 >>> It seem that accessing the region contents before forking >>> and swapping avoids the problem. But there is a problem >>> if the region was only written-to before the fork/swap. >=20 > There is a place that if a test_check call is put then the > problem does not happen at any stage: I tried putting a > call between the fork and the later wait/sleep code: I changed the byte sequence patterns to avoid zero values since the bad values are zeros: static value_type value(size_t v) { return (value_type)((v&0xFEu)|0x1u); = } // value now avoids the zero value since the failures // are zeros. With that I can then test accurately what bytes have bad values vs. do not. I also changed to: void partial_test_check(void) { if (value(0u)!=3Dgbl_region.array[0]) raise(SIGABRT); if (value(0u)!=3D(*dyn_region).array[0]) raise(SIGABRT); } since previously [0] had a zero value and so I'd used [1]. On this basis I'm now using the below. See the comments tied to partial_test_check() calls: extern void test_setup(void); // Sets up the memory byte = patterns. extern void test_check(void); // Tests the memory byte patterns. extern void partial_test_check(void); // Tests just [0] of each region // (gbl_region and dyn_region). int main(void) { test_setup(); test_check(); // Before fork() [passes] pid_t pid =3D fork(); int wait_status =3D 0;; // After fork; before waitsleep/swap-out. if (0=3D=3Dpid) partial_test_check(); // Even the above is sufficient by // itself to prevent failure for // region_size 1u through // 4u*1024u! // But 4u*1024u+1u and above fail // with this access to memory. // The failing test is of // (*dyn_region).array[4096u]. // This test never fails here. if (0 This suggests to me that the small access is forcing one or more = things to > be initialized for memory access that fork is not establishing of = itself. > It appears that if established correctly then the swap-out/swap-in > sequence would work okay without needing the manual access to the = memory. >=20 >=20 > So far via this test I've not seen any evidence of problems with the = global > region but only the dynamically allocated region. >=20 > However, the symptoms that started this investigation in a much more > complicated context had an area of global memory from a .so that ended > up being zero. >=20 > I think that things should be fixed for this simpler context first and > that further investigation of the sh/su related should wait to see = what > things are like after this test case works. =3D=3D=3D Mark Millard markmi at dsl-only.net