Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 9 Apr 2017 13:25:16 -0700
From:      Mark Millard <markmi@dsl-only.net>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        andrew@freebsd.org, freebsd-hackers@freebsd.org, freebsd-arm <freebsd-arm@freebsd.org>
Subject:   Re: The arm64 fork-then-swap-out-then-swap-in failures: a program source for exploring them
Message-ID:  <8FFE95AA-DB40-4D1E-A103-4BA9FCC6EDEE@dsl-only.net>
In-Reply-To: <9DCAF95B-39A5-4346-88FC-6AFDEE8CF9BB@dsl-only.net>
References:  <4DEA2D76-9F27-426D-A8D2-F07B16575FB9@dsl-only.net> <163B37B0-55D6-498E-8F52-9A95C036CDFA@dsl-only.net> <08E7A5B0-8707-4479-9D7A-272C427FF643@dsl-only.net> <20170409122715.GF1788@kib.kiev.ua> <9D152170-5F19-47A2-A06A-66F83CA88A09@dsl-only.net> <9DCAF95B-39A5-4346-88FC-6AFDEE8CF9BB@dsl-only.net>

next in thread | previous in thread | raw e-mail | index | archive | help
[I've not tried building the kernel with
your patch yet.]

Top post of new, independent information.

Jordan Gordeev made a testing suggestion that got me to look
at kdumps of runs with jemalloc allocations sizes that fail
(14*1024) vs. work (14*1024+1).

Example comparison:

 2258 swaptesting6 0.000169 CALL  =
mmap(0,0x200000,0x3<PROT_READ|PROT_WRITE>,0x1002<MAP_PRIVATE|MAP_ANON>,0xf=
fffffff,0)
 2258 swaptesting6 0.000047 RET   mmap 1080033280/0x40600000
vs.
 2325 swaptesting7 0.000091 CALL  =
mmap(0,0x200000,0x3<PROT_READ|PROT_WRITE>,0x1002<MAP_PRIVATE|MAP_ANON>,0xf=
fffffff,0)
 2325 swaptesting7 0.000024 RET   mmap 1080033280/0x40600000

No difference. And so it goes.

What varies is the number of mmap's: the larger jemalloc allocation size
gets more mmap's for the same number of jemalloc allocations. (All the
mmap's from my program's explicit allocations are together, =
back-to-back,
with no other traced activity between.)

But varying the number of jemalloc allocations in the program varies the =
number
of mmap calls, yet the size of the individual jemalloc allocations still =
makes
the difference between failure (zeroed pages after fork-then-swap) and =
success.

This problem is a complicated one to classify/isolate.

After the allocations there is not much activity visible in
kdump output. I traced with "-t +" and so avoided page fault
tracing but got most everything else.

I may have to ktrace the page faults for the two jemalloc
allocation sizes and see if anything stands out.

On 2017-Apr-9, at 11:24 AM, Mark Millard <markmi at dsl-only.net> wrote:

> On 2017-Apr-9, at 10:24 AM, Mark Millard <markmi at dsl-only.net> =
wrote:
>=20
>> On 2017-Apr-9, at 5:27 AM, Konstantin Belousov <kostikbel@gmail.com> =
wrote:
>>=20
>>> On Sat, Apr 08, 2017 at 06:02:00PM -0700, Mark Millard wrote:
>>>> [I've identified the code path involved is the arm64 small =
allocations
>>>> turning into zeros for later fork-then-swapout-then-back-in,
>>>> specifically the ongoing RES(ident memory) size decrease that
>>>> "top -PCwaopid" shows before the fork/swap sequence. Hopefully
>>>> I've also exposed enough related information for someone that
>>>> knows what they are doing to get started with a specific
>>>> investigation, looking for a fix. I'd like for a pine64+
>>>> 2GB to have buildworld complete despite the forking and
>>>> swapping involved (yep: for a time zero RES(ident memory) for
>>>> some processes involved in the build).]
>>>=20
>>> I was not able to follow the walls of text, but do not think that
>>> I pmap_ts_reference() is the real culprit there.
>>>=20
>>> Is my impression right that the issue occurs on fork, and looks as
>>> a memory corruption, where some page suddently becomes zero-filled ?
>>> And swapping seems to be involved ?  It is somewhat interesting to =
see
>>> if the problem is reproducable on non-arm64 machines, e.g. armv7 or =
amd64.
>>=20
>> Yes, yes, non-arm64 that I've tried works.
>>=20
>> But I think that the following extra detail my be of use: what top
>> shows for RES over time is also odd on arm64 (only) and the amount
>> of pages that are zeroed is proportional to the decrease in RES.
>>=20
>> In the test sequence:
>>=20
>> A) Allocate lots of 14 KiByte allocations and initialize the content =
of each
>> to non-zero. The example ends up with RES of about 265M.
>=20
> I did forget to list one important property: why I picked 14 KiBytes.
>=20
> A) Any allocation sizes <=3D 14 KiBytes that I've tried
>   gets the zero's problem in my arm64 contexts (bpim3 and rip3).
>=20
> B) Any allocation size >=3D 14 KiBYtes + 1 Byte that I've
>   tried works in those contexts.
>=20
> For the arm64 contexts that I use this happens to match with
> the jemalloc SMALL_MAXCLASS size boundary. When I looked it
> appeared that 14 Ki was the smallest SMALL_MAXCLASS value
> in jemalloc so it would always fit the category.
>=20
>> B) sleep some amount of time, I've been using well over 30 seconds =
here.
>>=20
>> C) fork
>>=20
>> D) sleep again (parent and child), also forcing swapping during the =
sleep
>>  (I used stress, manually run.)
>>=20
>> E) Test the memory pattern in the parent and child process, passing =
over
>>  all the bytes, failed and good.
>>=20
>> Both the parent and the child in (E) see the first pages allocated as =
zero,
>> with the number of pages being zero increasing as the sleep time in =
(B)
>> increases (as long as the sleep is over 30 sec or so). The parent and =
child
>> match for which pages are zero vs. not.
>>=20
>> It fails with (B) being a no-op as well. But the proportionality with
>> the time for the sleep is interesting.
>>=20
>> During (B) "top -PCwaopid" shows RES decreasing, starting after 30 =
sec
>> or so. The fork in (C) produces a child that does not have the same =
RES
>> as the parent but instead a tiny RES (80K as I remember). During (E)
>> the child's RES increases to full size.
>>=20
>> My powerpc64, armv7, and amd64 tests of such do not fail, nor does =
RES
>> decrease during (B). The child process gets the same RES as the =
parent
>> as well, unlike for arm64.
>>=20
>> In the failing context (arm64) RES in the parent decreases during (D)
>> before the swap-out as well.
>>=20
>>> If answers to my two questions are yes, there is probably some bug =
with
>>> arm64 pmap handling of the dirty bit emulation.  ARMv8.0 does not =
provide
>>> hardware dirty bit, and pmap interprets an accessed writeable page =
as
>>> unconditionally dirty.  More, accessed bit is also not maintained by
>>> hardware, instead if should be set by pmap.  And arm64 pmap sets the
>>> AF bit unconditionally when creating valid pte.
>>=20
>> fork-then-swap-out/in is required to see the problem. Neither fork
>> by itself nor swapping (zero RES as shown in top) by itself have
>> shown the problem so far.
>>=20
>>> Hmm, could you try the following patch, I did not even compiled it.
>>=20
>> I'll try it later today.
>>=20
>>> diff --git a/sys/arm64/arm64/pmap.c b/sys/arm64/arm64/pmap.c
>>> index 3d5756ba891..55aa402eb1c 100644
>>> --- a/sys/arm64/arm64/pmap.c
>>> +++ b/sys/arm64/arm64/pmap.c
>>> @@ -2481,6 +2481,11 @@ pmap_protect(pmap_t pmap, vm_offset_t sva, =
vm_offset_t eva, vm_prot_t prot)
>>> 		    sva +=3D L3_SIZE) {
>>> 			l3 =3D pmap_load(l3p);
>>> 			if (pmap_l3_valid(l3)) {
>>> +				if ((l3 & ATTR_SW_MANAGED) &&
>>> +				    pmap_page_dirty(l3)) {
>>> +					vm_page_dirty(PHYS_TO_VM_PAGE(l3 =
&
>>> +					    ~ATTR_MASK));
>>> +				}
>>> 				pmap_set(l3p, ATTR_AP(ATTR_AP_RO));
>>> 				PTE_SYNC(l3p);
>>> 				/* XXX: Use pmap_invalidate_range */

=3D=3D=3D
Mark Millard
markmi at dsl-only.net




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?8FFE95AA-DB40-4D1E-A103-4BA9FCC6EDEE>