Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 6 May 2020 15:46:04 -0700
From:      Mark Millard <marklmi@yahoo.com>
To:        Justin Hibbits <chmeeedalf@gmail.com>
Cc:        Brandon Bergren <bdragon@FreeBSD.org>, FreeBSD PowerPC ML <freebsd-ppc@freebsd.org>
Subject:   Re: svn commit: r360233 - in head: contrib/jemalloc . . . : This partially breaks a 2-socket 32-bit powerpc (old PowerMac G4) based on head -r360311
Message-ID:  <012EC6DD-AF2F-40EE-A9E2-A74ACE28E7A3@yahoo.com>
In-Reply-To: <20200506120215.2615b439@titan.knownspace>
References:  <C24EE1A1-FAED-42C2-8204-CA7B1D20A369.ref@yahoo.com> <C24EE1A1-FAED-42C2-8204-CA7B1D20A369@yahoo.com> <1588493689.54538000.et1xl2l8@frv55.fwdcdn.com> <922FBA7C-039D-4852-AC8F-E85A221C2559@yahoo.com> <b7297680-2f4e-4b75-9303-274f4461a0b6@www.fastmail.com> <20200506120215.2615b439@titan.knownspace>

next in thread | previous in thread | raw e-mail | index | archive | help


On 2020-May-6, at 10:02, Justin Hibbits <chmeeedalf@gmail.com> wrote:

> On Sun, 03 May 2020 09:56:02 -0500
> "Brandon Bergren" <bdragon@FreeBSD.org> wrote:
> 
>> On Sun, May 3, 2020, at 9:38 AM, Mark Millard via freebsd-ppc wrote:
>>> 
>>> Observing and reporting the reverting result is an initial
>>> part of problem isolation. I made no request for FreeBSD
>>> to give up on using the updated jemalloc. (Unfortunately,
>>> I'm not sure what a good next step of problem isolation
>>> might be for the dual-socket PowerMac G4 context.)  
>> 
>> I appreciate this testing btw. The only dual-socket G4 I have (my
>> xserve g4) does not have the second socket populated, so I am
>> currently unable to test two-socket ppc32.
>> 
>>> Other than reverting, no patch is known for the issue at
>>> this point. More problem isolation is needed first.
>>> 
>>> While I do not have access, https://wiki.freebsd.org/powerpc
>>> lists more modern 32-bit powerpc hardware as supported:
>>> MPC85XX evaluation boards and AmigaOne A1222 (powerpcspe).
>>> (The AmigaOne A1222 seems to be dual-ore/single-socket.)  
>> 
>> jhibbits has an A1222 that is used as an actual primary desktop, and
>> I will hopefully have one at the end of the year. And I have an RB800
>> that I use for testing.
>> 
>> powerpcspe is really a different beast than aim32 though. I have been
>> mainly working on aim32 on g4 laptops, although I do have an xserve.
>> 
>>> 
>>> So folks with access to one of those may want to see
>>> if they also see the problem(s) with head -r360233 or
>>> later.  
>> 
>> Frankly, I wouldn't be surprised if this continues to be down to the
>> timebase skew somehow. I know that jemalloc tends to be sensitive to
>> time problems.
>> 
>>> 
>>> Another interesting context to test could be single-socket
>>> with just one core. (I might be able to do that on another
>>> old PowerMac, booting the same media after moving the
>>> media.)  
>> 
>> That's my primary aim32 testing platform. I have a stack of g4
>> laptops that I test on, and a magically working usb stick (ADATA C008
>> / 8GB model. For some reason it just works, I've never seen another
>> stick actually work)
>> 
>>> 
>>> If I understand right, the most common 32-bit powerpc
>>> tier 2 hardware platforms may still be old PowerMac's.
>>> They are considered supported and "mature", instead of
>>> just "stable". See https://wiki.freebsd.org/powerpc .
>>> However, the reality is that there are various problems
>>> for old PowerMacs (32-bit and 64-bit, at least when
>>> there is more than one socket present). The wiki page
>>> does not hint at such. (I'm not sure about
>>> single socket/multi-core PowerMacs: no access to
>>> such.)  
>> 
>> Yes, neither I nor jhibbits have multiple socket g4 hardware at the
>> moment, and I additionally don't have multiple socket g5 either.
>> 
>>> 
>>> It is certainly possible for some problem to happen
>>> that would lead to dropping the supported-status
>>> for some or all old 32-bit PowerMacs, even as tier 2.
>>> But that has not happened yet and I'd have no say in
>>> such a choice.  
>> 
>> From a kernel standpoint, I for one have no intention of dropping 32
>> bit support in the forseeable future. I've actually been putting more
>> work into 32 bit than 64 bit recently in fact.
>> 
> 
> I currently have FreeBSD HEAD from late last week running on a dual G4
> MDD (WITNESS kernel), and no segmentation faults from dhclient.  I'm
> using the following patch against jemalloc.  Brandon has reported other
> results with that patch to me, so I'm not sure it's a correct patch.
> 
> - Justin

Thanks.

The status of trying to track this down . . .

I normally use MALLOC_PRODUCTION= in my normally non-debug
builds. So: no jemalloc assert's. So I tried a "debug" build
without MALLOC_PRODUCTION= --and so far I've not had any
failures after booting with that world-build. Nor have any
assert's failed. It has been longer than usual but it would
probably be a few days before I concluded much. (At some
point I'll reboot just to change the conditions some and
then give it more time as well.)

I had hoped this type of build would detect there being a
problem earlier after things start going bad internally.

I've still no means of directly causing the problem. I've
still only seen the odd SIGSEGV's in dhclient, rpcbind,
mountd, nfsd, and sendmail.

I've really only learned:

A) Either messed up memory contents is involved
   or addresses in registers were pointing to
   the wrong place. (I know not which for sure.)

B) It seems to be probabilistic for when it starts
   in each of the 5 types of context. (Possibly some
   data race involved?)

C) The programs do not all fail together but over time
   more than one type can get failures.

D) Once sendmail's quickly executing subprocess starts
   having the problem during its exit, later instances
   seem to have it as well. (Inheriting bad memory
   content via a fork-ish operation that creates the
   subprocess?)

E) I do have the example failure of one of the contexts
   with the prior jemalloc code. (It was a
   MALLOC_PRODUCTION= style build.) (I reverted to the
   modern jemalloc that seemed to expose the problem
   more.)

So far I've made no progress for isolating the context
for where the problem starts. I've no clue how much is
messed up or for how long it has been messed up by the
time a notice is reported.

I still do not blame jemalloc: as far as I know it could
be just contributing to exposing problem(s) from other
code instead of having problems of its own. Some of the
SIGEGV's are not in jemalloc code at the time of the
SIGSEGV.


> diff --git a/contrib/jemalloc/include/jemalloc/internal/cache_bin.h
> b/contrib/jemalloc/include/jemalloc/internal/cache_bin.h index
> d14556a3da8..728959a448e 100644 ---
> a/contrib/jemalloc/include/jemalloc/internal/cache_bin.h +++
> b/contrib/jemalloc/include/jemalloc/internal/cache_bin.h @@ -88,7 +88,7
> @@ JEMALLOC_ALWAYS_INLINE void * cache_bin_alloc_easy(cache_bin_t *bin,
> bool *success) { void *ret;
> 
> -       bin->ncached--;
> +       cache_bin_sz_t cached = --bin->ncached;
> 
>        /*
>         * Check for both bin->ncached == 0 and ncached < low_water
> @@ -111,7 +111,7 @@ cache_bin_alloc_easy(cache_bin_t *bin, bool
> *success) {
>         * cacheline).
>         */
>        *success = true;
> -       ret = *(bin->avail - (bin->ncached + 1));
> +       ret = *(bin->avail - (cached + 1));
> 
>        return ret;
> }

As stands, it is messy trying to conclude if something
helps vs. hurts vs. makes-little-difference. So I'm not
sure how or when I'll try the above. So far I've focused
on reproducing the problem, possibly in a away that
gives better (earlier) information.

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?012EC6DD-AF2F-40EE-A9E2-A74ACE28E7A3>