Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 25 Jan 2022 16:08:15 -0800
From:      Mark Millard <marklmi@yahoo.com>
To:        bob prohaska <fbsd@www.zefox.net>
Cc:        Free BSD <freebsd-arm@freebsd.org>
Subject:   Re: Troubles building world on stable/13
Message-ID:  <58DF1E04-98F4-496C-AFEC-B80EADFF8A74@yahoo.com>
In-Reply-To: <20220125221753.GA44654@www.zefox.net>
References:  <FA290367-D4B6-463D-AC67-64F224B3C227@yahoo.com> <FBD31544-6D8F-40DB-BC36-F0B2BBA78A14@yahoo.com> <8595CFBD-DC65-4472-A0A1-8A7BE1C031D6@yahoo.com> <20220124165449.GA39982@www.zefox.net> <5FAC2B2C-7740-435E-A183-FB3EF1FCE7F9@yahoo.com> <1CB4EDCD-0998-4363-8CEA-14854EB76FA3@yahoo.com> <20220125162245.GA43635@www.zefox.net> <61A3CF79-552C-4884-A8EA-85003B249856@yahoo.com> <20220125180823.GB43635@www.zefox.net> <35046946-7FE4-4E44-950F-BF9CCA72D8F0@yahoo.com> <20220125221753.GA44654@www.zefox.net>

next in thread | previous in thread | raw e-mail | index | archive | help


On 2022-Jan-25, at 14:17, bob prohaska <fbsd@www.zefox.net> wrote:

> On Tue, Jan 25, 2022 at 12:49:02PM -0800, Mark Millard wrote:
>> On 2022-Jan-25, at 10:08, bob prohaska <fbsd@www.zefox.net> wrote:
>>=20
>>> On Tue, Jan 25, 2022 at 09:13:08AM -0800, Mark Millard wrote:
>>>>=20
>>>> -DBATCH ? I'm not aware of there being any use of that symbol.
>>>> Do you have a documentation reference for it so that I could
>>>> read about it?
>>>>=20
>>> It's a switch to turn off dialog4ports. I can't find the reference
>>> now. Perhaps it's been deprecated? A name like -DUSE_DEFAULTS would
>>> be easier to understand anyway.=20
>>=20
>> I've never had buildworld buildkernel or the like try to use
>> dialog4ports. I've only had port building use it. buildworld
>> and buildkernel can be done with no ports installed at all.
>> dialog4ports is a port.
>>=20
>=20
> The attempt to build devel/llvm13 under stable/13 was done under =
ports.
> Thus the -DBATCH, to avoid manual intervention.

I missed the later reference to devel/llvm13 as applying
to the above and then later confused the contexts,
effectively ignoring devel/llvm13 completely. Sorry.

>> I think -DBATCH was ignored for the activity at hand.
>>=20
>>> On a whim, I tried building devel/llvm13 on a Pi4 running -current =
with=20
>>> 8 GB of RAM and 8 GB of swap. To my surprise, that stopped with:
>>> nemesis.zefox.com kernel log messages:
>>> +FreeBSD 14.0-CURRENT #26 main-5025e85013: Sun Jan 23 17:25:31 PST =
2022
>>> +swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1873450, =
size: 4096
>>> +swap_pager: indefinite wait buffer: bufobj: 0, blkno: 521393, size: =
4096
>>> +swap_pager: indefinite wait buffer: bufobj: 0, blkno: 209826, size: =
12288
>>> +swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1717218, =
size: 24576
>>> +pid 56508 (c++), jid 0, uid 0, was killed: failed to reclaim memory
>>>=20
>>> On an 8GB machine, that seems strange.=20
>>=20
>> -j<What?> build? -j4 ?
>>=20
> Since this too was a port build, I let ports decide. It settled on 4.
>=20
>> Were you watching the swap usage in top (or some such)?
>>=20
>=20
> Top was running but the failure happened overnight. Not expecting=20
> it to fail, I didn't keep a log of swapping activity. The message
> above was in the next morning's log email.
>=20
>> Note: The "was killed" related notices have been improved
>> in main, but there is a misnomer case about "out of swap"
>> (last I checked).
>>=20
>=20
>> An environment that gets "swap_pager: indefinite wait buffer"
>> notices is problematical and the I/O delays for the virtual
>> memory subsystem can lead to kills, if I understand right.
>>=20
>> But, if I remember right, the actual message for a directly
>> I/O related kill is now different.
>>=20
>=20
> In this case the message was "unable to reclaim memory", a=20
> message I've not seen before.=20

Yea, it is one, more accurate wording of the old out of swap
notices --probably covering most occurrences.

>> I think that being able to reproduce this case could be
>> important. I probably can not because I'd not get the
>> "swap_pager: indefinite wait buffer" in my hardware
>> context.

I was thinking buildworld buildkernel here. I got the context
wrong.

I'll eventually do a devel/llvm13 build on the 8 GiByte RPi4B
with my patched top monitoring various "maximum observed"
figures.

> If it's relevant, the case of /usr/ports/devel/llvm13 seems like
> the most expedient test, since it did fail with realistic amounts
> of memory and swap. I gather that there's a certain amount of=20
> self-recompilation in buildworld, is that true of the port version?
> Does it matter?
>=20
>>> Per the failure message I restarted the build of devel/llvm13 with=20=

>>> make -DBATCH MAKE_JOBS_UNSAFE=3DYES > make.log &
>>=20
>> Just like -DBATCH is for ports, not buildworld buildkernel,
>> MAKE_JOBS_UNSAFE=3D is for ports, not buildworld buildkernel,
>> at least if I understand right.
>>=20
> This was a ports build on the Pi4. The restart is running =
single-thread
> and quite slow, I'm tempted to stop it unless a failure would be =
useful.

Again an example of my not switching context correctly. Sorry.

>>>>> However, restarting buildworld using -j1 appears to have worked =
past
>>>>> the former point of failure.
>>>>=20
> [this on stable/13 pi3]=20
>>>> Hmm. That usually means one (or both) of two things was involved
>>>> in the failure:
>>>>=20
>>>> A) a build race where something is not (fully) ready when
>>>>  it is used
>>>>=20
>>>> B) running out of resources, such as RAM+SWAP
>>>>=20
>>>=20
>>> The stable/13 machine is short of swap; it has only 2 GB, which
>>> used to be enough.
>>=20
>> So RAM+SWAP is 1 GiByte + 2 GiByte, so 3 GiByte on that
>> RPi3*? (That would have been good to know earlier, such
>> as for my attempts at reproduction.)
>>=20
> Correct, 3GB RAM+swap. Didn't realize it would turn out to=20
> be important, sorry!

Do not know yet if it would have helped reproduction of
the problem. But I now know that I should try for something
that would give evidence about getting near or over 3
GiBytes.

>> -j<What?> for the RPi3* when it was failing?
>>=20
> -j4, but I think it also failed at -j2.=20
>> Did you havae failures with the .cpp and .sh (so no
>> make use involved) in the RAM+SWAP context?
>>=20
> Using the .cpp and .sh file on a Pi3 with 2 GB swap=20
> running stable/13 there was a consistent failure.

Ahh, a simpler, quicker test context/case. So that
is likely what I'd look into.

> Using the .cpp and .sh files on a Pi3 with 7GB swap
> there was no failure.=20
>=20
> Using a build of /usr/ports/devel/llvm13 as a test the
> build failed even with 8 GB of RAM and 8 GB of swap.
>=20
>>> Maybe that's the problem, but having an error=20
>>> report that says it's a segfault is a confusing diagnostic.=20
>>>=20
>>>> But, as I understand, you were able to use a .cpp and
>>>> .sh file pair that had been produced to repeat the
>>>> problem on the RPi3B --and that would not have been a
>>>> parallel-activity context.
>>>>=20
>>>=20
>>> To be clear, the reproduction was on the same stable/13 that
>>> reported the original failure. An attempt at reproduction
>>> on a different Pi3 running -current ran without any errors.
>>> Come to think of it, that machine had more swap, too.
>>=20
>> How much swap?
>>=20
> Two swap partitions, 3.6 GB and 4 GB, both in use.

So that is the devel/llvm13 example, not buildworld
buildkernel, not the .cpp and .sh combination.

>>=20
>> At this point, I expect that the failure was tied to the
>> RAM+SWAP totaling to 3 GiBytes.
>>=20
>=20
> That seems likely, or at least a reasonable suspicion.=20
>=20
>> Knowing that context we might have a reproducible report
>> that can be made based on the .cpp and .sh files, where
>> restricting the RAM+SWAP use allowed is part of the
>> report.
>>=20
>=20
> There seem to be some other reports of clang using unreasonable
> amounts of memory, for example=20
> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D261341
>=20
> A much older report that looks vaguely similar (out of memory
> reported as segfault)
> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D172576
> It's not arm-related and dates from 2012 but is still open.
>=20
> I'll try to repeat some of the tests using the logging script
> used previously. Right now it contains:
>=20
> #!/bin/sh
> while true
> sysctl hw.regulator.5v0.min_uvolt ; do vmstat ; gstat -abd -I 10s ; =
date ; swapinfo ; tail \
> -n 2 /var/log/messages ; netstat -m | grep "mbuf clusters" ; ps -auxd =
-w -w
> done
>=20
> Changes to the script are welcome, the output is voluminous.

I'll probably not get to experimenting with this for
some time.


=3D=3D=3D
Mark Millard
marklmi at yahoo.com




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?58DF1E04-98F4-496C-AFEC-B80EADFF8A74>