Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 24 Feb 2021 15:53:37 -0800
From:      Mark Millard <marklmi@yahoo.com>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        Alan Somers <asomers@freebsd.org>, FreeBSD Hackers <freebsd-hackers@freebsd.org>
Subject:   Re: The out-of-swap killer makes poor choices
Message-ID:  <90EC4887-A29A-4829-B75B-1D88303791A4@yahoo.com>
In-Reply-To: <EA37F4D3-BCED-405B-BF70-2A97B19A9444@yahoo.com>
References:  <CAOtMX2jYmrK7ftx62_NEfNCWS7O=giHKL1p9kXCqq1t5E1arxA@mail.gmail.com> <CAOtMX2i3Njo=KBP=99_G0%2BKuSa00CVgNvacmzhTaoZUYEhwPPA@mail.gmail.com> <YDYyQ1V/hEAGV%2ByJ@kib.kiev.ua> <1984125.0OzZcVfBr4@ravel> <CAOtMX2iYr4NDYE0xHSa_w1hA5XQ2m9cA28NzPoGbfzAKKox9aQ@mail.gmail.com> <YDacl5/AFzFA4nkg@kib.kiev.ua> <EA37F4D3-BCED-405B-BF70-2A97B19A9444@yahoo.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On 2021-Feb-24, at 11:59, Mark Millard <marklmi at yahoo.com> wrote:

> On 2021-Feb-24, at 10:36, Konstantin Belousov <kostikbel T gmail.com> =
wrote:
>=20
>> On Wed, Feb 24, 2021 at 10:34:23AM -0700, Alan Somers wrote:
>>> There's another silly problem that I didn't mention in my original =
post.
>>> The old rule of thumb is that the swap partition's size should be =
twice as
>>> large as the amount of RAM.  However, that's no longer possible in =
many
>>> cases.  The kernel imposes a hard limit of 64 GiB (on amd64 at =
least) on
>>> the usable size of any swap partition, and many servers now have far =
more
>>> than 64 GiB of RAM.  So the advice needs to change with the times.  =
I don't
>> I do not think so. The usable size of the swap is determined by the
>> amount of swap metadata we pre-configure at boot time. Usually it is
>> sized proportionally to the available physical memory, but you can
>> override swap zones size manually with the knob.
>=20
> There was a period of time when the 128 GiByte RAM ThreadRipper
> had its previous 192 GiByte swap partition use rejected and I
> had to split it into 3 64 GiByte ones. Later I saw a checkin that
> was a correction to some calculation (vague memory) and I retried
> having one 192 GiByte swap partition and it was again allowed.
>=20
> The ability to dump to a swap partition when there was a
> 64 GiByte limitation with 128 GiByte of RAM had implications
> for the configuration. I actually arranged having a partition
> that was only used for dump's potential use. That took some
> rearrangement to form a large enough space, making other
> tradeoffs to do so.
>=20
>=20
> (I'm not sure if I can find the commit that lead to me switching
> back to more than 64 GiByte for a swap file on the large memory
> machine. I do not remember details any more.)

The 64 GiByte size limit (as seen in my environment) was
replaced in:

=
https://cgit.freebsd.org/src/commit/sys/vm/swap_pager.c?id=3D00fd73d2dabde=
e2638203dd1145f007787f05be9
a.k.a.:
https://svnweb.freebsd.org/base?view=3Drevision&revision=3D363532

QUOTE
author	Doug Moore <dougm@FreeBSD.org>	2020-07-25 18:29:10 +0000
committer	Doug Moore <dougm@FreeBSD.org>	2020-07-25 18:29:10 =
+0000
. . .

Fix an overflow bug in the blist allocator that needlessly capped max
swap size by dividing a value, which was always a multiple of 64, by
64.  Remove the code that reduced max swap size down to that cap.

Eliminate the distinction between BLIST_BMAP_RADIX and
BLIST_META_RADIX.  Call them both BLIST_RADIX.

Make improvments to the blist self-test code to silence compiler
warnings and to test larger blists.

Reported by:	jmallett
Reviewed by:	alc
Discussed with:	kib
Tested by:	pho
Differential Revision:=09
https://reviews.freebsd.org/D25736

Notes
Notes:
    svn path=3D/head/; revision=3D363532
END QUOTE



Evidence sequence leading me there:

Establish a large swap partition on a device with
an old snapshot of my ThreadRipper environment,
resulting in:

# gpart show -pl nvd1
=3D>       40  937703008    nvd1  GPT  (447G)
         40       1024  nvd1p1  FBSDFSSDboot  (512K)
       1064  746586112  nvd1p2  FBSDFSSDroot  (356G)
  746587176  191115872  nvd1p3  FBSDFSSDswap  (91G)

I got a kernel from the ci.freebsd.org artifacts and put
it in place on the old snapshot of my ThreadRipper environment
(that no longer could even boot --ACPI incompatibilities), so
updating the old failing kernel but leaving the rest unchanged:

# uname -apKU
FreeBSD FBSDFSSD 13.0-CURRENT FreeBSD 13.0-CURRENT #0 r358314: Tue Feb =
25 18:08:20 UTC 2020     =
root@FreeBSD-head-amd64-build.jail.ci.FreeBSD.org:/usr/obj/usr/src/amd64.a=
md64/sys/GENERIC  amd64 amd64 1300081 1300037

So: old head (13) environment booted on the 128 GiByte
ThreadRipper:

=46rom /var/log/messages:

WARNING: reducing swap size to maximum of 65536MB per unit

# swapinfo
Device          1K-blocks     Used    Avail Capacity
/dev/gpt/FBSDFSSDswap  67108864        0 67108864     0%

The code that produced the message and limited
the size was in sys/vm/swap_pager.c back in that
time frame:

static void
swaponsomething(struct vnode *vp, void *id, u_long nblks,
    sw_strategy_t *strategy, sw_close_t *close, dev_t dev, int flags)
{
        struct swdevt *sp, *tsp;
        swblk_t dvbase;
        u_long mblocks;
 =20
        /*
         * nblks is in DEV_BSIZE'd chunks, convert to PAGE_SIZE'd =
chunks.
         * First chop nblks off to page-align it, then convert.
         *
         * sw->sw_nblks is in page-sized chunks now too.
         */
        nblks &=3D ~(ctodb(1) - 1);
        nblks =3D dbtoc(nblks);
=20
        /*
         * If we go beyond this, we get overflows in the radix
         * tree bitmap code.
         */
        mblocks =3D 0x40000000 / BLIST_META_RADIX;
        if (nblks > mblocks) {
                printf(
    "WARNING: reducing swap size to maximum of %luMB per unit\n",
                    mblocks / 1024 / 1024 * PAGE_SIZE);
                nblks =3D mblocks;
        }
. . .

Then I used blame to find the fix in git via looking at:

https://cgit.freebsd.org/src/blame/sys/vm/swap_pager.c


>> know what the best size would be for a modern server, but I would =
guess
>>> that it must be at least several times the RSS of your largest =
process, and
>>> also at least one tenth of RAM (for use as a dump device with =
compressed
>>> core dumps).


=3D=3D=3D
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?90EC4887-A29A-4829-B75B-1D88303791A4>