From owner-freebsd-stable@FreeBSD.ORG  Sat Feb  1 06:09:22 2014
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 577873CD
 for <freebsd-stable@freebsd.org>; Sat,  1 Feb 2014 06:09:22 +0000 (UTC)
Received: from mail.modirum.com (mail.modirum.com [31.185.27.10])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id DF1811199
 for <freebsd-stable@freebsd.org>; Sat,  1 Feb 2014 06:09:21 +0000 (UTC)
Received: from [77.87.241.103] (helo=unknown)
 by mail.modirum.com with esmtpsa (TLSv1:DHE-RSA-AES128-SHA:128)
 (Exim 4.80.1 (FreeBSD)) (envelope-from <matthew@reztek.cz>)
 id 1W9TlO-00073Q-Cq
 for freebsd-stable@freebsd.org; Sat, 01 Feb 2014 06:09:10 +0000
Date: Sat, 1 Feb 2014 07:09:12 +0100
From: Matthew Rezny <matthew@reztek.cz>
To: freebsd-stable@freebsd.org
Subject: Tuning kern.maxswzone
Message-ID: <20140201070912.00007971@unknown>
Organization: RezTek, s.r.o.
X-Mailer: Claws Mail 3.9.2-55-g74b05b (GTK+ 2.16.6; i586-pc-mingw32msvc)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
X-SA-Authenticated: Yes
X-SA-Exim-Connect-IP: 77.87.241.103
X-SA-Exim-Mail-From: matthew@reztek.cz
X-SA-Exim-Scanned: No (on mail.modirum.com); SAEximRunCond expanded to false
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable/>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 01 Feb 2014 06:09:22 -0000

What should be a simple adjustment has become a real head scratcher.
This mysterious tunable is barely documented and the results I'm seeing
while experimenting are just more confusing. Apologies for the length,
I just spent half the evening trying to tune this value.

I have two low-end boxes that I recently put 10-RC1 on and wanted to
update to 10-STABLE. The hardware is rather dated and by modern
standards one could call is pathetic, but it's still way more than what
I started on with FreeBSD 3.x so I expect to still be able to build
world, it just might take a while. I starting having hangs just trying
to do a svn checkout on one of the two boxes. The only difference is
RAM, so I figured a low memory problem was wedging something, and the
one with a little less memory kept spraying this message about
increasing kern.maxswzone so I figured I should look into it. And thus
I opened the can of worms.

The common wisdom is that minimum swap should be twice RAM, until you
get to several GB and then maybe only equal to RAM. Using a little
extra if you can spare the disk space was generally considered extra
insurance and a heavily loaded sever might have 4x as much swap as RAM.
Tuning(7) almost says the more the merrier. I was not expecting to hit a
maximum, though it makes sense that must be one of some sort.

These two boxes are Via C3-800 CPU with room for 2 PC133 DIMMs. Using
the compatible memory I have on hand (the only larger sticks are ECC but
the chipset rejects those), puts one box at 256MB and the other at
384MB, which should not be all that bad. The harddrives are marketed as
36GB, real capacity is 34GB, so I assigned an even 2GB to swap and 32GB
to UFS volumes. I was going to do 1GB swap as more than sufficient, but
2GB made nice even numbers, and it gives more room for tmpfs mounted
/tmp to spill into swap in the case I toss some rather large
file there.

According to what I've read, the default limit on swap is 8x the RAM
capacity. The kernel structures to track swap pages are apparently not
dynamically allocated when swap is mounted so it's a kernel tunable.
2GB is exactly 8x 256MB so I should be just fine, except I keep seeing
messages at boot that state "warning: total configured swap (524287
pages) exceeds maximum recommended amount (481376 pages). warning:
increase kern.maxswzone or reduce amount of swap." Well, turns out the
calculation of RAM is not physical memory but available memory. 8MB of
RAM is allocated to UMA video memory, so memtest86 shows only 248MB
rather than 256MB. The kernel image also takes some RAM which is
considered not-available. The boot messages state physical RAM is
256MB, available RAM is 233MB. 233MB/256MB = 91% and of course 481376
pages / 524287 pages = 91.8%, confirmation the available number is used
rather than physical. That makes sense where there is substantial
difference, but for the general case where the difference is small, it
just results in a headache when the admin multiplies physical RAM by 8
to arrive at a swap size that will be just large enough to trigger a
warning during boot.

Obvious course of action is to increase it, just needs to know to what
value. Quick search finds no docs on the meaning of the tunable (no
mention in man pages, nothing found searching the wiki and website),
only a few old forum and mailing list posts asking about it and
replies that essentially say to turn it up until the message goes away,
with the suggestion to start by adding the difference. No exact math is
available, just rough estimates of how much RAM should be needed for
some amount of swap, so my first guess was low by a factor of 10 and I
saw the number of usable swap pages go way down. I tried the other way,
multiply the default value by 1.09 to add the 9% it's short. I put that
value in but no change to the boot messages. I tried increasing again,
no change. I tried setting something silly high and then it panics on
boot. So, I figure maybe I should take the number from the other box,
the one with 384MB physical RAM, and just try using that. With 358MB
available, it should have sufficient kernel resources to track 2.8GB of
swap space with default settings. I look at the number it reports for
kern.maxswzone, expecting to find something larger by about 50% than
what the box with less RAM had defaulted to. Surprise, the numbers are
the exactly same! I looked not twice but thrice with fresh reboots on
each to see the default is indeed identical.

What does kern.maxswzone actually do and what does it's number even
mean? Supposedly it's a number in bytes of some memory structure used
to track swap pages. In what little I could find on the topic, it's
mentioned that there should be enough for twice as many pages of swap
as there will actually be used, but it's not clear if that doubling is
accounted for in the warning message. One old mailing list message
mentions setting this up to 32MB for 20GB of swap, which would be about
1.5MB of memory per GB of swap. If that number were right, my 2GB of
swap would need 3MB of kernel memory, which seems ok. The default value
of kern.maxswzone on both boxes is 36175872, which is 34.5MB! That
seems like an awful lot, 15% of the available memory would be used to
track what is swapped out to disk.

Aside from the question of why that is so much larger than the expected
8x, why am I getting the message about not enough when it appears to be
ample? I started cutting the number down, dividing by half until I saw
it actually do something. It was when I finally got down to about 5MB
that I saw some impact (though that's still double what the estimates
suggest, so maybe the warning does include the doubling for safety).
With some actual number of usable swap pages, I calculated it's
approximately 17.25 bytes per page. From that I calculated that it
should be slightly over 8MB for the 481376 pages. I started trying
numbers around there, quickly established endpoints, and started
bisecting the area. Booting with kern.maxswzone=8302000 showed 480928
pages of usable swap and with kern.maxswzone=8303000 showed 481376
pages, so it goes in chunks and the threshold should be between those
points. While trying to bisect to an exact number, I wound up all the
way back to 8302000 with 481376 pages. So, I stopped there as it's not
entirely consistent. Looking through the logs for a month, I can see
the number varied a little bit across boots with no changes to
configuration. It's probably on the edge and something as simple as the
difference in state from the BIOS and boot loader depending on whether
its a cold or warm boot, soft or hard reset, etc can knock it down a
chunk seemingly at random.

So, as best I can tell, the actual effective number used for
kern.maxswzone is indeed approximately 8x available RAM. If there is
some need to turn it down (using substantially less swap) then that is
possible, but turning it up (as suggested by the warning message) is
not possible. Setting any value higher does not appear to actually
increase the number of swap pages that can be used. Worse, it might
actually be increasing some allocation in the kernel without the
additional memory being used. If the number were just ignored over the
8x RAM limit, then I don't think I would have gotten the panic when I
set maxswzone to something over 40MB (which isn't that much above the
default).

This seems so strange that I really hope I'm somehow completely off
base, and I do hope someone can shed light on this situation and
reveal some glaring mistake I've made. At first I ignored the warning,
too much swap, BAH, but I had to start digging when I hit strange
problems I couldn't ignore on one box that differs from the otherwise
identical box only by amount of RAM and thus swap space ratio.

To cut a longer story short, I switched from cvsup to svnup to svn
while going from 9-STABLE to 10. I accidentally made a mess, I forgot
to clear /usr/src before doing the svn checkout. Unlike sup, svn won't
just overwrite, it tracks all those existing files as conflicts to
resolve after it has fetched everything. Rather than answer a few
thousand prompts, I decided the prudent thing was rm -r * and checkout
fresh. Imagine my surprise when I find that rm hangs on the box with
less RAM. Ctrl-c stopped rm about 5min later, with no disk activity,
but then I couldn't execute commands, so I reboot. Try it again and I
see it deletes a few files (with disk activity) and then goes through
more (but no disk activity) and seems to stick. I can't login to another
console once that happens. I booted a mfsbsd CD to poke it from another
angle and found I could delete everything rather rapidly, though rm did
seem to be using a lot of memory. Several times the rm blew up with out
of memory errors, sometimes taking sh or even getty with it so I had to
relogin several times and even reboot once (after init blew out) to get
all of /usr/src deleted. Booting up mfsbsd with it's monster md leaves
about 30MB free on this system, which is relatively tiny, but in 4.x
days I was building world on a P60 with 32MB RAM so I can't imagine how
rm could need more than that. So why would the mfsbsd instance with far
less memory be able to delete the files? The key difference that comes
to mind is mfsbsd is not trying to use swap, so if excess swap could
make allocations hang for minutes rather than instantly fail, the
normal system would look hung under pressure.

With /usr/src cleared (and after running fsck) I booted back into
10-PRERELEASE to try to fetch the 10-STABLE sources again. I started
svnlite co and find it hung shortly thereafter. I tried a few times but
each time I see it does a couple hundred files at best and just stops.
When it stops, I can't login to another terminal. If I have a spare
console logged in, I can't run anything. After a few tries, I manged to
catch it where I had top running in one VT, started the checkout, and
then switched back just in time. I never could even get top up with rm
running (it probably blows over some limit faster). When the checkout
hangs, the state of svnlite is "kmem a" according to top. I can only
guess that is short for kmem alloc, I guess some syscall is waiting on
an allocation that will never happen because something already is using
or has leaked everything that could satisfy that allocation. It looks
like activity on too many files within a short period runs something
out.

Now things get real messy, as completely separately from this I just
started seeing issues at 10-RELEASE with hangs in "kmem a" state on a
few boxes with much more memory (RAM >= to the swap space on these
two). I'm going to write up the "kmem a" hangs in a separate email as I
believe that to be a separate issue which made me look closer at this
warning which then ended up raising more questions. This kern.maxswzone
warning seems to be unsolvable because the tunable can't be turned
up effectively, it seems to have some non-useful side effect, and it
may be exacerbating another issue causing the "kmem a" hang to be more
prevalent (to the point I wonder if I could make it through another
buildworld once I somehow get sources onto the drive).