From owner-freebsd-stable@FreeBSD.ORG Sat Feb 1 06:09:22 2014 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 577873CD for ; Sat, 1 Feb 2014 06:09:22 +0000 (UTC) Received: from mail.modirum.com (mail.modirum.com [31.185.27.10]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id DF1811199 for ; Sat, 1 Feb 2014 06:09:21 +0000 (UTC) Received: from [77.87.241.103] (helo=unknown) by mail.modirum.com with esmtpsa (TLSv1:DHE-RSA-AES128-SHA:128) (Exim 4.80.1 (FreeBSD)) (envelope-from ) id 1W9TlO-00073Q-Cq for freebsd-stable@freebsd.org; Sat, 01 Feb 2014 06:09:10 +0000 Date: Sat, 1 Feb 2014 07:09:12 +0100 From: Matthew Rezny To: freebsd-stable@freebsd.org Subject: Tuning kern.maxswzone Message-ID: <20140201070912.00007971@unknown> Organization: RezTek, s.r.o. X-Mailer: Claws Mail 3.9.2-55-g74b05b (GTK+ 2.16.6; i586-pc-mingw32msvc) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-SA-Authenticated: Yes X-SA-Exim-Connect-IP: 77.87.241.103 X-SA-Exim-Mail-From: matthew@reztek.cz X-SA-Exim-Scanned: No (on mail.modirum.com); SAEximRunCond expanded to false X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 01 Feb 2014 06:09:22 -0000 What should be a simple adjustment has become a real head scratcher. This mysterious tunable is barely documented and the results I'm seeing while experimenting are just more confusing. Apologies for the length, I just spent half the evening trying to tune this value. I have two low-end boxes that I recently put 10-RC1 on and wanted to update to 10-STABLE. The hardware is rather dated and by modern standards one could call is pathetic, but it's still way more than what I started on with FreeBSD 3.x so I expect to still be able to build world, it just might take a while. I starting having hangs just trying to do a svn checkout on one of the two boxes. The only difference is RAM, so I figured a low memory problem was wedging something, and the one with a little less memory kept spraying this message about increasing kern.maxswzone so I figured I should look into it. And thus I opened the can of worms. The common wisdom is that minimum swap should be twice RAM, until you get to several GB and then maybe only equal to RAM. Using a little extra if you can spare the disk space was generally considered extra insurance and a heavily loaded sever might have 4x as much swap as RAM. Tuning(7) almost says the more the merrier. I was not expecting to hit a maximum, though it makes sense that must be one of some sort. These two boxes are Via C3-800 CPU with room for 2 PC133 DIMMs. Using the compatible memory I have on hand (the only larger sticks are ECC but the chipset rejects those), puts one box at 256MB and the other at 384MB, which should not be all that bad. The harddrives are marketed as 36GB, real capacity is 34GB, so I assigned an even 2GB to swap and 32GB to UFS volumes. I was going to do 1GB swap as more than sufficient, but 2GB made nice even numbers, and it gives more room for tmpfs mounted /tmp to spill into swap in the case I toss some rather large file there. According to what I've read, the default limit on swap is 8x the RAM capacity. The kernel structures to track swap pages are apparently not dynamically allocated when swap is mounted so it's a kernel tunable. 2GB is exactly 8x 256MB so I should be just fine, except I keep seeing messages at boot that state "warning: total configured swap (524287 pages) exceeds maximum recommended amount (481376 pages). warning: increase kern.maxswzone or reduce amount of swap." Well, turns out the calculation of RAM is not physical memory but available memory. 8MB of RAM is allocated to UMA video memory, so memtest86 shows only 248MB rather than 256MB. The kernel image also takes some RAM which is considered not-available. The boot messages state physical RAM is 256MB, available RAM is 233MB. 233MB/256MB = 91% and of course 481376 pages / 524287 pages = 91.8%, confirmation the available number is used rather than physical. That makes sense where there is substantial difference, but for the general case where the difference is small, it just results in a headache when the admin multiplies physical RAM by 8 to arrive at a swap size that will be just large enough to trigger a warning during boot. Obvious course of action is to increase it, just needs to know to what value. Quick search finds no docs on the meaning of the tunable (no mention in man pages, nothing found searching the wiki and website), only a few old forum and mailing list posts asking about it and replies that essentially say to turn it up until the message goes away, with the suggestion to start by adding the difference. No exact math is available, just rough estimates of how much RAM should be needed for some amount of swap, so my first guess was low by a factor of 10 and I saw the number of usable swap pages go way down. I tried the other way, multiply the default value by 1.09 to add the 9% it's short. I put that value in but no change to the boot messages. I tried increasing again, no change. I tried setting something silly high and then it panics on boot. So, I figure maybe I should take the number from the other box, the one with 384MB physical RAM, and just try using that. With 358MB available, it should have sufficient kernel resources to track 2.8GB of swap space with default settings. I look at the number it reports for kern.maxswzone, expecting to find something larger by about 50% than what the box with less RAM had defaulted to. Surprise, the numbers are the exactly same! I looked not twice but thrice with fresh reboots on each to see the default is indeed identical. What does kern.maxswzone actually do and what does it's number even mean? Supposedly it's a number in bytes of some memory structure used to track swap pages. In what little I could find on the topic, it's mentioned that there should be enough for twice as many pages of swap as there will actually be used, but it's not clear if that doubling is accounted for in the warning message. One old mailing list message mentions setting this up to 32MB for 20GB of swap, which would be about 1.5MB of memory per GB of swap. If that number were right, my 2GB of swap would need 3MB of kernel memory, which seems ok. The default value of kern.maxswzone on both boxes is 36175872, which is 34.5MB! That seems like an awful lot, 15% of the available memory would be used to track what is swapped out to disk. Aside from the question of why that is so much larger than the expected 8x, why am I getting the message about not enough when it appears to be ample? I started cutting the number down, dividing by half until I saw it actually do something. It was when I finally got down to about 5MB that I saw some impact (though that's still double what the estimates suggest, so maybe the warning does include the doubling for safety). With some actual number of usable swap pages, I calculated it's approximately 17.25 bytes per page. From that I calculated that it should be slightly over 8MB for the 481376 pages. I started trying numbers around there, quickly established endpoints, and started bisecting the area. Booting with kern.maxswzone=8302000 showed 480928 pages of usable swap and with kern.maxswzone=8303000 showed 481376 pages, so it goes in chunks and the threshold should be between those points. While trying to bisect to an exact number, I wound up all the way back to 8302000 with 481376 pages. So, I stopped there as it's not entirely consistent. Looking through the logs for a month, I can see the number varied a little bit across boots with no changes to configuration. It's probably on the edge and something as simple as the difference in state from the BIOS and boot loader depending on whether its a cold or warm boot, soft or hard reset, etc can knock it down a chunk seemingly at random. So, as best I can tell, the actual effective number used for kern.maxswzone is indeed approximately 8x available RAM. If there is some need to turn it down (using substantially less swap) then that is possible, but turning it up (as suggested by the warning message) is not possible. Setting any value higher does not appear to actually increase the number of swap pages that can be used. Worse, it might actually be increasing some allocation in the kernel without the additional memory being used. If the number were just ignored over the 8x RAM limit, then I don't think I would have gotten the panic when I set maxswzone to something over 40MB (which isn't that much above the default). This seems so strange that I really hope I'm somehow completely off base, and I do hope someone can shed light on this situation and reveal some glaring mistake I've made. At first I ignored the warning, too much swap, BAH, but I had to start digging when I hit strange problems I couldn't ignore on one box that differs from the otherwise identical box only by amount of RAM and thus swap space ratio. To cut a longer story short, I switched from cvsup to svnup to svn while going from 9-STABLE to 10. I accidentally made a mess, I forgot to clear /usr/src before doing the svn checkout. Unlike sup, svn won't just overwrite, it tracks all those existing files as conflicts to resolve after it has fetched everything. Rather than answer a few thousand prompts, I decided the prudent thing was rm -r * and checkout fresh. Imagine my surprise when I find that rm hangs on the box with less RAM. Ctrl-c stopped rm about 5min later, with no disk activity, but then I couldn't execute commands, so I reboot. Try it again and I see it deletes a few files (with disk activity) and then goes through more (but no disk activity) and seems to stick. I can't login to another console once that happens. I booted a mfsbsd CD to poke it from another angle and found I could delete everything rather rapidly, though rm did seem to be using a lot of memory. Several times the rm blew up with out of memory errors, sometimes taking sh or even getty with it so I had to relogin several times and even reboot once (after init blew out) to get all of /usr/src deleted. Booting up mfsbsd with it's monster md leaves about 30MB free on this system, which is relatively tiny, but in 4.x days I was building world on a P60 with 32MB RAM so I can't imagine how rm could need more than that. So why would the mfsbsd instance with far less memory be able to delete the files? The key difference that comes to mind is mfsbsd is not trying to use swap, so if excess swap could make allocations hang for minutes rather than instantly fail, the normal system would look hung under pressure. With /usr/src cleared (and after running fsck) I booted back into 10-PRERELEASE to try to fetch the 10-STABLE sources again. I started svnlite co and find it hung shortly thereafter. I tried a few times but each time I see it does a couple hundred files at best and just stops. When it stops, I can't login to another terminal. If I have a spare console logged in, I can't run anything. After a few tries, I manged to catch it where I had top running in one VT, started the checkout, and then switched back just in time. I never could even get top up with rm running (it probably blows over some limit faster). When the checkout hangs, the state of svnlite is "kmem a" according to top. I can only guess that is short for kmem alloc, I guess some syscall is waiting on an allocation that will never happen because something already is using or has leaked everything that could satisfy that allocation. It looks like activity on too many files within a short period runs something out. Now things get real messy, as completely separately from this I just started seeing issues at 10-RELEASE with hangs in "kmem a" state on a few boxes with much more memory (RAM >= to the swap space on these two). I'm going to write up the "kmem a" hangs in a separate email as I believe that to be a separate issue which made me look closer at this warning which then ended up raising more questions. This kern.maxswzone warning seems to be unsolvable because the tunable can't be turned up effectively, it seems to have some non-useful side effect, and it may be exacerbating another issue causing the "kmem a" hang to be more prevalent (to the point I wonder if I could make it through another buildworld once I somehow get sources onto the drive).