From owner-freebsd-hackers@FreeBSD.ORG Sat Nov 10 20:07:25 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 6D83C6BC; Sat, 10 Nov 2012 20:07:25 +0000 (UTC) (envelope-from freebsd@damnhippie.dyndns.org) Received: from duck.symmetricom.us (duck.symmetricom.us [206.168.13.214]) by mx1.freebsd.org (Postfix) with ESMTP id 3F0B28FC08; Sat, 10 Nov 2012 20:07:24 +0000 (UTC) Received: from damnhippie.dyndns.org (daffy.symmetricom.us [206.168.13.218]) by duck.symmetricom.us (8.14.5/8.14.5) with ESMTP id qAAK7Hh4037959; Sat, 10 Nov 2012 13:07:24 -0700 (MST) (envelope-from freebsd@damnhippie.dyndns.org) Received: from [172.22.42.240] (revolution.hippie.lan [172.22.42.240]) by damnhippie.dyndns.org (8.14.3/8.14.3) with ESMTP id qAAK75gq019245; Sat, 10 Nov 2012 13:07:05 -0700 (MST) (envelope-from freebsd@damnhippie.dyndns.org) Subject: Re: watchdogd, jemalloc, and mlockall From: Ian Lepore To: freebsd-embedded@freebsd.org, freebsd-hackers@freebsd.org In-Reply-To: <1351968635.1120.110.camel@revolution.hippie.lan> References: <1351967919.1120.102.camel@revolution.hippie.lan> <20121103184143.GC73505@kib.kiev.ua> <1351968635.1120.110.camel@revolution.hippie.lan> Content-Type: text/plain; charset="us-ascii" Date: Sat, 10 Nov 2012 13:07:05 -0700 Message-ID: <1352578025.17290.123.camel@revolution.hippie.lan> Mime-Version: 1.0 X-Mailer: Evolution 2.32.1 FreeBSD GNOME Team Port Content-Transfer-Encoding: 7bit Cc: Jason Evans X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 10 Nov 2012 20:07:25 -0000 On Sat, 2012-11-03 at 12:50 -0600, Ian Lepore wrote: > On Sat, 2012-11-03 at 20:41 +0200, Konstantin Belousov wrote: > > On Sat, Nov 03, 2012 at 12:38:39PM -0600, Ian Lepore wrote: > > > In an attempt to un-hijack the thread about memory usage increase > > > between 6.4 and 9.x, I'm starting a new thread here related to my recent > > > discovery that watchdogd uses a lot more memory since it began using > > > mlockall(2). > > > > > > I tried statically linking watchdogd and it made a small difference in > > > RSS, presumably because it doesn't wire down all of libc and libm. > > > > > > VSZ RSS > > > 10236 10164 Dynamic > > > 8624 8636 Static > > > > > > Those numbers are from ps -u on an arm platform. I just updated the PR > > > (bin/173332) with some procstat -v output comparing with/without > > > mlockall(). > > > > > > It appears that the bulk of the new RSS bloat comes from jemalloc > > > allocating vmspace in 8MB chunks. With mlockall(MCL_FUTURE) in effect > > > that leads to wiring 8MB to satisfy what probably amounts to a few > > > hundred bytes of malloc'd memory. > > > > > > It would probably also be a good idea to remove the floating point from > > > watchdogd to avoid wiring all of libm. The floating point is used just > > > to turn the timeout-in-seconds into a power-of-two-nanoseconds value. > > > There's probably a reasonably efficient way to do that without calling > > > log(), considering that it only happens once at program startup. > > > > No, I propose to add a switch to turn on/off the mlockall() call. > > I have no opinion on the default value of the suggested switch. > > In a patch I submitted along with the PR, I added code to query the > vm.swap_enabled sysctl and only call mlockall() when swapping is > enabled. > > Nobody yet has said anything about what seems to me to be the real > problem here: jemalloc grabs 8MB at a time even if you only need to > malloc a few bytes, and there appears to be no way to control that > behavior. Or maybe there's a knob in there that didn't jump out at me > on a quick glance through the header files. I finally found some time to pursue this further. A small correction to what I said earlier: it appears that jemalloc allocates chunks of 4MB at a time, not 8, but it also appears that it allocates at least 2 chunks so the net effect is an 8MB default minimum allocation. I played with the jemalloc tuning option lg_chunk and with static versus dynamic linking, and came up with the numbers below, which were generated by ps -u on an ARM-based system with 64MB running -current from a couple weeks ago, but with the recent patch to watchdogd to eliminate the need for libm. I used "lg_chunk:14" (16K chunks), the smallest value it would allow on this platform. For comparison I also include the numbers from a FreeBSD 8.2 ARM system (which would be dynamic linked and untuned, and also without any mlockall() calls). Link malloc %MEM VSZ RSS ------------------------------------- dynamic untuned 15.3 10040 9996 static untuned 13.2 8624 8636 dynamic tuned 2.8 1880 1836 static tuned 0.8 480 492 [ freebsd 8.2 ] 1.1 1752 748 So it appears that using jemalloc's tuning in a daemon that uses mlockall(2) is a big win, especially if the daemon doesn't do much memory allocation (watchdogd allocates 2 things, 4k and 1280 bytes; if you use -e it also strdup()s the command string). It also seems that providing a build-time knob to control static linking would be valuable on platforms that are very memory limited and can't benefit from having all of libc wired. I haven't attached a patch because there appears to be no good way to actually achieve this in a platform-agnostic way. The jemalloc code enforces the lower range of the lg_chunk tuning value to be tied to the page size of the platform, and it rejects out of range values without changing the tuning. The code that works on an ARM with 4K page size, const char *malloc_conf = "lg_chunk:14"; would fail on a system that had bigger pages. The tuning must be specified with a compile-time constant like that, because it has to be tuned before the first allocation, which apparently happens before main() is entered. It would be nice if jemalloc would clip the tuning to the lowest legal value instead of rejecting it, especially since the lowest legal value is calculated based not only on page size but on the value of other configurable values. There's another potential solution, but it strikes me as rather inelegant... jemalloc can also be tuned with the MALLOC_CONF env var. With the right rc-fu we could provide something like a watchdogd_memtune variable that you could set and watchdogd would be invoked with MALLOC_CONF set to that in the environment. It still couldn't be set to a default value that was good for all platforms. It would also get passed through environment inheritence to any "-e whatever" command run by watchdogd, which isn't necessarily appropriate. I'm cc'ing Jason in case he can offer some advice about a better way to achieve this tuning, because I'm sure my quick read-through of the manpage this morning missed important details and implications. -- Ian