From owner-freebsd-hackers@FreeBSD.ORG  Sat Nov 10 20:07:25 2012
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
 by hub.freebsd.org (Postfix) with ESMTP id 6D83C6BC;
 Sat, 10 Nov 2012 20:07:25 +0000 (UTC)
 (envelope-from freebsd@damnhippie.dyndns.org)
Received: from duck.symmetricom.us (duck.symmetricom.us [206.168.13.214])
 by mx1.freebsd.org (Postfix) with ESMTP id 3F0B28FC08;
 Sat, 10 Nov 2012 20:07:24 +0000 (UTC)
Received: from damnhippie.dyndns.org (daffy.symmetricom.us [206.168.13.218])
 by duck.symmetricom.us (8.14.5/8.14.5) with ESMTP id qAAK7Hh4037959;
 Sat, 10 Nov 2012 13:07:24 -0700 (MST)
 (envelope-from freebsd@damnhippie.dyndns.org)
Received: from [172.22.42.240] (revolution.hippie.lan [172.22.42.240])
 by damnhippie.dyndns.org (8.14.3/8.14.3) with ESMTP id qAAK75gq019245;
 Sat, 10 Nov 2012 13:07:05 -0700 (MST)
 (envelope-from freebsd@damnhippie.dyndns.org)
Subject: Re: watchdogd, jemalloc, and mlockall
From: Ian Lepore <freebsd@damnhippie.dyndns.org>
To: freebsd-embedded@freebsd.org, freebsd-hackers@freebsd.org
In-Reply-To: <1351968635.1120.110.camel@revolution.hippie.lan>
References: <1351967919.1120.102.camel@revolution.hippie.lan>
 <20121103184143.GC73505@kib.kiev.ua>
 <1351968635.1120.110.camel@revolution.hippie.lan>
Content-Type: text/plain; charset="us-ascii"
Date: Sat, 10 Nov 2012 13:07:05 -0700
Message-ID: <1352578025.17290.123.camel@revolution.hippie.lan>
Mime-Version: 1.0
X-Mailer: Evolution 2.32.1 FreeBSD GNOME Team Port 
Content-Transfer-Encoding: 7bit
Cc: Jason Evans <jasone@freebsd.org>
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 10 Nov 2012 20:07:25 -0000

On Sat, 2012-11-03 at 12:50 -0600, Ian Lepore wrote:
> On Sat, 2012-11-03 at 20:41 +0200, Konstantin Belousov wrote:
> > On Sat, Nov 03, 2012 at 12:38:39PM -0600, Ian Lepore wrote:
> > > In an attempt to un-hijack the thread about memory usage increase
> > > between 6.4 and 9.x, I'm starting a new thread here related to my recent
> > > discovery that watchdogd uses a lot more memory since it began using
> > > mlockall(2).
> > > 
> > > I tried statically linking watchdogd and it made a small difference in
> > > RSS, presumably because it doesn't wire down all of libc and libm.
> > > 
> > >  VSZ   RSS
> > > 10236 10164  Dynamic
> > >  8624  8636  Static
> > > 
> > > Those numbers are from ps -u on an arm platform.  I just updated the PR
> > > (bin/173332) with some procstat -v output comparing with/without
> > > mlockall().
> > > 
> > > It appears that the bulk of the new RSS bloat comes from jemalloc
> > > allocating vmspace in 8MB chunks.  With mlockall(MCL_FUTURE) in effect
> > > that leads to wiring 8MB to satisfy what probably amounts to a few
> > > hundred bytes of malloc'd memory.
> > > 
> > > It would probably also be a good idea to remove the floating point from
> > > watchdogd to avoid wiring all of libm.  The floating point is used just
> > > to turn the timeout-in-seconds into a power-of-two-nanoseconds value.
> > > There's probably a reasonably efficient way to do that without calling
> > > log(), considering that it only happens once at program startup.
> > 
> > No, I propose to add a switch to turn on/off the mlockall() call.
> > I have no opinion on the default value of the suggested switch.
> 
> In a patch I submitted along with the PR, I added code to query the
> vm.swap_enabled sysctl and only call mlockall() when swapping is
> enabled.  
> 
> Nobody yet has said anything about what seems to me to be the real
> problem here:  jemalloc grabs 8MB at a time even if you only need to
> malloc a few bytes, and there appears to be no way to control that
> behavior.  Or maybe there's a knob in there that didn't jump out at me
> on a quick glance through the header files.

I finally found some time to pursue this further.  A small correction to
what I said earlier: it appears that jemalloc allocates chunks of 4MB at
a time, not 8, but it also appears that it allocates at least 2 chunks
so the net effect is an 8MB default minimum allocation.

I played with the jemalloc tuning option lg_chunk and with static versus
dynamic linking, and came up with the numbers below, which were
generated by ps -u on an ARM-based system with 64MB running -current
from a couple weeks ago, but with the recent patch to watchdogd to
eliminate the need for libm.  I used "lg_chunk:14" (16K chunks), the
smallest value it would allow on this platform.  For comparison I also
include the numbers from a FreeBSD 8.2 ARM system (which would be
dynamic linked and untuned, and also without any mlockall() calls).

         Link     malloc    %MEM    VSZ  RSS
        -------------------------------------
        dynamic   untuned    15.3  10040 9996
        static    untuned    13.2   8624 8636
        dynamic   tuned       2.8   1880 1836
        static    tuned       0.8    480  492
        
        [ freebsd 8.2 ]       1.1   1752  748

So it appears that using jemalloc's tuning in a daemon that uses
mlockall(2) is a big win, especially if the daemon doesn't do much
memory allocation (watchdogd allocates 2 things, 4k and 1280 bytes; if
you use -e it also strdup()s the command string).  It also seems that
providing a build-time knob to control static linking would be valuable
on platforms that are very memory limited and can't benefit from having
all of libc wired.

I haven't attached a patch because there appears to be no good way to
actually achieve this in a platform-agnostic way.  The jemalloc code
enforces the lower range of the lg_chunk tuning value to be tied to the
page size of the platform, and it rejects out of range values without
changing the tuning.  The code that works on an ARM with 4K page size,

    const char *malloc_conf = "lg_chunk:14";

would fail on a system that had bigger pages.  The tuning must be
specified with a compile-time constant like that, because it has to be
tuned before the first allocation, which apparently happens before
main() is entered.  It would be nice if jemalloc would clip the tuning
to the lowest legal value instead of rejecting it, especially since the
lowest legal value is calculated based not only on page size but on the
value of other configurable values.

There's another potential solution, but it strikes me as rather
inelegant... jemalloc can also be tuned with the MALLOC_CONF env var.
With the right rc-fu we could provide something like a watchdogd_memtune
variable that you could set and watchdogd would be invoked with
MALLOC_CONF set to that in the environment.  It still couldn't be set to
a default value that was good for all platforms.  It would also get
passed through environment inheritence to any "-e whatever" command run
by watchdogd, which isn't necessarily appropriate.

I'm cc'ing Jason in case he can offer some advice about a better way to
achieve this tuning, because I'm sure my quick read-through of the
manpage this morning missed important details and implications.

-- Ian