From owner-freebsd-current@FreeBSD.ORG  Mon Jan  7 15:39:58 2008
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 5F94816A417;
	Mon,  7 Jan 2008 15:39:58 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 2316F13C45D;
	Mon,  7 Jan 2008 15:39:58 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from fledge.watson.org (fledge.watson.org [209.31.154.41])
	by cyrus.watson.org (Postfix) with ESMTP id 6AA1746C46;
	Mon,  7 Jan 2008 10:39:57 -0500 (EST)
Date: Mon, 7 Jan 2008 15:39:57 +0000 (GMT)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: Vadim Goncharov <vadim_nuclight@mail.ru>
In-Reply-To: <opt4kfd6y617d6mn@nuclight.avtf.net>
Message-ID: <20080107152305.A19068@fledge.watson.org>
References: <fll63b$j1c$1@ger.gmane.org>
	<20080104163352.GA42835@lor.one-eyed-alien.net>
	<9bbcef730801040958t36e48c9fjd0fbfabd49b08b97@mail.gmail.com>
	<200801061051.26817.peter.schuller@infidyne.com>
	<9bbcef730801060458k4bc9f2d6uc3f097d70e087b68@mail.gmail.com>
	<4780D289.7020509@FreeBSD.org> <flqmbo$eac$1@ger.gmane.org>
	<4780E546.9050303@FreeBSD.org>
	<9bbcef730801060651y489f1f9bw269d0968407dd8fb@mail.gmail.com>
	<4780EF09.4090908@FreeBSD.org> <flr0ie$euj$1@ger.gmane.org>
	<47810BE3.4080601@FreeBSD.org> <flr2lr$kph$1@ger.gmane.org>
	<4781113C.3090904@FreeBSD.org> <opt4i0g3k44fjv08@nuclight.avtf.net>
	<47814B53.50405@FreeBSD.org> <20080106223153.V72782@fledge.watson.org>
	<opt4kfd6y617d6mn@nuclight.avtf.net>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-current@freebsd.org
Subject: Re: When will ZFS become stable?
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 07 Jan 2008 15:39:58 -0000


On Mon, 7 Jan 2008, Vadim Goncharov wrote:

> Yes, in-kernel libalias is "leaking" in sense that it grows unbounded, and 
> uses malloc(9) instead if it's own UMA zone with settable limits (it frees 
> all used memory, however, on shutting down ng_nat, so I've done a workaround 
> restarting ng_nat nodes once a month). But as I see the panic string:

Did you have any luck raising interest from Paulo regarding this problem?  Is 
there a PR I can take a look at?  I'm not really familiar with the code, so 
I'd prefer someone who was a bit more familiar with it looked after it, but I 
can certainly take a glance.

> panic: kmem_malloc(16384): kmem_map too small: 83415040 total allocated
>
> and memory usage in crash dump:
>
> router:~# vmstat -m -M /var/crash/vmcore.32 | grep alias
>    libalias 241127 30161K       - 460568995  128
> router:~# vmstat -m -M /var/crash/vmcore.32 | awk '{sum+=$3} END {print sum}'
> 50407
>
> ...so why only 50 Mb from 80 were used at the moment of panic?

This is a bit complicated to answer, but I'll try to capture the gist in a 
short space.

The kernel memory map is an address space in which pages can be placed to be 
used by the kernel.  Those pages are often allocated using one of two kernel 
allocators, malloc(9) which does variable sized memory allocations, and uma(9) 
which is a slab allocator and supports caching of complex but fixed-size 
objects.  Temporary buffers of variable size or infrequently allocated objects 
will use malloc, but frequently allocated objects of fixed size (vnods, mbufs, 
...) will use uma.  "vmstat -m" prints out information on malloc allocations, 
and "vmstat -z" prints out information on uma allocations.

To make life slightly more complicated, small malloc allocations are actually 
implemented using uma -- there are a small number of small object size zones 
reserved for this purpose, and malloc just rounds up to the next such bucket 
size and allocations from that bucket.  For larger sizes, malloc goes through 
uma, but pretty much directly to VM which makes pages available directly.  So 
when you look at "vmstat -z" output, be aware that some of the information 
presented there (zones named things like "128", "256", etc) are actually the 
pools from which malloc allocations come, so there's double-counting.

There are also other ways to get memory into the kernel map, such as directly 
inserting pages from user memory into the kernel address space in order to 
implement zero-copy.  This is done, for example, when zero-copy sockets are 
used.

To make life just very slightly more complicated even, I'll tell you that 
there are something called "submaps" in the kernel memory map, which have 
special properties.  One of these is used for mapping the buffer cache. 
Another is used for mapping pageable memory used as part of copy-reduction in 
the pipe(2) code.  Rather than copying twice (into the kernel and out again) 
in the pipe code, for large pipe I/O we will borrow the user pages from the 
sending process, mapping them into the kernel and hooking them up to the pipe.

> BTW, current memory usage (April 6.2S, ipf w+ 2 ng_nat's) a week after 
> restart is low:
>
> vadim@router:~>vmstat -m | grep alias
>    libalias 79542  9983K       - 179493840  128
> vadim@router:~>vmstat -m | awk '{sum+=$3} END {print sum}'
> 28124
>
>> Actually, with mbuma, this has changed -- mbufs are now allocated from the 
>> general kernel map.  Pipe buffer memory and a few other things are still 
>> allocated from separate maps, however.  In fact, this was one of the known 
>> issues with the introduction of large cluster sizes without resource 
>> limits: address space and memory use were potentially unbounded, so Randall 
>> recently properly implemented the resource limits on mbuf clusters of large 
>> sizes.
>
> I still don't understand what that numbers from sysctl above do exactly mean 
> - sysctl -d for them is obscure. How many memory kernel uses in RAM, and for 
> which purposes? Is that limit constant? Does kernel swaps out parts of it, 
> and if yes, how many?

The concept of kernel memory, as seen above, is a bit of a convoluted concept. 
Simple memory allocated by the kernel for its internal data structures, such 
as vnodes, sockets, mbufs, etc, is almost always not something that can be 
paged, as it may be accessed from contexts where blocking on I/O is not 
permitted (for example, in interrupt threads or with critical mutexes held). 
However, other memory in the kernel map may well be pageable, such as kernel 
thread stacks for sleeping user threads (which can be swapped out under heavy 
memory load), pipe buffers, and general cached data for the buffer cache / 
file system, which will be paged out or discarded when memory pressure goes 
up.

When debugging a kernel memory leak in the network stack, the usual starting 
point is to look at vmstat -m and vmstat -z to see what type of memory is 
being leaked.  The really big monotonically growing type is usually the one 
that's at fault.  Often it's the one being allocated when the system runs out 
of address space or memory, so sometimes even a simple backtrace will identify 
the culprit.

Robert N M Watson
Computer Laboratory
University of Cambridge