From owner-freebsd-current@FreeBSD.ORG Mon Jan 7 15:39:58 2008 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5F94816A417; Mon, 7 Jan 2008 15:39:58 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.freebsd.org (Postfix) with ESMTP id 2316F13C45D; Mon, 7 Jan 2008 15:39:58 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id 6AA1746C46; Mon, 7 Jan 2008 10:39:57 -0500 (EST) Date: Mon, 7 Jan 2008 15:39:57 +0000 (GMT) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Vadim Goncharov In-Reply-To: Message-ID: <20080107152305.A19068@fledge.watson.org> References: <20080104163352.GA42835@lor.one-eyed-alien.net> <9bbcef730801040958t36e48c9fjd0fbfabd49b08b97@mail.gmail.com> <200801061051.26817.peter.schuller@infidyne.com> <9bbcef730801060458k4bc9f2d6uc3f097d70e087b68@mail.gmail.com> <4780D289.7020509@FreeBSD.org> <4780E546.9050303@FreeBSD.org> <9bbcef730801060651y489f1f9bw269d0968407dd8fb@mail.gmail.com> <4780EF09.4090908@FreeBSD.org> <47810BE3.4080601@FreeBSD.org> <4781113C.3090904@FreeBSD.org> <47814B53.50405@FreeBSD.org> <20080106223153.V72782@fledge.watson.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-current@freebsd.org Subject: Re: When will ZFS become stable? X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 07 Jan 2008 15:39:58 -0000 On Mon, 7 Jan 2008, Vadim Goncharov wrote: > Yes, in-kernel libalias is "leaking" in sense that it grows unbounded, and > uses malloc(9) instead if it's own UMA zone with settable limits (it frees > all used memory, however, on shutting down ng_nat, so I've done a workaround > restarting ng_nat nodes once a month). But as I see the panic string: Did you have any luck raising interest from Paulo regarding this problem? Is there a PR I can take a look at? I'm not really familiar with the code, so I'd prefer someone who was a bit more familiar with it looked after it, but I can certainly take a glance. > panic: kmem_malloc(16384): kmem_map too small: 83415040 total allocated > > and memory usage in crash dump: > > router:~# vmstat -m -M /var/crash/vmcore.32 | grep alias > libalias 241127 30161K - 460568995 128 > router:~# vmstat -m -M /var/crash/vmcore.32 | awk '{sum+=$3} END {print sum}' > 50407 > > ...so why only 50 Mb from 80 were used at the moment of panic? This is a bit complicated to answer, but I'll try to capture the gist in a short space. The kernel memory map is an address space in which pages can be placed to be used by the kernel. Those pages are often allocated using one of two kernel allocators, malloc(9) which does variable sized memory allocations, and uma(9) which is a slab allocator and supports caching of complex but fixed-size objects. Temporary buffers of variable size or infrequently allocated objects will use malloc, but frequently allocated objects of fixed size (vnods, mbufs, ...) will use uma. "vmstat -m" prints out information on malloc allocations, and "vmstat -z" prints out information on uma allocations. To make life slightly more complicated, small malloc allocations are actually implemented using uma -- there are a small number of small object size zones reserved for this purpose, and malloc just rounds up to the next such bucket size and allocations from that bucket. For larger sizes, malloc goes through uma, but pretty much directly to VM which makes pages available directly. So when you look at "vmstat -z" output, be aware that some of the information presented there (zones named things like "128", "256", etc) are actually the pools from which malloc allocations come, so there's double-counting. There are also other ways to get memory into the kernel map, such as directly inserting pages from user memory into the kernel address space in order to implement zero-copy. This is done, for example, when zero-copy sockets are used. To make life just very slightly more complicated even, I'll tell you that there are something called "submaps" in the kernel memory map, which have special properties. One of these is used for mapping the buffer cache. Another is used for mapping pageable memory used as part of copy-reduction in the pipe(2) code. Rather than copying twice (into the kernel and out again) in the pipe code, for large pipe I/O we will borrow the user pages from the sending process, mapping them into the kernel and hooking them up to the pipe. > BTW, current memory usage (April 6.2S, ipf w+ 2 ng_nat's) a week after > restart is low: > > vadim@router:~>vmstat -m | grep alias > libalias 79542 9983K - 179493840 128 > vadim@router:~>vmstat -m | awk '{sum+=$3} END {print sum}' > 28124 > >> Actually, with mbuma, this has changed -- mbufs are now allocated from the >> general kernel map. Pipe buffer memory and a few other things are still >> allocated from separate maps, however. In fact, this was one of the known >> issues with the introduction of large cluster sizes without resource >> limits: address space and memory use were potentially unbounded, so Randall >> recently properly implemented the resource limits on mbuf clusters of large >> sizes. > > I still don't understand what that numbers from sysctl above do exactly mean > - sysctl -d for them is obscure. How many memory kernel uses in RAM, and for > which purposes? Is that limit constant? Does kernel swaps out parts of it, > and if yes, how many? The concept of kernel memory, as seen above, is a bit of a convoluted concept. Simple memory allocated by the kernel for its internal data structures, such as vnodes, sockets, mbufs, etc, is almost always not something that can be paged, as it may be accessed from contexts where blocking on I/O is not permitted (for example, in interrupt threads or with critical mutexes held). However, other memory in the kernel map may well be pageable, such as kernel thread stacks for sleeping user threads (which can be swapped out under heavy memory load), pipe buffers, and general cached data for the buffer cache / file system, which will be paged out or discarded when memory pressure goes up. When debugging a kernel memory leak in the network stack, the usual starting point is to look at vmstat -m and vmstat -z to see what type of memory is being leaked. The really big monotonically growing type is usually the one that's at fault. Often it's the one being allocated when the system runs out of address space or memory, so sometimes even a simple backtrace will identify the culprit. Robert N M Watson Computer Laboratory University of Cambridge