From owner-freebsd-current@FreeBSD.ORG  Mon Jan  7 23:28:34 2008
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id E0B4116A418;
	Mon,  7 Jan 2008 23:28:33 +0000 (UTC)
	(envelope-from vadim_nuclight@mail.ru)
Received: from mx5.mail.ru (mx5.mail.ru [194.67.23.25])
	by mx1.freebsd.org (Postfix) with ESMTP id 2F09713C44B;
	Mon,  7 Jan 2008 23:28:33 +0000 (UTC)
	(envelope-from vadim_nuclight@mail.ru)
Received: from [78.140.2.250] (port=16688 helo=nuclight.avtf.net)
	by mx5.mail.ru with esmtp 
	id 1JC1OR-0002lG-00; Tue, 08 Jan 2008 02:28:31 +0300
Date: Tue, 08 Jan 2008 05:28:28 +0600
To: "Robert Watson" <rwatson@freebsd.org>, "Paolo Pisati" <piso@freebsd.org>
References: <fll63b$j1c$1@ger.gmane.org>
	<20080104163352.GA42835@lor.one-eyed-alien.net>
	<9bbcef730801040958t36e48c9fjd0fbfabd49b08b97@mail.gmail.com>
	<200801061051.26817.peter.schuller@infidyne.com>
	<9bbcef730801060458k4bc9f2d6uc3f097d70e087b68@mail.gmail.com>
	<4780D289.7020509@FreeBSD.org> <flqmbo$eac$1@ger.gmane.org>
	<4780E546.9050303@FreeBSD.org>
	<9bbcef730801060651y489f1f9bw269d0968407dd8fb@mail.gmail.com>
	<4780EF09.4090908@FreeBSD.org> <flr0ie$euj$1@ger.gmane.org>
	<47810BE3.4080601@FreeBSD.org> <flr2lr$kph$1@ger.gmane.org>
	<4781113C.3090904@FreeBSD.org> <opt4i0g3k44fjv08@nuclight.avtf.net>
	<47814B53.50405@FreeBSD.org>
	<20080106223153.V72782@fledge.watson.org>
	<opt4kfd6y617d6mn@nuclight.avtf.net>
	<20080107152305.A19068@fledge.watson.org>
From: "Vadim Goncharov" <vadim_nuclight@mail.ru>
Organization: AVTF TPU Hostel
Content-Type: text/plain; format=flowed; delsp=yes; charset=koi8-r
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Message-ID: <opt4k15qgd17d6mn@nuclight.avtf.net>
In-Reply-To: <20080107152305.A19068@fledge.watson.org>
User-Agent: Opera M2/7.54 (Win32, build 3865)
Cc: freebsd-current@freebsd.org
Subject: Re: When will ZFS become stable?
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 07 Jan 2008 23:28:34 -0000

07.01.08 @ 21:39 Robert Watson wrote:

> On Mon, 7 Jan 2008, Vadim Goncharov wrote:
>
>> Yes, in-kernel libalias is "leaking" in sense that it grows unbounded,  
>> and uses malloc(9) instead if it's own UMA zone with settable limits  
>> (it frees all used memory, however, on shutting down ng_nat, so I've  
>> done a workaround restarting ng_nat nodes once a month). But as I see  
>> the panic string:
>
> Did you have any luck raising interest from Paulo regarding this  
> problem?  Is there a PR I can take a look at?  I'm not really familiar  
> with the code, so I'd prefer someone who was a bit more familiar with it  
> looked after it, but I can certainly take a glance.

No, i didn't do that yet. Brief search, however, shows kern/118432, though  
it is not directly kmem issue, and also thread  
http://209.85.135.104/search?q=cache:lpXLlrtojg8J:archive.netbsd.se/%3Fml%3Dfreebsd-net%26a%3D2006-10%26t%3D2449333+ng_nat+panic+memory&hl=ru&ct=clnk&cd=9&client=opera  
in which memory exhaustion problem was predicted. Also, I've heard some  
rumors about ng_nat memory panics under very heavy load, but a man with  
300Mbps router with several ng_nat's said his router is rock stable for  
half a year - though his router has 1 Gb of RAM and mine only 256 Mb (BTW,  
it's his system that has crashed recently with kern/118993, but this is  
not ng_nat kmem issue, as I think).

>> panic: kmem_malloc(16384): kmem_map too small: 83415040 total allocated
>>
>> and memory usage in crash dump:
>>
>> router:~# vmstat -m -M /var/crash/vmcore.32 | grep alias
>>    libalias 241127 30161K       - 460568995  128
>> router:~# vmstat -m -M /var/crash/vmcore.32 | awk '{sum+=$3} END {print  
>> sum}'
>> 50407
>>
>> ...so why only 50 Mb from 80 were used at the moment of panic?
>
> This is a bit complicated to answer, but I'll try to capture the gist in  
> a short space.
>
> The kernel memory map is an address space in which pages can be placed  
> to be used by the kernel.  Those pages are often allocated using one of  
> two kernel allocators, malloc(9) which does variable sized memory  
> allocations, and uma(9) which is a slab allocator and supports caching  
> of complex but fixed-size objects.  Temporary buffers of variable size  
> or infrequently allocated objects will use malloc, but frequently  
> allocated objects of fixed size (vnods, mbufs, ....) will use uma.   
> "vmstat -m" prints out information on malloc allocations, and "vmstat  
> -z" prints out information on uma allocations.
>
> To make life slightly more complicated, small malloc allocations are  
> actually implemented using uma -- there are a small number of small  
> object size zones reserved for this purpose, and malloc just rounds up  
> to the next such bucket size and allocations from that bucket.  For  
> larger sizes, malloc goes through uma, but pretty much directly to VM  
> which makes pages available directly.  So when you look at "vmstat -z"  
> output, be aware that some of the information presented there (zones  
> named things like "128", "256", etc) are actually the pools from which  
> malloc allocations come, so there's double-counting.

Yes, I've known it, but didn't known what column names exactly mean.  
Requests/Failures, I guess, is a pure statistics, Size is one element  
size, but why USED + FREE != LIMIT (on whose where limit is non-zero) ?

> There are also other ways to get memory into the kernel map, such as  
> directly inserting pages from user memory into the kernel address space  
> in order to implement zero-copy.  This is done, for example, when  
> zero-copy sockets are used.

Last time I've tried it on 5.4 it caused panics every several hours on my  
fileserver, so I thought this feature is not of wide use...

> To make life just very slightly more complicated even, I'll tell you  
> that there are something called "submaps" in the kernel memory map,  
> which have special properties.  One of these is used for mapping the  
> buffer cache. Another is used for mapping pageable memory used as part  
> of copy-reduction in the pipe(2) code.  Rather than copying twice (into  
> the kernel and out again) in the pipe code, for large pipe I/O we will  
> borrow the user pages from the sending process, mapping them into the  
> kernel and hooking them up to the pipe.

So, is the kernel memory map global thing that covers entire kernel or  
there several maps in kernel, say, one for malloc(), one for other UMA,  
etc. ? Recalling sysctl values from my previous message:

vm.kmem_size: 83415040
vm.kmem_size_max: 335544320
vm.kmem_size_scale: 3
vm.kvm_size: 1073737728
vm.kvm_free: 704638976

So, kvm_size looks like amount of KVA_PAGES, covering entire kernel  
address space, plugged to every process' address space. But more than 300  
megs are used, while machine has only 256 Mb of RAM. I see line in top:

Mem: 41M Active, 1268K Inact, 102M Wired, 34M Buf, 94M Free

I guess 34M buffer cache is entirely in-kernel memory, is this part of  
kmem_size or another part of kernel space? What does kmem_size_max and  
kmem_size_scale do - can kmem grow dynamically? Does kmem_size of about 80  
megs mean that 80 megs of RAM is constantly used by kernel for it's needs,  
including buffer cache, and other 176 megs are spent for processes RSS, or  
relation is more complicated?

>> BTW, current memory usage (April 6.2S, ipf w+ 2 ng_nat's) a week after  
>> restart is low:
>>
>> vadim@router:~>vmstat -m | grep alias
>>    libalias 79542  9983K       - 179493840  128
>> vadim@router:~>vmstat -m | awk '{sum+=$3} END {print sum}'
>> 28124
>>
>>> Actually, with mbuma, this has changed -- mbufs are now allocated from  
>>> the general kernel map.  Pipe buffer memory and a few other things are  
>>> still allocated from separate maps, however.  In fact, this was one of  
>>> the known issues with the introduction of large cluster sizes without  
>>> resource limits: address space and memory use were potentially  
>>> unbounded, so Randall recently properly implemented the resource  
>>> limits on mbuf clusters of large sizes.
>>
>> I still don't understand what that numbers from sysctl above do exactly  
>> mean - sysctl -d for them is obscure. How many memory kernel uses in  
>> RAM, and for which purposes? Is that limit constant? Does kernel swaps  
>> out parts of it, and if yes, how many?
>
> The concept of kernel memory, as seen above, is a bit of a convoluted  
> concept. Simple memory allocated by the kernel for its internal data  
> structures, such as vnodes, sockets, mbufs, etc, is almost always not  
> something that can be paged, as it may be accessed from contexts where  
> blocking on I/O is not permitted (for example, in interrupt threads or  
> with critical mutexes held). However, other memory in the kernel map may  
> well be pageable, such as kernel thread stacks for sleeping user threads

We can assume for simplicty that their memoru is not-so-kernel but part of  
process address space :)

> (which can be swapped out under heavy memory load), pipe buffers, and  
> general cached data for the buffer cache / file system, which will be  
> paged out or discarded when memory pressure goes up.

Umm. I think there is no point in swapping disk cache which can be  
discarded, so the most actual part of kernel memory which is swappable are  
anonymous pipe(2) buffers?

> When debugging a kernel memory leak in the network stack, the usual  
> starting point is to look at vmstat -m and vmstat -z to see what type of  
> memory is being leaked.  The really big monotonically growing type is  
> usually the one that's at fault.  Often it's the one being allocated  
> when the system runs out of address space or memory, so sometimes even a  
> simple backtrace will identify the culprit.

OK, here are the zone state from the crash dump:

router:~# vmstat -z -M /var/crash/vmcore.32
ITEM                     SIZE     LIMIT      USED      FREE  REQUESTS   
FAILURES

UMA Kegs:                 140,        0,       88,        8,        
88,        0
UMA Zones:                120,        0,       88,        2,        
88,        0
UMA Slabs:                 64,        0,     5020,       54,  
15454953,        0
UMA RCntSlabs:            104,        0,     1500,      165,   
1443452,        0
UMA Hash:                 128,        0,        3,       27,         
6,        0
16 Bucket:                 76,        0,       19,       31,        
34,        0
32 Bucket:                140,        0,       24,        4,        
58,        0
64 Bucket:                268,        0,       14,       28,       
125,      177
128 Bucket:               524,        0,      449,       97,   415988,    
109049
VM OBJECT:                132,        0,     2124,    13217,  
37014938,        0
MAP:                      192,        0,        7,       33,         
7,        0
KMAP ENTRY:                68,    15512,       24,     2440,  
67460011,        0
MAP ENTRY:                 68,        0,     1141,      483,  
67039931,        0
PV ENTRY:                  24,   452400,    25801,    23499,  
784683549,        0
DP fakepg:                 72,        0,        0,        0,         
0,        0
mt_zone:                   64,        0,      237,       58,       
237,        0
16:                        16,        0,     2691,      354,  
21894973014,        0
32:                        32,        0,     2281,      318,  
35838274034,        0
64:                        64,        0,     6098,     1454,  
172769061,        0
128:                      128,        0,   243914,    16846,  
637135440,        4
256:                      256,        0,      978,      222,  
134799637,        0
512:                      512,        0,      196,      116,   
3216246,        0
1024:                    1024,        0,       67,       73,    
366070,        0
2048:                    2048,        0,     8988,       46,  
69855367,        7
4096:                    4096,        0,      155,       29,   
1894695,        0
Files:                     72,        0,      270,      207,  
31790371,        0
PROC:                     536,        0,       96,       37,   
1567418,        0
THREAD:                   376,        0,      142,        8,  
14326845,        0
KSEGRP:                    88,        0,      137,       63,       
662,        0
UPCALL:                    44,        0,        6,      150,       
536,        0
VMSPACE:                  296,        0,       48,       56,   
1567372,        0
audit_record:             828,        0,        0,        0,         
0,        0
mbuf_packet:              256,        0,      591,      121,  
208413611538,        0
mbuf:                     256,        0,     1902,     1226,  
202203273445,        0
mbuf_cluster:            2048,     8768,     2537,      463,  
5247493815,        2
mbuf_jumbo_pagesize:     4096,        0,        0,        0,         
0,        0
mbuf_jumbo_9k:           9216,        0,        0,        0,         
0,        0
mbuf_jumbo_16k:         16384,        0,        0,        0,         
0,        0
ACL UMA zone:             388,        0,        0,        0,         
0,        0
NetGraph items:            36,      546,        0,      546,  
251943928450,  1170428
g_bio:                    132,        0,        1,      231,  
336628343,        0
ata_request:              204,        0,        1,      316,  
82269680,        0
ata_composite:            196,        0,        0,        0,         
0,        0
VNODE:                    272,        0,     2039,    14523,  
40154724,        0
VNODEPOLL:                 76,        0,        0,       50,         
1,        0
S VFS Cache:               68,        0,     2247,    12929,  
41383752,        0
L VFS Cache:              291,        0,        0,      364,    
536802,        0
NAMEI:                   1024,        0,      372,       12,  
126634007,        0
NFSMOUNT:                 480,        0,        0,        0,         
0,        0
NFSNODE:                  460,        0,        0,        0,         
0,        0
DIRHASH:                 1024,        0,      156,      184,    
131252,        0
PIPE:                     408,        0,       24,       30,    
822603,        0
KNOTE:                     68,        0,        0,      112,    
249530,        0
bridge_rtnode:             32,        0,        0,        0,         
0,        0
socket:                   356,     8778,       75,       35,   
1488596,        0
ipq:                       32,      339,        0,      226,  
58472202,        0
udpcb:                    180,     8778,       17,       49,    
239035,        0
inpcb:                    180,     8778,       23,      109,    
676919,        0
tcpcb:                    464,     8768,       22,       34,    
676919,        0
tcptw:                     48,     1794,        1,      233,    
177851,        0
syncache:                 100,    15366,        0,       78,    
610893,        0
hostcache:                 76,    15400,       78,       72,     
13137,        0
tcpreass:                  20,      676,        0,      169,     
48826,        0
sackhole:                  20,        0,        0,      169,       
194,        0
ripcb:                    180,     8778,        4,       40,    
142316,        0
unpcb:                    144,     8775,       19,       62,    
393432,        0
rtentry:                  132,        0,      480,      187,    
448160,        0
pfsrctrpl:                100,        0,        0,        0,         
0,        0
pfrulepl:                 604,        0,        0,        0,         
0,        0
pfstatepl:                260,    10005,        0,        0,         
0,        0
pfaltqpl:                 128,        0,        0,        0,         
0,        0
pfpooladdrpl:              68,        0,        0,        0,         
0,        0
pfrktable:               1240,        0,        0,        0,         
0,        0
pfrkentry:                156,        0,        0,        0,         
0,        0
pfrkentry2:               156,        0,        0,        0,         
0,        0
pffrent:                   16,     5075,        0,        0,         
0,        0
pffrag:                    48,        0,        0,        0,         
0,        0
pffrcache:                 48,    10062,        0,        0,         
0,        0
pffrcent:                  12,    50141,        0,        0,         
0,        0
pfstatescrub:              28,        0,        0,        0,         
0,        0
pfiaddrpl:                 92,        0,        0,        0,         
0,        0
pfospfen:                 108,        0,        0,        0,         
0,        0
pfosfp:                    28,        0,        0,        0,         
0,        0
IPFW dynamic rule zone:      108,        0,      147,      393,  
20301589,        0
divcb:                    180,     8778,        2,       42,        
45,        0
SWAPMETA:                 276,    30548,     2257,      473,    
348836,        0
Mountpoints:              664,        0,        8,       10,       
100,        0
FFS inode:                132,        0,     2000,     6468,  
40152792,        0
FFS1 dinode:              128,        0,        0,        0,         
0,        0
FFS2 dinode:              256,        0,     2000,     3730,  
40152792,        0


-- 
WBR, Vadim Goncharov