From owner-freebsd-current@FreeBSD.ORG Mon Jan 7 23:28:34 2008 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E0B4116A418; Mon, 7 Jan 2008 23:28:33 +0000 (UTC) (envelope-from vadim_nuclight@mail.ru) Received: from mx5.mail.ru (mx5.mail.ru [194.67.23.25]) by mx1.freebsd.org (Postfix) with ESMTP id 2F09713C44B; Mon, 7 Jan 2008 23:28:33 +0000 (UTC) (envelope-from vadim_nuclight@mail.ru) Received: from [78.140.2.250] (port=16688 helo=nuclight.avtf.net) by mx5.mail.ru with esmtp id 1JC1OR-0002lG-00; Tue, 08 Jan 2008 02:28:31 +0300 Date: Tue, 08 Jan 2008 05:28:28 +0600 To: "Robert Watson" , "Paolo Pisati" References: <20080104163352.GA42835@lor.one-eyed-alien.net> <9bbcef730801040958t36e48c9fjd0fbfabd49b08b97@mail.gmail.com> <200801061051.26817.peter.schuller@infidyne.com> <9bbcef730801060458k4bc9f2d6uc3f097d70e087b68@mail.gmail.com> <4780D289.7020509@FreeBSD.org> <4780E546.9050303@FreeBSD.org> <9bbcef730801060651y489f1f9bw269d0968407dd8fb@mail.gmail.com> <4780EF09.4090908@FreeBSD.org> <47810BE3.4080601@FreeBSD.org> <4781113C.3090904@FreeBSD.org> <47814B53.50405@FreeBSD.org> <20080106223153.V72782@fledge.watson.org> <20080107152305.A19068@fledge.watson.org> From: "Vadim Goncharov" Organization: AVTF TPU Hostel Content-Type: text/plain; format=flowed; delsp=yes; charset=koi8-r MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Message-ID: In-Reply-To: <20080107152305.A19068@fledge.watson.org> User-Agent: Opera M2/7.54 (Win32, build 3865) Cc: freebsd-current@freebsd.org Subject: Re: When will ZFS become stable? X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 07 Jan 2008 23:28:34 -0000 07.01.08 @ 21:39 Robert Watson wrote: > On Mon, 7 Jan 2008, Vadim Goncharov wrote: > >> Yes, in-kernel libalias is "leaking" in sense that it grows unbounded, >> and uses malloc(9) instead if it's own UMA zone with settable limits >> (it frees all used memory, however, on shutting down ng_nat, so I've >> done a workaround restarting ng_nat nodes once a month). But as I see >> the panic string: > > Did you have any luck raising interest from Paulo regarding this > problem? Is there a PR I can take a look at? I'm not really familiar > with the code, so I'd prefer someone who was a bit more familiar with it > looked after it, but I can certainly take a glance. No, i didn't do that yet. Brief search, however, shows kern/118432, though it is not directly kmem issue, and also thread http://209.85.135.104/search?q=cache:lpXLlrtojg8J:archive.netbsd.se/%3Fml%3Dfreebsd-net%26a%3D2006-10%26t%3D2449333+ng_nat+panic+memory&hl=ru&ct=clnk&cd=9&client=opera in which memory exhaustion problem was predicted. Also, I've heard some rumors about ng_nat memory panics under very heavy load, but a man with 300Mbps router with several ng_nat's said his router is rock stable for half a year - though his router has 1 Gb of RAM and mine only 256 Mb (BTW, it's his system that has crashed recently with kern/118993, but this is not ng_nat kmem issue, as I think). >> panic: kmem_malloc(16384): kmem_map too small: 83415040 total allocated >> >> and memory usage in crash dump: >> >> router:~# vmstat -m -M /var/crash/vmcore.32 | grep alias >> libalias 241127 30161K - 460568995 128 >> router:~# vmstat -m -M /var/crash/vmcore.32 | awk '{sum+=$3} END {print >> sum}' >> 50407 >> >> ...so why only 50 Mb from 80 were used at the moment of panic? > > This is a bit complicated to answer, but I'll try to capture the gist in > a short space. > > The kernel memory map is an address space in which pages can be placed > to be used by the kernel. Those pages are often allocated using one of > two kernel allocators, malloc(9) which does variable sized memory > allocations, and uma(9) which is a slab allocator and supports caching > of complex but fixed-size objects. Temporary buffers of variable size > or infrequently allocated objects will use malloc, but frequently > allocated objects of fixed size (vnods, mbufs, ....) will use uma. > "vmstat -m" prints out information on malloc allocations, and "vmstat > -z" prints out information on uma allocations. > > To make life slightly more complicated, small malloc allocations are > actually implemented using uma -- there are a small number of small > object size zones reserved for this purpose, and malloc just rounds up > to the next such bucket size and allocations from that bucket. For > larger sizes, malloc goes through uma, but pretty much directly to VM > which makes pages available directly. So when you look at "vmstat -z" > output, be aware that some of the information presented there (zones > named things like "128", "256", etc) are actually the pools from which > malloc allocations come, so there's double-counting. Yes, I've known it, but didn't known what column names exactly mean. Requests/Failures, I guess, is a pure statistics, Size is one element size, but why USED + FREE != LIMIT (on whose where limit is non-zero) ? > There are also other ways to get memory into the kernel map, such as > directly inserting pages from user memory into the kernel address space > in order to implement zero-copy. This is done, for example, when > zero-copy sockets are used. Last time I've tried it on 5.4 it caused panics every several hours on my fileserver, so I thought this feature is not of wide use... > To make life just very slightly more complicated even, I'll tell you > that there are something called "submaps" in the kernel memory map, > which have special properties. One of these is used for mapping the > buffer cache. Another is used for mapping pageable memory used as part > of copy-reduction in the pipe(2) code. Rather than copying twice (into > the kernel and out again) in the pipe code, for large pipe I/O we will > borrow the user pages from the sending process, mapping them into the > kernel and hooking them up to the pipe. So, is the kernel memory map global thing that covers entire kernel or there several maps in kernel, say, one for malloc(), one for other UMA, etc. ? Recalling sysctl values from my previous message: vm.kmem_size: 83415040 vm.kmem_size_max: 335544320 vm.kmem_size_scale: 3 vm.kvm_size: 1073737728 vm.kvm_free: 704638976 So, kvm_size looks like amount of KVA_PAGES, covering entire kernel address space, plugged to every process' address space. But more than 300 megs are used, while machine has only 256 Mb of RAM. I see line in top: Mem: 41M Active, 1268K Inact, 102M Wired, 34M Buf, 94M Free I guess 34M buffer cache is entirely in-kernel memory, is this part of kmem_size or another part of kernel space? What does kmem_size_max and kmem_size_scale do - can kmem grow dynamically? Does kmem_size of about 80 megs mean that 80 megs of RAM is constantly used by kernel for it's needs, including buffer cache, and other 176 megs are spent for processes RSS, or relation is more complicated? >> BTW, current memory usage (April 6.2S, ipf w+ 2 ng_nat's) a week after >> restart is low: >> >> vadim@router:~>vmstat -m | grep alias >> libalias 79542 9983K - 179493840 128 >> vadim@router:~>vmstat -m | awk '{sum+=$3} END {print sum}' >> 28124 >> >>> Actually, with mbuma, this has changed -- mbufs are now allocated from >>> the general kernel map. Pipe buffer memory and a few other things are >>> still allocated from separate maps, however. In fact, this was one of >>> the known issues with the introduction of large cluster sizes without >>> resource limits: address space and memory use were potentially >>> unbounded, so Randall recently properly implemented the resource >>> limits on mbuf clusters of large sizes. >> >> I still don't understand what that numbers from sysctl above do exactly >> mean - sysctl -d for them is obscure. How many memory kernel uses in >> RAM, and for which purposes? Is that limit constant? Does kernel swaps >> out parts of it, and if yes, how many? > > The concept of kernel memory, as seen above, is a bit of a convoluted > concept. Simple memory allocated by the kernel for its internal data > structures, such as vnodes, sockets, mbufs, etc, is almost always not > something that can be paged, as it may be accessed from contexts where > blocking on I/O is not permitted (for example, in interrupt threads or > with critical mutexes held). However, other memory in the kernel map may > well be pageable, such as kernel thread stacks for sleeping user threads We can assume for simplicty that their memoru is not-so-kernel but part of process address space :) > (which can be swapped out under heavy memory load), pipe buffers, and > general cached data for the buffer cache / file system, which will be > paged out or discarded when memory pressure goes up. Umm. I think there is no point in swapping disk cache which can be discarded, so the most actual part of kernel memory which is swappable are anonymous pipe(2) buffers? > When debugging a kernel memory leak in the network stack, the usual > starting point is to look at vmstat -m and vmstat -z to see what type of > memory is being leaked. The really big monotonically growing type is > usually the one that's at fault. Often it's the one being allocated > when the system runs out of address space or memory, so sometimes even a > simple backtrace will identify the culprit. OK, here are the zone state from the crash dump: router:~# vmstat -z -M /var/crash/vmcore.32 ITEM SIZE LIMIT USED FREE REQUESTS FAILURES UMA Kegs: 140, 0, 88, 8, 88, 0 UMA Zones: 120, 0, 88, 2, 88, 0 UMA Slabs: 64, 0, 5020, 54, 15454953, 0 UMA RCntSlabs: 104, 0, 1500, 165, 1443452, 0 UMA Hash: 128, 0, 3, 27, 6, 0 16 Bucket: 76, 0, 19, 31, 34, 0 32 Bucket: 140, 0, 24, 4, 58, 0 64 Bucket: 268, 0, 14, 28, 125, 177 128 Bucket: 524, 0, 449, 97, 415988, 109049 VM OBJECT: 132, 0, 2124, 13217, 37014938, 0 MAP: 192, 0, 7, 33, 7, 0 KMAP ENTRY: 68, 15512, 24, 2440, 67460011, 0 MAP ENTRY: 68, 0, 1141, 483, 67039931, 0 PV ENTRY: 24, 452400, 25801, 23499, 784683549, 0 DP fakepg: 72, 0, 0, 0, 0, 0 mt_zone: 64, 0, 237, 58, 237, 0 16: 16, 0, 2691, 354, 21894973014, 0 32: 32, 0, 2281, 318, 35838274034, 0 64: 64, 0, 6098, 1454, 172769061, 0 128: 128, 0, 243914, 16846, 637135440, 4 256: 256, 0, 978, 222, 134799637, 0 512: 512, 0, 196, 116, 3216246, 0 1024: 1024, 0, 67, 73, 366070, 0 2048: 2048, 0, 8988, 46, 69855367, 7 4096: 4096, 0, 155, 29, 1894695, 0 Files: 72, 0, 270, 207, 31790371, 0 PROC: 536, 0, 96, 37, 1567418, 0 THREAD: 376, 0, 142, 8, 14326845, 0 KSEGRP: 88, 0, 137, 63, 662, 0 UPCALL: 44, 0, 6, 150, 536, 0 VMSPACE: 296, 0, 48, 56, 1567372, 0 audit_record: 828, 0, 0, 0, 0, 0 mbuf_packet: 256, 0, 591, 121, 208413611538, 0 mbuf: 256, 0, 1902, 1226, 202203273445, 0 mbuf_cluster: 2048, 8768, 2537, 463, 5247493815, 2 mbuf_jumbo_pagesize: 4096, 0, 0, 0, 0, 0 mbuf_jumbo_9k: 9216, 0, 0, 0, 0, 0 mbuf_jumbo_16k: 16384, 0, 0, 0, 0, 0 ACL UMA zone: 388, 0, 0, 0, 0, 0 NetGraph items: 36, 546, 0, 546, 251943928450, 1170428 g_bio: 132, 0, 1, 231, 336628343, 0 ata_request: 204, 0, 1, 316, 82269680, 0 ata_composite: 196, 0, 0, 0, 0, 0 VNODE: 272, 0, 2039, 14523, 40154724, 0 VNODEPOLL: 76, 0, 0, 50, 1, 0 S VFS Cache: 68, 0, 2247, 12929, 41383752, 0 L VFS Cache: 291, 0, 0, 364, 536802, 0 NAMEI: 1024, 0, 372, 12, 126634007, 0 NFSMOUNT: 480, 0, 0, 0, 0, 0 NFSNODE: 460, 0, 0, 0, 0, 0 DIRHASH: 1024, 0, 156, 184, 131252, 0 PIPE: 408, 0, 24, 30, 822603, 0 KNOTE: 68, 0, 0, 112, 249530, 0 bridge_rtnode: 32, 0, 0, 0, 0, 0 socket: 356, 8778, 75, 35, 1488596, 0 ipq: 32, 339, 0, 226, 58472202, 0 udpcb: 180, 8778, 17, 49, 239035, 0 inpcb: 180, 8778, 23, 109, 676919, 0 tcpcb: 464, 8768, 22, 34, 676919, 0 tcptw: 48, 1794, 1, 233, 177851, 0 syncache: 100, 15366, 0, 78, 610893, 0 hostcache: 76, 15400, 78, 72, 13137, 0 tcpreass: 20, 676, 0, 169, 48826, 0 sackhole: 20, 0, 0, 169, 194, 0 ripcb: 180, 8778, 4, 40, 142316, 0 unpcb: 144, 8775, 19, 62, 393432, 0 rtentry: 132, 0, 480, 187, 448160, 0 pfsrctrpl: 100, 0, 0, 0, 0, 0 pfrulepl: 604, 0, 0, 0, 0, 0 pfstatepl: 260, 10005, 0, 0, 0, 0 pfaltqpl: 128, 0, 0, 0, 0, 0 pfpooladdrpl: 68, 0, 0, 0, 0, 0 pfrktable: 1240, 0, 0, 0, 0, 0 pfrkentry: 156, 0, 0, 0, 0, 0 pfrkentry2: 156, 0, 0, 0, 0, 0 pffrent: 16, 5075, 0, 0, 0, 0 pffrag: 48, 0, 0, 0, 0, 0 pffrcache: 48, 10062, 0, 0, 0, 0 pffrcent: 12, 50141, 0, 0, 0, 0 pfstatescrub: 28, 0, 0, 0, 0, 0 pfiaddrpl: 92, 0, 0, 0, 0, 0 pfospfen: 108, 0, 0, 0, 0, 0 pfosfp: 28, 0, 0, 0, 0, 0 IPFW dynamic rule zone: 108, 0, 147, 393, 20301589, 0 divcb: 180, 8778, 2, 42, 45, 0 SWAPMETA: 276, 30548, 2257, 473, 348836, 0 Mountpoints: 664, 0, 8, 10, 100, 0 FFS inode: 132, 0, 2000, 6468, 40152792, 0 FFS1 dinode: 128, 0, 0, 0, 0, 0 FFS2 dinode: 256, 0, 2000, 3730, 40152792, 0 -- WBR, Vadim Goncharov