From owner-freebsd-hackers@FreeBSD.ORG Sun Aug 22 20:45:24 2010 Return-Path: Delivered-To: freebsd-hackers@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DFCAE10656A5; Sun, 22 Aug 2010 20:45:23 +0000 (UTC) (envelope-from avg@icyb.net.ua) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id BF64B8FC1F; Sun, 22 Aug 2010 20:45:22 +0000 (UTC) Received: from porto.topspin.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id XAA04937; Sun, 22 Aug 2010 23:45:21 +0300 (EEST) (envelope-from avg@icyb.net.ua) Received: from localhost.topspin.kiev.ua ([127.0.0.1]) by porto.topspin.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1OnHPs-000AuH-Pc; Sun, 22 Aug 2010 23:45:20 +0300 Message-ID: <4C718C60.2010205@icyb.net.ua> Date: Sun, 22 Aug 2010 23:45:20 +0300 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-US; rv:1.9.2.8) Gecko/20100822 Lightning/1.0b2 Thunderbird/3.1.2 MIME-Version: 1.0 To: freebsd-hackers@FreeBSD.org X-Enigmail-Version: 1.1.2 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Cc: Jeff Roberson , "Robert N. M. Watson" Subject: uma: zone fragmentation X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 22 Aug 2010 20:45:24 -0000 It seems that with inclusion of ZFS, which is a significant UMA user even when it is not used for ARC, zone fragmentation becomes an issue. For example, on my systems with 4GB of RAM I routinely observe several hundred megabytes in free items after zone draining (via lowmem event). I wrote a one-liner (quite long line though) for post-processing vmstat -z output and here's an example: $ vmstat -z | sed -e 's/ /_/' -e 's/:_* / /' -e 's/,//g' | tail +3 | awk 'BEGIN { total = 0; } { total += $2 * $5; print $2 * $5, $1, $4, $5, $2;} END { print total, "total"; }' | sort -n | tail -10 6771456 256 7749 26451 256 10710144 128 173499 83673 128 13400424 VM_OBJECT 33055 62039 216 17189568 zfs_znode_cache 33259 48834 352 19983840 VNODE 33455 41633 480 30936464 arc_buf_hdr_t 145387 148733 208 57030400 dmu_buf_impl_t 82816 254600 224 57619296 dnode_t 78811 73494 784 62067712 512 71050 121226 512 302164776 total When UMA is used for ARC, then "wasted" memory grows above 1GB effectevily making that setup unusable for me. I see that in OpenSolaris they developed a few measures to (try to) prevent fragmentation and perform defragmentation. First, they keep their equivalent of partial slab list sorted by number of used items thus trying to fill up the most used slab. Second, they allow to set a 'move' callback for a zone and have a special monitoring thread that tries to compact slabs when zone fragmentation goes above certain limit. The details can be found here (lengthy comment at the beginning and links in it): http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/kmem.c Not sure if we would want to implement anything like or some alternative, but zone fragmentation seems to have become an issue, at least for ZFS. I am testing the following primitive patch that tries to "lazily sort" (or pseudo sort) slab partial list. Linked list is not the kind of data structure that's easy to keep sorted in efficient manner. diff --git a/sys/vm/uma_core.c b/sys/vm/uma_core.c index 2dcd14f..ed07ecb 100644 --- a/sys/vm/uma_core.c +++ b/sys/vm/uma_core.c @@ -2727,14 +2727,26 @@ zone_free_item(uma_zone_t zone, void *item, void *udata, } MPASS(keg == slab->us_keg); - /* Do we need to remove from any lists? */ + /* Move to the appropriate list or re-queue further from the head. */ if (slab->us_freecount+1 == keg->uk_ipers) { + /* Partial -> free. */ LIST_REMOVE(slab, us_link); LIST_INSERT_HEAD(&keg->uk_free_slab, slab, us_link); } else if (slab->us_freecount == 0) { + /* Full -> partial. */ LIST_REMOVE(slab, us_link); LIST_INSERT_HEAD(&keg->uk_part_slab, slab, us_link); } + else { + /* Partial -> partial. */ + uma_slab_t tmp; + + tmp = LIST_NEXT(slab, us_link); + if (tmp != NULL && slab->us_freecount > tmp->us_freecount) { + LIST_REMOVE(slab, us_link); + LIST_INSERT_AFTER(tmp, slab, us_link); + } + } /* Slab management stuff */ freei = ((unsigned long)item - (unsigned long)slab->us_data) Unfortunately I don't have any conclusive results to report. The numbers seem to be better with the patch, but they are changing all the time depending on system usage. I couldn't think of any good test that would reflect real-world usage patterns, which I believe to be not entirely random. -- Andriy Gapon