From owner-freebsd-hackers@FreeBSD.ORG  Sun Aug 22 20:45:24 2010
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id DFCAE10656A5;
	Sun, 22 Aug 2010 20:45:23 +0000 (UTC) (envelope-from avg@icyb.net.ua)
Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140])
	by mx1.freebsd.org (Postfix) with ESMTP id BF64B8FC1F;
	Sun, 22 Aug 2010 20:45:22 +0000 (UTC)
Received: from porto.topspin.kiev.ua (porto-e.starpoint.kiev.ua
	[212.40.38.100])
	by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id XAA04937;
	Sun, 22 Aug 2010 23:45:21 +0300 (EEST)
	(envelope-from avg@icyb.net.ua)
Received: from localhost.topspin.kiev.ua ([127.0.0.1])
	by porto.topspin.kiev.ua with esmtp (Exim 4.34 (FreeBSD))
	id 1OnHPs-000AuH-Pc; Sun, 22 Aug 2010 23:45:20 +0300
Message-ID: <4C718C60.2010205@icyb.net.ua>
Date: Sun, 22 Aug 2010 23:45:20 +0300
From: Andriy Gapon <avg@icyb.net.ua>
User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-US;
	rv:1.9.2.8) Gecko/20100822 Lightning/1.0b2 Thunderbird/3.1.2
MIME-Version: 1.0
To: freebsd-hackers@FreeBSD.org
X-Enigmail-Version: 1.1.2
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Cc: Jeff Roberson <jeff@FreeBSD.org>,
	"Robert N. M. Watson" <rwatson@FreeBSD.org>
Subject: uma: zone fragmentation
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 22 Aug 2010 20:45:24 -0000


It seems that with inclusion of ZFS, which is a significant UMA user even when
it is not used for ARC, zone fragmentation becomes an issue.
For example, on my systems with 4GB of RAM I routinely observe several hundred
megabytes in free items after zone draining (via lowmem event).

I wrote a one-liner (quite long line though) for post-processing vmstat -z
output and here's an example:
$ vmstat -z | sed -e 's/ /_/' -e 's/:_* / /' -e 's/,//g' | tail +3 | awk 'BEGIN
{ total = 0; } { total += $2 * $5; print $2 * $5, $1, $4, $5, $2;} END { print
total, "total"; }' | sort -n | tail -10
6771456 256 7749 26451 256
10710144 128 173499 83673 128
13400424 VM_OBJECT 33055 62039 216
17189568 zfs_znode_cache 33259 48834 352
19983840 VNODE 33455 41633 480
30936464 arc_buf_hdr_t 145387 148733 208
57030400 dmu_buf_impl_t 82816 254600 224
57619296 dnode_t 78811 73494 784
62067712 512 71050 121226 512
302164776 total

When UMA is used for ARC, then "wasted" memory grows above 1GB effectevily
making that setup unusable for me.

I see that in OpenSolaris they developed a few measures to (try to) prevent
fragmentation and perform defragmentation.

First, they keep their equivalent of partial slab list sorted by number of used
items thus trying to fill up the most used slab.
Second, they allow to set a 'move' callback for a zone and have a special
monitoring thread that tries to compact slabs when zone fragmentation goes above
certain limit.
The details can be found here (lengthy comment at the beginning and links in it):
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/kmem.c

Not sure if we would want to implement anything like or some alternative, but
zone fragmentation seems to have become an issue, at least for ZFS.

I am testing the following primitive patch that tries to "lazily sort" (or
pseudo sort) slab partial list.  Linked list is not the kind of data structure
that's easy to keep sorted in efficient manner.

diff --git a/sys/vm/uma_core.c b/sys/vm/uma_core.c
index 2dcd14f..ed07ecb 100644
--- a/sys/vm/uma_core.c
+++ b/sys/vm/uma_core.c
@@ -2727,14 +2727,26 @@ zone_free_item(uma_zone_t zone, void *item, void *udata,
 	}
 	MPASS(keg == slab->us_keg);

-	/* Do we need to remove from any lists? */
+	/* Move to the appropriate list or re-queue further from the head. */
 	if (slab->us_freecount+1 == keg->uk_ipers) {
+		/* Partial -> free. */
 		LIST_REMOVE(slab, us_link);
 		LIST_INSERT_HEAD(&keg->uk_free_slab, slab, us_link);
 	} else if (slab->us_freecount == 0) {
+		/* Full -> partial. */
 		LIST_REMOVE(slab, us_link);
 		LIST_INSERT_HEAD(&keg->uk_part_slab, slab, us_link);
 	}
+	else {
+		/* Partial -> partial. */
+		uma_slab_t tmp;
+
+		tmp = LIST_NEXT(slab, us_link);
+		if (tmp != NULL && slab->us_freecount > tmp->us_freecount) {
+			LIST_REMOVE(slab, us_link);
+			LIST_INSERT_AFTER(tmp, slab, us_link);
+		}
+	}

 	/* Slab management stuff */
 	freei = ((unsigned long)item - (unsigned long)slab->us_data)


Unfortunately I don't have any conclusive results to report.
The numbers seem to be better with the patch, but they are changing all the time
depending on system usage.
I couldn't think of any good test that would reflect real-world usage patterns,
which I believe to be not entirely random.

-- 
Andriy Gapon