From owner-freebsd-hackers@FreeBSD.ORG  Sat Sep 18 22:42:12 2010
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id E9C8C106566B;
	Sat, 18 Sep 2010 22:42:11 +0000 (UTC)
	(envelope-from jroberson@jroberson.net)
Received: from mail-pz0-f54.google.com (mail-pz0-f54.google.com
	[209.85.210.54])
	by mx1.freebsd.org (Postfix) with ESMTP id AF8728FC12;
	Sat, 18 Sep 2010 22:42:11 +0000 (UTC)
Received: by pzk7 with SMTP id 7so1109817pzk.13
	for <multiple recipients>; Sat, 18 Sep 2010 15:42:11 -0700 (PDT)
Received: by 10.142.232.19 with SMTP id e19mr4134137wfh.254.1284848144219;
	Sat, 18 Sep 2010 15:15:44 -0700 (PDT)
Received: from [10.0.1.198] (udp022762uds.hawaiiantel.net [72.234.79.107])
	by mx.google.com with ESMTPS id o17sm9676882wal.21.2010.09.18.15.15.40
	(version=SSLv3 cipher=RC4-MD5); Sat, 18 Sep 2010 15:15:42 -0700 (PDT)
Date: Sat, 18 Sep 2010 12:16:49 -1000 (HST)
From: Jeff Roberson <jroberson@jroberson.net>
X-X-Sender: jroberson@desktop
To: Robert Watson <rwatson@FreeBSD.org>
In-Reply-To: <alpine.BSF.2.00.1009181221560.86826@fledge.watson.org>
Message-ID: <alpine.BSF.2.00.1009181135430.23448@desktop>
References: <4C93236B.4050906@freebsd.org> <4C935F56.4030903@freebsd.org>
	<alpine.BSF.2.00.1009181221560.86826@fledge.watson.org>
User-Agent: Alpine 2.00 (BSF 1167 2008-08-23)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Mailman-Approved-At: Sat, 18 Sep 2010 22:47:56 +0000
Cc: freebsd-hackers@freebsd.org, Jeff Roberson <jeff@freebsd.org>,
	Andre Oppermann <andre@freebsd.org>, Andriy Gapon <avg@freebsd.org>
Subject: Re: zfs + uma
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 18 Sep 2010 22:42:12 -0000

On Sat, 18 Sep 2010, Robert Watson wrote:

>
> On Fri, 17 Sep 2010, Andre Oppermann wrote:
>
>>> Although keeping free items around improves performance, it does consume 
>>> memory too.  And the fact that that memory is not freed on lowmem 
>>> condition makes the situation worse.
>> 
>> Interesting.  We may run into related issues with excessive mbuf (cluster) 
>> caching in the per-cpu buckets as well.
>> 
>> Having a general solutions for that is appreciated.  Maybe the size of the 
>> free per-cpu buckets should be specified when setting up the UMA zone.  Of 
>> certain frequently re-used elements we may want to cache more, other less.
>
> I've been keeping a vague eye out for this over the last few years, and 
> haven't spotted many problems in production machines I've inspected.  You can 
> use the umastat tool in the tools tree to look at the distribution of memory 
> over buckets (etc) in UMA manually.  It would be nice if it had some 
> automated statistics on fragmentation however.  Short-lived fragmentation is 
> likely, and isn't an issue, so what you want is a tool that monitors over 
> time and reports on longer-lived fragmentation.

Not specifically in reaction to Robert's comment but I would like to add 
my thoughts to this notion of resource balancing in buckets.  I really 
prefer not to do any specific per-zone tuning except in extreme cases. 
This is because quite often the decisions we make don't apply to some 
class of machines or workloads.  I would instead prefer to keep the 
algorithm adaptable.

I like the idea of weighting the bucket decisions by the size of the item. 
Obviously this has some flaws with compound objects but in the general 
case it is good.  We should consider increasing the cost of bucket 
expansion based on the size of the item.  Right now buckets are expanded 
fairly readily.

We could also consider decreasing the default bucket size for a zone based 
on vm pressure and use.  Right now there is no downward pressure on bucket 
size, only upward based on trips to the slab layer.

Additionally we could make a last ditch flush mechanism that runs on each 
cpu in turn and flushes some or all of the buckets in per-cpu caches. 
Presently that is not done due to synchronization issues.  It can't be 
done from a central place.  It could be done with a callout mechanism or a 
for loop that binds to each core in succession.

I believe the combination of these approaches would significantly solve 
the problem and should be relatively little new code.  It should also 
preserve the adaptable nature of the system without penalizing resource 
heavy systems.  I would be happy to review patches from anyone who wishes 
to undertake it.


>
> The main fragmentation issue we've had in the past has been due to 
> mbuf+cluster caching, which prevented mbufs from being freed usefully in some 
> cases.  Jeff's ongoing work on variable-sized mbufs would entirely eliminate 
> that problem...

I'm going to get back to this soon as infiniband gets to a useful state 
for doing high performance network testing.  This is only because I have 
no 10gigE but do have ib and have funding to cover working on it.  I hope 
to have some results and activity on this front by the end of the year. 
I know it has been long coming.

Thanks,
Jeff

>
> Robert
>