From owner-freebsd-hackers@FreeBSD.ORG  Sun Sep 19 08:26:37 2010
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id EADAB1065670;
	Sun, 19 Sep 2010 08:26:36 +0000 (UTC)
	(envelope-from jroberson@jroberson.net)
Received: from mail-pz0-f54.google.com (mail-pz0-f54.google.com
	[209.85.210.54])
	by mx1.freebsd.org (Postfix) with ESMTP id ACDC28FC14;
	Sun, 19 Sep 2010 08:26:36 +0000 (UTC)
Received: by pzk7 with SMTP id 7so1182700pzk.13
	for <multiple recipients>; Sun, 19 Sep 2010 01:26:36 -0700 (PDT)
Received: by 10.142.132.11 with SMTP id f11mr6297097wfd.35.1284884796189;
	Sun, 19 Sep 2010 01:26:36 -0700 (PDT)
Received: from [10.0.1.198] (udp022762uds.hawaiiantel.net [72.234.79.107])
	by mx.google.com with ESMTPS id l42sm3725264wfa.9.2010.09.19.01.26.33
	(version=SSLv3 cipher=RC4-MD5); Sun, 19 Sep 2010 01:26:35 -0700 (PDT)
Date: Sat, 18 Sep 2010 22:27:42 -1000 (HST)
From: Jeff Roberson <jroberson@jroberson.net>
X-X-Sender: jroberson@desktop
To: Andriy Gapon <avg@freebsd.org>
In-Reply-To: <4C95C804.1010701@freebsd.org>
Message-ID: <alpine.BSF.2.00.1009182225050.23448@desktop>
References: <4C93236B.4050906@freebsd.org> <4C935F56.4030903@freebsd.org>
	<alpine.BSF.2.00.1009181221560.86826@fledge.watson.org>
	<alpine.BSF.2.00.1009181135430.23448@desktop>
	<4C95C804.1010701@freebsd.org>
User-Agent: Alpine 2.00 (BSF 1167 2008-08-23)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Mailman-Approved-At: Sun, 19 Sep 2010 11:01:51 +0000
Cc: Andre Oppermann <andre@freebsd.org>, Jeff Roberson <jeff@freebsd.org>,
	Robert Watson <rwatson@freebsd.org>, freebsd-hackers@freebsd.org
Subject: Re: zfs + uma
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 19 Sep 2010 08:26:37 -0000

On Sun, 19 Sep 2010, Andriy Gapon wrote:

> on 19/09/2010 01:16 Jeff Roberson said the following:
>> Not specifically in reaction to Robert's comment but I would like to add my
>> thoughts to this notion of resource balancing in buckets.  I really prefer not
>> to do any specific per-zone tuning except in extreme cases. This is because
>> quite often the decisions we make don't apply to some class of machines or
>> workloads.  I would instead prefer to keep the algorithm adaptable.
>
> Agree.
>
>> I like the idea of weighting the bucket decisions by the size of the item.
>> Obviously this has some flaws with compound objects but in the general case it
>> is good.  We should consider increasing the cost of bucket expansion based on
>> the size of the item.  Right now buckets are expanded fairly readily.
>>
>> We could also consider decreasing the default bucket size for a zone based on vm
>> pressure and use.  Right now there is no downward pressure on bucket size, only
>> upward based on trips to the slab layer.
>>
>> Additionally we could make a last ditch flush mechanism that runs on each cpu in
>> turn and flushes some or all of the buckets in per-cpu caches. Presently that is
>> not done due to synchronization issues.  It can't be done from a central place.
>> It could be done with a callout mechanism or a for loop that binds to each core
>> in succession.
>
> I like all of the tree above approaches.
> The last one is a bit hard to implement, the first two seem easier.

All the last one requires is a loop calling sched_bind() on each available 
cpu.

>
>> I believe the combination of these approaches would significantly solve the
>> problem and should be relatively little new code.  It should also preserve the
>> adaptable nature of the system without penalizing resource heavy systems.  I
>> would be happy to review patches from anyone who wishes to undertake it.
>
> FWIW, the approach of simply limiting maximum bucket size based on item size
> seems to work rather well too, as my testing with zfs+uma shows.
> I will also try to add code to completely bypass the per-cpu cache for "really
> huge" items.

I don't like this because even with very large buffers you can still have 
high enough turnover to require per-cpu caching.  Kip specifically added 
UMA support to address this issue in zfs.  If you have allocations which 
don't require per-cpu caching and are very large why even use UMA?

One thing that would be nice if we are frequently using page size 
allocations is to eliminate the requirement for a slab header for each 
page.  It seems unnecessary for any zone where the items per slab is 1 but 
it would require careful modification to support properly.

Thanks,
Jeff

>
> -- 
> Andriy Gapon
>