From owner-freebsd-current@FreeBSD.ORG  Sat May  2 04:49:48 2009
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id D96A31065674
	for <current@freebsd.org>; Sat,  2 May 2009 04:49:48 +0000 (UTC)
	(envelope-from ben@wanderview.com)
Received: from mail.wanderview.com (mail.wanderview.com [66.92.166.102])
	by mx1.freebsd.org (Postfix) with ESMTP id 7B06D8FC0A
	for <current@freebsd.org>; Sat,  2 May 2009 04:49:48 +0000 (UTC)
	(envelope-from ben@wanderview.com)
Received: from harkness.in.wanderview.com (harkness.in.wanderview.com
	[10.76.10.150]) (authenticated bits=0)
	by mail.wanderview.com (8.14.3/8.14.3) with ESMTP id n424ndHI003523
	(version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NO)
	for <current@freebsd.org>; Sat, 2 May 2009 04:49:40 GMT
	(envelope-from ben@wanderview.com)
Message-Id: <E8BEB7E4-39C7-4BF8-8D58-F8739A0F435F@wanderview.com>
From: Ben Kelly <ben@wanderview.com>
To: current@freebsd.org
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (Apple Message framework v930.3)
Date: Sat, 2 May 2009 00:49:39 -0400
X-Mailer: Apple Mail (2.930.3)
X-Spam-Score: -1.44 () ALL_TRUSTED
X-Scanned-By: MIMEDefang 2.64 on 10.76.20.1
Cc: 
Subject: [patch] zfs kmem fragmentation
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 02 May 2009 04:49:49 -0000

Hello all,

Lately I've been looking into the "kmem too small" panics that often  
occur with zfs if you don't restrict the arc.  What I found in my test  
environment was that everything works well until the kmem usage hits  
the 75% limit set in arc.c.  At this point the arc is shrunk and slabs  
are reclaimed from uma.  Unfortunately, every time this reclamation  
process runs the kmem space becomes more fragmented.  The vast  
majority of the time my machine hits the "kmem too small" panic it has  
over 200MB of kmem space available, but the largest fragment is less  
than 128KB.

Ideally things would be arranged to free memory without  
fragmentation.  I have tried a few things along those lines, but none  
of them have been successful so far.  I'm going to continue that work,  
but in the meantime I've put together a patch that tries to avoid  
fragmentation by slowing kmem growth before the aggressive reclamation  
process is required:

   http://www.wanderview.com/svn/public/misc/zfs/zfs_kmem_limit.diff

It uses the following heuristics to do this:

   - Start arc_c at arc_c_min instead of arc_c_max.  This causes the  
system to warm up more slowly.
   - Half the rate arc_c grows when kmem exceeds kmem_slow_growth_thresh
   - Stop arc_c growth when kmem exceeds kmem_target
   - Evict arc data when the kmem exceeds kmem_target
   - If kmem usage exceeds kmem_target then ask the pagedaemon to  
reclaim pages
   - If the largest kmem fragment is less than kmem_fragment_target  
then ask the pagedaemon to reclaim pages
   - If the largest kmem fragment is less than a kmem_fragment_thresh  
then force the aggressve kmem/arc reclamation process

The defaults for the various targets and thresholds are:

   kmem_reclaim_threshold = 7/8 kmem
   kmem_target = 3/4 kmem
   kmem_slow_growth_threshold = 5/8 kmem
   kmem_fragment_target = 1/8 kmem
   kmem_fragment_thresh = 1/16 kmem

With this patch I've been able to run my load tests with the default  
arc size with kmem values of 512MB to 700MB.  I tried one loaded run  
with a 300MB kmem, but it panic'ed due to legitimate, non-fragmented  
kmem exhaustion.

Please note that you may still encounter some fragmentation.  Its  
possible for the system to get stuck in a degraded state where its  
constantly trying to free pages and memory in attempt to fix the  
fragmentation.  If the system is in this state the  
kstat.zfs.misc.arcstats.fragmented_kmem_count sysctl will be  
increasing at a fairly rapid rate.

Anyway, I just thought I would put this out there in case anyone  
wanted to try to test with it.  I've mainly been loading it using  
rsync between two pools on a non-SMP, i386, with 2GB memory.

Also, if anyone is interested in helping with the fragmentation  
problem please let me know.  At this point I think the best odds are  
to modify UMA to allow some zones to use a custom slab size of 128KB  
(max zfs buffer size) so that most of the allocations from kmem are  
the same size.  It also occurred to me that much of this mess would be  
simpler if kmem information were passed up through the vnode so that  
the top layer entities like pagedaemon could make better choices for  
the overall memory usage of the system.  Right now we have a sub- 
system two or three layers down making decisions for everyone.   
Anyway, suggestions and insights are more than welcome.

Thanks!

- Ben