From owner-freebsd-current@FreeBSD.ORG  Tue May  5 13:48:49 2009
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 786E61065678
	for <current@freebsd.org>; Tue,  5 May 2009 13:48:49 +0000 (UTC)
	(envelope-from ben@wanderview.com)
Received: from mail.wanderview.com (mail.wanderview.com [66.92.166.102])
	by mx1.freebsd.org (Postfix) with ESMTP id 295368FC19
	for <current@freebsd.org>; Tue,  5 May 2009 13:48:48 +0000 (UTC)
	(envelope-from ben@wanderview.com)
Received: from harkness.in.wanderview.com (harkness.in.wanderview.com
	[10.76.10.150]) (authenticated bits=0)
	by mail.wanderview.com (8.14.3/8.14.3) with ESMTP id n45DmiL1033881
	(version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NO);
	Tue, 5 May 2009 13:48:45 GMT (envelope-from ben@wanderview.com)
Message-Id: <8FB38AF4-3464-45AA-A6B2-96308EC49407@wanderview.com>
From: Ben Kelly <ben@wanderview.com>
To: Jeff Roberson <jroberson@jroberson.net>
In-Reply-To: <alpine.BSF.2.00.0905041207221.981@desktop>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (Apple Message framework v930.3)
Date: Tue, 5 May 2009 09:48:44 -0400
References: <E8BEB7E4-39C7-4BF8-8D58-F8739A0F435F@wanderview.com>
	<alpine.BSF.2.00.0905041207221.981@desktop>
X-Mailer: Apple Mail (2.930.3)
X-Spam-Score: -1.44 () ALL_TRUSTED
X-Scanned-By: MIMEDefang 2.64 on 10.76.20.1
Cc: current@freebsd.org
Subject: Re: [patch] zfs kmem fragmentation
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 05 May 2009 13:48:50 -0000

On May 4, 2009, at 6:17 PM, Jeff Roberson wrote:
> On Sat, 2 May 2009, Ben Kelly wrote:
>> Hello all,
>>
>> Lately I've been looking into the "kmem too small" panics that  
>> often occur with zfs if you don't restrict the arc.  What I found  
>> in my test environment was that everything works well until the  
>> kmem usage hits the 75% limit set in arc.c.  At this point the arc  
>> is shrunk and slabs are reclaimed from uma. Unfortunately, every  
>> time this reclamation process runs the kmem space becomes more  
>> fragmented.  The vast majority of the time my machine hits the  
>> "kmem too small" panic it has over 200MB of kmem space available,  
>> but the largest fragment is less than 128KB.
>
> What consumers make requests of kmem for 128kb and over?  What  
> ultimately trips the panic?

ZFS buffers range from 512 bytes to 128KB.  I don't know of any  
allocations above 128KB at the moment.

In my workload the panic is usually caused by zfs attempting to  
allocate a 128KB buffer, although sometimes its only doing a 64KB  
buffer.

At one point I hacked in some instrumentation to print the kmem_map  
vm_map_entry when I touched a sysctl mib.  Here's a capture I made  
during my load test as the fragmentation was occurring:

   http://www.wanderview.com/svn/public/misc/zfs/fragmentation.txt

I also added some debug later to show the consumers of the  
allocations.  The vast majority of them were from the opensolaris  
subsystem.  Unfortunately I don't have a capture of that output handy.

>> Ideally things would be arranged to free memory without  
>> fragmentation.  I have tried a few things along those lines, but  
>> none of them have been successful so far.  I'm going to continue  
>> that work, but in the meantime I've put together a patch that tries  
>> to avoid fragmentation by slowing kmem growth before the aggressive  
>> reclamation process is required:
>>
>> http://www.wanderview.com/svn/public/misc/zfs/zfs_kmem_limit.diff
>>
>> It uses the following heuristics to do this:
>>
>> - Start arc_c at arc_c_min instead of arc_c_max.  This causes the  
>> system to warm up more slowly.
>> - Half the rate arc_c grows when kmem exceeds kmem_slow_growth_thresh
>> - Stop arc_c growth when kmem exceeds kmem_target
>> - Evict arc data when the kmem exceeds kmem_target
>> - If kmem usage exceeds kmem_target then ask the pagedaemon to  
>> reclaim pages
>> - If the largest kmem fragment is less than kmem_fragment_target  
>> then ask the pagedaemon to reclaim pages
>> - If the largest kmem fragment is less than a kmem_fragment_thresh  
>> then force the aggressve kmem/arc reclamation process
>>
>> The defaults for the various targets and thresholds are:
>>
>> kmem_reclaim_threshold = 7/8 kmem
>> kmem_target = 3/4 kmem
>> kmem_slow_growth_threshold = 5/8 kmem
>> kmem_fragment_target = 1/8 kmem
>> kmem_fragment_thresh = 1/16 kmem
>>
>> With this patch I've been able to run my load tests with the  
>> default arc size with kmem values of 512MB to 700MB.  I tried one  
>> loaded run with a 300MB kmem, but it panic'ed due to legitimate,  
>> non-fragmented kmem exhaustion.
>>
>
> May I suggest an alternate approach;  Have you considered placing  
> zfs in its own kernel submap?  If all of its allocations are of a  
> like size, fragmentation won't be an issue and it can be constrained  
> to a fixed size without placing pressure on other kmem_map  
> consumers.  This is the approach taken for the buffer cache.  It  
> makes a good deal of sense.  If arc can be taught to handle  
> allocation failures we could eliminate the panic entirely by simply  
> causing arc to run out of space and flush more buffers.
>
> Do you believe this would also address the problem?

Using a separate submap might help.  It seems that the fragmentation  
is occurring due to the interaction of the smaller and larger buffers  
within zfs.  I believe in opensolaris data buffers and meta-data  
buffers are allocated from separate arenas.  We don't do this  
currently and it may be the cause of some of the fragmentation.  It  
also occurred to me that it might be nice if the arc could somehow  
share the buffer cache directly.

Unfortunately I am moving this Friday and probably will be unable to  
really look at this for the next couple weeks.

Thanks.

- Ben