From owner-freebsd-current@FreeBSD.ORG Sat May 2 04:49:48 2009 Return-Path: Delivered-To: current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D96A31065674 for ; Sat, 2 May 2009 04:49:48 +0000 (UTC) (envelope-from ben@wanderview.com) Received: from mail.wanderview.com (mail.wanderview.com [66.92.166.102]) by mx1.freebsd.org (Postfix) with ESMTP id 7B06D8FC0A for ; Sat, 2 May 2009 04:49:48 +0000 (UTC) (envelope-from ben@wanderview.com) Received: from harkness.in.wanderview.com (harkness.in.wanderview.com [10.76.10.150]) (authenticated bits=0) by mail.wanderview.com (8.14.3/8.14.3) with ESMTP id n424ndHI003523 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NO) for ; Sat, 2 May 2009 04:49:40 GMT (envelope-from ben@wanderview.com) Message-Id: From: Ben Kelly To: current@freebsd.org Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v930.3) Date: Sat, 2 May 2009 00:49:39 -0400 X-Mailer: Apple Mail (2.930.3) X-Spam-Score: -1.44 () ALL_TRUSTED X-Scanned-By: MIMEDefang 2.64 on 10.76.20.1 Cc: Subject: [patch] zfs kmem fragmentation X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 May 2009 04:49:49 -0000 Hello all, Lately I've been looking into the "kmem too small" panics that often occur with zfs if you don't restrict the arc. What I found in my test environment was that everything works well until the kmem usage hits the 75% limit set in arc.c. At this point the arc is shrunk and slabs are reclaimed from uma. Unfortunately, every time this reclamation process runs the kmem space becomes more fragmented. The vast majority of the time my machine hits the "kmem too small" panic it has over 200MB of kmem space available, but the largest fragment is less than 128KB. Ideally things would be arranged to free memory without fragmentation. I have tried a few things along those lines, but none of them have been successful so far. I'm going to continue that work, but in the meantime I've put together a patch that tries to avoid fragmentation by slowing kmem growth before the aggressive reclamation process is required: http://www.wanderview.com/svn/public/misc/zfs/zfs_kmem_limit.diff It uses the following heuristics to do this: - Start arc_c at arc_c_min instead of arc_c_max. This causes the system to warm up more slowly. - Half the rate arc_c grows when kmem exceeds kmem_slow_growth_thresh - Stop arc_c growth when kmem exceeds kmem_target - Evict arc data when the kmem exceeds kmem_target - If kmem usage exceeds kmem_target then ask the pagedaemon to reclaim pages - If the largest kmem fragment is less than kmem_fragment_target then ask the pagedaemon to reclaim pages - If the largest kmem fragment is less than a kmem_fragment_thresh then force the aggressve kmem/arc reclamation process The defaults for the various targets and thresholds are: kmem_reclaim_threshold = 7/8 kmem kmem_target = 3/4 kmem kmem_slow_growth_threshold = 5/8 kmem kmem_fragment_target = 1/8 kmem kmem_fragment_thresh = 1/16 kmem With this patch I've been able to run my load tests with the default arc size with kmem values of 512MB to 700MB. I tried one loaded run with a 300MB kmem, but it panic'ed due to legitimate, non-fragmented kmem exhaustion. Please note that you may still encounter some fragmentation. Its possible for the system to get stuck in a degraded state where its constantly trying to free pages and memory in attempt to fix the fragmentation. If the system is in this state the kstat.zfs.misc.arcstats.fragmented_kmem_count sysctl will be increasing at a fairly rapid rate. Anyway, I just thought I would put this out there in case anyone wanted to try to test with it. I've mainly been loading it using rsync between two pools on a non-SMP, i386, with 2GB memory. Also, if anyone is interested in helping with the fragmentation problem please let me know. At this point I think the best odds are to modify UMA to allow some zones to use a custom slab size of 128KB (max zfs buffer size) so that most of the allocations from kmem are the same size. It also occurred to me that much of this mess would be simpler if kmem information were passed up through the vnode so that the top layer entities like pagedaemon could make better choices for the overall memory usage of the system. Right now we have a sub- system two or three layers down making decisions for everyone. Anyway, suggestions and insights are more than welcome. Thanks! - Ben