From owner-freebsd-performance@FreeBSD.ORG Mon Apr 25 17:10:05 2005 Return-Path: Delivered-To: freebsd-performance@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 9415416A4CE for ; Mon, 25 Apr 2005 17:10:05 +0000 (GMT) Received: from cyrus.watson.org (cyrus.watson.org [204.156.12.53]) by mx1.FreeBSD.org (Postfix) with ESMTP id 071A343D1D for ; Mon, 25 Apr 2005 17:10:05 +0000 (GMT) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by cyrus.watson.org (Postfix) with ESMTP id 54DF246B03 for ; Mon, 25 Apr 2005 13:10:04 -0400 (EDT) Date: Mon, 25 Apr 2005 18:12:12 +0100 (BST) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: performance@FreeBSD.org In-Reply-To: <20050425114546.O74930@fledge.watson.org> Message-ID: <20050425181101.Y74930@fledge.watson.org> References: <20050417134448.L85588@fledge.watson.org> <20050425114546.O74930@fledge.watson.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Subject: Re: Memory allocation performance/statistics patches X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 25 Apr 2005 17:10:05 -0000 On Mon, 25 Apr 2005, Robert Watson wrote: > I now have updated versions of these patches, which correct some > inconsistencies in approach (universal use of curcpu now, for example), > remove some debugging code, etc. I've received relatively little > performance feedback on them, and would appreciate it if I could get > some. :-) Especially as to whether these impact disk I/O related > workloads, useful macrobenchmarks, etc. The latest patch is at: > > > http://www.watson.org/~robert/freebsd/netperf/20050425-uma-mbuf-malloc-critical.diff FYI: For those set up to track perforce, you can find the contents of this patch in: //depot/user/rwatson/percpu/... In addition, that branch also contains diagnostic micro-benchmarks in the kernel to measure the cost of various synchronization operations, memory allocation operations, etc, which can be queried using "sysctl test". Robert N M Watson > > The changes in the following files in the combined patch are intended to be > broken out in to separate patches, as desired, as follows: > > kern_malloc.c malloc.diff > kern_mbuf.c mbuf.diff > uipc_mbuf.c mbuf.diff > uipc_syscalls.c mbuf.diff > malloc.h malloc.diff > mbuf.h mbuf.diff > pcpu.h malloc.diff, mbuf.diff, uma.diff > uma_core.c uma.diff > uma_int.h uma.diff > > I.e., the pcpu.h changes are a dependency for all of the remaining changes. > As before, I'm interested in both the impact of individual patches, and the > net effect of the total change associated with all patches applied. > > Because this diff was generated by p4, patch may need some help in > identifying the targets of each part of the diff. > > Robert N M Watson > > On Sun, 17 Apr 2005, Robert Watson wrote: > >> >> Attached please find three patches: >> >> (1) uma.diff, which modifies the UMA slab allocator to use critical >> sections instead of mutexes to protect per-CPU caches. >> >> (2) malloc.diff, which modifies the malloc memory allocator to use >> critical sections and per-CPU data instead of mutexes to store >> per-malloc-type statistics, coalescing for the purposes of the sysctl >> used to generate vmstat -m output. >> >> (3) mbuf.diff, which modifies the mbuf allocator to use per-CPU data and >> critical sections for statistics, instead of synchronization-free >> statistics which could result in substantial inconsistency on SMP >> systems. >> >> These changes are facilitated by John Baldwin's recent re-introduction of >> critical section optimizations that permit critical sections to be >> implemented "in software", rather than using the hardware interrupt disable >> mechanism, which is quite expensive on modern processors (especially Xeon >> P4 CPUs). While not identical, this is similar to the softspl behavior in >> 4.x, and Linux's preemption disable mechanisms (and various other post-Vax >> systems :-)). >> >> The reason this is interesting is that it allows synchronization of per-CPU >> data to be performed at a much lower cost than previously, and consistently >> across UP and SMP systems. Prior to these changes, the use of critical >> sections and per-CPU data as an alternative to mutexes would lead to an >> improvement on SMP, but not on UP. So, that said, here's what I'd like us >> to look at: >> >> - Patches (1) and (2) are intended to improve performance by reducing the >> overhead of maintaining cache consistency and statistics for UMA and >> malloc(9), and may universally impact performance (in a small way) due >> to the breadth of their use through the kernel. >> >> - Patch (3) is intended to restore consistency to statistics in the >> presence of SMP and preemption, at the possible cost of some >> performance. >> >> I'd like to confirm that for the first two patches, for interesting >> workloads, performance generally improves, and that stability doesn't >> degrade. For the third partch, I'd like to quantify the cost of the >> changes for interesting workloads, and likewise confirm no loss of >> stability. >> >> Because these will have a relatively small impact, a fair amount of caution >> is required in testing. We may be talking about a percent or two, maybe >> four, difference in benchmark performance, and many benchmarks have a >> higher variance than that. >> >> A couple of observations for those interested: >> >> - The INVARIANTS panic with UMA seen in some earlier patch versions is >> believed to be corrected. >> >> - Right now, because I use arrays of foo[MAXCPUS], I'm concerned that >> different CPUs will be writing to the same cache line as they're >> adjacent in memory. Moving to per-CPU chunks of memory to hold this >> stuff is desirable, but I think first we need to identify a model by >> which to do that cleanly. I'm not currently enamored of the 'struct >> pcpu' model, since it makes us very sensitive to ABI changes, as well as >> not offering a model by which modules can register new per-cpu data >> cleanly. I'm also inconsistent about how I dereference into the arrays, >> and intend to move to using 'curcpu' throughout. >> >> - Because mutexes are no longer used in UMA, and not for the others >> either, stats read across different CPUs that are coalesced may be >> slightly inconsistent. I'm not all that concerned about it, but it's >> worth thinking on. >> >> - Malloc stats for realloc() are still broken if you apply this patch. >> >> - High watermarks are no longer maintained for malloc since they require a >> global notion of "high" that is tracked continuously (i.e., at each >> change), and there's no longer a global view except when the observer >> kicks in (sysctl). You can imagine various models to restore some >> notion of a high watermark, but I'm not currently sure which is the >> best. The high watermark notion is desirable though. >> >> So this is a request for: >> >> (1) Stability testing of these patches. Put them on a machine, make them >> hurt. If things go South, try applying the patches one by one until >> it's clear which is the source. >> >> (2) Performance testing of these patches. Subject to the challenges in >> testing them. If you are interested, please test each patch >> separately to evaluate its impact on your system. Then apply all >> together and see how it evens out. You may find that the mbuf >> allocator patch outweighs the benefits of the other two patches, if >> so, that is interesting and something to work on! >> >> I've done some micro-benchmarking using tools like netblast, >> syscall_timing, etc, but I'm interested particularly in the impact on >> macrobenchmarks. >> >> Thanks! >> >> Robert N M Watson > _______________________________________________ > freebsd-performance@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-performance > To unsubscribe, send any mail to > "freebsd-performance-unsubscribe@freebsd.org" >