Date: Mon, 25 Apr 2005 18:12:12 +0100 (BST) From: Robert Watson <rwatson@FreeBSD.org> To: performance@FreeBSD.org Subject: Re: Memory allocation performance/statistics patches Message-ID: <20050425181101.Y74930@fledge.watson.org> In-Reply-To: <20050425114546.O74930@fledge.watson.org> References: <20050417134448.L85588@fledge.watson.org> <20050425114546.O74930@fledge.watson.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 25 Apr 2005, Robert Watson wrote: > I now have updated versions of these patches, which correct some > inconsistencies in approach (universal use of curcpu now, for example), > remove some debugging code, etc. I've received relatively little > performance feedback on them, and would appreciate it if I could get > some. :-) Especially as to whether these impact disk I/O related > workloads, useful macrobenchmarks, etc. The latest patch is at: > > > http://www.watson.org/~robert/freebsd/netperf/20050425-uma-mbuf-malloc-critical.diff FYI: For those set up to track perforce, you can find the contents of this patch in: //depot/user/rwatson/percpu/... In addition, that branch also contains diagnostic micro-benchmarks in the kernel to measure the cost of various synchronization operations, memory allocation operations, etc, which can be queried using "sysctl test". Robert N M Watson > > The changes in the following files in the combined patch are intended to be > broken out in to separate patches, as desired, as follows: > > kern_malloc.c malloc.diff > kern_mbuf.c mbuf.diff > uipc_mbuf.c mbuf.diff > uipc_syscalls.c mbuf.diff > malloc.h malloc.diff > mbuf.h mbuf.diff > pcpu.h malloc.diff, mbuf.diff, uma.diff > uma_core.c uma.diff > uma_int.h uma.diff > > I.e., the pcpu.h changes are a dependency for all of the remaining changes. > As before, I'm interested in both the impact of individual patches, and the > net effect of the total change associated with all patches applied. > > Because this diff was generated by p4, patch may need some help in > identifying the targets of each part of the diff. > > Robert N M Watson > > On Sun, 17 Apr 2005, Robert Watson wrote: > >> >> Attached please find three patches: >> >> (1) uma.diff, which modifies the UMA slab allocator to use critical >> sections instead of mutexes to protect per-CPU caches. >> >> (2) malloc.diff, which modifies the malloc memory allocator to use >> critical sections and per-CPU data instead of mutexes to store >> per-malloc-type statistics, coalescing for the purposes of the sysctl >> used to generate vmstat -m output. >> >> (3) mbuf.diff, which modifies the mbuf allocator to use per-CPU data and >> critical sections for statistics, instead of synchronization-free >> statistics which could result in substantial inconsistency on SMP >> systems. >> >> These changes are facilitated by John Baldwin's recent re-introduction of >> critical section optimizations that permit critical sections to be >> implemented "in software", rather than using the hardware interrupt disable >> mechanism, which is quite expensive on modern processors (especially Xeon >> P4 CPUs). While not identical, this is similar to the softspl behavior in >> 4.x, and Linux's preemption disable mechanisms (and various other post-Vax >> systems :-)). >> >> The reason this is interesting is that it allows synchronization of per-CPU >> data to be performed at a much lower cost than previously, and consistently >> across UP and SMP systems. Prior to these changes, the use of critical >> sections and per-CPU data as an alternative to mutexes would lead to an >> improvement on SMP, but not on UP. So, that said, here's what I'd like us >> to look at: >> >> - Patches (1) and (2) are intended to improve performance by reducing the >> overhead of maintaining cache consistency and statistics for UMA and >> malloc(9), and may universally impact performance (in a small way) due >> to the breadth of their use through the kernel. >> >> - Patch (3) is intended to restore consistency to statistics in the >> presence of SMP and preemption, at the possible cost of some >> performance. >> >> I'd like to confirm that for the first two patches, for interesting >> workloads, performance generally improves, and that stability doesn't >> degrade. For the third partch, I'd like to quantify the cost of the >> changes for interesting workloads, and likewise confirm no loss of >> stability. >> >> Because these will have a relatively small impact, a fair amount of caution >> is required in testing. We may be talking about a percent or two, maybe >> four, difference in benchmark performance, and many benchmarks have a >> higher variance than that. >> >> A couple of observations for those interested: >> >> - The INVARIANTS panic with UMA seen in some earlier patch versions is >> believed to be corrected. >> >> - Right now, because I use arrays of foo[MAXCPUS], I'm concerned that >> different CPUs will be writing to the same cache line as they're >> adjacent in memory. Moving to per-CPU chunks of memory to hold this >> stuff is desirable, but I think first we need to identify a model by >> which to do that cleanly. I'm not currently enamored of the 'struct >> pcpu' model, since it makes us very sensitive to ABI changes, as well as >> not offering a model by which modules can register new per-cpu data >> cleanly. I'm also inconsistent about how I dereference into the arrays, >> and intend to move to using 'curcpu' throughout. >> >> - Because mutexes are no longer used in UMA, and not for the others >> either, stats read across different CPUs that are coalesced may be >> slightly inconsistent. I'm not all that concerned about it, but it's >> worth thinking on. >> >> - Malloc stats for realloc() are still broken if you apply this patch. >> >> - High watermarks are no longer maintained for malloc since they require a >> global notion of "high" that is tracked continuously (i.e., at each >> change), and there's no longer a global view except when the observer >> kicks in (sysctl). You can imagine various models to restore some >> notion of a high watermark, but I'm not currently sure which is the >> best. The high watermark notion is desirable though. >> >> So this is a request for: >> >> (1) Stability testing of these patches. Put them on a machine, make them >> hurt. If things go South, try applying the patches one by one until >> it's clear which is the source. >> >> (2) Performance testing of these patches. Subject to the challenges in >> testing them. If you are interested, please test each patch >> separately to evaluate its impact on your system. Then apply all >> together and see how it evens out. You may find that the mbuf >> allocator patch outweighs the benefits of the other two patches, if >> so, that is interesting and something to work on! >> >> I've done some micro-benchmarking using tools like netblast, >> syscall_timing, etc, but I'm interested particularly in the impact on >> macrobenchmarks. >> >> Thanks! >> >> Robert N M Watson > _______________________________________________ > freebsd-performance@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-performance > To unsubscribe, send any mail to > "freebsd-performance-unsubscribe@freebsd.org" >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20050425181101.Y74930>