Date: Tue, 26 Jun 2001 00:12:43 +1000 (EST) From: Bruce Evans <bde@zeta.org.au> To: Matt Dillon <dillon@earth.backplane.com> Cc: Peter Wemm <peter@wemm.org>, Mikhail Teterin <mi@aldan.algebra.com>, jlemon@FreeBSD.org, cvs-committers@FreeBSD.org, cvs-all@FreeBSD.org Subject: Re: kernel size w/ optimized bzero() & patch set (was Re: Inline optimized bzero (was Re: cvs commit: src/sys/netinettcp_subr.c)) Message-ID: <Pine.BSF.4.21.0106252337370.7918-100000@besplex.bde.org> In-Reply-To: <200106250134.f5P1YsN01440@earth.backplane.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 24 Jun 2001, Matt Dillon wrote: [Peter Wemm wrote] > :Just think.. This new ``improved'' bzero code can now fill up all 4K of L1 > :instruction cache on most of my systems, and most of my 8K L1 instruction > :cache on >= coppermine cpus. I'm impressed. Those microbenchmarks had > > Huh? Peter, you obviously haven't been listening. I strongly recommend > that you review the last few postings I've made. The suggested bzero > code certainly does NOT in any way blow up the L1 cache, and I think > I'm pretty clear on that. I wouldn't be doing it if it did. It was an intermediate version that blew up the cache. I have been trying slightly different versions, and found that gcc's builtin version doesn't make all that much difference in the code size, either up or down. With the following version of bzero: #define bzero(p, n) ({ \ if (__builtin_constant_p(n) && (n) <= X) \ __builtin_memset((p), 0, (n)); \ else \ (bzero)((p), (n)); \ }) for X = 0, 4, 8, 12, 16, 32 and "infinity", the kernel sizes were: text data bss dec hex filename 1962434 151436 349824 2463694 2597ce kernel.4 1962442 151436 349824 2463702 2597d6 kernel.8 1962446 151436 349824 2463706 2597da kernel.12 1962466 151436 349824 2463726 2597ee kernel.0 1962802 151436 349824 2464062 25993e kernel.16 1962866 151436 349824 2464126 25997e kernel.20 1963538 151436 349824 2464798 259c1e kernel.32 1964098 151436 349824 2465358 259e4e kernel.infinity Summary: it's hard for the inline version to be smaller; even when it only needs to do one store-immediate operation, the kernel is only 32 bytes smaller than the one using function calls which have to push 2 args, do the call, and clean up. This is presumably due to increased register pressure for the inlined versions. OTOH, the recent uninlining of the mbuf macros somehow reduced the size of my standard kernel by more than 5% (more than 100K). It also reduced the compilation time by more than 10%. Kernel compilation times are still 65% larger than in RELENG_3 for kernels with essentially the same options (this is using -current's compiler; they are 85% larger using RELENG_3's compiler). > :better be damn good, because it may end up the only thing that the system > :will do well now since all this excessive inlining looks like it is blowing > :the L1 cache out the door. > : > :(I also apply the same complaint to the vm/* inlines). > > And you are just as wrong. The few functions inlined in vm/* are inlined > mainly because (A) they are called with constant arguments, which means Some seem to have rotted a bit. E.g., _vm_map_lock_upgrade() (adding an mtx_lock() to anything will bloat it in both space and time). Bruce To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe cvs-all" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.4.21.0106252337370.7918-100000>