Date: Sun, 24 Jun 2001 13:25:19 +1000 (EST) From: Bruce Evans <bde@zeta.org.au> To: Matt Dillon <dillon@earth.backplane.com> Cc: Mikhail Teterin <mi@aldan.algebra.com>, jlemon@FreeBSD.org, cvs-committers@FreeBSD.org, cvs-all@FreeBSD.org Subject: Re: Inline optimized bzero (was Re: cvs commit: src/sys/netinet tcp_subr.c) Message-ID: <Pine.BSF.4.21.0106241223450.52918-100000@besplex.bde.org> In-Reply-To: <200106232102.f5NL2fY73920@earth.backplane.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 23 Jun 2001, Matt Dillon wrote: > I would propose adding a new kernel bzero() function, called bzerol(), > which is an inline integer-aligned implementation. I don't think this would be very useful, but if it exists then it should be called bzero(). We've already made the mistake of having 2 functions for bcopy() (callers are supposed to use memcpy() for non-overlapping copies with small constant sizes and bcopy() for all other cases, but many callers aren't disciplined enough to do this). > /* > * bzerol() - aligned bzero. The buffer must be integer aligned and sized. > * > * This routine should only be called with constant sizes, so GCC can > * optimize it. This routine typically optimizes down to just a few > * instructions. > */ > > static __inline void > bzerol(void *s, int bytes) > { > assert((bytes & (sizeof(int) - 1)) == 0); > > switch(bytes) { > case sizeof(int) * 5: > *((int *)s + 4) = 0; > /* fall through */ > case sizeof(int) * 4: > *((int *)s + 3) = 0; > /* fall through */ > case sizeof(int) * 3: > *((int *)s + 2) = 0; > /* fall through */ > case sizeof(int) * 2: > *((int *)s + 1) = 0; > /* fall through */ > case sizeof(int) * 1: > *(int *)s = 0; > /* fall through */ > case 0: > return; > default: > if (bytes >= sizeof(int) * 8) { > while (bytes >= sizeof(int) * 4) { > *(int *)((char *)s + 0 * sizeof(int)) = 0; > *(int *)((char *)s + 1 * sizeof(int)) = 0; > *(int *)((char *)s + 2 * sizeof(int)) = 0; > *(int *)((char *)s + 3 * sizeof(int)) = 0; > s = (char *)s + sizeof(int) * 4; > bytes -= sizeof(int) * 4; > } > } > while (bytes > 0) { > bytes -= 4; > *(int *)((char *)s + bytes) = 0; > } > } > } I just found that gcc already has essentially this optimization, at least on i386's, provided bzero() is spelled using memset() (I thought that gcc only had the corresponding optimization for memcpy()). "memset(p, 0, n)" generates stores of 0 for n <= 16 ("movl $0, addr" if n is a multiple of 4). For n >= 17 and for certain n < 16, it generates not so optimal inline code using stos[bwl]. This is a significant pessimization if n is very large and the library bzero is significantly optimized (e.g., if the library bzero is i586_bzero). To use the builtin memset except for parts of it that we don't like, I suggest using code like: #if defined(__GNUC) && defined(_HAVE_GOOD_BUILTIN_MEMSET) #define bzero(p, n) do { \ if (__builtin_constant_p(n) && (n) < LARGE_MD_VALUE && \ !__any_other_cases_that_we_dont_like(n)) \ __builtin_memset((p), 0, (n)); \ else \ (bzero)((p), (n)); \ } while (0) #endif Similarly for bzero/memcpy (the condition for not liking __builtin_memcpy is currently `if (1)'. Many bzero()s are now done in malloc(), so the above optimizations are even less useful than they used to be :-). Bruce To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe cvs-all" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.4.21.0106241223450.52918-100000>