Date: Mon, 27 May 2013 11:26:14 -0700 From: Tim Kientzle <kientzle@freebsd.org> To: Lee Thomas <lee_thomas@aslantools.com> Cc: freebsd-hackers@freebsd.org Subject: Re: Performance improvement to strnlen(). Message-ID: <E8F8611E-2E53-4B89-A56F-A7A014E93995@freebsd.org> In-Reply-To: <afa77dcc2e1cb351cfaded708acbdae0@lthomas.net> References: <afa77dcc2e1cb351cfaded708acbdae0@lthomas.net>
next in thread | previous in thread | raw e-mail | index | archive | help
>=20 > Index: strnlen.c > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > diff --git a/head/lib/libc/string/strnlen.c = b/head/lib/libc/string/strnlen.c > --- a/head/lib/libc/string/strnlen.c (revision 250951) > +++ b/head/lib/libc/string/strnlen.c (working copy) > @@ -1,5 +1,6 @@ > /*- > - * Copyright (c) 2009 David Schultz <das@FreeBSD.org> > + * Copyright (c) 2009, 2010 Xin LI <delphij@FreeBSD.org> > + * Copyright (c) 2013 Lee Thomas <lee_thomas@AslanTools.com> > * All rights reserved. > * > * Redistribution and use in source and binary forms, with or without > @@ -27,16 +28,91 @@ > #include <sys/cdefs.h> > __FBSDID("$FreeBSD$"); > =20 > +#include <sys/limits.h> > +#include <sys/types.h> > #include <string.h> > =20 > +/* > + * Portable strnlen() for 32-bit and 64-bit systems. > + * > + * Rationale: it is generally much more efficient to do word length > + * operations and avoid branches on modern computer systems, as > + * compared to byte-length operations with a lot of branches. > + * > + * The expression: > + * > + * ((x - 0x01....01) & ~x & 0x80....80) > + * > + * would evaluate to a non-zero value iff any of the bytes in the > + * original word is zero. > + * > + * On multi-issue processors, we can divide the above expression = into: > + * a) (x - 0x01....01) > + * b) (~x & 0x80....80) > + * c) a & b > + * > + * Where, a) and b) can be partially computed in parallel. > + * > + * The algorithm above is found on "Hacker's Delight" by > + * Henry S. Warren, Jr. > + */ > + > +/* Magic numbers for the algorithm */ > +#if LONG_BIT =3D=3D 32 > +static const unsigned long mask01 =3D 0x01010101; > +static const unsigned long mask80 =3D 0x80808080; > +#elif LONG_BIT =3D=3D 64 > +static const unsigned long mask01 =3D 0x0101010101010101; > +static const unsigned long mask80 =3D 0x8080808080808080; > +#else > +#error Unsupported word size > +#endif > + > +#define LONGPTR_MASK (sizeof(long) - 1) I would not use this at all; (sizeof(long) - 1) is a common expression that your audience probably understands more quickly than "LONGPTR_MASK". > + > size_t > -strnlen(const char *s, size_t maxlen) > +strnlen(const char *str, size_t maxlen) > { > - size_t len; > + const char *stop, *short_stop; > + const char *p; > + const unsigned long *lp; > + long va, vb; > =20 > - for (len =3D 0; len < maxlen; len++, s++) { > - if (!*s) > - break; > + if (maxlen=3D=3D0) return 0; > + > + stop=3Dstr+maxlen; > + /* > + * Before trying the hard (unaligned byte-by-byte access) way > + * to figure out whether there is a nul character, try to see > + * if there is a nul character is within this accessible word > + * first. > + * > + * p and (p & ~LONGPTR_MASK) must be equally accessible since > + * they always fall in the same memory page, as long as page > + * boundaries is integral multiple of word size. > + */ > + lp =3D (const unsigned long *)((uintptr_t)str & ~LONGPTR_MASK); > + lp++; Have you tested to see whether the extra work you're doing for the initial segment really gains you much? I would have used a simpler byte-by-byte check as follows: p =3D s; while (maxlen-- > 0 && p < (const char *)lp) { if (*p =3D=3D '\0') return (p - s); } while (maxlen > 0) { =85 complicated test of *lp =85 maxlen -=3D sizeof(*lp); } Few points here: * This form of the initial segment test doesn't get surprised by a NUL byte just before the string. Yours does a bunch of extra work in that case. * Duff's device might help unroll the first loop; would require testing to see if it was faster. For something this simple, it might not be. * Your version tests the first word twice (once in the initial check and then again at the start of the word-sized pass). This version doesn't. * Comparison with zero is often free, so a countdown to zero is often faster than a count up to a computed limit. This assumes that 'lp' is pointing to the first aligned word at or after the beginning of the string. Your version does not leave lp pointing to the beginning of the string when the string is initially already aligned. If the string is already fully aligned, my version avoids the initial check entirely. To compute 'lp' as a pointer to the first full word at or after the beginning of the string: lp =3D (const unsigned long *) ( ( (uintptr_t)str + sizeof(*lp) - 1 ) & ( sizeof(*lp) - 1 ) ); I've broken this onto multiple lines to illustrate the structure; the final code would of course be a little more compact. Also note the use of sizeof(*lp) rather than sizeof(long) or sizeof(unsigned long); this way, you guarantee that the sizeof() matches lp even if someone comes along later and changes the declaration of lp. > + va =3D (*lp - mask01); > + vb =3D ((~*lp) & mask80); > + if (va & vb) { > + /* Check if we have \0 in the first part */ > + short_stop=3D(const char *)lp; > + if (stop<short_stop) short_stop=3Dstop; > + for (p=3Dstr; p !=3D short_stop; p++) > + if (*p =3D=3D '\0') > + return (p-str); > } > - return (len); > + /* Scan the rest of the string using word sized operation */ > + for (; (const char *)lp < stop; lp++) { > + va =3D (*lp - mask01); > + vb =3D ((~*lp) & mask80); > + if (va & vb) { I don't think the extra variables are helpful. Just write it directly: if ((*lp - mask01) & ~*lp & mask80) { That matches the comment you put at the beginning. > + for (p=3D(const char *)lp; p !=3D stop; p++) > + if (*p =3D=3D '\0') > + break; > + return (p-str); > + } > + } > + return (maxlen); > } You should, of course, compare my suggestions above against your version to see which is faster. (Preferably run such tests with both clang and gcc, on at least two architectures, and with a couple of different optimization flags.) Tim
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?E8F8611E-2E53-4B89-A56F-A7A014E93995>