Date: Mon, 27 May 2013 16:01:51 -0400 From: Lee Thomas <lee_thomas@aslantools.com> To: <freebsd-hackers@freebsd.org> Subject: Re: Performance improvement to strnlen(). Message-ID: <361355916ec746edbf91c3f258f2416e@lthomas.net> In-Reply-To: <E8F8611E-2E53-4B89-A56F-A7A014E93995@freebsd.org> References: <afa77dcc2e1cb351cfaded708acbdae0@lthomas.net> <E8F8611E-2E53-4B89-A56F-A7A014E93995@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Hello Tim, Thank you for the review; replies inline. Note that all my performance discussion is for amd64, with a few tests of x86, and it's all on the machine described in my initial email (a Lynnfield Xeon), as I don't have any other FreeBSD platform to test on. Testing and performance measurement on other platforms would be appreciated. Thanks, Lee On 2013-05-27 14:26, Tim Kientzle wrote: >> >> Index: strnlen.c >> =================================================================== >> diff --git a/head/lib/libc/string/strnlen.c >> b/head/lib/libc/string/strnlen.c >> --- a/head/lib/libc/string/strnlen.c (revision 250951) >> +++ b/head/lib/libc/string/strnlen.c (working copy) >> @@ -1,5 +1,6 @@ >> /*- >> - * Copyright (c) 2009 David Schultz <das@FreeBSD.org> >> + * Copyright (c) 2009, 2010 Xin LI <delphij@FreeBSD.org> >> + * Copyright (c) 2013 Lee Thomas <lee_thomas@AslanTools.com> >> * All rights reserved. >> * >> * Redistribution and use in source and binary forms, with or >> without >> @@ -27,16 +28,91 @@ >> #include <sys/cdefs.h> >> __FBSDID("$FreeBSD$"); >> >> +#include <sys/limits.h> >> +#include <sys/types.h> >> #include <string.h> >> >> +/* >> + * Portable strnlen() for 32-bit and 64-bit systems. >> + * >> + * Rationale: it is generally much more efficient to do word length >> + * operations and avoid branches on modern computer systems, as >> + * compared to byte-length operations with a lot of branches. >> + * >> + * The expression: >> + * >> + * ((x - 0x01....01) & ~x & 0x80....80) >> + * >> + * would evaluate to a non-zero value iff any of the bytes in the >> + * original word is zero. >> + * >> + * On multi-issue processors, we can divide the above expression >> into: >> + * a) (x - 0x01....01) >> + * b) (~x & 0x80....80) >> + * c) a & b >> + * >> + * Where, a) and b) can be partially computed in parallel. >> + * >> + * The algorithm above is found on "Hacker's Delight" by >> + * Henry S. Warren, Jr. >> + */ >> + >> +/* Magic numbers for the algorithm */ >> +#if LONG_BIT == 32 >> +static const unsigned long mask01 = 0x01010101; >> +static const unsigned long mask80 = 0x80808080; >> +#elif LONG_BIT == 64 >> +static const unsigned long mask01 = 0x0101010101010101; >> +static const unsigned long mask80 = 0x8080808080808080; >> +#else >> +#error Unsupported word size >> +#endif >> + >> +#define LONGPTR_MASK (sizeof(long) - 1) > > I would not use this at all; (sizeof(long) - 1) is a common > expression that your audience probably understands > more quickly than "LONGPTR_MASK". > This is simply copy-pasted from our strlen implementation; I wanted to leave it as close to that implementation as possible, rather than rewrite the whole thing. I could refactor the common stuff out of both, though it would require a common implementation header for both (The style guidelines for which I don't know), and I'm not sure it would be a benefit to code legibility, anyways. Would it be worth cleaning strnlen.c up, at the cost of an increased diff with strlen.c? Maybe both should be cleaned up in the same ways? I'm willing to do any of these if you think it would be worth the code churn. >> + >> size_t >> -strnlen(const char *s, size_t maxlen) >> +strnlen(const char *str, size_t maxlen) >> { >> - size_t len; >> + const char *stop, *short_stop; >> + const char *p; >> + const unsigned long *lp; >> + long va, vb; >> >> - for (len = 0; len < maxlen; len++, s++) { >> - if (!*s) >> - break; >> + if (maxlen==0) return 0; >> + >> + stop=str+maxlen; >> + /* >> + * Before trying the hard (unaligned byte-by-byte access) way >> + * to figure out whether there is a nul character, try to see >> + * if there is a nul character is within this accessible word >> + * first. >> + * >> + * p and (p & ~LONGPTR_MASK) must be equally accessible since >> + * they always fall in the same memory page, as long as page >> + * boundaries is integral multiple of word size. >> + */ >> + lp = (const unsigned long *)((uintptr_t)str & ~LONGPTR_MASK); >> + lp++; > > Have you tested to see whether the extra work you're > doing for the initial segment really gains you much? I > would have used a simpler byte-by-byte check as > follows: Yes, it is worth it, with a pretty good margin for small strings, even under pretty high assumptions about the frequency of zero in the data before the start of the string. I tried various things, and this is the fastest I found. > > p = s; > while (maxlen-- > 0 && p < (const char *)lp) { > if (*p == '\0') > return (p - s); > } > > while (maxlen > 0) { > … complicated test of *lp … > maxlen -= sizeof(*lp); > } > Few points here: > * This form of the initial segment test doesn't get surprised > by a NUL byte just before the string. Yours does a bunch of > extra work in that case. > * Duff's device might help unroll the first loop; would require > testing to see if it was faster. For something this simple, it > might not be. > * Your version tests the first word twice (once in the > initial check and then again at the start of the word-sized > pass). This version doesn't. Actually, it doesn't. Note the 'lp++' on line 97, after computing va and vb, but before the if. It's before the if because it's used in computing short_stop. I tried it after the if, because it would be cleaner there (as your missing it clearly shows), but one of either gcc or clang, I forget which, didn't manage to do the common sub expression elimination with the computation inside the if(va & vb) {}, leading to a slowdown in the case where one of the junk bytes before the string is zero. > * Comparison with zero is often free, so a countdown to zero > is often faster than a count up to a computed limit. > > This assumes that 'lp' is pointing to the first aligned word > at or after the beginning of the string. > Your version does not leave lp pointing to > the beginning of the string when the string is initially > already aligned. If the string is already fully aligned, my > version avoids the initial check entirely. > > To compute 'lp' as a pointer to the first full word > at or after the beginning of the string: > > lp = (const unsigned long *) > ( > ( > (uintptr_t)str + sizeof(*lp) - 1 > ) > & > ( > sizeof(*lp) - 1 > ) > ); > > I've broken this onto multiple lines to illustrate the structure; > the final code would of course be a little more compact. > > Also note the use of sizeof(*lp) rather than > sizeof(long) or sizeof(unsigned long); this way, > you guarantee that the sizeof() matches lp even if someone > comes along later and changes the declaration of lp. >> + va = (*lp - mask01); >> + vb = ((~*lp) & mask80); >> + if (va & vb) { >> + /* Check if we have \0 in the first part */ >> + short_stop=(const char *)lp; >> + if (stop<short_stop) short_stop=stop; >> + for (p=str; p != short_stop; p++) >> + if (*p == '\0') >> + return (p-str); >> } >> - return (len); >> + /* Scan the rest of the string using word sized operation */ >> + for (; (const char *)lp < stop; lp++) { >> + va = (*lp - mask01); >> + vb = ((~*lp) & mask80); >> + if (va & vb) { > > I don't think the extra variables are helpful. Just write it > directly: > > if ((*lp - mask01) & ~*lp & mask80) { > > That matches the comment you put at the beginning. > Again, this is directly from the strlen code. As before, I'm willing to rewrite one or both if you think it'd be worth it. >> + for (p=(const char *)lp; p != stop; p++) >> + if (*p == '\0') >> + break; >> + return (p-str); >> + } >> + } >> + return (maxlen); >> } > > You should, of course, compare my suggestions above > against your version to see which is faster. (Preferably run > such tests with both clang and gcc, on at least two > architectures, and with a couple of different optimization > flags.) > > Tim I think my email did say some about my test methods, but I'll go into a little more detail. Let me know if you want the actual code, but be aware it's ugly as heck C++. I tested with both the gcc and clang in stable/9, though I think I forgot to mention in my email the opt flags I used, which were "-O2 -fno-strict-aliasing", "-O2", and "-O3". None of them seemed to make very much difference with this code, and gcc and clang both did pretty equivalently. I tried many variants, testing them against an artificial dataset of randomly generated strings, of 50% length between 0 and 20, and 50% between 21 and 1000, where the max_len value was effectively infinity (assuming that most of the time the string fits in the buffer). The dataset used randomly-generated overread bytes on both sides of the string, which were 50% NULL and 50% anything else. I used ministat for comparison, which by the way is the best tool I have ever encountered; We use it for 1000 things I had always used a spreadsheet for in the past. The code as posted is the fastest I could find. For instance, the strlen code unrolls the main loop; This is faster for strlen, but the extra length check makes the not-unrolled variant faster for strnlen. I think farther significant performance improvements are likely only to come through machine-dependent assembly versions, but I'm not going to write or maintain that :-). Linux is a little faster for ASCII (01-7F) strings, but quite a bit slower for general (01-FF) strings. I could do that, but FreeBSD had such a variant several years ago, and moved away from it, which I personally think was wise. Thanks again for the code review, Lee
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?361355916ec746edbf91c3f258f2416e>