Date: Thu, 26 Sep 2019 01:46:14 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Ed Maste <emaste@freebsd.org> Cc: Bruce Evans <brde@optusnet.com.au>, src-committers <src-committers@freebsd.org>, svn-src-all <svn-src-all@freebsd.org>, svn-src-head <svn-src-head@freebsd.org> Subject: Re: svn commit: r351700 - head/lib/libc/string Message-ID: <20190926002410.M1471@besplex.bde.org> In-Reply-To: <CAPyFy2B%2B7j6qthxc37PX0Yq_s=KNWjJ1L3=y0=iPX2eMVAcjQQ@mail.gmail.com> References: <201909021356.x82Dui6v077765@repo.freebsd.org> <20190920212702.N1017@besplex.bde.org> <CAPyFy2B%2B7j6qthxc37PX0Yq_s=KNWjJ1L3=y0=iPX2eMVAcjQQ@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 25 Sep 2019, Ed Maste wrote: > On Fri, 20 Sep 2019 at 08:14, Bruce Evans <brde@optusnet.com.au> wrote: >> >> Optimizing this function [memchr] is especially unimportant, > > Why? Because it is almost never used, and most of its uses are unimportant. It is used a whole 26 times in all of /usr/src/*bin: - sh/exec.c: 1 especially^N unimportant use after execve() fails - dmesg/dmesg.c: 2 uses; speed unimportant in utilitities like dmesg that parse small data - sort/sort.c: 2 uses for reading lines; speed possibly important - sed/process.c: 2 uses in options parsing; speed unimportant - grep/file.c: 3 uses for reading lines; speed possibly important, but if you want speed in utility that parses large data then you should use a special lexer and not general functions even to find newlines - gcore/elfcore.c: 1 use; speed unimportant - diff/diffreg.c: 1 use for detecting binaries; speed probably unimportant - mkimg/mkimg.c: 2 uses for writes; speed unimportant since i/o-bound - truss/syscalls.c: 1 use; speed unimportant - indent/pr_comment.c: 1 use; speed unimportant - ppp/*.c: 4 uses; speed unimportant - timed/readmsg.c: 1 use; speed unimportant - inetd/inetd.c: 1 use; speed unimportant - jail/config.c: 1 use; speed unimportant - daemon/daemon.c: 1 use; speed unimportant - bhyve/gdb.c: 2 uses; speed unimportant > Really, we should provide optimized assembly implementations of string > functions deemed important, but it will take time for that to happen > on all architectures. Really, we should provide optimized (usually non-assembly) implementations of no more than copying memory, and not deem any other string function as important (since none are). Even optimizing copying memory gives speed improvements of less than 1% in most programs. Assembly implementations are usually pessimal, and become more so with time. E.g., if you write memcpy() as something like *dst++ = *src++, then if the size is known at compile time, then clang will optimize to use AVXn in some cases. Handling the same number of cases in assembly is impractical. My experiments to optimize copying memory only handle about 30 cases, but I only tried to optimized for large copies and stopped adding cases 15 years ago when it became clear that too many cases would be needed. clang knows that it doesn't understand memory, so it doesn't do optimizations like this if the size is not known at compile time. It calls the library, which in FreeBSD doesn't even know that it doesn't understand memory. > >> and this version has mounds of style bugs. > > Yes, I've wondered about applying style(9) to the handful of imported > string routines. In general we don't really expect ongoing updates, so > it's probably no concern from a maintenance perspective. > >>> - do { >>> - if (*p++ == (unsigned char)c) >>> - return ((void *)(p - 1)); >>> - } while (--n != 0); >> >> Old KNF code. In the !__GNUC__ #else clause , this is replaced by >> equivalent code with a style that is very far from KNF. > > Functionally equivalent, although I compared the compiled output of > both cases and what's currently there is somewhat smaller. Note that > it's not an #else case, the equivalent loop is used in both cases - > handling the non-word-size residue in the __GNUC__ case. > >>> +void *memchr(const void *src, int c, size_t n) >>> +{ >>> + const unsigned char *s = src; >>> + c = (unsigned char)c; >>> +#ifdef __GNUC__ >>> + for (; ((uintptr_t)s & ALIGN) && n && *s != c; s++, n--); >>> + if (n && *s != c) { >>> + typedef size_t __attribute__((__may_alias__)) word; >> >> This seems to have no dependencies on __GNUC__ except the use of anti-aliasing, > > Yes, that is the reason. > >> I checked that this optimization actually works. For long strings, it is >> almost sizeof(size_t) times faster, by accessing memory with a size >> sizeof(size_t). size_t is a not very good way of hard-coding the word size. > > Indeed - I wouldn't have imported this change if I didn't observe a > reasonable benefit when trying it out. As for word size it could be > replaced using LONG_BIT I suppose (as strspn.c, strlen.c), if it's > getting reworked anyway. But how do you know that long strings are most common? For short strings, it is better to have low overhead. LONG_BIT is another bad way of encoding the word size. On arches with correctly sized longs, long is twice as wide as either a machine register or the maximum memory access width. So it is usually twice the word size if correct. register_t is closer to being correct. >> Related silly optimizations: >> - i386 has memchr() in asm. This uses scasb, which was last optimal on the >> 8088, or perhaps the original i386. On freefall, it is several times slower >> than the naive translation of the naive C code. > > I posted a review (https://reviews.freebsd.org/D21785) to remove the > scasb implementation. I haven't investigated wmemchr. > >> - i386 pessimizes several old string functions using scasb. > > The only other one I see is in strcat.S, which should probably be > retired as well. Are there others? No others. It also uses cmps[lb] in bcmp.S and memcmp.S. cmpsl is at least word-sized, so it is not very slow. But all string functions are slow. Even with ERMS, movs[ql] is too slow to use for memcpy() for sizes less than about 1K, since it has typically has a startup overhead of about 25 (or is it 15?) cycles and 858 bytes can be copied if the memory bandwidth is 128 GB/sec and the CPU speed is 4 GHz. 128 GB/sec is a typical L1 cache speed for a not very new x86. If the memcpy() size is much larger than 1K, then copying busts L1 and everything is slower and speeds are more limited by memory than by the implementation. *dst++ = *src++ is best if it can keep up with memory, which is not so easy if (main) memory is fast and memcpy() uses 32-bit integer accesses. >> - i386 also does dubious alignment optimizations > > I don't think there's much value in putting a lot of time into i386, > but am happy to address straightforward issues (such as removing > pessimal MD implementations). I haven't looked at these alignment > optimizations. > >> - amd64 pessimizes strcpy() by implementing it as a C wrapper for stpcpy(). >> The wrapper code is MI, but is only used for amd64. > > And it appears the MD one is selected by Makefile .PATH ordering, > which seems fragile. I thought that it was by include ordering, but .PATH has to give order too. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20190926002410.M1471>