Date: Thu, 1 Nov 2007 03:23:56 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Christoph Mallon <christoph.mallon@gmx.de> Cc: cvs-src@FreeBSD.org, src-committers@FreeBSD.org, "Andrey A. Chernov" <ache@FreeBSD.org>, cvs-all@FreeBSD.org Subject: Re: cvs commit: src/include _ctype.h Message-ID: <20071101024451.T4289@delplex.bde.org> In-Reply-To: <47264710.2000500@gmx.de> References: <200710272232.l9RMWSbK072082@repoman.freebsd.org> <47264710.2000500@gmx.de>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 29 Oct 2007, Christoph Mallon wrote: > Andrey A. Chernov wrote: >> ache 2007-10-27 22:32:28 UTC >> >> FreeBSD src repository >> >> Modified files: >> include _ctype.h Log: >> Micro-optimization of prev. commit, change >> (_c < 0 || _c >= 128) to (_c & ~0x7F) >> Revision Changes Path >> 1.33 +1 -1 src/include/_ctype.h > > Actually this is rather a micro-pessimisation. Every compiler worth its money > transforms the range check into single unsigned comparison. The latter test > on the other hand on x86 gets probably transformed into a test instruction. > This instruction has no form with sign extended 8bit immediate, but only with > 32bit immediate. This results in a significantly longer opcode (three bytes > more) than a single (unsigned)_c > 127, which a sane compiler produces. I > suspect some RISC machines need one more instruction for the > "micro-optimised" code, too. > In theory GCC could transform the _c & ~0x7F back into a (unsigned)_c > 127, > but it does not do this (the only compiler I found, which does this > transformation, is LLVM). > Further IMO it is hard to decipher what _c & ~0x7F is supposed to do. Indeed. In fact, one of the cleanups/optimizations in rev.1.5 and 1.6 by ache and me was to get rid of the mask. There was already a check for _c < 0, so the mask cost even more. The top limit was 256 instead of 128, so the point about 8bit immediates didn't apply, but I don't know of any machines where the mask is faster (didn't look hard :-). OTOH, _c is often a char or a u_char (it is declared as mumble_rune_t, but the functions are inline so the compiler can see the original type. If _c is u_char and u_char is uint8_t, then (_c < 0 || c >= 256) is always false, so the compiler should generate no code for it. The top limit of 256 was preferred so that this optimization is possible. A top limit of 128 doesn't work so well. I would have worried about the 1's complement case. I think a mask without a check for _c < 0 is plain broken in the 1's complement case, but this case is too hard to think about -- just do a range comparison which will always work, and let the compiler reduce it using 2's complement or 1's complement tricks if possible, but since 1's complement machines are rare, write the code so that it is easier for the compiler to optimize in the 2's complement case. Pipelining might make the old optimizations in ctype uninteresting. Maybe everything is almost free except for the table lookup (although that is cached, it will sometimes miss). I haven't timed this lately. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20071101024451.T4289>