Date: Wed, 24 Dec 2003 02:20:21 -0800 (PST) From: Bruce Evans <bde@zeta.org.au> To: freebsd-bugs@FreeBSD.org Subject: Re: ports/47061: Conflicting system headers by build of graphics/cqcam Message-ID: <200312241020.hBOAKLaw095291@freefall.freebsd.org>
next in thread | raw e-mail | index | archive | help
The following reply was made to PR kern/47061; it has been noted by GNATS. From: Bruce Evans <bde@zeta.org.au> To: Mark Linimon <linimon@lonesome.com> Cc: freebsd-gnats-submit@freebsd.org Subject: Re: ports/47061: Conflicting system headers by build of graphics/cqcam Date: Wed, 24 Dec 2003 21:14:55 +1100 (EST) On Tue, 23 Dec 2003, Mark Linimon wrote: > This is really a kernel problem. I am going to go ahead and commit a > workaround for this and the one or two other ports with this problem -- > but the workaround is basically unacceptable. Er, this is really a port[s] problem. <machine/cpufunc.h> is not intended to be included by applications. There was never any conflict with <string.h> in the kernel because the kernel never included <string.h>, and the kernel now avoids bogus conflicts, if any, with gcc's builtin ffs() using -fno-builtin. > The underlying problem is that machine/cpufunc.h for i386 has had > a definition for a machine function 'ffs' for, oh, say, about 9 years > now. However, man ffs will show you that there is an ffs(3) function > as well. Even after reading the source it's not clear to me if these > are supposed to have the same purpose -- someone with a more intimate > knowledge of i386 arch is going to have to rule for certain. They are the same. Last time I checked (less than a year ago), the gcc builtin was still slower than the kernel inline except possibly when the latter can use non-base-arch instructions like cmov. amd64's always have cmov and always use the builtin. ... I checked again. With the following slightly too simple test: %%% #include <sys/types.h> #include <machine/cpufunc.h> int z[4096]; main() { volatile int v; int i, j; for (i = 0; i < 4096; i++) z[i] = 1 << rand(); /* Yes, this is sloppy. */ for (j = 0; j < 100000; j++) for (i = 0; i < 4096; i++) #ifdef NOBUILTIN v = ffs(z[i]); #else v = __builtin_ffs(z[i]); #endif } %%% Times on an Athlon XP1600 overclocked by 146/133: cc -O -mcpu=pentiumpro -o foo foo.c (default from bsd.cpu.mk) 3.49 real 3.47 user 0.00 sys cc -O -mcpu=pentiumpro -DNOBUILTIN -o foo foo.c (default + kernel ffs()) 3.21 real 3.21 user 0.00 sys cc -O -march=pentiumpro -o foo foo.c (gives cmov and works on Athlon XP too): 3.21 real 3.21 user 0.00 sys Here using cmov[e] gives the same amount of optimization as the kernel ffs() gets by using a simple conditional branch instead of a slow instruction sequence starting with "set"[e]. Mispredicted branches are expensive on some arches, but apparently they aren't on Athlons. The rand() in the test was intended to cause mispredicted branches as well as lengthy searches, but it doesn't actually. The branch is never taken since z[i] is never 0. On changing the initialization of z[i] so that the branch is taken every second time: if (i & 1) z[i] = 1 << rand(); the kernel version becomes much faster: 2.01 real 2.00 user 0.00 sys and the other times don't change significantly. This is presumably because the Athlon predicts taking the branch every second time perfectly. The bit-search instruction is very expensive (and always takes the same time??) and by branching over it every second time the cost per iteration is almost halved. A better benchmark might randomize the branches, but this might be evey further from real applications since an arg of 0 may be very unlikely (or very likely). Times on a Celeron 366: gcc builtin without cmov (very slow!): 15.78 real 15.68 user 0.00 gcc builtin with cmov: 5.64 real 5.61 user 0.00 kernel ffs(): 5.85 real 5.81 user 0.00 kernel ffs() with alternating 0's (again, others not affected by alternating): 5.62 real 5.58 user 0.00 Times on an amd64 (sledge = Opteron 244 1804 MHz) gcc builtin with cmov: 2.73 real 2.72 user 0.00 sys old kernel ffs(): 3.42 real 3.39 user 0.01 sys kernel ffs() with alternating 0's (again, builtin affected by alternating): 1.82 real 1.82 user 0.00 sys So using cmov is actually significtly better than a simple branch on amd64's, but only if the arg isn't often 0. > In the meantime, I'm going to hold my nose and commit an include > file to the port that is merely the inb/outb functions. This is > clearly a hack that should go away once a "correct" solution is found. This is approximately correct, not a hack. The system could provide a header that implements inb() and outb() functions for userland (*), but <machine/cpufunc.h> is not this header. It's just a bit much for multiple applications to have to duplicate these interfaces. (*) They shouldn't exist in the kernel. Bus-space should be used. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200312241020.hBOAKLaw095291>