From owner-cvs-all@FreeBSD.ORG Mon Feb 2 03:24:57 2004 Return-Path: Delivered-To: cvs-all@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 12E0B16A4CE; Mon, 2 Feb 2004 03:24:57 -0800 (PST) Received: from mailout1.pacific.net.au (mailout1.pacific.net.au [61.8.0.84]) by mx1.FreeBSD.org (Postfix) with ESMTP id CB95943D45; Mon, 2 Feb 2004 03:24:53 -0800 (PST) (envelope-from bde@zeta.org.au) Received: from mailproxy1.pacific.net.au (mailproxy1.pacific.net.au [61.8.0.86])i12BOqLE002237; Mon, 2 Feb 2004 22:24:52 +1100 Received: from gamplex.bde.org (katana.zip.com.au [61.8.7.246]) i12BOnt4003314; Mon, 2 Feb 2004 22:24:50 +1100 Date: Mon, 2 Feb 2004 22:24:49 +1100 (EST) From: Bruce Evans X-X-Sender: bde@gamplex.bde.org To: Andy Farkas In-Reply-To: <20040202175017.W1579@gamplex.bde.org> Message-ID: <20040202215735.O2387@gamplex.bde.org> References: <200401282044.i0SKi8Y6063747@repoman.freebsd.org> <20040202175017.W1579@gamplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: src-committers@FreeBSD.org cc: cvs-all@FreeBSD.org cc: John Baldwin Subject: Re: cvs commit: src/sys/i386/i386 apic_vector.s src/sys/i386/isa atpic_vector.s X-BeenThere: cvs-all@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: CVS commit messages for the entire tree List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 02 Feb 2004 11:24:57 -0000 On Mon, 2 Feb 2004, Bruce Evans wrote: > On Mon, 2 Feb 2004, Andy Farkas wrote: > > > On Wed, 28 Jan 2004, John Baldwin wrote: > > > > > Modified files: > > > sys/i386/i386 apic_vector.s > > > sys/i386/isa atpic_vector.s > > > Log: > > > Optimize the i386 interrupt entry code to not reload the segment registers > > > if they already contain the correct kernel selectors. > > > > What effect on performance does this change have? It seems to be a rather > > significant change to an important code path, or am I totally confused..? > > I measured it in userland and saw about -1 cycles/interrupt on an AthlonXP > and about -22 cycles/interrupt on an old Celeron (negative means a > pessimization). Bah, the benchmark that gave these results was very buggy. Here is a better version. Results first (for a random pattern to simulate 75% of interrupts coming drom userland): Celeron 366: -20 cycles/interrupt optimization Celeron 366 with partial register stall pessimization backed out: No significant change (measured -0.05 cycles/interrupt less) P2 (or whatever freefall is): Similar to Celeron. Machine too loaded for accurate test. AthlonXP: +4 cycles/interrupt Best cases for this and some other percentages: 100% user: Celeron 0- AthlonXP -1 75% user: Celeron 0+ AthlonXP +4 50% user: Celeron +2 AthlonXP +9 25% user: Celeron +9 AthlonXP +15 0% user: Celeron +20 AthlonXP +21 The branch prediction seems to be too good to be true. %%% #include #include #include #include static void testnewway(void) { __asm(" \n\ KDSEL = 0x2f \n\ KPSEL = 0x2f \n\ pushal \n\ pushl %ds \n\ pushl %es \n\ pushl %fs \n\ mov %fs,%ax \n\ cmp $KPSEL,%ax \n\ je 1f \n\ mov $KDSEL,%ax \n\ mov %ax,%ds \n\ mov %ax,%es \n\ mov $KPSEL,%ax \n\ mov %ax,%fs \n\ 1: \n\ popl %fs \n\ popl %es \n\ popl %ds \n\ popal \n\ "); } static void testoldway(void) { __asm(" \n\ KDSEL = 0x2f \n\ KPSEL = 0x2f \n\ pushal \n\ pushl %ds \n\ pushl %es \n\ pushl %fs \n\ movl $KDSEL,%eax \n\ mov %ax,%ds \n\ mov %ax,%es \n\ movl $KPSEL,%eax \n\ mov %ax,%fs \n\ 1: \n\ popl %fs \n\ popl %es \n\ popl %ds \n\ popal \n\ "); } static void testnewwayfix1(void) { __asm(" \n\ KDSEL = 0x2f \n\ KPSEL = 0x2f \n\ pushal \n\ pushl %ds \n\ pushl %es \n\ pushl %fs \n\ mov %fs,%ax \n\ cmp $KPSEL,%ax \n\ je 1f \n\ movl $KDSEL,%eax \n\ mov %ax,%ds \n\ mov %ax,%es \n\ movl $KPSEL,%eax \n\ mov %ax,%fs \n\ 1: \n\ popl %fs \n\ popl %es \n\ popl %ds \n\ popal \n\ "); } static void testnewwayfix2(void) { /* * New way with a different way of avoiding the partial register * stores (use lots of segment override prefixes). * * It seems that gcc now understands "mov %ax,%ds" like I did in * 1986, but that understanding is wrong :-). gcc leaves out the * operand size prefix, but the prefix is apparently needed to * actually mov from %ax. Without it the mov is apparently from * %eax and there is a partial register store if we only load %ax. * gcc refuses to produce the prefix for "mov %eax,%ds". * * gcc understands mov's from segment registers better to general * registers better. Then it is clear that the target may be * either 16 or 32 bits. Intel now documents what happens in at * least this case. IIRC, it says that the operand size prefix * works as should be expected, and the top 16 bits of the target * are set to an indeterminate value. See the commit logs for * for where I shot down a pesimization of * this (we avoid using the operand size prefix, so we get garbage * in the top bits). * * This subset of the benchmark is just to demonstrate that using * the operand size prefix to force 16-bit operations is just a * pessimization. It is only a small pessimization though. Until * today (2004/01/04) I didn't understand why getting the prefix * wrong was such a large pessimization. It was due to partial * register stalls more that the prefix. */ __asm(" \n\ KDSEL = 0x2f \n\ KPSEL = 0x2f \n\ pushal \n\ pushl %ds \n\ pushl %es \n\ pushl %fs \n\ mov %fs,%ax \n\ cmp $KPSEL,%ax \n\ je 1f \n\ mov $KDSEL,%ax \n\ .byte 0x66 \n\ mov %ax,%ds \n\ .byte 0x66 \n\ mov %ax,%es \n\ mov $KPSEL,%ax \n\ .byte 0x66 \n\ mov %ax,%fs \n\ 1: \n\ popl %fs \n\ popl %es \n\ popl %ds \n\ popal \n\ "); } #define SIZE 1000000 unsigned char state[SIZE]; static void randfs(int i) { if (state[i] != 0) __asm("pushl %cs; popl %fs"); else __asm("pushl %ds; popl %fs"); } int main(void) { double statetot; unsigned long long start; int i; statetot = 0; for (i = 0; i < SIZE; i++) { #if 0 state[i] = (random() >> 30) & 1; #else /* * Bias the state to simulate that the system probably * spends most of its time in user mode (state[i] != 0). * Guess 75% in user mode. */ state[i] = (((random() >> 28) & 3) != 0); #endif statetot += state[i]; } start = rdtsc(); for (i = 0; i < SIZE; i++) { randfs(i); testnewway(); } printf("%llu\n", rdtsc() - start); start = rdtsc(); for (i = 0; i < SIZE; i++) { randfs(i); testoldway(); } printf("%llu\n", rdtsc() - start); start = rdtsc(); for (i = 0; i < SIZE; i++) { randfs(i); testnewwayfix1(); } printf("%llu\n", rdtsc() - start); start = rdtsc(); for (i = 0; i < SIZE; i++) { randfs(i); testnewwayfix2(); } printf("%llu\n", rdtsc() - start); printf("state average %.3f\n", statetot / SIZE); return (0); } %%% Bruce