From owner-svn-src-all@FreeBSD.ORG Thu Feb 17 10:03:49 2011 Return-Path: Delivered-To: svn-src-all@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D0C9F106566B; Thu, 17 Feb 2011 10:03:49 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail08.syd.optusnet.com.au (mail08.syd.optusnet.com.au [211.29.132.189]) by mx1.freebsd.org (Postfix) with ESMTP id AC0798FC0C; Thu, 17 Feb 2011 10:03:48 +0000 (UTC) Received: from c122-107-114-89.carlnfd1.nsw.optusnet.com.au (c122-107-114-89.carlnfd1.nsw.optusnet.com.au [122.107.114.89]) by mail08.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id p1HA3dL5026916 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 17 Feb 2011 21:03:43 +1100 Date: Thu, 17 Feb 2011 21:03:39 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Alexander Best In-Reply-To: <20110217010953.GA65017@freebsd.org> Message-ID: <20110217182526.G1149@besplex.bde.org> References: <201102161805.p1GI5ABX078768@svn.freebsd.org> <20110216221014.GA43296@freebsd.org> <20110216221712.GA44796@freebsd.org> <20110216224126.GA47777@freebsd.org> <20110217010953.GA65017@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org, Warner Losh Subject: Re: svn commit: r218745 - head/sys/boot/i386/boot2 X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 17 Feb 2011 10:03:49 -0000 On Thu, 17 Feb 2011, Alexander Best wrote: > On Wed Feb 16 11, Alexander Best wrote: >> On Wed Feb 16 11, Alexander Best wrote: >>> On Wed Feb 16 11, Alexander Best wrote: >>>> On Wed Feb 16 11, Warner Losh wrote: >>>> i think without this code uint32_t x serves no purpose any longer: >>>> >>>> /usr/git-freebsd-head/sys/boot/i386/boot2/boot2.c:322:20: warning: unused variable 'x' [-Wunused-variable] >>>> uint32_t addr, x; >>>> ^ Removing it won't save any space, however. >>> also due to >>> >>> /usr/git-freebsd-head/sys/boot/i386/boot2/boot2.c:631:8: warning: cast from 'caddr_t' (aka 'char *') to 'uint32_t *' (aka 'unsigned int *') increases required alignment from 1 to 4 [-Wcast-align] >>> t1 = *(uint32_t *)PTOV(0x46c); >>> ^~~~~~~~~~~~~~~~~~~~~~~ >>> >>> i think t0 and t1 can be turned into uint8_t's and PTOV(0x46c); can be casted >>> to (uint8_t *), instead of (uint32_t *). Maybe, but further changes would be required. The variable is a 32-bit tick counter. I think the tick frequency is 18.2 Hz. This will wrap after some time, and the code is already broken when it wraps. This brokenness is not too important since this code is normally used soon after booting and the wrap occurs after 7.25 years, but if you reduce the counts to uint8_t's then the wrap occurs after only 14 seconds, so you should be more careful. From an old version of boot2.c: % static int % keyhit(unsigned ticks) Ticks might also be best as uint8_t, especially if t0 and t1 are changed to uint8_t. It always has the value 3*SECOND or 5*SECOND, where SECOND is the tick frequency 18.2 rounded to 18 (rounding gives an error of 1 tick after 5 seconds; this can be fixed using floating point evaluated at compile time). I think the compiler is prevented from reducing it to its constant value by lack of (implicit) inlining. % { % uint32_t t0, t1; % % if (opts & 1 << RBX_NOINTR) % return 0; The only change in the current verion is to obfuscate the bit test using a macro. Such optimizations make it hard to see where the bloat is. % t0 = 0; % for (;;) { % if (xgetc(1)) % return 1; % t1 = *(uint32_t *)PTOV(0x46c); % if (!t0) % t0 = t1; Might be able to rewrite this so it is auto-initializing without a special case. Try putting 't0 = *(uint32_t *)PTOV(0x46c);' outside of the loop. I think it is intentionally not done this way since PTOV(0x46c) involves the large address constant 0x46c (which tends to generate large code (see below)) and the PTOV() macro makes this worse due to the bad implementation of this macro. (PTOV(pa) is ((caddr_t)(pa) - __base). Subtracting the base makes no sense, and defeats the normal address mode base+offset. However, here the address of PTOV(0x46C) is best calculated (with the bad PTOV()) by loading 0x46C into a register and subtracting __base. Then we have a pointer to it and can load t0 initially by just dereferencing this pointer. With the better PTOV(pa) of (__actual_base_with_the_correct_type + (pa)) the best calculation of the address is to load the actual base into a register and not add 0x46C to that register if we are only using PTOV(0x46c) once, but when PTOV(046c) is used twice the best code is less clear.) % if (t1 < t0 || t1 >= t0 + ticks) % return 0; Oops, this does handle wrap, but in a slow way and sloppy way. t1 < t0 means that the counter has wrapped. We then return immediately. Not too bad if there is 1 error 7.25 years after booting. With the counter reduced to 8 bits, this error after 14 seconds would be annoying. But the usual method of handling cyclic counters takes less code and is less sloppy: if ((counter_size_t)(t1 - t0) >= ticks) return 0; We must be careful with the cast. Then this works with a counter_size_t of uint8_t, since the maximum value for `ticks' is 70 which fits in 8 bits. Also, we can almost guarantee that this code isn't interrupted, so we won't miss a single change of the counter. With interruption, we would have to be very unlucky to miss a counter rollover after 7.25 7.25 years or even after 14 seconds, and even if we miss we would normally only wait at most `ticks' ticks extra; then we do a less-wrong thing than returning early. % } % } > this will actually increase the size. Maybe. Smaller types often give larger object code due to extra prefix bytes (in i386 code) and extra instructions to promote the types before using them. 16-bit types in 32-bit i386 code are especially bad, since they require prefixes. In the above, t1 < t0 gives the same size object code with either uint8_t or unisigned. t1 >= t0 + ticks gives larger object code with uint8_t. At least 1 movl will have to become movzbl to promote a uint8_t. But with everything uint8_t, only 8-bit instructions should be needed. >> with this additional change the code fits when compiled with clang: >> >> diff --git a/sys/boot/i386/boot2/sio.S b/sys/boot/i386/boot2/sio.S >> index 7b8e9eb..d745129 100644 >> --- a/sys/boot/i386/boot2/sio.S >> +++ b/sys/boot/i386/boot2/sio.S >> @@ -29,11 +29,11 @@ >> sio_init: movw $SIO_PRT+0x3,%dx # Data format reg >> movb $SIO_FMT|0x80,%al # Set format >> outb %al,(%dx) # and DLAB >> - pushl %edx # Save >> + pushb %dl # Save >> subb $0x3,%dl # Divisor latch reg >> movl 0x8(%esp),%eax # Set >> outw %ax,(%dx) # BPS >> - popl %edx # Restore >> + popb %dl # Restore >> movb $SIO_FMT,%al # Clear >> outb %al,(%dx) # DLAB >> incl %edx # Modem control reg > > cannot push a single byte onto the stack on i386 32 bit. Indeed. It could only save a space by not generating any code for pushb. Compiling with -Wa,-al shows the null code, but also shows errors. push/pop of a single byte only works on i386 for pushing a single _immediate_ byte. The result is still to push a 16- ot 32-bit word, with the immediate byte sign extended, but when you have a constant value that is between -128 and 127, "push[lw] $value" is by far the shortest way to put this value in memory (2 bytes, 1 for the opcode and 1 for the value). In contrast, storing a general constant value to a general address often takes 10 bytes (2 bytes for the opcode, 4 for the memory address, and 4 for the value; with a reduction of only the last from 4 to 1 for a small immediate value). >> ...since we're only modifying %dl in subb $0x3,%dl, we don't need to push/pop >> a 32 bit value, but only 8 bits. gcc also has little idea of how to generate small code for this. Use of any byte register except %al should be avoided, since for example subb $3,%dl takes 3 bytes while subb $3,%al takes 2 bytes. However in sio.S, the result is needed in %dx. Operating on %dl is already a trick to save space, but is is far from best: unlike operations, movb to %al is no more efficient than movb to any byte register. subb $3,%dl should therefore be written as movb $mumble,%dl to save 1 byte. $mumble is the device's register set address, plus the device register offset, all mod 256, and the everything must be known at compile time not cross a 256-byte boundary for this to work. The source code is then more readable (no magic $3, which is the difference between the offset of the device register previously used (very context dependent) and the device register to be used next; instead, just the offset of the device reguster to be used next combined with the base). ISTR seeing gcc doing this optimization starting from C code, but it rarely gets a chance since the base address is rarely known at compile time. Here we direct the code generated precisely and don't do it as well as a compiler could starting from C code -- sio.S never needed to be in asm. In sio.S, the base is known at compile time (as SIOPRT), and xx50 ports are always aligned at 8-byte boundaries and have 8 internal registers so there is no problem with 256-byte boundaries. Here is serial.S from the old and better boot loader i386/boot/biosboot which I use, translated into C according to the above ideas. When compiled with -Os -fomit-frame-pointer, it has size 112 bytes, but serial.S gives size 105 bytes so this doesn't quite prove my case that C is better even for things than are carefully optimized in asm. gcc-3.3.3 gives size 128 bytes, so gcc is actually getting closer to being competitive with asm programmers. serial.c: % /* % * The serial port interface routines implement a simple polled i/o % * interface to a standard serial port. Due to the space restrictions % * for the boot blocks, no BIOS support is used (since BIOS requires % * expensive real/protected mode switches), instead the rudimentary % * BIOS support is duplicated here. % * % * XXX modes switches shouldn't be expensive. An int86() function % * could be used, so that new BIOS calls could be written in C. This % * would probably take more space, however. The switch could be done % * automagically by trapping interrupts. This would allow space-efficient % * calls of the form asm("int nn" : "outputs" : "inputs"). % * % * The base address and speed for the i/o port are passed from the % * Makefile in the COMCONSOLE and CONSPEED preprocessor macros. The % * line control parameters are currently hard-coded to 8 bits, no % * parity, 1 stop bit (8N1). This can be changed in init_serial(). % * % * XXX FIXME we now use the kernel's default 8N1. % */ % % #ifdef broken % #include "boot.h" % #endif % % #include % #include % % #include % % #include % #include /* XXX incompletely globalized defs */ % #define COMCONSOLE 0x3f8 /* XXX another; normally in makefile */ % % #define COMBRD(x) (1843200 / (16 * (x))) % #define CONADDR COMCONSOLE % % /* % * Write a character to the serial port. % */ % /* % * XXX should be named serial_putchar(). % */ % void % serial_putc(ch) % int ch; % { % int timeout; % % for (timeout = 10000; --timeout != 0;) % if ((inb(CONADDR + com_lsr) & LSR_TXRDY)) % break; % outb(CONADDR + com_data, ch); % } % % /* % * Read a character from the serial port. % */ % /* % * XXX should be named serial_getchar(). % */ % int % serial_getc() % { % unsigned char ch; % % while (!(inb(CONADDR + com_lsr) & LSR_RXRDY)) % ; % ch = inb(CONADDR + com_data); % % /* Remove any parity bit. */ % ch &= 0x7f; % % /* Convert DEL to BACKSPACE. */ % if (ch == 0x7f) % ch = 8; % % return (ch); % } % % /* % * Return nonzero if getc() wouldn't block. % */ % /* % * XXX should be named serial_ischar(). % * XXX FIXME this should be static inline to save space. % */ % int % serial_ischar() % { % return (inb(CONADDR + com_lsr) & LSR_RXRDY); % } % % /* % * Initialize port CONADDR to speed CONSPEED, line settings 8N1. % */ % /* % * XXX should be named serial_init(). % * XXX FIXME his should be static inline to save space. % */ % void % init_serial() % { % outb(CONADDR + com_cfcr, CFCR_DLAB); % outb(CONADDR + com_dlbl, COMBRD(CONSPEED)); % outb(CONADDR + com_dlbh, COMBRD(CONSPEED) >> 8); % % /* Disable fifo to reduce worst-case busy-wait. */ % outb(CONADDR + com_fcr, 0); % % outb(CONADDR + com_cfcr, CFCR_8BITS); /* 8N1 */ % outb(CONADDR + com_mcr, MCR_DTR | MCR_RTS); % % /* Flush (unconditionally the first time) all input. */ % do % inb(CONADDR + com_data); % while (inb(CONADDR + com_lsr) & LSR_RXRDY); % } Code generated from the above by gcc-4.2.1 -Os -fomit-frame-pointer -Wa,-al: % GAS LISTING /var/tmp//cc2FFF3R.s page 1 % % % 1 .file "w.c" % 2 .text % 3 .globl serial_getc % 4 .type serial_getc, @function % 5 serial_getc: % 6 .L3: % 7 0000 BAFD0300 movl $1021, %edx % 7 00 This takes 5 bytes (1 byte opcode and 4-byte immediate value), but should take 4 bytes (1 byte operand size prefix, 1 byte opcode and 2-byte immediate value. My old asms in cpufunc.h are probably at fault here. Those intentionally use "u_int port" to avoid the prefix. This optimizes for time instead of space (the prefix costs a cycle on old x86's). The value is 0x3f8+3, obfuscated in the usual way by gcc (gcc generates decimal constants even when the source code uses hex constants :-(). % 8 #APP % 9 0005 EC inb %dx,%al % 10 #NO_APP Only the low 16 bits of %edx are used. This allows a 16-bit load to work. But the above did a 32-bit load to optimize for time. % 11 0006 A801 testb $1, %al % 12 0008 74F6 je .L3 % 13 000a B2F8 movb $-8, %dl Here we see gcc perfectly optimizing the adjustment from 0x3f8+3 to 0x3f8 -- it doesn't subtract 3, but loads 0xf8. The obfuscation of the value is even better here -- 0xf8 technically can't be loaded since the immediate byte is signed, but that is how the value should be written. % 14 #APP % 15 000c EC inb %dx,%al % 16 #NO_APP % 17 000d 83E07F andl $127, %eax It could use andb to save a byte, but this prepares a good return value at a cost of only 1 byte. % 18 0010 3C7F cmpb $127, %al % 19 0012 7502 jne .L5 % 20 0014 B008 movb $8, %al % 21 .L5: % 22 0016 0FB6C0 movzbl %al, %eax This wastes 3 bytes. We already zero-extended using the andl. Perhaps the source code can be tweaked to make this clear to the compiler. % 23 0019 C3 ret I think the interface is intentionally pessimized to return a 32 bit value and thus require wasting 1 byte for the above andl, since if we don't extend once here then the callers will tend to extend more than once later. % 24 .size serial_getc, .-serial_getc % 25 .globl serial_putc % 26 .type serial_putc, @function % 27 serial_putc: % 28 001a B9102700 movl $10000, %ecx % 28 00 % 29 001f EB0A jmp .L11 % 30 .L12: % 31 0021 BAFD0300 movl $1021, %edx % 31 00 % 32 #APP % 33 0026 EC inb %dx,%al % 34 #NO_APP % 35 0027 A820 testb $32, %al % 36 0029 7503 jne .L13 % 37 .L11: % 38 002b 49 decl %ecx % 39 002c 75F3 jne .L12 gcc apparently doesn't know about the `loop' instruction which does this in 2 bytes. The `loop*' instructions are useless and normally slower except: - they save space - on 8088's, the instruction fetcher was so weak that saving space in instruction space also saved time in just about all cases. % 40 .L13: % 41 002e BAF80300 movl $1016, %edx % 41 00 This loads the full register (32 bits, though 16 would be enough) since it is jumped to. Better generated code would load CONADDR into %[e]dx initially so that it is available for low-byte-only modifications throughout. % 42 0033 8A442404 movb 4(%esp), %al This function returns void so it automatically avoids extra code to extend the return value, but its parameter may have been excessively extended when passed here. % 43 #APP % 44 0037 EE outb %al,%dx % 45 #NO_APP % 46 0038 C3 ret % 47 .size serial_putc, .-serial_putc % 48 .globl init_serial % 49 .type init_serial, @function % 50 init_serial: % 51 0039 B9FB0300 movl $1019, %ecx % 51 00 % 52 003e B080 movb $-128, %al % GAS LISTING /var/tmp//cc2FFF3R.s page 2 % % % 53 0040 89CA movl %ecx, %edx % 54 #APP % 55 0042 EE outb %al,%dx % 56 #NO_APP % 57 0043 B00C movb $12, %al % 58 0045 B2F8 movb $-8, %dl % 59 #APP % 60 0047 EE outb %al,%dx % 61 #NO_APP % 62 0048 31C0 xorl %eax, %eax % 63 004a B2F9 movb $-7, %dl % 64 #APP % 65 004c EE outb %al,%dx % 66 #NO_APP % 67 004d B2FA movb $-6, %dl % 68 #APP % 69 004f EE outb %al,%dx % 70 #NO_APP % 71 0050 B003 movb $3, %al % 72 0052 89CA movl %ecx, %edx % 73 #APP % 74 0054 EE outb %al,%dx % 75 #NO_APP % 76 0055 B2FC movb $-4, %dl % 77 #APP % 78 0057 EE outb %al,%dx % 79 #NO_APP % 80 .L16: % 81 0058 BAF80300 movl $1016, %edx % 81 00 % 82 #APP % 83 005d EC inb %dx,%al % 84 #NO_APP % 85 005e B2FD movb $-3, %dl % 86 #APP % 87 0060 EC inb %dx,%al % 88 #NO_APP % 89 0061 A801 testb $1, %al % 90 0063 75F3 jne .L16 % 91 0065 C3 ret % 92 .size init_serial, .-init_serial I didn't try hard to optimize the initialization function for use here, but may have written parts of the asm that it is based on and wrote them for small size, and tried hard to make the cpufuncs not turn small code into large code. The results are not quite perfect -- the initial load of is into %ecx when it should be direct into %dx, and there is another load into %edx. % 93 .globl serial_ischar % 94 .type serial_ischar, @function % 95 serial_ischar: % 96 0066 BAFD0300 movl $1021, %edx % 96 00 % 97 #APP % 98 006b EC inb %dx,%al % 99 #NO_APP % 100 006c 83E001 andl $1, %eax andb and a byte return should be enough. % 101 006f C3 ret % 102 .size serial_ischar, .-serial_ischar Inlining might reduce the loading of %dx even more. This function is called twice, so it is not quite small enough to always inline. % 103 .ident "GCC: (GNU) 4.2.1 20070719 [FreeBSD]" Bruce