Date: Fri, 24 Aug 2012 01:08:37 GMT From: Bruce Evans <bde@FreeBSD.org> To: jhb@FreeBSD.org, rizzo@iet.unipi.it Cc: freebsd-hackers@FreeBSD.org, adrian@FreeBSD.org, mitya@cabletv.dp.ua, wojtek@wojtek.tensor.gdynia.pl, freebsd-net@FreeBSD.org Subject: Re: Replace bcopy() to update ether_addr Message-ID: <201208240108.q7O18bej009632@ref10-i386.freebsd.org> In-Reply-To: <20120822215244.GC67796@onelab2.iet.unipi.it>
next in thread | previous in thread | raw e-mail | index | archive | help
luigi wrote: > On Wed, Aug 22, 2012 at 03:21:06PM -0400, John Baldwin wrote: > > On Wednesday, August 22, 2012 2:54:07 pm Adrian Chadd wrote: > > > On 22 August 2012 05:02, John Baldwin <jhb@freebsd.org> wrote: > > > > On Tuesday, August 21, 2012 12:34:42 pm Adrian Chadd wrote: > > > >> Hi, > > > >> > > > >> What about just creating an ETHER_ADDR_COPY(dst, src) and putting that > > > >> in a relevant include file, then hide the ugliness there? Good for source code ugliness and bloat (not just in the macro). > > > >> The same benefits will likely appear when copying wifi MAC addresses > > > >> to/from headers. Why stop there? Change all assignments to use the ASSIGN() macro. The compiler doesn't understand assignment, but we do ;-). > > > >> Thanks, I'm glad someone noticed this. > > > > > > > > I doubt we even _need_ the ugliness. We should just use *dst = *src > > > > unless there is a compelling reason not to. > > > > > > Because it's not very clear? :-) I'd much prefer my array-of-things > > > copies to be explicit. > > > > Eh? 'struct foo *src, *dst; *dst = *src' is pretty bog-standard C. That > > isn't really all that obtuse. (Builtin) memcpy is better for this in general (no sarcasm now). It reduces to a struct copy if the object being copy is a struct, and is not limited to structs. However, ethernet address types are now fully pessimized, so compilers are almost forced to assume that they are fully misaligned. From net/ethernet.h: % /* % * Structure of a 10Mb/s Ethernet header. % */ Wow, 10Mb/s. % struct ether_header { % u_char ether_dhost[ETHER_ADDR_LEN]; % u_char ether_shost[ETHER_ADDR_LEN]; % u_short ether_type; % } __packed; This used to be not quite guaranteed to be 16-bit aligned by the short in it. Perhaps it is still aligned in practice, But in 2006, it was fully pessimized by adding __packed without adding __aligned(2). This __packed prevents the struct being padded to a multiple of 32 bits in size on arm. It has the side effect of allowing the compiler to not add padding for alignment before instances of this struct, and allowing embedded short to be misaligned, so all accesses to this short might require 2 1-byte load/store instructions and more instructions to combine the bytes. On x86, the accesses can be a single load/store (or sometimes an arithmetic operation), with the hardware doing the combination, possibly more slowly than for aligned accesses. The pessimization is better on other arches. % /* % * Structure of a 48-bit Ethernet address. % */ % struct ether_addr { % u_char octet[ETHER_ADDR_LEN]; % } __packed; This was always just an array of bytes, so it was nevever required to have more than byte alignment. However, the compiler was permitted to add padding before and after it to align it and the thing after it. __packed on it may prevent this. I hope that it actually takes a __packed on the containing struct or on the struct member with this type to prevent external padding, but __packed on everything may be needed anyway for __packed on this to have much effect. An ETHER_ADDR_COPY() macro can only do worse than the compiler here, since the object is nothing more than an array of bytes after forgetting the extra packing info that may be known to the compiler. On x86, the macro can use 2 possibly-misaligned load/stores, but so can the compiler, or the macro can use a dynamic comparison of the address to find the alignment point and then do 1, 2 or 3 aligned load/store pairs, but so can the compiler, or the macro can use "rep movsb", but so can the compiler... The best choice depends on CFLAGS (a static test that is easiest for the compiler) or on the CPU. For a complete implementation, the macro must of course patch the instructions used to best suit the current CPU. This would be a lot of work for an unimportant (~0.1% max) optimization. On non-x86 or any arch with strict alignment requirements, the only obviously correct implementions of the macro are 6 1-byte load/store pairs, or memcpy(), or bcopy(). The compiler can easily beat all of these. > the thread has probably forked causing people to miss the explanation > that Bruce gave: quite often the function is called by casting > arbitrary pointers into 'struct foo *', so the compiler's expectations > about alignment do not necessarily match the user's lies. Except for struct ether_addr, the cast would actually reduce alignment (unless the compiler is smart enough to not do it literally). > Unfortunately we are building kernels with many compiler checks > disabled, so there is a fair chance that the compiler will not > detect such invalid casts. Er, we check more in the kernel than almost everywhere else. > Probably addresses are aligned to 2-byte boundaries, but certainly > not on a 4-byte, which means that a safe copy might require 3 > instructions, even though a compiler could otherwise decide to align > all non-packed 'struct foo' to a 4- or 8-byte boundary and possibly > do the copy with 2 or even 1 instruction. The code actually asks for 1-byte alignment. From if_ethersubr.c: % int % ether_output(struct ifnet *ifp, struct mbuf *m, % struct sockaddr *dst, struct route *ro) % { % short type; % int error = 0, hdrcmplt = 0; % u_char esrc[ETHER_ADDR_LEN], edst[ETHER_ADDR_LEN]; This asks for raw byte arrays. gcc tends to align even arrays of char more than asked for but the code asks for minimal alignment. % struct llentry *lle = NULL; % struct rtentry *rt0 = NULL; % struct ether_header *eh; This points into an array of data. Such pointers may lose extra alignment info that may be present in the original type. % struct pf_mtag *t; % int loop_copy = 1; % int hlen; /* link layer header length */ Another pessimization technique used in this code is to initialize many variables in their declarations, despite unbroken versions of style(9) forbidding this. Out of order execution and x86 speed hides the slowness from this about as well as it hides the slowness from extra memcpy()'s. (gcc likes to do all of these inititializations "early" in program order, but with out of order execution there is no strict order and you can't do much better by delaying or avoiding the initializations.) Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201208240108.q7O18bej009632>