Date: Fri, 4 Jun 2004 12:55:57 -0700 (PDT) From: Matthew Dillon <dillon@apollo.backplane.com> To: Alexander Leidinger <Alexander@Leidinger.net> Cc: freebsd-arch@freebsd.org Subject: New TCP/IP checksum code and a HOWTO on how to modernize and fix FreeBSD's FP-unit use in the kernel (was Re: ether_crc32_[bl]e()) Message-ID: <200406041955.i54Jtv0V053964@apollo.backplane.com> References: <c9d9u3$o6k$1@kemoauc.mips.inka.de> <20040531164514.GA7776@green.homeunix.org> <20040601113825.54e5b57b@Magellan.Leidinger.net>
next in thread | previous in thread | raw e-mail | index | archive | help
:Is someone interested in improving our IP checksum code too? : :On i386 it uses assembly language which "works ok" with gcc 3.x (so :far), but it isn't guaranteed it will work with future versions of gcc. :Intels C compiler already has problems with it (and it's verified it's :because of bugs in the asm code), so in case of the use of icc a C :version is used. : :All other architectures use a C version ("MD" code, even if it could be :made MI... at least it could be shared between the big and little endian :architectures). : :Matthew Dillon rewrote the IP checksum code in dragonfly: :---snip--- : Modified files: : sys/conf files.i386 : sys/i386/i386 in_cksum.c : sys/i386/include in_cksum.h : sys/netinet igmp.c in.h ip_icmp.c : Added files: : sys/i386/i386 in_cksum2.s : Log: : Rewrite the IP checksum code. Get rid of all the inline assembly garbage, : get rid of old APIs that are no longer used, and build a new 'core' checksum :... I should add that I finally got tired of the original checksum code failing at high -On optimization levels with GCC2 and GCC3, not to mention the ridiculously unreadable source that tried to optimize down to the byte, which is just dumb. That is why I rewrote it. At this point the code has been in our tree for quite a while, with no complaints, so I would regard it as 'extremely well tested'. I strongly recommend that FreeBSD adopt either this code or the core of this code and scrap that aweful C-hybrid-inline-assembly junk. Also, final note: remember that IP/TCP checksums are 1's complement checksums, which means that they are byte-order-agnostic (the byte order of the result will be the same as the byte order of the data, and the result will be correct, so if the original data is in network byte order, the resulting 1's complement checksum using normal (non translating) instructions will also be in network byte order). This is why one can simply use adcl instructions on an Intel/Amd cpu. In fact, it might be possible to use 128-bit media instructions but that's probably overkill. -- You might also want to look at our new MMX/XMM optimized bcopy/copyin/copyout. That was a lot harder to get right (and, most especially, it was a lot harder to make the FP state in kernel mode be properly saved and restored). I lost a few filesystems on my test box getting the code right :-). You would not be able to copy it directly since our FP state handling is very different from FreeBSD's now (which I will describe below), but you ought to be able to use the core MMX code. Right now FreeBSD is using old FP-stack instructions. This runs about as fast as MMX on an Athlon but, generally speaking, it is a decrepid, obsolete use of the FP unit. On DFly I made the following changes: * I implemented the comment in the old FBSD code (I think DG or Dyson made the comment) about having a separate FP save area pointer. This allows the kernel to use the FP unit trivially rather then having to copy/restore the user process's FP save area. This saves an ungodly number of cycles in the copy path and greatly simplifies the ability of the kernel to use the FP unit. Our FP copy code's overhead is now such that we can use the FP unit for copies half the size as on FreeBSD and it will still be more optimal then an integer copy, and XMM copies are much, much faster (esp on Athlon64's and Opterons) verses the old FBSD code. * DFly guarentees that if the FP unit is marked unused, the FP state is such that no fninit is required prior to using FP instructions. This saves ~50+ cycles in the best-case copy path. * FP-in-use-by-kernel is a per-cpu bit. * DFly does not try to optimize copies on pre-fxsave machines. The minimum required support is FXSAVE + MMX now. If XMM is available (SSE2), then 128 bit media instructions will be used. I saw no point in retaining code that was only just a bit faster then the integer code on old machines. * I scrapped the old integer copy code as being too complex and rewrote it using a middle-of-the-road integer copy (rather then having umpteen versions of integer copy). * If you attempt to use more of our code, remember that DFly does not preemptively migrate threads across cpus so our code doesn't have to worry about that. * I scrapped the separately-optimized copyin/copyout code and wrote a more generic pcb_oncall capability that allows the copy routine itself to push the restoration function on its stack, so the same optimized copy code is now used for ALL copies (memcpy, bcopy, copyin, copyout). Cavets on doing any of this for FreeBSD: The FP code in FreeBSD is extremely fragile, as I found to my horror when I first tried modifying it. In fact, I think there may still be interrupt and/or cpu migration races in the current FreeBSD FP borrowing. If anyone in FBSDland intends to make these changes, I recommend doing it one piece at a time, one commit per week: - week1: Redo the onfault API to allow the individual copy routines to push their own restore function in a stackable manner. That way copyin/copyout can push its restore function, then call a general optimized bcopy routine which pushes ITS restore function (to clean up the FP unit). i.e. onfault failure handling can now be stacked in DFly which allows us to use thet FP optimized bcopy code for copyin/copyout. (refer to the DFly codebase for how to do this). - week2: make the save area a pointer instead of fixed in pcb. Just point it at the PCB for this commit. I recommend putting the save pointer in the machine dependant thread (per-thread) structure and not embedding it in the PCB. Make sure fork() does the right thing. - week3: change the existing FP optimized code to use the new pointer method instead of the exchange-save-area method (create a fixed save area in the per-cpu data structure, do not allocate the 512 bytes required for fxsave on the stack). Keep the global kernel-is-using-fp bit (make it per-cpu), and pin the cpu for the duration of the FP copy. - week4: (rest) give the last set of changes 2 weeks to settle and do intensive testing to make sure there aren't any leaks, because a mistake here can cause filesystem corruption. - week5: change the FP copy requirements to require FXSAVE/FXRSTR and adjust the existing FP copy code to use FXSAVE/FXRSTR instead of fnsave/frstr. - week6: rip out the old FP copy code and replace it with the new MMX/XMM code. Rip out the old integer copy code and just use a good solid integer copy algorithm as we have that works well with 586 and later cpus. (import the DFLY FP copy core here. It would be the absolute last step). p.s. and if you need a kick in the pants, our PIPE code and any medium-sized block copies from the filesystem cache (e.g. using dd), which basically just tests copyin/copyout/bcopy performance, beats the crap out of FreeBSD-5 now on P4's and Athlon64/Opterons. Both are able to take advantage of the MMX/XMM optimized copies, especially due to the far lower FP setup overhead our kernel has now due to the pointer save area change and other things. (after a few runs to pre-cache) dhcp62# dd if=test.dat bs=32k | cat > /dev/null 335544320 bytes transferred in 0.570208 secs (588459681 bytes/sec) (DRAGONFLY) dhcp61# dd if=test.dat bs=32k | cat > /dev/null 335544320 bytes transferred in 0.901231 secs (372317753 bytes/sec) (FREEBSD-5) dhcp62# dd if=test.dat of=/dev/null bs=32k 335544320 bytes transferred in 0.283803 secs (1182313275 bytes/sec) (DRAGONFLY) dhcp61# dd if=test.dat of=/dev/null bs=32k 335544320 bytes transferred in 0.378349 secs (886864966 bytes/sec) (FREEBSD-5) (with witness turned off in FreeBSD-5, so you don't get that cop-out. with witness turned on the results are so horrible I won't even bother pasting them in, to save you guys the embarassment). That is what being able to use an XMM based copy for copyin/copyout gives you. I think it is well worth the effort, but a *lot* of effort is required if FreeBSD wants to do it right. It took me three weeks to get it right in DragonFly working nearly full time, but you would have the advantage of learning from all my mistakes :-). -Matt
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200406041955.i54Jtv0V053964>