Date: Thu, 27 Mar 2003 19:07:15 +1100 (EST) From: Bruce Evans <bde@zeta.org.au> To: Mike Silbersack <silby@silby.com> Cc: Nate Lawson <nate@root.org> Subject: Re: Checksum/copy (was: Re: cvs commit: src/sys/netinet ip_output.c) Message-ID: <20030327180247.D1825@gamplex.bde.org> In-Reply-To: <20030326225530.G2075@odysseus.silby.com> References: <Pine.BSF.4.21.0303260956250.27748-100000@root.org> <20030326225530.G2075@odysseus.silby.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 26 Mar 2003, Mike Silbersack wrote: > On Wed, 26 Mar 2003, Nate Lawson wrote: > > > I don't want to hijack the thread too much, but has thought gone into a > > combined checksum and copy function? The first mention I can remember of > > this is in RFC 817 p. 19-20. Is this RFC old? Combined checksum and copy hasn't been a larger optimization since L1 caches became large enough, since to a first approximation, everything is dominated by memory bandwidth and another pass to calculate the checksum is free because copying left all the data in the L1 cache. > Heh, I don't think anyone has. What actually would make sense is for > someone who feels like doing ASM timing to look at our bcopy routines / > etc. I spent a lot of time on this about 7 years ago. See ~bde/cache on freefall for old versions of programs that try lots of different copy/read/write checksum methods. Better hardware made the differences between various methods relatively small. One can probably do better (50%?) for largish (1K+ ?) buffers using SSE instructions on i386's now. > On my Mobile Celeron, a for (i = 0; i < max; i++) array[i]=0 runs > faster than bzero. :( Saved data from my benchmarks show that bzero (stosl) was OK on 486's, poor on original Pentiums, OK on K6-1's, best by far on second generation Celerons (ones like PII) and poor on Athlon XP's (but not as relatively bad as on original Pentiums). The C loop could easily be competitive with hand-unrolled asm that uses the same instruction to access memory (no SSE etc) for large buffers, but I would expect it to be slower for small buffers since it does an unnecesarily large number of instructions per memory access. But maybe these get pipelined perfectly so that everything is limited by memory, while stosl has extra limits. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20030327180247.D1825>