Date: Fri, 28 Mar 2003 17:20:43 +1100 (EST) From: Bruce Evans <bde@zeta.org.au> To: Nate Lawson <nate@root.org> Cc: cvs-all@FreeBSD.org Subject: Re: Checksum/copy (was: Re: cvs commit: src/sys/netinet ip_output.c) Message-ID: <20030328170704.C6082@gamplex.bde.org> In-Reply-To: <Pine.BSF.4.21.0303270954070.29744-100000@root.org> References: <Pine.BSF.4.21.0303270954070.29744-100000@root.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 27 Mar 2003, Nate Lawson wrote: > On Thu, 27 Mar 2003, Bruce Evans wrote: > > On Wed, 26 Mar 2003, Mike Silbersack wrote: > > > On Wed, 26 Mar 2003, Nate Lawson wrote: > > > > I don't want to hijack the thread too much, but has thought gone into a > > > > combined checksum and copy function? The first mention I can remember of > > > > this is in RFC 817 p. 19-20. > > > > Is this RFC old? Combined checksum and copy hasn't been a larger > > optimization since L1 caches became large enough, since to a first > > approximation, everything is dominated by memory bandwidth and another > > pass to calculate the checksum is free because copying left all the > > data in the L1 cache. > > Yes, the RFC is old. However, there still may be performance advantages > in ILP because while the data is being fetched the first time (for the > copy), idle slots can be filled with an incremental checksum update. I'm sure there are some advantages on some CPUs but doubt that they are significant. I'll some old code for filling pipelines in in_cksum() on Pentium I's to a trimmed Cc list in separate mail. I never committed this because the improvement was marginal on Pentium I's, and memory has become slower relative to CPUs since Pentium I's were new. > > > Heh, I don't think anyone has. What actually would make sense is for > > > someone who feels like doing ASM timing to look at our bcopy routines / > > > etc. > > > > I spent a lot of time on this about 7 years ago. See ~bde/cache on > > freefall for old versions of programs that try lots of different > > copy/read/write checksum methods. Better hardware made the differences > > between various methods relatively small. One can probably do better > > (50%?) for largish (1K+ ?) buffers using SSE instructions on i386's > > now. > > We definitely should have an SSE version for P3+. The 128 bit > instructions make a big difference. And for checksumming, you can do 8 > packed adds at once. Is it 8 * 128 bits at once? 8-way superscalar must be on the horizon if not routine now. What is the state of the art for keeping 8 ALUs fed with data (assuming that all the data is in the cache? Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20030328170704.C6082>