Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 27 Mar 2003 19:07:15 +1100 (EST)
From:      Bruce Evans <bde@zeta.org.au>
To:        Mike Silbersack <silby@silby.com>
Cc:        Nate Lawson <nate@root.org>
Subject:    Re: Checksum/copy (was: Re: cvs commit: src/sys/netinet ip_output.c)
Message-ID:  <20030327180247.D1825@gamplex.bde.org>
In-Reply-To: <20030326225530.G2075@odysseus.silby.com>
References:  <Pine.BSF.4.21.0303260956250.27748-100000@root.org> <20030326225530.G2075@odysseus.silby.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 26 Mar 2003, Mike Silbersack wrote:

> On Wed, 26 Mar 2003, Nate Lawson wrote:
>
> > I don't want to hijack the thread too much, but has thought gone into a
> > combined checksum and copy function?  The first mention I can remember of
> > this is in RFC 817 p. 19-20.

Is this RFC old?  Combined checksum and copy hasn't been a larger
optimization since L1 caches became large enough, since to a first
approximation, everything is dominated by memory bandwidth and another
pass to calculate the checksum is free because copying left all the
data in the L1 cache.

> Heh, I don't think anyone has.  What actually would make sense is for
> someone who feels like doing ASM timing to look at our bcopy routines /
> etc.

I spent a lot of time on this about 7 years ago.  See ~bde/cache on
freefall for old versions of programs that try lots of different
copy/read/write checksum methods.  Better hardware made the differences
between various methods relatively small.  One can probably do better
(50%?) for largish (1K+ ?) buffers using SSE instructions on i386's
now.

> On my Mobile Celeron, a for (i = 0; i < max; i++) array[i]=0 runs
> faster than bzero.  :(

Saved data from my benchmarks show that bzero (stosl) was OK on 486's,
poor on original Pentiums, OK on K6-1's, best by far on second generation
Celerons (ones like PII) and poor on Athlon XP's (but not as relatively
bad as on original Pentiums).  The C loop could easily be competitive
with hand-unrolled asm that uses the same instruction to access memory
(no SSE etc) for large buffers, but I would expect it to be slower for
small buffers since it does an unnecesarily large number of instructions
per memory access.  But maybe these get pipelined perfectly so that
everything is limited by memory, while stosl has extra limits.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20030327180247.D1825>