Date: Mon, 25 Jun 2001 03:54:53 -0700 (PDT) From: Matt Dillon <dillon@earth.backplane.com> To: Bruce Evans <bde@zeta.org.au> Cc: Mikhail Teterin <mi@aldan.algebra.com>, jlemon@FreeBSD.ORG, cvs-committers@FreeBSD.ORG, cvs-all@FreeBSD.ORG Subject: Re: Inline optimized bzero (was Re: cvs commit: src/sys/netinet tcp_subr.c) Message-ID: <200106251054.f5PAsrp04325@earth.backplane.com> References: <Pine.BSF.4.21.0106260024430.8175-100000@besplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
:I would expect the opposite. If the bzero's in the networking code don't
:show up in the network latency benchmarks, where would they show up? ISTR
:that a Linux hacker who made lmbench1 go faster for Linux saying that the
:bzero() at the start of the FreeBSD tcp_input() is a really stupid thing
:to do. But I think even completely eliminating it would be just another
:micro-optimization, worth 1% in favourable cases, so you need 10 more like
:it to give a useful speedup.
I wouldn't expect any incremental change to have a noticeable effect
on something like lmbench. From my perusal of the code, the few
bzero's in tcp/ip's critical path are only likely to save a few
hundred nanoseconds per packet, so any noticeable effect would
tend to occur in a system handling lots of simultanious connections
and lots of smaller packets. Even then I wouldn't expect much of
an effect in a single subsystem. The other effects are going to be
scattered. In syscalls, getfh() will be 100nS faster. In
kern_descrip.c, falloc() and fdinit() will be faster because
the structures being bzero'd are tiny. There are a bunch of places
in netinet where small bzero()'s are in the critical path - not just
for TCP - where exercising that particular subsystem should yield
a benefit.
The main point is that the effect can only be better. I can try to
work the kernel size down so there is no bloat at all, but right now
the average change is less then one byte per bzero call.
-Matt
:...
:> it added 6ns to the loop, which is fine, but it blew up the constant
:> optimization and wound up adding a switch table and a dozen
:> instructions inline (hundreds of bytes!).
:
:Yes, it's clear that alignment is not worth doing in the kernel. Userland
:is different -- the application might have turned on alignment checking,
:or it might be poorly behaved and pass a lot of unaligned buffers. gcc
:is primarily a userland compiler, so it's a little surprising that its
:builtins don't worry about alignment.
:
:> I added alignment checks to i586_bzero but it ate 20nS. Also,
:> it should be noted that i586_bzero() as it currently stands does not
:> do any alignment checks either - it checks only the size argument,
:> it doesn't check the base pointer.
:
:Neither does generic_bzero(). i586_bzero() just turns itself into
:generic_bzero() for small sizes. I'm fairly sure that I benchmarked
:this, and came to the conclusion that there is nothing significanttly
:better than "rep movsl" when the size isn't know at compile time. In
:particular, lots of jumps as in i486_bzero are actively bad. This may
:be P5-specific (branch prediction is not very good on original Pentiums).
:
:Bruce
To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe cvs-all" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200106251054.f5PAsrp04325>
