Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 25 Jun 2001 03:54:53 -0700 (PDT)
From:      Matt Dillon <dillon@earth.backplane.com>
To:        Bruce Evans <bde@zeta.org.au>
Cc:        Mikhail Teterin <mi@aldan.algebra.com>, jlemon@FreeBSD.ORG, cvs-committers@FreeBSD.ORG, cvs-all@FreeBSD.ORG
Subject:   Re: Inline optimized bzero (was Re: cvs commit: src/sys/netinet tcp_subr.c)
Message-ID:  <200106251054.f5PAsrp04325@earth.backplane.com>
References:   <Pine.BSF.4.21.0106260024430.8175-100000@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
:I would expect the opposite.  If the bzero's in the networking code don't
:show up in the network latency benchmarks, where would they show up?  ISTR
:that a Linux hacker who made lmbench1 go faster for Linux saying that the
:bzero() at the start of the FreeBSD tcp_input() is a really stupid thing
:to do.  But I think even completely eliminating it would be just another
:micro-optimization, worth 1% in favourable cases, so you need 10 more like
:it to give a useful speedup.

    I wouldn't expect any incremental change to have a noticeable effect
    on something like lmbench.   From my perusal of the code, the few
    bzero's in tcp/ip's critical path are only likely to save a few 
    hundred nanoseconds per packet, so any noticeable effect would 
    tend to occur in a system handling lots of simultanious connections
    and lots of smaller packets.  Even then I wouldn't expect much of
    an effect in a single subsystem.  The other effects are going to be
    scattered.  In syscalls, getfh() will be 100nS faster.  In 
    kern_descrip.c, falloc() and fdinit() will be faster because 
    the structures being bzero'd are tiny.  There are a bunch of places
    in netinet where small bzero()'s are in the critical path - not just
    for TCP - where exercising that particular subsystem should yield
    a benefit.

    The main point is that the effect can only be better.  I can try to
    work the kernel size down so there is no bloat at all, but right now
    the average change is less then one byte per bzero call.

						-Matt

:...
:>     it added 6ns to the loop, which is fine, but it blew up the constant
:>     optimization and wound up adding a switch table and a dozen
:>     instructions inline (hundreds of bytes!).
:
:Yes, it's clear that alignment is not worth doing in the kernel.  Userland
:is different -- the application might have turned on alignment checking,
:or it might be poorly behaved and pass a lot of unaligned buffers.  gcc
:is primarily a userland compiler, so it's a little surprising that its
:builtins don't worry about alignment.
:
:>     I added alignment checks to i586_bzero but it ate 20nS.  Also,
:>     it should be noted that i586_bzero() as it currently stands does not
:>     do any alignment checks either - it checks only the size argument,
:>     it doesn't check the base pointer.
:
:Neither does generic_bzero().  i586_bzero() just turns itself into
:generic_bzero() for small sizes.  I'm fairly sure that I benchmarked
:this, and came to the conclusion that there is nothing significanttly
:better than "rep movsl" when the size isn't know at compile time.  In
:particular, lots of jumps as in i486_bzero are actively bad.  This may
:be P5-specific (branch prediction is not very good on original Pentiums).
:
:Bruce




To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe cvs-all" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200106251054.f5PAsrp04325>