Date: Mon, 25 Jun 2001 03:54:53 -0700 (PDT) From: Matt Dillon <dillon@earth.backplane.com> To: Bruce Evans <bde@zeta.org.au> Cc: Mikhail Teterin <mi@aldan.algebra.com>, jlemon@FreeBSD.ORG, cvs-committers@FreeBSD.ORG, cvs-all@FreeBSD.ORG Subject: Re: Inline optimized bzero (was Re: cvs commit: src/sys/netinet tcp_subr.c) Message-ID: <200106251054.f5PAsrp04325@earth.backplane.com> References: <Pine.BSF.4.21.0106260024430.8175-100000@besplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
:I would expect the opposite. If the bzero's in the networking code don't :show up in the network latency benchmarks, where would they show up? ISTR :that a Linux hacker who made lmbench1 go faster for Linux saying that the :bzero() at the start of the FreeBSD tcp_input() is a really stupid thing :to do. But I think even completely eliminating it would be just another :micro-optimization, worth 1% in favourable cases, so you need 10 more like :it to give a useful speedup. I wouldn't expect any incremental change to have a noticeable effect on something like lmbench. From my perusal of the code, the few bzero's in tcp/ip's critical path are only likely to save a few hundred nanoseconds per packet, so any noticeable effect would tend to occur in a system handling lots of simultanious connections and lots of smaller packets. Even then I wouldn't expect much of an effect in a single subsystem. The other effects are going to be scattered. In syscalls, getfh() will be 100nS faster. In kern_descrip.c, falloc() and fdinit() will be faster because the structures being bzero'd are tiny. There are a bunch of places in netinet where small bzero()'s are in the critical path - not just for TCP - where exercising that particular subsystem should yield a benefit. The main point is that the effect can only be better. I can try to work the kernel size down so there is no bloat at all, but right now the average change is less then one byte per bzero call. -Matt :... :> it added 6ns to the loop, which is fine, but it blew up the constant :> optimization and wound up adding a switch table and a dozen :> instructions inline (hundreds of bytes!). : :Yes, it's clear that alignment is not worth doing in the kernel. Userland :is different -- the application might have turned on alignment checking, :or it might be poorly behaved and pass a lot of unaligned buffers. gcc :is primarily a userland compiler, so it's a little surprising that its :builtins don't worry about alignment. : :> I added alignment checks to i586_bzero but it ate 20nS. Also, :> it should be noted that i586_bzero() as it currently stands does not :> do any alignment checks either - it checks only the size argument, :> it doesn't check the base pointer. : :Neither does generic_bzero(). i586_bzero() just turns itself into :generic_bzero() for small sizes. I'm fairly sure that I benchmarked :this, and came to the conclusion that there is nothing significanttly :better than "rep movsl" when the size isn't know at compile time. In :particular, lots of jumps as in i486_bzero are actively bad. This may :be P5-specific (branch prediction is not very good on original Pentiums). : :Bruce To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe cvs-all" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200106251054.f5PAsrp04325>