Date: Wed, 15 Oct 2008 04:08:24 -0600 From: Scott Long <scottl@samsco.org> To: Bruce Evans <brde@optusnet.com.au> Cc: svn-src-head@FreeBSD.org, Marius Strobl <marius@FreeBSD.org>, src-committers@FreeBSD.org, svn-src-all@FreeBSD.org Subject: Re: svn commit: r183896 - head/sys/dev/bge Message-ID: <48F5C118.3040209@samsco.org> In-Reply-To: <20081015184833.N43215@delplex.bde.org> References: <200810142028.m9EKShoL015514@svn.freebsd.org> <48F5053D.7070705@samsco.org> <20081015184833.N43215@delplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Bruce Evans wrote: > On Tue, 14 Oct 2008, Scott Long wrote: > >> Marius Strobl wrote: >>> Author: marius >>> Date: Tue Oct 14 20:28:42 2008 >>> New Revision: 183896 >>> URL: http://svn.freebsd.org/changeset/base/183896 >>> >>> Log: >>> Use bus_{read,write}_4(9) instead of bus_space_{read,write}_4(9) >>> in order to get rid of the bus space handle and tag in the softc. >> >> Has anyone looked at the generated code from this interface switch and >> compared it what was getting generated previously? Way back when, > > I just looked at the source code. This seems to be only a small > pessimization since it involves just one extra indirection and > related loss of optimization possibilities: (sc->sc_bt, sc->sc_bh) > becomes r = sc->sc_br; (r->r_bustag, r->r_bushandle). In theory, > the compiler could optimize by caching the tag and handle in registers > in either case so that only 1 extra indirection is needed per function, > but I've never seen that being done enough to make much difference. > > However, some drivers (e.g., ata) were too stupid to cache sc_bt and > sc_bh, and converting these to use bus_nonspace*() was a relatively large > optimization: r = sc->sc_br; (rman_get_bustag(r), rman_get_bushandle(r)) > became r = sc->sc_br; (r->r_bustag, r->r_bushandle). Since > rman_get_bustag() > and rman_get_bushandle() have never been inline, calling both of them > on every bus space access gave enormous code space and instruction > count bloat and corresponding (relatively tiny) runtime bloat. The > instructions normally run so much faster than the i/o that you can do > hundreds or thousands of them per i/o before noticing the runtime > bloat, and the instruction count bloat here is only about a factor of 20. It'll be a long time before disk i/o is fast enough to saturate the CPU to the point of this issue being important. Even the fastest SAS hardware that I've written a driver for struggles to get over 100,000 transactions/sec before saturating the attached disks, leaving plenty of CPU remaining. > >> including <machine/bus_memio.h> made bus_space_read|write_4() compile >> into a direct memory access on machines that supported it. The dubious >> removal of bus_memio.h and bus_pio.h took away that benefit, and I'm >> afraid that it's only getting worse now. Bus writes to card memory are >> still very important to high-performance devices and shouldn't be >> pessimized in the name of simpler-looking C code. > > I hate the bloat too, but it rarely matters (see above). I mainly noticed > it using ddb. Where nice drivers are optimized to use a single instruction > per i/o (or perhaps 2 with minimal bus space bloat), the ones that use > rman_get*() on every i/o take 20+, and with a primitive debugger like ddb > it is painful to even skip over all the function calls. It's really the opportunity for gratuitous D-cache line misses that worries me, and to a lesser extend the I-cache bloat, though that's probably minimal. In the past I've been able to demonstrate considerably better micro-performance, and modestly better macro-performance, in the bge and em drivers from optimizing cache efficiency in busdma. > > Which devices have fast enough i/o for the extra indirection to matter? > bge tries hard to avoid all PCI access in time-critical code. I think > we reduced the PCI accesses to 1 PCI write per interrupt. Most device > accesses for bge use host memory which is DMAed to/from by the hardware, > so bus space doesn't apply. This type of access seems to be the best > or only way for a PCI device to go fast enough (though it is still too > slow on at least i386 since the DMA always causes cache misses). > > Bruce I'm thinking of 10Gb ethernet drivers, which have already shown to fully saturate the host CPU long before wire bandwidth is filled. I'm only familiar with cxgb and mxge; cxgb does a single bus write per transaction, while mxge cheats and aliases the card memory registers via rman_get_virtual(), making the single-instruction-access automatic =-) Scott
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?48F5C118.3040209>