Date: Wed, 17 Feb 1999 11:51:18 -0800 (PST) From: Matthew Dillon <dillon@apollo.backplane.com> To: Kevin Day <toasty@home.dragondata.com> Cc: dyson@iquest.net, tlambert@primenet.com, mike@smith.net.au, hackers@FreeBSD.ORG Subject: Re: vm_page_zero_fill Message-ID: <199902171951.LAA10456@apollo.backplane.com> References: <199902171902.NAA25290@home.dragondata.com>
next in thread | previous in thread | raw e-mail | index | archive | help
:The system I'm working on is a embedded, highly graphical 2D/3D product.
:These systems will not be connected to the internet, nor will anyone have
:keyboard/telnet/terminal/whatever access to them. They're about as secure as
:they're going to get, so my concerns are mostly speed over security.
:
:In looking with some logic analyzers, we're seeing that we're nearly out of
:PCI bandwidth, and we're hitting the memory very hard too. 99% of our run
:time is spent ferrying data from ram into the graphic device.
: 
:Because of the nature of the product, we're needing more and more
:'real-time' like operation. The delay from when a user does something, until
:...
:
:After things still being slower than I wanted, I pulled out a logic
:analyzer. In watching memory accesses on the analyzer, we saw a lot of
:zeroing going on, especially after exec()'ing another application. (This
:...
:
:Currently, the time spent loading/preparing the new application is a bit
:long, so I was looking at ways to shrink that down. That's where this
     Ahh.  A couple of things.  First, I presume that the amount of memory
     in the machine is not an issue... that you have enough to hold all
     the programs pretty much resident.
     In that case, simply preload the executables.  That is, rather then
     take the latency hit when the user hits a button, take the latency
     hit when the user is idle and just tell the program to 'go' ( through
     a pipe ) when the user hits the button.
     Second, if you aren't already using a Xeon with its largest L2
     cache configuration, you should probably be using a Xeon with its
     largest L2 cache configuration.  Intel cpu's tend to fall on their
     face with DATA-memory-intensive applications due to their 
     undersized caches.   The undersized cache works ok for instructions
     because instructions are pretty compact, but it does not work
     well for data.
     If the box you are using does not have a 100MHz memory bus, you need
     to get one that does.
:While I don't want to get accused of not trying to figure this one out on my
:own.... Suppose I mmap a large (2MB or more) file. Should any zero'ing be
:going on when I touch those pages for the first time? From the analyzer, it
:looks like it's zeroing pages before putting what it read from the disk into
:them, but as you know, figuring out what's really going on by watching a
:logic analyzer is a form of witchcraft... If this is the case, turning this
:off would greatly help me. :)
    It should not be zeroing pages before doing full reads into them.
    That is pretty well optimized, usually.
    Third, Memory->PCI transfers are best done with DMA ( as you
    already know ).  For a frame store, you can eek out additiona
    l PCI bus speed by messing with the burst transfer length ( 
    especially if the cpu is not heavily involved and can afford 
    to stall a little more ).  You should be able to push 
    120 MBytes/sec on a PCI bus by tuning the DMA burst.  
    The PCI card should have a FIFO big enough to accomodate the
    burst, too.  If you do a large transfer to a PCI card's frame
    buffer with memcpy() ( or equivalent ), you eat double the 
    memory bandwidth plus blow away the data cache on the cpu.
    Fourth, if you are doing direct frame store from disk to a
    PCI card, you may wish to consider building a custom piece
    of hardware / firmware to actually use the SCSI bus to 
    transfer the data directly ( i.e. put the frame store *on*
    the SCSI bus and have it master the data directly from the
    drives without host intervention ).  This is a rather more
    complex solution.
    Fifth - double-wide (64 bit wide) PCI busses or AGP busses.
    AGP can certainly be done on a PC.  I'm not sure what is 
    available in regards to 64 bit PCI busses.  However, both
    these options are departures from the norm and may not be
    cost effective.
					-Matt
					Matthew Dillon 
					<dillon@backplane.com>
:(If I'm not being clear enough, imagine mmap'ing a movie, and memcpy'ing it
:into a frame buffer at 60fps, to get an idea of the kind of data I'm going
:through)
:
:I hope this sort of explained my application, although I'm sure there are
:arguments either way if this is really going to help me or not.
:
:Thanks again,
:
:Kevin
To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199902171951.LAA10456>
