Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 4 Sep 1998 19:26:53 -0400 (EDT)
From:      Bill Paul <wpaul@skynet.ctr.columbia.edu>
To:        hackers@FreeBSD.ORG
Subject:   Questions for networking gurus
Message-ID:  <199809042326.TAA13218@skynet.ctr.columbia.edu>

next in thread | raw e-mail | index | archive | help
I'm trying to work out a packet reception strategy for the RealTek
ethernet controller. The RealTek chip is a bus master, which normally
is a good thing because with the right sort of interface you can eliminate
buffer copies within the driver code. Unfortunately, the RealTek is designed
in such a way that it's almost impossible to avoid buffer copies even with
bus master capability. I think there is a way to do it, but I'm not sure
if it will work correctly as it depends a lot on how the higher protocol
layers work, and I'm a bit fuzzy on some of the details.

The RealTek chip works roughly as follows. The driver allocates a buffer
area of 8K, 16K, 32K or 64K in size, depending on the setting of certain
bits in the receive config register. This buffer resides in system
memory, independent of the chip's packer FIFO memory. The driver 
then gives the chip the base address of this buffer area and activates 
the receiver.

When the chip receives a packet, it writes a 32-bit header value into
the driver's buffer area and then copies the packet immediately after
the header. (The header contains the packet length and some status bits). 
The chip rounds up its address pointer to a 32-bit boundary and then 
writes the next header and packet, and so on until it hits the end of the 
buffer area. The driver is supposed to pass the packets along to the 
higher level protocols and advance a special register to indicate how 
much of the receive buffer has been processed. It is possible for the 
chip to wrap a packet from the end of the receive area back to the 
beginning again assuming the driver has freed up space at the beginning 
of the buffer area.

This is sort of an oddball mechanism given the way other devices work.
With most bus master chips, you have a descriptor mechanism that allows
the driver to pre-allocate individual packet buffers (in our case mbuf
clusters) and provide the buffer addresses to the chip. The chip can
then DMA incoming packets straight into the mbufs, and the driver can
just pass them along. This completely eliminates the need for buffer
copies, in exchange for some wasted space since an mbuf cluster buffer
is 2048 bytes in size and an ethernet frame is never larger than about
1500 bytes.

You can't do that with the RealTek chip though. Here, you have only
one contiguous receive buffer, and you can't predict the sizes or
offsets of the packets within the buffer. The simplest way to deal with
this is to use m_devget() to copy packet data out of the receive buffer
and into mbufs and hand those off to the upper layers.

This is silly though, because it defeats the purpose of the bus master
capability, which is to avoid having the CPU perform buffer copies.

One possibly alternative was proposed to me recently by Matthew Dodd,
which is to use the external data pointers in mbufs. The idea works
like this:

- The chip receives a packet and copies it into the RX buffer region.
- The driver allocates a single mbuf and sets the M_EXT flag in its
  header.
- The driver determines the start address of the packet within the buffer
  region and puts that address into the external data pointer of the
  mbuf.
- The driver sets the mbuf's data length to the length of the packet and
  specifies a free() routine to use to deallocate the external storage
  when the mbuf is released. (We don't want to actually free the
  buffer space though, so the free routine can be a no-op.)
- Lastly, the driver passes the mbuf to ether_input() for processing.

This avoids a buffer copy by using an mbuf to 'encapsulate' the packet
data as an external data region, but it creates a problem: once the driver
ties a portion of the receive buffer to an mbuf, it can't allow that
portion of the buffer to be overwritten by the chip's DMA engine until
the mbuf has been released. Otherwise, the packet data will be corrupted
while another part of the kernel is fiddling with it.

It may be possible to get around this by pre-allocating several receive 
buffer areas: if all of the space in the first region has been tied to 
mbufs and remains unreleased, the driver can reload the chip's receive 
buffer address register with ther address of another buffer. Assuming 
that all of the space in the first region will be released eventually, it 
should be possible to provide room for the chip to DMA new frames while 
allowing the protocols time to process existing frames in previous buffers.

However this brings up the following questions:

1) Does using external data regions with mbufs like this actually work?
   I know it works with mbuf clusters, but that's sort of a special case.
   I remember reading somewhere, possibly in TCP/IP Illustrated Vol. 2,
   that there were bugs that prevented this from working correctly 100%
   of the time except for the mbuf cluster case. Has this been fixed, or
   are there still pitfalls?

2) What's the longtest time than an mbuf chain with received packet data
   will survive inside the kernel? The driver has to allocate enough
   memory so that it can continue handling data from the chip while
   waiting for previous buffers to be freed by the protocols, but if
   an mbuf can get hung up inside the protocols for a very long time
   (or worse, be locked indefinitely), then the buffer allocation
   would be ridiculously large. This would outweight the benefit of
   avoiding copies.

So, does this scheme sound sensible or should I just swallow my pride
and settle for using m_devget(). It would be nice to find a way to
actually squeeze some decent performance out of this gawdawful device
just to spite the designers. If anybody has tried to do something like
this before, or is familiar with the guts of the BSD networking code,
I'd appreciate any insights.

-Bill

-- 
=============================================================================
-Bill Paul            (212) 854-6020 | System Manager, Master of Unix-Fu
Work:         wpaul@ctr.columbia.edu | Center for Telecommunications Research
Home:  wpaul@skynet.ctr.columbia.edu | Columbia University, New York City
=============================================================================
 "It is not I who am crazy; it is I who am mad!" - Ren Hoek, "Space Madness"
=============================================================================

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199809042326.TAA13218>