Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 9 Jul 2000 22:23:41 -0600
From:      "Kenneth D. Merry" <ken@kdm.org>
To:        Alfred Perlstein <bright@wintelcom.net>
Cc:        net@FreeBSD.ORG, dg@FreeBSD.ORG, wollman@FreeBSD.ORG
Subject:   Re: argh! Re: weird things with M_EXT and large packets
Message-ID:  <20000709222341.A20360@panzer.kdm.org>
In-Reply-To: <20000709205124.A25571@fw.wintelcom.net>; from bright@wintelcom.net on Sun, Jul 09, 2000 at 08:51:24PM -0700
References:  <20000709140441.T25571@fw.wintelcom.net> <20000709205124.A25571@fw.wintelcom.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, Jul 09, 2000 at 20:51:24 -0700, Alfred Perlstein wrote:
> * Alfred Perlstein <bright@wintelcom.net> [000709 14:04] wrote:
> > I have some code here sending a mbuf via:
> > 
> > error = (*so->so_proto->pr_usrreqs->pru_send)(so, 0, m, 0, 0, p);
> > 
> > m is setup like so:
> > 
> >     m->m_ext.ext_free = kblob_mbuf_free;
> >     m->m_ext.ext_ref = kblob_mbuf_ref;
> >     m->m_ext.ext_buf = (void *)kb;
> >     m->m_ext.ext_size = kb->kb_len;
> >     m->m_data = (char *) kb->kb_data + uap->offset;
> >     m->m_flags |= M_EXT;
> >     m->m_pkthdr.len = m->m_len = uap->nbytes;
> > 
> > uap->nbytes is 59499.
> > 
> > It looks like the packet is being broken up or referenced to be sent,
> > but at a certain point it hangs.
> 
> I'm 99.99% sure what's going on is that since I'm using normal kernel
> malloc for these external clusters what's happening is that the device
> driver is failing to notice that the data contained crosses a page
> boundry and isn't breaking the data up properly.  Since the memory is
> fragmented it's passing garbage over the wire that doesn't match the
> checksum (hence the resending of the data)
> 
> Doing a transfer over localhost works fine.
> 
> If use contigmalloc to allocate the buffers then it works, I would really
> rather not use contigmalloc because frankly it scares me.

I had the same problem earlier this year, except it was with pages passed
from userland into the kernel.

My solution was to walk each incoming buffer and detect boundaries between
chunks of contiguous pages.  (So I wound up with a set of physical pointers
and lengths.)

> Is there a specific reason the network drivers (or at least fxp)
> don't seem to check page boundries so that discontig kmem can be
> passed to the drivers in large chunks?  I'd rather not have to
> allocate size/PAGE_SIZE mbuf headers for each send.
> 
> This may only fxp doing this incorrectly, or I may be just be
> totally off, does this all make sense?

It does make sense.  I would bet that most, if not all, network drivers
don't check for contiguous memory.

There are numerous reasons for this, but I think the bottom line is that
it's too much trouble for too little gain.

Most network devices that FreeBSD supports have a MTU of 1500 bytes or so,
and at least in standard mbufs, the drivers don't have to worry about the
chunk of data they get crossing page boundaries.

Even with drivers with larger MTUs, like gigabit ethernet drivers, they
typically take chains of mbufs, and do a separate vtophys() on each element
in the chain, ans pass it down to the card.  Again, they expect that each
mbuf points to a physically contiguous chunk of memory.

One thing to keep in mind, at least about allocating huge chunks of memory
and passing them down the network stack, is that the big chunks will get
split up, either by the TCP layer or the IP layer.

The big chunks will get split into multiple pieces, each with its own mbuf
header.

The zero copy send code that Drew Gallatin wrote uses page-sized chunks to
pass things around.  With a gigabit ethernet jumbo MTU (9000 bytes), that
is very efficient on the Alpha, with its 8K page size, but less efficient
on the i386, with its 4K page size.  (Since you end up with double the
number of chunks.)

From the benchmarks I've done, increasing the chunk size from 4K to 8K on
the i386 would cut CPU utilization in half on sends over gigabit ethernet.
The problem in that instance (according to Drew) is getting the COW stuff
right for chunks of data bigger than a page.

Another thing I learned from doing benchmarks is that increasing the chunk
size to something larger than your MTU doesn't help CPU utilization much,
if at all, since the larger chunks eventually get broken up into MTU-sized
chunks.

The most efficient chunk size for larger MTU adapters (i.e. more than 4K)
is the nearest page multiple that is less than the MTU size.

I'm not sure if this will work with what you're trying to do, but you could
use contigmalloc() to allocate a large chunk of memory (say multiple
megabytes in size) and then break it up into smaller chunks of memory that
are then tacked onto mbufs.

The ti(4) driver uses that approach in its stock (i.e. non-zero-copy) jumbo
receive buffer allocation code, since the type of jumbo receive buffers it
uses by default are expected to consist of one contiguous piece of memory.
(The Tigon firmware also supports another type of jumbo receive buffer with
4 S/G entries.)

Ken
-- 
Kenneth Merry
ken@kdm.org


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-net" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20000709222341.A20360>