From owner-freebsd-net  Thu Jun 20  9:28: 5 2002
Delivered-To: freebsd-net@freebsd.org
Received: from duke.cs.duke.edu (duke.cs.duke.edu [152.3.140.1])
	by hub.freebsd.org (Postfix) with ESMTP
	id CBB1B37B40B; Thu, 20 Jun 2002 09:26:30 -0700 (PDT)
Received: from grasshopper.cs.duke.edu (grasshopper.cs.duke.edu [152.3.145.30])
	by duke.cs.duke.edu (8.9.3/8.9.3) with ESMTP id MAA04622;
	Thu, 20 Jun 2002 12:26:28 -0400 (EDT)
Received: (from gallatin@localhost)
	by grasshopper.cs.duke.edu (8.11.6/8.9.1) id g5KGPwu30393;
	Thu, 20 Jun 2002 12:25:58 -0400 (EDT)
	(envelope-from gallatin@cs.duke.edu)
From: Andrew Gallatin <gallatin@cs.duke.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID: <15634.534.696063.241224@grasshopper.cs.duke.edu>
Date: Thu, 20 Jun 2002 12:25:58 -0400 (EDT)
To: Bosko Milekic <bmilekic@unixdaemons.com>
Cc: "Kenneth D. Merry" <ken@kdm.org>, current@FreeBSD.ORG,
	net@FreeBSD.ORG
Subject: Re: new zero copy sockets snapshot
In-Reply-To: <20020620114511.A22413@unixdaemons.com>
References: <20020618223635.A98350@panzer.kdm.org>
	<xzpelf3ida1.fsf@flood.ping.uio.no>
	<20020619090046.A2063@panzer.kdm.org>
	<20020619120641.A18434@unixdaemons.com>
	<15633.17238.109126.952673@grasshopper.cs.duke.edu>
	<20020619233721.A30669@unixdaemons.com>
	<15633.62357.79381.405511@grasshopper.cs.duke.edu>
	<20020620114511.A22413@unixdaemons.com>
X-Mailer: VM 6.75 under 21.1 (patch 12) "Channel Islands" XEmacs Lucid
Sender: owner-freebsd-net@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-net.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-net>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-net>
X-Loop: FreeBSD.org


Bosko Milekic writes:

 > > Years ago, I used Wollman's MCLBYTES > PAGE_SIZE support (introduced
 > > in rev 1.20 of uipc_mbuf.c) and it seemed to work OK then.  But having
 > > 16K clusters is a huge waste of space. ;).
 > 
 >   Since then, the mbuf allocator in -CURRENT has totally changed.  It is
 > still possible to provide allocations of > PAGE_SIZE buffers, however
 > they will likely not map physically contiguous memory.  If you happen to
 > have a device that doesn't support scatter/gather for DMA, then these
 > buffers will be broken for it (I know that if_ti is not a problem).

Actually, it will be a problem for if_ti.  The original tigon 1s
didn't support s/g DMA.  I think we should just not support jumbo
frames on tigon 1s..

 >   The other issue is that the mbuf allocator then as well as the new
 > mbuf allocator uses the kmem_malloc() interface that was also used by
 > malloc() to perform allocations of wired-down pages.  I am not sure if
 > you'll be able to play those tricks where you unmap and remap the page
 > that is allocated for you once it comes out of the mbuf allocator.  Do
 > you think it would work?

I don't think so, but I haven't read the code carefully and I don't
know for certain.

However, my intent was to use a jumbo mbuf type for copyin and to
clean up the existing infastructure for drivers w/brain dead firmware,
not to use a new 10K cluster as a framework for zero-copy.

 > > Do you think it would be feasable to glue in a new jumbo (10K?)
 > > allocator on top of the existing mbuf and mcl allocators using the
 > > existing mechanisms and the existing MCLBYTES > PAGE_SIZE support
 > > (but broken out into separte functions and macros)?
 > 
 >   Assuming that you can still play those VM tricks with the pages spit
 > out by mb_alloc (kern/subr_mbuf.c in -CURRENT), then this wouldn't be a
 > problem at all.  It's easy to add a new fixed-size type allocation to
 > mb_alloc.  In fact, it would be beneficial.  mb_alloc uses per-CPU
 > caches and also makes mbuf and cluster allocations share the same
 > per-CPU lock.  What could be done is that the jumbo buffer allocations
 > could share the same lock as well (since they will likely usually be
 > allocated right after an mbuf is).  This would give us jumbo-cluster
 > support, but it would only be useful for devices clued enough to break
 > up the cluster into PAGE_SIZE chunks and do scatter/gather.  For most
 > worthy gigE devices, I don't think this should be a problem.

I'm a bit worried about other devices.. Tradidtionally, mbufs have
never crossed page boundaries so most drivers never bother to check
for a transmit mbuf crossing a page boundary.  Using physically
discontigous mbufs could lead to a lot of subtle data corruption.

One question.  I've observed some really anomolous behaviour under
-stable with my Myricom GM driver (2Gb/s + 2Gb/s link speed, Dual 1GHz
pIII).  When I use 4K mbufs for receives, the best speed I see is
about 1300Mb/sec.  However, if I use private 9K physically contiguous
buffers I see 1850Mb/sec (iperf TCP).

The obvious conclusion is that there's a lot of overhead in setting up
the DMA engines, but that's not the case; we have a fairly quick chain
dma engine.  I've provided a "control" by breaking my contiguous
buffers down into 4K chunks so that I do the same number of DMAs in
both cases and I still see ~1850 Mb/sec for the 9K buffers.  

A coworker suggested that the problem was that when doing copyouts to
userspace, the PIII was doing speculative reads and loading the cache
with the next page.  However, we then start copying from a totally
different address using discontigous buffers, so we effectively take
2x the number of cache misses we'd need to.  Does that sound
reasonable to you?  I need to try malloc'ing virtually contigous and
physically discontigous buffers & see if I get the same (good)
performance...

Cheers,

Drew


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-net" in the body of the message