From owner-freebsd-mips@FreeBSD.ORG  Mon Aug 20 19:01:38 2012
Return-Path: <owner-freebsd-mips@FreeBSD.ORG>
Delivered-To: mips@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id A15F2106572D;
	Mon, 20 Aug 2012 19:01:38 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from bigwig.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net
	[IPv6:2001:470:1f10:75::2])
	by mx1.freebsd.org (Postfix) with ESMTP id 4892E8FC24;
	Mon, 20 Aug 2012 19:01:38 +0000 (UTC)
Received: from jhbbsd.localnet (unknown [209.249.190.124])
	by bigwig.baldwin.cx (Postfix) with ESMTPSA id B53A1B95B;
	Mon, 20 Aug 2012 15:01:37 -0400 (EDT)
From: John Baldwin <jhb@freebsd.org>
To: Ian Lepore <freebsd@damnhippie.dyndns.org>, scottl@freebsd.org,
	Justin Gibbs <gibbs@freebsd.org>
Date: Mon, 20 Aug 2012 14:34:16 -0400
User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p17; KDE/4.5.5; amd64; ; )
References: <20120703111753.GB72292@server.rulingia.com>
	<1344355782.1128.186.camel@revolution.hippie.lan>
	<201208071406.45172.jhb@freebsd.org>
In-Reply-To: <201208071406.45172.jhb@freebsd.org>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Message-Id: <201208201434.16538.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7
	(bigwig.baldwin.cx); Mon, 20 Aug 2012 15:01:37 -0400 (EDT)
Cc: arm@freebsd.org, mips@freebsd.org, Peter Jeremy <peter@rulingia.com>
Subject: Re: On-stack allocation of DMA S/G lists
X-BeenThere: freebsd-mips@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Porting FreeBSD to MIPS <freebsd-mips.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-mips>,
	<mailto:freebsd-mips-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-mips>
List-Post: <mailto:freebsd-mips@freebsd.org>
List-Help: <mailto:freebsd-mips-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-mips>,
	<mailto:freebsd-mips-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 20 Aug 2012 19:01:39 -0000

[ Adding Scott Long and Justin Gibbs for their input ]

On Tuesday, August 07, 2012 2:06:44 pm John Baldwin wrote:
> On Tuesday, August 07, 2012 12:09:42 pm Ian Lepore wrote:
> > On Mon, 2012-08-06 at 10:26 -0400, John Baldwin wrote:
> > > On Thursday, July 12, 2012 8:26:05 am John Baldwin wrote:
> > > > On Sunday, July 08, 2012 7:05:16 am Peter Jeremy wrote:
> > > > > BTW(2): Whilst studying busdma_machdep.c for arm and mips, I've
> > > > > noticed they appear to potentially allocate substantial kernel stack
> > > > > under some conditions as several bus_dma(9) functions include:
> > > > >     bus_dma_segment_t dm_segments[dmat->nsegments];
> > > > > What prevents this overflowing the kernel stack?
> > > > 
> > > > That does seem dubious.  x86 stores the array in the tag instead.
> > > 
> > > I have an untested patch to change bus-dma on arm and mips to allocate a
> > > dynamic S/G list in each DMA tag on first use instead of using on-stack
> > > allocation (which I think is rather bogus).  Can folks review and test this 
> > > patch please?  Thanks.
> > > 
> > > http://www.FreeBSD.org/~jhb/patches/arm_mips_dynamic_dma_segs.patch
> > > 
> > 
> > I'm worried about changing a per-mapping-call resource to a per-dma-tag
> > resource here.  What prevents the situation where you have two
> > bus_dmamap_load() calls in progress at the same time using different
> > buffers but the same tag?
> > 
> > I can't find anything in the docs that indicates you have to provide
> > external locking of the tag for map load/unload calls, or that even
> > implies the tag can be modified by a mapping operation.  The lockfunc
> > stuff related to creating the tag is documented as being used only
> > during a deferred callback.
> 
> Actually, I do think it is implicit that you won't do concurrent loads
> on a DMA tag, though that may not be obvious.  Keep in mind that this
> is what x86's bus_dma has always done.  For storage drivers you certainly
> can't do this or risk completeing I/O requests out-of-order which can
> break an upper-layer assumption in a filesystem.  Note that all other
> platforms do this as well, only arm and mips allocate on the stack.

One thing I should note is that it is assumed that the lock specifed by the
lockfunc stuff is held when you invoke bus_dmamap_load*().

> > The existing code seems to go out of its way to avoid modifying the tag
> > during a mapping operation.  For example, it decides at tag creation
> > time whether any bounce pages might ever be needed for the tag, and if
> > so it pre-sets a bounce zone in the tag, then at mapping time the bounce
> > zone is protected with its own lock when it gets modified.  To me this
> > feels like a way to specifically avoid the need to lock or modify the
> > tag during a mapping operation.
> > 
> > Assuming that all of the foregoing is moot for some reason I've
> > overlooked, then on a purely implementation level, could all the
> > duplicated code to allocate the array when necessary be moved into
> > bus_dmamap_load_buffer(), triggered by a NULL 'segs' pointer?
> 
> Nope, bus_dmamap_load() doesn't know which of M_NOWAIT / M_WAITOK is
> appropriate to use.
> 
> > And just for the record, looking at the problem from an even more
> > distant vantage... is there really a problem with stack-allocating the
> > segments?  On a 64-bit arch the struct is like 16 bytes.  Typical usage
> > is to allocate a tag allowing 1 or just a few segments.  Is anyone
> > really going to create a tag specifying hundreds of segments that would
> > overflow the stack?  If they try, wouldn't failing the tag create be
> > good enough?
> 
> I/O devices can allocate tags with several S/G elements.  An mfi(4) tag
> on i386 would use a 256 byte segments array (512 on amd64).  That's not
> entirely trivial.  It would be worse if you couldn't depend on
> dmat->nsegments and had to always allocate the full size.  Presumably
> though we require C99 at that point (and it requires that?).

Also, we are moving to a model of using larger MAXPHYS (some places already
modify FreeBSD to bump MAXPHYS up), and to looking at using direct dispatch
for disk I/O rather than always bouncing requests through g_up/g_down to
gain greater concurrency in the disk I/O subsystem.  Both of these are going
to place greater stress on the kernel stack and argue for less kernel stack
use, not more.  Scott noted that a 1MB I/O request can require up to 256
segments, which is 4k.  That would be too much space on the stack.

One other option is to move the S/G lists into the dma map structure instead
of being in the tag, but to date that has not been done because it would
introduce a significant amount of memory overhead.

Also, I don't see a great need to allow concurrent loads within a single
tag.  Typically you allocate a single tag for each "thing" that can accept
a stream of DMA requests (so 1 tag for each command queue on a HBA, or
for each descriptor ring on a NIC).  For example, in the case of a multiq
NIC you want to use separate tags so that each tag uses a per-ring lock.
But typically that stream of DMA requests requires serialization, so you
would never end up wanting to have concurrent loads on the same tag.

-- 
John Baldwin