From owner-freebsd-arch@FreeBSD.ORG  Thu Aug 23 23:11:02 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 8D18B106567C
	for <freebsd-arch@freebsd.org>; Thu, 23 Aug 2012 23:11:02 +0000 (UTC)
	(envelope-from freebsd@damnhippie.dyndns.org)
Received: from qmta14.emeryville.ca.mail.comcast.net
	(qmta14.emeryville.ca.mail.comcast.net [76.96.27.212])
	by mx1.freebsd.org (Postfix) with ESMTP id 42E248FC15
	for <freebsd-arch@freebsd.org>; Thu, 23 Aug 2012 23:11:02 +0000 (UTC)
Received: from omta06.emeryville.ca.mail.comcast.net ([76.96.30.51])
	by qmta14.emeryville.ca.mail.comcast.net with comcast
	id qNmx1j00116AWCUAEP9wpl; Thu, 23 Aug 2012 23:09:56 +0000
Received: from damnhippie.dyndns.org ([24.8.232.202])
	by omta06.emeryville.ca.mail.comcast.net with comcast
	id qP9v1j00g4NgCEG8SP9ws5; Thu, 23 Aug 2012 23:09:56 +0000
Received: from [172.22.42.240] (revolution.hippie.lan [172.22.42.240])
	by damnhippie.dyndns.org (8.14.3/8.14.3) with ESMTP id q7NN9spk025527; 
	Thu, 23 Aug 2012 17:09:54 -0600 (MDT)
	(envelope-from freebsd@damnhippie.dyndns.org)
From: Ian Lepore <freebsd@damnhippie.dyndns.org>
To: freebsd-arch@freebsd.org, freebsd-arm@freebsd.org, freebsd-mips@freebsd.org
In-Reply-To: <3A08EB08-2BBF-4B0F-97F2-A3264754C4B7@bsdimp.com>
References: <1345757300.27688.535.camel@revolution.hippie.lan>
	<3A08EB08-2BBF-4B0F-97F2-A3264754C4B7@bsdimp.com>
Content-Type: text/plain; charset="us-ascii"
Date: Thu, 23 Aug 2012 17:09:53 -0600
Message-ID: <1345763393.27688.578.camel@revolution.hippie.lan>
Mime-Version: 1.0
X-Mailer: Evolution 2.32.1 FreeBSD GNOME Team Port 
Content-Transfer-Encoding: 7bit
Cc: 
Subject: Re: Partial cacheline flush problems on ARM and MIPS
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 23 Aug 2012 23:11:02 -0000

On Thu, 2012-08-23 at 15:50 -0600, Warner Losh wrote: 
> On Aug 23, 2012, at 3:28 PM, Ian Lepore wrote:
> > A recent innocuous change to the USB driver code caused intermittant
> > errors in the umass(4) driver on ARM and MIPS platforms, and this
> 
> I think the proper solution is to segregate DMA and non-DMA parts of structures so that you don't have both sharing a cache line.
> 
> I also wonder why we don't allocate the DMA memory for these structures separately from the non-DMA parts.  This would eliminate the USB_CACHE_BYTES kludge (which is CPU dependent, not arch dependent) and move the knowledge of this junk into busdma layer where it belongs.  From my understanding of the issue, this would completely eliminate the problem forever!
> 
> Sharing a cacheline between something that is DMA aware and something that is just begging for trouble.  We're  doing more work than we need to to support this dubious feature and we'd be miles ahead if we could not share at all.
> 
> Warner
> 

It seems to me that what we have here is a new type of constraint on DMA
operations, and we need a way to communicate that constraint from the
part of the platform support code that knows about it to the drivers and
driver support code that needs to know.  The way we communicate DMA
constraints is with a busdma tag, but right now that tag only
communicates constraints that were needed for ISA and PCI busses, namely
buffer alignment, boundary-crossing restrictions, and exclusion
regions.  

Now we have a new type of constraint, I think of it as "granularity".
In effect, we have a DMA system that can only do DMA in cacheline sized
chunks.  Even when the IO size -- and thus the number of "bits on the
wire" -- is less than the cacheline size, at the end of the DMA
operation (which includes the software-assisted coherency operations)
the number of bytes in memory that may be modified is the size of a
cacheline.  This is because "the DMA system" is not just the engine that
moves bytes around, it's the combination of hardware and software that
work together to maintain cache coherency.

Ideally we'd find a way to communicate this new constraint using the
existing mechanism, the busdma tag, and ideally we'd not have to change
every existing call to bus_dma_tag_create() to add a new parm.  As I
understand it, parent tags are now passed down through the newbus
hierarchy consistantly, such that a tag at the nexus level could
describe a platform requirement such as granularity, and all devices and
the helper code they use will have access to that constraint via
inheritance from ancestors' tags.  Maybe we could have a new flavor of
bus_dma_tag_create() that takes a struct of parms, and existing calls
wouldn't have to be changed.

Communicating the constraint is only part of the problem; it also has to
be easy for drivers to work with that constraint, especially drivers
that are not targeted specifically at platforms with granular DMA.  I
think we can achieve a huge chunk of that purely within the arm/mips
implementation of bus_dmamem_alloc(), but even so there would be a new
conceptual limitation on using that routine: it is specifically for
allocating DMA buffers, and that means that there will be a new a rule
that the CPU cannot access any memory within that buffer while an IO
operation is in progress.

I'd also like to say there's a new rule that you cannot subdivide a
buffer obtained from bus_dmamem_alloc() into multiple buffers, or into a
combination of DMA and CPU-accessed data.  That would be bad news for
the USB subsystem, and perhaps other drivers.  If this idea is either
impossible or particularly contentious, then I guess we'd need some new
helper routines so that a driver can subdivide the memory in a way that
doesn't violate any constraints implied by the tag used to allocate the
buffer.

Not all IO occurs using buffers obtained from bus_dmamem_alloc(), and I
doubt we can reasonably ever require that it be so.  I think the only
hope we have of handling that problem is to bounce the requests that
don't meet the granularity constraint, just as we'd have to do if the
caller-supplied buffer fell into an exclusion zone or violated an
alignment or boundary constraint.  When I've tossed this idea out in the
past there was instant resistance.  Yeah, bounce buffers are massively
inefficient, but my experience has been that most of the IO that isn't
aligned and sized to a multiple of a cacheline is small IO (a few to a
few dozen bytes).  I've never seen a case of page-sized or larger IO
requests that required partial-cacheline handling.  I'm sure some
examples exist, but they're probably more the exception than the rule.
(And the bad performance you'd get from bouncing and copying massive
bulk data flow would be a powerful incentive to track down the culprit
and improve the driver.)

-- Ian