From owner-freebsd-mips@FreeBSD.ORG  Thu Aug 23 23:26:26 2012
Return-Path: <owner-freebsd-mips@FreeBSD.ORG>
Delivered-To: freebsd-mips@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 4A6831065670
	for <freebsd-mips@freebsd.org>; Thu, 23 Aug 2012 23:26:26 +0000 (UTC)
	(envelope-from imp@bsdimp.com)
Received: from mail-pb0-f54.google.com (mail-pb0-f54.google.com
	[209.85.160.54])
	by mx1.freebsd.org (Postfix) with ESMTP id 1292B8FC0A
	for <freebsd-mips@freebsd.org>; Thu, 23 Aug 2012 23:26:25 +0000 (UTC)
Received: by pbbrp2 with SMTP id rp2so2404540pbb.13
	for <freebsd-mips@freebsd.org>; Thu, 23 Aug 2012 16:26:25 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=google.com; s=20120113;
	h=sender:subject:mime-version:content-type:from:in-reply-to:date:cc
	:content-transfer-encoding:message-id:references:to:x-mailer
	:x-gm-message-state;
	bh=huWGlQ/K+J9q23N2Dy+i8d3okVdSWb7iIk3+N02V218=;
	b=YfKNPahQCrou2U4Q1U4stX2TJDFxz4BYJ0KvY5YiRw4/auKAcOx9i1gYq1M+2qFl9K
	6HGNMnrm2CnP2EJzUeqHpR7BJZzBXcQoGp6HZRR0G/vCt9L76tM366zas//VX+fE6QXO
	xGw0oYwgRvzXrZL0UhUyubitzzddSUO2kvJixCeFF3npI4xBmteYQSlgy9BkmdzobaU/
	nis2VPvdWUN7keQyw2dOiuDllA7Bvtb+EFYRvha6cu/hnfRjUe79oEo2ZZs03eK5GooI
	J6exfluNdq3sQToZ9Ak0iRdVJk52OCRdwMLVb95CDlCCbv26MwBu72c19I/tyuKosf/2
	bs6g==
Received: by 10.68.196.193 with SMTP id io1mr8509341pbc.17.1345764385480;
	Thu, 23 Aug 2012 16:26:25 -0700 (PDT)
Received: from [10.30.101.53] ([209.117.142.2])
	by mx.google.com with ESMTPS id pj10sm6923195pbb.46.2012.08.23.16.26.24
	(version=TLSv1/SSLv3 cipher=OTHER);
	Thu, 23 Aug 2012 16:26:25 -0700 (PDT)
Sender: Warner Losh <wlosh@bsdimp.com>
Mime-Version: 1.0 (Apple Message framework v1084)
Content-Type: text/plain; charset=us-ascii
From: Warner Losh <imp@bsdimp.com>
In-Reply-To: <1345763393.27688.578.camel@revolution.hippie.lan>
Date: Thu, 23 Aug 2012 17:26:19 -0600
Content-Transfer-Encoding: quoted-printable
Message-Id: <FD8DC82C-AD3B-4EBC-A625-62A37B9ECBF1@bsdimp.com>
References: <1345757300.27688.535.camel@revolution.hippie.lan>
	<3A08EB08-2BBF-4B0F-97F2-A3264754C4B7@bsdimp.com>
	<1345763393.27688.578.camel@revolution.hippie.lan>
To: Ian Lepore <freebsd@damnhippie.dyndns.org>
X-Mailer: Apple Mail (2.1084)
X-Gm-Message-State: ALoCoQlS+8VNNlDUQuomr0fYnz3Q0EHVIKHoR3CRrawN4rXayM8iLbnKpNsVNF2p+tIkERBOaY8X
Cc: freebsd-arm@freebsd.org, freebsd-mips@freebsd.org, freebsd-arch@freebsd.org
Subject: Re: Partial cacheline flush problems on ARM and MIPS
X-BeenThere: freebsd-mips@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Porting FreeBSD to MIPS <freebsd-mips.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-mips>,
	<mailto:freebsd-mips-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-mips>
List-Post: <mailto:freebsd-mips@freebsd.org>
List-Help: <mailto:freebsd-mips-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-mips>,
	<mailto:freebsd-mips-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 23 Aug 2012 23:26:26 -0000


On Aug 23, 2012, at 5:09 PM, Ian Lepore wrote:

> On Thu, 2012-08-23 at 15:50 -0600, Warner Losh wrote:=20
>> On Aug 23, 2012, at 3:28 PM, Ian Lepore wrote:
>>> A recent innocuous change to the USB driver code caused intermittant
>>> errors in the umass(4) driver on ARM and MIPS platforms, and this
>>=20
>> I think the proper solution is to segregate DMA and non-DMA parts of =
structures so that you don't have both sharing a cache line.
>>=20
>> I also wonder why we don't allocate the DMA memory for these =
structures separately from the non-DMA parts.  This would eliminate the =
USB_CACHE_BYTES kludge (which is CPU dependent, not arch dependent) and =
move the knowledge of this junk into busdma layer where it belongs.  =
=46rom my understanding of the issue, this would completely eliminate =
the problem forever!
>>=20
>> Sharing a cacheline between something that is DMA aware and something =
that is just begging for trouble.  We're  doing more work than we need =
to to support this dubious feature and we'd be miles ahead if we could =
not share at all.
>>=20
>> Warner
>>=20
>=20
> It seems to me that what we have here is a new type of constraint on =
DMA
> operations, and we need a way to communicate that constraint from the
> part of the platform support code that knows about it to the drivers =
and
> driver support code that needs to know.  The way we communicate DMA
> constraints is with a busdma tag, but right now that tag only
> communicates constraints that were needed for ISA and PCI busses, =
namely
> buffer alignment, boundary-crossing restrictions, and exclusion
> regions. =20
>=20
> Now we have a new type of constraint, I think of it as "granularity".
> In effect, we have a DMA system that can only do DMA in cacheline =
sized
> chunks.  Even when the IO size -- and thus the number of "bits on the
> wire" -- is less than the cacheline size, at the end of the DMA
> operation (which includes the software-assisted coherency operations)
> the number of bytes in memory that may be modified is the size of a
> cacheline.  This is because "the DMA system" is not just the engine =
that
> moves bytes around, it's the combination of hardware and software that
> work together to maintain cache coherency.

But this isn't new.  It is an alignment requirement, which carries with =
it an implicit size requirement.  If you enforce the alignment, and =
force all 'sub buffers' to have this alignment, you don't need the new =
thing.

> Ideally we'd find a way to communicate this new constraint using the
> existing mechanism, the busdma tag, and ideally we'd not have to =
change
> every existing call to bus_dma_tag_create() to add a new parm.  As I
> understand it, parent tags are now passed down through the newbus
> hierarchy consistantly, such that a tag at the nexus level could
> describe a platform requirement such as granularity, and all devices =
and
> the helper code they use will have access to that constraint via
> inheritance from ancestors' tags.  Maybe we could have a new flavor of
> bus_dma_tag_create() that takes a struct of parms, and existing calls
> wouldn't have to be changed.

Wouldn't a simpler solution be to just make this alignment requirement =
be part of the global parent tag on these platforms and to make sure all =
drivers on those platforms use it and don't cop-out and pass NULL?

> Communicating the constraint is only part of the problem; it also has =
to
> be easy for drivers to work with that constraint, especially drivers
> that are not targeted specifically at platforms with granular DMA.  I
> think we can achieve a huge chunk of that purely within the arm/mips
> implementation of bus_dmamem_alloc(), but even so there would be a new
> conceptual limitation on using that routine: it is specifically for
> allocating DMA buffers, and that means that there will be a new a rule
> that the CPU cannot access any memory within that buffer while an IO
> operation is in progress.

I don't think we should pander to drivers that don't know how to do DMA =
properly.  We get it almost right in bus_dma now.  However, going from =
almost right to completely right is hard and we keep uncovering edge =
cases that bite us.  Wouldn't it be better to eliminate all these weird =
edge cases?

> I'd also like to say there's a new rule that you cannot subdivide a
> buffer obtained from bus_dmamem_alloc() into multiple buffers, or into =
a
> combination of DMA and CPU-accessed data.  That would be bad news for
> the USB subsystem, and perhaps other drivers.  If this idea is either
> impossible or particularly contentious, then I guess we'd need some =
new
> helper routines so that a driver can subdivide the memory in a way =
that
> doesn't violate any constraints implied by the tag used to allocate =
the
> buffer.

When the USB subsystem went into the tree, this was one of the =
criticisms that was ignored.  It has come back to bite us time and time =
again.  Perhaps it is time to fix it once and for all.

> Not all IO occurs using buffers obtained from bus_dmamem_alloc(), and =
I
> doubt we can reasonably ever require that it be so. =20

True, but the I/O that's not in memory from bus_dmamem_alloc is page =
aligned.

> I think the only
> hope we have of handling that problem is to bounce the requests that
> don't meet the granularity constraint, just as we'd have to do if the
> caller-supplied buffer fell into an exclusion zone or violated an
> alignment or boundary constraint.  When I've tossed this idea out in =
the
> past there was instant resistance.  Yeah, bounce buffers are massively
> inefficient, but my experience has been that most of the IO that isn't
> aligned and sized to a multiple of a cacheline is small IO (a few to a
> few dozen bytes).  I've never seen a case of page-sized or larger IO
> requests that required partial-cacheline handling.  I'm sure some
> examples exist, but they're probably more the exception than the rule.
> (And the bad performance you'd get from bouncing and copying massive
> bulk data flow would be a powerful incentive to track down the culprit
> and improve the driver.)

That's also the underlying idea in the bus_dma stuff.  You give the =
constraints, you get the buffers and if you have a buffer that's outside =
the constraints it gets bounced.  That's why the sync operations on on =
DMA items, not on cache line items.  While cache lines are one issue, =
memory placement can be another.  Floppy drives, for example, couldn't =
DMA past the first 16MB and if you have a buffer passed in that's =
outside of that, it bounces.  If this bouncing produces slower code, =
then the drivers should be updated to have better alignment.

The USB subsystem is making assumptions about the underlying cache =
mechanisms that aren't really true.  Ideally, we could get it to stop =
doing that.

Warner=