Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 23 Aug 2012 17:26:19 -0600
From:      Warner Losh <imp@bsdimp.com>
To:        Ian Lepore <freebsd@damnhippie.dyndns.org>
Cc:        freebsd-arm@freebsd.org, freebsd-mips@freebsd.org, freebsd-arch@freebsd.org
Subject:   Re: Partial cacheline flush problems on ARM and MIPS
Message-ID:  <FD8DC82C-AD3B-4EBC-A625-62A37B9ECBF1@bsdimp.com>
In-Reply-To: <1345763393.27688.578.camel@revolution.hippie.lan>
References:  <1345757300.27688.535.camel@revolution.hippie.lan> <3A08EB08-2BBF-4B0F-97F2-A3264754C4B7@bsdimp.com> <1345763393.27688.578.camel@revolution.hippie.lan>

next in thread | previous in thread | raw e-mail | index | archive | help

On Aug 23, 2012, at 5:09 PM, Ian Lepore wrote:

> On Thu, 2012-08-23 at 15:50 -0600, Warner Losh wrote:=20
>> On Aug 23, 2012, at 3:28 PM, Ian Lepore wrote:
>>> A recent innocuous change to the USB driver code caused intermittant
>>> errors in the umass(4) driver on ARM and MIPS platforms, and this
>>=20
>> I think the proper solution is to segregate DMA and non-DMA parts of =
structures so that you don't have both sharing a cache line.
>>=20
>> I also wonder why we don't allocate the DMA memory for these =
structures separately from the non-DMA parts.  This would eliminate the =
USB_CACHE_BYTES kludge (which is CPU dependent, not arch dependent) and =
move the knowledge of this junk into busdma layer where it belongs.  =
=46rom my understanding of the issue, this would completely eliminate =
the problem forever!
>>=20
>> Sharing a cacheline between something that is DMA aware and something =
that is just begging for trouble.  We're  doing more work than we need =
to to support this dubious feature and we'd be miles ahead if we could =
not share at all.
>>=20
>> Warner
>>=20
>=20
> It seems to me that what we have here is a new type of constraint on =
DMA
> operations, and we need a way to communicate that constraint from the
> part of the platform support code that knows about it to the drivers =
and
> driver support code that needs to know.  The way we communicate DMA
> constraints is with a busdma tag, but right now that tag only
> communicates constraints that were needed for ISA and PCI busses, =
namely
> buffer alignment, boundary-crossing restrictions, and exclusion
> regions. =20
>=20
> Now we have a new type of constraint, I think of it as "granularity".
> In effect, we have a DMA system that can only do DMA in cacheline =
sized
> chunks.  Even when the IO size -- and thus the number of "bits on the
> wire" -- is less than the cacheline size, at the end of the DMA
> operation (which includes the software-assisted coherency operations)
> the number of bytes in memory that may be modified is the size of a
> cacheline.  This is because "the DMA system" is not just the engine =
that
> moves bytes around, it's the combination of hardware and software that
> work together to maintain cache coherency.

But this isn't new.  It is an alignment requirement, which carries with =
it an implicit size requirement.  If you enforce the alignment, and =
force all 'sub buffers' to have this alignment, you don't need the new =
thing.

> Ideally we'd find a way to communicate this new constraint using the
> existing mechanism, the busdma tag, and ideally we'd not have to =
change
> every existing call to bus_dma_tag_create() to add a new parm.  As I
> understand it, parent tags are now passed down through the newbus
> hierarchy consistantly, such that a tag at the nexus level could
> describe a platform requirement such as granularity, and all devices =
and
> the helper code they use will have access to that constraint via
> inheritance from ancestors' tags.  Maybe we could have a new flavor of
> bus_dma_tag_create() that takes a struct of parms, and existing calls
> wouldn't have to be changed.

Wouldn't a simpler solution be to just make this alignment requirement =
be part of the global parent tag on these platforms and to make sure all =
drivers on those platforms use it and don't cop-out and pass NULL?

> Communicating the constraint is only part of the problem; it also has =
to
> be easy for drivers to work with that constraint, especially drivers
> that are not targeted specifically at platforms with granular DMA.  I
> think we can achieve a huge chunk of that purely within the arm/mips
> implementation of bus_dmamem_alloc(), but even so there would be a new
> conceptual limitation on using that routine: it is specifically for
> allocating DMA buffers, and that means that there will be a new a rule
> that the CPU cannot access any memory within that buffer while an IO
> operation is in progress.

I don't think we should pander to drivers that don't know how to do DMA =
properly.  We get it almost right in bus_dma now.  However, going from =
almost right to completely right is hard and we keep uncovering edge =
cases that bite us.  Wouldn't it be better to eliminate all these weird =
edge cases?

> I'd also like to say there's a new rule that you cannot subdivide a
> buffer obtained from bus_dmamem_alloc() into multiple buffers, or into =
a
> combination of DMA and CPU-accessed data.  That would be bad news for
> the USB subsystem, and perhaps other drivers.  If this idea is either
> impossible or particularly contentious, then I guess we'd need some =
new
> helper routines so that a driver can subdivide the memory in a way =
that
> doesn't violate any constraints implied by the tag used to allocate =
the
> buffer.

When the USB subsystem went into the tree, this was one of the =
criticisms that was ignored.  It has come back to bite us time and time =
again.  Perhaps it is time to fix it once and for all.

> Not all IO occurs using buffers obtained from bus_dmamem_alloc(), and =
I
> doubt we can reasonably ever require that it be so. =20

True, but the I/O that's not in memory from bus_dmamem_alloc is page =
aligned.

> I think the only
> hope we have of handling that problem is to bounce the requests that
> don't meet the granularity constraint, just as we'd have to do if the
> caller-supplied buffer fell into an exclusion zone or violated an
> alignment or boundary constraint.  When I've tossed this idea out in =
the
> past there was instant resistance.  Yeah, bounce buffers are massively
> inefficient, but my experience has been that most of the IO that isn't
> aligned and sized to a multiple of a cacheline is small IO (a few to a
> few dozen bytes).  I've never seen a case of page-sized or larger IO
> requests that required partial-cacheline handling.  I'm sure some
> examples exist, but they're probably more the exception than the rule.
> (And the bad performance you'd get from bouncing and copying massive
> bulk data flow would be a powerful incentive to track down the culprit
> and improve the driver.)

That's also the underlying idea in the bus_dma stuff.  You give the =
constraints, you get the buffers and if you have a buffer that's outside =
the constraints it gets bounced.  That's why the sync operations on on =
DMA items, not on cache line items.  While cache lines are one issue, =
memory placement can be another.  Floppy drives, for example, couldn't =
DMA past the first 16MB and if you have a buffer passed in that's =
outside of that, it bounces.  If this bouncing produces slower code, =
then the drivers should be updated to have better alignment.

The USB subsystem is making assumptions about the underlying cache =
mechanisms that aren't really true.  Ideally, we could get it to stop =
doing that.

Warner=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?FD8DC82C-AD3B-4EBC-A625-62A37B9ECBF1>