Date: Thu, 23 Aug 2012 17:26:19 -0600 From: Warner Losh <imp@bsdimp.com> To: Ian Lepore <freebsd@damnhippie.dyndns.org> Cc: freebsd-arm@freebsd.org, freebsd-mips@freebsd.org, freebsd-arch@freebsd.org Subject: Re: Partial cacheline flush problems on ARM and MIPS Message-ID: <FD8DC82C-AD3B-4EBC-A625-62A37B9ECBF1@bsdimp.com> In-Reply-To: <1345763393.27688.578.camel@revolution.hippie.lan> References: <1345757300.27688.535.camel@revolution.hippie.lan> <3A08EB08-2BBF-4B0F-97F2-A3264754C4B7@bsdimp.com> <1345763393.27688.578.camel@revolution.hippie.lan>
next in thread | previous in thread | raw e-mail | index | archive | help
On Aug 23, 2012, at 5:09 PM, Ian Lepore wrote: > On Thu, 2012-08-23 at 15:50 -0600, Warner Losh wrote:=20 >> On Aug 23, 2012, at 3:28 PM, Ian Lepore wrote: >>> A recent innocuous change to the USB driver code caused intermittant >>> errors in the umass(4) driver on ARM and MIPS platforms, and this >>=20 >> I think the proper solution is to segregate DMA and non-DMA parts of = structures so that you don't have both sharing a cache line. >>=20 >> I also wonder why we don't allocate the DMA memory for these = structures separately from the non-DMA parts. This would eliminate the = USB_CACHE_BYTES kludge (which is CPU dependent, not arch dependent) and = move the knowledge of this junk into busdma layer where it belongs. = =46rom my understanding of the issue, this would completely eliminate = the problem forever! >>=20 >> Sharing a cacheline between something that is DMA aware and something = that is just begging for trouble. We're doing more work than we need = to to support this dubious feature and we'd be miles ahead if we could = not share at all. >>=20 >> Warner >>=20 >=20 > It seems to me that what we have here is a new type of constraint on = DMA > operations, and we need a way to communicate that constraint from the > part of the platform support code that knows about it to the drivers = and > driver support code that needs to know. The way we communicate DMA > constraints is with a busdma tag, but right now that tag only > communicates constraints that were needed for ISA and PCI busses, = namely > buffer alignment, boundary-crossing restrictions, and exclusion > regions. =20 >=20 > Now we have a new type of constraint, I think of it as "granularity". > In effect, we have a DMA system that can only do DMA in cacheline = sized > chunks. Even when the IO size -- and thus the number of "bits on the > wire" -- is less than the cacheline size, at the end of the DMA > operation (which includes the software-assisted coherency operations) > the number of bytes in memory that may be modified is the size of a > cacheline. This is because "the DMA system" is not just the engine = that > moves bytes around, it's the combination of hardware and software that > work together to maintain cache coherency. But this isn't new. It is an alignment requirement, which carries with = it an implicit size requirement. If you enforce the alignment, and = force all 'sub buffers' to have this alignment, you don't need the new = thing. > Ideally we'd find a way to communicate this new constraint using the > existing mechanism, the busdma tag, and ideally we'd not have to = change > every existing call to bus_dma_tag_create() to add a new parm. As I > understand it, parent tags are now passed down through the newbus > hierarchy consistantly, such that a tag at the nexus level could > describe a platform requirement such as granularity, and all devices = and > the helper code they use will have access to that constraint via > inheritance from ancestors' tags. Maybe we could have a new flavor of > bus_dma_tag_create() that takes a struct of parms, and existing calls > wouldn't have to be changed. Wouldn't a simpler solution be to just make this alignment requirement = be part of the global parent tag on these platforms and to make sure all = drivers on those platforms use it and don't cop-out and pass NULL? > Communicating the constraint is only part of the problem; it also has = to > be easy for drivers to work with that constraint, especially drivers > that are not targeted specifically at platforms with granular DMA. I > think we can achieve a huge chunk of that purely within the arm/mips > implementation of bus_dmamem_alloc(), but even so there would be a new > conceptual limitation on using that routine: it is specifically for > allocating DMA buffers, and that means that there will be a new a rule > that the CPU cannot access any memory within that buffer while an IO > operation is in progress. I don't think we should pander to drivers that don't know how to do DMA = properly. We get it almost right in bus_dma now. However, going from = almost right to completely right is hard and we keep uncovering edge = cases that bite us. Wouldn't it be better to eliminate all these weird = edge cases? > I'd also like to say there's a new rule that you cannot subdivide a > buffer obtained from bus_dmamem_alloc() into multiple buffers, or into = a > combination of DMA and CPU-accessed data. That would be bad news for > the USB subsystem, and perhaps other drivers. If this idea is either > impossible or particularly contentious, then I guess we'd need some = new > helper routines so that a driver can subdivide the memory in a way = that > doesn't violate any constraints implied by the tag used to allocate = the > buffer. When the USB subsystem went into the tree, this was one of the = criticisms that was ignored. It has come back to bite us time and time = again. Perhaps it is time to fix it once and for all. > Not all IO occurs using buffers obtained from bus_dmamem_alloc(), and = I > doubt we can reasonably ever require that it be so. =20 True, but the I/O that's not in memory from bus_dmamem_alloc is page = aligned. > I think the only > hope we have of handling that problem is to bounce the requests that > don't meet the granularity constraint, just as we'd have to do if the > caller-supplied buffer fell into an exclusion zone or violated an > alignment or boundary constraint. When I've tossed this idea out in = the > past there was instant resistance. Yeah, bounce buffers are massively > inefficient, but my experience has been that most of the IO that isn't > aligned and sized to a multiple of a cacheline is small IO (a few to a > few dozen bytes). I've never seen a case of page-sized or larger IO > requests that required partial-cacheline handling. I'm sure some > examples exist, but they're probably more the exception than the rule. > (And the bad performance you'd get from bouncing and copying massive > bulk data flow would be a powerful incentive to track down the culprit > and improve the driver.) That's also the underlying idea in the bus_dma stuff. You give the = constraints, you get the buffers and if you have a buffer that's outside = the constraints it gets bounced. That's why the sync operations on on = DMA items, not on cache line items. While cache lines are one issue, = memory placement can be another. Floppy drives, for example, couldn't = DMA past the first 16MB and if you have a buffer passed in that's = outside of that, it bounces. If this bouncing produces slower code, = then the drivers should be updated to have better alignment. The USB subsystem is making assumptions about the underlying cache = mechanisms that aren't really true. Ideally, we could get it to stop = doing that. Warner=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?FD8DC82C-AD3B-4EBC-A625-62A37B9ECBF1>