Date: Thu, 23 Aug 2012 17:09:53 -0600 From: Ian Lepore <freebsd@damnhippie.dyndns.org> To: freebsd-arch@freebsd.org, freebsd-arm@freebsd.org, freebsd-mips@freebsd.org Subject: Re: Partial cacheline flush problems on ARM and MIPS Message-ID: <1345763393.27688.578.camel@revolution.hippie.lan> In-Reply-To: <3A08EB08-2BBF-4B0F-97F2-A3264754C4B7@bsdimp.com> References: <1345757300.27688.535.camel@revolution.hippie.lan> <3A08EB08-2BBF-4B0F-97F2-A3264754C4B7@bsdimp.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 2012-08-23 at 15:50 -0600, Warner Losh wrote: > On Aug 23, 2012, at 3:28 PM, Ian Lepore wrote: > > A recent innocuous change to the USB driver code caused intermittant > > errors in the umass(4) driver on ARM and MIPS platforms, and this > > I think the proper solution is to segregate DMA and non-DMA parts of structures so that you don't have both sharing a cache line. > > I also wonder why we don't allocate the DMA memory for these structures separately from the non-DMA parts. This would eliminate the USB_CACHE_BYTES kludge (which is CPU dependent, not arch dependent) and move the knowledge of this junk into busdma layer where it belongs. From my understanding of the issue, this would completely eliminate the problem forever! > > Sharing a cacheline between something that is DMA aware and something that is just begging for trouble. We're doing more work than we need to to support this dubious feature and we'd be miles ahead if we could not share at all. > > Warner > It seems to me that what we have here is a new type of constraint on DMA operations, and we need a way to communicate that constraint from the part of the platform support code that knows about it to the drivers and driver support code that needs to know. The way we communicate DMA constraints is with a busdma tag, but right now that tag only communicates constraints that were needed for ISA and PCI busses, namely buffer alignment, boundary-crossing restrictions, and exclusion regions. Now we have a new type of constraint, I think of it as "granularity". In effect, we have a DMA system that can only do DMA in cacheline sized chunks. Even when the IO size -- and thus the number of "bits on the wire" -- is less than the cacheline size, at the end of the DMA operation (which includes the software-assisted coherency operations) the number of bytes in memory that may be modified is the size of a cacheline. This is because "the DMA system" is not just the engine that moves bytes around, it's the combination of hardware and software that work together to maintain cache coherency. Ideally we'd find a way to communicate this new constraint using the existing mechanism, the busdma tag, and ideally we'd not have to change every existing call to bus_dma_tag_create() to add a new parm. As I understand it, parent tags are now passed down through the newbus hierarchy consistantly, such that a tag at the nexus level could describe a platform requirement such as granularity, and all devices and the helper code they use will have access to that constraint via inheritance from ancestors' tags. Maybe we could have a new flavor of bus_dma_tag_create() that takes a struct of parms, and existing calls wouldn't have to be changed. Communicating the constraint is only part of the problem; it also has to be easy for drivers to work with that constraint, especially drivers that are not targeted specifically at platforms with granular DMA. I think we can achieve a huge chunk of that purely within the arm/mips implementation of bus_dmamem_alloc(), but even so there would be a new conceptual limitation on using that routine: it is specifically for allocating DMA buffers, and that means that there will be a new a rule that the CPU cannot access any memory within that buffer while an IO operation is in progress. I'd also like to say there's a new rule that you cannot subdivide a buffer obtained from bus_dmamem_alloc() into multiple buffers, or into a combination of DMA and CPU-accessed data. That would be bad news for the USB subsystem, and perhaps other drivers. If this idea is either impossible or particularly contentious, then I guess we'd need some new helper routines so that a driver can subdivide the memory in a way that doesn't violate any constraints implied by the tag used to allocate the buffer. Not all IO occurs using buffers obtained from bus_dmamem_alloc(), and I doubt we can reasonably ever require that it be so. I think the only hope we have of handling that problem is to bounce the requests that don't meet the granularity constraint, just as we'd have to do if the caller-supplied buffer fell into an exclusion zone or violated an alignment or boundary constraint. When I've tossed this idea out in the past there was instant resistance. Yeah, bounce buffers are massively inefficient, but my experience has been that most of the IO that isn't aligned and sized to a multiple of a cacheline is small IO (a few to a few dozen bytes). I've never seen a case of page-sized or larger IO requests that required partial-cacheline handling. I'm sure some examples exist, but they're probably more the exception than the rule. (And the bad performance you'd get from bouncing and copying massive bulk data flow would be a powerful incentive to track down the culprit and improve the driver.) -- Ian
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1345763393.27688.578.camel>