Date: Thu, 23 Aug 2012 15:28:20 -0600 From: Ian Lepore <freebsd@damnhippie.dyndns.org> To: freebsd-arch@freebsd.org, freebsd-mips@freebsd.org, freebsd-arm@freebsd.org Subject: Partial cacheline flush problems on ARM and MIPS Message-ID: <1345757300.27688.535.camel@revolution.hippie.lan>
next in thread | raw e-mail | index | archive | help
A recent innocuous change to the USB driver code caused intermittant errors in the umass(4) driver on ARM and MIPS platforms, and this message is some background on partial cacheline flushes, and info on what I found while investigating the cause, which is rooted in the DMA cache coherency code. First I need to say that when I say ARM in this message I mean "ARM and MIPS and any other platform where cache coherency around DMA operations is maintained with help from software (as opposed to being implemented purely in hardware as it is in i386/amd64)." I have no experience on MIPS, but I believe that it's similar to ARM in regards to cache coherency. As far as I know, this is not specific to VIVT caches, but rather is specific to software cache coherency, so it probably applies to armv6/v7 as well as v4 and v5 that I'm working with. Over the years, we've had a series of glitches and patches in the partial cacheline flush logic for arm. Even with all the fixes, I've thought that there are two ways that the scheme could fail, but I've never been able to demonstrate it experimentally, and empirically it seems that the failures are rare. Both ways involve the fact that we have several software entities trying to maintain the caches concurrently, without knowledge of each others' actions or needs. The two ways are basically two variations of a situation where a dirty cacheline is flushed by the software while a DMA operation that overlaps that cacheline is in progress: * A cpu_dcache_wbinv_all() call happens after some DMA data has hit main memory, but before bus_dmamap_sync(POSTREAD) is called. * Two independent DMA operations are happening in different parts of the same cacheline and some DMA data from both devices has hit main memory; whichever operation does the POSTREAD sync first can wipe out data in main memory from the other operation that is still in progress. For problems to happen, the CPU has to also modify the same memory location / cacheline as the DMA, so that the cache holds the newest data for part of the cacheline, and main memory holds the newest data for part of the cacheline. Our logic for handling POSTREAD partial cacheline flushes creates this condition even if it doesn't already exist on entry to the sync routine. The wbinv_all() situation seemed to me the most likely to occur. It gets called from a variety of places for a variety of reasons. It is called from a several places in pmap.c; it appears to me that many of those invocations can happen at completely arbitrary points with respect to any IO that's in progress. Another way wbinv_all() can get invoked is during a call to wbinv_range() or even just inv_range(), when the range to be invalidated is so large that it's too inefficient to loop through the range discarding a line at a time on a given platform. In those cases, for some arm platforms, the inv_range() implementation just calls wbinv_all() internally. The implication of the last paragraph is that wbinv_all() can potentially be invoked as part of the busdma sync operations for any IO, PREREAD, PREWRITE, or POSTREAD, from any device at any time. A recent USB driver change moved some things around in memory, such that a small (13 byte) IO buffer became split across two cachelines, and suddenly we had intermittant (but fairly frequent) failures reported by umass(4). Some logging from the usb driver showed that there was stale data from previous IO operations in part of the IO buffer. I added some code to pre-initialize the buffer to various byte patterns before starting the IO, and after the IO, part of the buffer would contain those patterns, and the rest of the buffer (after the cacheline split point) contained newer data from the IO. It looked pretty conclusively as if the partial cacheline flush logic was failing. First I investigated the logic for handling such splits, but it was working correctly. So I moved on to assuming that the cause was one of the two potential problems I've long suspected. I received a helpful clue from Hans that the buffer in question was allocated once at device creation and remained allocated from that point on. That made it easy to save that buffer pointer when it was created, and write wrappers for all the cache writeback and invalidate routines that checked whether the cacheline containing that buffer was part of the cache operation. What I expected to see was that USB would call the busdma sync ops before starting the IO, and then before it called the post-IO sync ops I would see that something else in the system called wbinv_all() or a [wb]inv_range() that included the umass buffer address. What I actually saw was that that never happened. Not even once. Very rarely I would see some other [wb]inv_range() calls happen, but the ranges never involved the umass buffer, and the unit I'm doing these tests on (a DreamPlug) is not one that ever turns an inv_range into wbinv_all. It eventually occurred to me that I had been overlooking the most obvious way a dirty cacheline can get written back to main memory: the cache hardware needs to evict a line to make room for a new incoming line, and the line it has chosen is dirty and has to be written back before being evicted. Unfortunately, there is no way to instrument the software to detect that happening, so now I'm in the position of proving something based on the complete lack of evidence that anything else is the cause. That's a great way to promote a conspiracy theory, not so great for debugging. In addition to showing that no software-triggered flush/invalidate operations are affecting the cacheline, I was able to show that the problem wasn't just that a partial cacheline flush was involved, but that error condition depended on the specific memory addresses (and thus the specific cacheline) involved. At the point in the usb code where that buffer is allocated I changed the code to add 32 bytes to the buffer offset, so that the buffer is still split across two cachelines in exactly the same way as before, but now it's two different cachelines. When doing this, the error doesn't occur. I think that may lend some weight to the theory that it is hardware-based cacheline eviction which is causing a flush of a dirty cacheline while IO into that memory is in progress, but it's just more circumstantial evidence. I think the intermittant-but-frequent nature of the error may also be circumstantial evidence that hardware eviction is the cause. My DreamPlug unit has a 4-way set-associative cache that selects one of the ways at random when it needs to evict a line for refill. That would seem to imply that there's a one in four chance that the cacheline holding the umass status buffer is the one that gets hit, and that seems to match the symptoms I see of "this usb drive kind of works but there are tons of errors spewing on the console about it". Sometimes you get several failures in a row and the drive fails to attach, but most of the time it limps along with lots of errors followed by succesful retries. I considered trying to lock the cacheline in question into the cache as a way of confirming this situation (that should make the error go away). It turns out that's not especially easy to do on this platform, and you can't lock a single cacheline, you have to lock a whole cache way. That's a pretty big change that will perturb system operations in general, it may be hard to draw conclusions from the results. The ARM Architecture Reference Manual mentions the following guidelines as part of the strategy for handling DMA/cache coherency: * marking the memory areas involved in the DMA operation as uncachable and/or unbufferable * cleaning and/or invalidating the data cache, at least with respect to the address range involved in the DMA operation * draining the write buffer * restrictions on processor accesses to the address range involved in the DMA operation until it is known that the DMA operation is complete. Our partial cacheline flush logic is trying to wish away the last bullet item, but now I think we can never successfully do so. Until last week I thought we had a theoretical problem that could eventually be fixed with a sufficiently-clever cache maintenance implementation that somehow avoided having unrelated parts of the OS interfering with each other's operations. Now it appears that hardware operations we have no direct control over can also lead to memory corruption, and no amount of software cleverness is ever going to allow concurrent CPU and DMA access to the same memory without disabling the cache for that memory range. At this point I was going to launch into some "what we can do about it" rambling, but this is long enough already; I think I'll leave this message as a summary of where we've come from and what I learned in the past few days, and leave "what next" for followups. -- Ian
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1345757300.27688.535.camel>