Date: Mon, 24 Dec 2012 13:42:55 -0700 From: Ian Lepore <freebsd@damnhippie.dyndns.org> To: Jeff Roberson <jroberson@jroberson.net> Cc: powerpc@freebsd.org, marcel@freebsd.org, mips@freebsd.org, John Baldwin <jhb@freebsd.org>, mav@freebsd.org, scottl@freebsd.org, attilio@freebsd.org, kib@freebsd.org, sparc64@freebsd.org, arm@freebsd.org Subject: Re: Call for testing and review, busdma changes Message-ID: <1356381775.1129.181.camel@revolution.hippie.lan> In-Reply-To: <alpine.BSF.2.00.1212231418120.2005@desktop> References: <alpine.BSF.2.00.1212080841370.4081@desktop> <1355077061.87661.320.camel@revolution.hippie.lan> <alpine.BSF.2.00.1212090840080.4081@desktop> <1355085250.87661.345.camel@revolution.hippie.lan> <alpine.BSF.2.00.1212231418120.2005@desktop>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 2012-12-23 at 14:22 -1000, Jeff Roberson wrote: > On Sun, 9 Dec 2012, Ian Lepore wrote: > > > On Sun, 2012-12-09 at 08:48 -1000, Jeff Roberson wrote: > >> > >> On Sun, 9 Dec 2012, Ian Lepore wrote: > >> > >>> On Sat, 2012-12-08 at 08:51 -1000, Jeff Roberson wrote: > >>>> Hello, > >>>> > >>>> http://people.freebsd.org/~jeff/physbio.diff > >>>> [...] > >>> > >>> Testing on armv4/v5 platforms, the patches applied and compiled cleanly, > >>> but fault-abort with a NULL deref on what appears to be the first call > >>> to bus_dmamap_load(). I'll dig deeper into debugging that after > >>> browsing the new code so I can figure out where to start debugging. > >> > >> Can you give me the file and line of the fault? > >> > > > > Better than that, I can give you patches (attached). There were two > > little glitches... the allocated slist wasn't getting attached to the > > map, and when building the slist in _bus_dmamap_load_buffer() the slist > > needs to participate in the coalescing logic near the end of the loop. > > I understand the first problem. The second, I'm not sure about. When > we're going through bounce pages we use virtual to virtual. Why do you > need to flush the cache lines for the source addresses? The device will > only do io to the bounce pages so they are the only that need other > synchronization. > As I indicated in my following paragraph, I didn't think my fix for constructing the sync list was perfect, it was just good enough to allow more testing to progress. In particular I don't think it would be right if bouncing were required, but it never is on the arm platforms I have to test with. The logic in your patch didn't work in the normal non-bounce case. The main loop that does the mapping works on a single page at a time. For each page it decides whether to use a bounce page instead, then it either merges the page into the current segment if it can, or starts a new segment. Your patch has it create a new sync list entry for every non-bounced page, but the sync list is sized to maxsegs not maxpages. So your logic writes outside the bounds of the sync list array, and my logic includes the virtual ranges of pages that will be bounced in the sync list (which I think should be harmless anyway, just a bit inefficient). The logic is wrong in both cases, but since bouncing never happens on my hardware I knew my tweak would let me do more testing. In the past, I've never paid close attention to the bounce logic, I've tended to read around it as if it weren't there. Recently I've begun to look at it and the more I look the more I think that to the degree that it works at all, it works by accident. I think mostly the logic just never gets used on the arm platforms that people are working with, because there'd be more problems reported if it were. For example... there is no handling of the PREREAD case when syncing bounce pages. A POSTREAD invalidate is insufficient; if any of the cache lines are dirty at the time the DMA starts, there are several ways those lines can get written back to physical memory while the DMA is in progress, corrupting the data in the IO buffer. Another example... there is an unordered list of bounce pages which are substituted for real pages as needed, one page at a time. No attempt is made to ensure that a run of contiguous physical pages that need to be bounced turns into a corresponding single run of contiguous pages in the bounce zone. In other words, it seems like over time as the list becomes less ordered the bounce logic would increasingly create multi-segment DMA. The reason that seems like a problem is that on every system I've checked (arm and i386) getting to multi-user mode creates about 100 dma tags, and typically only 5 or 6 of those tags allow more than one segment. I understand that this is not "incorrect operation" of the busdma code, but it does seem to be circumstantial evidence that this logic isn't being exercised much, since it appears that few drivers can actually handle multi-segment DMA. While the work I've done so far on arm busdma has concentrated on the buffer allocation and correct operation of the sync routines in the non-bounce cases, I think I'll be looking closely at the map load (currently seems a bit inefficient) and bounce-handling logic soon. > > > > I'm not positive the attached quick-fix is perfect, but it allows one of > > my arm units here to boot and run and pass some basic IO smoke tests. > > I'll be trying it on other arm platforms later today. > > > >>> > >>> Just from an initial quick browsing of the new code for armv4, I notice > >>> these things... > >>> > >>> The new routine __bus_dmamap_mayblock() does (*flags) |= BUS_DMA_WAITOK; > >>> which is a no-op because BUS_DMA_WAITOK is zero. I'm not sure what the > >>> intent was, but I think any modification of the WAIT-related flags would > >>> be conceptually wrong because it overrides the caller's instructions on > >>> whether or not to do a deferred load when resources aren't available. > >> > >> The intent of the original load function seems to be that it is always > >> blocking. Most architectures force the flag. The mbuf and uio load > >> routines don't support blocking at all because the callback lacks the > >> information needed to restart the operation. > >> > > > > The manpage states that bus_dmamap_load() never blocks for any reason. > > The mbuf and uio flavors say they are variations of bus_dmapmap_load(); > > I take the implication of that to mean that they also are defined as > > never blocking. > > > > The docs then say that WAITOK vs. NOWAIT control the scheduling of a > > deferred load, and that NOWAIT is forced in the mbuf and uio cases. > > Your refactored code already handles that by adding NOWAIT to the > > caller-specified flags for mbufs and uio. > > > > So the code will do the right thing now (with your patches), but only > > because (*flags) |= BUS_DMA_WAITOK is a no-op. > > This code is problematic. We can only support WAITOK operations for > bus_dmamap_load() because the callbacks from the swi when bounce pages are > available can only remember one pointer. To fix this for real we'd have > to remember the original type and call the appropriate load function > again. This could be solved by having a { type, pointer, length } > structure that is passed between the front and back-end. This would solve > your mbuf problem as well. I think I will do this next. I'm lost here. I don't understand your point about only being able to remember one pointer. We only ever need to remember one pointer, because a map can only be associated with a single buffer at a time. The fact that a deferred load can't be easily implemented in the case of mbufs and uio might be why the manpage says the NOWAIT flag is implied in those cases. There's no reason why the busdma code can't handle deferred loading for mbufs and uio -- as you say, it seems to only involve storing a bit of extra info for the callback. I think the hard part of the problem is elsewhere. For a transmit mbuf that needs a deferred mapping, does the driver's interface to the network stack allow for the concept of "this call has neither suceeded nor failed, but either outcome is possible, I'll let you know later" (I have no idea what the answer to that is but I can see how it could quickly become complex for drivers and other higher layers to deal with). For a deferred uio mapping, what if the userland pmap is no longer valid when the callback happens? What if userland has freed the memory? I suspect the implied NOWAIT requirement exists to wish away the need to handle such things. My original point was that the WAITOK flag has a value of zero. Your code gives the impression that it forces all callers to accept a deferred mapping, overriding whatever flag they may have passed, but in fact it doesn't do anything at all. Hmm, I just looked at non-arm implementations more closely and I see that some of them have the same logic as your patches, they OR-in a zero in an apparent attempt to override the caller's flags. That's contrary to what the manpage says and IMO contrary to common sense -- if a caller has said that it is unable to handle a deferred mapping callback, it's unlikely that anything is going to work correctly if a deferred callback happens (in fact, it sounds like a crash or panic waiting to happen). I think the only reason the existing code hasn't caused problems is because the flag value is zero and it actually does nothing except mislead when you read it. > By the way I have an mbuf branch that pushes data and mbuf structure into > separate cachelines. It's not only faster on arm and the like but on x86 > as well. So someone should pursue this and you'd have way fewer partial > line flushes. > [...] > You can skip the [mbuf] sync because you know the information in the mbuf header > is not necessary? > Not exactly. In the general case, if the buffer isn't aligned and padded to a cacheline boundary, we don't know anything about the memory before and after the buffer which partially shares a cacheline we're about to operate on, so the existing code attempts to preserve the contents of that unknown memory across the sync operation. (It cannot possibly be reliably sucessful in doing so, but that's another story.) In the case of an mbuf, where the data area following the header is always misaligned to a cacheline, we do know something about the memory before the data buffer: we know that it's an mbuf header which will not be accessed by the cpu while the dma is in progress. Likewise if the length isn't a multiple of the line size we know the data after the mapped length is still part of a buffer which is allocated to a multiple of the line size and the cpu won't touch that memory during the dma. Using that knowledge we can handle a PREREAD or PREWRITE by doing a wbinv on the cacheline (this is no different than the general case), and then we can handle the POSTREAD as if the buffer were aligned. We don't have to attempt to preserve partial cachelines before/after the buffer because we know the cpu hasn't touched that memory during the dma, so no cache line load could have happened. In the changes you mentioned, how do you ensure the mbuf header and data end up in separate cache lines without compile-time knowledge of cache line size? Even if the embedded data buffer in an mbuf becomes cache-aligned, the special mbuf awareness will still be needed to handle the fact that the mapped data length may not be a multiple of the cache line size. We'll still need the special knowledge that it's an mbuf so that we can sync the entire last cacheline without attempting to preserve the part beyond the end of the dma buffer. (If this idea makes you vaguely uneasy, you're not alone. However, I've been assured by Scott and Warner that it's always safe to treat mbufs this way.) > > > > I have patches to fix the mbuf partial sync and other things related to > > partial sync problems, and my next step will be to try to rework them to > > fit in with what you've done. Based on what I've seen so far in your > > refactoring, I think just passing a flag to say it's an mbuf, from the > > MI to the MD mapping routine, will provide enough info. > > I think I described a better approach above that will solve mutliple > problems. Let me know what you think. > > Jeff In the long run we can't just reduce the incidence of partial cacheline flushes, they have to be eliminated completely. We can eliminate them for mbufs because of arbitrary rules for mbufs that are supposedly universally followed in the existing code already. We can also eliminate them for any buffers which are allocated by bus_dmamem_alloc(), basically for reasons similar to mbufs: we can ensure that allocated buffers are always aligned and padded appropriately, and establish a rule that says you must not allow any other access to memory in the allocated buffer during dma. Most especially that means you can't allocate a big buffer and sub-divide it in such a way that there is concurrent cpu and dma access, or multiple concurrent dma operations, within a single buffer returned by bus_dmamem_alloc(). Still unresolved is what to do about the remaining cases -- attempts to do dma in arbitrary buffers not obtained from bus_dmamem_alloc() which are not aligned and padded appropriately. There was some discussion a while back, but no clear resolution. I decided not to get bogged down by that fact and to fix the mbuf and allocated-buffer situations that we know how to deal with for now. -- Ian
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1356381775.1129.181.camel>