Date: Mon, 21 May 2012 10:20:21 -0600 From: Ian Lepore <freebsd@damnhippie.dyndns.org> To: Svatopluk Kraus <onwahe@gmail.com> Cc: Richard Hodges <richard@hodges.org>, freebsd-arm@freebsd.org, hackers@freebsd.org Subject: Re: ARM + CACHE_LINE_SIZE + DMA Message-ID: <1337617221.2516.38.camel@revolution.hippie.lan> In-Reply-To: <CAFHCsPVxkhNfiTQp7gvjfonfTjoG-28RgNrG=%2BdxbGhzxqY%2BDg@mail.gmail.com> References: <CAFHCsPUdZXGKFvmVGgaEUsfhwd28mNVGaY84ExcJp=ogQxzPJQ@mail.gmail.com> <1337285248.1503.308.camel@revolution.hippie.lan> <CAFHCsPVxkhNfiTQp7gvjfonfTjoG-28RgNrG=%2BdxbGhzxqY%2BDg@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 2012-05-18 at 16:13 +0200, Svatopluk Kraus wrote: > On Thu, May 17, 2012 at 10:07 PM, Ian Lepore > <freebsd@damnhippie.dyndns.org> wrote: > > On Thu, 2012-05-17 at 15:20 +0200, Svatopluk Kraus wrote: > >> Hi, > >> > >> I'm working on DMA bus implementation for ARM11mpcore platform. I've > >> looked at implementation in ARM tree, but IMHO it only works with some > >> assumptions. There is a problem with DMA on memory block which is not > >> aligned on CACHE_LINE_SIZE (start and end) if memory is not coherent. > >> > >> Let's have a buffer for DMA which is no aligned on CACHE_LINE_SIZE. > >> Then first cache line associated with the buffer can be divided into > >> two parts, A and B, where A is a memory we know nothing about it and B > >> is buffer memory. The same stands for last cache line associatted with > >> the buffer. We have no problem if a memory is coherent. Otherwise it > >> depends on memory attributes. > >> > >> 1. [no cache] attribute > >> No problem as memory is coherent. > >> > >> 2. [write throught] attribute > >> The part A can be invalidated without loss of any data. It's not problem too. > >> > >> 3. [write back] attribute > >> In general, there is no way how to keep both parts consistent. At the > >> start of DMA transaction, the cache line is written back and > >> invalidated. However, as we know nothing about memory associated with > >> part A of the cache line, the cache line can be filled again at any > >> time and messing up DMA transaction if flushed. Even if the cache line > >> is only filled but not flushed during DMA transaction, we must make it > >> coherent with memory after that. There is a trick with saving part A > >> of the line into temporary buffer, invalidating the line, and > >> restoring part A in current ARM (MIPS) implementation. However, if > >> somebody is writting to memory associated with part A of the line > >> during this trick, the part A will be messed up. Moreover, the part A > >> can be part of another DMA transaction. > >> > >> To safely use DMA with no coherent memory, a memory with [no cache] or > >> [write throught] attributes can be used without problem. A memory with > >> [write back] attribute must be aligned on CACHE_LINE_SIZE. > >> > >> However, for example mbuf, a buffer for DMA can be part of a structure > >> which can be aligned on CACHE_LINE_SIZE, but not the buffer itself. We > >> can know that nobody will write to the structure during DMA > >> transaction, so it's safe to use the buffer event if it's not aligned > >> on CACHE_LINE_SIZE. > >> > >> So, in practice, if DMA buffer is not aligned on CACHE_LINE_SIZE and > >> we want to avoid bounce pages overhead, we must support additional > >> information to DMA transaction. It should be easy to support the > >> information about drivers data buffers. However, what about OS data > >> buffers like mentioned mbufs? > >> > >> The question is following. Is or can be guaranteed for all or at least > >> well-known OS data buffers which can be part of DMA access that the > >> not CACHE_LINE_SIZE aligned buffers are surrounded by data which > >> belongs to the same object as the buffer and the data is not written > >> by OS when given to a driver? > >> > >> Any answer is appreciated. However, 'bounce pages' is not an answer. > >> > >> Thanks, Svata > > > > I'm adding freebsd-arm@ to the CC list; that's where this has been > > discussed before. > > > > Your analysis is correct... to the degree that it works at all right > > now, it's working by accident. At work we've been making the good > > accident a bit more likely by setting the minimum allocation size to > > arm_dcache_align in kern_malloc.c. This makes it somewhat less likely > > that unrelated objects in the kernel are sharing a cache line, but it > > also reduces the effectiveness of the cache somewhat. > > > > Another factor, not mentioned in your analysis, is the size of the IO > > operation. Even if the beginning of the DMA buffer is cache-aligned, if > > the size isn't exactly a multiple of the cache line size you still have > > the partial flush situation and all of its problems. > > > > It's not guaranteed that data surrounding a DMA buffer will be untouched > > during the DMA, even when that surrounding data is part of the same > > conceptual object as the IO buffer. It's most often true, but certainly > > not guaranteed. In addition, as Mark pointed out in a prior reply, > > sometimes the DMA buffer is on the stack, and even returning from the > > function that starts the IO operation affects the cacheline associated > > with the DMA buffer. Consider something like this: > > > > void do_io() > > { > > int buffer; > > start_read(&buffer); > > // maybe do other stuff here > > wait_for_read_done(); > > } > > > > start_read() gets some IO going, so before it returns a call has been > > made to bus_dmamap_sync(..., BUS_DMASYNC_PREREAD) and an invalidate gets > > done on the cacheline containing the variable 'buffer'. The act of > > returning from the start_read() function causes that cacheline to get > > reloaded, so now the stale pre-DMA value of the variable 'buffer' is in > > cache again. Right after that, the DMA completes so that ram has a > > newer value that belongs in the buffer variable and the copy in the > > cacheline is stale. > > > > Before control gets into the wait_for_read_done() routine that will > > attempt to handle the POSTREAD partial cacheline flush, another thread > > gets control and begins setting up a new DMA for another device, > > different buffer. This time it's a read using a 64K buffer for disk IO. > > The busdma sync code calls cpu_dcache_inv_range() for that buffer but > > because the range is so large, it gets turned into a > > cpu_dcache_wbinv_all() because that's cheaper than looping through an > > arbitrarily large range invalidating a line at a time. > > > > Except, ooops, that means we write back to ram the cacheline holding the > > stale value of the 'buffer' variable, wiping out the data brought in by > > DMA before the partial cachline flush code could do its dance to > > preserve it. > > > > There are several variations of the above scenario; it doesn't require a > > stack-allocated buffer to trigger a writeback of a stale value. Any > > cacheline that gets dirtied after a PREREAD invalidate can end up > > overwriting fresh DMA data in ram with stale data from the cacheline at > > any time, because any call to cpu_dcache_inv_range() or > > cpu_dcache_wbinv_range() can get turned into a cpu_dcache_wbinv_all(). > > That means any DMA operation and also a context switch which calls > > cpu_dcache_wbinv_all(). > > > > If you rule out bounce buffers as a solution, then I think that may > > leave us with just one option: make the pages uncacheable for the > > duration of the DMA. Essentially force a DMA_COHERENT buffer if the > > driver didn't already do so. I think doing so blindly may have > > performance implications every bit as bad as using bounce buffers. (For > > example, turning off cache on a stack page could really hurt.) The code > > to do the remapping already exists in pmap.c as part of handling > > multiple mappings for VIVT caches. From the busdma code you could > > accomplish the remapping by making a temporary writable kernel mapping > > for the buffer pages in the PRE handling and undo that mapping in the > > POST handling. > > > > It may be that a hybrid approach would work. For an unaligned buffer, > > if it isn't already DMA_COHERENT, then if it is below a certain size > > bounce it, otherwise remap the buffer's pages. When I was knee-deep in > > this problem last summer one of the things I noticed in our systems was > > that large DMA operations (1 KByte or larger buffers) tend to be > > DMA_COHERENT buffers, and when not they're already cache aligned and a > > power of two in size. For us the partial cacheline flush situations are > > almost always caused by tiny IO of 1 to 128 bytes length, usually > > serial-comms or usb related. > Good to know. > > > > > I think pre-allocating a few pages for bouncing small unaligned IO would > > be a big win compared to remapping the pages as uncacheable. The > > remapping has to take locks and search lists of pages and so on; it > > should be way faster to do a small memcpy() instead. Buffers bigger > > than some (perhaps tunable) limit would get remapped instead of > > bounced. > > > > It also might be nice to have a knob to enable logging when bouncing or > > remapping is used to avoid partial cacheline operations, to make it easy > > to find drivers that could be tweaked for better performance. If you're > > bouncing 2 or 3 operations per second with a 4-byte buffer that's no big > > deal. If every network packet is resulting in bouncing/remapping you'd > > want to know about that. > It sounds resonable for me. > > > > > -- Ian > > Thanks for replies. > > So, we have to check DMA buffers if they are aligned and if not, we > have two possibilies in general. > > 1. To not assume anything about surrounding data around unaligned DMA > buffer at all. This always leads to bouncing or memory attributes > changing in no coherent case. > I think this is the only safe starting assumption. Even if every driver in the current source base is fixed, there are 3rd party drivers that aren't checked in, and also it's possible that the IO buffers are in userspace and it's impossible for the kernel to make any assumptions about how the surrounding memory is being accessed during the IO. At work we have several proprietary drivers that do IO directly to/from userspace buffers. > 2. To add new flag (something like BUS_DMA_UNALIGNED_SAFE) and set it > in dmamap load functions in cases that we know it's safe to use an > unaligned buffer. This way we can avoid bouncing in some cases. > This seems like it could be a useful optimization for a driver that can't afford the performance hit of the automatic handling from item 1 above. It also seems like it could be a "please shoot me in the foot" flag if it's used unwisely (or, more likely, if code is blindly cloned from an existing driver that does use it wisely into a new driver where it's not appropriate). > I didn't know about drivers that are using DMA buffers on stack. > However, to patch such a driver is something I can do on my own. I.e., > I always can decide that a driver buffer is safe for DMA even > unaligned. Moreover, for example, DMA descriptors rings are defined as > an array in some net drivers and a descriptor size could be smaller > than CACHE_LINE_SIZE. The drivers must be modified anyway to made > descriptors coherent or aligned. > > What I can do in a driver it's not so simple in case of OS buffers > like mbufs. I can check how mbufs are used in current implementation > and say, yes, it's safe to use them unaligned. However, it can be > changed in next release if anybody won't take care of it. It would be > nice to have a maintained list of OS buffers which are DMA safe in > respect of CACHE_LINE_SIZE. Is the flag and the list interesting for > somebody else? > I don't have a definitive answer for this, but my assumption has always been that once an mbuf is handed to a driver (for example, when it's added to an interface's send queue), the driver then "owns" that mbuf and nothing else in the system will touch it until the driver makes a call to hand it off or free it. If that assumption is true, a driver could make good use of a BUS_DMA_UNALIGNED_SAFE flag with mbufs. The part that scares me about my assumption is the m_ext.ref_cnt field of the mbuf. Its existance seems to imply that multiple entities concurrently have an interest in the data. On the other hand, the lack of any built in provisions for locking seems to imply that concurrent access isn't happening, or perhaps it implies that any required synchronization is temporal rather than lock-based. I've never found anything in writing that explains mbuf usage conventions at this level of detail. > Some more notes. > > SMP makes things worse and ARM11mpcore is about SMP too. For example, > another thread could be open about that how to flush caches (exclusive > L1 cache) in SMP case. > > I'm not sure how to correctly change memory attributes on page which > is in use. Making new temporary mapping with different attributes is > wrong and does not help at all. It's question how to do TLB and cache > flushes on two and more processors and be sure that everything is OK. > It could be slow and maybe, changing memory attributes on the fly is > not a good idea at all. > My suggestion of making a temporary writable mapping was the answer to how to correctly change memory attributes on a page which is in use, at least in the existing code, which is for a single processor. You don't need, and won't even use, the temporary mapping. You would make it just because doing so invokes logic in arm/arm/pmap.c which will find all existing virtual mappings of the given physical pages, and disable caching in each of those existing mappings. In effect, it makes all existing mappings of the physical pages DMA_COHERENT. When you later free the temporary mapping, all other existing mappings are changed back to being cacheable (as long as no more than one of the mappings that remain is writable). I don't know that making a temporary mapping just for its side effect of changing other existing mappings is a good idea, it's just a quick and easy thing to do if you want to try changing all existing mappings to non-cacheable. It could be that a better way would be to have the busdma_machdep code call directly to lower-level routines in pmap.c to change existing mappings without making a new temporary mapping in the kernel pmap. The actual changes to the existing mappings are made by pmap_fix_cache() but that routine isn't directly callable right now. Also, as far as I know all of this automatic disabling of cache for multiple writable mappings applies only to VIVT cache architectures. I'm not sure how the pmap code is going to change to support VIPT and PIPT caches, but it may no longer be true that making a second writable mapping of a page will lead to changing all existing mappings to non-cacheable. -- Ian
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1337617221.2516.38.camel>