Date: Wed, 26 Oct 2005 10:29:21 -0600 From: Scott Long <scottl@samsco.org> To: Jacques Caron <jc@oxado.com> Cc: freebsd-amd64@freebsd.org, sos@freebsd.org Subject: Re: busdma dflt_lock on amd64 > 4 GB Message-ID: <435FAEE1.6000506@samsco.org> In-Reply-To: <6.2.3.4.0.20051026163501.03b7d3e8@wheresmymailserver.com> References: <6.2.3.4.0.20051025171333.03a15490@pop.interactivemediafactory.net> <6.2.3.4.0.20051026131012.03a80a20@pop.interactivemediafactory.net> <435F8E06.9060507@samsco.org> <6.2.3.4.0.20051026163501.03b7d3e8@wheresmymailserver.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Jacques Caron wrote: > Hi Scott, > > Thanks for the input. I'm utterly lost in unknown terrain, but I'm > trying to understand... > > At 16:09 26/10/2005, Scott Long wrote: > >> So, the panic is doing exactly what it is supposed to do. It's guarding >> against bugs in the driver. The workaround for this is to use the >> NOWAIT flag in all instances of bus_dmamap_load() where deferals can >> happen. > > > As pointed out by Soren, this is not documented in man bus_dma :-/ It > says bus_dmamap_load flags are supposed to be 0, and BUS_DMA_ALLOCNOW > should be set at tag creation to avoid EINPROGRESS. I'm not sure the two > would actually be equivalent, either. They are not. The point of the ALLOCNOW flag is to try to avoid mallocing buffers later when the map is created. It's not the solution to any problem, just a shortcut. It's really only useful for drivers that allocate maps on the fly instead of pre-allocating them. > And from what I understand, even a > call to bus_dma_tag_create with BUS_DMA_ALLOCNOW can be successful but > not actually allocate what will be needed later (see below). Like I replied to Soeren, each bounce zone is guaranteed to have enough pages for one transaction by one consumer. Allocating more maps can increase the size of the pool, but allocating more tags cannot. Again, this is to guard against over-allocation. busdma doesn't know whether a tag will be used for static buffers or dynamic buffers, and static buffers tend to be large and not require a map or bouncing. It used to be that bus_dma_tag_create() would always increase the page allocation in the zone, but then we got into problems with drivers wanting large static allocations and fooling busdma into exhausting physical memory by allocating too many bounce pages, non of which were needed. Another approach that I've been considering is adding a BUS_DMA_STATICMAP flag bus_dma_tag_create() that tells it to not allocate bounce pages and not allow deferals. Then the bounce page limit heuristics can be removed from tags that don't have that flag, and the code will be more simple and predictable. But, since any time you touch busdma you have to consider many dozens of drivers, it's not something that I'm ready to do without more thought. > >> This, however, means that using bounce pages still remains fragile >> and that the driver is still likely to return ENOMEM to the upper >> layers. C'est la vie, I guess. At one time I had patches that >> made ATA use the busdma API correctly (it is one of the few remaining >> that does not), but they rotted over time. > > > So what would be the "correct" way? Move the part that's after the DMA > setup in the callback? I suppose there are limitations as to what can > happen in the callback, though, so it would complicate things quite a bit. > > Obviously, a lockfunc would be needed in this situation, right? I sent a long email on this on Dec 14, 2004. I'll pull it up and forward it out. What I really should do is publish a definitive article on the whole topic. As for 'limitations as to what can happen in the callback', there are none if you use the correct code structure. > > Also, I believe many other drivers just have lots of BUS_DMA_ALLOCNOW or > BUS_DMA_NOWAIT all over the place, I'm not sure that's the "correct" > way, is it? Most network drivers use these because they prefer to handle the ENOMEM case rather than handle the possibility of out-of-order packets caused by deferals (though this is really not possible; busdma guards against it). The network stack is designed to handle loss both on the trasmitting end as well as the receiving end, unlike the storage layer. Keep in mind that this discussion started with talking about ATA =-) > >> No. Some tags specifically should not permit deferals. > > > How do they do that? Setting BUS_DMA_ALLOCNOW in the tag, or > BUS_DMA_NOWAIT in the map_load, or both, or something else? They set it by using NULL as the lockfunc. > What should > make one decide when deferrals should not be permitted? Static allocations should never require bouncing, and thus should never have a deferal. The assertion is there to make sure that a driver doesn't accidentally try to use a tag created for static buffers for dynamic buffers. > It is my > impression that quite a few drivers happily decide they don't like > deferrals at all whatever happens... Again, these are mostly network drivers, and the network stack is designed to reliably handle this. The storage stack tries a little bit to handle it, but it's not reliable. Nor should it have to handle it; direct I/O _must_always_succeed_. What if you're out of RAM and the VM system tries to write some pages to swap in order to free up RAM, but those writes fail on ENOMEM? Again, FreeBSD has shown excellent handling in high memory pressure situations over the years where other OS's die horribly. This is one of the reasons why. > >> Just about every other modern driver honors the API correctly. > > > Depends what you mean by "correctly". I'm not sure using BUS_DMA_NOWAIT > is the right way to go as it fails if there is contention for bounce > buffers. > >> Bounce pages cannot be reclaimed to the system, so overallocating just >> wastes memory. > > > I'm not talking about over-allocating, but rather allocating what is > needed: I don't understand why bus_dma_tag_create limits the total > number of bounce pages in a bounce zone to maxsize if BUS_DMA_ALLOCNOW > is set (which prevents bus_dmamap_create from allocating any further > bounce pages as long as there's only one map per tag, which seems pretty > common), while bus_dmamap_create will allocate maxsize additional pages > if BUS_DMA_ALLOCNOW was not set. Actually, one map per tag is not common. If the ATA driver supported tagged queuing (which I assume that it will someday for SATAII, yes?) then it would have multiple maps. Just about every other modern block driver supports multiple concurrent transactions and thus multiple maps. > > The end result is that the ata driver is limited to 32 bounce pages > whatever the number of instances (I guess that's channels, or disks?), > while other drivers get hundreds of bounce pages which they hardly use. > Maybe this is intended and it's just the way the ata driver uses tags > and maps that is wrong, maybe it's the busdma logic that is wrong, I > don't know... If a map is being created for every drive in the system, and the result is that not enough bounce pages are being reserved for all three drives to operate concurrently, then there might be a bug in busdma. We should discuss this offline. > >> The whole point of the deferal mechanism is to allow >> you to allocate enough pages for a normal load while also being able to >> handle sporadic spikes in load (like when the syncer runs) without >> trapping memory. > > > In this case 32 bounce pages (out of 8 GB RAM) for 6 disks seems like a > very tight bottleneck to me. If that's all that is needed to saturate non-tagged ATA, then there is nothing wrong with that. But once tagged queuing comes into the picture, more resources will need to be reserved of course. This should all just work, since it works for other drivers, but I'm happy to help investigate bugs. Scott
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?435FAEE1.6000506>