From owner-svn-src-head@freebsd.org Tue Feb 16 15:11:15 2016 Return-Path: Delivered-To: svn-src-head@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id B8EEEAAAD53 for ; Tue, 16 Feb 2016 15:11:15 +0000 (UTC) (envelope-from ian@freebsd.org) Received: from pmta2.delivery6.ore.mailhop.org (pmta2.delivery6.ore.mailhop.org [54.200.129.228]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 9B895AE4 for ; Tue, 16 Feb 2016 15:11:15 +0000 (UTC) (envelope-from ian@freebsd.org) X-MHO-User: 8e3460bf-d4bf-11e5-8de6-958346fd02ba X-Report-Abuse-To: https://support.duocircle.com/support/solutions/articles/5000540958-duocircle-standard-smtp-abuse-information X-Originating-IP: 73.34.117.227 X-Mail-Handler: DuoCircle Outbound SMTP Received: from ilsoft.org (unknown [73.34.117.227]) by outbound2.ore.mailhop.org (Halon Mail Gateway) with ESMTPSA; Tue, 16 Feb 2016 15:11:29 +0000 (UTC) Received: from rev (rev [172.22.42.240]) by ilsoft.org (8.15.2/8.14.9) with ESMTP id u1GFA4nV027192; Tue, 16 Feb 2016 08:10:05 -0700 (MST) (envelope-from ian@freebsd.org) Message-ID: <1455635404.12873.78.camel@freebsd.org> Subject: Re: svn commit: r295557 - head/sys/dev/uart From: Ian Lepore To: Bruce Evans Cc: Michal Meloun , src-committers@freebsd.org, Marius Strobl , svn-src-all@freebsd.org, svn-src-head@freebsd.org Date: Tue, 16 Feb 2016 08:10:04 -0700 In-Reply-To: <20160216115639.T2000@besplex.bde.org> References: <201602120514.u1C5EwWt053622@repo.freebsd.org> <20160212164755.GC4980@alchemy.franken.de> <20160213041246.V1974@besplex.bde.org> <20160213005801.GF15359@alchemy.franken.de> <56C1BDE2.8090300@freebsd.org> <20160216080249.F1233@besplex.bde.org> <1455579466.12873.23.camel@freebsd.org> <20160216103914.F1693@besplex.bde.org> <1455582802.12873.49.camel@freebsd.org> <20160216115639.T2000@besplex.bde.org> Content-Type: text/plain; charset="us-ascii" X-Mailer: Evolution 3.16.5 FreeBSD GNOME Team Port Mime-Version: 1.0 Content-Transfer-Encoding: 7bit X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Feb 2016 15:11:15 -0000 On Tue, 2016-02-16 at 12:46 +1100, Bruce Evans wrote: > On Mon, 15 Feb 2016, Ian Lepore wrote: > > > On Tue, 2016-02-16 at 11:01 +1100, Bruce Evans wrote: > >> On Mon, 15 Feb 2016, Ian Lepore wrote: > >> > >>> On Tue, 2016-02-16 at 09:28 +1100, Bruce Evans wrote: > >>>> On Mon, 15 Feb 2016, Michal Meloun wrote: > >>>> > >>>>> [...] > >>>>> Please note that ARM architecture does not have vectored interrupts, > >>>>> CPU must read actual interrupt source from external interrupt > >>>>> controller (GIC) register. This register contain predefined value if > >>>>> none of interrupts are active. > >>>>> > >>>>> 1 - CPU1: enters ns8250_bus_transmit() and sets IER_ETXRDY. > >>>>> 2 - HW: UART interrupt is asserted, processed by GIC and signaled > >>>>> to CPU2. > >>>>> 3 - CPU2: enters interrupt service. > >>>> > >>>> It is blocked by uart_lock(), right? > >>>> > >>>>> 4 - CPU1: writes character to into REG_DATA register. > >>>>> 5 - HW: UART clear its interrupt request > >>>>> 6 - CPU2: reads interrupt source register. No active interrupt is > >>>>> found, spurious interrupt is signaled, and CPU leaves interrupted > >>>>> state. > >>>>> 7 - CPU1: executes uart_barrier(). This function is not empty on ARM, > >>>>> and can be slow in some cases. > >>>> > >>>> It is not empty even on x86, although it probably should be. > >>>> > >>>> BTW, if arm needs the barrier, then how does it work with > >>>> bus_space_barrier() referenced in just 25 files in all of /sys/dev? > >>> > >>> With a hack, of course. In the arm interrupt-controller drivers we > >>> always call bus_space_barrier() right before doing an EOI. It's not a > >>> 100% solution, but in practice it seems to work pretty well. > >> > >> I thought about the x86 behaviour a bit more and now see that it does > >> need barriers but not the ones given by bus_space_barrier(). All (?) > >> interrupt handlers use mutexes (if not driver ones, then higher-level > >> ones). These might give stronger or different ordering than given by > >> bus_space_barrier(). On x86, they use the same memory bus lock as > >> the bus_space_barrier(). This is needed to give ordering across > >> CPUs. But for accessing a single device, you only need program order > >> for a single CPU. This is automatic on x86 provided a mutex is used > >> to prevent other CPUs accessing the same device. And if you don't use > >> a mutex, then bus_space_barrier() cannot give the necessary ordering > >> since if cannot prevent other CPUs interfering. > >> > >> So how does bus_space_barrier() before EOI make much difference? It > >> doesn't affect the order for a bunch of accesses on a single CPU. > >> It must do more than a mutex to do something good across CPUs. > >> Arguably, it is a bug in mutexes is they don't gives synchronization > >> for device memory. > >>> ... > >>> The hack code does a drain-write-buffer which doesn't g'tee that the > >>> slow peripheral write has made it all the way to the device, but it > >>> does at least g'tee that the write to the bus the perhiperal is on has > >>> been posted and ack'd by any bus<->bus bridge, and that seems to be > >>> good enough in practice. (If there were multiple bridged busses > >>> downstream it probably wouldn't be, but so far things aren't that > >>> complicated inside the socs we support.) > >> > >> Hmm, so there is some automatic strong ordering but mutexes don't > >> work for device memory? > > > > I guess you keep mentioning mutexes because on x86 their implementation > > uses some of the same instructions that are involved in bus_space > > barriers on x86? Otherwise I can't see what they have to do with > > anything related to the spurious interrupts that happen on arm. (You > > also mentioned multiple CPUs, which is not a requirement for this > > trouble on arm, it'll happen with a single core.) > > Partly. I wasn't worrying about this "spurious" interrupt but locking > in general. Now I don't see how mutexes can work unless they order > device accesses in exactly the same way as ordinary memory accesses. > Mutexes on arm are implemented with entirely different instructions, which do not cause any of this buffer draining or influence the ordering of surrouding accesses to non-mutex data. When a mutex is used to prevent concurrent hardware access between the interrupt and non-interrupt parts of a driver, or multiple cores accessing the hardware, a bus_space_barrier() is required after writing to the hardware and before releasing the mutex. Of course, such barriers don't exist in most drivers, especially ones not written for soc-specific hardware, and to the degree those drivers work, it's by accident. ::sigh:: There's no easy way to slip in a "mostly fixes all drivers" hack like the EOI hack, short of doing a barrier at the end of every bus_space access, and that's too expensive. > > The piece of info you're missing might be the fact that memory-mapped > > device registers on arm are mapped with the Device attribute which > > gives stronger ordering than Normal memory. In particular, writes are > > in order and not combined, but they are buffered. > > arm doesn't need the barrier after each output byte then, and in general > bus_space_write_multi_N() shouldn't need a barrier after each write, but > only it can reasonably know if it does, so any barrier(s) for it belong > in it and not in callers. > There is certainly no need for a barrier call after each byte is written into the fifo (on any hardware that I know of). A single barrier at the end of the loop should suffice. Hmmm, I wonder if that implies that another way to fix the original problem, one that would work with buggy hardware too, would be: enable interrupt barrier loop to fill the fifo barrier The first barrier would ensure the interrupt-enable reaches its register before any data gets to the fifo (probably not needed but safe and not-too-expensive). The second barrier should ensure that at least one byte has gotten to the TX register and that should result in de -asserting the TXRDY before control leaves the uart interrupt routines. > Drivers for arm could also do a bunch of unrelated writes (and reads?) > followed by a single barrier. But this is even more MD. > That's what virtually all soc-specific drivers should be doing on arm, but even that close to the metal, most of them don't. (I'm as guilty as anyone on that front -- some of this knowledge was acquired after drivers were written, and I haven't gone back and cleaned up.) > > In some designs > > there are multiple buffers, so there can be multiple writes that > > haven't reached the hardware yet. A read from the same region will > > stall until all writes to that region are done, and there is also an > > instruction that specifically forces out the buffers and stalls until > > they're empty. > > The multiple regions should be in separate bus spaces so that the barriers > can be optimized to 1 at the end. I now see how the optimization can > almost be done at the bus_space level -- drivers call > bus_space_maybe_barrier() after every access, but this is a no-op on bus > spaces with sufficient ordering iff the bus space hasn't changed; > drivers call bus_space_sync_barriers() at the end. I think the DMA sync > functions work a bit like this. A well-written interrupt handler would > have just 1 bus_space_sync_barriers() at the end. This works like the > EOI hack. If all arches have sufficiently strong ordering then > bus_space_maybe_barrier() is not needed but more careful syncing is > needed when switching the bus space. > There is a special place in my heart for any solution that includes a function with "maybe" in the name. Beyond that, I'm going to have to ponder this one. The part that sounds expensive at runtime is keeping track of which bus_space was last accessed so that barriers can be automatically inserted between writes to difference spaces. (Or semi-automatically if it's done only on a maybe_barrier, but the bookkeeping is the same.) > > Without doing the drain-write-buffer (or a device read) after each > > write, the only g'tee you'd get is that each device sees the writes > > directed at it in the order they were issued. With devices A and B, > > you could write a sequence of A1 B1 A2 B2 A3 B3 and they could arrive > > at the devices as A1 A2 B1 B2 A3 B3, or any other permutation, as long > > as device A sees 123 and device B sees 123. > > I think it is even more important that the order is synchronized relative > to memory for mutex locking. Mutex locking presumably doesn't issue > bus barriers because that would be too wasteful for the usual case. So > before almost every mutex unlock for a driver, there must be a > bus_space_sync_barriers() call and this must flush all the write buffers > for devices before all memory accesses for the mutex, so that the mutex > works as expected. But is this automatic on arm? Ordinary memory is > like a different bus space that has a different write buffer which might > be flushed in a different order. > Looks like what I wrote in my first paragraphs might have been more in context here. Ordinary memory is literally another bus here. > > So on arm the need for barriers arises primarily when two different > > devices interact with each other in some way and it matters that a > > series of interleaved writes reaches the devices in the same relative > > order they were issued by the cpu. That condition mostly comes up only > > in terms of the PIC interacting with basically every other device. > > The code under discussion seems to provide a good example of why you > need ordering relative to ordinary memory too. In the old version, > CPU2 is waiting for the unlock. If it sees this before the device > memory is flushed, then it might be confused. I guess if the device > memory is strongly ordered (but buffered) CPU2 would not see reads out > of order for the device memory, but it might see everything for device > memory out order relative to ordinary memory. However, the sprinkled > bs barriers order everything. > Yes, the bus_space_barrier that is done for the EOI hack is also a barrier for any preceeding ordinary memory accesses. But it usually happens right after a mutex has been released in an interrupt routine, so there's still that big dose of "works by accident". > > I expect trouble to show up any time now as we start implementing DMA > > drivers in socs that have generic DMA engines that are only loosely > > coupled to the devices they're moving data for. That just seems like > > another place where a single driver is coordinating the actions of two > > different pieces of hardware that may be on different busses, and it's > > ripe for the lack of barriers to cause rare or intermittant failures. > > I don't understand DMA syncs for most NICs either. They seem too simple > to work. Mutex locking isn't going to control the order of host DMA. > On arm, the cache sync operations for DMA all end with the same drain write buffer sequence used in the EOI hack. -- Ian