Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 16 Feb 2016 08:10:04 -0700
From:      Ian Lepore <ian@freebsd.org>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        Michal Meloun <mmel@freebsd.org>, src-committers@freebsd.org, Marius Strobl <marius@freebsd.org>, svn-src-all@freebsd.org, svn-src-head@freebsd.org
Subject:   Re: svn commit: r295557 - head/sys/dev/uart
Message-ID:  <1455635404.12873.78.camel@freebsd.org>
In-Reply-To: <20160216115639.T2000@besplex.bde.org>
References:  <201602120514.u1C5EwWt053622@repo.freebsd.org> <20160212164755.GC4980@alchemy.franken.de> <20160213041246.V1974@besplex.bde.org> <20160213005801.GF15359@alchemy.franken.de> <56C1BDE2.8090300@freebsd.org> <20160216080249.F1233@besplex.bde.org> <1455579466.12873.23.camel@freebsd.org> <20160216103914.F1693@besplex.bde.org> <1455582802.12873.49.camel@freebsd.org> <20160216115639.T2000@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 2016-02-16 at 12:46 +1100, Bruce Evans wrote:
> On Mon, 15 Feb 2016, Ian Lepore wrote:
> 
> > On Tue, 2016-02-16 at 11:01 +1100, Bruce Evans wrote:
> >> On Mon, 15 Feb 2016, Ian Lepore wrote:
> >>
> >>> On Tue, 2016-02-16 at 09:28 +1100, Bruce Evans wrote:
> >>>> On Mon, 15 Feb 2016, Michal Meloun wrote:
> >>>>
> >>>>> [...]
> >>>>> Please note that ARM architecture does not have vectored interrupts,
> >>>>> CPU must read actual interrupt source from external interrupt
> >>>>> controller (GIC) register. This register contain predefined value if
> >>>>> none of interrupts are active.
> >>>>>
> >>>>> 1 - CPU1: enters ns8250_bus_transmit() and sets IER_ETXRDY.
> >>>>> 2 - HW: UART interrupt is asserted, processed by GIC and signaled
> >>>>>    to CPU2.
> >>>>> 3 - CPU2: enters interrupt service.
> >>>>
> >>>> It is blocked by uart_lock(), right?
> >>>>
> >>>>> 4 - CPU1: writes character to into REG_DATA register.
> >>>>> 5 - HW: UART clear its interrupt request
> >>>>> 6 - CPU2: reads interrupt source register. No active interrupt is
> >>>>>    found, spurious interrupt is signaled, and CPU leaves interrupted
> >>>>>    state.
> >>>>> 7 - CPU1: executes uart_barrier(). This function is not empty on ARM,
> >>>>>    and can be slow in some cases.
> >>>>
> >>>> It is not empty even on x86, although it probably should be.
> >>>>
> >>>> BTW, if arm needs the barrier, then how does it work with
> >>>> bus_space_barrier() referenced in just 25 files in all of /sys/dev?
> >>>
> >>> With a hack, of course.  In the arm interrupt-controller drivers we
> >>> always call bus_space_barrier() right before doing an EOI.  It's not a
> >>> 100% solution, but in practice it seems to work pretty well.
> >>
> >> I thought about the x86 behaviour a bit more and now see that it does
> >> need barriers but not the ones given by bus_space_barrier().  All (?)
> >> interrupt handlers use mutexes (if not driver ones, then higher-level
> >> ones).   These might give stronger or different ordering than given by
> >> bus_space_barrier().  On x86, they use the same memory bus lock as
> >> the bus_space_barrier().  This is needed to give ordering across
> >> CPUs.  But for accessing a single device, you only need program order
> >> for a single CPU.  This is automatic on x86 provided a mutex is used
> >> to prevent other CPUs accessing the same device.  And if you don't use
> >> a mutex, then bus_space_barrier() cannot give the necessary ordering
> >> since if cannot prevent other CPUs interfering.
> >>
> >> So how does bus_space_barrier() before EOI make much difference?  It
> >> doesn't affect the order for a bunch of accesses on a single CPU.
> >> It must do more than a mutex to do something good across CPUs.
> >> Arguably, it is a bug in mutexes is they don't gives synchronization
> >> for device memory.
> >>> ...
> >>> The hack code does a drain-write-buffer which doesn't g'tee that the
> >>> slow peripheral write has made it all the way to the device, but it
> >>> does at least g'tee that the write to the bus the perhiperal is on has
> >>> been posted and ack'd by any bus<->bus bridge, and that seems to be
> >>> good enough in practice.  (If there were multiple bridged busses
> >>> downstream it probably wouldn't be, but so far things aren't that
> >>> complicated inside the socs we support.)
> >>
> >> Hmm, so there is some automatic strong ordering but mutexes don't
> >> work for device memory?
> >
> > I guess you keep mentioning mutexes because on x86 their implementation
> > uses some of the same instructions that are involved in bus_space
> > barriers on x86?  Otherwise I can't see what they have to do with
> > anything related to the spurious interrupts that happen on arm.  (You
> > also mentioned multiple CPUs, which is not a requirement for this
> > trouble on arm, it'll happen with a single core.)
> 
> Partly.  I wasn't worrying about this "spurious" interrupt but locking
> in general.  Now I don't see how mutexes can work unless they order
> device accesses in exactly the same way as ordinary memory accesses.
> 

Mutexes on arm are implemented with entirely different instructions,
which do not cause any of this buffer draining or influence the
ordering of surrouding accesses to non-mutex data.  When a mutex is
used to prevent concurrent hardware access between the interrupt and
non-interrupt parts of a driver, or multiple cores accessing the
hardware, a bus_space_barrier() is required after writing to the
hardware and before releasing the mutex.

Of course, such barriers don't exist in most drivers, especially ones
not written for soc-specific hardware, and to the degree those drivers
work, it's by accident.  ::sigh::  There's no easy way to slip in a
"mostly fixes all drivers" hack like the EOI hack, short of doing a
barrier at the end of every bus_space access, and that's too expensive.

> > The piece of info you're missing might be the fact that memory-mapped
> > device registers on arm are mapped with the Device attribute which
> > gives stronger ordering than Normal memory.  In particular, writes are
> > in order and not combined, but they are buffered.
> 
> arm doesn't need the barrier after each output byte then, and in general
> bus_space_write_multi_N() shouldn't need a barrier after each write, but
> only it can reasonably know if it does, so any barrier(s) for it belong
> in it and not in callers.
> 

There is certainly no need for a barrier call after each byte is
written into the fifo (on any hardware that I know of).  A single
barrier at the end of the loop should suffice.

Hmmm, I wonder if that implies that another way to fix the original
problem, one that would work with buggy hardware too, would be:

  enable interrupt
  barrier
  loop to fill the fifo
  barrier

The first barrier would ensure the interrupt-enable reaches its
register before any data gets to the fifo (probably not needed but safe
and not-too-expensive).  The second barrier should ensure that at least
one byte has gotten to the TX register and that should result in de
-asserting the TXRDY before control leaves the uart interrupt routines.

> Drivers for arm could also do a bunch of unrelated writes (and reads?)
> followed by a single barrier.  But this is even more MD.
> 

That's what virtually all soc-specific drivers should be doing on arm,
but even that close to the metal, most of them don't.  (I'm as guilty
as anyone on that front -- some of this knowledge was acquired after
drivers were written, and I haven't gone back and cleaned up.)

> > In some designs
> > there are multiple buffers, so there can be multiple writes that
> > haven't reached the hardware yet.  A read from the same region will
> > stall until all writes to that region are done, and there is also an
> > instruction that specifically forces out the buffers and stalls until
> > they're empty.
> 
> The multiple regions should be in separate bus spaces so that the barriers
> can be optimized to 1 at the end.  I now see how the optimization can
> almost be done at the bus_space level -- drivers call
> bus_space_maybe_barrier() after every access, but this is a no-op on bus
> spaces with sufficient ordering iff the bus space hasn't changed;
> drivers call bus_space_sync_barriers() at the end.  I think the DMA sync
> functions work a bit like this.  A well-written interrupt handler would
> have just 1 bus_space_sync_barriers() at the end.  This works like the
> EOI hack.  If all arches have sufficiently strong ordering then
> bus_space_maybe_barrier() is not needed but more careful syncing is
> needed when switching the bus space.
> 

There is a special place in my heart for any solution that includes a
function with "maybe" in the name.

Beyond that, I'm going to have to ponder this one.  The part that
sounds expensive at runtime is keeping track of which bus_space was
last accessed so that barriers can be automatically inserted between
writes to difference spaces.  (Or semi-automatically if it's done only
on a maybe_barrier, but the bookkeeping is the same.)

> > Without doing the drain-write-buffer (or a device read) after each
> > write, the only g'tee you'd get is that each device sees the writes
> > directed at it in the order they were issued.  With devices A and B,
> > you could write a sequence of A1 B1 A2 B2 A3 B3 and they could arrive
> > at the devices as A1 A2 B1 B2 A3 B3, or any other permutation, as long
> > as device A sees 123 and device B sees 123.
> 
> I think it is even more important that the order is synchronized relative
> to memory for mutex locking.  Mutex locking presumably doesn't issue
> bus barriers because that would be too wasteful for the usual case.  So
> before almost every mutex unlock for a driver, there must be a
> bus_space_sync_barriers() call and this must flush all the write buffers
> for devices before all memory accesses for the mutex, so that the mutex
> works as expected.  But is this automatic on arm?  Ordinary memory is
> like a different bus space that has a different write buffer which might
> be flushed in a different order.
> 

Looks like what I wrote in my first paragraphs might have been more in
context here.  Ordinary memory is literally another bus here.

> > So on arm the need for barriers arises primarily when two different
> > devices interact with each other in some way and it matters that a
> > series of interleaved writes reaches the devices in the same relative
> > order they were issued by the cpu.  That condition mostly comes up only
> > in terms of the PIC interacting with basically every other device.
> 
> The code under discussion seems to provide a good example of why you
> need ordering relative to ordinary memory too.  In the old version,
> CPU2 is waiting for the unlock.  If it sees this before the device
> memory is flushed, then it might be confused.  I guess if the device
> memory is strongly ordered (but buffered) CPU2 would not see reads out
> of order for the device memory, but it might see everything for device
> memory out order relative to ordinary memory.  However, the sprinkled
> bs barriers order everything.
> 

Yes, the bus_space_barrier that is done for the EOI hack is also a
barrier for any preceeding ordinary memory accesses.  But it usually
happens right after a mutex has been released in an interrupt routine,
so there's still that big dose of "works by accident".

> > I expect trouble to show up any time now as we start implementing DMA
> > drivers in socs that have generic DMA engines that are only loosely
> > coupled to the devices they're moving data for.  That just seems like
> > another place where a single driver is coordinating the actions of two
> > different pieces of hardware that may be on different busses, and it's
> > ripe for the lack of barriers to cause rare or intermittant failures.
> 
> I don't understand DMA syncs for most NICs either.  They seem too simple
> to work.  Mutex locking isn't going to control the order of host DMA.
> 

On arm, the cache sync operations for DMA all end with the same drain
write buffer sequence used in the EOI hack.

-- Ian



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1455635404.12873.78.camel>