From owner-freebsd-mips@FreeBSD.ORG  Thu Aug 23 21:28:29 2012
Return-Path: <owner-freebsd-mips@FreeBSD.ORG>
Delivered-To: freebsd-mips@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 848351065670
	for <freebsd-mips@freebsd.org>; Thu, 23 Aug 2012 21:28:29 +0000 (UTC)
	(envelope-from freebsd@damnhippie.dyndns.org)
Received: from qmta01.emeryville.ca.mail.comcast.net
	(qmta01.emeryville.ca.mail.comcast.net [76.96.30.16])
	by mx1.freebsd.org (Postfix) with ESMTP id 5D2078FC18
	for <freebsd-mips@freebsd.org>; Thu, 23 Aug 2012 21:28:29 +0000 (UTC)
Received: from omta10.emeryville.ca.mail.comcast.net ([76.96.30.28])
	by qmta01.emeryville.ca.mail.comcast.net with comcast
	id qCUQ1j00D0cQ2SLA1MUPpT; Thu, 23 Aug 2012 21:28:23 +0000
Received: from damnhippie.dyndns.org ([24.8.232.202])
	by omta10.emeryville.ca.mail.comcast.net with comcast
	id qMUN1j00P4NgCEG8WMUPgg; Thu, 23 Aug 2012 21:28:23 +0000
Received: from [172.22.42.240] (revolution.hippie.lan [172.22.42.240])
	by damnhippie.dyndns.org (8.14.3/8.14.3) with ESMTP id q7NLSK2M025408; 
	Thu, 23 Aug 2012 15:28:20 -0600 (MDT)
	(envelope-from freebsd@damnhippie.dyndns.org)
From: Ian Lepore <freebsd@damnhippie.dyndns.org>
To: freebsd-arch@freebsd.org, freebsd-mips@freebsd.org, freebsd-arm@freebsd.org
Content-Type: text/plain; charset="us-ascii"
Date: Thu, 23 Aug 2012 15:28:20 -0600
Message-ID: <1345757300.27688.535.camel@revolution.hippie.lan>
Mime-Version: 1.0
X-Mailer: Evolution 2.32.1 FreeBSD GNOME Team Port 
Content-Transfer-Encoding: 7bit
Cc: 
Subject: Partial cacheline flush problems on ARM and MIPS
X-BeenThere: freebsd-mips@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Porting FreeBSD to MIPS <freebsd-mips.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-mips>,
	<mailto:freebsd-mips-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-mips>
List-Post: <mailto:freebsd-mips@freebsd.org>
List-Help: <mailto:freebsd-mips-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-mips>,
	<mailto:freebsd-mips-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 23 Aug 2012 21:28:29 -0000

A recent innocuous change to the USB driver code caused intermittant
errors in the umass(4) driver on ARM and MIPS platforms, and this
message is some background on partial cacheline flushes, and info on
what I found while investigating the cause, which is rooted in the DMA
cache coherency code.

First I need to say that when I say ARM in this message I mean "ARM and
MIPS and any other platform where cache coherency around DMA operations
is maintained with help from software (as opposed to being implemented
purely in hardware as it is in i386/amd64)."  I have no experience on
MIPS, but I believe that it's similar to ARM in regards to cache
coherency.  As far as I know, this is not specific to VIVT caches, but
rather is specific to software cache coherency, so it probably applies
to armv6/v7 as well as v4 and v5 that I'm working with.

Over the years, we've had a series of glitches and patches in the
partial cacheline flush logic for arm.  Even with all the fixes, I've
thought that there are two ways that the scheme could fail, but I've
never been able to demonstrate it experimentally, and empirically it
seems that the failures are rare.  Both ways involve the fact that we
have several software entities trying to maintain the caches
concurrently, without knowledge of each others' actions or needs.  The
two ways are basically two variations of a situation where a dirty
cacheline is flushed by the software while a DMA operation that overlaps
that cacheline is in progress:

      * A cpu_dcache_wbinv_all() call happens after some DMA data has
        hit main memory, but before bus_dmamap_sync(POSTREAD) is called.
      * Two independent DMA operations are happening in different parts
        of the same cacheline and some DMA data from both devices has
        hit main memory; whichever operation does the POSTREAD sync
        first can wipe out data in main memory from the other operation
        that is still in progress.

For problems to happen, the CPU has to also modify the same memory
location / cacheline as the DMA, so that the cache holds the newest data
for part of the cacheline, and main memory holds the newest data for
part of the cacheline.  Our logic for handling POSTREAD partial
cacheline flushes creates this condition even if it doesn't already
exist on entry to the sync routine.

The wbinv_all() situation seemed to me the most likely to occur.  It
gets called from a variety of places for a variety of reasons.  It is
called from a several places in pmap.c; it appears to me that many of
those invocations can happen at completely arbitrary points with respect
to any IO that's in progress.  

Another way wbinv_all() can get invoked is during a call to
wbinv_range() or even just inv_range(), when the range to be invalidated
is so large that it's too inefficient to loop through the range
discarding a line at a time on a given platform.  In those cases, for
some arm platforms, the inv_range() implementation just calls
wbinv_all() internally.

The implication of the last paragraph is that wbinv_all() can
potentially be invoked as part of the busdma sync operations for any IO,
PREREAD, PREWRITE, or POSTREAD, from any device at any time.

A recent USB driver change moved some things around in memory, such that
a small (13 byte) IO buffer became split across two cachelines, and
suddenly we had intermittant (but fairly frequent) failures reported by
umass(4).  Some logging from the usb driver showed that there was stale
data from previous IO operations in part of the IO buffer.  I added some
code to pre-initialize the buffer to various byte patterns before
starting the IO, and after the IO, part of the buffer would contain
those patterns, and the rest of the buffer (after the cacheline split
point) contained newer data from the IO.  It looked pretty conclusively
as if the partial cacheline flush logic was failing.

First I investigated the logic for handling such splits, but it was
working correctly.  So I moved on to assuming that the cause was one of
the two potential problems I've long suspected.

I received a helpful clue from Hans that the buffer in question was
allocated once at device creation and remained allocated from that point
on.  That made it easy to save that buffer pointer when it was created,
and write wrappers for all the cache writeback and invalidate routines
that checked whether the cacheline containing that buffer was part of
the cache operation.  

What I expected to see was that USB would call the busdma sync ops
before starting the IO, and then before it called the post-IO sync ops I
would see that something else in the system called wbinv_all() or a
[wb]inv_range() that included the umass buffer address.  What I actually
saw was that that never happened.  Not even once.  Very rarely I would
see some other [wb]inv_range() calls happen, but the ranges never
involved the umass buffer, and the unit I'm doing these tests on (a
DreamPlug) is not one that ever turns an inv_range into wbinv_all.

It eventually occurred to me that I had been overlooking the most
obvious way a dirty cacheline can get written back to main memory:  the
cache hardware needs to evict a line to make room for a new incoming
line, and the line it has chosen is dirty and has to be written back
before being evicted.  Unfortunately, there is no way to instrument the
software to detect that happening, so now I'm in the position of proving
something based on the complete lack of evidence that anything else is
the cause.  That's a great way to promote a conspiracy theory, not so
great for debugging.

In addition to showing that no software-triggered flush/invalidate
operations are affecting the cacheline, I was able to show that the
problem wasn't just that a partial cacheline flush was involved, but
that error condition depended on the specific memory addresses (and thus
the specific cacheline) involved.  At the point in the usb code where
that buffer is allocated I changed the code to add 32 bytes to the
buffer offset, so that the buffer is still split across two cachelines
in exactly the same way as before, but now it's two different
cachelines.  When doing this, the error doesn't occur.

I think that may lend some weight to the theory that it is
hardware-based cacheline eviction which is causing a flush of a dirty
cacheline while IO into that memory is in progress, but it's just more
circumstantial evidence.

I think the intermittant-but-frequent nature of the error may also be
circumstantial evidence that hardware eviction is the cause.  My
DreamPlug unit has a 4-way set-associative cache that selects one of the
ways at random when it needs to evict a line for refill.  That would
seem to imply that there's a one in four chance that the cacheline
holding the umass status buffer is the one that gets hit, and that seems
to match the symptoms I see of "this usb drive kind of works but there
are tons of errors spewing on the console about it".  Sometimes you get
several failures in a row and the drive fails to attach, but most of the
time it limps along with lots of errors followed by succesful retries.

I considered trying to lock the cacheline in question into the cache as
a way of confirming this situation (that should make the error go away).
It turns out that's not especially easy to do on this platform, and you
can't lock a single cacheline, you have to lock a whole cache way.
That's a pretty big change that will perturb system operations in
general, it may be hard to draw conclusions from the results.

The ARM Architecture Reference Manual mentions the following guidelines
as part of the strategy for handling DMA/cache coherency:

      * marking the memory areas involved in the DMA operation as
        uncachable and/or unbufferable
      * cleaning and/or invalidating the data cache, at least with
        respect to the address range involved in the DMA operation
      * draining the write buffer
      * restrictions on processor accesses to the address range involved
        in the DMA operation until it is known that the DMA operation is
        complete.

Our partial cacheline flush logic is trying to wish away the last bullet
item, but now I think we can never successfully do so.  Until last week
I thought we had a theoretical problem that could eventually be fixed
with a sufficiently-clever cache maintenance implementation that somehow
avoided having unrelated parts of the OS interfering with each other's
operations.  Now it appears that hardware operations we have no direct
control over can also lead to memory corruption, and no amount of
software cleverness is ever going to allow concurrent CPU and DMA access
to the same memory without disabling the cache for that memory range.

At this point I was going to launch into some "what we can do about it"
rambling, but this is long enough already; I think I'll leave this
message as a summary of where we've come from and what I learned in the
past few days, and leave "what next" for followups.

-- Ian