Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 11 Jan 2012 09:59:47 -0700
From:      Scott Long <scottl@samsco.org>
To:        Luigi Rizzo <rizzo@iet.unipi.it>
Cc:        Adrian Chadd <adrian@freebsd.org>, freebsd-current@freebsd.org
Subject:   Re: memory barriers in bus_dmamap_sync() ?
Message-ID:  <4E8FCE8E-DDCB-4B38-9BFD-2A67BF03D50F@samsco.org>
In-Reply-To: <20120111162944.GB2266@onelab2.iet.unipi.it>
References:  <20120110213719.GA92799@onelab2.iet.unipi.it> <CAJ-VmomdQ5ZWBf_h1xJhppO8WsinvK7RJiDSgDrYKpo%2BJ8eGYQ@mail.gmail.com> <20120110224100.GB93082@onelab2.iet.unipi.it> <201201111005.28610.jhb@freebsd.org> <20120111162944.GB2266@onelab2.iet.unipi.it>

next in thread | previous in thread | raw e-mail | index | archive | help

On Jan 11, 2012, at 9:29 AM, Luigi Rizzo wrote:

> On Wed, Jan 11, 2012 at 10:05:28AM -0500, John Baldwin wrote:
>> On Tuesday, January 10, 2012 5:41:00 pm Luigi Rizzo wrote:
>>> On Tue, Jan 10, 2012 at 01:52:49PM -0800, Adrian Chadd wrote:
>>>> On 10 January 2012 13:37, Luigi Rizzo <rizzo@iet.unipi.it> wrote:
>>>>> I was glancing through manpages and implementations of bus_dma(9)
>>>>> and i am a bit unclear on what this API (in particular, =
bus_dmamap_sync() )
>>>>> does in terms of memory barriers.
>>>>>=20
>>>>> I see that the x86/amd64 and ia64 code only does the bounce =
buffers.
>>=20
>> That is because x86 in general does not need memory barriers. ...
>=20
> maybe they are not called memory barriers but for instance
> how do i make sure, even on the x86, that a write to the NIC ring
> is properly flushed before the write to the 'start' register occurs ?
>=20

Flushed from where?  The CPU's cache or the device memory and pci bus?  =
I already told you that x86/64 is fundamentally designed around bus =
snooping, and John already told you that we map device memory to be =
uncached.  Also, PCI guarantees that reads and writes are retired in =
order, and that reads are therefore flushing barriers.  So lets take two =
scenarios.  In the first scenario, the NIC descriptors are in device =
memory, so the DMA has to do bus_space accesses to write them.

Scenario 1
1.  driver writes to the descriptors.  These may or may not hang out in =
the cpu's cache, though they probably won't because we map PCI device =
memory as uncachable.  But let's say for the sake of argument that they =
are cached.
2. driver writes to the 'go' register on the card.  This may or may not =
be in the cpu's cache, as in step 1.
3. The writes get flushed out of the cpu and onto the host bus.  Again, =
the x86/64 architecture guarantees that these writes won't be reordered.
4. The writes get onto the PCI bus and buffered at the first bridge.
5. PCI ordering rules keep the writes in order, and they eventually make =
it to the card in the same order that the driver executed them.

Scenario 2
1. driver writes to the descriptors in host memory.  This memory is =
mapped as cache-able, so these writes hang out in the CPU.
2. driver writes to the 'go' register on the card.  This may or may not =
hang out in the cpu's cache, but likely won't as discussed previously.
3. The 'go' write eventually makes its way down to the card, and the =
card starts its processing.
4. the card masters a PCI read for the descriptor data, and the request =
goes up the pci bus to the host bridge
5. thanks to the fundamental design guarantees on x86/64, the pci host =
bridge, memory controller, and cpu all snoop each other.  In this case, =
the cpu sees the read come from the pci host bridge, knows that its for =
data that's in its cache, and intercepts and fills the request.  =
Coherency is preserved!

Explicit barriers aren't needed in either scenario; everything will =
retire correctly and in order.  The only caveat is the buffering that =
happens on the PCI bus.  A write by the host might take a relatively =
long and indeterminate time to reach the card thanks to this buffering =
and the bus being busy.  To guarantee that you know when the write has =
been delivered and retired, you can do a read immediately after the =
write.  On some systems, this might also boost the transaction priority =
of the write and get it down faster, but that's really not a reliably =
guarantee.  All you'll know is that when the read completes, the write =
prior to it has also completed.

Where barriers _are_ needed is in interrupt handlers, and I can discuss =
that if you're interested.

Scott




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4E8FCE8E-DDCB-4B38-9BFD-2A67BF03D50F>