Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 22 Feb 2012 17:06:23 +0200
From:      JD Louw <jdl.ntq@gmail.com>
To:        John Baldwin <jhb@freebsd.org>
Cc:        freebsd-drivers@freebsd.org
Subject:   Re: bus_dma coalesce advice
Message-ID:  <CAB-7mS4Q3Z%2Bc8H8rJV6z17vmMab2djTO23s2Z8Z3HSyGKFyuUQ@mail.gmail.com>
In-Reply-To: <201202211504.34169.jhb@freebsd.org>
References:  <CAB-7mS5JXO8khFLWnSPhu9nadTW9JWakCp-3bP4vwoJ5KXsX8w@mail.gmail.com> <201202211504.34169.jhb@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Feb 21, 2012 at 10:04 PM, John Baldwin <jhb@freebsd.org> wrote:
> On Monday, February 20, 2012 1:05:40 pm JD Louw wrote:
>> Hi,
>>
>> I have a Xilinx FPGA PCIe DMA design that I'd like to get going on
>> FreeBSD. I'd like some advice on the best practice of the bus_dma
>> functions. Specifically, I'd like to understand how best to coalesce
>> multiple DMA transactions.
>>
>> Using the bus_dma_tag_create and bus_dmamem_alloc functions I create
>> 256 contiguous descriptors.
>>
>> =C2=A0 =C2=A0 =C2=A0 bus_dma_tag_create(NULL, =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/* parent */
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 4, =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0/* alignment */
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0, =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0/* bounds */
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 BUS_SPACE_MAXADDR, =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/* lowaddr */
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 BUS_SPACE_MAXADDR, =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/* highaddr */
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 NULL, NULL, =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* filter, filt=
erarg */
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 256*sizeof(descriptor),=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* maxsize */
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 1, =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0/* nsegments */
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 256*sizeof(descriptor),=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* maxsegsize */
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 BUS_DMA_ALLOCNOW, =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* flags */
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 NULL, NULL, =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* lockfunc, lo=
ckarg */
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 &desc_tag); =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* dmat */
>>
>> I then create another bus_dma_tag for the data:
>>
>> =C2=A0 =C2=A0 =C2=A0 bus_dma_tag_create(NULL, =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/* parent */
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 4, =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0/* alignment */
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0, =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0/* bounds */
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 BUS_SPACE_MAXADDR, =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/* lowaddr */
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 BUS_SPACE_MAXADDR, =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/* highaddr */
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 NULL, NULL, =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* filter, filt=
erarg */
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0xFFFFF, =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/* max=
size - 1MB */
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 256, =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0/* nsegments */
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0x1000, =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* ma=
xsegsize - 4KB*/
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 BUS_DMA_ALLOCNOW, =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* flags */
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 NULL, NULL, =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* lockfunc, lo=
ckarg */
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 &data_tag); =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* dmat */
>>
>> Now my question: In order to batch several mbufs/uios in into the 256
>> descriptors I'd like to do multiple bus_dmamap_loads on the data tag.
>> But reading the bus_dmamap_load_mbuf/uio code it looks like this is
>> not a good idea. Each mapping operation does not subtract its nsegment
>> count from the tag maximum nsegment count, so at some point
>> bus_dmamap_load will overrun my 256 descriptors.
>
> Does your DMA engine really allow a single transaction to span more than =
256
> descriptors? =C2=A0(The 'nsegmenets' is the maximum number of S/G entries=
 for a
> single transaction, not the number of entries in your ring.)
>
>> Do I need to allocate a separate set of descriptors for each bus_dmamapp=
ing?
>>
>> Any advice much appreciated,
>
> Typically in a NIC driver you will use bus_dmamap_load_mbuf_sg() to popul=
ate
> an array of S/G elements on the stack. =C2=A0You can check the returned v=
alue for
> the number of segments and handle the case where it exceeds the number of
> segments you actually have available (e.g. by calling m_collapse() or
> m_defrag() or just queueing the packet until you get a TX completion inte=
rrupt
> that frees up some descriptors). =C2=A0Note that for all of those cases y=
ou will
> need to do a bus_dmamap_unload() first.
>
> --
> John Baldwin

I'm not sure how NIC ring buffer descriptors are structured, but the
FPGA DMA descriptor structure looks as follows:

struct descriptor {
	uint64_t seg_phys_addr;
	uint32_t seg_len;
	uint64_t next_desc_phys_addr;
};

The FPGA's descriptors are chained together in a linked list using
physical addressing, so I can chain together as many descriptors as I
want to. The chain is terminated by a NULL pointer. Something like
this:

  d------>d------->d------>NULL
  |       |        |
  |       |        |
  =E2=8C=84       =E2=8C=84        =E2=8C=84
 seg     seg      seg


The physical address of the first descriptor is written to a DMA
address register and the engine is started by writing to the DMA
control register. A hardware interrupt is generated once the DMA
engine is done walking the descriptor chain and sucking in all
segments.


I'd like lessen the interrupt load by loading more than one uio/mbuf
map into the chain before starting off the DMA. But since I don't know
beforehand how many segments each uio/mbuf load will occupy I may
overrun the 256 chain elements.

One solution I can think of is to create multiple smaller descriptor
chains (let's say 64 descriptors long), one for each uio/mbuf. Then
after loading multiple uio/mbufs I can link the occupied parts of each
chain together in one big chain.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAB-7mS4Q3Z%2Bc8H8rJV6z17vmMab2djTO23s2Z8Z3HSyGKFyuUQ>