From owner-freebsd-drivers@FreeBSD.ORG Wed Feb 22 15:06:24 2012 Return-Path: Delivered-To: freebsd-drivers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id F25DD106566C for ; Wed, 22 Feb 2012 15:06:24 +0000 (UTC) (envelope-from jdl.ntq@gmail.com) Received: from mail-gy0-f182.google.com (mail-gy0-f182.google.com [209.85.160.182]) by mx1.freebsd.org (Postfix) with ESMTP id AEF508FC0A for ; Wed, 22 Feb 2012 15:06:24 +0000 (UTC) Received: by ghbg15 with SMTP id g15so87474ghb.13 for ; Wed, 22 Feb 2012 07:06:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=Iqlx1j9wM9PbmMhaXspoAT+m5YQ8HZl+B0PHOhYn2Nc=; b=HD6Ct+tO+56xT7XEbxbixHeha5k1pB89oy9A5sDiMoz/PPeB2c9UjAHY/LzmwtxJr5 /sl48nrltn7iHabm3v4XsW3lKEUgpLDb6eeuFdSBrJHtyH62RSl0/qBw0laX3dkS4/qI G9zPowtuXVMh7K/O/Ley0RtAADq7DWWL0wEa0= MIME-Version: 1.0 Received: by 10.50.85.227 with SMTP id k3mr22014363igz.17.1329923183861; Wed, 22 Feb 2012 07:06:23 -0800 (PST) Received: by 10.231.51.18 with HTTP; Wed, 22 Feb 2012 07:06:23 -0800 (PST) In-Reply-To: <201202211504.34169.jhb@freebsd.org> References: <201202211504.34169.jhb@freebsd.org> Date: Wed, 22 Feb 2012 17:06:23 +0200 Message-ID: From: JD Louw To: John Baldwin Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: freebsd-drivers@freebsd.org Subject: Re: bus_dma coalesce advice X-BeenThere: freebsd-drivers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Writing device drivers for FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 22 Feb 2012 15:06:25 -0000 On Tue, Feb 21, 2012 at 10:04 PM, John Baldwin wrote: > On Monday, February 20, 2012 1:05:40 pm JD Louw wrote: >> Hi, >> >> I have a Xilinx FPGA PCIe DMA design that I'd like to get going on >> FreeBSD. I'd like some advice on the best practice of the bus_dma >> functions. Specifically, I'd like to understand how best to coalesce >> multiple DMA transactions. >> >> Using the bus_dma_tag_create and bus_dmamem_alloc functions I create >> 256 contiguous descriptors. >> >> =C2=A0 =C2=A0 =C2=A0 bus_dma_tag_create(NULL, =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/* parent */ >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 4, =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0/* alignment */ >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0, =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0/* bounds */ >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 BUS_SPACE_MAXADDR, =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/* lowaddr */ >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 BUS_SPACE_MAXADDR, =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/* highaddr */ >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 NULL, NULL, =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* filter, filt= erarg */ >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 256*sizeof(descriptor),= =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* maxsize */ >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 1, =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0/* nsegments */ >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 256*sizeof(descriptor),= =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* maxsegsize */ >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 BUS_DMA_ALLOCNOW, =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* flags */ >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 NULL, NULL, =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* lockfunc, lo= ckarg */ >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 &desc_tag); =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* dmat */ >> >> I then create another bus_dma_tag for the data: >> >> =C2=A0 =C2=A0 =C2=A0 bus_dma_tag_create(NULL, =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/* parent */ >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 4, =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0/* alignment */ >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0, =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0/* bounds */ >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 BUS_SPACE_MAXADDR, =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/* lowaddr */ >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 BUS_SPACE_MAXADDR, =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/* highaddr */ >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 NULL, NULL, =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* filter, filt= erarg */ >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0xFFFFF, =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/* max= size - 1MB */ >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 256, =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0/* nsegments */ >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0x1000, =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* ma= xsegsize - 4KB*/ >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 BUS_DMA_ALLOCNOW, =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* flags */ >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 NULL, NULL, =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* lockfunc, lo= ckarg */ >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 &data_tag); =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* dmat */ >> >> Now my question: In order to batch several mbufs/uios in into the 256 >> descriptors I'd like to do multiple bus_dmamap_loads on the data tag. >> But reading the bus_dmamap_load_mbuf/uio code it looks like this is >> not a good idea. Each mapping operation does not subtract its nsegment >> count from the tag maximum nsegment count, so at some point >> bus_dmamap_load will overrun my 256 descriptors. > > Does your DMA engine really allow a single transaction to span more than = 256 > descriptors? =C2=A0(The 'nsegmenets' is the maximum number of S/G entries= for a > single transaction, not the number of entries in your ring.) > >> Do I need to allocate a separate set of descriptors for each bus_dmamapp= ing? >> >> Any advice much appreciated, > > Typically in a NIC driver you will use bus_dmamap_load_mbuf_sg() to popul= ate > an array of S/G elements on the stack. =C2=A0You can check the returned v= alue for > the number of segments and handle the case where it exceeds the number of > segments you actually have available (e.g. by calling m_collapse() or > m_defrag() or just queueing the packet until you get a TX completion inte= rrupt > that frees up some descriptors). =C2=A0Note that for all of those cases y= ou will > need to do a bus_dmamap_unload() first. > > -- > John Baldwin I'm not sure how NIC ring buffer descriptors are structured, but the FPGA DMA descriptor structure looks as follows: struct descriptor { uint64_t seg_phys_addr; uint32_t seg_len; uint64_t next_desc_phys_addr; }; The FPGA's descriptors are chained together in a linked list using physical addressing, so I can chain together as many descriptors as I want to. The chain is terminated by a NULL pointer. Something like this: d------>d------->d------>NULL | | | | | | =E2=8C=84 =E2=8C=84 =E2=8C=84 seg seg seg The physical address of the first descriptor is written to a DMA address register and the engine is started by writing to the DMA control register. A hardware interrupt is generated once the DMA engine is done walking the descriptor chain and sucking in all segments. I'd like lessen the interrupt load by loading more than one uio/mbuf map into the chain before starting off the DMA. But since I don't know beforehand how many segments each uio/mbuf load will occupy I may overrun the 256 chain elements. One solution I can think of is to create multiple smaller descriptor chains (let's say 64 descriptors long), one for each uio/mbuf. Then after loading multiple uio/mbufs I can link the occupied parts of each chain together in one big chain.