From owner-svn-src-head@freebsd.org Tue Sep 1 20:41:48 2020 Return-Path: Delivered-To: svn-src-head@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id B44AA3D7A99; Tue, 1 Sep 2020 20:41:48 +0000 (UTC) (envelope-from vmaffione@FreeBSD.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "Let's Encrypt Authority X3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4BgzWc4Ntcz4DHW; Tue, 1 Sep 2020 20:41:48 +0000 (UTC) (envelope-from vmaffione@FreeBSD.org) Received: from repo.freebsd.org (repo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:0]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 7B43B1F1E7; Tue, 1 Sep 2020 20:41:48 +0000 (UTC) (envelope-from vmaffione@FreeBSD.org) Received: from repo.freebsd.org ([127.0.1.37]) by repo.freebsd.org (8.15.2/8.15.2) with ESMTP id 081Kfm9w098075; Tue, 1 Sep 2020 20:41:48 GMT (envelope-from vmaffione@FreeBSD.org) Received: (from vmaffione@localhost) by repo.freebsd.org (8.15.2/8.15.2/Submit) id 081KflZN098070; Tue, 1 Sep 2020 20:41:47 GMT (envelope-from vmaffione@FreeBSD.org) Message-Id: <202009012041.081KflZN098070@repo.freebsd.org> X-Authentication-Warning: repo.freebsd.org: vmaffione set sender to vmaffione@FreeBSD.org using -f From: Vincenzo Maffione Date: Tue, 1 Sep 2020 20:41:47 +0000 (UTC) To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: svn commit: r365061 - in head/sys: dev/bnxt dev/mgb dev/vmware/vmxnet3 net X-SVN-Group: head X-SVN-Commit-Author: vmaffione X-SVN-Commit-Paths: in head/sys: dev/bnxt dev/mgb dev/vmware/vmxnet3 net X-SVN-Commit-Revision: 365061 X-SVN-Commit-Repository: base MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.33 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Sep 2020 20:41:48 -0000 Author: vmaffione Date: Tue Sep 1 20:41:47 2020 New Revision: 365061 URL: https://svnweb.freebsd.org/changeset/base/365061 Log: iflib: leave only 1 receive descriptor unused The pidx argument of isc_rxd_flush() indicates which is the last valid receive descriptor to be used by the NIC. However, current code has multiple issues: - Intel drivers write pidx to their RDT register, which means that NICs will only use the descriptors up to pidx-1 (modulo ring size N), and won't actually use the one pointed by pidx. This does not break reception, but it is anyway confusing and suboptimal (the NIC will actually see only N-2 descriptors as available, rather than N-1). Other drivers (if_vmx, if_bnxt, if_mgb) adhere to this semantic). - The semantic used by Intel (RDT is one descriptor past the last valid one) is used by most (if not all) NICs, and it is also used on the TX side (also in iflib). Since iflib is not currently using this semantic for RX, it must decrement fl->ifl_pidx (modulo N) before calling isc_rxd_flush(), and then the per-driver callback implementation must increment the index again (to match the real semantic). This is confusing and suboptimal. - The iflib refill function is also called at initialization. However, in case the ring size is smaller than 128 (e.g. if_mgb), the refill function will actually prepare all the receive descriptors (N), without leaving one unused, as most of NICs assume (e.g. to avoid RDT to overrun RDH). I can speculate that the code looks like this right now because this issue showed up during testing (e.g. with if_mgb), and it was easy to workaround by decrementing pidx before isc_rxd_flush(). The goal of this change is to simplify the code (removing a bunch of instructions from the RX fast path), and to make the semantic of isc_rxd_flush() consistent across drivers. To achieve this, we: - change the semantics of the pidx argument to the usual one (that is the index one past the last valid one), so that both iflib and drivers avoid the decrement/increment dance. - fix the initialization code to prepare at most N-1 descriptors. Reviewed by: markj MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D26191 Modified: head/sys/dev/bnxt/bnxt_txrx.c head/sys/dev/mgb/if_mgb.c head/sys/dev/mgb/if_mgb.h head/sys/dev/vmware/vmxnet3/if_vmx.c head/sys/net/iflib.c Modified: head/sys/dev/bnxt/bnxt_txrx.c ============================================================================== --- head/sys/dev/bnxt/bnxt_txrx.c Tue Sep 1 20:13:50 2020 (r365060) +++ head/sys/dev/bnxt/bnxt_txrx.c Tue Sep 1 20:41:47 2020 (r365061) @@ -316,10 +316,9 @@ bnxt_isc_rxd_flush(void *sc, uint16_t rxqid, uint8_t f if (softc->rx_cp_rings[rxqid].cons != UINT32_MAX) BNXT_CP_IDX_DISABLE_DB(&softc->rx_cp_rings[rxqid].ring, softc->rx_cp_rings[rxqid].cons); - /* We're given the last filled RX buffer here, not the next empty one */ - BNXT_RX_DB(rx_ring, RING_NEXT(rx_ring, pidx)); + BNXT_RX_DB(rx_ring, pidx); /* TODO: Cumulus+ doesn't need the double doorbell */ - BNXT_RX_DB(rx_ring, RING_NEXT(rx_ring, pidx)); + BNXT_RX_DB(rx_ring, pidx); return; } Modified: head/sys/dev/mgb/if_mgb.c ============================================================================== --- head/sys/dev/mgb/if_mgb.c Tue Sep 1 20:13:50 2020 (r365060) +++ head/sys/dev/mgb/if_mgb.c Tue Sep 1 20:41:47 2020 (r365061) @@ -1191,7 +1191,12 @@ mgb_isc_rxd_flush(void *xsc, uint16_t rxqid, uint8_t f sc = xsc; KASSERT(rxqid == 0, ("tried to flush RX Channel %d.\n", rxqid)); - sc->rx_ring_data.last_tail = pidx; + /* + * According to the programming guide, last_tail must be set to + * the last valid RX descriptor, rather than to the one past that. + * Note that this is not true for the TX ring! + */ + sc->rx_ring_data.last_tail = MGB_PREV_RING_IDX(pidx); CSR_WRITE_REG(sc, MGB_DMA_RX_TAIL(rxqid), sc->rx_ring_data.last_tail); return; } Modified: head/sys/dev/mgb/if_mgb.h ============================================================================== --- head/sys/dev/mgb/if_mgb.h Tue Sep 1 20:13:50 2020 (r365060) +++ head/sys/dev/mgb/if_mgb.h Tue Sep 1 20:41:47 2020 (r365061) @@ -178,7 +178,8 @@ #define MGB_DESC_GET_FRAME_LEN(_desc) \ (((_desc)->ctl & MGB_DESC_FRAME_LEN_MASK) >> 16) -#define MGB_NEXT_RING_IDX(_idx) (((_idx) + 1) % MGB_DMA_RING_SIZE) +#define MGB_NEXT_RING_IDX(_idx) (((_idx) == MGB_DMA_RING_SIZE - 1) ? 0 : ((_idx_) + 1)) +#define MGB_PREV_RING_IDX(_idx) (((_idx) == 0) ? (MGB_DMA_RING_SIZE - 1) : ((_idx_) - 1)) #define MGB_RING_SPACE(_sc) \ ((((_sc)->tx_ring_data.last_head - (_sc)->tx_ring_data.last_tail - 1) \ + MGB_DMA_RING_SIZE ) % MGB_DMA_RING_SIZE ) Modified: head/sys/dev/vmware/vmxnet3/if_vmx.c ============================================================================== --- head/sys/dev/vmware/vmxnet3/if_vmx.c Tue Sep 1 20:13:50 2020 (r365060) +++ head/sys/dev/vmware/vmxnet3/if_vmx.c Tue Sep 1 20:41:47 2020 (r365061) @@ -1744,13 +1744,6 @@ vmxnet3_isc_rxd_flush(void *vsc, uint16_t rxqid, uint8 else r = VMXNET3_BAR0_RXH2(rxqid); - /* - * pidx is the index of the last descriptor with a buffer the device - * can use, and the device needs to be told which index is one past - * that. - */ - if (++pidx == rxr->vxrxr_ndesc) - pidx = 0; vmxnet3_write_bar0(sc, r, pidx); } Modified: head/sys/net/iflib.c ============================================================================== --- head/sys/net/iflib.c Tue Sep 1 20:13:50 2020 (r365060) +++ head/sys/net/iflib.c Tue Sep 1 20:41:47 2020 (r365061) @@ -1959,7 +1959,8 @@ _rxq_refill_cb(void *arg, bus_dma_segment_t *segs, int * @count: the number of new buffers to allocate * * (Re)populate an rxq free-buffer list with up to @count new packet buffers. - * The caller must assure that @count does not exceed the queue's capacity. + * The caller must assure that @count does not exceed the queue's capacity + * minus one (since we always leave a descriptor unavailable). */ static uint8_t iflib_fl_refill(if_ctx_t ctx, iflib_fl_t fl, int count) @@ -1974,6 +1975,8 @@ iflib_fl_refill(if_ctx_t ctx, iflib_fl_t fl, int count int err, frag_idx, i, idx, n, pidx; qidx_t credits; + MPASS(count <= fl->ifl_size - fl->ifl_credits - 1); + sd_m = fl->ifl_sds.ifsd_m; sd_map = fl->ifl_sds.ifsd_map; sd_cl = fl->ifl_sds.ifsd_cl; @@ -2081,15 +2084,10 @@ iflib_fl_refill(if_ctx_t ctx, iflib_fl_t fl, int count fl->ifl_credits = credits; } DBG_COUNTER_INC(rxd_flush); - if (fl->ifl_pidx == 0) - pidx = fl->ifl_size - 1; - else - pidx = fl->ifl_pidx - 1; - bus_dmamap_sync(fl->ifl_ifdi->idi_tag, fl->ifl_ifdi->idi_map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); ctx->isc_rxd_flush(ctx->ifc_softc, fl->ifl_rxq->ifr_id, - fl->ifl_id, pidx); + fl->ifl_id, fl->ifl_pidx); if (__predict_true(bit_test(fl->ifl_rx_bitmap, frag_idx))) { fl->ifl_fragidx = frag_idx + 1; if (fl->ifl_fragidx == fl->ifl_size) @@ -2105,7 +2103,17 @@ iflib_fl_refill(if_ctx_t ctx, iflib_fl_t fl, int count static inline uint8_t iflib_fl_refill_all(if_ctx_t ctx, iflib_fl_t fl) { - /* we avoid allowing pidx to catch up with cidx as it confuses ixl */ + /* + * We leave an unused descriptor to avoid pidx to catch up with cidx. + * This is important as it confuses most NICs. For instance, + * Intel NICs have (per receive ring) RDH and RDT registers, where + * RDH points to the next receive descriptor to be used by the NIC, + * and RDT for the next receive descriptor to be published by the + * driver to the NIC (RDT - 1 is thus the last valid one). + * The condition RDH == RDT means no descriptors are available to + * the NIC, and thus it would be ambiguous if it also meant that + * all the descriptors are available to the NIC. + */ int32_t reclaimable = fl->ifl_size - fl->ifl_credits - 1; #ifdef INVARIANTS int32_t delta = fl->ifl_size - get_inuse(fl->ifl_size, fl->ifl_cidx, fl->ifl_pidx, fl->ifl_gen) - 1; @@ -2210,12 +2218,15 @@ iflib_fl_setup(iflib_fl_t fl) fl->ifl_zone = m_getzone(fl->ifl_buf_size); - /* avoid pre-allocating zillions of clusters to an idle card - * potentially speeding up attach + /* + * Avoid pre-allocating zillions of clusters to an idle card + * potentially speeding up attach. In any case make sure + * to leave a descriptor unavailable. See the comment in + * iflib_fl_refill_all(). */ - (void)iflib_fl_refill(ctx, fl, min(128, fl->ifl_size)); - MPASS(min(128, fl->ifl_size) == fl->ifl_credits); - if (min(128, fl->ifl_size) != fl->ifl_credits) + MPASS(fl->ifl_size > 0); + (void)iflib_fl_refill(ctx, fl, min(128, fl->ifl_size - 1)); + if (min(128, fl->ifl_size - 1) != fl->ifl_credits) return (ENOBUFS); /* * handle failure