From owner-svn-src-all@FreeBSD.ORG Tue Jan 6 20:33:50 2015 Return-Path: Delivered-To: svn-src-all@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 26A3B1B2; Tue, 6 Jan 2015 20:33:50 +0000 (UTC) Received: from mail-pa0-x22e.google.com (mail-pa0-x22e.google.com [IPv6:2607:f8b0:400e:c03::22e]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id DF90D3DB; Tue, 6 Jan 2015 20:33:49 +0000 (UTC) Received: by mail-pa0-f46.google.com with SMTP id lf10so15510pab.5; Tue, 06 Jan 2015 12:33:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:mail-followup-to :references:mime-version:content-type:content-disposition :content-transfer-encoding:in-reply-to:user-agent; bh=wAkrUEhLFOUhXhFR84/EtnVaowtak4/VcNKFrYVW6kQ=; b=FzLSIke2hAAhcIQGGUtnDtrzw6z6vYIVvDG/Skuc13dIclIO0tA1C8QWTlALngxqEd 6atghftGrL1EHtShcJ3tVkVwj/6hlUUMFmPCNA1kuNh6GR/4EU39Vik7OHsaineDR17y 7OkhVbKVpQw8Sau+WN/k+aV6tQX7i5OrJrQLt77bRguK1gN5kSho9qaEoS9IY5QKc4IZ jqGZ6tVD/MgpF5C3hitIIX6oO0FdZ5UnR+B993SvZ6qoOJyFgqLCKAko3O+F0CDU0Pqq P+YnltrGoYs1uloQQ9FDx/0akUMlwrd2nlm9k2eRCCamwQeKlDC3te9q0lh2y9+fIFqg yxzw== X-Received: by 10.66.251.167 with SMTP id zl7mr161478244pac.140.1420576429361; Tue, 06 Jan 2015 12:33:49 -0800 (PST) Received: from ox ([24.6.44.228]) by mx.google.com with ESMTPSA id hs10sm57812548pdb.33.2015.01.06.12.33.46 (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Tue, 06 Jan 2015 12:33:48 -0800 (PST) Sender: Navdeep Parhar Date: Tue, 6 Jan 2015 12:33:44 -0800 From: Navdeep Parhar To: Luigi Rizzo Subject: Re: svn commit: r276485 - in head/sys: conf dev/cxgbe modules/cxgbe/if_cxgbe Message-ID: <20150106203344.GB26068@ox> Mail-Followup-To: Luigi Rizzo , "src-committers@freebsd.org" , "svn-src-all@freebsd.org" , "svn-src-head@freebsd.org" References: <201412312319.sBVNJHca031041@svn.freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Cc: "svn-src-head@freebsd.org" , "svn-src-all@freebsd.org" , "src-committers@freebsd.org" X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 06 Jan 2015 20:33:50 -0000 On Tue, Jan 06, 2015 at 07:58:34PM +0100, Luigi Rizzo wrote: > > > On Thu, Jan 1, 2015 at 12:19 AM, Navdeep Parhar wrote: > > Author: np > Date: Wed Dec 31 23:19:16 2014 > New Revision: 276485 > URL: https://svnweb.freebsd.org/changeset/base/276485 > > Log: >   cxgbe(4): major tx rework. > > > FYI, this commit has some unnamed unions (eg. in t4_mp_ring.c) > which prevent the kernel from compiling with our stock gcc > and its standard kernel build flags (specifically -std=...). > > Adding the following in the kernel config > > makeoptions COPTFLAGS="-fms-extensions" > > seems to do the job > > I know it is unavoidable that we'll end up with gcc not working, > but maybe we can still avoid unnamed unions. There are two unresolved issues with mp_ring and I had to make the driver amd64-only while I consider my options. - platforms where gcc is the default (and our version has problems with unnamed unions). This is simple to fix but reduces the readability of the code. But sure, if building head with gcc is popular then that trumps readability. I wonder if adding -fms-extensions just to the driver's build flags would be an acceptable compromise. - platforms without the acq/rel versions of 64b cmpset. I think it would be simple to add acq/rel variants to i386/pc98 and others that already have 64b cmpset. The driver will be permanently unplugged from whatever remains (only 32 bit powerpc I think). I'll try to sort all this out within the next couple of weeks. Regards, Navdeep > > cheers > luigi >   > > >   a) Front load as much work as possible in if_transmit, before any driver >   lock or software queue has to get involved. > >   b) Replace buf_ring with a brand new mp_ring (multiproducer ring).  This >   is specifically for the tx multiqueue model where one of the if_transmit >   producer threads becomes the consumer and other producers carry on as >   usual.  mp_ring is implemented as standalone code and it should be >   possible to use it in any driver with tx multiqueue.  It also has: >   - the ability to enqueue/dequeue multiple items.  This might become >     significant if packet batching is ever implemented. >   - an abdication mechanism to allow a thread to give up writing tx >     descriptors and have another if_transmit thread take over.  A thread >     that's writing tx descriptors can end up doing so for an unbounded >     time period if a) there are other if_transmit threads continuously >     feeding the sofware queue, and b) the chip keeps up with whatever the >     thread is throwing at it. >   - accurate statistics about interesting events even when the stats come >     at the expense of additional branches/conditional code. > >   The NIC txq lock is uncontested on the fast path at this point.  I've >   left it there for synchronization with the control events (interface >   up/down, modload/unload). > >   c) Add support for "type 1" coalescing work request in the normal NIC tx >   path.  This work request is optimized for frames with a single item in >   the DMA gather list.  These are very common when forwarding packets. >   Note that netmap tx in cxgbe already uses these "type 1" work requests. > >   d) Do not request automatic cidx updates every 32 descriptors.  Instead, >   request updates via bits in individual work requests (still every 32 >   descriptors approximately).  Also, request an automatic final update >   when the queue idles after activity.  This means NIC tx reclaim is still >   performed lazily but it will catch up quickly as soon as the queue >   idles.  This seems to be the best middle ground and I'll probably do >   something similar for netmap tx as well. > >   e) Implement a faster tx path for WRQs (used by TOE tx and control >   queues, _not_ by the normal NIC tx).  Allow work requests to be written >   directly to the hardware descriptor ring if room is available.  I will >   convert t4_tom and iw_cxgbe modules to this faster style gradually. > >   MFC after:    2 months > > Added: >   head/sys/dev/cxgbe/t4_mp_ring.c   (contents, props changed) >   head/sys/dev/cxgbe/t4_mp_ring.h   (contents, props changed) > Modified: >   head/sys/conf/files >   head/sys/dev/cxgbe/adapter.h >   head/sys/dev/cxgbe/t4_l2t.c >   head/sys/dev/cxgbe/t4_main.c >   head/sys/dev/cxgbe/t4_sge.c >   head/sys/modules/cxgbe/if_cxgbe/Makefile > > Modified: head/sys/conf/files > =========================================================================== > === > --- head/sys/conf/files Wed Dec 31 22:52:43 2014        (r276484) > +++ head/sys/conf/files Wed Dec 31 23:19:16 2014        (r276485) > @@ -1142,6 +1142,8 @@ dev/cxgb/sys/uipc_mvec.c  optional cxgb p >         compile-with "${NORMAL_C} -I$S/dev/cxgb" >  dev/cxgb/cxgb_t3fw.c           optional cxgb cxgb_t3fw \ >         compile-with "${NORMAL_C} -I$S/dev/cxgb" > +dev/cxgbe/t4_mp_ring.c         optional cxgbe pci \ > +       compile-with "${NORMAL_C} -I$S/dev/cxgbe" >  dev/cxgbe/t4_main.c            optional cxgbe pci \ >         compile-with "${NORMAL_C} -I$S/dev/cxgbe" >  dev/cxgbe/t4_netmap.c          optional cxgbe pci \ > > Modified: head/sys/dev/cxgbe/adapter.h > =========================================================================== > === > --- head/sys/dev/cxgbe/adapter.h        Wed Dec 31 22:52:43 2014        > (r276484) > +++ head/sys/dev/cxgbe/adapter.h        Wed Dec 31 23:19:16 2014        > (r276485) > @@ -152,7 +152,8 @@ enum { >         CL_METADATA_SIZE = CACHE_LINE_SIZE, > >         SGE_MAX_WR_NDESC = SGE_MAX_WR_LEN / EQ_ESIZE, /* max WR size in > desc */ > -       TX_SGL_SEGS = 36, > +       TX_SGL_SEGS = 39, > +       TX_SGL_SEGS_TSO = 38, >         TX_WR_FLITS = SGE_MAX_WR_LEN / 8 >  }; > > @@ -273,6 +274,7 @@ struct port_info { >         struct timeval last_refreshed; >         struct port_stats stats; >         u_int tnl_cong_drops; > +       u_int tx_parse_error; > >         eventhandler_tag vlan_c; > > @@ -308,23 +310,9 @@ struct tx_desc { >         __be64 flit[8]; >  }; > > -struct tx_map { > -       struct mbuf *m; > -       bus_dmamap_t map; > -}; > - > -/* DMA maps used for tx */ > -struct tx_maps { > -       struct tx_map *maps; > -       uint32_t map_total;     /* # of DMA maps */ > -       uint32_t map_pidx;      /* next map to be used */ > -       uint32_t map_cidx;      /* reclaimed up to this index */ > -       uint32_t map_avail;     /* # of available maps */ > -}; > - >  struct tx_sdesc { > +       struct mbuf *m;         /* m_nextpkt linked chain of frames */ >         uint8_t desc_used;      /* # of hardware descriptors used by the WR > */ > -       uint8_t credits;        /* NIC txq: # of frames sent out in the WR > */ >  }; > > > @@ -378,16 +366,12 @@ struct sge_iq { >  enum { >         EQ_CTRL         = 1, >         EQ_ETH          = 2, > -#ifdef TCP_OFFLOAD >         EQ_OFLD         = 3, > -#endif > >         /* eq flags */ > -       EQ_TYPEMASK     = 7,            /* 3 lsbits hold the type */ > -       EQ_ALLOCATED    = (1 << 3),     /* firmware resources allocated */ > -       EQ_DOOMED       = (1 << 4),     /* about to be destroyed */ > -       EQ_CRFLUSHED    = (1 << 5),     /* expecting an update from SGE */ > -       EQ_STALLED      = (1 << 6),     /* out of hw descriptors or dmamaps > */ > +       EQ_TYPEMASK     = 0x3,          /* 2 lsbits hold the type (see > above) */ > +       EQ_ALLOCATED    = (1 << 2),     /* firmware resources allocated */ > +       EQ_ENABLED      = (1 << 3),     /* open for business */ >  }; > >  /* Listed in order of preference.  Update t4_sysctls too if you change > these */ > @@ -402,32 +386,25 @@ enum {DOORBELL_UDB, DOORBELL_WCWR, DOORB >  struct sge_eq { >         unsigned int flags;     /* MUST be first */ >         unsigned int cntxt_id;  /* SGE context id for the eq */ > -       bus_dma_tag_t desc_tag; > -       bus_dmamap_t desc_map; > -       char lockname[16]; >         struct mtx eq_lock; > >         struct tx_desc *desc;   /* KVA of descriptor ring */ > -       bus_addr_t ba;          /* bus address of descriptor ring */ > -       struct sge_qstat *spg;  /* status page, for convenience */ >         uint16_t doorbells; >         volatile uint32_t *udb; /* KVA of doorbell (lies within BAR2) */ >         u_int udb_qid;          /* relative qid within the doorbell page */ > -       uint16_t cap;           /* max # of desc, for convenience */ > -       uint16_t avail;         /* available descriptors, for convenience * > / > -       uint16_t qsize;         /* size (# of entries) of the queue */ > +       uint16_t sidx;          /* index of the entry with the status page > */ >         uint16_t cidx;          /* consumer idx (desc idx) */ >         uint16_t pidx;          /* producer idx (desc idx) */ > -       uint16_t pending;       /* # of descriptors used since last > doorbell */ > +       uint16_t equeqidx;      /* EQUEQ last requested at this pidx */ > +       uint16_t dbidx;         /* pidx of the most recent doorbell */ >         uint16_t iqid;          /* iq that gets egr_update for the eq */ >         uint8_t tx_chan;        /* tx channel used by the eq */ > -       struct task tx_task; > -       struct callout tx_callout; > +       volatile u_int equiq;   /* EQUIQ outstanding */ > > -       /* stats */ > - > -       uint32_t egr_update;    /* # of SGE_EGR_UPDATE notifications for eq > */ > -       uint32_t unstalled;     /* recovered from stall */ > +       bus_dma_tag_t desc_tag; > +       bus_dmamap_t desc_map; > +       bus_addr_t ba;          /* bus address of descriptor ring */ > +       char lockname[16]; >  }; > >  struct sw_zone_info { > @@ -499,18 +476,19 @@ struct sge_fl { >         struct cluster_layout cll_alt;  /* alternate refill zone, layout */ >  }; > > +struct mp_ring; > + >  /* txq: SGE egress queue + what's needed for Ethernet NIC */ >  struct sge_txq { >         struct sge_eq eq;       /* MUST be first */ > >         struct ifnet *ifp;      /* the interface this txq belongs to */ > -       bus_dma_tag_t tx_tag;   /* tag for transmit buffers */ > -       struct buf_ring *br;    /* tx buffer ring */ > +       struct mp_ring *r;      /* tx software ring */ >         struct tx_sdesc *sdesc; /* KVA of software descriptor ring */ > -       struct mbuf *m;         /* held up due to temporary resource > shortage */ > - > -       struct tx_maps txmaps; > +       struct sglist *gl; > +       __be32 cpl_ctrl0;       /* for convenience */ > > +       struct task tx_reclaim_task; >         /* stats for common events first */ > >         uint64_t txcsum;        /* # of times hardware assisted with > checksum */ > @@ -519,13 +497,12 @@ struct sge_txq { >         uint64_t imm_wrs;       /* # of work requests with immediate data * > / >         uint64_t sgl_wrs;       /* # of work requests with direct SGL */ >         uint64_t txpkt_wrs;     /* # of txpkt work requests (not coalesced) > */ > -       uint64_t txpkts_wrs;    /* # of coalesced tx work requests */ > -       uint64_t txpkts_pkts;   /* # of frames in coalesced tx work > requests */ > +       uint64_t txpkts0_wrs;   /* # of type0 coalesced tx work requests */ > +       uint64_t txpkts1_wrs;   /* # of type1 coalesced tx work requests */ > +       uint64_t txpkts0_pkts;  /* # of frames in type0 coalesced tx WRs */ > +       uint64_t txpkts1_pkts;  /* # of frames in type1 coalesced tx WRs */ > >         /* stats for not-that-common events */ > - > -       uint32_t no_dmamap;     /* no DMA map to load the mbuf */ > -       uint32_t no_desc;       /* out of hardware descriptors */ >  } __aligned(CACHE_LINE_SIZE); > >  /* rxq: SGE ingress queue + SGE free list + miscellaneous items */ > @@ -574,7 +551,13 @@ struct wrqe { >         STAILQ_ENTRY(wrqe) link; >         struct sge_wrq *wrq; >         int wr_len; > -       uint64_t wr[] __aligned(16); > +       char wr[] __aligned(16); > +}; > + > +struct wrq_cookie { > +       TAILQ_ENTRY(wrq_cookie) link; > +       int ndesc; > +       int pidx; >  }; > >  /* > @@ -585,17 +568,32 @@ struct sge_wrq { >         struct sge_eq eq;       /* MUST be first */ > >         struct adapter *adapter; > +       struct task wrq_tx_task; > + > +       /* Tx desc reserved but WR not "committed" yet. */ > +       TAILQ_HEAD(wrq_incomplete_wrs , wrq_cookie) incomplete_wrs; > > -       /* List of WRs held up due to lack of tx descriptors */ > +       /* List of WRs ready to go out as soon as descriptors are > available. */ >         STAILQ_HEAD(, wrqe) wr_list; > +       u_int nwr_pending; > +       u_int ndesc_needed; > >         /* stats for common events first */ > > -       uint64_t tx_wrs;        /* # of tx work requests */ > +       uint64_t tx_wrs_direct; /* # of WRs written directly to desc ring. > */ > +       uint64_t tx_wrs_ss;     /* # of WRs copied from scratch space. */ > +       uint64_t tx_wrs_copied; /* # of WRs queued and copied to desc ring. > */ > >         /* stats for not-that-common events */ > > -       uint32_t no_desc;       /* out of hardware descriptors */ > +       /* > +        * Scratch space for work requests that wrap around after reaching > the > +        * status page, and some infomation about the last WR that used it. > +        */ > +       uint16_t ss_pidx; > +       uint16_t ss_len; > +       uint8_t ss[SGE_MAX_WR_LEN]; > + >  } __aligned(CACHE_LINE_SIZE); > > > @@ -744,7 +742,7 @@ struct adapter { >         struct sge sge; >         int lro_timeout; > > -       struct taskqueue *tq[NCHAN];    /* taskqueues that flush data out * > / > +       struct taskqueue *tq[NCHAN];    /* General purpose taskqueues */ >         struct port_info *port[MAX_NPORTS]; >         uint8_t chan_map[NCHAN]; > > @@ -978,12 +976,11 @@ static inline int >  tx_resume_threshold(struct sge_eq *eq) >  { > > -       return (eq->qsize / 4); > +       /* not quite the same as qsize / 4, but this will do. */ > +       return (eq->sidx / 4); >  } > >  /* t4_main.c */ > -void t4_tx_task(void *, int); > -void t4_tx_callout(void *); >  int t4_os_find_pci_capability(struct adapter *, int); >  int t4_os_pci_save_state(struct adapter *); >  int t4_os_pci_restore_state(struct adapter *); > @@ -1024,16 +1021,15 @@ int t4_setup_adapter_queues(struct adapt >  int t4_teardown_adapter_queues(struct adapter *); >  int t4_setup_port_queues(struct port_info *); >  int t4_teardown_port_queues(struct port_info *); > -int t4_alloc_tx_maps(struct tx_maps *, bus_dma_tag_t, int, int); > -void t4_free_tx_maps(struct tx_maps *, bus_dma_tag_t); >  void t4_intr_all(void *); >  void t4_intr(void *); >  void t4_intr_err(void *); >  void t4_intr_evt(void *); >  void t4_wrq_tx_locked(struct adapter *, struct sge_wrq *, struct wrqe *); > -int t4_eth_tx(struct ifnet *, struct sge_txq *, struct mbuf *); >  void t4_update_fl_bufsize(struct ifnet *); > -int can_resume_tx(struct sge_eq *); > +int parse_pkt(struct mbuf **); > +void *start_wrq_wr(struct sge_wrq *, int, struct wrq_cookie *); > +void commit_wrq_wr(struct sge_wrq *, void *, struct wrq_cookie *); > >  /* t4_tracer.c */ >  struct t4_tracer; > > Modified: head/sys/dev/cxgbe/t4_l2t.c > =========================================================================== > === > --- head/sys/dev/cxgbe/t4_l2t.c Wed Dec 31 22:52:43 2014        (r276484) > +++ head/sys/dev/cxgbe/t4_l2t.c Wed Dec 31 23:19:16 2014        (r276485) > @@ -113,16 +113,15 @@ found: >  int >  t4_write_l2e(struct adapter *sc, struct l2t_entry *e, int sync) >  { > -       struct wrqe *wr; > +       struct wrq_cookie cookie; >         struct cpl_l2t_write_req *req; >         int idx = e->idx + sc->vres.l2t.start; > >         mtx_assert(&e->lock, MA_OWNED); > > -       wr = alloc_wrqe(sizeof(*req), &sc->sge.mgmtq); > -       if (wr == NULL) > +       req = start_wrq_wr(&sc->sge.mgmtq, howmany(sizeof(*req), 16), & > cookie); > +       if (req == NULL) >                 return (ENOMEM); > -       req = wrtod(wr); > >         INIT_TP_WR(req, 0); >         OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_L2T_WRITE_REQ, idx | > @@ -132,7 +131,7 @@ t4_write_l2e(struct adapter *sc, struct >         req->vlan = htons(e->vlan); >         memcpy(req->dst_mac, e->dmac, sizeof(req->dst_mac)); > > -       t4_wrq_tx(sc, wr); > +       commit_wrq_wr(&sc->sge.mgmtq, req, &cookie); > >         if (sync && e->state != L2T_STATE_SWITCHING) >                 e->state = L2T_STATE_SYNC_WRITE; > > Modified: head/sys/dev/cxgbe/t4_main.c > =========================================================================== > === > --- head/sys/dev/cxgbe/t4_main.c        Wed Dec 31 22:52:43 2014        > (r276484) > +++ head/sys/dev/cxgbe/t4_main.c        Wed Dec 31 23:19:16 2014        > (r276485) > @@ -66,6 +66,7 @@ __FBSDID("$FreeBSD$"); >  #include "common/t4_regs_values.h" >  #include "t4_ioctl.h" >  #include "t4_l2t.h" > +#include "t4_mp_ring.h" > >  /* T4 bus driver interface */ >  static int t4_probe(device_t); > @@ -378,7 +379,8 @@ static void build_medialist(struct port_ >  static int cxgbe_init_synchronized(struct port_info *); >  static int cxgbe_uninit_synchronized(struct port_info *); >  static int setup_intr_handlers(struct adapter *); > -static void quiesce_eq(struct adapter *, struct sge_eq *); > +static void quiesce_txq(struct adapter *, struct sge_txq *); > +static void quiesce_wrq(struct adapter *, struct sge_wrq *); >  static void quiesce_iq(struct adapter *, struct sge_iq *); >  static void quiesce_fl(struct adapter *, struct sge_fl *); >  static int t4_alloc_irq(struct adapter *, struct irq *, int rid, > @@ -434,7 +436,6 @@ static int sysctl_tx_rate(SYSCTL_HANDLER >  static int sysctl_ulprx_la(SYSCTL_HANDLER_ARGS); >  static int sysctl_wcwr_stats(SYSCTL_HANDLER_ARGS); >  #endif > -static inline void txq_start(struct ifnet *, struct sge_txq *); >  static uint32_t fconf_to_mode(uint32_t); >  static uint32_t mode_to_fconf(uint32_t); >  static uint32_t fspec_to_fconf(struct t4_filter_specification *); > @@ -1429,67 +1430,36 @@ cxgbe_transmit(struct ifnet *ifp, struct >  { >         struct port_info *pi = ifp->if_softc; >         struct adapter *sc = pi->adapter; > -       struct sge_txq *txq = &sc->sge.txq[pi->first_txq]; > -       struct buf_ring *br; > +       struct sge_txq *txq; > +       void *items[1]; >         int rc; > >         M_ASSERTPKTHDR(m); > +       MPASS(m->m_nextpkt == NULL);    /* not quite ready for this yet */ > >         if (__predict_false(pi->link_cfg.link_ok == 0)) { >                 m_freem(m); >                 return (ENETDOWN); >         } > > -       /* check if flowid is set */ > -       if (M_HASHTYPE_GET(m) != M_HASHTYPE_NONE) > -               txq += ((m->m_pkthdr.flowid % (pi->ntxq - pi-> > rsrv_noflowq)) > -                   + pi->rsrv_noflowq); > -       br = txq->br; > - > -       if (TXQ_TRYLOCK(txq) == 0) { > -               struct sge_eq *eq = &txq->eq; > - > -               /* > -                * It is possible that t4_eth_tx finishes up and releases > the > -                * lock between the TRYLOCK above and the drbr_enqueue > here.  We > -                * need to make sure that this mbuf doesn't just sit there > in > -                * the drbr. > -                */ > - > -               rc = drbr_enqueue(ifp, br, m); > -               if (rc == 0 && callout_pending(&eq->tx_callout) == 0 && > -                   !(eq->flags & EQ_DOOMED)) > -                       callout_reset(&eq->tx_callout, 1, t4_tx_callout, > eq); > +       rc = parse_pkt(&m); > +       if (__predict_false(rc != 0)) { > +               MPASS(m == NULL);                       /* was freed > already */ > +               atomic_add_int(&pi->tx_parse_error, 1); /* rare, atomic is > ok */ >                 return (rc); >         } > > -       /* > -        * txq->m is the mbuf that is held up due to a temporary shortage > of > -        * resources and it should be put on the wire first.  Then what's > in > -        * drbr and finally the mbuf that was just passed in to us. > -        * > -        * Return code should indicate the fate of the mbuf that was passed > in > -        * this time. > -        */ > - > -       TXQ_LOCK_ASSERT_OWNED(txq); > -       if (drbr_needs_enqueue(ifp, br) || txq->m) { > - > -               /* Queued for transmission. */ > - > -               rc = drbr_enqueue(ifp, br, m); > -               m = txq->m ? txq->m : drbr_dequeue(ifp, br); > -               (void) t4_eth_tx(ifp, txq, m); > -               TXQ_UNLOCK(txq); > -               return (rc); > -       } > +       /* Select a txq. */ > +       txq = &sc->sge.txq[pi->first_txq]; > +       if (M_HASHTYPE_GET(m) != M_HASHTYPE_NONE) > +               txq += ((m->m_pkthdr.flowid % (pi->ntxq - pi-> > rsrv_noflowq)) + > +                   pi->rsrv_noflowq); > > -       /* Direct transmission. */ > -       rc = t4_eth_tx(ifp, txq, m); > -       if (rc != 0 && txq->m) > -               rc = 0; /* held, will be transmitted soon (hopefully) */ > +       items[0] = m; > +       rc = mp_ring_enqueue(txq->r, items, 1, 4096); > +       if (__predict_false(rc != 0)) > +               m_freem(m); > > -       TXQ_UNLOCK(txq); >         return (rc); >  } > > @@ -1499,17 +1469,17 @@ cxgbe_qflush(struct ifnet *ifp) >         struct port_info *pi = ifp->if_softc; >         struct sge_txq *txq; >         int i; > -       struct mbuf *m; > >         /* queues do not exist if !PORT_INIT_DONE. */ >         if (pi->flags & PORT_INIT_DONE) { >                 for_each_txq(pi, i, txq) { >                         TXQ_LOCK(txq); > -                       m_freem(txq->m); > -                       txq->m = NULL; > -                       while ((m = buf_ring_dequeue_sc(txq->br)) != NULL) > -                               m_freem(m); > +                       txq->eq.flags &= ~EQ_ENABLED; >                         TXQ_UNLOCK(txq); > +                       while (!mp_ring_is_idle(txq->r)) { > +                               mp_ring_check_drainage(txq->r, 0); > +                               pause("qflush", 1); > +                       } >                 } >         } >         if_qflush(ifp); > @@ -1564,7 +1534,7 @@ cxgbe_get_counter(struct ifnet *ifp, ift >                         struct sge_txq *txq; > >                         for_each_txq(pi, i, txq) > -                               drops += txq->br->br_drops; > +                               drops += counter_u64_fetch(txq->r->drops); >                 } > >                 return (drops); > @@ -3236,7 +3206,8 @@ cxgbe_init_synchronized(struct port_info >  { >         struct adapter *sc = pi->adapter; >         struct ifnet *ifp = pi->ifp; > -       int rc = 0; > +       int rc = 0, i; > +       struct sge_txq *txq; > >         ASSERT_SYNCHRONIZED_OP(sc); > > @@ -3265,6 +3236,17 @@ cxgbe_init_synchronized(struct port_info >         } > >         /* > +        * Can't fail from this point onwards.  Review > cxgbe_uninit_synchronized > +        * if this changes. > +        */ > + > +       for_each_txq(pi, i, txq) { > +               TXQ_LOCK(txq); > +               txq->eq.flags |= EQ_ENABLED; > +               TXQ_UNLOCK(txq); > +       } > + > +       /* >          * The first iq of the first port to come up is used for tracing. >          */ >         if (sc->traceq < 0) { > @@ -3297,7 +3279,8 @@ cxgbe_uninit_synchronized(struct port_in >  { >         struct adapter *sc = pi->adapter; >         struct ifnet *ifp = pi->ifp; > -       int rc; > +       int rc, i; > +       struct sge_txq *txq; > >         ASSERT_SYNCHRONIZED_OP(sc); > > @@ -3314,6 +3297,12 @@ cxgbe_uninit_synchronized(struct port_in >                 return (rc); >         } > > +       for_each_txq(pi, i, txq) { > +               TXQ_LOCK(txq); > +               txq->eq.flags &= ~EQ_ENABLED; > +               TXQ_UNLOCK(txq); > +       } > + >         clrbit(&sc->open_device_map, pi->port_id); >         PORT_LOCK(pi); >         ifp->if_drv_flags &= ~IFF_DRV_RUNNING; > @@ -3543,15 +3532,17 @@ port_full_uninit(struct port_info *pi) > >         if (pi->flags & PORT_INIT_DONE) { > > -               /* Need to quiesce queues.  XXX: ctrl queues? */ > +               /* Need to quiesce queues.  */ > + > +               quiesce_wrq(sc, &sc->sge.ctrlq[pi->port_id]); > >                 for_each_txq(pi, i, txq) { > -                       quiesce_eq(sc, &txq->eq); > +                       quiesce_txq(sc, txq); >                 } > >  #ifdef TCP_OFFLOAD >                 for_each_ofld_txq(pi, i, ofld_txq) { > -                       quiesce_eq(sc, &ofld_txq->eq); > +                       quiesce_wrq(sc, ofld_txq); >                 } >  #endif > > @@ -3576,23 +3567,39 @@ port_full_uninit(struct port_info *pi) >  } > >  static void > -quiesce_eq(struct adapter *sc, struct sge_eq *eq) > +quiesce_txq(struct adapter *sc, struct sge_txq *txq) >  { > -       EQ_LOCK(eq); > -       eq->flags |= EQ_DOOMED; > +       struct sge_eq *eq = &txq->eq; > +       struct sge_qstat *spg = (void *)&eq->desc[eq->sidx]; > > -       /* > -        * Wait for the response to a credit flush if one's > -        * pending. > -        */ > -       while (eq->flags & EQ_CRFLUSHED) > -               mtx_sleep(eq, &eq->eq_lock, 0, "crflush", 0); > -       EQ_UNLOCK(eq); > +       (void) sc;      /* unused */ > > -       callout_drain(&eq->tx_callout); /* XXX: iffy */ > -       pause("callout", 10);           /* Still iffy */ > +#ifdef INVARIANTS > +       TXQ_LOCK(txq); > +       MPASS((eq->flags & EQ_ENABLED) == 0); > +       TXQ_UNLOCK(txq); > +#endif > > -       taskqueue_drain(sc->tq[eq->tx_chan], &eq->tx_task); > +       /* Wait for the mp_ring to empty. */ > +       while (!mp_ring_is_idle(txq->r)) { > +               mp_ring_check_drainage(txq->r, 0); > +               pause("rquiesce", 1); > +       } > + > +       /* Then wait for the hardware to finish. */ > +       while (spg->cidx != htobe16(eq->pidx)) > +               pause("equiesce", 1); > + > +       /* Finally, wait for the driver to reclaim all descriptors. */ > +       while (eq->cidx != eq->pidx) > +               pause("dquiesce", 1); > +} > + > +static void > +quiesce_wrq(struct adapter *sc, struct sge_wrq *wrq) > +{ > + > +       /* XXXTX */ >  } > >  static void > @@ -4892,6 +4899,9 @@ cxgbe_sysctls(struct port_info *pi) >         oid = SYSCTL_ADD_NODE(ctx, children, OID_AUTO, "stats", CTLFLAG_RD, >             NULL, "port statistics"); >         children = SYSCTL_CHILDREN(oid); > +       SYSCTL_ADD_UINT(ctx, children, OID_AUTO, "tx_parse_error", > CTLFLAG_RD, > +           &pi->tx_parse_error, 0, > +           "# of tx packets with invalid length or # of segments"); > >  #define SYSCTL_ADD_T4_REG64(pi, name, desc, reg) \ >         SYSCTL_ADD_OID(ctx, children, OID_AUTO, name, \ > @@ -6947,74 +6957,6 @@ sysctl_wcwr_stats(SYSCTL_HANDLER_ARGS) >  } >  #endif > > -static inline void > -txq_start(struct ifnet *ifp, struct sge_txq *txq) > -{ > -       struct buf_ring *br; > -       struct mbuf *m; > - > -       TXQ_LOCK_ASSERT_OWNED(txq); > - > -       br = txq->br; > -       m = txq->m ? txq->m : drbr_dequeue(ifp, br); > -       if (m) > -               t4_eth_tx(ifp, txq, m); > -} > - > -void > -t4_tx_callout(void *arg) > -{ > -       struct sge_eq *eq = arg; > -       struct adapter *sc; > - > -       if (EQ_TRYLOCK(eq) == 0) > -               goto reschedule; > - > -       if (eq->flags & EQ_STALLED && !can_resume_tx(eq)) { > -               EQ_UNLOCK(eq); > -reschedule: > -               if (__predict_true(!(eq->flags && EQ_DOOMED))) > -                       callout_schedule(&eq->tx_callout, 1); > -               return; > -       } > - > -       EQ_LOCK_ASSERT_OWNED(eq); > - > -       if (__predict_true((eq->flags & EQ_DOOMED) == 0)) { > - > -               if ((eq->flags & EQ_TYPEMASK) == EQ_ETH) { > -                       struct sge_txq *txq = arg; > -                       struct port_info *pi = txq->ifp->if_softc; > - > -                       sc = pi->adapter; > -               } else { > -                       struct sge_wrq *wrq = arg; > - > -                       sc = wrq->adapter; > -               } > - > -               taskqueue_enqueue(sc->tq[eq->tx_chan], &eq->tx_task); > -       } > - > -       EQ_UNLOCK(eq); > -} > - > -void > -t4_tx_task(void *arg, int count) > -{ > -       struct sge_eq *eq = arg; > - > -       EQ_LOCK(eq); > -       if ((eq->flags & EQ_TYPEMASK) == EQ_ETH) { > -               struct sge_txq *txq = arg; > -               txq_start(txq->ifp, txq); > -       } else { > -               struct sge_wrq *wrq = arg; > -               t4_wrq_tx_locked(wrq->adapter, wrq, NULL); > -       } > -       EQ_UNLOCK(eq); > -} > - >  static uint32_t >  fconf_to_mode(uint32_t fconf) >  { > @@ -7452,9 +7394,9 @@ static int >  set_filter_wr(struct adapter *sc, int fidx) >  { >         struct filter_entry *f = &sc->tids.ftid_tab[fidx]; > -       struct wrqe *wr; >         struct fw_filter_wr *fwr; >         unsigned int ftid; > +       struct wrq_cookie cookie; > >         ASSERT_SYNCHRONIZED_OP(sc); > > @@ -7473,12 +7415,10 @@ set_filter_wr(struct adapter *sc, int fi > >         ftid = sc->tids.ftid_base + fidx; > > -       wr = alloc_wrqe(sizeof(*fwr), &sc->sge.mgmtq); > -       if (wr == NULL) > +       fwr = start_wrq_wr(&sc->sge.mgmtq, howmany(sizeof(*fwr), 16), & > cookie); > +       if (fwr == NULL) >                 return (ENOMEM); > - > -       fwr = wrtod(wr); > -       bzero(fwr, sizeof (*fwr)); > +       bzero(fwr, sizeof(*fwr)); > >         fwr->op_pkd = htobe32(V_FW_WR_OP(FW_FILTER_WR)); >         fwr->len16_pkd = htobe32(FW_LEN16(*fwr)); > @@ -7547,7 +7487,7 @@ set_filter_wr(struct adapter *sc, int fi >         f->pending = 1; >         sc->tids.ftids_in_use++; > > -       t4_wrq_tx(sc, wr); > +       commit_wrq_wr(&sc->sge.mgmtq, fwr, &cookie); >         return (0); >  } > > @@ -7555,22 +7495,21 @@ static int >  del_filter_wr(struct adapter *sc, int fidx) >  { >         struct filter_entry *f = &sc->tids.ftid_tab[fidx]; > -       struct wrqe *wr; >         struct fw_filter_wr *fwr; >         unsigned int ftid; > +       struct wrq_cookie cookie; > >         ftid = sc->tids.ftid_base + fidx; > > -       wr = alloc_wrqe(sizeof(*fwr), &sc->sge.mgmtq); > -       if (wr == NULL) > +       fwr = start_wrq_wr(&sc->sge.mgmtq, howmany(sizeof(*fwr), 16), & > cookie); > +       if (fwr == NULL) >                 return (ENOMEM); > -       fwr = wrtod(wr); >         bzero(fwr, sizeof (*fwr)); > >         t4_mk_filtdelwr(ftid, fwr, sc->sge.fwq.abs_id); > >         f->pending = 1; > -       t4_wrq_tx(sc, wr); > +       commit_wrq_wr(&sc->sge.mgmtq, fwr, &cookie); >         return (0); >  } > > @@ -8170,6 +8109,7 @@ t4_ioctl(struct cdev *dev, unsigned long > >                 /* MAC stats */ >                 t4_clr_port_stats(sc, pi->tx_chan); > +               pi->tx_parse_error = 0; > >                 if (pi->flags & PORT_INIT_DONE) { >                         struct sge_rxq *rxq; > @@ -8192,24 +8132,24 @@ t4_ioctl(struct cdev *dev, unsigned long >                                 txq->imm_wrs = 0; >                                 txq->sgl_wrs = 0; >                                 txq->txpkt_wrs = 0; > -                               txq->txpkts_wrs = 0; > -                               txq->txpkts_pkts = 0; > -                               txq->br->br_drops = 0; > -                               txq->no_dmamap = 0; > -                               txq->no_desc = 0; > +                               txq->txpkts0_wrs = 0; > +                               txq->txpkts1_wrs = 0; > +                               txq->txpkts0_pkts = 0; > +                               txq->txpkts1_pkts = 0; > +                               mp_ring_reset_stats(txq->r); >                         } > >  #ifdef TCP_OFFLOAD >                         /* nothing to clear for each ofld_rxq */ > >                         for_each_ofld_txq(pi, i, wrq) { > -                               wrq->tx_wrs = 0; > -                               wrq->no_desc = 0; > +                               wrq->tx_wrs_direct = 0; > +                               wrq->tx_wrs_copied = 0; >                         } >  #endif >                         wrq = &sc->sge.ctrlq[pi->port_id]; > -                       wrq->tx_wrs = 0; > -                       wrq->no_desc = 0; > +                       wrq->tx_wrs_direct = 0; > +                       wrq->tx_wrs_copied = 0; >                 } >                 break; >         } > > Added: head/sys/dev/cxgbe/t4_mp_ring.c > =========================================================================== > === > --- /dev/null   00:00:00 1970   (empty, because file is newly added) > +++ head/sys/dev/cxgbe/t4_mp_ring.c     Wed Dec 31 23:19:16 2014        > (r276485) > @@ -0,0 +1,364 @@ > +/*- > + * Copyright (c) 2014 Chelsio Communications, Inc. > + * All rights reserved. > + * Written by: Navdeep Parhar > + * > + * Redistribution and use in source and binary forms, with or without > + * modification, are permitted provided that the following conditions > + * are met: > + * 1. Redistributions of source code must retain the above copyright > + *    notice, this list of conditions and the following disclaimer. > + * 2. Redistributions in binary form must reproduce the above copyright > + *    notice, this list of conditions and the following disclaimer in the > + *    documentation and/or other materials provided with the distribution. > + * > + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND > + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE > + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR > PURPOSE > + * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE > + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR > CONSEQUENTIAL > + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS > + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) > + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, > STRICT > + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY > WAY > + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF > + * SUCH DAMAGE. > + */ > + > +#include > +__FBSDID("$FreeBSD$"); > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include "t4_mp_ring.h" > + > +union ring_state { > +       struct { > +               uint16_t pidx_head; > +               uint16_t pidx_tail; > +               uint16_t cidx; > +               uint16_t flags; > +       }; > +       uint64_t state; > +}; > + > +enum { > +       IDLE = 0,       /* consumer ran to completion, nothing more to do. > */ > +       BUSY,           /* consumer is running already, or will be shortly. > */ > +       STALLED,        /* consumer stopped due to lack of resources. */ > +       ABDICATED,      /* consumer stopped even though there was work to > be > +                          done because it wants another thread to take > over. */ > +}; > + > +static inline uint16_t > +space_available(struct mp_ring *r, union ring_state s) > +{ > +       uint16_t x = r->size - 1; > + > +       if (s.cidx == s.pidx_head) > +               return (x); > +       else if (s.cidx > s.pidx_head) > +               return (s.cidx - s.pidx_head - 1); > +       else > +               return (x - s.pidx_head + s.cidx); > +} > + > +static inline uint16_t > +increment_idx(struct mp_ring *r, uint16_t idx, uint16_t n) > +{ > +       int x = r->size - idx; > + > +       MPASS(x > 0); > +       return (x > n ? idx + n : n - x); > +} > + > +/* Consumer is about to update the ring's state to s */ > +static inline uint16_t > +state_to_flags(union ring_state s, int abdicate) > +{ > + > +       if (s.cidx == s.pidx_tail) > +               return (IDLE); > +       else if (abdicate && s.pidx_tail != s.pidx_head) > +               return (ABDICATED); > + > +       return (BUSY); > +} > + > +/* > + * Caller passes in a state, with a guarantee that there is work to do and > that > + * all items up to the pidx_tail in the state are visible. > + */ > +static void > +drain_ring(struct mp_ring *r, union ring_state os, uint16_t prev, int > budget) > +{ > +       union ring_state ns; > +       int n, pending, total; > +       uint16_t cidx = os.cidx; > +       uint16_t pidx = os.pidx_tail; > + > +       MPASS(os.flags == BUSY); > +       MPASS(cidx != pidx); > + > +       if (prev == IDLE) > +               counter_u64_add(r->starts, 1); > +       pending = 0; > +       total = 0; > + > +       while (cidx != pidx) { > + > +               /* Items from cidx to pidx are available for consumption. * > / > +               n = r->drain(r, cidx, pidx); > +               if (n == 0) { > +                       critical_enter(); > +                       do { > +                               os.state = ns.state = r->state; > +                               ns.cidx = cidx; > +                               ns.flags = STALLED; > +                       } while (atomic_cmpset_64(&r->state, os.state, > +                           ns.state) == 0); > +                       critical_exit(); > +                       if (prev != STALLED) > +                               counter_u64_add(r->stalls, 1); > +                       else if (total > 0) { > +                               counter_u64_add(r->restarts, 1); > +                               counter_u64_add(r->stalls, 1); > +                       } > +                       break; > +               } > +               cidx = increment_idx(r, cidx, n); > +               pending += n; > +               total += n; > + > +               /* > +                * We update the cidx only if we've caught up with the > pidx, the > +                * real cidx is getting too far ahead of the one visible to > +                * everyone else, or we have exceeded our budget. > +                */ > +               if (cidx != pidx && pending < 64 && total < budget) > +                       continue; > +               critical_enter(); > +               do { > +                       os.state = ns.state = r->state; > +                       ns.cidx = cidx; > +                       ns.flags = state_to_flags(ns, total >= budget); > +               } while (atomic_cmpset_acq_64(&r->state, os.state, > ns.state) == 0); > +               critical_exit(); > + > +               if (ns.flags == ABDICATED) > +                       counter_u64_add(r->abdications, 1); > +               if (ns.flags != BUSY) { > +                       /* Wrong loop exit if we're going to stall. */ > +                       MPASS(ns.flags != STALLED); > +                       if (prev == STALLED) { > +                               MPASS(total > 0); > +                               counter_u64_add(r->restarts, 1); > +                       } > +                       break; > +               } > + > +               /* > +                * The acquire style atomic above guarantees visibility of > items > +                * associated with any pidx change that we notice here. > +                */ > +               pidx = ns.pidx_tail; > +               pending = 0; > +       } > +} > + > +int > +mp_ring_alloc(struct mp_ring **pr, int size, void *cookie, ring_drain_t > drain, > +    ring_can_drain_t can_drain, struct malloc_type *mt, int flags) > +{ > +       struct mp_ring *r; > + > +       /* All idx are 16b so size can be 65536 at most */ > +       if (pr == NULL || size < 2 || size > 65536 || drain == NULL || > +           can_drain == NULL) > +               return (EINVAL); > +       *pr = NULL; > +       flags &= M_NOWAIT | M_WAITOK; > +       MPASS(flags != 0); > + > +       r = malloc(__offsetof(struct mp_ring, items[size]), mt, flags | > M_ZERO); > +       if (r == NULL) > +               return (ENOMEM); > +       r->size = size; > +       r->cookie = cookie; > +       r->mt = mt; > +       r->drain = drain; > +       r->can_drain = can_drain; > +       r->enqueues = counter_u64_alloc(flags); > +       r->drops = counter_u64_alloc(flags); > +       r->starts = counter_u64_alloc(flags); > +       r->stalls = counter_u64_alloc(flags); > +       r->restarts = counter_u64_alloc(flags); > +       r->abdications = counter_u64_alloc(flags); > +       if (r->enqueues == NULL || r->drops == NULL || r->starts == NULL || > +           r->stalls == NULL || r->restarts == NULL || > +           r->abdications == NULL) { > +               mp_ring_free(r); > +               return (ENOMEM); > +       } > + > +       *pr = r; > +       return (0); > +} > + > +void > + > +mp_ring_free(struct mp_ring *r) > +{ > + > +       if (r == NULL) > +               return; > + > +       if (r->enqueues != NULL) > +               counter_u64_free(r->enqueues); > +       if (r->drops != NULL) > +               counter_u64_free(r->drops); > +       if (r->starts != NULL) > +               counter_u64_free(r->starts); > +       if (r->stalls != NULL) > +               counter_u64_free(r->stalls); > +       if (r->restarts != NULL) > +               counter_u64_free(r->restarts); > +       if (r->abdications != NULL) > +               counter_u64_free(r->abdications); > + > +       free(r, r->mt); > +} > + > +/* > + * Enqueue n items and maybe drain the ring for some time. > + * > + * Returns an errno. > + */ > +int > +mp_ring_enqueue(struct mp_ring *r, void **items, int n, int budget) > +{ > +       union ring_state os, ns; > +       uint16_t pidx_start, pidx_stop; > +       int i; > + > +       MPASS(items != NULL); > +       MPASS(n > 0); > + > > *** DIFF OUTPUT TRUNCATED AT 1000 LINES *** > > > > > > -- > -----------------------------------------+------------------------------- >  Prof. Luigi RIZZO, rizzo@iet.unipi.it  . Dip. di Ing. dell'Informazione >  http://www.iet.unipi.it/~luigi/        . Universita` di Pisa >  TEL      +39-050-2211611               . via Diotisalvi 2 >  Mobile   +39-338-6809875               . 56122 PISA (Italy) > -----------------------------------------+-------------------------------