From owner-freebsd-net@FreeBSD.ORG Fri Dec 24 15:44:03 2010 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CB394106566B for ; Fri, 24 Dec 2010 15:44:03 +0000 (UTC) (envelope-from admin@shtorm.com) Received: from ns.shtorm.com (ns.shtorm.com [195.62.14.3]) by mx1.freebsd.org (Postfix) with ESMTP id 2E5B08FC08 for ; Fri, 24 Dec 2010 15:44:02 +0000 (UTC) Received: from [10.66.6.77] (unknown [10.66.6.77]) by ns.shtorm.com (Postfix) with ESMTP id B26811A0AD1; Fri, 24 Dec 2010 17:43:33 +0200 (EET) From: Shtorm To: Eugene Grosbein In-Reply-To: <4D1083D6.6010707@rdtc.ru> References: <4D0CFEFF.3000902@rdtc.ru> <1292844095.1917.136.camel@stormi> <4D1083D6.6010707@rdtc.ru> Content-Type: text/plain; charset="UTF-8" Date: Fri, 24 Dec 2010 17:41:38 +0200 Message-ID: <1293205298.1917.191.camel@stormi> Mime-Version: 1.0 X-Mailer: Evolution 2.30.3 Content-Transfer-Encoding: 7bit Cc: net Subject: Re: lagg/lacp poor traffic distribution X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Dec 2010 15:44:03 -0000 On Tue, 2010-12-21 at 16:39 +0600, Eugene Grosbein wrote: > On 20.12.2010 17:21, Shtorm wrote: > > On Sun, 2010-12-19 at 00:35 +0600, Eugene Grosbein wrote: > >> Hi! > >> > >> I've loaded router using two lagg interfaces in LACP mode. > >> lagg0 has IP address and two ports (em0 and em1) and carry untagged frames. > >> lagg1 has no IP address and has two ports (igb0 and igb1) and carry > >> about 1000 dot-q vlans with lots of hosts in each vlan. > >> > >> For lagg1, lagg distributes outgoing traffic over two ports just fine. > >> For lagg0 (untagged ethernet segment with only 2 MAC addresses) > >> less than 0.07% (54Mbit/s max) of traffic goes to em0 > >> and over 99.92% goes to em1, that's bad. > >> > >> That's general traffic of several thousands of customers surfing the web, > >> using torrents etc. I've glanced over lagg/lacp sources if src/sys/net/ > >> and found nothing suspicious, it should extract and use srcIP/dstIP for hash. > >> > >> How do I debug this problem? > >> > >> Eugene Grosbein > >> _______________________________________________ > >> freebsd-net@freebsd.org mailing list > >> http://lists.freebsd.org/mailman/listinfo/freebsd-net > >> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" > > > > I had this problem with igb driver, and I found, that lagg selects > > outgoing interface based on packet header flowid field if M_FLOWID field > > is set. And in the igb driver code flowid is set as > > > > #if __FreeBSD_version >= 800000 > > <------><------><------>rxr->fmp->m_pkthdr.flowid = que->msix; > > <------><------><------>rxr->fmp->m_flags |= M_FLOWID; > > #endif > > > > The same thing in em driver with MULTIQUEUE > > > > That does not give enough number of flows to balance traffic well, so I > > commented check in if_lagg.c > > > > lagg_lb_start(struct lagg_softc *sc, struct mbuf *m) > > { > > <------>struct lagg_lb *lb = (struct lagg_lb *)sc->sc_psc; > > <------>struct lagg_port *lp = NULL; > > <------>uint32_t p = 0; > > > > //<---->if (m->m_flags & M_FLOWID) > > //<----><------>p = m->m_pkthdr.flowid; > > //<---->else > > > > and with this change I have much better load distribution across interfaces. > > > > Hope it helps. > > You are perfectly right. By disabling flow usage I've obtained load sharing > close to even (final patch follows). Two questions: > > 1. Is it a bug or design problem? > 2. Will I get problems like packet reordering by permanently disabling > usage of these flows in lagg(4)? > > --- if_lagg.c.orig 2010-12-20 22:53:21.000000000 +0600 > +++ if_lagg.c 2010-12-21 13:37:20.000000000 +0600 > @@ -168,6 +168,11 @@ > &lagg_failover_rx_all, 0, > "Accept input from any interface in a failover lagg"); > > +int lagg_use_flows = 1; > +SYSCTL_INT(_net_link_lagg, OID_AUTO, use_flows, CTLFLAG_RW, > + &lagg_use_flows, 1, > + "Use flows for load sharing"); > + > static int > lagg_modevent(module_t mod, int type, void *data) > { > @@ -1666,7 +1671,7 @@ > struct lagg_port *lp = NULL; > uint32_t p = 0; > > - if (m->m_flags & M_FLOWID) > + if (lagg_use_flows && (m->m_flags & M_FLOWID)) > p = m->m_pkthdr.flowid; > else > p = lagg_hashmbuf(m, lb->lb_key); > --- if_lagg.h.orig 2010-12-21 16:34:35.000000000 +0600 > +++ if_lagg.h 2010-12-21 16:35:27.000000000 +0600 > @@ -242,6 +242,8 @@ > int lagg_enqueue(struct ifnet *, struct mbuf *); > uint32_t lagg_hashmbuf(struct mbuf *, uint32_t); > > +extern int lagg_use_flows; > + > #endif /* _KERNEL */ > > #endif /* _NET_LAGG_H */ > --- ieee8023ad_lacp.c.orig 2010-12-21 16:36:09.000000000 +0600 > +++ ieee8023ad_lacp.c 2010-12-21 16:35:58.000000000 +0600 > @@ -812,7 +812,7 @@ > return (NULL); > } > > - if (m->m_flags & M_FLOWID) > + if (lagg_use_flows && (m->m_flags & M_FLOWID)) > hash = m->m_pkthdr.flowid; > else > hash = lagg_hashmbuf(m, lsc->lsc_hashkey); > > Eugene Grosbein > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" Can you please look at the different and maybe more generic solution for the lagg and flowid problem? In if_ethersubr.c in function ether_input I addeded this code m->m_flags |= M_FLOWID; m->m_pkthdr.flowid = eh->ether_dhost[0] + eh->ether_dhost[1] + eh->ether_dhost[2] + eh->ether_dhost[3] + eh->ether_dhost[4] + eh->ether_dhost[5]; m->m_pkthdr.flowid += eh->ether_shost[0] + eh->ether_shost[1] + eh->ether_shost[2] + eh->ether_shost[3] + eh->ether_shost[4] + eh->ether_shost[5]; and in function ether_demux I addeded this case ETHERTYPE_IP: if ((m = ip_fastforward(m)) == NULL) return; isr = NETISR_IP; struct ipheader { u_char offset [12]; //ip header fields not actually needed u_char src [4]; //ip src u_char dst [4]; //ip dst } __packed __aligned(4); if (m->m_pkthdr.len < sizeof(struct ipheader)) { //ip header and mbuf stuff stolen from ip_fastforward if_printf(ifp, "flowid math: discard frame with too small header\n"); goto discard; } if (m->m_len < sizeof (struct ipheader) && (m = m_pullup(m, sizeof (struct ipheader))) == NULL) { if_printf(ifp, "flowid math: discard frame at pullup\n"); return;> /* mbuf already free'd */ } struct ipheader *ip; ip = mtod(m, struct ipheader *); m->m_pkthdr.flowid += ip->src[0] + ip->src[1] + ip->src[2] + ip->src[3]; m->m_pkthdr.flowid += ip->dst[0] + ip->dst[1] + ip->dst[2] + ip->dst[3]; // if_printf(ifp, "Calculated flow id %d\n", m->m_pkthdr.flowid); break; case ETHERTYPE_ARP: Sorry, I have no idea how to create nice diff, maybe point to small howto will help :) Probably this code should be wrapped by sysctl check, so it can be enabled or disabled for l2 and l3 info, but I do not know how to do this. In case we calculate flowid early at ethernet input, we not only solve lagg load distribution but also different flows can be processed with different netisr threads when fastforwarding disabled. I know about wasting some cpu on this, but for example, for the router with 4 cores and two em cards top looks like this: last pid: 84129; load averages: 0.06, 0.12, 0.09 up 9+14:21:30 17:30:50 175 processes: 6 running, 145 sleeping, 24 waiting CPU 0: 2.3% user, 0.0% nice, 3.9% system, 27.9% interrupt, 65.9% idle CPU 1: 1.6% user, 0.0% nice, 1.6% system, 26.4% interrupt, 70.5% idle CPU 2: 3.1% user, 0.0% nice, 1.6% system, 34.1% interrupt, 61.2% idle CPU 3: 0.8% user, 0.0% nice, 1.5% system, 30.0% interrupt, 67.7% idle Mem: 175M Active, 38M Inact, 262M Wired, 108K Cache, 60M Buf, 1497M Free Swap: # netstat -I em0 -w 1 input (em0) output packets errs idrops bytes packets errs bytes colls 46381 0 0 49430586 36616 0 21480310 0 45753 0 0 47685283 36941 0 22789381 0 46167 0 0 48173940 37736 0 23442515 0 46608 0 0 49172207 37705 0 23199023 0 50114 0 0 53050475 39719 0 23409046 0 47786 0 0 49826567 37621 0 23658505 0 ^C # This box is all in one router - ppoe server with nat and dummynet shaping, and with my changes it can nat, shape and forward up to 100 kpps in each direction with 30% idle on all cores. Thanks a lot. Yuriy.