From owner-freebsd-net@FreeBSD.ORG Fri Oct 26 13:53:57 2012 Return-Path: Delivered-To: freebsd-net@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id B2EB2239 for ; Fri, 26 Oct 2012 13:53:57 +0000 (UTC) (envelope-from glebius@FreeBSD.org) Received: from cell.glebius.int.ru (glebius.int.ru [81.19.64.117]) by mx1.freebsd.org (Postfix) with ESMTP id E2A798FC16 for ; Fri, 26 Oct 2012 13:53:56 +0000 (UTC) Received: from cell.glebius.int.ru (localhost [127.0.0.1]) by cell.glebius.int.ru (8.14.5/8.14.5) with ESMTP id q9QDrtsD012468; Fri, 26 Oct 2012 17:53:55 +0400 (MSK) (envelope-from glebius@FreeBSD.org) Received: (from glebius@localhost) by cell.glebius.int.ru (8.14.5/8.14.5/Submit) id q9QDrtEJ012467; Fri, 26 Oct 2012 17:53:55 +0400 (MSK) (envelope-from glebius@FreeBSD.org) X-Authentication-Warning: cell.glebius.int.ru: glebius set sender to glebius@FreeBSD.org using -f Date: Fri, 26 Oct 2012 17:53:54 +0400 From: Gleb Smirnoff To: Sebastian Kuzminsky Subject: Re: fragmentation problem in FreeBSD 7 Message-ID: <20121026135354.GD70741@FreeBSD.org> References: MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="jCrbxBqMcLqd4mOl" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-net@FreeBSD.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 26 Oct 2012 13:53:57 -0000 --jCrbxBqMcLqd4mOl Content-Type: text/plain; charset=koi8-r Content-Disposition: inline Sebastian, On Tue, Oct 23, 2012 at 12:25:59PM -0600, Sebastian Kuzminsky wrote: S> The root of the problem is two-fold: S> S> 1. ip_output.c:ip_fragment() does not clear the CSUM_IP flag in the mbuf S> when it does software IP checksum computation, so the mbuf still looks like S> it needs IP checksumming. S> S> 2. The em driver does not advertise IP checksum offloading, but still S> checks the CSUM_IP flag in the mbuf and modifies the packet when that flag S> is set (this is in em_transmit_checksum_setup(), called by em_xmit()). S> Unfortunately the em driver gets the checksum wrong in this case, i guess S> that's why it doesn't advertise this capability in its if_hwassist! S> S> So the fragments that ip_fastfwd.c:ip_fastforward() gets from S> ip_output.c:ip_fragment() have ip->ip_sum set correctly, but the S> mbuf->m_pkthdr.csum_flags incorrectly has CSUM_IP still set, and this S> causes the em driver to emit incorrect packets. S> S> There are some other callers of ip_fragment(), notably ip_output(). S> ip_output() clears CSUM_IP in the mbuf csum_flags itself if it's not in S> if_hwassist, so avoids this problem. S> S> So, the fix is simple: clear the mbuf's CSUM_IP when computing ip->ip_sum S> in ip_fragment(). The first attached patch (against S> gitorious/svn_stable_7) does this. S> S> In looking at this issue, I noticed that ip_output()'s use of sw_csum is S> inconsistent. ip_output() splits the mbuf's csum_flags into two parts: the S> stuff that hardware will assist with (these flags get left in the mbuf) and S> the stuff that software needs to do (these get moved to sw_csum). But S> later ip_output() calls functions that don't get sw_csum, or that don't S> know to look in it and look in the mbuf instead. My second patch fixes S> these kinds of issues and (IMO) simplifies the code by leaving all the S> packet's checksumming needs in the mbuf, getting rid of sw_csum entirely. Thanks for submission! I'm about to commit the attached patch to head. Can you please review it? Haven't I missed anything important? I have tested it in a 3-box setup similar to yours, except the middle box doesn't have em(4) NIC, it has igb(4). I've tested it with rxcsum/txcsum on and off on both NICs. I have also moved from CSUM_DELAY_IP to CSUM_IP. AFAIU, the alias CSUM_DELAY_IP was made to match CSUM_DELAY_DATA. But to my point of view it makes it more difficult to understand code, because a person reading code sees different constants in the stack and in drivers. Since your change touches every line in the stack, that utilizes CSUM_DELAY_IP, I decided to consistently use CSUM_IP constant. -- Totus tuus, Glebius. --jCrbxBqMcLqd4mOl Content-Type: text/x-diff; charset=koi8-r Content-Disposition: attachment; filename="csum.diff" Index: netinet/ip_output.c =================================================================== --- netinet/ip_output.c (revision 242127) +++ netinet/ip_output.c (working copy) @@ -125,7 +125,7 @@ struct sockaddr_in *dst; struct in_ifaddr *ia; int isbroadcast; - uint16_t ip_len, ip_off, sw_csum; + uint16_t ip_len, ip_off; struct route iproute; struct rtentry *rte; /* cache for ro->ro_rt */ struct in_addr odst; @@ -583,18 +583,16 @@ } m->m_pkthdr.csum_flags |= CSUM_IP; - sw_csum = m->m_pkthdr.csum_flags & ~ifp->if_hwassist; - if (sw_csum & CSUM_DELAY_DATA) { + if (m->m_pkthdr.csum_flags & CSUM_DELAY_DATA & ~ifp->if_hwassist) { in_delayed_cksum(m); - sw_csum &= ~CSUM_DELAY_DATA; + m->m_pkthdr.csum_flags &= ~CSUM_DELAY_DATA; } #ifdef SCTP - if (sw_csum & CSUM_SCTP) { + if (m->m_pkthdr.csum_flags & CSUM_SCTP & ~ifp->if_hwassist) { sctp_delayed_cksum(m, (uint32_t)(ip->ip_hl << 2)); - sw_csum &= ~CSUM_SCTP; + m->m_pkthdr.csum_flags &= ~CSUM_SCTP; } #endif - m->m_pkthdr.csum_flags &= ifp->if_hwassist; /* * If small enough for interface, or the interface will take @@ -604,8 +602,10 @@ (m->m_pkthdr.csum_flags & ifp->if_hwassist & CSUM_TSO) != 0 || ((ip_off & IP_DF) == 0 && (ifp->if_hwassist & CSUM_FRAGMENT))) { ip->ip_sum = 0; - if (sw_csum & CSUM_DELAY_IP) + if (m->m_pkthdr.csum_flags & CSUM_IP & ~ifp->if_hwassist) { ip->ip_sum = in_cksum(m, hlen); + m->m_pkthdr.csum_flags &= ~CSUM_IP; + } /* * Record statistics for this interface address. @@ -646,7 +646,7 @@ * Too large for interface; fragment if possible. If successful, * on return, m will point to a list of packets to be sent. */ - error = ip_fragment(ip, &m, mtu, ifp->if_hwassist, sw_csum); + error = ip_fragment(ip, &m, mtu, ifp->if_hwassist); if (error) goto bad; for (; m; m = m0) { @@ -691,11 +691,10 @@ * chain of fragments that should be freed by the caller. * * if_hwassist_flags is the hw offload capabilities (see if_data.ifi_hwassist) - * sw_csum contains the delayed checksums flags (e.g., CSUM_DELAY_IP). */ int ip_fragment(struct ip *ip, struct mbuf **m_frag, int mtu, - u_long if_hwassist_flags, int sw_csum) + u_long if_hwassist_flags) { int error = 0; int hlen = ip->ip_hl << 2; @@ -833,8 +832,10 @@ m->m_pkthdr.csum_flags = m0->m_pkthdr.csum_flags; mhip->ip_off = htons(mhip->ip_off); mhip->ip_sum = 0; - if (sw_csum & CSUM_DELAY_IP) + if (m->m_pkthdr.csum_flags & CSUM_IP & ~if_hwassist_flags) { mhip->ip_sum = in_cksum(m, mhlen); + m->m_pkthdr.csum_flags &= ~CSUM_IP; + } *mnext = m; mnext = &m->m_nextpkt; } @@ -853,8 +854,10 @@ ip->ip_len = htons((u_short)m0->m_pkthdr.len); ip->ip_off = htons(ip_off | IP_MF); ip->ip_sum = 0; - if (sw_csum & CSUM_DELAY_IP) + if (m0->m_pkthdr.csum_flags & CSUM_IP & ~if_hwassist_flags) { ip->ip_sum = in_cksum(m0, hlen); + m0->m_pkthdr.csum_flags &= ~CSUM_IP; + } done: *m_frag = m0; Index: netinet/ip_mroute.c =================================================================== --- netinet/ip_mroute.c (revision 242127) +++ netinet/ip_mroute.c (working copy) @@ -2404,7 +2404,8 @@ ip->ip_sum = in_cksum(mb_copy, ip->ip_hl << 2); } else { /* Fragment the packet */ - if (ip_fragment(ip, &mb_copy, mtu, 0, CSUM_DELAY_IP) != 0) { + mb_copy->m_pkthdr.csum_flags |= CSUM_IP; + if (ip_fragment(ip, &mb_copy, mtu, 0) != 0) { m_freem(mb_copy); return NULL; } Index: netinet/ip_var.h =================================================================== --- netinet/ip_var.h (revision 242127) +++ netinet/ip_var.h (working copy) @@ -210,7 +210,7 @@ int ip_ctloutput(struct socket *, struct sockopt *sopt); void ip_drain(void); int ip_fragment(struct ip *ip, struct mbuf **m_frag, int mtu, - u_long if_hwassist_flags, int sw_csum); + u_long if_hwassist_flags); void ip_forward(struct mbuf *m, int srcrt); void ip_init(void); #ifdef VIMAGE Index: netinet/ip_fastfwd.c =================================================================== --- netinet/ip_fastfwd.c (revision 242127) +++ netinet/ip_fastfwd.c (working copy) @@ -542,10 +542,8 @@ * We have to fragment the packet */ m->m_pkthdr.csum_flags |= CSUM_IP; - if (ip_fragment(ip, &m, mtu, ifp->if_hwassist, - (~ifp->if_hwassist & CSUM_DELAY_IP))) { + if (ip_fragment(ip, &m, mtu, ifp->if_hwassist)) goto drop; - } KASSERT(m != NULL, ("null mbuf and no error")); /* * Send off the fragments via outgoing interface Index: netpfil/pf/pf.c =================================================================== --- netpfil/pf/pf.c (revision 242127) +++ netpfil/pf/pf.c (working copy) @@ -5140,7 +5140,7 @@ struct pf_addr naddr; struct pf_src_node *sn = NULL; int error = 0; - uint16_t ip_len, ip_off, sw_csum; + uint16_t ip_len, ip_off; KASSERT(m && *m && r && oifp, ("%s: invalid parameters", __func__)); KASSERT(dir == PF_IN || dir == PF_OUT, ("%s: invalid direction", @@ -5240,18 +5240,16 @@ /* Copied from FreeBSD 10.0-CURRENT ip_output. */ m0->m_pkthdr.csum_flags |= CSUM_IP; - sw_csum = m0->m_pkthdr.csum_flags & ~ifp->if_hwassist; - if (sw_csum & CSUM_DELAY_DATA) { + if (m0->m_pkthdr.csum_flags & CSUM_DELAY_DATA & ~ifp->if_hwassist) { in_delayed_cksum(m0); - sw_csum &= ~CSUM_DELAY_DATA; + m0->m_pkthdr.csum_flags &= ~CSUM_DELAY_DATA; } #ifdef SCTP - if (sw_csum & CSUM_SCTP) { + if (m0->m_pkthdr.csum_flags & CSUM_SCTP & ~ifp->if_hwassist) { sctp_delayed_cksum(m, (uint32_t)(ip->ip_hl << 2)); - sw_csum &= ~CSUM_SCTP; + m0->m_pkthdr.csum_flags &= ~CSUM_SCTP; } #endif - m0->m_pkthdr.csum_flags &= ifp->if_hwassist; /* * If small enough for interface, or the interface will take @@ -5261,8 +5259,10 @@ (m0->m_pkthdr.csum_flags & ifp->if_hwassist & CSUM_TSO) != 0 || ((ip_off & IP_DF) == 0 && (ifp->if_hwassist & CSUM_FRAGMENT))) { ip->ip_sum = 0; - if (sw_csum & CSUM_DELAY_IP) + if (m0->m_pkthdr.csum_flags & CSUM_IP & ~ifp->if_hwassist) { ip->ip_sum = in_cksum(m0, ip->ip_hl << 2); + m0->m_pkthdr.csum_flags &= ~CSUM_IP; + } m0->m_flags &= ~(M_PROTOFLAGS); error = (*ifp->if_output)(ifp, m0, sintosa(&dst), NULL); goto done; @@ -5280,7 +5280,7 @@ goto bad; } - error = ip_fragment(ip, &m0, ifp->if_mtu, ifp->if_hwassist, sw_csum); + error = ip_fragment(ip, &m0, ifp->if_mtu, ifp->if_hwassist); if (error) goto bad; Index: net/if_bridge.c =================================================================== --- net/if_bridge.c (revision 242127) +++ net/if_bridge.c (working copy) @@ -3379,8 +3379,8 @@ goto out; ip = mtod(m, struct ip *); - error = ip_fragment(ip, &m, ifp->if_mtu, ifp->if_hwassist, - CSUM_DELAY_IP); + m->m_pkthdr.csum_flags |= CSUM_IP; + error = ip_fragment(ip, &m, ifp->if_mtu, ifp->if_hwassist); if (error) goto out; --jCrbxBqMcLqd4mOl--