Date: Fri, 26 Oct 2012 17:53:54 +0400 From: Gleb Smirnoff <glebius@FreeBSD.org> To: Sebastian Kuzminsky <seb@lineratesystems.com> Cc: freebsd-net@FreeBSD.org Subject: Re: fragmentation problem in FreeBSD 7 Message-ID: <20121026135354.GD70741@FreeBSD.org> In-Reply-To: <CAN=597Rb-ToBQuJ%2BYet9e25Hbt-QmLJPKUXGf1pFEbVsRvFONg@mail.gmail.com> References: <CAN=597Rb-ToBQuJ%2BYet9e25Hbt-QmLJPKUXGf1pFEbVsRvFONg@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
--jCrbxBqMcLqd4mOl
Content-Type: text/plain; charset=koi8-r
Content-Disposition: inline
Sebastian,
On Tue, Oct 23, 2012 at 12:25:59PM -0600, Sebastian Kuzminsky wrote:
S> The root of the problem is two-fold:
S>
S> 1. ip_output.c:ip_fragment() does not clear the CSUM_IP flag in the mbuf
S> when it does software IP checksum computation, so the mbuf still looks like
S> it needs IP checksumming.
S>
S> 2. The em driver does not advertise IP checksum offloading, but still
S> checks the CSUM_IP flag in the mbuf and modifies the packet when that flag
S> is set (this is in em_transmit_checksum_setup(), called by em_xmit()).
S> Unfortunately the em driver gets the checksum wrong in this case, i guess
S> that's why it doesn't advertise this capability in its if_hwassist!
S>
S> So the fragments that ip_fastfwd.c:ip_fastforward() gets from
S> ip_output.c:ip_fragment() have ip->ip_sum set correctly, but the
S> mbuf->m_pkthdr.csum_flags incorrectly has CSUM_IP still set, and this
S> causes the em driver to emit incorrect packets.
S>
S> There are some other callers of ip_fragment(), notably ip_output().
S> ip_output() clears CSUM_IP in the mbuf csum_flags itself if it's not in
S> if_hwassist, so avoids this problem.
S>
S> So, the fix is simple: clear the mbuf's CSUM_IP when computing ip->ip_sum
S> in ip_fragment(). The first attached patch (against
S> gitorious/svn_stable_7) does this.
S>
S> In looking at this issue, I noticed that ip_output()'s use of sw_csum is
S> inconsistent. ip_output() splits the mbuf's csum_flags into two parts: the
S> stuff that hardware will assist with (these flags get left in the mbuf) and
S> the stuff that software needs to do (these get moved to sw_csum). But
S> later ip_output() calls functions that don't get sw_csum, or that don't
S> know to look in it and look in the mbuf instead. My second patch fixes
S> these kinds of issues and (IMO) simplifies the code by leaving all the
S> packet's checksumming needs in the mbuf, getting rid of sw_csum entirely.
Thanks for submission!
I'm about to commit the attached patch to head. Can you please review it?
Haven't I missed anything important?
I have tested it in a 3-box setup similar to yours, except the middle box
doesn't have em(4) NIC, it has igb(4). I've tested it with rxcsum/txcsum on
and off on both NICs.
I have also moved from CSUM_DELAY_IP to CSUM_IP. AFAIU, the alias CSUM_DELAY_IP
was made to match CSUM_DELAY_DATA. But to my point of view it makes it more
difficult to understand code, because a person reading code sees different
constants in the stack and in drivers. Since your change touches every line
in the stack, that utilizes CSUM_DELAY_IP, I decided to consistently use
CSUM_IP constant.
--
Totus tuus, Glebius.
--jCrbxBqMcLqd4mOl
Content-Type: text/x-diff; charset=koi8-r
Content-Disposition: attachment; filename="csum.diff"
Index: netinet/ip_output.c
===================================================================
--- netinet/ip_output.c (revision 242127)
+++ netinet/ip_output.c (working copy)
@@ -125,7 +125,7 @@
struct sockaddr_in *dst;
struct in_ifaddr *ia;
int isbroadcast;
- uint16_t ip_len, ip_off, sw_csum;
+ uint16_t ip_len, ip_off;
struct route iproute;
struct rtentry *rte; /* cache for ro->ro_rt */
struct in_addr odst;
@@ -583,18 +583,16 @@
}
m->m_pkthdr.csum_flags |= CSUM_IP;
- sw_csum = m->m_pkthdr.csum_flags & ~ifp->if_hwassist;
- if (sw_csum & CSUM_DELAY_DATA) {
+ if (m->m_pkthdr.csum_flags & CSUM_DELAY_DATA & ~ifp->if_hwassist) {
in_delayed_cksum(m);
- sw_csum &= ~CSUM_DELAY_DATA;
+ m->m_pkthdr.csum_flags &= ~CSUM_DELAY_DATA;
}
#ifdef SCTP
- if (sw_csum & CSUM_SCTP) {
+ if (m->m_pkthdr.csum_flags & CSUM_SCTP & ~ifp->if_hwassist) {
sctp_delayed_cksum(m, (uint32_t)(ip->ip_hl << 2));
- sw_csum &= ~CSUM_SCTP;
+ m->m_pkthdr.csum_flags &= ~CSUM_SCTP;
}
#endif
- m->m_pkthdr.csum_flags &= ifp->if_hwassist;
/*
* If small enough for interface, or the interface will take
@@ -604,8 +602,10 @@
(m->m_pkthdr.csum_flags & ifp->if_hwassist & CSUM_TSO) != 0 ||
((ip_off & IP_DF) == 0 && (ifp->if_hwassist & CSUM_FRAGMENT))) {
ip->ip_sum = 0;
- if (sw_csum & CSUM_DELAY_IP)
+ if (m->m_pkthdr.csum_flags & CSUM_IP & ~ifp->if_hwassist) {
ip->ip_sum = in_cksum(m, hlen);
+ m->m_pkthdr.csum_flags &= ~CSUM_IP;
+ }
/*
* Record statistics for this interface address.
@@ -646,7 +646,7 @@
* Too large for interface; fragment if possible. If successful,
* on return, m will point to a list of packets to be sent.
*/
- error = ip_fragment(ip, &m, mtu, ifp->if_hwassist, sw_csum);
+ error = ip_fragment(ip, &m, mtu, ifp->if_hwassist);
if (error)
goto bad;
for (; m; m = m0) {
@@ -691,11 +691,10 @@
* chain of fragments that should be freed by the caller.
*
* if_hwassist_flags is the hw offload capabilities (see if_data.ifi_hwassist)
- * sw_csum contains the delayed checksums flags (e.g., CSUM_DELAY_IP).
*/
int
ip_fragment(struct ip *ip, struct mbuf **m_frag, int mtu,
- u_long if_hwassist_flags, int sw_csum)
+ u_long if_hwassist_flags)
{
int error = 0;
int hlen = ip->ip_hl << 2;
@@ -833,8 +832,10 @@
m->m_pkthdr.csum_flags = m0->m_pkthdr.csum_flags;
mhip->ip_off = htons(mhip->ip_off);
mhip->ip_sum = 0;
- if (sw_csum & CSUM_DELAY_IP)
+ if (m->m_pkthdr.csum_flags & CSUM_IP & ~if_hwassist_flags) {
mhip->ip_sum = in_cksum(m, mhlen);
+ m->m_pkthdr.csum_flags &= ~CSUM_IP;
+ }
*mnext = m;
mnext = &m->m_nextpkt;
}
@@ -853,8 +854,10 @@
ip->ip_len = htons((u_short)m0->m_pkthdr.len);
ip->ip_off = htons(ip_off | IP_MF);
ip->ip_sum = 0;
- if (sw_csum & CSUM_DELAY_IP)
+ if (m0->m_pkthdr.csum_flags & CSUM_IP & ~if_hwassist_flags) {
ip->ip_sum = in_cksum(m0, hlen);
+ m0->m_pkthdr.csum_flags &= ~CSUM_IP;
+ }
done:
*m_frag = m0;
Index: netinet/ip_mroute.c
===================================================================
--- netinet/ip_mroute.c (revision 242127)
+++ netinet/ip_mroute.c (working copy)
@@ -2404,7 +2404,8 @@
ip->ip_sum = in_cksum(mb_copy, ip->ip_hl << 2);
} else {
/* Fragment the packet */
- if (ip_fragment(ip, &mb_copy, mtu, 0, CSUM_DELAY_IP) != 0) {
+ mb_copy->m_pkthdr.csum_flags |= CSUM_IP;
+ if (ip_fragment(ip, &mb_copy, mtu, 0) != 0) {
m_freem(mb_copy);
return NULL;
}
Index: netinet/ip_var.h
===================================================================
--- netinet/ip_var.h (revision 242127)
+++ netinet/ip_var.h (working copy)
@@ -210,7 +210,7 @@
int ip_ctloutput(struct socket *, struct sockopt *sopt);
void ip_drain(void);
int ip_fragment(struct ip *ip, struct mbuf **m_frag, int mtu,
- u_long if_hwassist_flags, int sw_csum);
+ u_long if_hwassist_flags);
void ip_forward(struct mbuf *m, int srcrt);
void ip_init(void);
#ifdef VIMAGE
Index: netinet/ip_fastfwd.c
===================================================================
--- netinet/ip_fastfwd.c (revision 242127)
+++ netinet/ip_fastfwd.c (working copy)
@@ -542,10 +542,8 @@
* We have to fragment the packet
*/
m->m_pkthdr.csum_flags |= CSUM_IP;
- if (ip_fragment(ip, &m, mtu, ifp->if_hwassist,
- (~ifp->if_hwassist & CSUM_DELAY_IP))) {
+ if (ip_fragment(ip, &m, mtu, ifp->if_hwassist))
goto drop;
- }
KASSERT(m != NULL, ("null mbuf and no error"));
/*
* Send off the fragments via outgoing interface
Index: netpfil/pf/pf.c
===================================================================
--- netpfil/pf/pf.c (revision 242127)
+++ netpfil/pf/pf.c (working copy)
@@ -5140,7 +5140,7 @@
struct pf_addr naddr;
struct pf_src_node *sn = NULL;
int error = 0;
- uint16_t ip_len, ip_off, sw_csum;
+ uint16_t ip_len, ip_off;
KASSERT(m && *m && r && oifp, ("%s: invalid parameters", __func__));
KASSERT(dir == PF_IN || dir == PF_OUT, ("%s: invalid direction",
@@ -5240,18 +5240,16 @@
/* Copied from FreeBSD 10.0-CURRENT ip_output. */
m0->m_pkthdr.csum_flags |= CSUM_IP;
- sw_csum = m0->m_pkthdr.csum_flags & ~ifp->if_hwassist;
- if (sw_csum & CSUM_DELAY_DATA) {
+ if (m0->m_pkthdr.csum_flags & CSUM_DELAY_DATA & ~ifp->if_hwassist) {
in_delayed_cksum(m0);
- sw_csum &= ~CSUM_DELAY_DATA;
+ m0->m_pkthdr.csum_flags &= ~CSUM_DELAY_DATA;
}
#ifdef SCTP
- if (sw_csum & CSUM_SCTP) {
+ if (m0->m_pkthdr.csum_flags & CSUM_SCTP & ~ifp->if_hwassist) {
sctp_delayed_cksum(m, (uint32_t)(ip->ip_hl << 2));
- sw_csum &= ~CSUM_SCTP;
+ m0->m_pkthdr.csum_flags &= ~CSUM_SCTP;
}
#endif
- m0->m_pkthdr.csum_flags &= ifp->if_hwassist;
/*
* If small enough for interface, or the interface will take
@@ -5261,8 +5259,10 @@
(m0->m_pkthdr.csum_flags & ifp->if_hwassist & CSUM_TSO) != 0 ||
((ip_off & IP_DF) == 0 && (ifp->if_hwassist & CSUM_FRAGMENT))) {
ip->ip_sum = 0;
- if (sw_csum & CSUM_DELAY_IP)
+ if (m0->m_pkthdr.csum_flags & CSUM_IP & ~ifp->if_hwassist) {
ip->ip_sum = in_cksum(m0, ip->ip_hl << 2);
+ m0->m_pkthdr.csum_flags &= ~CSUM_IP;
+ }
m0->m_flags &= ~(M_PROTOFLAGS);
error = (*ifp->if_output)(ifp, m0, sintosa(&dst), NULL);
goto done;
@@ -5280,7 +5280,7 @@
goto bad;
}
- error = ip_fragment(ip, &m0, ifp->if_mtu, ifp->if_hwassist, sw_csum);
+ error = ip_fragment(ip, &m0, ifp->if_mtu, ifp->if_hwassist);
if (error)
goto bad;
Index: net/if_bridge.c
===================================================================
--- net/if_bridge.c (revision 242127)
+++ net/if_bridge.c (working copy)
@@ -3379,8 +3379,8 @@
goto out;
ip = mtod(m, struct ip *);
- error = ip_fragment(ip, &m, ifp->if_mtu, ifp->if_hwassist,
- CSUM_DELAY_IP);
+ m->m_pkthdr.csum_flags |= CSUM_IP;
+ error = ip_fragment(ip, &m, ifp->if_mtu, ifp->if_hwassist);
if (error)
goto out;
--jCrbxBqMcLqd4mOl--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20121026135354.GD70741>
