Date: Tue, 23 Oct 2012 12:25:59 -0600 From: Sebastian Kuzminsky <seb@lineratesystems.com> To: freebsd-net@freebsd.org Subject: fragmentation problem in FreeBSD 7 Message-ID: <CAN=597Rb-ToBQuJ%2BYet9e25Hbt-QmLJPKUXGf1pFEbVsRvFONg@mail.gmail.com>
next in thread | raw e-mail | index | archive | help
[-- Attachment #1 --]
Hi folks, this is my first post to freebsd-net, and my first bug-fix
submission... I hope this is the right mailing list for this issue, and
the right format for sending in patches....
I'm working on a derivative of FreeBSD 7.
I've run into a problem with IP header checksums when fragmenting to an
e1000 (em) interface, and I've narrowed it down to a very simple test. The
test setup is like this:
[computer A]---(network 1)---[computer B]---(network 2)---[computer C]
That gorgeous drawing shows computer A connected to computer B via network
1, and computer B connected to computer C via network 2. Computer B is set
up to forward packets between networks 1 and 2. A can see B but not C. C
can see B but not A. B forwards between A and C. Pretty simple.
One of B's NICs is a Broadcom, handled by the bce driver; this one works
fine in all my testing.
B's other NIC is an Intel PRO/1000 handled by the em driver. This is the
one giving me trouble.
The test disables PMTUD on all three hosts. It then sets the MTU of the
bce and em interfaces to the unrealistically low value of 72 bytes, and
tries to pass TCP packets back and forth using nc on computers A and C
(with computer B acting as a gateway). This is to force the B gateway to
fragment the TCP frames it forwards.
Receiving on the em and sending on the bce works just fine (as noted
above). Small TCP frames that fit in the MTU, big TCP frames that get
fragmented, no problems.
Receiving on the bce and sending on the em interface works fine for small
TCP frames that don't need fragmentation, but when B has to fragment the IP
packets before sending them out the em, the IP header checksums in the IP
packets that appear on the em's wires are wrong. I came to this conclusion
by packet capture and by watching the 'bad header checksums' counter of
'netstat -s -p ip', both running on the computer receiving the fragments.
Ok, those are all my observations, next comes thoughts about the cause & a
proposed fix.
The root of the problem is two-fold:
1. ip_output.c:ip_fragment() does not clear the CSUM_IP flag in the mbuf
when it does software IP checksum computation, so the mbuf still looks like
it needs IP checksumming.
2. The em driver does not advertise IP checksum offloading, but still
checks the CSUM_IP flag in the mbuf and modifies the packet when that flag
is set (this is in em_transmit_checksum_setup(), called by em_xmit()).
Unfortunately the em driver gets the checksum wrong in this case, i guess
that's why it doesn't advertise this capability in its if_hwassist!
So the fragments that ip_fastfwd.c:ip_fastforward() gets from
ip_output.c:ip_fragment() have ip->ip_sum set correctly, but the
mbuf->m_pkthdr.csum_flags incorrectly has CSUM_IP still set, and this
causes the em driver to emit incorrect packets.
There are some other callers of ip_fragment(), notably ip_output().
ip_output() clears CSUM_IP in the mbuf csum_flags itself if it's not in
if_hwassist, so avoids this problem.
So, the fix is simple: clear the mbuf's CSUM_IP when computing ip->ip_sum
in ip_fragment(). The first attached patch (against
gitorious/svn_stable_7) does this.
In looking at this issue, I noticed that ip_output()'s use of sw_csum is
inconsistent. ip_output() splits the mbuf's csum_flags into two parts: the
stuff that hardware will assist with (these flags get left in the mbuf) and
the stuff that software needs to do (these get moved to sw_csum). But
later ip_output() calls functions that don't get sw_csum, or that don't
know to look in it and look in the mbuf instead. My second patch fixes
these kinds of issues and (IMO) simplifies the code by leaving all the
packet's checksumming needs in the mbuf, getting rid of sw_csum entirely.
--
Sebastian Kuzminsky
Linerate Systems
[-- Attachment #2 --]
From c04a7a95890ef5d032e6998675496bb438c3a14b Mon Sep 17 00:00:00 2001
From: Sebastian Kuzminsky <seb@lineratesystems.com>
Date: Mon, 22 Oct 2012 21:08:40 -0600
Subject: [PATCH 1/2] Update the mbuf csum_flags of IP fragments when
computing their IP checksum
Before this commit, the ip_fragment() function does not clear the
mbuf CSUM_IP flag ("this mbuf needs an IP header checksum"), even
when it computes the IP header checksum itself. This behavior is
acceptable when ip_fragment() is called from ip_output(), because
ip_output() clears the mbuf's flag. But it is not acceptable when
ip_fragment() is called from ip_fastforward(), because ip_fastforward()
does not clear the mbuf's flag.
The result is that, when forwarding a packet that needs fragmentation,
and the fragments are sent by a NIC that does not advertise hardware
IP checksum offloading, and that NIC *does* check for for the CSUM_IP
flag anyway and then gets the IP checksum wrong, *then* the fragments
going out on the wire would have the wrong checksum.
The em driver does not advertise IP header checksum offloading, but
does try to set up IP header checksum offloading anyway when the
mbuf is marked CSUM_IP, and it gets the IP checksum wrong.
The fix is to clear the CSUM_IP flag in the mbuf in ip_fragment()
when the IP checksum is computed, to let the lower layers know that
they don't need to do it.
---
sys/netinet/ip_fastfwd.c | 9 ++++++++-
sys/netinet/ip_output.c | 30 ++++++++++++++++++++++++++++--
2 files changed, 36 insertions(+), 3 deletions(-)
diff --git a/sys/netinet/ip_fastfwd.c b/sys/netinet/ip_fastfwd.c
index e84695e..ae65bfe 100644
--- a/sys/netinet/ip_fastfwd.c
+++ b/sys/netinet/ip_fastfwd.c
@@ -558,12 +558,19 @@ passout:
goto consumed;
} else {
/*
- * We have to fragment the packet
+ * We have to fragment this packet, and the fragments
+ * will need all-new IP checksums. (The payload
+ * checksums, if any, don't need to be modified
+ * because the payload will be reassembled before
+ * delivery.)
*/
m->m_pkthdr.csum_flags |= CSUM_IP;
/*
* ip_fragment expects ip_len and ip_off in host byte
* order but returns all packets in network byte order
+ * If if_hwassist doesn't advertise IP checksum
+ * offloading, ask ip_fragment to do it for us in
+ * software now.
*/
if (ip_fragment(ip, &m, mtu, ifp->if_hwassist,
(~ifp->if_hwassist & CSUM_DELAY_IP))) {
diff --git a/sys/netinet/ip_output.c b/sys/netinet/ip_output.c
index adbb074..08c4185 100644
--- a/sys/netinet/ip_output.c
+++ b/sys/netinet/ip_output.c
@@ -506,12 +506,30 @@ passout:
}
}
+ /* Annotate the outgoing packet: it needs its IP header checksummed. */
m->m_pkthdr.csum_flags |= CSUM_IP;
+
+ /* sw_csum is everything the packet needs that *won't* be done in
+ * hardware.
+ */
sw_csum = m->m_pkthdr.csum_flags & ~ifp->if_hwassist;
+
+ /* Do payload checksumming in software, now, if needed & wanted. */
if (sw_csum & CSUM_DELAY_DATA) {
in_delayed_cksum(m);
sw_csum &= ~CSUM_DELAY_DATA;
}
+
+ /* Clear all the packet's needs that'll be done by software.
+ * At this point the packet's needs are (m_pkthdr.csum_flags | sw_csum),
+ * and software should do the stuff in sw_csum.
+ *
+ * FIXME: This is a bug, stuff in the code paths after this
+ * (for example ip_fragment) expect m_pkthdr->csum_flags to be the
+ * list of stuff the packet needs. in_delayed_cksum() above also
+ * has this expectation, which is why this code is convoluted to
+ * call it before clearing m's csum_flags.
+ */
m->m_pkthdr.csum_flags &= ifp->if_hwassist;
/*
@@ -526,6 +544,10 @@ passout:
ip->ip_sum = 0;
if (sw_csum & CSUM_DELAY_IP)
ip->ip_sum = in_cksum(m, hlen);
+ /* Normally we'd clear CSUM_DELAY_IP out of sw_csum
+ * here, but that variable is not used again before
+ * it passes out of scope.
+ */
/*
* Record statistics for this interface address.
@@ -743,8 +765,10 @@ smart_frag_failure:
m->m_pkthdr.csum_flags = m0->m_pkthdr.csum_flags;
mhip->ip_off = htons(mhip->ip_off);
mhip->ip_sum = 0;
- if (sw_csum & CSUM_DELAY_IP)
+ if (sw_csum & CSUM_DELAY_IP) {
mhip->ip_sum = in_cksum(m, mhlen);
+ m->m_pkthdr.csum_flags &= ~CSUM_DELAY_IP;
+ }
*mnext = m;
mnext = &m->m_nextpkt;
}
@@ -764,8 +788,10 @@ smart_frag_failure:
ip->ip_off |= IP_MF;
ip->ip_off = htons(ip->ip_off);
ip->ip_sum = 0;
- if (sw_csum & CSUM_DELAY_IP)
+ if (sw_csum & CSUM_DELAY_IP) {
ip->ip_sum = in_cksum(m0, hlen);
+ m0->m_pkthdr.csum_flags &= ~CSUM_DELAY_IP;
+ }
done:
*m_frag = m0;
--
1.7.8.3
[-- Attachment #3 --]
From ebbbd7ad64a1cafd9a4b1182ede182fbc373b529 Mon Sep 17 00:00:00 2001
From: Sebastian Kuzminsky <seb@lineratesystems.com>
Date: Tue, 23 Oct 2012 10:59:20 -0600
Subject: [PATCH 2/2] Simplify the tracking of mbuf checksumming needs
The IP code tracks outgoing packets' checksumming needs inconsistently.
The sw_csum variable complicates but does not add value.
The sw_csum variable is not needed. This commit removes it. The
mbuf's m_pkthdr->csum_flags are now the one record of what the
packet needs.
---
sys/contrib/pf/net/pf.c | 14 +++++---------
sys/net/if_bridge.c | 8 ++++++--
sys/netinet/ip_fastfwd.c | 3 +--
sys/netinet/ip_mroute.c | 3 ++-
sys/netinet/ip_output.c | 40 ++++++++++------------------------------
sys/netinet/ip_var.h | 2 +-
6 files changed, 25 insertions(+), 45 deletions(-)
diff --git a/sys/contrib/pf/net/pf.c b/sys/contrib/pf/net/pf.c
index 20e925b..0cdd217 100644
--- a/sys/contrib/pf/net/pf.c
+++ b/sys/contrib/pf/net/pf.c
@@ -6253,9 +6253,6 @@ pf_route(struct mbuf **m, struct pf_rule *r, int dir, struct ifnet *oifp,
struct pf_addr naddr;
struct pf_src_node *sn = NULL;
int error = 0;
-#ifdef __FreeBSD__
- int sw_csum;
-#endif
#ifdef IPSEC
struct m_tag *mtag;
#endif /* IPSEC */
@@ -6361,8 +6358,7 @@ pf_route(struct mbuf **m, struct pf_rule *r, int dir, struct ifnet *oifp,
#ifdef __FreeBSD__
/* Copied from FreeBSD 5.1-CURRENT ip_output. */
m0->m_pkthdr.csum_flags |= CSUM_IP;
- sw_csum = m0->m_pkthdr.csum_flags & ~ifp->if_hwassist;
- if (sw_csum & CSUM_DELAY_DATA) {
+ if (m0->m_pkthdr.csum_flags & CSUM_DELAY_DATA & ~ifp->if_hwassist) {
/*
* XXX: in_delayed_cksum assumes HBO for ip->ip_len (at least)
*/
@@ -6371,9 +6367,8 @@ pf_route(struct mbuf **m, struct pf_rule *r, int dir, struct ifnet *oifp,
in_delayed_cksum(m0);
HTONS(ip->ip_len);
HTONS(ip->ip_off);
- sw_csum &= ~CSUM_DELAY_DATA;
+ m0->m_pkthdr.csum_flags &= ~CSUM_DELAY_DATA;
}
- m0->m_pkthdr.csum_flags &= ifp->if_hwassist;
if (ntohs(ip->ip_len) <= ifp->if_mtu ||
(m0->m_pkthdr.csum_flags & ifp->if_hwassist & CSUM_TSO) != 0 ||
@@ -6384,7 +6379,7 @@ pf_route(struct mbuf **m, struct pf_rule *r, int dir, struct ifnet *oifp,
* ip->ip_off = htons(ip->ip_off);
*/
ip->ip_sum = 0;
- if (sw_csum & CSUM_DELAY_IP) {
+ if (m0->m_pkthdr.csum_flags & CSUM_DELAY_IP & ~ifp->if_hwassist) {
/* From KAME */
if (ip->ip_v == IPVERSION &&
(ip->ip_hl << 2) == sizeof(*ip)) {
@@ -6392,6 +6387,7 @@ pf_route(struct mbuf **m, struct pf_rule *r, int dir, struct ifnet *oifp,
} else {
ip->ip_sum = in_cksum(m0, ip->ip_hl << 2);
}
+ m0->m_pkthdr.csum_flags &= ~CSUM_DELAY_IP;
}
PF_UNLOCK();
error = (*ifp->if_output)(ifp, m0, sintosa(dst), ro->ro_rt);
@@ -6478,7 +6474,7 @@ pf_route(struct mbuf **m, struct pf_rule *r, int dir, struct ifnet *oifp,
*/
NTOHS(ip->ip_len);
NTOHS(ip->ip_off);
- error = ip_fragment(ip, &m0, ifp->if_mtu, ifp->if_hwassist, sw_csum);
+ error = ip_fragment(ip, &m0, ifp->if_mtu, ifp->if_hwassist);
#else
error = ip_fragment(m0, ifp, ifp->if_mtu);
#endif
diff --git a/sys/net/if_bridge.c b/sys/net/if_bridge.c
index 4d1b9da..a2a2b0d 100644
--- a/sys/net/if_bridge.c
+++ b/sys/net/if_bridge.c
@@ -3347,8 +3347,12 @@ bridge_fragment(struct ifnet *ifp, struct mbuf *m, struct ether_header *eh,
goto out;
ip = mtod(m, struct ip *);
- error = ip_fragment(ip, &m, ifp->if_mtu, ifp->if_hwassist,
- CSUM_DELAY_IP);
+ /* We're going to fragment the IP packet, the fragments will need
+ * new IP checksums.
+ */
+ m->m_pkthdr.csum_flags |= CSUM_DELAY_IP;
+
+ error = ip_fragment(ip, &m, ifp->if_mtu, ifp->if_hwassist);
if (error)
goto out;
diff --git a/sys/netinet/ip_fastfwd.c b/sys/netinet/ip_fastfwd.c
index ae65bfe..852ea8e 100644
--- a/sys/netinet/ip_fastfwd.c
+++ b/sys/netinet/ip_fastfwd.c
@@ -572,8 +572,7 @@ passout:
* offloading, ask ip_fragment to do it for us in
* software now.
*/
- if (ip_fragment(ip, &m, mtu, ifp->if_hwassist,
- (~ifp->if_hwassist & CSUM_DELAY_IP))) {
+ if (ip_fragment(ip, &m, mtu, ifp->if_hwassist)) {
goto drop;
}
KASSERT(m != NULL, ("null mbuf and no error"));
diff --git a/sys/netinet/ip_mroute.c b/sys/netinet/ip_mroute.c
index d60e8bd..dc5b0af 100644
--- a/sys/netinet/ip_mroute.c
+++ b/sys/netinet/ip_mroute.c
@@ -2630,7 +2630,8 @@ pim_register_prepare(struct ip *ip, struct mbuf *m)
ip->ip_sum = in_cksum(mb_copy, ip->ip_hl << 2);
} else {
/* Fragment the packet */
- if (ip_fragment(ip, &mb_copy, mtu, 0, CSUM_DELAY_IP) != 0) {
+ mb_copy->m_pkthdr.csum_flags |= CSUM_DELAY_IP;
+ if (ip_fragment(ip, &mb_copy, mtu, 0) != 0) {
m_freem(mb_copy);
return NULL;
}
diff --git a/sys/netinet/ip_output.c b/sys/netinet/ip_output.c
index 08c4185..c6f2d2d 100644
--- a/sys/netinet/ip_output.c
+++ b/sys/netinet/ip_output.c
@@ -112,7 +112,7 @@ ip_output(struct mbuf *m, struct mbuf *opt, struct route *ro, int flags,
int len, error = 0;
struct sockaddr_in *dst = NULL; /* keep compiler happy */
struct in_ifaddr *ia = NULL;
- int isbroadcast, sw_csum;
+ int isbroadcast;
struct route iproute;
struct in_addr odst;
#ifdef IPFIREWALL_FORWARD
@@ -509,29 +509,12 @@ passout:
/* Annotate the outgoing packet: it needs its IP header checksummed. */
m->m_pkthdr.csum_flags |= CSUM_IP;
- /* sw_csum is everything the packet needs that *won't* be done in
- * hardware.
- */
- sw_csum = m->m_pkthdr.csum_flags & ~ifp->if_hwassist;
-
/* Do payload checksumming in software, now, if needed & wanted. */
- if (sw_csum & CSUM_DELAY_DATA) {
+ if (m->m_pkthdr.csum_flags & CSUM_DELAY_DATA & ~ifp->if_hwassist) {
in_delayed_cksum(m);
- sw_csum &= ~CSUM_DELAY_DATA;
+ m->m_pkthdr.csum_flags &= ~CSUM_DELAY_DATA;
}
- /* Clear all the packet's needs that'll be done by software.
- * At this point the packet's needs are (m_pkthdr.csum_flags | sw_csum),
- * and software should do the stuff in sw_csum.
- *
- * FIXME: This is a bug, stuff in the code paths after this
- * (for example ip_fragment) expect m_pkthdr->csum_flags to be the
- * list of stuff the packet needs. in_delayed_cksum() above also
- * has this expectation, which is why this code is convoluted to
- * call it before clearing m's csum_flags.
- */
- m->m_pkthdr.csum_flags &= ifp->if_hwassist;
-
/*
* If small enough for interface, or the interface will take
* care of the fragmentation for us, we can just send directly.
@@ -542,12 +525,10 @@ passout:
ip->ip_len = htons(ip->ip_len);
ip->ip_off = htons(ip->ip_off);
ip->ip_sum = 0;
- if (sw_csum & CSUM_DELAY_IP)
+ if ((m->m_pkthdr.csum_flags & CSUM_DELAY_IP) & ~ifp->if_hwassist) {
ip->ip_sum = in_cksum(m, hlen);
- /* Normally we'd clear CSUM_DELAY_IP out of sw_csum
- * here, but that variable is not used again before
- * it passes out of scope.
- */
+ m->m_pkthdr.csum_flags &= ~CSUM_DELAY_IP;
+ }
/*
* Record statistics for this interface address.
@@ -589,7 +570,7 @@ passout:
* Too large for interface; fragment if possible. If successful,
* on return, m will point to a list of packets to be sent.
*/
- error = ip_fragment(ip, &m, mtu, ifp->if_hwassist, sw_csum);
+ error = ip_fragment(ip, &m, mtu, ifp->if_hwassist);
if (error)
goto bad;
for (; m; m = m0) {
@@ -633,11 +614,10 @@ bad:
* chain of fragments that should be freed by the caller.
*
* if_hwassist_flags is the hw offload capabilities (see if_data.ifi_hwassist)
- * sw_csum contains the delayed checksums flags (e.g., CSUM_DELAY_IP).
*/
int
ip_fragment(struct ip *ip, struct mbuf **m_frag, int mtu,
- u_long if_hwassist_flags, int sw_csum)
+ u_long if_hwassist_flags)
{
int error = 0;
int hlen = ip->ip_hl << 2;
@@ -765,7 +745,7 @@ smart_frag_failure:
m->m_pkthdr.csum_flags = m0->m_pkthdr.csum_flags;
mhip->ip_off = htons(mhip->ip_off);
mhip->ip_sum = 0;
- if (sw_csum & CSUM_DELAY_IP) {
+ if (m->m_pkthdr.csum_flags & CSUM_DELAY_IP & ~if_hwassist_flags) {
mhip->ip_sum = in_cksum(m, mhlen);
m->m_pkthdr.csum_flags &= ~CSUM_DELAY_IP;
}
@@ -788,7 +768,7 @@ smart_frag_failure:
ip->ip_off |= IP_MF;
ip->ip_off = htons(ip->ip_off);
ip->ip_sum = 0;
- if (sw_csum & CSUM_DELAY_IP) {
+ if (m0->m_pkthdr.csum_flags & CSUM_DELAY_IP & ~if_hwassist_flags) {
ip->ip_sum = in_cksum(m0, hlen);
m0->m_pkthdr.csum_flags &= ~CSUM_DELAY_IP;
}
diff --git a/sys/netinet/ip_var.h b/sys/netinet/ip_var.h
index 19e9b7e..8cbc74d 100644
--- a/sys/netinet/ip_var.h
+++ b/sys/netinet/ip_var.h
@@ -195,7 +195,7 @@ int ip_ctloutput(struct socket *, struct sockopt *sopt);
void ip_drain(void);
void ip_fini(void *xtp);
int ip_fragment(struct ip *ip, struct mbuf **m_frag, int mtu,
- u_long if_hwassist_flags, int sw_csum);
+ u_long if_hwassist_flags);
void ip_forward(struct mbuf *m, int srcrt);
void ip_init(void);
extern int
--
1.7.8.3
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAN=597Rb-ToBQuJ%2BYet9e25Hbt-QmLJPKUXGf1pFEbVsRvFONg>
