From owner-freebsd-net@FreeBSD.ORG  Tue Feb  5 10:49:02 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 65EE156A;
 Tue,  5 Feb 2013 10:49:02 +0000 (UTC)
 (envelope-from randall@lakerest.net)
Received: from lakerest.net (lakerest.net [70.155.160.98])
 by mx1.freebsd.org (Postfix) with ESMTP id D066CD62;
 Tue,  5 Feb 2013 10:49:01 +0000 (UTC)
Received: from [10.1.1.101] (bsd4.lakerest.net [70.155.160.102])
 (authenticated bits=0)
 by lakerest.net (8.14.4/8.14.3) with ESMTP id r15AnGZb038037
 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NOT);
 Tue, 5 Feb 2013 05:49:16 -0500 (EST)
 (envelope-from randall@lakerest.net)
Subject: Re: Driver patch to look at...
Mime-Version: 1.0 (Apple Message framework v1283)
Content-Type: text/plain; charset=us-ascii
From: Randy Stewart <randall@lakerest.net>
In-Reply-To: <201302041524.58699.jhb@freebsd.org>
Date: Tue, 5 Feb 2013 05:49:00 -0500
Content-Transfer-Encoding: quoted-printable
Message-Id: <45AD1046-A630-4C96-B4D2-B8A7D6A6DC45@lakerest.net>
References: <D3AA078A-CD19-4228-A019-BE9C985895E2@lakerest.net>
 <201302041524.58699.jhb@freebsd.org>
To: John Baldwin <jhb@freebsd.org>
X-Mailer: Apple Mail (2.1283)
Cc: freebsd-net@freebsd.org, Robert Watson <rwatson@freebsd.org>,
 Kip Macy <kmacy@freebsd.org>, jv@freebsd.org
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 05 Feb 2013 10:49:02 -0000


On Feb 4, 2013, at 3:24 PM, John Baldwin wrote:

> On Monday, February 04, 2013 12:22:49 pm Randy Stewart wrote:
>> All:
>>=20
>> I have been working with TCP in gigabit networks (igb driver =
actually) and have
>> found a very nasty problem with the way the driver is doing its put =
back when
>> it fills the out-bound transmit queue.
>>=20
>> Basically it has taken a packet from the head of the ring buffer, and =
then=20
>> realizes it can't fit it into the transmit queue. So it just =
re-enqueue's it
>> into the ring buffer. Whats wrong with that? Well most of the time =
there
>> are anywhere from 10-50 packets (maybe more) in that ring buffer when =
you are
>> operating at full speed (or trying to). This means you will see 10 =
duplicate
>> ACKs, do a fast retransmit and cut your cwnd in half.. not very nice =
actually.
>>=20
>> The patch I have attached makes it so that
>>=20
>> 1) There are ways to swap back.
>> 2) Use the peek in the ring buffer and only
>>   dequeue the packet if we put it into the transmit ring
>> 3) If something goes wrong and the transmit frees the packet we =
dequeue it.
>> 4) If the transmit changed it (defrag etc) then swap out the new mbuf =
that
>>   has taken its place.
>>=20
>> I have fixed the four intel drivers that had this systemic issue, but =
there
>> are still more to fix.
>>=20
>> Comments/review .. rotten egg's etc.. would be most welcome before
>> I commit this..
>=20
> Does this only happen in drivers that use buffering?

Yep, there are a lot of drivers that *do not* use the drbr_xxxx() =
functions and
for those they do the IFQ_DRV_PREPEND().. its only the newer drivers =
like the
intel 1Gig and 10Gig ones that seem effected

Also effected are :

bxe
cxgb
oce
en

I have not fixed those yet.

>  I seem to recall that
> drivers using IFQ would just stuff the packet at the head of the IFQ =
via
> IFQ_DRV_PREPEND() in this case so it is still the next packet to =
transmit.
> See, for example, this bit in dc_start_locked():
>=20
> 	for (queued =3D 0; !IFQ_DRV_IS_EMPTY(&ifp->if_snd); ) {
> 		/*
> 		 * If there's no way we can send any packets, return =
now.
> 		 */
> 		if (sc->dc_cdata.dc_tx_cnt > DC_TX_LIST_CNT - =
DC_TX_LIST_RSVD) {
> 			ifp->if_drv_flags |=3D IFF_DRV_OACTIVE;
> 			break;
> 		}
> 		IFQ_DRV_DEQUEUE(&ifp->if_snd, m_head);
> 		if (m_head =3D=3D NULL)
> 			break;
>=20
> 		if (dc_encap(sc, &m_head)) {
> 			if (m_head =3D=3D NULL)
> 				break;
> 			IFQ_DRV_PREPEND(&ifp->if_snd, m_head);
> 			ifp->if_drv_flags |=3D IFF_DRV_OACTIVE;
> 			break;
> 		}
>=20
> It sounds like drbr/buf_ring just don't handle this case correctly?  =
It
> seems a shame to have to duplicate so much code in the various drivers =
to
> fix this, but that seems to be par for the course when using buf_ring. =
:(
> (buggy in edge cases and lots of duplicated code that is).

> Also, doing the drbr_swap() just so that drbr_dequeue() returns what =
you
> just swapped in seems... odd.  It seems that it would be nicer instead
> to have some sort of drbr_peek() / drbr_advance() where the latter =
just
> skips over whatever the current head is?  Then you could have =
something
> like:
>=20
> 	while ((next =3D drbr_peek()) !=3D NULL) {
> 		if (!foo_encap(&next)) {
> 			if (next =3D=3D NULL)
> 				drbr_advance();
> 			break;
> 		}
> 		drbr_advance();
> 	}
>=20

That was what I originally did (without the rename), but there is a sure =
crash waiting in that.
The only difference from what I originally had was just drbr_dequeue().. =
but
I was being a bit lazy and not wanting to add yet another function to =
the=20
drbr_xxxx code since essential it would be a clone of drbr_dequeue() =
without
returning the mbuf.

The crash potential here is in that foo_encap(&next) often times will =
return
a different mbuf (at least in the igb driver it does). I believe its due
to either the m_pullup() which could change the lead mbuf you want
in the drbr_ring, or the m_defrag all within igb_xmit. Thus you have
to track what comes back from the !foo_encap() call and compare it to=20
see if you must swap it out.=20


> I guess the sticky widget is the case of ATLQ as you need to dequeue =
the
> packet always in the ALTQ case and put it back if the encap fails.

Yeah ALTQ is ugly and IMO we need to re-write it anyway.. maybe re-think
this whole layer ;-0

>  For
> your patch it's not clear to me how that works.  It seems that if the
> encap routine frees the mbuf you try to dereference a freed pointer =
when
> you call drbr_dequeue().

Hmm you are right.. I forgot how we keep those using the mbuf itself...

>  I really think you will instead need some sort
> of 'drbr_putback()' and have 'drbr_peek()' dequeue in the ALTQ case =
and
> use 'drbr_putback()' to put it back (PREPEND) in the ALTQ case.

We could do that but drbr_putback() would probably need both the old
and new pointers and then we could make it do the ring_swap() to put
the right mbuf in place..

Let me go explore that and come up with a better patch ;-)

R


>=20
> --=20
> John Baldwin
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"
>=20

-----
Randall Stewart
randall@lakerest.net