From owner-freebsd-hackers  Fri May 30 00:01:32 1997
Return-Path: <owner-hackers>
Received: (from root@localhost)
          by hub.freebsd.org (8.8.5/8.8.5) id AAA02881
          for hackers-outgoing; Fri, 30 May 1997 00:01:32 -0700 (PDT)
Received: from inetfw.sonycsl.co.jp (inetfw.sonycsl.co.jp [203.137.129.4])
          by hub.freebsd.org (8.8.5/8.8.5) with ESMTP id AAA02876
          for <hackers@FreeBSD.ORG>; Fri, 30 May 1997 00:01:29 -0700 (PDT)
Received: from hotaka.csl.sony.co.jp (hotaka.csl.sony.co.jp [43.27.98.57]) by inetfw.sonycsl.co.jp (8.8.3/3.5W) with ESMTP id HAA28266; Fri, 30 May 1997 07:01:16 GMT
Received: from localhost (localhost [127.0.0.1]) by hotaka.csl.sony.co.jp (8.8.4/3.3W3) with ESMTP id QAA12323; Fri, 30 May 1997 16:01:01 +0900 (JST)
Message-Id: <199705300701.QAA12323@hotaka.csl.sony.co.jp>
To: Chris Csanady <ccsanady@friley01.res.iastate.edu>
cc: hackers@FreeBSD.ORG
Subject: Re: TCP/IP bug? Unnecessary fragmentation... 
In-reply-to: Your message of "Thu, 29 May 1997 15:44:30 EST."
             <199705292044.PAA04544@friley01.res.iastate.edu> 
Date: Fri, 30 May 1997 16:01:00 +0900
From: Kenjiro Cho <kjc@csl.sony.co.jp>
Sender: owner-hackers@FreeBSD.ORG
X-Loop: FreeBSD.org
Precedence: bulk


Chris Csanady <ccsanady@friley01.res.iastate.edu> said:

  >> Another oddity is that if you close the connection immediately, for the
  >> same data sizes that it does this odd fragmentation (100 < S < 208),
  >> the FIN is piggybacked on the 2nd data segment.  This doesn't happen
  >> with any other data size.

>> I have some graphs, and I believe that it does it at many other sizes too.
>> The performance hit is _much_ more that just mbuf/cluster handling should
>> impose it would seem.  It seems to do it up around 2000, 2100(?), 4000...
>> Is this a generic 44BSD problem, or is it FreeBSD specific?

It is not a bug but caused by a mismatch of the mbuf allocation and
TCP, which is common to *BSD systems.

The mbuf allocation algorithm is that if the data size is less than 2
MLENs (100 bytes for the first one, 108 bytes otherwise), allocate 2
mbufs.  If the data size is larger than 2 MLENs, allocate a 2KB mbuf
cluster.  So, 2 small mbufs will be allocated when you send:
	100 < len <= 208
	or
	n * 2K + 108 < len <= n * 2K + 216

tcp_usr_send is called for each mbuf.

The Nagle algorithm of TCP prevents the sender from sending a small
packet when outstanding packets have not yet been acknowledged.  
Thus, the 2nd packet won't be sent until the ack of the 1st packet
comes back.

To make matters worse, the ack of the 1st packet will be delayed up to
200ms on the receiver side by the delayed ack mechanism.

I think, considering the wide use of TCP, the socket layer should try
to call tcp_usr_send all at once when possible.

--kj