From owner-freebsd-hackers  Wed Aug 19 07:23:28 1998
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id HAA18999
          for freebsd-hackers-outgoing; Wed, 19 Aug 1998 07:23:28 -0700 (PDT)
          (envelope-from owner-freebsd-hackers@FreeBSD.ORG)
Received: from bsd.synx.com (rt.synx.com [194.167.81.239])
          by hub.freebsd.org (8.8.8/8.8.8) with SMTP id HAA18986
          for <hackers@FreeBSD.ORG>; Wed, 19 Aug 1998 07:23:18 -0700 (PDT)
          (envelope-from root@synx.com)
Received: from synx.com (rn [192.1.1.241]) by bsd.synx.com (8.6.12/8.6.12) with ESMTP id PAA28280; Wed, 19 Aug 1998 15:22:19 +0100
Message-Id: <199808191422.PAA28280@bsd.synx.com>
Date: Wed, 19 Aug 1998 16:22:10 +0200 (CEST)
From: Remy NONNENMACHER <remy@synx.com>
Reply-To: remy@synx.com
Subject: Re: Yard/FreeBSD Problem (fwd) 
To: didier@omnix.net
cc: dg@root.com, hackers@FreeBSD.ORG, support@yard.de
In-Reply-To: <Pine.BSF.3.96.980819091204.25143A-100000@omnix.net>
MIME-Version: 1.0
Content-Type: TEXT/plain; CHARSET=US-ASCII
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

On 19 Aug, Didier Derny wrote:
> On Mon, 17 Aug 1998, Remy NONNENMACHER wrote:
>> 
>> I think i got the point. Didier sent me a tcpdump trace of the exchange
>> beetwen the client and the server. The protocol uses a lot of small
>> packets flowing back and forth, so ack_delayed=1 would be a good thing.
>> Unfortunetly, sometime (ie, 3 time in the trace), the protocol
>> encountered the 100 bytes syndrome. Precisely, the application wrote
>> 163 bytes, the data base replied by 119 bytes and the application wrote
>> 105 bytes. Here are fragments :
>> 
>> 13:16:24.147494 1035 > yardsql: P 401:501(100) ack 70 win 17280
>> 13:16:24.232584 yardsql > 1035: . ack 501 win 17280
>> 13:16:24.232629 1035 > yardsql: P 501:564(63) ack 70 win 17280
>> 13:16:24.234125 yardsql > 1035: P 70:170(100) ack 564 win 17280
>> 13:16:24.432584 1035 > yardsql: . ack 170 win 17280
>> 13:16:24.432624 yardsql > 1035: P 170:193(23) ack 564 win 17280
>> 13:16:24.432767 1035 > yardsql: P 564:639(75) ack 193 win 17280
>> 13:16:24.433231 yardsql > 1035: P 193:293(100) ack 639 win 17280
>> 13:16:24.632595 1035 > yardsql: . ack 293 win 17280
>> 13:16:24.632639 yardsql > 1035: P 293:312(19) ack 639 win 17280
>> 
>> The 100 byte syndrome caused a bad fragmentation and delayed the whole
>> transaction by half a second (mean response time for other exchanges
>> are about 1 milli-second).
>> 
>> The solution here seems to force the TCP_NODELAY and ack_delayed=1.
>> 
> 
> Hi,
> 
> In short, is it a general problem with the tcpip stack on all platforms ?
> a specific problem to bsd and bsd like tcpip stack ? 
> Is it a bug ?

It's a feature ;).
see http://www.kohala.com/~rstevens/vanj.88jul20.txt for a detailed
explaination of the origin of this. It affects NetBSD stack also. The
idea, behind the stuff, is to reduce data moving inside the kernel,
 between sosend and tcp_output. Someting like :

kern/uipc_socket.c :

sosend()
	.
	.
	if (size to send >= MINCLSIZE) {
		allocate a cluster
		copy user data in the cluster (MINCLSZE=208 bytes)
		/* more work must be done by tcp_ouput() */
	} else {
		/* Less work to be done by tcp_output() */
		allocate a mbuf with header
		copy 100 first bytes (128-20-8)
		allocate a mbuf without header
		copy up to 108 bytes (128-20)
	}
	tcp_output()

Well, by now, with all the power we have, and considering the delaying
introduced by delayed sending (Nagle) facing a delayed ack, we can
seriously consider phasing out this optimization (or, at least, make it
sysctl'isable).

Bill Fenner (in -net) proposed a fix. Another simple way may be to
locate the line 
	if (resid >= MINCLSIZE)
in kern/uipc_socket.c, (sosend()), and to change it to :
	if (resid >= MHLEN) 
(warning: not tested)

All this need a complete review from one of the TCP great ancient
god....


> Why is it working with linux ?
>

I haven't a Linux kernel to check if they uses the same 'optimization'
so I can't tell.
 
> Yard modified their application to include a TCP_NODELAY.  But
> they have discovered that after a "dup" the TCP_NODELAY flag was lost.
> Is it the normal behavior for "dup" ?
> 

seems to be a known point.

> After the modification by Yard of their source code. It's partly working
> sometimes the system is very fast (like with delayed_ack=0) and sometimes
> it becomes extremely slow (like with delay_ack=1).
> 

Probably TCP_NODELAY=0 and a 101 to 207 bytes packet. Outside of these
limits, the ping/pong exchange will work very well.

> I've been able to reproduce the same phemenon by manually toggling
> delay_ack why the application was running.
> 
> Do you have any suggestion ?
> 

Fix this by forcing TCP_NODELAY inside the kernel till a review of the
sosend 100-208 byte syndrome. Can be done by :

 (in netinet/tcp_output.c, tcp_output())

change :
		.
	if ((idle || tp->t_flags & TF_NODELAY) &&
		.
by
		.
	if ((idle || 1 || tp->t_flags & TF_NODELAY) &&
		.

(horrible no ?)

RN.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message