From owner-freebsd-hackers Wed Aug 19 07:23:28 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id HAA18999 for freebsd-hackers-outgoing; Wed, 19 Aug 1998 07:23:28 -0700 (PDT) (envelope-from owner-freebsd-hackers@FreeBSD.ORG) Received: from bsd.synx.com (rt.synx.com [194.167.81.239]) by hub.freebsd.org (8.8.8/8.8.8) with SMTP id HAA18986 for ; Wed, 19 Aug 1998 07:23:18 -0700 (PDT) (envelope-from root@synx.com) Received: from synx.com (rn [192.1.1.241]) by bsd.synx.com (8.6.12/8.6.12) with ESMTP id PAA28280; Wed, 19 Aug 1998 15:22:19 +0100 Message-Id: <199808191422.PAA28280@bsd.synx.com> Date: Wed, 19 Aug 1998 16:22:10 +0200 (CEST) From: Remy NONNENMACHER Reply-To: remy@synx.com Subject: Re: Yard/FreeBSD Problem (fwd) To: didier@omnix.net cc: dg@root.com, hackers@FreeBSD.ORG, support@yard.de In-Reply-To: MIME-Version: 1.0 Content-Type: TEXT/plain; CHARSET=US-ASCII Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On 19 Aug, Didier Derny wrote: > On Mon, 17 Aug 1998, Remy NONNENMACHER wrote: >> >> I think i got the point. Didier sent me a tcpdump trace of the exchange >> beetwen the client and the server. The protocol uses a lot of small >> packets flowing back and forth, so ack_delayed=1 would be a good thing. >> Unfortunetly, sometime (ie, 3 time in the trace), the protocol >> encountered the 100 bytes syndrome. Precisely, the application wrote >> 163 bytes, the data base replied by 119 bytes and the application wrote >> 105 bytes. Here are fragments : >> >> 13:16:24.147494 1035 > yardsql: P 401:501(100) ack 70 win 17280 >> 13:16:24.232584 yardsql > 1035: . ack 501 win 17280 >> 13:16:24.232629 1035 > yardsql: P 501:564(63) ack 70 win 17280 >> 13:16:24.234125 yardsql > 1035: P 70:170(100) ack 564 win 17280 >> 13:16:24.432584 1035 > yardsql: . ack 170 win 17280 >> 13:16:24.432624 yardsql > 1035: P 170:193(23) ack 564 win 17280 >> 13:16:24.432767 1035 > yardsql: P 564:639(75) ack 193 win 17280 >> 13:16:24.433231 yardsql > 1035: P 193:293(100) ack 639 win 17280 >> 13:16:24.632595 1035 > yardsql: . ack 293 win 17280 >> 13:16:24.632639 yardsql > 1035: P 293:312(19) ack 639 win 17280 >> >> The 100 byte syndrome caused a bad fragmentation and delayed the whole >> transaction by half a second (mean response time for other exchanges >> are about 1 milli-second). >> >> The solution here seems to force the TCP_NODELAY and ack_delayed=1. >> > > Hi, > > In short, is it a general problem with the tcpip stack on all platforms ? > a specific problem to bsd and bsd like tcpip stack ? > Is it a bug ? It's a feature ;). see http://www.kohala.com/~rstevens/vanj.88jul20.txt for a detailed explaination of the origin of this. It affects NetBSD stack also. The idea, behind the stuff, is to reduce data moving inside the kernel, between sosend and tcp_output. Someting like : kern/uipc_socket.c : sosend() . . if (size to send >= MINCLSIZE) { allocate a cluster copy user data in the cluster (MINCLSZE=208 bytes) /* more work must be done by tcp_ouput() */ } else { /* Less work to be done by tcp_output() */ allocate a mbuf with header copy 100 first bytes (128-20-8) allocate a mbuf without header copy up to 108 bytes (128-20) } tcp_output() Well, by now, with all the power we have, and considering the delaying introduced by delayed sending (Nagle) facing a delayed ack, we can seriously consider phasing out this optimization (or, at least, make it sysctl'isable). Bill Fenner (in -net) proposed a fix. Another simple way may be to locate the line if (resid >= MINCLSIZE) in kern/uipc_socket.c, (sosend()), and to change it to : if (resid >= MHLEN) (warning: not tested) All this need a complete review from one of the TCP great ancient god.... > Why is it working with linux ? > I haven't a Linux kernel to check if they uses the same 'optimization' so I can't tell. > Yard modified their application to include a TCP_NODELAY. But > they have discovered that after a "dup" the TCP_NODELAY flag was lost. > Is it the normal behavior for "dup" ? > seems to be a known point. > After the modification by Yard of their source code. It's partly working > sometimes the system is very fast (like with delayed_ack=0) and sometimes > it becomes extremely slow (like with delay_ack=1). > Probably TCP_NODELAY=0 and a 101 to 207 bytes packet. Outside of these limits, the ping/pong exchange will work very well. > I've been able to reproduce the same phemenon by manually toggling > delay_ack why the application was running. > > Do you have any suggestion ? > Fix this by forcing TCP_NODELAY inside the kernel till a review of the sosend 100-208 byte syndrome. Can be done by : (in netinet/tcp_output.c, tcp_output()) change : . if ((idle || tp->t_flags & TF_NODELAY) && . by . if ((idle || 1 || tp->t_flags & TF_NODELAY) && . (horrible no ?) RN. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message