Date: Wed, 19 Aug 1998 16:22:10 +0200 (CEST) From: Remy NONNENMACHER <remy@synx.com> To: didier@omnix.net Cc: dg@root.com, hackers@FreeBSD.ORG, support@yard.de Subject: Re: Yard/FreeBSD Problem (fwd) Message-ID: <199808191422.PAA28280@bsd.synx.com> In-Reply-To: <Pine.BSF.3.96.980819091204.25143A-100000@omnix.net>
next in thread | previous in thread | raw e-mail | index | archive | help
On 19 Aug, Didier Derny wrote: > On Mon, 17 Aug 1998, Remy NONNENMACHER wrote: >> >> I think i got the point. Didier sent me a tcpdump trace of the exchange >> beetwen the client and the server. The protocol uses a lot of small >> packets flowing back and forth, so ack_delayed=1 would be a good thing. >> Unfortunetly, sometime (ie, 3 time in the trace), the protocol >> encountered the 100 bytes syndrome. Precisely, the application wrote >> 163 bytes, the data base replied by 119 bytes and the application wrote >> 105 bytes. Here are fragments : >> >> 13:16:24.147494 1035 > yardsql: P 401:501(100) ack 70 win 17280 >> 13:16:24.232584 yardsql > 1035: . ack 501 win 17280 >> 13:16:24.232629 1035 > yardsql: P 501:564(63) ack 70 win 17280 >> 13:16:24.234125 yardsql > 1035: P 70:170(100) ack 564 win 17280 >> 13:16:24.432584 1035 > yardsql: . ack 170 win 17280 >> 13:16:24.432624 yardsql > 1035: P 170:193(23) ack 564 win 17280 >> 13:16:24.432767 1035 > yardsql: P 564:639(75) ack 193 win 17280 >> 13:16:24.433231 yardsql > 1035: P 193:293(100) ack 639 win 17280 >> 13:16:24.632595 1035 > yardsql: . ack 293 win 17280 >> 13:16:24.632639 yardsql > 1035: P 293:312(19) ack 639 win 17280 >> >> The 100 byte syndrome caused a bad fragmentation and delayed the whole >> transaction by half a second (mean response time for other exchanges >> are about 1 milli-second). >> >> The solution here seems to force the TCP_NODELAY and ack_delayed=1. >> > > Hi, > > In short, is it a general problem with the tcpip stack on all platforms ? > a specific problem to bsd and bsd like tcpip stack ? > Is it a bug ? It's a feature ;). see http://www.kohala.com/~rstevens/vanj.88jul20.txt for a detailed explaination of the origin of this. It affects NetBSD stack also. The idea, behind the stuff, is to reduce data moving inside the kernel, between sosend and tcp_output. Someting like : kern/uipc_socket.c : sosend() . . if (size to send >= MINCLSIZE) { allocate a cluster copy user data in the cluster (MINCLSZE=208 bytes) /* more work must be done by tcp_ouput() */ } else { /* Less work to be done by tcp_output() */ allocate a mbuf with header copy 100 first bytes (128-20-8) allocate a mbuf without header copy up to 108 bytes (128-20) } tcp_output() Well, by now, with all the power we have, and considering the delaying introduced by delayed sending (Nagle) facing a delayed ack, we can seriously consider phasing out this optimization (or, at least, make it sysctl'isable). Bill Fenner (in -net) proposed a fix. Another simple way may be to locate the line if (resid >= MINCLSIZE) in kern/uipc_socket.c, (sosend()), and to change it to : if (resid >= MHLEN) (warning: not tested) All this need a complete review from one of the TCP great ancient god.... > Why is it working with linux ? > I haven't a Linux kernel to check if they uses the same 'optimization' so I can't tell. > Yard modified their application to include a TCP_NODELAY. But > they have discovered that after a "dup" the TCP_NODELAY flag was lost. > Is it the normal behavior for "dup" ? > seems to be a known point. > After the modification by Yard of their source code. It's partly working > sometimes the system is very fast (like with delayed_ack=0) and sometimes > it becomes extremely slow (like with delay_ack=1). > Probably TCP_NODELAY=0 and a 101 to 207 bytes packet. Outside of these limits, the ping/pong exchange will work very well. > I've been able to reproduce the same phemenon by manually toggling > delay_ack why the application was running. > > Do you have any suggestion ? > Fix this by forcing TCP_NODELAY inside the kernel till a review of the sosend 100-208 byte syndrome. Can be done by : (in netinet/tcp_output.c, tcp_output()) change : . if ((idle || tp->t_flags & TF_NODELAY) && . by . if ((idle || 1 || tp->t_flags & TF_NODELAY) && . (horrible no ?) RN. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199808191422.PAA28280>