From owner-freebsd-bugs@FreeBSD.ORG Sat Jan 15 05:31:40 2011 Return-Path: Delivered-To: freebsd-bugs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0D2CA1065673; Sat, 15 Jan 2011 05:31:40 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail05.syd.optusnet.com.au (mail05.syd.optusnet.com.au [211.29.132.186]) by mx1.freebsd.org (Postfix) with ESMTP id 6DCEA8FC0A; Sat, 15 Jan 2011 05:31:39 +0000 (UTC) Received: from c122-106-165-206.carlnfd1.nsw.optusnet.com.au (c122-106-165-206.carlnfd1.nsw.optusnet.com.au [122.106.165.206]) by mail05.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id p0F5VAbG024545 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 15 Jan 2011 16:31:12 +1100 Date: Sat, 15 Jan 2011 16:31:10 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Stefan `Sec` Zehl In-Reply-To: <20110115013336.A314E2845B@ice.42.org> Message-ID: <20110115143903.K16210@besplex.bde.org> References: <20110115013336.A314E2845B@ice.42.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-bugs@FreeBSD.org, FreeBSD-gnats-submit@FreeBSD.org Subject: Re: kern/154006: tcp "window probe" bug on 64bit X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 15 Jan 2011 05:31:40 -0000 On Sat, 15 Jan 2011, Stefan `Sec` Zehl wrote: >> Description: > > On amd64 the PERSIST timer does not get started (and consecquently executed) > for tcp connections stalled on a 0-size receive window. This means that no > single-byte probe packet is sent, so connections might hang indefinitely. > > This is due to a missing (long) conversion in tcp_output.c around line 562 > where "adv" is calculated. > > After this patch, amd64 behaves the same way as i386 again. >> Fix: > > --- src/sys/netinet/tcp_output.c 2010-09-20 17:49:17.000000000 +0200 > +++ src/sys/netinet/tcp_output.c 2011-01-14 19:30:46.000000000 +0100 > @@ -571,7 +559,7 @@ > * TCP_MAXWIN << tp->rcv_scale. > */ > long adv = min(recwin, (long)TCP_MAXWIN << tp->rcv_scale) - > - (tp->rcv_adv - tp->rcv_nxt); > + (long) (tp->rcv_adv - tp->rcv_nxt); > > if (adv >= (long) (2 * tp->t_maxseg)) > goto send; > Many other type errors are visible in this patch: - min() takes 'unsigned int' args, but is passed 'signed long' args: - recwin has type long. This is smaller )same size but smaller max) than 'unsigned int' on 32-bit arches, and larger on 64-bit arches - TCP_MAXWIN has type int (except on 16-bit arches, which are not supported and are no longer permitted by POSIX). Then we explicitly make its type incompatible with min() by casting to long. The 16-bit arches don't matter, except they are responsible for many of the type errors here. recvwin is long and TCP_WIN is cast to long since plain int was not long enough on 16-bit arches. Hopefully both of min()'s parameters are non-negative and <= UINT_MAX. Then nothing bad happens when min() converts them to u_int. The result of min() has type u_int. - rcv_adv has type tcp_seq. Seems correct - tcp_seq has type u_int32_t. Seems correct, except for its old spelling. The spelling is not so old that it is u_long (to support the 16-bit arches), but it hasn't caught up with C99 yet. - rcv_next has type u_int32_t. Seems logically incorrect -- should be tcp_seq. - (tp->rcv_adv - tp->rcv_nxt) has type [ the default promotion of { tcp_seq, u_int32_t } ]. This is u_int on all supported arches. Apparently, the value of this should always be positive, since the cast doesn't change this on 64-bit arches. However, the cast might break this on 32-bit arches (it breaks the value whenever it exceeds 0x80000000, if that can happen, since longs are smaller than u_int's on 32-bit arches. - the type of the expression for the rvalue is [ the default promotion of { u_int, u_int } ] in the old version, and the same with the last u_int replaced by long in the patched version. It is most natural to subtract u_int's here, like the old version did -- everything in sight is (except for all the type errors) a sequence number or a difference of sequence numbers; the differences are always taken mod 2**32 and are non-negative, but must be careful if the difference should really be negative. The SEQ_LT() family of macros can be used to determine if differences should be negative (this family is further towards losing 16-bitness -- it casts to int instead of to long). Unfortunately there is no SEQ_DIFF() macro to simplify easy cases of taking differences. I think there are scattered casts for this as here. So casting to long is not good. It gives another type error to analyse, and works accidentally. Futher analysis: without the patch: long adv = x - y; where x has type u_int and y had type u_int. The difference always has type u_int; if x is sequentially less than y, then the difference should be negative, but its type forces it to be positive. We should use SEQ_FOO() if this is possible, or we can use delicate conversions if we do only 2 pages of analysis per line to justify the delicacies (not too bad if there is a macro for this). - On 32-bit arches, long is smaller than u_int, so the assignment overflows if the difference should have been negative. The behaviour is undefined, but on normal 2's complement arches, it is benign and fixes up the sign error. - On 64-bit arches, long is larger than u_int, so the difference remains nonnegative when it should have been negative, and is normally huge (something like 0U - 1U = 0xFFFFFFFF). The huge value is near UINT_MAX. LONG_MAX is much larger, so the assignment doesn't overflow and the value remains near UINT_MAX. With the patch: long adv = x - (long)y; where x has type u_int and (long)y had type long: - On 32-bit arches, long is smaller than u_int, so (long)y may overflow; overflow gives undefined behaviour which happens to be benign. Then the binary promotions apply. Although I have been describing long as being smaller than u_int on 32-bit arches, in the C type system it is logically larger, so the binary promotions promote x to long too, and leave (long)y unchanged. "Promotion" of x is really demotion, so it may overflow beningly just like for y. I think the difference doesn't overflow, and even if it does then the result is the same as before, since everything will be done in 32-bit registers using the same code as before. - On 64-bit arches: long is larger than u_int, so (long)y doesn't change the value of y. The binary promotions then promote x to long without changing its value, and don't change (long)y's type or value. Both terms remain nonnegative. (long)y can still be garbage -- something like 0xFFFFFFFF when it should be -1. I think this causes problems, but much smaller than before. Oops, the above may be wrong about y possibly wanting to be negative. Things are not quite as complicated if this sequence cannot occur: - if this can occur, then (x - (long)y) is a large negative number when it should be a small positive number (not much larger than x). This doesn't seem to be what causes the main problem. - the main problem is just when x < y. Then (x - y) gives a huge unsigned int value (which bogusly assigning to a long doesn't fix up for the 64-bit case). But (x - (long)y) gives a negative value when x < y, without additional type errors or overflows on either 32-bit or 64-bit arches provideded x and y are not very large. Better fixes: (A) explicitly convert to int instead implicitly converting to long: long adv = (int) min(recwin, (long)TCP_MAXWIN << tp->rcv_scale) - (tp->rcv_adv - tp->rcv_nxt); or more complete fixes for type errors (beware of things needing to remaining bogusly long): /* Also change recwin to int32_t. */ int adv = imin(recwin, TCP_MAXWIN << tp->rcv_scale) - (int)(tp->rcv_adv - tp->rcv_nxt); This doesn't fix some style bugs: - nested declaration. - initialization in declaration tcp code already uses scattered conversions like this a bit too much. E.g., in tcp_input.c, there is one imax() very like the above imin(). This seems to be the only one involving the window, however; it initializes `win' which already has type int, but some other window variables have type u_int... Later code in tcp_output uses bogus casts to long and larger code instead: % if (recwin < (long)(tp->rcv_adv - tp->rcv_nxt)) % recwin = (long)(tp->rcv_adv - tp->rcv_nxt); % if (recwin > (long)TCP_MAXWIN << tp->rcv_scale) % recwin = (long)TCP_MAXWIN << tp->rcv_scale; % ... % if (recwin > 0 && SEQ_GT(tp->rcv_nxt + recwin, tp->rcv_adv)) % tp->rcv_adv = tp->rcv_nxt + recwin; Note that the first statement avoids using the technically incorrect SEQ_FOO() although its internals are better (cast to int instead of long). It uses cases essentially like yours. Then further analysis is simpler because everything is converted to long. The second starement is similar to the first half of the broken expression. Large code using if's and else's and tests (x >= y) before subtracting y from x is much easier to get right than 1 complicated 1-statement expression like the broken one. It takes these (x >= y) tests to make code with mixed types obviously correct. But I prefer small fast code with ints for everything, since type analyis is too hard. (B) Use SEQ_FOO(). This can be used for the difference of the sequence numbers, but using it on the final difference is not quite right since neither x nor y is a sequence number. In practice SEQ_LT(x, y) will work. (C) Put (A) or (B) in a macro. It can depend on benign overflow, or test values if necessary. All this macro is about is subtracting 2 seqence values, or possibly differences of and bounds of sequence values, with a result that is negative iff that is needed, and a type that is signed iff a negative value makes sense or can be handled by the caller (int should do for the signed cases, else the type should remain tcp_seq or its promotion). Using ints for tcp_seq is technically invalid since they overflow at value INT_MAX. Bruce