From owner-freebsd-bugs@FreeBSD.ORG  Sat Jan 15 05:31:40 2011
Return-Path: <owner-freebsd-bugs@FreeBSD.ORG>
Delivered-To: freebsd-bugs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 0D2CA1065673;
	Sat, 15 Jan 2011 05:31:40 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail05.syd.optusnet.com.au (mail05.syd.optusnet.com.au
	[211.29.132.186])
	by mx1.freebsd.org (Postfix) with ESMTP id 6DCEA8FC0A;
	Sat, 15 Jan 2011 05:31:39 +0000 (UTC)
Received: from c122-106-165-206.carlnfd1.nsw.optusnet.com.au
	(c122-106-165-206.carlnfd1.nsw.optusnet.com.au [122.106.165.206])
	by mail05.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	p0F5VAbG024545
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Sat, 15 Jan 2011 16:31:12 +1100
Date: Sat, 15 Jan 2011 16:31:10 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Stefan `Sec` Zehl <sec@42.org>
In-Reply-To: <20110115013336.A314E2845B@ice.42.org>
Message-ID: <20110115143903.K16210@besplex.bde.org>
References: <20110115013336.A314E2845B@ice.42.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-bugs@FreeBSD.org, FreeBSD-gnats-submit@FreeBSD.org
Subject: Re: kern/154006: tcp "window probe" bug on 64bit
X-BeenThere: freebsd-bugs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Bug reports <freebsd-bugs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-bugs>,
	<mailto:freebsd-bugs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-bugs>
List-Post: <mailto:freebsd-bugs@freebsd.org>
List-Help: <mailto:freebsd-bugs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-bugs>,
	<mailto:freebsd-bugs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 15 Jan 2011 05:31:40 -0000

On Sat, 15 Jan 2011, Stefan `Sec` Zehl wrote:

>> Description:
>
> On amd64 the PERSIST timer does not get started (and consecquently executed)
> for tcp connections stalled on a 0-size receive window. This means that no
> single-byte probe packet is sent, so connections might hang indefinitely.
>
> This is due to a missing (long) conversion in tcp_output.c around line 562
> where "adv" is calculated.
>
> After this patch, amd64 behaves the same way as i386 again.

>> Fix:
>
> --- src/sys/netinet/tcp_output.c	2010-09-20 17:49:17.000000000 +0200
> +++ src/sys/netinet/tcp_output.c	2011-01-14 19:30:46.000000000 +0100
> @@ -571,7 +559,7 @@
> 		 * TCP_MAXWIN << tp->rcv_scale.
> 		 */
> 		long adv = min(recwin, (long)TCP_MAXWIN << tp->rcv_scale) -
> -			(tp->rcv_adv - tp->rcv_nxt);
> +			(long) (tp->rcv_adv - tp->rcv_nxt);
>
> 		if (adv >= (long) (2 * tp->t_maxseg))
> 			goto send;
>

Many other type errors are visible in this patch:
- min() takes 'unsigned int' args, but is passed 'signed long' args:
   - recwin has type long.  This is smaller )same size but smaller max)
     than 'unsigned int' on 32-bit arches, and larger on 64-bit arches
   - TCP_MAXWIN has type int (except on 16-bit arches, which are not
     supported and are no longer permitted by POSIX).  Then we explicitly
     make its type incompatible with min() by casting to long.  The 16-bit
     arches don't matter, except they are responsible for many of the type
     errors here.  recvwin is long and TCP_WIN is cast to long since plain
     int was not long enough on 16-bit arches.
   Hopefully both of min()'s parameters are non-negative and <= UINT_MAX.
   Then nothing bad happens when min() converts them to u_int.  The result
   of min() has type u_int.
- rcv_adv has type tcp_seq.  Seems correct
- tcp_seq has type u_int32_t.  Seems correct, except for its old spelling.
   The spelling is not so old that it is u_long (to support the 16-bit arches),
   but it hasn't caught up with C99 yet.
- rcv_next has type u_int32_t.  Seems logically incorrect -- should be tcp_seq.
- (tp->rcv_adv - tp->rcv_nxt) has type [ the default promotion of { tcp_seq,
   u_int32_t } ].  This is u_int on all supported arches.  Apparently, the
   value of this should always be positive, since the cast doesn't change
   this on 64-bit arches.  However, the cast might break this on 32-bit
   arches (it breaks the value whenever it exceeds 0x80000000, if that can
   happen, since longs are smaller than u_int's on 32-bit arches.
- the type of the expression for the rvalue is [ the default promotion of
   { u_int, u_int } ] in the old version, and the same with the last u_int
   replaced by long in the patched version.  It is most natural to subtract
   u_int's here, like the old version did -- everything in sight is (except
   for all the type errors) a sequence number or a difference of sequence
   numbers; the differences are always taken mod 2**32 and are non-negative,
   but must be careful if the difference should really be negative.  The
   SEQ_LT() family of macros can be used to determine if differences should
   be negative (this family is further towards losing 16-bitness -- it casts
   to int instead of to long).  Unfortunately there is no SEQ_DIFF() macro
   to simplify easy cases of taking differences.  I think there are scattered
   casts for this as here.

So casting to long is not good.  It gives another type error to analyse,
and works accidentally.

Futher analysis: without the patch:

 		long adv = x - y;

where x has type u_int and y had type u_int.  The difference always has
type u_int; if x is sequentially less than y, then the difference should
be negative, but its type forces it to be positive.  We should use
SEQ_FOO() if this is possible, or we can use delicate conversions if we
do only 2 pages of analysis per line to justify the delicacies (not too
bad if there is a macro for this).

- On 32-bit arches, long is smaller than u_int, so the assignment overflows
   if the difference should have been negative.  The behaviour is undefined,
   but on normal 2's complement arches, it is benign and fixes up the sign
   error.

- On 64-bit arches, long is larger than u_int, so the difference remains
   nonnegative when it should have been negative, and is normally huge
   (something like 0U - 1U = 0xFFFFFFFF).  The huge value is near UINT_MAX.
   LONG_MAX is much larger, so the assignment doesn't overflow and the
   value remains near UINT_MAX.

With the patch:

 		long adv = x - (long)y;

where x has type u_int and (long)y had type long:

- On 32-bit arches, long is smaller than u_int, so (long)y may overflow;
   overflow gives undefined behaviour which happens to be benign.  Then
   the binary promotions apply.  Although I have been describing long as
   being smaller than u_int on 32-bit arches, in the C type system it is
   logically larger, so the binary promotions promote x to long too, and
   leave (long)y unchanged.  "Promotion" of x is really demotion, so it
   may overflow beningly just like for y.  I think the difference doesn't
   overflow, and even if it does then the result is the same as before,
   since everything will be done in 32-bit registers using the same code
   as before.

- On 64-bit arches: long is larger than u_int, so (long)y doesn't change
   the value of y.  The binary promotions then promote x to long without
   changing its value, and don't change (long)y's type or value.  Both
   terms remain nonnegative.  (long)y can still be garbage -- something
   like 0xFFFFFFFF when it should be -1.  I think this causes problems,
   but much smaller than before.  Oops, the above may be wrong about y possibly
   wanting to be negative.  Things are not quite as complicated if this
   sequence cannot occur:
   - if this can occur, then (x - (long)y) is a large negative number when
     it should be a small positive number (not much larger than x).  This
     doesn't seem to be what causes the main problem.
   - the main problem is just when x < y.  Then (x - y) gives a huge
     unsigned int value (which bogusly assigning to a long doesn't fix
     up for the 64-bit case).  But (x - (long)y) gives a negative value
     when x < y, without additional type errors or overflows on either
     32-bit or 64-bit arches provideded x and y are not very large.

Better fixes:

(A) explicitly convert to int instead implicitly converting to long:

 		long adv = (int)
 		    min(recwin, (long)TCP_MAXWIN << tp->rcv_scale) -
 		    (tp->rcv_adv - tp->rcv_nxt);

or more complete fixes for type errors (beware of things needing to remaining
bogusly long):

 		/* Also change recwin to int32_t. */
 		int adv = imin(recwin, TCP_MAXWIN << tp->rcv_scale) -
 		    (int)(tp->rcv_adv - tp->rcv_nxt);

This doesn't fix some style bugs:
- nested declaration.
- initialization in declaration

tcp code already uses scattered conversions like this a bit too much.  E.g.,
in tcp_input.c, there is one imax() very like the above imin().  This seems
to be the only one involving the window, however; it initializes `win'
which already has type int, but some other window variables have type
u_int...

Later code in tcp_output uses bogus casts to long and larger code instead:

% 	if (recwin < (long)(tp->rcv_adv - tp->rcv_nxt))
% 		recwin = (long)(tp->rcv_adv - tp->rcv_nxt);
% 	if (recwin > (long)TCP_MAXWIN << tp->rcv_scale)
% 		recwin = (long)TCP_MAXWIN << tp->rcv_scale;
% 	...
% 	if (recwin > 0 && SEQ_GT(tp->rcv_nxt + recwin, tp->rcv_adv))
% 		tp->rcv_adv = tp->rcv_nxt + recwin;

Note that the first statement avoids using the technically incorrect
SEQ_FOO() although its internals are better (cast to int instead of
long).  It uses cases essentially like yours.  Then further analysis
is simpler because everything is converted to long.  The second starement
is similar to the first half of the broken expression.  Large code using
if's and else's and tests (x >= y) before subtracting y from x is much
easier to get right than 1 complicated 1-statement expression like the
broken one.  It takes these (x >= y) tests to make code with mixed types
obviously correct.  But I prefer small fast code with ints for everything,
since type analyis is too hard.

(B) Use SEQ_FOO().  This can be used for the difference of the sequence
numbers, but using it on the final difference is not quite right since
neither x nor y is a sequence number.  In practice SEQ_LT(x, y) will work.

(C) Put (A) or (B) in a macro.  It can depend on benign overflow, or test
values if necessary.  All this macro is about is subtracting 2 seqence
values, or possibly differences of and bounds of sequence values, with
a result that is negative iff that is needed, and a type that is signed
iff a negative value makes sense or can be handled by the caller (int
should do for the signed cases, else the type should remain tcp_seq or
its promotion).  Using ints for tcp_seq is technically invalid since
they overflow at value INT_MAX.

Bruce