From owner-freebsd-net@FreeBSD.ORG Fri Feb 1 22:39:40 2013 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 5162BE53 for ; Fri, 1 Feb 2013 22:39:40 +0000 (UTC) (envelope-from oppermann@networx.ch) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) by mx1.freebsd.org (Postfix) with ESMTP id 9FC5F202 for ; Fri, 1 Feb 2013 22:39:39 +0000 (UTC) Received: (qmail 49725 invoked from network); 1 Feb 2013 23:59:13 -0000 Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 1 Feb 2013 23:59:13 -0000 Message-ID: <510C4424.4030701@networx.ch> Date: Fri, 01 Feb 2013 23:39:32 +0100 From: Andre Oppermann User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130107 Thunderbird/17.0.2 MIME-Version: 1.0 To: Kevin Day Subject: Re: Syncookies break with Windows 8 References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Feb 2013 22:39:40 -0000 On 01.02.2013 22:21, Kevin Day wrote: > We've got a large cluster of HTTP servers, each server handling >10,000req/sec. Occasionally, and > during periods of heavy load, we'd get complaints from some users that downloads were working but > going EXTREMELY slowly. After a whole lot of debugging, we narrowed it down to being only Windows > 8 clients experiencing this problem. It turns out that FreeBSD's implementation of syncookies is > likely violating RFC1323. > > When syncookies kicks in, either because the syncache limit is reached or > net.inet.tcp.syncookies_only is set, some shortcuts are taken with regard to TCP connections. > Unlike some other syncookies implementations which (ab)use timestamps to store options, the > FreeBSD implementation of syncookies discards TCP options such as window scaling. In itself this > isn't a bad thing, but it becomes a bad thing because we then lie and pretend that we are > supporting window scaling. This is not true. FreeBSD uses bits in the timestamp to encode all recognized TCP options including window scaling. > According to RFC1323, if you want to use TCP window scaling, the client says so on the initial > SYN. If the server is also willing to use scaling, it says so on the SYN/ACK. If both parties > included a scaling option on their respective SYN, you assume window scaling is working and > proceed to use it. If one or both parties don't have a scaling option, you don't scale at all. > The problem here is that with syncookies, we don't save the wscale parameter from the client's > SYN, but offer to use window scaling anyway on our SYN/ACK, so the client thinks we successfully > negotiated window scaling even though we haven't. The syncookie window scale is stored in the timestamp. Of course this becomes problematic as you describe when timestamps are not active on an connection. > This is how a normal window scaled connection happens: > > client > server: Flags [S], win 65535, options [mss 1460,nop,wscale 4,nop,nop,sackOK], length 0 > (client is connecting, offering a window of 64K, but if scaling is negotiated wants to scale > future window sizes by 4 bits) > > server > client: Flags [S.], win 65535, options [mss 1460,nop,wscale 5,sackOK,eol], length 0 > (server is ACKing the client's SYN, also offering an unscaled window of 64K, but wanting to shift > by 5 going forward) > > The server and client both offered window scaling, so they're now using it from this point on. > All window sizes sent/received are shifted by the appropriate number of bits. No timestamps. > When syncookies kicks in on the server, and the client is anything BUT Windows 8, this happens: > > client > server: Flags [S], win 65535, options [mss 1460,nop,wscale 4,nop,nop,sackOK], length 0 > However, syncookies cause the options to get lost. The client sent the "wscale 4" parameter, but > we immediately forgot it. > > server > client: Flags [S.], win 65535, options [mss 1460,nop,wscale 5,sackOK,eol], length 0 > (server is ACKing the client's SYN, also offering an unscaled window of 64K, but wanting to shift > by 5 going forward) > > The server sent a wscale back on its SYN/ACK, so the client thinks window scaling is now in > effect. But it's not, the server didn't remember the client's wscale option, so it's not scaling > any of the received window sizes that are coming in from the client. This doesn't actually hurt > much. The client thinks it's telling us it has a 1MB window open, but we're only hearing that > it's sent a 64K window, so that's all we ever use. It's "failing safe" here, and nothing actually > breaks. > > > Now throw Windows 8 into the mix. Windows 8's TCP auto tuning is much more aggressive than > previous versions of Windows. I honestly can't tell if this is a bug or intentional design, but > Windows will sometimes, intermittently, advertise a much much larger wscale option than it > actually needs. This is a mild example of what happens: > > client > server: Flags [S], win 8192, options [mss 1460,nop,wscale 8,nop,nop,sackOK], length 0 > (client is connecting, offering an unscaled window of 8192 bytes, but wants to negotiate window > scaling of 8 bits if the server will accept it) > > server > client: Flags [S.], win 65535, options [mss 1460,nop,wscale 5,sackOK,eol], length 0 > (server is ACKing the client's SYN, also offering an unscaled window of 64K, but wanting to shift > by 5 going forward) > > We're at the same point here as in the above example, the client now believes we've successfully > negotiated window scaling, but on the server side we're treating all window sizes coming from the > client as being shifted by 0. So the client sends it's first ACK: > > client > server: Flags [.], seq 1, ack 1, win 256, length 0 > > The client believes we're still scaling everything it says by 8 bits, but it only wants to give > us a 64K window, so it's saying 256 here. (256<<8 = 65536). We don't remember that we agreed to > shift everything by 8, so we treat that as just 256. The connection now proceeds, but we think we > can only send 256 bytes at a time. It is extremely slow. Yeah, that's bad. > I have seen Windows 8 attempt to use wscale parameters of 8 all way up to 10. While I've only > caught a few cases of this happening in the wild, when it's using 10 we end up thinking we only > have a 64 byte window and things get really silly really fast. Indeed. > I've been talking with someone on Microsoft's side of things about why Windows is choosing to do > this. But my own view of this is that if syncookies are being used in their current state (we > lose the client's wscale option), we can't advertise wscale on the SYN/ACK. My reading of RFC1323 > says that if we put a wscale option in our SYN/ACK that means we agreed to use the client's > wscale in their SYN. I don't think that's correct. If syncookies are being used, we should > advertise MIN(sb_max, TCP_MAXWIN) with no scaling and stay within the RFC. > > This doesn't affect Linux because it uses timestamp options to stuff the client's wscale, so it > gets re-learned on the ACK. OpenBSD and OS X don't have syncookies. NetBSD seems to have the same > problem if it's new syncookie implementation gets turned on. This can't be because of the lack of timestamps. Linux must be encoding the scale in the ISS taking away bits from the cookie. I haven't looked into how Linux actually does it recently. > Any thoughts? Was there a reason why we're forcing the use of wscale on syncookie connections? We can change the behavior of syncookies in a couple of ways to deal with this problem: 1/ send syncookies only when the syncache overflows and set wscale to 0 in the SYN-ACK when timestamps are not active. 2/ move the wscale bits from timestamp encoding to the ISS taking bits away from the cookie. At the moment we send syncookies on every SYN-ACK and bump the oldest entry from the syncache when it is full. That results in potentially every segment degrading to syncookie only. The default values are insufficient for such high loads. In general at 10,000 connections per second you should significantly increase the size of your syncache to 3 * conn/sec at least. In the loader you can set these tunables: net.inet.tcp.syncache.hashsize = 2048 net.inet.tcp.syncache.bucketlimit = 32 net.inet.tcp.syncache.cachelimit = 65536 These settings are a bit more complicated than they should be. Going forward I have to take a closer look at the modifications in 1/ and 2/ and possibly combinations of both. The main issue is the tradeoff in cookie bits vs. cookie life time and how fast the hash can be cracked these days. OTOH a too complicated hash would cost us significant cpu power too. -- Andre