From owner-freebsd-net@FreeBSD.ORG Fri Feb 1 21:21:20 2013 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 2441F185 for ; Fri, 1 Feb 2013 21:21:20 +0000 (UTC) (envelope-from kevin@your.org) Received: from mail.your.org (mail.your.org [IPv6:2001:4978:1:2::cc09:3717]) by mx1.freebsd.org (Postfix) with ESMTP id CE24BF61 for ; Fri, 1 Feb 2013 21:21:19 +0000 (UTC) Received: from mail.your.org (chi02.mail.your.org [204.9.55.23]) by mail.your.org (Postfix) with ESMTP id 5EA7BF06C5F for ; Fri, 1 Feb 2013 21:21:19 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=your.org; h=from :content-type:content-transfer-encoding:subject:message-id:date :to:mime-version; s=selector1; bh=Ts+2cF4JmCqYxq57vmIonBWLEGA=; b= jr2LCfv2uXibcUn29atMsVwUjDJIlIQrzPjL6WcjZ6+MpxNjAdIisGJGAWaH+XZy H81L2/DOByaR0aKO3yvU0PC5OChFzZS1EthNO363zvTHE+cS74N6MPJebOzG85Pj Jrc3vQk2ARhuBFgPM8zodFjz2kBeY7LRQ8TvidhuCiQ= DomainKey-Signature: a=rsa-sha1; c=nofws; d=your.org; h=from :content-type:content-transfer-encoding:subject:message-id:date :to:mime-version; q=dns; s=selector1; b=AzxF9kVAy2hnP/bNwUNVC0E2 9xD9tqTrRYpMhvyzLF6P49SLP0LvsKVBd5bUObr+9qqDcZMF4E2LR2djuUD8zYY7 lSi6BJgg8a3coVUSGLFcNMOxqO/auvMaHwO6TMI4SQVQREcIS9b1E2Vnw3bvXIUh dvVoiKNq4L+hFuk9xpo= Received: from vpn132.rw1.your.org (vpn132.rw1.your.org [204.9.51.132]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by mail.your.org (Postfix) with ESMTPSA id 6C212F06C5A for ; Fri, 1 Feb 2013 21:21:17 +0000 (UTC) From: Kevin Day Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Subject: Syncookies break with Windows 8 Message-Id: Date: Fri, 1 Feb 2013 15:21:15 -0600 To: freebsd-net@freebsd.org Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) X-Mailer: Apple Mail (2.1499) X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Feb 2013 21:21:20 -0000 We've got a large cluster of HTTP servers, each server handling = >10,000req/sec. Occasionally, and during periods of heavy load, we'd get = complaints from some users that downloads were working but going = EXTREMELY slowly. After a whole lot of debugging, we narrowed it down to = being only Windows 8 clients experiencing this problem. It turns out = that FreeBSD's implementation of syncookies is likely violating RFC1323. When syncookies kicks in, either because the syncache limit is reached = or net.inet.tcp.syncookies_only is set, some shortcuts are taken with = regard to TCP connections. Unlike some other syncookies implementations = which (ab)use timestamps to store options, the FreeBSD implementation of = syncookies discards TCP options such as window scaling. In itself this = isn't a bad thing, but it becomes a bad thing because we then lie and = pretend that we are supporting window scaling. According to RFC1323, if you want to use TCP window scaling, the client = says so on the initial SYN. If the server is also willing to use = scaling, it says so on the SYN/ACK. If both parties included a scaling = option on their respective SYN, you assume window scaling is working and = proceed to use it. If one or both parties don't have a scaling option, = you don't scale at all. The problem here is that with syncookies, we = don't save the wscale parameter from the client's SYN, but offer to use = window scaling anyway on our SYN/ACK, so the client thinks we = successfully negotiated window scaling even though we haven't. This is how a normal window scaled connection happens: client > server: Flags [S], win 65535, options [mss 1460,nop,wscale = 4,nop,nop,sackOK], length 0 (client is connecting, offering a window of 64K, but if scaling is = negotiated wants to scale future window sizes by 4 bits) server > client: Flags [S.], win 65535, options [mss 1460,nop,wscale = 5,sackOK,eol], length 0 (server is ACKing the client's SYN, also offering an unscaled window of = 64K, but wanting to shift by 5 going forward) The server and client both offered window scaling, so they're now using = it from this point on. All window sizes sent/received are shifted by the = appropriate number of bits. When syncookies kicks in on the server, and the client is anything BUT = Windows 8, this happens: client > server: Flags [S], win 65535, options [mss 1460,nop,wscale = 4,nop,nop,sackOK], length 0 However, syncookies cause the options to get lost. The client sent the = "wscale 4" parameter, but we immediately forgot it. server > client: Flags [S.], win 65535, options [mss 1460,nop,wscale = 5,sackOK,eol], length 0 (server is ACKing the client's SYN, also offering an unscaled window of = 64K, but wanting to shift by 5 going forward) The server sent a wscale back on its SYN/ACK, so the client thinks = window scaling is now in effect. But it's not, the server didn't = remember the client's wscale option, so it's not scaling any of the = received window sizes that are coming in from the client. This doesn't = actually hurt much. The client thinks it's telling us it has a 1MB = window open, but we're only hearing that it's sent a 64K window, so = that's all we ever use. It's "failing safe" here, and nothing actually = breaks. Now throw Windows 8 into the mix. Windows 8's TCP auto tuning is much = more aggressive than previous versions of Windows. I honestly can't tell = if this is a bug or intentional design, but Windows will sometimes, = intermittently, advertise a much much larger wscale option than it = actually needs. This is a mild example of what happens: client > server: Flags [S], win 8192, options [mss 1460,nop,wscale = 8,nop,nop,sackOK], length 0 (client is connecting, offering an unscaled window of 8192 bytes, but = wants to negotiate window scaling of 8 bits if the server will accept = it) server > client: Flags [S.], win 65535, options [mss 1460,nop,wscale = 5,sackOK,eol], length 0 (server is ACKing the client's SYN, also offering an unscaled window of = 64K, but wanting to shift by 5 going forward) We're at the same point here as in the above example, the client now = believes we've successfully negotiated window scaling, but on the server = side we're treating all window sizes coming from the client as being = shifted by 0. So the client sends it's first ACK: client > server: Flags [.], seq 1, ack 1, win 256, length 0 The client believes we're still scaling everything it says by 8 bits, = but it only wants to give us a 64K window, so it's saying 256 here. = (256<<8 =3D 65536). We don't remember that we agreed to shift everything = by 8, so we treat that as just 256. The connection now proceeds, but we = think we can only send 256 bytes at a time. It is extremely slow. I have seen Windows 8 attempt to use wscale parameters of 8 all way up = to 10. While I've only caught a few cases of this happening in the wild, = when it's using 10 we end up thinking we only have a 64 byte window and = things get really silly really fast. I've been talking with someone on Microsoft's side of things about why = Windows is choosing to do this. But my own view of this is that if = syncookies are being used in their current state (we lose the client's = wscale option), we can't advertise wscale on the SYN/ACK. My reading of = RFC1323 says that if we put a wscale option in our SYN/ACK that means we = agreed to use the client's wscale in their SYN. I don't think that's = correct. If syncookies are being used, we should advertise MIN(sb_max, = TCP_MAXWIN) with no scaling and stay within the RFC. This doesn't affect Linux because it uses timestamp options to stuff the = client's wscale, so it gets re-learned on the ACK. OpenBSD and OS X = don't have syncookies. NetBSD seems to have the same problem if it's new = syncookie implementation gets turned on.=20 Any thoughts? Was there a reason why we're forcing the use of wscale on = syncookie connections? -- Kevin