From owner-freebsd-net@FreeBSD.ORG Wed Dec 27 22:45:33 2006 Return-Path: X-Original-To: freebsd-net@freebsd.org Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 10E4116A407 for ; Wed, 27 Dec 2006 22:45:33 +0000 (UTC) (envelope-from fbsd@synoptic.org) Received: from gort.synoptic.org (gort.synoptic.org [216.254.17.65]) by mx1.freebsd.org (Postfix) with ESMTP id CFA0413C46D for ; Wed, 27 Dec 2006 22:45:32 +0000 (UTC) (envelope-from fbsd@synoptic.org) Received: by gort.synoptic.org (Postfix, from userid 1000) id C200C6352813; Wed, 27 Dec 2006 14:18:44 -0800 (PST) Date: Wed, 27 Dec 2006 14:18:44 -0800 From: Matthew Hudson To: freebsd-net@freebsd.org Message-ID: <20061227221844.GA50395@gort.synoptic.org> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.2i Subject: Re: Diagnose co-location networking problem X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 27 Dec 2006 22:45:33 -0000 On Tue, Dec 26, 2006 at 06:45:39PM -0800, Stephan Wehner wrote: > So I am thinking the problem may be with the co-location operation. > > How can I make sure? How can I diagnose this? The only idea I had was > to run tcpdump on my Linux client (tcpdump host stbgo.org), and indeed > I can see entries lines this: > I troubleshoot issues just like this for a living so I hope I can be of some help. Others have already suggested some useful strategies so I'll try to focus on ones that I haven't seen mentioned yet. Off the bat based on what you've described I'd tend to suspect some sort of transparent proxy, be it a stateful firewall or a intermediary loadbalancer of some sort. The fact that your ssh connection from the same source IP (I'm assuming) isn't showing any symptoms would tend to de-emphasize layers 1-3 (IP on down to ethernet, ruling out packetloss due to ethernet duplex mismatch/cabling and bad IP routing, doesn't rule out rate limiting). However, if you've been experiencing intermittent pauses with your ssh session, even if they don't coincide with interruptions in http traffic then you may still have a packet loss issue. If you suspect packetloss, confirm with 'netstat -i' and look in the Ierrs and Oerrs columns, they should both be 0 if everything is spiff. Also check the TCP retransmit counters in 'netstat -s' (you will always have some retransmission, you just don't want a *lot* of it). I should note that I think this is a low probability based on symptoms. Actually based on the traffic snip you quoted, I tend to strongly suspect firewall/loadbalancer/proxy.. note the source IP: > 21:40:22.162536 192.168.2.54.35932 > 65.110.18.138.80: > S 1526509984:1526509984(0) win 5840 > (DF) the source IP is 192.168.2.54 which isn't a routable IP address. Unless you're coming through a VPN or are local to the network, this would be clear evidence that there is a box in the middle that's at least smart enough to do address translation. To troubleshoot everything else I would start with recording a full traffic capture from both the client and the server and try and reproduce the problem. It sounds like that shouldn't be a problem. On the client I'd run: tcpdump -n -s 1600 -i -w clientside.dmp host On the server I'd run tcpdump -n -s 1600 -i -w serverside.dmp Plan on clientside.dmp and serverside.dmp files getting large fast. That's ok, you just want to be sure to get everything. Let these two dumps run and then proceed to reproduce the problem. If you can, get a good mix of good connections vs failed ones. Then stop the dumps, it's time for analysis. For this, I'd recommend using the program 'tcptrace', which is in the ports tree. I'd start by looking in clientside.dmp for failed connection attempts/short connections. You can do this using the command tcptrace -n -b clientside.dmp and you should see something like this: hudson@Nikto:~/share > tcptrace -n -b dumpexample.dmp 1 arg remaining, starting with 'dumpexample.dmp' Ostermann's tcptrace -- version 6.6.1 -- Wed Nov 19, 2003 496 packets seen, 496 TCP packets traced elapsed wallclock time: 0:00:00.030771, 16119 pkts/sec analyzed trace file elapsed time: 0:00:25.364361 TCP connection info: 1: 10.192.4.16:59723 - 72.14.253.99:80 (a2b) 7> 7< (complete) 2: 195.64.132.11:29957 - 10.192.4.16:80 (c2d) 1> 3< 3: 10.192.4.16:51717 - 198.238.212.10:80 (e2f) 30> 41< 4: 10.192.4.16:64601 - 198.238.212.10:80 (g2h) 17> 9< 5: 10.192.4.16:54693 - 198.238.212.10:80 (i2j) 26> 15< (complete) 6: 10.192.4.16:65285 - 198.238.212.30:80 (k2l) 33> 52< 7: 10.192.4.16:54362 - 66.35.250.151:80 (m2n) 5> 5< (complete) 8: 10.192.4.16:65391 - 66.35.250.150:80 (o2p) 14> 16< (complete) This gives you a rough outline of the connections in the dump and tells you how many packets in either direction were sent. If the connection failed, you should see a very low packet count for the connection. Connection #2 in the above example would be suspect for instance. Once you have isolated an interesting connection, you can use tcpdump again to filter on that connection and get the full story: tcpdump -n -r dumpexample.dmp port 29957 My first hunch would be that there are intermittent connection establishment failures thanks to the firewall/loadbalancer. This would manifest itself as SYN's being seen on the client side that are not seen on the server side. (find a connection where SYN's aren't being answered in clientside.dmp and then check serverside.dmp to see if the SYN's are being received, crossreference by time). If the SYN's are making it to the server, then you have a server issue, if they aren't then we're still looking at a potential middlebox problem. When looking for the SYN's in the serverside dump, don't filter by IP address as it's possible that the failure is due to bad address translation somewhere... i.e. you may see SYN's being received at the same time that the client is sending them but with the wrong source IP address. If the problem isn't restricted to connection-establishment, then I'd look for connections in clientside.dmp that have long pauses in them and try to explain those pauses by comparing with serverside.dmp. To isolate connections with pauses in them, I'd again turn to the trusty 'tcptrace' program. This time however I'd use the '-l' ("long") switch to get more details on individual connections and grep for anomalies. Here's an example of tcptrace long output: TCP connection 1: host a: 10.192.4.16:59723 host b: 72.14.253.99:80 complete conn: yes first packet: Wed Dec 27 13:29:59.651504 2006 last packet: Wed Dec 27 13:30:02.302161 2006 elapsed time: 0:00:02.650656 total packets: 14 filename: dumpexample.dmp a->b: b->a: total packets: 7 total packets: 7 ack pkts sent: 6 ack pkts sent: 7 pure acks sent: 3 pure acks sent: 2 sack pkts sent: 0 sack pkts sent: 0 dsack pkts sent: 0 dsack pkts sent: 0 max sack blks/ack: 0 max sack blks/ack: 0 unique bytes sent: 1271 unique bytes sent: 2399 actual data pkts: 2 actual data pkts: 3 actual data bytes: 1271 actual data bytes: 2399 rexmt data pkts: 0 rexmt data pkts: 0 rexmt data bytes: 0 rexmt data bytes: 0 zwnd probe pkts: 0 zwnd probe pkts: 0 zwnd probe bytes: 0 zwnd probe bytes: 0 outoforder pkts: 0 outoforder pkts: 0 pushed data pkts: 2 pushed data pkts: 2 SYN/FIN pkts sent: 1/1 SYN/FIN pkts sent: 1/1 req 1323 ws/ts: Y/Y req 1323 ws/ts: N/N adv wind scale: 0 adv wind scale: 0 req sack: Y req sack: N sacks sent: 0 sacks sent: 0 urgent data pkts: 0 pkts urgent data pkts: 0 pkts urgent data bytes: 0 bytes urgent data bytes: 0 bytes mss requested: 1460 bytes mss requested: 1460 bytes max segm size: 762 bytes max segm size: 1430 bytes min segm size: 509 bytes min segm size: 151 bytes avg segm size: 635 bytes avg segm size: 799 bytes max win adv: 65535 bytes max win adv: 8190 bytes min win adv: 64882 bytes min win adv: 6444 bytes zero win adv: 0 times zero win adv: 0 times avg win adv: 65441 bytes avg win adv: 7317 bytes initial window: 509 bytes initial window: 2248 bytes initial window: 1 pkts initial window: 2 pkts ttl stream length: 1271 bytes ttl stream length: 2399 bytes missed data: 0 bytes missed data: 0 bytes truncated data: 0 bytes truncated data: 0 bytes truncated packets: 0 pkts truncated packets: 0 pkts data xmit time: 2.495 secs data xmit time: 2.566 secs idletime max: 2449.3 ms idletime max: 2519.2 ms throughput: 480 Bps throughput: 905 Bps I'd look at the fields 'elapsed time' and 'idletime max' (in the direction from the server to client only (in this case the "b->a" column), the other direction will always have long idle times due to the nature of HTTP). Some clever grepping should isolate interesting candidate connections which you can then isolate with tcpdump. At this point, if it is indeed a middlebox problem, there's probably not much you can do about it. But if you can isolate the symptoms and even provide example tcpdumps illustrating the problem, then you greatly increase the chances that your ISP's support staff can resolve the problem. Many times even if they know that a problem exists they may not know how to resolve it... having tcpdumps handy makes it easier for them to show the problem to someone else (say the firewall/loadbalancer vendor) who can tell them how to fix it. I know this from experience, I work at a company that makes loadbalancers. ;) Hope that helps, -- Matthew Hudson