Date: Mon, 25 Sep 2006 02:57:45 -0700 From: John-Mark Gurney <gurney_j@resnet.uoregon.edu> To: current@FreeBSD.org, net@FreeBSD.org Cc: Andre Oppermann <andre@FreeBSD.org>, mohans@FreeBSD.org Subject: odd TCP rtt/retransmit timeout issue... Message-ID: <20060925095745.GA80527@funkthat.com>
next in thread | raw e-mail | index | archive | help
I was brining up another interface that I just added to /etc/rc.conf and ran the command /etc/rc.d/netif start to initalize it... But then my connection never came back.... I found that the shell was still active as I could type commands like sleep 5, and another session's w would see sleep 5 run on the session... even filling up the send-q w/ 32k of data didn't get the HEAD box to send any data to the client... With the help of silby, I managed to find that the t_rxtcur value in the tcpcb was getting a very large value. The session that hung had a retransmit timeout of 19 days... This led us to find that the TCPT_RANGESET macro was letting very large tvmin values override the more sane tvmax values due to an extra else. I have added that so we shouldn't see any more multi day timeouts, but we still apparently have a problem where the rtt value calculated is wildly incorrect... It appears that each connection will get a different "random" rtt values... From a few connections to my machine: (kgdb) print ((struct tcpcb *)0xc3a34af8)->t_rxtcur $3 = 64000 (kgdb) print ((struct tcpcb *)0xc3a3457c)->t_rxtcur $6 = 1662654093 (kgdb) print ((struct tcpcb *)0xc3a343a8)->t_rxtcur $12 = 1358 (kgdb) print ((struct tcpcb *)0xc3a9e1d4)->t_rxtcur $17 = 203 (kgdb) print ((struct tcpcb *)0xc3a9e000)->t_rxtcur $19 = 284155863 most connections are stable around the "picked" value, though I have seen some connections oscillate between 64000 and a really large value.. I was trying to track this down, and a kernel as of 9/17 exhibits the problem, but I managed to track it down to a RELENG_6 commit (which obviously would effect HEAD) when I realized that each connection got a different value, and my older tests I was getting lucky in not having a bad timeout... To obtain these values, I used kgdb kernel /dev/mem, and put the value returned by netstat -Aanfinet's first column in as the tcpcb pointer above.. (Why is the columned named Socket, when it's the control block struct and not the socket struct?) Anyone want to track down why we are getting such large values in there? I'll try to back track farther to see how old this issue is.. -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not."
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20060925095745.GA80527>