From owner-freebsd-current Wed Nov 24 12:36:26 1999 Delivered-To: freebsd-current@freebsd.org Received: from wall.polstra.com (rtrwan160.accessone.com [206.213.115.74]) by hub.freebsd.org (Postfix) with ESMTP id D05F7155DE for ; Wed, 24 Nov 1999 12:36:19 -0800 (PST) (envelope-from jdp@polstra.com) Received: from vashon.polstra.com (vashon.polstra.com [206.213.73.13]) by wall.polstra.com (8.9.3/8.9.3) with ESMTP id MAA01202; Wed, 24 Nov 1999 12:36:08 -0800 (PST) (envelope-from jdp@polstra.com) From: John Polstra Received: (from jdp@localhost) by vashon.polstra.com (8.9.3/8.9.1) id MAA24659; Wed, 24 Nov 1999 12:36:08 -0800 (PST) (envelope-from jdp@polstra.com) Date: Wed, 24 Nov 1999 12:36:08 -0800 (PST) Message-Id: <199911242036.MAA24659@vashon.polstra.com> To: wollman@khavrinen.lcs.mit.edu Subject: Re: Route table leaks In-Reply-To: <199911221552.KAA84691@khavrinen.lcs.mit.edu> References: <199911220150.UAA78559@khavrinen.lcs.mit.edu> <199911221552.KAA84691@khavrinen.lcs.mit.edu> Organization: Polstra & Co., Seattle, WA Cc: current@freebsd.org Sender: owner-freebsd-current@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG I've been working on the cvsup-master route table leaks. I haven't found the bug yet, but I've got some clues now. If this info inspires a Eureka! from any of you, please let me know. I started by running this script to print out key information every 2 seconds while I ran a test: #! /bin/sh while :; do date "+%H:%M:%S" vmstat -m | grep 'routetbl[ ]' || exit netstat -ran | egrep 'default|206\.213\.73\.12' || exit echo "=============================" || exit sleep 2 || exit done (Yes, this is crude. But remember, the machine is 900 miles away and I'm trying not to disrupt service too much.) The output began like this: 19:41:56 routetbl 3682 518K 522K 21221K 11473 0 0 16,32,64,128,256 default 204.216.27.17 UGc 1823 60 wb0 ============================= 19:41:58 routetbl 3682 518K 522K 21221K 11473 0 0 16,32,64,128,256 default 204.216.27.17 UGc 1823 60 wb0 I.e., there were 3682 route table structures in use, and 1823 references to the default route. Then I made a connection to the CVSup server from one of my own machines (206.213.73.12): 19:42:00 routetbl 3684 518K 522K 21221K 11475 0 0 16,32,64,128,256 default 204.216.27.17 UGc 1824 60 wb0 206.213.73.12 204.216.27.17 UGHW 1 64 wb0 So far, so good. A cloned route was created and the refcount on the default route was incremented by one. Two new route table entries were allocated, and that seems to be normal and OK. I immediately closed the connection, and it entered the TIME_WAIT state on cvsup-master. The script output remained as above for 60 seconds (2 * MSL), after which it changed to this: 19:42:59 routetbl 3684 518K 522K 21221K 11475 0 0 16,32,64,128,256 default 204.216.27.17 UGc 1824 62 wb0 206.213.73.12 204.216.27.17 UGHW3 0 64 wb0 3600 This still looks OK. The cloned route has gained the "3" flag and a 1-hour expiration time. That is because the TIME_WAIT state has ended, the PCB has been discarded, and the cloned route is now being managed by the caching code in "sys/netinet/in_rmx.c". (That's what the "3" flag signifies.) The basic idea of this code (as I understand it) is to keep cloned routes for dead connections around for awhile in case they are needed again soon. That's useful since the cloned routes contain RTT estimates and so forth. Now we get to the interesting part. One would expect the route to remain cached for 3600 seconds. There are ways that in_rmx.c can expire it sooner than that, but I confirmed that those situations (e.g., too many cached routes) aren't arising. Nevertheless, the route is deleted after roughly another 200 seconds: 19:46:14 routetbl 3684 518K 522K 21221K 11478 0 0 16,32,64,128,256 default 204.216.27.17 UGc 1824 64 wb0 206.213.73.12 204.216.27.17 UGHW3 0 64 wb0 3405 ============================= 19:46:16 routetbl 3682 518K 522K 21221K 11480 0 0 16,32,64,128,256 default 204.216.27.17 UGc 1823 64 wb0 Using DDB and some of routed's tracing options I determined that routed is deleting the route. More on that later. Anyway, given that the route is being deleted, things still look OK in the above. The route is deleted, the 2 route table entries are freed again, and the refcount on the default route is decremented back to its original value. But look what happens in the next 2 seconds: 19:46:18 routetbl 3684 518K 522K 21221K 11482 0 0 16,32,64,128,256 default 204.216.27.17 UGc 1824 64 wb0 The 2 entries were allocated again and the refcount on the default route was incremented. Why? I don't know (yet). But the numbers remain that way thereafter. That's the leak, and I can reproduce it reliably on cvsup-master. Unfortunately, I cannot reproduce the problem here on my own -current machine. I tried to simulate the environment as accurately as possible, including running routed. On my machine, routed deletes the cached route before it has expired too, but the leak doesn't happen. One other thing. Back on cvsup-master, I changed rc.conf so that it sets the default route statically, and I disabled routed. That has completely eliminated the route table leak. Any ideas? Using DDB remotely through a console switch really isn't much fun. I'd prefer a Eureka! from somebody. :-) John -- John Polstra jdp@polstra.com John D. Polstra & Co., Inc. Seattle, Washington USA "No matter how cynical I get, I just can't keep up." -- Nora Ephron To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message