From owner-freebsd-current  Wed Jan 16 16:43:59 2002
Delivered-To: freebsd-current@freebsd.org
Received: from mail.acns.ab.ca (h24-64-56-135.cg.shawcable.net [24.64.56.135])
	by hub.freebsd.org (Postfix) with ESMTP id 69A6D37B405
	for <current@freebsd.org>; Wed, 16 Jan 2002 16:43:49 -0800 (PST)
Received: from colnta.acns.ab.ca (colnta.acns.ab.ca [192.168.1.2])
	by mail.acns.ab.ca (8.11.6/8.11.3) with ESMTP id g0H0hgI82489;
	Wed, 16 Jan 2002 17:43:42 -0700 (MST)
	(envelope-from davidc@colnta.acns.ab.ca)
Received: (from davidc@localhost)
	by colnta.acns.ab.ca (8.11.6/8.11.3) id g0H0hgQ02155;
	Wed, 16 Jan 2002 17:43:42 -0700 (MST)
	(envelope-from davidc)
Date: Wed, 16 Jan 2002 17:43:42 -0700
From: Chad David <davidc@acns.ab.ca>
To: Terry Lambert <tlambert2@mindspring.com>
Cc: Chad David <davidc@acns.ab.ca>, current@freebsd.org
Subject: Re: socket shutdown delay?
Message-ID: <20020116174342.A2097@colnta.acns.ab.ca>
Mail-Followup-To: Terry Lambert <tlambert2@mindspring.com>,
	Chad David <davidc@acns.ab.ca>, current@freebsd.org
References: <20020116070908.A803@colnta.acns.ab.ca> <3C45F32A.5B517F7E@mindspring.com> <20020116152908.A1476@colnta.acns.ab.ca> <3C4611D7.F99A5147@mindspring.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <3C4611D7.F99A5147@mindspring.com>; from tlambert2@mindspring.com on Wed, Jan 16, 2002 at 03:50:47PM -0800
Sender: owner-freebsd-current@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-current.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-current>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-current>
X-Loop: FreeBSD.ORG

On Wed, Jan 16, 2002 at 03:50:47PM -0800, Terry Lambert wrote:
> Chad David wrote:
> > > A connection goes into FIN_WAIT_2 when it has received the ACK
> > > of the FIN, but not received a FIN (or sent an ACK) itself, thus
> > > permitting it to enter TIME_WAIT state for 2MSL before proceeding
> > > to the CLOSED state, as a result of a server initiated close.
> > >
> > > A connection goes into LAST_ACK when it has sent a FIN and not
> > > received the ACK of the FIN before proceeding to the CLOSED
> > > state, as a result of a client initiated close.
> > 
> 

The direct cause is a bug in my client.  I call close(2) out side of the
main loop (one line off :( ), so none of the client side sockets were
getting closed.  When I fixed this all of the connections went to
TIME_WAIT right away.

I'm still not convinced that all is well though, as on Solaris 5.9 and
4.4-STABLE I do not see the problem with the bad client.

I'll address your points below, but if you don't feel like chasing this
anymore that is fine with me... I'll add it to my list of things to
try and understand on my next vacation :).

> Also make sure that the keepalive sysctl is set on (1).

colnta->sysctl -a | grep keepalive
net.inet.tcp.always_keepalive: 1

> 
> This doesn't tell me the setting of the keepalive sysctl.  By
> default, it won't be on unless the sysctl forces it on, which
> it does by default, unless it's been changed, or the default
> has been changed in -current (don't know).  So check this one.

It must be on by default.

> 
> > I also do not call
> > shutdown() on either end, and both the client and server
> > processes have exited and the connections still do not clear
> > up (in time they do, around 10 minutes).
> 
> You should probably call shutdown(2), if you want your code
> to be mostly correct.

Call shutdown(2) instead of close(2)?

> 
> I suspect that you are just doing a large number of connections.

One connection at a time, as fast as the client can loop, with
a small (1k) amount of data being returned by the server.

> 
> I frequently ended up with 50,000+ connections in TIME_WAIT
> state (I rarely use the same machine for both the client and
> the server, since that is not representative of real world
> use), and, of course, it takes 2MSL for TIME_WAIT to drain
> connections out.

Agreed, I'm still testing functionality.  I just got hit with
this while trying to check for simple memory leaks and broken
code (not load testing).

> 
> My guess is that you have ran out of mbufs (your usage stats
> tell me nothing about the abailable number of real mbufs;
> even the "0 requests for memory denied" is not really as
> useful as it would appear in the stats), or you just have
> an incredibly large number of files open.

colnta->sysctl -a | grep mbuf
kern.ipc.nmbufs: 67584
kern.ipc.mbuf_wait: 64
kern.ipc.mbuf_limit: 512

> > > 3)    You've exhausted your mbufs before you've exhausted
> > >       the number of simultaneous connections you are
> > >       permitted, because you have incorrectly tuned your
> > >       kernel, and therefore all your connections are sitting
> > >       in a starvation deadlock, waiting for packets that can
> > >       never be sent because there are no mbufs available.
> > 
> > The client eventually fails with EADDRNOTAVAIL.
> 
> Yes, this is the outbound connection limitation because of the
> ports.  There's three bugs there, in FreeBSD, as well, but they
> generally limit the outbound connections, rather than causing
> problems.
> 
> One tuning variable you probably want on the machine making the
> connections is to up the TCP port range to 65535; you will have
> to do two sysctls in order to do this.  This will delay your
> client failure by about a factor of 8-10 times as many
> connections (outbound connections count against the total, but
> inbound connections do not, since they do not use up socket/port
> pairs be source).

With the fixed client it never fails.  I moved a few GB through it
without any problem.

> > 
> > and a few minutes later:
> > colnta->netstat -an | grep FIN_WAIT_2 | wc
> >     1434    8604  111852
> 
> This indicates a 2MSL draining.  The resource track close could
> also be slow.  You could probably get an incredible speedup by
> doing explicit closes in the client program, starting with the
> highest used fd, and working down, instead of going the other
> way (it's probably a good idea to modify the FreeBSD resource
> track close to so the same thing).

If I had been doing any explicit closes :(.

> 
> There are some other inefficiencies in the fd code that can be
> addressed... nominally, the allocation is a linear search at
> the last valid one going higher.  For most servers, this could
> be significantly improved by linking free fd's in a sparse
> list onto a "freelist", and maintaining a pointer to that,
> instead of the index to the first free one, but that should only
> impact you on allocation (like the inpcb hash, which fails
> pretty badly, even when you tune up the hash size to some
> unreasonable amount, and the port allocation for outbound
> connections, which is, frankly, broken.  Both could benefit from
> a nice btree overhaul).

I actually implemented something for this type of problem over Christmas
with one of the Solaris engineers.  It was inspired by Jeff Bonwick's
vmem stuff (Usenix 2001), but was bit mask based, so the actual storage
overhead was a lot less, with what appeared to be very good allocate and
free times (O(n) as the worst case with O(1) typically).

I still need to add support for layering to allow it to scale across
multiple processors... but I'm getting off topic.

> 
> THe timer code is also pretty sucky, even with a very large
> callout wheel.  It would be incredibly valuable to have fixed
> interval timers ordered by entry on interval specific lists
> (e.g. MSL and 2MSL lists, as well as other common ones), so
> that the scan of the timer entries could be stopped at the
> first one whose expiration time was after the current time for
> the given interval callout.  This would save you almost all of
> your callout list traversals, which, with the wheel, have to be
> ordered (see the Rice University paper on opportunistic timers
> for a glancing approach at solving the real problem here).

I think I have that paper around here somewhere... is it older,
like from around 1990?

> 
> These aren't biting you, though, because the quick draining is
> happening, indicating that it's not really the timer code or
> the other code that's your immediate problem (though you might
> speed draining by a factor of 3 just by fixing the timers to
> use ordered lists per interval, rather than the callout wheel).

Maybe tomorrow night :).

> 
> > > 4)    You've got local hacks that your aren't telling us
> > >       about (shame on you!).
> > 
> > Nope.  Stock -current, none of my patches applied.
> 
> Heh... "not useful information without a date of cvsup,
> and then possibly not even then".  Moving target problems...

The original email has the uname and a dmesg, but:
FreeBSD colnta 5.0-CURRENT FreeBSD 5.0-CURRENT #17: Sun Jan 13 03:51:32 MST 2002

> 
> Can you repeat this on 4.5RC?  If so, try 4.4-RELEASE.  It
> may be related to the SYN cache code.

I do not have a RC or RELEASE box, but 4.4-STABLE does not do this.

> 
> The SYN-cookie code is vulnerable to the "ACK gun" attack,
> and since the SYN cache code falls back into SYN cookie
> (it assumes that the reason it didn't find the corresponding
> SYN in the SYN cache is that it overflowed and was discarded,
> turning naked ACK attempts into SYN-cookie attempts completely
> automatically), you might be hitting it that way.
> 
> If that's the case, then I suggest leaving the SYN cache
> enabled, and disabling the SYN cookie.  If that doesn't fix
> it, then you may also want to try disabling the SYN cache.

I'll have to look into this stuff to understand what you are saying.

> 
> Other than that, once you've tried this, then I will need to
> know what the failure modes are, and then more about the
> client and server code (kqueue based?  Standard sockets
> based?), and then I can suggest more to narrow it down.

Very simple sockets.  Basically:
	... accept() -> read() -> write() -> close() ...

The actual read(), write(), close(), takes place in a seperate thread,
but there is only one thread active at a time.
	

> 
> Another thing you may want to try is delay closing the
> server side of the connection for 1-2 seconds after the
> last write.  This is the canonical way of forcing a client
> to do the close first in all cases, which totally avoids
> the server-side-close-first case, which also avoids the
> FIN_WAIT_2.  For real code, you would have to add a "close
> cache" and timer.

Give that each connection is in its own thread this is very doable...

> 
> Hope this helps...

If nothing else I'm learning... I just wish I could read as fast
as you can type :).

-- 
Chad David        davidc@acns.ab.ca
www.FreeBSD.org   davidc@freebsd.org
ACNS Inc.         Calgary, Alberta Canada
Fourthly, The constant breeders, beside the gain of eight shillings
sterling per annum by the sale of their children, will be rid of the
charge of maintaining them after the first year. - Johnathan Swift

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message