From owner-freebsd-net@FreeBSD.ORG  Fri Nov 30 14:09:10 2012
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
 by hub.freebsd.org (Postfix) with ESMTP id D3CCB9FC
 for <freebsd-net@freebsd.org>; Fri, 30 Nov 2012 14:09:10 +0000 (UTC)
 (envelope-from keith.arner@gmail.com)
Received: from mail-we0-f182.google.com (mail-we0-f182.google.com
 [74.125.82.182])
 by mx1.freebsd.org (Postfix) with ESMTP id 6055E8FC08
 for <freebsd-net@freebsd.org>; Fri, 30 Nov 2012 14:09:10 +0000 (UTC)
Received: by mail-we0-f182.google.com with SMTP id u54so196138wey.13
 for <freebsd-net@freebsd.org>; Fri, 30 Nov 2012 06:09:09 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:date:x-google-sender-auth:message-id:subject
 :from:to:content-type;
 bh=u1jCHVz//jT3aUWmvwl1+sPVG9rJNWfLUygbcdeAODI=;
 b=gWw/YMMewNsINY3zjfONDQJsEsJgPsLIcG9CJRy1xu55eev79RUwl0WcbbSIrMcMvx
 c6mHqDRA0Y8OXexekhZWceUNGIWo9Y167fdrarlTKNhiNPn/RaUBhkul78cCbKel2YiV
 YTeoNh9wOXhtJf2EazXI0fCsIeZCdE9ItfVS59SLxfF0GkxIVAgFHimxxvBT+GEuD7Yw
 yPkk5hCIilhSH7cE5BCtgAPmF3SB1rHDZLtff1arD96gixHYt5+m+7N8Nh0W04r9L9Sy
 nZ1CZoZsQ5pAPWlzoFED7fU9KtmETaS+wcVhmWynm+VvWQjKnysJggoDdpIjpa/CEpF1
 ZDkQ==
MIME-Version: 1.0
Received: by 10.216.228.20 with SMTP id e20mr496023weq.166.1354284549098; Fri,
 30 Nov 2012 06:09:09 -0800 (PST)
Sender: keith.arner@gmail.com
Received: by 10.216.123.129 with HTTP; Fri, 30 Nov 2012 06:09:08 -0800 (PST)
Date: Fri, 30 Nov 2012 09:09:08 -0500
X-Google-Sender-Auth: t8jLyy67pM5nYec6Jy5VZgiZlTg
Message-ID: <CAEo_tUH9LPzPFP-O=317rYEQ3nT66b4biQshV_8=L8hReO_BLg@mail.gmail.com>
Subject: Problems with ephemeral port selection
From: Keith Arner <vornum@gmail.com>
To: freebsd-net@freebsd.org
Content-Type: text/plain; charset=ISO-8859-1
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 30 Nov 2012 14:09:10 -0000

I've noticed some issues with ephemeral port number selection from
tcp_connect(), which limit the number of concurrent, outgoing connections
that can be established (connect(), rather than accept()).  Sifting through
the source code, I believe the issuess stem from two problems in the
tcp_connect() code path.  Specifically:

 1) The wrong function gets called to determine if a given ephemeral
    port number is currently usable.
 2) The ephemeral port number gets selected without considering the
    foreign addr/port.

Curiously, the effect of #1 mostly cancels the effect of #2, such that
the common calling convention gives you a correct result so long as you
only have a small number of outgoing connections.  However, once you get to
a large number of outgoing connections, things start to break down.  (I'll
define large and small later.)

As a side note, I have been working with FreeBSD 7.2.  The implementations
of several of the relevant functions have been refactored somewhere between
7.2-RELEASE and 9-STABLE, but the core problems in the logic seem to be
the same between versions.

For problem #1, the code path that selects the ephemeral port number is:
 tcp_connect() ->
   in_pcbbind() ->
     in_pcbbind_setup() ->
       in_pcb_lport() [not in FreeBSD 7.2] ->
         in_pcblookup_local()

There is a loop in in_pcb_lport() [or directly in in_pcbbind_setup() in
earlier releases] that considers candidate ephemeral port numbers and
calls in_pcblookup_local() to determine if a given candidate is suitable.
The default behaviour (if the caller has not set either SO_REUSEADDR or
SO_REUSEPORT) is to pick a local port number that is not in use by
*any* local TCP socket.

So long as the number of concurrent, outgoing connections is less than the
range configured by `sysctl net.inet.ip.portrange.*`, selecting a totally
unique ephemeral port number works OK.  However, you cannot exceed that
limit, even if each outgoing connection has a unique faddr/fport.  This
does not limit the number of connections that can be accept()'ed, only the
number of connections that can be connect()'ed.

In this particular path, I think the code should call in_pcblookup_hash(),
rather than in_pcblookup_local().  The criteria in in_pcblookup_hash() only
match if the full 5-tuple matches, rather than just the local port number.
The complication, of course, comes from the fact that in_pcbbind() is
called from both bind() and for the implicit bind that happens for a
connect().  The matching criteria in in_pcblookup_local() make sense for
the former but not quite for the later.

I mentioned that the above is the default behaviour you get when you don't
specify SO_REUSEADDR or SO_REUSEPORT.  Setting SO_REUSEADDR
before calling connect() has some surprizing consequences (surprizing in the
sense that I don't believe SO_REUSEADDR is supposed to have any effect
on connect()).  In this case, when in_pcblookup_local() is called, wild_okay
is set to false.  This changes the matching criteria to (in effect) allow
tcp_connect() to use the full 5-tuple space.  However, this brings us to the
second problem.

Problem #2 is that the ephemeral port number is chosen before the
fport/faddr gets set on the pcb; that is tcp_connect() calls in_pcbbind() to
select the ephemeral port number, *then* calls in_pcbconnect_setup() to
populate the fport/faddr.  With SO_REUSEADDR, in_pcbbind() can select
an in-use local port.  If the local port is used by a socket with a different
laddr/fport/faddr, all is good.  However, if the local port selection
results in a
full conflict it will get rejected by the call to in_pcblookup_hash() inside
in_pcbconnect_setup().  This happens *after* the loop inside
in_pcbbind(), so the call to tcp_connect() fails with EADDRINUSE.  Thus,
with SO_REUSEADDR, connect() can fail with EADDRINUSE long before
the ephemeral port space has been exhausted.  The application could re-try
the call to connect() and likely succeed, as a new local port would be
selected.

Overall, this behaviour hinders the ability to open a large number of
outbound connections:
 * If you don't specify SO_REUSEADDR, you have a fairly limited maximum
   number of outbound connections.
 * If you do specify SO_REUSEADDR, you are able to open a much larger
   number of outbound connections, but must retry on EADDRINUSE.

I believe that the logic under tcp_connect() should be modified to:

 - behave uniformly whether or not SO_REUSEADDR has been set
 - allow outgoing connection requests to re-use a local port number, so
   long as the remaining elements of the tuple (laddr, fport, faddr) are
   unique

Keith

-- 
"A problem well put is half solved."