Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 1 Jul 2020 01:23:50 +0000
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Benjamin Kaduk <kaduk@mit.edu>
Cc:        Benjamin Kaduk <bjkfbsd@gmail.com>, Rick Macklem <rmacklem@freebsd.org>, src-committers <src-committers@freebsd.org>, "svn-src-projects@freebsd.org" <svn-src-projects@freebsd.org>
Subject:   Re: svn commit: r362798 - in projects/nfs-over-tls/sys/rpc: . rpcsec_tls
Message-ID:  <QB1PR01MB33642D5CC58DF44548BB1911DD6C0@QB1PR01MB3364.CANPRD01.PROD.OUTLOOK.COM>
In-Reply-To: <QB1PR01MB3364FE7A60B953C2D730E6F3DD6C0@QB1PR01MB3364.CANPRD01.PROD.OUTLOOK.COM>
References:  <202006301449.05UEnq2x072917@repo.freebsd.org> <CAJ5_RoDe=_s2LZociYXTmdVOP%2BLJDA5HJ7jZkKr7LChffbaH8w@mail.gmail.com> <QB1PR01MB336441A427B14216A4A20384DD6F0@QB1PR01MB3364.CANPRD01.PROD.OUTLOOK.COM>, <20200630163340.GN58278@kduck.mit.edu>, <QB1PR01MB3364FE7A60B953C2D730E6F3DD6C0@QB1PR01MB3364.CANPRD01.PROD.OUTLOOK.COM>

next in thread | previous in thread | raw e-mail | index | archive | help
Rick Macklem wrote:=0A=
>Benjamin Kaduk wrote:=0A=
>>On Tue, Jun 30, 2020 at 04:20:45PM +0000, Rick Macklem wrote:=0A=
>>> Benjamin Kaduk wrote:=0A=
>>> >On Tue, Jun 30, 2020 at 7:49 AM Rick Macklem <rmacklem@freebsd.org<mai=
lto:rmacklem@freebsd.org>> wrote:=0A=
>>> >Author: rmacklem=0A=
>>> >Date: Tue Jun 30 14:49:51 2020=0A=
>>> >New Revision: 362798=0A=
>>> >URL: https://svnweb.freebsd.org/changeset/base/362798=0A=
>>> >=0A=
>>> >Log:=0A=
>>> >  Testing when a server does not respond to TLS handshake records expo=
sed=0A=
>>> >  a couple of problems, since the daemon would be in SSL_connect() for=
 6 minutes.=0A=
>>> >=0A=
>>> >  - When the upcall timed out and was retried, the RPCTLS_SYSC_CLSOCKE=
T syscall=0A=
>>> >    was broken and did not return an error upon a retry. It allocated =
a file=0A=
>>> >    descriptor for a NULL socket.=0A=
>>> >  - The socket structure in the kernel could be free'd while the daemo=
n was=0A=
>>> >    still using it in SSL_connect().=0A=
>>> >  - Adjust the timeout a retry count so that upcalls are only attempte=
d once=0A=
>>> >    with a 10minute timeout.=0A=
>>> >=0A=
>>> >=0A=
>>> >10 minutes seems really long!  It sounds from the description like the=
 upcall so >>that=0A=
>>> >userspace can run SSL_connect() was taking 6 minutes, and you needed 1=
0 >>>minutes so=0A=
>>> >as to be longer than the 6 minutes that is "out of your control"?=0A=
>>> Well, I think a long timeout here is ok, since a timeout indicates a br=
oken daemon.=0A=
>>> (The upcalls to the local daemon should be reliable and cannot safely b=
e redone.=0A=
>>>  In a perfect world, the upcall mechanism would be "exactly once" inste=
ad of=0A=
>>>  "at least once". I think an upcall might fail when the mbuf pool in th=
e kernel=0A=
>>>  is exhausted, but that should be rare.)=0A=
>>>=0A=
>>> >I feel like there should be some sockopts available to get the SSL_con=
nect() timeout=0A=
>>> >down, so that the upcall timeout doesn't need to be so long, either.=
=0A=
>>> Yes, 6 minutes does seem like a long time. I only discovered this yeste=
rday when=0A=
>>> I simulated a server that did not respond to handshake records.=0A=
>>>=0A=
>>> I haven't yet dug into the openssl code to see if there is a way to adj=
ust this=0A=
>>> timeout.=0A=
>>> I also do not know what a good timeout value for SSL_connect() might be=
,=0A=
>>> even if the daemon can override the default.=0A=
>>>=0A=
>>> In practice, this should only happen when trying to do an NFS mount on=
=0A=
>>> a broken server which responds to the "STARTTLS" Null RPC, but does not=
=0A=
>>> do the handshake.=0A=
>>> Having the mount attempt stuck for 6minutes before failing is not that =
serious=0A=
>>> a problem, imho.=0A=
>>> (When systems boot after something like a power failure, delays getting=
 NFS=0A=
>>>  mounts done, due to the NFS server/network needing to be up, is fairly=
=0A=
>>>  normal. The "-b" option to put the mount attempt in background has bee=
n=0A=
>>>  around for a long time for this.)=0A=
>>>=0A=
>>> If you happen to know how to set a timeout for SSL_connect() in the ope=
nssl=0A=
>>> library, I would be interested in hearing that.=0A=
>>=0A=
>>As it happens, I took a look before I wrote the initial note, and there=
=0A=
>>doesn't seem to be any intrinsic TLS (not DTLS) handshake timeouts in=0A=
>>libssl itself; I expect this is actually just the (kernel's!) TCP timeout=
.=0A=
>>So you'd be getting the socket fd (e.g., SSL_get_fd(), if you don't have =
a=0A=
>>reference already) and using setsockopt() to set the timeout(s).=0A=
>Interesting. The test case I simulated did not close the TCP socket used b=
y=0A=
>SSL_connect(). The server just replied to the STARTTLS Null RPC, but did n=
ot=0A=
>call SSL_accept(), so the server side just isn't playing "handshake".=0A=
>"netstat -a" showed the connection as ESTABLISHED.=0A=
>During debugging, I also used the trick of putting:=0A=
>    while (1)=0A=
>        sleep(1);=0A=
>right after the SSL_connect() call and, when watching it via "ps",=0A=
>it would switch from "sbwait" to "nanoslp" after 6 minutes and=0A=
>a syslog() call showed that SSL_connect() had returned -1.=0A=
>=0A=
>So, if the TCP connection was "established", what caused the SSL_connect()=
=0A=
>to return with an error (-1) after 6 minutes?=0A=
>=0A=
>Now, there is a 6 minute idle timeout in the RPC code for TCP where it,=0A=
>by default, closes the connection when there is 6 minutes without any=0A=
>activity. (I have to look if waiting for a reply for the upcall implies "n=
o activity" and >if=0A=
>this also happens for AF_LOCAL sockets, which is what the upcalls use.)=0A=
Ok, I figured out what is happening for this test.=0A=
It is the 6 minute idle timeout, but it occurs at the server end, where the=
 NFS server=0A=
end shuts down the TCP connection.=0A=
Now, the client cannot assume all servers will do this.=0A=
I'm going to try playing around with doing a shutdown of the socket on the=
=0A=
client end after a shorter timeout on the upcall and see if that can get=0A=
SSL_connect() to return with a failure in the daemon.=0A=
=0A=
>Now, if that happens, a SIGPIPE would be posted to the daemon, which=0A=
>is SIG_IGN'd by the daemon. But maybe the SIGPIPE somehow causes=0A=
>SSL_connect() to return -1 by making the syscall it is doing (read/recv on=
 the=0A=
>TCP socket sitting in sbwait) return EINTR, or something like that?=0A=
Ignore this "theory". It was bunk.=0A=
=0A=
>I can change this 6minute timeout to see if that affects it.=0A=
Can't be changed, since it is at the server end of the TCP connection.=0A=
(A comment in the krpc code mentions a 5minute timeout in the client,=0A=
 but I don't see that in the code?)=0A=
=0A=
>When you've got upcalls and library functions both talking to sockets it=
=0A=
>can get interesting.=0A=
>=0A=
>Thanks for the comments, rick=0A=
=0A=
Correcting myself, rick=0A=
=0A=
-Ben=0A=
=0A=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?QB1PR01MB33642D5CC58DF44548BB1911DD6C0>