Date: Wed, 1 Jul 2020 00:43:41 +0000 From: Rick Macklem <rmacklem@uoguelph.ca> To: Benjamin Kaduk <kaduk@mit.edu> Cc: Benjamin Kaduk <bjkfbsd@gmail.com>, Rick Macklem <rmacklem@freebsd.org>, src-committers <src-committers@freebsd.org>, "svn-src-projects@freebsd.org" <svn-src-projects@freebsd.org> Subject: Re: svn commit: r362798 - in projects/nfs-over-tls/sys/rpc: . rpcsec_tls Message-ID: <QB1PR01MB3364FE7A60B953C2D730E6F3DD6C0@QB1PR01MB3364.CANPRD01.PROD.OUTLOOK.COM> In-Reply-To: <20200630163340.GN58278@kduck.mit.edu> References: <202006301449.05UEnq2x072917@repo.freebsd.org> <CAJ5_RoDe=_s2LZociYXTmdVOP%2BLJDA5HJ7jZkKr7LChffbaH8w@mail.gmail.com> <QB1PR01MB336441A427B14216A4A20384DD6F0@QB1PR01MB3364.CANPRD01.PROD.OUTLOOK.COM>, <20200630163340.GN58278@kduck.mit.edu>
next in thread | previous in thread | raw e-mail | index | archive | help
Benjamin Kaduk wrote:=0A= >On Tue, Jun 30, 2020 at 04:20:45PM +0000, Rick Macklem wrote:=0A= >> Benjamin Kaduk wrote:=0A= >> >On Tue, Jun 30, 2020 at 7:49 AM Rick Macklem <rmacklem@freebsd.org<mail= to:rmacklem@freebsd.org>> wrote:=0A= >> >Author: rmacklem=0A= >> >Date: Tue Jun 30 14:49:51 2020=0A= >> >New Revision: 362798=0A= >> >URL: https://svnweb.freebsd.org/changeset/base/362798=0A= >> >=0A= >> >Log:=0A= >> > Testing when a server does not respond to TLS handshake records expos= ed=0A= >> > a couple of problems, since the daemon would be in SSL_connect() for = 6 minutes.=0A= >> >=0A= >> > - When the upcall timed out and was retried, the RPCTLS_SYSC_CLSOCKET= syscall=0A= >> > was broken and did not return an error upon a retry. It allocated a= file=0A= >> > descriptor for a NULL socket.=0A= >> > - The socket structure in the kernel could be free'd while the daemon= was=0A= >> > still using it in SSL_connect().=0A= >> > - Adjust the timeout a retry count so that upcalls are only attempted= once=0A= >> > with a 10minute timeout.=0A= >> >=0A= >> >=0A= >> >10 minutes seems really long! It sounds from the description like the = upcall so >>that=0A= >> >userspace can run SSL_connect() was taking 6 minutes, and you needed 10= >>minutes so=0A= >> >as to be longer than the 6 minutes that is "out of your control"?=0A= >> Well, I think a long timeout here is ok, since a timeout indicates a bro= ken daemon.=0A= >> (The upcalls to the local daemon should be reliable and cannot safely be= redone.=0A= >> In a perfect world, the upcall mechanism would be "exactly once" instea= d of=0A= >> "at least once". I think an upcall might fail when the mbuf pool in the= kernel=0A= >> is exhausted, but that should be rare.)=0A= >>=0A= >> >I feel like there should be some sockopts available to get the SSL_conn= ect() timeout=0A= >> >down, so that the upcall timeout doesn't need to be so long, either.=0A= >> Yes, 6 minutes does seem like a long time. I only discovered this yester= day when=0A= >> I simulated a server that did not respond to handshake records.=0A= >>=0A= >> I haven't yet dug into the openssl code to see if there is a way to adju= st this=0A= >> timeout.=0A= >> I also do not know what a good timeout value for SSL_connect() might be,= =0A= >> even if the daemon can override the default.=0A= >>=0A= >> In practice, this should only happen when trying to do an NFS mount on= =0A= >> a broken server which responds to the "STARTTLS" Null RPC, but does not= =0A= >> do the handshake.=0A= >> Having the mount attempt stuck for 6minutes before failing is not that s= erious=0A= >> a problem, imho.=0A= >> (When systems boot after something like a power failure, delays getting = NFS=0A= >> mounts done, due to the NFS server/network needing to be up, is fairly= =0A= >> normal. The "-b" option to put the mount attempt in background has been= =0A= >> around for a long time for this.)=0A= >>=0A= >> If you happen to know how to set a timeout for SSL_connect() in the open= ssl=0A= >> library, I would be interested in hearing that.=0A= >=0A= >As it happens, I took a look before I wrote the initial note, and there=0A= >doesn't seem to be any intrinsic TLS (not DTLS) handshake timeouts in=0A= >libssl itself; I expect this is actually just the (kernel's!) TCP timeout.= =0A= >So you'd be getting the socket fd (e.g., SSL_get_fd(), if you don't have a= =0A= >reference already) and using setsockopt() to set the timeout(s).=0A= Interesting. The test case I simulated did not close the TCP socket used by= =0A= SSL_connect(). The server just replied to the STARTTLS Null RPC, but did no= t=0A= call SSL_accept(), so the server side just isn't playing "handshake".=0A= "netstat -a" showed the connection as ESTABLISHED.=0A= During debugging, I also used the trick of putting:=0A= while (1)=0A= sleep(1);=0A= right after the SSL_connect() call and, when watching it via "ps",=0A= it would switch from "sbwait" to "nanoslp" after 6 minutes and=0A= a syslog() call showed that SSL_connect() had returned -1.=0A= =0A= So, if the TCP connection was "established", what caused the SSL_connect()= =0A= to return with an error (-1) after 6 minutes?=0A= =0A= Now, there is a 6 minute idle timeout in the RPC code for TCP where it,=0A= by default, closes the connection when there is 6 minutes without any=0A= activity. (I have to look if waiting for a reply for the upcall implies "no= activity" and if=0A= this also happens for AF_LOCAL sockets, which is what the upcalls use.)=0A= =0A= Now, if that happens, a SIGPIPE would be posted to the daemon, which=0A= is SIG_IGN'd by the daemon. But maybe the SIGPIPE somehow causes=0A= SSL_connect() to return -1 by making the syscall it is doing (read/recv on = the=0A= TCP socket sitting in sbwait) return EINTR, or something like that?=0A= =0A= I can change this 6minute timeout to see if that affects it.=0A= =0A= When you've got upcalls and library functions both talking to sockets it=0A= can get interesting.=0A= =0A= Thanks for the comments, rick=0A= =0A= -Ben=0A= =0A=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?QB1PR01MB3364FE7A60B953C2D730E6F3DD6C0>