Date: Tue, 30 Jun 2020 09:33:40 -0700 From: Benjamin Kaduk <kaduk@mit.edu> To: Rick Macklem <rmacklem@uoguelph.ca> Cc: Benjamin Kaduk <bjkfbsd@gmail.com>, Rick Macklem <rmacklem@freebsd.org>, src-committers <src-committers@freebsd.org>, "svn-src-projects@freebsd.org" <svn-src-projects@freebsd.org> Subject: Re: svn commit: r362798 - in projects/nfs-over-tls/sys/rpc: . rpcsec_tls Message-ID: <20200630163340.GN58278@kduck.mit.edu> In-Reply-To: <QB1PR01MB336441A427B14216A4A20384DD6F0@QB1PR01MB3364.CANPRD01.PROD.OUTLOOK.COM> References: <202006301449.05UEnq2x072917@repo.freebsd.org> <CAJ5_RoDe=_s2LZociYXTmdVOP%2BLJDA5HJ7jZkKr7LChffbaH8w@mail.gmail.com> <QB1PR01MB336441A427B14216A4A20384DD6F0@QB1PR01MB3364.CANPRD01.PROD.OUTLOOK.COM>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Jun 30, 2020 at 04:20:45PM +0000, Rick Macklem wrote: > Benjamin Kaduk wrote: > >On Tue, Jun 30, 2020 at 7:49 AM Rick Macklem <rmacklem@freebsd.org<mailto:rmacklem@freebsd.org>> wrote: > >Author: rmacklem > >Date: Tue Jun 30 14:49:51 2020 > >New Revision: 362798 > >URL: https://svnweb.freebsd.org/changeset/base/362798 > > > >Log: > > Testing when a server does not respond to TLS handshake records exposed > > a couple of problems, since the daemon would be in SSL_connect() for 6 minutes. > > > > - When the upcall timed out and was retried, the RPCTLS_SYSC_CLSOCKET syscall > > was broken and did not return an error upon a retry. It allocated a file > > descriptor for a NULL socket. > > - The socket structure in the kernel could be free'd while the daemon was > > still using it in SSL_connect(). > > - Adjust the timeout a retry count so that upcalls are only attempted once > > with a 10minute timeout. > > > > > >10 minutes seems really long! It sounds from the description like the upcall so >that > >userspace can run SSL_connect() was taking 6 minutes, and you needed 10 >minutes so > >as to be longer than the 6 minutes that is "out of your control"? > Well, I think a long timeout here is ok, since a timeout indicates a broken daemon. > (The upcalls to the local daemon should be reliable and cannot safely be redone. > In a perfect world, the upcall mechanism would be "exactly once" instead of > "at least once". I think an upcall might fail when the mbuf pool in the kernel > is exhausted, but that should be rare.) > > >I feel like there should be some sockopts available to get the SSL_connect() timeout > >down, so that the upcall timeout doesn't need to be so long, either. > Yes, 6 minutes does seem like a long time. I only discovered this yesterday when > I simulated a server that did not respond to handshake records. > > I haven't yet dug into the openssl code to see if there is a way to adjust this > timeout. > I also do not know what a good timeout value for SSL_connect() might be, > even if the daemon can override the default. > > In practice, this should only happen when trying to do an NFS mount on > a broken server which responds to the "STARTTLS" Null RPC, but does not > do the handshake. > Having the mount attempt stuck for 6minutes before failing is not that serious > a problem, imho. > (When systems boot after something like a power failure, delays getting NFS > mounts done, due to the NFS server/network needing to be up, is fairly > normal. The "-b" option to put the mount attempt in background has been > around for a long time for this.) > > If you happen to know how to set a timeout for SSL_connect() in the openssl > library, I would be interested in hearing that. As it happens, I took a look before I wrote the initial note, and there doesn't seem to be any intrinsic TLS (not DTLS) handshake timeouts in libssl itself; I expect this is actually just the (kernel's!) TCP timeout. So you'd be getting the socket fd (e.g., SSL_get_fd(), if you don't have a reference already) and using setsockopt() to set the timeout(s). -Ben
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20200630163340.GN58278>