Date: Fri, 11 Dec 2020 23:28:30 +0000 From: Rick Macklem <rmacklem@uoguelph.ca> To: J David <j.david.lists@gmail.com> Cc: "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org> Subject: Re: Major issues with nfsv4 Message-ID: <YQXPR0101MB09680D155B6D685442B5E25EDDCA0@YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM> In-Reply-To: <CABXB=RSyN%2Bo2yXcpmYw8sCSUUDhN-w28Vu9v_cCWa-2=pLZmHg@mail.gmail.com> References: <CABXB=RRB2nUk0pPDisBQPdicUA3ooHpg8QvBwjG_nFU4cHvCYw@mail.gmail.com> <YQXPR0101MB096849ADF24051F7479E565CDDCA0@YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM>, <CABXB=RSyN%2Bo2yXcpmYw8sCSUUDhN-w28Vu9v_cCWa-2=pLZmHg@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
J David wrote: >Unfortunately, switching the FreeBSD NFS clients to NFSv4.1 did not >resolve our issue. But I've narrowed down the problem to a harmful >interaction between NFSv4 and nullfs. I am afraid I know nothing about nullfs and jails. I suspect it will be something related to when file descriptors in the NFS client mount get closed. The NFSv4 Open is a Windows Open lock and has nothing to do with a POSIX open. Since only one of these can exist for each <client process, file> tuple, the NFSv4 Close must be delayed until all POSIX Opens on the file have been closed, including open file descriptors inherited by children processes. Someone else recently reported problems using nullfs and vnet jails. >These FreeBSD NFS clients form a pool of application servers that run >jobs for the application. A given job needs read-write access to its >data and read-only access to the set of binaries it needs to run. > >The job data is horizontally partitioned across a set of directory >trees spread over one set of NFS servers. A separate set of NFS >servers store the read-only binary roots. > >The jobs are assigned to these machines by a scheduler. A job might >take five milliseconds or five days. > >Historically, we have mounted the job data trees and the various >binary roots on each application server over NFSv3. When a job >starts, its setup binds the needed data and binaries into a jail via >nullfs, then runs the job in the jail. This approach has worked >perfectly for 10+ years. Well, NFSv3 is not going away any time soon, so if you don't need any of the additional features it offers... >After I switched a server to NFSv4.1 to test that recommendation, it >started having the same load problems as NFSv4. As a test, I altered >it to mount NFS directly in the jails for both the data and the >binaries. As "nullfs-NFS" jobs finished and "direct NFS" jobs >started, the load and CPU usage started to fall dramatically. Good work isolating the problem. Imay try playing with NFSv4/nullfs someday soon and see if I can break it. >The critical problem with this approach is that privileged TCP ports >are a finite resource. At two per job, this creates two issues. > >First, there's a hard limit on both simultaneous jobs per server >inconsistent with the hardware's capabilities. Second, due to >TIME_WAIT, it places a hard limit on job throughput. In practice, >these limits also interfere with each other; the more simultaneous >long jobs are running, the more impact TIME_WAIT has on short job >throughput. > >While it's certainly possible to configure NFS not to require reserved >ports, the slightest possibility of a non-root user establishing a >session to the NFS server kills that as an option. Personally, I've never thought the reserved port# requirement provided any real security for most situations. Unless you set "vfs.usermount=1" only root can do the mount. For non-root to mount the NFS server when "vfs.usermount=0", a user would have to run their own custom hacked userland NFS client. Although doable, I have never heard of it being done. rick Turning down TIME_WAIT helps, though the ability to do that only on the interface facing the NFS server would be more palatable than doing it globally. Adjusting net.inet.ip.portrange.lowlast does not seem to help. The code at sys/nfs/krpc_subr.c correctly uses ports between IPPORT_RESERVED and IPPORT_RESERVED/2 instead of ipport_lowfirstauto and ipport_lowlastauto. But is that the correct place to look for NFSv4.1? How explosive would adding SO_REUSEADDR to the NFS client be? It's not a full solution, but it would handle the TIME_WAIT side of the issue. Even so, there may be no workaround for the simultaneous mount limit as long as reserved ports are required. Solving the negative interaction with nullfs seems like the only long-term fix. What would be a good next step there? Thanks!
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YQXPR0101MB09680D155B6D685442B5E25EDDCA0>
