Date: Fri, 11 Dec 2020 16:52:16 -0500 From: J David <j.david.lists@gmail.com> To: Rick Macklem <rmacklem@uoguelph.ca> Cc: "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org> Subject: Re: Major issues with nfsv4 Message-ID: <CABXB=RSyN%2Bo2yXcpmYw8sCSUUDhN-w28Vu9v_cCWa-2=pLZmHg@mail.gmail.com> In-Reply-To: <YQXPR0101MB096849ADF24051F7479E565CDDCA0@YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM> References: <CABXB=RRB2nUk0pPDisBQPdicUA3ooHpg8QvBwjG_nFU4cHvCYw@mail.gmail.com> <YQXPR0101MB096849ADF24051F7479E565CDDCA0@YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM>
next in thread | previous in thread | raw e-mail | index | archive | help
Unfortunately, switching the FreeBSD NFS clients to NFSv4.1 did not resolve our issue. But I've narrowed down the problem to a harmful interaction between NFSv4 and nullfs. These FreeBSD NFS clients form a pool of application servers that run jobs for the application. A given job needs read-write access to its data and read-only access to the set of binaries it needs to run. The job data is horizontally partitioned across a set of directory trees spread over one set of NFS servers. A separate set of NFS servers store the read-only binary roots. The jobs are assigned to these machines by a scheduler. A job might take five milliseconds or five days. Historically, we have mounted the job data trees and the various binary roots on each application server over NFSv3. When a job starts, its setup binds the needed data and binaries into a jail via nullfs, then runs the job in the jail. This approach has worked perfectly for 10+ years. After I switched a server to NFSv4.1 to test that recommendation, it started having the same load problems as NFSv4. As a test, I altered it to mount NFS directly in the jails for both the data and the binaries. As "nullfs-NFS" jobs finished and "direct NFS" jobs started, the load and CPU usage started to fall dramatically. The critical problem with this approach is that privileged TCP ports are a finite resource. At two per job, this creates two issues. First, there's a hard limit on both simultaneous jobs per server inconsistent with the hardware's capabilities. Second, due to TIME_WAIT, it places a hard limit on job throughput. In practice, these limits also interfere with each other; the more simultaneous long jobs are running, the more impact TIME_WAIT has on short job throughput. While it's certainly possible to configure NFS not to require reserved ports, the slightest possibility of a non-root user establishing a session to the NFS server kills that as an option. Turning down TIME_WAIT helps, though the ability to do that only on the interface facing the NFS server would be more palatable than doing it globally. Adjusting net.inet.ip.portrange.lowlast does not seem to help. The code at sys/nfs/krpc_subr.c correctly uses ports between IPPORT_RESERVED and IPPORT_RESERVED/2 instead of ipport_lowfirstauto and ipport_lowlastauto. But is that the correct place to look for NFSv4.1? How explosive would adding SO_REUSEADDR to the NFS client be? It's not a full solution, but it would handle the TIME_WAIT side of the issue. Even so, there may be no workaround for the simultaneous mount limit as long as reserved ports are required. Solving the negative interaction with nullfs seems like the only long-term fix. What would be a good next step there? Thanks!
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CABXB=RSyN%2Bo2yXcpmYw8sCSUUDhN-w28Vu9v_cCWa-2=pLZmHg>