Date: Fri, 11 Dec 2020 16:08:10 -0700 From: Alan Somers <asomers@freebsd.org> To: J David <j.david.lists@gmail.com> Cc: Rick Macklem <rmacklem@uoguelph.ca>, "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org> Subject: Re: Major issues with nfsv4 Message-ID: <CAOtMX2h82vU6Tt5eOCCEz=iOGhxHdL1XBnjvCTqqFEsaSMTjaA@mail.gmail.com> In-Reply-To: <CABXB=RSyN%2Bo2yXcpmYw8sCSUUDhN-w28Vu9v_cCWa-2=pLZmHg@mail.gmail.com> References: <CABXB=RRB2nUk0pPDisBQPdicUA3ooHpg8QvBwjG_nFU4cHvCYw@mail.gmail.com> <YQXPR0101MB096849ADF24051F7479E565CDDCA0@YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM> <CABXB=RSyN%2Bo2yXcpmYw8sCSUUDhN-w28Vu9v_cCWa-2=pLZmHg@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Dec 11, 2020 at 2:52 PM J David <j.david.lists@gmail.com> wrote: > Unfortunately, switching the FreeBSD NFS clients to NFSv4.1 did not > resolve our issue. But I've narrowed down the problem to a harmful > interaction between NFSv4 and nullfs. > > These FreeBSD NFS clients form a pool of application servers that run > jobs for the application. A given job needs read-write access to its > data and read-only access to the set of binaries it needs to run. > > The job data is horizontally partitioned across a set of directory > trees spread over one set of NFS servers. A separate set of NFS > servers store the read-only binary roots. > > The jobs are assigned to these machines by a scheduler. A job might > take five milliseconds or five days. > > Historically, we have mounted the job data trees and the various > binary roots on each application server over NFSv3. When a job > starts, its setup binds the needed data and binaries into a jail via > nullfs, then runs the job in the jail. This approach has worked > perfectly for 10+ years. > > After I switched a server to NFSv4.1 to test that recommendation, it > started having the same load problems as NFSv4. As a test, I altered > it to mount NFS directly in the jails for both the data and the > binaries. As "nullfs-NFS" jobs finished and "direct NFS" jobs > started, the load and CPU usage started to fall dramatically. > > The critical problem with this approach is that privileged TCP ports > are a finite resource. At two per job, this creates two issues. > > First, there's a hard limit on both simultaneous jobs per server > inconsistent with the hardware's capabilities. Second, due to > TIME_WAIT, it places a hard limit on job throughput. In practice, > these limits also interfere with each other; the more simultaneous > long jobs are running, the more impact TIME_WAIT has on short job > throughput. > > While it's certainly possible to configure NFS not to require reserved > ports, the slightest possibility of a non-root user establishing a > session to the NFS server kills that as an option. > > Turning down TIME_WAIT helps, though the ability to do that only on > the interface facing the NFS server would be more palatable than doing > it globally. > > Adjusting net.inet.ip.portrange.lowlast does not seem to help. The > code at sys/nfs/krpc_subr.c correctly uses ports between > IPPORT_RESERVED and IPPORT_RESERVED/2 instead of ipport_lowfirstauto > and ipport_lowlastauto. But is that the correct place to look for > NFSv4.1? > > How explosive would adding SO_REUSEADDR to the NFS client be? It's > not a full solution, but it would handle the TIME_WAIT side of the > issue. > > Even so, there may be no workaround for the simultaneous mount limit > as long as reserved ports are required. Solving the negative > interaction with nullfs seems like the only long-term fix. > > What would be a good next step there? > > Thanks! > That's some good information. However, it must not be the whole story. I've been nullfs mounting my NFS mounts for years. For example, right now on a FreeBSD 12.2-RC2 machine: > sudo nfsstat -m Password: 192.168.0.2:/home on /usr/home nfsv4,minorversion=1,tcp,resvport,soft,cto,sec=sys,acdirmin=3,acdirmax=60,acregmin=5,acregmax=60,nametimeo=60,negnametimeo=60,rsize=65536,wsize=65536,readdirsize=65536,readahead=1,wcommitsize=16777216,timeout=120,retrans=2147483647 > mount | grep home 192.168.0.2:/home on /usr/home (nfs, nfsv4acls) /usr/home on /iocage/jails/rustup2/root/usr/home (nullfs) Are you using any mount options with nullfs? It might be worth trying to make the read-only mount into read-write, to see if that helps. And what does "jls -n" show? -Alan
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOtMX2h82vU6Tt5eOCCEz=iOGhxHdL1XBnjvCTqqFEsaSMTjaA>