Date: Fri, 11 Dec 2020 16:35:22 -0700 From: Alan Somers <asomers@freebsd.org> To: Rick Macklem <rmacklem@uoguelph.ca> Cc: J David <j.david.lists@gmail.com>, "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org> Subject: Re: Major issues with nfsv4 Message-ID: <CAOtMX2i1S7TtF-X1oYLhaOGQ4-RnEgR%2BBSLzzNchZGiHBzE0Ug@mail.gmail.com> In-Reply-To: <YQXPR0101MB09680D155B6D685442B5E25EDDCA0@YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM> References: <CABXB=RRB2nUk0pPDisBQPdicUA3ooHpg8QvBwjG_nFU4cHvCYw@mail.gmail.com> <YQXPR0101MB096849ADF24051F7479E565CDDCA0@YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM> <CABXB=RSyN%2Bo2yXcpmYw8sCSUUDhN-w28Vu9v_cCWa-2=pLZmHg@mail.gmail.com> <YQXPR0101MB09680D155B6D685442B5E25EDDCA0@YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Dec 11, 2020 at 4:28 PM Rick Macklem <rmacklem@uoguelph.ca> wrote: > J David wrote: > >Unfortunately, switching the FreeBSD NFS clients to NFSv4.1 did not > >resolve our issue. But I've narrowed down the problem to a harmful > >interaction between NFSv4 and nullfs. > I am afraid I know nothing about nullfs and jails. I suspect it will be > something related to when file descriptors in the NFS client mount > get closed. > > The NFSv4 Open is a Windows Open lock and has nothing to do with > a POSIX open. Since only one of these can exist for each > <client process, file> tuple, the NFSv4 Close must be delayed until > all POSIX Opens on the file have been closed, including open file > descriptors inherited by children processes. > Does it make a difference whether the files are opened read-only or read-write? My longstanding practice has been to never use NFS to store object files while compiling. I do that for performance reasons, and I didn't think that nullfs had anything to do with it (but maybe it does). > > Someone else recently reported problems using nullfs and vnet jails. > > >These FreeBSD NFS clients form a pool of application servers that run > >jobs for the application. A given job needs read-write access to its > >data and read-only access to the set of binaries it needs to run. > > > >The job data is horizontally partitioned across a set of directory > >trees spread over one set of NFS servers. A separate set of NFS > >servers store the read-only binary roots. > > > >The jobs are assigned to these machines by a scheduler. A job might > >take five milliseconds or five days. > > > >Historically, we have mounted the job data trees and the various > >binary roots on each application server over NFSv3. When a job > >starts, its setup binds the needed data and binaries into a jail via > >nullfs, then runs the job in the jail. This approach has worked > >perfectly for 10+ years. > Well, NFSv3 is not going away any time soon, so if you don't need > any of the additional features it offers... > > >After I switched a server to NFSv4.1 to test that recommendation, it > >started having the same load problems as NFSv4. As a test, I altered > >it to mount NFS directly in the jails for both the data and the > >binaries. As "nullfs-NFS" jobs finished and "direct NFS" jobs > >started, the load and CPU usage started to fall dramatically. > Good work isolating the problem. Imay try playing with NFSv4/nullfs > someday soon and see if I can break it. > > >The critical problem with this approach is that privileged TCP ports > >are a finite resource. At two per job, this creates two issues. > > > >First, there's a hard limit on both simultaneous jobs per server > >inconsistent with the hardware's capabilities. Second, due to > >TIME_WAIT, it places a hard limit on job throughput. In practice, > >these limits also interfere with each other; the more simultaneous > >long jobs are running, the more impact TIME_WAIT has on short job > >throughput. > > > >While it's certainly possible to configure NFS not to require reserved > >ports, the slightest possibility of a non-root user establishing a > >session to the NFS server kills that as an option. > Personally, I've never thought the reserved port# requirement provided > any real security for most situations. Unless you set "vfs.usermount=1" > only root can do the mount. For non-root to mount the NFS server > when "vfs.usermount=0", a user would have to run their own custom hacked > userland NFS client. Although doable, I have never heard of it being done. > There are a few out there. For example, https://github.com/sahlberg/libnfs . > > rick > > Turning down TIME_WAIT helps, though the ability to do that only on > the interface facing the NFS server would be more palatable than doing > it globally. > > Adjusting net.inet.ip.portrange.lowlast does not seem to help. The > code at sys/nfs/krpc_subr.c correctly uses ports between > IPPORT_RESERVED and IPPORT_RESERVED/2 instead of ipport_lowfirstauto > and ipport_lowlastauto. But is that the correct place to look for > NFSv4.1? > > How explosive would adding SO_REUSEADDR to the NFS client be? It's > not a full solution, but it would handle the TIME_WAIT side of the > issue. > > Even so, there may be no workaround for the simultaneous mount limit > as long as reserved ports are required. Solving the negative > interaction with nullfs seems like the only long-term fix. > > What would be a good next step there? > > Thanks! > _______________________________________________ > freebsd-fs@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOtMX2i1S7TtF-X1oYLhaOGQ4-RnEgR%2BBSLzzNchZGiHBzE0Ug>