Date: Fri, 11 Dec 2020 23:28:30 +0000 From: Rick Macklem <rmacklem@uoguelph.ca> To: J David <j.david.lists@gmail.com> Cc: "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org> Subject: Re: Major issues with nfsv4 Message-ID: <YQXPR0101MB09680D155B6D685442B5E25EDDCA0@YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM> In-Reply-To: <CABXB=RSyN%2Bo2yXcpmYw8sCSUUDhN-w28Vu9v_cCWa-2=pLZmHg@mail.gmail.com> References: <CABXB=RRB2nUk0pPDisBQPdicUA3ooHpg8QvBwjG_nFU4cHvCYw@mail.gmail.com> <YQXPR0101MB096849ADF24051F7479E565CDDCA0@YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM>, <CABXB=RSyN%2Bo2yXcpmYw8sCSUUDhN-w28Vu9v_cCWa-2=pLZmHg@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
J David wrote:=0A= >Unfortunately, switching the FreeBSD NFS clients to NFSv4.1 did not=0A= >resolve our issue. But I've narrowed down the problem to a harmful=0A= >interaction between NFSv4 and nullfs.=0A= I am afraid I know nothing about nullfs and jails. I suspect it will be=0A= something related to when file descriptors in the NFS client mount=0A= get closed.=0A= =0A= The NFSv4 Open is a Windows Open lock and has nothing to do with=0A= a POSIX open. Since only one of these can exist for each=0A= <client process, file> tuple, the NFSv4 Close must be delayed until=0A= all POSIX Opens on the file have been closed, including open file=0A= descriptors inherited by children processes.=0A= =0A= Someone else recently reported problems using nullfs and vnet jails.=0A= =0A= >These FreeBSD NFS clients form a pool of application servers that run=0A= >jobs for the application. A given job needs read-write access to its=0A= >data and read-only access to the set of binaries it needs to run.=0A= >=0A= >The job data is horizontally partitioned across a set of directory=0A= >trees spread over one set of NFS servers. A separate set of NFS=0A= >servers store the read-only binary roots.=0A= >=0A= >The jobs are assigned to these machines by a scheduler. A job might=0A= >take five milliseconds or five days.=0A= >=0A= >Historically, we have mounted the job data trees and the various=0A= >binary roots on each application server over NFSv3. When a job=0A= >starts, its setup binds the needed data and binaries into a jail via=0A= >nullfs, then runs the job in the jail. This approach has worked=0A= >perfectly for 10+ years.=0A= Well, NFSv3 is not going away any time soon, so if you don't need=0A= any of the additional features it offers...=0A= =0A= >After I switched a server to NFSv4.1 to test that recommendation, it=0A= >started having the same load problems as NFSv4. As a test, I altered=0A= >it to mount NFS directly in the jails for both the data and the=0A= >binaries. As "nullfs-NFS" jobs finished and "direct NFS" jobs=0A= >started, the load and CPU usage started to fall dramatically.=0A= Good work isolating the problem. Imay try playing with NFSv4/nullfs=0A= someday soon and see if I can break it.=0A= =0A= >The critical problem with this approach is that privileged TCP ports=0A= >are a finite resource. At two per job, this creates two issues.=0A= >=0A= >First, there's a hard limit on both simultaneous jobs per server=0A= >inconsistent with the hardware's capabilities. Second, due to=0A= >TIME_WAIT, it places a hard limit on job throughput. In practice,=0A= >these limits also interfere with each other; the more simultaneous=0A= >long jobs are running, the more impact TIME_WAIT has on short job=0A= >throughput.=0A= >=0A= >While it's certainly possible to configure NFS not to require reserved=0A= >ports, the slightest possibility of a non-root user establishing a=0A= >session to the NFS server kills that as an option.=0A= Personally, I've never thought the reserved port# requirement provided=0A= any real security for most situations. Unless you set "vfs.usermount=3D1"= =0A= only root can do the mount. For non-root to mount the NFS server=0A= when "vfs.usermount=3D0", a user would have to run their own custom hacked= =0A= userland NFS client. Although doable, I have never heard of it being done.= =0A= =0A= rick=0A= =0A= Turning down TIME_WAIT helps, though the ability to do that only on=0A= the interface facing the NFS server would be more palatable than doing=0A= it globally.=0A= =0A= Adjusting net.inet.ip.portrange.lowlast does not seem to help. The=0A= code at sys/nfs/krpc_subr.c correctly uses ports between=0A= IPPORT_RESERVED and IPPORT_RESERVED/2 instead of ipport_lowfirstauto=0A= and ipport_lowlastauto. But is that the correct place to look for=0A= NFSv4.1?=0A= =0A= How explosive would adding SO_REUSEADDR to the NFS client be? It's=0A= not a full solution, but it would handle the TIME_WAIT side of the=0A= issue.=0A= =0A= Even so, there may be no workaround for the simultaneous mount limit=0A= as long as reserved ports are required. Solving the negative=0A= interaction with nullfs seems like the only long-term fix.=0A= =0A= What would be a good next step there?=0A= =0A= Thanks!=0A=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YQXPR0101MB09680D155B6D685442B5E25EDDCA0>