FreeBSD Mail Archives

Date:      Fri, 11 Dec 2020 23:28:30 +0000
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        J David <j.david.lists@gmail.com>
Cc:        "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org>
Subject:   Re: Major issues with nfsv4
Message-ID:  <YQXPR0101MB09680D155B6D685442B5E25EDDCA0@YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM>
In-Reply-To: <CABXB=RSyN%2Bo2yXcpmYw8sCSUUDhN-w28Vu9v_cCWa-2=pLZmHg@mail.gmail.com>
References:  <CABXB=RRB2nUk0pPDisBQPdicUA3ooHpg8QvBwjG_nFU4cHvCYw@mail.gmail.com> <YQXPR0101MB096849ADF24051F7479E565CDDCA0@YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM>, <CABXB=RSyN%2Bo2yXcpmYw8sCSUUDhN-w28Vu9v_cCWa-2=pLZmHg@mail.gmail.com>

J David wrote:=0A=
>Unfortunately, switching the FreeBSD NFS clients to NFSv4.1 did not=0A=
>resolve our issue.  But I've narrowed down the problem to a harmful=0A=
>interaction between NFSv4 and nullfs.=0A=
I am afraid I know nothing about nullfs and jails. I suspect it will be=0A=
something related to when file descriptors in the NFS client mount=0A=
get closed.=0A=
=0A=
The NFSv4 Open is a Windows Open lock and has nothing to do with=0A=
a POSIX open. Since only one of these can exist for each=0A=
<client process, file> tuple, the NFSv4 Close must be delayed until=0A=
all POSIX Opens on the file have been closed, including open file=0A=
descriptors inherited by children processes.=0A=
=0A=
Someone else recently reported problems using nullfs and vnet jails.=0A=
=0A=
>These FreeBSD NFS clients form a pool of application servers that run=0A=
>jobs for the application.  A given job needs read-write access to its=0A=
>data and read-only access to the set of binaries it needs to run.=0A=
>=0A=
>The job data is horizontally partitioned across a set of directory=0A=
>trees spread over one set of NFS servers.  A separate set of NFS=0A=
>servers store the read-only binary roots.=0A=
>=0A=
>The jobs are assigned to these machines by a scheduler.  A job might=0A=
>take five milliseconds or five days.=0A=
>=0A=
>Historically, we have mounted the job data trees and the various=0A=
>binary roots on each application server over NFSv3.  When a job=0A=
>starts, its setup binds the needed data and binaries into a jail via=0A=
>nullfs, then runs the job in the jail.  This approach has worked=0A=
>perfectly for 10+ years.=0A=
Well, NFSv3 is not going away any time soon, so if you don't need=0A=
any of the additional features it offers...=0A=
=0A=
>After I switched a server to NFSv4.1 to test that recommendation, it=0A=
>started having the same load problems as NFSv4.  As a test, I altered=0A=
>it to mount NFS directly in the jails for both the data and the=0A=
>binaries.  As "nullfs-NFS" jobs finished and "direct NFS" jobs=0A=
>started, the load and CPU usage started to fall dramatically.=0A=
Good work isolating the problem. Imay try playing with NFSv4/nullfs=0A=
someday soon and see if I can break it.=0A=
=0A=
>The critical problem with this approach is that privileged TCP ports=0A=
>are a finite resource.  At two per job, this creates two issues.=0A=
>=0A=
>First, there's a hard limit on both simultaneous jobs per server=0A=
>inconsistent with the hardware's capabilities.  Second, due to=0A=
>TIME_WAIT, it places a hard limit on job throughput.  In practice,=0A=
>these limits also interfere with each other; the more simultaneous=0A=
>long jobs are running, the more impact TIME_WAIT has on short job=0A=
>throughput.=0A=
>=0A=
>While it's certainly possible to configure NFS not to require reserved=0A=
>ports, the slightest possibility of a non-root user establishing a=0A=
>session to the NFS server kills that as an option.=0A=
Personally, I've never thought the reserved port# requirement provided=0A=
any real security for most situations. Unless you set "vfs.usermount=3D1"=
=0A=
only root can do the mount. For non-root to mount the NFS server=0A=
when "vfs.usermount=3D0", a user would have to run their own custom hacked=
=0A=
userland NFS client. Although doable, I have never heard of it being done.=
=0A=
=0A=
rick=0A=
=0A=
Turning down TIME_WAIT helps, though the ability to do that only on=0A=
the interface facing the NFS server would be more palatable than doing=0A=
it globally.=0A=
=0A=
Adjusting net.inet.ip.portrange.lowlast does not seem to help.  The=0A=
code at sys/nfs/krpc_subr.c correctly uses ports between=0A=
IPPORT_RESERVED and IPPORT_RESERVED/2 instead of ipport_lowfirstauto=0A=
and ipport_lowlastauto.  But is that the correct place to look for=0A=
NFSv4.1?=0A=
=0A=
How explosive would adding SO_REUSEADDR to the NFS client be?  It's=0A=
not a full solution, but it would handle the TIME_WAIT side of the=0A=
issue.=0A=
=0A=
Even so, there may be no workaround for the simultaneous mount limit=0A=
as long as reserved ports are required.  Solving the negative=0A=
interaction with nullfs seems like the only long-term fix.=0A=
=0A=
What would be a good next step there?=0A=
=0A=
Thanks!=0A=

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YQXPR0101MB09680D155B6D685442B5E25EDDCA0>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation