Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 11 Dec 2020 16:52:16 -0500
From:      J David <j.david.lists@gmail.com>
To:        Rick Macklem <rmacklem@uoguelph.ca>
Cc:        "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org>
Subject:   Re: Major issues with nfsv4
Message-ID:  <CABXB=RSyN%2Bo2yXcpmYw8sCSUUDhN-w28Vu9v_cCWa-2=pLZmHg@mail.gmail.com>
In-Reply-To: <YQXPR0101MB096849ADF24051F7479E565CDDCA0@YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM>
References:  <CABXB=RRB2nUk0pPDisBQPdicUA3ooHpg8QvBwjG_nFU4cHvCYw@mail.gmail.com> <YQXPR0101MB096849ADF24051F7479E565CDDCA0@YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM>

next in thread | previous in thread | raw e-mail | index | archive | help
Unfortunately, switching the FreeBSD NFS clients to NFSv4.1 did not
resolve our issue.  But I've narrowed down the problem to a harmful
interaction between NFSv4 and nullfs.

These FreeBSD NFS clients form a pool of application servers that run
jobs for the application.  A given job needs read-write access to its
data and read-only access to the set of binaries it needs to run.

The job data is horizontally partitioned across a set of directory
trees spread over one set of NFS servers.  A separate set of NFS
servers store the read-only binary roots.

The jobs are assigned to these machines by a scheduler.  A job might
take five milliseconds or five days.

Historically, we have mounted the job data trees and the various
binary roots on each application server over NFSv3.  When a job
starts, its setup binds the needed data and binaries into a jail via
nullfs, then runs the job in the jail.  This approach has worked
perfectly for 10+ years.

After I switched a server to NFSv4.1 to test that recommendation, it
started having the same load problems as NFSv4.  As a test, I altered
it to mount NFS directly in the jails for both the data and the
binaries.  As "nullfs-NFS" jobs finished and "direct NFS" jobs
started, the load and CPU usage started to fall dramatically.

The critical problem with this approach is that privileged TCP ports
are a finite resource.  At two per job, this creates two issues.

First, there's a hard limit on both simultaneous jobs per server
inconsistent with the hardware's capabilities.  Second, due to
TIME_WAIT, it places a hard limit on job throughput.  In practice,
these limits also interfere with each other; the more simultaneous
long jobs are running, the more impact TIME_WAIT has on short job
throughput.

While it's certainly possible to configure NFS not to require reserved
ports, the slightest possibility of a non-root user establishing a
session to the NFS server kills that as an option.

Turning down TIME_WAIT helps, though the ability to do that only on
the interface facing the NFS server would be more palatable than doing
it globally.

Adjusting net.inet.ip.portrange.lowlast does not seem to help.  The
code at sys/nfs/krpc_subr.c correctly uses ports between
IPPORT_RESERVED and IPPORT_RESERVED/2 instead of ipport_lowfirstauto
and ipport_lowlastauto.  But is that the correct place to look for
NFSv4.1?

How explosive would adding SO_REUSEADDR to the NFS client be?  It's
not a full solution, but it would handle the TIME_WAIT side of the
issue.

Even so, there may be no workaround for the simultaneous mount limit
as long as reserved ports are required.  Solving the negative
interaction with nullfs seems like the only long-term fix.

What would be a good next step there?

Thanks!



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CABXB=RSyN%2Bo2yXcpmYw8sCSUUDhN-w28Vu9v_cCWa-2=pLZmHg>