Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 23 Oct 2015 08:04:23 -0400 (EDT)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        FreeBSD FS <freebsd-fs@freebsd.org>
Cc:        Josh Paetzel <josh@ixsystems.com>, Alexander Motin <mav@freebsd.org>,  ken@freebsd.org
Subject:   NFS FHA issue and possible change to the algorithm
Message-ID:  <1927021065.49882210.1445601863864.JavaMail.zimbra@uoguelph.ca>
In-Reply-To: <2144282175.49880310.1445601737139.JavaMail.zimbra@uoguelph.ca>

index | next in thread | previous in thread | raw e-mail

[-- Attachment #1 --]
Hi,

An off list discussion occurred where a site running an NFS server found that
they needed to disable File Handle Affinity (FHA) to get good performance.
Here is a re-post of some of that (with Josh's permission):
First what was observed w.r.t. the machine.
Josh Paetzel wrote:
>>>> It's all good.
>>>>
>>>> It's a 96GB RAM machine and I have 2 million nmbclusters, so 8GB RAM,
>>>> and we've tried 1024 NFS threads.
>>>>
>>>> It might be running out of network memory but we can't really afford to
>>>> give it any more, for this use case disabling FHA might end up being the
>>>> way to go.
>>>>
I wrote:
>>> Just to fill mav@ in, the person that reported a serious performance
>>> problem
>>> to Josh was able to fix it by disabling FHA.
Josh Paetzel wrote:
>>
>> There's about 300 virtual machines that mount root from a read only NFS
>> share.
>>
>> There's also another few hundred users that mount their home directories
>> over NFS.  When things went sideways it is always the virtual machines
>> that get unusable.  45 seconds to log in via ssh, 15 minutes to boot,
>> stuff like that.
>>
>> root@head2] ~# nfsstat -s 1
>>  GtAttr Lookup Rdlink   Read  Write Rename Access  Rddir
>>    4117     17      0    124    689      4    680      0
>>    4750     31      5    121    815      3    950      1
>>    4168     16      0    109    659      9    672      0
>>    4416     24      0    112    771      3    748      0
>>    5038     86      0     76    728      4    825      0
>>    5602     21      0     76    740      3    702      6
>>
>> [root@head2] ~# arcstat.py 1
>>     time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
>> 18:25:36    21     0      0     0    0     0    0     0    0    65G   65G
>> 18:25:37  1.8K    23      1    23    1     0    0     7    0    65G   65G
>> 18:25:38  1.9K    88      4    32    1    56   32     3    0    65G   65G
>> 18:25:39  2.2K    67      3    62    2     5    5     2    0    65G   65G
>> 18:25:40  2.7K   132      4    39    1    93   17     8    0    65G   65G
>>
>> last pid:  7800;  load averages:  1.44,  1.65,  1.68
>>                                                                  up
>> 0+19:22:29  18:26:16
>> 69 processes:  1 running, 68 sleeping
>> CPU:  0.1% user,  0.0% nice,  1.8% system,  0.9% interrupt, 97.3% idle
>> Mem: 297M Active, 180M Inact, 74G Wired, 140K Cache, 565M Buf, 19G Free
>> ARC: 66G Total, 39G MFU, 24G MRU, 53M Anon, 448M Header, 1951M Other
>> Swap: 28G Total, 28G Free
>>
>>   PID USERNAME       THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU
>> COMMAND
>>  9915 root            37  52    0  9900K  2060K rpcsvc 16  16.7H 24.02% nfsd
>>  6402 root             1  52    0 85352K 20696K select  8  47:17  3.08%
>> python2.7
>> 43178 root             1  20    0 70524K 30752K select  7  31:04  0.59%
>> rsync
>>  7363 root             1  20    0 49512K  6456K CPU16  16   0:00  0.59% top
>> 37968 root             1  20    0 70524K 31432K select  7  16:53  0.00%
>> rsync
>> 37969 root             1  20    0 55752K 11052K select  1   9:11  0.00% ssh
>> 13516 root            12  20    0   176M 41152K uwait  23   4:14  0.00%
>> collectd
>> 31375 root            12  20    0   176M 42432K uwa
>>
>> This is a quick peek at the system at the end of the day, so load has
>> dropped off considerably, however the main takeaway is it has plenty of
>> free RAM, and ZFS ARC hit percentage is > 99%.
>>
I wrote:
>>> I took a look at it and I wonder if it is time to consider changing the
>>> algorithm
>>> somewhat?
>>>
>>> The main thing that I wonder about is doing FHA for all the RPCs other than
>>> Read and Write.
>>>
>>> In particular, Getattr is often the most frequent RPC and doing FHA for it
>>> seems
>>> like wasted overhead to me? Normally separate Getattr RPCs wouldn't be done
>>> for
>>> FHs are being Read/Written, since the Read/Write reply has updated
>>> attributes in it.
>>>
Although the load is mostly Getattr RPCs and I think the above statement is correct,
I don't know if the overhead of doing FHA for all the Getattr RPCs explains the observed
performance problem?

I don't see how doing FHA for RPCs like Getattr will improve their performance.
Note that when the FHA algorithm was originally done, there wasn't a shared vnode
lock and, as such, all RPCs on a given FH/vnode would have been serialized by the vnode
lock anyhow. Now, with shared vnode locks, this isn't the case for frequently performed
RPCs like Getattr, Read (Write for ZFS), Lookup and Access. I have always felt that
doing FHA for RPCs other than Read and Write didn't make much sense to me, but I don't
have any evidence that it causes a significant performance penalty.

Anyhow, the attached simple patch limits FHA to Read and Write RPCs.
The simple testing I've done shows it to be about performance neutral (0-1% improvement),
but I have only small hardware and no ZFS or any easy way to emulate a load of mostly
Getattr RPCs. As such, unless others can determine if this patch (or some other one)
helps w.r.t. this, I don't think committing it makes much sense?

If anyone can test this or have comments w.r.t. this or suggestions for other possible
changes to the FHA algorithm, please do so.

Thanks, rick


[-- Attachment #2 --]
--- nfs/nfs_fha.c.sav	2015-10-21 19:29:53.000000000 -0400
+++ nfs/nfs_fha.c	2015-10-22 19:31:43.000000000 -0400
@@ -42,6 +42,8 @@ __FBSDID("$FreeBSD: head/sys/nfs/nfs_fha
 
 static MALLOC_DEFINE(M_NFS_FHA, "NFS FHA", "NFS FHA");
 
+static u_int	nfsfha_enableallrpcs = 0;
+
 /*
  * XXX need to commonize definitions between old and new NFS code.  Define
  * this here so we don't include one nfsproto.h over the other.
@@ -109,6 +111,10 @@ fha_init(struct fha_params *softc)
 	    OID_AUTO, "fhe_stats", CTLTYPE_STRING | CTLFLAG_RD, 0, 0,
 	    softc->callbacks.fhe_stats_sysctl, "A", "");
 
+	SYSCTL_ADD_UINT(&softc->sysctl_ctx, SYSCTL_CHILDREN(softc->sysctl_tree),
+	    OID_AUTO, "enable_allrpcs", CTLFLAG_RW,
+	    &nfsfha_enableallrpcs, 0, "Enable FHA for all RPCs");
+
 }
 
 void
@@ -383,6 +389,7 @@ fha_assign(SVCTHREAD *this_thread, struc
 	struct fha_info i;
 	struct fha_hash_entry *fhe;
 	struct fha_callbacks *cb;
+	rpcproc_t procnum;
 
 	cb = &softc->callbacks;
 
@@ -399,6 +406,24 @@ fha_assign(SVCTHREAD *this_thread, struc
 	if (req->rq_vers != 2 && req->rq_vers != 3)
 		goto thist;
 
+	/*
+	 * The main reason for use of FHA now that FreeBSD supports shared
+	 * vnode locks is to try and maintain sequential ordering of Read
+	 * and Write operations.  Also, it has been observed that some
+	 * RPC loads, such as one mostly of Getattr RPCs, perform better
+	 * without FHA applied to them.  As such, FHA is only applied to
+	 * Read and Write RPCs by default.
+	 * The sysctl "fha.enable_allrpcs" can be set nonzero so that FHA is
+	 * applied to all RPCs for backwards compatibility with the old FHA
+	 * code.
+	 */
+	procnum = req->rq_proc;
+	if (req->rq_vers == 2)
+		procnum = cb->get_procnum(procnum);
+	if (cb->is_read(procnum) == 0 && cb->is_write(procnum) == 0 &&
+	    nfsfha_enableallrpcs == 0)
+		goto thist;
+
 	fha_extract_info(req, &i, cb);
 
 	/*
help

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1927021065.49882210.1445601863864.JavaMail.zimbra>