FreeBSD Mail Archives

Date:      Fri, 23 Oct 2015 08:04:23 -0400 (EDT)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        FreeBSD FS <freebsd-fs@freebsd.org>
Cc:        Josh Paetzel <josh@ixsystems.com>, Alexander Motin <mav@freebsd.org>,  ken@freebsd.org
Subject:   NFS FHA issue and possible change to the algorithm
Message-ID:  <1927021065.49882210.1445601863864.JavaMail.zimbra@uoguelph.ca>
In-Reply-To: <2144282175.49880310.1445601737139.JavaMail.zimbra@uoguelph.ca>

------=_Part_49882208_271473258.1445601863862
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

Hi,

An off list discussion occurred where a site running an NFS server found that
they needed to disable File Handle Affinity (FHA) to get good performance.
Here is a re-post of some of that (with Josh's permission):
First what was observed w.r.t. the machine.
Josh Paetzel wrote:
>>>> It's all good.
>>>>
>>>> It's a 96GB RAM machine and I have 2 million nmbclusters, so 8GB RAM,
>>>> and we've tried 1024 NFS threads.
>>>>
>>>> It might be running out of network memory but we can't really afford to
>>>> give it any more, for this use case disabling FHA might end up being the
>>>> way to go.
>>>>
I wrote:
>>> Just to fill mav@ in, the person that reported a serious performance
>>> problem
>>> to Josh was able to fix it by disabling FHA.
Josh Paetzel wrote:
>>
>> There's about 300 virtual machines that mount root from a read only NFS
>> share.
>>
>> There's also another few hundred users that mount their home directories
>> over NFS.  When things went sideways it is always the virtual machines
>> that get unusable.  45 seconds to log in via ssh, 15 minutes to boot,
>> stuff like that.
>>
>> root@head2] ~# nfsstat -s 1
>>  GtAttr Lookup Rdlink   Read  Write Rename Access  Rddir
>>    4117     17      0    124    689      4    680      0
>>    4750     31      5    121    815      3    950      1
>>    4168     16      0    109    659      9    672      0
>>    4416     24      0    112    771      3    748      0
>>    5038     86      0     76    728      4    825      0
>>    5602     21      0     76    740      3    702      6
>>
>> [root@head2] ~# arcstat.py 1
>>     time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
>> 18:25:36    21     0      0     0    0     0    0     0    0    65G   65G
>> 18:25:37  1.8K    23      1    23    1     0    0     7    0    65G   65G
>> 18:25:38  1.9K    88      4    32    1    56   32     3    0    65G   65G
>> 18:25:39  2.2K    67      3    62    2     5    5     2    0    65G   65G
>> 18:25:40  2.7K   132      4    39    1    93   17     8    0    65G   65G
>>
>> last pid:  7800;  load averages:  1.44,  1.65,  1.68
>>                                                                  up
>> 0+19:22:29  18:26:16
>> 69 processes:  1 running, 68 sleeping
>> CPU:  0.1% user,  0.0% nice,  1.8% system,  0.9% interrupt, 97.3% idle
>> Mem: 297M Active, 180M Inact, 74G Wired, 140K Cache, 565M Buf, 19G Free
>> ARC: 66G Total, 39G MFU, 24G MRU, 53M Anon, 448M Header, 1951M Other
>> Swap: 28G Total, 28G Free
>>
>>   PID USERNAME       THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU
>> COMMAND
>>  9915 root            37  52    0  9900K  2060K rpcsvc 16  16.7H 24.02% nfsd
>>  6402 root             1  52    0 85352K 20696K select  8  47:17  3.08%
>> python2.7
>> 43178 root             1  20    0 70524K 30752K select  7  31:04  0.59%
>> rsync
>>  7363 root             1  20    0 49512K  6456K CPU16  16   0:00  0.59% top
>> 37968 root             1  20    0 70524K 31432K select  7  16:53  0.00%
>> rsync
>> 37969 root             1  20    0 55752K 11052K select  1   9:11  0.00% ssh
>> 13516 root            12  20    0   176M 41152K uwait  23   4:14  0.00%
>> collectd
>> 31375 root            12  20    0   176M 42432K uwa
>>
>> This is a quick peek at the system at the end of the day, so load has
>> dropped off considerably, however the main takeaway is it has plenty of
>> free RAM, and ZFS ARC hit percentage is > 99%.
>>
I wrote:
>>> I took a look at it and I wonder if it is time to consider changing the
>>> algorithm
>>> somewhat?
>>>
>>> The main thing that I wonder about is doing FHA for all the RPCs other than
>>> Read and Write.
>>>
>>> In particular, Getattr is often the most frequent RPC and doing FHA for it
>>> seems
>>> like wasted overhead to me? Normally separate Getattr RPCs wouldn't be done
>>> for
>>> FHs are being Read/Written, since the Read/Write reply has updated
>>> attributes in it.
>>>
Although the load is mostly Getattr RPCs and I think the above statement is correct,
I don't know if the overhead of doing FHA for all the Getattr RPCs explains the observed
performance problem?

I don't see how doing FHA for RPCs like Getattr will improve their performance.
Note that when the FHA algorithm was originally done, there wasn't a shared vnode
lock and, as such, all RPCs on a given FH/vnode would have been serialized by the vnode
lock anyhow. Now, with shared vnode locks, this isn't the case for frequently performed
RPCs like Getattr, Read (Write for ZFS), Lookup and Access. I have always felt that
doing FHA for RPCs other than Read and Write didn't make much sense to me, but I don't
have any evidence that it causes a significant performance penalty.

Anyhow, the attached simple patch limits FHA to Read and Write RPCs.
The simple testing I've done shows it to be about performance neutral (0-1% improvement),
but I have only small hardware and no ZFS or any easy way to emulate a load of mostly
Getattr RPCs. As such, unless others can determine if this patch (or some other one)
helps w.r.t. this, I don't think committing it makes much sense?

If anyone can test this or have comments w.r.t. this or suggestions for other possible
changes to the FHA algorithm, please do so.

Thanks, rick


------=_Part_49882208_271473258.1445601863862
Content-Type: text/x-patch; name=nfsfha.patch
Content-Disposition: attachment; filename=nfsfha.patch
Content-Transfer-Encoding: base64

LS0tIG5mcy9uZnNfZmhhLmMuc2F2CTIwMTUtMTAtMjEgMTk6Mjk6NTMuMDAwMDAwMDAwIC0wNDAw
CisrKyBuZnMvbmZzX2ZoYS5jCTIwMTUtMTAtMjIgMTk6MzE6NDMuMDAwMDAwMDAwIC0wNDAwCkBA
IC00Miw2ICs0Miw4IEBAIF9fRkJTRElEKCIkRnJlZUJTRDogaGVhZC9zeXMvbmZzL25mc19maGEK
IAogc3RhdGljIE1BTExPQ19ERUZJTkUoTV9ORlNfRkhBLCAiTkZTIEZIQSIsICJORlMgRkhBIik7
CiAKK3N0YXRpYyB1X2ludAluZnNmaGFfZW5hYmxlYWxscnBjcyA9IDA7CisKIC8qCiAgKiBYWFgg
bmVlZCB0byBjb21tb25pemUgZGVmaW5pdGlvbnMgYmV0d2VlbiBvbGQgYW5kIG5ldyBORlMgY29k
ZS4gIERlZmluZQogICogdGhpcyBoZXJlIHNvIHdlIGRvbid0IGluY2x1ZGUgb25lIG5mc3Byb3Rv
Lmggb3ZlciB0aGUgb3RoZXIuCkBAIC0xMDksNiArMTExLDEwIEBAIGZoYV9pbml0KHN0cnVjdCBm
aGFfcGFyYW1zICpzb2Z0YykKIAkgICAgT0lEX0FVVE8sICJmaGVfc3RhdHMiLCBDVExUWVBFX1NU
UklORyB8IENUTEZMQUdfUkQsIDAsIDAsCiAJICAgIHNvZnRjLT5jYWxsYmFja3MuZmhlX3N0YXRz
X3N5c2N0bCwgIkEiLCAiIik7CiAKKwlTWVNDVExfQUREX1VJTlQoJnNvZnRjLT5zeXNjdGxfY3R4
LCBTWVNDVExfQ0hJTERSRU4oc29mdGMtPnN5c2N0bF90cmVlKSwKKwkgICAgT0lEX0FVVE8sICJl
bmFibGVfYWxscnBjcyIsIENUTEZMQUdfUlcsCisJICAgICZuZnNmaGFfZW5hYmxlYWxscnBjcywg
MCwgIkVuYWJsZSBGSEEgZm9yIGFsbCBSUENzIik7CisKIH0KIAogdm9pZApAQCAtMzgzLDYgKzM4
OSw3IEBAIGZoYV9hc3NpZ24oU1ZDVEhSRUFEICp0aGlzX3RocmVhZCwgc3RydWMKIAlzdHJ1Y3Qg
ZmhhX2luZm8gaTsKIAlzdHJ1Y3QgZmhhX2hhc2hfZW50cnkgKmZoZTsKIAlzdHJ1Y3QgZmhhX2Nh
bGxiYWNrcyAqY2I7CisJcnBjcHJvY190IHByb2NudW07CiAKIAljYiA9ICZzb2Z0Yy0+Y2FsbGJh
Y2tzOwogCkBAIC0zOTksNiArNDA2LDI0IEBAIGZoYV9hc3NpZ24oU1ZDVEhSRUFEICp0aGlzX3Ro
cmVhZCwgc3RydWMKIAlpZiAocmVxLT5ycV92ZXJzICE9IDIgJiYgcmVxLT5ycV92ZXJzICE9IDMp
CiAJCWdvdG8gdGhpc3Q7CiAKKwkvKgorCSAqIFRoZSBtYWluIHJlYXNvbiBmb3IgdXNlIG9mIEZI
QSBub3cgdGhhdCBGcmVlQlNEIHN1cHBvcnRzIHNoYXJlZAorCSAqIHZub2RlIGxvY2tzIGlzIHRv
IHRyeSBhbmQgbWFpbnRhaW4gc2VxdWVudGlhbCBvcmRlcmluZyBvZiBSZWFkCisJICogYW5kIFdy
aXRlIG9wZXJhdGlvbnMuICBBbHNvLCBpdCBoYXMgYmVlbiBvYnNlcnZlZCB0aGF0IHNvbWUKKwkg
KiBSUEMgbG9hZHMsIHN1Y2ggYXMgb25lIG1vc3RseSBvZiBHZXRhdHRyIFJQQ3MsIHBlcmZvcm0g
YmV0dGVyCisJICogd2l0aG91dCBGSEEgYXBwbGllZCB0byB0aGVtLiAgQXMgc3VjaCwgRkhBIGlz
IG9ubHkgYXBwbGllZCB0bworCSAqIFJlYWQgYW5kIFdyaXRlIFJQQ3MgYnkgZGVmYXVsdC4KKwkg
KiBUaGUgc3lzY3RsICJmaGEuZW5hYmxlX2FsbHJwY3MiIGNhbiBiZSBzZXQgbm9uemVybyBzbyB0
aGF0IEZIQSBpcworCSAqIGFwcGxpZWQgdG8gYWxsIFJQQ3MgZm9yIGJhY2t3YXJkcyBjb21wYXRp
YmlsaXR5IHdpdGggdGhlIG9sZCBGSEEKKwkgKiBjb2RlLgorCSAqLworCXByb2NudW0gPSByZXEt
PnJxX3Byb2M7CisJaWYgKHJlcS0+cnFfdmVycyA9PSAyKQorCQlwcm9jbnVtID0gY2ItPmdldF9w
cm9jbnVtKHByb2NudW0pOworCWlmIChjYi0+aXNfcmVhZChwcm9jbnVtKSA9PSAwICYmIGNiLT5p
c193cml0ZShwcm9jbnVtKSA9PSAwICYmCisJICAgIG5mc2ZoYV9lbmFibGVhbGxycGNzID09IDAp
CisJCWdvdG8gdGhpc3Q7CisKIAlmaGFfZXh0cmFjdF9pbmZvKHJlcSwgJmksIGNiKTsKIAogCS8q
Cg==
------=_Part_49882208_271473258.1445601863862--

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1927021065.49882210.1445601863864.JavaMail.zimbra>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation