Date: Sat, 28 Mar 2020 13:22:56 +0100 From: Peter Eriksson <pen@lysator.liu.se> To: FreeBSD Filesystems <freebsd-fs@freebsd.org> Cc: "PK1048.COM" <info@pk1048.com> Subject: Re: ZFS/NFS hickups and some tools to monitor stuff... Message-ID: <E6171E44-F677-4926-9F55-775F538900E4@lysator.liu.se> In-Reply-To: <66AB88C0-12E8-48A0-9CD7-75B30C15123A@pk1048.com> References: <CFD0E4E5-EF2B-4789-BF14-F46AC569A191@lysator.liu.se> <66AB88C0-12E8-48A0-9CD7-75B30C15123A@pk1048.com>
next in thread | previous in thread | raw e-mail | index | archive | help
> On 28 Mar 2020, at 04:07, PK1048.COM <info@pk1048.com> wrote:
>>=20
>> The last couple of weeks I=E2=80=99ve been fighting with a severe =
case of NFS users complaining about slow response times from our (5) =
FreeBSD 11.3-RELEASE-p6 file servers. Now even though our SMB (Windows) =
users (thankfully since they are like 500 per server vs 50 NFS users) =
didn=E2=80=99t see the same slowdown (or atleast didn=E2=80=99t complain =
about it) the root cause is probably ZFS-related.
>=20
> What is the use case for the NFS users: Unix home directories? VM =
image files? etc.
Mostly home directories. No VM image files. About 20000 filesystems per =
server with around 100 snapshots per filesystem. Around 150-180M =
files/directories per server.
>> 3. Rsync cloning of data into the servers. Response times up to 15 =
minutes was seen=E2=80=A6 Yes, 15 minutes to do a mkdir(=E2=80=9Ctest-dir=E2=
=80=9D). Possibly in conjunction with #1 above=E2=80=A6.
>=20
> There is something very wrong with your setup with that kind of NFS =
response time.
>=20
> Maybe I=E2=80=99m asking eh obvious, but what is performance like =
natively on the server for these operations?
Normal response times:
This is from a small Intel NUC running OmniOS so NFS 4.0:
$ ./pfst -v /mnt/filur01
[pfst, version 1.7 - Peter Eriksson <pen@lysator.liu.se>]
2020-03-28 12:19:10 [2114 =C2=B5s]: /mnt/filur01: =
mkdir("t-omnibus-821-1")
2020-03-28 12:19:10 [ 413 =C2=B5s]: /mnt/filur01: =
rmdir("t-omnibus-821-1")
2020-03-28 12:19:11 [1375 =C2=B5s]: /mnt/filur01: =
mkdir("t-omnibus-821-2")
2020-03-28 12:19:11 [ 438 =C2=B5s]: /mnt/filur01: =
rmdir("t-omnibus-821-2")
2020-03-28 12:19:12 [1329 =C2=B5s]: /mnt/filur01: =
mkdir("t-omnibus-821-3")
2020-03-28 12:19:12 [ 428 =C2=B5s]: /mnt/filur01: =
rmdir("t-omnibus-821-3")
2020-03-28 12:19:13 [1253 =C2=B5s]: /mnt/filur01: =
mkdir("t-omnibus-821-4")
2020-03-28 12:19:13 [ 395 =C2=B5s]: /mnt/filur01: =
rmdir("t-omnibus-821-4")
Ie, an mkdir() takes around 1-2ms and an rmdir 0.4ms.
Same from a CentOS 7 client (different hardware, a Dell workstation, NFS =
4.1):
$ ./pfst -v /mnt/filur01
[pfst, version 1.6 - Peter Eriksson <pen@lysator.liu.se>]
2020-03-28 12:21:15 [ 633 =C2=B5s]: /mnt/filur01: =
mkdir("t-electra-965-1")
2020-03-28 12:21:15 [ 898 =C2=B5s]: /mnt/filur01: =
rmdir("t-electra-965-1")
2020-03-28 12:21:16 [1019 =C2=B5s]: /mnt/filur01: =
mkdir("t-electra-965-2")
2020-03-28 12:21:16 [ 709 =C2=B5s]: /mnt/filur01: =
rmdir("t-electra-965-2")
2020-03-28 12:21:17 [ 955 =C2=B5s]: /mnt/filur01: =
mkdir("t-electra-965-3")
2020-03-28 12:21:17 [ 668 =C2=B5s]: /mnt/filur01: =
rmdir("t-electra-965-3")
Mkdir & rmdir takes about the same amount of time here. (0.6 - 1ms).
(The above only tests operations on already-mounted filesystems. =
Mounting filesystems have their own set of partly different problems =
occasionally.)
> What does the disk %busy look like on the disks that make up the =
vdev=E2=80=99s? (iostat -x)
Don=E2=80=99t have those numbers (when we were seeing problems) =
unfortunately but if I remember correctly fairly busy during the =
resilver (not surprising).
Current status (right now):
# iostat -x 10 |egrep -v pass
extended device statistics =20
device r/s w/s kr/s kw/s ms/r ms/w ms/o ms/t qlen =
%b =20
nvd0 0 0 0.0 0.0 0 0 0 0 0 =
0=20
da0 3 55 31.1 1129.4 10 1 87 3 0 =
13=20
da1 4 53 31.5 1109.1 10 1 86 3 0 =
13=20
da2 5 51 41.9 1082.4 9 1 87 3 0 =
14=20
da3 3 55 31.1 1129.7 10 1 85 2 0 =
13=20
da4 4 53 31.3 1108.7 10 1 86 3 0 =
13=20
da5 5 52 41.7 1081.3 9 1 87 3 0 =
14=20
da6 3 55 27.6 1103.9 10 1 94 2 0 =
13=20
da7 4 54 34.2 1064.7 10 1 92 3 0 =
14=20
da8 5 55 39.6 1088.2 10 1 94 3 0 =
15=20
da9 3 55 27.7 1103.4 10 1 92 2 0 =
13=20
da10 4 53 34.2 1065.0 10 1 92 3 0 =
14=20
da11 5 55 39.5 1089.0 10 1 95 3 0 =
15=20
da12 1 23 4.7 553.5 0 0 0 0 0 =
0=20
da13 1 23 4.7 553.5 0 0 0 0 0 =
0=20
da14 0 23 1.1 820.6 0 0 0 0 0 =
1=20
da15 0 23 1.0 820.7 0 0 0 0 0 =
1=20
da16 0 0 0.0 0.0 1 0 0 1 0 =
0=20
da17 0 0 0.0 0.0 1 0 0 0 0 =
0=20
>> Previously #1 and #2 hasn=E2=80=99t caused that much problems, and #3 =
definitely. Something has changed the last half year or so but so far I =
haven=E2=80=99t been able to figure it out.
>=20
> The degradation over time makes me think fragmentation of lots of =
small writes or an overly full zpool, but that should effect SMB as well =
as NFS.
Yeah, the zpool is around 50% full so no.
(One of the servers right now:)
# zpool iostat -v DATA
capacity operations =
bandwidth
pool alloc free read write read =
write
-------------------------------- ----- ----- ----- ----- ----- =
-----
DATA 57.7T 50.3T 46 556 403K =
8.48M
raidz2 29.0T 25.0T 23 253 204K =
3.49M
diskid/DISK-7PK8DWXC - - 3 52 31.1K =
1.10M
diskid/DISK-7PK5X6XC - - 4 50 31.5K =
1.08M
diskid/DISK-7PK4W4BC - - 5 49 41.9K =
1.06M
diskid/DISK-7PK204LG - - 3 52 31.1K =
1.10M
diskid/DISK-7PK2GDHG - - 4 50 31.3K =
1.08M
diskid/DISK-7PK7850C - - 5 49 41.7K =
1.06M
raidz2 28.8T 25.2T 23 256 199K =
3.39M
diskid/DISK-7PK62HHC - - 3 53 27.6K =
1.08M
diskid/DISK-7PK6SG3C - - 4 52 34.2K =
1.04M
diskid/DISK-7PK8DRHC - - 5 53 39.6K =
1.06M
diskid/DISK-7PK85ADG - - 3 53 27.7K =
1.08M
diskid/DISK-2TK7PBYD - - 4 51 34.2K =
1.04M
diskid/DISK-7PK6WY9C - - 5 53 39.5K =
1.06M
logs - - - - - =
-
diskid/DISK-BTHV7146043U400NGN 289M 372G 0 23 0 =
821K
diskid/DISK-BTHV71460441400NGN 274M 372G 0 23 0 =
821K
> What type of SSD=E2=80=99s are you using for the SLOG (every zpool has =
a built-in ZIL, you have a SLOG when you add separate devices for the =
Intent Log for sync operations)? I had tried a run of the mill consumer =
SSD for the SLOG and NFS sync performance went DOWN! For SLOG devices =
you really must have real enterprise grade server class SSDs designed =
for very heavy random WRITE load. Today you can get enterprise SSDs =
designed for READ, WRITE, or a combination.
Dell-original Enterprise SSDs like Intel DC S3520-series & Intel DC =
S3700-series (SATA)
Also tested some Intel SSD 750 (NVME on PCIe 3.0-cards)
Going to test some pure SAS SSD write-optimised drives too (as soon as =
we can get delivery of them). However I don=E2=80=99t really suspect the =
SSDs
anymore. (We=E2=80=99ve had other issues with the S3520:series though so =
might replace them anyway - they have a tendency to play =E2=80=9Cpossum=E2=
=80=9D every now and then - =E2=80=9Cdie=E2=80=9D and go offline - =
however if we just let them sit there after a couple of weeks they would =
reappear again automagically=E2=80=A6 We=E2=80=99ve seen the same thing =
happen with the Intel P3520=E2=80=99s too (PCIe variant) but then it =
just didn=E2=80=99t drop off the bus - it took down the server too due =
to hanging the PCIe bus - crappy buggers)
> Depending on your NFS use case, you may or may not be seeing sync =
operations. If you are seeing async operations then a ZIL/SLOG will not =
help as they are only used for sync operations. For async operations =
multiple writes are combined in RAM and committed to disk at least every =
5 seconds. For sync operations the writes are immediately written to the =
ZIL/SLOG, so a SLOG _can_ be of benefit if it is fast enough for random =
write operations. Since a sync operation must be committed to =
non-volatile storage before the write call can return, and having a =
single SLOG device can lead to loss of data, it is prudent to mirror the =
SLOG devices. Having a mirrored pair does not slow down operations as =
both devices are written to in parallel. The SLOG only exists to meet =
the sync requirement. If the system does not crash, the SLOG is (almost) =
never actually read. The exception, and the reason I say =E2=80=9Calmost=E2=
=80=9D, is that if you are writing so much data over long periods of =
time that the zpool disks cannot catch up between write requests, then =
you may have to read some TXG (transaction groups) from the SLOG to get =
them into the zpool.
>=20
>> We=E2=80=99ve tried it with and without L2ARC,
>=20
> ARC and L2ARC will only effect read operations, and you can see the =
effect with zfs-stats -E.
Well, L2ARC uses up some of the ARC so if ARC is too tight from the =
beginning then it=E2=80=99ll make reads go even slower.=20
Which actually is my main suspect (the ARC) right now. At least for the =
resilver problem. When the resilver was running it seems it would read =
in a lot of data into the ARC causing it to drop older data (& metadata=3D=
from the ARC cache. Which then became a problem due to two issues:
1. The problematic server had =E2=80=9Ccompressed ARC=E2=80=9D disabled =
due to a previous kernel-crashing problem (happened less often with that =
disabled)=20
2. ZFS prioritises writes over reads
3. The metadata (directory information for example) wasn=E2=80=99t in =
the ARC anymore so needed to be read from disk again and again
This together with the fact that we:
We are taking hourly snapshots of all filesystems - this needed to =
access metadata for all filesystems.
We are cleaning snapshots nightly - this needed to access metadata for =
all filesystems (and all old snapshots).
We had a problem with the cleaning script so it was running behind and =
more and more snapshots were piling up on a couple of the fileservers so =
instead of 100 snapshots there were around 400. Including the server =
where we were resilvering.
The 500-600 samba =E2=80=9Csmbd=E2=80=9D processes are using a lot of =
RAM each (100-200MB), ie 120-60GB, so would compete with the ARC too.
Many of the NFS operations access metadata from the filesystems. For =
example mount operations, or filename lookups. So if any of that was =
evicted from the ARC then it would have to go to disk to read it again. =
And when ZFS was busy writing resilver data that would take a long time.
Or when a filesystem is near quota then that too will cause the ZFS =
transaction sizse to go down - inflating write IOPS and starving the =
reads IOPs again=E2=80=A6
We are probably going to invest in more RAM for the servers (going from =
256GB to 512GB or 768GB) to allow for way more ARC.
> You should not need to make any system tunings to make this work.
Well, I had to tune down vfs.zfs.arc_max from the default (90% of RAM) =
to a 128GB since the Samba smbd processes also need RAM to run and the =
auto-tuning in FreeBSD of the ARC doesn=E2=80=99t seem to work that =
well. In order to mitigate that problem I=E2=80=99m currently using =
these settings:
vfs.zfs.arc_max =3D 96GB (tuned down from 128GB since ARC overshot the =
128GB allocated and started killing random processes)=20
vfs.zfs.arc_meta_limit =3D 50% of ARC (default is 25%)
The overshooting of ARC (especially when resilvering) was probably due =
to us also tuning the kern.maxvnodes - seems if you touch that number =
when the system is already running ARC goes berserk. So I can probably =
increase ARC to 128GB again. It looks to be more stable now that we=E2=80=99=
ve removed the kern.maxvnodes setting completely.
It would have been nice to that a setting to set a minimum limit of =
metadata so it would be more protected from normal data evicting it=E2=80=A6=
=20
We=E2=80=99ve installed 512GB instead of 256GB in a test server now and =
will so what settings might work well there...
Right now (with no resilver running and not much =E2=80=9Cnear-full-quota=E2=
=80=9D filesystems, and many of the extraneous snapshots deleted this =
are looking much better:
$ ./pfst -t200ms /mnt/filur0*
=E2=80=A6
2020-03-26 21:16:40 [1678 ms]: /mnt/filur01: =
mkdir("t-omnibus-782-190284") [Time limit exceeded]
2020-03-26 21:16:47 [1956 ms]: /mnt/filur01: =
mkdir("t-omnibus-782-190289") [Time limit exceeded]
2020-03-26 21:16:53 [1439 ms]: /mnt/filur01: =
mkdir("t-omnibus-782-190293") [Time limit exceeded]
2020-03-26 21:16:59 [1710 ms]: /mnt/filur01: =
mkdir("t-omnibus-782-190297") [Time limit exceeded]
2020-03-26 21:17:05 [2044 ms]: /mnt/filur01: =
mkdir("t-omnibus-782-190301") [Time limit exceeded]
2020-03-26 21:17:11 [1955 ms]: /mnt/filur01: =
mkdir("t-omnibus-782-190305") [Time limit exceeded]
2020-03-26 21:17:16 [1515 ms]: /mnt/filur01: =
mkdir("t-omnibus-782-190309") [Time limit exceeded]
...
2020-03-28 06:54:54 [ 370 ms]: /mnt/filur06: =
mkdir("t-omnibus-783-311285") [Time limit exceeded]
2020-03-28 07:00:01 [ 447 ms]: /mnt/filur01: =
mkdir("t-omnibus-782-311339") [Time limit exceeded]
2020-03-28 07:00:01 [ 312 ms]: /mnt/filur06: =
mkdir("t-omnibus-783-311591") [Time limit exceeded]
2020-03-28 07:00:02 [ 291 ms]: /mnt/filur06: =
mkdir("t-omnibus-783-311592") [Time limit exceeded]
2020-03-28 07:00:05 [ 378 ms]: /mnt/filur06: =
mkdir("t-omnibus-783-311594") [Time limit exceeded]
2020-03-28 10:35:13 [1876 ms]: /mnt/filur01: =
mkdir("t-omnibus-782-324215") [Time limit exceeded]
(This only prints operations taking more than 200ms)
The last ~2s one is probably due to a new filesystem being created. This =
triggers the NFS mount daemon to update the kernel list of exports (and =
hold a NFS kernel lock) which take some time (although it=E2=80=99s much =
faster now with the =E2=80=9Cincremental update=E2=80=9D fix in mountd.=20=
=20
The slow ones around 21:16 is due to me doing a number of =E2=80=9Czfs =
rename DATA/home/$USER DATA/archive/home/$USER=E2=80=9D operations. This =
too causes a lot of NFS mountd activity (removing NFS exports from the =
kernel) causing a NFS kernel lock to be held...
- Peter
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?E6171E44-F677-4926-9F55-775F538900E4>
