From owner-freebsd-fs@freebsd.org  Sat Mar 28 12:23:21 2020
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.nyi.freebsd.org (Postfix) with ESMTP id E098127AABB
 for <freebsd-fs@mailman.nyi.freebsd.org>; Sat, 28 Mar 2020 12:23:21 +0000 (UTC)
 (envelope-from pen@lysator.liu.se)
Received: from mail.lysator.liu.se (mail.lysator.liu.se
 [IPv6:2001:6b0:17:f0a0::3])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 48qHtl2vsvz46vB
 for <freebsd-fs@freebsd.org>; Sat, 28 Mar 2020 12:23:10 +0000 (UTC)
 (envelope-from pen@lysator.liu.se)
Received: from mail.lysator.liu.se (localhost [127.0.0.1])
 by mail.lysator.liu.se (Postfix) with ESMTP id CA4E940007
 for <freebsd-fs@freebsd.org>; Sat, 28 Mar 2020 13:22:58 +0100 (CET)
Received: by mail.lysator.liu.se (Postfix, from userid 1004)
 id B592340005; Sat, 28 Mar 2020 13:22:58 +0100 (CET)
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 bernadotte.lysator.liu.se
X-Spam-Level: 
X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED,AWL
 autolearn=disabled version=3.4.2
X-Spam-Score: -1.0
Received: from [IPv6:2001:9b1:28ff:d901:4ca:669b:f651:f44a] (unknown
 [IPv6:2001:9b1:28ff:d901:4ca:669b:f651:f44a])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by mail.lysator.liu.se (Postfix) with ESMTPSA id 4031E40004;
 Sat, 28 Mar 2020 13:22:57 +0100 (CET)
Content-Type: text/plain;
	charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 13.0 \(3608.60.0.2.5\))
Subject: Re: ZFS/NFS hickups and some tools to monitor stuff...
From: Peter Eriksson <pen@lysator.liu.se>
In-Reply-To: <66AB88C0-12E8-48A0-9CD7-75B30C15123A@pk1048.com>
Date: Sat, 28 Mar 2020 13:22:56 +0100
Cc: "PK1048.COM" <info@pk1048.com>
Content-Transfer-Encoding: quoted-printable
Message-Id: <E6171E44-F677-4926-9F55-775F538900E4@lysator.liu.se>
References: <CFD0E4E5-EF2B-4789-BF14-F46AC569A191@lysator.liu.se>
 <66AB88C0-12E8-48A0-9CD7-75B30C15123A@pk1048.com>
To: FreeBSD Filesystems <freebsd-fs@freebsd.org>
X-Mailer: Apple Mail (2.3608.60.0.2.5)
X-Virus-Scanned: ClamAV using ClamSMTP
X-Rspamd-Queue-Id: 48qHtl2vsvz46vB
X-Spamd-Bar: ----
Authentication-Results: mx1.freebsd.org; dkim=none;
 dmarc=pass (policy=none) header.from=liu.se;
 spf=pass (mx1.freebsd.org: domain of pen@lysator.liu.se designates
 2001:6b0:17:f0a0::3 as permitted sender) smtp.mailfrom=pen@lysator.liu.se
X-Spamd-Result: default: False [-4.10 / 15.00]; ARC_NA(0.00)[];
 RCVD_VIA_SMTP_AUTH(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000,0];
 FROM_HAS_DN(0.00)[];
 R_SPF_ALLOW(-0.20)[+a:mail.lysator.liu.se]; MV_CASE(0.50)[];
 MIME_GOOD(-0.10)[text/plain];
 PREVIOUSLY_DELIVERED(0.00)[freebsd-fs@freebsd.org];
 NEURAL_HAM_LONG(-1.00)[-1.000,0]; RCVD_COUNT_THREE(0.00)[4];
 TO_MATCH_ENVRCPT_SOME(0.00)[]; TO_DN_ALL(0.00)[];
 RCPT_COUNT_TWO(0.00)[2];
 RCVD_IN_DNSWL_NONE(0.00)[3.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.a.0.f.7.1.0.0.0.b.6.0.1.0.0.2.list.dnswl.org
 : 127.0.11.0]; DMARC_POLICY_ALLOW(-0.50)[liu.se,none];
 RCVD_TLS_LAST(0.00)[]; FROM_EQ_ENVFROM(0.00)[];
 R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+];
 ASN(0.00)[asn:1653, ipnet:2001:6b0::/32, country:EU];
 MID_RHS_MATCH_FROM(0.00)[];
 IP_SCORE(-1.80)[ip: (-7.10), ipnet: 2001:6b0::/32(-1.04), asn: 1653(-0.84),
 country: EU(-0.01)]
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 28 Mar 2020 12:23:23 -0000


> On 28 Mar 2020, at 04:07, PK1048.COM <info@pk1048.com> wrote:
>>=20
>> The last couple of weeks I=E2=80=99ve been fighting with a severe =
case of NFS users complaining about slow response times from our (5) =
FreeBSD 11.3-RELEASE-p6 file servers. Now even though our SMB (Windows) =
users (thankfully since they are like 500 per server vs 50 NFS users) =
didn=E2=80=99t see the same slowdown (or atleast didn=E2=80=99t complain =
about it) the root cause is probably ZFS-related.
>=20
> What is the use case for the NFS users: Unix home directories? VM =
image files? etc.

Mostly home directories. No VM image files. About 20000 filesystems per =
server with around 100 snapshots per filesystem. Around 150-180M =
files/directories per server.


>> 3. Rsync cloning of data into the servers. Response times up to 15 =
minutes was seen=E2=80=A6 Yes, 15 minutes to do a mkdir(=E2=80=9Ctest-dir=E2=
=80=9D). Possibly in conjunction with #1 above=E2=80=A6.
>=20
> There is something very wrong with your setup with that kind of NFS =
response time.
>=20
> Maybe I=E2=80=99m asking eh obvious, but what is performance like =
natively on the server for these operations?

Normal response times:

This is from a small Intel NUC running OmniOS so NFS 4.0:

$ ./pfst -v /mnt/filur01
[pfst, version 1.7 - Peter Eriksson <pen@lysator.liu.se>]
2020-03-28 12:19:10 [2114 =C2=B5s]: /mnt/filur01: =
mkdir("t-omnibus-821-1")
2020-03-28 12:19:10 [ 413 =C2=B5s]: /mnt/filur01: =
rmdir("t-omnibus-821-1")
2020-03-28 12:19:11 [1375 =C2=B5s]: /mnt/filur01: =
mkdir("t-omnibus-821-2")
2020-03-28 12:19:11 [ 438 =C2=B5s]: /mnt/filur01: =
rmdir("t-omnibus-821-2")
2020-03-28 12:19:12 [1329 =C2=B5s]: /mnt/filur01: =
mkdir("t-omnibus-821-3")
2020-03-28 12:19:12 [ 428 =C2=B5s]: /mnt/filur01: =
rmdir("t-omnibus-821-3")
2020-03-28 12:19:13 [1253 =C2=B5s]: /mnt/filur01: =
mkdir("t-omnibus-821-4")
2020-03-28 12:19:13 [ 395 =C2=B5s]: /mnt/filur01: =
rmdir("t-omnibus-821-4")

Ie, an mkdir() takes around 1-2ms and an rmdir 0.4ms.


Same from a CentOS 7 client (different hardware, a Dell workstation, NFS =
4.1):

$ ./pfst -v /mnt/filur01
[pfst, version 1.6 - Peter Eriksson <pen@lysator.liu.se>]
2020-03-28 12:21:15 [ 633 =C2=B5s]: /mnt/filur01: =
mkdir("t-electra-965-1")
2020-03-28 12:21:15 [ 898 =C2=B5s]: /mnt/filur01: =
rmdir("t-electra-965-1")
2020-03-28 12:21:16 [1019 =C2=B5s]: /mnt/filur01: =
mkdir("t-electra-965-2")
2020-03-28 12:21:16 [ 709 =C2=B5s]: /mnt/filur01: =
rmdir("t-electra-965-2")
2020-03-28 12:21:17 [ 955 =C2=B5s]: /mnt/filur01: =
mkdir("t-electra-965-3")
2020-03-28 12:21:17 [ 668 =C2=B5s]: /mnt/filur01: =
rmdir("t-electra-965-3")

Mkdir & rmdir takes about the same amount of time here. (0.6 - 1ms).

(The above only tests operations on already-mounted filesystems. =
Mounting filesystems have their own set of partly different problems =
occasionally.)


> What does the disk %busy look like on the disks that make up the =
vdev=E2=80=99s? (iostat -x)

Don=E2=80=99t have those numbers (when we were seeing problems) =
unfortunately but if I remember correctly fairly busy during the =
resilver (not surprising).

Current status (right now):

# iostat -x 10 |egrep -v pass
                        extended device statistics =20
device       r/s     w/s     kr/s     kw/s  ms/r  ms/w  ms/o  ms/t qlen  =
%b =20
nvd0           0       0      0.0      0.0     0     0     0     0    0  =
 0=20
da0            3      55     31.1   1129.4    10     1    87     3    0  =
13=20
da1            4      53     31.5   1109.1    10     1    86     3    0  =
13=20
da2            5      51     41.9   1082.4     9     1    87     3    0  =
14=20
da3            3      55     31.1   1129.7    10     1    85     2    0  =
13=20
da4            4      53     31.3   1108.7    10     1    86     3    0  =
13=20
da5            5      52     41.7   1081.3     9     1    87     3    0  =
14=20
da6            3      55     27.6   1103.9    10     1    94     2    0  =
13=20
da7            4      54     34.2   1064.7    10     1    92     3    0  =
14=20
da8            5      55     39.6   1088.2    10     1    94     3    0  =
15=20
da9            3      55     27.7   1103.4    10     1    92     2    0  =
13=20
da10           4      53     34.2   1065.0    10     1    92     3    0  =
14=20
da11           5      55     39.5   1089.0    10     1    95     3    0  =
15=20
da12           1      23      4.7    553.5     0     0     0     0    0  =
 0=20
da13           1      23      4.7    553.5     0     0     0     0    0  =
 0=20
da14           0      23      1.1    820.6     0     0     0     0    0  =
 1=20
da15           0      23      1.0    820.7     0     0     0     0    0  =
 1=20
da16           0       0      0.0      0.0     1     0     0     1    0  =
 0=20
da17           0       0      0.0      0.0     1     0     0     0    0  =
 0=20


>> Previously #1 and #2 hasn=E2=80=99t caused that much problems, and #3 =
definitely. Something has changed the last half year or so but so far I =
haven=E2=80=99t been able to figure it out.
>=20
> The degradation over time makes me think fragmentation of lots of =
small writes or an overly full zpool, but that should effect SMB as well =
as NFS.

Yeah, the zpool is around 50% full so no.

(One of the servers right now:)

# zpool iostat -v DATA
                                     capacity     operations    =
bandwidth
pool                              alloc   free   read  write   read  =
write
--------------------------------  -----  -----  -----  -----  -----  =
-----
DATA                              57.7T  50.3T     46    556   403K  =
8.48M
  raidz2                          29.0T  25.0T     23    253   204K  =
3.49M
    diskid/DISK-7PK8DWXC              -      -      3     52  31.1K  =
1.10M
    diskid/DISK-7PK5X6XC              -      -      4     50  31.5K  =
1.08M
    diskid/DISK-7PK4W4BC              -      -      5     49  41.9K  =
1.06M
    diskid/DISK-7PK204LG              -      -      3     52  31.1K  =
1.10M
    diskid/DISK-7PK2GDHG              -      -      4     50  31.3K  =
1.08M
    diskid/DISK-7PK7850C              -      -      5     49  41.7K  =
1.06M
  raidz2                          28.8T  25.2T     23    256   199K  =
3.39M
    diskid/DISK-7PK62HHC              -      -      3     53  27.6K  =
1.08M
    diskid/DISK-7PK6SG3C              -      -      4     52  34.2K  =
1.04M
    diskid/DISK-7PK8DRHC              -      -      5     53  39.6K  =
1.06M
    diskid/DISK-7PK85ADG              -      -      3     53  27.7K  =
1.08M
    diskid/DISK-2TK7PBYD              -      -      4     51  34.2K  =
1.04M
    diskid/DISK-7PK6WY9C              -      -      5     53  39.5K  =
1.06M
logs                                  -      -      -      -      -      =
-
  diskid/DISK-BTHV7146043U400NGN   289M   372G      0     23      0   =
821K
  diskid/DISK-BTHV71460441400NGN   274M   372G      0     23      0   =
821K


> What type of SSD=E2=80=99s are you using for the SLOG (every zpool has =
a built-in ZIL, you have a SLOG when you add separate devices for the =
Intent Log for sync operations)? I had tried a run of the mill consumer =
SSD for the SLOG and NFS sync performance went DOWN! For SLOG devices =
you really must have real enterprise grade server class SSDs designed =
for very heavy random WRITE load. Today you can get enterprise SSDs =
designed for READ, WRITE, or a combination.

Dell-original Enterprise SSDs like Intel DC S3520-series & Intel DC =
S3700-series (SATA)
Also tested some Intel SSD 750 (NVME on PCIe 3.0-cards)

Going to test some pure SAS SSD write-optimised drives too (as soon as =
we can get delivery of them). However I don=E2=80=99t really suspect the =
SSDs
anymore. (We=E2=80=99ve had other issues with the S3520:series though so =
might replace them anyway - they have a tendency to play =E2=80=9Cpossum=E2=
=80=9D every now and then - =E2=80=9Cdie=E2=80=9D and go offline - =
however if we just let them sit there after a couple of weeks they would =
reappear again automagically=E2=80=A6 We=E2=80=99ve seen the same thing =
happen with the Intel P3520=E2=80=99s too (PCIe variant) but then it =
just didn=E2=80=99t drop off the bus - it took down the server too due =
to hanging the PCIe bus - crappy buggers)


> Depending on your NFS use case, you may or may not be seeing sync =
operations. If you are seeing async operations then a ZIL/SLOG will not =
help as they are only used for sync operations. For async operations =
multiple writes are combined in RAM and committed to disk at least every =
5 seconds. For sync operations the writes are immediately written to the =
ZIL/SLOG, so a SLOG _can_ be of benefit if it is fast enough for random =
write operations. Since a sync operation must be committed to =
non-volatile storage before the write call can return, and having a =
single SLOG device can lead to loss of data, it is prudent to mirror the =
SLOG devices. Having a mirrored pair does not slow down operations as =
both devices are written to in parallel. The SLOG only exists to meet =
the sync requirement. If the system does not crash, the SLOG is (almost) =
never actually read. The exception, and the reason I say =E2=80=9Calmost=E2=
=80=9D, is that if you are writing so much data over long periods of =
time that the zpool disks cannot catch up between write requests, then =
you may have to read some TXG (transaction groups) from the SLOG to get =
them into the zpool.
>=20
>> We=E2=80=99ve tried it with and without L2ARC,
>=20
> ARC and L2ARC will only effect read operations, and you can see the =
effect with zfs-stats -E.

Well, L2ARC uses up some of the ARC so if ARC is too tight from the =
beginning then it=E2=80=99ll make reads go even slower.=20

Which actually is my main suspect (the ARC) right now. At least for the =
resilver problem. When the resilver was running it seems it would read =
in a lot of data into the ARC causing it to drop older data (& metadata=3D=
 from the ARC cache. Which then became a problem due to two issues:

1. The problematic server had =E2=80=9Ccompressed ARC=E2=80=9D disabled =
due to a previous kernel-crashing problem (happened less often with that =
disabled)=20
2. ZFS prioritises writes over reads
3. The metadata (directory information for example) wasn=E2=80=99t in =
the ARC anymore so needed to be read from disk again and again

This together with the fact that we:

We are taking hourly snapshots of all filesystems - this needed to =
access metadata for all filesystems.
We are cleaning snapshots nightly - this needed to access metadata for =
all filesystems (and all old snapshots).

We had a problem with the cleaning script so it was running behind and =
more and more snapshots were piling up on a couple of the fileservers so =
instead of 100 snapshots there were around 400. Including the server =
where we were resilvering.

The 500-600 samba =E2=80=9Csmbd=E2=80=9D processes are using a lot of =
RAM each (100-200MB), ie 120-60GB, so would compete with the ARC too.

Many of the NFS operations access metadata from the filesystems. For =
example mount operations, or filename lookups. So if any of that was =
evicted from the ARC then it would have to go to disk to read it again. =
And when ZFS was busy writing resilver data that would take a long time.

Or when a filesystem is near quota then that too will cause the ZFS =
transaction sizse to go down - inflating write IOPS and starving the =
reads IOPs again=E2=80=A6

We are probably going to invest in more RAM for the servers (going from =
256GB to 512GB or 768GB) to allow for way more ARC.


> You should not need to make any system tunings to make this work.

Well, I had to tune down vfs.zfs.arc_max from the default (90% of RAM) =
to a 128GB since the Samba smbd processes also need RAM to run and the =
auto-tuning in FreeBSD of the ARC doesn=E2=80=99t seem to work that =
well. In order to mitigate that problem I=E2=80=99m currently using =
these settings:

vfs.zfs.arc_max =3D 96GB (tuned down from 128GB since ARC overshot the =
128GB allocated and started killing random processes)=20
vfs.zfs.arc_meta_limit =3D 50% of ARC (default is 25%)

The overshooting of ARC (especially when resilvering) was probably due =
to us also tuning the kern.maxvnodes - seems if you touch that number =
when the system is already running ARC goes berserk. So I can probably =
increase ARC to 128GB again. It looks to be more stable now that we=E2=80=99=
ve removed the kern.maxvnodes setting completely.

It would have been nice to that a setting to set a minimum limit of =
metadata so it would be more protected from normal data evicting it=E2=80=A6=

=20
We=E2=80=99ve installed 512GB instead of 256GB in a test server now and =
will so what settings might work well there...


Right now (with no resilver running and not much =E2=80=9Cnear-full-quota=E2=
=80=9D filesystems, and many of the extraneous snapshots deleted this =
are looking much better:

$ ./pfst -t200ms /mnt/filur0*
=E2=80=A6
2020-03-26 21:16:40 [1678 ms]: /mnt/filur01: =
mkdir("t-omnibus-782-190284") [Time limit exceeded]
2020-03-26 21:16:47 [1956 ms]: /mnt/filur01: =
mkdir("t-omnibus-782-190289") [Time limit exceeded]
2020-03-26 21:16:53 [1439 ms]: /mnt/filur01: =
mkdir("t-omnibus-782-190293") [Time limit exceeded]
2020-03-26 21:16:59 [1710 ms]: /mnt/filur01: =
mkdir("t-omnibus-782-190297") [Time limit exceeded]
2020-03-26 21:17:05 [2044 ms]: /mnt/filur01: =
mkdir("t-omnibus-782-190301") [Time limit exceeded]
2020-03-26 21:17:11 [1955 ms]: /mnt/filur01: =
mkdir("t-omnibus-782-190305") [Time limit exceeded]
2020-03-26 21:17:16 [1515 ms]: /mnt/filur01: =
mkdir("t-omnibus-782-190309") [Time limit exceeded]
...
2020-03-28 06:54:54 [ 370 ms]: /mnt/filur06: =
mkdir("t-omnibus-783-311285") [Time limit exceeded]
2020-03-28 07:00:01 [ 447 ms]: /mnt/filur01: =
mkdir("t-omnibus-782-311339") [Time limit exceeded]
2020-03-28 07:00:01 [ 312 ms]: /mnt/filur06: =
mkdir("t-omnibus-783-311591") [Time limit exceeded]
2020-03-28 07:00:02 [ 291 ms]: /mnt/filur06: =
mkdir("t-omnibus-783-311592") [Time limit exceeded]
2020-03-28 07:00:05 [ 378 ms]: /mnt/filur06: =
mkdir("t-omnibus-783-311594") [Time limit exceeded]
2020-03-28 10:35:13 [1876 ms]: /mnt/filur01: =
mkdir("t-omnibus-782-324215") [Time limit exceeded]

(This only prints operations taking more than 200ms)

The last ~2s one is probably due to a new filesystem being created. This =
triggers the NFS mount daemon to update the kernel list of exports (and =
hold a NFS kernel lock) which take some time (although it=E2=80=99s much =
faster now with the =E2=80=9Cincremental update=E2=80=9D fix in mountd.=20=

=20
The slow ones around 21:16 is due to me doing a number of =E2=80=9Czfs =
rename DATA/home/$USER DATA/archive/home/$USER=E2=80=9D operations. This =
too causes a lot of NFS mountd activity (removing NFS exports from the =
kernel) causing a NFS kernel lock to be held...


- Peter