From owner-freebsd-fs@freebsd.org  Sun Mar 29 18:23:40 2020
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.nyi.freebsd.org (Postfix) with ESMTP id C17352A652A
 for <freebsd-fs@mailman.nyi.freebsd.org>; Sun, 29 Mar 2020 18:23:40 +0000 (UTC)
 (envelope-from pen@lysator.liu.se)
Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 48r3r06dR0z3G1V
 for <freebsd-fs@freebsd.org>; Sun, 29 Mar 2020 18:23:28 +0000 (UTC)
 (envelope-from pen@lysator.liu.se)
Received: from mail.lysator.liu.se (localhost [127.0.0.1])
 by mail.lysator.liu.se (Postfix) with ESMTP id 3EA6040007
 for <freebsd-fs@freebsd.org>; Sun, 29 Mar 2020 20:23:14 +0200 (CEST)
Received: by mail.lysator.liu.se (Postfix, from userid 1004)
 id 29D8E40004; Sun, 29 Mar 2020 20:23:13 +0200 (CEST)
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 bernadotte.lysator.liu.se
X-Spam-Level: 
X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED, AWL, HTML_MESSAGE
 autolearn=disabled version=3.4.2
X-Spam-Score: -1.0
Received: from [IPv6:2001:9b1:28ff:d901:6519:594e:bae9:86c7] (unknown
 [IPv6:2001:9b1:28ff:d901:6519:594e:bae9:86c7])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by mail.lysator.liu.se (Postfix) with ESMTPSA id AD53740005;
 Sun, 29 Mar 2020 20:23:12 +0200 (CEST)
From: Peter Eriksson <pen@lysator.liu.se>
Message-Id: <982F9A21-FF1C-4DAB-98B3-610D70714ED3@lysator.liu.se>
Mime-Version: 1.0 (Mac OS X Mail 13.0 \(3608.60.0.2.5\))
Subject: Re: ZFS/NFS hickups and some tools to monitor stuff...
Date: Sun, 29 Mar 2020 20:23:12 +0200
In-Reply-To: <FE244C11-44CA-4DCC-8CD9-A8C7A7C5F059@pk1048.com>
Cc: "PK1048.COM" <info@pk1048.com>
To: FreeBSD Filesystems <freebsd-fs@freebsd.org>
References: <CFD0E4E5-EF2B-4789-BF14-F46AC569A191@lysator.liu.se>
 <66AB88C0-12E8-48A0-9CD7-75B30C15123A@pk1048.com>
 <E6171E44-F677-4926-9F55-775F538900E4@lysator.liu.se>
 <FE244C11-44CA-4DCC-8CD9-A8C7A7C5F059@pk1048.com>
X-Mailer: Apple Mail (2.3608.60.0.2.5)
X-Virus-Scanned: ClamAV using ClamSMTP
X-Rspamd-Queue-Id: 48r3r06dR0z3G1V
X-Spamd-Bar: -----
Authentication-Results: mx1.freebsd.org; dkim=none;
 dmarc=pass (policy=none) header.from=liu.se;
 spf=pass (mx1.freebsd.org: domain of pen@lysator.liu.se designates
 130.236.254.3 as permitted sender) smtp.mailfrom=pen@lysator.liu.se
X-Spamd-Result: default: False [-5.63 / 15.00]; ARC_NA(0.00)[];
 RCVD_VIA_SMTP_AUTH(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000,0];
 FROM_HAS_DN(0.00)[];
 R_SPF_ALLOW(-0.20)[+a:mail.lysator.liu.se]; MV_CASE(0.50)[];
 MIME_GOOD(-0.10)[multipart/alternative,text/plain];
 PREVIOUSLY_DELIVERED(0.00)[freebsd-fs@freebsd.org];
 NEURAL_HAM_LONG(-1.00)[-1.000,0]; RCVD_COUNT_THREE(0.00)[4];
 TO_MATCH_ENVRCPT_SOME(0.00)[]; TO_DN_ALL(0.00)[];
 RCVD_IN_DNSWL_MED(-0.20)[3.254.236.130.list.dnswl.org : 127.0.11.2];
 RCPT_COUNT_TWO(0.00)[2];
 DMARC_POLICY_ALLOW(-0.50)[liu.se,none]; RCVD_TLS_LAST(0.00)[];
 FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[];
 MIME_TRACE(0.00)[0:+,1:+,2:~];
 ASN(0.00)[asn:2843, ipnet:130.236.0.0/16, country:SE];
 MID_RHS_MATCH_FROM(0.00)[];
 IP_SCORE(-3.13)[ip: (-8.11), ipnet: 130.236.0.0/16(-4.17), asn: 2843(-3.34),
 country: SE(-0.03)]
Content-Type: text/plain;
	charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.29
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 29 Mar 2020 18:23:41 -0000


>>=20
>> Mostly home directories. No VM image files. About 20000 filesystems =
per server with around 100 snapshots per filesystem. Around 150-180M =
files/directories per server.
>=20
> Wow. By =E2=80=9Cfilesystems=E2=80=9D I assume you mean zfs datasaets =
and not zpools?
>=20
Yes, ZFS datasets. Our AD right now contains ~130000 users (not all are =
active thankfully, typically around 3000-5000 are online at the same =
time (90% via SMB)) spread out over a number of servers).


> Why so many? In the early days of zfs, due to the lack of per-user =
quotas, there were sites with one zfs dataset per user (home directory) =
so that quotas could be enforced. This raised serious performance =
issues, mostly ay boot time due to the need to mount so many different =
datasets. Per-
...
> But, this may be the underlying source of the performance issues.

Yeah. I=E2=80=99m aware of the user quotas which is nice. However they =
only solve part of the problem(s):

1. With the GDPR laws and the =E2=80=9Cright to be forgotten=E2=80=9D we =
need to be able to delete a user's files when they leave their =
employment here and/or when students stop studying. This together with =
snapshots would make this a much harder operation (if we just have one =
big dataset for all users then we can=E2=80=99t just delete a specific =
user=E2=80=99s dataset (and snapshots). Basically we would have to =
delete the snapshot for *all users* then=E2=80=A6

2. Users that write a lot of data every day - this uses up a lot of =
=E2=80=9Cquota=E2=80=9D and not =E2=80=9Crefquota=E2=80=9D. The =E2=80=9Cu=
ser quota=E2=80=9D just counts the =E2=80=9Crefquota=E2=80=9D=E2=80=A6 =
And then even if we find that user and makes them stop writing, their =
old data will live on in the snapshots=E2=80=A6

Right now we are trying out a scheme where we give each user dataset a =
=E2=80=9Cuserquota=E2=80=9D of X, a =E2=80=9Crefquota=E2=80=9D of X+1G =
and a =E2=80=9Cquota=E2=80=9D of 3*X. This will hopefully lessen the =
problem of =E2=80=9Cslow servers" when a user is filling up their ref =
quotas since they=E2=80=99ll run into their user quotas before the ref =
quotas=E2=80=A6

I=E2=80=99ve also modified the =E2=80=9Czfs snap=E2=80=9D command to =
avoid taking snapshots on near-full datasets. We=E2=80=99ll see if this =
makes things better.

(I=E2=80=99ve also modified it to have a =E2=80=9Czfs clean=E2=80=9D =
command to parallelise the snapshot deletion a bit (and make it possible =
to be smart on which snapshots to remove. We used to do this via a =
Python script but that is not nearly as efficient as doing it directly =
in the =E2=80=9Czfs=E2=80=9D command. We normally set a user property =
like =E2=80=9Cse.liu.it:expires=E2=80=9D to a date when a snapshot =
should expire, and then the =E2=80=9Czfs clean=E2=80=9D command can look =
for that property and delete just those that have expired. The idea is =
to keep hourly snapshots for 2 days, daily snapshots for 2 weeks and =
weekly snapshots for 3 months (or so, we=E2=80=99ve been testing =
differents times here) - any older is on the backup server so users =
easily can recover their own files via the Windows =E2=80=9CPrevious =
Versions=E2=80=9D feature (or via the .zfs/snapshot=E2=80=9D directory =
for Unix users).=20


>>> Maybe I=E2=80=99m asking eh obvious, but what is performance like =
natively on the server for these operations?
>>=20
>> Normal response times:
>>=20
>> This is from a small Intel NUC running OmniOS so NFS 4.0:
>>=20
>> $ ./pfst -v /mnt/filur01
>> [pfst, version 1.7 - Peter Eriksson <pen@lysator.liu.se>]
>> 2020-03-28 12:19:10 [2114 =C2=B5s]: /mnt/filur01: =
mkdir("t-omnibus-821-1=E2=80=9D)
>=20
> You misunderstood my question. When you are seeing the performance =
issue via NFS, do you also see a performance issue directly on the NFS =
server?

The tests I ran did not indicate the same performance issues directly on =
the server (I ran the same test program locally). Well, except for =
=E2=80=9Czfs=E2=80=9D-commands being slow though.=20


> If the SMB clients are all (or mostly) Windows 8 or newer, the =
Microsoft CIFS/SMB client stack has lots of caching to make poor server =
performance feel good. That caching by the client may be masking =
comparable performance issues via SMB. Testing directly on the server =
will remove the network fie share layer from the discussion, or focus =
the discussion there.
>=20
>> Mkdir & rmdir takes about the same amount of time here. (0.6 - 1ms).
>=20
> Do reads/writes from/to existing files show the same degradation? =
Especially reads?

Didn=E2=80=99t test that at the time unfortunately. And right now things =
are running pretty OK..


>>> What does the disk %busy look like on the disks that make up the =
vdev=E2=80=99s? (iostat -x)
>>=20
>> Don=E2=80=99t have those numbers (when we were seeing problems) =
unfortunately but if I remember correctly fairly busy during the =
resilver (not surprising).
>>=20
>> Current status (right now):
>>=20
>> # iostat -x 10 |egrep -v pass
>>                       extended device statistics =20
>> device       r/s     w/s     kr/s     kw/s  ms/r  ms/w  ms/o  ms/t =
qlen  %b =20
>> nvd0           0       0      0.0      0.0     0     0     0     0    =
0   0=20
>> da0            3      55     31.1   1129.4    10     1    87     3    =
0  13=20
>> da1            4      53     31.5   1109.1    10     1    86     3    =
0  13=20
>> da2            5      51     41.9   1082.4     9     1    87     3    =
0  14
>=20
> If this is during typical activity, you are already using 13% of your =
capacity. I also don=E2=80=99t like the 80ms per operation times.

The spinning rust drives are HGST He10 (10TB SAS 7200rpm) drives on Dell =
HBA330 controllers (LSI SAS3008). We also use HP server with their own =
Smart H241 HBAs and are seeing similar latencies there.=20

=
https://www.storagereview.com/review/hgst-ultrastar-he10-10tb-enterprise-h=
ard-drive-review =
<https://www.storagereview.com/review/hgst-ultrastar-he10-10tb-enterprise-=
hard-drive-review>


> What does zpool list show (fragmentation)?


22-27% fragmentation with 50-53% cap (108T size) on the 3 biggest =
servers.


>> At least for the resilver problem.
>=20
> There are tunings you can apply to make the resilver even more =
background than it usually is. I don=E2=80=99t have them off the top of =
my head.

Yeah, I tried those. Didn=E2=80=99t make much difference though=E2=80=A6


> I have mamnaged zfs server with hundreds of thousands of snapshots =
with no performance penalty, except for snapshot management functions =
(zfs list -t snapshot, which would take many, many minutes to complete), =
so just the presence of snapshots should not hurt. Be aware that =
destroying snapshots in the order in which they were _created_ is much =
faster. In other words, always destroy the oldest snapshot first and =
work your way forward.

Yeah, I know. That=E2=80=99s what we are doing.

- Peter