Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 29 Mar 2020 20:23:12 +0200
From:      Peter Eriksson <pen@lysator.liu.se>
To:        FreeBSD Filesystems <freebsd-fs@freebsd.org>
Cc:        "PK1048.COM" <info@pk1048.com>
Subject:   Re: ZFS/NFS hickups and some tools to monitor stuff...
Message-ID:  <982F9A21-FF1C-4DAB-98B3-610D70714ED3@lysator.liu.se>
In-Reply-To: <FE244C11-44CA-4DCC-8CD9-A8C7A7C5F059@pk1048.com>
References:  <CFD0E4E5-EF2B-4789-BF14-F46AC569A191@lysator.liu.se> <66AB88C0-12E8-48A0-9CD7-75B30C15123A@pk1048.com> <E6171E44-F677-4926-9F55-775F538900E4@lysator.liu.se> <FE244C11-44CA-4DCC-8CD9-A8C7A7C5F059@pk1048.com>

next in thread | previous in thread | raw e-mail | index | archive | help

>>=20
>> Mostly home directories. No VM image files. About 20000 filesystems =
per server with around 100 snapshots per filesystem. Around 150-180M =
files/directories per server.
>=20
> Wow. By =E2=80=9Cfilesystems=E2=80=9D I assume you mean zfs datasaets =
and not zpools?
>=20
Yes, ZFS datasets. Our AD right now contains ~130000 users (not all are =
active thankfully, typically around 3000-5000 are online at the same =
time (90% via SMB)) spread out over a number of servers).


> Why so many? In the early days of zfs, due to the lack of per-user =
quotas, there were sites with one zfs dataset per user (home directory) =
so that quotas could be enforced. This raised serious performance =
issues, mostly ay boot time due to the need to mount so many different =
datasets. Per-
...
> But, this may be the underlying source of the performance issues.

Yeah. I=E2=80=99m aware of the user quotas which is nice. However they =
only solve part of the problem(s):

1. With the GDPR laws and the =E2=80=9Cright to be forgotten=E2=80=9D we =
need to be able to delete a user's files when they leave their =
employment here and/or when students stop studying. This together with =
snapshots would make this a much harder operation (if we just have one =
big dataset for all users then we can=E2=80=99t just delete a specific =
user=E2=80=99s dataset (and snapshots). Basically we would have to =
delete the snapshot for *all users* then=E2=80=A6

2. Users that write a lot of data every day - this uses up a lot of =
=E2=80=9Cquota=E2=80=9D and not =E2=80=9Crefquota=E2=80=9D. The =E2=80=9Cu=
ser quota=E2=80=9D just counts the =E2=80=9Crefquota=E2=80=9D=E2=80=A6 =
And then even if we find that user and makes them stop writing, their =
old data will live on in the snapshots=E2=80=A6

Right now we are trying out a scheme where we give each user dataset a =
=E2=80=9Cuserquota=E2=80=9D of X, a =E2=80=9Crefquota=E2=80=9D of X+1G =
and a =E2=80=9Cquota=E2=80=9D of 3*X. This will hopefully lessen the =
problem of =E2=80=9Cslow servers" when a user is filling up their ref =
quotas since they=E2=80=99ll run into their user quotas before the ref =
quotas=E2=80=A6

I=E2=80=99ve also modified the =E2=80=9Czfs snap=E2=80=9D command to =
avoid taking snapshots on near-full datasets. We=E2=80=99ll see if this =
makes things better.

(I=E2=80=99ve also modified it to have a =E2=80=9Czfs clean=E2=80=9D =
command to parallelise the snapshot deletion a bit (and make it possible =
to be smart on which snapshots to remove. We used to do this via a =
Python script but that is not nearly as efficient as doing it directly =
in the =E2=80=9Czfs=E2=80=9D command. We normally set a user property =
like =E2=80=9Cse.liu.it:expires=E2=80=9D to a date when a snapshot =
should expire, and then the =E2=80=9Czfs clean=E2=80=9D command can look =
for that property and delete just those that have expired. The idea is =
to keep hourly snapshots for 2 days, daily snapshots for 2 weeks and =
weekly snapshots for 3 months (or so, we=E2=80=99ve been testing =
differents times here) - any older is on the backup server so users =
easily can recover their own files via the Windows =E2=80=9CPrevious =
Versions=E2=80=9D feature (or via the .zfs/snapshot=E2=80=9D directory =
for Unix users).=20



>>> Maybe I=E2=80=99m asking eh obvious, but what is performance like =
natively on the server for these operations?
>>=20
>> Normal response times:
>>=20
>> This is from a small Intel NUC running OmniOS so NFS 4.0:
>>=20
>> $ ./pfst -v /mnt/filur01
>> [pfst, version 1.7 - Peter Eriksson <pen@lysator.liu.se>]
>> 2020-03-28 12:19:10 [2114 =C2=B5s]: /mnt/filur01: =
mkdir("t-omnibus-821-1=E2=80=9D)
>=20
> You misunderstood my question. When you are seeing the performance =
issue via NFS, do you also see a performance issue directly on the NFS =
server?

The tests I ran did not indicate the same performance issues directly on =
the server (I ran the same test program locally). Well, except for =
=E2=80=9Czfs=E2=80=9D-commands being slow though.=20


> If the SMB clients are all (or mostly) Windows 8 or newer, the =
Microsoft CIFS/SMB client stack has lots of caching to make poor server =
performance feel good. That caching by the client may be masking =
comparable performance issues via SMB. Testing directly on the server =
will remove the network fie share layer from the discussion, or focus =
the discussion there.
>=20
>> Mkdir & rmdir takes about the same amount of time here. (0.6 - 1ms).
>=20
> Do reads/writes from/to existing files show the same degradation? =
Especially reads?

Didn=E2=80=99t test that at the time unfortunately. And right now things =
are running pretty OK..


>>> What does the disk %busy look like on the disks that make up the =
vdev=E2=80=99s? (iostat -x)
>>=20
>> Don=E2=80=99t have those numbers (when we were seeing problems) =
unfortunately but if I remember correctly fairly busy during the =
resilver (not surprising).
>>=20
>> Current status (right now):
>>=20
>> # iostat -x 10 |egrep -v pass
>>                       extended device statistics =20
>> device       r/s     w/s     kr/s     kw/s  ms/r  ms/w  ms/o  ms/t =
qlen  %b =20
>> nvd0           0       0      0.0      0.0     0     0     0     0    =
0   0=20
>> da0            3      55     31.1   1129.4    10     1    87     3    =
0  13=20
>> da1            4      53     31.5   1109.1    10     1    86     3    =
0  13=20
>> da2            5      51     41.9   1082.4     9     1    87     3    =
0  14
>=20
> If this is during typical activity, you are already using 13% of your =
capacity. I also don=E2=80=99t like the 80ms per operation times.

The spinning rust drives are HGST He10 (10TB SAS 7200rpm) drives on Dell =
HBA330 controllers (LSI SAS3008). We also use HP server with their own =
Smart H241 HBAs and are seeing similar latencies there.=20

=
https://www.storagereview.com/review/hgst-ultrastar-he10-10tb-enterprise-h=
ard-drive-review =
<https://www.storagereview.com/review/hgst-ultrastar-he10-10tb-enterprise-=
hard-drive-review>


> What does zpool list show (fragmentation)?


22-27% fragmentation with 50-53% cap (108T size) on the 3 biggest =
servers.


>> At least for the resilver problem.
>=20
> There are tunings you can apply to make the resilver even more =
background than it usually is. I don=E2=80=99t have them off the top of =
my head.

Yeah, I tried those. Didn=E2=80=99t make much difference though=E2=80=A6


> I have mamnaged zfs server with hundreds of thousands of snapshots =
with no performance penalty, except for snapshot management functions =
(zfs list -t snapshot, which would take many, many minutes to complete), =
so just the presence of snapshots should not hurt. Be aware that =
destroying snapshots in the order in which they were _created_ is much =
faster. In other words, always destroy the oldest snapshot first and =
work your way forward.

Yeah, I know. That=E2=80=99s what we are doing.

- Peter




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?982F9A21-FF1C-4DAB-98B3-610D70714ED3>