Date: Sun, 29 Mar 2020 20:23:12 +0200 From: Peter Eriksson <pen@lysator.liu.se> To: FreeBSD Filesystems <freebsd-fs@freebsd.org> Cc: "PK1048.COM" <info@pk1048.com> Subject: Re: ZFS/NFS hickups and some tools to monitor stuff... Message-ID: <982F9A21-FF1C-4DAB-98B3-610D70714ED3@lysator.liu.se> In-Reply-To: <FE244C11-44CA-4DCC-8CD9-A8C7A7C5F059@pk1048.com> References: <CFD0E4E5-EF2B-4789-BF14-F46AC569A191@lysator.liu.se> <66AB88C0-12E8-48A0-9CD7-75B30C15123A@pk1048.com> <E6171E44-F677-4926-9F55-775F538900E4@lysator.liu.se> <FE244C11-44CA-4DCC-8CD9-A8C7A7C5F059@pk1048.com>
next in thread | previous in thread | raw e-mail | index | archive | help
>>=20 >> Mostly home directories. No VM image files. About 20000 filesystems = per server with around 100 snapshots per filesystem. Around 150-180M = files/directories per server. >=20 > Wow. By =E2=80=9Cfilesystems=E2=80=9D I assume you mean zfs datasaets = and not zpools? >=20 Yes, ZFS datasets. Our AD right now contains ~130000 users (not all are = active thankfully, typically around 3000-5000 are online at the same = time (90% via SMB)) spread out over a number of servers). > Why so many? In the early days of zfs, due to the lack of per-user = quotas, there were sites with one zfs dataset per user (home directory) = so that quotas could be enforced. This raised serious performance = issues, mostly ay boot time due to the need to mount so many different = datasets. Per- ... > But, this may be the underlying source of the performance issues. Yeah. I=E2=80=99m aware of the user quotas which is nice. However they = only solve part of the problem(s): 1. With the GDPR laws and the =E2=80=9Cright to be forgotten=E2=80=9D we = need to be able to delete a user's files when they leave their = employment here and/or when students stop studying. This together with = snapshots would make this a much harder operation (if we just have one = big dataset for all users then we can=E2=80=99t just delete a specific = user=E2=80=99s dataset (and snapshots). Basically we would have to = delete the snapshot for *all users* then=E2=80=A6 2. Users that write a lot of data every day - this uses up a lot of = =E2=80=9Cquota=E2=80=9D and not =E2=80=9Crefquota=E2=80=9D. The =E2=80=9Cu= ser quota=E2=80=9D just counts the =E2=80=9Crefquota=E2=80=9D=E2=80=A6 = And then even if we find that user and makes them stop writing, their = old data will live on in the snapshots=E2=80=A6 Right now we are trying out a scheme where we give each user dataset a = =E2=80=9Cuserquota=E2=80=9D of X, a =E2=80=9Crefquota=E2=80=9D of X+1G = and a =E2=80=9Cquota=E2=80=9D of 3*X. This will hopefully lessen the = problem of =E2=80=9Cslow servers" when a user is filling up their ref = quotas since they=E2=80=99ll run into their user quotas before the ref = quotas=E2=80=A6 I=E2=80=99ve also modified the =E2=80=9Czfs snap=E2=80=9D command to = avoid taking snapshots on near-full datasets. We=E2=80=99ll see if this = makes things better. (I=E2=80=99ve also modified it to have a =E2=80=9Czfs clean=E2=80=9D = command to parallelise the snapshot deletion a bit (and make it possible = to be smart on which snapshots to remove. We used to do this via a = Python script but that is not nearly as efficient as doing it directly = in the =E2=80=9Czfs=E2=80=9D command. We normally set a user property = like =E2=80=9Cse.liu.it:expires=E2=80=9D to a date when a snapshot = should expire, and then the =E2=80=9Czfs clean=E2=80=9D command can look = for that property and delete just those that have expired. The idea is = to keep hourly snapshots for 2 days, daily snapshots for 2 weeks and = weekly snapshots for 3 months (or so, we=E2=80=99ve been testing = differents times here) - any older is on the backup server so users = easily can recover their own files via the Windows =E2=80=9CPrevious = Versions=E2=80=9D feature (or via the .zfs/snapshot=E2=80=9D directory = for Unix users).=20 >>> Maybe I=E2=80=99m asking eh obvious, but what is performance like = natively on the server for these operations? >>=20 >> Normal response times: >>=20 >> This is from a small Intel NUC running OmniOS so NFS 4.0: >>=20 >> $ ./pfst -v /mnt/filur01 >> [pfst, version 1.7 - Peter Eriksson <pen@lysator.liu.se>] >> 2020-03-28 12:19:10 [2114 =C2=B5s]: /mnt/filur01: = mkdir("t-omnibus-821-1=E2=80=9D) >=20 > You misunderstood my question. When you are seeing the performance = issue via NFS, do you also see a performance issue directly on the NFS = server? The tests I ran did not indicate the same performance issues directly on = the server (I ran the same test program locally). Well, except for = =E2=80=9Czfs=E2=80=9D-commands being slow though.=20 > If the SMB clients are all (or mostly) Windows 8 or newer, the = Microsoft CIFS/SMB client stack has lots of caching to make poor server = performance feel good. That caching by the client may be masking = comparable performance issues via SMB. Testing directly on the server = will remove the network fie share layer from the discussion, or focus = the discussion there. >=20 >> Mkdir & rmdir takes about the same amount of time here. (0.6 - 1ms). >=20 > Do reads/writes from/to existing files show the same degradation? = Especially reads? Didn=E2=80=99t test that at the time unfortunately. And right now things = are running pretty OK.. >>> What does the disk %busy look like on the disks that make up the = vdev=E2=80=99s? (iostat -x) >>=20 >> Don=E2=80=99t have those numbers (when we were seeing problems) = unfortunately but if I remember correctly fairly busy during the = resilver (not surprising). >>=20 >> Current status (right now): >>=20 >> # iostat -x 10 |egrep -v pass >> extended device statistics =20 >> device r/s w/s kr/s kw/s ms/r ms/w ms/o ms/t = qlen %b =20 >> nvd0 0 0 0.0 0.0 0 0 0 0 = 0 0=20 >> da0 3 55 31.1 1129.4 10 1 87 3 = 0 13=20 >> da1 4 53 31.5 1109.1 10 1 86 3 = 0 13=20 >> da2 5 51 41.9 1082.4 9 1 87 3 = 0 14 >=20 > If this is during typical activity, you are already using 13% of your = capacity. I also don=E2=80=99t like the 80ms per operation times. The spinning rust drives are HGST He10 (10TB SAS 7200rpm) drives on Dell = HBA330 controllers (LSI SAS3008). We also use HP server with their own = Smart H241 HBAs and are seeing similar latencies there.=20 = https://www.storagereview.com/review/hgst-ultrastar-he10-10tb-enterprise-h= ard-drive-review = <https://www.storagereview.com/review/hgst-ultrastar-he10-10tb-enterprise-= hard-drive-review> > What does zpool list show (fragmentation)? 22-27% fragmentation with 50-53% cap (108T size) on the 3 biggest = servers. >> At least for the resilver problem. >=20 > There are tunings you can apply to make the resilver even more = background than it usually is. I don=E2=80=99t have them off the top of = my head. Yeah, I tried those. Didn=E2=80=99t make much difference though=E2=80=A6 > I have mamnaged zfs server with hundreds of thousands of snapshots = with no performance penalty, except for snapshot management functions = (zfs list -t snapshot, which would take many, many minutes to complete), = so just the presence of snapshots should not hurt. Be aware that = destroying snapshots in the order in which they were _created_ is much = faster. In other words, always destroy the oldest snapshot first and = work your way forward. Yeah, I know. That=E2=80=99s what we are doing. - Peter
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?982F9A21-FF1C-4DAB-98B3-610D70714ED3>