Date: Sun, 29 Mar 2020 21:16:00 +0200 From: Peter Eriksson <pen@lysator.liu.se> To: FreeBSD Filesystems <freebsd-fs@freebsd.org> Cc: "PK1048.COM" <info@pk1048.com> Subject: Re: ZFS/NFS hickups and some tools to monitor stuff... Message-ID: <7790DD37-4F95-409E-9E33-1A330B1B49C8@lysator.liu.se> In-Reply-To: <CDB51790-ED6B-4670-B256-43CDF98BD26D@pk1048.com> References: <CFD0E4E5-EF2B-4789-BF14-F46AC569A191@lysator.liu.se> <66AB88C0-12E8-48A0-9CD7-75B30C15123A@pk1048.com> <E6171E44-F677-4926-9F55-775F538900E4@lysator.liu.se> <FE244C11-44CA-4DCC-8CD9-A8C7A7C5F059@pk1048.com> <982F9A21-FF1C-4DAB-98B3-610D70714ED3@lysator.liu.se> <CDB51790-ED6B-4670-B256-43CDF98BD26D@pk1048.com>
next in thread | previous in thread | raw e-mail | index | archive | help
> I thought that snapshot deletion was single threaded within a zpool, = since TXGs are zpool-wide, not per dataset. So you may not be able to = destroy snapshot in parallel. Yeah, I thought so too but decided to try it anyway. It sometimes goes = faster and since I decoupled the "read snapshots to delete" and the = =E2=80=9Cdo the deletion=E2=80=9D into separate threads now it doesn=E2=80= =99t have to read all snapshots to delete first and then delete them = all, but can interleave the jobs. Basically what my code now does is: for all datasets (recursively) collect_snapshots_to_delete if more than a (configurable limit) is queued start a deletion worker (configurable limit) So it can continue gathering snapshots to delete while deleting a batch. = And it doesn=E2=80=99t have to wait for the reading all snapshots in all = dataset before starting to delete stuff. So if it (for some reason) is = slow then atleast it will have deleted _some_ snapshots until we = terminate the =E2=80=9Cclean=E2=80=9D command I did some tests on the speed with different number of =E2=80=9Cworker=E2=80= =9D threads and I actually did see some speed improvements (cut the time = in half in some cases). But it varies a lot I guess - if all metadata is = in the ARC then it normally is pretty quick anyway. I=E2=80=99ve been thinking of also adding separate read workers so if = one dataset takes a long time to read it=E2=80=99s snapshots then others = could continue but it=E2=80=99s a bit harder to code in a good way :-) What we do now is (simplified): # Create hourly snapshots that expire in 2 days: zfs snap -r -E =E2=80=9Cse.liu.it:expires=E2=80=9D -e 2d = "DATA/staff@${DATETIME}" # Clean expired snapshots (10 workers, atleast 500 snapshots per = delete) zfs clean -r -E =E2=80=9Cse.liu.it:expires=E2=80=9D -P10 -L500 = -e DATA/staff I have my patch available at GitHub ( = https://github.com/ptrrkssn/freebsd-stuff = <https://github.com/ptrrkssn/freebsd-stuff> ) if it would be of = interest.=20 (At first I modified the =E2=80=9Czfs destroy=E2=80=9D command but since = I always feel nervous about using that one since a slip of the finger = could have catastrophic consequences so I decided to create a separate one that = only works on snapshots and nothing else). > I expect zpool/zfs commands to be very slow when large zfs operations = are in flight. The fact that you are not seeing the issue locally means = the issue is not directly with the zpool/dataset but somehow with the = interaction between NFS Client <-> NFS Server <-> ZFS dataset =E2=80=A6 = NFS does not have to be sync, but can you force the NFS client to always = use sync writes? That might better leverage the SLOG. Since our use case = is VMs and VirtualBox does sync writes, we get the benefit of the SLOG. >=20 >>> If this is during typical activity, you are already using 13% of = your capacity. I also don=E2=80=99t like the 80ms per operation times. >>=20 >> The spinning rust drives are HGST He10 (10TB SAS 7200rpm) drives on = Dell HBA330 controllers (LSI SAS3008). We also use HP server with their = own Smart H241 HBAs and are seeing similar latencies there. >=20 > That should be Ok, but I have heard some reports of issues with the HP = Smart 4xx series controllers with FreeBSD. Why are you seeing higher = disk latency with SAS than we are with SATA? I assume you checked logs = for device communication errors and retries? Yeah, no errors. The HP H241 HBAs are not as well supported as the = SAS3008 ones, but they work OK. At least if you force them into = =E2=80=9CHBA=E2=80=9D mode (changeable from BIOS. Until we did that they = had their problems yes=E2=80=A6 also there were some firmware issues on = certain releases) Anyway, I we are going to expand the RAM in the servers from 256GB to = 512GB (or 768GB). A test I did on our test server seems to indicate that = the metadata set fits much better with more RAM so everything is much = faster. (Now I=E2=80=99d also like to see persistent L2ARC support (it would be = great to have the metadata cached on faster SSDs and have it survive a = reboot) - but that won=E2=80=99t happen until the switch to OpenZFS = (FreeBSD 13 hopefully) so=E2=80=A6 - Peter
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?7790DD37-4F95-409E-9E33-1A330B1B49C8>