From owner-freebsd-fs@freebsd.org  Sun Mar 29 19:16:22 2020
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.nyi.freebsd.org (Postfix) with ESMTP id 9CEC82A7C5D
 for <freebsd-fs@mailman.nyi.freebsd.org>; Sun, 29 Mar 2020 19:16:22 +0000 (UTC)
 (envelope-from pen@lysator.liu.se)
Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 48r50q5lqmz44xB
 for <freebsd-fs@freebsd.org>; Sun, 29 Mar 2020 19:16:11 +0000 (UTC)
 (envelope-from pen@lysator.liu.se)
Received: from mail.lysator.liu.se (localhost [127.0.0.1])
 by mail.lysator.liu.se (Postfix) with ESMTP id 07FDA40007
 for <freebsd-fs@freebsd.org>; Sun, 29 Mar 2020 21:16:02 +0200 (CEST)
Received: by mail.lysator.liu.se (Postfix, from userid 1004)
 id E17A240005; Sun, 29 Mar 2020 21:16:01 +0200 (CEST)
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 bernadotte.lysator.liu.se
X-Spam-Level: 
X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED, AWL, HTML_MESSAGE
 autolearn=disabled version=3.4.2
X-Spam-Score: -1.0
Received: from [IPv6:2001:9b1:28ff:d901:6519:594e:bae9:86c7] (unknown
 [IPv6:2001:9b1:28ff:d901:6519:594e:bae9:86c7])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by mail.lysator.liu.se (Postfix) with ESMTPSA id 0D55A40004;
 Sun, 29 Mar 2020 21:16:01 +0200 (CEST)
From: Peter Eriksson <pen@lysator.liu.se>
Message-Id: <7790DD37-4F95-409E-9E33-1A330B1B49C8@lysator.liu.se>
Mime-Version: 1.0 (Mac OS X Mail 13.0 \(3608.60.0.2.5\))
Subject: Re: ZFS/NFS hickups and some tools to monitor stuff...
Date: Sun, 29 Mar 2020 21:16:00 +0200
In-Reply-To: <CDB51790-ED6B-4670-B256-43CDF98BD26D@pk1048.com>
Cc: "PK1048.COM" <info@pk1048.com>
To: FreeBSD Filesystems <freebsd-fs@freebsd.org>
References: <CFD0E4E5-EF2B-4789-BF14-F46AC569A191@lysator.liu.se>
 <66AB88C0-12E8-48A0-9CD7-75B30C15123A@pk1048.com>
 <E6171E44-F677-4926-9F55-775F538900E4@lysator.liu.se>
 <FE244C11-44CA-4DCC-8CD9-A8C7A7C5F059@pk1048.com>
 <982F9A21-FF1C-4DAB-98B3-610D70714ED3@lysator.liu.se>
 <CDB51790-ED6B-4670-B256-43CDF98BD26D@pk1048.com>
X-Mailer: Apple Mail (2.3608.60.0.2.5)
X-Virus-Scanned: ClamAV using ClamSMTP
X-Rspamd-Queue-Id: 48r50q5lqmz44xB
X-Spamd-Bar: -----
Authentication-Results: mx1.freebsd.org; dkim=none;
 dmarc=pass (policy=none) header.from=liu.se;
 spf=pass (mx1.freebsd.org: domain of pen@lysator.liu.se designates
 130.236.254.3 as permitted sender) smtp.mailfrom=pen@lysator.liu.se
X-Spamd-Result: default: False [-5.69 / 15.00]; ARC_NA(0.00)[];
 RCVD_VIA_SMTP_AUTH(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000,0];
 FROM_HAS_DN(0.00)[];
 R_SPF_ALLOW(-0.20)[+a:mail.lysator.liu.se]; MV_CASE(0.50)[];
 MIME_GOOD(-0.10)[multipart/alternative,text/plain];
 PREVIOUSLY_DELIVERED(0.00)[freebsd-fs@freebsd.org];
 NEURAL_HAM_LONG(-1.00)[-1.000,0]; RCVD_COUNT_THREE(0.00)[4];
 TO_MATCH_ENVRCPT_SOME(0.00)[]; TO_DN_ALL(0.00)[];
 RCVD_IN_DNSWL_MED(-0.20)[3.254.236.130.list.dnswl.org : 127.0.11.2];
 RCPT_COUNT_TWO(0.00)[2];
 DMARC_POLICY_ALLOW(-0.50)[liu.se,none]; RCVD_TLS_LAST(0.00)[];
 FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[];
 MIME_TRACE(0.00)[0:+,1:+,2:~];
 ASN(0.00)[asn:2843, ipnet:130.236.0.0/16, country:SE];
 MID_RHS_MATCH_FROM(0.00)[];
 IP_SCORE(-3.19)[ip: (-8.28), ipnet: 130.236.0.0/16(-4.23), asn: 2843(-3.39),
 country: SE(-0.03)]
Content-Type: text/plain;
	charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.29
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 29 Mar 2020 19:16:23 -0000

> I thought that snapshot deletion was single threaded within a zpool, =
since TXGs are zpool-wide, not per dataset. So you may not be able to =
destroy snapshot in parallel.

Yeah, I thought so too but decided to try it anyway. It sometimes goes =
faster and since I decoupled the "read snapshots to delete" and the =
=E2=80=9Cdo the deletion=E2=80=9D into separate threads now it doesn=E2=80=
=99t have to read all snapshots to delete first and then delete them =
all, but can interleave the jobs.

Basically what my code now does is:

	for all datasets (recursively)
	   collect_snapshots_to_delete
	   if more than a (configurable limit) is queued
		start a deletion worker (configurable limit)

So it can continue gathering snapshots to delete while deleting a batch. =
And it doesn=E2=80=99t have to wait for the reading all snapshots in all =
dataset before starting to delete stuff. So if it (for some reason) is =
slow then atleast it will have deleted _some_ snapshots until we =
terminate the =E2=80=9Cclean=E2=80=9D command

I did some tests on the speed with different number of =E2=80=9Cworker=E2=80=
=9D threads and I actually did see some speed improvements (cut the time =
in half in some cases). But it varies a lot I guess - if all metadata is =
in the ARC then it normally is pretty quick anyway.

I=E2=80=99ve been thinking of also adding separate read workers so if =
one dataset takes a long time to read it=E2=80=99s snapshots then others =
could continue but it=E2=80=99s a bit harder to code in a good way :-)

What we do now is (simplified):

	# Create hourly snapshots that expire in 2 days:
	zfs snap -r -E =E2=80=9Cse.liu.it:expires=E2=80=9D -e 2d =
"DATA/staff@${DATETIME}"

	# Clean expired snapshots (10 workers, atleast 500 snapshots per =
delete)
	zfs clean -r -E =E2=80=9Cse.liu.it:expires=E2=80=9D -P10 -L500 =
-e DATA/staff

I have my patch available at GitHub ( =
https://github.com/ptrrkssn/freebsd-stuff =
<https://github.com/ptrrkssn/freebsd-stuff> ) if it would be of =
interest.=20

(At first I modified the =E2=80=9Czfs destroy=E2=80=9D command but since =
I always feel nervous about using that one since a slip of the finger =
could have
catastrophic consequences so I decided to create a separate one that =
only works on snapshots and nothing else).


> I expect zpool/zfs commands to be very slow when large zfs operations =
are in flight. The fact that you are not seeing the issue locally means =
the issue is not directly with the zpool/dataset but somehow with the =
interaction between NFS Client <-> NFS Server <-> ZFS dataset =E2=80=A6 =
NFS does not have to be sync, but can you force the NFS client to always =
use sync writes? That might better leverage the SLOG. Since our use case =
is VMs and VirtualBox does sync writes, we get the benefit of the SLOG.
>=20
>>> If this is during typical activity, you are already using 13% of =
your capacity. I also don=E2=80=99t like the 80ms per operation times.
>>=20
>> The spinning rust drives are HGST He10 (10TB SAS 7200rpm) drives on =
Dell HBA330 controllers (LSI SAS3008). We also use HP server with their =
own Smart H241 HBAs and are seeing similar latencies there.
>=20
> That should be Ok, but I have heard some reports of issues with the HP =
Smart 4xx series controllers with FreeBSD. Why are you seeing higher =
disk latency with SAS than we are with SATA? I assume you checked logs =
for device communication errors and retries?

Yeah, no errors. The HP H241 HBAs are not as well supported as the =
SAS3008 ones, but they work OK. At least if you force them into =
=E2=80=9CHBA=E2=80=9D mode (changeable from BIOS. Until we did that they =
had their problems yes=E2=80=A6 also there were some firmware issues on =
certain releases)

Anyway, I we are going to expand the RAM in the servers from 256GB to =
512GB (or 768GB). A test I did on our test server seems to indicate that =
the metadata set fits much better with more RAM so everything is much =
faster.

(Now I=E2=80=99d also like to see persistent L2ARC support (it would be =
great to have the metadata cached on faster SSDs and have it survive a =
reboot) - but that won=E2=80=99t happen until the switch to OpenZFS =
(FreeBSD 13 hopefully) so=E2=80=A6

- Peter