From owner-freebsd-fs@freebsd.org Sun Mar 29 19:16:22 2020 Return-Path: Delivered-To: freebsd-fs@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 9CEC82A7C5D for ; Sun, 29 Mar 2020 19:16:22 +0000 (UTC) (envelope-from pen@lysator.liu.se) Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 48r50q5lqmz44xB for ; Sun, 29 Mar 2020 19:16:11 +0000 (UTC) (envelope-from pen@lysator.liu.se) Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id 07FDA40007 for ; Sun, 29 Mar 2020 21:16:02 +0200 (CEST) Received: by mail.lysator.liu.se (Postfix, from userid 1004) id E17A240005; Sun, 29 Mar 2020 21:16:01 +0200 (CEST) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on bernadotte.lysator.liu.se X-Spam-Level: X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED, AWL, HTML_MESSAGE autolearn=disabled version=3.4.2 X-Spam-Score: -1.0 Received: from [IPv6:2001:9b1:28ff:d901:6519:594e:bae9:86c7] (unknown [IPv6:2001:9b1:28ff:d901:6519:594e:bae9:86c7]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id 0D55A40004; Sun, 29 Mar 2020 21:16:01 +0200 (CEST) From: Peter Eriksson Message-Id: <7790DD37-4F95-409E-9E33-1A330B1B49C8@lysator.liu.se> Mime-Version: 1.0 (Mac OS X Mail 13.0 \(3608.60.0.2.5\)) Subject: Re: ZFS/NFS hickups and some tools to monitor stuff... Date: Sun, 29 Mar 2020 21:16:00 +0200 In-Reply-To: Cc: "PK1048.COM" To: FreeBSD Filesystems References: <66AB88C0-12E8-48A0-9CD7-75B30C15123A@pk1048.com> <982F9A21-FF1C-4DAB-98B3-610D70714ED3@lysator.liu.se> X-Mailer: Apple Mail (2.3608.60.0.2.5) X-Virus-Scanned: ClamAV using ClamSMTP X-Rspamd-Queue-Id: 48r50q5lqmz44xB X-Spamd-Bar: ----- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=pass (policy=none) header.from=liu.se; spf=pass (mx1.freebsd.org: domain of pen@lysator.liu.se designates 130.236.254.3 as permitted sender) smtp.mailfrom=pen@lysator.liu.se X-Spamd-Result: default: False [-5.69 / 15.00]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000,0]; FROM_HAS_DN(0.00)[]; R_SPF_ALLOW(-0.20)[+a:mail.lysator.liu.se]; MV_CASE(0.50)[]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; PREVIOUSLY_DELIVERED(0.00)[freebsd-fs@freebsd.org]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; RCVD_COUNT_THREE(0.00)[4]; TO_MATCH_ENVRCPT_SOME(0.00)[]; TO_DN_ALL(0.00)[]; RCVD_IN_DNSWL_MED(-0.20)[3.254.236.130.list.dnswl.org : 127.0.11.2]; RCPT_COUNT_TWO(0.00)[2]; DMARC_POLICY_ALLOW(-0.50)[liu.se,none]; RCVD_TLS_LAST(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; ASN(0.00)[asn:2843, ipnet:130.236.0.0/16, country:SE]; MID_RHS_MATCH_FROM(0.00)[]; IP_SCORE(-3.19)[ip: (-8.28), ipnet: 130.236.0.0/16(-4.23), asn: 2843(-3.39), country: SE(-0.03)] Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 29 Mar 2020 19:16:23 -0000 > I thought that snapshot deletion was single threaded within a zpool, = since TXGs are zpool-wide, not per dataset. So you may not be able to = destroy snapshot in parallel. Yeah, I thought so too but decided to try it anyway. It sometimes goes = faster and since I decoupled the "read snapshots to delete" and the = =E2=80=9Cdo the deletion=E2=80=9D into separate threads now it doesn=E2=80= =99t have to read all snapshots to delete first and then delete them = all, but can interleave the jobs. Basically what my code now does is: for all datasets (recursively) collect_snapshots_to_delete if more than a (configurable limit) is queued start a deletion worker (configurable limit) So it can continue gathering snapshots to delete while deleting a batch. = And it doesn=E2=80=99t have to wait for the reading all snapshots in all = dataset before starting to delete stuff. So if it (for some reason) is = slow then atleast it will have deleted _some_ snapshots until we = terminate the =E2=80=9Cclean=E2=80=9D command I did some tests on the speed with different number of =E2=80=9Cworker=E2=80= =9D threads and I actually did see some speed improvements (cut the time = in half in some cases). But it varies a lot I guess - if all metadata is = in the ARC then it normally is pretty quick anyway. I=E2=80=99ve been thinking of also adding separate read workers so if = one dataset takes a long time to read it=E2=80=99s snapshots then others = could continue but it=E2=80=99s a bit harder to code in a good way :-) What we do now is (simplified): # Create hourly snapshots that expire in 2 days: zfs snap -r -E =E2=80=9Cse.liu.it:expires=E2=80=9D -e 2d = "DATA/staff@${DATETIME}" # Clean expired snapshots (10 workers, atleast 500 snapshots per = delete) zfs clean -r -E =E2=80=9Cse.liu.it:expires=E2=80=9D -P10 -L500 = -e DATA/staff I have my patch available at GitHub ( = https://github.com/ptrrkssn/freebsd-stuff = ) if it would be of = interest.=20 (At first I modified the =E2=80=9Czfs destroy=E2=80=9D command but since = I always feel nervous about using that one since a slip of the finger = could have catastrophic consequences so I decided to create a separate one that = only works on snapshots and nothing else). > I expect zpool/zfs commands to be very slow when large zfs operations = are in flight. The fact that you are not seeing the issue locally means = the issue is not directly with the zpool/dataset but somehow with the = interaction between NFS Client <-> NFS Server <-> ZFS dataset =E2=80=A6 = NFS does not have to be sync, but can you force the NFS client to always = use sync writes? That might better leverage the SLOG. Since our use case = is VMs and VirtualBox does sync writes, we get the benefit of the SLOG. >=20 >>> If this is during typical activity, you are already using 13% of = your capacity. I also don=E2=80=99t like the 80ms per operation times. >>=20 >> The spinning rust drives are HGST He10 (10TB SAS 7200rpm) drives on = Dell HBA330 controllers (LSI SAS3008). We also use HP server with their = own Smart H241 HBAs and are seeing similar latencies there. >=20 > That should be Ok, but I have heard some reports of issues with the HP = Smart 4xx series controllers with FreeBSD. Why are you seeing higher = disk latency with SAS than we are with SATA? I assume you checked logs = for device communication errors and retries? Yeah, no errors. The HP H241 HBAs are not as well supported as the = SAS3008 ones, but they work OK. At least if you force them into = =E2=80=9CHBA=E2=80=9D mode (changeable from BIOS. Until we did that they = had their problems yes=E2=80=A6 also there were some firmware issues on = certain releases) Anyway, I we are going to expand the RAM in the servers from 256GB to = 512GB (or 768GB). A test I did on our test server seems to indicate that = the metadata set fits much better with more RAM so everything is much = faster. (Now I=E2=80=99d also like to see persistent L2ARC support (it would be = great to have the metadata cached on faster SSDs and have it survive a = reboot) - but that won=E2=80=99t happen until the switch to OpenZFS = (FreeBSD 13 hopefully) so=E2=80=A6 - Peter