From owner-freebsd-fs@freebsd.org Sun Mar 29 18:23:40 2020 Return-Path: Delivered-To: freebsd-fs@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id C17352A652A for ; Sun, 29 Mar 2020 18:23:40 +0000 (UTC) (envelope-from pen@lysator.liu.se) Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 48r3r06dR0z3G1V for ; Sun, 29 Mar 2020 18:23:28 +0000 (UTC) (envelope-from pen@lysator.liu.se) Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id 3EA6040007 for ; Sun, 29 Mar 2020 20:23:14 +0200 (CEST) Received: by mail.lysator.liu.se (Postfix, from userid 1004) id 29D8E40004; Sun, 29 Mar 2020 20:23:13 +0200 (CEST) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on bernadotte.lysator.liu.se X-Spam-Level: X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED, AWL, HTML_MESSAGE autolearn=disabled version=3.4.2 X-Spam-Score: -1.0 Received: from [IPv6:2001:9b1:28ff:d901:6519:594e:bae9:86c7] (unknown [IPv6:2001:9b1:28ff:d901:6519:594e:bae9:86c7]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id AD53740005; Sun, 29 Mar 2020 20:23:12 +0200 (CEST) From: Peter Eriksson Message-Id: <982F9A21-FF1C-4DAB-98B3-610D70714ED3@lysator.liu.se> Mime-Version: 1.0 (Mac OS X Mail 13.0 \(3608.60.0.2.5\)) Subject: Re: ZFS/NFS hickups and some tools to monitor stuff... Date: Sun, 29 Mar 2020 20:23:12 +0200 In-Reply-To: Cc: "PK1048.COM" To: FreeBSD Filesystems References: <66AB88C0-12E8-48A0-9CD7-75B30C15123A@pk1048.com> X-Mailer: Apple Mail (2.3608.60.0.2.5) X-Virus-Scanned: ClamAV using ClamSMTP X-Rspamd-Queue-Id: 48r3r06dR0z3G1V X-Spamd-Bar: ----- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=pass (policy=none) header.from=liu.se; spf=pass (mx1.freebsd.org: domain of pen@lysator.liu.se designates 130.236.254.3 as permitted sender) smtp.mailfrom=pen@lysator.liu.se X-Spamd-Result: default: False [-5.63 / 15.00]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000,0]; FROM_HAS_DN(0.00)[]; R_SPF_ALLOW(-0.20)[+a:mail.lysator.liu.se]; MV_CASE(0.50)[]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; PREVIOUSLY_DELIVERED(0.00)[freebsd-fs@freebsd.org]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; RCVD_COUNT_THREE(0.00)[4]; TO_MATCH_ENVRCPT_SOME(0.00)[]; TO_DN_ALL(0.00)[]; RCVD_IN_DNSWL_MED(-0.20)[3.254.236.130.list.dnswl.org : 127.0.11.2]; RCPT_COUNT_TWO(0.00)[2]; DMARC_POLICY_ALLOW(-0.50)[liu.se,none]; RCVD_TLS_LAST(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; ASN(0.00)[asn:2843, ipnet:130.236.0.0/16, country:SE]; MID_RHS_MATCH_FROM(0.00)[]; IP_SCORE(-3.13)[ip: (-8.11), ipnet: 130.236.0.0/16(-4.17), asn: 2843(-3.34), country: SE(-0.03)] Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 29 Mar 2020 18:23:41 -0000 >>=20 >> Mostly home directories. No VM image files. About 20000 filesystems = per server with around 100 snapshots per filesystem. Around 150-180M = files/directories per server. >=20 > Wow. By =E2=80=9Cfilesystems=E2=80=9D I assume you mean zfs datasaets = and not zpools? >=20 Yes, ZFS datasets. Our AD right now contains ~130000 users (not all are = active thankfully, typically around 3000-5000 are online at the same = time (90% via SMB)) spread out over a number of servers). > Why so many? In the early days of zfs, due to the lack of per-user = quotas, there were sites with one zfs dataset per user (home directory) = so that quotas could be enforced. This raised serious performance = issues, mostly ay boot time due to the need to mount so many different = datasets. Per- ... > But, this may be the underlying source of the performance issues. Yeah. I=E2=80=99m aware of the user quotas which is nice. However they = only solve part of the problem(s): 1. With the GDPR laws and the =E2=80=9Cright to be forgotten=E2=80=9D we = need to be able to delete a user's files when they leave their = employment here and/or when students stop studying. This together with = snapshots would make this a much harder operation (if we just have one = big dataset for all users then we can=E2=80=99t just delete a specific = user=E2=80=99s dataset (and snapshots). Basically we would have to = delete the snapshot for *all users* then=E2=80=A6 2. Users that write a lot of data every day - this uses up a lot of = =E2=80=9Cquota=E2=80=9D and not =E2=80=9Crefquota=E2=80=9D. The =E2=80=9Cu= ser quota=E2=80=9D just counts the =E2=80=9Crefquota=E2=80=9D=E2=80=A6 = And then even if we find that user and makes them stop writing, their = old data will live on in the snapshots=E2=80=A6 Right now we are trying out a scheme where we give each user dataset a = =E2=80=9Cuserquota=E2=80=9D of X, a =E2=80=9Crefquota=E2=80=9D of X+1G = and a =E2=80=9Cquota=E2=80=9D of 3*X. This will hopefully lessen the = problem of =E2=80=9Cslow servers" when a user is filling up their ref = quotas since they=E2=80=99ll run into their user quotas before the ref = quotas=E2=80=A6 I=E2=80=99ve also modified the =E2=80=9Czfs snap=E2=80=9D command to = avoid taking snapshots on near-full datasets. We=E2=80=99ll see if this = makes things better. (I=E2=80=99ve also modified it to have a =E2=80=9Czfs clean=E2=80=9D = command to parallelise the snapshot deletion a bit (and make it possible = to be smart on which snapshots to remove. We used to do this via a = Python script but that is not nearly as efficient as doing it directly = in the =E2=80=9Czfs=E2=80=9D command. We normally set a user property = like =E2=80=9Cse.liu.it:expires=E2=80=9D to a date when a snapshot = should expire, and then the =E2=80=9Czfs clean=E2=80=9D command can look = for that property and delete just those that have expired. The idea is = to keep hourly snapshots for 2 days, daily snapshots for 2 weeks and = weekly snapshots for 3 months (or so, we=E2=80=99ve been testing = differents times here) - any older is on the backup server so users = easily can recover their own files via the Windows =E2=80=9CPrevious = Versions=E2=80=9D feature (or via the .zfs/snapshot=E2=80=9D directory = for Unix users).=20 >>> Maybe I=E2=80=99m asking eh obvious, but what is performance like = natively on the server for these operations? >>=20 >> Normal response times: >>=20 >> This is from a small Intel NUC running OmniOS so NFS 4.0: >>=20 >> $ ./pfst -v /mnt/filur01 >> [pfst, version 1.7 - Peter Eriksson ] >> 2020-03-28 12:19:10 [2114 =C2=B5s]: /mnt/filur01: = mkdir("t-omnibus-821-1=E2=80=9D) >=20 > You misunderstood my question. When you are seeing the performance = issue via NFS, do you also see a performance issue directly on the NFS = server? The tests I ran did not indicate the same performance issues directly on = the server (I ran the same test program locally). Well, except for = =E2=80=9Czfs=E2=80=9D-commands being slow though.=20 > If the SMB clients are all (or mostly) Windows 8 or newer, the = Microsoft CIFS/SMB client stack has lots of caching to make poor server = performance feel good. That caching by the client may be masking = comparable performance issues via SMB. Testing directly on the server = will remove the network fie share layer from the discussion, or focus = the discussion there. >=20 >> Mkdir & rmdir takes about the same amount of time here. (0.6 - 1ms). >=20 > Do reads/writes from/to existing files show the same degradation? = Especially reads? Didn=E2=80=99t test that at the time unfortunately. And right now things = are running pretty OK.. >>> What does the disk %busy look like on the disks that make up the = vdev=E2=80=99s? (iostat -x) >>=20 >> Don=E2=80=99t have those numbers (when we were seeing problems) = unfortunately but if I remember correctly fairly busy during the = resilver (not surprising). >>=20 >> Current status (right now): >>=20 >> # iostat -x 10 |egrep -v pass >> extended device statistics =20 >> device r/s w/s kr/s kw/s ms/r ms/w ms/o ms/t = qlen %b =20 >> nvd0 0 0 0.0 0.0 0 0 0 0 = 0 0=20 >> da0 3 55 31.1 1129.4 10 1 87 3 = 0 13=20 >> da1 4 53 31.5 1109.1 10 1 86 3 = 0 13=20 >> da2 5 51 41.9 1082.4 9 1 87 3 = 0 14 >=20 > If this is during typical activity, you are already using 13% of your = capacity. I also don=E2=80=99t like the 80ms per operation times. The spinning rust drives are HGST He10 (10TB SAS 7200rpm) drives on Dell = HBA330 controllers (LSI SAS3008). We also use HP server with their own = Smart H241 HBAs and are seeing similar latencies there.=20 = https://www.storagereview.com/review/hgst-ultrastar-he10-10tb-enterprise-h= ard-drive-review = > What does zpool list show (fragmentation)? 22-27% fragmentation with 50-53% cap (108T size) on the 3 biggest = servers. >> At least for the resilver problem. >=20 > There are tunings you can apply to make the resilver even more = background than it usually is. I don=E2=80=99t have them off the top of = my head. Yeah, I tried those. Didn=E2=80=99t make much difference though=E2=80=A6 > I have mamnaged zfs server with hundreds of thousands of snapshots = with no performance penalty, except for snapshot management functions = (zfs list -t snapshot, which would take many, many minutes to complete), = so just the presence of snapshots should not hurt. Be aware that = destroying snapshots in the order in which they were _created_ is much = faster. In other words, always destroy the oldest snapshot first and = work your way forward. Yeah, I know. That=E2=80=99s what we are doing. - Peter