From owner-freebsd-fs@freebsd.org Sat Mar 28 12:23:21 2020 Return-Path: Delivered-To: freebsd-fs@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id E098127AABB for ; Sat, 28 Mar 2020 12:23:21 +0000 (UTC) (envelope-from pen@lysator.liu.se) Received: from mail.lysator.liu.se (mail.lysator.liu.se [IPv6:2001:6b0:17:f0a0::3]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 48qHtl2vsvz46vB for ; Sat, 28 Mar 2020 12:23:10 +0000 (UTC) (envelope-from pen@lysator.liu.se) Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id CA4E940007 for ; Sat, 28 Mar 2020 13:22:58 +0100 (CET) Received: by mail.lysator.liu.se (Postfix, from userid 1004) id B592340005; Sat, 28 Mar 2020 13:22:58 +0100 (CET) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on bernadotte.lysator.liu.se X-Spam-Level: X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED,AWL autolearn=disabled version=3.4.2 X-Spam-Score: -1.0 Received: from [IPv6:2001:9b1:28ff:d901:4ca:669b:f651:f44a] (unknown [IPv6:2001:9b1:28ff:d901:4ca:669b:f651:f44a]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id 4031E40004; Sat, 28 Mar 2020 13:22:57 +0100 (CET) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 13.0 \(3608.60.0.2.5\)) Subject: Re: ZFS/NFS hickups and some tools to monitor stuff... From: Peter Eriksson In-Reply-To: <66AB88C0-12E8-48A0-9CD7-75B30C15123A@pk1048.com> Date: Sat, 28 Mar 2020 13:22:56 +0100 Cc: "PK1048.COM" Content-Transfer-Encoding: quoted-printable Message-Id: References: <66AB88C0-12E8-48A0-9CD7-75B30C15123A@pk1048.com> To: FreeBSD Filesystems X-Mailer: Apple Mail (2.3608.60.0.2.5) X-Virus-Scanned: ClamAV using ClamSMTP X-Rspamd-Queue-Id: 48qHtl2vsvz46vB X-Spamd-Bar: ---- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=pass (policy=none) header.from=liu.se; spf=pass (mx1.freebsd.org: domain of pen@lysator.liu.se designates 2001:6b0:17:f0a0::3 as permitted sender) smtp.mailfrom=pen@lysator.liu.se X-Spamd-Result: default: False [-4.10 / 15.00]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000,0]; FROM_HAS_DN(0.00)[]; R_SPF_ALLOW(-0.20)[+a:mail.lysator.liu.se]; MV_CASE(0.50)[]; MIME_GOOD(-0.10)[text/plain]; PREVIOUSLY_DELIVERED(0.00)[freebsd-fs@freebsd.org]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; RCVD_COUNT_THREE(0.00)[4]; TO_MATCH_ENVRCPT_SOME(0.00)[]; TO_DN_ALL(0.00)[]; RCPT_COUNT_TWO(0.00)[2]; RCVD_IN_DNSWL_NONE(0.00)[3.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.a.0.f.7.1.0.0.0.b.6.0.1.0.0.2.list.dnswl.org : 127.0.11.0]; DMARC_POLICY_ALLOW(-0.50)[liu.se,none]; RCVD_TLS_LAST(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:1653, ipnet:2001:6b0::/32, country:EU]; MID_RHS_MATCH_FROM(0.00)[]; IP_SCORE(-1.80)[ip: (-7.10), ipnet: 2001:6b0::/32(-1.04), asn: 1653(-0.84), country: EU(-0.01)] X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 28 Mar 2020 12:23:23 -0000 > On 28 Mar 2020, at 04:07, PK1048.COM wrote: >>=20 >> The last couple of weeks I=E2=80=99ve been fighting with a severe = case of NFS users complaining about slow response times from our (5) = FreeBSD 11.3-RELEASE-p6 file servers. Now even though our SMB (Windows) = users (thankfully since they are like 500 per server vs 50 NFS users) = didn=E2=80=99t see the same slowdown (or atleast didn=E2=80=99t complain = about it) the root cause is probably ZFS-related. >=20 > What is the use case for the NFS users: Unix home directories? VM = image files? etc. Mostly home directories. No VM image files. About 20000 filesystems per = server with around 100 snapshots per filesystem. Around 150-180M = files/directories per server. >> 3. Rsync cloning of data into the servers. Response times up to 15 = minutes was seen=E2=80=A6 Yes, 15 minutes to do a mkdir(=E2=80=9Ctest-dir=E2= =80=9D). Possibly in conjunction with #1 above=E2=80=A6. >=20 > There is something very wrong with your setup with that kind of NFS = response time. >=20 > Maybe I=E2=80=99m asking eh obvious, but what is performance like = natively on the server for these operations? Normal response times: This is from a small Intel NUC running OmniOS so NFS 4.0: $ ./pfst -v /mnt/filur01 [pfst, version 1.7 - Peter Eriksson ] 2020-03-28 12:19:10 [2114 =C2=B5s]: /mnt/filur01: = mkdir("t-omnibus-821-1") 2020-03-28 12:19:10 [ 413 =C2=B5s]: /mnt/filur01: = rmdir("t-omnibus-821-1") 2020-03-28 12:19:11 [1375 =C2=B5s]: /mnt/filur01: = mkdir("t-omnibus-821-2") 2020-03-28 12:19:11 [ 438 =C2=B5s]: /mnt/filur01: = rmdir("t-omnibus-821-2") 2020-03-28 12:19:12 [1329 =C2=B5s]: /mnt/filur01: = mkdir("t-omnibus-821-3") 2020-03-28 12:19:12 [ 428 =C2=B5s]: /mnt/filur01: = rmdir("t-omnibus-821-3") 2020-03-28 12:19:13 [1253 =C2=B5s]: /mnt/filur01: = mkdir("t-omnibus-821-4") 2020-03-28 12:19:13 [ 395 =C2=B5s]: /mnt/filur01: = rmdir("t-omnibus-821-4") Ie, an mkdir() takes around 1-2ms and an rmdir 0.4ms. Same from a CentOS 7 client (different hardware, a Dell workstation, NFS = 4.1): $ ./pfst -v /mnt/filur01 [pfst, version 1.6 - Peter Eriksson ] 2020-03-28 12:21:15 [ 633 =C2=B5s]: /mnt/filur01: = mkdir("t-electra-965-1") 2020-03-28 12:21:15 [ 898 =C2=B5s]: /mnt/filur01: = rmdir("t-electra-965-1") 2020-03-28 12:21:16 [1019 =C2=B5s]: /mnt/filur01: = mkdir("t-electra-965-2") 2020-03-28 12:21:16 [ 709 =C2=B5s]: /mnt/filur01: = rmdir("t-electra-965-2") 2020-03-28 12:21:17 [ 955 =C2=B5s]: /mnt/filur01: = mkdir("t-electra-965-3") 2020-03-28 12:21:17 [ 668 =C2=B5s]: /mnt/filur01: = rmdir("t-electra-965-3") Mkdir & rmdir takes about the same amount of time here. (0.6 - 1ms). (The above only tests operations on already-mounted filesystems. = Mounting filesystems have their own set of partly different problems = occasionally.) > What does the disk %busy look like on the disks that make up the = vdev=E2=80=99s? (iostat -x) Don=E2=80=99t have those numbers (when we were seeing problems) = unfortunately but if I remember correctly fairly busy during the = resilver (not surprising). Current status (right now): # iostat -x 10 |egrep -v pass extended device statistics =20 device r/s w/s kr/s kw/s ms/r ms/w ms/o ms/t qlen = %b =20 nvd0 0 0 0.0 0.0 0 0 0 0 0 = 0=20 da0 3 55 31.1 1129.4 10 1 87 3 0 = 13=20 da1 4 53 31.5 1109.1 10 1 86 3 0 = 13=20 da2 5 51 41.9 1082.4 9 1 87 3 0 = 14=20 da3 3 55 31.1 1129.7 10 1 85 2 0 = 13=20 da4 4 53 31.3 1108.7 10 1 86 3 0 = 13=20 da5 5 52 41.7 1081.3 9 1 87 3 0 = 14=20 da6 3 55 27.6 1103.9 10 1 94 2 0 = 13=20 da7 4 54 34.2 1064.7 10 1 92 3 0 = 14=20 da8 5 55 39.6 1088.2 10 1 94 3 0 = 15=20 da9 3 55 27.7 1103.4 10 1 92 2 0 = 13=20 da10 4 53 34.2 1065.0 10 1 92 3 0 = 14=20 da11 5 55 39.5 1089.0 10 1 95 3 0 = 15=20 da12 1 23 4.7 553.5 0 0 0 0 0 = 0=20 da13 1 23 4.7 553.5 0 0 0 0 0 = 0=20 da14 0 23 1.1 820.6 0 0 0 0 0 = 1=20 da15 0 23 1.0 820.7 0 0 0 0 0 = 1=20 da16 0 0 0.0 0.0 1 0 0 1 0 = 0=20 da17 0 0 0.0 0.0 1 0 0 0 0 = 0=20 >> Previously #1 and #2 hasn=E2=80=99t caused that much problems, and #3 = definitely. Something has changed the last half year or so but so far I = haven=E2=80=99t been able to figure it out. >=20 > The degradation over time makes me think fragmentation of lots of = small writes or an overly full zpool, but that should effect SMB as well = as NFS. Yeah, the zpool is around 50% full so no. (One of the servers right now:) # zpool iostat -v DATA capacity operations = bandwidth pool alloc free read write read = write -------------------------------- ----- ----- ----- ----- ----- = ----- DATA 57.7T 50.3T 46 556 403K = 8.48M raidz2 29.0T 25.0T 23 253 204K = 3.49M diskid/DISK-7PK8DWXC - - 3 52 31.1K = 1.10M diskid/DISK-7PK5X6XC - - 4 50 31.5K = 1.08M diskid/DISK-7PK4W4BC - - 5 49 41.9K = 1.06M diskid/DISK-7PK204LG - - 3 52 31.1K = 1.10M diskid/DISK-7PK2GDHG - - 4 50 31.3K = 1.08M diskid/DISK-7PK7850C - - 5 49 41.7K = 1.06M raidz2 28.8T 25.2T 23 256 199K = 3.39M diskid/DISK-7PK62HHC - - 3 53 27.6K = 1.08M diskid/DISK-7PK6SG3C - - 4 52 34.2K = 1.04M diskid/DISK-7PK8DRHC - - 5 53 39.6K = 1.06M diskid/DISK-7PK85ADG - - 3 53 27.7K = 1.08M diskid/DISK-2TK7PBYD - - 4 51 34.2K = 1.04M diskid/DISK-7PK6WY9C - - 5 53 39.5K = 1.06M logs - - - - - = - diskid/DISK-BTHV7146043U400NGN 289M 372G 0 23 0 = 821K diskid/DISK-BTHV71460441400NGN 274M 372G 0 23 0 = 821K > What type of SSD=E2=80=99s are you using for the SLOG (every zpool has = a built-in ZIL, you have a SLOG when you add separate devices for the = Intent Log for sync operations)? I had tried a run of the mill consumer = SSD for the SLOG and NFS sync performance went DOWN! For SLOG devices = you really must have real enterprise grade server class SSDs designed = for very heavy random WRITE load. Today you can get enterprise SSDs = designed for READ, WRITE, or a combination. Dell-original Enterprise SSDs like Intel DC S3520-series & Intel DC = S3700-series (SATA) Also tested some Intel SSD 750 (NVME on PCIe 3.0-cards) Going to test some pure SAS SSD write-optimised drives too (as soon as = we can get delivery of them). However I don=E2=80=99t really suspect the = SSDs anymore. (We=E2=80=99ve had other issues with the S3520:series though so = might replace them anyway - they have a tendency to play =E2=80=9Cpossum=E2= =80=9D every now and then - =E2=80=9Cdie=E2=80=9D and go offline - = however if we just let them sit there after a couple of weeks they would = reappear again automagically=E2=80=A6 We=E2=80=99ve seen the same thing = happen with the Intel P3520=E2=80=99s too (PCIe variant) but then it = just didn=E2=80=99t drop off the bus - it took down the server too due = to hanging the PCIe bus - crappy buggers) > Depending on your NFS use case, you may or may not be seeing sync = operations. If you are seeing async operations then a ZIL/SLOG will not = help as they are only used for sync operations. For async operations = multiple writes are combined in RAM and committed to disk at least every = 5 seconds. For sync operations the writes are immediately written to the = ZIL/SLOG, so a SLOG _can_ be of benefit if it is fast enough for random = write operations. Since a sync operation must be committed to = non-volatile storage before the write call can return, and having a = single SLOG device can lead to loss of data, it is prudent to mirror the = SLOG devices. Having a mirrored pair does not slow down operations as = both devices are written to in parallel. The SLOG only exists to meet = the sync requirement. If the system does not crash, the SLOG is (almost) = never actually read. The exception, and the reason I say =E2=80=9Calmost=E2= =80=9D, is that if you are writing so much data over long periods of = time that the zpool disks cannot catch up between write requests, then = you may have to read some TXG (transaction groups) from the SLOG to get = them into the zpool. >=20 >> We=E2=80=99ve tried it with and without L2ARC, >=20 > ARC and L2ARC will only effect read operations, and you can see the = effect with zfs-stats -E. Well, L2ARC uses up some of the ARC so if ARC is too tight from the = beginning then it=E2=80=99ll make reads go even slower.=20 Which actually is my main suspect (the ARC) right now. At least for the = resilver problem. When the resilver was running it seems it would read = in a lot of data into the ARC causing it to drop older data (& metadata=3D= from the ARC cache. Which then became a problem due to two issues: 1. The problematic server had =E2=80=9Ccompressed ARC=E2=80=9D disabled = due to a previous kernel-crashing problem (happened less often with that = disabled)=20 2. ZFS prioritises writes over reads 3. The metadata (directory information for example) wasn=E2=80=99t in = the ARC anymore so needed to be read from disk again and again This together with the fact that we: We are taking hourly snapshots of all filesystems - this needed to = access metadata for all filesystems. We are cleaning snapshots nightly - this needed to access metadata for = all filesystems (and all old snapshots). We had a problem with the cleaning script so it was running behind and = more and more snapshots were piling up on a couple of the fileservers so = instead of 100 snapshots there were around 400. Including the server = where we were resilvering. The 500-600 samba =E2=80=9Csmbd=E2=80=9D processes are using a lot of = RAM each (100-200MB), ie 120-60GB, so would compete with the ARC too. Many of the NFS operations access metadata from the filesystems. For = example mount operations, or filename lookups. So if any of that was = evicted from the ARC then it would have to go to disk to read it again. = And when ZFS was busy writing resilver data that would take a long time. Or when a filesystem is near quota then that too will cause the ZFS = transaction sizse to go down - inflating write IOPS and starving the = reads IOPs again=E2=80=A6 We are probably going to invest in more RAM for the servers (going from = 256GB to 512GB or 768GB) to allow for way more ARC. > You should not need to make any system tunings to make this work. Well, I had to tune down vfs.zfs.arc_max from the default (90% of RAM) = to a 128GB since the Samba smbd processes also need RAM to run and the = auto-tuning in FreeBSD of the ARC doesn=E2=80=99t seem to work that = well. In order to mitigate that problem I=E2=80=99m currently using = these settings: vfs.zfs.arc_max =3D 96GB (tuned down from 128GB since ARC overshot the = 128GB allocated and started killing random processes)=20 vfs.zfs.arc_meta_limit =3D 50% of ARC (default is 25%) The overshooting of ARC (especially when resilvering) was probably due = to us also tuning the kern.maxvnodes - seems if you touch that number = when the system is already running ARC goes berserk. So I can probably = increase ARC to 128GB again. It looks to be more stable now that we=E2=80=99= ve removed the kern.maxvnodes setting completely. It would have been nice to that a setting to set a minimum limit of = metadata so it would be more protected from normal data evicting it=E2=80=A6= =20 We=E2=80=99ve installed 512GB instead of 256GB in a test server now and = will so what settings might work well there... Right now (with no resilver running and not much =E2=80=9Cnear-full-quota=E2= =80=9D filesystems, and many of the extraneous snapshots deleted this = are looking much better: $ ./pfst -t200ms /mnt/filur0* =E2=80=A6 2020-03-26 21:16:40 [1678 ms]: /mnt/filur01: = mkdir("t-omnibus-782-190284") [Time limit exceeded] 2020-03-26 21:16:47 [1956 ms]: /mnt/filur01: = mkdir("t-omnibus-782-190289") [Time limit exceeded] 2020-03-26 21:16:53 [1439 ms]: /mnt/filur01: = mkdir("t-omnibus-782-190293") [Time limit exceeded] 2020-03-26 21:16:59 [1710 ms]: /mnt/filur01: = mkdir("t-omnibus-782-190297") [Time limit exceeded] 2020-03-26 21:17:05 [2044 ms]: /mnt/filur01: = mkdir("t-omnibus-782-190301") [Time limit exceeded] 2020-03-26 21:17:11 [1955 ms]: /mnt/filur01: = mkdir("t-omnibus-782-190305") [Time limit exceeded] 2020-03-26 21:17:16 [1515 ms]: /mnt/filur01: = mkdir("t-omnibus-782-190309") [Time limit exceeded] ... 2020-03-28 06:54:54 [ 370 ms]: /mnt/filur06: = mkdir("t-omnibus-783-311285") [Time limit exceeded] 2020-03-28 07:00:01 [ 447 ms]: /mnt/filur01: = mkdir("t-omnibus-782-311339") [Time limit exceeded] 2020-03-28 07:00:01 [ 312 ms]: /mnt/filur06: = mkdir("t-omnibus-783-311591") [Time limit exceeded] 2020-03-28 07:00:02 [ 291 ms]: /mnt/filur06: = mkdir("t-omnibus-783-311592") [Time limit exceeded] 2020-03-28 07:00:05 [ 378 ms]: /mnt/filur06: = mkdir("t-omnibus-783-311594") [Time limit exceeded] 2020-03-28 10:35:13 [1876 ms]: /mnt/filur01: = mkdir("t-omnibus-782-324215") [Time limit exceeded] (This only prints operations taking more than 200ms) The last ~2s one is probably due to a new filesystem being created. This = triggers the NFS mount daemon to update the kernel list of exports (and = hold a NFS kernel lock) which take some time (although it=E2=80=99s much = faster now with the =E2=80=9Cincremental update=E2=80=9D fix in mountd.=20= =20 The slow ones around 21:16 is due to me doing a number of =E2=80=9Czfs = rename DATA/home/$USER DATA/archive/home/$USER=E2=80=9D operations. This = too causes a lot of NFS mountd activity (removing NFS exports from the = kernel) causing a NFS kernel lock to be held... - Peter