From owner-freebsd-fs@freebsd.org Sat Mar 27 16:06:48 2021 Return-Path: Delivered-To: freebsd-fs@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id C4C4E5787F9 for ; Sat, 27 Mar 2021 16:06:48 +0000 (UTC) (envelope-from mchouque@thi.eu.com) Received: from relay5-d.mail.gandi.net (relay5-d.mail.gandi.net [217.70.183.197]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4F73cl6BSlz3tt3 for ; Sat, 27 Mar 2021 16:06:46 +0000 (UTC) (envelope-from mchouque@thi.eu.com) X-Originating-IP: 62.210.143.248 Received: from weirdfishes.localdomain (62-210-143-248.rev.poneytelecom.eu [62.210.143.248]) (Authenticated sender: m@thi.eu.com) by relay5-d.mail.gandi.net (Postfix) with ESMTPSA id 599F91C0002; Sat, 27 Mar 2021 16:06:43 +0000 (UTC) Received: by weirdfishes.localdomain (Postfix, from userid 1000) id 1B31A7203B9D5; Sat, 27 Mar 2021 17:06:43 +0100 (CET) Date: Sat, 27 Mar 2021 17:06:43 +0100 From: Mathieu Chouquet-Stringer To: freebsd-fs@freebsd.org Subject: Re: Scrub incredibly slow with 13.0-RC3 (as well as RC1 & 2) Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Queue-Id: 4F73cl6BSlz3tt3 X-Spamd-Bar: - Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=none (mx1.freebsd.org: domain of mchouque@thi.eu.com has no SPF policy when checking 217.70.183.197) smtp.mailfrom=mchouque@thi.eu.com X-Spamd-Result: default: False [-1.40 / 15.00]; RCVD_VIA_SMTP_AUTH(0.00)[]; HAS_XOIP(0.00)[]; TO_DN_NONE(0.00)[]; NEURAL_HAM_SHORT(-1.00)[-1.000]; FORGED_SENDER(0.30)[me@mathieu.digital,mchouque@thi.eu.com]; RCVD_IN_DNSWL_LOW(-0.10)[217.70.183.197:from]; MIME_TRACE(0.00)[0:+]; RBL_DBL_DONT_QUERY_IPS(0.00)[217.70.183.197:from]; ASN(0.00)[asn:29169, ipnet:217.70.176.0/20, country:FR]; R_DKIM_NA(0.00)[]; TAGGED_FROM(0.00)[freebsd]; FROM_NEQ_ENVFROM(0.00)[me@mathieu.digital,mchouque@thi.eu.com]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; FROM_HAS_DN(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[text/plain]; DMARC_NA(0.00)[mathieu.digital]; AUTH_NA(1.00)[]; RCPT_COUNT_ONE(0.00)[1]; SPAMHAUS_ZRD(0.00)[217.70.183.197:from:127.0.2.255]; R_SPF_NA(0.00)[no SPF record]; RWL_MAILSPIKE_POSSIBLE(0.00)[217.70.183.197:from]; MID_RHS_NOT_FQDN(0.50)[]; RCVD_COUNT_TWO(0.00)[2]; RCVD_TLS_ALL(0.00)[]; MAILMAN_DEST(0.00)[freebsd-fs] X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 27 Mar 2021 16:06:48 -0000 Hello, TL;DR: read_max makes no difference, don't use hpet with KVM, use scan_legacy. On Sat, Mar 27, 2021 at 11:59:43AM +0000, tech-lists wrote: > Have you tried the following sysctl: > > vfs.read_max=128 It's something I haven't so I gave it a shot. Note that I first ran my tests on zroot which is on a virtio controller (the image file is on mirror made of two SSDs). This pool is small with only 12.1GB allocated out of 21.9GB: NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT zroot 21.9G 12.1G 9.88G - - 37% 54% 1.00x ONLINE - Scrubbing is hard to bench because the status seems stuck forever: scan: scrub in progress since Sat Mar 27 15:16:27 2021 11.8G scanned at 146M/s, 1.23M issued at 15.2K/s, 12.1G total 0B repaired, 0.01% done, no estimated completion time scan: scrub in progress since Sat Mar 27 15:16:27 2021 12.0G scanned at 145M/s, 3.58M issued at 43.2K/s, 12.1G total 0B repaired, 0.03% done, no estimated completion time scan: scrub in progress since Sat Mar 27 15:16:27 2021 12.1G scanned at 142M/s, 409M issued at 4.70M/s, 12.1G total 0B repaired, 3.31% done, no estimated completion time Look how it already read 12.0G so pretty much 100% of the allocated space in the pool and it still shows 0.03% done... And with a bs of 128k (zfs record size), dd with iflags=direct tells me I can read this thing at: 23266197504 bytes (23 GB, 22 GiB) transferred 23.042s, 1010 MB/s With a bs of 32k I still get 310MB/s. So please find below scrubbing duration with vfs.read_max 512, 256, 128 and 64: it makes no difference. Please not it seems it's completely cpu bound again as I see this process using 100% of the CPU the whole time scrub is running: 31 root -8 - 0B 3952K CPU0 0 20:05 100.00% zfskern{txg_thread_enter} - vfs.read_max=512 vfs.read_max: 64 -> 512 - Scrub 1 scan: scrub repaired 0B in 00:02:53 with 0 errors on Sat Mar 27 15:10:41 2021 - Scrub 2 scan: scrub repaired 0B in 00:02:52 with 0 errors on Sat Mar 27 15:13:33 2021 - Scrub 3 scan: scrub repaired 0B in 00:02:54 with 0 errors on Sat Mar 27 15:16:27 2021 - vfs.read_max=256 vfs.read_max: 512 -> 256 - Scrub 1 scan: scrub repaired 0B in 00:02:52 with 0 errors on Sat Mar 27 15:19:19 2021 - Scrub 2 scan: scrub repaired 0B in 00:02:53 with 0 errors on Sat Mar 27 15:22:12 2021 - Scrub 3 scan: scrub repaired 0B in 00:02:52 with 0 errors on Sat Mar 27 15:25:04 2021 - vfs.read_max=128 vfs.read_max: 256 -> 128 - Scrub 1 scan: scrub repaired 0B in 00:02:52 with 0 errors on Sat Mar 27 15:27:56 2021 - Scrub 2 scan: scrub repaired 0B in 00:02:53 with 0 errors on Sat Mar 27 15:30:49 2021 - Scrub 3 scan: scrub repaired 0B in 00:02:52 with 0 errors on Sat Mar 27 15:33:41 2021 - vfs.read_max=64 vfs.read_max: 128 -> 64 - Scrub 1 scan: scrub repaired 0B in 00:02:53 with 0 errors on Sat Mar 27 15:36:34 2021 - Scrub 2 scan: scrub repaired 0B in 00:02:52 with 0 errors on Sat Mar 27 15:39:26 2021 - Scrub 3 scan: scrub repaired 0B in 00:02:53 with 0 errors on Sat Mar 27 15:42:19 2021 I am attaching a flamegraph for that busy process: again a lot of hpet_get_timecount... So I went and disabled hpet in kvm: I gained 30 seconds and again, playing with vfs.read_max made no difference but then again it's a ssd. And given I still am cpu bound I'm not sure I'll gain anything by playing with block sizes... But the first lesson is to not use hpet with KVM. - vfs.read_max=512 vfs.read_max: 64 -> 512 - Scrub 1 scan: scrub repaired 0B in 00:02:19 with 0 errors on Sat Mar 27 16:14:25 2021 - Scrub 2 scan: scrub repaired 0B in 00:02:19 with 0 errors on Sat Mar 27 16:16:44 2021 - Scrub 3 scan: scrub repaired 0B in 00:02:18 with 0 errors on Sat Mar 27 16:19:02 2021 - vfs.read_max=256 vfs.read_max: 512 -> 256 - Scrub 1 scan: scrub repaired 0B in 00:02:18 with 0 errors on Sat Mar 27 16:21:20 2021 - Scrub 2 scan: scrub repaired 0B in 00:02:19 with 0 errors on Sat Mar 27 16:23:39 2021 - Scrub 3 scan: scrub repaired 0B in 00:02:18 with 0 errors on Sat Mar 27 16:25:57 2021 - vfs.read_max=128 vfs.read_max: 256 -> 128 - Scrub 1 scan: scrub repaired 0B in 00:02:18 with 0 errors on Sat Mar 27 16:28:15 2021 - Scrub 2 scan: scrub repaired 0B in 00:02:18 with 0 errors on Sat Mar 27 16:30:33 2021 - Scrub 3 scan: scrub repaired 0B in 00:02:18 with 0 errors on Sat Mar 27 16:32:51 2021 - vfs.read_max=64 vfs.read_max: 128 -> 64 - Scrub 1 scan: scrub repaired 0B in 00:02:19 with 0 errors on Sat Mar 27 16:35:10 2021 - Scrub 2 scan: scrub repaired 0B in 00:02:18 with 0 errors on Sat Mar 27 16:37:28 2021 - Scrub 3 scan: scrub repaired 0B in 00:02:19 with 0 errors on Sat Mar 27 16:39:47 2021 An interesting thing is vfs.zfs.scan_legacy=1 makes the progress start right away. See how I read almost 50% of the disk and I'm already 49% done while with the non legacy scan, progress only moves after reading all the allocated space. scan: scrub in progress since Sat Mar 27 16:41:33 2021 5.85G scanned at 133M/s, 5.85G issued at 133M/s, 12.1G total 0B repaired, 48.50% done, 00:00:47 to go It's altogether way quicker (I gain 54 seconds on the no hpet test!!!) and totally reproducible: scan: scrub repaired 0B in 00:01:35 with 0 errors on Sat Mar 27 16:43:08 2021 scan: scrub repaired 0B in 00:01:35 with 0 errors on Sat Mar 27 16:46:10 2021 I will start a scrub of my zraid2 zpool and tweak read_max to see if it makes a difference but at this point I doubt it. And scan_legacy could be quicker because I see as many busy threads as there are vdevs while if I start a scan with the non legacy scan, I only see one from the get go. PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND 0 root -16 - 0B 13M CPU0 0 8:40 99.17% kernel{z_rd_iss_4} 0 root -16 - 0B 13M CPU9 9 8:40 99.15% kernel{z_rd_iss_6} 0 root -16 - 0B 13M CPU7 7 8:40 98.97% kernel{z_rd_iss_5} 0 root -16 - 0B 13M CPU4 4 8:40 98.96% kernel{z_rd_iss_7} 0 root -16 - 0B 13M CPU6 6 8:40 98.59% kernel{z_rd_iss_1} 0 root -16 - 0B 13M CPU5 5 8:40 98.56% kernel{z_rd_iss_2} 0 root -16 - 0B 13M CPU14 14 8:40 98.33% kernel{z_rd_iss_3} 0 root -16 - 0B 13M CPU11 11 8:40 97.89% kernel{z_rd_iss_0} -- Mathieu Chouquet-Stringer The sun itself sees not till heaven clears. -- William Shakespeare --