From owner-freebsd-fs@freebsd.org  Sat Mar 27 16:06:48 2021
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.nyi.freebsd.org (Postfix) with ESMTP id C4C4E5787F9
 for <freebsd-fs@mailman.nyi.freebsd.org>; Sat, 27 Mar 2021 16:06:48 +0000 (UTC)
 (envelope-from mchouque@thi.eu.com)
Received: from relay5-d.mail.gandi.net (relay5-d.mail.gandi.net
 [217.70.183.197])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 4F73cl6BSlz3tt3
 for <freebsd-fs@freebsd.org>; Sat, 27 Mar 2021 16:06:46 +0000 (UTC)
 (envelope-from mchouque@thi.eu.com)
X-Originating-IP: 62.210.143.248
Received: from weirdfishes.localdomain (62-210-143-248.rev.poneytelecom.eu
 [62.210.143.248]) (Authenticated sender: m@thi.eu.com)
 by relay5-d.mail.gandi.net (Postfix) with ESMTPSA id 599F91C0002;
 Sat, 27 Mar 2021 16:06:43 +0000 (UTC)
Received: by weirdfishes.localdomain (Postfix, from userid 1000)
 id 1B31A7203B9D5; Sat, 27 Mar 2021 17:06:43 +0100 (CET)
Date: Sat, 27 Mar 2021 17:06:43 +0100
From: Mathieu Chouquet-Stringer <me+freebsd@mathieu.digital>
To: freebsd-fs@freebsd.org
Subject: Re: Scrub incredibly slow with 13.0-RC3 (as well as RC1 & 2)
Message-ID: <YF9YE282kOWN7ubr@weirdfishes>
References: <YFhuxr0qRzchA7x8@weirdfishes> <YF8eL0dmSh6H8HX2@ceres.zyxst.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <YF8eL0dmSh6H8HX2@ceres.zyxst.net>
X-Rspamd-Queue-Id: 4F73cl6BSlz3tt3
X-Spamd-Bar: -
Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none;
 spf=none (mx1.freebsd.org: domain of mchouque@thi.eu.com has no SPF policy
 when checking 217.70.183.197) smtp.mailfrom=mchouque@thi.eu.com
X-Spamd-Result: default: False [-1.40 / 15.00]; RCVD_VIA_SMTP_AUTH(0.00)[];
 HAS_XOIP(0.00)[]; TO_DN_NONE(0.00)[];
 NEURAL_HAM_SHORT(-1.00)[-1.000];
 FORGED_SENDER(0.30)[me@mathieu.digital,mchouque@thi.eu.com];
 RCVD_IN_DNSWL_LOW(-0.10)[217.70.183.197:from];
 MIME_TRACE(0.00)[0:+];
 RBL_DBL_DONT_QUERY_IPS(0.00)[217.70.183.197:from];
 ASN(0.00)[asn:29169, ipnet:217.70.176.0/20, country:FR];
 R_DKIM_NA(0.00)[]; TAGGED_FROM(0.00)[freebsd];
 FROM_NEQ_ENVFROM(0.00)[me@mathieu.digital,mchouque@thi.eu.com];
 ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000];
 FROM_HAS_DN(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[];
 NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[text/plain];
 DMARC_NA(0.00)[mathieu.digital]; AUTH_NA(1.00)[];
 RCPT_COUNT_ONE(0.00)[1];
 SPAMHAUS_ZRD(0.00)[217.70.183.197:from:127.0.2.255];
 R_SPF_NA(0.00)[no SPF record];
 RWL_MAILSPIKE_POSSIBLE(0.00)[217.70.183.197:from];
 MID_RHS_NOT_FQDN(0.50)[]; RCVD_COUNT_TWO(0.00)[2];
 RCVD_TLS_ALL(0.00)[]; MAILMAN_DEST(0.00)[freebsd-fs]
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 27 Mar 2021 16:06:48 -0000

	Hello,

TL;DR: read_max makes no difference, don't use hpet with KVM, use
scan_legacy.

On Sat, Mar 27, 2021 at 11:59:43AM +0000, tech-lists wrote:
> Have you tried the following sysctl:
> 
> vfs.read_max=128

It's something I haven't so I gave it a shot. Note that I first ran my
tests on zroot which is on a virtio controller (the image file is on
mirror made of two SSDs).

This pool is small with only 12.1GB allocated out of 21.9GB:
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zroot  21.9G  12.1G  9.88G        -         -    37%    54%  1.00x    ONLINE  -

Scrubbing is hard to bench because the status seems stuck forever:

  scan: scrub in progress since Sat Mar 27 15:16:27 2021
        11.8G scanned at 146M/s, 1.23M issued at 15.2K/s, 12.1G total
        0B repaired, 0.01% done, no estimated completion time

  scan: scrub in progress since Sat Mar 27 15:16:27 2021
        12.0G scanned at 145M/s, 3.58M issued at 43.2K/s, 12.1G total
        0B repaired, 0.03% done, no estimated completion time

  scan: scrub in progress since Sat Mar 27 15:16:27 2021
        12.1G scanned at 142M/s, 409M issued at 4.70M/s, 12.1G total
        0B repaired, 3.31% done, no estimated completion time

Look how it already read 12.0G so pretty much 100% of the allocated
space in the pool and it still shows 0.03% done...

And with a bs of 128k (zfs record size), dd with iflags=direct tells me
I can read this thing at:
  23266197504 bytes (23 GB, 22 GiB) transferred 23.042s, 1010 MB/s

With a bs of 32k I still get 310MB/s.

So please find below scrubbing duration with vfs.read_max 512, 256, 128
and 64: it makes no difference.

Please not it seems it's completely cpu bound again as I see this
process using 100% of the CPU the whole time scrub is running:

   31 root         -8    -     0B  3952K CPU0     0  20:05 100.00% zfskern{txg_thread_enter}

- vfs.read_max=512
vfs.read_max: 64 -> 512
  - Scrub 1 scan: scrub repaired 0B in 00:02:53 with 0 errors on Sat Mar 27 15:10:41 2021
  - Scrub 2 scan: scrub repaired 0B in 00:02:52 with 0 errors on Sat Mar 27 15:13:33 2021
  - Scrub 3 scan: scrub repaired 0B in 00:02:54 with 0 errors on Sat Mar 27 15:16:27 2021
- vfs.read_max=256
vfs.read_max: 512 -> 256
  - Scrub 1 scan: scrub repaired 0B in 00:02:52 with 0 errors on Sat Mar 27 15:19:19 2021
  - Scrub 2 scan: scrub repaired 0B in 00:02:53 with 0 errors on Sat Mar 27 15:22:12 2021
  - Scrub 3 scan: scrub repaired 0B in 00:02:52 with 0 errors on Sat Mar 27 15:25:04 2021
- vfs.read_max=128
vfs.read_max: 256 -> 128
  - Scrub 1 scan: scrub repaired 0B in 00:02:52 with 0 errors on Sat Mar 27 15:27:56 2021
  - Scrub 2 scan: scrub repaired 0B in 00:02:53 with 0 errors on Sat Mar 27 15:30:49 2021
  - Scrub 3 scan: scrub repaired 0B in 00:02:52 with 0 errors on Sat Mar 27 15:33:41 2021
- vfs.read_max=64
vfs.read_max: 128 -> 64
  - Scrub 1 scan: scrub repaired 0B in 00:02:53 with 0 errors on Sat Mar 27 15:36:34 2021
  - Scrub 2 scan: scrub repaired 0B in 00:02:52 with 0 errors on Sat Mar 27 15:39:26 2021
  - Scrub 3 scan: scrub repaired 0B in 00:02:53 with 0 errors on Sat Mar 27 15:42:19 2021

I am attaching a flamegraph for that busy process: again a lot of
hpet_get_timecount... So I went and disabled hpet in kvm: I gained 30
seconds and again, playing with vfs.read_max made no difference but then
again it's a ssd. And given I still am cpu bound I'm not sure I'll gain
anything by playing with block sizes... But the first lesson is to not
use hpet with KVM.

- vfs.read_max=512
vfs.read_max: 64 -> 512
  - Scrub 1 scan: scrub repaired 0B in 00:02:19 with 0 errors on Sat Mar 27 16:14:25 2021
  - Scrub 2 scan: scrub repaired 0B in 00:02:19 with 0 errors on Sat Mar 27 16:16:44 2021
  - Scrub 3 scan: scrub repaired 0B in 00:02:18 with 0 errors on Sat Mar 27 16:19:02 2021
- vfs.read_max=256
vfs.read_max: 512 -> 256
  - Scrub 1 scan: scrub repaired 0B in 00:02:18 with 0 errors on Sat Mar 27 16:21:20 2021
  - Scrub 2 scan: scrub repaired 0B in 00:02:19 with 0 errors on Sat Mar 27 16:23:39 2021
  - Scrub 3 scan: scrub repaired 0B in 00:02:18 with 0 errors on Sat Mar 27 16:25:57 2021
- vfs.read_max=128
vfs.read_max: 256 -> 128
  - Scrub 1 scan: scrub repaired 0B in 00:02:18 with 0 errors on Sat Mar 27 16:28:15 2021
  - Scrub 2 scan: scrub repaired 0B in 00:02:18 with 0 errors on Sat Mar 27 16:30:33 2021
  - Scrub 3 scan: scrub repaired 0B in 00:02:18 with 0 errors on Sat Mar 27 16:32:51 2021
- vfs.read_max=64
vfs.read_max: 128 -> 64
  - Scrub 1 scan: scrub repaired 0B in 00:02:19 with 0 errors on Sat Mar 27 16:35:10 2021
  - Scrub 2 scan: scrub repaired 0B in 00:02:18 with 0 errors on Sat Mar 27 16:37:28 2021
  - Scrub 3 scan: scrub repaired 0B in 00:02:19 with 0 errors on Sat Mar 27 16:39:47 2021

An interesting thing is vfs.zfs.scan_legacy=1 makes the progress start
right away. See how I read almost 50% of the disk and I'm already 49%
done while with the non legacy scan, progress only moves after reading
all the allocated space.

  scan: scrub in progress since Sat Mar 27 16:41:33 2021
        5.85G scanned at 133M/s, 5.85G issued at 133M/s, 12.1G total
        0B repaired, 48.50% done, 00:00:47 to go

It's altogether way quicker (I gain 54 seconds on the no hpet test!!!)
and totally reproducible:
  scan: scrub repaired 0B in 00:01:35 with 0 errors on Sat Mar 27 16:43:08 2021
  scan: scrub repaired 0B in 00:01:35 with 0 errors on Sat Mar 27 16:46:10 2021

I will start a scrub of my zraid2 zpool and tweak read_max to see if it
makes a difference but at this point I doubt it.

And scan_legacy could be quicker because I see as many busy threads as there
are vdevs while if I start a scan with the non legacy scan, I only see one from
the get go.

  PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
    0 root        -16    -     0B    13M CPU0     0   8:40  99.17% kernel{z_rd_iss_4}
    0 root        -16    -     0B    13M CPU9     9   8:40  99.15% kernel{z_rd_iss_6}
    0 root        -16    -     0B    13M CPU7     7   8:40  98.97% kernel{z_rd_iss_5}
    0 root        -16    -     0B    13M CPU4     4   8:40  98.96% kernel{z_rd_iss_7}
    0 root        -16    -     0B    13M CPU6     6   8:40  98.59% kernel{z_rd_iss_1}
    0 root        -16    -     0B    13M CPU5     5   8:40  98.56% kernel{z_rd_iss_2}
    0 root        -16    -     0B    13M CPU14   14   8:40  98.33% kernel{z_rd_iss_3}
    0 root        -16    -     0B    13M CPU11   11   8:40  97.89% kernel{z_rd_iss_0}

-- 
Mathieu Chouquet-Stringer
            The sun itself sees not till heaven clears.
	             -- William Shakespeare --