Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 4 Sep 2023 02:56:53 -0700
From:      Mark Millard <marklmi@yahoo.com>
To:        Alexander Motin <mav@FreeBSD.org>
Cc:        dev-commits-src-main@freebsd.org, Current FreeBSD <freebsd-current@freebsd.org>
Subject:   Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned
Message-ID:  <C78808EB-FDE2-432D-8309-A7017DFC1BCE@yahoo.com>
In-Reply-To: <2BDD30B5-6248-4EC3-83C8-0499E0717D1D@yahoo.com>
References:  <F20BD354-2E10-4CD9-88FB-99DB1E405A52.ref@yahoo.com> <F20BD354-2E10-4CD9-88FB-99DB1E405A52@yahoo.com> <c9c5096b-33d9-1cb6-d446-97b99f324fce@FreeBSD.org> <673A446E-6F94-451E-910F-079F678C5289@yahoo.com> <2BDD30B5-6248-4EC3-83C8-0499E0717D1D@yahoo.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sep 4, 2023, at 02:00, Mark Millard <marklmi@yahoo.com> wrote:

> On Sep 3, 2023, at 23:35, Mark Millard <marklmi@yahoo.com> wrote:
>=20
>> On Sep 3, 2023, at 22:06, Alexander Motin <mav@FreeBSD.org> wrote:
>>=20
>>>=20
>>> On 03.09.2023 22:54, Mark Millard wrote:
>>>> After that ^t produced the likes of:
>>>> load: 6.39  cmd: sh 4849 [tx->tx_quiesce_done_cv] 10047.33r 0.51u =
121.32s 1% 13004k
>>>=20
>>> So the full state is not "tx->tx", but is actually a =
"tx->tx_quiesce_done_cv", which means the thread is waiting for new =
transaction to be opened, which means some previous to be quiesced and =
then synced.
>>>=20
>>>> #0 0xffffffff80b6f103 at mi_switch+0x173
>>>> #1 0xffffffff80bc0f24 at sleepq_switch+0x104
>>>> #2 0xffffffff80aec4c5 at _cv_wait+0x165
>>>> #3 0xffffffff82aba365 at txg_wait_open+0xf5
>>>> #4 0xffffffff82a11b81 at dmu_free_long_range+0x151
>>>=20
>>> Here it seems like transaction commit is waited due to large amount =
of delete operations, which ZFS tries to spread between separate TXGs.
>>=20
>> That fit the context: cleaning out /usr/local/poudriere/data/.m/
>>=20
>>> You should probably see some large and growing number in sysctl =
kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay .
>>=20
>> After the reboot I started a -J64 example. It has avoided the
>> early "witness exhausted". Again I ^C'd after about an hours
>> after the 2nd builder had started. So: again cleaning out
>> /usr/local/poudriere/data/.m/ Only seconds between:
>>=20
>> # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
>> kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276042
>>=20
>> # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
>> kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276427
>>=20
>> # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
>> kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 277323
>>=20
>> # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
>> kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 278027
>>=20
>> I have found a measure of progress: zfs list's USED
>> for /usr/local/poudriere/data/.m is decreasing. So
>> ztop's d/s was a good classification: deletes.
>>=20
>>>> #5 0xffffffff829a87d2 at zfs_rmnode+0x72
>>>> #6 0xffffffff829b658d at zfs_freebsd_reclaim+0x3d
>>>> #7 0xffffffff8113a495 at VOP_RECLAIM_APV+0x35
>>>> #8 0xffffffff80c5a7d9 at vgonel+0x3a9
>>>> #9 0xffffffff80c5af7f at vrecycle+0x3f
>>>> #10 0xffffffff829b643e at zfs_freebsd_inactive+0x4e
>>>> #11 0xffffffff80c598cf at vinactivef+0xbf
>>>> #12 0xffffffff80c590da at vput_final+0x2aa
>>>> #13 0xffffffff80c68886 at kern_funlinkat+0x2f6
>>>> #14 0xffffffff80c68588 at sys_unlink+0x28
>>>> #15 0xffffffff8106323f at amd64_syscall+0x14f
>>>> #16 0xffffffff8103512b at fast_syscall_common+0xf8
>>>=20
>>> What we don't see here is what quiesce and sync threads of the pool =
are actually doing.  Sync thread has plenty of different jobs, including =
async write, async destroy, scrub and others, that all may delay each =
other.
>>>=20
>>> Before you rebooted the system, depending how alive it is, could you =
save a number of outputs of `procstat -akk`, or at least specifically =
`procstat -akk | grep txg_thread_enter` if the full is hard?  Or somehow =
else observe what they are doing.
>>=20
>> # procstat -akk > ~/mmjnk00.txt
>> # procstat -akk > ~/mmjnk01.txt
>> # procstat -akk > ~/mmjnk02.txt
>> # procstat -akk > ~/mmjnk03.txt
>> # procstat -akk > ~/mmjnk04.txt
>> # procstat -akk > ~/mmjnk05.txt
>> # grep txg_thread_enter ~/mmjnk0[0-5].txt
>> /usr/home/root/mmjnk00.txt:    6 100881 zfskern             =
txg_thread_enter    mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 =
txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 =
fork_trampoline+0xe=20
>> /usr/home/root/mmjnk00.txt:    6 100882 zfskern             =
txg_thread_enter    mi_switch+0x173 sleepq_switch+0x104 =
sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 =
dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 =
fork_trampoline+0xe=20
>> /usr/home/root/mmjnk01.txt:    6 100881 zfskern             =
txg_thread_enter    mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 =
txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 =
fork_trampoline+0xe=20
>> /usr/home/root/mmjnk01.txt:    6 100882 zfskern             =
txg_thread_enter    mi_switch+0x173 sleepq_switch+0x104 =
sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 =
dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 =
fork_trampoline+0xe=20
>> /usr/home/root/mmjnk02.txt:    6 100881 zfskern             =
txg_thread_enter    mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 =
txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 =
fork_trampoline+0xe=20
>> /usr/home/root/mmjnk02.txt:    6 100882 zfskern             =
txg_thread_enter    mi_switch+0x173 sleepq_switch+0x104 =
sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 =
dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 =
fork_trampoline+0xe=20
>> /usr/home/root/mmjnk03.txt:    6 100881 zfskern             =
txg_thread_enter    mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 =
txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 =
fork_trampoline+0xe=20
>> /usr/home/root/mmjnk03.txt:    6 100882 zfskern             =
txg_thread_enter    mi_switch+0x173 sleepq_switch+0x104 =
sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 =
dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 =
fork_trampoline+0xe=20
>> /usr/home/root/mmjnk04.txt:    6 100881 zfskern             =
txg_thread_enter    mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 =
txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 =
fork_trampoline+0xe=20
>> /usr/home/root/mmjnk04.txt:    6 100882 zfskern             =
txg_thread_enter    mi_switch+0x173 sleepq_switch+0x104 =
sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 =
dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 =
fork_trampoline+0xe=20
>> /usr/home/root/mmjnk05.txt:    6 100881 zfskern             =
txg_thread_enter    mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 =
txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 =
fork_trampoline+0xe=20
>> /usr/home/root/mmjnk05.txt:    6 100882 zfskern             =
txg_thread_enter    mi_switch+0x173 sleepq_switch+0x104 =
sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 =
dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 =
fork_trampoline+0xe=20
>>=20
>> (Hopefully that will be a sufficiently useful start.)
>>=20
>>> `zpool status`, `zpool get all` and `sysctl -a` would also not harm.
>>=20
>> It is a very simple zpool configuration: one partition.
>> I only use it for bectl BE reasons, not the general
>> range of reasons for using zfs. I created the media with
>> my normal content, then checkpointed before doing the
>> git fetch to start to set up the experiment.
>>=20
>> # zpool status
>> pool: zamd64
>> state: ONLINE
>> status: Some supported and requested features are not enabled on the =
pool.
>> The pool can still be used, but some features are unavailable.
>> action: Enable all features using 'zpool upgrade'. Once this is done,
>> the pool may no longer be accessible by software that does not =
support
>> the features. See zpool-features(7) for details.
>> checkpoint: created Sun Sep  3 11:46:54 2023, consumes 2.17M
>> config:
>>=20
>> NAME            STATE     READ WRITE CKSUM
>> zamd64          ONLINE       0     0     0
>> gpt/amd64zfs  ONLINE       0     0     0
>>=20
>> errors: No known data errors
>>=20
>> There was also a snapshot in place before I did the
>> checkpoint operation.
>>=20
>> I deliberately did not use my typical openzfs-2.1-freebsd=20
>> for compatibility but used defaults when creating the pool:
>>=20
>> # zpool get all
>> NAME    PROPERTY                       VALUE                          =
SOURCE
>> zamd64  size                           872G                           =
-
>> zamd64  capacity                       21%                            =
-
>> zamd64  altroot                        -                              =
default
>> zamd64  health                         ONLINE                         =
-
>> zamd64  guid                           4817074778276814820            =
-
>> zamd64  version                        -                              =
default
>> zamd64  bootfs                         zamd64/ROOT/main-amd64         =
local
>> zamd64  delegation                     on                             =
default
>> zamd64  autoreplace                    off                            =
default
>> zamd64  cachefile                      -                              =
default
>> zamd64  failmode                       wait                           =
default
>> zamd64  listsnapshots                  off                            =
default
>> zamd64  autoexpand                     off                            =
default
>> zamd64  dedupratio                     1.00x                          =
-
>> zamd64  free                           688G                           =
-
>> zamd64  allocated                      184G                           =
-
>> zamd64  readonly                       off                            =
-
>> zamd64  ashift                         0                              =
default
>> zamd64  comment                        -                              =
default
>> zamd64  expandsize                     -                              =
-
>> zamd64  freeing                        0                              =
-
>> zamd64  fragmentation                  17%                            =
-
>> zamd64  leaked                         0                              =
-
>> zamd64  multihost                      off                            =
default
>> zamd64  checkpoint                     2.17M                          =
-
>> zamd64  load_guid                      17719601284614388220           =
-
>> zamd64  autotrim                       off                            =
default
>> zamd64  compatibility                  off                            =
default
>> zamd64  bcloneused                     0                              =
-
>> zamd64  bclonesaved                    0                              =
-
>> zamd64  bcloneratio                    1.00x                          =
-
>> zamd64  feature@async_destroy          enabled                        =
local
>> zamd64  feature@empty_bpobj            active                         =
local
>> zamd64  feature@lz4_compress           active                         =
local
>> zamd64  feature@multi_vdev_crash_dump  enabled                        =
local
>> zamd64  feature@spacemap_histogram     active                         =
local
>> zamd64  feature@enabled_txg            active                         =
local
>> zamd64  feature@hole_birth             active                         =
local
>> zamd64  feature@extensible_dataset     active                         =
local
>> zamd64  feature@embedded_data          active                         =
local
>> zamd64  feature@bookmarks              enabled                        =
local
>> zamd64  feature@filesystem_limits      enabled                        =
local
>> zamd64  feature@large_blocks           enabled                        =
local
>> zamd64  feature@large_dnode            enabled                        =
local
>> zamd64  feature@sha512                 enabled                        =
local
>> zamd64  feature@skein                  enabled                        =
local
>> zamd64  feature@edonr                  enabled                        =
local
>> zamd64  feature@userobj_accounting     active                         =
local
>> zamd64  feature@encryption             enabled                        =
local
>> zamd64  feature@project_quota          active                         =
local
>> zamd64  feature@device_removal         enabled                        =
local
>> zamd64  feature@obsolete_counts        enabled                        =
local
>> zamd64  feature@zpool_checkpoint       active                         =
local
>> zamd64  feature@spacemap_v2            active                         =
local
>> zamd64  feature@allocation_classes     enabled                        =
local
>> zamd64  feature@resilver_defer         enabled                        =
local
>> zamd64  feature@bookmark_v2            enabled                        =
local
>> zamd64  feature@redaction_bookmarks    enabled                        =
local
>> zamd64  feature@redacted_datasets      enabled                        =
local
>> zamd64  feature@bookmark_written       enabled                        =
local
>> zamd64  feature@log_spacemap           active                         =
local
>> zamd64  feature@livelist               enabled                        =
local
>> zamd64  feature@device_rebuild         enabled                        =
local
>> zamd64  feature@zstd_compress          enabled                        =
local
>> zamd64  feature@draid                  enabled                        =
local
>> zamd64  feature@zilsaxattr             active                         =
local
>> zamd64  feature@head_errlog            active                         =
local
>> zamd64  feature@blake3                 enabled                        =
local
>> zamd64  feature@block_cloning          enabled                        =
local
>> zamd64  feature@vdev_zaps_v2           active                         =
local
>> zamd64  feature@redaction_list_spill   disabled                       =
local
>>=20
>> /etc/sysctl.conf does have:
>>=20
>> vfs.zfs.min_auto_ashift=3D12
>> vfs.zfs.per_txg_dirty_frees_percent=3D5
>>=20
>> The vfs.zfs.per_txg_dirty_frees_percent is from prior
>> Mateusz Guzik help, where after testing the change I
>> reported:
>>=20
>> Result summary: Seems to have avoided the sustained periods
>> of low load average activity. Much better for the context.
>>=20
>> But it was for a different machine (aarch64, 8 cores). But
>> it was for poudriere bulk use.
>>=20
>> Turns out the default of 30 was causing sort of like
>> what is seen here: I could have presented some of the
>> information via the small load average figures here.
>>=20
>> (Note: 5 is the old default, 30 is newer. Other contexts
>> have other problems with 5: no single right setting and
>> no automated configuration.)
>>=20
>> Other than those 2 items, zfs is untuned (defaults).
>>=20
>> sysctl -a is a lot more output (864930 Bytes) so I'll skip
>> it for now.
>>=20
>>> PS: I may be wrong, but USB in "USB3 NVMe SSD storage" makes me =
shiver. Make sure there is no storage problems, like some huge delays, =
timeouts, etc, that can be seen, for example, as busy percents regularly =
spiking far above 100% in your `gstat -spod`.
>>>=20
>>=20
>> The "gstat -spod" output showed (and shows): around 0.8ms/w to 3ms/w,
>> mostly at the lower end of the range. < 98%busy, no spikes to > 100%.
>> It is a previously unused Samsung PSSD T7 Touch.
>=20
> A little more context here: that is for the "kB" figures seen
> during the cleanup/delete activity. During port builds into
> packages larger "kB" figures are seen and the ms/w figures
> will tend to be larger as well. The larger sizes can also
> lead to reaching somewhat above 100 %busy some of the time.
>=20
> I'll also note that I've ended up doing a lot more write
> activity exploring than I'd expected.
>=20
>> I was not prepared to replace the content of a PCIe slot's media
>> or M.2 connection's media for the temporary purpose. No spare
>> supply for those so no simple swapping for those.
>=20
>=20

Trying -J36 (so: 32+4) got to 470 built in about an hour
after [02] reached "Builder started". /usr/local/poudriere/data/.m
used a little under 40 GiBytes at that point. (I do not have
a file count.)

The cleanup seems to have gone somewhat faster after my ^C for
this context:

^C[01:20:20] Error: Signal SIGINT caught, cleaning up and exiting
[01:20:20] [27] [00:02:54] Finished math/p5-Data-Float | =
p5-Data-Float-0.013: Success
[main-amd64-bulk_a-default] [2023-09-04_00h30m42s] [sigint:] Queued: =
34588 Built: 502   Failed: 1     Skipped: 50    Ignored: 335   Fetched: =
0     Tobuild: 33700  Time: 01:20:12
[01:20:22] Logs: =
/usr/local/poudriere/data/logs/bulk/main-amd64-bulk_a-default/2023-09-04_0=
0h30m42s
[01:20:23] [25] [00:04:46] Finished www/p5-HTML-TreeBuilder-XPath | =
p5-HTML-TreeBuilder-XPath-0.14_1: Success
[01:20:24] Cleaning up
[02:17:01] Unmounting file systems
Exiting with status 1

So it took about an hour to cleanup after 502 port builds into
packages (not published, though).

( gstat -spod showed a fairly general, sustained lack of read activity,
instead of the comparitively small sustained amount I'd not mentioned
for the previous explorations. May be that helped. )


=3D=3D=3D
Mark Millard
marklmi at yahoo.com




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?C78808EB-FDE2-432D-8309-A7017DFC1BCE>