Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 19 Aug 2023 16:27:26 -0700
From:      Mark Millard <marklmi@yahoo.com>
To:        Current FreeBSD <freebsd-current@freebsd.org>
Subject:   Re: ZFS deadlock in 14 [USE_TMPFS=no poudriere messed up from the start, lots of "vlruwk"]
Message-ID:  <8D0C1422-CE60-4266-8051-2296C3E9B7D7@yahoo.com>
In-Reply-To: <ACACF575-A23B-480C-96D5-9C9599C917FC@yahoo.com>
References:  <59FCB309-4A55-4924-98C4-7ACCA70FD299@yahoo.com> <0F2C42B4-36FF-443A-A174-5B0CC57C4FC7@yahoo.com> <3AA253E3-C4F0-4AA3-9C37-D77E7527A458@yahoo.com> <C5747BF8-724E-43B7-88D0-A9F70485E7E1@yahoo.com> <ACACF575-A23B-480C-96D5-9C9599C917FC@yahoo.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Aug 19, 2023, at 15:41, Mark Millard <marklmi@yahoo.com> wrote:

> On Aug 19, 2023, at 13:41, Mark Millard <marklmi@yahoo.com> wrote:
>=20
>> [I forgot to adjust USE_TMPFS for the purpose of the test.
>> So I'll later be starting over.]
>>=20
>> . . .
>=20
> I finally got around to starting a from-scratch bulk -a
> again (based on USE_TMPFS=3Dno this time). This is with
> 15107.patch and 15122.patch applied. This is a non-debug
> kernel experiment.
>=20
> Interstingly it got:
>=20
> [00:01:34] [01] [00:00:00] Builder starting
> [00:01:57] [01] [00:00:23] Builder started
> [00:01:57] [01] [00:00:00] Building ports-mgmt/pkg | pkg-1.20.4
> [00:03:09] [01] [00:01:12] Finished ports-mgmt/pkg | pkg-1.20.4: =
Success
> [00:03:21] [01] [00:00:00] Building print/indexinfo | indexinfo-0.3.1
> [00:03:21] [02] [00:00:00] Builder starting
> [00:03:21] [03] [00:00:00] Builder starting
> [00:03:21] [04] [00:00:00] Builder starting
> [00:03:21] [05] [00:00:00] Builder starting
> [00:03:21] [06] [00:00:00] Builder starting
> [00:03:21] [07] [00:00:00] Builder starting
> [00:03:22] [08] [00:00:00] Builder starting
> [00:03:22] [09] [00:00:00] Builder starting
> [00:03:22] [10] [00:00:00] Builder starting
> [00:03:22] [11] [00:00:00] Builder starting
> [00:03:22] [12] [00:00:00] Builder starting
> [00:03:22] [13] [00:00:00] Builder starting
> [00:03:22] [14] [00:00:00] Builder starting
> [00:03:22] [15] [00:00:00] Builder starting
> [00:03:22] [16] [00:00:00] Builder starting
> [00:03:22] [17] [00:00:00] Builder starting
> [00:03:22] [18] [00:00:00] Builder starting
> [00:03:22] [19] [00:00:00] Builder starting
> [00:03:22] [20] [00:00:00] Builder starting
> [00:03:22] [21] [00:00:00] Builder starting
> [00:03:22] [22] [00:00:00] Builder starting
> [00:03:22] [23] [00:00:00] Builder starting
> [00:03:22] [24] [00:00:00] Builder starting
> [00:03:22] [25] [00:00:00] Builder starting
> [00:03:22] [26] [00:00:00] Builder starting
> [00:03:22] [27] [00:00:00] Builder starting
> [00:03:22] [28] [00:00:00] Builder starting
> [00:03:22] [29] [00:00:00] Builder starting
> [00:03:22] [30] [00:00:00] Builder starting
> [00:03:22] [31] [00:00:00] Builder starting
> [00:03:22] [32] [00:00:00] Builder starting
> [00:03:30] [01] [00:00:09] Finished print/indexinfo | indexinfo-0.3.1: =
Success
> [00:03:31] [01] [00:00:00] Building devel/gettext-runtime | =
gettext-runtime-0.22
>=20
> and is still that way minutes later.
>=20
> ^T shows:
>=20
> [00:03:31] [01] [00:00:00] Building devel/gettext-runtime | =
gettext-runtime-0.22
> load: 13.02  cmd: sh 2187 [vlruwk] 570.19r 0.62u 38.60s 9% 3948k
> #0 0xffffffff80b7701b at mi_switch+0xbb
> #1 0xffffffff80bc976f at sleepq_timedwait+0x2f
> #2 0xffffffff80b76770 at _sleep+0x1d0
> #3 0xffffffff80c5b435 at vn_alloc_hard+0x2a5
> #4 0xffffffff80c50b72 at getnewvnode_reserve+0x92
> #5 0xffffffff829b9b12 at zfs_zget+0x22
> #6 0xffffffff829a6a8d at zfs_dirent_lookup+0x16d
> #7 0xffffffff829a6b5f at zfs_dirlook+0x7f
> #8 0xffffffff829b6410 at zfs_lookup+0x350
> #9 0xffffffff829b182a at zfs_freebsd_cachedlookup+0x6a
> #10 0xffffffff80c36a0d at vfs_cache_lookup+0xad
> #11 0xffffffff80c45141 at vfs_lookup+0x581
> #12 0xffffffff80c44238 at namei+0x238
> #13 0xffffffff80c63b5e at kern_statat+0xee
> #14 0xffffffff80c64237 at sys_fstatat+0x27
> #15 0xffffffff81049a79 at amd64_syscall+0x109
> #16 0xffffffff8101f11b at fast_syscall_common+0xf8
> [main-amd64-bulk_a-default] [2023-08-19_15h14m10s] [parallel_build:] =
Queued: 34435 Built: 2     Failed: 0     Skipped: 35    Ignored: 358   =
Fetched: 0     Tobuild: 34040  Time: 00:10:52
> ID  TOTAL                   ORIGIN   PKGNAME              PHASE PHASE  =
  TMPFS  CPU% MEM%
> [01] 00:07:29 devel/gettext-runtime | gettext-runtime-0.22 build =
00:06:32       25.4%   0%
> [00:11:25] Logs: =
/usr/local/poudriere/data/logs/bulk/main-amd64-bulk_a-default/2023-08-19_1=
5h14m10s
>=20
> Note the 3:31->11:25 .
>=20
> Top is showing lots of "vlruwk". For example:
>=20
>  362     0 root         40    0  27076Ki   13776Ki CPU19   19   4:23   =
0.00% cpdup -i0 -o ref 32
>  349     0 root         53    0  27076Ki   13776Ki vlruwk  22   4:20   =
0.01% cpdup -i0 -o ref 31
>  328     0 root         68    0  27076Ki   13804Ki vlruwk   8   4:30   =
0.01% cpdup -i0 -o ref 30
>  304     0 root         37    0  27076Ki   13792Ki vlruwk   6   4:18   =
0.01% cpdup -i0 -o ref 29
>  282     0 root         42    0  33220Ki   13956Ki vlruwk   8   4:33   =
0.01% cpdup -i0 -o ref 28
>  242     0 root         56    0  27076Ki   13796Ki vlruwk   4   4:28   =
0.00% cpdup -i0 -o ref 27
>=20
> In other words, it is messed up from the start, not
> just later.
>=20
> It does suggest that the dbg kernel should not end up with
> resource problems: not that much gets very far. So I'll
> probably stop it and substitute the debug kernel, reboot
> and try again.

Still for nodbg kernel . . .

The "vlruwk" processes do occasionally instead show a
CPU?? . Nothing seems stuck in only one STATE. (Live
lock?)

As for using the dbg kernel instead . . .

Most of the time that processes are showing CPU?? more
progress is made in building, but basically one builder.
vlruwk dooes show up, gradually showing a larger fracion
of the time. ref 02 .. ref 32 are still in cpdup -i0 -o .
*vnode is showing up some as well. N process  looks to
be stuck in just one of those. (Live lock?)

The debug kernel is not reporting anything during this
so far.

(some time goes by)

At this point vlruwk is fairly commonly what mostlt display
for the cpdup's that are not finishing --but none are stuck
in vlruuwk .

Looks like I should try without the 2 patches (15107 and
15122).


=3D=3D=3D
Mark Millard
marklmi at yahoo.com




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?8D0C1422-CE60-4266-8051-2296C3E9B7D7>