Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 19 Aug 2023 17:33:47 -0700
From:      Mark Millard <marklmi@yahoo.com>
To:        Current FreeBSD <freebsd-current@freebsd.org>
Cc:        Alexander Motin <mav@FreeBSD.org>
Subject:   Re: ZFS deadlock in 14 [USE_TMPFS=no poudriere messed up from the start, lots of "vlruwk"]
Message-ID:  <C86552F8-ABDF-4075-B0BD-FF4D384696C1@yahoo.com>
In-Reply-To: <8D0C1422-CE60-4266-8051-2296C3E9B7D7@yahoo.com>
References:  <59FCB309-4A55-4924-98C4-7ACCA70FD299@yahoo.com> <0F2C42B4-36FF-443A-A174-5B0CC57C4FC7@yahoo.com> <3AA253E3-C4F0-4AA3-9C37-D77E7527A458@yahoo.com> <C5747BF8-724E-43B7-88D0-A9F70485E7E1@yahoo.com> <ACACF575-A23B-480C-96D5-9C9599C917FC@yahoo.com> <8D0C1422-CE60-4266-8051-2296C3E9B7D7@yahoo.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Aug 19, 2023, at 16:27, Mark Millard <marklmi@yahoo.com> wrote:

> On Aug 19, 2023, at 15:41, Mark Millard <marklmi@yahoo.com> wrote:
>=20
>> On Aug 19, 2023, at 13:41, Mark Millard <marklmi@yahoo.com> wrote:
>>=20
>>> [I forgot to adjust USE_TMPFS for the purpose of the test.
>>> So I'll later be starting over.]
>>>=20
>>> . . .
>>=20
>> I finally got around to starting a from-scratch bulk -a
>> again (based on USE_TMPFS=3Dno this time). This is with
>> 15107.patch and 15122.patch applied. This is a non-debug
>> kernel experiment.
>>=20
>> Interstingly it got:
>>=20
>> [00:01:34] [01] [00:00:00] Builder starting
>> [00:01:57] [01] [00:00:23] Builder started
>> [00:01:57] [01] [00:00:00] Building ports-mgmt/pkg | pkg-1.20.4
>> [00:03:09] [01] [00:01:12] Finished ports-mgmt/pkg | pkg-1.20.4: =
Success
>> [00:03:21] [01] [00:00:00] Building print/indexinfo | indexinfo-0.3.1
>> [00:03:21] [02] [00:00:00] Builder starting
>> [00:03:21] [03] [00:00:00] Builder starting
>> [00:03:21] [04] [00:00:00] Builder starting
>> [00:03:21] [05] [00:00:00] Builder starting
>> [00:03:21] [06] [00:00:00] Builder starting
>> [00:03:21] [07] [00:00:00] Builder starting
>> [00:03:22] [08] [00:00:00] Builder starting
>> [00:03:22] [09] [00:00:00] Builder starting
>> [00:03:22] [10] [00:00:00] Builder starting
>> [00:03:22] [11] [00:00:00] Builder starting
>> [00:03:22] [12] [00:00:00] Builder starting
>> [00:03:22] [13] [00:00:00] Builder starting
>> [00:03:22] [14] [00:00:00] Builder starting
>> [00:03:22] [15] [00:00:00] Builder starting
>> [00:03:22] [16] [00:00:00] Builder starting
>> [00:03:22] [17] [00:00:00] Builder starting
>> [00:03:22] [18] [00:00:00] Builder starting
>> [00:03:22] [19] [00:00:00] Builder starting
>> [00:03:22] [20] [00:00:00] Builder starting
>> [00:03:22] [21] [00:00:00] Builder starting
>> [00:03:22] [22] [00:00:00] Builder starting
>> [00:03:22] [23] [00:00:00] Builder starting
>> [00:03:22] [24] [00:00:00] Builder starting
>> [00:03:22] [25] [00:00:00] Builder starting
>> [00:03:22] [26] [00:00:00] Builder starting
>> [00:03:22] [27] [00:00:00] Builder starting
>> [00:03:22] [28] [00:00:00] Builder starting
>> [00:03:22] [29] [00:00:00] Builder starting
>> [00:03:22] [30] [00:00:00] Builder starting
>> [00:03:22] [31] [00:00:00] Builder starting
>> [00:03:22] [32] [00:00:00] Builder starting
>> [00:03:30] [01] [00:00:09] Finished print/indexinfo | =
indexinfo-0.3.1: Success
>> [00:03:31] [01] [00:00:00] Building devel/gettext-runtime | =
gettext-runtime-0.22
>>=20
>> and is still that way minutes later.
>>=20
>> ^T shows:
>>=20
>> [00:03:31] [01] [00:00:00] Building devel/gettext-runtime | =
gettext-runtime-0.22
>> load: 13.02  cmd: sh 2187 [vlruwk] 570.19r 0.62u 38.60s 9% 3948k
>> #0 0xffffffff80b7701b at mi_switch+0xbb
>> #1 0xffffffff80bc976f at sleepq_timedwait+0x2f
>> #2 0xffffffff80b76770 at _sleep+0x1d0
>> #3 0xffffffff80c5b435 at vn_alloc_hard+0x2a5
>> #4 0xffffffff80c50b72 at getnewvnode_reserve+0x92
>> #5 0xffffffff829b9b12 at zfs_zget+0x22
>> #6 0xffffffff829a6a8d at zfs_dirent_lookup+0x16d
>> #7 0xffffffff829a6b5f at zfs_dirlook+0x7f
>> #8 0xffffffff829b6410 at zfs_lookup+0x350
>> #9 0xffffffff829b182a at zfs_freebsd_cachedlookup+0x6a
>> #10 0xffffffff80c36a0d at vfs_cache_lookup+0xad
>> #11 0xffffffff80c45141 at vfs_lookup+0x581
>> #12 0xffffffff80c44238 at namei+0x238
>> #13 0xffffffff80c63b5e at kern_statat+0xee
>> #14 0xffffffff80c64237 at sys_fstatat+0x27
>> #15 0xffffffff81049a79 at amd64_syscall+0x109
>> #16 0xffffffff8101f11b at fast_syscall_common+0xf8
>> [main-amd64-bulk_a-default] [2023-08-19_15h14m10s] [parallel_build:] =
Queued: 34435 Built: 2     Failed: 0     Skipped: 35    Ignored: 358   =
Fetched: 0     Tobuild: 34040  Time: 00:10:52
>> ID  TOTAL                   ORIGIN   PKGNAME              PHASE PHASE =
   TMPFS  CPU% MEM%
>> [01] 00:07:29 devel/gettext-runtime | gettext-runtime-0.22 build =
00:06:32       25.4%   0%
>> [00:11:25] Logs: =
/usr/local/poudriere/data/logs/bulk/main-amd64-bulk_a-default/2023-08-19_1=
5h14m10s
>>=20
>> Note the 3:31->11:25 .
>>=20
>> Top is showing lots of "vlruwk". For example:
>>=20
>> 362     0 root         40    0  27076Ki   13776Ki CPU19   19   4:23   =
0.00% cpdup -i0 -o ref 32
>> 349     0 root         53    0  27076Ki   13776Ki vlruwk  22   4:20   =
0.01% cpdup -i0 -o ref 31
>> 328     0 root         68    0  27076Ki   13804Ki vlruwk   8   4:30   =
0.01% cpdup -i0 -o ref 30
>> 304     0 root         37    0  27076Ki   13792Ki vlruwk   6   4:18   =
0.01% cpdup -i0 -o ref 29
>> 282     0 root         42    0  33220Ki   13956Ki vlruwk   8   4:33   =
0.01% cpdup -i0 -o ref 28
>> 242     0 root         56    0  27076Ki   13796Ki vlruwk   4   4:28   =
0.00% cpdup -i0 -o ref 27
>>=20
>> In other words, it is messed up from the start, not
>> just later.
>>=20
>> It does suggest that the dbg kernel should not end up with
>> resource problems: not that much gets very far. So I'll
>> probably stop it and substitute the debug kernel, reboot
>> and try again.
>=20
> Still for nodbg kernel . . .
>=20
> The "vlruwk" processes do occasionally instead show a
> CPU?? . Nothing seems stuck in only one STATE. (Live
> lock?)
>=20
> As for using the dbg kernel instead . . .
>=20
> Most of the time that processes are showing CPU?? more
> progress is made in building, but basically one builder.
> vlruwk dooes show up, gradually showing a larger fracion
> of the time. ref 02 .. ref 32 are still in cpdup -i0 -o .
> *vnode is showing up some as well. N process  looks to
> be stuck in just one of those. (Live lock?)
>=20
> The debug kernel is not reporting anything during this
> so far.
>=20
> (some time goes by)
>=20
> At this point vlruwk is fairly commonly what mostlt display
> for the cpdup's that are not finishing --but none are stuck
> in vlruuwk .
>=20
> Looks like I should try without the 2 patches (15107 and
> 15122).
>=20

I restored to eliminate the 2 patches and rebuilt the kernels
installed the nodbg one, rebooted, and tried again, still
using USE_TMPFS=3Dno . . .

The cpdup's get the same sort of vlruwk/CPU??/*vnode sort of
general activity. ref 02 .. ref 32 still do not complete,
leaving 1 builder active.

In other words, the patches did not make a difference that I've
noticed for what I'm reporting.


=3D=3D=3D
Mark Millard
marklmi at yahoo.com




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?C86552F8-ABDF-4075-B0BD-FF4D384696C1>