Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 4 May 2024 09:59:15 -0700
From:      Mark Millard <marklmi@yahoo.com>
To:        Current FreeBSD <freebsd-current@freebsd.org>, freebsd-amd64@freebsd.org
Subject:   main [so: 15] amd64: Rare poudriere bulk builder "stuck in umtxq_sleep" condition (race failure?) during high-load-average "poudriere bulk -c -a" runs
Message-ID:  <F54B5572-3A8A-45BE-BD0F-7F0ED2D6D4C9@yahoo.com>
References:  <F54B5572-3A8A-45BE-BD0F-7F0ED2D6D4C9.ref@yahoo.com>

next in thread | previous in thread | raw e-mail | index | archive | help
I recently did some of my rare "poudriere bulk -c -a" high-load-average
style experiments, here on a 7950X3D (amd64) system and I ended up with
a couple of stuck builders (one per bulk run of 2 runs). Contexts:

# uname -apKU
FreeBSD 7950X3D-UFS 15.0-CURRENT FreeBSD 15.0-CURRENT #142 =
main-n269589-9dcf39575efb-dirty: Sun Apr 21 07:28:55 UTC 2024     =
root@7950X3D-ZFS:/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64=
.amd64/sys/GENERIC-NODBG amd64 amd64 1500018 1500018

# uname -apKU
FreeBSD 7950X3D-ZFS 15.0-CURRENT FreeBSD 15.0-CURRENT #142 =
main-n269589-9dcf39575efb-dirty: Sun Apr 21 07:28:55 UTC 2024     =
root@7950X3D-ZFS:/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64=
.amd64/sys/GENERIC-NODBG amd64 amd64 1500018 1500018

So: One was in a ZFS context and the other was in a UFS context.

32 hardware threads, 32 builders, ALLOW_MAKE_JOBS=3Dyes in use
(no use of MAKE_JOBS_NUMBER_LIMIT or the like), USE_TMPFS=3Dall
in use, TMPFS_BLACKLIST in use, 192 GiBytes of RAM, 512 GiByte
Swap partition in use, so SystemRAM+SystemSWAP being
704 GiBytes.


I'll start with notes about the more recent UFS context experiment . . .

graphics/pinta in the UFS experiment had gotten stuck in threads
of /usr/local/bin/mono (mono-sgen):

[05] 15:31:47                     graphics/pinta | pinta-1.7.1_4         =
                             stage 15:28:31 2.30 GiB   0%   0%

# procstat -k -k 93415
  PID    TID COMM                TDNAME              KSTACK              =
        =20
93415 671706 mono-sgen           -                   mi_switch+0xba =
sleepq_catch_signals+0x2c6 sleepq_wait_sig+0x9 _sleep+0x1ae =
umtxq_sleep+0x2cd do_lock_umutex+0x6a6 __umtx_op_wait_umutex+0x49 =
sys__umtx_op+0x7e amd64_syscall+0x115 fast_syscall_common+0xf8=20
93415 678651 mono-sgen           SGen worker         mi_switch+0xba =
sleepq_catch_signals+0x2c6 sleepq_wait_sig+0x9 _sleep+0x1ae =
umtxq_sleep+0x2cd do_wait+0x244 __umtx_op_wait_uint_private+0x54 =
sys__umtx_op+0x7e amd64_syscall+0x115 fast_syscall_common+0xf8=20
93415 678652 mono-sgen           Finalizer           mi_switch+0xba =
sleepq_catch_signals+0x2c6 sleepq_wait_sig+0x9 _sleep+0x1ae =
umtxq_sleep+0x2cd __umtx_op_sem2_wait+0x49a sys__umtx_op+0x7e =
amd64_syscall+0x115 fast_syscall_common+0xf8=20
93415 678655 mono-sgen           -                   mi_switch+0xba =
sleepq_catch_signals+0x2c6 sleepq_wait_sig+0x9 _sleep+0x1ae =
umtxq_sleep+0x2cd do_wait+0x244 __umtx_op_wait_uint_private+0x54 =
sys__umtx_op+0x7e amd64_syscall+0x115 fast_syscall_common+0xf8=20
93415 678660 mono-sgen           Thread Pool Wor     mi_switch+0xba =
sleepq_catch_signals+0x2c6 sleepq_wait_sig+0x9 _sleep+0x1ae =
umtxq_sleep+0x2cd do_lock_umutex+0x6a6 __umtx_op_wait_umutex+0x49 =
sys__umtx_op+0x7e amd64_syscall+0x115 fast_syscall_common+0xf8

So I did a kill -9 93415 to let the bulk run complete.

I then removed my ADDITION of BROKEN to print/miktex that had gotten
stuck in the ZFS experiment and tried in the now tiny-load-average
UFS context: bulk print/miktex graphics/pinta

They both worked just fine, not getting stuck (UFS context):

[00:00:50] [02] [00:00:25] Finished  graphics/pinta | pinta-1.7.1_4: =
Success ending TMPFS: 2.30 GiB
[00:14:11] [01] [00:13:47] Finished  print/miktex | miktex-23.9_3: =
Success ending TMPFS: 3.21 GiB

I'll note that the "procstat -k -k" for the stuck print/miketex
in the ZFS context had looked like:

# procstat -k -k 70121
 PID    TID COMM                TDNAME              KSTACK               =
       =20
70121 409420 miktex-ctangle      -                   mi_switch+0xba =
sleepq_catch_signals+0x2c6 sleepq_wait_sig+0x9 _sleep+0x1ae =
umtxq_sleep+0x2cd do_wait+0x244 __umtx_op_wait+0x53 sys__umtx_op+0x7e =
amd64_syscall+0x115 fast_syscall_common+0xf8=20
70121 646547 miktex-ctangle      -                   mi_switch+0xba =
sleepq_catch_signals+0x2c6 sleepq_wait_sig+0x9 _sleep+0x1ae =
kqueue_scan+0x9f1 kqueue_kevent+0x13b kern_kevent_fp+0x4b =
kern_kevent_generic+0xd6 sys_kevent+0x61 amd64_syscall+0x115 =
fast_syscall_common+0xf8=20
70121 646548 miktex-ctangle      -                   mi_switch+0xba =
sleepq_catch_signals+0x2c6 sleepq_wait_sig+0x9 _sleep+0x1ae =
umtxq_sleep+0x2cd do_wait+0x244 __umtx_op_wait_uint_private+0x54 =
sys__umtx_op+0x7e amd64_syscall+0x115 fast_syscall_common+0xf8

Note that, unlike the UFS context, the above also involves: kqueue_scan

It looks like there is some form of failing race(?) condition
that can occur on amd64 --and does rarely occur in high load
average contexts.

I've no clue how to reduce this to a simple, repeatable context.


=3D=3D=3D
Mark Millard
marklmi at yahoo.com




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?F54B5572-3A8A-45BE-BD0F-7F0ED2D6D4C9>