Date: Sat, 4 May 2024 15:09:20 -0700 From: Mark Millard <marklmi@yahoo.com> To: Current FreeBSD <freebsd-current@freebsd.org>, freebsd-amd64@freebsd.org Subject: Re: main [so: 15] amd64: Rare poudriere bulk builder "stuck in umtxq_sleep" condition (race failure?) during high-load-average "poudriere bulk -c -a" runs Message-ID: <0E773BD7-7C19-4AA7-A66D-7C645BCE0182@yahoo.com> In-Reply-To: <F54B5572-3A8A-45BE-BD0F-7F0ED2D6D4C9@yahoo.com> References: <F54B5572-3A8A-45BE-BD0F-7F0ED2D6D4C9@yahoo.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On May 4, 2024, at 09:59, Mark Millard <marklmi@yahoo.com> wrote: > I recently did some of my rare "poudriere bulk -c -a" = high-load-average > style experiments, here on a 7950X3D (amd64) system and I ended up = with > a couple of stuck builders (one per bulk run of 2 runs). Contexts: >=20 > # uname -apKU > FreeBSD 7950X3D-UFS 15.0-CURRENT FreeBSD 15.0-CURRENT #142 = main-n269589-9dcf39575efb-dirty: Sun Apr 21 07:28:55 UTC 2024 = root@7950X3D-ZFS:/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64= .amd64/sys/GENERIC-NODBG amd64 amd64 1500018 1500018 >=20 > # uname -apKU > FreeBSD 7950X3D-ZFS 15.0-CURRENT FreeBSD 15.0-CURRENT #142 = main-n269589-9dcf39575efb-dirty: Sun Apr 21 07:28:55 UTC 2024 = root@7950X3D-ZFS:/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64= .amd64/sys/GENERIC-NODBG amd64 amd64 1500018 1500018 >=20 > So: One was in a ZFS context and the other was in a UFS context. >=20 > 32 hardware threads, 32 builders, ALLOW_MAKE_JOBS=3Dyes in use > (no use of MAKE_JOBS_NUMBER_LIMIT or the like), USE_TMPFS=3Dall > in use, TMPFS_BLACKLIST in use, 192 GiBytes of RAM, 512 GiByte > Swap partition in use, so SystemRAM+SystemSWAP being > 704 GiBytes. >=20 >=20 > I'll start with notes about the more recent UFS context experiment . . = . >=20 > graphics/pinta in the UFS experiment had gotten stuck in threads > of /usr/local/bin/mono (mono-sgen): >=20 > [05] 15:31:47 graphics/pinta | pinta-1.7.1_4 = stage 15:28:31 2.30 GiB 0% 0% >=20 > # procstat -k -k 93415 > PID TID COMM TDNAME KSTACK = =20 > 93415 671706 mono-sgen - mi_switch+0xba = sleepq_catch_signals+0x2c6 sleepq_wait_sig+0x9 _sleep+0x1ae = umtxq_sleep+0x2cd do_lock_umutex+0x6a6 __umtx_op_wait_umutex+0x49 = sys__umtx_op+0x7e amd64_syscall+0x115 fast_syscall_common+0xf8=20 > 93415 678651 mono-sgen SGen worker mi_switch+0xba = sleepq_catch_signals+0x2c6 sleepq_wait_sig+0x9 _sleep+0x1ae = umtxq_sleep+0x2cd do_wait+0x244 __umtx_op_wait_uint_private+0x54 = sys__umtx_op+0x7e amd64_syscall+0x115 fast_syscall_common+0xf8=20 > 93415 678652 mono-sgen Finalizer mi_switch+0xba = sleepq_catch_signals+0x2c6 sleepq_wait_sig+0x9 _sleep+0x1ae = umtxq_sleep+0x2cd __umtx_op_sem2_wait+0x49a sys__umtx_op+0x7e = amd64_syscall+0x115 fast_syscall_common+0xf8=20 > 93415 678655 mono-sgen - mi_switch+0xba = sleepq_catch_signals+0x2c6 sleepq_wait_sig+0x9 _sleep+0x1ae = umtxq_sleep+0x2cd do_wait+0x244 __umtx_op_wait_uint_private+0x54 = sys__umtx_op+0x7e amd64_syscall+0x115 fast_syscall_common+0xf8=20 > 93415 678660 mono-sgen Thread Pool Wor mi_switch+0xba = sleepq_catch_signals+0x2c6 sleepq_wait_sig+0x9 _sleep+0x1ae = umtxq_sleep+0x2cd do_lock_umutex+0x6a6 __umtx_op_wait_umutex+0x49 = sys__umtx_op+0x7e amd64_syscall+0x115 fast_syscall_common+0xf8 >=20 > So I did a kill -9 93415 to let the bulk run complete. >=20 > I then removed my ADDITION of BROKEN to print/miktex that had gotten > stuck in the ZFS experiment and tried in the now tiny-load-average > UFS context: bulk print/miktex graphics/pinta >=20 > They both worked just fine, not getting stuck (UFS context): >=20 > [00:00:50] [02] [00:00:25] Finished graphics/pinta | pinta-1.7.1_4: = Success ending TMPFS: 2.30 GiB > [00:14:11] [01] [00:13:47] Finished print/miktex | miktex-23.9_3: = Success ending TMPFS: 3.21 GiB >=20 > I'll note that the "procstat -k -k" for the stuck print/miketex > in the ZFS context had looked like: >=20 > # procstat -k -k 70121 > PID TID COMM TDNAME KSTACK = =20 > 70121 409420 miktex-ctangle - mi_switch+0xba = sleepq_catch_signals+0x2c6 sleepq_wait_sig+0x9 _sleep+0x1ae = umtxq_sleep+0x2cd do_wait+0x244 __umtx_op_wait+0x53 sys__umtx_op+0x7e = amd64_syscall+0x115 fast_syscall_common+0xf8=20 > 70121 646547 miktex-ctangle - mi_switch+0xba = sleepq_catch_signals+0x2c6 sleepq_wait_sig+0x9 _sleep+0x1ae = kqueue_scan+0x9f1 kqueue_kevent+0x13b kern_kevent_fp+0x4b = kern_kevent_generic+0xd6 sys_kevent+0x61 amd64_syscall+0x115 = fast_syscall_common+0xf8=20 > 70121 646548 miktex-ctangle - mi_switch+0xba = sleepq_catch_signals+0x2c6 sleepq_wait_sig+0x9 _sleep+0x1ae = umtxq_sleep+0x2cd do_wait+0x244 __umtx_op_wait_uint_private+0x54 = sys__umtx_op+0x7e amd64_syscall+0x115 fast_syscall_common+0xf8 >=20 > Note that, unlike the UFS context, the above also involves: = kqueue_scan >=20 > It looks like there is some form of failing race(?) condition > that can occur on amd64 --and does rarely occur in high load > average contexts. >=20 > I've no clue how to reduce this to a simple, repeatable context. >=20 Some other oddities, including comparison on ZFS to a run using MUTUALLY_EXCLUSIVE_BUILD_PACKAGES to a run not using such. USE_TMPFS=3Dall= and ALLOW_MAKE_JOBS were always in use. The combinations were: A) ZFS without TMPFS_BLACKLIST but with = MUTUALLY_EXCLUSIVE_BUILD_PACKAGES B) ZFS with TMPFS_BLACKLIST but without = MUTUALLY_EXCLUSIVE_BUILD_PACKAGES C) UFS with TMPFS_BLACKLIST but without = MUTUALLY_EXCLUSIVE_BUILD_PACKAGES (No claim that the likes of, say, (A) vs. (B) is a cause of any = oddities.) Just (B) and (C) got ess-emacs_devel build lock failures, (A) did not: (diff of error log files used) 1081,1082c1104,1105 < Error: file-locked ("~/ESS-24.01.1/lisp/ess-autoloads.el" = "root@amd64_ZFS (pid 6112)" "Cannot resolve lock conflict in batch = mode") < ask-user-about-lock("~/ESS-24.01.1/lisp/ess-autoloads.el" = "root@amd64_ZFS (pid 6112)") --- > Error: file-locked ("~/ESS-24.01.1/lisp/ess-autoloads.el" = "root@amd64_UFS (pid 81244)" "Cannot resolve lock conflict in batch = mode") > ask-user-about-lock("~/ESS-24.01.1/lisp/ess-autoloads.el" = "root@amd64_UFS (pid 81244)") 1090a1114,1118 (B) and (C) got distinct mate-terminal build Segmentation fault failures [(A) did not get any such]: (diff of error log files used) 2965,2966c2960 < Warning: Could not merge it translation for msgid: < DOCUMENT AND MODIFIED VERSIONS OF THE DOCUMENT ARE PROVIDED UNDER THE = TERMS OF THE GNU FREE DOCUMENTATION LICENSE WITH THE FURTHER = UNDERSTANDING THAT: --- > if ! test -d "kk/"; then mkdir "kk/"; fi 2972,2973d2965 < if ! test -d "kk/"; then mkdir "kk/"; fi < Segmentation fault (core dumped) . . . > touch "ky/ky.stamp" > Segmentation fault (core dumped) Just (A) got openvpn-auth-ldap build failure: TRConfig.m:43:9: fatal error: 'TRConfigParser.h' file not found 43 | #import "TRConfigParser.h" | ^~~~~~~~~~~~~~~~~~ 1 error generated. *** [TRConfig.o] Error code 1 make[1]: stopped in = /wrkdirs/usr/ports/security/openvpn-auth-ldap/work/openvpn-auth-ldap-auth-= ldap-2.0.4/src Just (A) got rinetd build failure: --- index.html --- if which roffit >/dev/null 2>&1; then roffit < > index.html; else touch = index.html; fi sh: Syntax error: redirection unexpected (expecting word) *** [index.html] Error code 2 make[1]: stopped in /wrkdirs/usr/ports/net/rinetd/work/rinetd-d4e0a60 make[1]: 1 error Just (A) and (B) got the below adacurses example failure (1st failure = shown): gnatlink rain.ali -L../lib -lAdaCurses -fstack-protector-strong = -lncursesw -lncurses -lmenu -lform -lpanel -fstack-protector-strong = -lncursesw -lncurses -lmenu -lform -lpanel gcc -c -I. -I../src -I./../src -O2 -I. ncurses2-acs_display.adb gcc -c -I. -I../src -I./../src -O2 -I. sample-keyboard_handler.adb /usr/local/bin/ld: cannot find -lAdaCurses: No such file or directory collect2: error: ld returned 1 exit status gnatlink: error when calling /usr/local/gnat12/bin/gcc gnatmake: *** link failed. gmake[1]: *** [Makefile:168: rain] Error 4 gmake[1]: *** Waiting for unfinished jobs.... Just (A) got amath example failure (1st failure shown): mkdir -p shared c++ -O2 -I. -I../.. -Wall -c clear.cpp c++ -O3 -I. -I.. -Wall -fPIC -c bigint.cpp -o shared/bigint.o error: unable to open output file 'shared/aengine.o': 'No such file or = directory' 1 error generated. c++ -O2 -I. -I.. -Wall -c console_stdc.cpp c++ -O2 -I. -I../.. -Wall -c delete.cpp c++ -O2 -I. -I.. -Wall -c console_termios.cpp c++ -O2 -I. -I.. -Wall -c fgrid.cpp c++ -O3 -I. -I.. -Wall -c bigint.cpp -o static/bigint.o c++ -O2 -I. -I.. -Wall -c lexer.cpp c++ -O2 -I. -I.. -Wall -c console_windows.cpp c++ -O2 -I. -I../.. -Wall -c digits.cpp c++ -O3 -I. -I.. -Wall -c charbuf.cpp -o static/charbuf.o gmake[1]: *** [Makefile:29: shared/aengine.o] Error 1 gmake[1]: *** Waiting for unfinished jobs.... ALSO amath: Just (B) got amath example failure (1st failure shown): error: unable to open output file 'shared/cacos.o': 'No such file or = directory' 1 error generated. gmake[1]: *** [Makefile:31: shared/cacos.o] Error 1 gmake[1]: *** Waiting for unfinished jobs.... gmake[1]: Leaving directory = '/wrkdirs/usr/ports/math/amath/work/amath-1.8.5/src/cplex' gmake: *** [Makefile:76: shared-libs] Error 2 gmake: *** Waiting for unfinished jobs.... Just (B) got an berkeleygw ( BerkeleyGW ) example failure (1st failure = shown): gfortran13 -ffree-form -ffree-line-length-none -I ../Common -I = /usr/local/include -c -O3 genwf_mpi.p.f -o genwf_mpi.o -J./ gfortran13 -ffree-form -ffree-line-length-none -I ../Common -I = /usr/local/include -c -O3 input_common.p.f -o input_common.o -J./ gfortran13 -ffree-form -ffree-line-length-none -I ../Common -I = /usr/local/include -c -O3 genwf_mpi.p.f -o genwf_mpi.o -J./ /usr/local/bin/ld: /usr/lib/crt1.o: in function `_start': /usr/main-src/lib/csu/amd64/crt1_s.S:69: undefined reference to `main' collect2: error: ld returned 1 exit status gmake[2]: *** [../Common/common-rules.mk:321: = ../Common/print_version_info.x] Error 1 gmake[2]: Leaving directory = '/wrkdirs/usr/ports/science/berkeleygw/work/BerkeleyGW-3.0.1/Common' gmake[1]: *** [Makefile:82: make-Common] Error 2 gmake[1]: *** Waiting for unfinished jobs.... Just (B) got artemis example failure (1st failure shown): = CLASSPATH=3Dlib/commons-lang-2.6.jar:lib/biojava.jar:lib/jemAlign.jar:lib/= j2ssh/j2ssh-core.jar:lib/ibatis/ibatis-2.3.4.726.jar:lib/ibatis/log4j-1.2.= 14.jar:lib/postgresql-8.4-701.jdbc3.jar:lib/picard/pic = ard.jar:lib/commons-net-3.6.jar:lib/batik/batik-awt-util.jar:lib/batik/bat= ik-dom.jar:lib/batik/batik-ext.jar:lib/batik/batik-svggen.jar:lib/batik/ba= tik-util.jar:lib/batik/batik-xml.jar:. javac -source 1.8 -target 1.8 uk/ac/sanger/artemis/io/ReadOnlyEntry.java Note: uk/ac/sanger/artemis/ActionVector.java uses unchecked or unsafe = operations. Note: Recompile with -Xlint:unchecked for details. uk/ac/sanger/artemis/FilteredEntryGroup.java:43: error: cannot access = EntryGroup public class FilteredEntryGroup implements EntryGroup ^ bad class file: ./uk/ac/sanger/artemis/EntryGroup.class class file contains wrong class: java.util.NoSuchElementException Please remove or make sure it appears in the correct subdirectory of = the classpath. gmake: *** [Makefile:47: uk/ac/sanger/artemis/FilteredEntryGroup.class] = Error 1 gmake: *** Waiting for unfinished jobs.... Just (B) got ess-emacs_canna example failure (1st failure shown): Error: error ("Cannot resolve lock conflict in batch mode") mapbacktrace(#f(compiled-function (evald func args flags) #<bytecode = 0x1f681b36ebaed07c>)) debug-early-backtrace() debug-early(error (error "Cannot resolve lock conflict in batch = mode")) signal(error ("Cannot resolve lock conflict in batch mode")) error("Cannot resolve lock conflict in batch mode") ask-user-about-lock("~/ESS-24.01.1/lisp/ess-autoloads.el" = "root@amd64_ZFS (pid 38427)") = autoload-find-generated-file("/wrkdirs/usr/ports/math/ess/work-canna/ESS-2= 4.01.1/lisp/ess-autoloads.el") make-directory-autoloads(("~/ESS-24.01.1/lisp/") = "/wrkdirs/usr/ports/math/ess/work-canna/ESS-24.01.1/lisp/ess-autoloads.el"= ) update-directory-autoloads("~/ESS-24.01.1/lisp/") (progn (setq make-backup-files nil) (setq generated-autoload-file = (expand-file-name "ess-autoloads.el")) (setq find-file-visit-truename t) = (update-directory-autoloads default-directory)) eval((progn (setq make-backup-files nil) (setq generated-autoload-file = (expand-file-name "ess-autoloads.el")) (setq find-file-visit-truename t) = (update-directory-autoloads default-directory)) t) command-line-1(("--eval" "(progn(setq make-backup-files nil)(setq = generated-autoload-file (expand-file-name \"ess-autoloads.el\"))(setq = find-file-visit-truename t)(update-directory-autoloads default -directory))")) command-line() normal-top-level() Output written on refcard.pdf (2 pages, 179059 bytes). Transcript written on refcard.log. gmake[1]: *** [Makefile:56: ess-autoloads.el] Error 255 gmake[1]: Leaving directory = '/wrkdirs/usr/ports/math/ess/work-canna/ESS-24.01.1/lisp' gmake: *** [Makefile:49: autoloads] Error 2 gmake: *** Waiting for unfinished jobs.... Just (B) got ngs-sdk example failure (1st failure shown): error: unable to open output file = '/wrkdirs/usr/ports/biology/ngs-sdk/work/ngs-3.0.1/build/ngs-sdk/FreeBSD/c= lang/amd64/rel/obj/language/c++/ReadCollection.pic.o': 'No such file or = directory' 1 error generated. . . . gmake[3]: *** = [/wrkdirs/usr/ports/biology/ngs-sdk/work/ngs-3.0.1/ngs-sdk/./Makefile.conf= ig.FreeBSD.amd64:109: = /wrkdirs/usr/ports/biology/ngs-sdk/work/ngs-3.0.1/build/ngs-sdk/FreeBSD/cl= ang/amd64/rel/ob j/language/c++/ReadCollection.pic.o] Error 1 gmake[3]: *** Waiting for unfinished jobs.... Just (B) got qucsator example failure (1st failure shown): FAILED: src/interface/CMakeFiles/coreInterface.dir/qucs_interface.cpp.o /usr/bin/c++ -DDEBUG -DHAVE_CONFIG_H = -I/wrkdirs/usr/ports/cad/qucsator/work/qucsator-0.0.20-4-g22126bb9 = -I/wrkdirs/usr/ports/cad/qucsator/work/qucsator-0.0.20-4-g22126bb9/src/mat= h -I/wrkdirs/usr/ports /cad/qucsator/work/qucsator-0.0.20-4-g22126bb9/src = -I/wrkdirs/usr/ports/cad/qucsator/work/qucsator-0.0.20-4-g22126bb9/src/com= ponents = -I/wrkdirs/usr/ports/cad/qucsator/work/qucsator-0.0.20-4-g22126bb9/ src/interface -I/wrkdirs/usr/ports/cad/qucsator/work/.build = -I/wrkdirs/usr/ports/cad/qucsator/work/.build/src = -I/wrkdirs/usr/ports/cad/qucsator/work/.build/src/components -O2 -pipe = -fstack-protector-s trong -fno-strict-aliasing -fPIC -Wall -std=3Dc++11 -stdlib=3Dlibc++ = -O4 -DNDEBUG -MD -MT = src/interface/CMakeFiles/coreInterface.dir/qucs_interface.cpp.o -MF = src/interface/CMakeFiles/coreInterface.dir/qu cs_interface.cpp.o.d -o = src/interface/CMakeFiles/coreInterface.dir/qucs_interface.cpp.o -c = /wrkdirs/usr/ports/cad/qucsator/work/qucsator-0.0.20-4-g22126bb9/src/inter= face/qucs_interface.cpp c++: warning: -O4 is equivalent to -O3 [-Wdeprecated] In file included from = /wrkdirs/usr/ports/cad/qucsator/work/qucsator-0.0.20-4-g22126bb9/src/inter= face/qucs_interface.cpp:38: = /wrkdirs/usr/ports/cad/qucsator/work/qucsator-0.0.20-4-g22126bb9/src/compo= nents/components.h:164:10: fatal error: 'verilog/tff_SR.core.h' file not = found 164 | #include "verilog/tff_SR.core.h" | ^~~~~~~~~~~~~~~~~~~~~~~ 1 error generated. Just (B) got yuck-cmdline-parser example failure (1st failure shown): mv -f .deps/yuck_scmver-yuck-scmver.Tpo .deps/yuck_scmver-yuck-scmver.Po mv: rename .deps/yuck_scmver-yuck-scmver.Tpo to = .deps/yuck_scmver-yuck-scmver.Po: No such file or directory gmake[2]: *** [Makefile:587: yuck_scmver-yuck-scmver.o] Error 1 gmake[2]: *** Waiting for unfinished jobs.... Just (A) got ess-emacs_devel_nox example failure (1st failure shown): ~/ESS-24.01.1/lisp/ess-autoloads.el: root@amd64_ZFS (pid 15639), Cannot = resolve lock conflict in batch mode Error: file-locked ("~/ESS-24.01.1/lisp/ess-autoloads.el" = "root@amd64_ZFS (pid 15639)" "Cannot resolve lock conflict in batch = mode") ask-user-about-lock("~/ESS-24.01.1/lisp/ess-autoloads.el" = "root@amd64_ZFS (pid 15639)") = autoload-find-generated-file("/wrkdirs/usr/ports/math/ess/work-devel_nox/E= SS-24.01.1/lisp/ess-autoloads.el") make-directory-autoloads(("~/ESS-24.01.1/lisp/") = "/wrkdirs/usr/ports/math/ess/work-devel_nox/ESS-24.01.1/lisp/ess-autoloads= .el") update-directory-autoloads("~/ESS-24.01.1/lisp/") (progn (setq make-backup-files nil) (setq generated-autoload-file = (expand-file-name "ess-autoloads.el")) (setq find-file-visit-truename t) = (update-directory-autoloads default-directory)) eval((progn (setq make-backup-files nil) (setq generated-autoload-file = (expand-file-name "ess-autoloads.el")) (setq find-file-visit-truename t) = (update-directory-autoloads default-directory)) t) command-line-1(("--eval" "(progn(setq make-backup-files nil)(setq = generated-autoload-file (expand-file-name \"ess-autoloads.el\"))(setq = find-file-visit-truename t)(update-directory-autoloads default -directory))")) command-line() normal-top-level() gmake[1]: *** [Makefile:56: ess-autoloads.el] Error 255 gmake[1]: *** Waiting for unfinished jobs.... Just (A) got fl_moxgen example failure (1st failure shown): =3D=3D=3D> Building for fl_moxgen-1.00_2 --- fl_moxgen_defines.h --- --- write_pdf.o --- write_pdf.c:12:10: fatal error: 'fl_moxgen_defines.h' file not found 12 | #include "fl_moxgen_defines.h" | ^~~~~~~~~~~~~~~~~~~~~ 1 error generated. *** [write_pdf.o] Error code 1 make: stopped in /wrkdirs/usr/ports/comms/fl_moxgen/work/Fl_MoxGen-1.00 make: 1 error Just (A) got grx example failure (1st failure shown): cc -L/usr/local/lib -fstack-protector-strong -o ../bin/bin2c = utilprog/bin2c.o ../lib/unix/libgrx20X.a -L/usr/local/lib -lX11 cc -L/usr/local/lib -fstack-protector-strong -o ../bin/fnt2c = utilprog/fnt2c.o ../lib/unix/libgrx20X.a -L/usr/local/lib -lX11 cc: error: no such file or directory: 'utilprog/bin2c.o' gmake[1]: *** [makefile.x11:150: ../bin/bin2c] Error 1 gmake[1]: *** Waiting for unfinished jobs.... Just (A) got openvsp example failure (1st failure shown): In file included from = /wrkdirs/usr/ports/cad/openvsp/work/.build/Libraries-prefix/src/Libraries-= build/STEPCODE-prefix/src/STEPCODE-build/schemas/sdai_ap203/SdaiCONFIG_CON= TROL_DESIGN_unity_entities.cc: 70: = /wrkdirs/usr/ports/cad/openvsp/work/.build/Libraries-prefix/src/Libraries-= build/STEPCODE-prefix/src/STEPCODE-build/schemas/sdai_ap203/entity/SdaiB_s= pline_surface.cc:334:1: error: unknown type name 'f' 334 | f | ^ =3D=3D=3D Mark Millard marklmi at yahoo.com
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?0E773BD7-7C19-4AA7-A66D-7C645BCE0182>