Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 13 Feb 2018 10:46:32 -0800 (PST)
From:      Don Lewis <truckman@FreeBSD.org>
To:        freebsd-hackers@freebsd.org
Subject:   rare gmake jobserver hangs, lost SIGCHLD?
Message-ID:  <tkrat.c6385fc59f468b30@FreeBSD.org>

next in thread | raw e-mail | index | archive | help
I have ALLOW_MAKE_JOBS=yes in poudriere.conf on my package build machine
so that large ports like chromium will build more quickly at the end of
the poudriere run when they are the last ports to build.  A downside of
this is that the load average can get pretty high (in the 40's) earlier
in the poudriere run, but I've got plenty of RAM and swap so that
doesn't seem to cause any obvious problems.

My package build box has an AMD FX-8320E Eight-Core Processor, but I'm
breaking in a Ryzen 7 1700X replacement.  Both run 12.0-CURRENT, which
is updated on an irregular basis every month or so.  I generally build a
set of ~1700 ports for three different jails, and the average poudriere
run probably builds about 500 ports on average.

Starting in December 2016, I've seen three build runaway failures on
different ports.  The first two are on my older machine, the last on the
Ryzen.  The one thing they all have in common is that there are one or
more gmake processes in a wait state and they have a number of zombified
child processes:

December 2016 - 10-STABLE jail:
USER   PID %CPU %MEM   VSZ  RSS TT  STAT STARTED    TIME COMMAND
root 76087  0.0  0.0  9132 1336  0  I+J   8:51AM 0:00.04 /usr/bin/make -C /usr/p
orts/multimedia/gstreamer-plugins-good build
root 76113  0.0  0.0 14680 1896  0  I+J   8:51AM 0:00.01 - gmake -f Makefile -j8
 DESTDIR=/wrkdirs/usr/ports/multimedia/gstreamer-plugins-good/work/stage all
root 76129  0.0  0.0 14680 1916  0  I+J   8:51AM 0:00.01 |-- gmake -C videofilte
r
root 76148  0.0  0.0     0    0  0  Z+J   8:51AM 0:00.59 | `-- <defunct>
root 76156  0.0  0.0 14680 1968  0  I+J   8:51AM 0:00.01 |-- gmake -C audiofx
root 76625  0.0  0.0     0    0  0  Z+J   8:51AM 0:00.34 | `-- <defunct>
root 76164  0.0  0.0 14680 1912  0  I+J   8:51AM 0:00.01 |-- gmake -C audioparse
rs
root 78664  0.0  0.0     0    0  0  Z+J   8:51AM 0:00.43 | `-- <defunct>
root 77796  0.0  0.0 14680 1948  0  I+J   8:51AM 0:00.01 |-- gmake -C debugutils
root 78663  0.0  0.0     0    0  0  Z+J   8:51AM 0:00.00 | `-- <defunct>
root 78193  0.0  0.0 14680 1920  0  I+J   8:51AM 0:00.01 |-- gmake -C deinterlac
e
root 78232  0.0  0.0 14680 1952  0  I+J   8:51AM 0:00.01 | `-- gmake all-am
root 78566  0.0  0.0     0    0  0  Z+J   8:51AM 0:01.45 |   `-- <defunct>
root 78681  0.0  0.0 14680 1940  0  I+J   8:51AM 0:00.01 |-- gmake -C effectv
root 78819  0.0  0.0     0    0  0  Z+J   8:51AM 0:00.28 | `-- <defunct>
root 78830  0.0  0.0 14680 1916  0  I+J   8:51AM 0:00.01 |-- gmake -C equalizer
root 78834  0.0  0.0     0    0  0  Z+J   8:51AM 0:00.43 | `-- <defunct>
root 78952  0.0  0.0 14680 1912  0  I+J   8:51AM 0:00.01 `-- gmake -C flv
root 79075  0.0  0.0     0    0  0  Z+J   8:51AM 0:00.00   `-- <defunct>

September 2017 - 12-CURRENT jail:
USER   PID %CPU %MEM   VSZ   RSS TT  STAT STARTED    TIME COMMAND
root  8279  0.0  0.0 10832     0  1  IW+J -       0:00.00 /usr/bin/make -C /usr/
ports/editors/openoffice-devel build
root  8550  0.0  0.0 24852     0  1  IW+J -       0:00.00 - perl -S /wrkdirs/usr
/ports/editors/openoffice-devel/work/aoo-4.2.0/main/solenv/bin/build.pl --all -P
2 -- -P4
root 64577  0.0  0.0 24852     0  1  IW+J -       0:00.00 `-- perl -S /wrkdirs/u
sr/ports/editors/openoffice-devel/work/aoo-4.2.0/main/solenv/bin/build.pl --all
-P2 -- -P4
root 64578  0.0  0.0 11700     0  1  IW+J -       0:00.00   `-- dmake -P4
root 64583  0.0  0.0 12688     0  1  IW+J -       0:00.00     `-- /usr/local/bin
/bash -c cd .. && gmake  -r -j4   && gmake  -r deliverlog
root 64584  0.0  0.1 53556 44532  1  I+J  20:32   0:29.23       `-- gmake -r -j4
root 30217  0.0  0.0     0     0  1  Z+J  20:49   0:00.01         |-- <defunct>
root 30220  0.0  0.0     0     0  1  Z+J  20:49   0:00.01         |-- <defunct>
root 30234  0.0  0.0     0     0  1  Z+J  20:49   0:00.01         |-- <defunct>
root 30236  0.0  0.0     0     0  1  Z+J  20:49   0:00.01         `-- <defunct>


USER     PID %CPU %MEM   VSZ   RSS TT  STAT STARTED    TIME COMMAND
nobody 43355  0.0  0.0 10500  1596  1  I+J  11:09   0:00.05 /usr/bin/make -C /usr/ports/lang/rust build
nobody 43444  0.0  0.1 46368 36944  1  I+J  11:09   0:07.20 - /usr/local/bin/python2.7 /wrkdirs/usr/ports/lang/rust/work/rustc-1.23.0-src/x.py build --verbose --config ./config.toml --jobs 7
nobody 71768  0.0  0.0 25436 11536  1  I+J  11:10   0:00.21 `-- /wrkdirs/usr/ports/lang/rust/work/rustc-1.23.0-src/build/bootstrap/debug/bootstrap build --verbose --config ./config.toml --jobs 7
nobody  9694  0.0  0.0 23260 13172  1  I+J  11:14   0:00.01   `-- cmake --build . --target install --config Release -- -j 7
nobody  9746  0.0  0.0 11976  2412  1  I+J  11:14   0:00.01     `-- /usr/local/bin/gmake -j 7 install
nobody  9784  0.0  0.0 12624  3292  1  I+J  11:14   0:00.08       `-- /usr/local/bin/gmake -f CMakeFiles/Makefile2 all
nobody  9539  0.0  0.0 11976  2404  1  I+J  11:16   0:00.02         |-- /usr/local/bin/gmake -f lib/DebugInfo/CodeView/CMakeFiles/LLVMDebugInfoCodeView.dir/build.make lib/DebugInfo/CodeView/CMakeFiles/LLVMDebugInfoCodeView.dir/build
nobody 10364  0.0  0.0     0     0  1  Z+J  11:17   0:02.12         | `-- <defunct>
nobody 10119  0.0  0.0 11908  2264  1  I+J  11:14   0:00.01         |-- /usr/local/bin/gmake -f lib/TableGen/CMakeFiles/LLVMTableGen.dir/build.make lib/TableGen/CMakeFiles/LLVMTableGen.dir/build
nobody 10231  0.0  0.0     0     0  1  Z+J  11:14   0:00.95         | `-- <defunct>
nobody 10420  0.0  0.0 12020  2524  1  I+J  11:14   0:00.02         |-- /usr/local/bin/gmake -f lib/MC/MCParser/CMakeFiles/LLVMMCParser.dir/build.make lib/MC/MCParser/CMakeFiles/LLVMMCParser.dir/build
nobody  9353  0.0  0.0     0     0  1  Z+J  11:16   0:00.86         | `-- <defunct>
nobody 10667  0.0  0.0 12476  2800  1  I+J  11:14   0:00.03         |-- /usr/local/bin/gmake -f lib/Support/CMakeFiles/LLVMSupport.dir/build.make lib/Support/CMakeFiles/LLVMSupport.dir/build
nobody 10766  0.0  0.0     0     0  1  Z+J  11:14   0:04.21         | `-- <defunct>
nobody 10726  0.0  0.0 12120  2576  1  I+J  11:14   0:00.03         |-- /usr/local/bin/gmake -f utils/TableGen/CMakeFiles/obj.llvm-tblgen.dir/build.make utils/TableGen/CMakeFiles/obj.llvm-tblgen.dir/build
nobody  9446  0.0  0.0     0     0  1  Z+J  11:16   0:03.01         | `-- <defunct>
nobody 10743  0.0  0.0 12140  2652  1  I+J  11:14   0:00.04         |-- /usr/local/bin/gmake -f lib/MC/CMakeFiles/LLVMMC.dir/build.make lib/MC/CMakeFiles/LLVMMC.dir/build
nobody 10795  0.0  0.0     0     0  1  Z+J  11:14   0:02.35         | `-- <defunct>
nobody 10790  0.0  0.0 12412  2828  1  I+J  11:17   0:00.05         `-- /usr/local/bin/gmake -f lib/DebugInfo/PDB/CMakeFiles/LLVMDebugInfoPDB.dir/build.make lib/DebugInfo/PDB/CMakeFiles/LLVMDebugInfoPDB.dir/build
nobody 17155  0.0  0.0     0     0  1  Z+J  11:17   0:00.01           `-- <defunct>


At least the first and third failures sound like the potential deadlock
described in the GNU make jobserver algorithm description here:
http://make.mad-scientist.net/papers/jobserver-implementation/
The description of the algorithm will make more sense if you renumber
the sub-steps under step 4 from 1-5 to a-e.

What happens if we enter the read() in step 4e, and between the time we
check the fd for validity and the time the read sees there is no data
and goes to sleep if a SIGCHLD arrives and closes fd?  Is that even
possible or is the SIGCHLD delivery deferred until the sycall goes to
sleep?

Could we sometimes lose the SIGCHLD?  It looks like losing duplicate
signals should be OK and receiving one per child should not be
necessary.  If multiple children exit after step 4a, as long as SIGCHLD
is delivered at least once before step 4e, then the handler should close
the duplicate fd and step 4e should get EBADF.

The cited document says that the deadlock can occur only if two
instances of make run into this problem.  Because of that I'm suspicious
that this failure is actually more common.  If I watch top during a
poudriere run, I often see long stretches of time when the count of
zombie processes is in the mid single digits.

I haven't yet dug into the code, but I thought I should share my
observations to date.




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?tkrat.c6385fc59f468b30>