Date: Sat, 2 Apr 2022 11:27:51 -0700 From: Mark Millard <marklmi@yahoo.com> To: Dimitry Andric <dim@FreeBSD.org> Cc: "toolchain@freebsd.org" <toolchain@FreeBSD.org>, Piotr Kubaj <pkubaj@anongoth.pl>, Glen Barber <gjb@FreeBSD.org> Subject: Re: [package - 130arm64-default][lang/gcc12-devel] Failed for gcc12-devel-12.0.1.s20220306_2 in build/runaway Message-ID: <5EEA824B-1EC5-48AB-B40D-A3D18E73B739@yahoo.com> In-Reply-To: <17DDA9F6-CFD0-479B-B3B5-51B570893863@yahoo.com> References: <202203261416.22QEGtRR065106@ampere3.nyi.freebsd.org> <A4CB59C1-229B-4F61-837D-5B557DFA8339@FreeBSD.org> <21D1C2BF-151E-4252-936C-B5B22C9C8071@yahoo.com> <75A61EB5-70D1-4E1F-89D2-524407854D6F@yahoo.com> <FE5F8CCE-BBC2-4A3F-B95D-22B51C6A9833@yahoo.com> <17CAD266-C7C0-4CD7-B255-3DC07F422EB5@yahoo.com> <2D081409-B3E7-422D-98C4-AC7394915F72@yahoo.com> <17DDA9F6-CFD0-479B-B3B5-51B570893863@yahoo.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On 2022-Mar-27, at 09:02, Mark Millard <marklmi@yahoo.com> wrote: . . . > On 26 Mar 2022, at 15:16, pkg-fallout@freebsd.org = <pkg-fallout@FreeBSD.org> wrote: >>=20 >> . . . >> Log URL: = http://ampere3.nyi.freebsd.org/data/130arm64-default/60ab72786154/logs/gcc= 12-devel-12.0.1.s20220306_2.log >> . . . Turns out that log (and other examples of lang/gcc12-devel runaway kills) does have a hint about what timeouts to change: QUOTE =3D>> Killing runaway build after 7200 seconds with no output END QUOTE Quarterly's build for 12.3 also got a kill, with the same message. See: = https://lists.freebsd.org/archives/freebsd-toolchain/2022-April/000478.htm= l I'll note that the after the message, the kill can be hours later in the build's activity, depending on the size of the log file: the log file is evaluated before the kill is done and the involved scans (plural!) of huge log files can be on that kind of time scale. The message is from: # grep -r "seconds with no output" /usr/local/share/poudriere/ | more /usr/local/share/poudriere/common.sh: = msg "Killing runaway build after ${NOHANG_TIME} seconds with no output" So, at least NOHANG_TIME needs to increase as long as = bootstrap-lto-noplugin is in use. (There may be more.) Note that NOHANG_TIME is not specific to the individual port. As I remember, the kills tend to happen between 11 and 12 hours into the aarch64 build but the successful builds take 20..24 hours. It is not = great evidence, but it might suggest more than doubling NOHANG_TIME (for = aarch64 jails?). Looking at it differently, since it does sometimes build a = smaller increase might avoid most of the kills that are now happening. For poudriere, there are: NOHANG_TIME MAX_EXECUTION_TIME MAX_EXECUTION_TIME_EXTRACT MAX_EXECUTION_TIME_INSTALL MAX_EXECUTION_TIME_PACKAGE MAX_EXECUTION_TIME_DEINSTALL QEMU_MAX_EXECUTION_TIME QEMU_NOHANG_TIME These are not independent, however. Setting a larger MAX_EXECUTION_TIME* value can be ineffective with a small NOHANG_TIME, for example, if the activity that takes the extra time happens to not output periodically. (I've run into that before when I tried a bulk -a for WITH_DEBUG=3D in use.) One of the issues with poudriere's timeouts is that they do not = auto-scale to match machine performance. Figures for slower environments may be time/power wasters on faster hardware when a build process really does runaway. Another point is that there is no scaling based on expected/historical = time frames. So runaways of port builds that should not take much time = instead run for a long time before being killed --in order to allow ports that are expected to take a long time to build instead of being killed. Another issue is that, for multiple builders doing a build over the same = time frame, the other activity can lead to longer times of "seconds with no = output". Part of the issue for lang/gcc* is that part of the = bootstrap-lto-noplugin processing does not respect the limits on parallel activity. Having = multiple bootstrap-lto-noplugin going because of multiple lang/gcc* building at = the same time apparently can lead to very high load averages for a time. In fact, = even just one lang/gcc* doing bootstrap-lto-noplugin can have a load average = for a time that is something like 1.5 * (# hardware threads) when the build = indicated to use the # hardware threads as the limitation. (Cores in this = context.) When changes are made to how things build, who is supposed to determine = how to adjust poudriere's settings to match on the various build architectures? = Is this something "exp run" sort of experiments are appropriate for determining = for each build architecture --or at least for tier 1 architectures? (Just me pondering, given that what *TIME* settings to use is not obvious.) =3D=3D=3D Mark Millard marklmi at yahoo.com
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5EEA824B-1EC5-48AB-B40D-A3D18E73B739>