Date: Sun, 17 Jan 2021 17:50:10 -0800 From: bob prohaska <fbsd@www.zefox.net> To: bob prohaska <fbsd@www.zefox.net> Cc: Current FreeBSD <freebsd-current@freebsd.org>, freebsd-arm@freebsd.org Subject: Re: Silent hang in buildworld, was Re: Invoking -v for clang during buildworld Message-ID: <20210118015009.GA31353@www.zefox.net> In-Reply-To: <85889EAE-F579-4220-9185-944D9AA5075A@yahoo.com> References: <20210116043740.GA19523@www.zefox.net> <ED26508F-282D-439D-8A6A-65A136C76C84@yahoo.com> <20210116155538.GA24259@www.zefox.net> <F62027C8-0813-4E6F-934A-3088F966AF8C@yahoo.com> <20210116220334.GA26756@www.zefox.net> <C75D3D9C-4284-4BE3-B2CD-5DC6BBB60843@yahoo.com> <20210117174006.GA30728@www.zefox.net> <85889EAE-F579-4220-9185-944D9AA5075A@yahoo.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, Jan 17, 2021 at 12:30:51PM -0800, Mark Millard wrote: > > > On 2021-Jan-17, at 09:40, bob prohaska <fbsd at www.zefox.net> wrote: > > > On Sat, Jan 16, 2021 at 03:04:04PM -0800, Mark Millard wrote: > >> > >> Other than -j1 style builds (or equivalent), one pretty much > >> always needs to go looking around for a non-panic failure. It > >> is uncommon for all the material to be together in the build > >> log in such contexts. > > > > Running make cleandir twice and restarting -j4 buildworld brought > > the process full circle: A silent hang, no debugger response, no > > console warnings. That's what sent me down the rabbit hole of make > > without clean, which worked at least once... > > Unfortunately, such a hang tends to mean that log files and such > were not completely written out to media. We do not get to see > evidence of the actual failure time frame, just somewhat before. > (compiler/linker output and such can have the same issues of > ending up with incomplete updates.) > > So, pretty much my notes are unlikely to be strongly tied to > any solid evidence: more like alternatives to possibly explore > that could be far off the mark. > > It is not clear if you were using: > > LDFLAGS.lld+= -Wl,--threads=1 > > or some such to limit the multi-thread linking and its memory. No, I wasn't trying to limit ld.lld thread number. > I'll note that if -j4 gets 4 links running in parallel it used > to be each could have something like 5 threads active on a 4 > core machine, so 20 or so threads. (I've not checked llvm11's > lld behavior. It might avoid such for defaults.) > > You have not reported any testing of -j2 or -j3 so far, just > -j4 . (Another way of limiting memory use, power use, temperature, > etc. .) > Not recently, simply because it's so slow to build. On my "production" armv7 machines running stable/12 I do use -j2. But, they get updated only a couple times per year, when there's a security issue. > You have not reported if your boot complained about the swap > space size or if you have adjusted related settings to make > non-default tradeoffs for swap amanagment for these specific > tests. I recommend not tailoring and using a swap size total > that is somewhat under what starts to complain when there is > no tailoring. > Both Pi2 and Pi3 have been complaining about too much swap since I first got them. Near as can be told it's never been a demonstrated problem, thus far. Now, as things like LLVM get bigger and bigger, it seems possible excess swap might cause, or obscure, other problems. For the Pi2 I picked 2 GB from the old "2x physical RAM" rule. > > > The residue of the top screen shows > > > > last pid: 63377; load averages: 4.29, 4.18, 4.15 up 1+07:11:07 04:46:46 > > 60 processes: 5 running, 55 sleeping > > CPU: 70.7% user, 0.0% nice, 26.5% system, 2.8% interrupt, 0.0% idle > > Mem: 631M Active, 4932K Inact, 92M Laundry, 166M Wired, 98M Buf, 18M Free > > Swap: 2048M Total, 119M Used, 1928M Free, 5% Inuse, 16K In, 3180K Out > > packet_write_wait: Connection to 50.1.20.26 port 22: Broken pipe > > bob@raspberrypi:~ $ ssh www.zefox.com RES STATE C TIME WCPU COMMAND > > ssh: connect to host www.zefox.com port 22: Connection timed out86.17% c++ > > bob@raspberrypi:~ $ 1 99 0 277M 231M RUN 0 3:26 75.00% c++ > > 63245 bob 1 99 0 219M 173M CPU0 0 2:10 73.12% c++ > > 62690 bob 1 98 0 354M 234M RUN 3 9:42 47.06% c++ > > 63377 bob 1 30 0 5856K 2808K nanslp 0 0:00 3.13% gstat > > 38283 bob 1 24 0 5208K 608K wait 2 2:00 0.61% sh > > 995 bob 1 20 0 6668K 1184K CPU3 3 8:46 0.47% top > > 990 bob 1 20 0 12M 1060K select 2 0:48 0.05% sshd > > .... > > This does not look like ld was in use as of the last top > display update's content. But the time between reasonable > display updates is fairly long relative to CPU activity > so it is only suggestive. > > > [apologies for typing over the remnants] > > > > I've put copies of the build and swap logs at > > > > http://www.zefox.net/~fbsd/rpi2/buildworld/ > > > > The last vmstat entry (10 second repeat time) reports: > > procs memory page disks faults cpu > > r b w avm fre flt re pi po fr sr da0 sd0 in sy cs us sy id > > 4 0 14 969160 91960 685 2 2 1 707 304 0 0 11418 692 1273 45 5 50 > > > > Does that point to the memory exhaustion suggested earlier in the thread? > > At this point /boot/loader.conf contains vm.pfault_oom_attempts="-1", but > > that's a relic of long-ago attempts to use USB flash for root and swap. > > Might removing it stimulate more warning messages? > > > > vm.pfault_oom_attempts="-1" should only be used in contexts where > running out of swap will not happen. Otherwise a deadlocked system > can result if it does run out of swap. (Run-out has more senses the > just the swap partition being fully used: other internal resources > for keeping track of the swap can run into its limits.) I've no > evidence that the -1 was actually a problem. > > I do not find any 1000+ ms/w or ms/r figures in swapscript.log . > I found 3 examples of a little under 405 (on sdda0*), 3 between > 340 and 345 (da0*), 4 in the 200s (da0*), under 60 in the 100s > (da0*). It does not look to me like the recorded part had problems > with the long latencies that you used to have happen. > > So I've not found any specific evidence about what led to the > hangup. So my earlier questions/suggestions are basically > arbitrary and I would not know what to do with any answers > to the questions. > > The only notes that are fairly solid are about the hangup leading > to there being some files that were likely incompletely updated > (logs, compiler output files, etc.). > The notion that log files might be truncated didn't didn't register until you brought it up. The obvious things to try seem to be: Disable vm.pfault_oom_attempts="-1" Decrease swap partition size Try using WITH_META_MODE WITH_META_MODE seems relatively easy; just add -DWITH_META_MODE to the make command line for buildworld and add filemon_load="YES" to boot/loader.conf, but I wonder about the burdens it'll impose on CPU and disk space/throughput. There's nothing to lose at this stage, but I'm all ears if there's a better approach. Many thanks for your patient good counsel! bob prohaska
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20210118015009.GA31353>