From owner-freebsd-arm@freebsd.org Mon Jan 18 01:50:09 2021 Return-Path: Delivered-To: freebsd-arm@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 4CE654DDDF2; Mon, 18 Jan 2021 01:50:09 +0000 (UTC) (envelope-from fbsd@www.zefox.net) Received: from www.zefox.net (www.zefox.net [50.1.20.27]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "www.zefox.com", Issuer "www.zefox.com" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 4DJvqh0Lzhz4mQy; Mon, 18 Jan 2021 01:50:07 +0000 (UTC) (envelope-from fbsd@www.zefox.net) Received: from www.zefox.net (localhost [127.0.0.1]) by www.zefox.net (8.16.1/8.15.2) with ESMTPS id 10I1oA88032823 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NO); Sun, 17 Jan 2021 17:50:10 -0800 (PST) (envelope-from fbsd@www.zefox.net) Received: (from fbsd@localhost) by www.zefox.net (8.16.1/8.15.2/Submit) id 10I1oApG032822; Sun, 17 Jan 2021 17:50:10 -0800 (PST) (envelope-from fbsd) Date: Sun, 17 Jan 2021 17:50:10 -0800 From: bob prohaska To: bob prohaska Cc: Current FreeBSD , freebsd-arm@freebsd.org Subject: Re: Silent hang in buildworld, was Re: Invoking -v for clang during buildworld Message-ID: <20210118015009.GA31353@www.zefox.net> References: <20210116043740.GA19523@www.zefox.net> <20210116155538.GA24259@www.zefox.net> <20210116220334.GA26756@www.zefox.net> <20210117174006.GA30728@www.zefox.net> <85889EAE-F579-4220-9185-944D9AA5075A@yahoo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <85889EAE-F579-4220-9185-944D9AA5075A@yahoo.com> X-Rspamd-Queue-Id: 4DJvqh0Lzhz4mQy X-Spamd-Bar: - Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=none (mx1.freebsd.org: domain of fbsd@www.zefox.net has no SPF policy when checking 50.1.20.27) smtp.mailfrom=fbsd@www.zefox.net X-Spamd-Result: default: False [-1.10 / 15.00]; RCVD_TLS_ALL(0.00)[]; ARC_NA(0.00)[]; WWW_DOT_DOMAIN(0.50)[]; MID_RHS_MATCH_FROM(0.00)[]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; TO_DN_SOME(0.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[text/plain]; DMARC_NA(0.00)[zefox.net]; RBL_DBL_DONT_QUERY_IPS(0.00)[50.1.20.27:from]; AUTH_NA(1.00)[]; SPAMHAUS_ZRD(0.00)[50.1.20.27:from:127.0.2.255]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_HAM_SHORT(-1.00)[-1.000]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; R_SPF_NA(0.00)[no SPF record]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:7065, ipnet:50.1.16.0/20, country:US]; RCVD_COUNT_TWO(0.00)[2]; MAILMAN_DEST(0.00)[freebsd-current,freebsd-arm]; MID_RHS_WWW(0.50)[] X-BeenThere: freebsd-arm@freebsd.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: "Porting FreeBSD to ARM processors." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 18 Jan 2021 01:50:09 -0000 On Sun, Jan 17, 2021 at 12:30:51PM -0800, Mark Millard wrote: > > > On 2021-Jan-17, at 09:40, bob prohaska wrote: > > > On Sat, Jan 16, 2021 at 03:04:04PM -0800, Mark Millard wrote: > >> > >> Other than -j1 style builds (or equivalent), one pretty much > >> always needs to go looking around for a non-panic failure. It > >> is uncommon for all the material to be together in the build > >> log in such contexts. > > > > Running make cleandir twice and restarting -j4 buildworld brought > > the process full circle: A silent hang, no debugger response, no > > console warnings. That's what sent me down the rabbit hole of make > > without clean, which worked at least once... > > Unfortunately, such a hang tends to mean that log files and such > were not completely written out to media. We do not get to see > evidence of the actual failure time frame, just somewhat before. > (compiler/linker output and such can have the same issues of > ending up with incomplete updates.) > > So, pretty much my notes are unlikely to be strongly tied to > any solid evidence: more like alternatives to possibly explore > that could be far off the mark. > > It is not clear if you were using: > > LDFLAGS.lld+= -Wl,--threads=1 > > or some such to limit the multi-thread linking and its memory. No, I wasn't trying to limit ld.lld thread number. > I'll note that if -j4 gets 4 links running in parallel it used > to be each could have something like 5 threads active on a 4 > core machine, so 20 or so threads. (I've not checked llvm11's > lld behavior. It might avoid such for defaults.) > > You have not reported any testing of -j2 or -j3 so far, just > -j4 . (Another way of limiting memory use, power use, temperature, > etc. .) > Not recently, simply because it's so slow to build. On my "production" armv7 machines running stable/12 I do use -j2. But, they get updated only a couple times per year, when there's a security issue. > You have not reported if your boot complained about the swap > space size or if you have adjusted related settings to make > non-default tradeoffs for swap amanagment for these specific > tests. I recommend not tailoring and using a swap size total > that is somewhat under what starts to complain when there is > no tailoring. > Both Pi2 and Pi3 have been complaining about too much swap since I first got them. Near as can be told it's never been a demonstrated problem, thus far. Now, as things like LLVM get bigger and bigger, it seems possible excess swap might cause, or obscure, other problems. For the Pi2 I picked 2 GB from the old "2x physical RAM" rule. > > > The residue of the top screen shows > > > > last pid: 63377; load averages: 4.29, 4.18, 4.15 up 1+07:11:07 04:46:46 > > 60 processes: 5 running, 55 sleeping > > CPU: 70.7% user, 0.0% nice, 26.5% system, 2.8% interrupt, 0.0% idle > > Mem: 631M Active, 4932K Inact, 92M Laundry, 166M Wired, 98M Buf, 18M Free > > Swap: 2048M Total, 119M Used, 1928M Free, 5% Inuse, 16K In, 3180K Out > > packet_write_wait: Connection to 50.1.20.26 port 22: Broken pipe > > bob@raspberrypi:~ $ ssh www.zefox.com RES STATE C TIME WCPU COMMAND > > ssh: connect to host www.zefox.com port 22: Connection timed out86.17% c++ > > bob@raspberrypi:~ $ 1 99 0 277M 231M RUN 0 3:26 75.00% c++ > > 63245 bob 1 99 0 219M 173M CPU0 0 2:10 73.12% c++ > > 62690 bob 1 98 0 354M 234M RUN 3 9:42 47.06% c++ > > 63377 bob 1 30 0 5856K 2808K nanslp 0 0:00 3.13% gstat > > 38283 bob 1 24 0 5208K 608K wait 2 2:00 0.61% sh > > 995 bob 1 20 0 6668K 1184K CPU3 3 8:46 0.47% top > > 990 bob 1 20 0 12M 1060K select 2 0:48 0.05% sshd > > .... > > This does not look like ld was in use as of the last top > display update's content. But the time between reasonable > display updates is fairly long relative to CPU activity > so it is only suggestive. > > > [apologies for typing over the remnants] > > > > I've put copies of the build and swap logs at > > > > http://www.zefox.net/~fbsd/rpi2/buildworld/ > > > > The last vmstat entry (10 second repeat time) reports: > > procs memory page disks faults cpu > > r b w avm fre flt re pi po fr sr da0 sd0 in sy cs us sy id > > 4 0 14 969160 91960 685 2 2 1 707 304 0 0 11418 692 1273 45 5 50 > > > > Does that point to the memory exhaustion suggested earlier in the thread? > > At this point /boot/loader.conf contains vm.pfault_oom_attempts="-1", but > > that's a relic of long-ago attempts to use USB flash for root and swap. > > Might removing it stimulate more warning messages? > > > > vm.pfault_oom_attempts="-1" should only be used in contexts where > running out of swap will not happen. Otherwise a deadlocked system > can result if it does run out of swap. (Run-out has more senses the > just the swap partition being fully used: other internal resources > for keeping track of the swap can run into its limits.) I've no > evidence that the -1 was actually a problem. > > I do not find any 1000+ ms/w or ms/r figures in swapscript.log . > I found 3 examples of a little under 405 (on sdda0*), 3 between > 340 and 345 (da0*), 4 in the 200s (da0*), under 60 in the 100s > (da0*). It does not look to me like the recorded part had problems > with the long latencies that you used to have happen. > > So I've not found any specific evidence about what led to the > hangup. So my earlier questions/suggestions are basically > arbitrary and I would not know what to do with any answers > to the questions. > > The only notes that are fairly solid are about the hangup leading > to there being some files that were likely incompletely updated > (logs, compiler output files, etc.). > The notion that log files might be truncated didn't didn't register until you brought it up. The obvious things to try seem to be: Disable vm.pfault_oom_attempts="-1" Decrease swap partition size Try using WITH_META_MODE WITH_META_MODE seems relatively easy; just add -DWITH_META_MODE to the make command line for buildworld and add filemon_load="YES" to boot/loader.conf, but I wonder about the burdens it'll impose on CPU and disk space/throughput. There's nothing to lose at this stage, but I'm all ears if there's a better approach. Many thanks for your patient good counsel! bob prohaska