From nobody Thu Feb  1 14:47:44 2024
X-Original-To: hackers@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4TQhbc3zlPz58mwv
	for <hackers@mlmmj.nyi.freebsd.org>; Thu,  1 Feb 2024 14:48:12 +0000 (UTC)
	(envelope-from kpielorz_lst@tdx.co.uk)
Received: from smtp.krpservers.com (smtp.krpservers.com [62.13.128.145])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "*.krpservers.com", Issuer "RapidSSL TLS RSA CA G1" (not verified))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4TQhbb4dHDz4kJp
	for <hackers@freebsd.org>; Thu,  1 Feb 2024 14:48:11 +0000 (UTC)
	(envelope-from kpielorz_lst@tdx.co.uk)
Authentication-Results: mx1.freebsd.org;
	dkim=pass header.d=tdx.co.uk header.s=krpdkim header.b=MZqIlx8a;
	dmarc=pass (policy=none) header.from=tdx.co.uk;
	spf=pass (mx1.freebsd.org: domain of kpielorz_lst@tdx.co.uk designates 62.13.128.145 as permitted sender) smtp.mailfrom=kpielorz_lst@tdx.co.uk
Received: from [10.12.30.106] 
	by smtp.krpservers.com (8.16.1/8.15.2) with ESMTPSA id 411EliZR056041
	(version=TLSv1 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Thu, 1 Feb 2024 14:47:45 GMT
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=tdx.co.uk;
	s=krpdkim; t=1706798866;
	bh=hVlfIwIQGdD7sGAgXHJW/faMtcJ6Ou218gA3S7+6tRE=;
	h=Date:From:To:Subject;
	b=MZqIlx8awDqY3BPZ57QRvSub7eC4fMynGjOX1BqBMR3RJltxl0S1mSskumyYUfa7g
	 5ELn7Iw/vJYyb5P0kGJU65Q9SPdfZ0Iq3JKGK9aAZqyoTHsMtDBImAUk/uUrGyKedh
	 Khjgyp2BoXtbMBuIWZCJm2cbD2MZsEzoL4OdQssci73I8XN4b7tIveY6M5K//PSnDN
	 eiNDpC6ljncRX1anKGhLU7CWCsVyFOl/x6DudwWTLJYguSQoiMyhmh+FeRnJ8L1VNZ
	 cN0VEKoF7GHGNADaFNVxd4eeH1NL08uYWpeFP7GyalqAU2ZyuTVQTKUuHzDYsE84lQ
	 uh3EyMj5/bMkw==
Date: Thu, 01 Feb 2024 14:47:44 +0000
From: Karl Pielorz <kpielorz_lst@tdx.co.uk>
To: Daniel Braniss <danny@cs.huji.ac.il>,
        freebsd-hackers <hackers@freebsd.org>
Subject: Re: ... was killed: a thread waited too long to allocate a page
Message-ID: <29D13BFFCFA5255C07379043@[10.12.30.106]>
In-Reply-To: <0C31C8D8-2335-43ED-96B3-21AC46F30C1D@cs.huji.ac.il>
References: <0C31C8D8-2335-43ED-96B3-21AC46F30C1D@cs.huji.ac.il>
X-Mailer: Mulberry/4.0.8 (Win32)
List-Id: Technical discussions relating to FreeBSD <freebsd-hackers.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-hackers
List-Help: <mailto:freebsd-hackers+help@freebsd.org>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Subscribe: <mailto:freebsd-hackers+subscribe@freebsd.org>
List-Unsubscribe: <mailto:freebsd-hackers+unsubscribe@freebsd.org>
Sender: owner-freebsd-hackers@freebsd.org
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-Spamd-Bar: ---
X-Spamd-Result: default: False [-3.40 / 15.00];
	NEURAL_HAM_LONG(-1.00)[-1.000];
	NEURAL_HAM_MEDIUM(-1.00)[-1.000];
	NEURAL_HAM_SHORT(-1.00)[-0.997];
	DMARC_POLICY_ALLOW(-0.50)[tdx.co.uk,none];
	MID_RHS_IP_LITERAL(0.50)[];
	R_DKIM_ALLOW(-0.20)[tdx.co.uk:s=krpdkim];
	R_SPF_ALLOW(-0.20)[+a:smtp.krpservers.com];
	MIME_GOOD(-0.10)[text/plain];
	ONCE_RECEIVED(0.10)[];
	ARC_NA(0.00)[];
	RCVD_TLS_ALL(0.00)[];
	RCVD_COUNT_ONE(0.00)[1];
	ASN(0.00)[asn:60969, ipnet:62.13.128.0/22, country:GB];
	MIME_TRACE(0.00)[0:+];
	RCVD_VIA_SMTP_AUTH(0.00)[];
	RCPT_COUNT_TWO(0.00)[2];
	TO_DN_ALL(0.00)[];
	FROM_HAS_DN(0.00)[];
	MLMMJ_DEST(0.00)[hackers@freebsd.org];
	FROM_EQ_ENVFROM(0.00)[];
	TO_MATCH_ENVRCPT_SOME(0.00)[];
	DKIM_TRACE(0.00)[tdx.co.uk:+]
X-Rspamd-Queue-Id: 4TQhbb4dHDz4kJp


--On 28 December 2023 11:38 +0200 Daniel Braniss <danny@cs.huji.ac.il> 
wrote:

> hi,
> I'm running 13.2 Stable on this particular host, which has about 200TB of
> zfs storage the host also has some 132Gb of memory,
> lately, mountd is getting killed:
>   kernel: pid 3212 (mountd), jid 0, uid 0, was killed: a thread waited
> too long to allocate a page
>
> rpcinfo shows it's still there, but
> 	service mountd restart
> fails.
>
> only solution is to reboot.
> BTW, the only 'heavy' stuff that I can see are several rsync
> processes.

Hi,

I seem to have run into something similar. I recently upgraded a 12.4 box 
to 13.2p9. The box has 32G of RAM, and runs ZFS. We do a lot of rsync work 
to it monthly - the first month we've done this with 13.2p9 we get a lot of 
processes killed, all with a similar (but not identical) message, e.g.

pid 11103 (ssh), jid 0, uid 0, was killed: failed to reclaim memory
pid 10972 (local-unbound), jid 0, uid 59, was killed: failed to reclaim 
memory
pid 3223 (snmpd), jid 0, uid 0, was killed: failed to reclaim memory
pid 3243 (mountd), jid 0, uid 0, was killed: failed to reclaim memory
pid 3251 (nfsd), jid 0, uid 0, was killed: failed to reclaim memory
pid 10996 (sshd), jid 0, uid 0, was killed: failed to reclaim memory
pid 3257 (sendmail), jid 0, uid 0, was killed: failed to reclaim memory
pid 8562 (csh), jid 0, uid 0, was killed: failed to reclaim memory
pid 3363 (smartd), jid 0, uid 0, was killed: failed to reclaim memory
pid 8558 (csh), jid 0, uid 0, was killed: failed to reclaim memory
pid 3179 (ntpd), jid 0, uid 0, was killed: failed to reclaim memory
pid 8555 (tcsh), jid 0, uid 1001, was killed: failed to reclaim memory
pid 3260 (sendmail), jid 0, uid 25, was killed: failed to reclaim memory
pid 2806 (devd), jid 0, uid 0, was killed: failed to reclaim memory
pid 3156 (rpcbind), jid 0, uid 0, was killed: failed to reclaim memory
pid 3252 (nfsd), jid 0, uid 0, was killed: failed to reclaim memory
pid 3377 (getty), jid 0, uid 0, was killed: failed to reclaim memory

This 'looks' like 'out of RAM' type situation - but at the time, top showed:

last pid: 12622;  load averages:  0.10,  0.24,  0.13 

7 processes:   1 running, 6 sleeping
CPU:  0.1% user,  0.0% nice,  0.2% system,  0.0% interrupt, 99.7% idle
Mem: 4324K Active, 8856K Inact, 244K Laundry, 24G Wired, 648M Buf, 7430M 
Free
ARC: 20G Total, 8771M MFU, 10G MRU, 2432K Anon, 161M Header, 920M Other
     15G Compressed, 23G Uncompressed, 1.59:1 Ratio
Swap: 8192M Total, 5296K Used, 8187M Free

Rebooting it recovers it, and it completed the rsync after the reboot - 
which left us with:

last pid: 12570;  load averages:  0.07,  0.14,  0.17 
up 0+00:15:06  14:43:56
26 processes:  1 running, 25 sleeping
CPU:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
Mem: 39M Active, 5640K Inact, 17G Wired, 42M Buf, 14G Free
ARC: 15G Total, 33M MFU, 15G MRU, 130K Anon, 32M Header, 138M Other
     14G Compressed, 15G Uncompressed, 1.03:1 Ratio
Swap: 8192M Total, 8192M Free


I've not seen any bug reports along this line, in fact very little coverage 
at all of the specific error.

My only thought is to set a sysctl to limit ZFS ARC usage, i.e. to leave 
more free RAM floating around the system. During the rsync it was 
'swapping' occasionally (few K in, few K out) - but it never ran out of 
swap that I saw - and it certainly didn't look like an complete out of 
memory scenario/box (which is what it felt like with everything getting 
killed).


-Karl