From nobody Thu Feb  1 16:30:19 2024
X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4TQksp2rcBz58x7X
	for <freebsd-hackers@mlmmj.nyi.freebsd.org>; Thu,  1 Feb 2024 16:30:38 +0000 (UTC)
	(envelope-from marklmi@yahoo.com)
Received: from sonic303-24.consmr.mail.gq1.yahoo.com (sonic303-24.consmr.mail.gq1.yahoo.com [98.137.64.205])
	(using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(Client did not present a certificate)
	by mx1.freebsd.org (Postfix) with ESMTPS id 4TQksn1pHxz3xhs
	for <freebsd-hackers@freebsd.org>; Thu,  1 Feb 2024 16:30:37 +0000 (UTC)
	(envelope-from marklmi@yahoo.com)
Authentication-Results: mx1.freebsd.org;
	dkim=pass header.d=yahoo.com header.s=s2048 header.b=qFxvRv1r;
	dmarc=pass (policy=reject) header.from=yahoo.com;
	spf=pass (mx1.freebsd.org: domain of marklmi@yahoo.com designates 98.137.64.205 as permitted sender) smtp.mailfrom=marklmi@yahoo.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1706805033; bh=BGeScltvFUSPzoh4m9rbePkx5k3tl5P1fyYXVn5EsNA=; h=From:Subject:Date:To:References:From:Subject:Reply-To; b=qFxvRv1rcUjFMCMzpBw5v4j087qMvb/+ZpedMR14vaui7rv3HQhXj4kRwR5ogYZdL7okvKctdS/G9IDrlZdRGrFn5rMnwLvh2ux3QX/5rg+SH66wupAmIgis+0fXEMt0eN9zpvbKj6FzEhIl57gC14yoJl/yph1Zvi/hEbHnZjQrB48MiU6Gsz2/T0ifzG/1N5YTMI+YsUi0W7FffxfGDW3QyS6U2k+nSRYIVNErwLQbVBJ58CcGyEQPzuf8tIjhtob1bTpXy19zdrV5XmSyTpp7rhCVV2vuWIW8y0NLjPnR8xmDG0xsLSIsE1Q4Vh9scp5z58YRtuarhrjPav651w==
X-SONIC-DKIM-SIGN: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1706805033; bh=N+KZduP6I8+9SP3DZ7p4mcJRIR6ojLDE9gHtViuluai=; h=X-Sonic-MF:From:Subject:Date:To:From:Subject; b=aaPodpKMVJL86wSMzsFZMSVP+6hLTDsIrDH3jMXdR1oOygsSVd2y2nmoB6rvypczgjqyE9C0efH84tmX5Dt/Hs6TV07f/P0SqdPa/LDkJzZiSc/1xPo14jJ639Gqq60pVLHJLsHkpPJhZw37MdkeH97G+3xmvLC1epXW/2GTmF61rXkEtrRaJ6fumnj68g8Gm4ai0HLksoEL3NCpPT6JJfggMjA32fAm9Pc5IZGmq8sOV5yqjiS9Cwh/xrZTtl5qn7Vhrl6thIC3/+gtwjKblx/sROCXAUtkuBEGt/gL2LwmR6b4DbO2CMwcdBZ9x7kELg2UN2twQNRtWPFRcKtANg==
X-YMail-OSG: Uh4u27UVM1l03qlFTRjFI6hQ6IkuXnaZsD2l_JyjmPrnf4HzdsQSRy4clD09S_F
 Uis79ZvpCdPS5FSlXM.z323wc94kOhkVOdgV.pQrDxjA1fUWKS_5PwkpYu2KLFWwBwbl9d3SstUa
 u_o8Pv5.p.lyN.wH54JAdx3_6Cm1oNBNzzGJZ.q4yRztUEgOxyqjfpJzRNSvZ_4r6SHUWGpOSDxO
 SpoekYqQQ.SOTMokq7CP4VoAnTGVeBgdLFV07FQcORum.zX1kZS.0ZkmVThnRhwMOv6JAPas8MVR
 EjMZ3C_cAKITMcTSDXfNLJJGIewYGQiZoGvShOdpNyfk.z.sK0HA13yMvnW9F2hEZw0k8iNibJ9F
 iyzTjdIEjsUaqw5mEYDrSLHWRtcH0HeFzVgtGBhzqwgqWlBhtmXkYngQeDT98MCF0a3LD91Mtc8X
 9N355VJ4P55M7G8n2WsJy1dywM56feHIAS4Wh26AOQzbkEUAR4Nm867ioUpMv7FYYC6NLpcYml6o
 XQmEi.FN3Jf44V1dbLTc3CiKt4TzEUOclkeUeqOzfxZdmec6VuQAIq9cUIm_6jwGDI_xo.Hqndhc
 MhpbcKkLIwekEje1UBwH91RC7Gp3FK5ispbz3YLRIY1mlE.iwJQZJM6cX2O8NmFBJ1oRBbBa_bCY
 wvJnXX29CR4FyH9kB0nGT3SP.AoTPpmFxSDwUZXXTr317C8MtnosXsFvAi0uYAydddauGyNnaO0x
 9Cbw.vfn1u9R1sVd_RbITqjLTzvGiHk5ry.wA34rOOh9g1LRALbm0alWVPgJ.VCHW1r0x9eocP28
 LXyq_Kfk6tQvq4cqvzjb.liAky8EtWtQ6jWWjoOaLjn30Oopx8_DqD2pe87rNHewetSV_bztPRij
 cKBCj9I8WWreeBCCagzTLyOtfnoSzfws8uPnKc6VS7V3THjiF7nRm5H93vKiFuv_XIpafOvRy_s6
 LVgxpA6nEKIJ8MbwdFA2nxAcF2693E1TogV3yfzoAZvWNWxF.1gjauHZU6B2te15mtXubLhA2sDs
 IpB_R1Dfy9kVdj8Nr4AR2wxT8QPIGjokwujZbQAt8qGZ9mp0NnM4ayhELKLjQnT2ZlGQE.deD9So
 bIcmi0SCabxZxFBP4lx.VMqyiQ1MDS9QJ52267YzcAocpv7OywKvRcdazTJOzlwWGZlaiFJjQUtB
 fpGT58x3XLUkBF5UDoPCCo6BB0OJed_KDxNXySSlr_SM527nxRkDwgVDeOiRnLxqBfgrbZBTxkFS
 loHudmD977Z_ZkXqXhKr0J1fPUKwPi.7ctDktRvz9QyqeUUvPfN6g3YBTFwkaYP6AwLFtzdUsjAb
 NQ2o9UAH8A6facgw4Fz8DEmyqJ5GSFnKRklFpbrqh5wEZa0FPox_gsFFurxHsZ41kbobAakTdPMH
 XYOolslQ7t.QM8Owk8ndgJFeHVvBRfklQKOJzv58DNPXs8PJffIrq1g_3uMCMoh9ppyeH_ltOm9X
 JZljsQO4zU9K8cuAysBA9e5N_BjXzvKm61po36snWd_KY0K6911SOF2WgxTXFlFCH_CBkAWaoQ9o
 WMak_k8dt11FOrds2BhmXcNX8c6zsDM39ssxtVL0wTbLP2IuTjeSMsyJ5MbGk_xxT5pIdAUAsvrw
 .VTBsf6ioWNfTsNqZtm1IwRKuoy5jpTbQTgloBIvUfpzwsZxZ066B5usdo.BCLIVStf26eVicGOp
 ePU3pL1Qn1Ult4RZyZyL2n1vZVt_e2gWE__YXIzO8Ood6Yoolu2sMgyTtmsby9GUHB4_Q8IIP8bq
 QUE6QGm_QdXWlyLFlie_8o429TK4tMAO4TVJzKmnBX8BP9We1IKJ8BImyGQ1Gmf6p.frcWkFoaRS
 0LZnGE2KuJ7s9H8CHF_S0w5x.JbzAvvgCdhZ08jN7qAgaYze3yXZbSHdyCJBrcU9J0nioFkbG96W
 uk2az7J9kuUPxb3ncMi07yoPt4G6RtVscPPS4JcDeqjA8W3H3FVdWQjNwNzRe8LaZSJPtx7f34PY
 pXujOaXTiPVm5SlxHRQ9LcJB.vinyBe_BXOA2TsAvZ63t48mbfXrnuRpYrOiWkku9_EjZiQdLXGa
 2E4wMQhsdwWjl9waHl7sGOmpkbP557aMZpB73Wc2YV0_yFirdigD7qhWW8EIdt9ch1kdrwWrTGKu
 2PH7UmZDRIjX07aIO9JHUUqYDbQXhqpXPymgtGIb6YFiRdzvXVjRyhYdm2uDJmF_UGu52hOzGyLe
 c4UlexyAuxaZ3kl0HNXZcJlZ864a47pmXqpJAK3ue6m1PYvm1iW99ccEk9_yhgdC4VrjrEB_.Jno
 -
X-Sonic-MF: <marklmi@yahoo.com>
X-Sonic-ID: 750def05-8f6f-4d9d-84a7-1a5badb93094
Received: from sonic.gate.mail.ne1.yahoo.com by sonic303.consmr.mail.gq1.yahoo.com with HTTP; Thu, 1 Feb 2024 16:30:33 +0000
Received: by hermes--production-gq1-5c57879fdf-wt62k (Yahoo Inc. Hermes SMTP Server) with ESMTPA ID 4dcdb5fbc5a3d36791259574ed92ab9f;
          Thu, 01 Feb 2024 16:30:30 +0000 (UTC)
From: Mark Millard <marklmi@yahoo.com>
Content-Type: text/plain;
	charset=us-ascii
Content-Transfer-Encoding: quoted-printable
List-Id: Technical discussions relating to FreeBSD <freebsd-hackers.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-hackers
List-Help: <mailto:freebsd-hackers+help@freebsd.org>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Subscribe: <mailto:freebsd-hackers+subscribe@freebsd.org>
List-Unsubscribe: <mailto:freebsd-hackers+unsubscribe@freebsd.org>
Sender: owner-freebsd-hackers@freebsd.org
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3774.300.61.1.2\))
Subject: Re: ... was killed: a thread waited too long to allocate a page
 [actually: was killed: failed to reclaim memory problem]
Message-Id: <B061D029-892A-48F8-8775-BE2A67D5BF3B@yahoo.com>
Date: Thu, 1 Feb 2024 08:30:19 -0800
To: kpielorz_lst@tdx.co.uk,
 FreeBSD Hackers <freebsd-hackers@freebsd.org>
X-Mailer: Apple Mail (2.3774.300.61.1.2)
References: <B061D029-892A-48F8-8775-BE2A67D5BF3B.ref@yahoo.com>
X-Spamd-Bar: ---
X-Spamd-Result: default: False [-3.29 / 15.00];
	NEURAL_HAM_MEDIUM(-1.00)[-1.000];
	NEURAL_HAM_LONG(-1.00)[-1.000];
	NEURAL_HAM_SHORT(-0.79)[-0.791];
	MV_CASE(0.50)[];
	DMARC_POLICY_ALLOW(-0.50)[yahoo.com,reject];
	R_SPF_ALLOW(-0.20)[+ptr:yahoo.com];
	R_DKIM_ALLOW(-0.20)[yahoo.com:s=s2048];
	MIME_GOOD(-0.10)[text/plain];
	RCPT_COUNT_TWO(0.00)[2];
	RCVD_TLS_LAST(0.00)[];
	MIME_TRACE(0.00)[0:+];
	TO_DN_SOME(0.00)[];
	ARC_NA(0.00)[];
	DWL_DNSWL_NONE(0.00)[yahoo.com:dkim];
	FREEMAIL_FROM(0.00)[yahoo.com];
	FROM_HAS_DN(0.00)[];
	ASN(0.00)[asn:36647, ipnet:98.137.64.0/20, country:US];
	FREEMAIL_ENVFROM(0.00)[yahoo.com];
	TO_MATCH_ENVRCPT_SOME(0.00)[];
	RCVD_COUNT_TWO(0.00)[2];
	FROM_EQ_ENVFROM(0.00)[];
	DKIM_TRACE(0.00)[yahoo.com:+];
	MLMMJ_DEST(0.00)[freebsd-hackers@freebsd.org];
	RWL_MAILSPIKE_POSSIBLE(0.00)[98.137.64.205:from];
	MID_RHS_MATCH_FROM(0.00)[];
	RCVD_VIA_SMTP_AUTH(0.00)[];
	RCVD_IN_DNSWL_NONE(0.00)[98.137.64.205:from]
X-Rspamd-Queue-Id: 4TQksn1pHxz3xhs

Karl Pielorz <kpielorz_lst_at_tdx.co.uk> wrote on
Date: Thu, 01 Feb 2024 14:47:44 UTC :

> --On 28 December 2023 11:38 +0200 Daniel Braniss <danny@cs.huji.ac.il>=20=

> wrote:
>=20
> > hi,
> > I'm running 13.2 Stable on this particular host, which has about =
200TB of
> > zfs storage the host also has some 132Gb of memory,
> > lately, mountd is getting killed:
> > kernel: pid 3212 (mountd), jid 0, uid 0, was killed: a thread waited
> > too long to allocate a page
> >
> > rpcinfo shows it's still there, but
> > service mountd restart
> > fails.
> >
> > only solution is to reboot.
> > BTW, the only 'heavy' stuff that I can see are several rsync
> > processes.
>=20
> Hi,
>=20
> I seem to have run into something similar. I recently upgraded a 12.4 =
box=20
> to 13.2p9. The box has 32G of RAM, and runs ZFS. We do a lot of rsync =
work=20
> to it monthly - the first month we've done this with 13.2p9 we get a =
lot of=20
> processes killed, all with a similar (but not identical) message, e.g.
>=20
> pid 11103 (ssh), jid 0, uid 0, was killed: failed to reclaim memory
> pid 10972 (local-unbound), jid 0, uid 59, was killed: failed to =
reclaim=20
> memory
> pid 3223 (snmpd), jid 0, uid 0, was killed: failed to reclaim memory
> pid 3243 (mountd), jid 0, uid 0, was killed: failed to reclaim memory
> pid 3251 (nfsd), jid 0, uid 0, was killed: failed to reclaim memory
> pid 10996 (sshd), jid 0, uid 0, was killed: failed to reclaim memory
> pid 3257 (sendmail), jid 0, uid 0, was killed: failed to reclaim =
memory
> pid 8562 (csh), jid 0, uid 0, was killed: failed to reclaim memory
> pid 3363 (smartd), jid 0, uid 0, was killed: failed to reclaim memory
> pid 8558 (csh), jid 0, uid 0, was killed: failed to reclaim memory
> pid 3179 (ntpd), jid 0, uid 0, was killed: failed to reclaim memory
> pid 8555 (tcsh), jid 0, uid 1001, was killed: failed to reclaim memory
> pid 3260 (sendmail), jid 0, uid 25, was killed: failed to reclaim =
memory
> pid 2806 (devd), jid 0, uid 0, was killed: failed to reclaim memory
> pid 3156 (rpcbind), jid 0, uid 0, was killed: failed to reclaim memory
> pid 3252 (nfsd), jid 0, uid 0, was killed: failed to reclaim memory
> pid 3377 (getty), jid 0, uid 0, was killed: failed to reclaim memory
>=20
> This 'looks' like 'out of RAM' type situation - but at the time, top =
showed:
>=20
> last pid: 12622; load averages: 0.10, 0.24, 0.13=20
>=20
> 7 processes: 1 running, 6 sleeping
> CPU: 0.1% user, 0.0% nice, 0.2% system, 0.0% interrupt, 99.7% idle
> Mem: 4324K Active, 8856K Inact, 244K Laundry, 24G Wired, 648M Buf, =
7430M=20
> Free
> ARC: 20G Total, 8771M MFU, 10G MRU, 2432K Anon, 161M Header, 920M =
Other
> 15G Compressed, 23G Uncompressed, 1.59:1 Ratio
> Swap: 8192M Total, 5296K Used, 8187M Free
>=20
> Rebooting it recovers it, and it completed the rsync after the reboot =
-=20
> which left us with:
>=20
> last pid: 12570; load averages: 0.07, 0.14, 0.17=20
> up 0+00:15:06 14:43:56
> 26 processes: 1 running, 25 sleeping
> CPU: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
> Mem: 39M Active, 5640K Inact, 17G Wired, 42M Buf, 14G Free
> ARC: 15G Total, 33M MFU, 15G MRU, 130K Anon, 32M Header, 138M Other
> 14G Compressed, 15G Uncompressed, 1.03:1 Ratio
> Swap: 8192M Total, 8192M Free
>=20
>=20
> I've not seen any bug reports along this line, in fact very little =
coverage=20
> at all of the specific error.
>=20
> My only thought is to set a sysctl to limit ZFS ARC usage, i.e. to =
leave=20
> more free RAM floating around the system. During the rsync it was=20
> 'swapping' occasionally (few K in, few K out) - but it never ran out =
of=20
> swap that I saw - and it certainly didn't look like an complete out of=20=

> memory scenario/box (which is what it felt like with everything =
getting=20
> killed).


One direction of control is . . .

What do you have for ( copied from my /boot/loader.conf ):

#
# Delay when persistent low free RAM leads to
# Out Of Memory killing of processes:
vm.pageout_oom_seq=3D120

The default is 12 (last I knew, anyway).

The 120 figure has allowed me and others to do buildworld,
buildkernel, and poudriere bulk runs on small arm boards
using all cores that otherwise got "failed to reclaim
memory" (to use the modern, improved [not misleading]
message text). Similarly for others that had other kinds
of contexts that got the message.

(The units for the 120 are not time units: more like a
number of (re)tries to gain at least a target amount of
Free RAM before failure handling starts. The comment
wording is based on a consequence of the assignment.)

The 120 is not a maximum, just a figure that has proved
useful in various contexts.

But see the notes below based as well.

Notes:

"failed to reclaim memory" can happen even with swap
space enabled but no swap in use: sufficiently active
pages are just not paged out to swap space so if most
non-wired pages are classified as active, the kills
can start.

(There are some other parameters of possible use for some
other modern "was killed" reason texts.)

Wired pages are pages that can not be swapped out, even
if classified as inactive.

Your report indicates: 24G Wired with 20G of that being
from ARC use. This likely was after some processes had
already been killed. So likely more was wired and less
was free at the start of the kills.

That 24G+ of wired meant that only 8GiBytes- were
available everything else. Avoiding that by limiting
the ARC (tuning ZFS) or adjusting how the work load
is spread over time or some combination also looks
appropriate.

I've no clue why ARC use would be signifcantly
different for 12.4 vs. 13.2p9 .

=3D=3D=3D
Mark Millard
marklmi at yahoo.com