From nobody Thu Feb 1 16:30:19 2024 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4TQksp2rcBz58x7X for ; Thu, 1 Feb 2024 16:30:38 +0000 (UTC) (envelope-from marklmi@yahoo.com) Received: from sonic303-24.consmr.mail.gq1.yahoo.com (sonic303-24.consmr.mail.gq1.yahoo.com [98.137.64.205]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4TQksn1pHxz3xhs for ; Thu, 1 Feb 2024 16:30:37 +0000 (UTC) (envelope-from marklmi@yahoo.com) Authentication-Results: mx1.freebsd.org; dkim=pass header.d=yahoo.com header.s=s2048 header.b=qFxvRv1r; dmarc=pass (policy=reject) header.from=yahoo.com; spf=pass (mx1.freebsd.org: domain of marklmi@yahoo.com designates 98.137.64.205 as permitted sender) smtp.mailfrom=marklmi@yahoo.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1706805033; bh=BGeScltvFUSPzoh4m9rbePkx5k3tl5P1fyYXVn5EsNA=; h=From:Subject:Date:To:References:From:Subject:Reply-To; b=qFxvRv1rcUjFMCMzpBw5v4j087qMvb/+ZpedMR14vaui7rv3HQhXj4kRwR5ogYZdL7okvKctdS/G9IDrlZdRGrFn5rMnwLvh2ux3QX/5rg+SH66wupAmIgis+0fXEMt0eN9zpvbKj6FzEhIl57gC14yoJl/yph1Zvi/hEbHnZjQrB48MiU6Gsz2/T0ifzG/1N5YTMI+YsUi0W7FffxfGDW3QyS6U2k+nSRYIVNErwLQbVBJ58CcGyEQPzuf8tIjhtob1bTpXy19zdrV5XmSyTpp7rhCVV2vuWIW8y0NLjPnR8xmDG0xsLSIsE1Q4Vh9scp5z58YRtuarhrjPav651w== X-SONIC-DKIM-SIGN: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1706805033; bh=N+KZduP6I8+9SP3DZ7p4mcJRIR6ojLDE9gHtViuluai=; h=X-Sonic-MF:From:Subject:Date:To:From:Subject; b=aaPodpKMVJL86wSMzsFZMSVP+6hLTDsIrDH3jMXdR1oOygsSVd2y2nmoB6rvypczgjqyE9C0efH84tmX5Dt/Hs6TV07f/P0SqdPa/LDkJzZiSc/1xPo14jJ639Gqq60pVLHJLsHkpPJhZw37MdkeH97G+3xmvLC1epXW/2GTmF61rXkEtrRaJ6fumnj68g8Gm4ai0HLksoEL3NCpPT6JJfggMjA32fAm9Pc5IZGmq8sOV5yqjiS9Cwh/xrZTtl5qn7Vhrl6thIC3/+gtwjKblx/sROCXAUtkuBEGt/gL2LwmR6b4DbO2CMwcdBZ9x7kELg2UN2twQNRtWPFRcKtANg== X-YMail-OSG: Uh4u27UVM1l03qlFTRjFI6hQ6IkuXnaZsD2l_JyjmPrnf4HzdsQSRy4clD09S_F Uis79ZvpCdPS5FSlXM.z323wc94kOhkVOdgV.pQrDxjA1fUWKS_5PwkpYu2KLFWwBwbl9d3SstUa u_o8Pv5.p.lyN.wH54JAdx3_6Cm1oNBNzzGJZ.q4yRztUEgOxyqjfpJzRNSvZ_4r6SHUWGpOSDxO SpoekYqQQ.SOTMokq7CP4VoAnTGVeBgdLFV07FQcORum.zX1kZS.0ZkmVThnRhwMOv6JAPas8MVR EjMZ3C_cAKITMcTSDXfNLJJGIewYGQiZoGvShOdpNyfk.z.sK0HA13yMvnW9F2hEZw0k8iNibJ9F iyzTjdIEjsUaqw5mEYDrSLHWRtcH0HeFzVgtGBhzqwgqWlBhtmXkYngQeDT98MCF0a3LD91Mtc8X 9N355VJ4P55M7G8n2WsJy1dywM56feHIAS4Wh26AOQzbkEUAR4Nm867ioUpMv7FYYC6NLpcYml6o XQmEi.FN3Jf44V1dbLTc3CiKt4TzEUOclkeUeqOzfxZdmec6VuQAIq9cUIm_6jwGDI_xo.Hqndhc MhpbcKkLIwekEje1UBwH91RC7Gp3FK5ispbz3YLRIY1mlE.iwJQZJM6cX2O8NmFBJ1oRBbBa_bCY wvJnXX29CR4FyH9kB0nGT3SP.AoTPpmFxSDwUZXXTr317C8MtnosXsFvAi0uYAydddauGyNnaO0x 9Cbw.vfn1u9R1sVd_RbITqjLTzvGiHk5ry.wA34rOOh9g1LRALbm0alWVPgJ.VCHW1r0x9eocP28 LXyq_Kfk6tQvq4cqvzjb.liAky8EtWtQ6jWWjoOaLjn30Oopx8_DqD2pe87rNHewetSV_bztPRij cKBCj9I8WWreeBCCagzTLyOtfnoSzfws8uPnKc6VS7V3THjiF7nRm5H93vKiFuv_XIpafOvRy_s6 LVgxpA6nEKIJ8MbwdFA2nxAcF2693E1TogV3yfzoAZvWNWxF.1gjauHZU6B2te15mtXubLhA2sDs IpB_R1Dfy9kVdj8Nr4AR2wxT8QPIGjokwujZbQAt8qGZ9mp0NnM4ayhELKLjQnT2ZlGQE.deD9So bIcmi0SCabxZxFBP4lx.VMqyiQ1MDS9QJ52267YzcAocpv7OywKvRcdazTJOzlwWGZlaiFJjQUtB fpGT58x3XLUkBF5UDoPCCo6BB0OJed_KDxNXySSlr_SM527nxRkDwgVDeOiRnLxqBfgrbZBTxkFS loHudmD977Z_ZkXqXhKr0J1fPUKwPi.7ctDktRvz9QyqeUUvPfN6g3YBTFwkaYP6AwLFtzdUsjAb NQ2o9UAH8A6facgw4Fz8DEmyqJ5GSFnKRklFpbrqh5wEZa0FPox_gsFFurxHsZ41kbobAakTdPMH XYOolslQ7t.QM8Owk8ndgJFeHVvBRfklQKOJzv58DNPXs8PJffIrq1g_3uMCMoh9ppyeH_ltOm9X JZljsQO4zU9K8cuAysBA9e5N_BjXzvKm61po36snWd_KY0K6911SOF2WgxTXFlFCH_CBkAWaoQ9o WMak_k8dt11FOrds2BhmXcNX8c6zsDM39ssxtVL0wTbLP2IuTjeSMsyJ5MbGk_xxT5pIdAUAsvrw .VTBsf6ioWNfTsNqZtm1IwRKuoy5jpTbQTgloBIvUfpzwsZxZ066B5usdo.BCLIVStf26eVicGOp ePU3pL1Qn1Ult4RZyZyL2n1vZVt_e2gWE__YXIzO8Ood6Yoolu2sMgyTtmsby9GUHB4_Q8IIP8bq QUE6QGm_QdXWlyLFlie_8o429TK4tMAO4TVJzKmnBX8BP9We1IKJ8BImyGQ1Gmf6p.frcWkFoaRS 0LZnGE2KuJ7s9H8CHF_S0w5x.JbzAvvgCdhZ08jN7qAgaYze3yXZbSHdyCJBrcU9J0nioFkbG96W uk2az7J9kuUPxb3ncMi07yoPt4G6RtVscPPS4JcDeqjA8W3H3FVdWQjNwNzRe8LaZSJPtx7f34PY pXujOaXTiPVm5SlxHRQ9LcJB.vinyBe_BXOA2TsAvZ63t48mbfXrnuRpYrOiWkku9_EjZiQdLXGa 2E4wMQhsdwWjl9waHl7sGOmpkbP557aMZpB73Wc2YV0_yFirdigD7qhWW8EIdt9ch1kdrwWrTGKu 2PH7UmZDRIjX07aIO9JHUUqYDbQXhqpXPymgtGIb6YFiRdzvXVjRyhYdm2uDJmF_UGu52hOzGyLe c4UlexyAuxaZ3kl0HNXZcJlZ864a47pmXqpJAK3ue6m1PYvm1iW99ccEk9_yhgdC4VrjrEB_.Jno - X-Sonic-MF: X-Sonic-ID: 750def05-8f6f-4d9d-84a7-1a5badb93094 Received: from sonic.gate.mail.ne1.yahoo.com by sonic303.consmr.mail.gq1.yahoo.com with HTTP; Thu, 1 Feb 2024 16:30:33 +0000 Received: by hermes--production-gq1-5c57879fdf-wt62k (Yahoo Inc. Hermes SMTP Server) with ESMTPA ID 4dcdb5fbc5a3d36791259574ed92ab9f; Thu, 01 Feb 2024 16:30:30 +0000 (UTC) From: Mark Millard Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@freebsd.org Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3774.300.61.1.2\)) Subject: Re: ... was killed: a thread waited too long to allocate a page [actually: was killed: failed to reclaim memory problem] Message-Id: Date: Thu, 1 Feb 2024 08:30:19 -0800 To: kpielorz_lst@tdx.co.uk, FreeBSD Hackers X-Mailer: Apple Mail (2.3774.300.61.1.2) References: X-Spamd-Bar: --- X-Spamd-Result: default: False [-3.29 / 15.00]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_SHORT(-0.79)[-0.791]; MV_CASE(0.50)[]; DMARC_POLICY_ALLOW(-0.50)[yahoo.com,reject]; R_SPF_ALLOW(-0.20)[+ptr:yahoo.com]; R_DKIM_ALLOW(-0.20)[yahoo.com:s=s2048]; MIME_GOOD(-0.10)[text/plain]; RCPT_COUNT_TWO(0.00)[2]; RCVD_TLS_LAST(0.00)[]; MIME_TRACE(0.00)[0:+]; TO_DN_SOME(0.00)[]; ARC_NA(0.00)[]; DWL_DNSWL_NONE(0.00)[yahoo.com:dkim]; FREEMAIL_FROM(0.00)[yahoo.com]; FROM_HAS_DN(0.00)[]; ASN(0.00)[asn:36647, ipnet:98.137.64.0/20, country:US]; FREEMAIL_ENVFROM(0.00)[yahoo.com]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_COUNT_TWO(0.00)[2]; FROM_EQ_ENVFROM(0.00)[]; DKIM_TRACE(0.00)[yahoo.com:+]; MLMMJ_DEST(0.00)[freebsd-hackers@freebsd.org]; RWL_MAILSPIKE_POSSIBLE(0.00)[98.137.64.205:from]; MID_RHS_MATCH_FROM(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; RCVD_IN_DNSWL_NONE(0.00)[98.137.64.205:from] X-Rspamd-Queue-Id: 4TQksn1pHxz3xhs Karl Pielorz wrote on Date: Thu, 01 Feb 2024 14:47:44 UTC : > --On 28 December 2023 11:38 +0200 Daniel Braniss =20= > wrote: >=20 > > hi, > > I'm running 13.2 Stable on this particular host, which has about = 200TB of > > zfs storage the host also has some 132Gb of memory, > > lately, mountd is getting killed: > > kernel: pid 3212 (mountd), jid 0, uid 0, was killed: a thread waited > > too long to allocate a page > > > > rpcinfo shows it's still there, but > > service mountd restart > > fails. > > > > only solution is to reboot. > > BTW, the only 'heavy' stuff that I can see are several rsync > > processes. >=20 > Hi, >=20 > I seem to have run into something similar. I recently upgraded a 12.4 = box=20 > to 13.2p9. The box has 32G of RAM, and runs ZFS. We do a lot of rsync = work=20 > to it monthly - the first month we've done this with 13.2p9 we get a = lot of=20 > processes killed, all with a similar (but not identical) message, e.g. >=20 > pid 11103 (ssh), jid 0, uid 0, was killed: failed to reclaim memory > pid 10972 (local-unbound), jid 0, uid 59, was killed: failed to = reclaim=20 > memory > pid 3223 (snmpd), jid 0, uid 0, was killed: failed to reclaim memory > pid 3243 (mountd), jid 0, uid 0, was killed: failed to reclaim memory > pid 3251 (nfsd), jid 0, uid 0, was killed: failed to reclaim memory > pid 10996 (sshd), jid 0, uid 0, was killed: failed to reclaim memory > pid 3257 (sendmail), jid 0, uid 0, was killed: failed to reclaim = memory > pid 8562 (csh), jid 0, uid 0, was killed: failed to reclaim memory > pid 3363 (smartd), jid 0, uid 0, was killed: failed to reclaim memory > pid 8558 (csh), jid 0, uid 0, was killed: failed to reclaim memory > pid 3179 (ntpd), jid 0, uid 0, was killed: failed to reclaim memory > pid 8555 (tcsh), jid 0, uid 1001, was killed: failed to reclaim memory > pid 3260 (sendmail), jid 0, uid 25, was killed: failed to reclaim = memory > pid 2806 (devd), jid 0, uid 0, was killed: failed to reclaim memory > pid 3156 (rpcbind), jid 0, uid 0, was killed: failed to reclaim memory > pid 3252 (nfsd), jid 0, uid 0, was killed: failed to reclaim memory > pid 3377 (getty), jid 0, uid 0, was killed: failed to reclaim memory >=20 > This 'looks' like 'out of RAM' type situation - but at the time, top = showed: >=20 > last pid: 12622; load averages: 0.10, 0.24, 0.13=20 >=20 > 7 processes: 1 running, 6 sleeping > CPU: 0.1% user, 0.0% nice, 0.2% system, 0.0% interrupt, 99.7% idle > Mem: 4324K Active, 8856K Inact, 244K Laundry, 24G Wired, 648M Buf, = 7430M=20 > Free > ARC: 20G Total, 8771M MFU, 10G MRU, 2432K Anon, 161M Header, 920M = Other > 15G Compressed, 23G Uncompressed, 1.59:1 Ratio > Swap: 8192M Total, 5296K Used, 8187M Free >=20 > Rebooting it recovers it, and it completed the rsync after the reboot = -=20 > which left us with: >=20 > last pid: 12570; load averages: 0.07, 0.14, 0.17=20 > up 0+00:15:06 14:43:56 > 26 processes: 1 running, 25 sleeping > CPU: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle > Mem: 39M Active, 5640K Inact, 17G Wired, 42M Buf, 14G Free > ARC: 15G Total, 33M MFU, 15G MRU, 130K Anon, 32M Header, 138M Other > 14G Compressed, 15G Uncompressed, 1.03:1 Ratio > Swap: 8192M Total, 8192M Free >=20 >=20 > I've not seen any bug reports along this line, in fact very little = coverage=20 > at all of the specific error. >=20 > My only thought is to set a sysctl to limit ZFS ARC usage, i.e. to = leave=20 > more free RAM floating around the system. During the rsync it was=20 > 'swapping' occasionally (few K in, few K out) - but it never ran out = of=20 > swap that I saw - and it certainly didn't look like an complete out of=20= > memory scenario/box (which is what it felt like with everything = getting=20 > killed). One direction of control is . . . What do you have for ( copied from my /boot/loader.conf ): # # Delay when persistent low free RAM leads to # Out Of Memory killing of processes: vm.pageout_oom_seq=3D120 The default is 12 (last I knew, anyway). The 120 figure has allowed me and others to do buildworld, buildkernel, and poudriere bulk runs on small arm boards using all cores that otherwise got "failed to reclaim memory" (to use the modern, improved [not misleading] message text). Similarly for others that had other kinds of contexts that got the message. (The units for the 120 are not time units: more like a number of (re)tries to gain at least a target amount of Free RAM before failure handling starts. The comment wording is based on a consequence of the assignment.) The 120 is not a maximum, just a figure that has proved useful in various contexts. But see the notes below based as well. Notes: "failed to reclaim memory" can happen even with swap space enabled but no swap in use: sufficiently active pages are just not paged out to swap space so if most non-wired pages are classified as active, the kills can start. (There are some other parameters of possible use for some other modern "was killed" reason texts.) Wired pages are pages that can not be swapped out, even if classified as inactive. Your report indicates: 24G Wired with 20G of that being from ARC use. This likely was after some processes had already been killed. So likely more was wired and less was free at the start of the kills. That 24G+ of wired meant that only 8GiBytes- were available everything else. Avoiding that by limiting the ARC (tuning ZFS) or adjusting how the work load is spread over time or some combination also looks appropriate. I've no clue why ARC use would be signifcantly different for 12.4 vs. 13.2p9 . =3D=3D=3D Mark Millard marklmi at yahoo.com