From nobody Sat May 14 08:09:30 2022 X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 550481AE92D4 for ; Sat, 14 May 2022 08:09:47 +0000 (UTC) (envelope-from marklmi@yahoo.com) Received: from sonic308-54.consmr.mail.gq1.yahoo.com (sonic308-54.consmr.mail.gq1.yahoo.com [98.137.68.30]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4L0dTk0LG0z59HV for ; Sat, 14 May 2022 08:09:45 +0000 (UTC) (envelope-from marklmi@yahoo.com) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1652515777; bh=DKBnlcA/ZsekJq8CeOYQEt9OysOOGQcce+Ebs6DFMmU=; h=From:Subject:Date:To:References:From:Subject:Reply-To; b=ir+jQ1dqOxxxRYS2ZnuDBRR0SMY9ZhJbXk9ypkTxvA5Z5M6sV3v+97bf593yVrUTYO7kBLopOuex+hhNQssZ6OLOf1E9XaNKpdlcHAv8PSblGtDGU5iEk0NSZ8RRcCqzNwKjTaI2WyHN509IbxA5DnOOgVxWkie+DW1uB867/90VYihhuVAkJyIdMAmdNYN5DVPCcsGq2uWj+3ePu15ME1CeYV0XXgbrOTJme+bYt/12Y12nOMSNCjVVA457rGxE47RUb2hCjJknYj5NafQBqSzc3ou7raD3L0k6/UjVAwZacuOQ0dZiLQ3hMA1at3hjX6GUOjJU81dNcrdYmGrSKA== X-SONIC-DKIM-SIGN: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1652515777; bh=8SE6Neexw8aF+UAJ0auqFEAxHvJexXZB6QuCx6VWXwb=; h=X-Sonic-MF:From:Subject:Date:To:From:Subject; b=T8Kt3fTvKra/vB7EFba1D6+ABnc4PPDWXOsTXdw1UinPKUvhSqlSNYzBneNFq3VyZUt8Sj+2mozkM9YO95Br/slcLGLW3Uc+xCKP4c2LnIZ8dCGh9f2ZgmgL7nHVXkuCsKYoWzUZK/UbEIRbIk2Dg0Dyqgd6Yho1zDJ7TnlzMpKTV04tkxx1lfBTr22C/GcjMywAuHLw0oQTS8OAqiSYVZ0UApvngwXNCFH9SaB/1NmcCW6D/FWD6YV6RxORZMTfOkb9s2n7BBxeEGg7sXk53AqLplrNixOLs6Py4Pf0Ys/Dw+gPk8Uz1iq0+N6cXgDUAPb4g9BrYQJw2dXxO7pbDw== X-YMail-OSG: oZ4ecZ0VM1k.iQZiK1U7jybxtIlGcQe9_wDRr52FoGH6XynhDBlFivKS3GXo5yU Zsk90x_Uw_0AyiSp6X7FUSPDyO6Fd5XVGfhn7iYASokiLmJ_D8fJ.II7Mtjr8r4ICwrbEDnOVpBh kyYSUMawsQ0QchyEpWKEWK4iWZoarNpsYqMDrH6K69L6erZgjdLPG7G.zLf0EGSnMAQdygy5t1bp kpVIl_xQyhDznxkPYHpwNT18X6.2L5gyh1Q31y1.fEJMguIM6Z.MTGLYpRnDmA3HtB1.3ztwbU9S 7xl465tpgNx5rpwrlTQgpHXIXOZFrSL2HZWBz1ZQGX06oxKwS7g6vwz7Igx7V6WfdlOfdI3Tqj6c HVquTd9T2O.Z_0BNhzHZ.mw3hdU94Wbd8lqu3eNGPUf0bavNqbOEUchRvU0vdk_DO_S5uK_VnpMe _IEuVn7clpU2uqCNPQ9YRk3KKZfql6OkF2UlAkF..OeEUILiSujh3iOhZV559Uq3QF2TE_sTUTsP auSb1mpW_uWORZHUbdAT52lsXqFKlTNh.U0KjZoYVzesfnGq5CsqkpfC.kgYHveYNr7lxJRpxXxh MyTcuNqRD3iFNpP_lHZ08uzvPUVbdTtynFg3LkxTQdI6Du1Apz4iDf1fgD9RsQ_sH6ih..knGA42 Ao5fRaTbsTp6ala_FAfG_OBlhOZsVch9Y2yjLgf.AACwQoAs91kPjX7KTmkt2BhNHK6CyHtN85Tc odrCdUisBgbybrDsZUpTypNdh7O4gQwx0lYrF.28PzpqeJ3bczeIC_yz_jz7qq6buMO7BNjTbuym A77znEZH9Sy7FfwDEMS4BcUuIX10QzG6bKntiXHcbvTTR0xhxz_NiqxY.5MDcLmRmRdw_w8.ww9P td6PS69.paH.j480x0MSX5a_NTPlCRZ4nIf2CaiCi6Xe_Io29TW0ImKmwyilRZmCsuoSvXE.u5kW Wyj1Q7CUXdY4DlJmK_RQaherc.rMeWW76Lv0zxlOnCyPsGzGWzKdyAGYqN98RHkxFP8CSfXmUngW 40Y607ANtUqYnYTrg1WgFPoHQwjN1_E1xR4K7Zwpd141FYVtm5_Pt6f6Dd9neda0WMFNcMNLEpJs YUlvQY2k_cTUzNicN5iT.UrCRs1LzhfV0ieB6gMAyveV_c4Tjq6AMqt.oPLRGiCpKn801jwmGbNX ykfHYQzmahdCcpuE63HyxwDt7cuGtam0zaaztOYOzgATmNARLFtfwizAQvnB.gBJlfCwug9Uttqf Zc0G5vmfo_mltMxAaAOZ1ZxoGnJjCjuu3caYUzdlET6Y6ewIcwnonmbWSRsTC.1TpMl5yO2PWD45 OgjnarOSGpnLpwonG2nk4PNu1ctmbGciLFIz5YuKumEOB5nQhB8kOw5IAalA_3Kve7X6UsG_Ir94 hCXvwD4g2_0EH95jDG1KW28Zz1iWkil6qmg7nuMFw2A8R5U4f4lHf757HdOC1UCWCUqUFIb9xiBg emfuVMwPFgKbp2jBn5Umlrg0fBLbeZPTXNGf1ZeuCdVb_DmamuN500isPhDnB1f8KZE_Ku4rs0KB r80hYjTIPsQ78Ei_yWvsSZyRVit_TYqrfmJOqWlisNivqkFO09Q1aiYuj6.jR3zxOycmQoRWGcNP QSRHxzdOg8Dq.0mq_NWK77yXUeqilqONBMFsmt3s1Yq2lwyA3.e2cVhjSMQDnwe8dDrPInCtmJhB PKB2bDHAVwK4uCqcz5TtrjCv2z4kXcuDDHbKGiExVi2j5XSAit0YRgTiWUSm_RYqNTIDx5bh3cvs eCyWU8yWZg7tN3eY7SgG6etg6vSlrfWY1CdMob7OqD9J1le3mmpF9ahmOw5AO4XktdxOuJWVQvu4 kRmtmY0UKyXysuOwZ59hL86nhA71DyVfvEJLaoA5Gu7KIhnTTFJUmhcDMjbS8GaWWR1lHocayOlI zykH7wgVcXJ9T.621bX5GRy4a2bVP4JwBnzAEH4sXGzeGPaSB3BgMR5RruPDrqYCMwgpdhstzjNc l0KzaUM20W3ts873l5iPZ5ywXaHw_LliYSy2NpFAHySfhz8ye995pLgsCf1MzRJIW3w1oO9MXiZk JRl4raqliZ2j0tnvsBiCIx6z5N_NbdC4AkL96sxaDt0Oiqa137xU7Hz8n09AoZFHLlGFRDYoe406 qKNWDwyK19.XNE_Qg2UjHylTL8Dpfo8OyT9Dv8lqsc6GQ_wUI5rIfDKngJRY1AUR1pojjeVeUMw3 7wYN88PRl9C66qbRQM2RwHfoQuHsK3TGSsw_By2UhAD8px17s2hvzKnDBzibMxvV44FnJlqurPi3 aHlS_dZzRKqvWvw-- X-Sonic-MF: Received: from sonic.gate.mail.ne1.yahoo.com by sonic308.consmr.mail.gq1.yahoo.com with HTTP; Sat, 14 May 2022 08:09:37 +0000 Received: by hermes--canary-production-bf1-579c78cbb7-lxrpx (Yahoo Inc. Hermes SMTP Server) with ESMTPA ID 6a8a2233e0b69a3cd7b077c3b722d3bd; Sat, 14 May 2022 08:09:34 +0000 (UTC) From: Mark Millard Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@freebsd.org Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.13\)) Subject: Re: Chasing OOM Issues - good sysctl metrics to use? Message-Id: <8C14A90D-3429-437C-A815-E811B7BFBF05@yahoo.com> Date: Sat, 14 May 2022 01:09:30 -0700 To: Pete Wright , freebsd-current X-Mailer: Apple Mail (2.3654.120.0.1.13) References: <8C14A90D-3429-437C-A815-E811B7BFBF05.ref@yahoo.com> X-Rspamd-Queue-Id: 4L0dTk0LG0z59HV X-Spamd-Bar: - Authentication-Results: mx1.freebsd.org; dkim=pass header.d=yahoo.com header.s=s2048 header.b=ir+jQ1dq; dmarc=pass (policy=reject) header.from=yahoo.com; spf=pass (mx1.freebsd.org: domain of marklmi@yahoo.com designates 98.137.68.30 as permitted sender) smtp.mailfrom=marklmi@yahoo.com X-Spamd-Result: default: False [-1.91 / 15.00]; FREEMAIL_FROM(0.00)[yahoo.com]; MV_CASE(0.50)[]; R_SPF_ALLOW(-0.20)[+ptr:yahoo.com]; TO_DN_ALL(0.00)[]; DKIM_TRACE(0.00)[yahoo.com:+]; RCPT_COUNT_TWO(0.00)[2]; DMARC_POLICY_ALLOW(-0.50)[yahoo.com,reject]; NEURAL_HAM_SHORT(-0.99)[-0.994]; FROM_EQ_ENVFROM(0.00)[]; RCVD_TLS_LAST(0.00)[]; MIME_TRACE(0.00)[0:+]; FREEMAIL_ENVFROM(0.00)[yahoo.com]; ASN(0.00)[asn:36647, ipnet:98.137.64.0/20, country:US]; MID_RHS_MATCH_FROM(0.00)[]; DWL_DNSWL_NONE(0.00)[yahoo.com:dkim]; SUBJECT_ENDS_QUESTION(1.00)[]; ARC_NA(0.00)[]; R_DKIM_ALLOW(-0.20)[yahoo.com:s=s2048]; NEURAL_HAM_MEDIUM(-0.42)[-0.420]; FROM_HAS_DN(0.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[text/plain]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_IN_DNSWL_NONE(0.00)[98.137.68.30:from]; MLMMJ_DEST(0.00)[freebsd-current]; RCVD_COUNT_TWO(0.00)[2] X-ThisMailContainsUnwantedMimeParts: N Pete Wright wrote on Date: Fri, 13 May 2022 13:43:11 -0700 : > On 5/11/22 12:52, Mark Millard wrote: > > > > > > Relative to avoiding hang-ups, so far it seems that > > use of vm.swap_enabled=3D0 with vm.swap_idle_enabled=3D0 > > makes hang-ups less likely/less frequent/harder to > > produce examples of. But is no guarantee of lack of > > a hang-up. Its does change the cause of the hang-up > > (in that it avoids processes with kernel stacks swapped > > out being involved). >=20 > thanks for the above analysis Mark. i am going to test these settings=20= > out now as i'm still seeing the lockup. >=20 > this most recent hang-up was using a patch tijl_at_ asked me to test=20= > (attached to this email), and the default setting of = vm.pageout_oom_seq:=20 > 12. I also had been run various tests for tijl_at_ , the same sort of 'removal of the " + 1" patch'. I had found a basic way to tell if a fundamental problem was completely avoided or not, without having to wait long periods of activity to do so. But that does not mean the test is a good simulation of your context's sequence that leads to issues. Nor does it indicate how wide a range of activity is fairly likely to reach the failing conditions. You could see how vm.pageout_oom_seq=3D120 does for you with the patch. I was never patient enough to wait long enough for this to OOM kill or hang-up in my test context. I've been reporting the likes of: # sysctl vm.domain.0.stats # done after the fact vm.domain.0.stats.inactive_pps: 1037 vm.domain.0.stats.free_severe: 15566 vm.domain.0.stats.free_min: 25759 vm.domain.0.stats.free_reserved: 5374 vm.domain.0.stats.free_target: 86914 vm.domain.0.stats.inactive_target: 130371 vm.domain.0.stats.unswppdpgs: 0 vm.domain.0.stats.unswappable: 0 vm.domain.0.stats.laundpdpgs: 858845 vm.domain.0.stats.laundry: 9 vm.domain.0.stats.inactpdpgs: 1040939 vm.domain.0.stats.inactive: 1063 vm.domain.0.stats.actpdpgs: 407937767 vm.domain.0.stats.active: 1032 vm.domain.0.stats.free_count: 3252526 But I also have a kernel that reports just before the call that is to cause a OOM kill, ending up with output like: vm_pageout_mightbe_oom: kill context: v_free_count: 15306, = v_inactive_count: 1, v_laundry_count: 64, v_active_count: 3891599 May 11 00:44:11 CA72_Mbin_ZFS kernel: pid 844 (stress), jid 0, uid 0, = was killed: failed to reclaim memory (I was testing main [so: 14].) So I report that as well. Since I was using stress as part of my test context, there were also lines like: stress: FAIL: [843] (415) <-- worker 844 got signal 9 stress: WARN: [843] (417) now reaping child worker processes stress: FAIL: [843] (451) failed run completed in 119s (tijl_at_ had me add v_laundry_count and v_active_count to what I've had carried forward since back in 2018 when Mark J. provided the original extra message.) Turns out the kernel debugger (db> prompt) can report the same general sort of figures: db> show page vm_cnt.v_free_count: 15577 vm_cnt.v_inactive_count: 1 vm_cnt.v_active_count: 3788852 vm_cnt.v_laundry_count: 0 vm_cnt.v_wire_count: 272395 vm_cnt.v_free_reserved: 5374 vm_cnt.v_free_min: 25759 vm_cnt.v_free_target: 86914 vm_cnt.v_inactive_target: 130371 db> show pageq pq_free 15577 dom 0 page_cnt 4077116 free 15577 pq_act 3788852 pq_inact 1 pq_laund 0 = pq_unsw 0 (Note: pq_unsw is a non-swappable count that excludes the wired count, apparently matching vm.domain.0.stats.unswappable .) The above is the most extremely small pq_inact+pq_laund that I saw at the OOM kill time or during a "hang-up" (what I saw across example "hang-ups" suggests to me a livelock context, not a deadlock context). > interestingly enough with the patch applied i observed a smaller=20 > amount of memory used for laundry as well as less swap space used = until=20 > right before the crash. If your logging of values has been made public, I've not (yet?) looked at it at all. None of my testing reached a stage of having much swap space in use. But the test is biased to produce the problems quickly, rather than to explore a range of ways to reach conditions with the problem. I've stopped testing for now and am doing a round of OS building and upgrading, port (re-)building and installing and the like, mostly for aarch64 but also for armv7 and amd64. (This is without the 'remove " + 1"' patch.) One of the points is to see if I get any evidence of vm.swap_enabled=3D0 with vm.swap_idle_enabled=3D0 ending up contributing to any problems in my normal usage. So far: no. vm.pageout_oom_seq=3D120 is in use for this, my normal context since sometime in 2018. =3D=3D=3D Mark Millard marklmi at yahoo.com