From owner-freebsd-hackers@freebsd.org Wed Feb 24 11:02:38 2021 Return-Path: Delivered-To: freebsd-hackers@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 5A8A755867A for ; Wed, 24 Feb 2021 11:02:38 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4DltL50xWdz4qDF; Wed, 24 Feb 2021 11:02:36 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.16.1/8.16.1) with ESMTPS id 11OB2RTJ028201 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NO); Wed, 24 Feb 2021 13:02:30 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua 11OB2RTJ028201 Received: (from kostik@localhost) by tom.home (8.16.1/8.16.1/Submit) id 11OB2R3N028200; Wed, 24 Feb 2021 13:02:27 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Wed, 24 Feb 2021 13:02:27 +0200 From: Konstantin Belousov To: Alan Somers Cc: FreeBSD Hackers Subject: Re: The out-of-swap killer makes poor choices Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FORGED_GMAIL_RCVD,FREEMAIL_FROM, NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on tom.home X-Rspamd-Queue-Id: 4DltL50xWdz4qDF X-Spamd-Bar: - Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=fail reason="No valid SPF, No valid DKIM" header.from=gmail.com (policy=none); spf=softfail (mx1.freebsd.org: 2001:470:d5e7:1::1 is neither permitted nor denied by domain of kostikbel@gmail.com) smtp.mailfrom=kostikbel@gmail.com X-Spamd-Result: default: False [-1.04 / 15.00]; RCVD_TLS_ALL(0.00)[]; ARC_NA(0.00)[]; DMARC_POLICY_SOFTFAIL(0.10)[gmail.com : No valid SPF, No valid DKIM,none]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; FROM_HAS_DN(0.00)[]; RBL_DBL_DONT_QUERY_IPS(0.00)[2001:470:d5e7:1::1:from]; FREEMAIL_FROM(0.00)[gmail.com]; TO_MATCH_ENVRCPT_ALL(0.00)[]; MIME_GOOD(-0.10)[text/plain]; HAS_XAW(0.00)[]; NEURAL_HAM_LONG(-1.00)[-0.996]; R_SPF_SOFTFAIL(0.00)[~all]; NEURAL_SPAM_SHORT(0.95)[0.953]; SPAMHAUS_ZRD(0.00)[2001:470:d5e7:1::1:from:127.0.2.255]; TO_DN_ALL(0.00)[]; RCPT_COUNT_TWO(0.00)[2]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; FREEMAIL_ENVFROM(0.00)[gmail.com]; ASN(0.00)[asn:6939, ipnet:2001:470::/32, country:US]; MIME_TRACE(0.00)[0:+]; MAILMAN_DEST(0.00)[freebsd-hackers]; RCVD_COUNT_TWO(0.00)[2] X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Technical discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 24 Feb 2021 11:02:38 -0000 On Tue, Feb 23, 2021 at 04:29:46PM -0700, Alan Somers wrote: > On Tue, Feb 23, 2021 at 3:36 PM Konstantin Belousov > wrote: > > > On Tue, Feb 23, 2021 at 02:20:21PM -0700, Alan Somers wrote: > > > On Tue, Feb 23, 2021 at 2:11 PM Konstantin Belousov > > > > > wrote: > > > > > > > On Tue, Feb 23, 2021 at 01:49:49PM -0700, Alan Somers wrote: > > > > > To me it's always seemed like the out-of-swap killer kills the wrong > > > > > process. Oh, it does the right thing with a trivial while(1) > > {malloc()} > > > > > test program, but not with real workloads. To summarize the logic in > > > > > vm_pageout_oom: > > > > > > > > > > * Don't kill system, protected, or killed processes > > > > > * Don't kill processes with a thread that isn't running or suspended > > > > > * Kill whichever process is using the most swap or swap + ram, > > depending > > > > on > > > > > the shortage variable. On ties, kill the newest one. > > > > > > > > > > This algorithm probably made sense in the days when computers had > > much > > > > more > > > > > swap than RAM. But now it leads to several problems: > > > > > > > > > > * It's almost guaranteed to do the wrong thing when shortage == > > > > > VM_OOM_SWAPZ and there is little or no swap configured. If no swap > > is > > > > > configured, it will kill the newest running or suspended process. > > If a > > > > > little bit is configured, it will probably kill some idle process, > > like > > > > > zfsd, that is swapped out because it doesn't run very often. > > > > > > > > > > * Even if multiple GB of swap are configured, the OOM killer is still > > > > > biased towards killing idle processes when shortage == VM_OOM_SWAPZ. > > > > Most > > > > > often, the process responsible for an out-of-memory condition is not > > > > idle, > > > > > and is consuming large amounts of RAM. > > > > > > > > > > * It ignores RLIMIT_RSS. We consider that rlimit when deciding > > whether > > > > to > > > > > move a process from RAM to swap. > > > > > > > > > > * The "out of swap space" kernel message doesn't specify whether the > > > > > process was killed because of insufficient swap or RAM (the shortage > > > > > variable) > > > > > > > > > > I propose the following changes: > > > > > > > > > > * Incorporate shortage into the "out of swap space" message. > > > > ok with me, not sure if users could make any action based on discretion > > > > > > > > > * When walking the process list, if any process exceeds its > > RLIMIT_RSS, > > > > > choose it immediately, without bothering to compare it to older > > > > processes. > > > > RSS was never supposed to be a limit on how many pages are resident. > > > > It only provided some preference for more aggressive paging out > > process' > > > > pages. > > > > > > > > Or put it differently, RSS is not supposed to be the working set size > > > > in VMS/NT sense. > > > > > > > > > > Sure, but given that we must kill _something_, preferentially killing a > > > process that was specifically limited sounds better than killing a > > process > > > that wasn't, won't you agree? > > Semantic of RLIMIT_RSS is not to limit, but to give preference for pageout. > > Changing it to the semantic of 'preference for OOM' would give the similar > > complaint. > > > > > > > > > > > > > > > > > * Always consider the sum of a process's RAM + swap, regardless of > > the > > > > > shortage variable. > > > > > > > > > > Does this make sense? Am I missing something about shortage == > > > > > VM_OOM_SWAPZ? I don't understand why you would ever want to exclude > > > > > processes' RAM usage. That logic was added in revision > > > > > 2025d69ba7a68a5af173007a8072c45ad797ea23, but I don't understand the > > > > > rationale. > > > > > > > > SWAPZ means that swap zone is exhausted. In this case, killing a > > process > > > > that does not use swap, would not free any space in the zone. > > Similarly, > > > > we should select a process with largest swap (== metadata kept in swap > > > > zone) > > > > use to free something in swap zone. > > > > > > > > > > But killing a process that does not use swap could reduce the need for > > more > > > swap by other processes. How many cases are there where a process needs > > > more SWAP and won't settle for RAM instead? > > Both choices are somewhat random. The goal is to get more swap zone slack, > > and this is what the code tried to target. > > > > In fact, if OOM kills largest RAM+swap consumer, then with the small swap > > there is huge chance that swap is not freed, and then on the next nearby > > pageout attempt some more process would be killed, perhaps innocently. > > > > OOM purpose is not to smoother operation of over-committed system, but > > to have it survive (avoid low resources deadlock) to the state where it > > can be examined and possibly corrected. > > > > > > > > > > > > > > > > In other words, such kill could be not enough and really require more > > and > > > > more rounds of OOM, esp. on machine with very small swap configured. > > > > > Ok, I'll abandon this idea. No OOM algorithm would ever satisfy everybody. I explained the reasoning for the current design, even if it actually evolved this way, instead being written as a whole with the stated goal. I do not object against adding something that would help to get it more fit with different goals as well, but the current idea of making the system survive should be kept. I remember Linux has more advanced controls to guide OOM decisions. We only have 'protected' flag that should prevent killer from ever touching specific process, like sshd.