From owner-freebsd-hackers@freebsd.org Tue Feb 23 23:29:58 2021 Return-Path: Delivered-To: freebsd-hackers@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 8C9F4557F38 for ; Tue, 23 Feb 2021 23:29:58 +0000 (UTC) (envelope-from asomers@gmail.com) Received: from mail-oi1-f171.google.com (mail-oi1-f171.google.com [209.85.167.171]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4DlZyt3QYVz4qNT for ; Tue, 23 Feb 2021 23:29:58 +0000 (UTC) (envelope-from asomers@gmail.com) Received: by mail-oi1-f171.google.com with SMTP id q186so476470oig.12 for ; Tue, 23 Feb 2021 15:29:58 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=/ONPizlIIQkiiFQ/xGopnPxsmsMe4f43nFk5XDRBFc8=; b=RQtikmPKxsoI8KAbMHh8G7HAI83EmK8C7eD7j2t/vFPTmGEeDn5Ib/6w10XQrp6JBv 8cClKvEIXqg8XxdQ+xLn+Cb45EtkYljVxJKUumsClD2GdnCx+hRSWQ7P0LzNrnWvCpJH NvcSamJ+w5RxH0NZIbOKnrkRt+3X2xxoADJ5QxMnuJ/T8fzzaHuEkXvFh8paUCNIdRJ5 dsS0y1N08lWo21qtiPjxElGXzr8OvKO2NMS/gZbKATyPJ6C8CLYeU1iXvK63lqZ4iepB QIhD+1IfDk8zAYhg+cGlAUb/UktffDj/Y1uWvXIIXxOrHK6hpPh6kxjS1CYdY7pCaoIi 2mSw== X-Gm-Message-State: AOAM532753l9YlCVFgmcfNug9s8ThZAOX4N1WyvQfZIkcmPv2mxY7acf 8eYzHikY1dq6G/rGarnfuSzC/M9uCWhRdNHIewI= X-Google-Smtp-Source: ABdhPJzR2nKBDr3UgOwZ3umVeh4XPFnZ0gUURz9FzTLtASqQx9YJRKMNkfDLsStSll/MSD8ih3KgdNTcpr0GqggaYS8= X-Received: by 2002:aca:478a:: with SMTP id u132mr825333oia.73.1614122997276; Tue, 23 Feb 2021 15:29:57 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Alan Somers Date: Tue, 23 Feb 2021 16:29:46 -0700 Message-ID: Subject: Re: The out-of-swap killer makes poor choices To: Konstantin Belousov Cc: FreeBSD Hackers X-Rspamd-Queue-Id: 4DlZyt3QYVz4qNT X-Spamd-Bar: ---- Authentication-Results: mx1.freebsd.org; none X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[] Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.34 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Technical discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 23 Feb 2021 23:29:58 -0000 On Tue, Feb 23, 2021 at 3:36 PM Konstantin Belousov wrote: > On Tue, Feb 23, 2021 at 02:20:21PM -0700, Alan Somers wrote: > > On Tue, Feb 23, 2021 at 2:11 PM Konstantin Belousov > > > wrote: > > > > > On Tue, Feb 23, 2021 at 01:49:49PM -0700, Alan Somers wrote: > > > > To me it's always seemed like the out-of-swap killer kills the wrong > > > > process. Oh, it does the right thing with a trivial while(1) > {malloc()} > > > > test program, but not with real workloads. To summarize the logic in > > > > vm_pageout_oom: > > > > > > > > * Don't kill system, protected, or killed processes > > > > * Don't kill processes with a thread that isn't running or suspended > > > > * Kill whichever process is using the most swap or swap + ram, > depending > > > on > > > > the shortage variable. On ties, kill the newest one. > > > > > > > > This algorithm probably made sense in the days when computers had > much > > > more > > > > swap than RAM. But now it leads to several problems: > > > > > > > > * It's almost guaranteed to do the wrong thing when shortage == > > > > VM_OOM_SWAPZ and there is little or no swap configured. If no swap > is > > > > configured, it will kill the newest running or suspended process. > If a > > > > little bit is configured, it will probably kill some idle process, > like > > > > zfsd, that is swapped out because it doesn't run very often. > > > > > > > > * Even if multiple GB of swap are configured, the OOM killer is still > > > > biased towards killing idle processes when shortage == VM_OOM_SWAPZ. > > > Most > > > > often, the process responsible for an out-of-memory condition is not > > > idle, > > > > and is consuming large amounts of RAM. > > > > > > > > * It ignores RLIMIT_RSS. We consider that rlimit when deciding > whether > > > to > > > > move a process from RAM to swap. > > > > > > > > * The "out of swap space" kernel message doesn't specify whether the > > > > process was killed because of insufficient swap or RAM (the shortage > > > > variable) > > > > > > > > I propose the following changes: > > > > > > > > * Incorporate shortage into the "out of swap space" message. > > > ok with me, not sure if users could make any action based on discretion > > > > > > > * When walking the process list, if any process exceeds its > RLIMIT_RSS, > > > > choose it immediately, without bothering to compare it to older > > > processes. > > > RSS was never supposed to be a limit on how many pages are resident. > > > It only provided some preference for more aggressive paging out > process' > > > pages. > > > > > > Or put it differently, RSS is not supposed to be the working set size > > > in VMS/NT sense. > > > > > > > Sure, but given that we must kill _something_, preferentially killing a > > process that was specifically limited sounds better than killing a > process > > that wasn't, won't you agree? > Semantic of RLIMIT_RSS is not to limit, but to give preference for pageout. > Changing it to the semantic of 'preference for OOM' would give the similar > complaint. > > > > > > > > > > > > * Always consider the sum of a process's RAM + swap, regardless of > the > > > > shortage variable. > > > > > > > > Does this make sense? Am I missing something about shortage == > > > > VM_OOM_SWAPZ? I don't understand why you would ever want to exclude > > > > processes' RAM usage. That logic was added in revision > > > > 2025d69ba7a68a5af173007a8072c45ad797ea23, but I don't understand the > > > > rationale. > > > > > > SWAPZ means that swap zone is exhausted. In this case, killing a > process > > > that does not use swap, would not free any space in the zone. > Similarly, > > > we should select a process with largest swap (== metadata kept in swap > > > zone) > > > use to free something in swap zone. > > > > > > > But killing a process that does not use swap could reduce the need for > more > > swap by other processes. How many cases are there where a process needs > > more SWAP and won't settle for RAM instead? > Both choices are somewhat random. The goal is to get more swap zone slack, > and this is what the code tried to target. > > In fact, if OOM kills largest RAM+swap consumer, then with the small swap > there is huge chance that swap is not freed, and then on the next nearby > pageout attempt some more process would be killed, perhaps innocently. > > OOM purpose is not to smoother operation of over-committed system, but > to have it survive (avoid low resources deadlock) to the state where it > can be examined and possibly corrected. > > > > > > > > > > > In other words, such kill could be not enough and really require more > and > > > more rounds of OOM, esp. on machine with very small swap configured. > > Ok, I'll abandon this idea.