From owner-freebsd-current@freebsd.org  Sun Apr  4 19:24:04 2021
Return-Path: <owner-freebsd-current@freebsd.org>
Delivered-To: freebsd-current@mailman.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.nyi.freebsd.org (Postfix) with ESMTP id 793C15B17A3
 for <freebsd-current@mailman.nyi.freebsd.org>;
 Sun,  4 Apr 2021 19:24:04 +0000 (UTC)
 (envelope-from kostikbel@gmail.com)
Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 4FD3cg2RDMz3rRF
 for <freebsd-current@freebsd.org>; Sun,  4 Apr 2021 19:24:02 +0000 (UTC)
 (envelope-from kostikbel@gmail.com)
Received: from tom.home (kib@localhost [127.0.0.1])
 by kib.kiev.ua (8.16.1/8.16.1) with ESMTPS id 134JNrPN063404
 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NO);
 Sun, 4 Apr 2021 22:23:56 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua 134JNrPN063404
Received: (from kostik@localhost)
 by tom.home (8.16.1/8.16.1/Submit) id 134JNri8063403;
 Sun, 4 Apr 2021 22:23:53 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com
 using -f
Date: Sun, 4 Apr 2021 22:23:53 +0300
From: Konstantin Belousov <kostikbel@gmail.com>
To: Poul-Henning Kamp <phk@phk.freebsd.dk>
Cc: Warner Losh <imp@bsdimp.com>, Mateusz Guzik <mjguzik@gmail.com>,
 FreeBSD CURRENT <freebsd-current@freebsd.org>
Subject: Re: [SOLVED] Re: Strange behavior after running under high load
Message-ID: <YGoSSXzGDZBDl922@kib.kiev.ua>
References: <58bea0f0-5c3d-4263-ebee-f939a7e169e9@freebsd.org>
 <494d4aab-487b-83c9-03f3-10cf470081c5@freebsd.org>
 <CAGudoHHDBxOWc_u6=c1v8x+w-yfYEhv_-BALCj5t95HkobCZeA@mail.gmail.com>
 <81671.1617432659@critter.freebsd.dk>
 <CAGudoHFp4x3C7fzh-SM4DQ+7t3YuREuknUBd-VaO=+s2th4J6A@mail.gmail.com>
 <CANCZdfrthB8QLbeF+fux9i1H2_jF6LRppkYe1dhEt7URBo4qSw@mail.gmail.com>
 <YGn8+W/ipcysamdI@kib.kiev.ua>
 <11447.1617562904@critter.freebsd.dk>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <11447.1617562904@critter.freebsd.dk>
X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED,BAYES_00,
 DKIM_ADSP_CUSTOM_MED,FORGED_GMAIL_RCVD,FREEMAIL_FROM,
 NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on tom.home
X-Rspamd-Queue-Id: 4FD3cg2RDMz3rRF
X-Spamd-Bar: --
Authentication-Results: mx1.freebsd.org; dkim=none;
 dmarc=fail reason="No valid SPF, No valid DKIM" header.from=gmail.com
 (policy=none); 
 spf=softfail (mx1.freebsd.org: 2001:470:d5e7:1::1 is neither permitted nor
 denied by domain of kostikbel@gmail.com) smtp.mailfrom=kostikbel@gmail.com
X-Spamd-Result: default: False [-2.88 / 15.00]; ARC_NA(0.00)[];
 FREEMAIL_ENVFROM(0.00)[gmail.com];
 DMARC_POLICY_SOFTFAIL(0.10)[gmail.com : No valid SPF, No valid DKIM,none];
 RCVD_TLS_ALL(0.00)[]; FROM_HAS_DN(0.00)[];
 RCPT_COUNT_THREE(0.00)[4]; FREEMAIL_FROM(0.00)[gmail.com];
 NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[text/plain];
 HAS_XAW(0.00)[];
 RBL_DBL_DONT_QUERY_IPS(0.00)[2001:470:d5e7:1::1:from];
 R_SPF_SOFTFAIL(0.00)[~all:c];
 SPAMHAUS_ZRD(0.00)[2001:470:d5e7:1::1:from:127.0.2.255];
 TO_MATCH_ENVRCPT_SOME(0.00)[]; TO_DN_ALL(0.00)[];
 NEURAL_HAM_SHORT(-0.88)[-0.880];
 NEURAL_HAM_MEDIUM(-1.00)[-1.000]; FROM_EQ_ENVFROM(0.00)[];
 R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+];
 ASN(0.00)[asn:6939, ipnet:2001:470::/32, country:US];
 FREEMAIL_CC(0.00)[bsdimp.com,gmail.com,freebsd.org];
 MAILMAN_DEST(0.00)[freebsd-current]; RCVD_COUNT_TWO(0.00)[2]
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
 <freebsd-current.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-current>, 
 <mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current/>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
 <mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 04 Apr 2021 19:24:04 -0000

On Sun, Apr 04, 2021 at 07:01:44PM +0000, Poul-Henning Kamp wrote:
> --------
> Konstantin Belousov writes:
> 
> > But what would you provide as the input for PID controller, and what would be the targets?
> 
> Viewing this purely as a vnode related issue is wrong, this is about memory allocation in general.
> 
> We may or may not want a PID regulator, but putting it on counts of vnode would not improve things, precisely, as you point out, because the amount of memory a vnode ties up has enormous variance.
> 
Yes

> 
> We should focus on the end goal: To ensure "sufficient" memory can always be allocated for any purpose "without major delay".
> 
and no

> 
> Architecturally there are three major problems:
> 
> A) While each subsystem generally have a good idea about memory that can be released "without major delay", the information does not trickle up through a summarizing NUMA aware tree.
> 
> B) We lack a nuanced call-back to tell the subsystems to release some of their memory "without major delay".
The delay in the wall clock sense does not drive the issue.
We cannot expect any io to proceed while we are low on memory, in the sense
that allocators cannot respond right now.  More and more, our io subsystem
requires allocating memory to make any progress with io.  This is already
quite bad with geom, although some hacks make it not too outstanding.

It is very bad with ZFS, where swap on zvols causes deadlocks almost
immediately.

> 
> C) We have never attempted to enlist userland, where jemalloc often hang on to a lot of unused VM pages.
> 
The userland does not add to this problem, because pagedaemon typically has
enough processing power to convert user-allocated pages into usable clean
or free pages.  Of course, if there is no swap and dirty anon page cannot
be launder, the issue would accumulate.

But normally operating system does not have an issue with user pages.  

> 
> As far as vnodes go:
> 
> 
> It used to be that "without major delay" meant "without disk-I/O" which again led to the "dirty buffers/VM pages" heuristic.
> 
> With microsecond SSD backing store, that heuristic is not only invalid, it is down-right harmful in many cases.
> 
> GEOM maintains estimates of per-provider latency and VM+VFS should use that to schedule write-back so that more of it happens outside rush-hour, in order to increase the amount of memory which can be released "without major delay".
> 
> Today that happens largely as a side effect of the periodic syncer, which does a really bad job at it, because it still expects VAX-era hardware performance and workloads.
> 
Io latency is not the factor there. We must avoid situations where
instantiating a vnode stalls waiting for KVA to appear, similarly we
must avoid system state where vnodes allocation consumed so much kmem
that other allocations stall.

Quite indicative is that we do not shrink the vnode list on low memory
events.  Vnlru also does not account for the memory pressure.

Problem is that it is not clear how to express that relations between
safe allocators state and our desire to cache file system data, which is
bound to the vnode identity.