From owner-freebsd-current@freebsd.org Sun Apr 4 19:24:04 2021 Return-Path: Delivered-To: freebsd-current@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 793C15B17A3 for ; Sun, 4 Apr 2021 19:24:04 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4FD3cg2RDMz3rRF for ; Sun, 4 Apr 2021 19:24:02 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.16.1/8.16.1) with ESMTPS id 134JNrPN063404 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NO); Sun, 4 Apr 2021 22:23:56 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua 134JNrPN063404 Received: (from kostik@localhost) by tom.home (8.16.1/8.16.1/Submit) id 134JNri8063403; Sun, 4 Apr 2021 22:23:53 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Sun, 4 Apr 2021 22:23:53 +0300 From: Konstantin Belousov To: Poul-Henning Kamp Cc: Warner Losh , Mateusz Guzik , FreeBSD CURRENT Subject: Re: [SOLVED] Re: Strange behavior after running under high load Message-ID: References: <58bea0f0-5c3d-4263-ebee-f939a7e169e9@freebsd.org> <494d4aab-487b-83c9-03f3-10cf470081c5@freebsd.org> <81671.1617432659@critter.freebsd.dk> <11447.1617562904@critter.freebsd.dk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <11447.1617562904@critter.freebsd.dk> X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FORGED_GMAIL_RCVD,FREEMAIL_FROM, NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on tom.home X-Rspamd-Queue-Id: 4FD3cg2RDMz3rRF X-Spamd-Bar: -- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=fail reason="No valid SPF, No valid DKIM" header.from=gmail.com (policy=none); spf=softfail (mx1.freebsd.org: 2001:470:d5e7:1::1 is neither permitted nor denied by domain of kostikbel@gmail.com) smtp.mailfrom=kostikbel@gmail.com X-Spamd-Result: default: False [-2.88 / 15.00]; ARC_NA(0.00)[]; FREEMAIL_ENVFROM(0.00)[gmail.com]; DMARC_POLICY_SOFTFAIL(0.10)[gmail.com : No valid SPF, No valid DKIM,none]; RCVD_TLS_ALL(0.00)[]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[4]; FREEMAIL_FROM(0.00)[gmail.com]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[text/plain]; HAS_XAW(0.00)[]; RBL_DBL_DONT_QUERY_IPS(0.00)[2001:470:d5e7:1::1:from]; R_SPF_SOFTFAIL(0.00)[~all:c]; SPAMHAUS_ZRD(0.00)[2001:470:d5e7:1::1:from:127.0.2.255]; TO_MATCH_ENVRCPT_SOME(0.00)[]; TO_DN_ALL(0.00)[]; NEURAL_HAM_SHORT(-0.88)[-0.880]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:6939, ipnet:2001:470::/32, country:US]; FREEMAIL_CC(0.00)[bsdimp.com,gmail.com,freebsd.org]; MAILMAN_DEST(0.00)[freebsd-current]; RCVD_COUNT_TWO(0.00)[2] X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 04 Apr 2021 19:24:04 -0000 On Sun, Apr 04, 2021 at 07:01:44PM +0000, Poul-Henning Kamp wrote: > -------- > Konstantin Belousov writes: > > > But what would you provide as the input for PID controller, and what would be the targets? > > Viewing this purely as a vnode related issue is wrong, this is about memory allocation in general. > > We may or may not want a PID regulator, but putting it on counts of vnode would not improve things, precisely, as you point out, because the amount of memory a vnode ties up has enormous variance. > Yes > > We should focus on the end goal: To ensure "sufficient" memory can always be allocated for any purpose "without major delay". > and no > > Architecturally there are three major problems: > > A) While each subsystem generally have a good idea about memory that can be released "without major delay", the information does not trickle up through a summarizing NUMA aware tree. > > B) We lack a nuanced call-back to tell the subsystems to release some of their memory "without major delay". The delay in the wall clock sense does not drive the issue. We cannot expect any io to proceed while we are low on memory, in the sense that allocators cannot respond right now. More and more, our io subsystem requires allocating memory to make any progress with io. This is already quite bad with geom, although some hacks make it not too outstanding. It is very bad with ZFS, where swap on zvols causes deadlocks almost immediately. > > C) We have never attempted to enlist userland, where jemalloc often hang on to a lot of unused VM pages. > The userland does not add to this problem, because pagedaemon typically has enough processing power to convert user-allocated pages into usable clean or free pages. Of course, if there is no swap and dirty anon page cannot be launder, the issue would accumulate. But normally operating system does not have an issue with user pages. > > As far as vnodes go: > > > It used to be that "without major delay" meant "without disk-I/O" which again led to the "dirty buffers/VM pages" heuristic. > > With microsecond SSD backing store, that heuristic is not only invalid, it is down-right harmful in many cases. > > GEOM maintains estimates of per-provider latency and VM+VFS should use that to schedule write-back so that more of it happens outside rush-hour, in order to increase the amount of memory which can be released "without major delay". > > Today that happens largely as a side effect of the periodic syncer, which does a really bad job at it, because it still expects VAX-era hardware performance and workloads. > Io latency is not the factor there. We must avoid situations where instantiating a vnode stalls waiting for KVA to appear, similarly we must avoid system state where vnodes allocation consumed so much kmem that other allocations stall. Quite indicative is that we do not shrink the vnode list on low memory events. Vnlru also does not account for the memory pressure. Problem is that it is not clear how to express that relations between safe allocators state and our desire to cache file system data, which is bound to the vnode identity.