From owner-freebsd-arm@freebsd.org Fri Nov 13 18:32:49 2015 Return-Path: Delivered-To: freebsd-arm@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9E5E7A2E1E6 for ; Fri, 13 Nov 2015 18:32:49 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 417FB1FE9; Fri, 13 Nov 2015 18:32:49 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id tADIWgVb004218 (version=TLSv1 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Fri, 13 Nov 2015 20:32:42 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua tADIWgVb004218 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id tADIWfxB004217; Fri, 13 Nov 2015 20:32:41 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Fri, 13 Nov 2015 20:32:41 +0200 From: Konstantin Belousov To: Warner Losh Cc: Michael Tuexen , freebsd-arm Subject: Re: Memory management issue on RPi? Message-ID: <20151113183241.GA2257@kib.kiev.ua> References: <20151112121825.GJ2257@kib.kiev.ua> <20151112171221.GO2257@kib.kiev.ua> <984BA2E2-DD1A-4D05-858B-362192660E54@freebsd.org> <20151112180954.GP2257@kib.kiev.ua> <29DB8CF5-7569-4139-885A-8496993805A7@freebsd.org> <20151112200300.GR2257@kib.kiev.ua> <6452F207-A79B-42C6-A2CC-07BF454B7024@freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home X-BeenThere: freebsd-arm@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: "Porting FreeBSD to ARM processors." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 13 Nov 2015 18:32:49 -0000 On Fri, Nov 13, 2015 at 08:12:04AM -0700, Warner Losh wrote: > On Fri, Nov 13, 2015 at 1:23 AM, Michael Tuexen wrote: > > > > On 12 Nov 2015, at 21:03, Konstantin Belousov > > wrote: > > > > > > On Thu, Nov 12, 2015 at 08:47:29PM +0100, Michael Tuexen wrote: > > >>> On 12 Nov 2015, at 19:09, Konstantin Belousov > > wrote: > > >>> > > >>> On Thu, Nov 12, 2015 at 06:57:03PM +0100, Michael Tuexen wrote: > > >>>>> On 12 Nov 2015, at 18:12, Konstantin Belousov > > wrote: > > >>>>> > > >>>>> On Thu, Nov 12, 2015 at 05:25:37PM +0100, Michael Tuexen wrote: > > >>>>>>> On 12 Nov 2015, at 13:18, Konstantin Belousov > > wrote: > > >>>>>>> This is a known problem with the swap-less OOM. The following > > patch > > >>>>>>> should give you an immediate relief. You might want to tweak > > >>>>>>> sysctl vm.pageout_oom_seq if default value is not right, it was > > selected > > >>>>>>> by 'try and see' approach on very small (32 or 64MB) i386 VM. > > >>>>>> It just works... Will do some more testing... > > >>>>> > > >>>>> I am more interested in report if OOM was triggered when it should. > > >>>> How do I know? What output do you want to see? > > >>>> > > >>>> Best regards > > >>>> Michael > > >>>>> > > >>>>> Try running several instances of 'sort /dev/zero'. > > >>> ^^^^^^^^^^^^^ I already answered this. > > >>> Run sort /dev/zero, and see whether OOM fires. > > >> OK, now I understand. You want to see if some processes are getting > > killed. > > >> (I was thinking that you might want to see some sysctl counters or so). > > >> > > >> Results: > > >> * I'm able to compile/link/install a kernel from source. This was not > > >> possible before. > > >> * When running three instances of sort /dev/zero, two of them get killed > > >> after a while (less than a minute). One continued to run, but got also > > >> kill eventually. All via ssh login. > > > Exactly, this is the experiment I want to occur, and even more, the > > results > > > are good. > > Any plans to commit it? > > > > These changes are good as an experiment. The RPi's relative speed > of the CPU to the extremely slow SD card where pages are laundered > to. Deferring the calls to the actual OOM a bit is useful. However, > a simple count won't self-scale. What's good for the RPi likely is > likely poor for a CPU connected to faster storage. The OOM won't kill > things quickly enough in those circumstances. I imagine that there may > be a more complicated relationship between the rate of page dirtying > and laundering. The biggest problematic case fixed by this approach is *swap-less* setup, where the speed of the slow storage does not matter at all for the speed of pagedaemon, since there is no swap. > > I'd hope that there'd be some kind of scaling that would take this > variation into account. > > At Netflix, we're developing some patches to do more pro-active > laundering of pages rather than waiting for the page daemon to kick > in. We do this primarily to avoid flushing the uma caches which have > performance implications that we need to address to smooth out the > performance. Perhaps something like this would be a more general way > to cope with this situation? Page laundering speed cannot be a factor in deciding to trigger OOM. If you can clean up something, then OOM must not be fired. The patch does not trigger OOM when no progress is made, immediately, because it expects that some delay might indeed allow the async io to finish and provide some pages to cover the deficit. Only when the progress stalls completely, the ticking for OOM starts. Several iterations are performed before the deadlock is claimed. There is no good heuristic which I could formulate to provide suitable iterations count. But the current value was tested on both small (32-64M) and large (32GB) machines and found satisfactory. Even then, it is run-time tunable to allow to set it by operator for better-suited value. OOM means that the user data is lost. Netflix might not care, due to the specifics of the load, but I and most other users do care about their data. I always prefer the kernel flushing the caches (not only UMA caches, but also pv entries, UFS dirhashes, GPU unpinned buffers etc) over deadlocking or killing the browser where I filled a long form, or text editor, or any other unrecoverable state. If OOM is not fatal for your data, you can reduce the value of the tunable to prefer kernel caches over the user data. And, to make it clear, the current code which triggers OOM does not make much sense. It mostly takes the count of free pages as the indicator of OOM condition, which fails to account the simple fact that queued pages may be laundred or discarded. As result, false OOM is triggered, and it is easier to get false trigger on swap-less system due to swap being always 'full'. This is orthogonal to the issue of the pagedaemon performance.