Date: Tue, 14 Aug 2018 22:26:39 -0600 From: Warner Losh <imp@bsdimp.com> To: bob prohaska <fbsd@www.zefox.net> Cc: Mark Millard <marklmi@yahoo.com>, freebsd-arm <freebsd-arm@freebsd.org>, Mark Johnston <markj@freebsd.org> Subject: Re: RPI3 swap experiments (grace under pressure) Message-ID: <CANCZdfoB_AcidFpKT_ZmZWUFnmC4Bw55krK%2BMqEmmj=f9KMQ2Q@mail.gmail.com> In-Reply-To: <20180815013612.GB51051@www.zefox.net> References: <20180812173248.GA81324@phouka1.phouka.net> <20180812224021.GA46372@www.zefox.net> <B81E53A9-459E-4489-883B-24175B87D049@yahoo.com> <20180813021226.GA46750@www.zefox.net> <0D8B9A29-DD95-4FA3-8F7D-4B85A3BB54D7@yahoo.com> <FC0798A1-C805-4096-9EB1-15E3F854F729@yahoo.com> <20180813185350.GA47132@www.zefox.net> <FA3B8541-73E0-4796-B2AB-D55CE40B9654@yahoo.com> <20180814014226.GA50013@www.zefox.net> <CANCZdfqFKY3Woa%2B9pVS5hika_JUAUCxAvLznSS4gaLq2kKoWtQ@mail.gmail.com> <20180815013612.GB51051@www.zefox.net>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Aug 14, 2018 at 7:36 PM, bob prohaska <fbsd@www.zefox.net> wrote: > On Tue, Aug 14, 2018 at 05:50:11PM -0600, Warner Losh wrote: > [big snip] > > > > So, philosophically, I agree that the system shouldn't suck. Making it > > robust against suckage for extreme events that don't match the historic > > usage of BSD, though, is going to take some work. > > > > You've taught me a lot in the snippage above, but you skipped a key > question: > > What do modern sysadmins in datacenter environments want their machines > to > do when overloaded? The overloads could be malign or benign, they might > even > be profitable. In the old days the rule seemed to be "slow down if you > must, > but don't stop". Page first, swap second, kill third. > > Has that changed? Perhaps the jobs aborted by OOMA can be restarted by > another > machine in the cloud? Then OOMA makes a great deal more sense. No. That's not changed. That's the order we do things in FreeBSD still. The question is always when do you give up on each level? When do you start to swap? Ideally never, but you should swap when your current rate of page cleaning can't keep up with the demand, and there's dirty pages that you could use if you swap them out. When do you OOMA? Ideally, never, which is why we try to avoid it. The problem is that the heuristic we use to avoid it (12 tries) is well tuned for systems that are well matched to the I/O system, when dirty pages are created slower than the disk can swap them out, and there's rarely a backup in the disk system. There's no knowledge in the upper layers of how much we're loading the disks (apart from a few kludges like runningbuf), so it has to guess how it can best put load onto the disks to get the most out of them. All the tunables in the kernel for the VM system try to address these balance points. We want to have enough free pages so that we can give processes free pages so they don't have to sleep. We want to keep this above a minimum which is basically the response time of the page daemon to new or changing demand. The extra act as a shock absorber for changes in load. Otherwise, the PID that's in the page daemon does just enough work to ensure that we push out pages we need to to keep up with demand, yet not so much we do too much work. It knows how many new pages will be dirtied (estimated based on recent events), how many clean ones will show up (also based on recent history), so it can guess fairly well that to keep above the low water mark, it needs to launder so many pages in the next interval. The PID keeps the oscillations down, and allows it to respond more quickly to trouble than a simple 'steer out the error (P)' loop. However, tuning the PID look can be tricky in some applications. >From the data I've seen so far, FreeBSD isn't one of the tricky applications: there's a broad range of Kp, Kd and Ki values that give good results. So I think we may be seeing several problems here. One is that the normal write speed of thumb drives isn't that great, so the ability to push pages out is diminished (think of it this way: if you had 0 cost page out, you are only limited by the architecture's VM limits, real disks take time, so the practical limits are somewhat less of than that). In the past, read and write speed have remained in the same order of magnitude (more or less: median may be 5ms and P99 may be 40ms with max somewhere near 60ms, for example, and the numbers are similar for read and write), but with some flash that's no longer true. So even when there's not bugs / trouble in the I/O stack, you have things tied against you. Next, you have the problem that thumb drives have an 'erase size' that's more like 64k or 128k or so, not 4k so the traditional behavior of the swapper is going to do write somewhat less than that, which can make these drives perform even worse due to rewriting (the good ones have a true log device behind the scenes, so this doesn't matter, the bad ones cheat on cost so don't have enough RAM for the LUTs needed to do this, so make tradeoffs, one of which can be RMW). Next, there's issues with something in the system. Either the drive stops responding (so we get timeouts) or the USB stack hick-ups (so we get timeouts) or something. This problem comes and goes and confounds efforts to make the first problems better... So I think what's well tuned for the gear that's in a server doing traditional database and/or compute workloads may not be so well tuned for the RPi3 when you put NAND that can vary a lot in performance, as well as have fast reads and slow writes when the mix isn't that high. The system can be tuned to cope, but isn't tuned that way out of the box. tl;dr: these systems are enough different than the normal system that additional tuning is needed where the normal systems work great out of the box. Plus some code tuneups may help the algorithms be more dynamic than they are today. Warner
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfoB_AcidFpKT_ZmZWUFnmC4Bw55krK%2BMqEmmj=f9KMQ2Q>