From owner-freebsd-arm@freebsd.org Wed Aug 15 04:26:41 2018 Return-Path: Delivered-To: freebsd-arm@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 70EC210720A6 for ; Wed, 15 Aug 2018 04:26:41 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-io0-x234.google.com (mail-io0-x234.google.com [IPv6:2607:f8b0:4001:c06::234]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id DE4D0777BE for ; Wed, 15 Aug 2018 04:26:40 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: by mail-io0-x234.google.com with SMTP id v26-v6so36806iog.5 for ; Tue, 14 Aug 2018 21:26:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20150623.gappssmtp.com; s=20150623; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=qvn/a9Zj2owm7UIMenicgFX/Z0gookeus80boi96XxI=; b=UTiUsUyKu/XvHQBghEPFaDRuWIe0gvpGHtnoZgbJaT2lZc5/+dkdc3EsDqg+BB15rs IWOI6ZL7MhPIFAvefrrnYJ8uNOGIuQ+BvfnconXFGgtjr8uZjI9B5Ph2BIsHO2sZ3Iy0 PhVmRwdMTsjyE01fgPlbJfAtfzHUnlVicLbobZrNdWCAsUNSTnekLyJzJdzppLGVWOma agSDDBoVPYqXkODwTDZlAf+2q4BQVTYJ8izMAVp1njP1uqXC+XgkNqncF82cKogwHld4 tPYyRJlUpxJH6swf4F77l5DjloK1jaQDkU2tujBmMVBaivm9XR+xM7PR7HH0cB58YzTt dFxw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=qvn/a9Zj2owm7UIMenicgFX/Z0gookeus80boi96XxI=; b=VN0cMDaZxTwfDb7UeA68hai6e4/8VVz1kN8ER3xQseIR80z2bvuDhFu//8/g6LsfEN PyRxyAWR4Lg1dauyH4YIzyg5/VccjDYs0zChZ/UHh/yZKRrriI91YQXpimSflR1BKtj2 2gJh0jpLn9ckHeEc/tZ//MJPrYC7KEFSLqpDDlENGlfQV5yUUJxmGuX1tq/51dR+aty5 uoHsHrVD81UmInRpoQOuTBSSvAWfLHFjCoZyeJpMM/TxazIs4zHrI9jOVhyb9KEcFWAH yhZb+tnBjqDbphtqT4zX1+FjDenIcp0/M/vviTGS0aVInnMUwRev1T8QB4n1nokP7iZ+ X82w== X-Gm-Message-State: AOUpUlHxco/NgYcGTzl+ty4rwMyD+OfZ2DOC68l41RGLkoCXZ8yGoEoh pbeFBo27OGeBYgg40FLsuIRdZ4SDu7kYiX2McHPEPA== X-Google-Smtp-Source: AA+uWPxsKc+o47gPrvhWI9lZuVwzAw97jzqTFs7ePM0V22u4iCTGYME1nokuBurx7EpFY+tlNrPkriUkMRwVj7S6n8I= X-Received: by 2002:a6b:d004:: with SMTP id x4-v6mr20313246ioa.299.1534307200053; Tue, 14 Aug 2018 21:26:40 -0700 (PDT) MIME-Version: 1.0 Sender: wlosh@bsdimp.com Received: by 2002:a4f:381a:0:0:0:0:0 with HTTP; Tue, 14 Aug 2018 21:26:39 -0700 (PDT) X-Originating-IP: [2603:300b:6:5100:1052:acc7:f9de:2b6d] In-Reply-To: <20180815013612.GB51051@www.zefox.net> References: <20180812173248.GA81324@phouka1.phouka.net> <20180812224021.GA46372@www.zefox.net> <20180813021226.GA46750@www.zefox.net> <0D8B9A29-DD95-4FA3-8F7D-4B85A3BB54D7@yahoo.com> <20180813185350.GA47132@www.zefox.net> <20180814014226.GA50013@www.zefox.net> <20180815013612.GB51051@www.zefox.net> From: Warner Losh Date: Tue, 14 Aug 2018 22:26:39 -0600 X-Google-Sender-Auth: axl48sGs_wFqlFZVYSr8m28RcCM Message-ID: Subject: Re: RPI3 swap experiments (grace under pressure) To: bob prohaska Cc: Mark Millard , freebsd-arm , Mark Johnston Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.27 X-BeenThere: freebsd-arm@freebsd.org X-Mailman-Version: 2.1.27 Precedence: list List-Id: "Porting FreeBSD to ARM processors." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 15 Aug 2018 04:26:41 -0000 On Tue, Aug 14, 2018 at 7:36 PM, bob prohaska wrote: > On Tue, Aug 14, 2018 at 05:50:11PM -0600, Warner Losh wrote: > [big snip] > > > > So, philosophically, I agree that the system shouldn't suck. Making it > > robust against suckage for extreme events that don't match the historic > > usage of BSD, though, is going to take some work. > > > > You've taught me a lot in the snippage above, but you skipped a key > question: > > What do modern sysadmins in datacenter environments want their machines > to > do when overloaded? The overloads could be malign or benign, they might > even > be profitable. In the old days the rule seemed to be "slow down if you > must, > but don't stop". Page first, swap second, kill third. > > Has that changed? Perhaps the jobs aborted by OOMA can be restarted by > another > machine in the cloud? Then OOMA makes a great deal more sense. No. That's not changed. That's the order we do things in FreeBSD still. The question is always when do you give up on each level? When do you start to swap? Ideally never, but you should swap when your current rate of page cleaning can't keep up with the demand, and there's dirty pages that you could use if you swap them out. When do you OOMA? Ideally, never, which is why we try to avoid it. The problem is that the heuristic we use to avoid it (12 tries) is well tuned for systems that are well matched to the I/O system, when dirty pages are created slower than the disk can swap them out, and there's rarely a backup in the disk system. There's no knowledge in the upper layers of how much we're loading the disks (apart from a few kludges like runningbuf), so it has to guess how it can best put load onto the disks to get the most out of them. All the tunables in the kernel for the VM system try to address these balance points. We want to have enough free pages so that we can give processes free pages so they don't have to sleep. We want to keep this above a minimum which is basically the response time of the page daemon to new or changing demand. The extra act as a shock absorber for changes in load. Otherwise, the PID that's in the page daemon does just enough work to ensure that we push out pages we need to to keep up with demand, yet not so much we do too much work. It knows how many new pages will be dirtied (estimated based on recent events), how many clean ones will show up (also based on recent history), so it can guess fairly well that to keep above the low water mark, it needs to launder so many pages in the next interval. The PID keeps the oscillations down, and allows it to respond more quickly to trouble than a simple 'steer out the error (P)' loop. However, tuning the PID look can be tricky in some applications. >From the data I've seen so far, FreeBSD isn't one of the tricky applications: there's a broad range of Kp, Kd and Ki values that give good results. So I think we may be seeing several problems here. One is that the normal write speed of thumb drives isn't that great, so the ability to push pages out is diminished (think of it this way: if you had 0 cost page out, you are only limited by the architecture's VM limits, real disks take time, so the practical limits are somewhat less of than that). In the past, read and write speed have remained in the same order of magnitude (more or less: median may be 5ms and P99 may be 40ms with max somewhere near 60ms, for example, and the numbers are similar for read and write), but with some flash that's no longer true. So even when there's not bugs / trouble in the I/O stack, you have things tied against you. Next, you have the problem that thumb drives have an 'erase size' that's more like 64k or 128k or so, not 4k so the traditional behavior of the swapper is going to do write somewhat less than that, which can make these drives perform even worse due to rewriting (the good ones have a true log device behind the scenes, so this doesn't matter, the bad ones cheat on cost so don't have enough RAM for the LUTs needed to do this, so make tradeoffs, one of which can be RMW). Next, there's issues with something in the system. Either the drive stops responding (so we get timeouts) or the USB stack hick-ups (so we get timeouts) or something. This problem comes and goes and confounds efforts to make the first problems better... So I think what's well tuned for the gear that's in a server doing traditional database and/or compute workloads may not be so well tuned for the RPi3 when you put NAND that can vary a lot in performance, as well as have fast reads and slow writes when the mix isn't that high. The system can be tuned to cope, but isn't tuned that way out of the box. tl;dr: these systems are enough different than the normal system that additional tuning is needed where the normal systems work great out of the box. Plus some code tuneups may help the algorithms be more dynamic than they are today. Warner