From owner-freebsd-current@freebsd.org  Sun Jun  4 05:49:03 2017
Return-Path: <owner-freebsd-current@freebsd.org>
Delivered-To: freebsd-current@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 6C34DAFB712
 for <freebsd-current@mailman.ysv.freebsd.org>;
 Sun,  4 Jun 2017 05:49:03 +0000 (UTC)
 (envelope-from wlosh@bsdimp.com)
Received: from mail-it0-x233.google.com (mail-it0-x233.google.com
 [IPv6:2607:f8b0:4001:c0b::233])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 2550B66522
 for <freebsd-current@freebsd.org>; Sun,  4 Jun 2017 05:49:03 +0000 (UTC)
 (envelope-from wlosh@bsdimp.com)
Received: by mail-it0-x233.google.com with SMTP id m47so37814158iti.1
 for <freebsd-current@freebsd.org>; Sat, 03 Jun 2017 22:49:03 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=bsdimp-com.20150623.gappssmtp.com; s=20150623;
 h=mime-version:sender:in-reply-to:references:from:date:message-id
 :subject:to:cc;
 bh=SSz/a60fF0wnU8NeACG2PcPUgzcmJfmzuMhDIDE/Axs=;
 b=X5jIr8OPHTtKUAwYydzL+5M8CtQ4mhUbILQBvHYvQdeuMIywzYNwk8xV9m6MkgfVob
 FpTz1rcSO0Wgph7+y67kiFgL1IYgUM9hVRnBftnyxZOIBzdusV5hqWFEn86Jyx21RUCw
 kAbVP3dSwKqKvrxzBhngRICBA/4hlRnmAUIFE0b1uWf6fMMVOxmIx4/aDDIte1+jdgkf
 alhgefE4YBSgVaTIxz3CMIucRClahECgiEQ/Sfid2ENKMUB7v374k/r8qjMIvpaRQIXw
 TA+dHWGHhv3UKTy3969ORA8TYAYPM7R/Fkr4C26szXxfcBpkrHrBbmcqCg3gEbBEO/GZ
 N3jg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:sender:in-reply-to:references:from
 :date:message-id:subject:to:cc;
 bh=SSz/a60fF0wnU8NeACG2PcPUgzcmJfmzuMhDIDE/Axs=;
 b=BDCOCBhMttpPm+IODt8zKyoS5DDRkjpxvGCqJHU7j8AR5B9b3UBlPfl8GLngP/UsXy
 uwYUQOBqbTAVSTlOEOza2zqd7pP0KLYJL3g90oGeiV0CkedfhYnAs/mrWdkSCrjMVdL8
 MtFZ4M1x+ZVpvh9D01Igu30rPKtJA2ZH9vXG2SEV2txhDCZWnnEGCQ3tvtWOqiNRc4D4
 dIPnSijeyw1SSUfYgjATX/xl+/rQipgUDsr9qu5GqwAqvXFg09S35Vow56A1h2cf6vkm
 /97QduMH7O088uM4yj2W694VKnzFE7ho1dug/WXvRYfSIf4MiBvjbu3AqOAh99rEidMm
 Nr1A==
X-Gm-Message-State: AODbwcB5CfBk6MZMkE+a8+YEFvjiQEFKA6dlW5EbtrcRzDbNoivGDbCY
 5l2Ct6Ibqj//wInGp+QIPzdcNDeBZ+Ub3T8=
X-Received: by 10.107.16.217 with SMTP id 86mr10618991ioq.134.1496555342401;
 Sat, 03 Jun 2017 22:49:02 -0700 (PDT)
MIME-Version: 1.0
Sender: wlosh@bsdimp.com
Received: by 10.79.192.69 with HTTP; Sat, 3 Jun 2017 22:49:01 -0700 (PDT)
X-Originating-IP: [2603:300b:6:5100:f916:f485:1733:1e28]
In-Reply-To: <CANCZdfrkc1ERKnJr4JzHpePmU+rN5JOgAVePCShPHLDCAE19=w@mail.gmail.com>
References: <0100015c6fc1167c-6e139920-60d9-4ce3-9f59-15520276aebb-000000@email.amazonses.com>
 <972dbd34-b5b3-c363-721e-c6e48806e2cd@elischer.org>
 <3719c729-9434-3121-cf52-393a4453d0b2@freebsd.org>
 <CANCZdfrkc1ERKnJr4JzHpePmU+rN5JOgAVePCShPHLDCAE19=w@mail.gmail.com>
From: Warner Losh <imp@bsdimp.com>
Date: Sat, 3 Jun 2017 23:49:01 -0600
X-Google-Sender-Auth: -TnUnFKnNRHEzm_urDOfsj4CaT0
Message-ID: <CANCZdfpD3G8gR=C2_AekM6VeJ6dzKOnP820OOoF1M_eS0MfJ3g@mail.gmail.com>
Subject: Re: Time to increase MAXPHYS?
To: Allan Jude <allanjude@freebsd.org>
Cc: FreeBSD Current <freebsd-current@freebsd.org>
Content-Type: text/plain; charset="UTF-8"
X-Content-Filtered-By: Mailman/MimeDel 2.1.23
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
 <freebsd-current.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-current>, 
 <mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current/>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
 <mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 04 Jun 2017 05:49:03 -0000

On Sat, Jun 3, 2017 at 11:28 PM, Warner Losh <imp@bsdimp.com> wrote:

>
>
> On Sat, Jun 3, 2017 at 9:55 PM, Allan Jude <allanjude@freebsd.org> wrote:
>
>> On 2017-06-03 22:35, Julian Elischer wrote:
>> > On 4/6/17 4:59 am, Colin Percival wrote:
>> >> On January 24, 1998, in what was later renumbered to SVN r32724, dyson@
>> >> wrote:
>> >>> Add better support for larger I/O clusters, including larger physical
>> >>> I/O.  The support is not mature yet, and some of the underlying
>> >>> implementation
>> >>> needs help.  However, support does exist for IDE devices now.
>> >> and increased MAXPHYS from 64 kB to 128 kB.  Is it time to increase it
>> >> again,
>> >> or do we need to wait at least two decades between changes?
>> >>
>> >> This is hurting performance on some systems; in particular, EC2 "io1"
>> >> disks
>> >> are optimized for 256 kB I/Os, EC2 "st1" (throughput optimized
>> >> spinning rust)
>> >> disks are optimized for 1 MB I/Os, and Amazon's NFS service (EFS)
>> >> recommends
>> >> using a maximum I/O size of 1 MB (and despite NFS not being *physical*
>> >> I/O it
>> >> seems to still be limited by MAXPHYS).
>> >>
>> > We increase it in freebsd 8 and 10.3 on our systems,  Only good results.
>> >
>> > sys/sys/param.h:#define MAXPHYS         (1024 * 1024)   /* max raw I/O
>> > transfer size */
>> >
>> > _______________________________________________
>> > freebsd-current@freebsd.org mailing list
>> > https://lists.freebsd.org/mailman/listinfo/freebsd-current
>> > To unsubscribe, send any mail to "freebsd-current-unsubscribe@f
>> reebsd.org"
>>
>> At some point Warner and I discussed how hard it might be to make this a
>> boot time tunable, so that big amd64 machines can have a larger value
>> without causing problems for smaller machines.
>>
>> ZFS supports a block size of 1mb, and doing I/Os in 128kb negates some
>> of the benefit.
>>
>> I am preparing some benchmarks and other data along with a patch to
>> increase the maximum size of pipe I/O's as well, because using 1MB
>> offers a relatively large performance gain there as well.
>>
>
> It doesn't look to be hard to change this, though struct buf depends on
> MAXPHYS:
>         struct  vm_page *b_pages[btoc(MAXPHYS)];
> and b_pages isn't the last item in the list, so changing MAXPHYS at boot
> time would cause an ABI change. IMHO, we should move it to the last element
> so that wouldn't happen. IIRC all buf allocations are from a fixed pool.
> We'd have to audit anybody that creates one on the stack knowing it will be
> persisted. Given how things work, I don't think this is possible, so we may
> be safe. Thankfully, struct bio doesn't seem to be affected.
>
> As for making it boot-time configurable, it shouldn't be too horrible with
> the above change. We should have enough of the tunables mechanism up early
> enough to pull this in before we create the buf pool.
>
> Netflix runs MAXPHYS of 8MB. There's issues with something this big, to be
> sure, especially on memory limited systems. Lots of hardware can't do this
> big an I/O, and some drivers can't cope, even if the underlying hardware
> can. Since we don't use such drivers at work, I don't have a list handy
> (though I think the SG list for NVMe limits it to 1MB). 128k is totally
> reasonable bump by default, but I think going larger by default should be
> approached with some caution given the overhead that adds to struct buf.
> Having it be a run-time tunable would be great.
>

Of course 128k is reasonable, it's the current default :). I'd mean to say
that doubling would have a limited impact. 1MB might be a good default, but
it might be too big for smaller systems (nothing says it has to be a MI
constant, though). It would be a perfectly fine default if it were a
tunable.


> There's a number of places in userland that depend on MAXPHYS, which is
> unfortunate since they assume a fixed value and don't pick it up from the
> kernel or kernel config. Thankfully, there are only a limited number of
> these.
>

There's a number of other places that assume MAXPHYS is constant. The ahci
driver uses it to define the max number of SG operations you can have, for
example. aio has an array sized based off of it. There are some places that
use this when they should use 128k instead. There's several places that use
it to define other constants, and it would take a while to run them all to
ground to make sure they are all good. We might need to bump DFLTPHYS as
well, so it might also make a good tunable. There's a few places that check
things in terms of a fixed multiple of MAXPHYS that are rules of thumb that
kinda work today maybe by accident or maybe the 100 * MAXPHYS is highly
scientific. It's hard to say without careful study.

For example, until recently, nvmecontrol would use MAXPHYS. But it's the
system default MAXPHYS. And even if it isn't, there's currently a hard
limit of 1MB for an I/O imposed by how the driver uses nvme's SG lists. But
it doesn't show up as MAXPHYS, but rather as NVME_MAX_XFER_SIZE in places.
It totally surprised me when I hit this problem at runtime and tracked it
to ground.


> Of course, there's times when I/Os can return much more than this. Reading
> drive log pages, for example, can generate tens or hundreds of MB of data,
> and there's no way to do that with one transaction today. If drive makers
> were perfect, we could use the generally defined offset and length fields
> to read them out piecemeal. If the log is table, a big if for some of the
> snapshots of internal state logs that are sometimes necessary to
> investigate problems... It sure would be nice if there were a way to have
> super-huge I/O on an exception basis for these situations.
>

The hardest part about doing this is chasing down all the references since
it winds up in the craziest of places.

Warner