From owner-freebsd-current@freebsd.org Sun Jun 4 05:28:24 2017 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id A080CAFB2A3 for ; Sun, 4 Jun 2017 05:28:24 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-it0-x22f.google.com (mail-it0-x22f.google.com [IPv6:2607:f8b0:4001:c0b::22f]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 6ED9565C9B for ; Sun, 4 Jun 2017 05:28:24 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: by mail-it0-x22f.google.com with SMTP id m47so50273306iti.0 for ; Sat, 03 Jun 2017 22:28:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20150623.gappssmtp.com; s=20150623; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=TLdBr3kokZOe9pNyrR4QlHpS4v1w/eh6jttYzattdwU=; b=cjSW11CNiu/Uxz0O77PNnqFehoaXRzNSE0OlISI6l89KtIkcNCw8I/H+Pm3BNgd6co gqrt2Koq5ZxlyXSm12SLeEq9oZlfPWTfAX5Udeoy7Xzo529sMW3DOgRfXf8ZtGIUlF0q TwhEYGJOPWmBI4e9jWvolhXfj86UnNXoKO9i9TWx2uXdQ1iaH/8NjvmjzC5xrkpbGgxn DjaYcGhsdIkDyGcz8z4fYLgfDnqDBhgyCMUl8itJrVMG7BZjuFdIE1pjg5nqHw9NIucF rUKwlL1kMPJQ6zyQ7A/aSLaHT/mrFKtW6GhzwNR26VCKrBxUJ6nnWidpF2qfrMi8IH/X oK2g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=TLdBr3kokZOe9pNyrR4QlHpS4v1w/eh6jttYzattdwU=; b=dd1BncLsB5bperPD9BxYoVDX842vQWYxte0vN4F1BBL3BWatjbJe2uIGtDkrGso4j6 Uqae4mrzq0xbgxT5XCaU9d/TeVsUYS942ZHGqfFca8zhsJKJxQR+M/knN2AZ/CRFAFPm KRy2YskxBI3zwbOQy+kRDBVcPJFew1I72Y7q/7Or1DSaKriGYeRS48guLUdXx64UU5T8 x7ZIS+Rt7kvRLayROFr0Dxe32M8ncKbl7p8a2sEVidgcq+XZTnyeujxslJnPzXV5pwTc SIWKjUOQZoD9NjLD6+xQmMhVnfh6gkgwjN9YtCdNZX7bnO9PPKmp8145J0MgkthGiGOI TrzA== X-Gm-Message-State: AODbwcCw896uLLf5e4ymfXXFBNPaOa/fFQWFcvoNbmkjAIxHlt9bnemT zC/gZCBLd6iTFrcEE63rWxrmVX0Rtyv3 X-Received: by 10.36.26.18 with SMTP id 18mr6579195iti.103.1496554103702; Sat, 03 Jun 2017 22:28:23 -0700 (PDT) MIME-Version: 1.0 Sender: wlosh@bsdimp.com Received: by 10.79.192.69 with HTTP; Sat, 3 Jun 2017 22:28:23 -0700 (PDT) X-Originating-IP: [2603:300b:6:5100:f916:f485:1733:1e28] In-Reply-To: <3719c729-9434-3121-cf52-393a4453d0b2@freebsd.org> References: <0100015c6fc1167c-6e139920-60d9-4ce3-9f59-15520276aebb-000000@email.amazonses.com> <972dbd34-b5b3-c363-721e-c6e48806e2cd@elischer.org> <3719c729-9434-3121-cf52-393a4453d0b2@freebsd.org> From: Warner Losh Date: Sat, 3 Jun 2017 23:28:23 -0600 X-Google-Sender-Auth: FPNfRUJwGf30eD8QqbMb86KfXRI Message-ID: Subject: Re: Time to increase MAXPHYS? To: Allan Jude Cc: FreeBSD Current Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.23 X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 04 Jun 2017 05:28:24 -0000 On Sat, Jun 3, 2017 at 9:55 PM, Allan Jude wrote: > On 2017-06-03 22:35, Julian Elischer wrote: > > On 4/6/17 4:59 am, Colin Percival wrote: > >> On January 24, 1998, in what was later renumbered to SVN r32724, dyson@ > >> wrote: > >>> Add better support for larger I/O clusters, including larger physical > >>> I/O. The support is not mature yet, and some of the underlying > >>> implementation > >>> needs help. However, support does exist for IDE devices now. > >> and increased MAXPHYS from 64 kB to 128 kB. Is it time to increase it > >> again, > >> or do we need to wait at least two decades between changes? > >> > >> This is hurting performance on some systems; in particular, EC2 "io1" > >> disks > >> are optimized for 256 kB I/Os, EC2 "st1" (throughput optimized > >> spinning rust) > >> disks are optimized for 1 MB I/Os, and Amazon's NFS service (EFS) > >> recommends > >> using a maximum I/O size of 1 MB (and despite NFS not being *physical* > >> I/O it > >> seems to still be limited by MAXPHYS). > >> > > We increase it in freebsd 8 and 10.3 on our systems, Only good results. > > > > sys/sys/param.h:#define MAXPHYS (1024 * 1024) /* max raw I/O > > transfer size */ > > > > _______________________________________________ > > freebsd-current@freebsd.org mailing list > > https://lists.freebsd.org/mailman/listinfo/freebsd-current > > To unsubscribe, send any mail to "freebsd-current-unsubscribe@ > freebsd.org" > > At some point Warner and I discussed how hard it might be to make this a > boot time tunable, so that big amd64 machines can have a larger value > without causing problems for smaller machines. > > ZFS supports a block size of 1mb, and doing I/Os in 128kb negates some > of the benefit. > > I am preparing some benchmarks and other data along with a patch to > increase the maximum size of pipe I/O's as well, because using 1MB > offers a relatively large performance gain there as well. > It doesn't look to be hard to change this, though struct buf depends on MAXPHYS: struct vm_page *b_pages[btoc(MAXPHYS)]; and b_pages isn't the last item in the list, so changing MAXPHYS at boot time would cause an ABI change. IMHO, we should move it to the last element so that wouldn't happen. IIRC all buf allocations are from a fixed pool. We'd have to audit anybody that creates one on the stack knowing it will be persisted. Given how things work, I don't think this is possible, so we may be safe. Thankfully, struct bio doesn't seem to be affected. As for making it boot-time configurable, it shouldn't be too horrible with the above change. We should have enough of the tunables mechanism up early enough to pull this in before we create the buf pool. Netflix runs MAXPHYS of 8MB. There's issues with something this big, to be sure, especially on memory limited systems. Lots of hardware can't do this big an I/O, and some drivers can't cope, even if the underlying hardware can. Since we don't use such drivers at work, I don't have a list handy (though I think the SG list for NVMe limits it to 1MB). 128k is totally reasonable bump by default, but I think going larger by default should be approached with some caution given the overhead that adds to struct buf. Having it be a run-time tunable would be great. There's a number of places in userland that depend on MAXPHYS, which is unfortunate since they assume a fixed value and don't pick it up from the kernel or kernel config. Thankfully, there are only a limited number of these. Of course, there's times when I/Os can return much more than this. Reading drive log pages, for example, can generate tens or hundreds of MB of data, and there's no way to do that with one transaction today. If drive makers were perfect, we could use the generally defined offset and length fields to read them out piecemeal. If the log is table, a big if for some of the snapshots of internal state logs that are sometimes necessary to investigate problems... It sure would be nice if there were a way to have super-huge I/O on an exception basis for these situations. Warner