From owner-svn-src-all@freebsd.org Fri Mar 11 16:15:53 2016 Return-Path: Delivered-To: svn-src-all@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D33B1ACA593; Fri, 11 Mar 2016 16:15:53 +0000 (UTC) (envelope-from asomers@gmail.com) Received: from mail-ob0-x231.google.com (mail-ob0-x231.google.com [IPv6:2607:f8b0:4003:c01::231]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 968A080B; Fri, 11 Mar 2016 16:15:53 +0000 (UTC) (envelope-from asomers@gmail.com) Received: by mail-ob0-x231.google.com with SMTP id fz5so117949419obc.0; Fri, 11 Mar 2016 08:15:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc; bh=/1Enm+vT8N9qFGj4+00KISpLVXsOaaDf6zJlHR/OUL0=; b=F3KSLzden67ugmS/KP12I+rKS9appqQWExaFiBMNKYHiZTLBhouZHSiZ38weRcbEVt 7viDvoZupLuvPidmu0pi3n43RhMnq+H1ZsdkgXnJrruaOxKKk4tahtWdS5+YP9PA3AP8 SkiLb7Z56ahEjQwhvWpveO+INYDYlIW7mxURFCJm26rILSLh3L/mOv+80b58rBtt2AxA 87b5D4P48Ie5qDt8yeZzEPhii8niVKdAxZoZiwJnjVean46iMyNRfxDathot1X9BZM3K +95ZFBq1nzYE5rfMA29waoFZwR+3UeN54ARIGPNcBTZU0drMilnDZ0hzLU3nj21AF2+s 1YAA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:sender:in-reply-to:references:date :message-id:subject:from:to:cc; bh=/1Enm+vT8N9qFGj4+00KISpLVXsOaaDf6zJlHR/OUL0=; b=SrWoJpaG2p2D3kBbQSc2uW26c1nWJrVj4sJPJpO47KZxg1Tdw+iIDQxiEOzaEfe3fp +yuhi8qJvuTu/9UIgyfo8NJ2LP0vq3RAvtarN1kkqjD/98nmw2lFZ7PlPTqpUeUmxNCt eULXa0YEB9QW6Z7gphRkIcMOmay5kxXpyCodleHjln+2bUOrWU/iwdenpcDXQJr3dNbP Bsozp7Ys9hVQvOzfCNLh23FxpjvwBY6sUFAT7EANc0OGgvrOipXloou76vQDSi1xvDOH 3dtFrRrMiEt/KG8PHYMi619uVxQA+swBefPxgmciG4/BbSKHqOcORyFAOc1dNVSVfScm DjbQ== X-Gm-Message-State: AD7BkJLwf6/x/+Cgtpe9XWi5EnyO/AmANw7pHxMnmmavrNwxN2Ja8b236trVEuOv9G2mQ5gsWaT+z4GshRK1TQ== MIME-Version: 1.0 X-Received: by 10.60.141.227 with SMTP id rr3mr5976698oeb.57.1457712952781; Fri, 11 Mar 2016 08:15:52 -0800 (PST) Sender: asomers@gmail.com Received: by 10.202.64.138 with HTTP; Fri, 11 Mar 2016 08:15:52 -0800 (PST) In-Reply-To: <56E28ABD.3060803@FreeBSD.org> References: <201512110206.tBB264Ad039486@repo.freebsd.org> <56E28ABD.3060803@FreeBSD.org> Date: Fri, 11 Mar 2016 09:15:52 -0700 X-Google-Sender-Auth: 6_rO6LfIupc_GhbxopID9LlVy28 Message-ID: Subject: Re: svn commit: r292074 - in head/sys/dev: nvd nvme From: Alan Somers To: Alexander Motin Cc: Warner Losh , Steven Hartland , "src-committers@freebsd.org" , "svn-src-all@freebsd.org" , "svn-src-head@freebsd.org" Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.21 X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 11 Mar 2016 16:15:53 -0000 Interesting. I didn't know about the alternate meaning of stripesize. I agree then that there's currently no way to tune ZFS to respect NVME's 128KB boundaries. One could set zfs.vfs.vdev.aggregation_limit to 128KB, but that would only halfway solve the problem, because allocations could be unaligned. Frankly, I'm surprised that NVME drives should have such a small limit when SATA and SAS devices commonly handle single commands that span multiple MB. I don't think there's any way to adapt ZFS to this limit without hurting it in other ways; for example by restricting its ability to use large _or_ small record sizes. Hopefully the NVME slow path isn't _too_ slow. On Fri, Mar 11, 2016 at 2:07 AM, Alexander Motin wrote: > On 11.03.16 06:58, Alan Somers wrote: > > Do they behave badly for writes that cross a 128KB boundary, but are > > nonetheless aligned to 128KB boundaries? Then I don't understand how > > this change (or mav's replacement) is supposed to help. The stripesize > > is supposed to be the minimum write that the device can accept without > > requiring a read-modify-write. ZFS guarantees that it will never issue > > a write smaller than the stripesize, nor will it ever issue a write that > > is not aligned to a stripesize-boundary. But even if ZFS worked with > > 128KB stripesizes, it would still happily issue writes a multiple of > > 128KB in size, and these would cross those boundaries. Am I not > > understanding something here? > > stripesize is not necessary related to read-modify-write. It reports > "some" native boundaries of the device. For example, RAID0 array has > stripes, crossing which does not cause read-modify-write cycles, but > causes I/O split and head seeks for extra disks. This, as I understand, > is the case for some Intel's NVMe device models here, and is the reason > why 128KB stripesize was originally reported. > > We can not demand all file systems to never issue I/Os of less then > stripesize, since it can be 128KB, 1MB or even more (and since then it > would be called sectorsize). If ZFS (in this case) doesn't support > allocation block sizes above 8K (and even that is very > space-inefficient), and it has no other mechanisms to optimize I/O > alignment, then it is not a problem of the NVMe device or driver, but > only of ZFS itself. So what I have done here is moved workaround from > improper place (NVMe) to proper one (ZFS): NVMe now correctly reports > its native 128K bondaries, that will be respected, for example, by > gpart, that help, for example UFS align its 32K blocks, while ZFS will > correctly ignore values for which it can't optimize, falling back to > efficient 512 bytes allocations. > > PS about the meaning of stripesize not limited to read-modify-write: For > example, RAID5 of 5 512e disks actually has three stripe sizes: 4K, 64K > and 256K: aligned writes of 4K allow to avoid read-modify-write inside > the drive, I/Os not crossing 64K boundaries without reason improve > parallel performance, aligned writes of 256K allow to avoid > read-modify-write on the RAID5 level. Obviously not all of those > optimizations achievable in all environments, and the bigger the stripe > size the harder optimize for it, but it does not mean that such > optimization is impossible. It would be good to be able to report all > of them, allowing each consumer to use as many of them as it can. > > > On Thu, Mar 10, 2016 at 9:34 PM, Warner Losh > > wrote: > > > > Some Intel NVMe drives behave badly when the LBA range crosses a > > 128k boundary. Their > > performance is worse for those transactions than for ones that don't > > cross the 128k boundary. > > > > Warner > > > > On Thu, Mar 10, 2016 at 11:01 AM, Alan Somers > > wrote: > > > > Are you saying that Intel NVMe controllers perform poorly for > > all I/Os that are less than 128KB, or just for I/Os of any size > > that cross a 128KB boundary? > > > > On Thu, Dec 10, 2015 at 7:06 PM, Steven Hartland > > > wrote: > > > > Author: smh > > Date: Fri Dec 11 02:06:03 2015 > > New Revision: 292074 > > URL: https://svnweb.freebsd.org/changeset/base/292074 > > > > Log: > > Limit stripesize reported from nvd(4) to 4K > > > > Intel NVMe controllers have a slow path for I/Os that span > > a 128KB stripe boundary but ZFS limits ashift, which is > > derived from d_stripesize, to 13 (8KB) so we limit the > > stripesize reported to geom(8) to 4KB. > > > > This may result in a small number of additional I/Os to > > require splitting in nvme(4), however the NVMe I/O path is > > very efficient so these additional I/Os will cause very > > minimal (if any) difference in performance or CPU > utilisation. > > > > This can be controller by the new sysctl > > kern.nvme.max_optimal_sectorsize. > > > > MFC after: 1 week > > Sponsored by: Multiplay > > Differential Revision: > > https://reviews.freebsd.org/D4446 > > > > Modified: > > head/sys/dev/nvd/nvd.c > > head/sys/dev/nvme/nvme.h > > head/sys/dev/nvme/nvme_ns.c > > head/sys/dev/nvme/nvme_sysctl.c > > > > > > > > > -- > Alexander Motin >