From owner-svn-src-all@freebsd.org  Fri Mar 11 16:15:53 2016
Return-Path: <owner-svn-src-all@freebsd.org>
Delivered-To: svn-src-all@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id D33B1ACA593;
 Fri, 11 Mar 2016 16:15:53 +0000 (UTC)
 (envelope-from asomers@gmail.com)
Received: from mail-ob0-x231.google.com (mail-ob0-x231.google.com
 [IPv6:2607:f8b0:4003:c01::231])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 968A080B;
 Fri, 11 Mar 2016 16:15:53 +0000 (UTC)
 (envelope-from asomers@gmail.com)
Received: by mail-ob0-x231.google.com with SMTP id fz5so117949419obc.0;
 Fri, 11 Mar 2016 08:15:53 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:in-reply-to:references:date:message-id:subject
 :from:to:cc; bh=/1Enm+vT8N9qFGj4+00KISpLVXsOaaDf6zJlHR/OUL0=;
 b=F3KSLzden67ugmS/KP12I+rKS9appqQWExaFiBMNKYHiZTLBhouZHSiZ38weRcbEVt
 7viDvoZupLuvPidmu0pi3n43RhMnq+H1ZsdkgXnJrruaOxKKk4tahtWdS5+YP9PA3AP8
 SkiLb7Z56ahEjQwhvWpveO+INYDYlIW7mxURFCJm26rILSLh3L/mOv+80b58rBtt2AxA
 87b5D4P48Ie5qDt8yeZzEPhii8niVKdAxZoZiwJnjVean46iMyNRfxDathot1X9BZM3K
 +95ZFBq1nzYE5rfMA29waoFZwR+3UeN54ARIGPNcBTZU0drMilnDZ0hzLU3nj21AF2+s
 1YAA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:mime-version:sender:in-reply-to:references:date
 :message-id:subject:from:to:cc;
 bh=/1Enm+vT8N9qFGj4+00KISpLVXsOaaDf6zJlHR/OUL0=;
 b=SrWoJpaG2p2D3kBbQSc2uW26c1nWJrVj4sJPJpO47KZxg1Tdw+iIDQxiEOzaEfe3fp
 +yuhi8qJvuTu/9UIgyfo8NJ2LP0vq3RAvtarN1kkqjD/98nmw2lFZ7PlPTqpUeUmxNCt
 eULXa0YEB9QW6Z7gphRkIcMOmay5kxXpyCodleHjln+2bUOrWU/iwdenpcDXQJr3dNbP
 Bsozp7Ys9hVQvOzfCNLh23FxpjvwBY6sUFAT7EANc0OGgvrOipXloou76vQDSi1xvDOH
 3dtFrRrMiEt/KG8PHYMi619uVxQA+swBefPxgmciG4/BbSKHqOcORyFAOc1dNVSVfScm
 DjbQ==
X-Gm-Message-State: AD7BkJLwf6/x/+Cgtpe9XWi5EnyO/AmANw7pHxMnmmavrNwxN2Ja8b236trVEuOv9G2mQ5gsWaT+z4GshRK1TQ==
MIME-Version: 1.0
X-Received: by 10.60.141.227 with SMTP id rr3mr5976698oeb.57.1457712952781;
 Fri, 11 Mar 2016 08:15:52 -0800 (PST)
Sender: asomers@gmail.com
Received: by 10.202.64.138 with HTTP; Fri, 11 Mar 2016 08:15:52 -0800 (PST)
In-Reply-To: <56E28ABD.3060803@FreeBSD.org>
References: <201512110206.tBB264Ad039486@repo.freebsd.org>
 <CAOtMX2gAmt_--_vs6M=be9nShkCpKbwzK-K_N4t1MahMijyoog@mail.gmail.com>
 <CANCZdfp3aq4Ysb+wbew-KjUvg7yqbzoqLSS82hKQQut=QRJQbQ@mail.gmail.com>
 <CAOtMX2h16eb=W9VC-hMtuHLknv8pEzA6OxP9=5uFtrftYsBTvw@mail.gmail.com>
 <56E28ABD.3060803@FreeBSD.org>
Date: Fri, 11 Mar 2016 09:15:52 -0700
X-Google-Sender-Auth: 6_rO6LfIupc_GhbxopID9LlVy28
Message-ID: <CAOtMX2gYj_GA+oYKh7Lav2N_kzipECAda9wEKOW09O_=7XBf-w@mail.gmail.com>
Subject: Re: svn commit: r292074 - in head/sys/dev: nvd nvme
From: Alan Somers <asomers@freebsd.org>
To: Alexander Motin <mav@freebsd.org>
Cc: Warner Losh <imp@bsdimp.com>, Steven Hartland <smh@freebsd.org>, 
 "src-committers@freebsd.org" <src-committers@freebsd.org>, 
 "svn-src-all@freebsd.org" <svn-src-all@freebsd.org>, 
 "svn-src-head@freebsd.org" <svn-src-head@freebsd.org>
Content-Type: text/plain; charset=UTF-8
X-Content-Filtered-By: Mailman/MimeDel 2.1.21
X-BeenThere: svn-src-all@freebsd.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: "SVN commit messages for the entire src tree \(except for &quot;
 user&quot; and &quot; projects&quot; \)" <svn-src-all.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/svn-src-all>,
 <mailto:svn-src-all-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-all/>
List-Post: <mailto:svn-src-all@freebsd.org>
List-Help: <mailto:svn-src-all-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/svn-src-all>,
 <mailto:svn-src-all-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 11 Mar 2016 16:15:53 -0000

Interesting.  I didn't know about the alternate meaning of stripesize.  I
agree then that there's currently no way to tune ZFS to respect NVME's
128KB boundaries.  One could set zfs.vfs.vdev.aggregation_limit to 128KB,
but that would only halfway solve the problem, because allocations could be
unaligned.  Frankly, I'm surprised that NVME drives should have such a
small limit when SATA and SAS devices commonly handle single commands that
span multiple MB.  I don't think there's any way to adapt ZFS to this limit
without hurting it in other ways; for example by restricting its ability to
use large _or_ small record sizes.

Hopefully the NVME slow path isn't _too_ slow.

On Fri, Mar 11, 2016 at 2:07 AM, Alexander Motin <mav@freebsd.org> wrote:

> On 11.03.16 06:58, Alan Somers wrote:
> > Do they behave badly for writes that cross a 128KB boundary, but are
> > nonetheless aligned to 128KB boundaries?  Then I don't understand how
> > this change (or mav's replacement) is supposed to help.  The stripesize
> > is supposed to be the minimum write that the device can accept without
> > requiring a read-modify-write.  ZFS guarantees that it will never issue
> > a write smaller than the stripesize, nor will it ever issue a write that
> > is not aligned to a stripesize-boundary.  But even if ZFS worked with
> > 128KB stripesizes, it would still happily issue writes a multiple of
> > 128KB in size, and these would cross those boundaries.  Am I not
> > understanding something here?
>
> stripesize is not necessary related to read-modify-write.  It reports
> "some" native boundaries of the device.  For example, RAID0 array has
> stripes, crossing which does not cause read-modify-write cycles, but
> causes I/O split and head seeks for extra disks.  This, as I understand,
> is the case for some Intel's NVMe device models here, and is the reason
> why 128KB stripesize was originally reported.
>
> We can not demand all file systems to never issue I/Os of less then
> stripesize, since it can be 128KB, 1MB or even more (and since then it
> would be called sectorsize).  If ZFS (in this case) doesn't support
> allocation block sizes above 8K (and even that is very
> space-inefficient), and it has no other mechanisms to optimize I/O
> alignment, then it is not a problem of the NVMe device or driver, but
> only of ZFS itself.  So what I have done here is moved workaround from
> improper place (NVMe) to proper one (ZFS): NVMe now correctly reports
> its native 128K bondaries, that will be respected, for example, by
> gpart, that help, for example UFS align its 32K blocks, while ZFS will
> correctly ignore values for which it can't optimize, falling back to
> efficient 512 bytes allocations.
>
> PS about the meaning of stripesize not limited to read-modify-write: For
> example, RAID5 of 5 512e disks actually has three stripe sizes: 4K, 64K
> and 256K: aligned writes of 4K allow to avoid read-modify-write inside
> the drive, I/Os not crossing 64K boundaries without reason improve
> parallel performance, aligned writes of 256K allow to avoid
> read-modify-write on the RAID5 level.  Obviously not all of those
> optimizations achievable in all environments, and the bigger the stripe
> size the harder optimize for it, but it does not mean that such
> optimization is impossible.  It would be good to be able to report all
> of them, allowing each consumer to use as many of them as it can.
>
> > On Thu, Mar 10, 2016 at 9:34 PM, Warner Losh <imp@bsdimp.com
> > <mailto:imp@bsdimp.com>> wrote:
> >
> >     Some Intel NVMe drives behave badly when the LBA range crosses a
> >     128k boundary. Their
> >     performance is worse for those transactions than for ones that don't
> >     cross the 128k boundary.
> >
> >     Warner
> >
> >     On Thu, Mar 10, 2016 at 11:01 AM, Alan Somers <asomers@freebsd.org
> >     <mailto:asomers@freebsd.org>> wrote:
> >
> >         Are you saying that Intel NVMe controllers perform poorly for
> >         all I/Os that are less than 128KB, or just for I/Os of any size
> >         that cross a 128KB boundary?
> >
> >         On Thu, Dec 10, 2015 at 7:06 PM, Steven Hartland
> >         <smh@freebsd.org <mailto:smh@freebsd.org>> wrote:
> >
> >             Author: smh
> >             Date: Fri Dec 11 02:06:03 2015
> >             New Revision: 292074
> >             URL: https://svnweb.freebsd.org/changeset/base/292074
> >
> >             Log:
> >               Limit stripesize reported from nvd(4) to 4K
> >
> >               Intel NVMe controllers have a slow path for I/Os that span
> >             a 128KB stripe boundary but ZFS limits ashift, which is
> >             derived from d_stripesize, to 13 (8KB) so we limit the
> >             stripesize reported to geom(8) to 4KB.
> >
> >               This may result in a small number of additional I/Os to
> >             require splitting in nvme(4), however the NVMe I/O path is
> >             very efficient so these additional I/Os will cause very
> >             minimal (if any) difference in performance or CPU
> utilisation.
> >
> >               This can be controller by the new sysctl
> >             kern.nvme.max_optimal_sectorsize.
> >
> >               MFC after:    1 week
> >               Sponsored by: Multiplay
> >               Differential Revision:
> >             https://reviews.freebsd.org/D4446
> >
> >             Modified:
> >               head/sys/dev/nvd/nvd.c
> >               head/sys/dev/nvme/nvme.h
> >               head/sys/dev/nvme/nvme_ns.c
> >               head/sys/dev/nvme/nvme_sysctl.c
> >
> >
> >
>
>
> --
> Alexander Motin
>