From owner-svn-src-all@freebsd.org  Fri Mar 11 16:31:49 2016
Return-Path: <owner-svn-src-all@freebsd.org>
Delivered-To: svn-src-all@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id CF8D6ACAD73
 for <svn-src-all@mailman.ysv.freebsd.org>;
 Fri, 11 Mar 2016 16:31:49 +0000 (UTC)
 (envelope-from wlosh@bsdimp.com)
Received: from mail-ig0-x229.google.com (mail-ig0-x229.google.com
 [IPv6:2607:f8b0:4001:c05::229])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 9E15298C
 for <svn-src-all@freebsd.org>; Fri, 11 Mar 2016 16:31:49 +0000 (UTC)
 (envelope-from wlosh@bsdimp.com)
Received: by mail-ig0-x229.google.com with SMTP id ig19so15961678igb.0
 for <svn-src-all@freebsd.org>; Fri, 11 Mar 2016 08:31:49 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=bsdimp-com.20150623.gappssmtp.com; s=20150623;
 h=mime-version:sender:in-reply-to:references:date:message-id:subject
 :from:to:cc; bh=feAX+TfXzPbYQA8GJHK1YgWyxMFqkYPmH0rMa5sv3jY=;
 b=mQqYRGE3Lbf9Ina9TrNcJXODT2773D9t7iexHsZFGvyGfWeKBv77czEwy4+fIZrTUH
 q2XYD+30B4ityiCc5z6bPnU358rTJOdmsMNLKqxfUEW9c/rTENyB9If+22XxNVCy7OJx
 Q66WXxkN9+iPu82g5hR5GNUBh25BCau/Vf81lm0wxmSYT4+B6posKTdE2+8JtA/LK6sI
 8h2RbFPGKAT/bRqfxjTy8ue/yzQy46gTQBK+2vVxlHtpzb2peocRu3eNtjHm5FWiZ183
 ZDv/75G7bj6kEM61evKoxd5118Gpk5KW52uEELFlsPewPZyF9Fpt97r1x7oX7u6csumR
 /Ltw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:mime-version:sender:in-reply-to:references:date
 :message-id:subject:from:to:cc;
 bh=feAX+TfXzPbYQA8GJHK1YgWyxMFqkYPmH0rMa5sv3jY=;
 b=LC4OyEPodOat8l6+D57lEh3mH8mnVjMvA89ojafqq3I57eecu0yMYtyz2uurrG0tdK
 B6bHM0VK0/ds8IEfEhTER0cuDbMT8oHyd2mPHQfZP/JuMm6g5uJJ8s/jt0/ilvqbUfLG
 47v06c3fM1Ul8qi+dPM1BucLP/h5m1Ju+Cwl3gh7JtQIvZ6suwZ3l/3np1Q5fqpRf3eI
 oGmRdPnnGsRlnLDq5lBYsqiZfVPDwFZOOsamhtJkYOmSUEcoMMeJlBGOW+owzCBtMkep
 avRyMVnod5BlFCdO9+H6bdm9QND6T27p7BX01YoFofaMirAo/vk3TJyP8odCoR5DJ0Kx
 /duw==
X-Gm-Message-State: AD7BkJJKG8ZX+PU9xXTdCI/e1jZIiTU68irgWN1ckxOoUUsyXeIaDgfJGdG9NuIoYVXKFOoziDuoxe2rD3hgFg==
MIME-Version: 1.0
X-Received: by 10.50.13.74 with SMTP id f10mr4533799igc.36.1457713908927; Fri,
 11 Mar 2016 08:31:48 -0800 (PST)
Sender: wlosh@bsdimp.com
Received: by 10.36.65.230 with HTTP; Fri, 11 Mar 2016 08:31:48 -0800 (PST)
X-Originating-IP: [50.253.99.174]
In-Reply-To: <CANCZdfrQMKY0BAvafbdWC5ENvmGCj4VqwWjpw_p1cn0xXwXYvw@mail.gmail.com>
References: <201512110206.tBB264Ad039486@repo.freebsd.org>
 <CAOtMX2gAmt_--_vs6M=be9nShkCpKbwzK-K_N4t1MahMijyoog@mail.gmail.com>
 <CANCZdfp3aq4Ysb+wbew-KjUvg7yqbzoqLSS82hKQQut=QRJQbQ@mail.gmail.com>
 <CAOtMX2h16eb=W9VC-hMtuHLknv8pEzA6OxP9=5uFtrftYsBTvw@mail.gmail.com>
 <56E28ABD.3060803@FreeBSD.org>
 <CAOtMX2gYj_GA+oYKh7Lav2N_kzipECAda9wEKOW09O_=7XBf-w@mail.gmail.com>
 <CANCZdfrQMKY0BAvafbdWC5ENvmGCj4VqwWjpw_p1cn0xXwXYvw@mail.gmail.com>
Date: Fri, 11 Mar 2016 09:31:48 -0700
X-Google-Sender-Auth: O-PNVzTePvVAXvG8H47d0HuUX2o
Message-ID: <CANCZdfqWM8u5z_=5NGPzeN5oY_Fvgx4yLSdw_beU5P9TvmKiGQ@mail.gmail.com>
Subject: Re: svn commit: r292074 - in head/sys/dev: nvd nvme
From: Warner Losh <imp@bsdimp.com>
To: Alan Somers <asomers@freebsd.org>
Cc: Alexander Motin <mav@freebsd.org>, Steven Hartland <smh@freebsd.org>, 
 "src-committers@freebsd.org" <src-committers@freebsd.org>, 
 "svn-src-all@freebsd.org" <svn-src-all@freebsd.org>, 
 "svn-src-head@freebsd.org" <svn-src-head@freebsd.org>
Content-Type: text/plain; charset=UTF-8
X-Content-Filtered-By: Mailman/MimeDel 2.1.21
X-BeenThere: svn-src-all@freebsd.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: "SVN commit messages for the entire src tree \(except for &quot;
 user&quot; and &quot; projects&quot; \)" <svn-src-all.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/svn-src-all>,
 <mailto:svn-src-all-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-all/>
List-Post: <mailto:svn-src-all@freebsd.org>
List-Help: <mailto:svn-src-all-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/svn-src-all>,
 <mailto:svn-src-all-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 11 Mar 2016 16:31:50 -0000

On Fri, Mar 11, 2016 at 9:24 AM, Warner Losh <imp@bsdimp.com> wrote:

>
>
> On Fri, Mar 11, 2016 at 9:15 AM, Alan Somers <asomers@freebsd.org> wrote:
>
>> Interesting.  I didn't know about the alternate meaning of stripesize.  I
>> agree then that there's currently no way to tune ZFS to respect NVME's
>> 128KB boundaries.  One could set zfs.vfs.vdev.aggregation_limit to 128KB,
>> but that would only halfway solve the problem, because allocations could be
>> unaligned.  Frankly, I'm surprised that NVME drives should have such a
>> small limit when SATA and SAS devices commonly handle single commands that
>> span multiple MB.  I don't think there's any way to adapt ZFS to this limit
>> without hurting it in other ways; for example by restricting its ability to
>> use large _or_ small record sizes.
>>
>> Hopefully the NVME slow path isn't _too_ slow.
>>
>
> Let's be clear here: this is purely an Intel controller issue, not an nvme
> issue. Most other nvme drives don't have any issues with this at all. At
> least for the drives I've been testing from well known NAND players (I'm
> unsure if they are released yet, so I can't name names, other than to say
> that they aren't OCZ). All these NVMe drives handle 1MB I/Os with
> approximately the same performance as 128k or 64k I/Os. The enterprise
> grade drives are quite fast and quite nice. It's the lower end, consumer
> drives that have more issues. Since those have been eliminated from our
> detailed consideration, I'm unsure if they have issues.
>
> And the Intel issue is a more subtle one having to do with PCIe burst
> sizes than necessarily crossing the 128k boundary. I've asked my contacts
> inside of Intel that I don't think read these lists for the exact details.
>

And keep in mind the original description was this:

Quote:

Intel NVMe controllers have a slow path for I/Os that span
a 128KB stripe boundary but ZFS limits ashift, which is derived
from d_stripesize, to 13 (8KB) so we limit the stripesize
reported to geom(8) to 4KB.

This may result in a small number of additional I/Os
to require splitting in nvme(4), however the NVMe I/O
path is very efficient so these additional I/Os will cause
very minimal (if any) difference in performance or
CPU utilisation.

unquote

so the issue seems to being blown up a bit. It's better if you
don't generate these I/Os, but the driver copes by splitting them
on the affected drives causing a small inefficiency because you're
increasing the IOs needed to do the I/O, cutting into the IOPS budget.

Warner


> Warner
>
>
>> On Fri, Mar 11, 2016 at 2:07 AM, Alexander Motin <mav@freebsd.org> wrote:
>>
>>> On 11.03.16 06:58, Alan Somers wrote:
>>> > Do they behave badly for writes that cross a 128KB boundary, but are
>>> > nonetheless aligned to 128KB boundaries?  Then I don't understand how
>>> > this change (or mav's replacement) is supposed to help.  The stripesize
>>> > is supposed to be the minimum write that the device can accept without
>>> > requiring a read-modify-write.  ZFS guarantees that it will never issue
>>> > a write smaller than the stripesize, nor will it ever issue a write
>>> that
>>> > is not aligned to a stripesize-boundary.  But even if ZFS worked with
>>> > 128KB stripesizes, it would still happily issue writes a multiple of
>>> > 128KB in size, and these would cross those boundaries.  Am I not
>>> > understanding something here?
>>>
>>> stripesize is not necessary related to read-modify-write.  It reports
>>> "some" native boundaries of the device.  For example, RAID0 array has
>>> stripes, crossing which does not cause read-modify-write cycles, but
>>> causes I/O split and head seeks for extra disks.  This, as I understand,
>>> is the case for some Intel's NVMe device models here, and is the reason
>>> why 128KB stripesize was originally reported.
>>>
>>> We can not demand all file systems to never issue I/Os of less then
>>> stripesize, since it can be 128KB, 1MB or even more (and since then it
>>> would be called sectorsize).  If ZFS (in this case) doesn't support
>>> allocation block sizes above 8K (and even that is very
>>> space-inefficient), and it has no other mechanisms to optimize I/O
>>> alignment, then it is not a problem of the NVMe device or driver, but
>>> only of ZFS itself.  So what I have done here is moved workaround from
>>> improper place (NVMe) to proper one (ZFS): NVMe now correctly reports
>>> its native 128K bondaries, that will be respected, for example, by
>>> gpart, that help, for example UFS align its 32K blocks, while ZFS will
>>> correctly ignore values for which it can't optimize, falling back to
>>> efficient 512 bytes allocations.
>>>
>>> PS about the meaning of stripesize not limited to read-modify-write: For
>>> example, RAID5 of 5 512e disks actually has three stripe sizes: 4K, 64K
>>> and 256K: aligned writes of 4K allow to avoid read-modify-write inside
>>> the drive, I/Os not crossing 64K boundaries without reason improve
>>> parallel performance, aligned writes of 256K allow to avoid
>>> read-modify-write on the RAID5 level.  Obviously not all of those
>>> optimizations achievable in all environments, and the bigger the stripe
>>> size the harder optimize for it, but it does not mean that such
>>> optimization is impossible.  It would be good to be able to report all
>>> of them, allowing each consumer to use as many of them as it can.
>>>
>>> > On Thu, Mar 10, 2016 at 9:34 PM, Warner Losh <imp@bsdimp.com
>>> > <mailto:imp@bsdimp.com>> wrote:
>>> >
>>> >     Some Intel NVMe drives behave badly when the LBA range crosses a
>>> >     128k boundary. Their
>>> >     performance is worse for those transactions than for ones that
>>> don't
>>> >     cross the 128k boundary.
>>> >
>>> >     Warner
>>> >
>>> >     On Thu, Mar 10, 2016 at 11:01 AM, Alan Somers <asomers@freebsd.org
>>> >     <mailto:asomers@freebsd.org>> wrote:
>>> >
>>> >         Are you saying that Intel NVMe controllers perform poorly for
>>> >         all I/Os that are less than 128KB, or just for I/Os of any size
>>> >         that cross a 128KB boundary?
>>> >
>>> >         On Thu, Dec 10, 2015 at 7:06 PM, Steven Hartland
>>> >         <smh@freebsd.org <mailto:smh@freebsd.org>> wrote:
>>> >
>>> >             Author: smh
>>> >             Date: Fri Dec 11 02:06:03 2015
>>> >             New Revision: 292074
>>> >             URL: https://svnweb.freebsd.org/changeset/base/292074
>>> >
>>> >             Log:
>>> >               Limit stripesize reported from nvd(4) to 4K
>>> >
>>> >               Intel NVMe controllers have a slow path for I/Os that
>>> span
>>> >             a 128KB stripe boundary but ZFS limits ashift, which is
>>> >             derived from d_stripesize, to 13 (8KB) so we limit the
>>> >             stripesize reported to geom(8) to 4KB.
>>> >
>>> >               This may result in a small number of additional I/Os to
>>> >             require splitting in nvme(4), however the NVMe I/O path is
>>> >             very efficient so these additional I/Os will cause very
>>> >             minimal (if any) difference in performance or CPU
>>> utilisation.
>>> >
>>> >               This can be controller by the new sysctl
>>> >             kern.nvme.max_optimal_sectorsize.
>>> >
>>> >               MFC after:    1 week
>>> >               Sponsored by: Multiplay
>>> >               Differential Revision:
>>> >             https://reviews.freebsd.org/D4446
>>> >
>>> >             Modified:
>>> >               head/sys/dev/nvd/nvd.c
>>> >               head/sys/dev/nvme/nvme.h
>>> >               head/sys/dev/nvme/nvme_ns.c
>>> >               head/sys/dev/nvme/nvme_sysctl.c
>>> >
>>> >
>>> >
>>>
>>>
>>> --
>>> Alexander Motin
>>>
>>
>>
>