From owner-svn-src-all@freebsd.org Fri Mar 11 16:31:49 2016 Return-Path: Delivered-To: svn-src-all@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id CF8D6ACAD73 for ; Fri, 11 Mar 2016 16:31:49 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-ig0-x229.google.com (mail-ig0-x229.google.com [IPv6:2607:f8b0:4001:c05::229]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 9E15298C for ; Fri, 11 Mar 2016 16:31:49 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: by mail-ig0-x229.google.com with SMTP id ig19so15961678igb.0 for ; Fri, 11 Mar 2016 08:31:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20150623.gappssmtp.com; s=20150623; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc; bh=feAX+TfXzPbYQA8GJHK1YgWyxMFqkYPmH0rMa5sv3jY=; b=mQqYRGE3Lbf9Ina9TrNcJXODT2773D9t7iexHsZFGvyGfWeKBv77czEwy4+fIZrTUH q2XYD+30B4ityiCc5z6bPnU358rTJOdmsMNLKqxfUEW9c/rTENyB9If+22XxNVCy7OJx Q66WXxkN9+iPu82g5hR5GNUBh25BCau/Vf81lm0wxmSYT4+B6posKTdE2+8JtA/LK6sI 8h2RbFPGKAT/bRqfxjTy8ue/yzQy46gTQBK+2vVxlHtpzb2peocRu3eNtjHm5FWiZ183 ZDv/75G7bj6kEM61evKoxd5118Gpk5KW52uEELFlsPewPZyF9Fpt97r1x7oX7u6csumR /Ltw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:sender:in-reply-to:references:date :message-id:subject:from:to:cc; bh=feAX+TfXzPbYQA8GJHK1YgWyxMFqkYPmH0rMa5sv3jY=; b=LC4OyEPodOat8l6+D57lEh3mH8mnVjMvA89ojafqq3I57eecu0yMYtyz2uurrG0tdK B6bHM0VK0/ds8IEfEhTER0cuDbMT8oHyd2mPHQfZP/JuMm6g5uJJ8s/jt0/ilvqbUfLG 47v06c3fM1Ul8qi+dPM1BucLP/h5m1Ju+Cwl3gh7JtQIvZ6suwZ3l/3np1Q5fqpRf3eI oGmRdPnnGsRlnLDq5lBYsqiZfVPDwFZOOsamhtJkYOmSUEcoMMeJlBGOW+owzCBtMkep avRyMVnod5BlFCdO9+H6bdm9QND6T27p7BX01YoFofaMirAo/vk3TJyP8odCoR5DJ0Kx /duw== X-Gm-Message-State: AD7BkJJKG8ZX+PU9xXTdCI/e1jZIiTU68irgWN1ckxOoUUsyXeIaDgfJGdG9NuIoYVXKFOoziDuoxe2rD3hgFg== MIME-Version: 1.0 X-Received: by 10.50.13.74 with SMTP id f10mr4533799igc.36.1457713908927; Fri, 11 Mar 2016 08:31:48 -0800 (PST) Sender: wlosh@bsdimp.com Received: by 10.36.65.230 with HTTP; Fri, 11 Mar 2016 08:31:48 -0800 (PST) X-Originating-IP: [50.253.99.174] In-Reply-To: References: <201512110206.tBB264Ad039486@repo.freebsd.org> <56E28ABD.3060803@FreeBSD.org> Date: Fri, 11 Mar 2016 09:31:48 -0700 X-Google-Sender-Auth: O-PNVzTePvVAXvG8H47d0HuUX2o Message-ID: Subject: Re: svn commit: r292074 - in head/sys/dev: nvd nvme From: Warner Losh To: Alan Somers Cc: Alexander Motin , Steven Hartland , "src-committers@freebsd.org" , "svn-src-all@freebsd.org" , "svn-src-head@freebsd.org" Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.21 X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 11 Mar 2016 16:31:50 -0000 On Fri, Mar 11, 2016 at 9:24 AM, Warner Losh wrote: > > > On Fri, Mar 11, 2016 at 9:15 AM, Alan Somers wrote: > >> Interesting. I didn't know about the alternate meaning of stripesize. I >> agree then that there's currently no way to tune ZFS to respect NVME's >> 128KB boundaries. One could set zfs.vfs.vdev.aggregation_limit to 128KB, >> but that would only halfway solve the problem, because allocations could be >> unaligned. Frankly, I'm surprised that NVME drives should have such a >> small limit when SATA and SAS devices commonly handle single commands that >> span multiple MB. I don't think there's any way to adapt ZFS to this limit >> without hurting it in other ways; for example by restricting its ability to >> use large _or_ small record sizes. >> >> Hopefully the NVME slow path isn't _too_ slow. >> > > Let's be clear here: this is purely an Intel controller issue, not an nvme > issue. Most other nvme drives don't have any issues with this at all. At > least for the drives I've been testing from well known NAND players (I'm > unsure if they are released yet, so I can't name names, other than to say > that they aren't OCZ). All these NVMe drives handle 1MB I/Os with > approximately the same performance as 128k or 64k I/Os. The enterprise > grade drives are quite fast and quite nice. It's the lower end, consumer > drives that have more issues. Since those have been eliminated from our > detailed consideration, I'm unsure if they have issues. > > And the Intel issue is a more subtle one having to do with PCIe burst > sizes than necessarily crossing the 128k boundary. I've asked my contacts > inside of Intel that I don't think read these lists for the exact details. > And keep in mind the original description was this: Quote: Intel NVMe controllers have a slow path for I/Os that span a 128KB stripe boundary but ZFS limits ashift, which is derived from d_stripesize, to 13 (8KB) so we limit the stripesize reported to geom(8) to 4KB. This may result in a small number of additional I/Os to require splitting in nvme(4), however the NVMe I/O path is very efficient so these additional I/Os will cause very minimal (if any) difference in performance or CPU utilisation. unquote so the issue seems to being blown up a bit. It's better if you don't generate these I/Os, but the driver copes by splitting them on the affected drives causing a small inefficiency because you're increasing the IOs needed to do the I/O, cutting into the IOPS budget. Warner > Warner > > >> On Fri, Mar 11, 2016 at 2:07 AM, Alexander Motin wrote: >> >>> On 11.03.16 06:58, Alan Somers wrote: >>> > Do they behave badly for writes that cross a 128KB boundary, but are >>> > nonetheless aligned to 128KB boundaries? Then I don't understand how >>> > this change (or mav's replacement) is supposed to help. The stripesize >>> > is supposed to be the minimum write that the device can accept without >>> > requiring a read-modify-write. ZFS guarantees that it will never issue >>> > a write smaller than the stripesize, nor will it ever issue a write >>> that >>> > is not aligned to a stripesize-boundary. But even if ZFS worked with >>> > 128KB stripesizes, it would still happily issue writes a multiple of >>> > 128KB in size, and these would cross those boundaries. Am I not >>> > understanding something here? >>> >>> stripesize is not necessary related to read-modify-write. It reports >>> "some" native boundaries of the device. For example, RAID0 array has >>> stripes, crossing which does not cause read-modify-write cycles, but >>> causes I/O split and head seeks for extra disks. This, as I understand, >>> is the case for some Intel's NVMe device models here, and is the reason >>> why 128KB stripesize was originally reported. >>> >>> We can not demand all file systems to never issue I/Os of less then >>> stripesize, since it can be 128KB, 1MB or even more (and since then it >>> would be called sectorsize). If ZFS (in this case) doesn't support >>> allocation block sizes above 8K (and even that is very >>> space-inefficient), and it has no other mechanisms to optimize I/O >>> alignment, then it is not a problem of the NVMe device or driver, but >>> only of ZFS itself. So what I have done here is moved workaround from >>> improper place (NVMe) to proper one (ZFS): NVMe now correctly reports >>> its native 128K bondaries, that will be respected, for example, by >>> gpart, that help, for example UFS align its 32K blocks, while ZFS will >>> correctly ignore values for which it can't optimize, falling back to >>> efficient 512 bytes allocations. >>> >>> PS about the meaning of stripesize not limited to read-modify-write: For >>> example, RAID5 of 5 512e disks actually has three stripe sizes: 4K, 64K >>> and 256K: aligned writes of 4K allow to avoid read-modify-write inside >>> the drive, I/Os not crossing 64K boundaries without reason improve >>> parallel performance, aligned writes of 256K allow to avoid >>> read-modify-write on the RAID5 level. Obviously not all of those >>> optimizations achievable in all environments, and the bigger the stripe >>> size the harder optimize for it, but it does not mean that such >>> optimization is impossible. It would be good to be able to report all >>> of them, allowing each consumer to use as many of them as it can. >>> >>> > On Thu, Mar 10, 2016 at 9:34 PM, Warner Losh >> > > wrote: >>> > >>> > Some Intel NVMe drives behave badly when the LBA range crosses a >>> > 128k boundary. Their >>> > performance is worse for those transactions than for ones that >>> don't >>> > cross the 128k boundary. >>> > >>> > Warner >>> > >>> > On Thu, Mar 10, 2016 at 11:01 AM, Alan Somers >> > > wrote: >>> > >>> > Are you saying that Intel NVMe controllers perform poorly for >>> > all I/Os that are less than 128KB, or just for I/Os of any size >>> > that cross a 128KB boundary? >>> > >>> > On Thu, Dec 10, 2015 at 7:06 PM, Steven Hartland >>> > > wrote: >>> > >>> > Author: smh >>> > Date: Fri Dec 11 02:06:03 2015 >>> > New Revision: 292074 >>> > URL: https://svnweb.freebsd.org/changeset/base/292074 >>> > >>> > Log: >>> > Limit stripesize reported from nvd(4) to 4K >>> > >>> > Intel NVMe controllers have a slow path for I/Os that >>> span >>> > a 128KB stripe boundary but ZFS limits ashift, which is >>> > derived from d_stripesize, to 13 (8KB) so we limit the >>> > stripesize reported to geom(8) to 4KB. >>> > >>> > This may result in a small number of additional I/Os to >>> > require splitting in nvme(4), however the NVMe I/O path is >>> > very efficient so these additional I/Os will cause very >>> > minimal (if any) difference in performance or CPU >>> utilisation. >>> > >>> > This can be controller by the new sysctl >>> > kern.nvme.max_optimal_sectorsize. >>> > >>> > MFC after: 1 week >>> > Sponsored by: Multiplay >>> > Differential Revision: >>> > https://reviews.freebsd.org/D4446 >>> > >>> > Modified: >>> > head/sys/dev/nvd/nvd.c >>> > head/sys/dev/nvme/nvme.h >>> > head/sys/dev/nvme/nvme_ns.c >>> > head/sys/dev/nvme/nvme_sysctl.c >>> > >>> > >>> > >>> >>> >>> -- >>> Alexander Motin >>> >> >> >