From owner-svn-src-head@freebsd.org Mon Dec 30 21:20:54 2019 Return-Path: Delivered-To: svn-src-head@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 5B24D1EA61A for ; Mon, 30 Dec 2019 21:20:54 +0000 (UTC) (envelope-from jroberson@jroberson.net) Received: from mail-pl1-x642.google.com (mail-pl1-x642.google.com [IPv6:2607:f8b0:4864:20::642]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) server-signature RSA-PSS (4096 bits) client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 47mr2F43Fkz4VSM for ; Mon, 30 Dec 2019 21:20:53 +0000 (UTC) (envelope-from jroberson@jroberson.net) Received: by mail-pl1-x642.google.com with SMTP id a6so14371857plm.3 for ; Mon, 30 Dec 2019 13:20:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=jroberson-net.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=8rsmRUiuTjn0eLDGFXk1xS6lI21xJIcA1F80cpD82cs=; b=PxtLaa2A1BBiRjyaP2+YopNB/Wz6bNP16NARNDxUVzKchNrS1hs4CpW+7FSJUut8W6 YebJlAHo9JmcvmMMFwG230u5vjJXB0xa+8I3h/YpoFhZBeRNgWb8yWYS4FwG+qTRfopV ZhI2FZge0Fy7x/+AtK9NtsMSsPHAYEelO4XlALexRAHq7ZE/4D3XcP+ssGbbIbJUXhxT Mav+lOSUR+El+YkVnOKjx5EI656KC/I54ai4+HuJyiBF+r+2OYBKjw2EGjImfpNB6HW7 VPaohIgAS+ix5Lj2AEmNMlUixW7OsguszEG5UciCCt+6NsVXC+5TXFGgW0TlK7R+OPpD FgJg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version; bh=8rsmRUiuTjn0eLDGFXk1xS6lI21xJIcA1F80cpD82cs=; b=FNhShgoqOzy5cGfN9ivZjtGMhPf3gn5K06dVDyHyiZa7jTFZVf7RYUCfIUe1S18sTE tAPFw9G3bkvTPun8nPz52p9wQK+M2Cj+ZYg5JdH+A8QLY1fvRnIobZcTNmsOsedqXvYG VchDMzZb9629Zpf5x+3aD5A3gUGjpT0gZQF4IWI8IHBk0DDj3p8wOCd6KnPIONHTVlY7 +rkcZMxTRi1MPBBebTTPplkU1zJMdVdGJwRaCh7b8LsrjiWIaXOYET/wrHLrFA3aNIqg NF/N1+Zi+Yyiux6izKHC9Wj8HEpmeDp7wnw3lPZpn6t2byAFPeT2kTfUdMTqvaq+A86U FB2A== X-Gm-Message-State: APjAAAVb164SWuTSD2Rxt8zIwgUrBm5uYYTXpdmSVtuYndCTJq7s5q4P FYhtOCEsWg5lns5Pkr7eNrWoPA== X-Google-Smtp-Source: APXvYqyro4rl48DI3Gh1A8oC0N2JrtYxp+GKt0Cg4zVTr59zrn76wZ5P3f3Pdn0i7RgMqw9rnshnew== X-Received: by 2002:a17:902:8f98:: with SMTP id z24mr64184163plo.51.1577740851492; Mon, 30 Dec 2019 13:20:51 -0800 (PST) Received: from rrcs-76-81-105-82.west.biz.rr.com (rrcs-76-81-105-82.west.biz.rr.com. [76.81.105.82]) by smtp.gmail.com with ESMTPSA id g24sm52546038pfk.92.2019.12.30.13.20.50 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 30 Dec 2019 13:20:51 -0800 (PST) Date: Mon, 30 Dec 2019 11:20:48 -1000 (HST) From: Jeff Roberson X-X-Sender: jroberson@desktop To: Warner Losh cc: Alexander Motin , Alexey Dokuchaev , src-committers , svn-src-all , svn-src-head Subject: Re: svn commit: r356185 - in head: lib/geom lib/geom/sched sys/geom sys/geom/sched sys/modules/geom sys/modules/geom/geom_sched sys/sys In-Reply-To: Message-ID: References: <201912292116.xBTLG4kV012809@repo.freebsd.org> <20191230113243.GA58338@FreeBSD.org> <20191230170208.GA20424@FreeBSD.org> <5a97d344-8741-3b8e-b6dd-b8e4cfa05aeb@FreeBSD.org> User-Agent: Alpine 2.21.9999 (BSF 287 2018-06-16) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed X-Rspamd-Queue-Id: 47mr2F43Fkz4VSM X-Spamd-Bar: -- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=jroberson-net.20150623.gappssmtp.com header.s=20150623 header.b=PxtLaa2A; dmarc=none; spf=none (mx1.freebsd.org: domain of jroberson@jroberson.net has no SPF policy when checking 2607:f8b0:4864:20::642) smtp.mailfrom=jroberson@jroberson.net X-Spamd-Result: default: False [-2.73 / 15.00]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; R_DKIM_ALLOW(-0.20)[jroberson-net.20150623.gappssmtp.com:s=20150623]; NEURAL_HAM_MEDIUM(-1.00)[-1.000,0]; FROM_HAS_DN(0.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; MIME_GOOD(-0.10)[text/plain]; PREVIOUSLY_DELIVERED(0.00)[svn-src-head@freebsd.org]; DMARC_NA(0.00)[jroberson.net]; RCPT_COUNT_FIVE(0.00)[6]; RCVD_COUNT_THREE(0.00)[3]; TO_MATCH_ENVRCPT_SOME(0.00)[]; TO_DN_ALL(0.00)[]; DKIM_TRACE(0.00)[jroberson-net.20150623.gappssmtp.com:+]; RCVD_IN_DNSWL_NONE(0.00)[2.4.6.0.0.0.0.0.0.0.0.0.0.0.0.0.0.2.0.0.4.6.8.4.0.b.8.f.7.0.6.2.list.dnswl.org : 127.0.5.0]; R_SPF_NA(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; MID_RHS_NOT_FQDN(0.50)[]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; RCVD_TLS_ALL(0.00)[]; IP_SCORE(-0.93)[ip: (-0.59), ipnet: 2607:f8b0::/32(-2.15), asn: 15169(-1.87), country: US(-0.05)] X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 30 Dec 2019 21:20:54 -0000 On Mon, 30 Dec 2019, Warner Losh wrote: > > > On Mon, Dec 30, 2019 at 12:55 PM Alexander Motin wrote: > On 30.12.2019 12:02, Alexey Dokuchaev wrote: > > On Mon, Dec 30, 2019 at 08:55:14AM -0700, Warner Losh wrote: > >> On Mon, Dec 30, 2019, 5:32 AM Alexey Dokuchaev wrote: > >>> On Sun, Dec 29, 2019 at 09:16:04PM +0000, Alexander Motin > wrote: > >>>> New Revision: 356185 > >>>> URL: https://svnweb.freebsd.org/changeset/base/356185 > >>>> > >>>> Log: > >>>>? ?Remove GEOM_SCHED class and gsched tool. > >>>>? ?[...] > >>> > >>> Wow, that was unexpected, I use it on all my machines' HDD > drives. > >>> Is there a planned replacement, or I'd better create a port > for the > >>> GEOM_SCHED class and gsched(8) tool? > >> > >> How much of a performance improvement do you see with it? > >> > >> There has been no tweaks to this geom in years and years. It > was tuned > >> to 10 year old hard drives and never retuned for anything > newer. > > > > Well, hard drives essentially didn't change since then, still > being the > > same roration media. :) > > At least some papers about gsched I read mention adX devices, > which > means old ATA stack and no NCQ.? It can be quite a significant > change to > let HDD to do its own scheduling.? Also about a year ago in > r335066 > Warner added sysctl debug.bioq_batchsize, which if set to > non-zero value > may, I think, improve fairness between several processes, just > not sure > why it was never enabled. > > > I never?enabled it because I never had a good?car size as the default. I'm > guessing? it's somewhere?on the order of 2 times the queue size in hardware, > but with modern drives I think phk might be right and that disabling > disksort entirely might be optimal, or close to optimal. > ? > >> And when I played with it a few years ago, I saw no > improvements... > > > > Admittedly, I've only did some tests no later than in 8.4 > times when I > > first started using it.? Fair point, though, I should redo them > again. > > I'm sorry to create a regression for you, if there is really > one.? As I > have written I don't have so much against the scheduler part > itself, as > against the accumulated technical debt and the way integration > is done, > such as mechanism of live insertion, etc.? Without unmapped I/O > and > direct dispatch I bet it must be quite slow on bigger systems, > that is > why I doubted anybody really use it. > > > Is there a planned replacement, or I'd better create a port > for the > > GEOM_SCHED class and gsched(8) tool? > > I wasn't planning replacement.? And moving it to ports would be a > problem, since in process I removed few capabilities critical > for it: > nstart/nend for live insertion and BIO classification for > scheduling. > But the last I don't mind to return if there appear to be a > need.? It is > only the first I am strongly against.? But if somebody would like > to > reimplement it, may be it would be better to consider merging > it with > CAM I/O scheduler by Warner?? The one at least knows about device > queue > depth, etc.? We could return the BIO classification to be used by > CAM > scheduler instead, if needed. > > > I'd be keen on helping anybody that wants to experiment with hard disk > drive optmizations in iosched. My doodles to make it better showed no early > improvements, so Iv'e not tried to bring them into the tree. However, our > workload is basically 'large block random' which isn't the same as others > and others might have a workload that could benefit. I've found a marginal > improvement from the read over writes bias in our workload, and > another?marginal improvement for favoring metadata reads over normal reads > (because?for us, sendfile blocks for some of these reads, but others may see > no improvement). I'm working to clean up the metadata read stuff to get it > into the tree. I've not tested it on ZFS, though, so there will be no ZFS > metadata labeling in the initial commit. > > So I like the idea, and would love to work with someone that needs it > and/or whose work loads can be improved by it. The biggest issue I have found with drive sorting and traditional elevator algorithms is that it is not latency limiting. We have other problems at higher layers where we scheduling too many writes simultaneously that contribute substantially to I/O latency. Also read-after-writes are blocked in the buffer cache while a senseless number of buffers are queued and locked. An algorithm I have found effective and implemented at least twice is to estimate I/O time and then give a maximum sort latency. For many drives you have to go further and starve them for I/O until they complete a particularly long running operation or they can continue to decide to sort something out indefinitely if the I/O you add to the queue is preferable. The basic notion is to give a boundary, perhaps 100-200ms, for reads and usually twice that or more for writes. You can sort I/O within batches of that size. You might violate the batch if a directly adjacent block is scheduled and you an concatenate them into a single transfer. You also have to consider whether the drive has a write cache enabled or not and whether the filesystem or application is going to sync the disk. Many SATA drives want an idle queue when they sync for best behavior. You probably also want a larger write queue for uncached writes but preferably not the entire drive queue. Eventually cached writes cause stalls on flush and too many in queue will just hold up queue space while they normally complete so quickly that a deep queue depth is not important. Elements of this are also useful on SSDs where you want to manage latency and queue depth. I suspect the drive queue is indeed preferable to the simple implementations we've had in tree. Thanks, Jeff > > Warner > > -- > Alexander Motin > > >