From owner-freebsd-hackers@FreeBSD.ORG  Wed Oct 17 11:32:10 2007
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 6FA5C16A418
	for <freebsd-hackers@freebsd.org>; Wed, 17 Oct 2007 11:32:10 +0000 (UTC)
	(envelope-from fabio@freebsd.org)
Received: from sssup.it (ms01.sssup.it [193.205.80.99])
	by mx1.freebsd.org (Postfix) with ESMTP id 8CE9613C458
	for <freebsd-hackers@freebsd.org>; Wed, 17 Oct 2007 11:32:09 +0000 (UTC)
	(envelope-from fabio@freebsd.org)
Received: from [10.30.3.4] (HELO granpasso.retis)
	by sssup.it (CommuniGate Pro SMTP 4.1.8)
	with ESMTP-TLS id 35078491; Wed, 17 Oct 2007 12:20:08 +0200
Received: from granpasso.retis (localhost.retis [127.0.0.1])
	by granpasso.retis (8.14.1/8.14.1) with ESMTP id l9HA92qY098989;
	Wed, 17 Oct 2007 12:09:07 +0200 (CEST)
	(envelope-from fabio@freebsd.org)
Received: (from fabio@localhost)
	by granpasso.retis (8.14.1/8.14.1/Submit) id l9GMNEfI074551;
	Wed, 17 Oct 2007 00:23:14 +0200 (CEST)
	(envelope-from fabio@freebsd.org)
X-Authentication-Warning: granpasso.retis: fabio set sender to
	fabio@freebsd.org using -f
Date: Wed, 17 Oct 2007 00:23:14 +0200
From: Fabio Checconi <fabio@freebsd.org>
To: Karsten Behrmann <BearPerson@gmx.net>
Message-ID: <20071016222314.GD1243@gandalf.sssup.it>
Mail-Followup-To: Karsten Behrmann <BearPerson@gmx.net>,
	freebsd-hackers@freebsd.org
References: <20071011022001.GC13480@gandalf.sssup.it>
	<20071016161037.5ab1b74f@39-25.mops.rwth-aachen.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20071016161037.5ab1b74f@39-25.mops.rwth-aachen.de>
User-Agent: Mutt/1.4.2.3i
Cc: freebsd-hackers@freebsd.org
Subject: Re: Pluggable Disk Scheduler Project
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 17 Oct 2007 11:32:10 -0000

> From: Karsten Behrmann <BearPerson@gmx.net>
> Date: Tue, Oct 16, 2007 04:10:37PM +0200
>
> > Hi,
> >     is anybody working on the `Pluggable Disk Scheduler Project' from
> > the ideas page?
> I've been kicking the idea around in my head, but I'm probably newer to
> everything involved than you are, so feel free to pick it up. If you want,
> we can toss some ideas and code to each other, though I don't really
> have anything on the latter.

Thank you for your answer, I'd really like to work/discuss with you
and anyone else interested in this project.


> >     o Where is the right place (in GEOM) for a disk scheduler?
> I have spent some time at eurobsdcon talking to Kirk and phk about
> this, and the result was that I now know strong proponents both for
> putting it into the disk drivers and for putting it into geom ;-)
> 
> Personally, I would put it into geom. I'll go into more detail on
> this later, but basically, geom seems a better fit for "high-level"
> code than a device driver, and if done properly performance penalties
> should be negligible.
> 

I'm pretty interested even in the arguments for the opposite solution;
I've started from GEOM because it seemed to be a) what was proposed/
requested on the ideas page, and b) cleaner at least for a prototype.
I wanted to start with some code also to evaluate the performance
penalties of that approach.

I am a little bit scared from the perspective of changing the
queueing mechanisms that drivers use, as this kind of modifications
can be difficult to write, test and maintain, but I'd really like
to know what people with experience in those kernel areas think
about the possibility of doing more complex io scheduling, with
some sort of unified interface, at this level.

As a side note, by now I've not posted any performance number because
at the moment I've only access to old ata drives that would not
give significative results.


> >     o How can anticipation be introduced into the GEOM framework?
> I wouldn't focus on just anticipation, but also other types of
> schedulers (I/O scheduling influenced by nice value?)
> 

That would be interesting, especially for the background fsck case.
I think that some kind of fair sharing approach should be used; as
you say below a priority driven scheduler can have relations with
the VFS that are difficult to track. (This problem was pointed out
also in one of the follow-ups to [1].)


> >     o How to deal with metadata requests and other VFS issues?
> Like any other disk request, though for priority-respecting
> schedulers this may get rather interesting.
> 
> [...]
> > The main idea is to allow the scheduler to enqueue the requests
> > having only one (other small fixed numbers can be better on some
> > hardware) outstanding request and to pass new requests to its
> > provider only after the service of the previous one ended.
> You'll want to queue at least two requests at once. The reason for
> this is performance:
> Currently, drivers queue their own I/O. This means that as soon
> as a request completes (on devices that don't have in-device
> queues), they can fairly quickly grab a new request from their
> internal queue and push it back to the device from the interrupt
> handler or some other fast method.

Wouldn't that require, to be sustainable (unless you want a fast
dispatch every two requests,) that the driver queue is always of
length two or more?  In this way you ask, to refill the driver
queue, the upper scheduler to dispatch a new request every time a
request is taken from the driver queue and sent to the disk, or
in any other moment before the request under service is completed.
In this way you cannot have an anticipation mechanism, because
the next request you'll want to dispatch from the upper scheduler
has not yet been issued (it will be only after the one being served
is completed and after the userspace application restarts.)


> Having the device idle while the response percolates up the geom
> stack and a new request down will likely be rather wasteful.

I completely agree on that.  I've only done in this way because it
was the less intrusive option I could find.  What can be other more
efficient alternatives?  (Obviously without subverting any of the
existing interfaces, and allowing the anticipation of requests.)


> For disks with queuing, I'd recommend trying to keep the queue
> reasonably full (unless the queuing strategy says otherwise),
> for disks without queuing I'd say we want to push at least one
> more request down. Personally, I think the sanest design would
> be to have device drivers return a "temporary I/O error" along
> the lines of EAGAIN, meaning their queue is full.
> 

This can be easily done for asynchronous requests.  The troubles
arise when dealing with synchronous requests... i.e., if you dispatch
more than one synchronous request you are serving more than one
process, and you have long seek times between the requests you have
dispatched.  I think we should do (I'll do as soon as I have access
to some realistic test system) some measurement on real hardware,
looking at what happens when we allow multiple outstanding requests.


> > The example scheduler in the draft takes the following approach:
> > 
> >     o a scheduling GEOM class is introduced.  It can be stacked on
> >       top of disk geoms, and schedules all the requests coming
> >       from its consumers.  I'm not absolutely sure that a new class
> >       is really needed but I think that it can simplify testing and
> >       experimenting with various solutions on the scheduler placement.
> Probably, though we'll want to make sure that they stack on top of
> (or are inside of?) the geoms talking to the disks, because it rarely
> makes sense to put a queuing geom on top of, say, a disklabel geom.
> 
> The advantage of making it a full geom is configurability. You would
> be able to swap out a scheduler at runtime, select different sched-
> ulers for different disks, and potentially even test new schedulers
> without rebooting (though you wouldn't want to do that for benchmarks)
> 

Ok.


> >     o  Requests coming from consumers are passed down immediately
> >       if there is no other request under service, otherwise they
> >       are queued in a bioq.
> This is specific to the anticipatory scheduler. I would say in more
> general terms:
> - A queuing geom is to push all requests that it wants serviced down
> towards the disk, until the disk reports a queue full. A queuing geom
> is allowed to hold back requests even when the driver queue is not
> full yet, if it does not want the disk to attempt such I/O yet (such
> as the anticipatory scheduler waiting for another disk request near
> the last one, or the process-priority scheduler holding back a low-
> priority request that would potentially cause a long seek, until io
> has been idle)
> This dispels phk's anti-geom argument of "it will be inefficient
> because it will take longer for a new request to get to the driver" -
> if the queuing strategy had wanted the request to be sent to the
> drive, it would already have sent it. (The exception is that the disk
> will have its internal queue a little more empty than it could be -
> not a big issue with queue sizes of 8 or 16)
> 

Well, it will take longer for new requests to get to the driver,
because we are inserting a whole new geom (or a new layer into an
existing one) in the path of a request.  Of course the benefits
should be in the time it takes to complete the request (or in other
properties of the service, such as fairness or something like bounded
service times.)

The mechanism you describe of an high level scheduler that fills
an underlying queue is somehow similar to what Hybrid did, with its
two-stage pipelined structure, but to the best of my knowledge it
did not work with synchronous requests, for the same problems
described above.


> [...]
> > So, as I've said, I'd like to know what you think about the subject,
> > if I'm missing something, if there is some kind of interest on this
> > and if/how can this work proceed.
> I would think that this would be quite useful, but I don't think my
> voice counts for much ;-)
> It would help
>  - servers where anticipatory performs better than elevator
>  - realtime environments that need a scheduler fitting their needs
>  - the background fsck, if someone implements a "priority" scheduler
> 
> Anyway, that's a quick dump of my thoughts on the subject so far,
> I've myself wanted to get started on this but didn't get around to
> it yet (I'm fairly new to FreeBSD).
> If you want to hash some ideas out with me, I'll be watching my inbox,
> this ML, and you can reach me on IRC as BearPerson on freenode,
> quakenet, undernet, or whatever else you ask me to connect to, or via
> whatever other method convenient to you ;)
> 

Thank you very much, your help is very much appreciated.  I'll try to
set up some kind of testing environment and share some result.


[1] http://lists.freebsd.org/pipermail/freebsd-geom/2007-January/001854.html