Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 16 Oct 2007 16:10:37 +0200
From:      Karsten Behrmann <BearPerson@gmx.net>
To:        freebsd-hackers@freebsd.org
Subject:   Re: Pluggable Disk Scheduler Project
Message-ID:  <20071016161037.5ab1b74f@39-25.mops.rwth-aachen.de>
In-Reply-To: <20071011022001.GC13480@gandalf.sssup.it>
References:  <20071011022001.GC13480@gandalf.sssup.it>

next in thread | previous in thread | raw e-mail | index | archive | help
> Hi,
>     is anybody working on the `Pluggable Disk Scheduler Project' from
> the ideas page?
I've been kicking the idea around in my head, but I'm probably newer to
everything involved than you are, so feel free to pick it up. If you want,
we can toss some ideas and code to each other, though I don't really
have anything on the latter.

[...]
> After reading [1], [2] and its follow-ups the main problems that
> need to be addressed seem to be:
> 
>     o is working on disk scheduling worth at all?
Probably, one of the main applications would be to make the background
fsck a little more well-behaved.

>     o Where is the right place (in GEOM) for a disk scheduler?
I have spent some time at eurobsdcon talking to Kirk and phk about
this, and the result was that I now know strong proponents both for
putting it into the disk drivers and for putting it into geom ;-)

Personally, I would put it into geom. I'll go into more detail on
this later, but basically, geom seems a better fit for "high-level"
code than a device driver, and if done properly performance penalties
should be negligible.

>     o How can anticipation be introduced into the GEOM framework?
I wouldn't focus on just anticipation, but also other types of
schedulers (I/O scheduling influenced by nice value?)

>     o What can be an interface for disk schedulers?
good question, but geom seems a good start ;)

>     o How to deal with devices that handle multiple request per time?
Bad news first: this is most disks out there, in a way ;)
SCSI has tagged queuing, ATA has native command queing or
whatever the ata people came up over their morning coffee today.
I'll mention a bit more about this further down.

>     o How to deal with metadata requests and other VFS issues?
Like any other disk request, though for priority-respecting
schedulers this may get rather interesting.

[...]
> The main idea is to allow the scheduler to enqueue the requests
> having only one (other small fixed numbers can be better on some
> hardware) outstanding request and to pass new requests to its
> provider only after the service of the previous one ended.
You'll want to queue at least two requests at once. The reason for
this is performance:
Currently, drivers queue their own I/O. This means that as soon
as a request completes (on devices that don't have in-device
queues), they can fairly quickly grab a new request from their
internal queue and push it back to the device from the interrupt
handler or some other fast method.
Having the device idle while the response percolates up the geom
stack and a new request down will likely be rather wasteful.
For disks with queuing, I'd recommend trying to keep the queue
reasonably full (unless the queuing strategy says otherwise),
for disks without queuing I'd say we want to push at least one
more request down. Personally, I think the sanest design would
be to have device drivers return a "temporary I/O error" along
the lines of EAGAIN, meaning their queue is full.

> The example scheduler in the draft takes the following approach:
> 
>     o a scheduling GEOM class is introduced.  It can be stacked on
>       top of disk geoms, and schedules all the requests coming
>       from its consumers.  I'm not absolutely sure that a new class
>       is really needed but I think that it can simplify testing and
>       experimenting with various solutions on the scheduler placement.
Probably, though we'll want to make sure that they stack on top of
(or are inside of?) the geoms talking to the disks, because it rarely
makes sense to put a queuing geom on top of, say, a disklabel geom.

The advantage of making it a full geom is configurability. You would
be able to swap out a scheduler at runtime, select different sched-
ulers for different disks, and potentially even test new schedulers
without rebooting (though you wouldn't want to do that for benchmarks)

>     o  Requests coming from consumers are passed down immediately
>       if there is no other request under service, otherwise they
>       are queued in a bioq.
This is specific to the anticipatory scheduler. I would say in more
general terms:
- A queuing geom is to push all requests that it wants serviced down
towards the disk, until the disk reports a queue full. A queuing geom
is allowed to hold back requests even when the driver queue is not
full yet, if it does not want the disk to attempt such I/O yet (such
as the anticipatory scheduler waiting for another disk request near
the last one, or the process-priority scheduler holding back a low-
priority request that would potentially cause a long seek, until io
has been idle)
This dispels phk's anti-geom argument of "it will be inefficient
because it will take longer for a new request to get to the driver" -
if the queuing strategy had wanted the request to be sent to the
drive, it would already have sent it. (The exception is that the disk
will have its internal queue a little more empty than it could be -
not a big issue with queue sizes of 8 or 16)

[...]
> So, as I've said, I'd like to know what you think about the subject,
> if I'm missing something, if there is some kind of interest on this
> and if/how can this work proceed.
I would think that this would be quite useful, but I don't think my
voice counts for much ;-)
It would help
 - servers where anticipatory performs better than elevator
 - realtime environments that need a scheduler fitting their needs
 - the background fsck, if someone implements a "priority" scheduler

Anyway, that's a quick dump of my thoughts on the subject so far,
I've myself wanted to get started on this but didn't get around to
it yet (I'm fairly new to FreeBSD).
If you want to hash some ideas out with me, I'll be watching my inbox,
this ML, and you can reach me on IRC as BearPerson on freenode,
quakenet, undernet, or whatever else you ask me to connect to, or via
whatever other method convenient to you ;)

So Far,
  Karsten "BearPerson" Behrmann



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20071016161037.5ab1b74f>