Date: Tue, 16 Oct 2007 16:10:37 +0200 From: Karsten Behrmann <BearPerson@gmx.net> To: freebsd-hackers@freebsd.org Subject: Re: Pluggable Disk Scheduler Project Message-ID: <20071016161037.5ab1b74f@39-25.mops.rwth-aachen.de> In-Reply-To: <20071011022001.GC13480@gandalf.sssup.it> References: <20071011022001.GC13480@gandalf.sssup.it>
next in thread | previous in thread | raw e-mail | index | archive | help
> Hi, > is anybody working on the `Pluggable Disk Scheduler Project' from > the ideas page? I've been kicking the idea around in my head, but I'm probably newer to everything involved than you are, so feel free to pick it up. If you want, we can toss some ideas and code to each other, though I don't really have anything on the latter. [...] > After reading [1], [2] and its follow-ups the main problems that > need to be addressed seem to be: > > o is working on disk scheduling worth at all? Probably, one of the main applications would be to make the background fsck a little more well-behaved. > o Where is the right place (in GEOM) for a disk scheduler? I have spent some time at eurobsdcon talking to Kirk and phk about this, and the result was that I now know strong proponents both for putting it into the disk drivers and for putting it into geom ;-) Personally, I would put it into geom. I'll go into more detail on this later, but basically, geom seems a better fit for "high-level" code than a device driver, and if done properly performance penalties should be negligible. > o How can anticipation be introduced into the GEOM framework? I wouldn't focus on just anticipation, but also other types of schedulers (I/O scheduling influenced by nice value?) > o What can be an interface for disk schedulers? good question, but geom seems a good start ;) > o How to deal with devices that handle multiple request per time? Bad news first: this is most disks out there, in a way ;) SCSI has tagged queuing, ATA has native command queing or whatever the ata people came up over their morning coffee today. I'll mention a bit more about this further down. > o How to deal with metadata requests and other VFS issues? Like any other disk request, though for priority-respecting schedulers this may get rather interesting. [...] > The main idea is to allow the scheduler to enqueue the requests > having only one (other small fixed numbers can be better on some > hardware) outstanding request and to pass new requests to its > provider only after the service of the previous one ended. You'll want to queue at least two requests at once. The reason for this is performance: Currently, drivers queue their own I/O. This means that as soon as a request completes (on devices that don't have in-device queues), they can fairly quickly grab a new request from their internal queue and push it back to the device from the interrupt handler or some other fast method. Having the device idle while the response percolates up the geom stack and a new request down will likely be rather wasteful. For disks with queuing, I'd recommend trying to keep the queue reasonably full (unless the queuing strategy says otherwise), for disks without queuing I'd say we want to push at least one more request down. Personally, I think the sanest design would be to have device drivers return a "temporary I/O error" along the lines of EAGAIN, meaning their queue is full. > The example scheduler in the draft takes the following approach: > > o a scheduling GEOM class is introduced. It can be stacked on > top of disk geoms, and schedules all the requests coming > from its consumers. I'm not absolutely sure that a new class > is really needed but I think that it can simplify testing and > experimenting with various solutions on the scheduler placement. Probably, though we'll want to make sure that they stack on top of (or are inside of?) the geoms talking to the disks, because it rarely makes sense to put a queuing geom on top of, say, a disklabel geom. The advantage of making it a full geom is configurability. You would be able to swap out a scheduler at runtime, select different sched- ulers for different disks, and potentially even test new schedulers without rebooting (though you wouldn't want to do that for benchmarks) > o Requests coming from consumers are passed down immediately > if there is no other request under service, otherwise they > are queued in a bioq. This is specific to the anticipatory scheduler. I would say in more general terms: - A queuing geom is to push all requests that it wants serviced down towards the disk, until the disk reports a queue full. A queuing geom is allowed to hold back requests even when the driver queue is not full yet, if it does not want the disk to attempt such I/O yet (such as the anticipatory scheduler waiting for another disk request near the last one, or the process-priority scheduler holding back a low- priority request that would potentially cause a long seek, until io has been idle) This dispels phk's anti-geom argument of "it will be inefficient because it will take longer for a new request to get to the driver" - if the queuing strategy had wanted the request to be sent to the drive, it would already have sent it. (The exception is that the disk will have its internal queue a little more empty than it could be - not a big issue with queue sizes of 8 or 16) [...] > So, as I've said, I'd like to know what you think about the subject, > if I'm missing something, if there is some kind of interest on this > and if/how can this work proceed. I would think that this would be quite useful, but I don't think my voice counts for much ;-) It would help - servers where anticipatory performs better than elevator - realtime environments that need a scheduler fitting their needs - the background fsck, if someone implements a "priority" scheduler Anyway, that's a quick dump of my thoughts on the subject so far, I've myself wanted to get started on this but didn't get around to it yet (I'm fairly new to FreeBSD). If you want to hash some ideas out with me, I'll be watching my inbox, this ML, and you can reach me on IRC as BearPerson on freenode, quakenet, undernet, or whatever else you ask me to connect to, or via whatever other method convenient to you ;) So Far, Karsten "BearPerson" Behrmann
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20071016161037.5ab1b74f>