Date: Sun, 22 Mar 2009 02:00:59 +0100 From: Luigi Rizzo <rizzo@iet.unipi.it> To: Poul-Henning Kamp <phk@phk.freebsd.dk> Cc: luigi@freebsd.org, Pawel Jakub Dawidek <pjd@freebsd.org>, Ivan Voras <ivoras@freebsd.org>, freebsd-geom@freebsd.org Subject: disk scheduling (was: Re: RFC: adding 'proxy' nodes to provider ports (with patch)) Message-ID: <eb21ef440903211800h266ec0aes158cb189095289c1@mail.gmail.com>
next in thread | raw e-mail | index | archive | help
On Sat, Mar 21, 2009 at 9:24 PM, Poul-Henning Kamp <phk@phk.freebsd.dk> wro= te: > In message <20090321200334.GB3102@garage.freebsd.pl>, Pawel Jakub Dawidek= write > s: > >> =A0 =A0 =A0 Special GEOM classes. >> =A0 =A0 =A0 --------------------- >> >> =A0 =A0 =A0 - There are no special GEOM classes. >> >>I wonder if phk changed his opinion over time. :) > > He didn't. > >>Maybe instead of adding special providers and GEOM classes, the >>infrastructure should be extended in some way, so that we won't use >>provider term to describe something that isn't really a regular GEOM >>provider. > > I have not had time to read this entire thread, being somewhat > snowed under with work elsewhere. > > First up, I am not sure I understand why the proxy nodes would > be the (or even 'a') right solution for I/O scheduling. > > In fact, it is not very clear to me at all that scheduling should > happen inside geom at all. > > I would tend to think that it belongs in the devicedriver, where > intelligent information about things like tagged queuing abilities > can be taken into account. > > For any kind of scheduling to do anything non-trivial, requests > needs to be piled up so they can be reordered, doing that in > places where bio's dont naturally pile up would require a damn > good argument and strong numbers to convince me. > > Where the already do pile up, the existing disksort mechanism > and API can be used. =A0(If you want to mess with the disksort > *algorithm*, by all means do so, but that should not require > you to hack up any apis, apart from the one to select algorithm). The thread was meant to be on inserting transparent nodes in GEOM. Scheduling was just an example on where the problem came out, but since you ask let's take a short diversion (and let me relabel this thread so we can discuss things separately). + nobody objects that the ideal place for scheduling is where requests naturally "pile up". Too bad that this ideal place is sometimes one we cannot access, i.e. the firmware of the disk drive. + some scheduling algorithms are "non work conserving", and they work by delaying some requests in the hope to save some seeks. They can be very effective (we sent numbers in our previous posting in january, but you can look at the literature on anticipatory scheduling for more). For the way they work, these algorithms artificially cause queues to build up. As such you can implement them effectively even above the device driver. + changing disksort can do some things but not all one would want. E.g. if you need to delay requests (as you do in several disk schedulers) then you need to interact heavily with the driver, e.g. to make sure it does not assume that the scheduler is work-conserving (some do, we found out in the GSoC 2005 work on disk schedulers), and to find out which kind of locking to use when it is time to reinject delayed requests. So, implementing certain scheduling algorithms in the device driver requires specific code on each and every driver. + of course adding or not a disk scheduler in one's system is completely optional, and there is no intention to change any current default. if you want a quick example on how can you fix some severe problems with the current disk scheduler even doing scheduling above the device driver, try the same experiments we did, first without scheduler, then with the geom_sched module that we posted: 1. run a few 'dd' in parallel on top of an ATA or SATA disk, and look at the overal throughput with and without scheduler; 2. run a cvs update (or other seeky application) in parallel with a sequential dd reader, and look at how slowly 'dd' runs without scheduler; 3. run a cvs update (or other seeky application) in parallel with a sequential dd writer, and look at how slowly cvs goes without scheduler. This is mostly an effect of Examples #1 and #2 are a direct result of the request patterns issued by readers, and cannot be fixed with work-conserving changes to disksort. Readers only have one pending request each, so the disk is doing a seek on each request, and the throughput degrades heavily. With anticipation, after one request you give the process a little bit of time to present another one, so you can serve a short burst of requests from each reader, boosting both individual and overall throughput. Example #3 is a result of the "capture effect" of our disksort: writers have many pending requests and if they are for contiguous blocks, once one of them is served the disk keeps serving the same process starving the others. Here you can do a lot of useful stuff even above the device driver, e.g. do not serve more than so many contiguous requests in a row. cheers luigi
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?eb21ef440903211800h266ec0aes158cb189095289c1>