From owner-freebsd-arch  Mon Feb  5 13:24:29 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from fw.wintelcom.net (ns1.wintelcom.net [209.1.153.20])
	by hub.freebsd.org (Postfix) with ESMTP
	id B246C37B401; Mon,  5 Feb 2001 13:24:07 -0800 (PST)
Received: (from bright@localhost)
	by fw.wintelcom.net (8.10.0/8.10.0) id f15LLr011092;
	Mon, 5 Feb 2001 13:21:53 -0800 (PST)
Date: Mon, 5 Feb 2001 13:21:52 -0800
From: Alfred Perlstein <bright@wintelcom.net>
To: Poul-Henning Kamp <phk@critter.freebsd.dk>
Cc: "Justin T. Gibbs" <gibbs@scsiguy.com>,
	Randell Jesup <rjesup@wgate.com>,
	Matt Dillon <dillon@earth.backplane.com>,
	Matthew Jacob <mjacob@feral.com>, Mike Smith <msmith@FreeBSD.ORG>,
	Dag-Erling Smorgrav <des@ofug.org>,
	Dan Nelson <dnelson@emsphone.com>,
	Seigo Tanimura <tanimura@r.dl.itc.u-tokyo.ac.jp>, arch@FreeBSD.ORG
Subject: Re: Bumping up {MAX,DFLT}*PHYS (was Re: Bumping up {MAX,DFL}*SIZ in i386)
Message-ID: <20010205132152.E26076@fw.wintelcom.net>
References: <20010205124707.Y26076@fw.wintelcom.net> <28618.981406901@critter>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <28618.981406901@critter>; from phk@critter.freebsd.dk on Mon, Feb 05, 2001 at 10:01:41PM +0100
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

* Poul-Henning Kamp <phk@critter.freebsd.dk> [010205 13:01] wrote:
> In message <20010205124707.Y26076@fw.wintelcom.net>, Alfred Perlstein writes:
> 
> >One of the suggestions that Poul-Henning made was to have the device
> >somehow specify an optimal clustering strategy, being able to specify
> >bounds and sizes.
> >
> >[...]
> >
> >Currently (i think) we only cluster based on logical file offsets,
> >it would be interesting to allow drivers to do callbacks into the
> >FS to ask for blocks physically adjacent to the blocks being written.
> 
> I've been playing with various ideas in this area, and to be frank,
> totally failed to come up with a breakthrough.
> 
> Give methods like striping and RAID-5, it becomes nontrivial to
> find a specification language for the driver to say "it would be
> quick to write the following blocks also" and it would be even
> slower to determine if this was indeed feasible.

You're right, it's non-trivial, however the difference between
memory and disk speed is also non-trivial, almost every reasonable
algorithm should be considered to reduce/optimize disk traffic.

A simple call into the VFS should be able to accomplish, afaik when
a VFS has a disk/physical backing it also hashes/sorts bufs based
on physicall backing location.  Although I may be remebering stuff
from 4.3BSD or 4.4BSD instead of the current code...

In fact if it is stored and hashed in the bufs you really don't need
a callback into the VFS, you just need a generic function to call
that gathers physically contig blocks that are dirty, unlocked and
actually contiguous.

> "feasible" covers not only "do we have it in RAM", but also "is it
> already scheduled for writing", "is it dirty" and not the least
> "would softupdates take a fit if we wrote it".

This is why callbacks into the VFS are probably a good idea along
with a generic function that accomplishes what we currently do,
except without the vm-remapping into the pbuf.  (use a linked
chain of bufs instead)

> The best I have been able to do so far is if the device-driver
> can specify the following quantities:
> 
> 	(M) maxmimum request size
> 	(R) preferred request size
> 	(B) preferred request sector boundary 
> 
> The clustering code would then try to increase request to:
> 
> 	N * R sectors starting X
> 	where X mod B == 0
> 	and N * R <= M
> 
> Having found a cluster opportunity, the cluster code will
> issue the read/write request specifying:
> 
> 	(E) First possible sector in request
> 	(S) First mandatory sector in request
> 	(L) Last mandatory sector in request
> 	(F) Lase possible sector in request
> 	(B) Sector address of (S) on media.
>
> The driver has to process the data from [S ... L],
> and can optionally process [E...S[ and ]L...F] if
> that seems convenient.


Well, there's some assertions and questions I have about this:

1) a device should not refuse to write a block unless there's an
   error, meaning if 'S' can't be satisfied, it should at least
   write the single block out.
   I think S & L pretty much have to be equal to each other otherwise
   we can have tricky issues to deal with there S through L never
   become clusterable (they are locked for long periods, or just
   clean)

2) the device should be able to allow a certain amount of
   fragmentation, currently (afaik) the clustering code does
   not tolerate gaps, clean bufs and locked bufs within the
   request, this ought to be changed, there's no reason why
   a request really needs to be completely contiguous as the
   really painful part of disk io, is the seek, being able
   to cluster data with gaps on the same track/cyl is much
   more important than not having any breaks in it at all.

3) with #2, it would be important to specify a tolerance for such
   'holes' in the cluster operation in case the device does have
   a penalty for gaps.

> If somebody is looking for a good project, benchmarking
> the performance of our current clustering and playing
> around with various changes would not be the worst 
> way to spend some winter evenings.  Playing with FFS/UFS
> options (block/fragment etc) at the same time may be
> worth while.

Actually, I'm not looking for a project, I'm looking for time. :)

-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message