Date: Mon, 10 Apr 2000 08:24:53 -0700 From: Julian Elischer <julian@elischer.org> To: Poul-Henning Kamp <phk@freebsd.org> Cc: arch@freebsd.org Subject: Re: BUF/BIO roadmap. Message-ID: <38F1F245.2781E494@elischer.org> References: <23546.955367727@critter.freebsd.dk>
next in thread | previous in thread | raw e-mail | index | archive | help
Poul-Henning Kamp wrote: > > Core asked me to produce a short document about what I am trying to > do with the struct buf / struct bio and all that jazz. > > This paper can be found here: http://phk.freebsd.dk/Bio/bio.ps > > There are two parts to it: > > 1. The argumentation for splitting struct bio out from struct buf > > 2. A road map for the stackable BIO system. I agree with all that is said... some coments.. I would like to see these issues addressed: When I did this I didn't try and separate the buf into two structures, but rather, introduced a structure called an iorequest (ioreq). This sructure was only ppresent in the disk stacking layer to limit changes elsewhere in the kernel for stability reasons and to allow the kernel to still be compiled without the new structure. (this was a mistake). The top level strategy routine allocated one of these and extracted the needed fields out of the struct buf. The aim was eventually, to do a cleanup on struct buf when ioreq had become upiquitous in drivers, and remove all the io related fields. different approach, same result. From memory, I managed to do this without having a pbklno and a lblkno.(I noticed that you still have both in the document) in the bio struct (my ioreq) so I wonder whether it is really needed. (I may have got the names wrong as the document is not in front of me) The devices and stacking had exactly the same semantics as you suggest (re: refusing to open clashing devices etc.) so I agree totally with that. You make no mention as to how one maps an arbitrarily stacked set of partitions into minor numbers. (i.e. what is the minor number of a partition called da2s1de [scsi-disk(2)]----[MBR(1)]---[BSD(d)]---[BSD(e)]--device_node (where someone has put a BSD partition within a BSD partition). (should be legal right?) I handled this by allocating minors on the fly and making DEVFS a required item. What is your suggestion? The issue of only physically mapping bufs is not related. Unless we get BSDI style interrupt threads, the idea of propogating up 'probe' operations cannot be done safely (believe me I looked at this a lot). I even had it running that way. It works but it's not guaranteed safe.. (never bit me but statistically it would eventually bite someone). The solution I eventualy came to,, but was never given the opportunity to check-in, was to immediatly propogate the 'arrival' events to a separate kernel daemon, called devd, and queue devd to be scheduled. The events would be queued for devd's attention. Each event consisted of information and a function to run. Thus each driver would scchedule that one of it's functions would be run at kernel process level (where sleeping on IO is possible), and that function would be responsible for initiating the probes for partition types etc. The Other problem I faced was the possibility that when a low level device was open, the user might re-write the structures that defined some upper layer devices. My solution for that was that on the close() of the lower level device, all the upper level devices were asked to verify that they were still valid. This was a variant of the probe() call that used a lot of the same code in some cases. The 'verify' request propogated up (in the context of the closing process so devd was not involved), and on encountering a newly invalidated partition, it was switched into an 'invalidate' request which was further propogated up to any higher layer devices. One result of this was that as all 'close' operations on direct devices caused reprobing (effectively), all that devd had to do to probe a device when asked, was {open(); close(); } on the lowest level device. As an example of how this worked..closing /dev/da0 would ask the MBR node to reexamine the MBR. It in turn would pass up 'invalidate' events to any (old) partitions that were not the same, and 'verify' events to any partitions that appeared the same from at the MBR level. Obviously the higher level nodes could not have been open or the open of /dev/da0 would not have been possible. With DEVFS it might actually be posible that openning /dev/da0 would actually instantaneously invalidate all the higher nodes (which would remove them from the devfs.. (you can't open them now anyhow)), and you would just allow them to be rebuilt from scratch on the close() anyhow. (this idea may be a bit radical for some). In the downward propogation of open() and close() calls, we need to propogate independently open-for-read() and open-for-read+write() If one partition is already open for read and the another is openned for read/write, then the 'downstream' device needs to be upgraded for read/write. However if the read/write upper device is then closed, the downstream device should be downgraded to r/o. This cannot be guaranteed under present semantics as only the last close is passed to the device layer. My preliminary suggestion is the addition of a method accessmode(), to the cdevsw entry for a device, that is called before/after/instead of the open() and close() calls IF THE DRIVER SUPPORTS IT, that fully propogates this information. Justin Gibbs suggested that this call should also allow the driver to know WHO is making hte call, and that it should also be called when a 'fork()' or dup() call is made so that the driver an keep accurate accountings of what modes are presently in use and which are not. I am including the whole stacking framework under name "driver" here. This needs further discussion and I think there may be better solutions. Some upward propogating events such as revocation may be safe at the interupt level, however I think that a general mechanism such as I implemented with devd can be a win in the long run as they can be proven to be safe, with the only point of danger being limited to the code that queues the action request, which can be kept small enough to be rigorously analysed. As I have said before, It's a pitty that NIH removed my code but having this pretty much identical code added certainly is the right thing even if it is 2 years later than it would have been. (* had to say it you know..) > > Barring any competent objections, the patch still up for review > at http://phk.freebsd.dk/misc will be committed and work progress > according to the roadmap. I haven't looked again but I assume this is still the patch I agreed to before.. > > -- > Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 > phk@FreeBSD.ORG | TCP/IP since RFC 956 > FreeBSD coreteam member | BSD since 4.3-tahoe > Never attribute to malice what can adequately be explained by incompetence. > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-arch" in the body of the message -- __--_|\ Julian Elischer / \ julian@elischer.org ( OZ ) World tour 2000 ---> X_.---._/ presently in: Perth v To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?38F1F245.2781E494>