From owner-freebsd-hackers Sat Oct 25 18:06:48 1997 Return-Path: Received: (from root@localhost) by hub.freebsd.org (8.8.7/8.8.7) id SAA04106 for hackers-outgoing; Sat, 25 Oct 1997 18:06:48 -0700 (PDT) (envelope-from owner-freebsd-hackers) Received: from smtp04.primenet.com (smtp04.primenet.com [206.165.5.85]) by hub.freebsd.org (8.8.7/8.8.7) with ESMTP id SAA04078 for ; Sat, 25 Oct 1997 18:06:39 -0700 (PDT) (envelope-from tlambert@usr04.primenet.com) Received: (from daemon@localhost) by smtp04.primenet.com (8.8.7/8.8.7) id SAA16040; Sat, 25 Oct 1997 18:06:37 -0700 (MST) Received: from usr04.primenet.com(206.165.6.204) via SMTP by smtp04.primenet.com, id smtpd016033; Sat Oct 25 18:06:29 1997 Received: (from tlambert@localhost) by usr04.primenet.com (8.8.5/8.8.5) id SAA15742; Sat, 25 Oct 1997 18:06:25 -0700 (MST) From: Terry Lambert Message-Id: <199710260106.SAA15742@usr04.primenet.com> Subject: Re: zipfs filesystem anyone ? To: gurney_j@resnet.uoregon.edu Date: Sun, 26 Oct 1997 01:06:25 +0000 (GMT) Cc: tlambert@primenet.com, michaelh@cet.co.jp, luigi@labinfo.iet.unipi.it, hackers@FreeBSD.ORG In-Reply-To: <19971025004544.21378@hydrogen.nike.efn.org> from "John-Mark Gurney" at Oct 25, 97 00:45:44 am X-Mailer: ELM [version 2.4 PL23] Content-Type: text Sender: owner-freebsd-hackers@FreeBSD.ORG X-Loop: FreeBSD.org Precedence: bulk > > The list terminator is still necessary for the population pass, where > > the vectors get populated. The VOP_REPLACEME needs to be seperate. > > I can see on the vnodeopv_desc of the vfs layer, but why does the NULL > entry on the array of the vnodeop_desc array in vnode_if.c need the > last NULL, that is what vfs_opv_numops is for... (which the population > code uses)... OK. This code is not obvious. I will have to refer to the original kern/vfs_init.c code for this. In vfs_opv_init(), there is a MALLOC() where the dynamic vectors are allocated. The purpose of this is that the per FS descriptor list for the VOP's defined for a given VFS layer is sparse. If you look further down in vfs_opv_init(), you will see: /* * Finally, go back and replace unfilled routines * with their default. (Sigh, an O(n^3) algorithm. I * could make it better, but that'd be work, and n is small.) */ So basically, it takes a list like: { A, D, F, J, NULL } And changes it to: { A, b, c, D, e, F, g, h, i, J, k, l } The point being that it knows the end of the per FS descriptor list by the NULL. This actually bears *directly* on my sorting recommendation: if the descriptor list is sparse, you will not have a baseline for a sort of a new descriptor entry that isn't also in the globally declared descriptor array, because only the descriptors defined in vnode_if.c will have known offsets. > so what I'm saying is we make two vars, maxvfs_opv_numops in which we > allocate all the memory for that many ops... then as a new op is added > we simply increase vfs_opv_numops to keep track of were we add the next > op.. when it's equal to (or greater) then we simply disallow adding any > new ops... Yes, this is just housekeeping, and it doesn't matter how it is accomplished so long as it doesn't preclude some future use we haven't thought of yet. It's a "don't care" state. > > > personally, a possible define that declares I want a possible X extra > > > vop vectors... and then have a couple variables that tell you the > > > maximum number, and the current last offset... this shouldn't be hard > > > to do... > > > > They kind of need to be static, unless the vector list becomes a linker > > set. This can't happen until the linker set can be agregated between > > I was talking about the size of the list being variable, but that ther > be an options like: > options "EXTRA_VNODEOPS=5" > > then we simply do: > int maxvfs_opv_numops = (sizeof(vfs_op_descs)/sizeof(struct vnodeop_desc *)) + > EXTRA_VNODEOPS; > > and then the define is like: > struct vnodeop_desc *vfs_op_descs[num] = { > }; > > and we simply modify the awk script to properly generate num... which > isn't hard... See above. The descriptors have to be unique at the time vnode_if.c is compiled; different indices pointing to the same address won't cut it. Probably you will need to have a special tag value, or you will need to have a duplicate array containing only the replacement descriptor entries portion of the vfs_op_descs table. Or you will need to save them one behind when you add the descriptor, and remove them one behind. I would caution *againt* saving them one behind, though. I could easily see a new VOP being used by two loadable FS's. To make the thing work, you would need to keep a reference count -- and that implies some place to keep it. I would suggest in a structure with the VOP descriptors you replace. Really, the output (populated) per FS structure with all entries filled in for the defaults should be possible to dereference directly -- *IFF* it's sorted. This gets rid of the function stubs for the VOP_* calls, and saves a lot of call overhead. Think about it, anyway... > arg, I see what you mean.. ffs depends upon some of ufs's functions.. > (arg, I was assuming that each module was independant)... Yes. This kind of screws up the BSDI plans for Soft Updates, since it mean the soft updates capable UFS will not be able to be shared with MFS, LFS, etc.. > right now the only way I can think of preventhing this from being a > problem is to add the MOUNT_xxx to the declaration of the vnodeops, > and then including in the ffs a dependancy of ufs... No. If the entries are sorted, and you have a reference, then an init routine can reference the previously defined loaded module (the symbol table is cumulative). This at least gets you an address to pass into the modified vfs_opv_init() function so you can convert the UFS descriptor list to a set of function pointers. The trick is to know beforehand that the populated descriptor lists are in the same order in the UFS and the FFS, and just use the offset so you don't have to reference every function by symbol. Hence the sort. It can be a simple bubble sort (that's what I used), since you can sort it in place in the populated vector (dynamic vector is a bad name for this in the code, IMO). > > Sorting the list also allows you to reference by index instead of by > > descriptor. If you think about it, you won't have a descriptor for > > a new VOP, if you dynamically load it, in vnode_if.h. This loses the > > little glue function. To fix this, you have to kill the glue functions. > > And you can do it if you can indext into the op vector and get the right > > function. > > umm... isn't this already what is done? from vfs_opv_init: > opv_desc_vector[opve_descp->opve_op->vdesc_offset] = > opve_descp->opve_impl; > the above line will set the appropriate vector by the vdesc_offset.. No. This requires that the symbol set for the descriptor set be complete; if you are adding a new VOP, then it won't be (unless you have ELF linker set agregation across distinct per kld sections working). Even then, the VOP entry may be duplicated, which I don't know how the linker set could handle correctly. It really needs a reference count for each FS that's using the new VOP. It becomes a can of worms quickly unless you reduce the decriptor list to a set of indices into an array of VOP functions. So again, doing the physical sort is the right way to go. Note that if you ad a VOP, you have to go back and repopulate the dynamic entry wheere the new VOP came in, if there is a default behaviour specified. 8-(. You can make the sort part of the vfs_opv_init(). Since the vfs_opv_init() "knows" the descriptor list, this works. > then at the end, there is a second pass to fill in the remaining entries > with the entry from VOFFSET(vop_default) of it's own vector.. > > hmm... looks like we need to remove this comment: > /* > * Finally, go back and replace unfilled routines > * with their default. (Sigh, an O(n^3) algorithm. I > * could make it better, but that'd be work, and n is small.) > */ > as right now the final routine runs in n as far as I can tell... :) The O(n^3) is wrong, unlesss you consider the whole thing, and you consider it in light of how the UFS is referenced by the FFS. For that specific case, O(n^3) is correct. This is because there is a seperate "dynamic vector" for each reference instance of a "non-dynamic vector". This is also not very obvious from the code. You need to look at the array declarations themselves, and the specfs ops, etc., in context to see this. The part of the comment that says "replace unfilled routines" is correct, but, again, it's descriptor based, which breaks truly dynamic loading of intermediate VFS's as opposed to complete-to-bottom stacks. The problem is in telling "NULL" entries from "uninitialized" entries. You need to know this because you want to collapse the first defined entry to the top of the stack so that you have the minimum possible call overhead for any individual stacked VOP. You can *almost* see how this is supposed to work in the "FFS stacked on UFS" case. It's kind of obfuscated by the "struct fileops" crap, and the fact that the vnode handling is from a global free pool, instead of the vnode objects being subelements in the per FS object structure (for UFS, struct inode for the in core inode). If you ignore that code and pretend the references are direct instead of indirect, it's a lot easier to visualize (one of the reasons I think the code should be changed to make it match the conceptual design). Right now, there is no real "collapse" taking place. Mostly because things with NULL entries are broken and getting more broken the more people stuff VM code into them. If you look at the "UFS in FFS VOP structure" entries, you can see how the vnode pointer is supposed to be an abstract type in each FS. The per FS macro for "VTOI" and "ITOV" are supposed to be the only gates where a vnode is not treated as an opaque type (and that goes back to the global pool implementation; you *could* use pointer math to make it *truly* opaque if the vnode were a subelement). It's all very complicated, but you can see it if you take enough time. Unfortunately, not very many people have taken enough time, since it's a lot more convoluted than it should be (ideally and IMO, anyway). But to ascend out of the fourth ring of hell, you first have to visit the fourth ring of hell... otherwise you can't get there from here. ;-). Once enough people understand the code so that a code review is possible and people can get the warm fuzzies about the new code, it should be possible to fix all of it to make it easier for people to get their brains around. Then it picks up speed as it rolls downhill. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.