From owner-freebsd-hackers Thu Nov 9 10:28:24 2000 Delivered-To: freebsd-hackers@freebsd.org Received: from smtp05.primenet.com (smtp05.primenet.com [206.165.6.135]) by hub.freebsd.org (Postfix) with ESMTP id 72EFF37B479 for ; Thu, 9 Nov 2000 10:28:15 -0800 (PST) Received: (from daemon@localhost) by smtp05.primenet.com (8.9.3/8.9.3) id LAA28847; Thu, 9 Nov 2000 11:28:42 -0700 (MST) Received: from usr08.primenet.com(206.165.6.208) via SMTP by smtp05.primenet.com, id smtpdAAA80aaq4; Thu Nov 9 11:28:32 2000 Received: (from tlambert@localhost) by usr08.primenet.com (8.8.5/8.8.5) id LAA22910; Thu, 9 Nov 2000 11:27:56 -0700 (MST) From: Terry Lambert Message-Id: <200011091827.LAA22910@usr08.primenet.com> Subject: Re: close call in a device ? To: bschwand@dvart.com (bruno schwander) Date: Thu, 9 Nov 2000 18:27:56 +0000 (GMT) Cc: tlambert@primenet.com (Terry Lambert), freebsd-hackers@FreeBSD.ORG In-Reply-To: <3A09F3BF.B028E0F8@dvart.com> from "bruno schwander" at Nov 08, 2000 04:45:51 PM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > > To add to this, the close calls can be forces; there is a flag > > in the device structure wich can force notification. I'm not > > sure what it does over a fork(), though: I think you really want > > open notification. > > You mean that when I register my device/kernel module, I can > explicitely request that all close calls will notify my module? > That is exactly what I need. Add D_TRACKCLOSE to d_flags for your device. When the d_close() of your device is call, the first arg is the dev. Unfortunately, vcount() is used to see if the close is really a final close, or not, and the vp is not passed into the close itself. You will have to track closes yourself. One kludge to get around having to so this is to modify spec_close() to do: } else if (devsw(dev)->d_flags & D_TRACKCLOSE) { /* Keep device updated on status */ if (vcount(vp) > 1) { /* clear flag to signal driver of last close*/ devsw(dev)->d_flags &= ~D_TRACKCLOSE; } } else if (vcount(vp) > 1) { and then as the _first_ thing in your close code: if( !(devsw(dev)->d_flags & D_TRACKCLOSE)) { /* magic: final close: add flag back in to keep sane*/ devsw(dev)->d_flags |= D_TRACKCLOSE; ... } You can find spec_close() in /sys/miscfs/specfs/spec_vnops.c. You really probably ought to just add the flag back in in the first open. The thing that makes this a kludge is that it very evilly unsets a flag it shouldn't unset, and it makes it the job of the device to fix up the damage (the interface isn't reflexive). A secondary nit is that this is not really reentrant, while the flag is clear, so you have to be careful. Really, since you will be doing per-open instance housekeeping anyway, you ought to just add a list pointer to the per-open instance data, and keep the open instances on a linked list; you will have tolook up the per-open instance data somehow, anyway, and it might as well be a list traversal. When list membership goes from 1->0, you'll know it's the last close, and you can free global (non per-open instance) resources. Traditionally, this is done using a minor number, but you can't just naievely allocate an unused one, since you might not get called. > What do you mean by open notification ? I do get all "open" calls to my > device, just not all the "close" For each open, d_open() gets called. This is where you will be creating your per open instance data. You should look at how fd's are handled over a fork() or other call. Without a look at this code in depth, I can't tell you today whether or not your d_open() code will get called again for each open fd, or not. If it doesn't, this could be a problem for you. It used to get called, FWIW. > > The main problem with per process resources is that the VFS that > > implements devices, specfs, doesn't own its own vnodes. > > Could you develop a little ? I don't know about VFS, specfs and vnodes... When you perform an operation on an open file, the vnode pointer is dereferenced out of the per process open file table. The kernel internally doesn't know from file handles (architectural bug, IMO), and so not only is it hard to do file I/O in the kernel, but you have this opaque handle called a vnode. When you do an ioctl() or something else, then because this is a special device, there is a dereference into a function table pointed to by the vnode. This table is called "struct fileops", and the table for special devices is spec_fileops. So you give it a vnode, it dereferences the fcntl() function pointer out of this table to make the call, and passes the vnode pointer as an argument. In the spec_fileops version of fcntl(), the device specific data is derefenced out of the vnode; it can do this, because it knows that any vnode of type VCHR will have one of these structures on it. This is used by specfs to locate the correct device entry point to call: your device. Your device driver function is then called, with the device private data pointer from the vnode, called "dev". It's a pointer to your device private data. Because the specfs does not own its own vnodes, each time you open a device, you get the same vnode back from specfs. It can't give you a different one, because you asked for the same one: by the time it gets to the open, all it has is the vnode of the parent directory, a major number, and a minor number. So there's no way for the open to return a unique instance of the device each time you open it, because it can only return one vnode. This gets worse, because of fork() and other fd managemetn behaviour. The kernel likes to give back the same vnode to a user space process as often as possible. If one of these calls to do this returns another reference to an existing open isntance (say you open the same device twice from the same program, or you call dup() or dup2()), then you may not get to call all the way down to the open, like you expect. This code is pretty convoluted, and I haven't traced it so I can give you an exact answer, and I'm probably not running the exact same version of the code as you, so even if I did the work and gave you the right answer for me, it might not be the same answer for you. So all I'm saying is "here there be dragons", and you should be very, very cautious. If you do give back a different vnode, then you will have to be careful about vcount(), which will always return "1" for your vnodes, and so it will always close them (this could be a benefit, actually: you could ignore D_TRACKCLOSE entirely). In any case, the vnodes are not the specfs or devices to give, they belong to the system. Because of that, there are two consequences: 1) You can have cache effects, like those I talked about for fork(), dup(), dup2(), and so on. 2) You won't necessarily get notified all the way to your device driver when a new fd references an existing vp, so relying on the open/close notification might not work. 3) The way a device driver is "supposed" to tell which device is which is by using the minor number out of the "dev" structure (the major number is what got you to your device driver through specfs in the first place), and... your device, if it acts this way, will have the same minor number for each time it is called, which means you will have to differentiate based on making up a new minor number for each open, and returning that as a different vnode. Only the vnode allocation is done above the device driver control, so this is hard (you _must_ modify specfs to get this behaviour; in particular, you _must_ modify the code that calls the spec_open() call, since that's where the vnode gets allocated for the open -- and that's two layers up from your device, so you will probably have to ad your own D_ flag, and teach the upper layer about it). It is not all that complicated, once you understand the code, but there are a lot of places that you have to be careful so that you don't get bitten. I think the biggest problem is actually going to be #1, above, since I expect that what is going to happen is that the VFS on which the special device is located will look in its little name cache, given the directory vp of the parent directory for the device, and the name of the device, and then just return a new reference on a cached copy of the vp that it gets back from this cache. The problem's going to be that no matter how you got there, the upper level VFS that has the device on it will probably get in your way. Really, the vnode cache wants to be in common upper level code, and let you set a falg on the vnode in the lowest level code that's still a VFS (in this case, specfs) to say "don't cache this thing, please, I want each open to call my device open, because each open is different". > What I did is make a module that defines a struct cdevsw with the open/read/ > etc callbacks, then I register my calls for various devices entries with > make_dev(), and at the end used the DEV_MODULE() macro to declare it to the > system. I modeled that after the example in /usr/src/share/examples/kld of > FreeBSD 4 This is the way to do it. > Is there a different driver/module architecture ? You will need to change the existing architecture to be able to do what you want, without having to have a different device name be used for each open (in order to get cloning devices). The actual "clone" event is when the upper level VFS where the device lives calls down to the specfs to the device to get a vnode with a device with a different minor number, and a different vnode to point to it, so that the per process open file table won't get confused. > > This is actually the primary reason that VMWARE can only run > > one instance at a time of a virtual machine: there is no way > > to have per open instance resources, which are not shared. > > > > If you were to use the TFS flag (grep for TFS in the headers, > > that's a substring), you could make specfs own its own vnodes. > > Where should I look for this ? I looked into /usr/src/ and only some > references to NTFS and TFS filesystems turned up ? vfs.h. You aren't going to see an FS, only the flag that lets an FS own its own vnodes (and it's a kludge). The documentation above is much more thorough (and I expect it will be corrected here and there by people reading this, or by you, when you dive into the code and find somewhere where I've given you old information). > Would I have to roll out a custom filesystem to have this running ? No. You just have to change the way the specfs is treated by the system. Mostly, I run systems without a specfs, since I think that struct fileops is a terrible, terrible kludge, but my code has diverged significantly since 1995, and continues to diverge even more radically as time goes on (so it won't be much use to you, since I haven't pulled a [Linux] Alan Cox and don't maintain an FTP distribution point for my changes, or sync them with the source tree more thanonce a month or so -- sorry). > > The way you would handle your problem then is by returning a > > different instance of the device, with a different instance of > > per process attached storage. It's pretty simple to do this: > > just return a different vnode for the next open of the same > > device, instead of the same vnode with an additional reference. > > this is really confusing me... in the example I had, the only thing I return > from my open routine is an int telling success or errors happened... any > pointers for the vnode stuff ? if it could apply to what I am trying to do ? > > Am I basing my driver on the wrong stuff ? No. You are basing your driver on the current FreeBSD state of the art. If you don't push for a higher state of the art, you will not be able to do what you want, and you will need to not use a clone device to do what you want. If you want to go this route, look at the library code for openpty(), and look at the pty driver. Basically, you will need to: 1) Create a bunch of minor devices off the same major 2) Open each minor device only once, and make your code iterate through all possible minor devices, before giving up. > > NB: If you are trying to do this for VMWARE or some other binary > > code, there's no way that the pty opening soloution suggested in > > the previous posting will be able to work for you, since the code > > Yes, I came to that conclusion too. So I can assume that this is a binary interface issue? If it's a kernel space issue, and _never_ a user space issue, you might be able to kludge this by calling your clone interface internally, once per minor number. The resource tracking for doing this right will be a pain, but should work. If it's a user space issue, then you will need to make cloning devices work. Unfortunately, many of the things I have described above will be a problem for you, or I'd just say "use a portal, and make it map to multiple real devices, one for each time you open the portal". The caching issues and notification issues noted above would still kill you with a portal, though, and a portal will look like a different special file type, instead of looking like a device, so you would not be able to ioctl() or otherwise treat the thing as a device. The problem is that there will be great resistance to these changes, since they will have to be broken up into pieces to get past reviewers. Since you have to have the directory cache in common code up front, this is going to be the hardest sell, since people will see no difference in functionality, until other code is written (it will look like a gratuitous change to the unenlightened). I had similar problems when I wanted to make unrelated VFS changes which would eventually have led to working stacking, since they were unwilling to take such a large chunk at one time, and they were unwilling to take what they considered gratuitous changes, once it was broken up into pieces that were small enough that they would accept them. As you can see above, the required changes are a little bit larger in scope than I recently led Jordan Hubbard to believe (if you followed that discussion). But I think they are necessary to permit progress in research in the area where you are doing your work (I did a similar thing back in 1997, which is part of my source base, in order to get true cloning pty drivers). If you can get even partial buy-in from a core team member, you will be much better off. Poul-Henning has shown some interest in this are, with some of his recent work on his own cloning implementation requiring devfs, but I think his code is far from complete (unless he has uncommitted patches he is willing to share). I can advise you on implementation, and even give you some code bits that aren't too far out of sync with how FreeBSD works, but for some things, you will be on your own (e.g. in my system, vnodes are already owned by each VFS, and have been for years, so I don't have the TFS issue to deal with to kludge around the system owning the things, etc.). Good luck, and I hope this has at least given you more insight into what's involved, and where to look in the source for more answers. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message