From owner-freebsd-hackers  Thu Nov  9 10:28:24 2000
Delivered-To: freebsd-hackers@freebsd.org
Received: from smtp05.primenet.com (smtp05.primenet.com [206.165.6.135])
	by hub.freebsd.org (Postfix) with ESMTP id 72EFF37B479
	for <freebsd-hackers@FreeBSD.ORG>; Thu,  9 Nov 2000 10:28:15 -0800 (PST)
Received: (from daemon@localhost)
	by smtp05.primenet.com (8.9.3/8.9.3) id LAA28847;
	Thu, 9 Nov 2000 11:28:42 -0700 (MST)
Received: from usr08.primenet.com(206.165.6.208)
 via SMTP by smtp05.primenet.com, id smtpdAAA80aaq4; Thu Nov  9 11:28:32 2000
Received: (from tlambert@localhost)
	by usr08.primenet.com (8.8.5/8.8.5) id LAA22910;
	Thu, 9 Nov 2000 11:27:56 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <200011091827.LAA22910@usr08.primenet.com>
Subject: Re: close call in a device ?
To: bschwand@dvart.com (bruno schwander)
Date: Thu, 9 Nov 2000 18:27:56 +0000 (GMT)
Cc: tlambert@primenet.com (Terry Lambert),
	freebsd-hackers@FreeBSD.ORG
In-Reply-To: <3A09F3BF.B028E0F8@dvart.com> from "bruno schwander" at Nov 08, 2000 04:45:51 PM
X-Mailer: ELM [version 2.5 PL2]
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

> > To add to this, the close calls can be forces; there is a flag
> > in the device structure wich can force notification.  I'm not
> > sure what it does over a fork(), though: I think you really want
> > open notification.
> 
> You mean that when I register my device/kernel module, I can
> explicitely request that all close calls will notify my module?
> That is exactly what I need.

Add D_TRACKCLOSE to d_flags for your device.  When the d_close()
of your device is call, the first arg is the dev.

Unfortunately, vcount() is used to see if the close is really a
final close, or not, and the vp is not passed into the close
itself.  You will have to track closes yourself.  One kludge to
get around having to so this is to modify spec_close() to do:

	} else if (devsw(dev)->d_flags & D_TRACKCLOSE) {
		/* Keep device updated on status */
		if (vcount(vp) > 1) {
			/* clear flag to signal driver of last close*/
			devsw(dev)->d_flags &= ~D_TRACKCLOSE;
		}
	} else if (vcount(vp) > 1) {

and then as the _first_ thing in your close code:


	if( !(devsw(dev)->d_flags & D_TRACKCLOSE)) {
		/* magic: final close: add flag back in to keep sane*/
		devsw(dev)->d_flags |= D_TRACKCLOSE;
		...
	}

You can find spec_close() in /sys/miscfs/specfs/spec_vnops.c.

You really probably ought to just add the flag back in in the
first open.  The thing that makes this a kludge is that it very
evilly unsets a flag it shouldn't unset, and it makes it the job
of the device to fix up the damage (the interface isn't reflexive).
A secondary nit is that this is not really reentrant, while the
flag is clear, so you have to be careful.

Really, since you will be doing per-open instance housekeeping
anyway, you ought to just add a list pointer to the per-open
instance data, and keep the open instances on a linked list;
you will have tolook up the per-open instance data somehow,
anyway, and it might as well be a list traversal.  When list
membership goes from 1->0, you'll know it's the last close, and
you can free global (non per-open instance) resources.

Traditionally, this is done using a minor number, but you
can't just naievely allocate an unused one, since you might
not get called.


> What do you mean by open notification ? I do get all "open" calls to my
> device, just not all the "close"

For each open, d_open() gets called.  This is where you will
be creating your per open instance data.  You should look at
how fd's are handled over a fork() or other call.  Without a
look at this code in depth, I can't tell you today whether or
not your d_open() code will get called again for each open fd,
or not.  If it doesn't, this could be a problem for you.  It
used to get called, FWIW.


> > The main problem with per process resources is that the VFS that
> > implements devices, specfs, doesn't own its own vnodes.
> 
> Could you develop a little ? I don't know about VFS, specfs and vnodes...

When you perform an operation on an open file, the vnode pointer
is dereferenced out of the per process open file table.  The kernel
internally doesn't know from file handles (architectural bug, IMO),
and so not only is it hard to do file I/O in the kernel, but you
have this opaque handle called a vnode.

When you do an ioctl() or something else, then because this is
a special device, there is a dereference into a function table
pointed to by the vnode.  This table is called "struct fileops",
and the table for special devices is spec_fileops.

So you give it a vnode, it dereferences the fcntl() function
pointer out of this table to make the call, and passes the
vnode pointer as an argument.

In the spec_fileops version of fcntl(), the device specific data
is derefenced out of the vnode; it can do this, because it knows
that any vnode of type VCHR will have one of these structures on
it.  This is used by specfs to locate the correct device entry
point to call: your device.

Your device driver function is then called, with the device
private data pointer from the vnode, called "dev".  It's a
pointer to your device private data.

Because the specfs does not own its own vnodes, each time you
open a device, you get the same vnode back from specfs.  It
can't give you a different one, because you asked for the same
one: by the time it gets to the open, all it has is the vnode
of the parent directory, a major number, and a minor number.

So there's no way for the open to return a unique instance of
the device each time you open it, because it can only return
one vnode.  This gets worse, because of fork() and other fd
managemetn behaviour.  The kernel likes to give back the same
vnode to a user space process as often as possible.  If one of
these calls to do this returns another reference to an existing
open isntance (say you open the same device twice from the same
program, or you call dup() or dup2()), then you may not get to
call all the way down to the open, like you expect.  This code
is pretty convoluted, and I haven't traced it so I can give you
an exact answer, and I'm probably not running the exact same
version of the code as you, so even if I did the work and gave
you the right answer for me, it might not be the same answer
for you.  So all I'm saying is "here there be dragons", and you
should be very, very cautious.

If you do give back a different vnode, then you will have to
be careful about vcount(), which will always return "1" for
your vnodes, and so it will always close them (this could be
a benefit, actually: you could ignore D_TRACKCLOSE entirely).

In any case, the vnodes are not the specfs or devices to give,
they belong to the system.  Because of that, there are two
consequences:

1)	You can have cache effects, like those I talked about
	for fork(), dup(), dup2(), and so on.

2)	You won't necessarily get notified all the way to your
	device driver when a new fd references an existing vp,
	so relying on the open/close notification might not work.

3)	The way a device driver is "supposed" to tell which
	device is which is by using the minor number out of
	the "dev" structure (the major number is what got you
	to your device driver through specfs in the first place),
	and... your device, if it acts this way, will have the
	same minor number for each time it is called, which
	means you will have to differentiate based on making
	up a new minor number for each open, and returning that
	as a different vnode.  Only the vnode allocation is
	done above the device driver control, so this is hard
	(you _must_ modify specfs to get this behaviour; in
	particular, you _must_ modify the code that calls the
	spec_open() call, since that's where the vnode gets
	allocated for the open -- and that's two layers up from
	your device, so you will probably have to ad your own
	D_ flag, and teach the upper layer about it).

It is not all that complicated, once you understand the code,
but there are a lot of places that you have to be careful so
that you don't get bitten.

I think the biggest problem is actually going to be #1, above,
since I expect that what is going to happen is that the VFS on
which the special device is located will look in its little
name cache, given the directory vp of the parent directory for
the device, and the name of the device, and then just return
a new reference on a cached copy of the vp that it gets back
from this cache.

The problem's going to be that no matter how you got there,
the upper level VFS that has the device on it will probably
get in your way.

Really, the vnode cache wants to be in common upper level code,
and let you set a falg on the vnode in the lowest level code
that's still a VFS (in this case, specfs) to say "don't cache
this thing, please, I want each open to call my device open,
because each open is different".


> What I did is make a module that defines a struct cdevsw with the open/read/
> etc callbacks, then I register my calls for various devices entries with
> make_dev(), and at the end used the DEV_MODULE() macro to declare it to the
> system. I modeled that after the example in /usr/src/share/examples/kld of
> FreeBSD 4

This is the way to do it.

> Is there a different driver/module architecture ?

You will need to change the existing architecture to be able to
do what you want, without having to have a different device name
be used for each open (in order to get cloning devices).  The
actual "clone" event is when the upper level VFS where the device
lives calls down to the specfs to the device to get a vnode with
a device with a different minor number, and a different vnode to
point to it, so that the per process open file table won't get
confused.

> > This is actually the primary reason that VMWARE can only run
> > one instance at a time of a virtual machine: there is no way
> > to have per open instance resources, which are not shared.
> >
> > If you were to use the TFS flag (grep for TFS in the headers,
> > that's a substring), you could make specfs own its own vnodes.
> 
> Where should I look for this ? I looked into /usr/src/ and only some
> references to NTFS and TFS filesystems turned up ?

vfs.h.  You aren't going to see an FS, only the flag that lets
an FS own its own vnodes (and it's a kludge).  The documentation
above is much more thorough (and I expect it will be corrected
here and there by people reading this, or by you, when you dive
into the code and find somewhere where I've given you old
information).

> Would I have to roll out a custom filesystem to have this running ?

No.  You just have to change the way the specfs is treated by
the system.  Mostly, I run systems without a specfs, since I
think that struct fileops is a terrible, terrible kludge, but
my code has diverged significantly since 1995, and continues to
diverge even more radically as time goes on (so it won't be much
use to you, since I haven't pulled a [Linux] Alan Cox and don't
maintain an FTP distribution point for my changes, or sync them
with the source tree more thanonce a month or so -- sorry).

> > The way you would handle your problem then is by returning a
> > different instance of the device, with a different instance of
> > per process attached storage.  It's pretty simple to do this:
> > just return a different vnode for the next open of the same
> > device, instead of the same vnode with an additional reference.
> 
> this is really confusing me... in the example I had, the only thing I return
> from my open routine is an int telling success or errors happened... any
> pointers for the vnode stuff ? if it could apply to what I am trying to do ?
> 
> Am I basing my driver on the wrong stuff ?

No.  You are basing your driver on the current FreeBSD state of
the art.  If you don't push for a higher state of the art, you
will not be able to do what you want, and you will need to not
use a clone device to do what you want.  If you want to go this
route, look at the library code for openpty(), and look at the
pty driver.  Basically, you will need to:

1)	Create a bunch of minor devices off the same major

2)	Open each minor device only once, and make your code
	iterate through all possible minor devices, before
	giving up.

> > NB: If you are trying to do this for VMWARE or some other binary
> > code, there's no way that the pty opening soloution suggested in
> > the previous posting will be able to work for you, since the code
> 
> Yes, I came to that conclusion too.

So I can assume that this is a binary interface issue?

If it's a kernel space issue, and _never_ a user space issue,
you might be able to kludge this by calling your clone interface
internally, once per minor number.  The resource tracking for
doing this right will be a pain, but should work.

If it's a user space issue, then you will need to make cloning
devices work.  Unfortunately, many of the things I have
described above will be a problem for you, or I'd just say
"use a portal, and make it map to multiple real devices, one
for each time you open the portal".  The caching issues and
notification issues noted above would still kill you with a
portal, though, and a portal will look like a different special
file type, instead of looking like a device, so you would not
be able to ioctl() or otherwise treat the thing as a device.


The problem is that there will be great resistance to these
changes, since they will have to be broken up into pieces to
get past reviewers.  Since you have to have the directory
cache in common code up front, this is going to be the hardest
sell, since people will see no difference in functionality,
until other code is written (it will look like a gratuitous
change to the unenlightened).

I had similar problems when I wanted to make unrelated VFS
changes which would eventually have led to working stacking,
since they were unwilling to take such a large chunk at one
time, and they were unwilling to take what they considered
gratuitous changes, once it was broken up into pieces that
were small enough that they would accept them.

As you can see above, the required changes are a little bit
larger in scope than I recently led Jordan Hubbard to believe
(if you followed that discussion).

But I think they are necessary to permit progress in research
in the area where you are doing your work (I did a similar
thing back in 1997, which is part of my source base, in order
to get true cloning pty drivers).


If you can get even partial buy-in from a core team member,
you will be much better off.  Poul-Henning has shown some
interest in this are, with some of his recent work on his
own cloning implementation requiring devfs, but I think his
code is far from complete (unless he has uncommitted patches
he is willing to share).

I can advise you on implementation, and even give you some
code bits that aren't too far out of sync with how FreeBSD
works, but for some things, you will be on your own (e.g. in
my system, vnodes are already owned by each VFS, and have been
for years, so I don't have the TFS issue to deal with to kludge
around the system owning the things, etc.).

Good luck, and I hope this has at least given you more insight
into what's involved, and where to look in the source for more
answers.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message