From owner-freebsd-current  Thu Apr  3 18:38:20 1997
Return-Path: <owner-current>
Received: (from root@localhost)
          by freefall.freebsd.org (8.8.5/8.8.5) id SAA08640
          for current-outgoing; Thu, 3 Apr 1997 18:38:20 -0800 (PST)
Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.50])
          by freefall.freebsd.org (8.8.5/8.8.5) with SMTP id SAA08628
          for <current@freebsd.org>; Thu, 3 Apr 1997 18:38:17 -0800 (PST)
Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id TAA18256 for current@freebsd.org; Thu, 3 Apr 1997 19:21:29 -0700
From: Terry Lambert <terry@lambert.org>
Message-Id: <199704040221.TAA18256@phaeton.artisoft.com>
Subject: Re: DISCUSS: system open file table
To: current@freebsd.org
Date: Thu, 3 Apr 1997 19:21:23 -0700 (MST)
In-Reply-To: <199704040127.RAA06069@root.com> from "David Greenman" at Apr 3, 97 05:27:00 pm
X-Mailer: ELM [version 2.4 PL24]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-current@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

> >But... currently, a vnode reference is not the same thing as an open
> >reference.
> 
>    Actually, for all practical purposes, it is. Ideally, everything in the
> kernel would do a "VOP_OPEN" (actually, vn_open) for internal file I/O (such
> as coredumps)...and I think we actually do now. There was a time in the
> past where this wasn't the case, however.

Is this new since the Lite2 merge?  My Lite2 tree is not on this machine,
so I can't check very easily.  If it is, I need to back off until I've
had a chance to look at the Lite2 stuff...


Actually, the vnode being returned is being returned by VOP_LOOKUP, by
way of the namei() call.  The VOP_OPEN is one of those few "veto"
based interfaces that actually works.  The open calls the ufs_open()
in ufs_vnops.c, and is basically there to veto the ability to zap the
"append only" files.

One problem I have with this is that the VOP_LOOKUP calls the generic
kernel allocation code, and the deallocation code is called by the
same layer that called the VOP_OPEN.

I would really rather that if you call a VOP to allocate a vnode, you
call a VOP to free one.  We can discuss whether or not the VFS should
be consuming a kernel vnode pool management interface in another
thread; if the interface is reflexive, it doesn't matter because that
consumption is opaque.


If the vnode reference instance *was* the open instance, I'd be OK with
leaving the interface at the VOP_ layer... though it still makes it
difficult to perform an open in the kernel for a file in the FS proper,
because VOP_'s are per FS, which is why we have namei().


The vn_open() soloution for this problem is not very nice, because it
assumes that it will be called in a process context... I can't just
pass it a manifest SUSER credential.

The system open file table entry is really just a credential holder
in the kernel case, and it makes it easier to deal with the idea
of revoking a vnode: because the reference is to the system open
file table entry instead of the vnode, you can revoke the vnode
that the entry points to without notifying the people who refernced
it until the go to access the now invalid vnode.  If they have a
vnode pointer instead, they have to be able to "check it" to see if
its valid, or be notified.  There's no real clean failure on refernce.

So effectively, it's not only a credential holder, its a handle that
can be invalidated out from under it.

This is the same thing that happens to a user space process in the
case of a forcible unmount of an FS where it has a file open.



> >Also, for things like a CIFS/SMBFS/AppleTalk/NetWare client, I want
> >to be able to use credentials which are not BSD process credentials,
> >but user credentials.
> 
>    I don't think this makes any sense. Process credentials are an instance
> of user credentials in the kernel.

It lets me look up my user credentials indirectly, as root.  This lets
me have a "password cache", either "unlocked by user credentials" or
stored in a session manager.

This lets me have seperate "alien" credentials than some other
process, but I can use the same connection to the server for multiple
user sessions.

I know this workds for CIFS Kerberos tickets; I admit, I think that
an SMBFS (Samba, etc.) client would need a server connection per user;
on the other hand, it could virtualize these (say having a maximum
pool of 10 active connections) and using the credential lookup, make
another connection on the requesting users behalf, after discarding
the LRU list tail of the 10.  For NetWare, which handles multiple
session over a single connection for OS/2 and NT clients, it should
work on one connection (though sessions might want to be pooled).  It
may also be that the session ticket was supplied by NDS or some other
directory server (LDAP? X.500?) and not be a Kerberos ticket at all;
so we can't just "handle it all the same".


> >I want to make the distinction between a cached reference and a real
> >reference, as well, both for the file I/O in a kernel thread, and
> >to get around some of the recent problems that come from VOP_LOCK
> >handling two types of locks.
> 
>    Hmmm. I agree with one thing: the current kludge of having vnodes with a
> "0" reference count + vnode generation count in the cache seems very wrong to
> me. It's done this way because of the handsprings one must do for NFS (and
> presumably any other "stateless" filesystem, which can't hold an "open"
> instance)...

Yes, and it's complicated by a relatively high turnover, though this
would probably tail off a lot if the vnode were FS associative instead
of in a global pool.

The SVR4 soloution for the name cache (which has similar problems) is
to flush the cache by vnode (or by VFS, when an FS is unmounted).

The NFS problem is less of an issue if the VFS handles cookie generation
a bit more intelligently, and doesn't use the vp to do it.  This is also
a "vp is FS associative" argument... the NFS file handle lookup is done
using the VFS OP FHTOVP to invoke a per FS "*_fhtovp" function, so the
NFS wiring is all already there.



> >Finally, there is the issue of taking faults in kernel mode without
> >a process context to sleep on.  I'd like to see the sleeping moved to
> >the address of the field in the system open file table,so that the
> >sleep handle doesn't have to know what kind of caller is making the
> >call.
> 
>    Hmmm. Yes, I can see how this would be useful, but on the other hand, you
> have to have some saved state (whether that is a kernel thread or a process
> context), and any notion of kernel threads in FreeBSD (which I think is highly
> unlikely to ever occur) is going to have to deal with sleeping anyway...so
> I don't see a problem here. (Note: don't confuse this statement with kernel
> support for user threads, which IS very likely to occur in FreeBSD's near
> future).

For kernel threading, the idea would be to allocate a context for the call;
this would include a kernel stack, etc..  It's roughly exactly what you
would need to support an async call gate for system calls (asyscall instead
of syscall, and then operate of the same sysent[] table, with another flag
for "CALL_CAN_BLOCK") to support call conversion for a full user space
POSIX threading implementation.


I actually think there was a kernel threading implementation posted to
the SMP list a while back -- I know that one was done for FreeBSD, in
any case, so I can probably dig it out from somewhere, even if it wasn't
the SMP list.  But I agree that supporting a future kernel threading
implementation isn't the primary reason for doing this.


					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.