Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 29 Jul 2003 00:04:50 -0700
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Robert Watson <rwatson@FreeBSD.org>
Cc:        Kris Kennaway <kris@obsecurity.org>
Subject:   Re: LOR with filedesc structure and Giant
Message-ID:  <3F261C92.2FB1AFCD@mindspring.com>
References:  <Pine.NEB.3.96L.1030728110437.57321A-100000@fledge.watson.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Robert Watson wrote:
> On Sun, 27 Jul 2003, Kris Kennaway wrote:
> > After upgrading last night, one of the package machines found this:
> 
> I've bumped into some similar problems -- it's a property of how we
> current lock select().  We hold the file descriptor lock for the duration
> of polling each object being "selected", and if any of those objects has
> to grab a lock for any reason, it has to implicitly fall after the file
> descriptor lock.  I actually run into this in some of our MAC code,
> because I need to grab a vnode lock to authorize polling the vnode using
> VOP_POLL(), and since the vnode lock is a sleep lock, this generates a
> WITNESS warning.  Unfortunately, it's not immediately clear what a better
> locking scheme would look like without going overboard on the fine-grained
> side.  We probably need to grab Giant before entering the select code
> since it's highly likely something in there will require Giant -- it
> reaches down into VFS, the device stuff, socket code, tc.

This is basically an instance of the race I warned against being
an impending danger to FreeBSD last week, now that we have the
possibility of multiple application threads in the kernel
simutaneously.

The easiest fix is to take a reference on the descriptors in the
select itself around the calls to selscan().  This is annoying,
because it's moderately expensive.

The problem is that you have to avoid false positives, and you
could sleep as a result of some of the underylying implementation
stuff.

Effectively, this means that you have to allocate a copy of the
descriptor table to hold a reference over the whole thing, so
that a close of the descriptors out from under you doesn't cause
you to panic after coming back.

The tricky part is that you have to change the semantics of the
selscan() (this part is ugly), which you do by verifying that
the descriptor you are using is the same one you were using.  You
have to do this because there's no way to safely avoid the race
of the first and second selscan, where someone closes a descriptor
out from under you, and then opens another one in its place in a
separate thread in the same application.

The way Solaris deals with this issue is that all sleeps are at
the same level, so what it can do is force the call to fail out
(basically, a cancellation point) for outstanding read() on an
fd in one thread, with the descriptor being closed in another.

It turns out that parts of the Java netwoking implementation in
the full-on Java implementation actually *depends* on this
behaviour, so we will have to bite the bullet on it eventually,
or rewrite that code when/if it makes it to FreeBSD.

The only other things I've thought of that could let you fix
this is to be Evil(tm) and fail the close() with the only error
that would cause the application to retry it later -- EINTR.
This is truly evil, since it's likely the application doesn't
check the return value from close, and even if it did, an EINTR
would likely cause it to try again immediately.

The only other semi-sane approach is Too Evil For Words(tm), which
would be to delay the return on the close until the reference count
goes from 2->1; the reason this is Too Evil For Words(tm) is this:
what do you do when one thread is blocked in the close(), and one
is blocked in a read(), and a third thread *also* calls close()?
You'd pretty much be guaranteeing that applications that assume
the Solaris semantics will all break (arguably, they are already
broken, with one thread closing files out from under another, but
they appear to function.  I guess there's always "psignal(p,SIGABRT);"
8-) 8-)).

I don't think FreeBSD could really implement the Solaris approach,
without a lot of code needing to be rewritten to support the idea
of cancellation of blocked operations pending completion (basically,
a wake-up-any-slept-on-this-fd-process).  Maybe Dragonfly BSD will
be able to solve this elegantly, but I really doubt it's going to
be able to do it.

You might also be able to kludge up an "internal signal", but the
slept thread is probably not sleeping on e.g. the descriptor address
(maybe it's slept in a malloc, etc.), so I don't know how you'd know
that it had a reference to the thing your trying to take away from
it without a rendevous.

Maybe it's time to give each blocking context a kqueue or something
equally gross.  8-(.

-- Terry



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3F261C92.2FB1AFCD>