From owner-freebsd-smp  Sat Apr 10  5:41:23 1999
Delivered-To: freebsd-smp@freebsd.org
Received: from spinner.netplex.com.au (spinner.netplex.com.au [202.12.86.3])
	by hub.freebsd.org (Postfix) with ESMTP id 87B7914E9C
	for <smp@FreeBSD.ORG>; Sat, 10 Apr 1999 05:41:14 -0700 (PDT)
	(envelope-from peter@spinner.netplex.com.au)
Received: from spinner.netplex.com.au (localhost [127.0.0.1])
	by spinner.netplex.com.au (Postfix) with ESMTP
	id 44DD21F4D; Sat, 10 Apr 1999 20:38:59 +0800 (WST)
	(envelope-from peter@spinner.netplex.com.au)
X-Mailer: exmh version 2.0.2 2/24/98
To: Terry Lambert <tlambert@primenet.com>
Cc: barney@databus.com, smp@FreeBSD.ORG, dot@dotat.at
Subject: Re: concurrent select()s on listen socket broken under SMP 
In-reply-to: Your message of "Sat, 10 Apr 1999 00:39:19 GMT."
             <199904100039.RAA05457@usr01.primenet.com> 
Date: Sat, 10 Apr 1999 20:38:58 +0800
From: Peter Wemm <peter@spinner.netplex.com.au>
Message-Id: <19990410123901.44DD21F4D@spinner.netplex.com.au>
Sender: owner-freebsd-smp@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

Terry Lambert wrote:
> > This is *NOT* a SMP specific problem, it's a generic problem in the socket
> > API and implementation.  You can trigger it on a UP kernel too, it just
> > happens to be real easy under SMP.
> > 
> > See rev 1.52 and 1.54 of uipc_socket.c and related files for a description
> > of what's really going on.  (the problem there is from a different angle,
> > it deals with remote users being able to trigger the race remotely
> > as a denial-of-service attack).
> > 
> > The original poster didn't mention what version of FreeBSD was in use. 
> > 3.1-RELEASE has this bug...  It's fixed in 4.0 and 3.1-STABLE (after
> > 26-feb-99).
> 
> He said he was running a curtrent version of 3.1-STABLE.

Either way, it doesn't matter.  As I said, look at the description of the
changes for details of what is going on.  The problem in the patch was a
*specific case* where it was possible to provoke this race with a single
process listener.  Having multiple processes in a select/accept role
listening role is always vulnerable..  select() and accept() are different
syscalls, and there is no guarantee that you're going to get a continuous
execution run from the select being true and the socket still being ready
for select.

Eg:     Process 1                   Process 2
        [sleeping in select]        [sleeping in select]
  [incoming connection - socket becomes readable, both become runnable]
        Runs..  does accept()
                [connection taken - another accept will hang]
                                    Finally runs, does accept(), hangs..

At this point, Process 1 has taken the connection and process 2 is locked
in accept().  The only way out of this is to set the listening socket
non-blocking so that accept() will abort on a false start.        

This race is inherent in the design of select() and accept().  It's nothing
to do with SMP etc.

> > Terry, rest assured, select() and poll() are quite SMP safe.
> > 
> > The other serious problem is select collisions.  I'd strongly advise using
> > a mutex so that only one process is ever in a select() for waiting for
> > connections on a socket.  If two or more processes select() waiting for the
> > listening socket to become readable, when it does - a select collision
> > happens.  ie: *EVERY SINGLE PROCESS* that was asleep in select gets woken
> > up, and runs, and scans it's select tables, and 99% go back to sleep.
> 
> Yes.  This is the classic "thundering herd" problem for which wakeup_one()
> (or wake_one()) was invented.

wakeup_one() cannot be used for select().  Look at the example above.  What
if Process 1 got selected for the wakeup, and instead it serviced something
else of a higher priority?  Suppose it hangs?  The connection will be in
limbo with nothing to service it.  Suppose it (process 1) core dumps?  How
is process 2 going to get woken up instead?

This is not quite 'the' thundering herd BTW.  If you have 3 processes
selecting on an event and it happens, all three (by design of select() api)
need to be woken.  The bug is that not just those three get worken but every
select process on the system, including those that don't give a damn about
the given event.

> The real problem is that for some type of events, you want all
> of the processes waiting on the event to go, and on other types of
> events, you want only one process to go.

In select() and poll(), you *cannot* "pick one".  All the processes listening
on the event *must* be woken - because the processes themselves scan the
vectors and decide what they will do and when.  A process can't "send back"
the wakeup if it's going to do something else instead.

> I think for a socket on which a listen has been called, it's pretty
> pbvious that you can only service a single accept at a time, and
> so based on the character of the object selected, wakeup_one() is
> the correct call.

Even then, you can't do this safely.  Sure, only one can accept the
connection, but if you pick one, how will you know if it actually will?
Remember, it's doing a select() so that it can look at *multiple* things at
once.

> I think that select collisions should be outlawed, unless the
> descriptor in question explicitly permits them, but I'd be OK
> with having to change the current behaviour by via fcntl on a
> per fd basis.

This won't work for the reasons above..  You'll get missed events all over
the place in a scenario where this is actually activated.. Remember, select
collisions are pretty rare for 99% of the software out there.

> > If you have a couple of hundred processes asleep in select(), this
> > hits like a bomb going off.  You can get a spontanious load average
> > jump from 0.2 to 100+ in an instant, depending on when loadav() runs.
> 
> And this is a design error, which should be corrected, whether you
> argue that it's a scheduler problem or a wakeup problem.  Dynix and
> SVR4.2 both corrected this problem a while ago, using a wake_one()
> approach.

Yes, it is a problem and should be fixed, but wakeup_one() isn't even
remotely the answer..

> > Select collisions are a well understood problem, and there is no cheap,
> > easy, trivial fix that scales well.  (Sure, making the selinfo struct have
> > 5 pid's rather than 1 would work for <= 6 processes waiting, but the
> > problem is still there for 6.  you can't allocate memory for it due to
> > the nature of the implementation without modifying the driver interface
> > so that there is an "d_unselect()" handler or something - selinfo's get
> > released long after the process has given up and gone onto something
> > else.  Changing the driver API _again_ isn't something we want to do
> > yet...)
> 
> I think what you want is a LIFO queue for people selecting on an fd,
> and perhaps a per fd fcntl'able flag that can turn that into a FIFO
> policy instead for resources for which there are interactive waits
> (clearly, for a server, LIFO is the correct ordering; at Novell, my
> suggestion that a LIFO service order be used for the NCP streams mux
> resulted in a 35% increase in throughput as a result of reduced
> paging load, making the UNIX soloution faster than Native NetWare).

A list/queue/whatever would be one way, but this requires a driver API
redesign in a big way.

At the moment, the drivers manage the select event wakeups.  A process
calls select/poll(), and this eventually translates into a selscan() call
which ends up calling the driver d_select() (now d_poll) entry point.  If
the event is not presently active, selrecord() is used to log the event in
the *driver*'s private selinfo struct that it provides for that event.

At some later time, the event "happens", the driver does a selwakeup()
on the selinfo.  The driver has *no idea* if the process is asleep and
waiting for the event or not or even if it's died or otherwise, because
there is no call from the select subsystem to tell the driver that it's
no longer interested.  As a result, the drivers cannot maintain a LIFO or
anything like that since they never get a chance to clean out the list
of things that are no longer interested in the event.

However, if we're prepared to get a little sloppy and do lazy reclaims of
buffers or something like that, then selrecord() could prune the table if
needed, and selwakeup could do it too.  We could probably fudge it a
little, so that instead of selinfo holding a short pid, it could hold a
union and an array of 4 or 8 pid's which transmutes into a list header or
something if all 4 or 8 slots get used up.

> One possibile soloution, which would do dick for the SMP implications,
> would be to add a "time_in_select", and then have wakeup_one() only
> wake the one who called last, using a linear iteration.  This is
> basically what it does now (why the hell does it wakeup more than
> one *just* because the first one is swapped out?!?).

If only it were that simple..

> 					Terry Lambert
> 					terry@lambert.org

Cheers,
-Peter


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message