From owner-freebsd-hackers  Sun Jul  4 22:11:28 1999
Delivered-To: freebsd-hackers@freebsd.org
Received: from hoser.devel (hoser.devel.redhat.com [207.175.42.139])
	by hub.freebsd.org (Postfix) with ESMTP id A94EC14EAB
	for <hackers@FreeBSD.ORG>; Sun,  4 Jul 1999 22:11:24 -0700 (PDT)
	(envelope-from zab@zabbo.net)
Received: from localhost (zab@localhost)
	by hoser.devel (8.9.3/8.9.3) with ESMTP id BAA17867;
	Mon, 5 Jul 1999 01:10:38 -0400
X-Authentication-Warning: hoser.devel: zab owned process doing -bs
Date: Mon, 5 Jul 1999 01:10:38 -0400 (EDT)
From: Zach Brown <zab@zabbo.net>
X-Sender: zab@hoser
To: Jonathan Lemon <jlemon@americantv.com>
Cc: Mike Smith <mike@smith.net.au>, hackers@FreeBSD.ORG
Subject: Re: poll() scalability
In-Reply-To: <19990704175106.56355@right.PCS>
Message-ID: <Pine.LNX.4.10.9907050010030.5548-100000@hoser>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

On Sun, 4 Jul 1999, Jonathan Lemon wrote:

> I would think that a system that uses callbacks (like POSIX's completion
> signals) would be more expensive than a call like poll() or select().

the sigio/siginfo model is a few orders of magnitude cheaper than
poll/select as you scale the number of fds you're watching.  The reasons
for this being that select()/poll() have that large chunk of state to
throw around every syscall, and the real world behaviour of very rarely
ever returning more than than a few active pollfds

with the sigio/siginfo model you register your interest in the fd at fd
creation. from then on, when a POLL_ event happens on the fd we notice
that it has an rt signal queue registered and a siginfo struct is tacked
onto the end.  these code paths can be nice and light.  the siginfo
enqueueing can be pointed at multiple queues by registering a process
group with F_SETOWN, etc.

(  and yes, the siginfo struct has stuff for telling what process just
sent you a signal via kill(), posix timers, normal signal delivery, 
telling things about the child that just send you sigchld, faulting addr
for segv and friends, in addition to the band (POLL_) info for sigio )

its important to notice that we don't actually use signal delivery for
this sigio/siginfo stuff, we mask the signal and use signwaitinfo() to
block or pop the next siginfo struct off the queue.  dealing with async
signals jumping in would be annoying, and to do it right one would
probably want to simply enqueue the siginfo delivered to the signal
handler into a nice fifo that the real thread of execution would deal
with.. instead of doing all this grossness, we just let the kernel
maintain the siginfo queue.

its quite like the 'delta poll' system proposed, but with differently
inelegant semantics.  I'd say if one were to design an event
queueing/notification system and add a new api for it, we'd want to do it
correctly from the get-go and lose the similarity to existing interfaces
entirely unless they really makes sense to behave like them (which it
doesn't in the poll() case, imho)

> Also, you really want to return more than one event at at time in
> order to amortize the cost of the system call over several events, this
> doesn't seem possible with callbacks (or upcalls).

yes, that would be a nice behaviour, but I haven't seen it become a real
issue yet.  the sigwaitinfo() syscall is just so much lighter than all the
other things going on in the situation where you actually use this system.  
for example, using this to serve a web page in the super fast case looks
something like:

	sigwaitinfo() - aha, POLL_IN on listening socket..
	accept() new fd
	setup sigio and such on new fd (dorky, we have to do this in
		linux rather than inheriting it from the listening fd.
		but it has yet to show up on the profile radar, so, 
		whatever :))
	read() in the header (usually done in one read, but rarely
		will block and require falling back to a POLL_IN on
		the new fd)
	parse header, ideally hash/lookup.
	write() out the precalced header and premapped data.  perhaps
		a writev() if you're a wimp :) :)

so even in the ridiculously light path of a cheating caching webserver,
the overhead of copying the siginfo over is dwarfed by the rest of the
stuff we're doing in response to the event.  

of course, this could change if you had a situation where you could burn
through events like nothing else and simply couldn't deal with the
lock-step..
		
> Also, I would guess that you would start getting into locking problems,
> and how to cancel a signal which has already been posted. 

locking problems?

yes, the possibility of getting stale events in the queue is _annoying_.  
This is going to be a problem in any system that passes state deltas to
the process in a queued manner.  hacks could be put in, and perhaps
should, to remove events in the queue for a fd when it is closed, etc.

take the web server case again.  it is quite possible to close() an fd
while there is an event queued for it, and then accept() a new fd that now
has a bogus event coming down the pipe for it. I get around this garbage
in the cheesy web server by doing deferred close()s on fds based on the
length of the queue when I stopped being interested in the fd (and as such
turned off sigio delivery).  Its gross.

but even with these problems, the rt signal queue is quite powerful.  to
do better would require a fair bit of engineering, and one might quickly
be bogged down in featuritis.

-- zach

- - - - - -
007 373 5963


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message