From owner-freebsd-hackers  Mon Feb 20 09:17:26 1995
Return-Path: hackers-owner
Received: (from majordom@localhost) by freefall.cdrom.com (8.6.9/8.6.6) id JAA28470 for hackers-outgoing; Mon, 20 Feb 1995 09:17:26 -0800
Received: from cs.weber.edu (cs.weber.edu [137.190.16.16]) by freefall.cdrom.com (8.6.9/8.6.6) with SMTP id JAA28464 for <freebsd-hackers@FreeBSD.org>; Mon, 20 Feb 1995 09:17:25 -0800
Received: by cs.weber.edu (4.1/SMI-4.1.1)
	id AA03273; Mon, 20 Feb 95 10:10:57 MST
From: terry@cs.weber.edu (Terry Lambert)
Message-Id: <9502201710.AA03273@cs.weber.edu>
Subject: Re: getrlimit()/setrlimit() strangeness
To: wpaul@skynet.ctr.columbia.edu (Wankle Rotary Engine)
Date: Mon, 20 Feb 95 10:10:57 MST
Cc: freebsd-hackers@FreeBSD.org
In-Reply-To: <199502200656.BAA02173@skynet.ctr.columbia.edu> from "Wankle Rotary Engine" at Feb 20, 95 01:56:03 am
X-Mailer: ELM [version 2.4dev PL52]
Sender: hackers-owner@FreeBSD.org
Precedence: bulk

> The other day a user here asked about increasing the per-process limit
> for the maximum number of open file descriptors (they have a server process
> that needs to have many file descriptors open at once for some periods of
> time). I put together the following test program to demonstrate how 
> getrlimit() and setrlimit() could be used for this purpose:

[ ... ]

> This attempts to set the number of permitted open file descriptors to
> 1024, which is only possible if the hard limit is equal to or higher
> than that. I decided to try this program on all the platforms I had
> around to see just how portable it would be. Turns out that it works
> fine on just about all of them -- except FreeBSD. :( 

[ ... ]

> In FreeBSD-current, weird things happen. I'll use freefall as an
> example since I tested this program there. (The same behavior
> shows up on my office machine, only my default limits are different
> becase my system configuration isn't the same as freefall's.)
> 
> On freefall, I defined MAXCONNECTIONS to be 2048 instead of 1024 since
> freefall's hard limit was higher than 1024.
> 
> getrlimit() reported that the soft file descriptor limit was 128 (which
> is correct) and that the hard limit was -1 (which is thoroughly bogus).
> The sysctl command showed that the hard limit was 1320. Attempting to
> set the soft and hard limits to 2048 appeared to succeed, but reading
> back the limits afterwards showed that both limits were maxed out at
> 1320. This behavior is not what I consider to be correct: the attempt
> to raise the limits above the hard limit should have failed noisily;
> instead it failed silently and the limits were trimmed at the hard
> threshold. And the hard resource limit is most definitely being reported
> incorectly. Why sysctl can see it properly but not getrlimit() I
> have no idea. Yet.
> 
> On my 1.1.5.1 system at home, the results were a little different
> but equally broken: instead of -1, getrlimit() reported the hard
> limit to be something in the neighborhood of MAXINT. Aside from that,
> it behaved the same as freefall, which is to say it screwed up.
> 
> Anybody else notice this? Better yet, anybody know how to fix it? :)


This is part of the stuff that needs to be fixed for kernel and user
space multithreading, and as a result of kernel multithreading, it
also wants to be fixed for SMP.

Take a look at the way the per process open file table maps into the
system open file table, and the way the per process open file table
is allocated for the process.

In most UNIX implementations, what happens is the the per process open
file table is allocated in chunks (usually chunks of 32), and is then
chaned as a linked list of chunks.

In SVR4, the kernel realloc is used to reallocate the structure as
necessary to expand it.  It turns out that this is about 30% more
efficient for your typical programs (this caveat because bash is not
a typical program and will screw you on nearly every platform as it
tries to move the real handles it maintains around to not conflict
with pipes and/or assigned descriptors).


The problem is that even when the size is increased, since BSD is not
using the SunOS approach and is not using the SVR4 approach, it is
doomed to failure.  You can not allow an increase to take place, even
if requested.


In effect, it might even be possible to write off the end of the list and
blow kernel memory, although blowing it to something "useful" instead of
just resulting in "denial of service" is another matter, and I think is
statistically improbable, since the values being blown in are vnode
addresses and are therefore not very predicatable.  Even if you could
predict, I think that getting a usable value is another matter.


If someone goes in to fix this, I'd suggest hash-collapse for the system
open file table so that there are not multiple instances of multiple
system open file table entries pointing to the same vnode.  I'd also
suggest a reference count on the structure itself, and I'd suggest
moving the current file offset into a per process specific area; the
current location is bogus for threading.  The current system open file
limit ideal is also bogus without the hash collapse, since it refers
to the limit on open files for all processes instead of the limit on
open unique files for the system.

If you really care about threading, atomic see/read and seek/write
system calls (I believe SVR4 calls these pread/pwrite) should also
be implemented to avoid seek/seek/read/read and other race conditions
resulting from the offset being a shared quantity (shared only between
threads using the same context, if the other suggested changes are
implemented).


					Terry Lambert
					terry@cs.weber.edu
---
Any opinions in this posting are my own and not those of my present
or previous employers.