From owner-freebsd-arch  Tue Jul 16  2:54:58 2002
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.FreeBSD.org (mx1.FreeBSD.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id A92F837B400
	for <arch@FreeBSD.ORG>; Tue, 16 Jul 2002 02:54:51 -0700 (PDT)
Received: from gw.catspoiler.org (217-ip-163.nccn.net [209.79.217.163])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 0811B43E4A
	for <arch@FreeBSD.ORG>; Tue, 16 Jul 2002 02:54:51 -0700 (PDT)
	(envelope-from dl-freebsd@catspoiler.org)
Received: from mousie.catspoiler.org (mousie.catspoiler.org [192.168.101.2])
	by gw.catspoiler.org (8.12.5/8.12.5) with ESMTP id g6G9scwr026538;
	Tue, 16 Jul 2002 02:54:42 -0700 (PDT)
	(envelope-from dl-freebsd@catspoiler.org)
Message-Id: <200207160954.g6G9scwr026538@gw.catspoiler.org>
Date: Tue, 16 Jul 2002 02:54:38 -0700 (PDT)
From: Don Lewis <dl-freebsd@catspoiler.org>
Subject: Re: wiring the sysctl output buffer
To: bde@zeta.org.au
Cc: arch@FreeBSD.ORG
In-Reply-To: <20020714201020.F35894-100000@gamplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/plain; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

On 14 Jul, Bruce Evans wrote:
> On Sun, 14 Jul 2002, Don Lewis wrote:
> 
>> A number of places in the kernel call SYSCTL_OUT() while holding locks.
>> The problem is that SYSCTL_OUT() can block while it wires down the "old"
>> buffer in user space.  If the data is small, such as an integer value,
> 
> This seems to have been broken by FreeBSD.  In rev.1.1, vslock() was
> called at the start of the sysctl, so the only problem with it was its
> pessimalness (*).  Now I think the vslock() is just a pessimization,
> since the point of doing it was to avoid blocking during the copyout(),
> but blocking during vslock() is equivalent.  Also, vslock() is insufficent
> if the kernel is preemptible.

Yes, in some ways it looks like we went backwards.  We can solve the
problem of the data changing while it is being copied by locking the
source of the data, but we need to call vslock() before locking the
source of the data to avoid being blocked in vslock() while holding
locks.  The current implematation which calls vslock() in the first
call to SYSCTL_OUT() is the reason that I'm looking at this stuff.

>> the best solution may be to copy the data and defer calling SYSCTL_OUT()
>> until after the locks are released.  This extra copy could be wasteful
>> if the data is large, and it may be cumbersome if the data size is not
>> known ahead of time, since determining the data size may be expensive.
> 
> The extra copy is unlikely to be more wasteful than vslock()'ing, since
> vfslock() is a very expensive operation.  Some instruction counts on
> a UP i386 kernel with no INVARIANTS or WITNESS:
> 
> (*) vslock():   about 2700 instructions (for a length of just 16)
>     vsunlock(): about 1400
>     useracc():  about  400
> 
> `oldlen' is known ahead of time, so a large enough buffer could be
> allocated in most cases.  Perhaps vslock() iff oldlen is very large.
> The data could still change while it is being copied to a malloc()'ed
> buffer (or earlier) if the kernel is preemptible.

I basically like this approach.  In most cases the data will be a small
and of fixed size, so the handler could even store the temporary copy on
the stack.  If the data is a type that can be copied atomically, like an
int, we don't have to worry about the data changing in mid-copy.  If it
is larger and we care about getting an atomic snapshot, the handler can
lock the source for the duration of the copy to the temporary buffer.

For large amounts of data, we probably want to avoid copying to a
temporary buffer.  PHK raised the issue of the user specifying a buffer
size that is way too large.  Allocating a huge buffer in wired down
kernel memory sounds even worse than wiring down a huge buffer in user
space.

If we don't want to allocate a temporary buffer of length oldlen because
it is much larger than what looks to be reasonable, but the amount of
data is not easy to calculate ahead of time, we have another potential
problem.  We might have to lock a data structure, walk through this
data structure and calculate the total data size, and unlock the data
structure.  After we malloc() the buffer, relock the data structure, and
start traversing the data structure again to do the copy, we may
discover that the amount of data to be returned has increased, so we
would have to drop the lock, free the buffer, and start over again.

>> The data size could also potentially change between the time the size is
>> calculatated and the time when the data is copied to the temporary
>> buffer if the locks must be released so that MALLOC() can be called to
>> allocate the buffer.
> 
> Hopefully the sysctl locking is enough to prevent at least *req changing
> underneath us.
> 
>> I think the best solution to this problem is to add a sysctl system API
>> to prewire the output buffer that can be called before grabbing the
>> locks.  Doing so allows the existing code to operate properly with only
>> minimal changes.
> 
>> Here's a patch that implements this new API and illustrates how it can
>> be used to fix the kern.function_list sysctl, which wants to call
>> SYSCTL_OUT() multiple times while walking a locked data structure.  I
>> think this is the best fix, though I'd like some feedback on whether
>> this is the best API.
> 
> I think the callers of SYSCTL_OUT() don't need a new API, but they
> should supply the data in a form suitable for copying out directly.
> This probably involves them making a copy while holding suitable
> domain-specific locks.  They can't just point to an active kernel
> struct since it might change.
> 
> I would just make a copy at a high level for now.

I'd vote for a hybrid approach.  I would remove the vslock() call from
sysctl_old_user() and provide the new API for invoking it from the
handler if appropriate.  Instead, I would call WITNESS_SLEEP() if
vslock() had not been called to wire the buffer to guard against
blocking while locks are held.  I would keep the call to vsunlock() in
the sysctl cleanup code.

If the handler makes a copy of the data and releases any locks before
calling SYSCTL_OUT(), then we can avoid the expense of vslock().  If
appropriate, the handler would still have the ability to wire the buffer
before grabbing a lock, and then invoke SYSCTL_OUT() while the lock is
held.

I'd also like to provide a second argument to the interface to vslock()
to allow the handler to specify a maximum, kernel enforced, buffer
length to potentially limit the amount of wired memory to something less
than oldlen, but unless we add a member to the sysctl_req structure to
hold it, we would need to overwrite oldlen so that we can pass the
proper value to vsunlock() in the cleanup code.  What kind impact would
changing sysctl_req have on binary compatibility?  I doesn't look like a
problem to me, since this structure seems to be mostly treated as an
opaque type.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message