From owner-freebsd-arch Tue Jul 16 2:54:58 2002 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.FreeBSD.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A92F837B400 for ; Tue, 16 Jul 2002 02:54:51 -0700 (PDT) Received: from gw.catspoiler.org (217-ip-163.nccn.net [209.79.217.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id 0811B43E4A for ; Tue, 16 Jul 2002 02:54:51 -0700 (PDT) (envelope-from dl-freebsd@catspoiler.org) Received: from mousie.catspoiler.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.12.5/8.12.5) with ESMTP id g6G9scwr026538; Tue, 16 Jul 2002 02:54:42 -0700 (PDT) (envelope-from dl-freebsd@catspoiler.org) Message-Id: <200207160954.g6G9scwr026538@gw.catspoiler.org> Date: Tue, 16 Jul 2002 02:54:38 -0700 (PDT) From: Don Lewis Subject: Re: wiring the sysctl output buffer To: bde@zeta.org.au Cc: arch@FreeBSD.ORG In-Reply-To: <20020714201020.F35894-100000@gamplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/plain; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On 14 Jul, Bruce Evans wrote: > On Sun, 14 Jul 2002, Don Lewis wrote: > >> A number of places in the kernel call SYSCTL_OUT() while holding locks. >> The problem is that SYSCTL_OUT() can block while it wires down the "old" >> buffer in user space. If the data is small, such as an integer value, > > This seems to have been broken by FreeBSD. In rev.1.1, vslock() was > called at the start of the sysctl, so the only problem with it was its > pessimalness (*). Now I think the vslock() is just a pessimization, > since the point of doing it was to avoid blocking during the copyout(), > but blocking during vslock() is equivalent. Also, vslock() is insufficent > if the kernel is preemptible. Yes, in some ways it looks like we went backwards. We can solve the problem of the data changing while it is being copied by locking the source of the data, but we need to call vslock() before locking the source of the data to avoid being blocked in vslock() while holding locks. The current implematation which calls vslock() in the first call to SYSCTL_OUT() is the reason that I'm looking at this stuff. >> the best solution may be to copy the data and defer calling SYSCTL_OUT() >> until after the locks are released. This extra copy could be wasteful >> if the data is large, and it may be cumbersome if the data size is not >> known ahead of time, since determining the data size may be expensive. > > The extra copy is unlikely to be more wasteful than vslock()'ing, since > vfslock() is a very expensive operation. Some instruction counts on > a UP i386 kernel with no INVARIANTS or WITNESS: > > (*) vslock(): about 2700 instructions (for a length of just 16) > vsunlock(): about 1400 > useracc(): about 400 > > `oldlen' is known ahead of time, so a large enough buffer could be > allocated in most cases. Perhaps vslock() iff oldlen is very large. > The data could still change while it is being copied to a malloc()'ed > buffer (or earlier) if the kernel is preemptible. I basically like this approach. In most cases the data will be a small and of fixed size, so the handler could even store the temporary copy on the stack. If the data is a type that can be copied atomically, like an int, we don't have to worry about the data changing in mid-copy. If it is larger and we care about getting an atomic snapshot, the handler can lock the source for the duration of the copy to the temporary buffer. For large amounts of data, we probably want to avoid copying to a temporary buffer. PHK raised the issue of the user specifying a buffer size that is way too large. Allocating a huge buffer in wired down kernel memory sounds even worse than wiring down a huge buffer in user space. If we don't want to allocate a temporary buffer of length oldlen because it is much larger than what looks to be reasonable, but the amount of data is not easy to calculate ahead of time, we have another potential problem. We might have to lock a data structure, walk through this data structure and calculate the total data size, and unlock the data structure. After we malloc() the buffer, relock the data structure, and start traversing the data structure again to do the copy, we may discover that the amount of data to be returned has increased, so we would have to drop the lock, free the buffer, and start over again. >> The data size could also potentially change between the time the size is >> calculatated and the time when the data is copied to the temporary >> buffer if the locks must be released so that MALLOC() can be called to >> allocate the buffer. > > Hopefully the sysctl locking is enough to prevent at least *req changing > underneath us. > >> I think the best solution to this problem is to add a sysctl system API >> to prewire the output buffer that can be called before grabbing the >> locks. Doing so allows the existing code to operate properly with only >> minimal changes. > >> Here's a patch that implements this new API and illustrates how it can >> be used to fix the kern.function_list sysctl, which wants to call >> SYSCTL_OUT() multiple times while walking a locked data structure. I >> think this is the best fix, though I'd like some feedback on whether >> this is the best API. > > I think the callers of SYSCTL_OUT() don't need a new API, but they > should supply the data in a form suitable for copying out directly. > This probably involves them making a copy while holding suitable > domain-specific locks. They can't just point to an active kernel > struct since it might change. > > I would just make a copy at a high level for now. I'd vote for a hybrid approach. I would remove the vslock() call from sysctl_old_user() and provide the new API for invoking it from the handler if appropriate. Instead, I would call WITNESS_SLEEP() if vslock() had not been called to wire the buffer to guard against blocking while locks are held. I would keep the call to vsunlock() in the sysctl cleanup code. If the handler makes a copy of the data and releases any locks before calling SYSCTL_OUT(), then we can avoid the expense of vslock(). If appropriate, the handler would still have the ability to wire the buffer before grabbing a lock, and then invoke SYSCTL_OUT() while the lock is held. I'd also like to provide a second argument to the interface to vslock() to allow the handler to specify a maximum, kernel enforced, buffer length to potentially limit the amount of wired memory to something less than oldlen, but unless we add a member to the sysctl_req structure to hold it, we would need to overwrite oldlen so that we can pass the proper value to vsunlock() in the cleanup code. What kind impact would changing sysctl_req have on binary compatibility? I doesn't look like a problem to me, since this structure seems to be mostly treated as an opaque type. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message