From owner-freebsd-current@FreeBSD.ORG  Thu Mar  4 13:42:50 2004
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id EC63F16A4CE; Thu,  4 Mar 2004 13:42:49 -0800 (PST)
Received: from gw.catspoiler.org (217-ip-163.nccn.net [209.79.217.163])
	by mx1.FreeBSD.org (Postfix) with ESMTP
	id 6D16D43D2D; Thu,  4 Mar 2004 13:42:47 -0800 (PST)
	(envelope-from truckman@FreeBSD.org)
Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2])
	by gw.catspoiler.org (8.12.9p2/8.12.9) with ESMTP id i24LgS7E005323;
	Thu, 4 Mar 2004 13:42:33 -0800 (PST)
	(envelope-from truckman@FreeBSD.org)
Message-Id: <200403042142.i24LgS7E005323@gw.catspoiler.org>
Date: Thu, 4 Mar 2004 13:42:28 -0800 (PST)
From: Don Lewis <truckman@FreeBSD.org>
To: bde@zeta.org.au
In-Reply-To: <20040305012433.B9033@gamplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/plain; charset=us-ascii
cc: freebsd-current@FreeBSD.org
cc: rwatson@FreeBSD.org
Subject: Re: sysctl spinning (was: Re: ps Causes Hard Hang)
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 04 Mar 2004 21:42:50 -0000

On  5 Mar, Bruce Evans wrote:
> On Thu, 4 Mar 2004, Don Lewis wrote:
> 
>> I just checked the mlock() man page in the Single UNIX Specification
>> Version 3.  Our mlock() implementation is broken.  We should be
>> returning ENOMEM here, though this will result in some sort of user
>> visible sysctl breakage instead of a tight loop.
> 
> POSIX is more authoritative.  It says that ENOMEM may be returned if
> an implementation-defined limit would be exceeded, and that EAGAIN
> shall be returned if some or all of the memory could not be locked
> when the call was made.  So it literally requires returning EAGAIN
> even when the limit would be exceeded, but it doesn't mean to require
> that.

That's the same as SUSv3.

I was pretty frazzled when I sent my previous message and was thinking
the test was something like:
	if (atop(size) > vm_page_max_wired)

The following patch is probably better.

Index: sys/vm/vm_mmap.c
===================================================================
RCS file: /home/ncvs/src/sys/vm/vm_mmap.c,v
retrieving revision 1.181
diff -u -r1.181 vm_mmap.c
--- sys/vm/vm_mmap.c	1 Mar 2004 02:44:33 -0000	1.181
+++ sys/vm/vm_mmap.c	4 Mar 2004 21:24:29 -0000
@@ -923,8 +923,8 @@
 	if (addr + size < addr)
 		return (EINVAL);
 
-	if (atop(size) + cnt.v_wire_count > vm_page_max_wired)
-		return (EAGAIN);
+	if (atop(size) > vm_page_max_wired)
+		return (ENOMEM);
 
 	PROC_LOCK(proc);
 	if (size + ptoa(pmap_wired_count(vm_map_pmap(&proc->p_vmspace->vm_map))) >
@@ -933,6 +933,9 @@
 		return (ENOMEM);
 	}
 	PROC_UNLOCK(proc);
+
+	if (atop(size) + cnt.v_wire_count > vm_page_max_wired)
+		return (EAGAIN);
 
 	error = vm_map_wire(&proc->p_vmspace->vm_map, addr,
 		     addr + size, VM_MAP_WIRE_USER|VM_MAP_WIRE_NOHOLES);


But ...

>> > I think EAGAIN was only meant for retrying after transient changes
>> > to the data.
>>
>> I think there may be legitimate cases where our memory wiring
>> implementation would legitimately want to return EAGAIN and not cause a
>> tight loop, but there is currently no way to distinguish this from the
>> case that you mention.
> 
> I think it should return EAGAIN except when certain limits including the
> rlimit would be exceeded.  The rlimit is often large or infinity, so
> it must be easy to exceed the system-wide limit if lots of processes uses
> locked memory.  Perhaps every process should be guaranteed a few pages of
> locked memory so that most systctls always work.  This is like guaranteeing
> that a minimum number of files (20 or so) can be opened, but I don't
> believe we even guarantee that, especially when the files are pipes
> wanting 64K of wired memory each.

One problem is that we should really distinguish between "permanently"
wired memory (from mlock(), etc.), from memory that is transiently wired
by sysctl() and device drivers.   I started regretting my change that
merged the mlock() and vslock() implementations after I realized this.

Permanent requests should be rejected while there is still memory
available for transient requests.

All transient requests should be checked against the limits.

There should be a per-uid limit on total wired memory so that a single
user can't DoS the system by consuming all the allowable wired memory
even if the user forks a large number of processes to take full
advantage of any per-process limits.

There should be a way of queuing transient requests, either blocking
transparently within the API, or by allowing the caller to sleep and be
woken when the request would succeed.