From owner-freebsd-hackers@FreeBSD.ORG Fri Jan 21 21:43:24 2011 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 363D51065673; Fri, 21 Jan 2011 21:43:24 +0000 (UTC) (envelope-from alan.l.cox@gmail.com) Received: from mail-fx0-f54.google.com (mail-fx0-f54.google.com [209.85.161.54]) by mx1.freebsd.org (Postfix) with ESMTP id 877FA8FC1B; Fri, 21 Jan 2011 21:43:23 +0000 (UTC) Received: by fxm16 with SMTP id 16so2382385fxm.13 for ; Fri, 21 Jan 2011 13:43:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:reply-to:in-reply-to:references :date:message-id:subject:from:to:cc:content-type; bh=MpeDP6yIEhNH9QxPNvArK3Qjr4rSbcJiZVzUg1yb+0U=; b=JvNCjY9EYG7m1QgyXlyFWUrlvD+lbLr24W5tdwgeGxcTkZ/sa+a7P8OUkVZxUNN8OU LgXZFhZTKT6Ao7WM9rkvuM3SoTFR0OeqidBGu90q5Gk8sNz3wAiINB6XioIdsLh8mI4e UYtKPiQ+d7b7obLFsGw0M/DlZQ+SFSdDOY9Kk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:reply-to:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; b=LNV4MUK3LcR8b7Ff40tvgsbb6izoWRVNioHYgddVurklxXsORLnmQzYHY0+uFahysD 4pt31LkcDo3JXS33QoCVP9MQSaLrfFIn1aBK8Iz1y1g3A40zPP9kuMQNrPuMgSeO24Ye V+KDYDaIZ+LFxsxgwC11aAl1D9OqlJNsjsZ38= MIME-Version: 1.0 Received: by 10.223.36.220 with SMTP id u28mr1229363fad.11.1295646202491; Fri, 21 Jan 2011 13:43:22 -0800 (PST) Received: by 10.223.126.207 with HTTP; Fri, 21 Jan 2011 13:43:22 -0800 (PST) In-Reply-To: References: <201101211244.13830.jhb@freebsd.org> Date: Fri, 21 Jan 2011 15:43:22 -0600 Message-ID: From: Alan Cox To: John Baldwin Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Cc: freebsd-hackers@freebsd.org, Sergey Kandaurov Subject: Re: [rfc] allow to boot with >= 256GB physmem X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: alc@freebsd.org List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 21 Jan 2011 21:43:24 -0000 On Fri, Jan 21, 2011 at 2:58 PM, Alan Cox wrote: > On Fri, Jan 21, 2011 at 11:44 AM, John Baldwin wrote: > >> On Friday, January 21, 2011 11:09:10 am Sergey Kandaurov wrote: >> > Hello. >> > >> > Some time ago I faced with a problem booting with 400GB physmem. >> > The problem is that vm.max_proc_mmap type overflows with >> > such high value, and that results in a broken mmap() syscall. >> > The max_proc_mmap value is a signed int and roughly calculated >> > at vmmapentry_rsrc_init() as u_long vm_kmem_size quotient: >> > vm_kmem_size / sizeof(struct vm_map_entry) / 100. >> > >> > Although at the time it was introduced at svn r57263 the value >> > was quite low (f.e. the related commit log stands: >> > "The value defaults to around 9000 for a 128MB machine."), >> > the problem is observed on amd64 where KVA space after >> > r212784 is factually bound to the only physical memory size. >> > >> > With INT_MAX here is 0x7fffffff, and sizeof(struct vm_map_entry) >> > is 120, it's enough to have sligthly less than 256GB to be able >> > to reproduce the problem. >> > >> > I rewrote vmmapentry_rsrc_init() to set large enough limit for >> > max_proc_mmap just to protect from integer type overflow. >> > As it's also possible to live tune this value, I also added a >> > simple anti-shoot constraint to its sysctl handler. >> > I'm not sure though if it's worth to commit the second part. >> > >> > As this patch may cause some bikeshedding, >> > I'd like to hear your comments before I will commit it. >> > >> > http://plukky.net/~pluknet/patches/max_proc_mmap.diff >> >> Is there any reason we can't just make this variable and sysctl a long? >> >> > Or just delete it. > > 1. Contrary to what the commit message says, this sysctl does not > effectively limit the number of vm map entries. It only limits the number > that are created by one system call, mmap(). Other system calls create vm > map entries just as easily, for example, mprotect(), madvise(), mlock(), and > minherit(). Basically, anything that alters the properties of a mapping. > Thus, in 2000, after this sysctl was added, the same resource exhaustion > induced crash could have been reproduced by trivially changing the program > in PR/16573 to do an mprotect() or two. > > In a nutshell, if you want to really limit the number of vm map entries > that a process can allocate, the implementation is a bit more involved than > what was done for this sysctl. > > 2. UMA implements M_WAITOK, whereas the old zone allocator in 2000 did > not. Moreover, vm map entries for user maps are allocated with M_WAITOK. > So, the exact crash reported in PR/16573 couldn't happen any longer. > > Actually, I take back part of what I said here. The old zone allocator did implement something like M_WAITOK, and that appears to have been used for user maps. However, the crash described in PR/16573 was actually on the allocation of a vm map entry within the *kernel* address space for a process U area. This type of allocation did not use the old zone allocator's equivalent to M_WAITOK. However, we no longer have U areas, so the exact crash scenario is clearly no longer possible. Interestingly, the sysctl in question has no direct effect on the allocation of kernel vm map entries. So, I remain skeptical that this sysctl is preventing any resource exhaustion based panics in the current kernel. Again, I would be thrilled to see one or more people do some testing, such as rerunning the program from PR/16573. 3. We now have the "vmemoryuse" resource limit. When this sysctl was > defined, we didn't. Limiting the virtual memory indirectly but effectively > limits the number of vm map entries that a process can allocate. > > In summary, I would do a little due diligence, for example, run the program > from PR/16573 with the limit disabled. If you can't reproduce the crash, in > other words, nothing contradicts point #2 above, then I would just delete > this sysctl. > > Alan > >