From owner-freebsd-stable@FreeBSD.ORG Sat Aug 20 18:36:58 2011 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2ABC610656B6 for ; Sat, 20 Aug 2011 18:36:58 +0000 (UTC) (envelope-from alc@rice.edu) Received: from mh5.mail.rice.edu (mh5.mail.rice.edu [128.42.199.32]) by mx1.freebsd.org (Postfix) with ESMTP id E88268FC12 for ; Sat, 20 Aug 2011 18:36:57 +0000 (UTC) Received: from mh5.mail.rice.edu (localhost.localdomain [127.0.0.1]) by mh5.mail.rice.edu (Postfix) with ESMTP id 212CB29021B; Sat, 20 Aug 2011 13:20:05 -0500 (CDT) X-Virus-Scanned: by amavis-2.6.4 at mh5.mail.rice.edu, auth channel Received: from mh5.mail.rice.edu ([127.0.0.1]) by mh5.mail.rice.edu (mh5.mail.rice.edu [127.0.0.1]) (amavis, port 10026) with ESMTP id VY-Q6Bmihokg; Sat, 20 Aug 2011 13:20:05 -0500 (CDT) Received: from adsl-216-63-78-18.dsl.hstntx.swbell.net (adsl-216-63-78-18.dsl.hstntx.swbell.net [216.63.78.18]) (using TLSv1 with cipher RC4-MD5 (128/128 bits)) (No client certificate requested) (Authenticated sender: alc) by mh5.mail.rice.edu (Postfix) with ESMTPSA id 5869E2901AB; Sat, 20 Aug 2011 13:20:04 -0500 (CDT) Message-ID: <4E4FFAD3.4090706@rice.edu> Date: Sat, 20 Aug 2011 13:20:03 -0500 From: Alan Cox User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.9.2.17) Gecko/20110620 Thunderbird/3.1.10 MIME-Version: 1.0 To: Kostik Belousov References: <4E4143A6.6030307@digsys.bg> <935F8EC2-88E0-45A3-BE8B-7210BE223BC5@mac.com> <4e42a0c0.e2t/9MF98O3HFjb1%perryh@pluto.rain.com> <4E4CCA6C.8020408@ipfw.ru> <20110820174147.GW17489@deviant.kiev.zoral.com.ua> In-Reply-To: <20110820174147.GW17489@deviant.kiev.zoral.com.ua> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: alc@freebsd.org, freebsd-stable@freebsd.org, perryh@pluto.rain.com, "Alexander V. Chernikov" , daniel@digsys.bg Subject: Re: 32GB limit per swap device? X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 20 Aug 2011 18:36:58 -0000 On 08/20/2011 12:41, Kostik Belousov wrote: > On Sat, Aug 20, 2011 at 12:33:29PM -0500, Alan Cox wrote: >> On Thu, Aug 18, 2011 at 3:16 AM, Alexander V. Chernikovwrote: >> >>> On 10.08.2011 19:16, perryh@pluto.rain.com wrote: >>> >>>> Chuck Swiger wrote: >>>> >>>> On Aug 9, 2011, at 7:26 AM, Daniel Kalchev wrote: >>>>>> I am trying to set up 64GB partitions for swap for a system that >>>>>> has 64GB of RAM (with the idea to dump kernel core etc). But, on >>>>>> 8-stable as of today I get: >>>>>> >>>>>> WARNING: reducing size to maximum of 67108864 blocks per swap unit >>>>>> >>>>>> Is there workaround for this limitation? >>>>>> >>> Another interesting question: >>> >>> swap pager operates in page blocks (PAGE_SIZE=4k on common arch). >>> >>> Block device size in passed to swaponsomething() in number of _disk_ blocks >>> (e.g. in DEV_BSIZE=512). After that, kernel b-lists (on top of which swap >>> pager is build) maximum objects check is enforced. >>> >>> The (possible) problem is that real object count we will operate on is not >>> the value passed to swaponsomething() since it is calculated in wrong units. >>> >>> we should check b-list limit on (X * DEV_BSIZE512 / PAGE_SIZE) value which >>> is rough (X / 8) so we should be able to address 32*8=256G. >>> >>> The code should look like this: >>> >>> Index: vm/swap_pager.c >>> ==============================**==============================**======= >>> --- vm/swap_pager.c (revision 223877) >>> +++ vm/swap_pager.c (working copy) >>> @@ -2129,6 +2129,15 @@ swaponsomething(struct vnode *vp, void *id, u_long >>> u_long mblocks; >>> >>> /* >>> + * nblks is in DEV_BSIZE'd chunks, convert to PAGE_SIZE'd chunks. >>> + * First chop nblks off to page-align it, then convert. >>> + * >>> + * sw->sw_nblks is in page-sized chunks now too. >>> + */ >>> + nblks&= ~(ctodb(1) - 1); >>> + nblks = dbtoc(nblks); >>> + >>> + /* >>> >>> * If we go beyond this, we get overflows in the radix >>> * tree bitmap code. >>> */ >>> @@ -2138,14 +2147,6 @@ swaponsomething(struct vnode *vp, void *id, u_long >>> mblocks); >>> nblks = mblocks; >>> } >>> - /* >>> - * nblks is in DEV_BSIZE'd chunks, convert to PAGE_SIZE'd chunks. >>> - * First chop nblks off to page-align it, then convert. >>> - * >>> - * sw->sw_nblks is in page-sized chunks now too. >>> - */ >>> - nblks&= ~(ctodb(1) - 1); >>> - nblks = dbtoc(nblks); >>> >>> sp = malloc(sizeof *sp, M_VMPGDATA, M_WAITOK | M_ZERO); >>> sp->sw_vp = vp; >>> >>> >>> (move pages recalculation before b-list check) >>> >>> >>> Can someone comment on this? >>> >>> >> I believe that you are correct. Have you tried testing this change on a >> large swap device? > I probably agree too, but I am in the process of re-reading the swap code, > and I do not quite believe in the limit. > I'm uncertain whether the current limit, "0x40000000 / BLIST_META_RADIX", is exact or not, but I doubt that it is too large. > When the initial code was committed, our daddr_t was 32bit, I checked > the RELENG_4 sources. Current code uses int64_t for daddr_t. My impression > right now is that we only utilize the low 32bits of daddr_t. > > Esp. interesting looks the following typedef: > typedef uint32_t u_daddr_t; /* unsigned disk address */ > which (correctly) means that typical mask (u_daddr_t)-1 is 0xffffffff. > > I wonder whether we could just use full 64bit and de-facto remove the > limitation on the swap partition size. I would rather argue first that the subr_list code should not be using daddr_t all. The code is abusing daddr_t and defining u_daddr_t to represent things that are not disk addresses. Instead, it should either define its own type or directly use (u)int*_t. Then, as for choosing between 32 and 64 bits, I'm skeptical of using this structure for managing more than 32 bits worth of blocks, given the amount of RAM it will use.