From owner-freebsd-hackers@FreeBSD.ORG Sat Sep 21 01:16:07 2013 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 985F6EEC; Sat, 21 Sep 2013 01:16:07 +0000 (UTC) (envelope-from cedric.blancher@gmail.com) Received: from mail-ie0-x229.google.com (mail-ie0-x229.google.com [IPv6:2607:f8b0:4001:c03::229]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 5DE912DFB; Sat, 21 Sep 2013 01:16:07 +0000 (UTC) Received: by mail-ie0-f169.google.com with SMTP id tp5so2456413ieb.0 for ; Fri, 20 Sep 2013 18:16:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=5Vr6+EQ3SDKEtWMKJL+MykOaQX6uc/vyabZ3dtQH64c=; b=pLYpAYx/4+tqXw0HrLBqXi0m4tShERHGvgyUQYMIcp9aEBDiN4+PF9Q2uQmyMll5bH zGPQC79sZXY2/9cZafye6HhgxU2Vhhhm/ushC/vvlljlRRU6nipJNX+PaoN2wf3du9FK Nk17+8nD6QxKbvT6OdIubtiXVZsgiEKygKzDyPXSxWQ9dEy3rn29j5/Y41KT3cFFp48s nHxjh3wStwFO0IOOoA6worywAv4v3DGlr4w1zZ7OHyejgB4gtjvFUd3uPvAMtt1oadSc YnoUGJqIxVVobsWLR0BnV8KjzPvCraZFBvdxk4HKa9a0R1+m/qqpVWbOm/QZYBZPWEQD FdBw== MIME-Version: 1.0 X-Received: by 10.43.98.202 with SMTP id cp10mr6031110icc.28.1379726166752; Fri, 20 Sep 2013 18:16:06 -0700 (PDT) Received: by 10.64.228.129 with HTTP; Fri, 20 Sep 2013 18:16:06 -0700 (PDT) In-Reply-To: References: <1379520488.49964.YahooMailNeo@web193502.mail.sg3.yahoo.com> <22E7E628-E997-4B64-B229-92E425D85084@f5.com> <1379649991.82562.YahooMailNeo@web193502.mail.sg3.yahoo.com> Date: Sat, 21 Sep 2013 03:16:06 +0200 Message-ID: Subject: Re: About Transparent Superpages and Non-transparent superapges From: Cedric Blancher To: Sebastian Kuzminsky Content-Type: text/plain; charset=ISO-8859-1 Cc: Patrick Dung , "freebsd-hackers@freebsd.org" , "ivoras@freebsd.org" X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 21 Sep 2013 01:16:07 -0000 [repost, the previous email was stuck because I used an old email address] On 21 September 2013 03:09, Cedric Blancher wrote: > On 20 September 2013 17:20, Sebastian Kuzminsky wrote: >> On Sep 19, 2013, at 22:06 , Patrick Dung wrote: >> >>> >We at Line Rate (now F5) are developing support for 1 Gig superpages on amd64. We're basing our work on 9.1.0 for now. >>> > >>> >An early preview is available here: >>> > >>> >https://github.com/Seb-LineRate/freebsd/tree/freebsd-9.1.0-1gig-pages-NOT-READY-2 >>> >>> That is cool. >>> >>> What type of applications can take advantage of the 1Gb page size? >>> And is it transparent? Or applications need to be modified? >> >> It's transparent for the kernel: all of UMA and kmem_malloc()/kmem_free() is backed by 1 gig superpages. >> >> It's not transparent for userspace: applications need to pass a new flag to mmap() to get 1 gig pages. > > That may be the wrong approach. What happens if x86 gets more > huge/largepage sizes like SPARC does (hint: Sign NDA with Intel and > AMD and get surprised, and then allocate 16 more bits for mmap() if > you wish to stick with your approach)? For example SPARC64 does 8k, > 64k, 512k, 4M, 32M, 256M, 2GB and 256GB pages (actual page sizes > differ from MMU to MMU implementation, and can be probed via pagesize > -a). > > A much better option would be to follow the Solaris API which has APIs > to enumerate the available page sizes, and then set it either for > heap, stack or a given address range (the last one is used to use > largepages for file I/O via mmap()). > > For example ksh93 uses this to use 64k pages for the stack (this > mainly aims at SPARC where 64k stack pages can be a real performance > booster if you shuffle a lot of strings via stack): > ----------- > int main(int argc, char *argv[]) > { > #if _lib_memcntl > /* advise larger stack size */ > struct memcntl_mha mha; > mha.mha_cmd = MHA_MAPSIZE_STACK; > mha.mha_flags = 0; > mha.mha_pagesize = 64 * 1024; > (void)memcntl(NULL, 0, MC_HAT_ADVISE, (caddr_t)&mha, 0, 0); > #endif > return(sh_main(argc, argv, (Shinit_f)0)); > } > ----------- > > Below is the memcntl(2) manpage describing the API: > --------------------------------------- > > > > System Calls memcntl(2) > > > > NAME > memcntl - memory management control > > SYNOPSIS > #include > #include > > int memcntl(caddr_t _ a_ d_ d_ r, size_t _ l_ e_ n, int > _ c_ m_ d, caddr_t _ a_ r_ g, > int _ a_ t_ t_ r, int _ m_ a_ s_ k); > > > DESCRIPTION > The memcntl() function allows the calling process to apply a > variety of control operations over the address space identi- > fied by the mappings established for the address range > [_ a_ d_ d_ r, _ a_ d_ d_ r + _ l_ e_ n). > > > The _ a_ d_ d_ r argument must be a multiple of the pagesize as > returned by sysconf(3C). The scope of the control operations > can be further defined with additional selection criteria > (in the form of attributes) according to the bit pattern > contained in _ a_ t_ t_ r. > > > The following attributes specify page mapping selection cri- > teria: > > SHARED Page is mapped shared. > > > PRIVATE Page is mapped private. > > > > The following attributes specify page protection selection > criteria. The selection criteria are constructed by a bit- > wise OR operation on the attribute bits and must match > exactly. > > PROT_READ Page can be read. > > > PROT_WRITE Page can be written. > > > PROT_EXEC Page can be executed. > > > > The following criteria may also be specified: > > > > > SunOS 5.11 Last change: 10 Apr 2007 1 > > > > > > > System Calls memcntl(2) > > > > PROC_TEXT Process text. > > > PROC_DATA Process data. > > > > The PROC_TEXT attribute specifies all privately mapped seg- > ments with read and execute permission, and the PROC_DATA > attribute specifies all privately mapped segments with write > permission. > > > Selection criteria can be used to describe various abstract > memory objects within the address space on which to operate. > If an operation shall not be constrained by the selection > criteria, _ a_ t_ t_ r must have the value 0. > > > The operation to be performed is identified by the argument > _ c_ m_ d. The symbolic names for the operations are defined in > as follows: > > MC_LOCK > > Lock in memory all pages in the range with attributes > _ a_ t_ t_ r. A given page may be locked multiple times through > different mappings; however, within a given mapping, > page locks do not nest. Multiple lock operations on the > same address in the same process will all be removed > with a single unlock operation. A page locked in one > process and mapped in another (or visible through a dif- > ferent mapping in the locking process) is locked in > memory as long as the locking process does neither an > implicit nor explicit unlock operation. If a locked map- > ping is removed, or a page is deleted through file remo- > val or truncation, an unlock operation is implicitly > performed. If a writable MAP_PRIVATE page in the address > range is changed, the lock will be transferred to the > private page. > > The _ a_ r_ g argument is not used, but must be 0 to ensure > compatibility with potential future enhancements. > > > MC_LOCKAS > > Lock in memory all pages mapped by the address space > with attributes _ a_ t_ t_ r. The _ a_ d_ d_ r and _ l_ e_ n > arguments are not > used, but must be _ N_ U_ L_ L and 0 respectively, to ensure > compatibility with potential future enhancements. The > _ a_ r_ g argument is a bit pattern built from the flags: > > > > SunOS 5.11 Last change: 10 Apr 2007 2 > > > > > > > System Calls memcntl(2) > > > > MCL_CURRENT Lock current mappings. > > > MCL_FUTURE Lock future mappings. > > The value of _ a_ r_ g determines whether the pages to be > locked are those currently mapped by the address space, > those that will be mapped in the future, or both. If > MCL_FUTURE is specified, then all mappings subsequently > added to the address space will be locked, provided suf- > ficient memory is available. > > > MC_SYNC > > Write to their backing storage locations all modified > pages in the range with attributes _ a_ t_ t_ r. Optionally, > invalidate cache copies. The backing storage for a modi- > fied MAP_SHARED mapping is the file the page is mapped > to; the backing storage for a modified MAP_PRIVATE map- > ping is its swap area. The _ a_ r_ g argument is a bit pattern > built from the flags used to control the behavior of the > operation: > > MS_ASYNC Perform asynchronous writes. > > > MS_SYNC Perform synchronous writes. > > > MS_INVALIDATE Invalidate mappings. > > MS_ASYNC Return immediately once all write operations > are scheduled; with MS_SYNC the function will not return > until all write operations are completed. > > MS_INVALIDATE Invalidate all cached copies of data in > memory, so that further references to the pages will be > obtained by the system from their backing storage loca- > tions. This operation should be used by applications > that require a memory object to be in a known state. > > > MC_UNLOCK > > Unlock all pages in the range with attributes _ a_ t_ t_ r. The > _ a_ r_ g argument is not used, but must be 0 to ensure compa- > tibility with potential future enhancements. > > > MC_UNLOCKAS > > > > > SunOS 5.11 Last change: 10 Apr 2007 3 > > > > > > > System Calls memcntl(2) > > > > Remove address space memory locks and locks on all pages > in the address space with attributes _ a_ t_ t_ r. The > _ a_ d_ d_ r, > _ l_ e_ n, and _ a_ r_ g arguments are not used, but must be > _ N_ U_ L_ L, 0 > and 0, respectively, to ensure compatibility with poten- > tial future enhancements. > > > MC_HAT_ADVISE > > Advise system how a region of user-mapped memory will be > accessed. The _ a_ r_ g argument is interpreted as a "struct > memcntl_mha *". The following members are defined in a > struct memcntl_mha: > > uint_t mha_cmd; > uint_t mha_flags; > size_t mha_pagesize; > > The accepted values for mha_cmd are: > > MHA_MAPSIZE_VA > MHA_MAPSIZE_STACK > MHA_MAPSIZE_BSSBRK > > The mha_flags member is reserved for future use and must > always be set to 0. The mha_pagesize member must be a > valid size as obtained from getpagesizes(3C) or the con- > stant value 0 to allow the system to choose an appropri- > ate hardware address translation mapping size. > > MHA_MAPSIZE_VA sets the preferred hardware address > translation mapping size of the region of memory from > _ a_ d_ d_ r to _ a_ d_ d_ r + _ l_ e_ n. Both _ a_ d_ d_ r > and _ l_ e_ n must be aligned to > an mha_pagesize boundary. The entire virtual address > region from _ a_ d_ d_ r to _ a_ d_ d_ r + _ l_ e_ n must not > have any holes. > Permissions within each mha_pagesize-aligned portion of > the region must be consistent. When a size of 0 is > specified, the system selects an appropriate size based > on the size and alignment of the memory region, type of > processor, and other considerations. > > MHA_MAPSIZE_STACK sets the preferred hardware address > translation mapping size of the process main thread > stack segment. The _ a_ d_ d_ r and _ l_ e_ n arguments must > be _ N_ U_ L_ L > and 0, respectively. > > MHA_MAPSIZE_BSSBRK sets the preferred hardware address > translation mapping size of the process heap. The _ a_ d_ d_ r > and _ l_ e_ n arguments must be _ N_ U_ L_ L and 0, respectively. See > the NOTES section of the ppgsz(1) manual page for addi- > tional information on process heap alignment. > > > > > SunOS 5.11 Last change: 10 Apr 2007 4 > > > > > > > System Calls memcntl(2) > > > > The _ a_ t_ t_ r argument must be 0 for all MC_HAT_ADVISE opera- > tions. > > > > The _ m_ a_ s_ k argument must be 0; it is reserved for future use. > > > Locks established with the lock operations are not inherited > by a child process after fork(2). The memcntl() function > fails if it attempts to lock more memory than a system- > specific limit. > > > Due to the potential impact on system resources, the opera- > tions MC_LOCKAS, MC_LOCK, MC_UNLOCKAS, and MC_UNLOCK are > restricted to privileged processes. > > USAGE > The memcntl() function subsumes the operations of plock(3C). > > > MC_HAT_ADVISE is intended to improve performance of applica- > tions that use large amounts of memory on processors that > support multiple hardware address translation mapping sizes; > however, it should be used with care. Not all processors > support all sizes with equal efficiency. Use of larger sizes > may also introduce extra overhead that could reduce perfor- > mance or available memory. Using large sizes for one appli- > cation may reduce available resources for other applications > and result in slower system wide performance. > > RETURN VALUES > Upon successful completion, memcntl() returns 0; otherwise, > it returns -1 and sets errno to indicate an error. > > ERRORS > The memcntl() function will fail if: > > EAGAIN When the selection criteria match, some or all of > the memory identified by the operation could not > be locked when MC_LOCK or MC_LOCKAS was specified, > some or all mappings in the address range [_ a_ d_ d_ r, > _ a_ d_ d_ r + _ l_ e_ n) are locked for I/O when MC_HAT_ADVISE > was specified, or the system has insufficient > resources when MC_HAT_ADVISE was specified. > > The _ c_ m_ d is MC_LOCK or MC_LOCKAS and locking the > memory identified by this operation would exceed a > limit or resource control on locked memory. > > > > > > SunOS 5.11 Last change: 10 Apr 2007 5 > > > > > > > System Calls memcntl(2) > > > > EBUSY When the selection criteria match, some or all of > the addresses in the range [_ a_ d_ d_ r, _ a_ d_ d_ r > + _ l_ e_ n) are > locked and MC_SYNC with the MS_INVALIDATE option > was specified. > > > EINVAL The _ a_ d_ d_ r argument specifies invalid selection cri- > teria or is not a multiple of the page size as > returned by sysconf(3C); the _ a_ d_ d_ r and/or _ l_ e_ n > argument does not have the value 0 when MC_LOCKAS > or MC_UNLOCKAS is specified; the _ a_ r_ g argument is > not valid for the function specified; mha_pagesize > or mha_cmd is invalid; or MC_HAT_ADVISE is speci- > fied and not all pages in the specified region > have the same access permissions within the given > size boundaries. > > > ENOMEM When the selection criteria match, some or all of > the addresses in the range [_ a_ d_ d_ r, _ a_ d_ d_ r > + _ l_ e_ n) are > invalid for the address space of a process or > specify one or more pages which are not mapped. > > > EPERM The {PRIV_PROC_LOCK_MEMORY} privilege is not > asserted in the effective set of the calling pro- > cess and MC_LOCK, MC_LOCKAS, MC_UNLOCK, or > MC_UNLOCKAS was specified. > > > ATTRIBUTES > See attributes(5) for descriptions of the following attri- > butes: > > > > ____________________________________________________________ > | ATTRIBUTE TYPE | ATTRIBUTE VALUE | > |______________________________ |______________________________ | > | MT-Level | MT-Safe | > |______________________________ |______________________________ | > > > SEE ALSO > ppgsz(1), fork(2), mmap(2), mprotect(2), getpagesizes(3C), > mlock(3C), mlockall(3C), msync(3C), plock(3C), sysconf(3C), > attributes(5), privileges(5) > > > > > > > > > SunOS 5.11 Last change: 10 Apr 2007 6 > --------------------------------------- > > Ced > -- > Cedric Blancher > Institute Pasteur -- Cedric Blancher Institute Pasteur