From owner-freebsd-hackers@FreeBSD.ORG Sat Sep 21 01:09:26 2013 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 5C755D33; Sat, 21 Sep 2013 01:09:26 +0000 (UTC) (envelope-from cedric.blancher@gmail.com) Received: from mail-ie0-x229.google.com (mail-ie0-x229.google.com [IPv6:2607:f8b0:4001:c03::229]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 21A812D8F; Sat, 21 Sep 2013 01:09:26 +0000 (UTC) Received: by mail-ie0-f169.google.com with SMTP id tp5so2445623ieb.28 for ; Fri, 20 Sep 2013 18:09:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=/NK2e8imHhFGLCQMchK8FO/qmu3Cldsq1LEgEbRr1Pk=; b=nhMELMzeCDlQMGzQ653IlI7OXB+AWpzd2IRn8U0sWV4xSxxgvZuA7FKeFVLU9dq5Fh HLVdthVpnJzgLEWiaeHKaUiC0uueLvbltWN8XUmWTXO4FuRZLkn9OUZqyLCIAKJQkCW3 nKaa+XnbGNE0dSz9tIccONq+HHl+/ufwHbKDZ0zdDUgQvNd59o0/yx3sREnkgBkhflcf WfCgAlxgyGmjcLYgv2ivm0Mrs8uv8hEpE6lQT4Sm5Tj+Suef6nq1hp35fLeUQvILRjBw tt1hPMwDJsukzIMEiCcxYv1efv2cCsronHS/yOqScR6aGEq1a4gsp6WkJhNTulHPEIAY wjUg== MIME-Version: 1.0 X-Received: by 10.43.98.202 with SMTP id cp10mr6021180icc.28.1379725764407; Fri, 20 Sep 2013 18:09:24 -0700 (PDT) Received: by 10.64.228.129 with HTTP; Fri, 20 Sep 2013 18:09:24 -0700 (PDT) In-Reply-To: References: <1379520488.49964.YahooMailNeo@web193502.mail.sg3.yahoo.com> <22E7E628-E997-4B64-B229-92E425D85084@f5.com> <1379649991.82562.YahooMailNeo@web193502.mail.sg3.yahoo.com> Date: Sat, 21 Sep 2013 03:09:24 +0200 Message-ID: Subject: Re: About Transparent Superpages and Non-transparent superapges From: Cedric Blancher To: Sebastian Kuzminsky Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Mailman-Approved-At: Sat, 21 Sep 2013 02:05:32 +0000 Cc: Patrick Dung , "freebsd-hackers@freebsd.org" , "ivoras@freebsd.org" X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 21 Sep 2013 01:09:26 -0000 On 20 September 2013 17:20, Sebastian Kuzminsky wrote: > On Sep 19, 2013, at 22:06 , Patrick Dung wrote: > >> >We at Line Rate (now F5) are developing support for 1 Gig superpages on= amd64. We're basing our work on 9.1.0 for now. >> > >> >An early preview is available here: >> > >> >https://github.com/Seb-LineRate/freebsd/tree/freebsd-9.1.0-1gig-pages-N= OT-READY-2 >> >> That is cool. >> >> What type of applications can take advantage of the 1Gb page size? >> And is it transparent? Or applications need to be modified? > > It's transparent for the kernel: all of UMA and kmem_malloc()/kmem_free()= is backed by 1 gig superpages. > > It's not transparent for userspace: applications need to pass a new flag = to mmap() to get 1 gig pages. That may be the wrong approach. What happens if x86 gets more huge/largepage sizes like SPARC does (hint: Sign NDA with Intel and AMD and get surprised, and then allocate 16 more bits for mmap() if you wish to stick with your approach)? For example SPARC64 does 8k, 64k, 512k, 4M, 32M, 256M, 2GB and 256GB pages (actual page sizes differ from MMU to MMU implementation, and can be probed via pagesize -a). A much better option would be to follow the Solaris API which has APIs to enumerate the available page sizes, and then set it either for heap, stack or a given address range (the last one is used to use largepages for file I/O via mmap()). For example ksh93 uses this to use 64k pages for the stack (this mainly aims at SPARC where 64k stack pages can be a real performance booster if you shuffle a lot of strings via stack): ----------- int main(int argc, char *argv[]) { #if _lib_memcntl /* advise larger stack size */ struct memcntl_mha mha; mha.mha_cmd =3D MHA_MAPSIZE_STACK; mha.mha_flags =3D 0; mha.mha_pagesize =3D 64 * 1024; (void)memcntl(NULL, 0, MC_HAT_ADVISE, (caddr_t)&mha, 0, 0); #endif return(sh_main(argc, argv, (Shinit_f)0)); } ----------- Below is the memcntl(2) manpage describing the API: --------------------------------------- System Calls memcntl(2) NAME memcntl - memory management control SYNOPSIS #include #include int memcntl(caddr_t _=08a_=08d_=08d_=08r, size_t _=08l_=08e_=08n, int _=08c_=08m_=08d, caddr_t _=08a_=08r_=08g, int _=08a_=08t_=08t_=08r, int _=08m_=08a_=08s_=08k); DESCRIPTION The memcntl() function allows the calling process to apply a variety of control operations over the address space identi- fied by the mappings established for the address range [_=08a_=08d_=08d_=08r, _=08a_=08d_=08d_=08r + _=08l_=08e_=08n). The _=08a_=08d_=08d_=08r argument must be a multiple of the pagesi= ze as returned by sysconf(3C). The scope of the control operations can be further defined with additional selection criteria (in the form of attributes) according to the bit pattern contained in _=08a_=08t_=08t_=08r. The following attributes specify page mapping selection cri- teria: SHARED Page is mapped shared. PRIVATE Page is mapped private. The following attributes specify page protection selection criteria. The selection criteria are constructed by a bit- wise OR operation on the attribute bits and must match exactly. PROT_READ Page can be read. PROT_WRITE Page can be written. PROT_EXEC Page can be executed. The following criteria may also be specified: SunOS 5.11 Last change: 10 Apr 2007 1 System Calls memcntl(2) PROC_TEXT Process text. PROC_DATA Process data. The PROC_TEXT attribute specifies all privately mapped seg- ments with read and execute permission, and the PROC_DATA attribute specifies all privately mapped segments with write permission. Selection criteria can be used to describe various abstract memory objects within the address space on which to operate. If an operation shall not be constrained by the selection criteria, _=08a_=08t_=08t_=08r must have the value 0. The operation to be performed is identified by the argument _=08c_=08m_=08d. The symbolic names for the operations are defined = in as follows: MC_LOCK Lock in memory all pages in the range with attributes _=08a_=08t_=08t_=08r. A given page may be locked multiple times t= hrough different mappings; however, within a given mapping, page locks do not nest. Multiple lock operations on the same address in the same process will all be removed with a single unlock operation. A page locked in one process and mapped in another (or visible through a dif- ferent mapping in the locking process) is locked in memory as long as the locking process does neither an implicit nor explicit unlock operation. If a locked map- ping is removed, or a page is deleted through file remo- val or truncation, an unlock operation is implicitly performed. If a writable MAP_PRIVATE page in the address range is changed, the lock will be transferred to the private page. The _=08a_=08r_=08g argument is not used, but must be 0 to ensu= re compatibility with potential future enhancements. MC_LOCKAS Lock in memory all pages mapped by the address space with attributes _=08a_=08t_=08t_=08r. The _=08a_=08d_=08d_=08r and= _=08l_=08e_=08n arguments are not used, but must be _=08N_=08U_=08L_=08L and 0 respectively, to = ensure compatibility with potential future enhancements. The _=08a_=08r_=08g argument is a bit pattern built from the flags: SunOS 5.11 Last change: 10 Apr 2007 2 System Calls memcntl(2) MCL_CURRENT Lock current mappings. MCL_FUTURE Lock future mappings. The value of _=08a_=08r_=08g determines whether the pages to = be locked are those currently mapped by the address space, those that will be mapped in the future, or both. If MCL_FUTURE is specified, then all mappings subsequently added to the address space will be locked, provided suf- ficient memory is available. MC_SYNC Write to their backing storage locations all modified pages in the range with attributes _=08a_=08t_=08t_=08r. Optio= nally, invalidate cache copies. The backing storage for a modi- fied MAP_SHARED mapping is the file the page is mapped to; the backing storage for a modified MAP_PRIVATE map- ping is its swap area. The _=08a_=08r_=08g argument is a bit patte= rn built from the flags used to control the behavior of the operation: MS_ASYNC Perform asynchronous writes. MS_SYNC Perform synchronous writes. MS_INVALIDATE Invalidate mappings. MS_ASYNC Return immediately once all write operations are scheduled; with MS_SYNC the function will not return until all write operations are completed. MS_INVALIDATE Invalidate all cached copies of data in memory, so that further references to the pages will be obtained by the system from their backing storage loca- tions. This operation should be used by applications that require a memory object to be in a known state. MC_UNLOCK Unlock all pages in the range with attributes _=08a_=08t_=08t_=08r= . The _=08a_=08r_=08g argument is not used, but must be 0 to ensure comp= a- tibility with potential future enhancements. MC_UNLOCKAS SunOS 5.11 Last change: 10 Apr 2007 3 System Calls memcntl(2) Remove address space memory locks and locks on all pages in the address space with attributes _=08a_=08t_=08t_=08r. The _=08a_=08d_=08d_=08r, _=08l_=08e_=08n, and _=08a_=08r_=08g arguments are not used, but m= ust be _=08N_=08U_=08L_=08L, 0 and 0, respectively, to ensure compatibility with poten- tial future enhancements. MC_HAT_ADVISE Advise system how a region of user-mapped memory will be accessed. The _=08a_=08r_=08g argument is interpreted as a "stru= ct memcntl_mha *". The following members are defined in a struct memcntl_mha: uint_t mha_cmd; uint_t mha_flags; size_t mha_pagesize; The accepted values for mha_cmd are: MHA_MAPSIZE_VA MHA_MAPSIZE_STACK MHA_MAPSIZE_BSSBRK The mha_flags member is reserved for future use and must always be set to 0. The mha_pagesize member must be a valid size as obtained from getpagesizes(3C) or the con- stant value 0 to allow the system to choose an appropri- ate hardware address translation mapping size. MHA_MAPSIZE_VA sets the preferred hardware address translation mapping size of the region of memory from _=08a_=08d_=08d_=08r to _=08a_=08d_=08d_=08r + _=08l_=08e_=08n. Bo= th _=08a_=08d_=08d_=08r and _=08l_=08e_=08n must be aligned to an mha_pagesize boundary. The entire virtual address region from _=08a_=08d_=08d_=08r to _=08a_=08d_=08d_=08r + _=08l_= =08e_=08n must not have any holes. Permissions within each mha_pagesize-aligned portion of the region must be consistent. When a size of 0 is specified, the system selects an appropriate size based on the size and alignment of the memory region, type of processor, and other considerations. MHA_MAPSIZE_STACK sets the preferred hardware address translation mapping size of the process main thread stack segment. The _=08a_=08d_=08d_=08r and _=08l_=08e_=08n argume= nts must be _=08N_=08U_=08L_=08L and 0, respectively. MHA_MAPSIZE_BSSBRK sets the preferred hardware address translation mapping size of the process heap. The _=08a_=08d_=08= d_=08r and _=08l_=08e_=08n arguments must be _=08N_=08U_=08L_=08L and 0, = respectively. See the NOTES section of the ppgsz(1) manual page for addi- tional information on process heap alignment. SunOS 5.11 Last change: 10 Apr 2007 4 System Calls memcntl(2) The _=08a_=08t_=08t_=08r argument must be 0 for all MC_HAT_ADVISE = opera- tions. The _=08m_=08a_=08s_=08k argument must be 0; it is reserved for future= use. Locks established with the lock operations are not inherited by a child process after fork(2). The memcntl() function fails if it attempts to lock more memory than a system- specific limit. Due to the potential impact on system resources, the opera- tions MC_LOCKAS, MC_LOCK, MC_UNLOCKAS, and MC_UNLOCK are restricted to privileged processes. USAGE The memcntl() function subsumes the operations of plock(3C). MC_HAT_ADVISE is intended to improve performance of applica- tions that use large amounts of memory on processors that support multiple hardware address translation mapping sizes; however, it should be used with care. Not all processors support all sizes with equal efficiency. Use of larger sizes may also introduce extra overhead that could reduce perfor- mance or available memory. Using large sizes for one appli- cation may reduce available resources for other applications and result in slower system wide performance. RETURN VALUES Upon successful completion, memcntl() returns 0; otherwise, it returns -1 and sets errno to indicate an error. ERRORS The memcntl() function will fail if: EAGAIN When the selection criteria match, some or all of the memory identified by the operation could not be locked when MC_LOCK or MC_LOCKAS was specified, some or all mappings in the address range [_=08a_=08d_=08d= _=08r, _=08a_=08d_=08d_=08r + _=08l_=08e_=08n) are locked for I/O w= hen MC_HAT_ADVISE was specified, or the system has insufficient resources when MC_HAT_ADVISE was specified. The _=08c_=08m_=08d is MC_LOCK or MC_LOCKAS and locking t= he memory identified by this operation would exceed a limit or resource control on locked memory. SunOS 5.11 Last change: 10 Apr 2007 5 System Calls memcntl(2) EBUSY When the selection criteria match, some or all of the addresses in the range [_=08a_=08d_=08d_=08r, _=08a_=08= d_=08d_=08r + _=08l_=08e_=08n) are locked and MC_SYNC with the MS_INVALIDATE option was specified. EINVAL The _=08a_=08d_=08d_=08r argument specifies invalid selectio= n cri- teria or is not a multiple of the page size as returned by sysconf(3C); the _=08a_=08d_=08d_=08r and/o= r _=08l_=08e_=08n argument does not have the value 0 when MC_LOCKAS or MC_UNLOCKAS is specified; the _=08a_=08r_=08g argument = is not valid for the function specified; mha_pagesize or mha_cmd is invalid; or MC_HAT_ADVISE is speci- fied and not all pages in the specified region have the same access permissions within the given size boundaries. ENOMEM When the selection criteria match, some or all of the addresses in the range [_=08a_=08d_=08d_=08r, _=08a_=08= d_=08d_=08r + _=08l_=08e_=08n) are invalid for the address space of a process or specify one or more pages which are not mapped. EPERM The {PRIV_PROC_LOCK_MEMORY} privilege is not asserted in the effective set of the calling pro- cess and MC_LOCK, MC_LOCKAS, MC_UNLOCK, or MC_UNLOCKAS was specified. ATTRIBUTES See attributes(5) for descriptions of the following attri- butes: ____________________________________________________________ | ATTRIBUTE TYPE | ATTRIBUTE VALUE | |______________________________=08|______________________________=08| | MT-Level | MT-Safe | |______________________________=08|______________________________=08| SEE ALSO ppgsz(1), fork(2), mmap(2), mprotect(2), getpagesizes(3C), mlock(3C), mlockall(3C), msync(3C), plock(3C), sysconf(3C), attributes(5), privileges(5) SunOS 5.11 Last change: 10 Apr 2007 6 --------------------------------------- Ced --=20 Cedric Blancher Institute Pasteur