Date: Sun, 31 Jan 2010 23:16:11 +0000 (UTC) From: Jason Evans <jasone@FreeBSD.org> To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: svn commit: r203329 - head/lib/libc/stdlib Message-ID: <201001312316.o0VNGBLo068121@svn.freebsd.org>
next in thread | raw e-mail | index | archive | help
Author: jasone Date: Sun Jan 31 23:16:10 2010 New Revision: 203329 URL: http://svn.freebsd.org/changeset/base/203329 Log: Fix bugs: * Fix a race in chunk_dealloc_dss(). * Check for allocation failure before zeroing memory in base_calloc(). Merge enhancements from a divergent version of jemalloc: * Convert thread-specific caching from magazines to an algorithm that is more tunable, and implement incremental GC. * Add support for medium size classes, [4KiB..32KiB], 2KiB apart by default. * Add dirty page tracking for pages within active small/medium object runs. This allows malloc to track precisely which pages are in active use, which makes dirty page purging more effective. * Base maximum dirty page count on proportion of active memory. * Use optional zeroing in arena_chunk_alloc() to avoid needless zeroing of chunks. This is useful in the context of DSS allocation, since a long-lived application may commonly recycle chunks. * Increase the default chunk size from 1MiB to 4MiB. Remove feature: * Remove the dynamic rebalancing code, since thread caching reduces its utility. Added: head/lib/libc/stdlib/ql.h (contents, props changed) head/lib/libc/stdlib/qr.h (contents, props changed) Modified: head/lib/libc/stdlib/malloc.3 head/lib/libc/stdlib/malloc.c Modified: head/lib/libc/stdlib/malloc.3 ============================================================================== --- head/lib/libc/stdlib/malloc.3 Sun Jan 31 22:31:01 2010 (r203328) +++ head/lib/libc/stdlib/malloc.3 Sun Jan 31 23:16:10 2010 (r203329) @@ -32,7 +32,7 @@ .\" @(#)malloc.3 8.1 (Berkeley) 6/4/93 .\" $FreeBSD$ .\" -.Dd September 26, 2009 +.Dd January 31, 2010 .Dt MALLOC 3 .Os .Sh NAME @@ -55,9 +55,7 @@ .Ft const char * .Va _malloc_options ; .Ft void -.Fo \*(lp*_malloc_message\*(rp -.Fa "const char *p1" "const char *p2" "const char *p3" "const char *p4" -.Fc +.Fn \*(lp*_malloc_message\*(rp "const char *p1" "const char *p2" "const char *p3" "const char *p4" .In malloc_np.h .Ft size_t .Fn malloc_usable_size "const void *ptr" @@ -124,7 +122,9 @@ will free the passed pointer when the re This is a .Fx specific API designed to ease the problems with traditional coding styles -for realloc causing memory leaks in libraries. +for +.Fn realloc +causing memory leaks in libraries. .Pp The .Fn free @@ -184,18 +184,6 @@ flags being set) become fatal. The process will call .Xr abort 3 in these cases. -.It B -Double/halve the per-arena lock contention threshold at which a thread is -randomly re-assigned to an arena. -This dynamic load balancing tends to push threads away from highly contended -arenas, which avoids worst case contention scenarios in which threads -disproportionately utilize arenas. -However, due to the highly dynamic load that applications may place on the -allocator, it is impossible for the allocator to know in advance how sensitive -it should be to contention over arenas. -Therefore, some applications may benefit from increasing or decreasing this -threshold parameter. -This option is not available for some configurations (non-PIC). .It C Double/halve the size of the maximum size class that is a multiple of the cacheline size (64). @@ -209,44 +197,62 @@ This option is enabled by default. See the .Dq M option for related information and interactions. +.It E +Double/halve the size of the maximum medium size class. +The valid range is from one page to one half chunk. +The default value is 32 KiB. .It F -Double/halve the per-arena maximum number of dirty unused pages that are -allowed to accumulate before informing the kernel about at least half of those -pages via +Halve/double the per-arena minimum ratio of active to dirty pages. +Some dirty unused pages may be allowed to accumulate, within the limit set by +the ratio, before informing the kernel about at least half of those pages via .Xr madvise 2 . This provides the kernel with sufficient information to recycle dirty pages if physical memory becomes scarce and the pages remain unused. -The default is 512 pages per arena; -.Ev MALLOC_OPTIONS=10f -will prevent any dirty unused pages from accumulating. +The default minimum ratio is 32:1; +.Ev MALLOC_OPTIONS=6F +will disable dirty page purging. .It G -When there are multiple threads, use thread-specific caching for objects that -are smaller than one page. -This option is enabled by default. -Thread-specific caching allows many allocations to be satisfied without -performing any thread synchronization, at the cost of increased memory use. +Double/halve the approximate interval (counted in terms of +thread-specific cache allocation/deallocation events) between full +thread-specific cache garbage collection sweeps. +Garbage collection is actually performed incrementally, one size +class at a time, in order to avoid large collection pauses. +The default sweep interval is 8192; +.Ev JEMALLOC_OPTIONS=14g +will disable garbage collection. +.It H +Double/halve the number of thread-specific cache slots per size +class. +When there are multiple threads, each thread uses a +thread-specific cache for small and medium objects. +Thread-specific caching allows many allocations to be satisfied +without performing any thread synchronization, at the cost of +increased memory use. See the -.Dq R +.Dq G option for related tuning information. -This option is not available for some configurations (non-PIC). +The default number of cache slots is 128; +.Ev JEMALLOC_OPTIONS=7h +will disable thread-specific caching. +Note that one cache slot per size class is not a valid +configuration due to implementation details. .It J Each byte of new memory allocated by .Fn malloc , -.Fn realloc +.Fn realloc , or .Fn reallocf will be initialized to 0xa5. All memory returned by .Fn free , -.Fn realloc +.Fn realloc , or .Fn reallocf will be initialized to 0x5a. This is intended for debugging and will impact performance negatively. .It K Double/halve the virtual memory chunk size. -The default chunk size is the maximum of 1 MB and the largest -page size that is less than or equal to 4 MB. +The default chunk size is 4 MiB. .It M Use .Xr mmap 2 @@ -279,14 +285,6 @@ Double/halve the size of the maximum siz quantum (8 or 16 bytes, depending on architecture). Above this size, cacheline spacing is used for size classes. The default value is 128 bytes. -.It R -Double/halve magazine size, which approximately doubles/halves the number of -rounds in each magazine. -Magazines are used by the thread-specific caching machinery to acquire and -release objects in bulk. -Increasing the magazine size decreases locking overhead, at the expense of -increased memory usage. -This option is not available for some configurations (non-PIC). .It U Generate .Dq utrace @@ -297,8 +295,7 @@ Consult the source for details on this o .It V Attempting to allocate zero bytes will return a .Dv NULL -pointer instead of -a valid pointer. +pointer instead of a valid pointer. (The default behavior is to make a minimal allocation and return a pointer to it.) This option is provided for System V compatibility. @@ -306,21 +303,20 @@ This option is incompatible with the .Dq X option. .It X -Rather than return failure for any allocation function, -display a diagnostic message on -.Dv stderr -and cause the program to drop -core (using +Rather than return failure for any allocation function, display a diagnostic +message on +.Dv STDERR_FILENO +and cause the program to drop core (using .Xr abort 3 ) . -This option should be set at compile time by including the following in -the source code: +This option should be set at compile time by including the following in the +source code: .Bd -literal -offset indent _malloc_options = "X"; .Ed .It Z Each byte of new memory allocated by .Fn malloc , -.Fn realloc +.Fn realloc , or .Fn reallocf will be initialized to 0. @@ -378,9 +374,9 @@ improve performance, mainly due to reduc However, it may make sense to reduce the number of arenas if an application does not make much use of the allocation functions. .Pp -In addition to multiple arenas, this allocator supports thread-specific -caching for small objects (smaller than one page), in order to make it -possible to completely avoid synchronization for most small allocation requests. +In addition to multiple arenas, this allocator supports thread-specific caching +for small and medium objects, in order to make it possible to completely avoid +synchronization for most small and medium allocation requests. Such caching allows very fast allocation in the common case, but it increases memory usage and fragmentation, since a bounded number of objects can remain allocated in each thread cache. @@ -391,23 +387,27 @@ Chunks are always aligned to multiples o This alignment makes it possible to find metadata for user objects very quickly. .Pp -User objects are broken into three categories according to size: small, large, -and huge. +User objects are broken into four categories according to size: small, medium, +large, and huge. Small objects are smaller than one page. +Medium objects range from one page to an upper limit determined at run time (see +the +.Dq E +option). Large objects are smaller than the chunk size. Huge objects are a multiple of the chunk size. -Small and large objects are managed by arenas; huge objects are managed +Small, medium, and large objects are managed by arenas; huge objects are managed separately in a single data structure that is shared by all threads. Huge objects are used by applications infrequently enough that this single data structure is not a scalability issue. .Pp Each chunk that is managed by an arena tracks its contents as runs of -contiguous pages (unused, backing a set of small objects, or backing one large -object). +contiguous pages (unused, backing a set of small or medium objects, or backing +one large object). The combination of chunk alignment and chunk page maps makes it possible to determine all metadata regarding small and large allocations in constant time. .Pp -Small objects are managed in groups by page runs. +Small and medium objects are managed in groups by page runs. Each run maintains a bitmap that tracks which regions are in use. Allocation requests that are no more than half the quantum (8 or 16, depending on architecture) are rounded up to the nearest power of two. @@ -419,10 +419,17 @@ Allocation requests that are more than t class, but no more than the minimum subpage-multiple size class (see the .Dq C option) are rounded up to the nearest multiple of the cacheline size (64). -Allocation requests that are more than the minimum subpage-multiple size class -are rounded up to the nearest multiple of the subpage size (256). -Allocation requests that are more than one page, but small enough to fit in -an arena-managed chunk (see the +Allocation requests that are more than the minimum subpage-multiple size class, +but no more than the maximum subpage-multiple size class are rounded up to the +nearest multiple of the subpage size (256). +Allocation requests that are more than the maximum subpage-multiple size class, +but no more than the maximum medium size class (see the +.Dq M +option) are rounded up to the nearest medium size class; spacing is an +automatically determined power of two and ranges from the subpage size to the +page size. +Allocation requests that are more than the maximum medium size class, but small +enough to fit in an arena-managed chunk (see the .Dq K option), are rounded up to the nearest run size. Allocation requests that are too large to fit in an arena-managed chunk are @@ -480,13 +487,12 @@ option is set, all warnings are treated .Pp The .Va _malloc_message -variable allows the programmer to override the function which emits -the text strings forming the errors and warnings if for some reason -the -.Dv stderr +variable allows the programmer to override the function which emits the text +strings forming the errors and warnings if for some reason the +.Dv STDERR_FILENO file descriptor is not suitable for this. -Please note that doing anything which tries to allocate memory in -this function is likely to result in a crash or deadlock. +Please note that doing anything which tries to allocate memory in this function +is likely to result in a crash or deadlock. .Pp All messages are prefixed by .Dq Ao Ar progname Ac Ns Li : (malloc) . Modified: head/lib/libc/stdlib/malloc.c ============================================================================== --- head/lib/libc/stdlib/malloc.c Sun Jan 31 22:31:01 2010 (r203328) +++ head/lib/libc/stdlib/malloc.c Sun Jan 31 23:16:10 2010 (r203329) @@ -1,5 +1,5 @@ /*- - * Copyright (C) 2006-2008 Jason Evans <jasone@FreeBSD.org>. + * Copyright (C) 2006-2010 Jason Evans <jasone@FreeBSD.org>. * All rights reserved. * * Redistribution and use in source and binary forms, with or without @@ -47,58 +47,67 @@ * * Allocation requests are rounded up to the nearest size class, and no record * of the original request size is maintained. Allocations are broken into - * categories according to size class. Assuming runtime defaults, 4 kB pages + * categories according to size class. Assuming runtime defaults, 4 KiB pages * and a 16 byte quantum on a 32-bit system, the size classes in each category * are as follows: * - * |=======================================| - * | Category | Subcategory | Size | - * |=======================================| - * | Small | Tiny | 2 | - * | | | 4 | - * | | | 8 | - * | |------------------+---------| - * | | Quantum-spaced | 16 | - * | | | 32 | - * | | | 48 | - * | | | ... | - * | | | 96 | - * | | | 112 | - * | | | 128 | - * | |------------------+---------| - * | | Cacheline-spaced | 192 | - * | | | 256 | - * | | | 320 | - * | | | 384 | - * | | | 448 | - * | | | 512 | - * | |------------------+---------| - * | | Sub-page | 760 | - * | | | 1024 | - * | | | 1280 | - * | | | ... | - * | | | 3328 | - * | | | 3584 | - * | | | 3840 | - * |=======================================| - * | Large | 4 kB | - * | | 8 kB | - * | | 12 kB | - * | | ... | - * | | 1012 kB | - * | | 1016 kB | - * | | 1020 kB | - * |=======================================| - * | Huge | 1 MB | - * | | 2 MB | - * | | 3 MB | - * | | ... | - * |=======================================| + * |========================================| + * | Category | Subcategory | Size | + * |========================================| + * | Small | Tiny | 2 | + * | | | 4 | + * | | | 8 | + * | |------------------+----------| + * | | Quantum-spaced | 16 | + * | | | 32 | + * | | | 48 | + * | | | ... | + * | | | 96 | + * | | | 112 | + * | | | 128 | + * | |------------------+----------| + * | | Cacheline-spaced | 192 | + * | | | 256 | + * | | | 320 | + * | | | 384 | + * | | | 448 | + * | | | 512 | + * | |------------------+----------| + * | | Sub-page | 760 | + * | | | 1024 | + * | | | 1280 | + * | | | ... | + * | | | 3328 | + * | | | 3584 | + * | | | 3840 | + * |========================================| + * | Medium | 4 KiB | + * | | 6 KiB | + * | | 8 KiB | + * | | ... | + * | | 28 KiB | + * | | 30 KiB | + * | | 32 KiB | + * |========================================| + * | Large | 36 KiB | + * | | 40 KiB | + * | | 44 KiB | + * | | ... | + * | | 1012 KiB | + * | | 1016 KiB | + * | | 1020 KiB | + * |========================================| + * | Huge | 1 MiB | + * | | 2 MiB | + * | | 3 MiB | + * | | ... | + * |========================================| * - * A different mechanism is used for each category: + * Different mechanisms are used accoding to category: * - * Small : Each size class is segregated into its own set of runs. Each run - * maintains a bitmap of which regions are free/allocated. + * Small/medium : Each size class is segregated into its own set of runs. + * Each run maintains a bitmap of which regions are + * free/allocated. * * Large : Each allocation is backed by a dedicated run. Metadata are stored * in the associated arena chunk header maps. @@ -134,18 +143,11 @@ #define MALLOC_TINY /* - * MALLOC_MAG enables a magazine-based thread-specific caching layer for small + * MALLOC_TCACHE enables a thread-specific caching layer for small and medium * objects. This makes it possible to allocate/deallocate objects without any * locking when the cache is in the steady state. */ -#define MALLOC_MAG - -/* - * MALLOC_BALANCE enables monitoring of arena lock contention and dynamically - * re-balances arena load if exponentially averaged contention exceeds a - * certain threshold. - */ -#define MALLOC_BALANCE +#define MALLOC_TCACHE /* * MALLOC_DSS enables use of sbrk(2) to allocate chunks from the data storage @@ -166,7 +168,6 @@ __FBSDID("$FreeBSD$"); #include "namespace.h" #include <sys/mman.h> #include <sys/param.h> -#include <sys/stddef.h> #include <sys/time.h> #include <sys/types.h> #include <sys/sysctl.h> @@ -185,6 +186,7 @@ __FBSDID("$FreeBSD$"); #include <stdbool.h> #include <stdio.h> #include <stdint.h> +#include <inttypes.h> #include <stdlib.h> #include <string.h> #include <strings.h> @@ -192,18 +194,11 @@ __FBSDID("$FreeBSD$"); #include "un-namespace.h" -#ifdef MALLOC_DEBUG -# ifdef NDEBUG -# undef NDEBUG -# endif -#else -# ifndef NDEBUG -# define NDEBUG -# endif -#endif -#include <assert.h> - #include "rb.h" +#if (defined(MALLOC_TCACHE) && defined(MALLOC_STATS)) +#include "qr.h" +#include "ql.h" +#endif #ifdef MALLOC_DEBUG /* Disable inlining to make debugging easier. */ @@ -214,55 +209,57 @@ __FBSDID("$FreeBSD$"); #define STRERROR_BUF 64 /* - * Minimum alignment of allocations is 2^QUANTUM_2POW bytes. + * Minimum alignment of allocations is 2^LG_QUANTUM bytes. */ #ifdef __i386__ -# define QUANTUM_2POW 4 -# define SIZEOF_PTR_2POW 2 +# define LG_QUANTUM 4 +# define LG_SIZEOF_PTR 2 # define CPU_SPINWAIT __asm__ volatile("pause") #endif #ifdef __ia64__ -# define QUANTUM_2POW 4 -# define SIZEOF_PTR_2POW 3 +# define LG_QUANTUM 4 +# define LG_SIZEOF_PTR 3 #endif #ifdef __alpha__ -# define QUANTUM_2POW 4 -# define SIZEOF_PTR_2POW 3 +# define LG_QUANTUM 4 +# define LG_SIZEOF_PTR 3 # define NO_TLS #endif #ifdef __sparc64__ -# define QUANTUM_2POW 4 -# define SIZEOF_PTR_2POW 3 +# define LG_QUANTUM 4 +# define LG_SIZEOF_PTR 3 # define NO_TLS #endif #ifdef __amd64__ -# define QUANTUM_2POW 4 -# define SIZEOF_PTR_2POW 3 +# define LG_QUANTUM 4 +# define LG_SIZEOF_PTR 3 # define CPU_SPINWAIT __asm__ volatile("pause") #endif #ifdef __arm__ -# define QUANTUM_2POW 3 -# define SIZEOF_PTR_2POW 2 +# define LG_QUANTUM 3 +# define LG_SIZEOF_PTR 2 # define NO_TLS #endif #ifdef __mips__ -# define QUANTUM_2POW 3 -# define SIZEOF_PTR_2POW 2 +# define LG_QUANTUM 3 +# define LG_SIZEOF_PTR 2 # define NO_TLS #endif #ifdef __powerpc__ -# define QUANTUM_2POW 4 -# define SIZEOF_PTR_2POW 2 +# define LG_QUANTUM 4 +#endif +#ifdef __s390x__ +# define LG_QUANTUM 4 #endif -#define QUANTUM ((size_t)(1U << QUANTUM_2POW)) +#define QUANTUM ((size_t)(1U << LG_QUANTUM)) #define QUANTUM_MASK (QUANTUM - 1) -#define SIZEOF_PTR (1U << SIZEOF_PTR_2POW) +#define SIZEOF_PTR (1U << LG_SIZEOF_PTR) -/* sizeof(int) == (1U << SIZEOF_INT_2POW). */ -#ifndef SIZEOF_INT_2POW -# define SIZEOF_INT_2POW 2 +/* sizeof(int) == (1U << LG_SIZEOF_INT). */ +#ifndef LG_SIZEOF_INT +# define LG_SIZEOF_INT 2 #endif /* We can't use TLS in non-PIC programs, since TLS relies on loader magic. */ @@ -271,13 +268,9 @@ __FBSDID("$FreeBSD$"); #endif #ifdef NO_TLS - /* MALLOC_MAG requires TLS. */ -# ifdef MALLOC_MAG -# undef MALLOC_MAG -# endif - /* MALLOC_BALANCE requires TLS. */ -# ifdef MALLOC_BALANCE -# undef MALLOC_BALANCE + /* MALLOC_TCACHE requires TLS. */ +# ifdef MALLOC_TCACHE +# undef MALLOC_TCACHE # endif #endif @@ -285,17 +278,24 @@ __FBSDID("$FreeBSD$"); * Size and alignment of memory chunks that are allocated by the OS's virtual * memory system. */ -#define CHUNK_2POW_DEFAULT 20 +#define LG_CHUNK_DEFAULT 22 -/* Maximum number of dirty pages per arena. */ -#define DIRTY_MAX_DEFAULT (1U << 9) +/* + * The minimum ratio of active:dirty pages per arena is computed as: + * + * (nactive >> opt_lg_dirty_mult) >= ndirty + * + * So, supposing that opt_lg_dirty_mult is 5, there can be no less than 32 + * times as many active pages as dirty pages. + */ +#define LG_DIRTY_MULT_DEFAULT 5 /* * Maximum size of L1 cache line. This is used to avoid cache line aliasing. * In addition, this controls the spacing of cacheline-spaced size classes. */ -#define CACHELINE_2POW 6 -#define CACHELINE ((size_t)(1U << CACHELINE_2POW)) +#define LG_CACHELINE 6 +#define CACHELINE ((size_t)(1U << LG_CACHELINE)) #define CACHELINE_MASK (CACHELINE - 1) /* @@ -305,13 +305,13 @@ __FBSDID("$FreeBSD$"); * There must be at least 4 subpages per page, due to the way size classes are * handled. */ -#define SUBPAGE_2POW 8 -#define SUBPAGE ((size_t)(1U << SUBPAGE_2POW)) +#define LG_SUBPAGE 8 +#define SUBPAGE ((size_t)(1U << LG_SUBPAGE)) #define SUBPAGE_MASK (SUBPAGE - 1) #ifdef MALLOC_TINY /* Smallest size class to support. */ -# define TINY_MIN_2POW 1 +# define LG_TINY_MIN 1 #endif /* @@ -319,14 +319,20 @@ __FBSDID("$FreeBSD$"); * a power of 2. Above this size, allocations are rounded up to the nearest * power of 2. */ -#define QSPACE_MAX_2POW_DEFAULT 7 +#define LG_QSPACE_MAX_DEFAULT 7 /* * Maximum size class that is a multiple of the cacheline, but not (necessarily) * a power of 2. Above this size, allocations are rounded up to the nearest * power of 2. */ -#define CSPACE_MAX_2POW_DEFAULT 9 +#define LG_CSPACE_MAX_DEFAULT 9 + +/* + * Maximum medium size class. This must not be more than 1/4 of a chunk + * (LG_MEDIUM_MAX_DEFAULT <= LG_CHUNK_DEFAULT - 2). + */ +#define LG_MEDIUM_MAX_DEFAULT 15 /* * RUN_MAX_OVRHD indicates maximum desired run header overhead. Runs are sized @@ -350,7 +356,10 @@ __FBSDID("$FreeBSD$"); #define RUN_MAX_OVRHD_RELAX 0x00001800U /* Put a cap on small object run size. This overrides RUN_MAX_OVRHD. */ -#define RUN_MAX_SMALL (12 * PAGE_SIZE) +#define RUN_MAX_SMALL \ + (arena_maxclass <= (1U << (CHUNK_MAP_LG_PG_RANGE + PAGE_SHIFT)) \ + ? arena_maxclass : (1U << (CHUNK_MAP_LG_PG_RANGE + \ + PAGE_SHIFT))) /* * Hyper-threaded CPUs may need a special instruction inside spin loops in @@ -366,40 +375,21 @@ __FBSDID("$FreeBSD$"); * potential for priority inversion deadlock. Backing off past a certain point * can actually waste time. */ -#define SPIN_LIMIT_2POW 11 - -/* - * Conversion from spinning to blocking is expensive; we use (1U << - * BLOCK_COST_2POW) to estimate how many more times costly blocking is than - * worst-case spinning. - */ -#define BLOCK_COST_2POW 4 - -#ifdef MALLOC_MAG - /* - * Default magazine size, in bytes. max_rounds is calculated to make - * optimal use of the space, leaving just enough room for the magazine - * header. - */ -# define MAG_SIZE_2POW_DEFAULT 9 -#endif +#define LG_SPIN_LIMIT 11 -#ifdef MALLOC_BALANCE +#ifdef MALLOC_TCACHE /* - * We use an exponential moving average to track recent lock contention, - * where the size of the history window is N, and alpha=2/(N+1). - * - * Due to integer math rounding, very small values here can cause - * substantial degradation in accuracy, thus making the moving average decay - * faster than it would with precise calculation. + * Default number of cache slots for each bin in the thread cache (0: + * disabled). */ -# define BALANCE_ALPHA_INV_2POW 9 - +# define LG_TCACHE_NSLOTS_DEFAULT 7 /* - * Threshold value for the exponential moving contention average at which to - * re-assign a thread. + * (1U << opt_lg_tcache_gc_sweep) is the approximate number of + * allocation events between full GC sweeps (-1: disabled). Integer + * rounding may cause the actual number to be slightly higher, since GC is + * performed incrementally. */ -# define BALANCE_THRESHOLD_DEFAULT (1U << (SPIN_LIMIT_2POW-4)) +# define LG_TCACHE_GC_SWEEP_DEFAULT 13 #endif /******************************************************************************/ @@ -426,6 +416,17 @@ static malloc_mutex_t init_lock = {_SPIN #ifdef MALLOC_STATS +#ifdef MALLOC_TCACHE +typedef struct tcache_bin_stats_s tcache_bin_stats_t; +struct tcache_bin_stats_s { + /* + * Number of allocation requests that corresponded to the size of this + * bin. + */ + uint64_t nrequests; +}; +#endif + typedef struct malloc_bin_stats_s malloc_bin_stats_t; struct malloc_bin_stats_s { /* @@ -434,9 +435,12 @@ struct malloc_bin_stats_s { */ uint64_t nrequests; -#ifdef MALLOC_MAG - /* Number of magazine reloads from this bin. */ - uint64_t nmags; +#ifdef MALLOC_TCACHE + /* Number of tcache fills from this bin. */ + uint64_t nfills; + + /* Number of tcache flushes to this bin. */ + uint64_t nflushes; #endif /* Total number of runs created for this bin's size class. */ @@ -449,10 +453,24 @@ struct malloc_bin_stats_s { uint64_t reruns; /* High-water mark for this bin. */ - unsigned long highruns; + size_t highruns; /* Current number of runs in this bin. */ - unsigned long curruns; + size_t curruns; +}; + +typedef struct malloc_large_stats_s malloc_large_stats_t; +struct malloc_large_stats_s { + /* + * Number of allocation requests that corresponded to this size class. + */ + uint64_t nrequests; + + /* High-water mark for this size class. */ + size_t highruns; + + /* Current number of runs of this size class. */ + size_t curruns; }; typedef struct arena_stats_s arena_stats_t; @@ -474,14 +492,21 @@ struct arena_stats_s { uint64_t nmalloc_small; uint64_t ndalloc_small; + size_t allocated_medium; + uint64_t nmalloc_medium; + uint64_t ndalloc_medium; + size_t allocated_large; uint64_t nmalloc_large; uint64_t ndalloc_large; -#ifdef MALLOC_BALANCE - /* Number of times this arena reassigned a thread due to contention. */ - uint64_t nbalance; -#endif + /* + * One element for each possible size class, including sizes that + * overlap with bin size classes. This is necessary because ipalloc() + * sometimes has to use such large objects in order to assure proper + * alignment. + */ + malloc_large_stats_t *lstats; }; typedef struct chunk_stats_s chunk_stats_t; @@ -490,14 +515,14 @@ struct chunk_stats_s { uint64_t nchunks; /* High-water mark for number of chunks allocated. */ - unsigned long highchunks; + size_t highchunks; /* * Current number of chunks allocated. This value isn't maintained for * any other purpose, so keep track of it in order to be able to set * highchunks. */ - unsigned long curchunks; + size_t curchunks; }; #endif /* #ifdef MALLOC_STATS */ @@ -550,14 +575,14 @@ struct arena_chunk_map_s { * Run address (or size) and various flags are stored together. The bit * layout looks like (assuming 32-bit system): * - * ???????? ???????? ????---- ---kdzla + * ???????? ???????? ????cccc ccccdzla * * ? : Unallocated: Run address for first/last pages, unset for internal * pages. - * Small: Run address. + * Small/medium: Don't care. * Large: Run size for first page, unset for trailing pages. * - : Unused. - * k : key? + * c : refcount (could overflow for PAGE_SIZE >= 128 KiB) * d : dirty? * z : zeroed? * l : large? @@ -565,7 +590,7 @@ struct arena_chunk_map_s { * * Following are example bit patterns for the three types of runs. * - * r : run address + * p : run page offset * s : run size * x : don't care * - : 0 @@ -576,10 +601,10 @@ struct arena_chunk_map_s { * xxxxxxxx xxxxxxxx xxxx---- ----d--- * ssssssss ssssssss ssss---- -----z-- * - * Small: - * rrrrrrrr rrrrrrrr rrrr---- -------a - * rrrrrrrr rrrrrrrr rrrr---- -------a - * rrrrrrrr rrrrrrrr rrrr---- -------a + * Small/medium: + * pppppppp ppppcccc cccccccc cccc---a + * pppppppp ppppcccc cccccccc cccc---a + * pppppppp ppppcccc cccccccc cccc---a * * Large: * ssssssss ssssssss ssss---- ------la @@ -587,11 +612,19 @@ struct arena_chunk_map_s { * -------- -------- -------- ------la */ size_t bits; -#define CHUNK_MAP_KEY ((size_t)0x10U) -#define CHUNK_MAP_DIRTY ((size_t)0x08U) -#define CHUNK_MAP_ZEROED ((size_t)0x04U) -#define CHUNK_MAP_LARGE ((size_t)0x02U) -#define CHUNK_MAP_ALLOCATED ((size_t)0x01U) +#define CHUNK_MAP_PG_MASK ((size_t)0xfff00000U) +#define CHUNK_MAP_PG_SHIFT 20 +#define CHUNK_MAP_LG_PG_RANGE 12 + +#define CHUNK_MAP_RC_MASK ((size_t)0xffff0U) +#define CHUNK_MAP_RC_ONE ((size_t)0x00010U) + +#define CHUNK_MAP_FLAGS_MASK ((size_t)0xfU) +#define CHUNK_MAP_DIRTY ((size_t)0x8U) +#define CHUNK_MAP_ZEROED ((size_t)0x4U) +#define CHUNK_MAP_LARGE ((size_t)0x2U) +#define CHUNK_MAP_ALLOCATED ((size_t)0x1U) +#define CHUNK_MAP_KEY (CHUNK_MAP_DIRTY | CHUNK_MAP_ALLOCATED) }; typedef rb_tree(arena_chunk_map_t) arena_avail_tree_t; typedef rb_tree(arena_chunk_map_t) arena_run_tree_t; @@ -605,6 +638,13 @@ struct arena_chunk_s { /* Linkage for the arena's chunks_dirty tree. */ rb_node(arena_chunk_t) link_dirty; + /* + * True if the chunk is currently in the chunks_dirty tree, due to + * having at some point contained one or more dirty pages. Removal + * from chunks_dirty is lazy, so (dirtied && ndirty == 0) is possible. + */ + bool dirtied; + /* Number of dirty pages. */ size_t ndirty; @@ -670,6 +710,10 @@ struct arena_bin_s { #endif }; +#ifdef MALLOC_TCACHE +typedef struct tcache_s tcache_t; +#endif + struct arena_s { #ifdef MALLOC_DEBUG uint32_t magic; @@ -681,6 +725,13 @@ struct arena_s { #ifdef MALLOC_STATS arena_stats_t stats; +# ifdef MALLOC_TCACHE + /* + * List of tcaches for extant threads associated with this arena. + * Stats from these are merged incrementally, and at exit. + */ + ql_head(tcache_t) tcache_ql; +# endif #endif /* Tree of dirty-page-containing chunks this arena manages. */ @@ -698,6 +749,9 @@ struct arena_s { */ arena_chunk_t *spare; + /* Number of pages in active runs. */ + size_t nactive; + /* * Current count of pages within unused runs that are potentially * dirty, and for which madvise(... MADV_FREE) has not been called. By @@ -712,67 +766,77 @@ struct arena_s { */ arena_avail_tree_t runs_avail; -#ifdef MALLOC_BALANCE /* - * The arena load balancing machinery needs to keep track of how much - * lock contention there is. This value is exponentially averaged. - */ - uint32_t contention; -#endif - - /* - * bins is used to store rings of free regions of the following sizes, - * assuming a 16-byte quantum, 4kB page size, and default + * bins is used to store trees of free regions of the following sizes, + * assuming a 16-byte quantum, 4 KiB page size, and default * MALLOC_OPTIONS. * - * bins[i] | size | - * --------+------+ - * 0 | 2 | - * 1 | 4 | - * 2 | 8 | - * --------+------+ - * 3 | 16 | - * 4 | 32 | - * 5 | 48 | - * 6 | 64 | - * : : - * : : - * 33 | 496 | - * 34 | 512 | - * --------+------+ - * 35 | 1024 | - * 36 | 2048 | - * --------+------+ + * bins[i] | size | + * --------+--------+ + * 0 | 2 | + * 1 | 4 | + * 2 | 8 | + * --------+--------+ + * 3 | 16 | + * 4 | 32 | + * 5 | 48 | + * : : + * 8 | 96 | + * 9 | 112 | + * 10 | 128 | + * --------+--------+ + * 11 | 192 | + * 12 | 256 | + * 13 | 320 | + * 14 | 384 | + * 15 | 448 | + * 16 | 512 | + * --------+--------+ + * 17 | 768 | + * 18 | 1024 | + * 19 | 1280 | + * : : + * 27 | 3328 | + * 28 | 3584 | + * 29 | 3840 | + * --------+--------+ + * 30 | 4 KiB | + * 31 | 6 KiB | + * 33 | 8 KiB | + * : : + * 43 | 28 KiB | + * 44 | 30 KiB | + * 45 | 32 KiB | + * --------+--------+ */ arena_bin_t bins[1]; /* Dynamically sized. */ }; /******************************************************************************/ /* - * Magazine data structures. + * Thread cache data structures. */ -#ifdef MALLOC_MAG -typedef struct mag_s mag_t; -struct mag_s { - size_t binind; /* Index of associated bin. */ - size_t nrounds; - void *rounds[1]; /* Dynamically sized. */ -}; - *** DIFF OUTPUT TRUNCATED AT 1000 LINES ***
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201001312316.o0VNGBLo068121>