From owner-freebsd-hackers@FreeBSD.ORG Sun Apr 29 23:49:15 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D43F5106564A; Sun, 29 Apr 2012 23:49:15 +0000 (UTC) (envelope-from alc@rice.edu) Received: from mh5.mail.rice.edu (mh5.mail.rice.edu [128.42.199.32]) by mx1.freebsd.org (Postfix) with ESMTP id 553DA8FC15; Sun, 29 Apr 2012 23:49:15 +0000 (UTC) Received: from mh5.mail.rice.edu (localhost.localdomain [127.0.0.1]) by mh5.mail.rice.edu (Postfix) with ESMTP id 6A72B291087; Sun, 29 Apr 2012 18:49:08 -0500 (CDT) Received: from mh5.mail.rice.edu (localhost.localdomain [127.0.0.1]) by mh5.mail.rice.edu (Postfix) with ESMTP id 58E74291042; Sun, 29 Apr 2012 18:49:08 -0500 (CDT) X-Virus-Scanned: by amavis-2.6.4 at mh5.mail.rice.edu, auth channel Received: from mh5.mail.rice.edu ([127.0.0.1]) by mh5.mail.rice.edu (mh5.mail.rice.edu [127.0.0.1]) (amavis, port 10026) with ESMTP id 4gsnVG1CFaYZ; Sun, 29 Apr 2012 18:49:08 -0500 (CDT) Received: from adsl-216-63-78-18.dsl.hstntx.swbell.net (adsl-216-63-78-18.dsl.hstntx.swbell.net [216.63.78.18]) (using TLSv1 with cipher RC4-MD5 (128/128 bits)) (No client certificate requested) (Authenticated sender: alc) by mh5.mail.rice.edu (Postfix) with ESMTPSA id 69533290DAC; Sun, 29 Apr 2012 18:49:07 -0500 (CDT) Message-ID: <4F9DD372.1020001@rice.edu> Date: Sun, 29 Apr 2012 18:49:06 -0500 From: Alan Cox User-Agent: Mozilla/5.0 (X11; FreeBSD i386; rv:8.0) Gecko/20111113 Thunderbird/8.0 MIME-Version: 1.0 To: Andrey Zonov References: <4F7B495D.3010402@zonov.org> <20120404071746.GJ2358@deviant.kiev.zoral.com.ua> <4F7DC037.9060803@rice.edu> <201204091126.25260.jhb@freebsd.org> <4F845D9B.10004@rice.edu> <4F851F87.3050206@zonov.org> In-Reply-To: <4F851F87.3050206@zonov.org> Content-Type: multipart/mixed; boundary="------------020902040700020406010201" Cc: Konstantin Belousov , freebsd-hackers@freebsd.org, alc@freebsd.org Subject: Re: problems with mmap() and disk caching X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 29 Apr 2012 23:49:15 -0000 This is a multi-part message in MIME format. --------------020902040700020406010201 Content-Type: text/plain; charset=windows-1251; format=flowed Content-Transfer-Encoding: 7bit On 04/11/2012 01:07, Andrey Zonov wrote: > On 10.04.2012 20:19, Alan Cox wrote: >> On 04/09/2012 10:26, John Baldwin wrote: >>> On Thursday, April 05, 2012 11:54:31 am Alan Cox wrote: >>>> On 04/04/2012 02:17, Konstantin Belousov wrote: >>>>> On Tue, Apr 03, 2012 at 11:02:53PM +0400, Andrey Zonov wrote: >>>>>> Hi, >>>>>> >>>>>> I open the file, then call mmap() on the whole file and get pointer, >>>>>> then I work with this pointer. I expect that page should be only >>>>>> once >>>>>> touched to get it into the memory (disk cache?), but this doesn't >>>>>> work! >>>>>> >>>>>> I wrote the test (attached) and ran it for the 1G file generated >>>>>> from >>>>>> /dev/random, the result is the following: >>>>>> >>>>>> Prepare file: >>>>>> # swapoff -a >>>>>> # newfs /dev/ada0b >>>>>> # mount /dev/ada0b /mnt >>>>>> # dd if=/dev/random of=/mnt/random-1024 bs=1m count=1024 >>>>>> >>>>>> Purge cache: >>>>>> # umount /mnt >>>>>> # mount /dev/ada0b /mnt >>>>>> >>>>>> Run test: >>>>>> $ ./mmap /mnt/random-1024 30 >>>>>> mmap: 1 pass took: 7.431046 (none: 262112; res: 32; super: >>>>>> 0; other: 0) >>>>>> mmap: 2 pass took: 7.356670 (none: 261648; res: 496; super: >>>>>> 0; other: 0) >>>>>> mmap: 3 pass took: 7.307094 (none: 260521; res: 1623; super: >>>>>> 0; other: 0) >>>>>> mmap: 4 pass took: 7.350239 (none: 258904; res: 3240; super: >>>>>> 0; other: 0) >>>>>> mmap: 5 pass took: 7.392480 (none: 257286; res: 4858; super: >>>>>> 0; other: 0) >>>>>> mmap: 6 pass took: 7.292069 (none: 255584; res: 6560; super: >>>>>> 0; other: 0) >>>>>> mmap: 7 pass took: 7.048980 (none: 251142; res: 11002; super: >>>>>> 0; other: 0) >>>>>> mmap: 8 pass took: 6.899387 (none: 247584; res: 14560; super: >>>>>> 0; other: 0) >>>>>> mmap: 9 pass took: 7.190579 (none: 242992; res: 19152; super: >>>>>> 0; other: 0) >>>>>> mmap: 10 pass took: 6.915482 (none: 239308; res: 22836; super: >>>>>> 0; other: 0) >>>>>> mmap: 11 pass took: 6.565909 (none: 232835; res: 29309; super: >>>>>> 0; other: 0) >>>>>> mmap: 12 pass took: 6.423945 (none: 226160; res: 35984; super: >>>>>> 0; other: 0) >>>>>> mmap: 13 pass took: 6.315385 (none: 208555; res: 53589; super: >>>>>> 0; other: 0) >>>>>> mmap: 14 pass took: 6.760780 (none: 192805; res: 69339; super: >>>>>> 0; other: 0) >>>>>> mmap: 15 pass took: 5.721513 (none: 174497; res: 87647; super: >>>>>> 0; other: 0) >>>>>> mmap: 16 pass took: 5.004424 (none: 155938; res: 106206; super: >>>>>> 0; other: 0) >>>>>> mmap: 17 pass took: 4.224926 (none: 135639; res: 126505; super: >>>>>> 0; other: 0) >>>>>> mmap: 18 pass took: 3.749608 (none: 117952; res: 144192; super: >>>>>> 0; other: 0) >>>>>> mmap: 19 pass took: 3.398084 (none: 99066; res: 163078; super: >>>>>> 0; other: 0) >>>>>> mmap: 20 pass took: 3.029557 (none: 74994; res: 187150; super: >>>>>> 0; other: 0) >>>>>> mmap: 21 pass took: 2.379430 (none: 55231; res: 206913; super: >>>>>> 0; other: 0) >>>>>> mmap: 22 pass took: 2.046521 (none: 40786; res: 221358; super: >>>>>> 0; other: 0) >>>>>> mmap: 23 pass took: 1.152797 (none: 30311; res: 231833; super: >>>>>> 0; other: 0) >>>>>> mmap: 24 pass took: 0.972617 (none: 16196; res: 245948; super: >>>>>> 0; other: 0) >>>>>> mmap: 25 pass took: 0.577515 (none: 8286; res: 253858; super: >>>>>> 0; other: 0) >>>>>> mmap: 26 pass took: 0.380738 (none: 3712; res: 258432; super: >>>>>> 0; other: 0) >>>>>> mmap: 27 pass took: 0.253583 (none: 1193; res: 260951; super: >>>>>> 0; other: 0) >>>>>> mmap: 28 pass took: 0.157508 (none: 0; res: 262144; super: >>>>>> 0; other: 0) >>>>>> mmap: 29 pass took: 0.156169 (none: 0; res: 262144; super: >>>>>> 0; other: 0) >>>>>> mmap: 30 pass took: 0.156550 (none: 0; res: 262144; super: >>>>>> 0; other: 0) >>>>>> >>>>>> If I ran this: >>>>>> $ cat /mnt/random-1024> /dev/null >>>>>> before test, when result is the following: >>>>>> >>>>>> $ ./mmap /mnt/random-1024 5 >>>>>> mmap: 1 pass took: 0.337657 (none: 0; res: 262144; super: >>>>>> 0; other: 0) >>>>>> mmap: 2 pass took: 0.186137 (none: 0; res: 262144; super: >>>>>> 0; other: 0) >>>>>> mmap: 3 pass took: 0.186132 (none: 0; res: 262144; super: >>>>>> 0; other: 0) >>>>>> mmap: 4 pass took: 0.186535 (none: 0; res: 262144; super: >>>>>> 0; other: 0) >>>>>> mmap: 5 pass took: 0.190353 (none: 0; res: 262144; super: >>>>>> 0; other: 0) >>>>>> >>>>>> This is what I expect. But why this doesn't work without reading >>>>>> file >>>>>> manually? >>>>> Issue seems to be in some change of the behaviour of the reserv or >>>>> phys allocator. I Cc:ed Alan. >>>> I'm pretty sure that the behavior here hasn't significantly changed in >>>> about twelve years. Otherwise, I agree with your analysis. >>>> >>>> On more than one occasion, I've been tempted to change: >>>> >>>> pmap_remove_all(mt); >>>> if (mt->dirty != 0) >>>> vm_page_deactivate(mt); >>>> else >>>> vm_page_cache(mt); >>>> >>>> to: >>>> >>>> vm_page_dontneed(mt); >>>> >>>> because I suspect that the current code does more harm than good. In >>>> theory, it saves activations of the page daemon. However, more often >>>> than not, I suspect that we are spending more on page reactivations >>>> than >>>> we are saving on page daemon activations. The sequential access >>>> detection heuristic is just too easily triggered. For example, I've >>>> seen it triggered by demand paging of the gcc text segment. Also, I >>>> think that pmap_remove_all() and especially vm_page_cache() are too >>>> severe for a detection heuristic that is so easily triggered. >>> Are you planning to commit this? >>> >> >> Not yet. I did some tests with a file that was several times larger than >> DRAM, and I didn't like what I saw. Initially, everything behaved as >> expected, but about halfway through the test the bulk of the pages were >> active. Despite the call to pmap_clear_reference() in >> vm_page_dontneed(), the page daemon is finding the pages to be >> referenced and reactivating them. The net result is that the time it >> takes to read the file (from a relatively fast SSD) goes up by about >> 12%. So, this still needs work. >> > > Hi Alan, > > What do you think about attached patch? > > Sorry for the slow reply, I've been rather busy for the past couple of weeks. What you propose is clearly good for sequential accesses, but not so good for random accesses. Keep in mind, the potential costs of unconditionally increasing the read window include not only wasted I/O but also increased memory pressure. Rather than argue about which is more important, sequential or random access, I think it's more productive to replace the sequential access heuristic. The current heuristic is just not that sophisticated. It's easy to do better. The attached patch implements a new heuristic, which starts with the same initial read window as the current heuristic, but arithmetically grows the window on sequential page faults. From a stylistic standpoint, this patch also cleanly separates the "read ahead" logic from the "cache behind" logic. At the same time, this new heuristic is more selective about performing cache behind. It requires three or four sequential page faults before cache behind is enabled. More precisely, it requires the read ahead window to reach its maximum size before cache behind is enabled. For long, sequential accesses, the results of my performance tests are just good as unconditionally increasing the window size. I'm also seeing fewer pages needlessly cached by the cache behind heuristic. That said, there is still room for improvement. We are still not achieving the same sequential performance as "dd", and there are still more pages being cached than I would like. Alan --------------020902040700020406010201 Content-Type: text/plain; name="vm_fault_cache98.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="vm_fault_cache98.patch" Index: vm/vm_map.c =================================================================== --- vm/vm_map.c (revision 234106) +++ vm/vm_map.c (working copy) @@ -1300,6 +1300,8 @@ charged: new_entry->protection = prot; new_entry->max_protection = max; new_entry->wired_count = 0; + new_entry->read_ahead = VM_FAULT_READ_AHEAD_INIT; + new_entry->next_read = OFF_TO_IDX(offset); KASSERT(cred == NULL || !ENTRY_CHARGED(new_entry), ("OVERCOMMIT: vm_map_insert leaks vm_map %p", new_entry)); Index: vm/vm_map.h =================================================================== --- vm/vm_map.h (revision 234106) +++ vm/vm_map.h (working copy) @@ -112,8 +112,9 @@ struct vm_map_entry { vm_prot_t protection; /* protection code */ vm_prot_t max_protection; /* maximum protection */ vm_inherit_t inheritance; /* inheritance */ + uint8_t read_ahead; /* pages in the read-ahead window */ int wired_count; /* can be paged if = 0 */ - vm_pindex_t lastr; /* last read */ + vm_pindex_t next_read; /* index of the next sequential read */ struct ucred *cred; /* tmp storage for creator ref */ }; @@ -330,6 +331,14 @@ long vmspace_wired_count(struct vmspace *vmspace); #define VM_FAULT_DIRTY 2 /* Dirty the page; use w/VM_PROT_COPY */ /* + * Initially, mappings are slightly sequential. The maximum window size must + * account for the map entry's "read_ahead" field being defined as an uint8_t. + */ +#define VM_FAULT_READ_AHEAD_MIN 7 +#define VM_FAULT_READ_AHEAD_INIT 15 +#define VM_FAULT_READ_AHEAD_MAX min(atop(MAXPHYS) - 1, UINT8_MAX) + +/* * The following "find_space" options are supported by vm_map_find() */ #define VMFS_NO_SPACE 0 /* don't find; use the given range */ Index: vm/vm_fault.c =================================================================== --- vm/vm_fault.c (revision 234106) +++ vm/vm_fault.c (working copy) @@ -118,9 +118,11 @@ static int prefault_pageorder[] = { static int vm_fault_additional_pages(vm_page_t, int, int, vm_page_t *, int *); static void vm_fault_prefault(pmap_t, vm_offset_t, vm_map_entry_t); -#define VM_FAULT_READ_AHEAD 8 -#define VM_FAULT_READ_BEHIND 7 -#define VM_FAULT_READ (VM_FAULT_READ_AHEAD+VM_FAULT_READ_BEHIND+1) +#define VM_FAULT_READ_BEHIND 8 +#define VM_FAULT_READ_MAX (1 + VM_FAULT_READ_AHEAD_MAX) +#define VM_FAULT_NINCR (VM_FAULT_READ_MAX / VM_FAULT_READ_BEHIND) +#define VM_FAULT_SUM (VM_FAULT_NINCR * (VM_FAULT_NINCR + 1) / 2) +#define VM_FAULT_CACHE_BEHIND (VM_FAULT_READ_BEHIND * VM_FAULT_SUM) struct faultstate { vm_page_t m; @@ -136,6 +138,8 @@ struct faultstate { int vfslocked; }; +static void vm_fault_cache_behind(const struct faultstate *fs, int distance); + static inline void release_page(struct faultstate *fs) { @@ -236,13 +240,13 @@ vm_fault_hold(vm_map_t map, vm_offset_t vaddr, vm_ int fault_flags, vm_page_t *m_hold) { vm_prot_t prot; - int is_first_object_locked, result; - boolean_t growstack, wired; + long ahead, behind; + int alloc_req, era, faultcount, nera, reqpage, result; + boolean_t growstack, is_first_object_locked, wired; int map_generation; vm_object_t next_object; - vm_page_t marray[VM_FAULT_READ], mt, mt_prev; + vm_page_t marray[VM_FAULT_READ_MAX]; int hardfault; - int faultcount, ahead, behind, alloc_req; struct faultstate fs; struct vnode *vp; int locked, error; @@ -252,7 +256,7 @@ vm_fault_hold(vm_map_t map, vm_offset_t vaddr, vm_ PCPU_INC(cnt.v_vm_faults); fs.vp = NULL; fs.vfslocked = 0; - faultcount = behind = 0; + faultcount = reqpage = 0; RetryFault:; @@ -460,76 +464,48 @@ readrest: */ if (TRYPAGER) { int rv; - int reqpage = 0; u_char behavior = vm_map_entry_behavior(fs.entry); if (behavior == MAP_ENTRY_BEHAV_RANDOM || P_KILLED(curproc)) { + behind = 0; ahead = 0; + } else if (behavior == MAP_ENTRY_BEHAV_SEQUENTIAL) { behind = 0; + ahead = atop(fs.entry->end - vaddr) - 1; + if (ahead > VM_FAULT_READ_AHEAD_MAX) + ahead = VM_FAULT_READ_AHEAD_MAX; + if (fs.pindex == fs.entry->next_read) + vm_fault_cache_behind(&fs, + VM_FAULT_READ_MAX); } else { - behind = (vaddr - fs.entry->start) >> PAGE_SHIFT; + /* + * If this is a sequential page fault, then + * arithmetically increase the number of pages + * in the read-ahead window. Otherwise, reset + * the read-ahead window to its smallest size. + */ + behind = atop(vaddr - fs.entry->start); if (behind > VM_FAULT_READ_BEHIND) behind = VM_FAULT_READ_BEHIND; - - ahead = ((fs.entry->end - vaddr) >> PAGE_SHIFT) - 1; - if (ahead > VM_FAULT_READ_AHEAD) - ahead = VM_FAULT_READ_AHEAD; + ahead = atop(fs.entry->end - vaddr) - 1; + era = fs.entry->read_ahead; + if (fs.pindex == fs.entry->next_read) { + nera = era + behind; + if (nera > VM_FAULT_READ_AHEAD_MAX) + nera = VM_FAULT_READ_AHEAD_MAX; + behind = 0; + if (ahead > nera) + ahead = nera; + if (era == VM_FAULT_READ_AHEAD_MAX) + vm_fault_cache_behind(&fs, + VM_FAULT_CACHE_BEHIND); + } else if (ahead > VM_FAULT_READ_AHEAD_MIN) + ahead = VM_FAULT_READ_AHEAD_MIN; + if (era != ahead) + fs.entry->read_ahead = ahead; } - is_first_object_locked = FALSE; - if ((behavior == MAP_ENTRY_BEHAV_SEQUENTIAL || - (behavior != MAP_ENTRY_BEHAV_RANDOM && - fs.pindex >= fs.entry->lastr && - fs.pindex < fs.entry->lastr + VM_FAULT_READ)) && - (fs.first_object == fs.object || - (is_first_object_locked = VM_OBJECT_TRYLOCK(fs.first_object))) && - fs.first_object->type != OBJT_DEVICE && - fs.first_object->type != OBJT_PHYS && - fs.first_object->type != OBJT_SG) { - vm_pindex_t firstpindex; - if (fs.first_pindex < 2 * VM_FAULT_READ) - firstpindex = 0; - else - firstpindex = fs.first_pindex - 2 * VM_FAULT_READ; - mt = fs.first_object != fs.object ? - fs.first_m : fs.m; - KASSERT(mt != NULL, ("vm_fault: missing mt")); - KASSERT((mt->oflags & VPO_BUSY) != 0, - ("vm_fault: mt %p not busy", mt)); - mt_prev = vm_page_prev(mt); - - /* - * note: partially valid pages cannot be - * included in the lookahead - NFS piecemeal - * writes will barf on it badly. - */ - while ((mt = mt_prev) != NULL && - mt->pindex >= firstpindex && - mt->valid == VM_PAGE_BITS_ALL) { - mt_prev = vm_page_prev(mt); - if (mt->busy || - (mt->oflags & VPO_BUSY)) - continue; - vm_page_lock(mt); - if (mt->hold_count || - mt->wire_count) { - vm_page_unlock(mt); - continue; - } - pmap_remove_all(mt); - if (mt->dirty != 0) - vm_page_deactivate(mt); - else - vm_page_cache(mt); - vm_page_unlock(mt); - } - ahead += behind; - behind = 0; - } - if (is_first_object_locked) - VM_OBJECT_UNLOCK(fs.first_object); - /* * Call the pager to retrieve the data, if any, after * releasing the lock on the map. We hold a ref on @@ -899,7 +875,7 @@ vnode_locked: * without holding a write lock on it. */ if (hardfault) - fs.entry->lastr = fs.pindex + faultcount - behind; + fs.entry->next_read = fs.pindex + faultcount - reqpage; if ((prot & VM_PROT_WRITE) != 0 || (fault_flags & VM_FAULT_DIRTY) != 0) { @@ -992,6 +968,56 @@ vnode_locked: } /* + * Speed up the reclamation of up to "distance" pages that precede the + * faulting pindex within the first object of the shadow chain. + */ +static void +vm_fault_cache_behind(const struct faultstate *fs, int distance) +{ + vm_page_t m, m_prev; + vm_pindex_t pindex; + boolean_t is_first_object_locked; + + VM_OBJECT_LOCK_ASSERT(fs->object, MA_OWNED); + is_first_object_locked = FALSE; + if (fs->first_object != fs->object && !(is_first_object_locked = + VM_OBJECT_TRYLOCK(fs->first_object))) + return; + if (fs->first_object->type != OBJT_DEVICE && + fs->first_object->type != OBJT_PHYS && + fs->first_object->type != OBJT_SG) { + if (fs->first_pindex < distance) + pindex = 0; + else + pindex = fs->first_pindex - distance; + if (pindex < OFF_TO_IDX(fs->entry->offset)) + pindex = OFF_TO_IDX(fs->entry->offset); + m = fs->first_object != fs->object ? fs->first_m : fs->m; + KASSERT(m != NULL, ("vm_fault_cache_behind: page missing")); + KASSERT((m->oflags & VPO_BUSY) != 0, + ("vm_fault_cache_behind: page %p is not busy", m)); + m_prev = vm_page_prev(m); + while ((m = m_prev) != NULL && m->pindex >= pindex && + m->valid == VM_PAGE_BITS_ALL) { + m_prev = vm_page_prev(m); + if (m->busy != 0 || (m->oflags & VPO_BUSY) != 0) + continue; + vm_page_lock(m); + if (m->hold_count == 0 && m->wire_count == 0) { + pmap_remove_all(m); + if (m->dirty != 0) + vm_page_deactivate(m); + else + vm_page_cache(m); + } + vm_page_unlock(m); + } + } + if (is_first_object_locked) + VM_OBJECT_UNLOCK(fs->first_object); +} + +/* * vm_fault_prefault provides a quick way of clustering * pagefaults into a processes address space. It is a "cousin" * of vm_map_pmap_enter, except it runs at page fault time instead --------------020902040700020406010201--