Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 29 Apr 2012 18:49:06 -0500
From:      Alan Cox <alc@rice.edu>
To:        Andrey Zonov <andrey@zonov.org>
Cc:        Konstantin Belousov <kostikbel@gmail.com>, freebsd-hackers@freebsd.org, alc@freebsd.org
Subject:   Re: problems with mmap() and disk caching
Message-ID:  <4F9DD372.1020001@rice.edu>
In-Reply-To: <4F851F87.3050206@zonov.org>
References:  <4F7B495D.3010402@zonov.org> <20120404071746.GJ2358@deviant.kiev.zoral.com.ua> <4F7DC037.9060803@rice.edu> <201204091126.25260.jhb@freebsd.org> <4F845D9B.10004@rice.edu> <4F851F87.3050206@zonov.org>

next in thread | previous in thread | raw e-mail | index | archive | help
This is a multi-part message in MIME format.
--------------020902040700020406010201
Content-Type: text/plain; charset=windows-1251; format=flowed
Content-Transfer-Encoding: 7bit

On 04/11/2012 01:07, Andrey Zonov wrote:
> On 10.04.2012 20:19, Alan Cox wrote:
>> On 04/09/2012 10:26, John Baldwin wrote:
>>> On Thursday, April 05, 2012 11:54:31 am Alan Cox wrote:
>>>> On 04/04/2012 02:17, Konstantin Belousov wrote:
>>>>> On Tue, Apr 03, 2012 at 11:02:53PM +0400, Andrey Zonov wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I open the file, then call mmap() on the whole file and get pointer,
>>>>>> then I work with this pointer. I expect that page should be only 
>>>>>> once
>>>>>> touched to get it into the memory (disk cache?), but this doesn't
>>>>>> work!
>>>>>>
>>>>>> I wrote the test (attached) and ran it for the 1G file generated 
>>>>>> from
>>>>>> /dev/random, the result is the following:
>>>>>>
>>>>>> Prepare file:
>>>>>> # swapoff -a
>>>>>> # newfs /dev/ada0b
>>>>>> # mount /dev/ada0b /mnt
>>>>>> # dd if=/dev/random of=/mnt/random-1024 bs=1m count=1024
>>>>>>
>>>>>> Purge cache:
>>>>>> # umount /mnt
>>>>>> # mount /dev/ada0b /mnt
>>>>>>
>>>>>> Run test:
>>>>>> $ ./mmap /mnt/random-1024 30
>>>>>> mmap: 1 pass took: 7.431046 (none: 262112; res: 32; super:
>>>>>> 0; other: 0)
>>>>>> mmap: 2 pass took: 7.356670 (none: 261648; res: 496; super:
>>>>>> 0; other: 0)
>>>>>> mmap: 3 pass took: 7.307094 (none: 260521; res: 1623; super:
>>>>>> 0; other: 0)
>>>>>> mmap: 4 pass took: 7.350239 (none: 258904; res: 3240; super:
>>>>>> 0; other: 0)
>>>>>> mmap: 5 pass took: 7.392480 (none: 257286; res: 4858; super:
>>>>>> 0; other: 0)
>>>>>> mmap: 6 pass took: 7.292069 (none: 255584; res: 6560; super:
>>>>>> 0; other: 0)
>>>>>> mmap: 7 pass took: 7.048980 (none: 251142; res: 11002; super:
>>>>>> 0; other: 0)
>>>>>> mmap: 8 pass took: 6.899387 (none: 247584; res: 14560; super:
>>>>>> 0; other: 0)
>>>>>> mmap: 9 pass took: 7.190579 (none: 242992; res: 19152; super:
>>>>>> 0; other: 0)
>>>>>> mmap: 10 pass took: 6.915482 (none: 239308; res: 22836; super:
>>>>>> 0; other: 0)
>>>>>> mmap: 11 pass took: 6.565909 (none: 232835; res: 29309; super:
>>>>>> 0; other: 0)
>>>>>> mmap: 12 pass took: 6.423945 (none: 226160; res: 35984; super:
>>>>>> 0; other: 0)
>>>>>> mmap: 13 pass took: 6.315385 (none: 208555; res: 53589; super:
>>>>>> 0; other: 0)
>>>>>> mmap: 14 pass took: 6.760780 (none: 192805; res: 69339; super:
>>>>>> 0; other: 0)
>>>>>> mmap: 15 pass took: 5.721513 (none: 174497; res: 87647; super:
>>>>>> 0; other: 0)
>>>>>> mmap: 16 pass took: 5.004424 (none: 155938; res: 106206; super:
>>>>>> 0; other: 0)
>>>>>> mmap: 17 pass took: 4.224926 (none: 135639; res: 126505; super:
>>>>>> 0; other: 0)
>>>>>> mmap: 18 pass took: 3.749608 (none: 117952; res: 144192; super:
>>>>>> 0; other: 0)
>>>>>> mmap: 19 pass took: 3.398084 (none: 99066; res: 163078; super:
>>>>>> 0; other: 0)
>>>>>> mmap: 20 pass took: 3.029557 (none: 74994; res: 187150; super:
>>>>>> 0; other: 0)
>>>>>> mmap: 21 pass took: 2.379430 (none: 55231; res: 206913; super:
>>>>>> 0; other: 0)
>>>>>> mmap: 22 pass took: 2.046521 (none: 40786; res: 221358; super:
>>>>>> 0; other: 0)
>>>>>> mmap: 23 pass took: 1.152797 (none: 30311; res: 231833; super:
>>>>>> 0; other: 0)
>>>>>> mmap: 24 pass took: 0.972617 (none: 16196; res: 245948; super:
>>>>>> 0; other: 0)
>>>>>> mmap: 25 pass took: 0.577515 (none: 8286; res: 253858; super:
>>>>>> 0; other: 0)
>>>>>> mmap: 26 pass took: 0.380738 (none: 3712; res: 258432; super:
>>>>>> 0; other: 0)
>>>>>> mmap: 27 pass took: 0.253583 (none: 1193; res: 260951; super:
>>>>>> 0; other: 0)
>>>>>> mmap: 28 pass took: 0.157508 (none: 0; res: 262144; super:
>>>>>> 0; other: 0)
>>>>>> mmap: 29 pass took: 0.156169 (none: 0; res: 262144; super:
>>>>>> 0; other: 0)
>>>>>> mmap: 30 pass took: 0.156550 (none: 0; res: 262144; super:
>>>>>> 0; other: 0)
>>>>>>
>>>>>> If I ran this:
>>>>>> $ cat /mnt/random-1024> /dev/null
>>>>>> before test, when result is the following:
>>>>>>
>>>>>> $ ./mmap /mnt/random-1024 5
>>>>>> mmap: 1 pass took: 0.337657 (none: 0; res: 262144; super:
>>>>>> 0; other: 0)
>>>>>> mmap: 2 pass took: 0.186137 (none: 0; res: 262144; super:
>>>>>> 0; other: 0)
>>>>>> mmap: 3 pass took: 0.186132 (none: 0; res: 262144; super:
>>>>>> 0; other: 0)
>>>>>> mmap: 4 pass took: 0.186535 (none: 0; res: 262144; super:
>>>>>> 0; other: 0)
>>>>>> mmap: 5 pass took: 0.190353 (none: 0; res: 262144; super:
>>>>>> 0; other: 0)
>>>>>>
>>>>>> This is what I expect. But why this doesn't work without reading 
>>>>>> file
>>>>>> manually?
>>>>> Issue seems to be in some change of the behaviour of the reserv or
>>>>> phys allocator. I Cc:ed Alan.
>>>> I'm pretty sure that the behavior here hasn't significantly changed in
>>>> about twelve years. Otherwise, I agree with your analysis.
>>>>
>>>> On more than one occasion, I've been tempted to change:
>>>>
>>>> pmap_remove_all(mt);
>>>> if (mt->dirty != 0)
>>>> vm_page_deactivate(mt);
>>>> else
>>>> vm_page_cache(mt);
>>>>
>>>> to:
>>>>
>>>> vm_page_dontneed(mt);
>>>>
>>>> because I suspect that the current code does more harm than good. In
>>>> theory, it saves activations of the page daemon. However, more often
>>>> than not, I suspect that we are spending more on page reactivations 
>>>> than
>>>> we are saving on page daemon activations. The sequential access
>>>> detection heuristic is just too easily triggered. For example, I've
>>>> seen it triggered by demand paging of the gcc text segment. Also, I
>>>> think that pmap_remove_all() and especially vm_page_cache() are too
>>>> severe for a detection heuristic that is so easily triggered.
>>> Are you planning to commit this?
>>>
>>
>> Not yet. I did some tests with a file that was several times larger than
>> DRAM, and I didn't like what I saw. Initially, everything behaved as
>> expected, but about halfway through the test the bulk of the pages were
>> active. Despite the call to pmap_clear_reference() in
>> vm_page_dontneed(), the page daemon is finding the pages to be
>> referenced and reactivating them. The net result is that the time it
>> takes to read the file (from a relatively fast SSD) goes up by about
>> 12%. So, this still needs work.
>>
>
> Hi Alan,
>
> What do you think about attached patch?
>
>

Sorry for the slow reply, I've been rather busy for the past couple of 
weeks.  What you propose is clearly good for sequential accesses, but 
not so good for random accesses.  Keep in mind, the potential costs of 
unconditionally increasing the read window include not only wasted I/O 
but also increased memory pressure.  Rather than argue about which is 
more important, sequential or random access, I think it's more 
productive to replace the sequential access heuristic.  The current 
heuristic is just not that sophisticated.  It's easy to do better.

The attached patch implements a new heuristic, which starts with the 
same initial read window as the current heuristic, but arithmetically 
grows the window on sequential page faults.  From a stylistic 
standpoint, this patch also cleanly separates the "read ahead" logic 
from the "cache behind" logic.

At the same time, this new heuristic is more selective about performing 
cache behind.  It requires three or four sequential page faults before 
cache behind is enabled.  More precisely, it requires the read ahead 
window to reach its maximum size before cache behind is enabled.

For long, sequential accesses, the results of my performance tests are 
just good as unconditionally increasing the window size.  I'm also 
seeing fewer pages needlessly cached by the cache behind heuristic.  
That said, there is still room for improvement.  We are still not 
achieving the same sequential performance as "dd", and there are still 
more pages being cached than I would like.

Alan



--------------020902040700020406010201
Content-Type: text/plain;
 name="vm_fault_cache98.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
 filename="vm_fault_cache98.patch"

Index: vm/vm_map.c
===================================================================
--- vm/vm_map.c	(revision 234106)
+++ vm/vm_map.c	(working copy)
@@ -1300,6 +1300,8 @@ charged:
 	new_entry->protection = prot;
 	new_entry->max_protection = max;
 	new_entry->wired_count = 0;
+	new_entry->read_ahead = VM_FAULT_READ_AHEAD_INIT;
+	new_entry->next_read = OFF_TO_IDX(offset);
 
 	KASSERT(cred == NULL || !ENTRY_CHARGED(new_entry),
 	    ("OVERCOMMIT: vm_map_insert leaks vm_map %p", new_entry));
Index: vm/vm_map.h
===================================================================
--- vm/vm_map.h	(revision 234106)
+++ vm/vm_map.h	(working copy)
@@ -112,8 +112,9 @@ struct vm_map_entry {
 	vm_prot_t protection;		/* protection code */
 	vm_prot_t max_protection;	/* maximum protection */
 	vm_inherit_t inheritance;	/* inheritance */
+	uint8_t read_ahead;		/* pages in the read-ahead window */
 	int wired_count;		/* can be paged if = 0 */
-	vm_pindex_t lastr;		/* last read */
+	vm_pindex_t next_read;		/* index of the next sequential read */
 	struct ucred *cred;		/* tmp storage for creator ref */
 };
 
@@ -330,6 +331,14 @@ long vmspace_wired_count(struct vmspace *vmspace);
 #define	VM_FAULT_DIRTY 2		/* Dirty the page; use w/VM_PROT_COPY */
 
 /*
+ * Initially, mappings are slightly sequential.  The maximum window size must
+ * account for the map entry's "read_ahead" field being defined as an uint8_t.
+ */
+#define	VM_FAULT_READ_AHEAD_MIN		7
+#define	VM_FAULT_READ_AHEAD_INIT	15
+#define	VM_FAULT_READ_AHEAD_MAX		min(atop(MAXPHYS) - 1, UINT8_MAX)
+
+/*
  * The following "find_space" options are supported by vm_map_find()
  */
 #define	VMFS_NO_SPACE		0	/* don't find; use the given range */
Index: vm/vm_fault.c
===================================================================
--- vm/vm_fault.c	(revision 234106)
+++ vm/vm_fault.c	(working copy)
@@ -118,9 +118,11 @@ static int prefault_pageorder[] = {
 static int vm_fault_additional_pages(vm_page_t, int, int, vm_page_t *, int *);
 static void vm_fault_prefault(pmap_t, vm_offset_t, vm_map_entry_t);
 
-#define VM_FAULT_READ_AHEAD 8
-#define VM_FAULT_READ_BEHIND 7
-#define VM_FAULT_READ (VM_FAULT_READ_AHEAD+VM_FAULT_READ_BEHIND+1)
+#define	VM_FAULT_READ_BEHIND	8
+#define	VM_FAULT_READ_MAX	(1 + VM_FAULT_READ_AHEAD_MAX)
+#define	VM_FAULT_NINCR		(VM_FAULT_READ_MAX / VM_FAULT_READ_BEHIND)
+#define	VM_FAULT_SUM		(VM_FAULT_NINCR * (VM_FAULT_NINCR + 1) / 2)
+#define	VM_FAULT_CACHE_BEHIND	(VM_FAULT_READ_BEHIND * VM_FAULT_SUM)
 
 struct faultstate {
 	vm_page_t m;
@@ -136,6 +138,8 @@ struct faultstate {
 	int vfslocked;
 };
 
+static void vm_fault_cache_behind(const struct faultstate *fs, int distance);
+
 static inline void
 release_page(struct faultstate *fs)
 {
@@ -236,13 +240,13 @@ vm_fault_hold(vm_map_t map, vm_offset_t vaddr, vm_
     int fault_flags, vm_page_t *m_hold)
 {
 	vm_prot_t prot;
-	int is_first_object_locked, result;
-	boolean_t growstack, wired;
+	long ahead, behind;
+	int alloc_req, era, faultcount, nera, reqpage, result;
+	boolean_t growstack, is_first_object_locked, wired;
 	int map_generation;
 	vm_object_t next_object;
-	vm_page_t marray[VM_FAULT_READ], mt, mt_prev;
+	vm_page_t marray[VM_FAULT_READ_MAX];
 	int hardfault;
-	int faultcount, ahead, behind, alloc_req;
 	struct faultstate fs;
 	struct vnode *vp;
 	int locked, error;
@@ -252,7 +256,7 @@ vm_fault_hold(vm_map_t map, vm_offset_t vaddr, vm_
 	PCPU_INC(cnt.v_vm_faults);
 	fs.vp = NULL;
 	fs.vfslocked = 0;
-	faultcount = behind = 0;
+	faultcount = reqpage = 0;
 
 RetryFault:;
 
@@ -460,76 +464,48 @@ readrest:
 		 */
 		if (TRYPAGER) {
 			int rv;
-			int reqpage = 0;
 			u_char behavior = vm_map_entry_behavior(fs.entry);
 
 			if (behavior == MAP_ENTRY_BEHAV_RANDOM ||
 			    P_KILLED(curproc)) {
+				behind = 0;
 				ahead = 0;
+			} else if (behavior == MAP_ENTRY_BEHAV_SEQUENTIAL) {
 				behind = 0;
+				ahead = atop(fs.entry->end - vaddr) - 1;
+				if (ahead > VM_FAULT_READ_AHEAD_MAX)
+					ahead = VM_FAULT_READ_AHEAD_MAX;
+				if (fs.pindex == fs.entry->next_read)
+					vm_fault_cache_behind(&fs,
+					    VM_FAULT_READ_MAX);
 			} else {
-				behind = (vaddr - fs.entry->start) >> PAGE_SHIFT;
+				/*
+				 * If this is a sequential page fault, then
+				 * arithmetically increase the number of pages
+				 * in the read-ahead window.  Otherwise, reset
+				 * the read-ahead window to its smallest size.
+				 */
+				behind = atop(vaddr - fs.entry->start);
 				if (behind > VM_FAULT_READ_BEHIND)
 					behind = VM_FAULT_READ_BEHIND;
-
-				ahead = ((fs.entry->end - vaddr) >> PAGE_SHIFT) - 1;
-				if (ahead > VM_FAULT_READ_AHEAD)
-					ahead = VM_FAULT_READ_AHEAD;
+				ahead = atop(fs.entry->end - vaddr) - 1;
+				era = fs.entry->read_ahead;
+				if (fs.pindex == fs.entry->next_read) {
+					nera = era + behind;
+					if (nera > VM_FAULT_READ_AHEAD_MAX)
+						nera = VM_FAULT_READ_AHEAD_MAX;
+					behind = 0;
+					if (ahead > nera)
+						ahead = nera;
+					if (era == VM_FAULT_READ_AHEAD_MAX)
+						vm_fault_cache_behind(&fs,
+						    VM_FAULT_CACHE_BEHIND);
+				} else if (ahead > VM_FAULT_READ_AHEAD_MIN)
+					ahead = VM_FAULT_READ_AHEAD_MIN;
+				if (era != ahead)
+					fs.entry->read_ahead = ahead;
 			}
-			is_first_object_locked = FALSE;
-			if ((behavior == MAP_ENTRY_BEHAV_SEQUENTIAL ||
-			     (behavior != MAP_ENTRY_BEHAV_RANDOM &&
-			      fs.pindex >= fs.entry->lastr &&
-			      fs.pindex < fs.entry->lastr + VM_FAULT_READ)) &&
-			    (fs.first_object == fs.object ||
-			     (is_first_object_locked = VM_OBJECT_TRYLOCK(fs.first_object))) &&
-			    fs.first_object->type != OBJT_DEVICE &&
-			    fs.first_object->type != OBJT_PHYS &&
-			    fs.first_object->type != OBJT_SG) {
-				vm_pindex_t firstpindex;
 
-				if (fs.first_pindex < 2 * VM_FAULT_READ)
-					firstpindex = 0;
-				else
-					firstpindex = fs.first_pindex - 2 * VM_FAULT_READ;
-				mt = fs.first_object != fs.object ?
-				    fs.first_m : fs.m;
-				KASSERT(mt != NULL, ("vm_fault: missing mt"));
-				KASSERT((mt->oflags & VPO_BUSY) != 0,
-				    ("vm_fault: mt %p not busy", mt));
-				mt_prev = vm_page_prev(mt);
-
-				/*
-				 * note: partially valid pages cannot be 
-				 * included in the lookahead - NFS piecemeal
-				 * writes will barf on it badly.
-				 */
-				while ((mt = mt_prev) != NULL &&
-				    mt->pindex >= firstpindex &&
-				    mt->valid == VM_PAGE_BITS_ALL) {
-					mt_prev = vm_page_prev(mt);
-					if (mt->busy ||
-					    (mt->oflags & VPO_BUSY))
-						continue;
-					vm_page_lock(mt);
-					if (mt->hold_count ||
-					    mt->wire_count) {
-						vm_page_unlock(mt);
-						continue;
-					}
-					pmap_remove_all(mt);
-					if (mt->dirty != 0)
-						vm_page_deactivate(mt);
-					else
-						vm_page_cache(mt);
-					vm_page_unlock(mt);
-				}
-				ahead += behind;
-				behind = 0;
-			}
-			if (is_first_object_locked)
-				VM_OBJECT_UNLOCK(fs.first_object);
-
 			/*
 			 * Call the pager to retrieve the data, if any, after
 			 * releasing the lock on the map.  We hold a ref on
@@ -899,7 +875,7 @@ vnode_locked:
 	 * without holding a write lock on it.
 	 */
 	if (hardfault)
-		fs.entry->lastr = fs.pindex + faultcount - behind;
+		fs.entry->next_read = fs.pindex + faultcount - reqpage;
 
 	if ((prot & VM_PROT_WRITE) != 0 ||
 	    (fault_flags & VM_FAULT_DIRTY) != 0) {
@@ -992,6 +968,56 @@ vnode_locked:
 }
 
 /*
+ * Speed up the reclamation of up to "distance" pages that precede the
+ * faulting pindex within the first object of the shadow chain.
+ */
+static void
+vm_fault_cache_behind(const struct faultstate *fs, int distance)
+{
+	vm_page_t m, m_prev;
+	vm_pindex_t pindex;
+	boolean_t is_first_object_locked;
+
+	VM_OBJECT_LOCK_ASSERT(fs->object, MA_OWNED);
+	is_first_object_locked = FALSE;
+	if (fs->first_object != fs->object && !(is_first_object_locked =
+	    VM_OBJECT_TRYLOCK(fs->first_object)))
+		return;
+	if (fs->first_object->type != OBJT_DEVICE &&
+	    fs->first_object->type != OBJT_PHYS &&
+	    fs->first_object->type != OBJT_SG) {
+		if (fs->first_pindex < distance)
+			pindex = 0;
+		else
+			pindex = fs->first_pindex - distance;
+		if (pindex < OFF_TO_IDX(fs->entry->offset))
+			pindex = OFF_TO_IDX(fs->entry->offset);
+		m = fs->first_object != fs->object ? fs->first_m : fs->m;
+		KASSERT(m != NULL, ("vm_fault_cache_behind: page missing"));
+		KASSERT((m->oflags & VPO_BUSY) != 0,
+		    ("vm_fault_cache_behind: page %p is not busy", m));
+		m_prev = vm_page_prev(m);
+		while ((m = m_prev) != NULL && m->pindex >= pindex &&
+		    m->valid == VM_PAGE_BITS_ALL) {
+			m_prev = vm_page_prev(m);
+			if (m->busy != 0 || (m->oflags & VPO_BUSY) != 0)
+				continue;
+			vm_page_lock(m);
+			if (m->hold_count == 0 && m->wire_count == 0) {
+				pmap_remove_all(m);
+				if (m->dirty != 0)
+					vm_page_deactivate(m);
+				else
+					vm_page_cache(m);
+			}
+			vm_page_unlock(m);
+		}
+	}
+	if (is_first_object_locked)
+		VM_OBJECT_UNLOCK(fs->first_object);
+}
+
+/*
  * vm_fault_prefault provides a quick way of clustering
  * pagefaults into a processes address space.  It is a "cousin"
  * of vm_map_pmap_enter, except it runs at page fault time instead

--------------020902040700020406010201--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4F9DD372.1020001>