Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 05 Apr 2012 10:54:31 -0500
From:      Alan Cox <alc@rice.edu>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        alc@freebsd.org, freebsd-hackers@freebsd.org, Andrey Zonov <andrey@zonov.org>
Subject:   Re: problems with mmap() and disk caching
Message-ID:  <4F7DC037.9060803@rice.edu>
In-Reply-To: <20120404071746.GJ2358@deviant.kiev.zoral.com.ua>
References:  <4F7B495D.3010402@zonov.org> <20120404071746.GJ2358@deviant.kiev.zoral.com.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On 04/04/2012 02:17, Konstantin Belousov wrote:
> On Tue, Apr 03, 2012 at 11:02:53PM +0400, Andrey Zonov wrote:
>> Hi,
>>
>> I open the file, then call mmap() on the whole file and get pointer,
>> then I work with this pointer.  I expect that page should be only once
>> touched to get it into the memory (disk cache?), but this doesn't work!
>>
>> I wrote the test (attached) and ran it for the 1G file generated from
>> /dev/random, the result is the following:
>>
>> Prepare file:
>> # swapoff -a
>> # newfs /dev/ada0b
>> # mount /dev/ada0b /mnt
>> # dd if=/dev/random of=/mnt/random-1024 bs=1m count=1024
>>
>> Purge cache:
>> # umount /mnt
>> # mount /dev/ada0b /mnt
>>
>> Run test:
>> $ ./mmap /mnt/random-1024 30
>> mmap:  1 pass took:   7.431046 (none: 262112; res:     32; super:
>> 0; other:      0)
>> mmap:  2 pass took:   7.356670 (none: 261648; res:    496; super:
>> 0; other:      0)
>> mmap:  3 pass took:   7.307094 (none: 260521; res:   1623; super:
>> 0; other:      0)
>> mmap:  4 pass took:   7.350239 (none: 258904; res:   3240; super:
>> 0; other:      0)
>> mmap:  5 pass took:   7.392480 (none: 257286; res:   4858; super:
>> 0; other:      0)
>> mmap:  6 pass took:   7.292069 (none: 255584; res:   6560; super:
>> 0; other:      0)
>> mmap:  7 pass took:   7.048980 (none: 251142; res:  11002; super:
>> 0; other:      0)
>> mmap:  8 pass took:   6.899387 (none: 247584; res:  14560; super:
>> 0; other:      0)
>> mmap:  9 pass took:   7.190579 (none: 242992; res:  19152; super:
>> 0; other:      0)
>> mmap: 10 pass took:   6.915482 (none: 239308; res:  22836; super:
>> 0; other:      0)
>> mmap: 11 pass took:   6.565909 (none: 232835; res:  29309; super:
>> 0; other:      0)
>> mmap: 12 pass took:   6.423945 (none: 226160; res:  35984; super:
>> 0; other:      0)
>> mmap: 13 pass took:   6.315385 (none: 208555; res:  53589; super:
>> 0; other:      0)
>> mmap: 14 pass took:   6.760780 (none: 192805; res:  69339; super:
>> 0; other:      0)
>> mmap: 15 pass took:   5.721513 (none: 174497; res:  87647; super:
>> 0; other:      0)
>> mmap: 16 pass took:   5.004424 (none: 155938; res: 106206; super:
>> 0; other:      0)
>> mmap: 17 pass took:   4.224926 (none: 135639; res: 126505; super:
>> 0; other:      0)
>> mmap: 18 pass took:   3.749608 (none: 117952; res: 144192; super:
>> 0; other:      0)
>> mmap: 19 pass took:   3.398084 (none:  99066; res: 163078; super:
>> 0; other:      0)
>> mmap: 20 pass took:   3.029557 (none:  74994; res: 187150; super:
>> 0; other:      0)
>> mmap: 21 pass took:   2.379430 (none:  55231; res: 206913; super:
>> 0; other:      0)
>> mmap: 22 pass took:   2.046521 (none:  40786; res: 221358; super:
>> 0; other:      0)
>> mmap: 23 pass took:   1.152797 (none:  30311; res: 231833; super:
>> 0; other:      0)
>> mmap: 24 pass took:   0.972617 (none:  16196; res: 245948; super:
>> 0; other:      0)
>> mmap: 25 pass took:   0.577515 (none:   8286; res: 253858; super:
>> 0; other:      0)
>> mmap: 26 pass took:   0.380738 (none:   3712; res: 258432; super:
>> 0; other:      0)
>> mmap: 27 pass took:   0.253583 (none:   1193; res: 260951; super:
>> 0; other:      0)
>> mmap: 28 pass took:   0.157508 (none:      0; res: 262144; super:
>> 0; other:      0)
>> mmap: 29 pass took:   0.156169 (none:      0; res: 262144; super:
>> 0; other:      0)
>> mmap: 30 pass took:   0.156550 (none:      0; res: 262144; super:
>> 0; other:      0)
>>
>> If I ran this:
>> $ cat /mnt/random-1024>  /dev/null
>> before test, when result is the following:
>>
>> $ ./mmap /mnt/random-1024 5
>> mmap:  1 pass took:   0.337657 (none:      0; res: 262144; super:
>> 0; other:      0)
>> mmap:  2 pass took:   0.186137 (none:      0; res: 262144; super:
>> 0; other:      0)
>> mmap:  3 pass took:   0.186132 (none:      0; res: 262144; super:
>> 0; other:      0)
>> mmap:  4 pass took:   0.186535 (none:      0; res: 262144; super:
>> 0; other:      0)
>> mmap:  5 pass took:   0.190353 (none:      0; res: 262144; super:
>> 0; other:      0)
>>
>> This is what I expect.  But why this doesn't work without reading file
>> manually?
> Issue seems to be in some change of the behaviour of the reserv or
> phys allocator. I Cc:ed Alan.

I'm pretty sure that the behavior here hasn't significantly changed in 
about twelve years.  Otherwise, I agree with your analysis.

On more than one occasion, I've been tempted to change:

                                         pmap_remove_all(mt);
                                         if (mt->dirty != 0)
                                                 vm_page_deactivate(mt);
                                         else
                                                 vm_page_cache(mt);

to:

                                         vm_page_dontneed(mt);

because I suspect that the current code does more harm than good.  In 
theory, it saves activations of the page daemon.  However, more often 
than not, I suspect that we are spending more on page reactivations than 
we are saving on page daemon activations.  The sequential access 
detection heuristic is just too easily triggered.  For example, I've 
seen it triggered by demand paging of the gcc text segment.  Also, I 
think that pmap_remove_all() and especially vm_page_cache() are too 
severe for a detection heuristic that is so easily triggered.

> What happen is that fault handler deactivates or caches the pages
> previous to the one which would satisfy the fault. See the if()
> statement starting at line 463 of vm/vm_fault.c. Since all pages
> of the object in your test are clean, the pages are cached.
>
> Next fault would need to allocate some more pages for different index
> of the same object. What I see is that vm_reserv_alloc_page() returns a
> page that is from the cache for the same object, but different pindex.
> As an obvious result, the page is invalidated and repurposed. When next
> loop started, the page is not resident anymore, so it has to be re-read
> from disk.
>
> The behaviour of the allocator is not consistent, so some pages are not
> reused, allowing the test to converge and to collect all pages of the
> object eventually.
>
> Calling madvise(MADV_RANDOM) fixes the issue, because the code to
> deactivate/cache the pages is turned off. On the other hand, it also
> turns of read-ahead for faulting, and the first loop becomes eternally
> long.
>
> Doing MADV_WILLNEED does not fix the problem indeed, since willneed
> reactivates the pages of the object at the time of call. To use
> MADV_WILLNEED, you would need to call it between faults/memcpy.
>
>> I've also never seen super pages, how to make them work?
> They just work, at least for me. Look at the output of procstat -v
> after enough loops finished to not cause disk activity.
>
>> I've been playing with madvise and posix_fadvise but no luck.  BTW,
>> posix_fadvise(POSIX_FADV_WILLNEED) does nothing as the commentary says,
>> shouldn't this be documented in the manual page?
>>
>> All tests were run under 9.0-STABLE (r233744).
>>
>> -- 
>> Andrey Zonov
>> /*_
>>   * Andrey Zonov (c) 2011
>>   */
>>
>> #include<sys/mman.h>
>> #include<sys/types.h>
>> #include<sys/time.h>
>> #include<sys/stat.h>
>> #include<err.h>
>> #include<fcntl.h>
>> #include<stdlib.h>
>> #include<string.h>
>> #include<unistd.h>
>>
>> int
>> main(int argc, char **argv)
>> {
>> 	int i;
>> 	int fd;
>> 	int num;
>> 	int block;
>> 	int pagesize;
>> 	size_t n;
>> 	size_t size;
>> 	size_t none, incore, super, other;
>> 	char *p;
>> 	char *tmp;
>> 	char *vec;
>> 	char *vecp;
>> 	struct stat sb;
>> 	struct timeval tp, tp1, tp2;
>>
>> 	if (argc<  2 || argc>  4)
>> 		errx(1, "usage: mmap<filename>  [num] [block]");
>>
>> 	fd = open(argv[1], O_RDONLY);
>> 	if (fd == -1)
>> 		err(1, "open()");
>>
>> 	num = 1;
>> 	if (argc>= 3)
>> 		num = atoi(argv[2]);
>>
>> 	pagesize = getpagesize();
>> 	block = pagesize;
>> 	if (argc == 4)
>> 		block = atoi(argv[3]);
>>
>> 	if (fstat(fd,&sb) == -1)
>> 		err(1, "fstat()");
>> 	size = sb.st_size;
>>
>> #if 0
>> 	if (posix_fadvise(fd, (off_t)0, (off_t)0, POSIX_FADV_WILLNEED) == -1)
>> 		err(1, "posix_fadvise()");
>> #endif
>>
>> 	p = mmap(NULL, sb.st_size, PROT_READ, /*MAP_PREFAULT_READ |*/ MAP_PRIVATE, fd, (off_t)0);
>> 	if (p == MAP_FAILED)
>> 		err(1, "mmap()");
>>
>> #if 0
>> 	if (madvise(p, (size_t)size, MADV_WILLNEED) == -1)
>> 		err(1, "madvise()");
>> #endif
>>
>> 	tmp = calloc(1, block);
>> 	if (tmp == NULL)
>> 		err(1, "calloc()");
>> 	vec = calloc(1, size / pagesize);
>> 	if (vec == NULL)
>> 		err(1, "calloc()");
>> 	for (i = 0; i<  num; i++) {
>> 		gettimeofday(&tp1, NULL);
>> 		for (n = 0; n<  size / block; n++)
>> 			memcpy(tmp, p + (n * block), block);
>> 		gettimeofday(&tp2, NULL);
>> 		timersub(&tp2,&tp1,&tp);
>>
>> 		if (mincore(p, size, vec) == -1)
>> 			err(1, "mincore()");
>>
>> 		none = incore = super = other = 0;
>> 		for (vecp = vec; (size_t)(vecp - vec)<  size / pagesize; vecp++) {
>> 			if (*vecp == 0)
>> 				none++;
>> 			else if (*vecp&  MINCORE_INCORE)
>> 				incore++;
>> 			else if (*vecp&  MINCORE_SUPER)
>> 				super++;
>> 			else
>> 				other++;
>> 		}
>> 		warnx("%2d pass took: %3ld.%06ld (none: %6ld; res: %6ld; super: %6ld; other: %6ld)",
>> 		   i + 1, tp.tv_sec, tp.tv_usec, none, incore, super, other);
>> 	}
>> 	free(vec);
>> 	free(tmp);
>>
>> 	if (munmap(p, sb.st_size) == -1)
>> 		err(1, "munmap()");
>>
>> 	close(fd);
>>
>> 	exit(0);
>> }
>> _______________________________________________
>> freebsd-hackers@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org"




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4F7DC037.9060803>