Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 4 Apr 2012 10:17:46 +0300
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Andrey Zonov <andrey@zonov.org>
Cc:        alc@freebsd.org, freebsd-hackers@freebsd.org
Subject:   Re: problems with mmap() and disk caching
Message-ID:  <20120404071746.GJ2358@deviant.kiev.zoral.com.ua>
In-Reply-To: <4F7B495D.3010402@zonov.org>
References:  <4F7B495D.3010402@zonov.org>

next in thread | previous in thread | raw e-mail | index | archive | help

--MLgImouMc6M0nTYk
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Tue, Apr 03, 2012 at 11:02:53PM +0400, Andrey Zonov wrote:
> Hi,
>=20
> I open the file, then call mmap() on the whole file and get pointer,=20
> then I work with this pointer.  I expect that page should be only once=20
> touched to get it into the memory (disk cache?), but this doesn't work!
>=20
> I wrote the test (attached) and ran it for the 1G file generated from=20
> /dev/random, the result is the following:
>=20
> Prepare file:
> # swapoff -a
> # newfs /dev/ada0b
> # mount /dev/ada0b /mnt
> # dd if=3D/dev/random of=3D/mnt/random-1024 bs=3D1m count=3D1024
>=20
> Purge cache:
> # umount /mnt
> # mount /dev/ada0b /mnt
>=20
> Run test:
> $ ./mmap /mnt/random-1024 30
> mmap:  1 pass took:   7.431046 (none: 262112; res:     32; super:=20
> 0; other:      0)
> mmap:  2 pass took:   7.356670 (none: 261648; res:    496; super:=20
> 0; other:      0)
> mmap:  3 pass took:   7.307094 (none: 260521; res:   1623; super:=20
> 0; other:      0)
> mmap:  4 pass took:   7.350239 (none: 258904; res:   3240; super:=20
> 0; other:      0)
> mmap:  5 pass took:   7.392480 (none: 257286; res:   4858; super:=20
> 0; other:      0)
> mmap:  6 pass took:   7.292069 (none: 255584; res:   6560; super:=20
> 0; other:      0)
> mmap:  7 pass took:   7.048980 (none: 251142; res:  11002; super:=20
> 0; other:      0)
> mmap:  8 pass took:   6.899387 (none: 247584; res:  14560; super:=20
> 0; other:      0)
> mmap:  9 pass took:   7.190579 (none: 242992; res:  19152; super:=20
> 0; other:      0)
> mmap: 10 pass took:   6.915482 (none: 239308; res:  22836; super:=20
> 0; other:      0)
> mmap: 11 pass took:   6.565909 (none: 232835; res:  29309; super:=20
> 0; other:      0)
> mmap: 12 pass took:   6.423945 (none: 226160; res:  35984; super:=20
> 0; other:      0)
> mmap: 13 pass took:   6.315385 (none: 208555; res:  53589; super:=20
> 0; other:      0)
> mmap: 14 pass took:   6.760780 (none: 192805; res:  69339; super:=20
> 0; other:      0)
> mmap: 15 pass took:   5.721513 (none: 174497; res:  87647; super:=20
> 0; other:      0)
> mmap: 16 pass took:   5.004424 (none: 155938; res: 106206; super:=20
> 0; other:      0)
> mmap: 17 pass took:   4.224926 (none: 135639; res: 126505; super:=20
> 0; other:      0)
> mmap: 18 pass took:   3.749608 (none: 117952; res: 144192; super:=20
> 0; other:      0)
> mmap: 19 pass took:   3.398084 (none:  99066; res: 163078; super:=20
> 0; other:      0)
> mmap: 20 pass took:   3.029557 (none:  74994; res: 187150; super:=20
> 0; other:      0)
> mmap: 21 pass took:   2.379430 (none:  55231; res: 206913; super:=20
> 0; other:      0)
> mmap: 22 pass took:   2.046521 (none:  40786; res: 221358; super:=20
> 0; other:      0)
> mmap: 23 pass took:   1.152797 (none:  30311; res: 231833; super:=20
> 0; other:      0)
> mmap: 24 pass took:   0.972617 (none:  16196; res: 245948; super:=20
> 0; other:      0)
> mmap: 25 pass took:   0.577515 (none:   8286; res: 253858; super:=20
> 0; other:      0)
> mmap: 26 pass took:   0.380738 (none:   3712; res: 258432; super:=20
> 0; other:      0)
> mmap: 27 pass took:   0.253583 (none:   1193; res: 260951; super:=20
> 0; other:      0)
> mmap: 28 pass took:   0.157508 (none:      0; res: 262144; super:=20
> 0; other:      0)
> mmap: 29 pass took:   0.156169 (none:      0; res: 262144; super:=20
> 0; other:      0)
> mmap: 30 pass took:   0.156550 (none:      0; res: 262144; super:=20
> 0; other:      0)
>=20
> If I ran this:
> $ cat /mnt/random-1024 > /dev/null
> before test, when result is the following:
>=20
> $ ./mmap /mnt/random-1024 5
> mmap:  1 pass took:   0.337657 (none:      0; res: 262144; super:=20
> 0; other:      0)
> mmap:  2 pass took:   0.186137 (none:      0; res: 262144; super:=20
> 0; other:      0)
> mmap:  3 pass took:   0.186132 (none:      0; res: 262144; super:=20
> 0; other:      0)
> mmap:  4 pass took:   0.186535 (none:      0; res: 262144; super:=20
> 0; other:      0)
> mmap:  5 pass took:   0.190353 (none:      0; res: 262144; super:=20
> 0; other:      0)
>=20
> This is what I expect.  But why this doesn't work without reading file=20
> manually?
Issue seems to be in some change of the behaviour of the reserv or
phys allocator. I Cc:ed Alan.

What happen is that fault handler deactivates or caches the pages
previous to the one which would satisfy the fault. See the if()
statement starting at line 463 of vm/vm_fault.c. Since all pages
of the object in your test are clean, the pages are cached.

Next fault would need to allocate some more pages for different index
of the same object. What I see is that vm_reserv_alloc_page() returns a
page that is from the cache for the same object, but different pindex.
As an obvious result, the page is invalidated and repurposed. When next
loop started, the page is not resident anymore, so it has to be re-read
from disk.

The behaviour of the allocator is not consistent, so some pages are not
reused, allowing the test to converge and to collect all pages of the
object eventually.

Calling madvise(MADV_RANDOM) fixes the issue, because the code to
deactivate/cache the pages is turned off. On the other hand, it also
turns of read-ahead for faulting, and the first loop becomes eternally
long.

Doing MADV_WILLNEED does not fix the problem indeed, since willneed
reactivates the pages of the object at the time of call. To use
MADV_WILLNEED, you would need to call it between faults/memcpy.

>=20
> I've also never seen super pages, how to make them work?
They just work, at least for me. Look at the output of procstat -v
after enough loops finished to not cause disk activity.

>=20
> I've been playing with madvise and posix_fadvise but no luck.  BTW,=20
> posix_fadvise(POSIX_FADV_WILLNEED) does nothing as the commentary says,=
=20
> shouldn't this be documented in the manual page?
>=20
> All tests were run under 9.0-STABLE (r233744).
>=20
> --=20
> Andrey Zonov

> /*_
>  * Andrey Zonov (c) 2011
>  */
>=20
> #include <sys/mman.h>
> #include <sys/types.h>
> #include <sys/time.h>
> #include <sys/stat.h>
> #include <err.h>
> #include <fcntl.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
>=20
> int
> main(int argc, char **argv)
> {
> 	int i;
> 	int fd;
> 	int num;
> 	int block;
> 	int pagesize;
> 	size_t n;
> 	size_t size;
> 	size_t none, incore, super, other;
> 	char *p;
> 	char *tmp;
> 	char *vec;
> 	char *vecp;
> 	struct stat sb;
> 	struct timeval tp, tp1, tp2;
>=20
> 	if (argc < 2 || argc > 4)
> 		errx(1, "usage: mmap <filename> [num] [block]");
>=20
> 	fd =3D open(argv[1], O_RDONLY);
> 	if (fd =3D=3D -1)
> 		err(1, "open()");
>=20
> 	num =3D 1;
> 	if (argc >=3D 3)
> 		num =3D atoi(argv[2]);
>=20
> 	pagesize =3D getpagesize();
> 	block =3D pagesize;
> 	if (argc =3D=3D 4)
> 		block =3D atoi(argv[3]);
>=20
> 	if (fstat(fd, &sb) =3D=3D -1)
> 		err(1, "fstat()");
> 	size =3D sb.st_size;
>=20
> #if 0
> 	if (posix_fadvise(fd, (off_t)0, (off_t)0, POSIX_FADV_WILLNEED) =3D=3D -1)
> 		err(1, "posix_fadvise()");
> #endif
>=20
> 	p =3D mmap(NULL, sb.st_size, PROT_READ, /*MAP_PREFAULT_READ |*/ MAP_PRIV=
ATE, fd, (off_t)0);
> 	if (p =3D=3D MAP_FAILED)
> 		err(1, "mmap()");
>=20
> #if 0
> 	if (madvise(p, (size_t)size, MADV_WILLNEED) =3D=3D -1)
> 		err(1, "madvise()");
> #endif
>=20
> 	tmp =3D calloc(1, block);
> 	if (tmp =3D=3D NULL)
> 		err(1, "calloc()");
> 	vec =3D calloc(1, size / pagesize);
> 	if (vec =3D=3D NULL)
> 		err(1, "calloc()");
> 	for (i =3D 0; i < num; i++) {
> 		gettimeofday(&tp1, NULL);
> 		for (n =3D 0; n < size / block; n++)
> 			memcpy(tmp, p + (n * block), block);
> 		gettimeofday(&tp2, NULL);
> 		timersub(&tp2, &tp1, &tp);
>=20
> 		if (mincore(p, size, vec) =3D=3D -1)
> 			err(1, "mincore()");
>=20
> 		none =3D incore =3D super =3D other =3D 0;
> 		for (vecp =3D vec; (size_t)(vecp - vec) < size / pagesize; vecp++) {
> 			if (*vecp =3D=3D 0)
> 				none++;
> 			else if (*vecp & MINCORE_INCORE)
> 				incore++;
> 			else if (*vecp & MINCORE_SUPER)
> 				super++;
> 			else
> 				other++;
> 		}
> 		warnx("%2d pass took: %3ld.%06ld (none: %6ld; res: %6ld; super: %6ld; o=
ther: %6ld)",
> 		   i + 1, tp.tv_sec, tp.tv_usec, none, incore, super, other);
> 	}
> 	free(vec);
> 	free(tmp);
>=20
> 	if (munmap(p, sb.st_size) =3D=3D -1)
> 		err(1, "munmap()");
>=20
> 	close(fd);
>=20
> 	exit(0);
> }

> _______________________________________________
> freebsd-hackers@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org"

--MLgImouMc6M0nTYk
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (FreeBSD)

iEYEARECAAYFAk979ZkACgkQC3+MBN1Mb4h70QCfWy5SBFMhoOSu4lsImFUH07ee
5XUAoLqpvJ9l29O1foymHmTDVNSEY4wU
=j1j8
-----END PGP SIGNATURE-----

--MLgImouMc6M0nTYk--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120404071746.GJ2358>