From owner-freebsd-hackers@FreeBSD.ORG Thu Apr 5 16:03:15 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 73991106566B; Thu, 5 Apr 2012 16:03:15 +0000 (UTC) (envelope-from alc@rice.edu) Received: from mh8.mail.rice.edu (mh8.mail.rice.edu [128.42.201.24]) by mx1.freebsd.org (Postfix) with ESMTP id 35E9B8FC18; Thu, 5 Apr 2012 16:03:15 +0000 (UTC) Received: from mh8.mail.rice.edu (localhost.localdomain [127.0.0.1]) by mh8.mail.rice.edu (Postfix) with ESMTP id 68379291D61; Thu, 5 Apr 2012 10:54:35 -0500 (CDT) Received: from mh8.mail.rice.edu (localhost.localdomain [127.0.0.1]) by mh8.mail.rice.edu (Postfix) with ESMTP id 5D4FE29761F; Thu, 5 Apr 2012 10:54:35 -0500 (CDT) X-Virus-Scanned: by amavis-2.6.4 at mh8.mail.rice.edu, auth channel Received: from mh8.mail.rice.edu ([127.0.0.1]) by mh8.mail.rice.edu (mh8.mail.rice.edu [127.0.0.1]) (amavis, port 10026) with ESMTP id UU0oaVYrni33; Thu, 5 Apr 2012 10:54:35 -0500 (CDT) Received: from adsl-216-63-78-18.dsl.hstntx.swbell.net (adsl-216-63-78-18.dsl.hstntx.swbell.net [216.63.78.18]) (using TLSv1 with cipher RC4-MD5 (128/128 bits)) (No client certificate requested) (Authenticated sender: alc) by mh8.mail.rice.edu (Postfix) with ESMTPSA id B2C2E291D19; Thu, 5 Apr 2012 10:54:34 -0500 (CDT) Message-ID: <4F7DC037.9060803@rice.edu> Date: Thu, 05 Apr 2012 10:54:31 -0500 From: Alan Cox User-Agent: Mozilla/5.0 (X11; FreeBSD i386; rv:8.0) Gecko/20111113 Thunderbird/8.0 MIME-Version: 1.0 To: Konstantin Belousov References: <4F7B495D.3010402@zonov.org> <20120404071746.GJ2358@deviant.kiev.zoral.com.ua> In-Reply-To: <20120404071746.GJ2358@deviant.kiev.zoral.com.ua> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Mailman-Approved-At: Thu, 05 Apr 2012 16:10:10 +0000 Cc: alc@freebsd.org, freebsd-hackers@freebsd.org, Andrey Zonov Subject: Re: problems with mmap() and disk caching X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 05 Apr 2012 16:03:15 -0000 On 04/04/2012 02:17, Konstantin Belousov wrote: > On Tue, Apr 03, 2012 at 11:02:53PM +0400, Andrey Zonov wrote: >> Hi, >> >> I open the file, then call mmap() on the whole file and get pointer, >> then I work with this pointer. I expect that page should be only once >> touched to get it into the memory (disk cache?), but this doesn't work! >> >> I wrote the test (attached) and ran it for the 1G file generated from >> /dev/random, the result is the following: >> >> Prepare file: >> # swapoff -a >> # newfs /dev/ada0b >> # mount /dev/ada0b /mnt >> # dd if=/dev/random of=/mnt/random-1024 bs=1m count=1024 >> >> Purge cache: >> # umount /mnt >> # mount /dev/ada0b /mnt >> >> Run test: >> $ ./mmap /mnt/random-1024 30 >> mmap: 1 pass took: 7.431046 (none: 262112; res: 32; super: >> 0; other: 0) >> mmap: 2 pass took: 7.356670 (none: 261648; res: 496; super: >> 0; other: 0) >> mmap: 3 pass took: 7.307094 (none: 260521; res: 1623; super: >> 0; other: 0) >> mmap: 4 pass took: 7.350239 (none: 258904; res: 3240; super: >> 0; other: 0) >> mmap: 5 pass took: 7.392480 (none: 257286; res: 4858; super: >> 0; other: 0) >> mmap: 6 pass took: 7.292069 (none: 255584; res: 6560; super: >> 0; other: 0) >> mmap: 7 pass took: 7.048980 (none: 251142; res: 11002; super: >> 0; other: 0) >> mmap: 8 pass took: 6.899387 (none: 247584; res: 14560; super: >> 0; other: 0) >> mmap: 9 pass took: 7.190579 (none: 242992; res: 19152; super: >> 0; other: 0) >> mmap: 10 pass took: 6.915482 (none: 239308; res: 22836; super: >> 0; other: 0) >> mmap: 11 pass took: 6.565909 (none: 232835; res: 29309; super: >> 0; other: 0) >> mmap: 12 pass took: 6.423945 (none: 226160; res: 35984; super: >> 0; other: 0) >> mmap: 13 pass took: 6.315385 (none: 208555; res: 53589; super: >> 0; other: 0) >> mmap: 14 pass took: 6.760780 (none: 192805; res: 69339; super: >> 0; other: 0) >> mmap: 15 pass took: 5.721513 (none: 174497; res: 87647; super: >> 0; other: 0) >> mmap: 16 pass took: 5.004424 (none: 155938; res: 106206; super: >> 0; other: 0) >> mmap: 17 pass took: 4.224926 (none: 135639; res: 126505; super: >> 0; other: 0) >> mmap: 18 pass took: 3.749608 (none: 117952; res: 144192; super: >> 0; other: 0) >> mmap: 19 pass took: 3.398084 (none: 99066; res: 163078; super: >> 0; other: 0) >> mmap: 20 pass took: 3.029557 (none: 74994; res: 187150; super: >> 0; other: 0) >> mmap: 21 pass took: 2.379430 (none: 55231; res: 206913; super: >> 0; other: 0) >> mmap: 22 pass took: 2.046521 (none: 40786; res: 221358; super: >> 0; other: 0) >> mmap: 23 pass took: 1.152797 (none: 30311; res: 231833; super: >> 0; other: 0) >> mmap: 24 pass took: 0.972617 (none: 16196; res: 245948; super: >> 0; other: 0) >> mmap: 25 pass took: 0.577515 (none: 8286; res: 253858; super: >> 0; other: 0) >> mmap: 26 pass took: 0.380738 (none: 3712; res: 258432; super: >> 0; other: 0) >> mmap: 27 pass took: 0.253583 (none: 1193; res: 260951; super: >> 0; other: 0) >> mmap: 28 pass took: 0.157508 (none: 0; res: 262144; super: >> 0; other: 0) >> mmap: 29 pass took: 0.156169 (none: 0; res: 262144; super: >> 0; other: 0) >> mmap: 30 pass took: 0.156550 (none: 0; res: 262144; super: >> 0; other: 0) >> >> If I ran this: >> $ cat /mnt/random-1024> /dev/null >> before test, when result is the following: >> >> $ ./mmap /mnt/random-1024 5 >> mmap: 1 pass took: 0.337657 (none: 0; res: 262144; super: >> 0; other: 0) >> mmap: 2 pass took: 0.186137 (none: 0; res: 262144; super: >> 0; other: 0) >> mmap: 3 pass took: 0.186132 (none: 0; res: 262144; super: >> 0; other: 0) >> mmap: 4 pass took: 0.186535 (none: 0; res: 262144; super: >> 0; other: 0) >> mmap: 5 pass took: 0.190353 (none: 0; res: 262144; super: >> 0; other: 0) >> >> This is what I expect. But why this doesn't work without reading file >> manually? > Issue seems to be in some change of the behaviour of the reserv or > phys allocator. I Cc:ed Alan. I'm pretty sure that the behavior here hasn't significantly changed in about twelve years. Otherwise, I agree with your analysis. On more than one occasion, I've been tempted to change: pmap_remove_all(mt); if (mt->dirty != 0) vm_page_deactivate(mt); else vm_page_cache(mt); to: vm_page_dontneed(mt); because I suspect that the current code does more harm than good. In theory, it saves activations of the page daemon. However, more often than not, I suspect that we are spending more on page reactivations than we are saving on page daemon activations. The sequential access detection heuristic is just too easily triggered. For example, I've seen it triggered by demand paging of the gcc text segment. Also, I think that pmap_remove_all() and especially vm_page_cache() are too severe for a detection heuristic that is so easily triggered. > What happen is that fault handler deactivates or caches the pages > previous to the one which would satisfy the fault. See the if() > statement starting at line 463 of vm/vm_fault.c. Since all pages > of the object in your test are clean, the pages are cached. > > Next fault would need to allocate some more pages for different index > of the same object. What I see is that vm_reserv_alloc_page() returns a > page that is from the cache for the same object, but different pindex. > As an obvious result, the page is invalidated and repurposed. When next > loop started, the page is not resident anymore, so it has to be re-read > from disk. > > The behaviour of the allocator is not consistent, so some pages are not > reused, allowing the test to converge and to collect all pages of the > object eventually. > > Calling madvise(MADV_RANDOM) fixes the issue, because the code to > deactivate/cache the pages is turned off. On the other hand, it also > turns of read-ahead for faulting, and the first loop becomes eternally > long. > > Doing MADV_WILLNEED does not fix the problem indeed, since willneed > reactivates the pages of the object at the time of call. To use > MADV_WILLNEED, you would need to call it between faults/memcpy. > >> I've also never seen super pages, how to make them work? > They just work, at least for me. Look at the output of procstat -v > after enough loops finished to not cause disk activity. > >> I've been playing with madvise and posix_fadvise but no luck. BTW, >> posix_fadvise(POSIX_FADV_WILLNEED) does nothing as the commentary says, >> shouldn't this be documented in the manual page? >> >> All tests were run under 9.0-STABLE (r233744). >> >> -- >> Andrey Zonov >> /*_ >> * Andrey Zonov (c) 2011 >> */ >> >> #include >> #include >> #include >> #include >> #include >> #include >> #include >> #include >> #include >> >> int >> main(int argc, char **argv) >> { >> int i; >> int fd; >> int num; >> int block; >> int pagesize; >> size_t n; >> size_t size; >> size_t none, incore, super, other; >> char *p; >> char *tmp; >> char *vec; >> char *vecp; >> struct stat sb; >> struct timeval tp, tp1, tp2; >> >> if (argc< 2 || argc> 4) >> errx(1, "usage: mmap [num] [block]"); >> >> fd = open(argv[1], O_RDONLY); >> if (fd == -1) >> err(1, "open()"); >> >> num = 1; >> if (argc>= 3) >> num = atoi(argv[2]); >> >> pagesize = getpagesize(); >> block = pagesize; >> if (argc == 4) >> block = atoi(argv[3]); >> >> if (fstat(fd,&sb) == -1) >> err(1, "fstat()"); >> size = sb.st_size; >> >> #if 0 >> if (posix_fadvise(fd, (off_t)0, (off_t)0, POSIX_FADV_WILLNEED) == -1) >> err(1, "posix_fadvise()"); >> #endif >> >> p = mmap(NULL, sb.st_size, PROT_READ, /*MAP_PREFAULT_READ |*/ MAP_PRIVATE, fd, (off_t)0); >> if (p == MAP_FAILED) >> err(1, "mmap()"); >> >> #if 0 >> if (madvise(p, (size_t)size, MADV_WILLNEED) == -1) >> err(1, "madvise()"); >> #endif >> >> tmp = calloc(1, block); >> if (tmp == NULL) >> err(1, "calloc()"); >> vec = calloc(1, size / pagesize); >> if (vec == NULL) >> err(1, "calloc()"); >> for (i = 0; i< num; i++) { >> gettimeofday(&tp1, NULL); >> for (n = 0; n< size / block; n++) >> memcpy(tmp, p + (n * block), block); >> gettimeofday(&tp2, NULL); >> timersub(&tp2,&tp1,&tp); >> >> if (mincore(p, size, vec) == -1) >> err(1, "mincore()"); >> >> none = incore = super = other = 0; >> for (vecp = vec; (size_t)(vecp - vec)< size / pagesize; vecp++) { >> if (*vecp == 0) >> none++; >> else if (*vecp& MINCORE_INCORE) >> incore++; >> else if (*vecp& MINCORE_SUPER) >> super++; >> else >> other++; >> } >> warnx("%2d pass took: %3ld.%06ld (none: %6ld; res: %6ld; super: %6ld; other: %6ld)", >> i + 1, tp.tv_sec, tp.tv_usec, none, incore, super, other); >> } >> free(vec); >> free(tmp); >> >> if (munmap(p, sb.st_size) == -1) >> err(1, "munmap()"); >> >> close(fd); >> >> exit(0); >> } >> _______________________________________________ >> freebsd-hackers@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers >> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org"