From owner-freebsd-hackers@FreeBSD.ORG Mon Apr 9 09:18:56 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0084A106564A; Mon, 9 Apr 2012 09:18:56 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id 4EDEB8FC08; Mon, 9 Apr 2012 09:18:54 +0000 (UTC) Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q399IfIC091846; Mon, 9 Apr 2012 12:18:41 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id q399Ifl2001983; Mon, 9 Apr 2012 12:18:41 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q399IefW001982; Mon, 9 Apr 2012 12:18:40 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Mon, 9 Apr 2012 12:18:39 +0300 From: Konstantin Belousov To: Andrey Zonov Message-ID: <20120409091839.GH2358@deviant.kiev.zoral.com.ua> References: <4F7B495D.3010402@zonov.org> <20120404071746.GJ2358@deviant.kiev.zoral.com.ua> <4F7DC037.9060803@rice.edu> <4F7DF39A.3000500@zonov.org> <20120405194122.GC2358@deviant.kiev.zoral.com.ua> <4F7DF88D.2050907@zonov.org> <20120406081349.GE2358@deviant.kiev.zoral.com.ua> <4F828D15.8080604@zonov.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="Iby4wbbvBBYJWhzT" Content-Disposition: inline In-Reply-To: <4F828D15.8080604@zonov.org> User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: alc@freebsd.org, freebsd-hackers@freebsd.org, Alan Cox Subject: Re: problems with mmap() and disk caching X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 09 Apr 2012 09:18:56 -0000 --Iby4wbbvBBYJWhzT Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, Apr 09, 2012 at 11:17:41AM +0400, Andrey Zonov wrote: > On 06.04.2012 12:13, Konstantin Belousov wrote: > >On Thu, Apr 05, 2012 at 11:54:53PM +0400, Andrey Zonov wrote: > >>On 05.04.2012 23:41, Konstantin Belousov wrote: > >>>On Thu, Apr 05, 2012 at 11:33:46PM +0400, Andrey Zonov wrote: > >>>>On 05.04.2012 19:54, Alan Cox wrote: > >>>>>On 04/04/2012 02:17, Konstantin Belousov wrote: > >>>>>>On Tue, Apr 03, 2012 at 11:02:53PM +0400, Andrey Zonov wrote: > >>>>[snip] > >>>>>>>This is what I expect. But why this doesn't work without reading f= ile > >>>>>>>manually? > >>>>>>Issue seems to be in some change of the behaviour of the reserv or > >>>>>>phys allocator. I Cc:ed Alan. > >>>>> > >>>>>I'm pretty sure that the behavior here hasn't significantly changed = in > >>>>>about twelve years. Otherwise, I agree with your analysis. > >>>>> > >>>>>On more than one occasion, I've been tempted to change: > >>>>> > >>>>>pmap_remove_all(mt); > >>>>>if (mt->dirty !=3D 0) > >>>>>vm_page_deactivate(mt); > >>>>>else > >>>>>vm_page_cache(mt); > >>>>> > >>>>>to: > >>>>> > >>>>>vm_page_dontneed(mt); > >>>>> > >>>> > >>>>Thanks Alan! Now it works as I expect! > >>>> > >>>>But I have more questions to you and kib@. They are in my test below. > >>>> > >>>>So, prepare file as earlier, and take information about memory usage > >>>>from top(1). After preparation, but before test: > >>>>Mem: 80M Active, 55M Inact, 721M Wired, 215M Buf, 46G Free > >>>> > >>>>First run: > >>>>$ ./mmap /mnt/random > >>>>mmap: 1 pass took: 7.462865 (none: 0; res: 262144; super: > >>>>0; other: 0) > >>>> > >>>>No super pages after first run, why?.. > >>>> > >>>>Mem: 79M Active, 1079M Inact, 722M Wired, 216M Buf, 45G Free > >>>> > >>>>Now the file is in inactive memory, that's good. > >>>> > >>>>Second run: > >>>>$ ./mmap /mnt/random > >>>>mmap: 1 pass took: 0.004191 (none: 0; res: 262144; super: > >>>>511; other: 0) > >>>> > >>>>All super pages are here, nice. > >>>> > >>>>Mem: 1103M Active, 55M Inact, 722M Wired, 216M Buf, 45G Free > >>>> > >>>>Wow, all inactive pages moved to active and sit there even after proc= ess > >>>>was terminated, that's not good, what do you think? > >>>Why do you think this is 'not good' ? You have plenty of free memory, > >>>there is no memory pressure, and all pages were referenced recently. > >>>THere is no reason for them to be deactivated. > >>> > >> > >>I always thought that active memory this is a sum of resident memory of > >>all processes, inactive shows disk cache and wired shows kernel itself. > >So you are wrong. Both active and inactive memory can be mapped and > >not mapped, both can belong to vnode or to anonymous objects etc. > >Active/inactive distinction is only the amount of references that was > >noted by pagedaemon, or some other page history like the way it was > >unwired. > > > >Wired is not neccessary means kernel-used pages, user processes can > >wire their pages as well. >=20 > Let's talk about that in details. >=20 > My understanding is the following: >=20 > Active memory: the memory which is referenced by application. An=20 Assuming the part 'by application' is removed, this sentence is almost righ= t. Any managed mapping of the page participates in the active references. > application may get memory only through mmap() (allocator don't use=20 > brk()/sbrk() any more). The resident memory of an application is the=20 > sum of physical used memory. So, sum of RSS is active memory. First, brk/sbrk is still used. Second, there is no requirement that resident pages are referenced. E.g. page could have participated in the buffer, and unwiring on the buffer dissolve put it into inactive state. Or pagedaemon cleared the reference and moved the page to inactive queue. Or the page was prefaulted by different optimizations. More, there is subtle difference between 'resident' and 'not causing fault on access'. Page may be resident, but pte was not preinstalled, or pte was flushed etc. >=20 > Inactive memory: the memory which has no references. Once we call=20 > read() on the file, the file is in inactive memory, because we have no=20 > references to this object, we just read it. This is also released=20 > memory by free(). On buffers dissolve, buffer cache explicitely puts pages constituing=20 the buffer, into the inactive queue. In fact, this is not quite right, e.g. if the same pages are mapped and actively referenced, then pagedaemon has slightly more work now to move the page from inactive to active. And, free(3) operates at so much higher level then vm subsystem that describing the interaction between these two is impossible in any definitive mood. Old naive mallocs put block description at the beggining of the block, actually causing free() to reference at least the first page of the block. Jemalloc often does madvise(MADV_FREE) for large freed allocations. MADV_FREE moves pages between queues probabalistically. >=20 > Cache memory: I don't know what is it. It's always small enough to not=20 > think about it. This was the bug you reported, and which Alan fixed on Sunday. >=20 > Wired memory: kernel memory and yes, application may get wired memory=20 > through mlock()/mlockall(), but I haven't seen any real application=20 > which calls mlock(). ntpd, amd from the base system. gpg and similar programs try to mlock key store to avoid sensitive material leakage to the swap. cdrecord(8) tried to mlock itself to avoid indefinite stalls during write. >=20 > >> > >>>> > >>>>Read the file: > >>>>$ cat /mnt/random> /dev/null > >>>> > >>>>Mem: 79M Active, 55M Inact, 1746M Wired, 1240M Buf, 45G Free > >>>> > >>>>Now the file is in wired memory. I do not understand why so. > >>>You do use UFS, right ? > >> > >>Yes. > >> > >>>There is enough buffer headers and buffer KVA > >>>to have buffers allocated for the whole file content. Since buffers wi= re > >>>corresponding pages, you get pages migrated to wired. > >>> > >>>When there appears a buffer pressure (i.e., any other i/o started), > >>>the buffers will be repurposed and pages moved to inactive. > >>> > >> > >>OK, how can I get amount of disk cache? > >You cannot. At least I am not aware of any counter that keeps track > >of the resident pages belonging to vnode pager. > > > >Buffers should not be thought as disk cache, pages cache disk content. > >Instead, VMIO buffers only provide bread()/bwrite() compatible interface > >to the page cache (*) for filesystems. > >(*) - The cache term is used in generic term, not to confuse with > >cached pages counter from top etc. > > >=20 > Yes, I know that. I try once again to ask my question about buffers.=20 > Is this reasonable to use for them 10% of the physical memory or we may= =20 > set rational upper limit automatically? >=20 > >> > >>>> > >>>>Could you please give me explanation about active/inactive/wired memo= ry? > >>>> > >>>> > >>>>>because I suspect that the current code does more harm than good. In > >>>>>theory, it saves activations of the page daemon. However, more often > >>>>>than not, I suspect that we are spending more on page reactivations= =20 > >>>>>than > >>>>>we are saving on page daemon activations. The sequential access > >>>>>detection heuristic is just too easily triggered. For example, I've= =20 > >>>>>seen > >>>>>it triggered by demand paging of the gcc text segment. Also, I think > >>>>>that pmap_remove_all() and especially vm_page_cache() are too severe= =20 > >>>>>for > >>>>>a detection heuristic that is so easily triggered. > >>>>> > >>>>[snip] > >>>> > >>>>-- > >>>>Andrey Zonov > >> > >>-- > >>Andrey Zonov >=20 > --=20 > Andrey Zonov --Iby4wbbvBBYJWhzT Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iEYEARECAAYFAk+CqW8ACgkQC3+MBN1Mb4iBLACdGJuYNxTgkKMk4EDVo6wTnEZX E0kAn3fl7avgZvjOA9F5f3t7xgBri8hk =WFcu -----END PGP SIGNATURE----- --Iby4wbbvBBYJWhzT--