Date: Fri, 2 Oct 2015 18:50:36 -0500 From: Alan Cox <alc@rice.edu> To: John Baldwin <jhb@freebsd.org> Cc: Mark Johnston <markj@freebsd.org>, src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r288431 - in head/sys: kern sys vm Message-ID: <F3EF914A-8296-4833-BCF8-B9D878CAB80C@rice.edu> In-Reply-To: <4276391.z2UvhhORjP@ralph.baldwin.cx> References: <201509302306.t8UN6UwX043736@repo.freebsd.org> <1837187.vUDrWYExQX@ralph.baldwin.cx> <20151002045842.GA18421@raichu> <4276391.z2UvhhORjP@ralph.baldwin.cx>
next in thread | previous in thread | raw e-mail | index | archive | help
On Oct 2, 2015, at 10:59 AM, John Baldwin <jhb@freebsd.org> wrote: > On Thursday, October 01, 2015 09:58:43 PM Mark Johnston wrote: >> On Thu, Oct 01, 2015 at 09:32:45AM -0700, John Baldwin wrote: >>> On Wednesday, September 30, 2015 11:06:30 PM Mark Johnston wrote: >>>> Author: markj >>>> Date: Wed Sep 30 23:06:29 2015 >>>> New Revision: 288431 >>>> URL: https://svnweb.freebsd.org/changeset/base/288431 >>>>=20 >>>> Log: >>>> As a step towards the elimination of PG_CACHED pages, rework the = handling >>>> of POSIX_FADV_DONTNEED so that it causes the backing pages to be = moved to >>>> the head of the inactive queue instead of being cached. >>>>=20 >>>> This affects the implementation of POSIX_FADV_NOREUSE as well, = since it >>>> works by applying POSIX_FADV_DONTNEED to file ranges after they = have been >>>> read or written. At that point the corresponding buffers may = still be >>>> dirty, so the previous implementation would coalesce successive = ranges and >>>> apply POSIX_FADV_DONTNEED to the result, ensuring that pages = backing the >>>> dirty buffers would eventually be cached. To preserve this = behaviour in an >>>> efficient manner, this change adds a new buf flag, B_NOREUSE, = which causes >>>> the pages backing a VMIO buf to be placed at the head of the = inactive queue >>>> when the buf is released. POSIX_FADV_NOREUSE then works by = setting this >>>> flag in bufs that underlie the specified range. >>>=20 >>> Putting these pages back on the inactive queue completely defeats = the primary >>> purpose of DONTNEED and NOREUSE. The primary purpose is to move the = pages out >>> of the VM object's tree of pages and into the free pool so that the = application >>> can instruct the VM to free memory more efficiently than relying on = page daemon. >>>=20 >>> The implementation used cache pages instead of free as a cheap = optimization so >>> that if an application did something dumb where it used DONTNEED and = then turned >>> around and read the file it would not have to go to disk if the = pages had not >>> yet been reused. In practice this didn't work out so well because = PG_CACHE pages >>> don't really work well. >>>=20 >>> However, using PG_CACHE was secondary to the primary purpose of = explicitly freeing >>> memory that an application knew wasn't going to be reused and = avoiding the need >>> for pagedaemon to run at all. I think this should be freeing the = pages instead of >>> keeping them inactive. If an application uses DONTNEED or NOREUSE = and then turns >>> around and rereads the file, it generally deserves to have to go to = disk for it. >>=20 >> A problem with this is that one application's DONTNEED or NOREUSE = hint >> would cause every application reading or writing that file to go to >> disk, but posix_fadvise(2) is explicitly intended for applications = that >> wish to provide hints about their own access patterns. I realize that >> it's typically used with application-private files, but that's not a >> requirement of the interface. Deactivating (or caching) the backing >> pages generally avoids this problem. >=20 > I think it is not unreasonble to expect that fadvise() incurs = system-wide > affects. A properly implemented WILLNEED that does read-ahead cannot = work > without incurring system-wide effects. I had always assumed that = fadvise() > operated on a file, not a given process' view of a file (unlike, say, > madvise which only operates on mappings and only indirectly affects > file-backed data). >=20 Can you elaborate on what you mean by =93I had always assumed that = fadvise() operated on a file, =85=94? Under the previous implementation, if you did an fadvise(DONTNEED) on a = file, in order to cache the file=92s pages, those pages first had to be = unmapped from any address space. (You can find this unmapping performed = by vm_page_try_to_cache().) In other words, there was never any code = that said, =93Is this a mapped page, and if it is, don=92t cache it = because we=92re actually performing an fadvise().=94 So, to pick an = extreme example, if you did an fadvise(=93libc.so=94, DONTNEED), unless = some process had libc.so wired, then every single mapping to every = single page of libc.so was going to be destroyed and the pages moved to = the cache. However, because we moved the pages to the cache (rather = than freeing them), and libc.so is frequently accessed, a subsequent = instruction fetch would have faulted and been able to reactivate the = cached page, avoiding an I/O operation. In other words, that we were = caching the pages targeted by fadvise() rather than simply freeing them = mattered in cases where the pages were in use/accessed by multiple = processes. >>> I'm pretty sure I had mentioned this to Alan before. I believe that = the idea is >>> that pagedaemon should be cheap enough that having it run anyway = shouldn't be an >>> issue, but I'm a bit skeptical of that. :) Lock contention is = always possible and >>> having DONTNEED/NOREUSE move pages to PG_CACHE avoided lock = contention with >>> pagedaemon during application page faults (since pagedaemon = potentially never has >>> to run). >>=20 >> That's true, but the page queue locking (and the pagedaemon's >> manipulation of the page queue locks) has also become more = fine-grained >> since posix_fadvise(2) was added. In particular, from some reading of >> sys/vm in stable/8, inactive queue scans used to be performed with = the >> global page queue lock held; it was only dropped to launder dirty = pages. >> Now, the page queue lock is split into separate locks for the active = and >> inactive page queues, and the pagedaemon drops the inactive queue = lock >> for each page in all but a few exceptional cases. Does the = optimization >> of freeing or caching DONTNEED pages buy us all that much now? >>=20 >> Some synthetic testing in which an application writes out many large >> (2G) files and calls posix_fadvise(FADV_DONTNEED) after each one = shows >> no significant difference in runtime if the buffer pages are = deactivated >> vs. freed. (My test just modifies vfs_vmio_unwire() to treat = B_NOREUSE >> identically to B_DIRECT.) Unsurprisingly, I see very little lock >> contention in the latter case, but in the former, most of the lock >> contention is short (i.e. the mutex is acquired while spinning), and >> a large majority of the contention is on the free page queue mutex. = If >> lock contention there is a concern, wouldn't it be better to try and >> address that directly rather than by bypassing the pagedaemon? >=20 > The lock contention was related to one process faulting in a new page = due to > a malloc() while pagedaemon ran. Also, it wasn't a steady type of = contention > that would show up in an average. Instead, it was the outliers (which = in the > case on 8.x were on the order of 2 seconds) that were problematic. I = used a > hack to log "long" wait times for specific processes to both debug = this and > evaluate the solution. I have a test program laying around from when = I last > tested this. I'll see what I can reproduce (before it required a = machine > with at least 24GB of RAM to reproduce). >=20 > The only foolproof way to reduce contention to zero is to eliminate = one of > the contending threads. :) I do think there are situations where an > application may be more informed about the optimal memory pattern for = its > workload than what the VM system can infer from heuristics. Currently = there > is no other way to flush a file's contents from RAM. If we had things = like > DONTNEED_I_MEAN_IT and DONTNEED_IM_NOT_SURE perhaps we could have a = sliding > scale, but at the moment the policy isn't that fine-grained. >=20 > --=20 > John Baldwin >=20 >=20
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?F3EF914A-8296-4833-BCF8-B9D878CAB80C>