From owner-svn-src-head@freebsd.org Fri Oct 2 23:50:45 2015 Return-Path: Delivered-To: svn-src-head@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9391CA0D6B7; Fri, 2 Oct 2015 23:50:45 +0000 (UTC) (envelope-from alc@rice.edu) Received: from pp1.rice.edu (proofpoint1.mail.rice.edu [128.42.201.100]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 6D1F21C40; Fri, 2 Oct 2015 23:50:44 +0000 (UTC) (envelope-from alc@rice.edu) Received: from pps.filterd (pp1.rice.edu [127.0.0.1]) by pp1.rice.edu (8.15.0.59/8.15.0.59) with SMTP id t92NnjnB012271; Fri, 2 Oct 2015 18:50:37 -0500 Received: from mh11.mail.rice.edu (mh11.mail.rice.edu [128.42.199.30]) by pp1.rice.edu with ESMTP id 1x9wyjg9rh-1; Fri, 02 Oct 2015 18:50:37 -0500 X-Virus-Scanned: by amavis-2.7.0 at mh11.mail.rice.edu, auth channel Received: from [10.87.76.177] (unknown [10.87.76.177]) (using TLSv1 with cipher RC4-MD5 (128/128 bits)) (No client certificate requested) (Authenticated sender: alc) by mh11.mail.rice.edu (Postfix) with ESMTPSA id 458054C01B1; Fri, 2 Oct 2015 18:50:37 -0500 (CDT) Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Subject: Re: svn commit: r288431 - in head/sys: kern sys vm From: Alan Cox In-Reply-To: <4276391.z2UvhhORjP@ralph.baldwin.cx> Date: Fri, 2 Oct 2015 18:50:36 -0500 Cc: Mark Johnston , src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: References: <201509302306.t8UN6UwX043736@repo.freebsd.org> <1837187.vUDrWYExQX@ralph.baldwin.cx> <20151002045842.GA18421@raichu> <4276391.z2UvhhORjP@ralph.baldwin.cx> To: John Baldwin X-Mailer: Apple Mail (2.1878.6) X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 kscore.is_bulkscore=0 kscore.compositescore=1 compositescore=0.9 suspectscore=2 malwarescore=0 phishscore=0 bulkscore=0 kscore.is_spamscore=0 rbsscore=0.9 spamscore=0 urlsuspectscore=0.9 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1507310000 definitions=main-1510020307 X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 02 Oct 2015 23:50:45 -0000 On Oct 2, 2015, at 10:59 AM, John Baldwin wrote: > On Thursday, October 01, 2015 09:58:43 PM Mark Johnston wrote: >> On Thu, Oct 01, 2015 at 09:32:45AM -0700, John Baldwin wrote: >>> On Wednesday, September 30, 2015 11:06:30 PM Mark Johnston wrote: >>>> Author: markj >>>> Date: Wed Sep 30 23:06:29 2015 >>>> New Revision: 288431 >>>> URL: https://svnweb.freebsd.org/changeset/base/288431 >>>>=20 >>>> Log: >>>> As a step towards the elimination of PG_CACHED pages, rework the = handling >>>> of POSIX_FADV_DONTNEED so that it causes the backing pages to be = moved to >>>> the head of the inactive queue instead of being cached. >>>>=20 >>>> This affects the implementation of POSIX_FADV_NOREUSE as well, = since it >>>> works by applying POSIX_FADV_DONTNEED to file ranges after they = have been >>>> read or written. At that point the corresponding buffers may = still be >>>> dirty, so the previous implementation would coalesce successive = ranges and >>>> apply POSIX_FADV_DONTNEED to the result, ensuring that pages = backing the >>>> dirty buffers would eventually be cached. To preserve this = behaviour in an >>>> efficient manner, this change adds a new buf flag, B_NOREUSE, = which causes >>>> the pages backing a VMIO buf to be placed at the head of the = inactive queue >>>> when the buf is released. POSIX_FADV_NOREUSE then works by = setting this >>>> flag in bufs that underlie the specified range. >>>=20 >>> Putting these pages back on the inactive queue completely defeats = the primary >>> purpose of DONTNEED and NOREUSE. The primary purpose is to move the = pages out >>> of the VM object's tree of pages and into the free pool so that the = application >>> can instruct the VM to free memory more efficiently than relying on = page daemon. >>>=20 >>> The implementation used cache pages instead of free as a cheap = optimization so >>> that if an application did something dumb where it used DONTNEED and = then turned >>> around and read the file it would not have to go to disk if the = pages had not >>> yet been reused. In practice this didn't work out so well because = PG_CACHE pages >>> don't really work well. >>>=20 >>> However, using PG_CACHE was secondary to the primary purpose of = explicitly freeing >>> memory that an application knew wasn't going to be reused and = avoiding the need >>> for pagedaemon to run at all. I think this should be freeing the = pages instead of >>> keeping them inactive. If an application uses DONTNEED or NOREUSE = and then turns >>> around and rereads the file, it generally deserves to have to go to = disk for it. >>=20 >> A problem with this is that one application's DONTNEED or NOREUSE = hint >> would cause every application reading or writing that file to go to >> disk, but posix_fadvise(2) is explicitly intended for applications = that >> wish to provide hints about their own access patterns. I realize that >> it's typically used with application-private files, but that's not a >> requirement of the interface. Deactivating (or caching) the backing >> pages generally avoids this problem. >=20 > I think it is not unreasonble to expect that fadvise() incurs = system-wide > affects. A properly implemented WILLNEED that does read-ahead cannot = work > without incurring system-wide effects. I had always assumed that = fadvise() > operated on a file, not a given process' view of a file (unlike, say, > madvise which only operates on mappings and only indirectly affects > file-backed data). >=20 Can you elaborate on what you mean by =93I had always assumed that = fadvise() operated on a file, =85=94? Under the previous implementation, if you did an fadvise(DONTNEED) on a = file, in order to cache the file=92s pages, those pages first had to be = unmapped from any address space. (You can find this unmapping performed = by vm_page_try_to_cache().) In other words, there was never any code = that said, =93Is this a mapped page, and if it is, don=92t cache it = because we=92re actually performing an fadvise().=94 So, to pick an = extreme example, if you did an fadvise(=93libc.so=94, DONTNEED), unless = some process had libc.so wired, then every single mapping to every = single page of libc.so was going to be destroyed and the pages moved to = the cache. However, because we moved the pages to the cache (rather = than freeing them), and libc.so is frequently accessed, a subsequent = instruction fetch would have faulted and been able to reactivate the = cached page, avoiding an I/O operation. In other words, that we were = caching the pages targeted by fadvise() rather than simply freeing them = mattered in cases where the pages were in use/accessed by multiple = processes. >>> I'm pretty sure I had mentioned this to Alan before. I believe that = the idea is >>> that pagedaemon should be cheap enough that having it run anyway = shouldn't be an >>> issue, but I'm a bit skeptical of that. :) Lock contention is = always possible and >>> having DONTNEED/NOREUSE move pages to PG_CACHE avoided lock = contention with >>> pagedaemon during application page faults (since pagedaemon = potentially never has >>> to run). >>=20 >> That's true, but the page queue locking (and the pagedaemon's >> manipulation of the page queue locks) has also become more = fine-grained >> since posix_fadvise(2) was added. In particular, from some reading of >> sys/vm in stable/8, inactive queue scans used to be performed with = the >> global page queue lock held; it was only dropped to launder dirty = pages. >> Now, the page queue lock is split into separate locks for the active = and >> inactive page queues, and the pagedaemon drops the inactive queue = lock >> for each page in all but a few exceptional cases. Does the = optimization >> of freeing or caching DONTNEED pages buy us all that much now? >>=20 >> Some synthetic testing in which an application writes out many large >> (2G) files and calls posix_fadvise(FADV_DONTNEED) after each one = shows >> no significant difference in runtime if the buffer pages are = deactivated >> vs. freed. (My test just modifies vfs_vmio_unwire() to treat = B_NOREUSE >> identically to B_DIRECT.) Unsurprisingly, I see very little lock >> contention in the latter case, but in the former, most of the lock >> contention is short (i.e. the mutex is acquired while spinning), and >> a large majority of the contention is on the free page queue mutex. = If >> lock contention there is a concern, wouldn't it be better to try and >> address that directly rather than by bypassing the pagedaemon? >=20 > The lock contention was related to one process faulting in a new page = due to > a malloc() while pagedaemon ran. Also, it wasn't a steady type of = contention > that would show up in an average. Instead, it was the outliers (which = in the > case on 8.x were on the order of 2 seconds) that were problematic. I = used a > hack to log "long" wait times for specific processes to both debug = this and > evaluate the solution. I have a test program laying around from when = I last > tested this. I'll see what I can reproduce (before it required a = machine > with at least 24GB of RAM to reproduce). >=20 > The only foolproof way to reduce contention to zero is to eliminate = one of > the contending threads. :) I do think there are situations where an > application may be more informed about the optimal memory pattern for = its > workload than what the VM system can infer from heuristics. Currently = there > is no other way to flush a file's contents from RAM. If we had things = like > DONTNEED_I_MEAN_IT and DONTNEED_IM_NOT_SURE perhaps we could have a = sliding > scale, but at the moment the policy isn't that fine-grained. >=20 > --=20 > John Baldwin >=20 >=20