From owner-svn-src-head@freebsd.org  Fri Oct  2 23:50:45 2015
Return-Path: <owner-svn-src-head@freebsd.org>
Delivered-To: svn-src-head@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9391CA0D6B7;
 Fri,  2 Oct 2015 23:50:45 +0000 (UTC) (envelope-from alc@rice.edu)
Received: from pp1.rice.edu (proofpoint1.mail.rice.edu [128.42.201.100])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 6D1F21C40;
 Fri,  2 Oct 2015 23:50:44 +0000 (UTC) (envelope-from alc@rice.edu)
Received: from pps.filterd (pp1.rice.edu [127.0.0.1])
 by pp1.rice.edu (8.15.0.59/8.15.0.59) with SMTP id t92NnjnB012271;
 Fri, 2 Oct 2015 18:50:37 -0500
Received: from mh11.mail.rice.edu (mh11.mail.rice.edu [128.42.199.30])
 by pp1.rice.edu with ESMTP id 1x9wyjg9rh-1;
 Fri, 02 Oct 2015 18:50:37 -0500
X-Virus-Scanned: by amavis-2.7.0 at mh11.mail.rice.edu, auth channel
Received: from [10.87.76.177] (unknown [10.87.76.177])
 (using TLSv1 with cipher RC4-MD5 (128/128 bits))
 (No client certificate requested) (Authenticated sender: alc)
 by mh11.mail.rice.edu (Postfix) with ESMTPSA id 458054C01B1;
 Fri,  2 Oct 2015 18:50:37 -0500 (CDT)
Content-Type: text/plain; charset=windows-1252
Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\))
Subject: Re: svn commit: r288431 - in head/sys: kern sys vm
From: Alan Cox <alc@rice.edu>
In-Reply-To: <4276391.z2UvhhORjP@ralph.baldwin.cx>
Date: Fri, 2 Oct 2015 18:50:36 -0500
Cc: Mark Johnston <markj@freebsd.org>, src-committers@freebsd.org,
 svn-src-all@freebsd.org, svn-src-head@freebsd.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <F3EF914A-8296-4833-BCF8-B9D878CAB80C@rice.edu>
References: <201509302306.t8UN6UwX043736@repo.freebsd.org>
 <1837187.vUDrWYExQX@ralph.baldwin.cx> <20151002045842.GA18421@raichu>
 <4276391.z2UvhhORjP@ralph.baldwin.cx>
To: John Baldwin <jhb@freebsd.org>
X-Mailer: Apple Mail (2.1878.6)
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0
 kscore.is_bulkscore=0
 kscore.compositescore=1 compositescore=0.9 suspectscore=2 malwarescore=0
 phishscore=0 bulkscore=0 kscore.is_spamscore=0 rbsscore=0.9 spamscore=0
 urlsuspectscore=0.9 adultscore=0 classifier=spam adjust=0 reason=mlx
 scancount=1 engine=8.0.1-1507310000 definitions=main-1510020307
X-BeenThere: svn-src-head@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: SVN commit messages for the src tree for head/-current
 <svn-src-head.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/svn-src-head>,
 <mailto:svn-src-head-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-head/>
List-Post: <mailto:svn-src-head@freebsd.org>
List-Help: <mailto:svn-src-head-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/svn-src-head>,
 <mailto:svn-src-head-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 02 Oct 2015 23:50:45 -0000


On Oct 2, 2015, at 10:59 AM, John Baldwin <jhb@freebsd.org> wrote:

> On Thursday, October 01, 2015 09:58:43 PM Mark Johnston wrote:
>> On Thu, Oct 01, 2015 at 09:32:45AM -0700, John Baldwin wrote:
>>> On Wednesday, September 30, 2015 11:06:30 PM Mark Johnston wrote:
>>>> Author: markj
>>>> Date: Wed Sep 30 23:06:29 2015
>>>> New Revision: 288431
>>>> URL: https://svnweb.freebsd.org/changeset/base/288431
>>>>=20
>>>> Log:
>>>>  As a step towards the elimination of PG_CACHED pages, rework the =
handling
>>>>  of POSIX_FADV_DONTNEED so that it causes the backing pages to be =
moved to
>>>>  the head of the inactive queue instead of being cached.
>>>>=20
>>>>  This affects the implementation of POSIX_FADV_NOREUSE as well, =
since it
>>>>  works by applying POSIX_FADV_DONTNEED to file ranges after they =
have been
>>>>  read or written.  At that point the corresponding buffers may =
still be
>>>>  dirty, so the previous implementation would coalesce successive =
ranges and
>>>>  apply POSIX_FADV_DONTNEED to the result, ensuring that pages =
backing the
>>>>  dirty buffers would eventually be cached.  To preserve this =
behaviour in an
>>>>  efficient manner, this change adds a new buf flag, B_NOREUSE, =
which causes
>>>>  the pages backing a VMIO buf to be placed at the head of the =
inactive queue
>>>>  when the buf is released.  POSIX_FADV_NOREUSE then works by =
setting this
>>>>  flag in bufs that underlie the specified range.
>>>=20
>>> Putting these pages back on the inactive queue completely defeats =
the primary
>>> purpose of DONTNEED and NOREUSE.  The primary purpose is to move the =
pages out
>>> of the VM object's tree of pages and into the free pool so that the =
application
>>> can instruct the VM to free memory more efficiently than relying on =
page daemon.
>>>=20
>>> The implementation used cache pages instead of free as a cheap =
optimization so
>>> that if an application did something dumb where it used DONTNEED and =
then turned
>>> around and read the file it would not have to go to disk if the =
pages had not
>>> yet been reused.  In practice this didn't work out so well because =
PG_CACHE pages
>>> don't really work well.
>>>=20
>>> However, using PG_CACHE was secondary to the primary purpose of =
explicitly freeing
>>> memory that an application knew wasn't going to be reused and =
avoiding the need
>>> for pagedaemon to run at all.  I think this should be freeing the =
pages instead of
>>> keeping them inactive.  If an application uses DONTNEED or NOREUSE =
and then turns
>>> around and rereads the file, it generally deserves to have to go to =
disk for it.
>>=20
>> A problem with this is that one application's DONTNEED or NOREUSE =
hint
>> would cause every application reading or writing that file to go to
>> disk, but posix_fadvise(2) is explicitly intended for applications =
that
>> wish to provide hints about their own access patterns. I realize that
>> it's typically used with application-private files, but that's not a
>> requirement of the interface. Deactivating (or caching) the backing
>> pages generally avoids this problem.
>=20
> I think it is not unreasonble to expect that fadvise() incurs =
system-wide
> affects.  A properly implemented WILLNEED that does read-ahead cannot =
work
> without incurring system-wide effects.  I had always assumed that =
fadvise()
> operated on a file, not a given process' view of a file (unlike, say,
> madvise which only operates on mappings and only indirectly affects
> file-backed data).
>=20


Can you elaborate on what you mean by =93I had always assumed that =
fadvise() operated on a file, =85=94?

Under the previous implementation, if you did an fadvise(DONTNEED) on a =
file, in order to cache the file=92s pages, those pages first had to be =
unmapped from any address space.  (You can find this unmapping performed =
by vm_page_try_to_cache().)  In other words, there was never any code =
that said, =93Is this a mapped page, and if it is, don=92t cache it =
because we=92re actually performing an fadvise().=94  So, to pick an =
extreme example, if you did an fadvise(=93libc.so=94, DONTNEED), unless =
some process had libc.so wired, then every single mapping to every =
single page of libc.so was going to be destroyed and the pages moved to =
the cache.  However, because we moved the pages to the cache (rather =
than freeing them), and libc.so is frequently accessed, a subsequent =
instruction fetch would have faulted and been able to reactivate the =
cached page, avoiding an I/O operation.  In other words, that we were =
caching the pages targeted by fadvise() rather than simply freeing them =
mattered in cases where the pages were in use/accessed by multiple =
processes.


>>> I'm pretty sure I had mentioned this to Alan before.  I believe that =
the idea is
>>> that pagedaemon should be cheap enough that having it run anyway =
shouldn't be an
>>> issue, but I'm a bit skeptical of that. :)  Lock contention is =
always possible and
>>> having DONTNEED/NOREUSE move pages to PG_CACHE avoided lock =
contention with
>>> pagedaemon during application page faults (since pagedaemon =
potentially never has
>>> to run).
>>=20
>> That's true, but the page queue locking (and the pagedaemon's
>> manipulation of the page queue locks) has also become more =
fine-grained
>> since posix_fadvise(2) was added. In particular, from some reading of
>> sys/vm in stable/8, inactive queue scans used to be performed with =
the
>> global page queue lock held; it was only dropped to launder dirty =
pages.
>> Now, the page queue lock is split into separate locks for the active =
and
>> inactive page queues, and the pagedaemon drops the inactive queue =
lock
>> for each page in all but a few exceptional cases. Does the =
optimization
>> of freeing or caching DONTNEED pages buy us all that much now?
>>=20
>> Some synthetic testing in which an application writes out many large
>> (2G) files and calls posix_fadvise(FADV_DONTNEED) after each one =
shows
>> no significant difference in runtime if the buffer pages are =
deactivated
>> vs. freed. (My test just modifies vfs_vmio_unwire() to treat =
B_NOREUSE
>> identically to B_DIRECT.) Unsurprisingly, I see very little lock
>> contention in the latter case, but in the former, most of the lock
>> contention is short (i.e. the mutex is acquired while spinning), and
>> a large majority of the contention is on the free page queue mutex. =
If
>> lock contention there is a concern, wouldn't it be better to try and
>> address that directly rather than by bypassing the pagedaemon?
>=20
> The lock contention was related to one process faulting in a new page =
due to
> a malloc() while pagedaemon ran.  Also, it wasn't a steady type of =
contention
> that would show up in an average.  Instead, it was the outliers (which =
in the
> case on 8.x were on the order of 2 seconds) that were problematic.  I =
used a
> hack to log "long" wait times for specific processes to both debug =
this and
> evaluate the solution.  I have a test program laying around from when =
I last
> tested this.  I'll see what I can reproduce (before it required a =
machine
> with at least 24GB of RAM to reproduce).
>=20
> The only foolproof way to reduce contention to zero is to eliminate =
one of
> the contending threads. :)  I do think there are situations where an
> application may be more informed about the optimal memory pattern for =
its
> workload than what the VM system can infer from heuristics.  Currently =
there
> is no other way to flush a file's contents from RAM.  If we had things =
like
> DONTNEED_I_MEAN_IT and DONTNEED_IM_NOT_SURE perhaps we could have a =
sliding
> scale, but at the moment the policy isn't that fine-grained.
>=20
> --=20
> John Baldwin
>=20
>=20