From owner-freebsd-stable@FreeBSD.ORG  Fri Jun 27 06:21:32 2008
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 651F11065675
	for <freebsd-stable@freebsd.org>; Fri, 27 Jun 2008 06:21:32 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id 42DD38FC14
	for <freebsd-stable@freebsd.org>; Fri, 27 Jun 2008 06:21:32 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m5R6LTAR026413;
	Thu, 26 Jun 2008 23:21:29 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m5R6LTum026412;
	Thu, 26 Jun 2008 23:21:29 -0700 (PDT)
Date: Thu, 26 Jun 2008 23:21:29 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200806270621.m5R6LTum026412@apollo.backplane.com>
To: Marcus Reid <marcus@blazingdot.com>
References: <20080626234455.GA77263@blazingdot.com>
	<200806270048.m5R0mDIU024172@apollo.backplane.com>
	<20080627023455.GA34022@blazingdot.com>
Cc: freebsd-stable@freebsd.org
Subject: Re: Performance of madvise / msync
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 27 Jun 2008 06:21:32 -0000

:With madvise() and without msync(), there are high numbers of
:faults, which matches the number of disk io operations.  It
:goes through cycles, every once in a while stalling while about
:60MB of data is dumped to disk at 20MB/s or so (buffers flushing?)
:At the beginning of each cycle it's fast, with 140 faults/s or so,
:and slows as the number of faults climbs to 180/s or so before
:stalling and flusing again.  It never gets _really_ slow though.

    Yah, without the msync() the dirty pages build up in the kernel's
    VM page cache.  A flush should happen automatically every 30-60
    seconds, or sooner if the buffer cache builds up too many dirty pages.

    The activity you are seeing sounds like the 30-60 second filesystem
    sync the kernel does periodically.

    Either NetBSD or OpenBSD, I forget which, implemented a partial sync
    feature to prevent long stalls when the filesystem syncer hits a file
    with a lot of dirty pages.  FreeBSD could borrow that optimization if
    they want to reduce stalls from the filesytem sync.  I ported it to DFly
    a while back myself.

:With msync() and without madvise(), things are very slow, and
:there are no faults, just writes.
:...
:>      The size_t argument to msync() (0x453b7618) is highly questionable.
:>      It could be ktrace reporting the wrong value, but maybe not.
:
:That's the size of rg2.rrd.  It's 1161524760 bytes long.
:...
:Looks like the source of my problem is very slow msync() on the
:file when the file is over a certain size.  It's still fastest
:without either madvise or msync.
:
:Thanks for your time,
:
:Marcus

    The msync() is clearly the problem.  There are numerous optimizations
    in the kernel but msync() is frankly a rather nasty critter even with
    the optimizations work.  Nobody using msync() in real life ever tries
    to run it over the entirety of such a large mapping... usually it is
    just run on explicit sub-ranges that the program wishes to sync.

    One reason why msync() is so nasty is that the kernel must physically
    check the page table(s) to determine whether a page has been marked dirty
    by the MMU, so it can't just iterate the pages it knows are dirty in
    the VM object.  It's nasty whether it scans the VM object and iterates
    the page tables, or scans the page tables and looks up the related VM
    pages.   The only way to optimize this is to force write-faults by
    mapping clean pages read-only, in order to track whether a page is
    actually dirty in real time instead of lazily.  Then msync() would
    only have to do a ranged-scan of the VM object's dirty-page list
    and would not have to actually check the page tables for clean pages.

    A secondary effect of the msync() is that it is initiating asynchronous
    I/O for what sounds like hundreds of VM pages, or even more.  All those
    pages are locked and busied from the point they are queued to the point
    the I/O finishes, which for some of the pages can be a very, very long
    time (into the multiples of seconds).  Pages locked that long will
    interfere with madvise() calls made after the msync(), and probably
    even interfere with the follow msync().

    It used to be that msync() only synced VM pages to the underlying
    file, making them consistent with read()'s and write()'s against
    the underlying file.  Since FreeBSD uses a unified VM page cache
    this is always true.  However, the Open Group specification now
    requires that the dirty pages actually be written out to the underlying
    media... i.e. issue real I/O.  So msync() can't be a NOP if you go by
    the OpenGroup specification.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>