From owner-svn-src-all@freebsd.org Fri Oct 2 22:03:13 2015 Return-Path: Delivered-To: svn-src-all@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9747BA0EF1D; Fri, 2 Oct 2015 22:03:13 +0000 (UTC) (envelope-from markjdb@gmail.com) Received: from mail-qk0-x233.google.com (mail-qk0-x233.google.com [IPv6:2607:f8b0:400d:c09::233]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4CBCF10FE; Fri, 2 Oct 2015 22:03:13 +0000 (UTC) (envelope-from markjdb@gmail.com) Received: by qkbi190 with SMTP id i190so28539308qkb.1; Fri, 02 Oct 2015 15:03:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=gXtCJItkCVtZD4rweN72big2DC5728OWMoPGt+u54KA=; b=Nl6r6rijMFhm98AIItT8F7r6ejp2gTz0HGWoIFtpRFg0WnxDDH7UveoRrRl0wh4q9m aEb3PA16X15oJrAjHucKvh0j0DoYf3UQVDb1uRethFyIdvKZAKB6fUEnIz1/fs4ku2vq yLI3PtTIifZs5npiBczhfuTz7rIO62d5bmItsVDLDPZITkMcPaWN82l6gIII2dCaahNH 0Tjik6KIxcNksLTnh5XjU0Xyo7clXVRpVgDCKfm1/QW8cJq4bcA4HeKQctun8nFX/T0n epNsWrYIK7OHlkZf6lueq5S6570A1dDKdkM6iqOIAKe8sCgRO6PgoHBXLExK0N4pTMcn /y1Q== X-Received: by 10.55.49.75 with SMTP id x72mr23436605qkx.45.1443823392098; Fri, 02 Oct 2015 15:03:12 -0700 (PDT) Received: from muskytusk ([104.236.250.12]) by smtp.gmail.com with ESMTPSA id y12sm5519665qgd.20.2015.10.02.15.03.11 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 02 Oct 2015 15:03:11 -0700 (PDT) Sender: Mark Johnston Date: Fri, 2 Oct 2015 22:02:34 +0000 From: Mark Johnston To: John Baldwin Cc: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r288431 - in head/sys: kern sys vm Message-ID: <20151002220234.GA58210@muskytusk> References: <201509302306.t8UN6UwX043736@repo.freebsd.org> <1837187.vUDrWYExQX@ralph.baldwin.cx> <20151002045842.GA18421@raichu> <4276391.z2UvhhORjP@ralph.baldwin.cx> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="EVF5PPMfhYS0aIcm" Content-Disposition: inline In-Reply-To: <4276391.z2UvhhORjP@ralph.baldwin.cx> User-Agent: Mutt/1.5.23 (2014-03-12) X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 02 Oct 2015 22:03:13 -0000 --EVF5PPMfhYS0aIcm Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Fri, Oct 02, 2015 at 08:59:33AM -0700, John Baldwin wrote: > On Thursday, October 01, 2015 09:58:43 PM Mark Johnston wrote: > > On Thu, Oct 01, 2015 at 09:32:45AM -0700, John Baldwin wrote: > > > On Wednesday, September 30, 2015 11:06:30 PM Mark Johnston wrote: > > > > Author: markj > > > > Date: Wed Sep 30 23:06:29 2015 > > > > New Revision: 288431 > > > > URL: https://svnweb.freebsd.org/changeset/base/288431 > > > > > > > > Log: > > > > As a step towards the elimination of PG_CACHED pages, rework the handling > > > > of POSIX_FADV_DONTNEED so that it causes the backing pages to be moved to > > > > the head of the inactive queue instead of being cached. > > > > > > > > This affects the implementation of POSIX_FADV_NOREUSE as well, since it > > > > works by applying POSIX_FADV_DONTNEED to file ranges after they have been > > > > read or written. At that point the corresponding buffers may still be > > > > dirty, so the previous implementation would coalesce successive ranges and > > > > apply POSIX_FADV_DONTNEED to the result, ensuring that pages backing the > > > > dirty buffers would eventually be cached. To preserve this behaviour in an > > > > efficient manner, this change adds a new buf flag, B_NOREUSE, which causes > > > > the pages backing a VMIO buf to be placed at the head of the inactive queue > > > > when the buf is released. POSIX_FADV_NOREUSE then works by setting this > > > > flag in bufs that underlie the specified range. > > > > > > Putting these pages back on the inactive queue completely defeats the primary > > > purpose of DONTNEED and NOREUSE. The primary purpose is to move the pages out > > > of the VM object's tree of pages and into the free pool so that the application > > > can instruct the VM to free memory more efficiently than relying on page daemon. > > > > > > The implementation used cache pages instead of free as a cheap optimization so > > > that if an application did something dumb where it used DONTNEED and then turned > > > around and read the file it would not have to go to disk if the pages had not > > > yet been reused. In practice this didn't work out so well because PG_CACHE pages > > > don't really work well. > > > > > > However, using PG_CACHE was secondary to the primary purpose of explicitly freeing > > > memory that an application knew wasn't going to be reused and avoiding the need > > > for pagedaemon to run at all. I think this should be freeing the pages instead of > > > keeping them inactive. If an application uses DONTNEED or NOREUSE and then turns > > > around and rereads the file, it generally deserves to have to go to disk for it. > > > > A problem with this is that one application's DONTNEED or NOREUSE hint > > would cause every application reading or writing that file to go to > > disk, but posix_fadvise(2) is explicitly intended for applications that > > wish to provide hints about their own access patterns. I realize that > > it's typically used with application-private files, but that's not a > > requirement of the interface. Deactivating (or caching) the backing > > pages generally avoids this problem. > > I think it is not unreasonble to expect that fadvise() incurs system-wide > affects. A properly implemented WILLNEED that does read-ahead cannot work > without incurring system-wide effects. I had always assumed that fadvise() > operated on a file, not a given process' view of a file (unlike, say, > madvise which only operates on mappings and only indirectly affects > file-backed data). Well, that's even true of read(): two processes reading the same file may affect each other if one primes the buffer cache with blocks as the second process is reading them. DONTNEED and NOREUSE would specifically pessimize all processes using the file if they were to cause backing pages to be freed, though. > > > > I'm pretty sure I had mentioned this to Alan before. I believe that the idea is > > > that pagedaemon should be cheap enough that having it run anyway shouldn't be an > > > issue, but I'm a bit skeptical of that. :) Lock contention is always possible and > > > having DONTNEED/NOREUSE move pages to PG_CACHE avoided lock contention with > > > pagedaemon during application page faults (since pagedaemon potentially never has > > > to run). > > > > That's true, but the page queue locking (and the pagedaemon's > > manipulation of the page queue locks) has also become more fine-grained > > since posix_fadvise(2) was added. In particular, from some reading of > > sys/vm in stable/8, inactive queue scans used to be performed with the > > global page queue lock held; it was only dropped to launder dirty pages. > > Now, the page queue lock is split into separate locks for the active and > > inactive page queues, and the pagedaemon drops the inactive queue lock > > for each page in all but a few exceptional cases. Does the optimization > > of freeing or caching DONTNEED pages buy us all that much now? > > > > Some synthetic testing in which an application writes out many large > > (2G) files and calls posix_fadvise(FADV_DONTNEED) after each one shows > > no significant difference in runtime if the buffer pages are deactivated > > vs. freed. (My test just modifies vfs_vmio_unwire() to treat B_NOREUSE > > identically to B_DIRECT.) Unsurprisingly, I see very little lock > > contention in the latter case, but in the former, most of the lock > > contention is short (i.e. the mutex is acquired while spinning), and > > a large majority of the contention is on the free page queue mutex. If > > lock contention there is a concern, wouldn't it be better to try and > > address that directly rather than by bypassing the pagedaemon? > > The lock contention was related to one process faulting in a new page due to > a malloc() while pagedaemon ran. Also, it wasn't a steady type of contention > that would show up in an average. Instead, it was the outliers (which in the > case on 8.x were on the order of 2 seconds) that were problematic. I used a > hack to log "long" wait times for specific processes to both debug this and > evaluate the solution. I have a test program laying around from when I last > tested this. I'll see what I can reproduce (before it required a machine > with at least 24GB of RAM to reproduce). Thanks! FWIW, I found the attached patch convenient for testing and benchmarking fadvise; it adds iadvice and oadvice parameters to dd(1). > > The only foolproof way to reduce contention to zero is to eliminate one of > the contending threads. :) I do think there are situations where an > application may be more informed about the optimal memory pattern for its > workload than what the VM system can infer from heuristics. Currently there > is no other way to flush a file's contents from RAM. If we had things like > DONTNEED_I_MEAN_IT and DONTNEED_IM_NOT_SURE perhaps we could have a sliding > scale, but at the moment the policy isn't that fine-grained. Sure. I guess I'm just making the conservative argument that a "seatbelts-off" implementation isn't obviously the right choice for a default. This is something that could be controlled with a sysctl or a richer set of hints. --EVF5PPMfhYS0aIcm Content-Type: text/x-diff; charset=us-ascii Content-Disposition: attachment; filename="ddadv.diff" diff --git a/bin/dd/args.c b/bin/dd/args.c index 2f197f8..6220115 100644 --- a/bin/dd/args.c +++ b/bin/dd/args.c @@ -39,10 +39,11 @@ static char sccsid[] = "@(#)args.c 8.3 (Berkeley) 4/2/94"; #include __FBSDID("$FreeBSD$"); -#include +#include #include #include +#include #include #include #include @@ -54,14 +55,17 @@ __FBSDID("$FreeBSD$"); static int c_arg(const void *, const void *); static int c_conv(const void *, const void *); +static void f_advice(const char *, IO *); static void f_bs(char *); static void f_cbs(char *); static void f_conv(char *); static void f_count(char *); static void f_files(char *); static void f_fillchar(char *); +static void f_iadvice(char *); static void f_ibs(char *); static void f_if(char *); +static void f_oadvice(char *); static void f_obs(char *); static void f_of(char *); static void f_seek(char *); @@ -81,9 +85,11 @@ static const struct arg { { "count", f_count, C_COUNT, C_COUNT }, { "files", f_files, C_FILES, C_FILES }, { "fillchar", f_fillchar, C_FILL, C_FILL }, + { "iadvice", f_iadvice, C_IADV, C_IADV }, { "ibs", f_ibs, C_IBS, C_BS|C_IBS }, { "if", f_if, C_IF, C_IF }, { "iseek", f_skip, C_SKIP, C_SKIP }, + { "oadvice", f_oadvice, C_OADV, C_OADV }, { "obs", f_obs, C_OBS, C_BS|C_OBS }, { "of", f_of, C_OF, C_OF }, { "oseek", f_seek, C_SEEK, C_SEEK }, @@ -104,6 +110,7 @@ jcl(char **argv) char *arg; in.dbsz = out.dbsz = 512; + in.advice = out.advice = POSIX_FADV_NORMAL; while ((oper = *++argv) != NULL) { if ((oper = strdup(oper)) == NULL) @@ -306,6 +313,48 @@ f_status(char *arg) errx(1, "unknown status %s", arg); } +static const struct { + const char *name; + int advice; + int flags; +} pieces[] = { + { "noreuse", POSIX_FADV_NOREUSE, ADVBEFORE }, + { "normal", POSIX_FADV_NORMAL, ADVBEFORE }, + { "random", POSIX_FADV_RANDOM, ADVBEFORE }, + { "sequential", POSIX_FADV_SEQUENTIAL, ADVBEFORE }, + { "willneed", POSIX_FADV_WILLNEED, ADVBEFORE }, + { "dontneed", POSIX_FADV_DONTNEED, ADVAFTER }, +}; + +static void +f_advice(const char *arg, IO *io) +{ + u_long i; + + for (i = 0; i < nitems(pieces); i++) + if (strcmp(arg, pieces[i].name) == 0) + break; + if (i == nitems(pieces)) + errx(1, "'%s' isn't real advice", arg); + + io->advice = pieces[i].advice; + io->flags |= pieces[i].flags; +} + +static void +f_iadvice(char *arg) +{ + + f_advice(arg, &in); +} + +static void +f_oadvice(char *arg) +{ + + f_advice(arg, &out); +} + static const struct conv { const char *name; u_int set, noset; diff --git a/bin/dd/dd.1 b/bin/dd/dd.1 index 0908642..9b6a824 100644 --- a/bin/dd/dd.1 +++ b/bin/dd/dd.1 @@ -94,6 +94,16 @@ modes, fill with the specified .Tn ASCII character, rather than using a space or .Dv NUL . +.It Cm iadvice Ns = Ns Ar value +Use +.Xr posix_fadvise 2 +to provide the specified hint about the input file over its entire range. +This is useful for benchmarking. +Valid values are formed by removing the +.Ql POSIX_FADV_ +prefix of +.Xr posix_fadvise 2 +arguments and converting the result to lower-case. .It Cm ibs Ns = Ns Ar n Set the input block size to .Ar n @@ -108,6 +118,10 @@ Seek on the input file blocks. This is synonymous with .Cm skip Ns = Ns Ar n . +.It Cm oadvice Ns = Ns Ar value +The equivalent of +.Cm iadvice +for the output file. .It Cm obs Ns = Ns Ar n Set the output block size to .Ar n diff --git a/bin/dd/dd.c b/bin/dd/dd.c index 8ae11a7..7c78295 100644 --- a/bin/dd/dd.c +++ b/bin/dd/dd.c @@ -243,6 +243,13 @@ setup(void) ctab = casetab; } + if ((in.flags & ADVBEFORE) && + posix_fadvise(in.fd, 0, 0, in.advice) != 0) + err(1, "posix_fadvise"); + if ((out.flags & ADVBEFORE) && + posix_fadvise(out.fd, 0, 0, out.advice) != 0) + err(1, "posix_fadvise"); + if (clock_gettime(CLOCK_MONOTONIC, &st.start)) err(1, "clock_gettime"); } @@ -409,6 +416,12 @@ dd_close(void) } if (out.dbcnt || pending) dd_out(1); + if ((in.flags & ADVAFTER) && + posix_fadvise(in.fd, 0, 0, in.advice) != 0) + err(1, "posix_fadvise"); + if ((out.flags & ADVAFTER) && + posix_fadvise(out.fd, 0, 0, out.advice) != 0) + err(1, "posix_fadvise"); } void diff --git a/bin/dd/dd.h b/bin/dd/dd.h index a8b45e5..9627f52 100644 --- a/bin/dd/dd.h +++ b/bin/dd/dd.h @@ -49,9 +49,12 @@ typedef struct { #define ISSEEK 0x08 /* valid to seek on */ #define NOREAD 0x10 /* not readable */ #define ISTRUNC 0x20 /* valid to ftruncate() */ +#define ADVBEFORE 0x40 /* apply advice before I/O */ +#define ADVAFTER 0x80 /* apply advice after I/O */ u_int flags; const char *name; /* name */ + int advice; /* posix_fadvise(2) advice */ int fd; /* file descriptor */ off_t offset; /* # of blocks to skip */ } IO; @@ -98,5 +101,7 @@ typedef struct { #define C_STATUS 0x08000000 #define C_NOXFER 0x10000000 #define C_NOINFO 0x20000000 +#define C_IADV 0x40000000 +#define C_OADV 0x80000000 #define C_PARITY (C_PAREVEN | C_PARODD | C_PARNONE | C_PARSET) --EVF5PPMfhYS0aIcm--