From owner-freebsd-hackers@freebsd.org  Thu Jul  7 23:20:15 2016
Return-Path: <owner-freebsd-hackers@freebsd.org>
Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id B7F10B8249A
 for <freebsd-hackers@mailman.ysv.freebsd.org>;
 Thu,  7 Jul 2016 23:20:15 +0000 (UTC)
 (envelope-from cedric.blancher@gmail.com)
Received: from mail-pa0-x22a.google.com (mail-pa0-x22a.google.com
 [IPv6:2607:f8b0:400e:c03::22a])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 87D6C17F6
 for <freebsd-hackers@freebsd.org>; Thu,  7 Jul 2016 23:20:15 +0000 (UTC)
 (envelope-from cedric.blancher@gmail.com)
Received: by mail-pa0-x22a.google.com with SMTP id uj8so9857139pab.3
 for <freebsd-hackers@freebsd.org>; Thu, 07 Jul 2016 16:20:15 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:in-reply-to:references:from:date:message-id:subject:to
 :cc:content-transfer-encoding;
 bh=n9qqVM9YdRvf66WYZCPZ2tOqYFzahK2zysWrJZsTtKA=;
 b=Wi50pxXx4K0f4dbVKFX2mE3BfNIwCK105o/twiShCyDoA0BQNNNtDPT+x41BSVQBlM
 vxB2/PPP0W/BMOev0bIPHKYQHsh9zDmtEV+CAylI3SlKi56Qmzr12U2ZHjVyG93F1Cpm
 iItpmxtJ3k1CZaV6SBVnPGIYvpAfuz1LATAUUQy0e3XYrws/7gNwulrU7wKlpoY9kDdN
 JZcmxxDdlLNUn3Jm72uc1HFpDEjw4wtvNaXeGUwwZ4hL0ZeHC3qkXkO5UYC8+lVGXL3q
 IRacRtAQZuSxlp2+0yF38TFGURuujfRg7DSyzsONzufWWwVEQL/W0Cd1sFDbIOauMu+A
 vl8A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:mime-version:in-reply-to:references:from:date
 :message-id:subject:to:cc:content-transfer-encoding;
 bh=n9qqVM9YdRvf66WYZCPZ2tOqYFzahK2zysWrJZsTtKA=;
 b=M8uu+NAXGdO4iFIlzZopJFiR5KvyrNjpXUpdlq/Q2Cd8v0VG+0F+qzbINMkoiy79Gp
 AkeNsDxK1bo3x8GxKSsbfDUgfyaBdJpNpfirc20TLtsZ9SWQkamBMndYVtrcMZ7pQs6k
 vcOeAwDwT/pMq66EeWw2wn+KyUKr+XNvydikdEq8WpqxrXzOUSFXcRJIzGvOzzXuLBtF
 t/8j0olD+NhOeFhSVZu3VGFurwKw2FXiMNWVSSf1lWf4Z/cn1fZXKM1MxZ1CyLM3lalp
 l+Sp5MXaKIAc100VYtqyObkoeO8pdegZ1pjMP8ZMlI1ZidQxkucX2VVf5xoyekPUaXIR
 PAxw==
X-Gm-Message-State: ALyK8tJl4VTdNj7xn/HyomnzYutGKWaaY0Ry3bErkmRM7IVXZ6WuYOZtm/ESpALj9NwmgMD5o/JIULyktR+BmQ==
X-Received: by 10.66.76.10 with SMTP id g10mr4627821paw.110.1467933615034;
 Thu, 07 Jul 2016 16:20:15 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.66.173.8 with HTTP; Thu, 7 Jul 2016 16:20:14 -0700 (PDT)
In-Reply-To: <45865ae6-18c9-ce9a-4a1e-6b2a8e44a8b2@denninger.net>
References: <20160630140625.3b4aece3@splash.akips.com>
 <CALXu0UfxRMnaamh+po5zp=iXdNUNuyj+7e_N1z8j46MtJmvyVA@mail.gmail.com>
 <20160703123004.74a7385a@splash.akips.com>
 <155afb8148f.c6f5294d33485.2952538647262141073@nextbsd.org>
 <45865ae6-18c9-ce9a-4a1e-6b2a8e44a8b2@denninger.net>
From: Cedric Blancher <cedric.blancher@gmail.com>
Date: Fri, 8 Jul 2016 01:20:14 +0200
Message-ID: <CALXu0UexG1G6ozZ+-QOpO168fT5n=L+yfKLJTzyRMWbCu6BjEg@mail.gmail.com>
Subject: Re: ZFS ARC and mmap/page cache coherency question
To: Karl Denninger <karl@denninger.net>
Cc: "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>,
 illumos-dev <developer@lists.illumos.org>, 
 "Garrett D'Amore" <garrett@damore.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 07 Jul 2016 23:20:15 -0000

I think Garrett D'Amore <garrett@damore.org> had some ideas about the
VM<---->ZFS communication and double/multicaching issues too.

Ced

On 3 July 2016 at 17:43, Karl Denninger <karl@denninger.net> wrote:
>
> On 7/3/2016 02:45, Matthew Macy wrote:
>>
>>             Cedric greatly overstates the intractability of resolving it=
. Nonetheless, since the initial import very little has been done to improv=
e integration, and I don't know of anyone who is up to the task taking an i=
nterest in it. Consequently, mmap() performance is likely "doomed" for the =
foreseeable future.-M----
>
> Wellllll....
>
> I've done a fair bit of work here (see
> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D187594) and the
> political issues are at least as bad as the coding ones.
>
> In short what Cedric says about the root of the issue is real.  VM is
> really-well implemented for what it handles, but the root of the issue
> is that while the UFS data cache is part of VM and thus it "knows" about
> it, ZFS is not because it is a "bolt-on."  UMA leads to further (severe)
> complications for certain workloads.
>
> Finally the underlying ZFS dmu_tx sizing code is just plain wrong and in
> fact this is one of the biggest issues as when the system runs into
> trouble it can take a bad situation and make it a *lot* worse.  There is
> only one write-back cache maintained instead of one per zvol, and that's
> flat-out broken.  Being able to re-order async writes to disk (where
> fsync() has not been called) and minimizing seek latency is excellent.
> Sadly rotating media these days sabotages much of this due to opacity
> introduced at the drive level (e.g. varying sector counts per track,
> etc) but it can still help.  But where things go dramatically wrong is
> on a system where a large write-back cache is allocated relative to the
> underlying zvol I/O performance (this occurs on moderately-large and
> bigger RAM systems) with moderate numbers of modest-performance rotating
> media; in this case it is entirely possible for a flush of the write
> buffers to require upwards of a *minute* to complete, during which all
> other writes block.  If this happens during periods of high RAM demand
> and you manage to trigger a page-out at the same time system performance
> will go straight into the toilet.  I have seen instances where simply
> trying to edit a text file with vi (or a "select" against a database
> table) will hang for upwards of a minute leading you to believe the
> system has crashed, when it fact it has not.
>
> The interaction of VM with the above can lead to severe pathological
> behavior because the VM system has no way to tell the ZFS subsystem to
> pare back ARC (and at least as important, perhaps more-so -- unused but
> allocated UMA) when memory pressure exists *before* it pages.  ZFS tries
> to detect memory pressure and do this itself but it winds up competing
> with the VM system.  This leads to demonstrably wrong behavior because
> you never want to hold disk cache in preference to RSS; if you have a
> block of data from the disk the best case is you avoid one I/O (to
> re-read it); if you page you are *guaranteed* to take one I/O (to write
> the paged-out RSS to disk) and *might* take two (if you then must read
> it back in.)
>
> In short trading the avoidance of one *possible* I/O for a *guaranteed*
> I/O and a second possible one is *always* a net lose.
>
> To "fix" all of this "correctly" (for all cases, instead of certain
> cases) VM would have to "know" about ARC and its use of UMA, along with
> being able to police both.  ZFS also must have the dmu_tx writeback
> cache sized per-zvol with its size chosen by the actual I/O performance
> characteristics of the disks in the zvol itself.  I've looked into doing
> both and it's fairly complex, and what's worse is that it would
> effectively "marry" VM and ZFS, removing the "bolt-on" aspect of
> things.  This then leads to a lot of maintenance work over time because
> any time ZFS code changes (and it does, quite a bit) you then have to go
> back through that process in order to become coherent with Illumos.
>
> The PR above resolved (completely) the issues I was having along with a
> number of other people on 10.x and before (I've not yet rolled it
> forward to 11.) but it's quite clearly a hack of sorts, in that it
> detects and treats symptoms (e.g. dynamic TX cache size modification,
> etc) rather than integrating VM and ZFS cache management.
>
> --
> Karl Denninger
> karl@denninger.net <mailto:karl@denninger.net>
> /The Market Ticker/
> /[S/MIME encrypted email preferred]/


--=20
Cedric Blancher <cedric.blancher@gmail.com>
Institute Pasteur