From owner-freebsd-hackers@freebsd.org Thu Jul 7 23:20:15 2016 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id B7F10B8249A for ; Thu, 7 Jul 2016 23:20:15 +0000 (UTC) (envelope-from cedric.blancher@gmail.com) Received: from mail-pa0-x22a.google.com (mail-pa0-x22a.google.com [IPv6:2607:f8b0:400e:c03::22a]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 87D6C17F6 for ; Thu, 7 Jul 2016 23:20:15 +0000 (UTC) (envelope-from cedric.blancher@gmail.com) Received: by mail-pa0-x22a.google.com with SMTP id uj8so9857139pab.3 for ; Thu, 07 Jul 2016 16:20:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=n9qqVM9YdRvf66WYZCPZ2tOqYFzahK2zysWrJZsTtKA=; b=Wi50pxXx4K0f4dbVKFX2mE3BfNIwCK105o/twiShCyDoA0BQNNNtDPT+x41BSVQBlM vxB2/PPP0W/BMOev0bIPHKYQHsh9zDmtEV+CAylI3SlKi56Qmzr12U2ZHjVyG93F1Cpm iItpmxtJ3k1CZaV6SBVnPGIYvpAfuz1LATAUUQy0e3XYrws/7gNwulrU7wKlpoY9kDdN JZcmxxDdlLNUn3Jm72uc1HFpDEjw4wtvNaXeGUwwZ4hL0ZeHC3qkXkO5UYC8+lVGXL3q IRacRtAQZuSxlp2+0yF38TFGURuujfRg7DSyzsONzufWWwVEQL/W0Cd1sFDbIOauMu+A vl8A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=n9qqVM9YdRvf66WYZCPZ2tOqYFzahK2zysWrJZsTtKA=; b=M8uu+NAXGdO4iFIlzZopJFiR5KvyrNjpXUpdlq/Q2Cd8v0VG+0F+qzbINMkoiy79Gp AkeNsDxK1bo3x8GxKSsbfDUgfyaBdJpNpfirc20TLtsZ9SWQkamBMndYVtrcMZ7pQs6k vcOeAwDwT/pMq66EeWw2wn+KyUKr+XNvydikdEq8WpqxrXzOUSFXcRJIzGvOzzXuLBtF t/8j0olD+NhOeFhSVZu3VGFurwKw2FXiMNWVSSf1lWf4Z/cn1fZXKM1MxZ1CyLM3lalp l+Sp5MXaKIAc100VYtqyObkoeO8pdegZ1pjMP8ZMlI1ZidQxkucX2VVf5xoyekPUaXIR PAxw== X-Gm-Message-State: ALyK8tJl4VTdNj7xn/HyomnzYutGKWaaY0Ry3bErkmRM7IVXZ6WuYOZtm/ESpALj9NwmgMD5o/JIULyktR+BmQ== X-Received: by 10.66.76.10 with SMTP id g10mr4627821paw.110.1467933615034; Thu, 07 Jul 2016 16:20:15 -0700 (PDT) MIME-Version: 1.0 Received: by 10.66.173.8 with HTTP; Thu, 7 Jul 2016 16:20:14 -0700 (PDT) In-Reply-To: <45865ae6-18c9-ce9a-4a1e-6b2a8e44a8b2@denninger.net> References: <20160630140625.3b4aece3@splash.akips.com> <20160703123004.74a7385a@splash.akips.com> <155afb8148f.c6f5294d33485.2952538647262141073@nextbsd.org> <45865ae6-18c9-ce9a-4a1e-6b2a8e44a8b2@denninger.net> From: Cedric Blancher Date: Fri, 8 Jul 2016 01:20:14 +0200 Message-ID: Subject: Re: ZFS ARC and mmap/page cache coherency question To: Karl Denninger Cc: "freebsd-hackers@freebsd.org" , illumos-dev , "Garrett D'Amore" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Jul 2016 23:20:15 -0000 I think Garrett D'Amore had some ideas about the VM<---->ZFS communication and double/multicaching issues too. Ced On 3 July 2016 at 17:43, Karl Denninger wrote: > > On 7/3/2016 02:45, Matthew Macy wrote: >> >> Cedric greatly overstates the intractability of resolving it= . Nonetheless, since the initial import very little has been done to improv= e integration, and I don't know of anyone who is up to the task taking an i= nterest in it. Consequently, mmap() performance is likely "doomed" for the = foreseeable future.-M---- > > Wellllll.... > > I've done a fair bit of work here (see > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D187594) and the > political issues are at least as bad as the coding ones. > > In short what Cedric says about the root of the issue is real. VM is > really-well implemented for what it handles, but the root of the issue > is that while the UFS data cache is part of VM and thus it "knows" about > it, ZFS is not because it is a "bolt-on." UMA leads to further (severe) > complications for certain workloads. > > Finally the underlying ZFS dmu_tx sizing code is just plain wrong and in > fact this is one of the biggest issues as when the system runs into > trouble it can take a bad situation and make it a *lot* worse. There is > only one write-back cache maintained instead of one per zvol, and that's > flat-out broken. Being able to re-order async writes to disk (where > fsync() has not been called) and minimizing seek latency is excellent. > Sadly rotating media these days sabotages much of this due to opacity > introduced at the drive level (e.g. varying sector counts per track, > etc) but it can still help. But where things go dramatically wrong is > on a system where a large write-back cache is allocated relative to the > underlying zvol I/O performance (this occurs on moderately-large and > bigger RAM systems) with moderate numbers of modest-performance rotating > media; in this case it is entirely possible for a flush of the write > buffers to require upwards of a *minute* to complete, during which all > other writes block. If this happens during periods of high RAM demand > and you manage to trigger a page-out at the same time system performance > will go straight into the toilet. I have seen instances where simply > trying to edit a text file with vi (or a "select" against a database > table) will hang for upwards of a minute leading you to believe the > system has crashed, when it fact it has not. > > The interaction of VM with the above can lead to severe pathological > behavior because the VM system has no way to tell the ZFS subsystem to > pare back ARC (and at least as important, perhaps more-so -- unused but > allocated UMA) when memory pressure exists *before* it pages. ZFS tries > to detect memory pressure and do this itself but it winds up competing > with the VM system. This leads to demonstrably wrong behavior because > you never want to hold disk cache in preference to RSS; if you have a > block of data from the disk the best case is you avoid one I/O (to > re-read it); if you page you are *guaranteed* to take one I/O (to write > the paged-out RSS to disk) and *might* take two (if you then must read > it back in.) > > In short trading the avoidance of one *possible* I/O for a *guaranteed* > I/O and a second possible one is *always* a net lose. > > To "fix" all of this "correctly" (for all cases, instead of certain > cases) VM would have to "know" about ARC and its use of UMA, along with > being able to police both. ZFS also must have the dmu_tx writeback > cache sized per-zvol with its size chosen by the actual I/O performance > characteristics of the disks in the zvol itself. I've looked into doing > both and it's fairly complex, and what's worse is that it would > effectively "marry" VM and ZFS, removing the "bolt-on" aspect of > things. This then leads to a lot of maintenance work over time because > any time ZFS code changes (and it does, quite a bit) you then have to go > back through that process in order to become coherent with Illumos. > > The PR above resolved (completely) the issues I was having along with a > number of other people on 10.x and before (I've not yet rolled it > forward to 11.) but it's quite clearly a hack of sorts, in that it > detects and treats symptoms (e.g. dynamic TX cache size modification, > etc) rather than integrating VM and ZFS cache management. > > -- > Karl Denninger > karl@denninger.net > /The Market Ticker/ > /[S/MIME encrypted email preferred]/ --=20 Cedric Blancher Institute Pasteur