From owner-freebsd-hackers@freebsd.org Tue Jul 5 18:40:32 2016 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E9DD6B2047C for ; Tue, 5 Jul 2016 18:40:32 +0000 (UTC) (envelope-from lionelcons1972@gmail.com) Received: from mail-yw0-x230.google.com (mail-yw0-x230.google.com [IPv6:2607:f8b0:4002:c05::230]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id A52791226 for ; Tue, 5 Jul 2016 18:40:32 +0000 (UTC) (envelope-from lionelcons1972@gmail.com) Received: by mail-yw0-x230.google.com with SMTP id i12so68519410ywa.1 for ; Tue, 05 Jul 2016 11:40:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=bPHVmSWMxMML7KaxWY4/HrDQ6m4E8ABHFTh6rWXUyYM=; b=vBgOQy5Ic2H1eLXr9vQeHdHKc0D3MaSbb2wBkoS34hHFVbySfDlysBU+wU/LRBNIcG z3DbsKcJR7l1nTWsdrZebd0vVT3+RsLvJ5KM34hiZ+sE48RQnHfwuUcC3gJdGptLyk3t idWPbroBB6VWMbP+S6hEgzXnjW7hmR5+84ix0nowTCc1JUiuW+w0yJsWpoS5aPfOhJd6 DGybS7JVV/NeyKwAZUdrWpgJC/g7H37sLpURdA14XXYcBdogccNLksxBMdZHRQAqLOmY LCGw2ohbP982gLLvhRk+C/0gvOBBXQjwXhtVOi2utVPpUdExlAOa2Pf25cuuBUNPP6zZ pSsw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=bPHVmSWMxMML7KaxWY4/HrDQ6m4E8ABHFTh6rWXUyYM=; b=GuQ/1vvDZxiLpzxEhiym7ArkS9LGSx0f5g3s7e2yu6ay/wXhRSPCXK8ezTl1KH2VI2 1u6kpvCn8vckoeKmNiPtvSCh8JuIIjiQY5OB8pTWkXKyg+SwA3e1gdAHKWawTdxzWbrD D8i8fiRTUnKImWKklGnQhWOUnbTquKVi1dHRA4YCGVRdrgZUzc541ECsJexmQM0xKWQn rX4bNjHvL1AfMJRAARMNJynwQNTmE2L9sBbh/75d9lsVlghuk5wIgDfbUSubc0cJVRUc o7lPf42PYbyPwEr2tjwINMl81RIvpcDlcVs9m/rj9qOmJpNJlPFAkvW8qXMzCph6HlNI 9d2g== X-Gm-Message-State: ALyK8tIpkSB5wRy0WwfVPm7zk7XBIVFsxMn1GNaYgN/Djr/vGNc35akJCJCPa16PyfqHyFmVAgnYDRuUyREnsQ== X-Received: by 10.129.50.83 with SMTP id y80mr12096377ywy.305.1467744031740; Tue, 05 Jul 2016 11:40:31 -0700 (PDT) MIME-Version: 1.0 Received: by 10.37.193.194 with HTTP; Tue, 5 Jul 2016 11:40:30 -0700 (PDT) In-Reply-To: <31f4d30f-4170-0d04-bd23-1b998474a92e@denninger.net> References: <20160630140625.3b4aece3@splash.akips.com> <20160703123004.74a7385a@splash.akips.com> <155afb8148f.c6f5294d33485.2952538647262141073@nextbsd.org> <45865ae6-18c9-ce9a-4a1e-6b2a8e44a8b2@denninger.net> <155b84da0aa.ad3af0e6139335.8627172617037605875@nextbsd.org> <7e00af5a-86cd-25f8-a4c6-2d946b507409@denninger.net> <155bc1260e6.12001bf18198857.6272515207330027022@nextbsd.org> <31f4d30f-4170-0d04-bd23-1b998474a92e@denninger.net> From: Lionel Cons Date: Tue, 5 Jul 2016 20:40:30 +0200 Message-ID: Subject: Re: ZFS ARC and mmap/page cache coherency question To: Karl Denninger Cc: Freebsd hackers list Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jul 2016 18:40:33 -0000 So what Oracle did (based on work done by SUN for Opensolaris) was to: 1. Modify ZFS to prevent *ANY* double/multi caching [this is considered a design defect] 2. Introduce a new VM subsystem which scales a lot better and provides hooks for [1] so there are never two or more copies of the same data in the system Given that this was a huge, paid, multiyear effort its not likely going to happen that the design defects in opensource ZFS will ever go away. Lionel On 5 July 2016 at 19:50, Karl Denninger wrote: > > On 7/5/2016 12:19, Matthew Macy wrote: >> >> >> ---- On Mon, 04 Jul 2016 19:26:06 -0700 Karl Denninger wrote ---- >> > >> > >> > On 7/4/2016 18:45, Matthew Macy wrote: >> > > >> > > >> > > ---- On Sun, 03 Jul 2016 08:43:19 -0700 Karl Denninger wrote ---- >> > > > >> > > > On 7/3/2016 02:45, Matthew Macy wrote: >> > > > > >> > > > > Cedric greatly overstates the intractability of re= solving it. Nonetheless, since the initial import very little has been done= to improve integration, and I don't know of anyone who is up to the task t= aking an interest in it. Consequently, mmap() performance is likely "doomed= " for the foreseeable future.-M---- >> > > > >> > > > Wellllll.... >> > > > >> > > > I've done a fair bit of work here (see >> > > > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D187594) and = the >> > > > political issues are at least as bad as the coding ones. >> > > > >> > > >> > > >> > > Strictly speaking, the root of the problem is the ARC. Not ZFS per = se. Have you ever tried disabling MFU caching to see how much worse LRU onl= y is? I'm not really convinced the ARC's benefits justify its cost. >> > > >> > > -M >> > > >> > >> > The ARC is very useful when it gets a hit as it avoid an I/O that wou= ld >> > otherwise take place. >> > >> > Where it sucks is when the system evicts working set to preserve ARC. >> > That's always wrong in that you're trading a speculative I/O (if the >> > cache is hit later) for a *guaranteed* one (to page out) and maybe *t= wo* >> > (to page back in.) >> >> The question wasn't ARC vs. no-caching. It was LRU only vs LRU + MFU. Th= ere are a lot of issues stemming from the fact that ZFS is a transactional = object store with a POSIX FS on top. One is that it caches disk blocks as o= pposed to file blocks. However, if one could resolve that and have the page= cache manage these blocks life would be much much better. However, you'd l= ose MFU. Hence my question. >> >> -M >> > I suspect there's an argument to be made there but the present problems > make determining the impact of that difficult or impossible as those > effects are swamped by the other issues. > > I can fairly-easily create workloads on the base code where simply > typing "vi ", making a change and hitting ":w" will result in > a stall of tens of seconds or more while the cache flush that gets > requested is run down. I've resolved a good part (but not all > instances) of this through my work. > > My understanding is that 11- has had additional work done to the base > code, but three underlying issues are not, from what I can see in the > commit logs and discussions, addressed: The VM system will page out > working set while leaving ARC alone, UMA reserved-but-not-in-use space > is not policed adequately when memory pressure exists *before* the pager > starts considering evicting working set and the write-back cache is for > many machine configurations grossly inappropriate and cannot be tuned > adequately by hand (particularly being true on a system with vdevs that > have materially-varying performance levels.) > > I have more-or-less stopped work on the tree on a forward basis since I > got to a place with 10.2 that (1) works for my production requirements, > resolving the problems and (2) ran into what I deemed to be intractable > political issues within core on progress toward eradicating the root of > the problem. > > I will probably revisit the situation with 11- at some point, as I'll > want to roll my production systems forward. However, I don't know when > that will be -- right now 11- is stable enough for some of my embedded > work (e.g. on the Raspberry Pi2) but is not on my server and > client-class machines. Indeed just yesterday I got a lock-order > reversal panic while doing a shutdown after a kernel update on one of my > lab boxes running a just-updated 11- codebase. > > -- > Karl Denninger > karl@denninger.net > /The Market Ticker/ > /[S/MIME encrypted email preferred]/ --=20 Lionel