From owner-freebsd-fs@freebsd.org Thu Mar 7 05:23:40 2019 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 09DEF151399E for ; Thu, 7 Mar 2019 05:23:40 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org [IPv6:2001:1900:2254:206a::50:5]) by mx1.freebsd.org (Postfix) with ESMTP id 5D1E58F0C1 for ; Thu, 7 Mar 2019 05:23:39 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: by mailman.ysv.freebsd.org (Postfix) id 16EB61513999; Thu, 7 Mar 2019 05:23:39 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id CB5611513996 for ; Thu, 7 Mar 2019 05:23:38 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail106.syd.optusnet.com.au (mail106.syd.optusnet.com.au [211.29.132.42]) by mx1.freebsd.org (Postfix) with ESMTP id 0C8CA8F0BC; Thu, 7 Mar 2019 05:23:37 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from [192.168.0.102] (c110-21-101-228.carlnfd1.nsw.optusnet.com.au [110.21.101.228]) by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id C0ED23D9D68; Thu, 7 Mar 2019 16:23:27 +1100 (AEDT) Date: Thu, 7 Mar 2019 16:23:27 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Conrad Meyer cc: Rick Macklem , "bugzilla-noreply@freebsd.org" , "fs@FreeBSD.org" Subject: Re: [Bug 235774] [FUSE]: Need to evict invalidated cache contents on fuse_write_directbackend() In-Reply-To: Message-ID: <20190307150927.L932@besplex.bde.org> References: MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=FNpr/6gs c=1 sm=1 tr=0 a=PalzARQSbocsUSjMRkwAPg==:117 a=PalzARQSbocsUSjMRkwAPg==:17 a=nlC_4_pT8q9DhB4Ho9EA:9 a=hF2rLc1pAAAA:8 a=6I5d2MoRAAAA:8 a=Oy4Sub0-Cv8Slzy_KP8A:9 a=Amr6LLl0Bs5_1y-f:21 a=sPFHcxzROKv6JPlr:21 a=45ClL6m2LaAA:10 a=O9OM7dhJW_8Hj9EqqvKN:22 a=IjZwj45LgO3ly-622nXo:22 X-Rspamd-Queue-Id: 0C8CA8F0BC X-Spamd-Bar: ------ Authentication-Results: mx1.freebsd.org X-Spamd-Result: default: False [-6.90 / 15.00]; NEURAL_HAM_MEDIUM(-1.00)[-1.000,0]; NEURAL_HAM_SHORT(-0.90)[-0.897,0]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; REPLY(-4.00)[] Content-Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Mar 2019 05:23:40 -0000 [bugzilla kills replies, but is in the Cc list twice] On Wed, 6 Mar 2019, Conrad Meyer wrote: > On Wed, Mar 6, 2019 at 1:32 PM Rick Macklem wrote: >> >> --- Comment #4 from Conrad Meyer --- >>> I think fuse's IO_DIRECT path is a mess. Really all IO should go throu= gh the >>> buffer cache, and B_DIRECT and ~B_CACHE are just flags that control the >>> buffer's lifetime once the operation is complete. Removing the "direct= " >>> backends entirely (except as implementation details of strategy()) woul= d >>> simplify and correct the caching logic. >> >> Hmm, I'm not sure that I agree that all I/O should go through the buffer= cache, >> in general. (I won't admit to knowing the fuse code well enough to comme= nt >> specifically on it.) > > The scope of the bug and comment you've replied to is just FUSE IO. > >> =E2=80=A6 having the NFS (or FUSE) client do a >> large amount of writing to a file can flood the buffer cache and avoidin= g this >> for the case where the client won't be reading the file would be nice. >> What I am not sure is whether O_DIRECT is a good indicator of "doing a l= ot of >> writing that won't be read back". > > This is the known failure mode of LRU cache policies plus finite cache > size plus naive clients. It's not specific to any particular > filesystem. You can either enlarge your LRU cache to incorporate the > entire working set size, incorporate frequency of access in eviction > policy, or have smart clients provide hints (e.g., > POSIX_FADV_DONTNEED). O_DIRECT -> IO_DIRECT -> B_DIRECT is already > used as a hint in the bufcache to release bufs/pages aggressively. It is mostly a failue with naive clients. Some are so naive that they even trust the implementation of O_DIRECT to be any good. Here the naive client is mostly FUSE. I fixed this in the md device using POSIX_FADV_DONTNEED and an optional new caching option that turns this off. Clients above md can still get slowness by using block sizes too different from the block sizes (if any) used by the backing storage, but unlike IO_DIRECT, POSIX_FADV_DONTNEED is only a hint and it only discards full blocks from the buffer cache for file systems that use the buffer cache. zfs doesn't use the buffer cache and most of posix_fadvise(2) including all of POSIX_FADV_DONTNEED is just a stub that has no effect for it. zfs also doesn't support IO_DIRECT, so the attempted pessimizations from using IO_DIRECT for md had no effect. ffs has a fairly bad implementation of IO_DIRECT. For writing, it does the write using the buffer cache and then kills the buffer. The result for full blocks is the same as for a normal write followed by POSIX_FADV_DONTNEED. The result for a partial block is to kill the buffer while POSIX_FADV_DONTNEED would keep it. For reading, it does much the same unless the optional DIRECTIO option is configured. Then the buffer cache is not used at all. This seems to make no significant difference when all i/o is direct. Normal methods using 1 buffer at a time won't thrash the buffer cache. Rawread uses a pbuf and pbufs are a more limited resource with more primitive management, so it might actually be slower. zfs also doesn't support DIRECTIO. md used to use IO_DIRECT only for reading. With vnode backing on ffs, only reading with the same block size as ffs was reasonably efficient. IO_DIRECT prevents normal clustering even without DIRECTIO, so large block sizes in md were not useful (ffs splits them up), and small block sizes were very small. E.g., with 512-blocks in the client above md and 32K-blocks in ffs, reading 32K in the client 512 bytes at a time uses 64 reads of the same 32K-block in ffs. Caching in the next layer of storage is usually no so bad, but it takes a lot of CPU and a large iops in all layers to do 64 times as many i/o's. Now the ffs block is kept until it is all read, so this only takes a lot of CPU and a large iops in layers between md and ffs, but iops there is only limited by CPU (including memory). md didn't use IO_DIRECT for writing, since it considered that to be too slow. But it was at worst only about 3 times slower than what md did. md also didn't use any clustering, and it normally doesn't use async writes (this is an unusable configuration option, since async writes can hang), so it got much the same slowness as sync mounts in ffs. The factor of 3 slowness is from having to do a read-modify-write to write partial blocks. This gave most of the disadvantages of not using the buffer cache, but still gave double-caching. Now writes in md are cached 1 block at a time and double-caching is avoided for file systems that support POSIX_FADV_DONTNEED. Even non-naive clients like md have a hard time managing the block sizes. E.g., to work as well as possible, md would first need to understand that POSIX_FADV_DONTNEED is not supported by some file systems and supply workarounds. In general, the details of the caching policies and current cache state in the lower layer(s) would have to be understood. Even posix_fadvise(2) doesn't understand much of that. It is only implemented at the vfs level where the details are not known except indirectly by their effect on the buffer and object caches. There is also some confusion and bugs involving *DONTNEED and *NOREUSE: - vop_stdadvise() only supports POSIX_FADVISE_DONTNEED. I does nothing for the stronger hint POSIX_FADVISE_NOREUSE. - posix_advise() knows about this bug and converts POSIX_FADVISE_NOREUSE into POSIX_FADVISE_DONTNEED. - ffs IO_DIRECT wants NOREUSE semantics (to kill the buffer completely). It gets this by not using VOP_ADVISE(), but using the buffer cache. - the buffer cache has the opposite confusion and bugs. It supports B_NOREUSE but not B_DONTNEED. IO_DIRECT is automatically converted to B_NOREUSE when ffs releases the buffer. This is how ffs kills the buffer without know the details. - my initial fixes for md did more management that would have worked with NOREUSE semantics. md wants to kill the buffer too, but only when it is full. I found that the DONTNEED semantics as implemented in vop_stadadvise() worked just as well. But there is a problem with random small i/o's. My initial fixes wanted to kill even small buffers when the next i/o is not contiguous. But this prevents caching when caching is especially needed (it is only sequential i/o's where the data is expected to not be needed again). posix_fadvise and vop_stadadvise() have even less idea how to handle random i/o's. I think they just don't free partial blocks. Bruce Bruce From owner-freebsd-fs@freebsd.org Thu Mar 7 09:24:58 2019 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 681BE1526B61 for ; Thu, 7 Mar 2019 09:24:58 +0000 (UTC) (envelope-from Alexander@leidinger.net) Received: from mailgate.Leidinger.net (mailgate.leidinger.net [IPv6:2a00:1828:2000:313::1:5]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) server-signature RSA-PSS (4096 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id DBFAF72AD1 for ; Thu, 7 Mar 2019 09:24:57 +0000 (UTC) (envelope-from Alexander@leidinger.net) Received: from outgoing.leidinger.net (p5B16566F.dip0.t-ipconnect.de [91.22.86.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (Client did not present a certificate) by mailgate.Leidinger.net (Postfix) with ESMTPSA id E8F9CF65 for ; Thu, 7 Mar 2019 10:24:54 +0100 (CET) Received: from webmail.leidinger.net (webmail.Leidinger.net [IPv6:fd73:10c7:2053:1::3:102]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (Client did not present a certificate) by outgoing.leidinger.net (Postfix) with ESMTPS id 7740D3243 for ; Thu, 7 Mar 2019 10:24:52 +0100 (CET) Received: (from www@localhost) by webmail.leidinger.net (8.15.2/8.14.4/Submit) id x279Oqma052156 for freebsd-fs@freebsd.org; Thu, 7 Mar 2019 10:24:52 +0100 (CET) (envelope-from Alexander@leidinger.net) X-Authentication-Warning: webmail.leidinger.net: www set sender to Alexander@leidinger.net using -f Received: from IO.Leidinger.net (IO.Leidinger.net [192.168.1.11]) by webmail.leidinger.net (Horde Framework) with HTTPS; Thu, 07 Mar 2019 10:24:52 +0100 Date: Thu, 07 Mar 2019 10:24:52 +0100 Message-ID: <20190307102452.Horde.LIPQtoTN3klZVA6iTMOmtYl@webmail.leidinger.net> From: Alexander Leidinger To: freebsd-fs@freebsd.org Subject: Re: 'td' vs sys/fs/nfsserver/ References: <201903052008.x25K8d4Z011653@chez.mckusick.com> <169532d2770.27fa.fa4b1493b064008fe79f0f905b8e5741@Leidinger.net> <20190306145031.GC2492@kib.kiev.ua> In-Reply-To: <20190306145031.GC2492@kib.kiev.ua> User-Agent: Horde Application Framework 5 Accept-Language: de,en Content-Type: multipart/signed; boundary="=_gev7OHDv8lFVJl6l7UZ4FGu"; protocol="application/pgp-signature"; micalg=pgp-sha1 MIME-Version: 1.0 ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=leidinger.net; s=outgoing-alex; t=1551950695; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=V6pP4CzqMMFDwBkb2k+LcvfqZ7T6wVXd+vHMg0Op+PU=; b=0iDuNmrilTLdX1pj/eiqtg3nCTDFMrQlghaa0wsT7GpCnbn/z9reho3nZSa+4OOC6jwwH2 mAGHajknkVt4tRqt/zuHfUb52+c4NOQTI5FGPZBeUg00lQRJaYhU32V4AvDqtlUpQD1CQJ B1X28L12QuhjKz12nF85LszxB3CRfRsAllHMCAptF3sFWhbyFuzxnEvETQ5TlRGyxPEkRZ qdfF1fbj2afp21mMT7QmKgjeQ+eiJ0qdu+1apd2tEiGKjwW0vKp5iSWI+X1gcBNe7X76h0 srqxqomYF79/EadQFvJ+PDr5VfczbvpBI7XrDG0jK3ezyFjDulqGZamJ7/c17g== ARC-Seal: i=1; s=outgoing-alex; d=leidinger.net; t=1551950695; a=rsa-sha256; cv=none; b=k5XVf9OSVLKNo7br9vPLXl/a6yXmSvnzfbgAJYzrfW79g8EYlA4SV5V8olrbTr+h5YvNDn QMRqm2YvrTI85wHQXauMLTrw/x0NU8T76qQ6oc4qjeiT5n+v5oo6dpxQcQ2O0BwaRF1baD tjFn7kAVvgp47/f7Dfj0tbRC6f9DCNhrO+yZdWGHcjr2u/r9oTgm2bqj8eEfeUkUXQN3Ht O4r1p03I8SXVMiP/Xrxcm4xLph++sKMcF+OktIm8f2rAcqBgV5Awwhdi4iOjsJRJtaUvXj JkrSxB8pSB3oeeGC3QGZ1BFBl9P2gePg2wd9NjpfvX7sZjDVoq9i9D2n2wBjyA== ARC-Authentication-Results: i=1; mailgate.Leidinger.net; auth=pass smtp.auth=netchild@leidinger.net smtp.mailfrom=Alexander@leidinger.net X-Rspamd-Queue-Id: DBFAF72AD1 X-Spamd-Bar: ------ X-Spamd-Result: default: False [-6.98 / 15.00]; NEURAL_HAM_MEDIUM(-1.00)[-1.000,0]; NEURAL_HAM_SHORT(-0.98)[-0.978,0]; REPLY(-4.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000,0] X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Mar 2019 09:24:58 -0000 This message is in MIME format and has been PGP signed. --=_gev7OHDv8lFVJl6l7UZ4FGu Content-Type: text/plain; charset=utf-8; format=flowed; DelSp=Yes Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Quoting Konstantin Belousov (from Wed, 6 Mar 2019=20=20 16:50:31=20+0200): > On Wed, Mar 06, 2019 at 02:25:16PM +0100, Alexander Leidinger via=20=20 >=20freebsd-fs wrote: >> Hi, >> >> About code churn: >> - Does it matter for an end user if the code repo gets bigger? >> - Will there be an (indirect) benefit for an end user (like code which = is >> more easy to understand and as such less bugs while changing something o= r >> more people willing to touch/improve/extend)? > Why does it matter how this affects end users ? We're not making changes for the sake of making changes. We make=20=20 changes=20to improve something. And in the end, users are the ones who=20= =20 shall=20benefit from improvements (be it directly by bug fixing / new=20=20 features,=20or indirectl by code quality improvements which prevent some=20= =20 bugs=20in the future). >> - How many developers mirror the repo and are at the same time space >> limited? =3D Does it matter for us developers? > Why does it matter at all? It was one of the points people complained about in the past in such=20=20 situations. >>=20 - How many developers are network transfer limited and what is the am= ount >> of expected change compared to a clang / openssl /.... import? >> - At which "churn-factor" does it not make sense anymore (and why)? > Repo churn is a situation where developers get significant amount > of mail with changes that they must read, but which does not change > functionality. The changes must be read to understand the current state > of the code after the change, to see that there is no un-intentional or > bad intentional chunks despite the the commit message. > > Vcs blame becomes harder to use after the churn because to get down to > the interesting change for the part of the code, you must skip a lot > of no-op commits (and before skipping, reader needs to ensure that the > commits are no-op). Same for vcs log. > > Then, when you found older change that is interesting, it does not > matches the current code state due to churn above it. > > Repo churn invalidates any out-of-tree patches or development trees > and requires efforts to re-merge. You basically say that code-refactoring is a no-go. > Lets ignore MFC for a moment. > > These are basics why huge style changes alone, or large set of trivial > non-functional changes cause lot of backpressure. Look at libexec/rtld-e= lf > for the canonical example of code breaking several important style(9) > rules which are not corrected because that would cause churn. In this thread we are not talking about style changes. To my=20=20 understanding=20we are talking about code-refactoring which is supposed=20= =20 to=20lead to - be more easy understanding for people new to the code (and we want=20= =20 to=20attract new people in general, right?), - a faster understanding for those people which had theirs hands=20=20 already=20in there but didn't had a look at it for a long time, - a simpler interface, and even - some clarity about the inner workings which is not available now=20=20 (I=20refer to the "is td always currthread" question). I fully agree to prevent code churn in terms of style changes. I do not agree to code-refactoring is a no-go (it may depending on the=20= =20 situation...=20IMO it should be more "yes to code-refactoring unless"=20=20 instead=20of "no to code-refactoring unless"). Bye, Alexander. --=20 http://www.Leidinger.net=20Alexander@Leidinger.net: PGP 0x8F31830F9F2772BF http://www.FreeBSD.org netchild@FreeBSD.org : PGP 0x8F31830F9F2772BF --=_gev7OHDv8lFVJl6l7UZ4FGu Content-Type: application/pgp-signature Content-Description: Digitale PGP-Signatur Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJcgONkAAoJEBINsJsD+NiGe9EP/0J+0MrAHg71oFZE7ty7EraM TKnchtMJZGp0VzS7IR2quYF34bmTwZkH5j3kYLOeIiDiXnlS2a5lNlm1TeuO+zp/ T097DCfA/vHYka3Y9pApoVhlI5lxzQUCLEsIuB1jZRB6mV6RDZgYx2PXdaVrDGFb xtJBN+xXFYvqn+DEAnv9AHPjeys8MoUCpgZmCn0ZOCMPd3BWJdfX1PI4d89J3elr 8IhxOW1ZW06/V7cBaPEFXYoN1zuw4LjR99tP+HyChprG4dIIrCFYrYP0scQEyrP+ oNOBNPLt6NA/17SaF2pYiCkdvkaM/TQWPgH5Ho3Gu+Ah1PS9shJaKSz3A39bLjjp OS7d4uSbmPPvih9sDGrKH0OIEM+S7257QmLA+nZplK9pIGRT2afJBBEGMnDuO7IZ 3J2/JWz8ZphgCR2v0weUXidgyT/uqKJpY4qLKObwIXk5G90ffCAH0KM4kCgDLyo3 fBwEp3jTZF9Ndidq7viLKy3dcYyOstirOJ83qILQAeBJT1kHgxktSpjVcaNM/Zzv 3tpR0pZWfONgTFqEIvzZ4ktYttpLHOGPBdvj1Gd5Vvtx9EclCcFX5JrEaZxKftMA /rfDae096XE+IYDwSZBXPEfMGKmAVFGNRG88dyg/VZd+K6VnDuhrgsjZBoEXvvRK GBmX0eZXkyenNDLNvq/z =h+lR -----END PGP SIGNATURE----- --=_gev7OHDv8lFVJl6l7UZ4FGu--