Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 7 Mar 2019 16:23:27 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Conrad Meyer <cem@freebsd.org>
Cc:        Rick Macklem <rmacklem@uoguelph.ca>,  "bugzilla-noreply@freebsd.org" <bugzilla-noreply@freebsd.org>,  "fs@FreeBSD.org" <fs@freebsd.org>
Subject:   Re: [Bug 235774] [FUSE]: Need to evict invalidated cache contents on fuse_write_directbackend()
Message-ID:  <20190307150927.L932@besplex.bde.org>
In-Reply-To: <CAG6CVpWyMeNBfZqRF7xvw8RAbDtM_iYCWdw0bxvKCOxcWDmP2w@mail.gmail.com>
References:  <bug-235774-3630@https.bugs.freebsd.org/bugzilla/> <bug-235774-3630-gv5OeBYwCK@https.bugs.freebsd.org/bugzilla/> <QB1PR01MB35379C69D7EE70F000B1975FDD730@QB1PR01MB3537.CANPRD01.PROD.OUTLOOK.COM> <CAG6CVpWyMeNBfZqRF7xvw8RAbDtM_iYCWdw0bxvKCOxcWDmP2w@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
[bugzilla kills replies, but is in the Cc list twice]

On Wed, 6 Mar 2019, Conrad Meyer wrote:

> On Wed, Mar 6, 2019 at 1:32 PM Rick Macklem <rmacklem@uoguelph.ca> wrote:
>>
>> --- Comment #4 from Conrad Meyer <cem@freebsd.org> ---
>>> I think fuse's IO_DIRECT path is a mess.  Really all IO should go throu=
gh the
>>> buffer cache, and B_DIRECT and ~B_CACHE are just flags that control the
>>> buffer's lifetime once the operation is complete.  Removing the "direct=
"
>>> backends entirely (except as implementation details of strategy()) woul=
d
>>> simplify and correct the caching logic.
>>
>> Hmm, I'm not sure that I agree that all I/O should go through the buffer=
 cache,
>> in general. (I won't admit to knowing the fuse code well enough to comme=
nt
>> specifically on it.)
>
> The scope of the bug and comment you've replied to is just FUSE IO.
>
>> =E2=80=A6 having the NFS (or FUSE) client do a
>> large amount of writing to a file can flood the buffer cache and avoidin=
g this
>> for the case where the client won't be reading the file would be nice.
>> What I am not sure is whether O_DIRECT is a good indicator of "doing a l=
ot of
>> writing that won't be read back".
>
> This is the known failure mode of LRU cache policies plus finite cache
> size plus naive clients.  It's not specific to any particular
> filesystem.  You can either enlarge your LRU cache to incorporate the
> entire working set size, incorporate frequency of access in eviction
> policy, or have smart clients provide hints (e.g.,
> POSIX_FADV_DONTNEED).  O_DIRECT -> IO_DIRECT -> B_DIRECT is already
> used as a hint in the bufcache to release bufs/pages aggressively.

It is mostly a failue with naive clients.  Some are so naive that they even
trust the implementation of O_DIRECT to be any good.  Here the naive client
is mostly FUSE.

I fixed this in the md device using POSIX_FADV_DONTNEED and an optional
new caching option that turns this off.  Clients above md can still get
slowness by using block sizes too different from the block sizes (if any)
used by the backing storage, but unlike IO_DIRECT, POSIX_FADV_DONTNEED is
only a hint and it only discards full blocks from the buffer cache for
file systems that use the buffer cache.

zfs doesn't use the buffer cache and most of posix_fadvise(2) including
all of POSIX_FADV_DONTNEED is just a stub that has no effect for it.
zfs also doesn't support IO_DIRECT, so the attempted pessimizations
from using IO_DIRECT for md had no effect.

ffs has a fairly bad implementation of IO_DIRECT.  For writing, it does
the write using the buffer cache and then kills the buffer.  The result
for full blocks is the same as for a normal write followed by
POSIX_FADV_DONTNEED.  The result for a partial block is to kill the
buffer while POSIX_FADV_DONTNEED would keep it.  For reading, it does
much the same unless the optional DIRECTIO option is configured.  Then
the buffer cache is not used at all.  This seems to make no significant
difference when all i/o is direct.  Normal methods using 1 buffer at a
time won't thrash the buffer cache.  Rawread uses a pbuf and pbufs are
a more limited resource with more primitive management, so it might
actually be slower.  zfs also doesn't support DIRECTIO.

md used to use IO_DIRECT only for reading.  With vnode backing on ffs,
only reading with the same block size as ffs was reasonably efficient.
IO_DIRECT prevents normal clustering even without DIRECTIO, so large
block sizes in md were not useful (ffs splits them up), and small block
sizes were very small.  E.g., with 512-blocks in the client above md and
32K-blocks in ffs, reading 32K in the client 512 bytes at a time uses
64 reads of the same 32K-block in ffs.  Caching in the next layer of
storage is usually no so bad, but it takes a lot of CPU and a large
iops in all layers to do 64 times as many i/o's.  Now the ffs block is
kept until it is all read, so this only takes a lot of CPU and a large
iops in layers between md and ffs, but iops there is only limited by
CPU (including memory).

md didn't use IO_DIRECT for writing, since it considered that to be too
slow.  But it was at worst only about 3 times slower than what md did.
md also didn't use any clustering, and it normally doesn't use async
writes (this is an unusable configuration option, since async writes
can hang), so it got much the same slowness as sync mounts in ffs.
The factor of 3 slowness is from having to do a read-modify-write to
write partial blocks.  This gave most of the disadvantages of not using
the buffer cache, but still gave double-caching.  Now writes in md are
cached 1 block at a time and double-caching is avoided for file systems
that support POSIX_FADV_DONTNEED.

Even non-naive clients like md have a hard time managing the block sizes.
E.g., to work as well as possible, md would first need to understand that
POSIX_FADV_DONTNEED is not supported by some file systems and supply
workarounds.  In general, the details of the caching policies and current
cache state in the lower layer(s) would have to be understood.  Even
posix_fadvise(2) doesn't understand much of that.  It is only implemented
at the vfs level where the details are not known except indirectly by their
effect on the buffer and object caches.

There is also some confusion and bugs involving *DONTNEED and *NOREUSE:
- vop_stdadvise() only supports POSIX_FADVISE_DONTNEED.  I does nothing for
   the stronger hint POSIX_FADVISE_NOREUSE.
- posix_advise() knows about this bug and converts POSIX_FADVISE_NOREUSE
   into POSIX_FADVISE_DONTNEED.
- ffs IO_DIRECT wants NOREUSE semantics (to kill the buffer completely).
   It gets this by not using VOP_ADVISE(), but using the buffer cache.
- the buffer cache has the opposite confusion and bugs.  It supports
   B_NOREUSE but not B_DONTNEED.  IO_DIRECT is automatically converted to
   B_NOREUSE when ffs releases the buffer.  This is how ffs kills the
   buffer without know the details.
- my initial fixes for md did more management that would have worked with
   NOREUSE semantics.  md wants to kill the buffer too, but only when it
   is full.  I found that the DONTNEED semantics as implemented in
   vop_stadadvise() worked just as well.  But there is a problem with
   random small i/o's.  My initial fixes wanted to kill even small buffers
   when the next i/o is not contiguous.  But this prevents caching when
   caching is especially needed (it is only sequential i/o's where the
   data is expected to not be needed again).  posix_fadvise and
   vop_stadadvise() have even less idea how to handle random i/o's.  I
   think they just don't free partial blocks.

Bruce

Bruce
From owner-freebsd-fs@freebsd.org  Thu Mar  7 09:24:58 2019
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 681BE1526B61
 for <freebsd-fs@mailman.ysv.freebsd.org>; Thu,  7 Mar 2019 09:24:58 +0000 (UTC)
 (envelope-from Alexander@leidinger.net)
Received: from mailgate.Leidinger.net (mailgate.leidinger.net
 [IPv6:2a00:1828:2000:313::1:5])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 server-signature RSA-PSS (4096 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id DBFAF72AD1
 for <freebsd-fs@freebsd.org>; Thu,  7 Mar 2019 09:24:57 +0000 (UTC)
 (envelope-from Alexander@leidinger.net)
Received: from outgoing.leidinger.net (p5B16566F.dip0.t-ipconnect.de
 [91.22.86.111])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits))
 (Client did not present a certificate)
 by mailgate.Leidinger.net (Postfix) with ESMTPSA id E8F9CF65
 for <freebsd-fs@freebsd.org>; Thu,  7 Mar 2019 10:24:54 +0100 (CET)
Received: from webmail.leidinger.net (webmail.Leidinger.net
 [IPv6:fd73:10c7:2053:1::3:102])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (Client did not present a certificate)
 by outgoing.leidinger.net (Postfix) with ESMTPS id 7740D3243
 for <freebsd-fs@freebsd.org>; Thu,  7 Mar 2019 10:24:52 +0100 (CET)
Received: (from www@localhost)
 by webmail.leidinger.net (8.15.2/8.14.4/Submit) id x279Oqma052156
 for freebsd-fs@freebsd.org; Thu, 7 Mar 2019 10:24:52 +0100 (CET)
 (envelope-from Alexander@leidinger.net)
X-Authentication-Warning: webmail.leidinger.net: www set sender to
 Alexander@leidinger.net using -f
Received: from IO.Leidinger.net (IO.Leidinger.net [192.168.1.11]) by
 webmail.leidinger.net (Horde Framework) with HTTPS; Thu, 07 Mar 2019
 10:24:52 +0100
Date: Thu, 07 Mar 2019 10:24:52 +0100
Message-ID: <20190307102452.Horde.LIPQtoTN3klZVA6iTMOmtYl@webmail.leidinger.net>
From: Alexander Leidinger <Alexander@leidinger.net>
To: freebsd-fs@freebsd.org
Subject: Re: 'td' vs sys/fs/nfsserver/
References: <201903052008.x25K8d4Z011653@chez.mckusick.com>
 <169532d2770.27fa.fa4b1493b064008fe79f0f905b8e5741@Leidinger.net>
 <20190306145031.GC2492@kib.kiev.ua>
In-Reply-To: <20190306145031.GC2492@kib.kiev.ua>
User-Agent: Horde Application Framework 5
Accept-Language: de,en
Content-Type: multipart/signed; boundary="=_gev7OHDv8lFVJl6l7UZ4FGu";
 protocol="application/pgp-signature"; micalg=pgp-sha1
MIME-Version: 1.0
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=leidinger.net; 
 s=outgoing-alex; t=1551950695;
 h=from:from:sender:reply-to:subject:subject:date:date:
 message-id:message-id:to:to:cc:mime-version:mime-version:
 content-type:content-type:content-transfer-encoding:
 in-reply-to:in-reply-to:references:references;
 bh=V6pP4CzqMMFDwBkb2k+LcvfqZ7T6wVXd+vHMg0Op+PU=;
 b=0iDuNmrilTLdX1pj/eiqtg3nCTDFMrQlghaa0wsT7GpCnbn/z9reho3nZSa+4OOC6jwwH2
 mAGHajknkVt4tRqt/zuHfUb52+c4NOQTI5FGPZBeUg00lQRJaYhU32V4AvDqtlUpQD1CQJ
 B1X28L12QuhjKz12nF85LszxB3CRfRsAllHMCAptF3sFWhbyFuzxnEvETQ5TlRGyxPEkRZ
 qdfF1fbj2afp21mMT7QmKgjeQ+eiJ0qdu+1apd2tEiGKjwW0vKp5iSWI+X1gcBNe7X76h0
 srqxqomYF79/EadQFvJ+PDr5VfczbvpBI7XrDG0jK3ezyFjDulqGZamJ7/c17g==
ARC-Seal: i=1; s=outgoing-alex; d=leidinger.net; t=1551950695; a=rsa-sha256;
 cv=none;
 b=k5XVf9OSVLKNo7br9vPLXl/a6yXmSvnzfbgAJYzrfW79g8EYlA4SV5V8olrbTr+h5YvNDn
 QMRqm2YvrTI85wHQXauMLTrw/x0NU8T76qQ6oc4qjeiT5n+v5oo6dpxQcQ2O0BwaRF1baD
 tjFn7kAVvgp47/f7Dfj0tbRC6f9DCNhrO+yZdWGHcjr2u/r9oTgm2bqj8eEfeUkUXQN3Ht
 O4r1p03I8SXVMiP/Xrxcm4xLph++sKMcF+OktIm8f2rAcqBgV5Awwhdi4iOjsJRJtaUvXj
 JkrSxB8pSB3oeeGC3QGZ1BFBl9P2gePg2wd9NjpfvX7sZjDVoq9i9D2n2wBjyA==
ARC-Authentication-Results: i=1; mailgate.Leidinger.net;
 auth=pass smtp.auth=netchild@leidinger.net
 smtp.mailfrom=Alexander@leidinger.net
X-Rspamd-Queue-Id: DBFAF72AD1
X-Spamd-Bar: ------
X-Spamd-Result: default: False [-6.98 / 15.00];
 NEURAL_HAM_MEDIUM(-1.00)[-1.000,0];
 NEURAL_HAM_SHORT(-0.98)[-0.978,0]; REPLY(-4.00)[];
 NEURAL_HAM_LONG(-1.00)[-1.000,0]
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>;
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 07 Mar 2019 09:24:58 -0000

This message is in MIME format and has been PGP signed.

--=_gev7OHDv8lFVJl6l7UZ4FGu
Content-Type: text/plain; charset=utf-8; format=flowed; DelSp=Yes
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Quoting Konstantin Belousov <kib@freebsd.org> (from Wed, 6 Mar 2019=20=20
16:50:31=20+0200):

> On Wed, Mar 06, 2019 at 02:25:16PM +0100, Alexander Leidinger via=20=20
>=20freebsd-fs wrote:
>> Hi,
>>
>> About code churn:
>>  - Does it matter for an end user if the code repo gets bigger?
>>  - Will there be an (indirect) benefit for an end user (like code which =
is
>> more easy to understand and as such less bugs while changing something o=
r
>> more people willing to touch/improve/extend)?
> Why does it matter how this affects end users ?

We're not making changes for the sake of making changes. We make=20=20
changes=20to improve something. And in the end, users are the ones who=20=
=20
shall=20benefit from improvements (be it directly by bug fixing / new=20=20
features,=20or indirectl by code quality improvements which prevent some=20=
=20
bugs=20in the future).

>>  - How many developers mirror the repo and are at the same time space
>> limited? =3D Does it matter for us developers?
> Why does it matter at all?

It was one of the points people complained about in the past in such=20=20
situations.

>>=20 - How many developers are network transfer limited and what is the am=
ount
>> of expected change compared to a clang / openssl /.... import?
>>  - At which "churn-factor" does it not make sense anymore (and why)?
> Repo churn is a situation where developers get significant amount
> of mail with changes that they must read, but which does not change
> functionality. The changes must be read to understand the current state
> of the code after the change, to see that there is no un-intentional or
> bad intentional chunks despite the the commit message.
>
> Vcs blame becomes harder to use after the churn because to get down to
> the interesting change for the part of the code, you must skip a lot
> of no-op commits (and before skipping, reader needs to ensure that the
> commits are no-op). Same for vcs log.
>
> Then, when you found older change that is interesting, it does not
> matches the current code state due to churn above it.
>
> Repo churn invalidates any out-of-tree patches or development trees
> and requires efforts to re-merge.

You basically say that code-refactoring is a no-go.

> Lets ignore MFC for a moment.
>
> These are basics why huge style changes alone, or large set of trivial
> non-functional changes cause lot of backpressure.  Look at libexec/rtld-e=
lf
> for the canonical example of code breaking several important style(9)
> rules which are not corrected because that would cause churn.

In this thread we are not talking about style changes. To my=20=20
understanding=20we are talking about code-refactoring which is supposed=20=
=20
to=20lead to
  - be more easy understanding for people new to the code (and we want=20=
=20
to=20attract new people in general, right?),
  - a faster understanding for those people which had theirs hands=20=20
already=20in there but didn't had a look at it for a long time,
  - a simpler interface, and even
  - some clarity about the inner workings which is not available now=20=20
(I=20refer to the "is td always currthread" question).

I fully agree to prevent code churn in terms of style changes.
I do not agree to code-refactoring is a no-go (it may depending on the=20=
=20
situation...=20IMO it should be more "yes to code-refactoring unless"=20=20
instead=20of "no to code-refactoring unless").

Bye,
Alexander.

--=20
http://www.Leidinger.net=20Alexander@Leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.org    netchild@FreeBSD.org  : PGP 0x8F31830F9F2772BF

--=_gev7OHDv8lFVJl6l7UZ4FGu
Content-Type: application/pgp-signature
Content-Description: Digitale PGP-Signatur
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQIcBAABAgAGBQJcgONkAAoJEBINsJsD+NiGe9EP/0J+0MrAHg71oFZE7ty7EraM
TKnchtMJZGp0VzS7IR2quYF34bmTwZkH5j3kYLOeIiDiXnlS2a5lNlm1TeuO+zp/
T097DCfA/vHYka3Y9pApoVhlI5lxzQUCLEsIuB1jZRB6mV6RDZgYx2PXdaVrDGFb
xtJBN+xXFYvqn+DEAnv9AHPjeys8MoUCpgZmCn0ZOCMPd3BWJdfX1PI4d89J3elr
8IhxOW1ZW06/V7cBaPEFXYoN1zuw4LjR99tP+HyChprG4dIIrCFYrYP0scQEyrP+
oNOBNPLt6NA/17SaF2pYiCkdvkaM/TQWPgH5Ho3Gu+Ah1PS9shJaKSz3A39bLjjp
OS7d4uSbmPPvih9sDGrKH0OIEM+S7257QmLA+nZplK9pIGRT2afJBBEGMnDuO7IZ
3J2/JWz8ZphgCR2v0weUXidgyT/uqKJpY4qLKObwIXk5G90ffCAH0KM4kCgDLyo3
fBwEp3jTZF9Ndidq7viLKy3dcYyOstirOJ83qILQAeBJT1kHgxktSpjVcaNM/Zzv
3tpR0pZWfONgTFqEIvzZ4ktYttpLHOGPBdvj1Gd5Vvtx9EclCcFX5JrEaZxKftMA
/rfDae096XE+IYDwSZBXPEfMGKmAVFGNRG88dyg/VZd+K6VnDuhrgsjZBoEXvvRK
GBmX0eZXkyenNDLNvq/z
=h+lR
-----END PGP SIGNATURE-----

--=_gev7OHDv8lFVJl6l7UZ4FGu--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20190307150927.L932>