From owner-freebsd-fs@FreeBSD.ORG  Wed Jul  9 12:54:00 2014
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 42C35FBD;
 Wed,  9 Jul 2014 12:54:00 +0000 (UTC)
Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id D8E2229A6;
 Wed,  9 Jul 2014 12:53:59 +0000 (UTC)
Received: from tom.home (kib@localhost [127.0.0.1])
 by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id s69CrnJr011171
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
 Wed, 9 Jul 2014 15:53:49 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
DKIM-Filter: OpenDKIM Filter v2.8.3 kib.kiev.ua s69CrnJr011171
Received: (from kostik@localhost)
 by tom.home (8.14.9/8.14.9/Submit) id s69Crngi011170;
 Wed, 9 Jul 2014 15:53:49 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com
 using -f
Date: Wed, 9 Jul 2014 15:53:49 +0300
From: Konstantin Belousov <kostikbel@gmail.com>
To: Bruce Evans <brde@optusnet.com.au>
Subject: Re: Strange IO performance with UFS
Message-ID: <20140709125349.GV93733@kib.kiev.ua>
References: <201407082230.s68MU0Dw028257@gw.catspoiler.org>
 <20140709213958.K1732@besplex.bde.org>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
 protocol="application/pgp-signature"; boundary="IF36ViUsfxDRDLKY"
Content-Disposition: inline
In-Reply-To: <20140709213958.K1732@besplex.bde.org>
User-Agent: Mutt/1.5.23 (2014-03-12)
X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00,
 DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no
 autolearn_force=no version=3.4.0
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on tom.home
Cc: freebsd-fs@freebsd.org, sparvu@systemdatarecorder.org,
 Don Lewis <truckman@freebsd.org>, freebsd-hackers@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 09 Jul 2014 12:54:00 -0000


--IF36ViUsfxDRDLKY
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, Jul 09, 2014 at 10:23:48PM +1000, Bruce Evans wrote:
> On Tue, 8 Jul 2014, Don Lewis wrote:
>=20
> > On  5 Jul, Konstantin Belousov wrote:
> >> On Sat, Jul 05, 2014 at 06:18:07PM +0200, Roger Pau Monn? wrote:
> >
> >>> As can be seen from the log above, at first the workload runs fine,
> >>> and the disk is only performing writes, but at some point (in this
> >>> case around 40% of completion) it starts performing this
> >>> read-before-write dance that completely screws up performance.
> >>
> >> I reproduced this locally.  I think my patch is useless for the fio/4k=
 write
> >> situation.
> >>
> >> What happens is indeed related to the amount of the available memory.
> >> When the size of the file written by fio is larger than the memory,
> >> system has to recycle the cached pages.  So after some moment, doing
> >> a write has to do read-before-write, and this occurs not at the EOF
> >> (since fio pre-allocated the job file).
> >
> > I reproduced this locally with dd if=3D/dev/zero bs=3D4k conv=3Dnotrunc=
 ...
> > For the small file case, if I flush the file from cache by unmounting
> > the filesystem where it resides and then remounting the filesystem, then
> > I see lots of reads right from the start.
>=20
> This seems to be related to kern/178997: Heavy disk I/O may hang system.
> Test programs doing more complicated versions of conv=3Dnotrunc caused
> even worse problems when run in parallel.  I lost track of what happened
> with that.  I think kib committed a partial fix that doesn't apply to
> the old version of FreeBSD that I use.
I do not think this is related to kern/178997.

Yes, kern/178997 is only partially fixed, parallel reads and starved
writer could still cause buffer cache livelock.  On the other hand,
I am not sure how feasible is to create a real test case for this.
Fix would be not easy.

>=20
> >> In fact, I used 10G file on 8G machine, but I interrupted the fio
> >> before it finish the job.  The longer the previous job runs, the longer
> >> is time for which new job does not issue reads.  If I allow the job to
> >> completely fill the cache, then the reads starts immediately on the ne=
xt
> >> job run.
> >>
> >> I do not see how could anything be changed there, if we want to keep
> >> user file content on partial block writes, and we do.
> >
> > About the only thing I can think of that might help is to trigger
> > readahead when we detect sequential small writes.  We'll still have to
> > do the reads, but hopefully they will be larger and occupy less time in
> > the critical path.
>=20
> ffs_balloc*() already uses cluster_write() so sequentuial small writes
> already normally do at least 128K of readahead and you should rarely
> see the the 4K-reads (except with O_DIRECT?).
You mean cluster_read().  Indeed, ffs_balloc* already does this.
This is also useful since it preallocates vnode pages, making writes
even less blocking.

>=20
> msdosfs is missing this readahead.  I never got around to sending
> my patches for this to kib in the PR 178997 discussion.
>=20
> Here I see full clustering with 64K-clusters on the old version of
> FreeBSD, but my drive doesn't like going back and forth, so the writes
> go 8 times as slow as without the reads instead of only 2 times as
> slow.  (It's an old ATA drive with a ~1MB buffer, but apparently has
> dumb firmware so seeking back just 64K is too much for it to cache.)
> Just remembered I have a newer SATA drive with a ~32MB buffer.  It
> only goes 3 times as slow.  The second drive is also on a not quite
> so old version of FreeBSD that certainly doesn't have any workarounds
> for PR 178977.  All file systems were mounted async, which shouldn't
> affect this much.
>=20
> > Writing a multiple of the filesystem blocksize is still the most
> > efficient strategy.
>=20
> Except when the filesystem block size is too large to be efficient.
> The FreeBSD ffs default block size of 32K is slow for small files.
> Fragments reduce its space wastage but interact badly with the
> buffer cache.  Linux avoids some of these problems by using smaller
> filesystem block sizes and not using fragments (at least in old
> filesystems).
>=20
> Bruce

--IF36ViUsfxDRDLKY
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIcBAEBAgAGBQJTvTtdAAoJEJDCuSvBvK1B7MMP/jLFcxLIyRajE89ULELxiNPp
W7Kip0tE04OzNm6DnsGdA0L9vimqdwNfSFdXn42sXaqUrFnaijAZoTjhEEYX+khD
22ijy3AoW7g5J2qRM0p/rbgwBKVFGtsw2GBFTQvFySGfFunmSssMGXOucXYQjo6z
jf8AyR/wLNG5+R+1rkykysAPbqGr8l/K/msxWmV0rqk1rjIMFRONOw5H0hz+hXXN
8YkkqsiJxq745ZyooHP8IsDEne1WIrbE1cY3XQPuDiI0fQb26M9qnufe+4KjLINe
teSN6OeHNAy8LBnP4OWWiW+NxigNpwL4vuyikJ7zpNm4WB24cad3KQiytGacNRgo
nat26IFAfpRciRnVtiPQyT+21sO/OBqTyRtnBzf3RATyKtZSfJupoKZXtVCkjTO4
eZ0gdb835dvq2zKKu/kl1mM9+R/hguFZfJVxBNSAzwhOms+7CxqFet78nnKGFVjn
kCpm18ZT1PY9mMyMdE0v/0w2Rnnq3RUAGGZZHAGD7V0LIX+ZXyjtuibNPqaJiLhQ
c0IxXzGdfJx02sh0HEUyPdlZZ1lcYZG7gEYQV5Lt5Iff28tkZcFITrE0j9Mh/3PR
BhAUUMAKWeiSbyZ+UVQmfIVZfQ9nrV9G14ZsfHmjgqagbuqh44P1QjDsHm+QRzkR
97SvM2fPqlJLuCDuGRge
=i4cX
-----END PGP SIGNATURE-----

--IF36ViUsfxDRDLKY--