From owner-freebsd-fs@FreeBSD.ORG Wed Jul 9 12:54:00 2014 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 42C35FBD; Wed, 9 Jul 2014 12:54:00 +0000 (UTC) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id D8E2229A6; Wed, 9 Jul 2014 12:53:59 +0000 (UTC) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id s69CrnJr011171 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 9 Jul 2014 15:53:49 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.8.3 kib.kiev.ua s69CrnJr011171 Received: (from kostik@localhost) by tom.home (8.14.9/8.14.9/Submit) id s69Crngi011170; Wed, 9 Jul 2014 15:53:49 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Wed, 9 Jul 2014 15:53:49 +0300 From: Konstantin Belousov To: Bruce Evans Subject: Re: Strange IO performance with UFS Message-ID: <20140709125349.GV93733@kib.kiev.ua> References: <201407082230.s68MU0Dw028257@gw.catspoiler.org> <20140709213958.K1732@besplex.bde.org> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="IF36ViUsfxDRDLKY" Content-Disposition: inline In-Reply-To: <20140709213958.K1732@besplex.bde.org> User-Agent: Mutt/1.5.23 (2014-03-12) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.0 X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on tom.home Cc: freebsd-fs@freebsd.org, sparvu@systemdatarecorder.org, Don Lewis , freebsd-hackers@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 09 Jul 2014 12:54:00 -0000 --IF36ViUsfxDRDLKY Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Jul 09, 2014 at 10:23:48PM +1000, Bruce Evans wrote: > On Tue, 8 Jul 2014, Don Lewis wrote: >=20 > > On 5 Jul, Konstantin Belousov wrote: > >> On Sat, Jul 05, 2014 at 06:18:07PM +0200, Roger Pau Monn? wrote: > > > >>> As can be seen from the log above, at first the workload runs fine, > >>> and the disk is only performing writes, but at some point (in this > >>> case around 40% of completion) it starts performing this > >>> read-before-write dance that completely screws up performance. > >> > >> I reproduced this locally. I think my patch is useless for the fio/4k= write > >> situation. > >> > >> What happens is indeed related to the amount of the available memory. > >> When the size of the file written by fio is larger than the memory, > >> system has to recycle the cached pages. So after some moment, doing > >> a write has to do read-before-write, and this occurs not at the EOF > >> (since fio pre-allocated the job file). > > > > I reproduced this locally with dd if=3D/dev/zero bs=3D4k conv=3Dnotrunc= ... > > For the small file case, if I flush the file from cache by unmounting > > the filesystem where it resides and then remounting the filesystem, then > > I see lots of reads right from the start. >=20 > This seems to be related to kern/178997: Heavy disk I/O may hang system. > Test programs doing more complicated versions of conv=3Dnotrunc caused > even worse problems when run in parallel. I lost track of what happened > with that. I think kib committed a partial fix that doesn't apply to > the old version of FreeBSD that I use. I do not think this is related to kern/178997. Yes, kern/178997 is only partially fixed, parallel reads and starved writer could still cause buffer cache livelock. On the other hand, I am not sure how feasible is to create a real test case for this. Fix would be not easy. >=20 > >> In fact, I used 10G file on 8G machine, but I interrupted the fio > >> before it finish the job. The longer the previous job runs, the longer > >> is time for which new job does not issue reads. If I allow the job to > >> completely fill the cache, then the reads starts immediately on the ne= xt > >> job run. > >> > >> I do not see how could anything be changed there, if we want to keep > >> user file content on partial block writes, and we do. > > > > About the only thing I can think of that might help is to trigger > > readahead when we detect sequential small writes. We'll still have to > > do the reads, but hopefully they will be larger and occupy less time in > > the critical path. >=20 > ffs_balloc*() already uses cluster_write() so sequentuial small writes > already normally do at least 128K of readahead and you should rarely > see the the 4K-reads (except with O_DIRECT?). You mean cluster_read(). Indeed, ffs_balloc* already does this. This is also useful since it preallocates vnode pages, making writes even less blocking. >=20 > msdosfs is missing this readahead. I never got around to sending > my patches for this to kib in the PR 178997 discussion. >=20 > Here I see full clustering with 64K-clusters on the old version of > FreeBSD, but my drive doesn't like going back and forth, so the writes > go 8 times as slow as without the reads instead of only 2 times as > slow. (It's an old ATA drive with a ~1MB buffer, but apparently has > dumb firmware so seeking back just 64K is too much for it to cache.) > Just remembered I have a newer SATA drive with a ~32MB buffer. It > only goes 3 times as slow. The second drive is also on a not quite > so old version of FreeBSD that certainly doesn't have any workarounds > for PR 178977. All file systems were mounted async, which shouldn't > affect this much. >=20 > > Writing a multiple of the filesystem blocksize is still the most > > efficient strategy. >=20 > Except when the filesystem block size is too large to be efficient. > The FreeBSD ffs default block size of 32K is slow for small files. > Fragments reduce its space wastage but interact badly with the > buffer cache. Linux avoids some of these problems by using smaller > filesystem block sizes and not using fragments (at least in old > filesystems). >=20 > Bruce --IF36ViUsfxDRDLKY Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBAgAGBQJTvTtdAAoJEJDCuSvBvK1B7MMP/jLFcxLIyRajE89ULELxiNPp W7Kip0tE04OzNm6DnsGdA0L9vimqdwNfSFdXn42sXaqUrFnaijAZoTjhEEYX+khD 22ijy3AoW7g5J2qRM0p/rbgwBKVFGtsw2GBFTQvFySGfFunmSssMGXOucXYQjo6z jf8AyR/wLNG5+R+1rkykysAPbqGr8l/K/msxWmV0rqk1rjIMFRONOw5H0hz+hXXN 8YkkqsiJxq745ZyooHP8IsDEne1WIrbE1cY3XQPuDiI0fQb26M9qnufe+4KjLINe teSN6OeHNAy8LBnP4OWWiW+NxigNpwL4vuyikJ7zpNm4WB24cad3KQiytGacNRgo nat26IFAfpRciRnVtiPQyT+21sO/OBqTyRtnBzf3RATyKtZSfJupoKZXtVCkjTO4 eZ0gdb835dvq2zKKu/kl1mM9+R/hguFZfJVxBNSAzwhOms+7CxqFet78nnKGFVjn kCpm18ZT1PY9mMyMdE0v/0w2Rnnq3RUAGGZZHAGD7V0LIX+ZXyjtuibNPqaJiLhQ c0IxXzGdfJx02sh0HEUyPdlZZ1lcYZG7gEYQV5Lt5Iff28tkZcFITrE0j9Mh/3PR BhAUUMAKWeiSbyZ+UVQmfIVZfQ9nrV9G14ZsfHmjgqagbuqh44P1QjDsHm+QRzkR 97SvM2fPqlJLuCDuGRge =i4cX -----END PGP SIGNATURE----- --IF36ViUsfxDRDLKY--