From owner-freebsd-hackers@FreeBSD.ORG Mon Jul 7 13:52:12 2014 Return-Path: Delivered-To: freebsd-hackers@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 2D76AF81; Mon, 7 Jul 2014 13:52:12 +0000 (UTC) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 9655F27B7; Mon, 7 Jul 2014 13:52:11 +0000 (UTC) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id s67DpsIh021439 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 7 Jul 2014 16:51:54 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.8.3 kib.kiev.ua s67DpsIh021439 Received: (from kostik@localhost) by tom.home (8.14.9/8.14.9/Submit) id s67DpsdY021438; Mon, 7 Jul 2014 16:51:54 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Mon, 7 Jul 2014 16:51:54 +0300 From: Konstantin Belousov To: Don Lewis Subject: Re: Strange IO performance with UFS Message-ID: <20140707135154.GM93733@kib.kiev.ua> References: <20140705195816.GV93733@kib.kiev.ua> <201407062149.s66LnGnm021769@gw.catspoiler.org> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="IhaVjqNYhz9BDJmo" Content-Disposition: inline In-Reply-To: <201407062149.s66LnGnm021769@gw.catspoiler.org> User-Agent: Mutt/1.5.23 (2014-03-12) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.0 X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on tom.home Cc: freebsd-fs@FreeBSD.org, sparvu@systemdatarecorder.org, freebsd-hackers@FreeBSD.org, roger.pau@citrix.com X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 07 Jul 2014 13:52:12 -0000 --IhaVjqNYhz9BDJmo Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sun, Jul 06, 2014 at 02:49:16PM -0700, Don Lewis wrote: > On 5 Jul, Konstantin Belousov wrote: > > On Sat, Jul 05, 2014 at 06:18:07PM +0200, Roger Pau Monn? wrote: > >> On 05/07/14 13:24, Konstantin Belousov wrote: > >> > On Sat, Jul 05, 2014 at 12:35:11PM +0200, Roger Pau Monn? wrote: > >> >> On 05/07/14 11:58, Konstantin Belousov wrote: > >> >>> On Sat, Jul 05, 2014 at 11:32:06AM +0200, Roger Pau Monn? > >> >>> wrote: > >> >>>> kernel`g_io_request+0x384 kernel`g_part_start+0x2c3=20 > >> >>>> kernel`g_io_request+0x384 kernel`g_part_start+0x2c3=20 > >> >>>> kernel`g_io_request+0x384 kernel`ufs_strategy+0x8a=20 > >> >>>> kernel`VOP_STRATEGY_APV+0xf5 kernel`bufstrategy+0x46=20 > >> >>>> kernel`cluster_read+0x5e6 kernel`ffs_balloc_ufs2+0x1be2=20 > >> >>>> kernel`ffs_write+0x310 kernel`VOP_WRITE_APV+0x166=20 > >> >>>> kernel`vn_write+0x2eb kernel`vn_io_fault_doio+0x22=20 > >> >>>> kernel`vn_io_fault1+0x78 kernel`vn_io_fault+0x173=20 > >> >>>> kernel`dofilewrite+0x85 kernel`kern_writev+0x65=20 > >> >>>> kernel`sys_write+0x63 > >> >>>>=20 > >> >>>> This can also be seen by running iostat in parallel with the > >> >>>> fio workload: > >> >>>>=20 > >> >>>> device r/s w/s kr/s kw/s qlen svc_t %b ada0=20 > >> >>>> 243.3 233.7 31053.3 29919.1 31 57.4 100 > >> >>>>=20 > >> >>>> This clearly shows that even when I was doing a sequential > >> >>>> write (the fio workload shown above), the disk was actually > >> >>>> reading more data than writing it, which makes no sense, and > >> >>>> all the reads come from the path trace shown above. > >> >>>=20 > >> >>> The backtrace above means that the BA_CLRBUF was specified for=20 > >> >>> UFS_BALLOC(). In turns, this occurs when the write size is > >> >>> less than the UFS block size. UFS has to read the block to > >> >>> ensure that partial write does not corrupt the rest of the > >> >>> buffer. > >> >>=20 > >> >> Thanks for the clarification, that makes sense. I'm not opening > >> >> the file with O_DIRECT, so shouldn't the write be cached in > >> >> memory and flushed to disk when we have the full block? It's a > >> >> sequential write, so the whole block is going to be rewritten > >> >> very soon. > >> >>=20 > >> >>>=20 > >> >>> You can get the block size for file with stat(2), st_blksize > >> >>> field of the struct stat, or using statfs(2), field f_iosize of > >> >>> struct statfs, or just looking at the dumpfs output for your > >> >>> filesystem, the bsize value. For modern UFS typical value is > >> >>> 32KB. > >> >>=20 > >> >> Yes, block size is 32KB, checked with dumpfs. I've changed the > >> >> block size in fio to 32k and then I get the expected results in > >> >> iostat and fio: > >> >>=20 > >> >> extended device statistics device r/s w/s kr/s kw/s > >> >> qlen svc_t %b ada0 1.0 658.2 31.1 84245.1 58 108.4 > >> >> 101 extended device statistics device r/s w/s kr/s > >> >> kw/s qlen svc_t %b ada0 0.0 689.8 0.0 88291.4 54 > >> >> 112.1 99 extended device statistics device r/s w/s kr/s > >> >> kw/s qlen svc_t %b ada0 1.0 593.3 30.6 75936.9 80 > >> >> 111.7 97 > >> >>=20 > >> >> write: io=3D10240MB, bw=3D81704KB/s, iops=3D2553, runt=3D128339msec > >> >=20 > >> > The current code in ffs_write() only avoids read before write when= =20 > >> > write covers complete block. I think we can somewhat loose the > >> > test to also avoid read when we are at EOF and write covers > >> > completely the valid portion of the last block. > >> >=20 > >> > This leaves the unwritten portion of the block with the garbage. I= =20 > >> > believe that it is not harmful, since the only way for usermode to= =20 > >> > access that garbage is through the mmap(2). The > >> > vnode_generic_getpages() zeroes out parts of the page which are > >> > after EOF. > >> >=20 > >> > Try this, almost completely untested: > >>=20 > >> Doesn't seem to help much, I'm still seeing the same issue. I'm > >> sampling iostat every 1s, and here's the output form the start of the > >> 4k block fio workload: > >>=20 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 349.5 0.0 44612.3 48 88.0 52 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 655.4 0.0 83773.6 76 99.8 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 699.2 0.0 89493.1 59 109.4 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 628.1 0.0 80392.6 55 114.8 98 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 655.7 0.0 83799.6 79 98.4 102 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 701.4 0.0 89782.0 80 105.5 97 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 697.9 0.0 89331.6 78 112.0 103 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 714.1 0.0 91408.7 77 110.3 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 724.0 0.0 92675.0 67 112.5 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 700.4 0.0 89646.6 49 102.5 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 686.4 0.0 87857.2 78 110.0 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 702.0 0.0 89851.6 80 112.9 97 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 736.3 0.0 94246.4 67 110.1 103 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 624.6 0.0 79950.0 48 115.7 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 704.0 0.0 90118.4 77 106.1 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 714.6 0.0 91470.0 80 103.6 99 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 710.4 0.0 90926.1 80 111.1 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 655.3 0.0 83882.1 70 115.8 99 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 539.8 0.0 69094.5 80 121.2 101 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 711.6 0.0 91087.6 79 107.9 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 705.5 0.0 90304.5 81 111.3 97 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 727.3 0.0 93092.8 81 108.9 102 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 699.5 0.0 89296.4 55 109.0 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 689.0 0.0 88066.1 78 96.6 101 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 738.3 0.0 94496.1 56 109.1 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 615.4 0.0 78770.0 80 112.3 98 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 707.3 0.0 90529.8 86 105.7 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 704.3 0.0 89333.9 67 98.3 99 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 641.3 0.0 82081.5 80 112.3 101 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 701.6 0.0 89747.9 51 101.1 101 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 693.0 0.0 88702.1 80 103.6 97 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 632.7 0.0 80991.8 80 112.0 99 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 709.0 0.0 90748.2 80 107.5 102 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 715.0 0.0 91523.0 80 104.7 101 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 650.1 0.0 83210.5 56 110.9 101 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 682.2 0.0 87319.1 57 107.9 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 719.0 0.0 92032.6 80 103.6 99 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 624.3 0.0 79905.8 80 110.5 97 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 696.5 0.0 89151.7 80 109.9 103 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 664.2 0.0 85017.6 77 109.9 101 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 681.7 0.0 87254.0 80 107.5 98 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 668.5 0.0 85569.3 57 109.9 99 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 682.3 0.0 87329.0 53 110.8 102 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 0.0 643.9 0.0 82420.9 77 104.8 101 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 107.5 457.1 13701.7 58471.3 57 106.0 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 220.9 253.9 28281.4 32498.9 54 108.8 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 290.6 277.9 37198.8 35576.1 65 94.3 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 309.3 267.9 39590.7 34295.9 80 89.5 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 213.6 302.0 27212.7 38562.0 24 93.5 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 232.1 224.3 29712.5 28339.8 31 117.4 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 262.9 249.4 33654.0 31928.1 47 81.4 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 232.2 229.2 29721.6 29340.5 50 78.5 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 222.8 229.4 28430.0 29362.7 42 85.9 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 231.5 246.5 29628.8 31555.9 6 72.9 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 261.7 256.8 33498.7 32769.1 33 83.9 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 262.7 260.7 33628.3 33279.4 35 85.5 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 234.0 249.1 29867.9 31883.1 18 90.9 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 252.1 239.8 32263.0 30581.4 32 91.2 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 241.5 257.5 30917.0 32961.1 16 69.5 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 257.9 243.5 33011.9 31164.2 32 86.8 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 237.5 235.6 30311.2 30046.9 31 67.4 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 290.4 213.1 37172.8 27277.0 79 65.3 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 216.4 284.3 27703.7 36392.5 42 95.4 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 223.8 248.2 28645.1 31774.4 16 69.4 89 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 294.0 217.7 37544.4 27864.2 64 68.0 110 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 210.7 245.6 26966.6 31439.8 59 107.4 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 228.5 265.2 29246.6 33940.5 10 99.2 98 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 279.1 218.4 35727.2 27955.0 52 71.9 102 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 232.3 293.4 29607.9 37521.4 14 93.2 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 299.5 236.6 38340.2 30288.8 79 69.7 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 216.3 268.9 27686.3 34417.3 4 90.5 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 285.8 261.0 36585.3 33409.5 53 84.6 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 228.5 232.5 29059.7 29661.1 48 74.3 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 242.7 262.4 31060.0 33588.2 27 69.9 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 248.2 252.2 31766.1 32149.3 8 78.9 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 267.9 230.2 34288.6 29462.8 62 68.5 100 > >> extended device statistics > >> device r/s w/s kr/s kw/s qlen svc_t %b > >> ada0 238.0 266.2 30375.8 34075.6 0 95.4 100 > >>=20 > >> As can be seen from the log above, at first the workload runs fine, > >> and the disk is only performing writes, but at some point (in this > >> case around 40% of completion) it starts performing this > >> read-before-write dance that completely screws up performance. > >=20 > > I reproduced this locally. I think my patch is useless for the fio/4k = write > > situation. > >=20 > > What happens is indeed related to the amount of the available memory. > > When the size of the file written by fio is larger than the memory, > > system has to recycle the cached pages. So after some moment, doing > > a write has to do read-before-write, and this occurs not at the EOF > > (since fio pre-allocated the job file). >=20 > It would seem to be much better to recycle pages associated parts of the > file that haven't been touched in a long time before recycling pages > associated with the filesystem block that is currently being written. If > the writes are sequential, then it definitely makes sense to hang on to > the block until the last portion of the block is written. It sounds > like we are doing pretty much the opposite of this. Yes, this sounds suspicious. >=20 > What seems odd is that it sounds like we are detecting the partial write > of a block, reading the block from the disk, updating it with the new > partial data, writing the block, and then immediately tossing it. It > seems odd that the the dirty block isn't allowed to stick around until > the syncer causes it to be written, with clean blocks being reclaimed in > the meantime. No, we don't. The amount of read and written bytes in the read-after-write state are equal. We read the whole buffer, then write 8 chunks by 4KB, all using the cached and delayed-write buffer. Would we indeed tossed the buffer after each write, then the read bytes count would be 8 times the write. --IhaVjqNYhz9BDJmo Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBAgAGBQJTuqX5AAoJEJDCuSvBvK1BxlsP/1iqvd/uelAk/fj6y6Br4cJH r2rC4SSn/+jhLRRmq4rzV913bz5+lf3KNARvnTiGwXmZhKr0U8AE98N+LN5F9Iba rJj6iPtIiuhd9EYGKHb19CC84bME70kQ0mrQ+cER0sd2SLRCGtEsIupfiEJ71Gp1 TcMTNnQ2qI50Ri8NTN/7vi0IMwZo5XPYGTX61KBxE17XYCMghQttz+pqoAd7JY7Q G/VlANlF5VTetcRGUSJ+6+v9J1QSRXTL8pN9oynmaqu+4A5vKBDppNys237Xe9Fg cGhg7K8zGgDl9MdMal8d0sLbtfmHNzvTKktUFdsvLonBiC+EOdiOzGo3tfV2Nc5r CjoU/lLPPo80RTCAo8xFaALqFuqjeSgmSDHnO2TsVSm6u7HScp3BfdXPwpIJWsbp xDcDkuzvH5f6/FkXM5nD0FznyVCwPN5H6VTID45fN0BWhH7+YMklDGyRFsAZHUNN 3e8Ye7QjMEVqSRG2m5dkoXwt34f213xiqz87r/KZh8s4RZVed7SHNsMoW5l5LTCf dDE0CjpUuaY9+xBl5pS9EFj4qctkVgI01UJ/BKy/3fubSkPfaV4CuE7BTXKCyr6q MIA9Xlr5XDkYq/dOUmI3IC6rqdGMTwIHQPGFbNkSTKyVvmLVjCUBtljSS26XCix8 tICaW7erKfH//jPOGhp+ =WJNK -----END PGP SIGNATURE----- --IhaVjqNYhz9BDJmo--