Date: Sun, 6 Jul 2014 14:49:16 -0700 (PDT) From: Don Lewis <truckman@FreeBSD.org> To: kostikbel@gmail.com Cc: freebsd-fs@FreeBSD.org, sparvu@systemdatarecorder.org, freebsd-hackers@FreeBSD.org, roger.pau@citrix.com Subject: Re: Strange IO performance with UFS Message-ID: <201407062149.s66LnGnm021769@gw.catspoiler.org> In-Reply-To: <20140705195816.GV93733@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
On 5 Jul, Konstantin Belousov wrote: > On Sat, Jul 05, 2014 at 06:18:07PM +0200, Roger Pau Monn? wrote: >> On 05/07/14 13:24, Konstantin Belousov wrote: >> > On Sat, Jul 05, 2014 at 12:35:11PM +0200, Roger Pau Monn? wrote: >> >> On 05/07/14 11:58, Konstantin Belousov wrote: >> >>> On Sat, Jul 05, 2014 at 11:32:06AM +0200, Roger Pau Monn? >> >>> wrote: >> >>>> kernel`g_io_request+0x384 kernel`g_part_start+0x2c3 >> >>>> kernel`g_io_request+0x384 kernel`g_part_start+0x2c3 >> >>>> kernel`g_io_request+0x384 kernel`ufs_strategy+0x8a >> >>>> kernel`VOP_STRATEGY_APV+0xf5 kernel`bufstrategy+0x46 >> >>>> kernel`cluster_read+0x5e6 kernel`ffs_balloc_ufs2+0x1be2 >> >>>> kernel`ffs_write+0x310 kernel`VOP_WRITE_APV+0x166 >> >>>> kernel`vn_write+0x2eb kernel`vn_io_fault_doio+0x22 >> >>>> kernel`vn_io_fault1+0x78 kernel`vn_io_fault+0x173 >> >>>> kernel`dofilewrite+0x85 kernel`kern_writev+0x65 >> >>>> kernel`sys_write+0x63 >> >>>> >> >>>> This can also be seen by running iostat in parallel with the >> >>>> fio workload: >> >>>> >> >>>> device r/s w/s kr/s kw/s qlen svc_t %b ada0 >> >>>> 243.3 233.7 31053.3 29919.1 31 57.4 100 >> >>>> >> >>>> This clearly shows that even when I was doing a sequential >> >>>> write (the fio workload shown above), the disk was actually >> >>>> reading more data than writing it, which makes no sense, and >> >>>> all the reads come from the path trace shown above. >> >>> >> >>> The backtrace above means that the BA_CLRBUF was specified for >> >>> UFS_BALLOC(). In turns, this occurs when the write size is >> >>> less than the UFS block size. UFS has to read the block to >> >>> ensure that partial write does not corrupt the rest of the >> >>> buffer. >> >> >> >> Thanks for the clarification, that makes sense. I'm not opening >> >> the file with O_DIRECT, so shouldn't the write be cached in >> >> memory and flushed to disk when we have the full block? It's a >> >> sequential write, so the whole block is going to be rewritten >> >> very soon. >> >> >> >>> >> >>> You can get the block size for file with stat(2), st_blksize >> >>> field of the struct stat, or using statfs(2), field f_iosize of >> >>> struct statfs, or just looking at the dumpfs output for your >> >>> filesystem, the bsize value. For modern UFS typical value is >> >>> 32KB. >> >> >> >> Yes, block size is 32KB, checked with dumpfs. I've changed the >> >> block size in fio to 32k and then I get the expected results in >> >> iostat and fio: >> >> >> >> extended device statistics device r/s w/s kr/s kw/s >> >> qlen svc_t %b ada0 1.0 658.2 31.1 84245.1 58 108.4 >> >> 101 extended device statistics device r/s w/s kr/s >> >> kw/s qlen svc_t %b ada0 0.0 689.8 0.0 88291.4 54 >> >> 112.1 99 extended device statistics device r/s w/s kr/s >> >> kw/s qlen svc_t %b ada0 1.0 593.3 30.6 75936.9 80 >> >> 111.7 97 >> >> >> >> write: io=10240MB, bw=81704KB/s, iops=2553, runt=128339msec >> > >> > The current code in ffs_write() only avoids read before write when >> > write covers complete block. I think we can somewhat loose the >> > test to also avoid read when we are at EOF and write covers >> > completely the valid portion of the last block. >> > >> > This leaves the unwritten portion of the block with the garbage. I >> > believe that it is not harmful, since the only way for usermode to >> > access that garbage is through the mmap(2). The >> > vnode_generic_getpages() zeroes out parts of the page which are >> > after EOF. >> > >> > Try this, almost completely untested: >> >> Doesn't seem to help much, I'm still seeing the same issue. I'm >> sampling iostat every 1s, and here's the output form the start of the >> 4k block fio workload: >> >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 349.5 0.0 44612.3 48 88.0 52 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 655.4 0.0 83773.6 76 99.8 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 699.2 0.0 89493.1 59 109.4 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 628.1 0.0 80392.6 55 114.8 98 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 655.7 0.0 83799.6 79 98.4 102 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 701.4 0.0 89782.0 80 105.5 97 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 697.9 0.0 89331.6 78 112.0 103 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 714.1 0.0 91408.7 77 110.3 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 724.0 0.0 92675.0 67 112.5 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 700.4 0.0 89646.6 49 102.5 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 686.4 0.0 87857.2 78 110.0 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 702.0 0.0 89851.6 80 112.9 97 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 736.3 0.0 94246.4 67 110.1 103 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 624.6 0.0 79950.0 48 115.7 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 704.0 0.0 90118.4 77 106.1 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 714.6 0.0 91470.0 80 103.6 99 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 710.4 0.0 90926.1 80 111.1 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 655.3 0.0 83882.1 70 115.8 99 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 539.8 0.0 69094.5 80 121.2 101 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 711.6 0.0 91087.6 79 107.9 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 705.5 0.0 90304.5 81 111.3 97 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 727.3 0.0 93092.8 81 108.9 102 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 699.5 0.0 89296.4 55 109.0 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 689.0 0.0 88066.1 78 96.6 101 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 738.3 0.0 94496.1 56 109.1 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 615.4 0.0 78770.0 80 112.3 98 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 707.3 0.0 90529.8 86 105.7 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 704.3 0.0 89333.9 67 98.3 99 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 641.3 0.0 82081.5 80 112.3 101 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 701.6 0.0 89747.9 51 101.1 101 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 693.0 0.0 88702.1 80 103.6 97 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 632.7 0.0 80991.8 80 112.0 99 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 709.0 0.0 90748.2 80 107.5 102 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 715.0 0.0 91523.0 80 104.7 101 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 650.1 0.0 83210.5 56 110.9 101 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 682.2 0.0 87319.1 57 107.9 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 719.0 0.0 92032.6 80 103.6 99 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 624.3 0.0 79905.8 80 110.5 97 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 696.5 0.0 89151.7 80 109.9 103 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 664.2 0.0 85017.6 77 109.9 101 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 681.7 0.0 87254.0 80 107.5 98 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 668.5 0.0 85569.3 57 109.9 99 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 682.3 0.0 87329.0 53 110.8 102 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 643.9 0.0 82420.9 77 104.8 101 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 107.5 457.1 13701.7 58471.3 57 106.0 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 220.9 253.9 28281.4 32498.9 54 108.8 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 290.6 277.9 37198.8 35576.1 65 94.3 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 309.3 267.9 39590.7 34295.9 80 89.5 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 213.6 302.0 27212.7 38562.0 24 93.5 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 232.1 224.3 29712.5 28339.8 31 117.4 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 262.9 249.4 33654.0 31928.1 47 81.4 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 232.2 229.2 29721.6 29340.5 50 78.5 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 222.8 229.4 28430.0 29362.7 42 85.9 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 231.5 246.5 29628.8 31555.9 6 72.9 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 261.7 256.8 33498.7 32769.1 33 83.9 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 262.7 260.7 33628.3 33279.4 35 85.5 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 234.0 249.1 29867.9 31883.1 18 90.9 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 252.1 239.8 32263.0 30581.4 32 91.2 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 241.5 257.5 30917.0 32961.1 16 69.5 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 257.9 243.5 33011.9 31164.2 32 86.8 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 237.5 235.6 30311.2 30046.9 31 67.4 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 290.4 213.1 37172.8 27277.0 79 65.3 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 216.4 284.3 27703.7 36392.5 42 95.4 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 223.8 248.2 28645.1 31774.4 16 69.4 89 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 294.0 217.7 37544.4 27864.2 64 68.0 110 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 210.7 245.6 26966.6 31439.8 59 107.4 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 228.5 265.2 29246.6 33940.5 10 99.2 98 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 279.1 218.4 35727.2 27955.0 52 71.9 102 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 232.3 293.4 29607.9 37521.4 14 93.2 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 299.5 236.6 38340.2 30288.8 79 69.7 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 216.3 268.9 27686.3 34417.3 4 90.5 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 285.8 261.0 36585.3 33409.5 53 84.6 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 228.5 232.5 29059.7 29661.1 48 74.3 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 242.7 262.4 31060.0 33588.2 27 69.9 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 248.2 252.2 31766.1 32149.3 8 78.9 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 267.9 230.2 34288.6 29462.8 62 68.5 100 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 238.0 266.2 30375.8 34075.6 0 95.4 100 >> >> As can be seen from the log above, at first the workload runs fine, >> and the disk is only performing writes, but at some point (in this >> case around 40% of completion) it starts performing this >> read-before-write dance that completely screws up performance. > > I reproduced this locally. I think my patch is useless for the fio/4k write > situation. > > What happens is indeed related to the amount of the available memory. > When the size of the file written by fio is larger than the memory, > system has to recycle the cached pages. So after some moment, doing > a write has to do read-before-write, and this occurs not at the EOF > (since fio pre-allocated the job file). It would seem to be much better to recycle pages associated parts of the file that haven't been touched in a long time before recycling pages associated with the filesystem block that is currently being written. If the writes are sequential, then it definitely makes sense to hang on to the block until the last portion of the block is written. It sounds like we are doing pretty much the opposite of this. What seems odd is that it sounds like we are detecting the partial write of a block, reading the block from the disk, updating it with the new partial data, writing the block, and then immediately tossing it. It seems odd that the the dirty block isn't allowed to stick around until the syncer causes it to be written, with clean blocks being reclaimed in the meantime.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201407062149.s66LnGnm021769>