Date: Sat, 05 Jul 2014 22:24:23 +0800 From: Julian Elischer <julian@freebsd.org> To: Konstantin Belousov <kostikbel@gmail.com>, Roger Pau Monn? <roger.pau@citrix.com> Cc: freebsd-fs@freebsd.org, Stefan Parvu <sparvu@systemdatarecorder.org>, FreeBSD Hackers <freebsd-hackers@freebsd.org> Subject: Re: Strange IO performance with UFS Message-ID: <53B80A97.1080803@freebsd.org> In-Reply-To: <20140705112448.GQ93733@kib.kiev.ua> References: <53B691EA.3070108@citrix.com> <53B69C73.7090806@citrix.com> <20140705001938.54a3873dd698080d93d840e2@systemdatarecorder.org> <53B7C616.1000702@citrix.com> <20140705095831.GO93733@kib.kiev.ua> <53B7D4DF.40301@citrix.com> <20140705112448.GQ93733@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
On 7/5/14, 7:24 PM, Konstantin Belousov wrote: > On Sat, Jul 05, 2014 at 12:35:11PM +0200, Roger Pau Monn? wrote: >> On 05/07/14 11:58, Konstantin Belousov wrote: >>> On Sat, Jul 05, 2014 at 11:32:06AM +0200, Roger Pau Monn? wrote: >>>> kernel`g_io_request+0x384 kernel`g_part_start+0x2c3 >>>> kernel`g_io_request+0x384 kernel`g_part_start+0x2c3 >>>> kernel`g_io_request+0x384 kernel`ufs_strategy+0x8a >>>> kernel`VOP_STRATEGY_APV+0xf5 kernel`bufstrategy+0x46 >>>> kernel`cluster_read+0x5e6 kernel`ffs_balloc_ufs2+0x1be2 >>>> kernel`ffs_write+0x310 kernel`VOP_WRITE_APV+0x166 >>>> kernel`vn_write+0x2eb kernel`vn_io_fault_doio+0x22 >>>> kernel`vn_io_fault1+0x78 kernel`vn_io_fault+0x173 >>>> kernel`dofilewrite+0x85 kernel`kern_writev+0x65 >>>> kernel`sys_write+0x63 >>>> >>>> This can also be seen by running iostat in parallel with the fio >>>> workload: >>>> >>>> device r/s w/s kr/s kw/s qlen svc_t %b ada0 >>>> 243.3 233.7 31053.3 29919.1 31 57.4 100 >>>> >>>> This clearly shows that even when I was doing a sequential write >>>> (the fio workload shown above), the disk was actually reading >>>> more data than writing it, which makes no sense, and all the >>>> reads come from the path trace shown above. >>> The backtrace above means that the BA_CLRBUF was specified for >>> UFS_BALLOC(). In turns, this occurs when the write size is less >>> than the UFS block size. UFS has to read the block to ensure that >>> partial write does not corrupt the rest of the buffer. >> Thanks for the clarification, that makes sense. I'm not opening the >> file with O_DIRECT, so shouldn't the write be cached in memory and >> flushed to disk when we have the full block? It's a sequential write, >> so the whole block is going to be rewritten very soon. >> >>> You can get the block size for file with stat(2), st_blksize field >>> of the struct stat, or using statfs(2), field f_iosize of struct >>> statfs, or just looking at the dumpfs output for your filesystem, >>> the bsize value. For modern UFS typical value is 32KB. >> Yes, block size is 32KB, checked with dumpfs. I've changed the block >> size in fio to 32k and then I get the expected results in iostat and fio: >> >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 1.0 658.2 31.1 84245.1 58 108.4 101 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 0.0 689.8 0.0 88291.4 54 112.1 99 >> extended device statistics >> device r/s w/s kr/s kw/s qlen svc_t %b >> ada0 1.0 593.3 30.6 75936.9 80 111.7 97 >> >> write: io=10240MB, bw=81704KB/s, iops=2553, runt=128339msec > The current code in ffs_write() only avoids read before write when > write covers complete block. I think we can somewhat loose the test > to also avoid read when we are at EOF and write covers completely > the valid portion of the last block. > > This leaves the unwritten portion of the block with the garbage. I > believe that it is not harmful, since the only way for usermode to > access that garbage is through the mmap(2). The vnode_generic_getpages() > zeroes out parts of the page which are after EOF. I have vague memories of this being in a security bulletin once along the lines of "random data disclosure" by making tons of 1 frag size files and then mmapping them. > > Try this, almost completely untested: > > commit 30375741f5b15609e51cac5b242ecfe7d614e902 > Author: Konstantin Belousov <kib@freebsd.org> > Date: Sat Jul 5 14:19:39 2014 +0300 > > Do not do read-before-write if the written area completely covers > the valid portion of the block at EOF. > > diff --git a/sys/ufs/ffs/ffs_vnops.c b/sys/ufs/ffs/ffs_vnops.c > index 423d811..b725932 100644 > --- a/sys/ufs/ffs/ffs_vnops.c > +++ b/sys/ufs/ffs/ffs_vnops.c > @@ -729,10 +729,12 @@ ffs_write(ap) > vnode_pager_setsize(vp, uio->uio_offset + xfersize); > > /* > - * We must perform a read-before-write if the transfer size > - * does not cover the entire buffer. > + * We must perform a read-before-write if the transfer > + * size does not cover the entire buffer or the valid > + * part of the last buffer for the file. > */ > - if (fs->fs_bsize > xfersize) > + if (fs->fs_bsize > xfersize && (blkoffset != 0 || > + uio->uio_offset + xfersize < ip->i_size)) > flags |= BA_CLRBUF; > else > flags &= ~BA_CLRBUF;
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?53B80A97.1080803>