Date: Sun, 31 Mar 2013 15:37:30 +0300 From: Konstantin Belousov <kostikbel@gmail.com> To: Scott Long <scottl@samsco.org> Cc: current@freebsd.org, Maksim Yevmenkin <maksim.yevmenkin@gmail.com> Subject: Re: [RFC] vfs.read_min proposal Message-ID: <20130331123730.GL3794@kib.kiev.ua> In-Reply-To: <8F56D4EB-E63F-4D52-A495-903019E129AF@samsco.org> References: <CAFPOs6rNDZTqWJZ3hK=px5RX5G44Z3hfzCLQcfceQ2n_7oU3GA@mail.gmail.com> <20130328075209.GL3794@kib.kiev.ua> <CAFPOs6qo7yHgpUWsnLb0hJ9S5_fjbFvh__2-n6MQLHOVdUQmOQ@mail.gmail.com> <20130329205853.GB3794@kib.kiev.ua> <8F56D4EB-E63F-4D52-A495-903019E129AF@samsco.org>
next in thread | previous in thread | raw e-mail | index | archive | help
--LQjLKkFQ1Ex4sAhM Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Mar 30, 2013 at 01:51:15AM -0600, Scott Long wrote: >=20 > On Mar 29, 2013, at 2:58 PM, Konstantin Belousov <kostikbel@gmail.com> wr= ote: >=20 > >>=20 > > I think this is definitely a feature that should be set by a flag to > > either file descriptor used for aio_read, or aio_read call itself. > > Adding a flag to aio_read() might be cumbersome from the ABI perspectiv= e. > >=20 >=20 > Fine if you think that there should be a corresponding fcntl() operation,= but > I see good reason to also have a vfs.read_min that compliments vfs_read_m= ax. > It's no less obscure. >=20 > >>=20 > >> finally, vfs.read_min allows us to control size of orignal disk reads, > >> and, vfs.read_max allows us to control of additional read ahead. so, > >> ww control both sides here. in fact, we can have 1mb reads and 1mb > >> read aheads together. granted, its not going to be optimal for all > >> loads. that is why vfs.read_min default is 1. however, i strongly > >> suspect that there are quite a few workloads where this could really > >> help with disk i/o. > >=20 > > In fact, the existing OS behaviour is reasonable for the arguments > > which are passed to the syscall. The specified read size is 1, and the > > current read-ahead algorithm tries to satisfy the request with minimal > > latency and without creating additional load under memory pressure, > > while starting the useful optimization of the background read. > >=20 >=20 > The doubled transaction made a lot of sense back when disks were very > slow. Now, let's use a modern example: >=20 > Default UFS block size =3D 16k > Default vfs.read_max =3D 8 (128k) > Time spent transferring a 16k block over 3Gbps SATA: 54ns > Time spent transferring a 128k block over 3Gbps SATA: 436ns > Time spent seeking to the 16k/128k block: Average 8ms on modern disks. > % time spent on data vs seek, 16k: 0.68% > % time spent on data vs seek, 128k: 5.4% >=20 > It'll take you 5% longer to get a completion back. Not nothing, but it's > also not something that would be turned on by default, at least not > right now. For 6Gbps SATA, it'll be half of that. However, this is a > very idealized example. When you start getting a busy disk and the > seek times reach the hundreds of milliseconds, this overhead goes > well into the noise. At the same time, reducing the number of > concurrent, unbalanced transactions to the disk makes them perform > much better when they are at their performance saturation point, and > we have very solid numbers to prove it. Hm, what you are talking about is not what the patch does. Patch fakes the read size from the user. Let me reformulate this in the following way: assume that the file extent at the current point is contiguous, and the read-ahead was specified by the file system. Then, is makes absolutely no sense to not read the maximal available contigous extent up to maxra. Current code limits the amount of first clustered read to the amount specified in the usermode read request. I think this is bug, and we do not need a tunable to disable the bug. >=20 > I think that there's still a place for doubled transactions for read ahea= d, > and that place would likely be with low-latency flash, but there's a lot > of other factors that get in the way of that right now in FreeBSD, like t= he > overhead of the threaded handoffs in GEOM. As this area is developed > over the next 6 months, and as we have more time to build and test > more models, I'm sure well get some interesting data. But for now, I'll > argue that Max's proposal is sound and is low maintenance. Even with such low-latency disk, I think that not reading contiguous chunk when there is read-ahead, is wrong. Could you try this, please, instead of read_min ? The patch can be buggy, but it should demonstrate what I mean. diff --git a/sys/kern/vfs_cluster.c b/sys/kern/vfs_cluster.c index 91a0443..c926dda 100644 --- a/sys/kern/vfs_cluster.c +++ b/sys/kern/vfs_cluster.c @@ -201,7 +201,7 @@ cluster_read(struct vnode *vp, u_quad_t filesize, daddr= _t lblkno, long size, */ if (ncontig) { /* Account for our first block. */ - ncontig =3D min(ncontig + 1, nblks); + ncontig =3D min(ncontig + 1, maxra); if (ncontig < nblks) nblks =3D ncontig; bp =3D cluster_rbuild(vp, filesize, lblkno, --LQjLKkFQ1Ex4sAhM Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iQIcBAEBAgAGBQJRWC4JAAoJEJDCuSvBvK1BsCYP/1Bj+Sc4HuZjR2h9/JWCRlWr IqPeDDVTUocPTs0PSbKXzo5Q7wBH6DwF8ecrvKRim2wHtVi7hJi7MFiMQQx9s/Jj wa003MTGhsSogvBF82RFEn3yNubwF0jNci6RRw7CIkNDhOeIBfZa2BTnddT4w+9N QX8P6aMi+4GaOYQ7h+SPUppxbZRJN6A1LVl+pLuI75Q3Ie1rdnwd782HHqAwxDvc dCKsLHJGgimlmD4odVYT4QgIvnWAeUSPBXpL0cIbj+cTLWQ63sceMV932UnuPHvP gIWRpxqRh+3iVzFLSr73r9E/IcLLa2hapuoJdmlKjyrRh1ssctsBbqxXZUmw39ni faRFutWBfnNvL0IdRaNa3Hilp6ImOCvxrb6foyH3mbBKZFUXS5hrVjzNaTM6jju9 QdXpOv1Jm2y8zJp9z5rwSfFDymZo1nI/3l7/s2WSrMJjzFcgv6vX9+4hfUfJUjTT ez2ZqScePfTFVILMHD15yTa//3FO9cyhegZCkByWMp7ug1jsnTZlUQe7/MZdq5BO Cc5+34cVd3kJaouhB6eYWGLusDsB8GCfYb3ckyFiluZY/ZMNxEiJDuwIUPMbajSb kdV1KnYUuJ8WZH0DIz9fk3haCiapJBRbyimPRt7w78B+vVsWGw2PXDrZfpnHJUrU gPD8EmQ5fDamIeWIZsRk =K2KQ -----END PGP SIGNATURE----- --LQjLKkFQ1Ex4sAhM--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130331123730.GL3794>