From owner-freebsd-current@FreeBSD.ORG Fri Mar 29 20:59:02 2013 Return-Path: Delivered-To: current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 7BB45E5C for ; Fri, 29 Mar 2013 20:59:02 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id C786CA6E for ; Fri, 29 Mar 2013 20:59:01 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.6/8.14.6) with ESMTP id r2TKwrNt018150; Fri, 29 Mar 2013 22:58:53 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.8.2 kib.kiev.ua r2TKwrNt018150 Received: (from kostik@localhost) by tom.home (8.14.6/8.14.6/Submit) id r2TKwraJ018149; Fri, 29 Mar 2013 22:58:53 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Fri, 29 Mar 2013 22:58:53 +0200 From: Konstantin Belousov To: Maksim Yevmenkin Subject: Re: [RFC] vfs.read_min proposal Message-ID: <20130329205853.GB3794@kib.kiev.ua> References: <20130328075209.GL3794@kib.kiev.ua> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="5i5vJnjh6kulLZLt" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home Cc: current@freebsd.org X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 29 Mar 2013 20:59:02 -0000 --5i5vJnjh6kulLZLt Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Mar 28, 2013 at 10:11:25AM -0700, Maksim Yevmenkin wrote: > On Thu, Mar 28, 2013 at 12:52 AM, Konstantin Belousov > wrote: >=20 > >> i would like to get some reviews, opinions and/or comments on the patc= h below. > >> > >> a little bit background, as far as i understand, cluster_read() can > >> initiate two disk i/o's: one for exact amount of data being requested > >> (rounded up to a filesystem block size) and another for a configurable > >> read ahead. read ahead data are always extra and do not super set data > >> being requested. also, read ahead can be controlled via f_seqcount (on > >> per descriptor basis) and/or vfs.read_max (global knob). > >> > >> in some cases and/or on some work loads it can be beneficial to bundle > >> original data and read ahead data in one i/o request. in other words, > >> read more than caller has requested, but only perform one larger i/o, > >> i.e. super set data being requested and read ahead. > > > > The totread argument to the cluster_read() is supplied by the filesystem > > to indicate how many data in the current request is specified. Always > > overriding this information means two things: > > - you fill the buffer and page cache with potentially unused data. >=20 > it very well could be >=20 > > For some situations, like partial reads, it would be really bad. >=20 > certainly possible >=20 > > - you increase the latency by forcing the reader to wait for the whole > > cluster which was not asked for. >=20 > perhaps, however, modern drives are fast, and, in fact, are a lot > better at reading continuous chunks without introducing significant > delays. >=20 > > So it looks as very single- and special-purpose hack. Besides, the > > global knob is obscure and probably would not have any use except your > > special situation. Would a file flag be acceptable for you ? >=20 > flag would probably work just as well, but having global knob allows > for runtime tuning without software re-configuration and re-start. My point is that the 'tuning' there has too wide scope to be acceptable as the feature for the general-puspose OS. >=20 > > What is the difference in the numbers you see, and what numbers ? > > Is it targeted for read(2) optimizations, or are you also concerned > > with the read-ahead done at the fault time ? >=20 > ok. please consider the following. modern high performance web server > - nginx - with aio + sendfile(SF_NODISKIO) method of delivering data. >=20 > for those who are not familiar how it works, here is a very quick > overview of nginx's aio + sendfile >=20 > 1) nginx always uses sendfile() with SF_NODISKIO to deliver content. > it means that if requested pages are not in the buffer cache, send as > much as available and return EBUSY >=20 > 2) when nginx sees EBUSY return code from sendfile() it would issue > aio_read() for 1 byte. the idea here being that OS will issue a read > ahead and fill up buffer cache with some pages >=20 > 3) when aio_read() completes, nginx calls sendfile() with SF_NODISKIO > again, making an assumption that at least some of the pages will be in > the buffer cache. >=20 > this model allow for completely non-blocking asynchronous data send > without copying anything back to user space. sounds awesome, right? > well, it almost is. here is the problem: aio_read() for 1 byte will > issue read for 1 filesystem block size, and, *another* read for read > ahead data. say, we configure, read ahead to be 1mb (via > ioctl(F_READAHEAD and/or vfs.read_max), and, say, our filesystem block > size is 64k. we end up with 2 disk reads: one for 64k and another for > 1mb, two trips to VM, and our average size per disk transfer is (64k + > 1mb) / 2 =3D 544kb. >=20 > now, if we use vfs.read_min and set it to 16, i.e. 1mb reads with 64k > filesystem block size, turn off read ahead completely, i.e. set > vfs.read_max to zero, then *all* cluster_reads() become nice 1mb > chunks, and, we only do one disk i/o and one trip to VM to get the > same data. so we effectively doubled (or halfed) our iops. also > average size per disk transfer is around 900kb (there are still some > filesystem block sized i/o that are not going via cluser_read()) which > is a lot nicer for modern disks. I think this is definitely a feature that should be set by a flag to either file descriptor used for aio_read, or aio_read call itself. Adding a flag to aio_read() might be cumbersome from the ABI perspective. >=20 > finally, vfs.read_min allows us to control size of orignal disk reads, > and, vfs.read_max allows us to control of additional read ahead. so, > ww control both sides here. in fact, we can have 1mb reads and 1mb > read aheads together. granted, its not going to be optimal for all > loads. that is why vfs.read_min default is 1. however, i strongly > suspect that there are quite a few workloads where this could really > help with disk i/o. In fact, the existing OS behaviour is reasonable for the arguments which are passed to the syscall. The specified read size is 1, and the current read-ahead algorithm tries to satisfy the request with minimal latency and without creating additional load under memory pressure, while starting the useful optimization of the background read. Not lying to the OS could be achieved by somehow specifying to aio_read() that you do not need copyout, and issuing the request for read of the full range. This is definitely more work than read_min, but I think that the result could be useful for the wide audience. --5i5vJnjh6kulLZLt Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iQIcBAEBAgAGBQJRVgCMAAoJEJDCuSvBvK1Bi90P/i9eXh1FUrOTDqrMnMLodW+/ /Fzk0GdoJVgNGT7BiInb8ntnT/tuCLBeloIzrlx4ZhhX+jRmdBwoHDoJw7nepVSX jv5M49ummgYV5X6bE8Mn7Q5Lbkv2aymdKsVvnU8O6Us7icFdSic3cl1HNfwca7JW ffMTII8QZmG3jrOSKgW/Gjzpexfnk+Y6e7TJ2wRaERzy4I1ii0IM7eGzlZqQ0K09 0OR5jzEPEt6um/7fGJKnSQ7yn7giLIWQA44Qgnv0Svfl1rpsX+bstGVxRuQUgzQ3 S0n5a7LPMIHRqpX2HSQtYLsQNO3u5aj6NQUsRrD2IMMxazCcYHVlIUB0zv31RDvv syC6OKNV/hpe45S7Ze9p9XJEID1xqIaQ/viSrJylrBQBs6/M3P7UxPEmHq86qUCa KldWPCu7TIkZVG1IdPoC47ULWoTGN2QFpImqqJLfVvMbGzmMtNy6EeRbs1ivQFVW k3Qe/feAOG7T9DeqJCjOg+swabNJJRzl0vKpbj8bQUadPAHNd6YwMKbBQTLsYUdL VsuYnRrkwvprUeLCkLWz5oMks9fZRxLYmEuxDZ3soaiLYx76wbkxC8zzqL6ZVkPQ imhLacyIFX2EN8BmEpFcwTzNpqtWAyDxYqRgCAqM5qg5k0n8zICMUCE1p36sKrsv zc57G9w9fqOKbkaPyau0 =mBhy -----END PGP SIGNATURE----- --5i5vJnjh6kulLZLt--