Date: Thu, 28 Mar 2013 10:11:25 -0700 From: Maksim Yevmenkin <maksim.yevmenkin@gmail.com> To: Konstantin Belousov <kostikbel@gmail.com> Cc: current@freebsd.org Subject: Re: [RFC] vfs.read_min proposal Message-ID: <CAFPOs6qo7yHgpUWsnLb0hJ9S5_fjbFvh__2-n6MQLHOVdUQmOQ@mail.gmail.com> In-Reply-To: <20130328075209.GL3794@kib.kiev.ua> References: <CAFPOs6rNDZTqWJZ3hK=px5RX5G44Z3hfzCLQcfceQ2n_7oU3GA@mail.gmail.com> <20130328075209.GL3794@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Mar 28, 2013 at 12:52 AM, Konstantin Belousov <kostikbel@gmail.com> wrote: >> i would like to get some reviews, opinions and/or comments on the patch below. >> >> a little bit background, as far as i understand, cluster_read() can >> initiate two disk i/o's: one for exact amount of data being requested >> (rounded up to a filesystem block size) and another for a configurable >> read ahead. read ahead data are always extra and do not super set data >> being requested. also, read ahead can be controlled via f_seqcount (on >> per descriptor basis) and/or vfs.read_max (global knob). >> >> in some cases and/or on some work loads it can be beneficial to bundle >> original data and read ahead data in one i/o request. in other words, >> read more than caller has requested, but only perform one larger i/o, >> i.e. super set data being requested and read ahead. > > The totread argument to the cluster_read() is supplied by the filesystem > to indicate how many data in the current request is specified. Always > overriding this information means two things: > - you fill the buffer and page cache with potentially unused data. it very well could be > For some situations, like partial reads, it would be really bad. certainly possible > - you increase the latency by forcing the reader to wait for the whole > cluster which was not asked for. perhaps, however, modern drives are fast, and, in fact, are a lot better at reading continuous chunks without introducing significant delays. > So it looks as very single- and special-purpose hack. Besides, the > global knob is obscure and probably would not have any use except your > special situation. Would a file flag be acceptable for you ? flag would probably work just as well, but having global knob allows for runtime tuning without software re-configuration and re-start. > What is the difference in the numbers you see, and what numbers ? > Is it targeted for read(2) optimizations, or are you also concerned > with the read-ahead done at the fault time ? ok. please consider the following. modern high performance web server - nginx - with aio + sendfile(SF_NODISKIO) method of delivering data. for those who are not familiar how it works, here is a very quick overview of nginx's aio + sendfile 1) nginx always uses sendfile() with SF_NODISKIO to deliver content. it means that if requested pages are not in the buffer cache, send as much as available and return EBUSY 2) when nginx sees EBUSY return code from sendfile() it would issue aio_read() for 1 byte. the idea here being that OS will issue a read ahead and fill up buffer cache with some pages 3) when aio_read() completes, nginx calls sendfile() with SF_NODISKIO again, making an assumption that at least some of the pages will be in the buffer cache. this model allow for completely non-blocking asynchronous data send without copying anything back to user space. sounds awesome, right? well, it almost is. here is the problem: aio_read() for 1 byte will issue read for 1 filesystem block size, and, *another* read for read ahead data. say, we configure, read ahead to be 1mb (via ioctl(F_READAHEAD and/or vfs.read_max), and, say, our filesystem block size is 64k. we end up with 2 disk reads: one for 64k and another for 1mb, two trips to VM, and our average size per disk transfer is (64k + 1mb) / 2 = 544kb. now, if we use vfs.read_min and set it to 16, i.e. 1mb reads with 64k filesystem block size, turn off read ahead completely, i.e. set vfs.read_max to zero, then *all* cluster_reads() become nice 1mb chunks, and, we only do one disk i/o and one trip to VM to get the same data. so we effectively doubled (or halfed) our iops. also average size per disk transfer is around 900kb (there are still some filesystem block sized i/o that are not going via cluser_read()) which is a lot nicer for modern disks. finally, vfs.read_min allows us to control size of orignal disk reads, and, vfs.read_max allows us to control of additional read ahead. so, ww control both sides here. in fact, we can have 1mb reads and 1mb read aheads together. granted, its not going to be optimal for all loads. that is why vfs.read_min default is 1. however, i strongly suspect that there are quite a few workloads where this could really help with disk i/o. thanks, max
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAFPOs6qo7yHgpUWsnLb0hJ9S5_fjbFvh__2-n6MQLHOVdUQmOQ>