From owner-freebsd-current@FreeBSD.ORG Sat Mar 30 07:51:25 2013 Return-Path: Delivered-To: current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 42812C09 for ; Sat, 30 Mar 2013 07:51:25 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.freebsd.org (Postfix) with ESMTP id 0724E66C for ; Sat, 30 Mar 2013 07:51:24 +0000 (UTC) Received: from [127.0.0.1] (Scott4long@pooker.samsco.org [168.103.85.57]) (authenticated bits=0) by pooker.samsco.org (8.14.5/8.14.5) with ESMTP id r2U7pF8G062998; Sat, 30 Mar 2013 01:51:15 -0600 (MDT) (envelope-from scottl@samsco.org) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 6.3 \(1503\)) Subject: Re: [RFC] vfs.read_min proposal From: Scott Long In-Reply-To: <20130329205853.GB3794@kib.kiev.ua> Date: Sat, 30 Mar 2013 01:51:15 -0600 Content-Transfer-Encoding: 7bit Message-Id: <8F56D4EB-E63F-4D52-A495-903019E129AF@samsco.org> References: <20130328075209.GL3794@kib.kiev.ua> <20130329205853.GB3794@kib.kiev.ua> To: Konstantin Belousov X-Mailer: Apple Mail (2.1503) X-Spam-Status: No, score=-50.0 required=3.8 tests=ALL_TRUSTED, T_RP_MATCHES_RCVD autolearn=unavailable version=3.3.0 X-Spam-Checker-Version: SpamAssassin 3.3.0 (2010-01-18) on pooker.samsco.org Cc: current@freebsd.org, Maksim Yevmenkin X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 30 Mar 2013 07:51:25 -0000 On Mar 29, 2013, at 2:58 PM, Konstantin Belousov wrote: >> > I think this is definitely a feature that should be set by a flag to > either file descriptor used for aio_read, or aio_read call itself. > Adding a flag to aio_read() might be cumbersome from the ABI perspective. > Fine if you think that there should be a corresponding fcntl() operation, but I see good reason to also have a vfs.read_min that compliments vfs_read_max. It's no less obscure. >> >> finally, vfs.read_min allows us to control size of orignal disk reads, >> and, vfs.read_max allows us to control of additional read ahead. so, >> ww control both sides here. in fact, we can have 1mb reads and 1mb >> read aheads together. granted, its not going to be optimal for all >> loads. that is why vfs.read_min default is 1. however, i strongly >> suspect that there are quite a few workloads where this could really >> help with disk i/o. > > In fact, the existing OS behaviour is reasonable for the arguments > which are passed to the syscall. The specified read size is 1, and the > current read-ahead algorithm tries to satisfy the request with minimal > latency and without creating additional load under memory pressure, > while starting the useful optimization of the background read. > The doubled transaction made a lot of sense back when disks were very slow. Now, let's use a modern example: Default UFS block size = 16k Default vfs.read_max = 8 (128k) Time spent transferring a 16k block over 3Gbps SATA: 54ns Time spent transferring a 128k block over 3Gbps SATA: 436ns Time spent seeking to the 16k/128k block: Average 8ms on modern disks. % time spent on data vs seek, 16k: 0.68% % time spent on data vs seek, 128k: 5.4% It'll take you 5% longer to get a completion back. Not nothing, but it's also not something that would be turned on by default, at least not right now. For 6Gbps SATA, it'll be half of that. However, this is a very idealized example. When you start getting a busy disk and the seek times reach the hundreds of milliseconds, this overhead goes well into the noise. At the same time, reducing the number of concurrent, unbalanced transactions to the disk makes them perform much better when they are at their performance saturation point, and we have very solid numbers to prove it. I think that there's still a place for doubled transactions for read ahead, and that place would likely be with low-latency flash, but there's a lot of other factors that get in the way of that right now in FreeBSD, like the overhead of the threaded handoffs in GEOM. As this area is developed over the next 6 months, and as we have more time to build and test more models, I'm sure well get some interesting data. But for now, I'll argue that Max's proposal is sound and is low maintenance. > Not lying to the OS could be achieved by somehow specifying to > aio_read() that you do not need copyout, and issuing the request for > read of the full range. This is definitely more work than read_min, > but I think that the result could be useful for the wide audience. A side-effect of the aio_mlock() work that's also going on right now is that we won't need to lie to the OS anymore. We still may not want to do a doubled transaction for read-ahead though because we're constrained on disk transactional bandwidth and we don't know that we'll always actually use the data that gets read-ahead. In any case, it's hard for me to resolve the arguments for giving freebsd the tools to let people make it faster in demonstrable ways, and then arguing that the tools offered won't be used and are too obscure. Let's move forward with this. Scott