From owner-freebsd-net@FreeBSD.ORG Sun Jan 26 01:56:12 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 472DF794; Sun, 26 Jan 2014 01:56:12 +0000 (UTC) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id ED2E41ADF; Sun, 26 Jan 2014 01:56:11 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: X-IronPort-AV: E=Sophos;i="4.95,721,1384318800"; d="scan'208";a="91061178" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 25 Jan 2014 20:55:47 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id B306CB4051; Sat, 25 Jan 2014 20:55:47 -0500 (EST) Date: Sat, 25 Jan 2014 20:55:47 -0500 (EST) From: Rick Macklem To: J David Message-ID: <278396201.16318356.1390701347722.JavaMail.root@uoguelph.ca> In-Reply-To: Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.201] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 26 Jan 2014 01:56:12 -0000 J David wrote: > On Fri, Jan 24, 2014 at 7:10 PM, Rick Macklem > wrote: > > I would like to hear if you find Linux doing read before write when > > you use "-r 2k", since I think that is writing less than a page. > > It doesn't. As I reported in the original test, I used an 8k > rsize/wsize and a 4k write size on the Linux test and no > read-before-write was observed. And just now I did as you asked, a > 2k > test with Linux mounting with 32k rsize/wsize. No extra reads, > excellent performance. FreeBSD, with the same mount options, does > reads even on the appends in this case and can't. > Well, when I get home in April, I'll try the fairly recent Linux client I have at home and see what it does. Not sure what trick they could use to avoid the read before write for partial pages. (I suppose I can look at their sources, but that could be pretty scary;-) If I understand the 15year old commit message, the main problem with not doing the read before write for a partial buffer is that mmap()'d file access will look at entire pages and potentially gets garbage if the entire page isn't valid. At this time, there is a single B_CACHE flag to indicate the buffer cache entry has been filled in. I think it would be possible to add a bitmap that marks which pages are actually allocated to the buffer cache entry, but I suspect the coding would be non-trivial. This would help for the case of page size writes on page boundaries, but would require the pages to be read in before write when the writes are not of page size on page boundaries. Well, one application I do have some experience with is software builds and the "ld" stage tends to write lots of chunks of odd sizes at any byte offset. (When I did testing of some code that extended the single dirty byte range to a list of dirty byte ranges, I discovered that "ld" often generates 100+ of these odd sized non-contiguous writes before resulting in a completely written block. I recently added a mount option called "noncontigwr" that would allow the single dirty byte range to cover these non-contiguous writes.) Bottom line, if the pages were read in individually, the "ld" case would result in several (up to 16 for 4K in a 64K buffer) small reads against the server, which isn't nearly as efficient as one larger 64K read. As mentioned above, I don't know how Linux would avoid the read before write for partial blocks/pages being written. rick > random > random > > KB reclen write rewrite read reread read > write > > Linux 1048576 2 281082 358672 125687 > 121964 > > FreeBSD 1048576 2 59042 22624 10304 > 1933 > > > For comparison, here's the same test with 32k reclen (again, both > Linux and FreeBSD using 32k rsize/wsize): > > random > random > > KB reclen write rewrite read reread read > write > > Linux 1048576 32 319387 373021 411106 > 364393 > > FreeBSD 1048576 32 74892 73703 34889 > 66350 > > > Unfortunately it sounds like this state of affairs isn't really going > to improve, at least in the near future. If there was one area where > I never thought Linux would surpass us, it was NFS. :( > > Thanks! >