From owner-freebsd-net@FreeBSD.ORG Fri Jan 31 04:36:19 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id BA598C9; Fri, 31 Jan 2014 04:36:19 +0000 (UTC) Received: from mail-ig0-x232.google.com (mail-ig0-x232.google.com [IPv6:2607:f8b0:4001:c05::232]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 728B51779; Fri, 31 Jan 2014 04:36:19 +0000 (UTC) Received: by mail-ig0-f178.google.com with SMTP id uq10so8705475igb.5 for ; Thu, 30 Jan 2014 20:36:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=aagJ+YALSALUvr3mHWx7BykmqSFhuPfJR6jgzI5O/W0=; b=ACOTWZ3Vs8rwaqpmcxcKSYdq6E/SJAwybpmZvADtLItLslyKfGImiiX5eblRgUyu7Z DA/vxbCud1fGcl23L/vaYV08J31goWMupdHyLxDWr6/f/OnmJ/I/M2kCfXaLifo9xctL 0nHDDDjXt/fJHjXq3xOnaHx1pLvKg/WicfhaGF9V/g7xVMB+4czuiEmPnKVwoYzVSFAX sXn2BO7ocqs5QBy5bZJ749NP3kotbZ+kYSViHsCLjCpDvw6d1M1JsQXeFA8nCiSXzfgg QsH5ui7IahSFfaGv8Q5aXnvt03fw+YGBfcO4oxvSng9pX5OGJn0niSGZGsMwPd/BjwdY jNrg== MIME-Version: 1.0 X-Received: by 10.43.51.65 with SMTP id vh1mr13779261icb.24.1391142978559; Thu, 30 Jan 2014 20:36:18 -0800 (PST) Sender: jdavidlists@gmail.com Received: by 10.42.170.8 with HTTP; Thu, 30 Jan 2014 20:36:18 -0800 (PST) In-Reply-To: <87942875.478893.1391121843834.JavaMail.root@uoguelph.ca> References: <87942875.478893.1391121843834.JavaMail.root@uoguelph.ca> Date: Thu, 30 Jan 2014 23:36:18 -0500 X-Google-Sender-Auth: ziCZJJVV4QHzrdWTvg3C_eY9gU4 Message-ID: Subject: Re: Terrible NFS performance under 9.2-RELEASE? From: J David To: Rick Macklem Content-Type: text/plain; charset=ISO-8859-1 Cc: Bryan Venteicher , Garrett Wollman , freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 31 Jan 2014 04:36:19 -0000 On Thu, Jan 30, 2014 at 5:44 PM, Rick Macklem wrote: > I'd like to see MAXBSIZE > increased to at least 128K, since that is the default block size for > ZFS, I've been told. Regrettably, that is incomplete. The ZFS record size is variable *up to* 128kiB by default; it's more of an upper limit than a hard and fast rule. Also, it is configurable at runtime on a per-filesystem basis. Although any file >128kiB probably does use 128kiB blocks, ZFS has ARC and L2ARC and manages its own prefetch. Probably as long as NFS treats the rsize/wsize as a fixed-sized block, the number of workloads benefited by pushing it to 128kiB may be very limited. > Also, for real networks, the NFS RPC message will be broken into > quite a few packets to go on the wire, as far as I know. (I don't > think there are real networks using a 64K jumbo packet, is there?) > For my hardware, the packets will be 1500bytes each on the wire, > since nothing I have does jumbo packets. Real environments for NFS in 2014 are 10gig LANs with hardware TSO that makes the overhead of TSO negligible. As someone else on this thread has already pointed out, efficiently utilizing TSO is essentially mandatory to make good use of 10gig hardware. So as far as FreeBSD is concerned, yes, many networks effectively have a 64k MTU (for TCP only since FreeBSD does not implement GSO at this time) and it should act accordingly when dealing with them. This NFS buffer size is nearly doubling the number of TCP packets it takes to move the same amount data. Regardless of how those packets are eventually segmented -- which can be effectively ignored in the real world of hardware TSO -- the overhead of TCP and IP is not nil, cannot be offloaded, and doubling it is not a good thing. It doubles every step down to the very bottom, including optional stuff like PF if it is hanging around in there. > Unfortunately, NFS adds a little bit to the front of the data, so > an NFS RPC will always be a little bit more than a power of 2 in > size for reads/writes of a power of 2. That's why NFS should be able to operate on page-sized multiples rather than powers of 2. Then it can operate on the filesystem using the best size for that, operate on the network using the best size for that, and mediate the two using page-sized jumbo clusters. If you know the underlying filesystem block size, by all means, read or write based on it where appropriate. > Now, I am not sure why 65535 (largest ip datagram) has been chosen > as the default limit for TSO segments? The process of TCP segmentation, whether offloaded or not, is performed on a single TCP packet. It operates by reusing that packet's header over and over for each segment with slight modifications. Consequently the maximum size that can be offloaded is the maximum size that can be segmented: one packet. > Well, since NFS sets the TCP_NODELAY socket option, that shouldn't > occur in the TCP layer. If some network device driver is delaying, > waiting for more to send, then I'd say that device driver is broken. This is not a driver issue. TCP_NODELAY means "don't wait for more data." It doesn't mean "don't send more data that is ready to be sent." If there's more data already present on the stream by the time the TCP stack gets to it, which is possible in an SMP environment, TCP_NODELAY won't, as far as I know, prevent it from being sent in the next available packet. This isn't necessarily something that happens every time, or even consistently, but when you're sending a hundred thousand packets per second, it looks like the chain can indeed come off the bicycle. NFS is not sending packets to the TCP stack, it is sending stream data. With TCP_NODELAY it should be possible to engineer a one send = one packet correlation, but that's true if and only if that send is less than the max packet size. > For real NFS environments, the performance of the file system and > underlying disk subsystem is generally more important than the network. Maybe this is the case if NFS is serving from one spinning disk. It's definitely not the case for ZFS installs with 128GiB RAM, shelves of SAS drives, TB of SSD L2ARC, and STEC slog devices. The performance of the virtual environment we're using as a test platform is remarkably close to that. It just has the benefit of being two orders of magnitude cheaper and therefore something that can be set aside for testing stuff like this. > (Some > NAS vendors avoid this by using non-volatile ram in the server as stable > storage, but a FreeBSD server can't expect such hardware to be available.) Nonvolatile slogs are all but mandatory in any ZFS-backed-NFS fileserver deployment. Like TSO, it's not hypothetical, it is standard for production deployments. >> but what is the origin of that? Is it >> something that could be changed? >> > Because disk file systems on file servers always use block sizes that > are a power of 2. Maybe my question wasn't phrased well. What is the origin of the huge performance drop when a non-multiple-of-2 size is used? This is visible under small random ops where the data difference between a 60k read and a 64k read isn't ever used and the next block is almost certainly not going to be read next. So it's very weird (to me) that performance drops as much as it does. > Agreed. I think adding a if_hw_tsomaxseg that TCP can use is preferable. It may be valuable for other workloads to prevent drops on some kind of pathologically sliced-up packets, but jumbo cluster support in NFS should pretty much guarantee that it is not going to have a problem in this area with any interface in common use. Thanks!