Date: Thu, 21 May 2009 16:32:03 -0400 (EDT) From: Rick Macklem <rmacklem@uoguelph.ca> To: Andre Oppermann <andre@freebsd.org> Cc: rwatson@freebsd.org, freebsd-current@freebsd.org Subject: Re: Socket related code duplication in NFS Message-ID: <Pine.GSO.4.63.0905211618510.17038@muncher.cs.uoguelph.ca> In-Reply-To: <4A1460A3.2010202@freebsd.org> References: <4A1460A3.2010202@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 20 May 2009, Andre Oppermann wrote: > > e) The socket buffer is most efficient when it can aggregate a number of > packets together before they are processed. Can the NFS code set a low > water mark on the socket to get called only after a few packets have > arrived instead of each one? (In the select and taskqueue model.) > I think the answer to this one is "no". NFS traffic is RPC requests and replies, which are mostly rather small messages (the write request, read reply and readdir reply are the exceptions). NFS performance is very sensition to RPC RTT, which means anything that introduces delay in getting an RPC message through (such as waiting a little while for more data/messages) is normally a detrement from what I've seen. It might be possible to handle the exceptions as a special case, but it isn't going to be easy, since TCP doesn't handle record marks, so knowing when a large message is coming would require something like "peeking" in the data for the RPC record marks. (Sun RPC puts a 32bit number in network byte order in front of each RPC message, which is it's length in bytes. A quirk on top of this is the definition of the high order bit of this mark indicating whether or not it is the last segment of a message. ie. An RPC message can be several record marked segments.) > f) I've been thinking of an modular socket filter approach (much like the > accept filter) scanning for upper layer specific markers or boundaries > and then signalling data availability. > If by this you mean scanning for the RPC message boundaries in the TCP stream (similar to what I said above), this could be very useful. So long as a message gets passed along as soon as you have a complete one, this sounds like a good idea to me. Btw, although FreeBSD currently uses 32Kbyte reads/writes, Solaris10 is using up to 1Mbyte and I'd like to see that happenning in FreeBSD too. (When you have 1Mbyte write request and read reply messages, delaying an upcall until you have an entire message, might work well.) Good luck with it, it sounds like an interesting project, rick
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.GSO.4.63.0905211618510.17038>