From owner-freebsd-performance@FreeBSD.ORG Wed Jun 11 03:23:11 2003 Return-Path: Delivered-To: freebsd-performance@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id F3AC437B401 for ; Wed, 11 Jun 2003 03:23:10 -0700 (PDT) Received: from heron.mail.pas.earthlink.net (heron.mail.pas.earthlink.net [207.217.120.189]) by mx1.FreeBSD.org (Postfix) with ESMTP id 3A05A43FAF for ; Wed, 11 Jun 2003 03:23:10 -0700 (PDT) (envelope-from tlambert2@mindspring.com) Received: from user-38lc0s4.dialup.mindspring.com ([209.86.3.132] helo=mindspring.com) by heron.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 19Q2ky-0001yo-00; Wed, 11 Jun 2003 03:23:06 -0700 Message-ID: <3EE7021E.F2928B7@mindspring.com> Date: Wed, 11 Jun 2003 03:19:10 -0700 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Eric Anderson References: <20030609211526.58641.qmail@web14908.mail.yahoo.com> <3EE4FAED.6090603@centtech.com> <3EE595D2.B223CA19@mindspring.com> <3EE5F8DE.30001@centtech.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a4793ab7015ff2800459f5ec9db8306898350badd9bab72f9c350badd9bab72f9c350badd9bab72f9c cc: freebsd-performance@freebsd.org Subject: Re: Slow disk write speeds over network X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 11 Jun 2003 10:23:11 -0000 Eric Anderson wrote: > Good news, but not done yet.. Keep reading: Sean Chittenden also had a couple of good pieces of advice; read his posting too. > > You haven't said if you were using UDP or TCP for the mounts; > > you should definitely use TCP with FreeBSD NFS servers; it's > > also just generally a good idea, since UDP frags act as a fixed > > non-sliding window: NFS over UDP sucks. > > Most clients are TCP, but some are still UDP (due to bugs in unmentioned > linux distros nfs clients). These will be able to starve each other out. There is a nifty DOS against the UDP reassembly code that operates by sending all the frags in an overly large datagram, but one. > > Also, you haven't said whether you are using aliases on your > > network cards; aliases and NFS tend to interact badly. > > Nope, no aliases.. I have one card on each network, with one IP per > card. I have full subnets (/24) full of P4's trying to slam the NFS > server for data all the time.. Good that you have no aliases; the aliasing code is not efficient for a large number of aliases. Also, the in_pcbhash code could use a rewrite to handle INADDR_ANY sockets better. Not a problem for your load level or configuration. > > Finally, you probably want to tweak some sysctl's, e.g. > > > > net.inet.ip.check_interface=0 > > net.inet.tcp.inflight_enable=1 > > net.inet.tcp.inflight_debug=0 > > net.inet.tcp.msl=3000 > > net.inet.tcp.inflight_min=6100 > > net.isr.enable=1 > > Ok - done.. some where defaults, and I couldn't find net.isr.enable.. > Did I need to config something on my kernel for it to show up? You have to set a compile option; look in /usr/src/sys/net; grep for "netisr_dispatch" or just "dispatch". > Also, can you explain any of those tweaks? The check_interface makes FreeBSD not care if the interface a response comes in on is the same as the one a request did. I told you to set that one in case your network topology was at fault. The inflight_enable allows inflight processing. This will cause it to use an expedited processing path. The debug is on by default )oir was) when inflight was used, and adds overhead, so it should be turned off. Both of these implement about 1/3 of a receiver livelock solution. Setting the MSL down decreases your relative bandwidth delay product; since you are using GigE, this should be relatively low. If you had non-local users on a VPN over a slow link, this would probably be a bad thing. Local GigE, though, and it's desirable. The net.isr.enable=1 will save you about 100ms per packet, minimum, and more if you have a high interrupt overhead that livelocks you from running NETISR. What it does is turns on direct processing by IP and TCP of packets as they come in the interface and you get the interrupt. Combined with soft interrupt coelescing and polling, they should give you another 1/3 of the receiver livelock fixup. The final third isn't available, unless you are willing to hack network stack code and scheduler code, since FreeBSD doesn't include LRP or Weighted Fair Share Queuing. > > Given your overloading of your bus, that last one is probably > > the most important one: it enables direct dispatch. > > > > You'll also want to enable DEVICE_POLLING in your kernel > > config file (assuming you have a good ethernet card whose > > driver supports it): > > > > options DEVICE_POLLING > > options HZ=2000 > > Well, the LINT file says only a few cards support it - not sure if I > should trust that or not, but I have Intel PRO/1000T Server Adapters - > which should be good enough cards to support it.. I've also put 100Mbit > cards in place of the gige's for now to make sure I wasn't hitting a > GigE problem or negotiation problem.. You should grep for DEVICE_POLLING in the network device drivers you are interested in using to see if they have the support. Also you can get up to 15% by adding soft interrupt coelescing code, if the driver doesn't already support it (I added it for a couple of drivers, and it was committed after the benchmarks showed it was good, but it's not everywhere); the basic idea is you take the interrupt, run rx_eof(), and call ether_input(). Then repeat the process until you hit some count limit, or until there's no more data. The direct dispatch (net.isr.enable) combined with that will process most packet trains to completion at interrupt, saving you 10ms up and 10ms back down per packet exchange (NETISR only runs on exit from spl or at the HZ time, which is default every 10ms). > > ...and yet more sysctl's for this: > > > > kern.polling.enable=1 > > kern.polling.user_frac=50 # 0..100; whatever works best > > > > If you've got a really terrible Gigabit Ethernet card, then > > you may be copying all your packets over again (e.g. m_pullup()), > > and that could be eating your bus, too. > > Ok, so the end result is that after playing around with sysctl's, I've > found that the tcp transfers are doing 20MB/s over FTP, but my NFS is > around 1-2MB/s - still slow.. So we've cleared up some tcp issues, but > yet still NFS is stinky.. > > Any more ideas? If you have a choice on the disks, go SCSI; you probably won't have a choice, though, if you haven't bought them already. The tagged command queuing in ATAPI can't disconnect during a write, only during a read, so writes serialize and reads don't. On SCSI, neither writes nor reads serialize (at least until you hit your tag queue depth). Standard advice about MBUFS/NMBCLUSTERS; see the NOTES files about these config options. Also, I would make sure maxusers was non-zero: disable the auto-tuning, it's generally not going to give you an optimal mix for a dedicated server, no matter what it's dedicated to doing. There's Sean's suggestions... I don't reccomend some of them, for data integrity reasons (see my comments in response to his post), but others are very good. If you can get your Intel cards to play nice with your switch, going to 8K packets (jumbograms) will help. In my experience, Intel doesn't play nice with other card vendors, and there's no real standards for MTU negotiation, so you have to futs with a lot of equipment to get it setup (manually locking the MTU). Also, many switchs (e.g. Alpine) don't really have enough memory in them to deal with this. Some GigE cards also have too little memory to do this and offload the TCP checksum processing. Reminds me: make sure your checksums are being done by your cards, if you can: checksum calculations in software are brutal on your performance. It is(/was) an ifconfig option. Just for grins (not for production!) you may want to mount your FS async, and set the NFS async option Sean wrote about. I would probably disable SYN-caching and SYN-cookie. I *always* disable SYN-cookie on any exposed machine (computational DOS attack is possible); the SYN-cache is a good defense against a DOS attack, but if this is an interior machine (and it should be), then your firewall already protects it; SYN-cache adds some overhead (read: latency) you probably don't want, and the cookie code will be harmless, but isn't terribly useful unless you are getting a huge connection attempt per second rate. You may also want to disable slowstart and the Nagle algorithm, but you will have to look those up (doing it makes you a bad network citizen, and I would be aiding and abetting ;^)). It shouldn't be *too* bad if you're switched rather than bridged or hub'ed all the way through (L4, not L2, so no Alpine GigE). If you are willing to hack code, PSC at CMU had a nice rate halving implementation for a slightly older version of the BSD stack, and both Rice University and Duke University have an LRP implementation (Duke's is more modern), but you'll have to know what you're doing in the stack to port any of these. You probably don't need to worry about load-shedding until your machine is spending all its time in interrupt, so there's no use going into RED-queueing or other programming work. -- Terry