Date: Tue, 14 Jan 97 23:09:29 +0000 From: Andrew.Gordon@net-tel.co.uk To: eivind@dimaga.com Cc: hackers@freebsd.org Subject: Re: IPFW + Samba -> performance problem Message-ID: <"45f6-970114231001-B849*/G=Andrew/S=Gordon/O=NET-TEL Computer Systems Ltd/PRMD=NET-TEL/ADMD=Gold 400/C=GB/"@MHS> In-Reply-To: <3.0.32.19970114125837.00a71dc0@dimaga.com>
next in thread | previous in thread | raw e-mail | index | archive | help
> This same server dial out with PPP. One day I got a fit of paranoia, and > decided to install ipfw to throw away packets coming from the net. The > firewalling worked, performance for reads from Samba is the same as ever, > but performance for writes dropped from well above 500KB/s to approx 20KB/s > (25-fold). BTW, if you're using /sbin/ppp and your firewalling requirements are simple, you may find that the ppp daemon's built-in packet filters are adequate for your purposes (set ifilter xxxx etc.). This would avoid needing IPFW in the kernel. > Has anybody got a clue? Because, in this case, I haven't. (A hyopthesis > is that something might happen to the TCP_NODELAY option when firewalling > is enabled, but this sounds kind of unlikely.) Maybe not so unlikely, though if so it is a need for TCP_NODELAY at the client end (if my understanding is right). I haven't hit this exact problem, but I did spend a long time looking at tcpdump output some while ago to explain variable _read_ performance we were seeing - all the old client machines (mostly 486s) had been working fine, but a new P120 client was much slower than the other machines at reading from Samba. It turned out that TCP_NODELAY was the solution (and at the time the FreeBSD port of Samba was missing a #include so that the -O TCP_NODELAY option didn't work!). Perhaps an explanation of what I found will help diagnose your problem. The SMB protocol is request/response: over a single TCP connection, client sends a request and waits for a response to come back, with the next request not being issued until the previous response has been completely received [I don't think this is a protocol restriction, but in practice a single-user client doesn't know what to do next until the previous block has come in]. In the case of a read request, the request is small, and the response can be of variable size; but when loading .EXE files (the main benchmark in real life) the reads seem to come in about 5K blocks. If Samba generated the result in a single write()/writev() call on the socket there would be no problem, but in fact it does a number of small write() calls [presumably to handle the case of really big reads??]. The result is going to be packetized by TCP for transmission, and so you have the Windows read size, the block size used by Samba for its write() calls, the TCP MTU size, and the socket write buffer size all interacting to control what happens - and all are arbitrary numbers which don't fit in convenient multiples. In particular, the write() size is typically one-and-a-bit times the TCP MTU. Also, the reads are typically not aligned to filesystem blocks. Now, suppose that the first few write() calls were made very quickly, but there is a small delay (perhaps reading the disc) before the last write(). It is extremely unlikely that the sum of all the write() calls is an exact multiple of the MTU size, so the data will get transmitted in a few full-size packets and a small one. The last write() now happens, and since the read transactions for loading .EXE files seem to be a mixture of sizes, the overall read is not a multiple of Samba's write() size - so the last write() will be shorter than the other ones. If you are unlucky, the last write() will be less than the TCP MTU size. At this point, the TCP Nagle algorithm comes into play. This says that the transmitter should not transmit another 'short' (ie. < MTU size) packet when there is already a short packet unacknowledged. However, if the outstanding packet(s) are less than the window size, the receiving end implements delayed acknowlegements and will wait 200ms in case there is data going the other way that can carry the acknowledge (or in case more data arrives). As already noted, neither of those things is going to happen in this case, so nothing happens until the delayed ack timer goes off, the ack is transmitted and the last piece of the transaction can be sent. So, if the numbers happen to stack up against you, there is a 200ms delay per SMB read transaction - if the transactions are 5Kbyte, this means only 25Kbyte/sec. The whole thing is _very_ sensitive to a large number of variables - if the network is busy, the nic is slow, or (window size permitting) the client is slow to ack the first window full of data, then the servers buffers never drain and the problem doesn't happen. Of course, setting TCP_NODELAY disables the Nagle algorithm and so the problem doesn't happen [IMHO, this should really be hard-wired in Samba, since the Nagle algorithm is designed to optimise interacive traffic with character echo, and the case where you "win" never happens in SMB traffic]. Everything I have been describing here applies to read transactions, but of course the same considerations would apply at the client end when doing write transactions. Maybe Microsoft forgot the TCP_NODELAY? Or some similar malfunction occurs. I would suggest watching the traffic between client and server with tcpdump: you would hope for the gaps between packets to be small and fairly constant - but the pattern I was observing gave bursts of rapid transmission separated by pauses of over 100ms. Good luck! Andrew Gordon.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?"45f6-970114231001-B849*/G=Andrew/S=Gordon/O=NET-TEL Computer Systems Ltd/PRMD=NET-TEL/ADMD=Gold 400/C=GB/">