From owner-freebsd-net@FreeBSD.ORG Sun Jan 21 06:25:24 2007 Return-Path: X-Original-To: net@freebsd.org Delivered-To: freebsd-net@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id C579616A400 for ; Sun, 21 Jan 2007 06:25:24 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailout1.pacific.net.au (mailout1-3.pacific.net.au [61.8.2.210]) by mx1.freebsd.org (Postfix) with ESMTP id 4E2FB13C428 for ; Sun, 21 Jan 2007 06:25:24 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailproxy1.pacific.net.au (mailproxy1.pacific.net.au [61.8.2.162]) by mailout1.pacific.net.au (Postfix) with ESMTP id 00B145A0657 for ; Sun, 21 Jan 2007 17:25:22 +1100 (EST) Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailproxy1.pacific.net.au (Postfix) with ESMTP id 7F15B8C02 for ; Sun, 21 Jan 2007 17:25:20 +1100 (EST) Date: Sun, 21 Jan 2007 17:25:14 +1100 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: net@freebsd.org Message-ID: <20070121155510.C23922@delplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Subject: slow writes on nfs with bge devices X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 21 Jan 2007 06:25:24 -0000 nfs writes much less well with bge NICs than with other NICs (sk, fxp, xl, even rl). Sometimes writing a 20K source file from vi seems to take about 2 seconds instead of seeming to be instantaneous (this gets faster as the system warms up). Iozone shows the problem more reproducibly. E.g.: 100Mbps fxp server -> 1Gbps bge 5701 client, udp: %%% IOZONE: Performance Test of Sequential File I/O -- V1.16 (10/28/92) By Bill Norcott Operating System: FreeBSD -- using fsync() IOZONE: auto-test mode MB reclen bytes/sec written bytes/sec read 1 512 1516885 291918639 1 1024 1158783 491354263 1 2048 1573651 715694105 1 4096 1223692 917431957 1 8192 729513 1097929467 2 512 1694809 281196631 2 1024 1379228 507917189 2 2048 1659521 789608264 2 4096 4606056 1064567574 2 8192 1142288 1318131028 4 512 1242214 298269971 4 1024 1853545 492110628 4 2048 2120136 742888430 4 4096 1896792 1121799065 4 8192 850210 1441812403 8 512 1563847 281422325 8 1024 1480844 492749552 8 2048 1658649 850165954 8 4096 2105283 1211348180 8 8192 2098425 1554875506 16 512 1508821 296842294 16 1024 1966239 527850530 16 2048 2036609 842656736 16 4096 1666138 1200594889 16 8192 2293378 1620824908 Completed series of tests %%% Here bge barely reaches 10Mbps speeds (~1.2 MB/S) for writing. Reading is cached well and fast. 100Mbps xl on the same client with the same server goes at full 100Mbps speed (11.77 MB/S for all file sizes including larger ones since the disk is not the limit at 100Mbps). 1Gbps sk on a different client with the same server goes at full 100Nbps speed. Switching to tcp gives full 100 Mbps speed. However, when the bge link speed is reduced to 100Mbps, udp becomes about 10 times slower than the above and tcp becomes about as slow as the above (maybe a bit faster, but far below 11.77 MB/S). bge is also slow at nfs serving: 1Gbps bge 5701 server -> 1Gbps sk client: %%% IOZONE: Performance Test of Sequential File I/O -- V1.16 (10/28/92) By Bill Norcott Operating System: FreeBSD -- using fsync() IOZONE: auto-test mode MB reclen bytes/sec written bytes/sec read 1 512 36255350 242114472 1 1024 3051699 413319147 1 2048 22406458 632021710 1 4096 22447700 851162198 1 8192 3522493 1047562648 2 512 3270779 48125247 2 1024 28992179 46693718 2 2048 5956380 753318255 2 4096 27616650 1053311658 2 8192 5573338 48290208 4 512 9004770 47435659 4 1024 9576276 45601645 4 2048 30348874 85116667 4 4096 8635673 86150049 4 8192 9356773 47100031 8 512 9762446 46424146 8 1024 10054027 58344604 8 2048 9197430 60253061 8 4096 15934077 59476759 8 8192 8765470 47647937 16 512 5670225 46239891 16 1024 9425169 45950990 16 2048 9833515 46242945 16 4096 14812057 51313693 16 8192 9203742 47648722 Completed series of tests %%% Now the available bandwidth is 10 times larger and about 9/10 of it is still not used, with a high variance. For larger files, the variance is lower and the average speed is about 10MB/S. The disk can only do about 40MB/S and the slowest of the 1Gbps NICS (sk) can only sustain 80MB/S through udp and about 50MB/S through tcp (it is limited by the 33 MHz 32-bit PCI bus and by being less smart than the bge interface). When the bge NIC was on the system which is now the server with the fxp NIC, bge and nfs worked unsurprisingly, just slower than I would have liked. The write speed was 20-30MB/S for large files and 30-40MB/S for medium-sized files, with low variance. This is the only configuration in which nfs/bge worked as expected. The problem is very old and not very hardware dependent. Similar behaviour happens when some of the following are changed: OS -> FreeBSD-~5.2 or FreeBSD-6 hardware -> newer amd64 CPU (Turion X2) with 5705 (iozone output for this below) instead of old amd64 CPU with 5701. The newer amd64 normally runs an i386-SMP current kernel while the old amd64 was running an amd64-UP current kernel in the above tests, but normally runs ~5.2 amd64-UP and behaves similarly with that. The combination that seemed to work right was an AthlonXP for the server with the same 5701 and any kernel. The only strangeness with that was that current kernels gave a 5-10% slower nfs server despite giving a 30-90% larger packet rate for small packets. IOZONE: Performance Test of Sequential File I/O -- V1.16 (10/28/92) By Bill Norcott Operating System: FreeBSD -- using fsync() 100Mbps fxp server -> 1Gbps bge 5705 client: %%% IOZONE: auto-test mode MB reclen bytes/sec written bytes/sec read 1 512 2994400 185462027 1 1024 3074084 337817536 1 2048 2991691 576792985 1 4096 3074759 884740798 1 8192 3078019 1176892296 2 512 4262096 186709962 2 1024 2994468 339893080 2 2048 5112176 584846610 2 4096 4754187 909815165 2 8192 5100574 1212919611 4 512 5298715 187129017 4 1024 5302620 344445041 4 2048 4985597 590579630 4 4096 3703618 927711124 4 8192 5236177 1240896243 8 512 5142274 186899396 8 1024 6207933 345564808 8 2048 6162773 593088329 8 4096 6031445 936751120 8 8192 6072523 1224102288 16 512 5427113 186797193 16 1024 5065901 345544445 16 2048 5462338 595487384 16 4096 5256552 937013065 16 8192 5097101 1226320870 Completed series of tests %%% rl on a system with 1/20 as much CPU is faster than this. The problem doesn't seem to affect much besides writes on nfs. The bge 5701 works very well for most things. It has a much better bus interface than the 5705 and works even better after moving it to the old amd64 system (it can now saturate 1Gbps where on the AthlonXP it only got 3/4 of the way, while the 5705 only gets 1/4 of the way). I've been working on minimising network latency and maximising packet rate, and normally have very low network latency (60-80 uS for ping) and fairly high packet rates. The changes for this are not the caause of the bug :-), since the behaviour is not affected by running kernels without these changes or by sysctl''ing the changes to be null. However, the problem looks like ones caused by large latencies combined with non-streaming protocols. To write at just 11.77 MB/S, at least 8000 packets/second must be set from the client to the server. Working clients sustain this rate, but broken clients the rate is much lower and not sustained: Output from netstat -s 1 on server while writing a ~1GB file via 5701/udp: %%% input (Total) output packets errs bytes packets errs bytes colls 900 0 1513334 142 0 33532 0 1509 0 2564836 236 0 57368 0 1647 0 2295802 259 0 51106 0 1603 0 1502736 252 0 32926 0 1055 0 637014 163 0 13938 0 558 0 1542510 86 0 34340 0 984 0 989854 155 0 21816 0 864 0 1320786 135 0 38152 0 883 0 1558060 165 0 34340 0 1177 0 3780102 203 0 85850 0 2087 0 954212 331 0 21210 0 1187 0 1413568 190 0 31310 0 650 0 3320604 101 0 75346 0 1565 0 1706542 246 0 37976 0 2055 0 2360620 329 0 52318 0 1554 0 2416996 244 0 54226 0 1402 0 2579894 220 0 58176 0 1690 0 774488 267 0 16968 0 1323 0 3690650 209 0 83830 0 591 0 4519858 92 0 103110 0 %%% There is no sign of any packet loss or switch problems. Forcing 1000baseTX full-duplex has no effect. Forcing 100baseTX full-duplex makes the problem more obvious. The mtu is 1500 throughout since only bge-5701 and sk support jumbo frames and I want to use udp for nfs. 5705/udp is better: %%% input (Total) output packets errs bytes packets errs bytes colls 5209 0 6607758 846 0 151702 0 4763 0 6684546 773 0 153520 0 4758 0 6618498 769 0 151298 0 3582 0 7057568 576 0 162498 0 4935 0 5115068 800 0 116756 0 4924 0 6622026 798 0 152802 0 4095 0 6018462 657 0 137450 0 4647 0 5270442 751 0 120594 0 4673 0 5451948 758 0 123624 0 2340 0 6001986 372 0 138168 0 3750 0 6150610 604 0 140996 0 %%% sk/udp works right: %%% input (Total) output packets errs bytes packets errs bytes colls 8638 0 12384676 1440 0 293062 0 8636 0 12415646 1439 0 293708 0 8637 0 12415646 1441 0 293708 0 8637 0 12415646 1439 0 293708 0 8637 0 12417160 1440 0 293708 0 8636 0 12413162 1439 0 293506 0 8637 0 12414132 1439 0 293708 0 8636 0 12417160 1440 0 293708 0 8637 0 12415646 1439 0 293708 0 8636 0 12417160 1440 0 293708 0 8637 0 12414676 1439 0 293506 0 %%% sk is under ~5.2 with latency/throughput/efficiency optimizations that don't have much effect here. Bruce