From owner-freebsd-net@FreeBSD.ORG Wed Dec 5 03:31:39 2012 Return-Path: Delivered-To: freebsd-net@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id F3616D4F; Wed, 5 Dec 2012 03:31:38 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from fallbackmx06.syd.optusnet.com.au (fallbackmx06.syd.optusnet.com.au [211.29.132.8]) by mx1.freebsd.org (Postfix) with ESMTP id 2824A8FC08; Wed, 5 Dec 2012 03:31:37 +0000 (UTC) Received: from mail05.syd.optusnet.com.au (mail05.syd.optusnet.com.au [211.29.132.186]) by fallbackmx06.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id qB53VUG7026821; Wed, 5 Dec 2012 14:31:31 +1100 Received: from c122-106-175-26.carlnfd1.nsw.optusnet.com.au (c122-106-175-26.carlnfd1.nsw.optusnet.com.au [122.106.175.26]) by mail05.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id qB53VHcG016927 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 5 Dec 2012 14:31:21 +1100 Date: Wed, 5 Dec 2012 14:31:17 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Andre Oppermann Subject: Re: Latency issues with buf_ring In-Reply-To: <50BE56C8.1030804@networx.ch> Message-ID: <20121205112511.Q932@besplex.bde.org> References: <1353259441.19423.YahooMailClassic@web121605.mail.ne1.yahoo.com> <201212041108.17645.jhb@freebsd.org> <50BE56C8.1030804@networx.ch> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.0 cv=c8fz2mBl c=1 sm=1 a=ie5KVN3-GTQA:10 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=SOXvLa97LiYA:10 a=gr-qOqZ8CvggYKoQBowA:9 a=CjuIK1q_8ugA:10 a=bxQHXO5Py4tHmhUgaywp5w==:117 Cc: Barney Cordoba , Adrian Chadd , John Baldwin , freebsd-net@FreeBSD.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 05 Dec 2012 03:31:39 -0000 On Tue, 4 Dec 2012, Andre Oppermann wrote: > For most if not all ethernet drivers from 100Mbit/s the TX DMA rings > are so large that buffering at the IFQ level doesn't make sense anymore > and only adds latency. I found sort of the opposite for bge at 1Gbps. Most or all bge NICs have a tx ring size of 512. The ifq length is the tx ring size minus 1 (511). I needed to expand this to imax(2 * tick / 4, 10000) to maximize pps. This does bad things to latency and worse things to caching (512 buffers might fit in the L2 cache, but 10000 buffers bust any reasonably cache as they are cycled through), but I only tried to optimize tx pps. > So it could simply directly put everything into > the TX DMA and not even try to soft-queue. If the TX DMA ring is full > ENOBUFS is returned instead of filling yet another queue. That could work, but upper layers currently don't understand ENOBUFS at all, so it would work poorly now. Also, 512 entries is not many, so even if upper layers understood ENOBUFS it is not easy for them to _always_ respond fast enough to keep the tx active, unless there are upstream buffers with many more than 512 entries. There needs to be enough buffering somewhere so that the tx ring can be replenished almost instantly from the buffer, to handle the worst-case latency for the threads generatng new (unbuffered) packets. At the line rate of ~1.5 Mpps for 1 Gbps, the maximum latency that can be covered by 512 entries is only 340 usec. > However there > are ALTQ interactions and other mechanisms which have to be considered > too making it a bit more involved. I didn't try to handle ALTQ or even optimize for TCP. More details: to maximize pps, the main detail is to ensure that the tx ring never becomes empty. The tx then transmits as fast as possible. This requires some watermark processing, but FreeBSD has almost none for tx rings. The following normally happens for packet generators like ttcp and netsend: - loop calling send() or sendto() until the tx ring (and also any upstream buffers) fill up. Then ENOBUFS is returned. - watermark processing is broken in the user API at this point. There is no way for the application to wait for the ENOBUFS condition to go away (select() and poll() don't work). Applications use poor workarounds: - old (~1989) ttcp sleeps for 18 msec when send() returns ENOBUFS. This was barely good enough for 1 Mbps ethernet (line rate ~1500 pps is 27 per 18 msec, so IFQ_MAXLEN = 50 combined with just a 1-entry tx ring provides a safety factor of about 2). Expansion of the tx ring size to 512 makes this work with 10 Mbps ethernet too. Expansion of the ifq to 511 gives another factor of 2. After losing the safety factor of 2, we can now handle 40 Mbps ethernet, and are only a factor of 25 short for 1 Gbps. My hardware can't do line rate for small packets -- it can only do 640 kpps. Thus ttcp is only a factor of 11 short of supporting the hardware at 1 Gbps. This assumes that sleeps of 18 msec are actually possible, which they aren't with HZ = 100 giving a granularity of 10 msec so that sleep(18 msec) actually sleeps for an average of 23 msec. -current uses the bad default of HZ = 1000. With that sleep(18 msec) would average 18.5 msec. Of course, ttcp should sleep for more like 1 msec if that is possible. Then the average sleep is 1.5 msec. ttcp can keep up with the hardware with that, and is only slightly behind the hardware with the worst-case sleep of 2 msec (512+511 packets generated every 2 msec is 511.5 kpps). I normally use old ttcp, except I modify it to sleep for 1 msec instead of 18 in one version, and in another version I remove the sleep so that it busy-waits in a loop that calls send() which almost always returns ENOBUFS. The latter wastes a lot of CPU, but is almost good enough for throughput testing. - newer ttcp tries to program the sleep time in microseconds. This doesn't really work, since the sleep granularity is normally at least a millisecond, and even if it could be the 340 microseconds needed by bge with no ifq (see above, and better divide the 340 by 2), then this is quite short and would take almost as much CPU as busy-waiting. I consider HZ = 1000 to be another form of polling/busy-waiting and don't use it except for testing. - netrate/netsend also uses a programmed sleep time. This doesn't really work, as above. netsend also tries to limit its rate based on sleeping. This is further from working, since even finer-grained sleeps are needed to limit the rate accurately than to keep up with the maxium rate. Watermark processing at the kernel level is not quite as broken. It is mostly non-existend, but partly works, sort of accidentally. The difference is now that there is a tx "eof" or "completion" interrupt which indicates the condition corresponding to the ENOBUFS condition going away, so that the kernel doesn't have to poll for this. This is not really an "eof" interrupt (unless bge is programmed insanely, to interrupt only after the tx ring is completely empty). It acts as primitive watermarking. bge can be programmed to interrupt after having sent every N packets (strictly, after every N buffer descriptors, but for small packets these are the same). When there are more than N packets to start, say M, this acts as a watermark at M-N packets. bge is normally misprogrammed with N = 10. At the line rate of 1.5 Mpps, this asks for an interrupt rate of 150 kHz, which is far too high and is usually unreachable, so reaching the line rate is impossible due to the CPU load from the interrupts. I use N = 384 or 256 so that the interrupt rate is not the dominant limit. However, N = 10 is better for latency and works under light loads. It also reduces the amount of buffering needed. The ifq works more as part of accidentally watermarking than as a buffer. It is the same size as the tx right (actually 1 smaller for bogus reasons), so it is not really useful as a buffer. However, with no explicit watermarking, any separate buffer like the ifq provides a sort of watermark at the boundary between the buffers. The usefulness of this would most obvious if the tx "eof" interrupt were actually for eof (perhaps that is what it was originally). Then on the eof interrupt, there is no time at all to generate new packets, and the time when the tx is idle can be minimized by keeping pre-generated packets handy where the can be copied to the tx ring at tx "eof" interrupt time. A buffer of about the same size as the tx ring (or maybe 1/4) the size, is enough for this. OTOH, with bge misprogrammed to interrupt after every 10 tx packets, the ifq is useless for its watermark purposes. The watermark is effectively in the tx ring, and very strangely placed there at 10 below the top (ring full). Normally tx watermarks are placed near the bottom (ring empty). They must not be placed too near the bottom, else there would not be enough time to replenish the ring between the time when the "eof" (really, the "watermark") interrupt is received and when the tx runs dry. They should not be placed too near the top like they are in -current's bge, else the point of having a large tx ring is defeated and there are too many interrupts. However, when they are placed near the top, latencency requirements are reduced. I recently worked on buffering for sio and noticed similar related problems for tx watermarks. Don't laugh -- serial i/o 1 character at a time at 3.686400 Mbps has much the same timing requirements as ethernet i/o 1 packet at a time at 1 Gbps. Each serial character takes ~2.7 usec and each minimal ethernet packet takes ~0.67 usec. With tx "ring" sizes of 128 and 512 respectively, the ring times for full to empty are 347 usec for serial i/o and 341 usec for ethernet i/o. Strangely, tx is harder than rx because: - perfection is possible and easier to measure for tx. It consists of just keeping at least 1 entry in the tx ring at all times. Latency must be kept below ~340 usec to have any chance of this. This is not so easy to achieve under _all_ loads. - for rx, you have an external source generating the packets, so you don't have to worry about latency affecting the generators. - the need for watermark processing is better known for rx, since it obviously doesn't work to generate the rx "eof" interrupt near the top. The serial timing was actually harder to satisfy, because I worked on it on a 366 MHz CPU while I worked on bge on a 2 GHz CPU, and even the 2GHz CPU couldn't keep up with line rate (so from full to empty takes 800 usec). It turned out that the best position for the tx low watermark is about 1/4 or 1/2 from the bottom for both sio and bge. It must be fairly high, else the latency requirements are not met. In the middle is a good general position. Although it apparently "wastes" half of the ring to make the latency requirements easier to meet (without very system-dependent tuning), the efficiency lost from this is reasonably small. Bruce