From owner-freebsd-stable@FreeBSD.ORG Fri Dec 28 06:49:23 2007 Return-Path: Delivered-To: freebsd-stable@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4864716A419; Fri, 28 Dec 2007 06:49:23 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail08.syd.optusnet.com.au (mail08.syd.optusnet.com.au [211.29.132.189]) by mx1.freebsd.org (Postfix) with ESMTP id C4A5F13C44B; Fri, 28 Dec 2007 06:49:22 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c211-30-219-213.carlnfd3.nsw.optusnet.com.au [211.30.219.213]) by mail08.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id lBS6nGth032192 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 28 Dec 2007 17:49:18 +1100 Date: Fri, 28 Dec 2007 17:49:14 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Bruce Evans In-Reply-To: <20071228155323.X3858@besplex.bde.org> Message-ID: <20071228170151.C4166@besplex.bde.org> References: <20071221234347.GS25053@tnn.dglawrence.com> <20071222050743.GP57756@deviant.kiev.zoral.com.ua> <20071223032944.G48303@delplex.bde.org> <985A3F99-B3F4-451E-BD77-E2EB4351E323@eng.oar.net> <20071228143411.C3587@besplex.bde.org> <20071228155323.X3858@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Kostik Belousov , freebsd-stable@FreeBSD.org, freebsd-net@FreeBSD.org Subject: Re: Packet loss every 30.999 seconds X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Dec 2007 06:49:23 -0000 On Fri, 28 Dec 2007, Bruce Evans wrote: > On Fri, 28 Dec 2007, Bruce Evans wrote: > >> In previous mail, you (Mark) wrote: >> >> # With FreeBSD 4 I was able to run a UDP data collector with rtprio set, >> # kern.ipc.maxsockbuf=20480000, then use setsockopt() with SO_RCVBUF >> # in the application. If packets were dropped they would show up >> # with netstat -s as "dropped due to full socket buffers". >> # # Since the packet never makes it to ip_input() I no longer have >> # any way to count drops. There will always be corner cases where >> # interrupts are lost and drops not accounted for if the adapter >> # hardware can't report them, but right now I've got no way to >> # estimate any loss. I found where drops are recorded for the net.isr.direct=0 case. It is in net.inet.ip.intr_queue.drops. The netisr subsystem just calls IF_HANDOFF(), and IF_HANDOFF() calls _IF_DROP() if the queue fills up. _IF_DROP(ifq) just increments ifq->ip_drops. The usual case for netisrs is for the queue to be ipintrq for NETISR_IP. The following details don't help: - drops for input queues don't seem to be displayed by any utilities (except ones for ipintrq are displayed primitively by sysctl net.inet.ip.intr_queue_drops). netstat and systat only display drops for send queues and ip frags. - the netisr subsystem's drop count doesn't seem to be displayed by any utilities except sysctl. It only counts drops due to there not being a queue; other drops are counted by _IF_DROP() in the per-queue counter. Users have a hard time integrating all these primitively displayed drop counts with other error counters. - the length of ipintrq defaults to the default ifq length of ipqmaxlen = IPQ_MAXLEN = 50. This is inadequate if there is just one NIC in the system that has an rx ring size of >= slightly less than 50. But 1 Gbps NICs should have an rx ring size of 256 or 512 (I think the size is 256 for em; it is 256 for bge due to bogus configuration of hardware that can handle it being 512). If the larger hardware rx ring is actually used, then ipintrq drops are almost ensured in the direct=0 case, so using the larger h/w ring is worse than useless (it also increases cache misses). This is for just one NIC. This problem is often limited by handling rx packets in small bursts, at a cost of extra overhead. Interrupt moderation increases it by increasing burst sizes. This contrasts with the handling of send queues. Send queues are per-interface and most drivers increase the default length from 50 to their ring size (-1 for bogus reasons). I think this is only an optimization, while a similar change for rx queues is important for avoiding packet loss. For send queues, the ifq acts mainly as a primitive implementation of watermarks. I have found that tx queue lengths need to be more like 5000 than 50 or 500 to provide enough buffering when applications are delayed by other applications or just by sleeping until the next clock tick, and use tx queues of length ~20000 (a couple of clock ticks at HZ = 100), but now think queue lengths should be restricted to more like 50 since long queues cannot fit in L2 caches (not to mention they are bad for latency). The length of ipintrq can be changed using sysctl net.inet.ip.intrq_queue_maxlen. Changing it from 50 to 1024 turns most or all ipintrq drops into "socket buffer full" drops (640 kpps input packets and 434 kpps socket buffer fulls with direct=0; 640 kpps input packets and 324 kpps socket buffer fulls with direct=1). Bruce