From owner-freebsd-arch@FreeBSD.ORG Sat Dec 19 17:15:51 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 222B41065676; Sat, 19 Dec 2009 17:15:51 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail03.syd.optusnet.com.au (mail03.syd.optusnet.com.au [211.29.132.184]) by mx1.freebsd.org (Postfix) with ESMTP id AC4D98FC1A; Sat, 19 Dec 2009 17:15:50 +0000 (UTC) Received: from besplex.bde.org (c220-239-235-116.carlnfd3.nsw.optusnet.com.au [220.239.235.116]) by mail03.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id nBJHFgbo021902 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 20 Dec 2009 04:15:43 +1100 Date: Sun, 20 Dec 2009 04:15:42 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Harti Brandt In-Reply-To: <20091219164818.L1741@beagle.kn.op.dlr.de> Message-ID: <20091220032452.W2429@besplex.bde.org> References: <20091215103759.P97203@beagle.kn.op.dlr.de> <200912151313.28326.jhb@freebsd.org> <20091219112711.GR55913@acme.spoerlein.net> <200912191244.17803.hselasky@c2i.net> <20091219232119.L1555@besplex.bde.org> <20091219164818.L1741@beagle.kn.op.dlr.de> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Ulrich =?iso-8859-1?q?Sp=F6rlein?= , freebsd-arch@freebsd.org, Hans Petter Selasky Subject: Re: network statistics in SMP X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 19 Dec 2009 17:15:51 -0000 On Sat, 19 Dec 2009, Harti Brandt wrote: > On Sun, 20 Dec 2009, Bruce Evans wrote: > > [... complications] > > To be honest, I'm lost now. Couldn't we just use the largest atomic type > for the given platform and atomic_inc/atomic_add/atomic_fetch and handle > the 32->64 bit stuff (for IA32) as I do it in bsnmp, but as a kernel > thread? That's probably best (except without the atomic operations) (like I said originally. I tried to spell out the complications to make it clear that they would be too much except for incomplete ones). > Are the 5-6 atomic operations really that costly given the many operations > done on an IP packet? Are they more costly than a heavyweight sync for > each ++ or +=? rwatson found that even non-atomic operations are quite costly, since at least on amd64 and i386, ones that write (or any access?) the same address (or cache line?) apparently involve much the same hardware activity (cache snoop?) as atomic ones implemented by locking the bus. I think this is mostly historical -- it should be necessary to lock the bus to get the slow version. Per-CPU counters give separate addresses and also don't require the bus lock. I don't like the complexity for per-CPU counters but don't use big SMP systems enough to know what the locks cost in real applications. > Or we could use the PCPU stuff, use just ++ and += for modifying the > statistics (32bit) and do the 32->64 bit stuff for all platforms with a > kernel thread per CPU (do we have this?). Between that thread and the > sysctl we could use a heavy sync. I don't like the squillions of threads in FreeBSD-post-4, but this seems to need its own one and there isn't one yet AFAIK. I think a thread is only needed for the 32-bit stuff (since aggregation has to use the current values and it shouldn't have to ask a thread to sum them). The thread should maintain only the high 32 or 33 bits of the 64-bit counters. Maybe there should be a thread per CPU (ugh) with per-CPU extra bits so that these bits can be accessed without locking. The synchronization is still interesting. > Or we could use PCPU and atomic_inc/atomic_add/atomic_fetch with the > largest atomic type for the platform, handle the aggregation and (on IA32) > the 32->64 bit stuff in a kernel thread. I don't see why using atomic or locks for just the 64 bit counters is good. We will probably end up with too many 64-bit counters, especially if they don't cost much when not read. I just thought of another implementation to reduce reads: trap on overflow and handle all the complications in the trap handler, or just set a flag to tell the fixup thread to run and normally don't run the fixup thread. This seems to not quite work -- arranging for the trap would be costly (needs "into" instruction on i386?). Similarly for explicit tests for wraparound (PCPU_INC() could be a function call that does the test and handles wraparound in a fully locked fashion. We don't care that this code executes slowly since it rarely executes, but we care that the test pessimizes the usual case). There is also "lock cmpxchg8b" on i386. I think this can be used in a loop to implement atomic 64-bit ops (?). Simpler, but slower in PCPU_INC(). I prefer a function call version of PCPU_INC() to this. That should be faster in the usual case and only much larger if we have too many 64-bit counters. > Using 32 bit stats may fail if you put in several 10GBit/s adapters into a > machine and do routing at link speed, though. This might overflow the IP > input/output byte counter (which we don't have yet) too fast. Not with a mere 10GB/S. That's ~1GB/S so it takes 4 seconds to overflow a 32-bit byte counter. A bit counter would take a while to overflow too. Are there any faster incrementors? TSCs also take O(1) seconds to overflow, and timecounter logic depends on no timecounter overflowing much faster than that. Bruce