From owner-freebsd-arch@FreeBSD.ORG Sun Dec 20 20:18:01 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 94999106568B; Sun, 20 Dec 2009 20:18:01 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail10.syd.optusnet.com.au (mail10.syd.optusnet.com.au [211.29.132.191]) by mx1.freebsd.org (Postfix) with ESMTP id 2ACC48FC14; Sun, 20 Dec 2009 20:18:00 +0000 (UTC) Received: from c211-30-197-33.carlnfd3.nsw.optusnet.com.au (c211-30-197-33.carlnfd3.nsw.optusnet.com.au [211.30.197.33]) by mail10.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id nBKKHu6Y004453 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 21 Dec 2009 07:17:58 +1100 Date: Mon, 21 Dec 2009 07:17:56 +1100 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Harti Brandt In-Reply-To: <20091219204217.D1741@beagle.kn.op.dlr.de> Message-ID: <20091221063312.R39703@delplex.bde.org> References: <20091215103759.P97203@beagle.kn.op.dlr.de> <200912151313.28326.jhb@freebsd.org> <20091219112711.GR55913@acme.spoerlein.net> <200912191244.17803.hselasky@c2i.net> <20091219232119.L1555@besplex.bde.org> <20091219164818.L1741@beagle.kn.op.dlr.de> <20091220032452.W2429@besplex.bde.org> <20091219204217.D1741@beagle.kn.op.dlr.de> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: network statistics in SMP X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 20 Dec 2009 20:18:01 -0000 On Sat, 19 Dec 2009, Harti Brandt wrote: > On Sun, 20 Dec 2009, Bruce Evans wrote: > BE>... > BE>I don't see why using atomic or locks for just the 64 bit counters is good. > BE>We will probably end up with too many 64-bit counters, especially if they > BE>don't cost much when not read. > > On a 32-bit arch when reading a 32-bit value on one CPU while the other CPU is > modifying it, the read will probably be always correct given the variable is > correctly aligned. We assume this. Otherwise the per-CPU counter optimization wouldn't work so well. > On a 64-bit arch when reading a 64-bit value on one CPU > while the other one is adding to, do I always get the correct value? I'm > not sure about this, why I put atomic_*() there assuming that they will make > this correct. You have to use the PCPU_INC()/PCPU_GET() interface and not worry about this. It must supply any necessary atomicness on an arch-specific basis. > The idea is (for 32-bit platforms): > > struct pcpu_stats { > uint32_t in_bytes; > uint32_t in_packets; > }; > > struct pcpu_hc_stats { > uint64_t hc_in_bytes; > uint64_t hc_in_packets; > }; > > /* driver; IP stack; ... */ > ... > pcpu_stats->in_bytes += bytes; > pcpu_stats->in_packets++; > ... > > /* per CPU kernel thread for 32-bit arch */ > lock(pcpu_hc_stats); > ... > val = pcpu_stats->in_bytes; > if ((uint32_t)pcpu_hc_stats->hc_in_bytes > val) > pcpu_hc_stats->in_bytes += 0x100000000; > pcpu_hc_stats->in_bytes = (pcpu_hc_stats->in_bytes & > 0xffffffff00000000ULL) | val; Why not just `pcpu_hc_stats->in_bytes |= val' for the second statement? > ... > unlock(pcpu_hc_stats); > BE>> Using 32 bit stats may fail if you put in several 10GBit/s adapters into a > BE>> machine and do routing at link speed, though. This might overflow the IP > BE>> input/output byte counter (which we don't have yet) too fast. > BE> > BE>Not with a mere 10GB/S. That's ~1GB/S so it takes 4 seconds to overflow > BE>a 32-bit byte counter. A bit counter would take a while to overflow too. > BE>Are there any faster incrementors? TSCs also take O(1) seconds to overflow, > BE>and timecounter logic depends on no timecounter overflowing much faster > BE>than that. > > If you have 4 10GBit/s adapters each operating full-duplex at link speed you > wrap in under 0.5 seconds, maybe even faster if you have some kind of tunnels > where each packet counts several times. But I suppose this will be not so easy > with IA32 to implement :-) I was only thinking of per-interface counters, but we can handle the 64-bit fixup for thousands of these if we can handle thousands of fixup threads each running thousands of times per second. Actually, thousands of adaptors would need thousands of CPUS to handle and each per-CPU counter would be limited by what 1 CPU could handle so we're back nearer to the original 4 seconds than to 4/4000. Bruce