Date: Mon, 21 Dec 2009 07:17:56 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Harti Brandt <harti@freebsd.org> Cc: freebsd-arch@freebsd.org Subject: Re: network statistics in SMP Message-ID: <20091221063312.R39703@delplex.bde.org> In-Reply-To: <20091219204217.D1741@beagle.kn.op.dlr.de> References: <20091215103759.P97203@beagle.kn.op.dlr.de> <200912151313.28326.jhb@freebsd.org> <20091219112711.GR55913@acme.spoerlein.net> <200912191244.17803.hselasky@c2i.net> <20091219232119.L1555@besplex.bde.org> <20091219164818.L1741@beagle.kn.op.dlr.de> <20091220032452.W2429@besplex.bde.org> <20091219204217.D1741@beagle.kn.op.dlr.de>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 19 Dec 2009, Harti Brandt wrote: > On Sun, 20 Dec 2009, Bruce Evans wrote: > BE>... > BE>I don't see why using atomic or locks for just the 64 bit counters is good. > BE>We will probably end up with too many 64-bit counters, especially if they > BE>don't cost much when not read. > > On a 32-bit arch when reading a 32-bit value on one CPU while the other CPU is > modifying it, the read will probably be always correct given the variable is > correctly aligned. We assume this. Otherwise the per-CPU counter optimization wouldn't work so well. > On a 64-bit arch when reading a 64-bit value on one CPU > while the other one is adding to, do I always get the correct value? I'm > not sure about this, why I put atomic_*() there assuming that they will make > this correct. You have to use the PCPU_INC()/PCPU_GET() interface and not worry about this. It must supply any necessary atomicness on an arch-specific basis. > The idea is (for 32-bit platforms): > > struct pcpu_stats { > uint32_t in_bytes; > uint32_t in_packets; > }; > > struct pcpu_hc_stats { > uint64_t hc_in_bytes; > uint64_t hc_in_packets; > }; > > /* driver; IP stack; ... */ > ... > pcpu_stats->in_bytes += bytes; > pcpu_stats->in_packets++; > ... > > /* per CPU kernel thread for 32-bit arch */ > lock(pcpu_hc_stats); > ... > val = pcpu_stats->in_bytes; > if ((uint32_t)pcpu_hc_stats->hc_in_bytes > val) > pcpu_hc_stats->in_bytes += 0x100000000; > pcpu_hc_stats->in_bytes = (pcpu_hc_stats->in_bytes & > 0xffffffff00000000ULL) | val; Why not just `pcpu_hc_stats->in_bytes |= val' for the second statement? > ... > unlock(pcpu_hc_stats); > BE>> Using 32 bit stats may fail if you put in several 10GBit/s adapters into a > BE>> machine and do routing at link speed, though. This might overflow the IP > BE>> input/output byte counter (which we don't have yet) too fast. > BE> > BE>Not with a mere 10GB/S. That's ~1GB/S so it takes 4 seconds to overflow > BE>a 32-bit byte counter. A bit counter would take a while to overflow too. > BE>Are there any faster incrementors? TSCs also take O(1) seconds to overflow, > BE>and timecounter logic depends on no timecounter overflowing much faster > BE>than that. > > If you have 4 10GBit/s adapters each operating full-duplex at link speed you > wrap in under 0.5 seconds, maybe even faster if you have some kind of tunnels > where each packet counts several times. But I suppose this will be not so easy > with IA32 to implement :-) I was only thinking of per-interface counters, but we can handle the 64-bit fixup for thousands of these if we can handle thousands of fixup threads each running thousands of times per second. Actually, thousands of adaptors would need thousands of CPUS to handle and each per-CPU counter would be limited by what 1 CPU could handle so we're back nearer to the original 4 seconds than to 4/4000. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20091221063312.R39703>