Date: Sat, 19 Dec 2009 21:01:35 +0100 (CET) From: Harti Brandt <hartmut.brandt@dlr.de> To: Bruce Evans <brde@optusnet.com.au> Cc: freebsd-arch@freebsd.org Subject: Re: network statistics in SMP Message-ID: <20091219204217.D1741@beagle.kn.op.dlr.de> In-Reply-To: <20091220032452.W2429@besplex.bde.org> References: <20091215103759.P97203@beagle.kn.op.dlr.de> <200912151313.28326.jhb@freebsd.org> <20091219112711.GR55913@acme.spoerlein.net> <200912191244.17803.hselasky@c2i.net> <20091219232119.L1555@besplex.bde.org> <20091219164818.L1741@beagle.kn.op.dlr.de> <20091220032452.W2429@besplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 20 Dec 2009, Bruce Evans wrote: BE>On Sat, 19 Dec 2009, Harti Brandt wrote: BE> BE>> On Sun, 20 Dec 2009, Bruce Evans wrote: BE>> BE>> [... complications] BE>> BE>> To be honest, I'm lost now. Couldn't we just use the largest atomic type BE>> for the given platform and atomic_inc/atomic_add/atomic_fetch and handle BE>> the 32->64 bit stuff (for IA32) as I do it in bsnmp, but as a kernel BE>> thread? BE> BE>That's probably best (except without the atomic operations) (like I said BE>originally. I tried to spell out the complications to make it clear that BE>they would be too much except for incomplete ones). BE> BE>> Are the 5-6 atomic operations really that costly given the many operations BE>> done on an IP packet? Are they more costly than a heavyweight sync for BE>> each ++ or +=? BE> BE>rwatson found that even non-atomic operations are quite costly, since BE>at least on amd64 and i386, ones that write (or any access?) the same BE>address (or cache line?) apparently involve much the same hardware BE>activity (cache snoop?) as atomic ones implemented by locking the bus. BE>I think this is mostly historical -- it should be necessary to lock the BE>bus to get the slow version. Per-CPU counters give separate addresses BE>and also don't require the bus lock. I don't like the complexity for BE>per-CPU counters but don't use big SMP systems enough to know what the BE>locks cost in real applications. BE> BE>> Or we could use the PCPU stuff, use just ++ and += for modifying the BE>> statistics (32bit) and do the 32->64 bit stuff for all platforms with a BE>> kernel thread per CPU (do we have this?). Between that thread and the BE>> sysctl we could use a heavy sync. BE> BE>I don't like the squillions of threads in FreeBSD-post-4, but this seems BE>to need its own one and there isn't one yet AFAIK. I think a thread is BE>only needed for the 32-bit stuff (since aggregation has to use the BE>current values and it shouldn't have to ask a thread to sum them). The BE>thread should maintain only the high 32 or 33 bits of the 64-bit counters. BE>Maybe there should be a thread per CPU (ugh) with per-CPU extra bits so BE>that these bits can be accessed without locking. The synchronization is BE>still interesting. BE> BE>> Or we could use PCPU and atomic_inc/atomic_add/atomic_fetch with the BE>> largest atomic type for the platform, handle the aggregation and (on IA32) BE>> the 32->64 bit stuff in a kernel thread. BE> BE>I don't see why using atomic or locks for just the 64 bit counters is good. BE>We will probably end up with too many 64-bit counters, especially if they BE>don't cost much when not read. On a 32-bit arch when reading a 32-bit value on one CPU while the other CPU is modifying it, the read will probably be always correct given the variable is correctly aligned. On a 64-bit arch when reading a 64-bit value on one CPU while the other one is adding to, do I always get the correct value? I'm not sure about this, why I put atomic_*() there assuming that they will make this correct. The idea is (for 32-bit platforms): struct pcpu_stats { uint32_t in_bytes; uint32_t in_packets; }; struct pcpu_hc_stats { uint64_t hc_in_bytes; uint64_t hc_in_packets; }; /* driver; IP stack; ... */ ... pcpu_stats->in_bytes += bytes; pcpu_stats->in_packets++; ... /* per CPU kernel thread for 32-bit arch */ lock(pcpu_hc_stats); ... val = pcpu_stats->in_bytes; if ((uint32_t)pcpu_hc_stats->hc_in_bytes > val) pcpu_hc_stats->in_bytes += 0x100000000; pcpu_hc_stats->in_bytes = (pcpu_hc_stats->in_bytes & 0xffffffff00000000ULL) | val; ... unlock(pcpu_hc_stats); /* sysctl */ memset(&stats, 0, sizeof(stats)); foreach(cpu) { lock(pcpu_hc_stats(cpu)); ... stats.in_bytes += pcpu_hc_stats(cpu)->hc_in_bytes; ... unlock(pcpu_hc_stats(cpu)); } copyout(stats); On 64-bit archs we can go without the locks and the thread given that we can reliably read the 64-bit per CPU numbers (can we?). BE>I just thought of another implementation to reduce reads: trap on BE>overflow and handle all the complications in the trap handler, or BE>just set a flag to tell the fixup thread to run and normally don't BE>run the fixup thread. This seems to not quite work -- arranging BE>for the trap would be costly (needs "into" instruction on i386?). BE>Similarly for explicit tests for wraparound (PCPU_INC() could be a BE>function call that does the test and handles wraparound in a fully BE>locked fashion. We don't care that this code executes slowly since BE>it rarely executes, but we care that the test pessimizes the usual BE>case). BE> BE>There is also "lock cmpxchg8b" on i386. I think this can be used in a BE>loop to implement atomic 64-bit ops (?). Simpler, but slower in BE>PCPU_INC(). I prefer a function call version of PCPU_INC() to this. BE>That should be faster in the usual case and only much larger if we BE>have too many 64-bit counters. BE> BE>> Using 32 bit stats may fail if you put in several 10GBit/s adapters into a BE>> machine and do routing at link speed, though. This might overflow the IP BE>> input/output byte counter (which we don't have yet) too fast. BE> BE>Not with a mere 10GB/S. That's ~1GB/S so it takes 4 seconds to overflow BE>a 32-bit byte counter. A bit counter would take a while to overflow too. BE>Are there any faster incrementors? TSCs also take O(1) seconds to overflow, BE>and timecounter logic depends on no timecounter overflowing much faster BE>than that. If you have 4 10GBit/s adapters each operating full-duplex at link speed you wrap in under 0.5 seconds, maybe even faster if you have some kind of tunnels where each packet counts several times. But I suppose this will be not so easy with IA32 to implement :-) harti
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20091219204217.D1741>