Date: Thu, 17 Dec 2009 05:19:45 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: John Baldwin <jhb@FreeBSD.org> Cc: Harti Brandt <harti@FreeBSD.org>, freebsd-arch@FreeBSD.org Subject: Re: network statistics in SMP Message-ID: <20091217021211.O35780@delplex.bde.org> In-Reply-To: <200912151313.28326.jhb@freebsd.org> References: <20091215103759.P97203@beagle.kn.op.dlr.de> <200912150812.35521.jhb@freebsd.org> <20091215183859.S53283@beagle.kn.op.dlr.de> <200912151313.28326.jhb@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 15 Dec 2009, John Baldwin wrote: > On Tuesday 15 December 2009 12:45:13 pm Harti Brandt wrote: >> On Tue, 15 Dec 2009, John Baldwin wrote: >> >> JB>On Tuesday 15 December 2009 4:38:04 am Harti Brandt wrote: >> JB>> Hi all, >> JB>> >> JB>> I'm working on our network statistics (in the context of SNMP) and wonder, >> JB>> to what extend we want them to be correct. I've re-read part of the past >> JB>> discussions about 64-bit counters on 32-bit archs and got the impression, >> JB>> that there are users that would like to have almost correct statistics >> JB>> (for accounting, for example). If this is the case I wonder whether the >> JB>> way we do the statistics today is correct. >> JB>> >> JB>> Basically all statistics are incremented or added to simply by a += b oder >> JB>> a++. As I understand, this worked fine in the old days, where you had >> JB>> spl*() calls at the right places. Nowadays when everything is SMP >> JB>> shouldn't we use at least atomic operations for this? Also I read that on >> JB>> architectures where cache coherency is not implemented in hardware even >> JB>> this does not help (I found a mail from jhb why for the mutex >> JB>> implementation this is not a problem, but I don't understand what to do >> JB>> for the += and ++ operations). I failed to find a way, though, to >> JB>> influence the caching policy (is there a function one can call to >> JB>> change the policy?). >> JB> >> JB>Atomic ops will always work for reliable statistics. However, I believe >> JB>Robert is working on using per-CPU statistics for TCP, UDP, etc. similar to >> JB>what we do now for many of the 'cnt' stats (context switches, etc.). For >> JB>'cnt' each CPU has its own count of stats that are updated using non-atomic >> JB>ops (since they are CPU local). sysctl handlers then sum up the various per- >> JB>CPU counts to report global counts to userland. I don't like the bloat from this, but don't see anything better. Julian said in another reply that there are even more complications for VIMAGE. >> I see. I was also thinking along these lines, but was not sure whether it >> is worth the trouble. I suppose this does not help to implement 64-bit >> counters on 32-bit architectures, though, because you cannot read them >> reliably without locking to sum them up, right? > > Either that or you just accept that you have a small race since it is only stats. :) Actually, you can do better with a generation count. The generation count would at least tell you if you lost a race. The generation count should only be maintained while summing other counts, since it must be global and incremented by atomic ops (to avoid the races without even more costly locking which would make the generation count irrelevant) so maintaining it all the time would more than defeat the point of having per-CPU counters (all CPUs would compete for it at the same address). Probably not worth it for statistics. Except, if userland had control over it, then userland could decide the policy. Actually2, this solves your original problem!, provided the races are so rarely lost that looping to recover from them works: Once counters are per-CPU, they can be 64-bits with no complications until they are summed. Detection of lost races is essential for summing them on 32-bit systems, unlike for 32-bit counters, since a lost race at the point where the low 32 bits wraps around may give an error of 2**32 in the sum, while a lost race for a 32-bit counter only makes the sum a bit too small (unless the 32-bit counter wrapped). Simple version: - bloat PCPU_INC(var) to do something like the following: if (PCPU_GET(counter_summing_mode)) atomic_add_int(&counter_gen, 1); OLD_PCPU_INC(var); - set PCPU_GET(counter_summing_mode) while summing. Needs heavyweight synchronization (IPIs?) to set and clear the flag on other CPUs. Must also make all other CPUs flush pending writes (so that a 64-bit counter cannot be half-written at the beginning of the summing), but this will happen automatically with any heavyweight synchronization. Unsimple versions: to avoid bloating PCPU_INC(), write-protect all counters while summing, and count generations in the trap handler ... However, I prefer summing 32-bit counters (with heuristics to detect wraparound) to a 64-bit sum, like I think you already do for SNMP. Wraparound heuristics may still be useful with the generation count: suppose the generation count increases faster than you can sum; then looping to get a coherent sum doesn't work, and wraparound must be ruled out or fixed up in another way; the 32-bit wraparound heuristic works perfectly since we can guarantee to sum faster than a 32-bit counter can wrap twice. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20091217021211.O35780>