Date: Sat, 19 Dec 2009 17:02:23 +0100 (CET) From: Harti Brandt <hartmut.brandt@dlr.de> To: Bruce Evans <brde@optusnet.com.au> Cc: Ulrich =?iso-8859-1?q?Sp=F6rlein?= <uqs@spoerlein.net>, freebsd-arch@freebsd.org, Hans Petter Selasky <hselasky@c2i.net> Subject: Re: network statistics in SMP Message-ID: <20091219164818.L1741@beagle.kn.op.dlr.de> In-Reply-To: <20091219232119.L1555@besplex.bde.org> References: <20091215103759.P97203@beagle.kn.op.dlr.de> <200912151313.28326.jhb@freebsd.org> <20091219112711.GR55913@acme.spoerlein.net> <200912191244.17803.hselasky@c2i.net> <20091219232119.L1555@besplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --1964543108-214838482-1261238543=:1741 Content-Type: TEXT/PLAIN; charset=koi8-r Content-Transfer-Encoding: QUOTED-PRINTABLE On Sun, 20 Dec 2009, Bruce Evans wrote: BE>On Sat, 19 Dec 2009, Hans Petter Selasky wrote: BE> BE>> On Saturday 19 December 2009 12:27:12 Ulrich Sp=F6rlein wrote: BE>> > On Tue, 15.12.2009 at 13:13:28 -0500, John Baldwin wrote: BE>> > > On Tuesday 15 December 2009 12:45:13 pm Harti Brandt wrote: BE>> > > > I see. I was also thinking along these lines, but was not sure BE>> > > > whether BE>> > > > it is worth the trouble. I suppose this does not help to impleme= nt BE>> > > > 64-bit counters on 32-bit architectures, though, because you can= not BE>> > > > read them reliably without locking to sum them up, right? BE>> > >=20 BE>> > > Either that or you just accept that you have a small race since it= is BE>> > > only stats. :) BE>> >=20 BE>> > This might be stupid, but can we not easily *read* 64bit counters BE>> > on 32bit machines like this: BE>> >=20 BE>> > do { BE>> > h1 =3D read_upper_32bits; BE>> > l1 =3D read_lower_32bits; BE>> > h2 =3D read_upper_32bits; BE>> > l2 =3D read_lower_32bits; /* not needed */ BE>> > } while (h1 !=3D h2); BE> BE>No. See my previous^N reply, but don't see it about this since it was BE>wrong about this for all N :-). BE> BE>> Just a comment. You know you don't need a while loop to get a stable v= alue? BE>> Should be implemented like this, in my opinion: BE>>=20 BE>> h1 =3D read_upper_32bits; BE>> l1 =3D read_lower_32bits; BE>> h2 =3D read_upper_32bits; BE>>=20 BE>> if (h1 !=3D h2) BE>> =09l1 =3D 0xffffffffUL; BE>>=20 BE>> sum64 =3D (h1<<32) | l1; BE> BE>Also wrong :-). Apart from write ordering problems (1), the write of BE>the second half (presumably the top half) might not have completed when = the BE>above looks at it. Then both of the above will see: BE> h1 =3D old value BE> l1 =3D new value BE> h2 =3D old value (since the new one has not been written yet). BE>The race window for this can be arbitrarily long, since the second write BE>can be delayed for arbitrarily long (e.g., by interrupt handling). Even BE>if we ensure write ordering and no interrupts, the above has many proble= ms. BE>- we can't reasonably guarantee that the reads of l1 and h2 will execute BE> sufficiently faster than the writes of l1 and h1/h2 so that the above BE> will see h2 after l1. I think the writes will usually go slightly BE> faster since they will go through a write buffer, provided the 2 halve= s BE> are in a single cache line, but this is unclear. SMP with different BE> CPU frequencies is not really supported, but works now modulo timecoun= ter BE> problems, and we probably want to support completely independent CPU BE> frequencies, with some CPUs throttled. BE>- I don't understand why the above compares the high values. Comparing = the BE> low values seems to work better. BE>- I can't see how to fix up the second method to be useful. It is faste= r, BE> but some delays seem to be necessary and they might as well be partly BE> in a slower method. BE> BE>Another try: BE>- enforce write ordering in writers and read ordering in the reader BE>- make sure that the reader runs fast enough. This might require using BE> critical_enter() in the reader. BE>- then many cases work with no complications: BE> (a) if l1 is not small and not large, then h1 must be associated with = l1, BE> since then the low value can't roll over while we are looking BE> at the pair, so the high value can't change while we are looking. BE> So we can just use h1 and l1, without reading h2 or l2. BE> (b) similarly, if l1 is small, then h2 is associated with l1. So we BE> can just use h2 with l1, without reading l2. BE>- otherwise, l1 is large: BE> (c) if l1 =3D l2, then h1 must be associated with l1, since some write= s BE> of the high value associated with writing not-so-large low values BE> have had plenty of time to complete (in fact, the one for (l1-2) BE> or (l2-1) must have completed, and "large" only needs to be 1 BE> or 2 to ensure that these values don't wrap. E.g., if l1 is BE> 0xFFFFFFFF, then it is about to wrap, but it certainly hasn't BE> wrapped recently so h1 hasn't incremented recently. So we can BE> use h1 with l1, after reading l2 (still need to read h2, in case BE> we don't get here). BE> (d) if l1 !=3D l2, then ordering implies that the write of the high va= lue BE> associated with l1 has completed. We might have missed reading BE> this value, since we might have read h1 too early and might have BE> read h2 too late, but in the usual case h1 =3D=3D h2 and then both BE> h's are associated with l1, while if h1 !=3D h2 then we can loop BE> again and surely find h1 =3D=3D h2 (and l1 small, so case (c)), or BE> we can use the second method. We had to read all 4 values to BE> determine what to do here, and can usually use 2 of them directly. BE> BE>(1) Write ordering is guaranteed on amd64 but I think not on all arches. BE> You could pessimize PCPU_INC() with memory barriers on some arches t= o BE> get write ordering. BE> BE>(2) You could pessimize PCPU_INC() with critical_enter() to mostly preve= nt BE> this. You cannot prevent the second write from being delayed for BE> arbitrarily long by a trap and its handling, except by breaking BE> at least the debugger traps needed to debug this. BE> BE>In my previous^2 reply, I said heavyweight synchronization combined BE>with extra atomic generation counters would work. The heavyweight BE>synchronization would have to be heavier than I thought -- it would BE>have to wait for all other CPUs to complete the pairs of writes for BE>64-bit counters, if any, and for this it would have to do more than BE>IPI's -- it should change priorities and reschedule to ensure that the BE>half-writes (if any) have a chance of completing soon... Far too BE>complicated for this. Disabling interrupts for the non-atomic PCPU_INC(= )s BE>is probably best. Duh, this or worse (locking) is required on the BE>writer side anyway, else increments in won't be atomic. Locking would BE>actually automatically give the rescheduling stuff for the heavyweight BE>synchronizaton -- you would acquire the lock in the reader and of BE>course in the writers, and get priority propagation to complete the BE>one writer allowed to hold the lock iff any is holding it. Locking BE>might not be too bad for a few 64-bit counters. So I've changed my BE>mind yet again and prefer locking to critical_enter(). It's cleaner BE>and works for traps. I just remembered that rwatson went the opposite BE>way and changed some locking to critical_enter() in UMA. I prefer the BE>old way, and at least in old versions of FreeBSD I got panics trying BE>to debug near this (single-stepping malloc()?). BE> BE>In my previous^1 reply, I said lighter weight synchronizion combined BE>with no extra or atomic counters (use counters as their own generation BE>counter) would work. But the synchronization still needs to be heavy, BE>or interrupts disabled, as above. BE> BE>Everything must be read before and after to test for getting a coherent BE>set of values, so the loop in the first method has the minimal number BE>of reads (for a single 64-bit counter). With sync, the order for each BE>pair in it doesn't matter on either the reader or writer (there must be BE>a sync or 2 instead). To be honest, I'm lost now. Couldn't we just use the largest atomic type=20 for the given platform and atomic_inc/atomic_add/atomic_fetch and handle=20 the 32->64 bit stuff (for IA32) as I do it in bsnmp, but as a kernel=20 thread? Are the 5-6 atomic operations really that costly given the many operations= =20 done on an IP packet? Are they more costly than a heavyweight sync for=20 each ++ or +=3D? Or we could use the PCPU stuff, use just ++ and +=3D for modifying the=20 statistics (32bit) and do the 32->64 bit stuff for all platforms with a=20 kernel thread per CPU (do we have this?). Between that thread and the=20 sysctl we could use a heavy sync. Or we could use PCPU and atomic_inc/atomic_add/atomic_fetch with the=20 largest atomic type for the platform, handle the aggregation and (on IA32)= =20 the 32->64 bit stuff in a kernel thread. Using 32 bit stats may fail if you put in several 10GBit/s adapters into a= =20 machine and do routing at link speed, though. This might overflow the IP=20 input/output byte counter (which we don't have yet) too fast. harti --1964543108-214838482-1261238543=:1741--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20091219164818.L1741>