From owner-freebsd-arch@FreeBSD.ORG Sat Dec 19 15:30:13 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 06C3B1065672; Sat, 19 Dec 2009 15:30:13 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail08.syd.optusnet.com.au (mail08.syd.optusnet.com.au [211.29.132.189]) by mx1.freebsd.org (Postfix) with ESMTP id 796E58FC08; Sat, 19 Dec 2009 15:30:11 +0000 (UTC) Received: from besplex.bde.org (c220-239-235-116.carlnfd3.nsw.optusnet.com.au [220.239.235.116]) by mail08.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id nBJFU3qZ026724 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 20 Dec 2009 02:30:04 +1100 Date: Sun, 20 Dec 2009 02:30:03 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Hans Petter Selasky In-Reply-To: <200912191244.17803.hselasky@c2i.net> Message-ID: <20091219232119.L1555@besplex.bde.org> References: <20091215103759.P97203@beagle.kn.op.dlr.de> <200912151313.28326.jhb@freebsd.org> <20091219112711.GR55913@acme.spoerlein.net> <200912191244.17803.hselasky@c2i.net> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="0-509698132-1261236603=:1555" Cc: Ulrich =?iso-8859-1?q?Sp=F6rlein?= , Harti Brandt , freebsd-arch@freebsd.org Subject: Re: network statistics in SMP X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 19 Dec 2009 15:30:13 -0000 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --0-509698132-1261236603=:1555 Content-Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Sat, 19 Dec 2009, Hans Petter Selasky wrote: > On Saturday 19 December 2009 12:27:12 Ulrich Sp=F6rlein wrote: >> On Tue, 15.12.2009 at 13:13:28 -0500, John Baldwin wrote: >>> On Tuesday 15 December 2009 12:45:13 pm Harti Brandt wrote: >>>> I see. I was also thinking along these lines, but was not sure whether >>>> it is worth the trouble. I suppose this does not help to implement >>>> 64-bit counters on 32-bit architectures, though, because you cannot >>>> read them reliably without locking to sum them up, right? >>> >>> Either that or you just accept that you have a small race since it is >>> only stats. :) >> >> This might be stupid, but can we not easily *read* 64bit counters >> on 32bit machines like this: >> >> do { >> h1 =3D read_upper_32bits; >> l1 =3D read_lower_32bits; >> h2 =3D read_upper_32bits; >> l2 =3D read_lower_32bits; /* not needed */ >> } while (h1 !=3D h2); No. See my previous^N reply, but don't see it about this since it was wrong about this for all N :-). > Just a comment. You know you don't need a while loop to get a stable valu= e? > Should be implemented like this, in my opinion: >=20 > h1 =3D read_upper_32bits; > l1 =3D read_lower_32bits; > h2 =3D read_upper_32bits; > > if (h1 !=3D h2) > =09l1 =3D 0xffffffffUL; > > sum64 =3D (h1<<32) | l1; Also wrong :-). Apart from write ordering problems (1), the write of the second half (presumably the top half) might not have completed=20 when the above looks at it. Then both of the above will see: h1 =3D old value l1 =3D new value h2 =3D old value (since the new one has not been written yet). The race window for this can be arbitrarily long, since the second write can be delayed for arbitrarily long (e.g., by interrupt handling). Even if we ensure write ordering and no interrupts, the above has many problems. - we can't reasonably guarantee that the reads of l1 and h2 will execute sufficiently faster than the writes of l1 and h1/h2 so that the above will see h2 after l1. I think the writes will usually go slightly faster since they will go through a write buffer, provided the 2 halves are in a single cache line, but this is unclear. SMP with different CPU frequencies is not really supported, but works now modulo timecounte= r problems, and we probably want to support completely independent CPU frequencies, with some CPUs throttled. - I don't understand why the above compares the high values. Comparing the low values seems to work better. - I can't see how to fix up the second method to be useful. It is faster, but some delays seem to be necessary and they might as well be partly in a slower method. Another try: - enforce write ordering in writers and read ordering in the reader - make sure that the reader runs fast enough. This might require using critical_enter() in the reader. - then many cases work with no complications: (a) if l1 is not small and not large, then h1 must be associated with l1= , since then the low value can't roll over while we are looking at the pair, so the high value can't change while we are looking. So we can just use h1 and l1, without reading h2 or l2. (b) similarly, if l1 is small, then h2 is associated with l1. So we can just use h2 with l1, without reading l2. - otherwise, l1 is large: (c) if l1 =3D l2, then h1 must be associated with l1, since some writes of the high value associated with writing not-so-large low values have had plenty of time to complete (in fact, the one for (l1-2) or (l2-1) must have completed, and "large" only needs to be 1 or 2 to ensure that these values don't wrap. E.g., if l1 is 0xFFFFFFFF, then it is about to wrap, but it certainly hasn't wrapped recently so h1 hasn't incremented recently. So we can use h1 with l1, after reading l2 (still need to read h2, in case we don't get here). (d) if l1 !=3D l2, then ordering implies that the write of the high valu= e associated with l1 has completed. We might have missed reading this value, since we might have read h1 too early and might have read h2 too late, but in the usual case h1 =3D=3D h2 and then both h's are associated with l1, while if h1 !=3D h2 then we can loop again and surely find h1 =3D=3D h2 (and l1 small, so case (c)), or we can use the second method. We had to read all 4 values to determine what to do here, and can usually use 2 of them directly. (1) Write ordering is guaranteed on amd64 but I think not on all arches. You could pessimize PCPU_INC() with memory barriers on some arches to get write ordering. (2) You could pessimize PCPU_INC() with critical_enter() to mostly prevent this. You cannot prevent the second write from being delayed for arbitrarily long by a trap and its handling, except by breaking at least the debugger traps needed to debug this. In my previous^2 reply, I said heavyweight synchronization combined with extra atomic generation counters would work. The heavyweight synchronization would have to be heavier than I thought -- it would have to wait for all other CPUs to complete the pairs of writes for 64-bit counters, if any, and for this it would have to do more than IPI's -- it should change priorities and reschedule to ensure that the half-writes (if any) have a chance of completing soon... Far too complicated for this. Disabling interrupts for the non-atomic PCPU_INC()s is probably best. Duh, this or worse (locking) is required on the writer side anyway, else increments in won't be atomic. Locking would actually automatically give the rescheduling stuff for the heavyweight synchronizaton -- you would acquire the lock in the reader and of course in the writers, and get priority propagation to complete the one writer allowed to hold the lock iff any is holding it. Locking might not be too bad for a few 64-bit counters. So I've changed my mind yet again and prefer locking to critical_enter(). It's cleaner and works for traps. I just remembered that rwatson went the opposite way and changed some locking to critical_enter() in UMA. I prefer the old way, and at least in old versions of FreeBSD I got panics trying to debug near this (single-stepping malloc()?). In my previous^1 reply, I said lighter weight synchronizion combined with no extra or atomic counters (use counters as their own generation counter) would work. But the synchronization still needs to be heavy, or interrupts disabled, as above. Everything must be read before and after to test for getting a coherent set of values, so the loop in the first method has the minimal number of reads (for a single 64-bit counter). With sync, the order for each pair in it doesn't matter on either the reader or writer (there must be a sync or 2 instead). Bruce --0-509698132-1261236603=:1555--