Date: Thu, 26 Jul 2012 16:25:01 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Konstantin Belousov <kostikbel@gmail.com> Cc: src-committers@FreeBSD.org, svn-src-all@FreeBSD.org, Andriy Gapon <avg@FreeBSD.org>, Bruce Evans <brde@optusnet.com.au>, svn-src-head@FreeBSD.org, Jim Harris <jim.harris@gmail.com> Subject: Re: svn commit: r238755 - head/sys/x86/x86 Message-ID: <20120726145325.U2026@besplex.bde.org> In-Reply-To: <20120725192351.GR2676@deviant.kiev.zoral.com.ua> References: <201207242210.q6OMACqV079603@svn.freebsd.org> <500F9E22.4080608@FreeBSD.org> <20120725102130.GH2676@deviant.kiev.zoral.com.ua> <20120725233033.N5406@besplex.bde.org> <20120725173212.GN2676@deviant.kiev.zoral.com.ua> <CAJP=Hc9Ys6uou2wfcO=r9D%2BAaC34mVeHOK1bgFWtBz8kfvWo6A@mail.gmail.com> <20120725192351.GR2676@deviant.kiev.zoral.com.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 25 Jul 2012, Konstantin Belousov wrote: > On Wed, Jul 25, 2012 at 11:00:41AM -0700, Jim Harris wrote: >> On Wed, Jul 25, 2012 at 10:32 AM, Konstantin Belousov >> <kostikbel@gmail.com> wrote: >>> I also asked Jim to test whether the cause the TSC sync test failure >>> is the lack of synchronization between gathering data and tasting it, >>> but ut appeared that the reason is genuine timecounter value going >>> backward. >> >> I wonder if instead of timecounter going backward, that TSC test >> fails because CPU speculatively performs rdtsc instruction in relation >> to waiter checks in smp_rendezvous_action. Or maybe we are saying >> the same thing. > > Ok, the definition of the 'timecounter goes back', as I understand it: > > you have two events A and B in two threads, provable ordered, say, A is > a lock release and B is the same lock acquisition. Assume that you take > rdtsc values tA and tB under the scope of the lock right before A and > right after B. Then it should be impossible to have tA > tB. For the threaded case, there has to something for the accesses to be provably ordered. It is hard to see how the something can be strong enough unless it serializes all thread state in A and B. The rdtsc state is not part of the thread state as know to APIs, but it is hard to see how threads can serialize themselves without also serializing the TSC. For most uses, the scope of the serialization and locking also needs to extend across multiple timer reads. Otherwise you can have situations like: read the time interrupt or context switch read later time in other intr handler/thread save late time back to previous context save earlier time It is unclear how to even prevent such situations. You (at least, I) don't want heavyweight locking/synchronization to prevent the context switches. And the kernel rarely if ever does such synchronization. binuptime() has none internally. It just spins if necessary until the read becomes stable. Most callers of binuptime() just call it. > I do not think that we can ever observe tA > tB if both threads are > executing on the same CPU. I thought that that was the problem, with a single thread and no context switches seeing the TSC go backwards. Even then, it would take non-useful behaviour (except for calibration and benchmarks) like spinning executing rdtsc to see it going backwards. Normally there are many instructions between rdtsc's and the non-serialization isn't as deep as that. Using syscalls, you just can't read the timecounter without about 1000 cycles between reads. When there is a context switch, there is usually accidental serialization from locking. I care about timestamps being ordered more than most people, and tried to kill the get*time() APIs because they are weakly ordered relative to the non-get variants (they return times in the past, and there is no way to round down to get consistent times). I tried to fix them by adding locking and updating them to the latest time whenever a non-get variant gives a later time (by being used). This was too slow, and breaks the design criteria that timecounter calls should not use any explicit locking. However, if you want slowness, then you can get it similarly by fixing the monotonicity of rdtsc in software. I think I just figured out how to do this with the same slowness as serialization, if a locked instruction serialzes; maybe less otherwise: spin: ptsc = prev_tsc; /* memory -> local (intentionally !atomic) */ tsc = rdtsc(); /* only 32 bits for timecounters */ if (tsc <= ptsc) { /* I forgot about wrap at first -- see below */ /* * It went backwards, or stopped. Could handle more * completely, starting with panic() to see if this * happens at all. */ return (ptsc); /* stopped is better than backwards */ } /* Usual case; update (32 bits). */ if (atomic_cmpset_int(&prev_tsc, ptsc, tsc)) return (tsc); goto spin; The 32-bitness of timecounters is important for the algorithm, and for efficiency on i386. We assume that the !atomic read gives coherent bits. The value may be in the past. When tsc <= ptsc, the value is in the future, so value must be up to date, unless there is massive non-seriality with another CPU having just written a value more up to date than this CPU read. We don't care about this, since losing this race is no different from being preempted after we read. When tsc > ptsc, we want to write it as 32 bits to avoid the cmpxchg8b slowness/unportability. Again, the value may be out of date when we try to update it, because we were preempted. We don't care about this either, as above, but detected some cases as a side effect of checking that ptsc is up to date. Normally ptsc was up to date when it was read, but it could easly be out of date when it was checked by the cmpset, iff another CPU updated it. This would enforce monotonicity across multiple CPUs even if their TSCs are very unsynchronized, but you wouldn't want it much then. But if they are synchronized to within the serialization time or the 1<<shift time, then this synchronizes them quite well. Oops, better add a wraparound check in the comparison. With only 32 bits, tsc often wraps. Use 64 bits for the test and depend on TSCs starting at 0, or any number of bits and a heuristic. This now gets complicated. Elswhere we depend on timecounters being called often enough to prevent them wrapping without us noticing, and use dummy calls to warm them up. The heuristic would depend on being called often enough. Warming up wouldn't as automatic as reading the counter -- it would need to reset the heuristi. Timecounters can go backwards when the scaling of the count for them overflows, or when the counter wraps. This can be detected as a bug, by saving the previous time as above (but now need more bits). This was intentionally not done because the locking for it is slow. If backwardness is detected, not much can be done except return the most forward time read, as above. Maybe add epsilon each time. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120726145325.U2026>