From owner-svn-src-all@FreeBSD.ORG Thu Jul 26 12:31:00 2012 Return-Path: Delivered-To: svn-src-all@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9B731106564A; Thu, 26 Jul 2012 12:31:00 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail04.syd.optusnet.com.au (mail04.syd.optusnet.com.au [211.29.132.185]) by mx1.freebsd.org (Postfix) with ESMTP id 057EE8FC18; Thu, 26 Jul 2012 12:30:59 +0000 (UTC) Received: from c122-106-171-246.carlnfd1.nsw.optusnet.com.au (c122-106-171-246.carlnfd1.nsw.optusnet.com.au [122.106.171.246]) by mail04.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q6QCUp6s000320 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 26 Jul 2012 22:30:52 +1000 Date: Thu, 26 Jul 2012 22:30:51 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov In-Reply-To: <20120726104918.GW2676@deviant.kiev.zoral.com.ua> Message-ID: <20120726213001.K3621@besplex.bde.org> References: <201207242210.q6OMACqV079603@svn.freebsd.org> <500F9E22.4080608@FreeBSD.org> <20120725102130.GH2676@deviant.kiev.zoral.com.ua> <500FE6AE.8070706@FreeBSD.org> <20120726001659.M5406@besplex.bde.org> <50102C94.9030706@FreeBSD.org> <20120725180537.GO2676@deviant.kiev.zoral.com.ua> <50103C61.8040904@FreeBSD.org> <20120726170837.Q2536@besplex.bde.org> <20120726104918.GW2676@deviant.kiev.zoral.com.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Jim Harris , src-committers@FreeBSD.org, svn-src-all@FreeBSD.org, Andriy Gapon , Bruce Evans , svn-src-head@FreeBSD.org, Jung-uk Kim Subject: Re: svn commit: r238755 - head/sys/x86/x86 X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 26 Jul 2012 12:31:00 -0000 On Thu, 26 Jul 2012, Konstantin Belousov wrote: > On Thu, Jul 26, 2012 at 05:35:23PM +1000, Bruce Evans wrote: >> In fact, there is always a full documented serialization instruction >> for syscalls, except maybe in FreeBSD-1 compat code on i386, at >> least on Athlon64. i386 syscalls use int 0x80 (except in FreeBSD-1 >> compat code they use lcalls, and the iret necessary to return from >> this is serializing on at least Athlon64. amd64 syscalls use >> sysenter/sysret. sysret isn't serializing (like far returns), at least >> on Athlon64, but at least in FreeBSD, the syscall implementation uses >> at least 2 swapgs's (one on entry and one just before the sysret), and >> swapgs is serializing, at least on Athlon64. > Yes, SWAPGS is not documented as serializing on Intels. I reviewed Isn't that too incompatible? > the whole syscall sequence for e.g. gettimeofday(2), and there is no > serialization point for fast path. E.g. ast would add locking and thus > serialization, as well as return by IRET, but fast path on amd64 has > no such things. >>> This function was moved around from time to time and now it sits here: >>> >>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob_plain;f=arch/x86/vdso/vclock_gettime.c >>> >>> It still carries one barrier before rdtsc. Please see the comments. >> >> For safety, you probably need to use the slowest (cpuid) method. Linux >> seems to be just using fences that are observed to work. > No, there is explicit mention of the recommended barriers in the vendor > documentation, which is LFENCE for Intels, and MFENCE for AMDs. My patch > just follows what is suggested in documentation. But you say later theat CPUID is needed (instead of just lock?). The original Athlon64 manual doesn't seem to mention MFENCE for RTDSC. Maybe later manuals clarify that MFENCE works on old CPUs too. > [Replying to other mail in-place, the thread goes wild] Too much quoting :-). > On Thu, Jul 26, 2012 at 04:25:01PM +1000, Bruce Evans wrote: >> ... >> For the threaded case, there has to something for the accesses to be >> provably ordered. It is hard to see how the something can be strong >> enough unless it serializes all thread state in A and B. The rdtsc >> state is not part of the thread state as know to APIs, but it is hard >> to see how threads can serialize themselves without also serializing >> the TSC. > TSC timer read is not synchronized, and I found the Linux test for the > thing I described above. Adopted version is available at > http://people.freebsd.org/~kib/misc/time-warp-test.c. > It shall be compiled in 32bit mode only. My point is that it will normally be synchronized by whatever the threads do to provide synchronization for themself. Only the case of a single thread doing sequential timer reads should expect the reads to be monotonic without any explicit synchronization. I hope this case doesn't require stalling everything in low-level code. > On my Nehalem workstation, I get enormous amount of wraps reported for > RDTSC without CPUID. Adding CPUID back fixes the issue. So at least on > Nehalems (and probably Westmere, I will test later today) RDTSC can even > pass LOCKed instructions. Oh, you mean with the test program, that it needs CPUID because it only has locks and no fences and its CPUID is commented out. > Curiously enough, SandyBridge is sane and reports zero wraps, it seems > Intel fixed the bug. The original Athlon64 manual doesn't seem to mention locks being sufficient any more than it mentions fences. >> I care about timestamps being ordered more than most people, and tried >> to kill the get*time() APIs because they are weakly ordered relative >> to the non-get variants (they return times in the past, and there is >> no way to round down to get consistent times). I tried to fix them >> by adding locking and updating them to the latest time whenever a >> non-get variant gives a later time (by being used). This was too slow, >> and breaks the design criteria that timecounter calls should not use >> any explicit locking. However, if you want slowness, then you can get >> it similarly by fixing the monotonicity of rdtsc in software. I think >> I just figured out how to do this with the same slowness as serialization, >> if a locked instruction serialzes; maybe less otherwise: >> >> spin: >> ptsc = prev_tsc; /* memory -> local (intentionally !atomic) */ >> tsc = rdtsc(); /* only 32 bits for timecounters */ >> if (tsc <= ptsc) { /* I forgot about wrap at first -- see below >> */ >> /* >> * It went backwards, or stopped. Could handle more >> * completely, starting with panic() to see if this >> * happens at all. >> */ >> return (ptsc); /* stopped is better than backwards */ >> } >> /* Usual case; update (32 bits). */ >> if (atomic_cmpset_int(&prev_tsc, ptsc, tsc)) >> return (tsc); >> goto spin; > I do not understand this. Algorithm is clear, but what you propose is > very heavy-weight comparing with adding just LFENCE or MFENCE before rdtsc. > First, the cache-line for prev_tsc becomes heavy-contended. Second, CAS > is expensive. LFENCE is fully local to the core it executes on. I expect the contention to be rare, but then optimization isn't important either. But if the problem is fully local, as it apparently is for fences to fix it, then prev_tsc can be per-CPU with a non-atomic cmpset to access it. We don't care if rdtsc gives an old value due to some delay in copying the result to EDX:EAX any more than we care about an old value due to being interrupted. The case where we are interrupted, and context-switched, and come back on a different CPU, is especially interesting. Then we may have an old tsc value from another CPU. Sometimes we detect that it is old for the new CPU, sometimes not. There is a problem in theory but I think none in practice. The switch could set a flag to tell us to loop (set prev_tsc to a sentinel value), and it accidentally already does, except with serious our of orderness: switches happen to always call the TSC if the TSC is usable, to maintain the the pcpu switchtime variable. The call for this will give a tsc value for the new CPU that is in advance of the old one, provided there all the TSC are in sync and there is sufficient monotonicity across CPUs. Then the loop will see that its tsc is old, and repeat. I am only half serious in proposing the above. If you want to be slow then you can do useful work like ensuring monotonicity across all CPUs in much the same time that is wasted by stalling the hardware until everything is serialized. Bruce