Date: Sat, 14 May 2011 05:14:18 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Jung-uk Kim <jkim@freebsd.org> Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org, Andriy Gapon <avg@freebsd.org> Subject: Re: svn commit: r221703 - in head/sys: amd64/include i386/include x86/isa x86/x86 Message-ID: <20110514041838.S2545@besplex.bde.org> In-Reply-To: <201105131257.34009.jkim@FreeBSD.org> References: <201105091734.p49HY0P3006180@svn.freebsd.org> <201105121239.31340.jkim@FreeBSD.org> <4DCD2875.9040808@FreeBSD.org> <201105131257.34009.jkim@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 13 May 2011, Jung-uk Kim wrote: > On Friday 13 May 2011 08:47 am, Andriy Gapon wrote: >> on 12/05/2011 19:39 Jung-uk Kim said the following: >>> Actually, I am kinda reluctant to enable smp_tsc by default on >>> recent CPUs. Although they made all TSCs in sync, it is very >>> very tricky to make it work in reality, e.g., >>> >>> https://patchwork.kernel.org/patch/691712/ >> >> I am not sure what is their concern there. >> TSC is good to be used as timecounter. > > *Iff* they are all in sync. and atomically increasing... Not even that. rdtsc is non-serializing, so the TSCs causality may be violated by different reordering of rdtsc on different CPUs. Apparently this happens in practice. I think it would not happen for "rdtsc; rdtsc" in a single thread, since the context switch to execute the rdtsc's on different CPUs would take a long time amd would probably execute some serializing instructions. It might happen for clock_gettime() in separate threads where the threads somehow know and depend on the order of the calls. Shared variables seems to be needed for knowing this, and I don't know how the variables could be accessed atomically enough without serializing the rdtsc's. Apart from that, the code might be: thread 1 thread 2 -------- -------- start1 = gen++; start2 = gen++; clock_gettime(...); clock_gettime(...); end1 = gen++; end2 = gen++; We can hope that first clock_gettime() executed entirely before before the second one if end1 < start2. But without any serialization instructions, the rdtsc's aren't guaranteed to execute between the stores to the variables, so knowing the order of the stores tells us nothing about the order of the rdtsc's. >> If they use TSC for performance measurements, then of course they >> have to use some barriers - this is well known and documented. Also for timecounters. > If my understanding is correct, Linux has to make sure the new timer > value read from a CPU must be written/read to/from memory in order > and all other CPUs must be able see the updated value as their > "vsyscall" and/or "vDSO" version of gettimeofday(2) and friends rely > on it. Also, the last value read from a CPU is kept in memory and > compared with a new value (possibly read from another CPU) to make > sure it is incremental. I'd call it a "TSC-safe" timecounter. ;-) > Some price to pay when you do timekeeping in user space to avoid > syscalls... Hmm, timecounter code intentionally doesn't store the last value read to a shared (kernel) variable due to the cost of doing so, although this causes bugs like times read by the "get*()" interfaces being incoherent with times read by the non-"get*()" interfaces. Perhaps the problem is only visible with the userland implementation, since clock_gettime() is too slow to give it via code like the above if it is a syscall (the rdtsc will be in the middle somewhere and there is plently of time for it to complete before the stores in userland). The Linux discussion says that a change optimizes clock_gettime() from 22 ns to 17 ns on Sandybridge. 20 ns is about the time for a single rdtsc. On AthlonXP, the best I've seen for the FreeBSD syscall is about 250 ns (550 cycles), despite rdtsc only taking 12 cycles on AthlonXP (rdtsc takes more like 40-80 cycles on newer CPUs, for the hardware part of its synchronization :-(). Timecounter calls using thr TSC timecounter in the FreeBSD kernel take only about 50 cycles (when rdtsc takes only 12 cycles), but syscalls add a lot. >> BTW, newer CPUs provide RDTSCP instruction which could be more >> convenient there. > > AFAIK, some people wanted to do that but Linus thought RDTSCP is too > expensive as it is a serialized instruction. The whole point of the Linux discussion is to reduce the synchronization that they already have (a couple of fence instructions). Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110514041838.S2545>