From owner-freebsd-current@freebsd.org Fri Jun 3 05:06:43 2016 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id DCC35B68143 for ; Fri, 3 Jun 2016 05:06:43 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4731F1E3A; Fri, 3 Jun 2016 05:06:43 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id u5356Thp069926 (version=TLSv1 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Fri, 3 Jun 2016 08:06:29 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua u5356Thp069926 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id u5356Swv069925; Fri, 3 Jun 2016 08:06:28 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Fri, 3 Jun 2016 08:06:28 +0300 From: Konstantin Belousov To: Maxim Sobolev Cc: FreeBSD Current Subject: Re: CLOCK_MONOTONIC / CLOCK_UPTIME is not really monotonic between threads Message-ID: <20160603050628.GV38613@kib.kiev.ua> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.6.1 (2016-04-27) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 03 Jun 2016 05:06:44 -0000 On Thu, Jun 02, 2016 at 09:16:47PM -0700, Maxim Sobolev wrote: > Hi there, we have an application here which is trying to measure UDP > command/response round-trip-time. It runs two posix threads (more actually, > but that's probably irrelevant), one (let's call it A) that does high-level > logic and the second one (B) that does network packet I/O. > > The sending side is done by first thread A forming the request, then > calling the function clock_gettime(CLOCK_MONOTONIC) and passing the packet > into the thread B. Obtained timestamp is stored with some logical > transaction ID allowing us to pull that stored value later on when response > arrives. Then we have a separate process that receives those requests, > processing them and sending back some form of response. > > Upon receiving a response from the network, the network I/O thread (B) > timestamps it by running clock_gettime(CLOCK_MONOTONIC) and passes the > packet data along with that value via queue to the thread A for processing. > > So if we put things into timeline, what our app does would probably look > something like the following: > > 1. Thread A generates request. > 2. A calls clock_gettime(CLOCK_MONOTONIC), storing value as t1 internally > 3. A passes packet to thread B > 4. B sends out packet via sendto() to server process running on the same > box (fully separate, not a thread) > > [some microseconds later] > > 5. B receives response from server with recvfrom() > 6. B instantly calls clock_gettime(CLOCK_MONOTONIC), assigns returned value > to t2 > 7. B passes packet data along with t2 to the A via queue > 8. A picks up packet, parses it and retrieves corresponding t1 stored at > step 2. > 9. A calculates RTT by doing t2 - t1 assuming it's going to be positive... > > As you might have guessed if you are still reading, from time to time t2 - > t1 comes out slightly negative! Provided it's not some obscure bug in our > app, there is no way this could happen if clock_gettime(CLOCK_MONOTONIC) > would work as advertised. Event (2) could not possibly happen earlier than > (6), which is guaranteed by the fact that the request needs to be processed > by the external entity first in order for the response to be seen by our > app at all. I've added some logs and it seems to be confirming that the > server only sees a single request, there is no chance for the client to > receive some other packet and confuse it. I've also confirmed with tcpdump, > which shows reasonable time delay between request and reply of few hundreds > microseconds. > > I've checked all logic and I could not find any mistakes on my end here, so > I added some logging for such events. The distribution appears to be > centered around 0.00006s, but there are some events that go as far up as > 0.012s. > > I've also tried using CLOCK_UPTIME_PRECISE instead, but it makes no > difference whatsoever. > > My questions therefore are: > > 1. Is it intended/expected behavior of the said API? No. > 2. Has anyone else bumped into this? Not that I am aware of. > 3. I know we are doing some clever optimizations using TSC to speed those > APIs up, could be it related to that? No idea. > 4. If the answer for (3) is yes, then what is the method to disable using > TSC and use slower but possibly more reliable method? Set sysctl kern.timecounter.fast_gettime to 0. > 5. Is there any way/plan to fix this long-term? Fix what ? > 6. If the behavior is intended/expected, what is the maximum discrepancy we > can expect from this effect? > > In general some time ago we've spend quite lot of time switching our app > from using REALTIME into MONOTIME in the hope to eliminate any wizardry > needed to deal with realtime possibly jumping back and forth due to NTP, > leap seconds etc, it's shame that this is not working either. Apart from > measuring command processing delay, that app does lot of high-volume voip > call billing, so even such small discrepancy can easily build up into a > bigger problem, not to mention the fact that every time we deal with > duration we now need to have some check in place to make sure we don't have > some negative values popping our of nowhere. > > Any hints, suggestions, pointers are appreciated. I can also give more > debug information as needed. Thanks! In fact, your rather long mail does not contain any facts except a statement that you possibly see clock_gettime(CLOCK_MONOTONIC) going backward. First obvious question is whether you did tried to reduce the testcase to some usable size ? That said, RDTSC going backward, or perceive of RDTSC value by processors going provable backward, was one of the concerns when userspace gettimeofday(2) code was developed. We have at least three things which should prevent that: 1. kernel tests TSC for SMP coherency on boot 2. since RDTSC is not serialized instruction, we bracket its execution with neccessary CPU-model specific fences to ensure that the counter is not read too early. 3. still, despite item 2, at least 1 lowest bit of the read counter is dropped to accomodate for jitter. When testing the userspace gettimeofday(2) code, I found the following program by Ingo Molnar. It verifies SMP-coherency for the bare TSC and for the library routines. I specifically tested with it on multi-socket Nehalems at that time, AFAIR. I am not asking for details about your machine config, system version and other neccessary information, before either the Ingo' program demostrates the back-stepping, or you provide tiny test which shows the problem. /* * Copyright (C) 2005, Ingo Molnar * * time-warp-test.c: check TSC synchronity on x86 CPUs. Also detects * gettimeofday()-level time warps. * * Compile with: gcc -Wall -O2 -o time-warp-test time-warp-test.c -lrt * $Id: time-warp-test.c,v 1.1 2012/07/25 19:11:05 kostik Exp kostik $ */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #define TEST_TSC 1 #define TEST_TOD 1 #define TEST_CLOCK 1 #if !TEST_TSC && !TEST_TOD && !TEST_CLOCK # error this setting makes no sense ... #endif #if DEBUG # define Printf(x...) printf(x) #else # define Printf(x...) do { } while (0) #endif /* * Shared locks and variables between the test tasks: */ enum { SHARED_LOCK = 0, SHARED_TSC = 2, SHARED_TOD = 4, SHARED_CLOCK = 6, SHARED_WORST_TSC = 8, SHARED_WORST_TOD = 10, SHARED_WORST_CLOCK = 12, SHARED_NR_TSC_LOOPS = 14, SHARED_NR_TSC_WARPS = 16, SHARED_NR_TOD_LOOPS = 18, SHARED_NR_TOD_WARPS = 20, SHARED_NR_CLOCK_LOOPS = 22, SHARED_NR_CLOCK_WARPS = 24, SHARED_END = 26, }; #define SHARED(x) (*(shared + SHARED_##x)) #define SHARED_LL(x) (*(long long *)(shared + SHARED_##x)) #define BUG_ON(c) assert(!(c)) typedef unsigned long long cycles_t; typedef unsigned long long usecs_t; typedef unsigned long long u64; #ifdef __x86_64__ #define DECLARE_ARGS(val, low, high) unsigned low, high #define EAX_EDX_VAL(val, low, high) ((low) | ((u64)(high) << 32)) #define EAX_EDX_ARGS(val, low, high) "a" (low), "d" (high) #define EAX_EDX_RET(val, low, high) "=a" (low), "=d" (high) #else #define DECLARE_ARGS(val, low, high) unsigned long long val #define EAX_EDX_VAL(val, low, high) (val) #define EAX_EDX_ARGS(val, low, high) "A" (val) #define EAX_EDX_RET(val, low, high) "=A" (val) #endif static inline unsigned long long __rdtscll(void) { DECLARE_ARGS(val, low, high); #if 0 asm volatile(/*"cpuid;" "lfence;*/"rdtsc" : EAX_EDX_RET(val, low, high)); #else asm volatile("lfence;rdtsc;"/* shrd $8,%%edx,%%eax"*/ : EAX_EDX_RET(val, low, high)); #endif return EAX_EDX_VAL(val, low, high); } #define rdtscll(val) do { (val) = __rdtscll(); } while (0) #define rdtod(val) \ do { \ struct timeval tv; \ \ gettimeofday(&tv, NULL); \ (val) = tv.tv_sec * 1000000ULL + tv.tv_usec; \ } while (0) #define rdclock(val) \ do { \ struct timespec ts; \ \ clock_gettime(CLOCK_MONOTONIC, &ts); \ (val) = ts.tv_sec * 1000000000ULL + ts.tv_nsec; \ } while (0) static unsigned long *setup_shared_var(void) { char zerobuff [4096] = { 0, }; int ret, fd; unsigned long *buf; fd = creat(".tmp_mmap", 0700); BUG_ON(fd == -1); close(fd); fd = open(".tmp_mmap", O_RDWR|O_CREAT|O_TRUNC); BUG_ON(fd == -1); ret = write(fd, zerobuff, 4096); BUG_ON(ret != 4096); buf = (void *)mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); BUG_ON(buf == (void *)-1); close(fd); unlink(".tmp_mmap"); return buf; } static inline void lock(unsigned long *flag) { #if 0 __asm__ __volatile__( "1: lock; btsl $0,%0\n" "jc 1b\n" : "=g"(*flag) : : "memory"); #else __asm__ __volatile__( "1: lock; btsl $0,%0\n\t" "jnc 3f\n" "2: testl $1,%0\n\t" "je 1b\n\t" "rep ; nop\n\t" "jmp 2b\n" "3:" : "+m"(*flag) : : "memory"); #endif } static inline void unlock(unsigned long *flag) { #if 0 __asm__ __volatile__( "lock; btrl $0,%0\n" : "=g"(*flag) :: "memory"); __asm__ __volatile__("rep; nop"); #else __asm__ __volatile__("movl $0,%0; rep; nop" : "=g"(*flag) :: "memory"); #endif } static void print_status(unsigned long *shared) { const char progress[] = "\\|/-"; static unsigned long long sum_tsc_loops, sum_tod_loops, sum_clock_loops, sum_tod; static unsigned int count1, count2; static usecs_t prev_tod; usecs_t tod; if (!prev_tod) rdtod(prev_tod); count1++; if (count1 < 1000) return; count1 = 0; rdtod(tod); if (abs(tod - prev_tod) < 100000ULL) return; sum_tod += tod - prev_tod; sum_tsc_loops += SHARED_LL(NR_TSC_LOOPS); sum_tod_loops += SHARED_LL(NR_TOD_LOOPS); sum_clock_loops += SHARED_LL(NR_CLOCK_LOOPS); SHARED_LL(NR_TSC_LOOPS) = 0; SHARED_LL(NR_TOD_LOOPS) = 0; SHARED_LL(NR_CLOCK_LOOPS) = 0; if (TEST_TSC) printf(" | TSC: %.2fus, fail:%ld", (double)sum_tod/(double)sum_tsc_loops, SHARED(NR_TSC_WARPS)); if (TEST_TOD) printf(" | TOD: %.2fus, fail:%ld", (double)sum_tod/(double)sum_tod_loops, SHARED(NR_TOD_WARPS)); if (TEST_CLOCK) printf(" | CLK: %.2fus, fail:%ld", (double)sum_tod/(double)sum_clock_loops, SHARED(NR_CLOCK_WARPS)); prev_tod = tod; count2++; printf(" %c\r", progress[count2 & 3]); fflush(stdout); } static inline void test_TSC(unsigned long *shared) { #if TEST_TSC cycles_t t0, t1; long long delta; lock(&SHARED(LOCK)); rdtscll(t1); t0 = SHARED_LL(TSC); SHARED_LL(TSC) = t1; SHARED_LL(NR_TSC_LOOPS)++; unlock(&SHARED(LOCK)); delta = t1-t0; if (delta < 0) { lock(&SHARED(LOCK)); SHARED(NR_TSC_WARPS)++; if (delta < SHARED_LL(WORST_TSC)) { SHARED_LL(WORST_TSC) = delta; fprintf(stderr, "\rnew TSC-warp maximum: %9Ld cycles, %016Lx -> %016Lx\n", delta, t0, t1); } unlock(&SHARED(LOCK)); } if (!((unsigned long)t0 & 31)) asm volatile ("rep; nop"); #endif } static inline void test_TOD(unsigned long *shared) { #if TEST_TOD usecs_t T0, T1; long long delta; lock(&SHARED(LOCK)); rdtod(T1); T0 = SHARED_LL(TOD); SHARED_LL(TOD) = T1; SHARED_LL(NR_TOD_LOOPS)++; unlock(&SHARED(LOCK)); delta = T1-T0; if (delta < 0) { lock(&SHARED(LOCK)); SHARED(NR_TOD_WARPS)++; if (delta < SHARED_LL(WORST_TOD)) { SHARED_LL(WORST_TOD) = delta; fprintf(stderr, "\rnew TOD-warp maximum: %9Ld usecs, %016Lx -> %016Lx\n", delta, T0, T1); } unlock(&SHARED(LOCK)); } #endif } static inline void test_CLOCK(unsigned long *shared) { #if TEST_CLOCK usecs_t T0, T1; long long delta; lock(&SHARED(LOCK)); rdclock(T1); T0 = SHARED_LL(CLOCK); SHARED_LL(CLOCK) = T1; SHARED_LL(NR_CLOCK_LOOPS)++; unlock(&SHARED(LOCK)); delta = T1-T0; if (delta < 0) { lock(&SHARED(LOCK)); SHARED(NR_CLOCK_WARPS)++; if (delta < SHARED_LL(WORST_CLOCK)) { SHARED_LL(WORST_CLOCK) = delta; fprintf(stderr, "\rnew CLOCK-warp maximum: %9Ld nsecs, %016Lx -> %016Lx\n", delta, T0, T1); } unlock(&SHARED(LOCK)); } #endif } int main(int argc, char **argv) { int i, parent, me; unsigned long *shared; unsigned long cpus, tasks; cpus = sysconf(_SC_NPROCESSORS_CONF); if (argc > 2) { usage: fprintf(stderr, "usage: tsc-sync-test \n"); exit(-1); } if (argc == 2) { tasks = atol(argv[1]); if (!tasks) goto usage; } else tasks = cpus; printf("%ld CPUs, running %ld parallel test-tasks.\n", cpus, tasks); printf("checking for time-warps via:\n" #if TEST_TSC "- read time stamp counter (RDTSC) instruction (cycle resolution)\n" #endif #if TEST_TOD "- gettimeofday (TOD) syscall (usec resolution)\n" #endif #if TEST_CLOCK "- clock_gettime(CLOCK_MONOTONIC) syscall (nsec resolution)\n" #endif "\n" ); shared = setup_shared_var(); parent = getpid(); for (i = 1; i < tasks; i++) { if (!fork()) break; } me = getpid(); while (1) { int i; for (i = 0; i < 10; i++) test_TSC(shared); for (i = 0; i < 10; i++) test_TOD(shared); for (i = 0; i < 10; i++) test_CLOCK(shared); if (me == parent) print_status(shared); } return 0; }