Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 3 Jun 2016 08:04:29 -0700
From:      Maxim Sobolev <sobomax@freebsd.org>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        FreeBSD Current <freebsd-current@freebsd.org>
Subject:   Re: CLOCK_MONOTONIC / CLOCK_UPTIME is not really monotonic between threads
Message-ID:  <CAH7qZftZ3oHAhsfmoZncorBN7Nr_wXn3MK6Q7YfG_eh_bZJnqQ@mail.gmail.com>
In-Reply-To: <20160603050628.GV38613@kib.kiev.ua>
References:  <CAH7qZfsBLM_wqdTXiiyVvevwU%2Bu257dX9JvZFftfntduUaZDMw@mail.gmail.com> <20160603050628.GV38613@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
Konstantin,

Thanks for taking your time looking into it and sorry for somewhat messy
problem report. I've been trying to fix that problem all day yesterday
thinking it's just application logic that is broken and indeed has been
able to find some bigger issues that were obscuring this one. But it got
very frustrating when the bug popped out anew at a seemingly lower level
now. The issue that triggered this is in some high level python code. Which
makes it quite difficult to narrow and isolate. There is still slight
chance that it's something about threading within the python that screws
this up somehow, however I don't quite see how that could lead to a
consistent result that is just off by few hundred microseconds and not in
some random garbage.

So, I take from you message, that high level
clock_gettime(CLOCK_MONOTONIC*) is supposed to be monotonic with respect to
the wall time even when called in different threads? I always though it is,
but was not 100% sure about that and wanted to confirm it before I dive
deeper into this and spend more time writing a test case to expose this.
The test case you gave me is interesting, but somewhat low-level. What I
would do if it comes to it, is to make something that uses pthreads and
plain clock_gettime(2). Should not be too difficult to reproduce if it's
real issue.

P.S. I've also tried kern.timecounter.fast_gettime=0, made no difference.
Assuming it does not take a reboot to test it. Neither does
switching kern.timecounter.hardware, I've tested TSC-low(1000)
ACPI-fast(900) HPET(950) i8254(0), all are the same.

In any case thanks for your valuable input, I think I have enough
information for now to investigate it further.

-Max
On Jun 2, 2016 10:06 PM, "Konstantin Belousov" <kostikbel@gmail.com> wrote:

> On Thu, Jun 02, 2016 at 09:16:47PM -0700, Maxim Sobolev wrote:
> > Hi there, we have an application here which is trying to measure UDP
> > command/response round-trip-time. It runs two posix threads (more
> actually,
> > but that's probably irrelevant), one (let's call it A) that does
> high-level
> > logic and the second one (B) that does network packet I/O.
> >
> > The sending side is done by first thread A forming the request, then
> > calling the function clock_gettime(CLOCK_MONOTONIC) and passing the
> packet
> > into the thread B. Obtained timestamp is stored with some logical
> > transaction ID allowing us to pull that stored value later on when
> response
> > arrives. Then we have a separate process that receives those requests,
> > processing them and sending back some form of response.
> >
> > Upon receiving a response from the network, the network I/O thread (B)
> > timestamps it by running clock_gettime(CLOCK_MONOTONIC) and passes the
> > packet data along with that value via queue to the thread A for
> processing.
> >
> > So if we put things into timeline, what our app does would probably look
> > something like the following:
> >
> > 1. Thread A generates request.
> > 2. A calls clock_gettime(CLOCK_MONOTONIC), storing value as t1 internally
> > 3. A passes packet to thread B
> > 4. B sends out packet via sendto() to server process running on the same
> > box (fully separate, not a thread)
> >
> > [some microseconds later]
> >
> > 5. B receives response from server with recvfrom()
> > 6. B instantly calls clock_gettime(CLOCK_MONOTONIC), assigns returned
> value
> > to t2
> > 7. B passes packet data along with t2 to the A via queue
> > 8. A picks up packet, parses it and retrieves corresponding t1 stored at
> > step 2.
> > 9. A calculates RTT by doing t2 - t1 assuming it's going to be
> positive...
> >
> > As you might have guessed if you are still reading, from time to time t2
> -
> > t1 comes out slightly negative! Provided it's not some obscure bug in our
> > app, there is no way this could happen if clock_gettime(CLOCK_MONOTONIC)
> > would work as advertised. Event (2) could not possibly happen earlier
> than
> > (6), which is guaranteed by the fact that the request needs to be
> processed
> > by the external entity first in order for the response to be seen by our
> > app at all. I've added some logs and it seems to be confirming that the
> > server only sees a single request, there is no chance for the client to
> > receive some other packet and confuse it. I've also confirmed with
> tcpdump,
> > which shows reasonable time delay between request and reply of few
> hundreds
> > microseconds.
> >
> > I've checked all logic and I could not find any mistakes on my end here,
> so
> > I added some logging for such events. The distribution appears to be
> > centered around 0.00006s, but there are some events that go as far up as
> > 0.012s.
> >
> > I've also tried using CLOCK_UPTIME_PRECISE instead, but it makes no
> > difference whatsoever.
> >
> > My questions therefore are:
> >
> > 1. Is it intended/expected behavior of the said API?
> No.
>
> > 2. Has anyone else bumped into this?
> Not that I am aware of.
>
> > 3. I know we are doing some clever optimizations using TSC to speed those
> > APIs up, could be it related to that?
> No idea.
>
> > 4. If the answer for (3) is yes, then what is the method to disable using
> > TSC and use slower but possibly more reliable method?
> Set sysctl kern.timecounter.fast_gettime to 0.
>
> > 5. Is there any way/plan to fix this long-term?
> Fix what ?
>
> > 6. If the behavior is intended/expected, what is the maximum discrepancy
> we
> > can expect from this effect?
> >
> > In general some time ago we've spend quite lot of time switching our app
> > from using REALTIME into MONOTIME in the hope to eliminate any wizardry
> > needed to deal with realtime possibly jumping back and forth due to NTP,
> > leap seconds etc, it's shame that this is not working either. Apart from
> > measuring command processing delay, that app does lot of high-volume voip
> > call billing, so even such small discrepancy can easily build up into a
> > bigger problem, not to mention the fact that every time we deal with
> > duration we now need to have some check in place to make sure we don't
> have
> > some negative values popping our of nowhere.
> >
> > Any hints, suggestions, pointers are appreciated. I can also give more
> > debug information as needed. Thanks!
>
> In fact, your rather long mail does not contain any facts except a
> statement
> that you possibly see clock_gettime(CLOCK_MONOTONIC) going backward.  First
> obvious question is whether you did tried to reduce the testcase to some
> usable size ?
>
> That said, RDTSC going backward, or perceive of RDTSC value by
> processors going provable backward, was one of the concerns when
> userspace gettimeofday(2) code was developed.  We have at least three
> things which should prevent that:
> 1. kernel tests TSC for SMP coherency on boot
> 2. since RDTSC is not serialized instruction, we bracket its execution
>    with neccessary CPU-model specific fences to ensure that the counter
>    is not read too early.
> 3. still, despite item 2, at least 1 lowest bit of the read counter is
>    dropped to accomodate for jitter.
>
> When testing the userspace gettimeofday(2) code, I found the following
> program by Ingo Molnar.  It verifies SMP-coherency for the bare TSC and
> for the library routines.  I specifically tested with it on multi-socket
> Nehalems at that time, AFAIR.
>
> I am not asking for details about your machine config, system version and
> other neccessary information, before either the Ingo' program demostrates
> the back-stepping, or you provide tiny test which shows the problem.
>
> /*
>  * Copyright (C) 2005, Ingo Molnar
>  *
>  * time-warp-test.c: check TSC synchronity on x86 CPUs. Also detects
>  *                   gettimeofday()-level time warps.
>  *
>  * Compile with: gcc -Wall -O2 -o time-warp-test time-warp-test.c -lrt
>  * $Id: time-warp-test.c,v 1.1 2012/07/25 19:11:05 kostik Exp kostik $
>  */
> #include <sys/types.h>
> #include <sys/mman.h>
> #include <sys/socket.h>
> #include <sys/stat.h>
> #include <sys/time.h>
> #include <sys/wait.h>
> #include <assert.h>
> #include <ctype.h>
> #include <fcntl.h>
> #include <signal.h>
> #include <stdarg.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <time.h>
> #include <unistd.h>
>
> #define TEST_TSC 1
> #define TEST_TOD 1
> #define TEST_CLOCK 1
>
> #if !TEST_TSC && !TEST_TOD && !TEST_CLOCK
> # error this setting makes no sense ...
> #endif
>
> #if DEBUG
> # define Printf(x...) printf(x)
> #else
> # define Printf(x...) do { } while (0)
> #endif
>
> /*
>  * Shared locks and variables between the test tasks:
>  */
> enum {
>         SHARED_LOCK             = 0,
>         SHARED_TSC              = 2,
>         SHARED_TOD              = 4,
>         SHARED_CLOCK            = 6,
>         SHARED_WORST_TSC        = 8,
>         SHARED_WORST_TOD        = 10,
>         SHARED_WORST_CLOCK      = 12,
>         SHARED_NR_TSC_LOOPS     = 14,
>         SHARED_NR_TSC_WARPS     = 16,
>         SHARED_NR_TOD_LOOPS     = 18,
>         SHARED_NR_TOD_WARPS     = 20,
>         SHARED_NR_CLOCK_LOOPS   = 22,
>         SHARED_NR_CLOCK_WARPS   = 24,
>         SHARED_END              = 26,
> };
>
> #define SHARED(x)       (*(shared + SHARED_##x))
> #define SHARED_LL(x)    (*(long long *)(shared + SHARED_##x))
>
> #define BUG_ON(c) assert(!(c))
>
> typedef unsigned long long cycles_t;
> typedef unsigned long long usecs_t;
> typedef unsigned long long u64;
>
> #ifdef __x86_64__
> #define DECLARE_ARGS(val, low, high)    unsigned low, high
> #define EAX_EDX_VAL(val, low, high)     ((low) | ((u64)(high) << 32))
> #define EAX_EDX_ARGS(val, low, high)    "a" (low), "d" (high)
> #define EAX_EDX_RET(val, low, high)     "=a" (low), "=d" (high)
> #else
> #define DECLARE_ARGS(val, low, high)    unsigned long long val
> #define EAX_EDX_VAL(val, low, high)     (val)
> #define EAX_EDX_ARGS(val, low, high)    "A" (val)
> #define EAX_EDX_RET(val, low, high)     "=A" (val)
> #endif
>
> static inline unsigned long long __rdtscll(void)
> {
>         DECLARE_ARGS(val, low, high);
>
> #if 0
>         asm volatile(/*"cpuid;" "lfence;*/"rdtsc" : EAX_EDX_RET(val, low,
> high));
> #else
>         asm volatile("lfence;rdtsc;"/* shrd $8,%%edx,%%eax"*/ :
> EAX_EDX_RET(val, low, high));
> #endif
>
>         return EAX_EDX_VAL(val, low, high);
> }
>
> #define rdtscll(val) do { (val) = __rdtscll(); } while (0)
>
> #define rdtod(val)                                      \
> do {                                                    \
>         struct timeval tv;                              \
>                                                         \
>         gettimeofday(&tv, NULL);                        \
>         (val) = tv.tv_sec * 1000000ULL + tv.tv_usec;    \
> } while (0)
>
> #define rdclock(val)                                    \
> do {                                                    \
>         struct timespec ts;                             \
>                                                         \
>         clock_gettime(CLOCK_MONOTONIC, &ts);            \
>         (val) = ts.tv_sec * 1000000000ULL + ts.tv_nsec; \
> } while (0)
>
> static unsigned long *setup_shared_var(void)
> {
>         char zerobuff [4096] = { 0, };
>         int ret, fd;
>         unsigned long *buf;
>
>         fd = creat(".tmp_mmap", 0700);
>         BUG_ON(fd == -1);
>         close(fd);
>
>         fd = open(".tmp_mmap", O_RDWR|O_CREAT|O_TRUNC);
>         BUG_ON(fd == -1);
>         ret = write(fd, zerobuff, 4096);
>         BUG_ON(ret != 4096);
>
>         buf = (void *)mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd,
> 0);
>         BUG_ON(buf == (void *)-1);
>
>         close(fd);
>         unlink(".tmp_mmap");
>
>         return buf;
> }
>
> static inline void lock(unsigned long *flag)
> {
> #if 0
>         __asm__ __volatile__(
>                 "1: lock; btsl $0,%0\n"
>                 "jc 1b\n"
>                              : "=g"(*flag) : : "memory");
> #else
>         __asm__ __volatile__(
>                "1: lock; btsl $0,%0\n\t"
>                "jnc 3f\n"
>                "2: testl $1,%0\n\t"
>                "je 1b\n\t"
>                "rep ; nop\n\t"
>                "jmp 2b\n"
>                "3:"
>                : "+m"(*flag) : : "memory");
> #endif
> }
>
> static inline void unlock(unsigned long *flag)
> {
> #if 0
>         __asm__ __volatile__(
>                 "lock; btrl $0,%0\n"
>                              : "=g"(*flag) :: "memory");
>         __asm__ __volatile__("rep; nop");
> #else
>         __asm__ __volatile__("movl $0,%0; rep; nop" : "=g"(*flag) ::
> "memory");
> #endif
> }
>
> static void print_status(unsigned long *shared)
> {
>         const char progress[] = "\\|/-";
>
>         static unsigned long long sum_tsc_loops, sum_tod_loops,
> sum_clock_loops,
>                                   sum_tod;
>         static unsigned int count1, count2;
>         static usecs_t prev_tod;
>
>         usecs_t tod;
>
>         if (!prev_tod)
>                 rdtod(prev_tod);
>
>         count1++;
>         if (count1 < 1000)
>                 return;
>         count1 = 0;
>
>         rdtod(tod);
>         if (abs(tod - prev_tod) < 100000ULL)
>                 return;
>
>         sum_tod += tod - prev_tod;
>         sum_tsc_loops += SHARED_LL(NR_TSC_LOOPS);
>         sum_tod_loops += SHARED_LL(NR_TOD_LOOPS);
>         sum_clock_loops += SHARED_LL(NR_CLOCK_LOOPS);
>         SHARED_LL(NR_TSC_LOOPS) = 0;
>         SHARED_LL(NR_TOD_LOOPS) = 0;
>         SHARED_LL(NR_CLOCK_LOOPS) = 0;
>
>         if (TEST_TSC)
>                 printf(" | TSC: %.2fus, fail:%ld",
>                         (double)sum_tod/(double)sum_tsc_loops,
>                         SHARED(NR_TSC_WARPS));
>
>         if (TEST_TOD)
>                 printf(" | TOD: %.2fus, fail:%ld",
>                         (double)sum_tod/(double)sum_tod_loops,
>                         SHARED(NR_TOD_WARPS));
>
>         if (TEST_CLOCK)
>                 printf(" | CLK: %.2fus, fail:%ld",
>                         (double)sum_tod/(double)sum_clock_loops,
>                         SHARED(NR_CLOCK_WARPS));
>
>         prev_tod = tod;
>         count2++;
>         printf(" %c\r", progress[count2 & 3]);
>         fflush(stdout);
> }
>
> static inline void test_TSC(unsigned long *shared)
> {
> #if TEST_TSC
>         cycles_t t0, t1;
>         long long delta;
>
>         lock(&SHARED(LOCK));
>         rdtscll(t1);
>         t0 = SHARED_LL(TSC);
>         SHARED_LL(TSC) = t1;
>         SHARED_LL(NR_TSC_LOOPS)++;
>         unlock(&SHARED(LOCK));
>
>         delta = t1-t0;
>         if (delta < 0) {
>                 lock(&SHARED(LOCK));
>                 SHARED(NR_TSC_WARPS)++;
>                 if (delta < SHARED_LL(WORST_TSC)) {
>                         SHARED_LL(WORST_TSC) = delta;
>                         fprintf(stderr, "\rnew TSC-warp maximum: %9Ld
> cycles, %016Lx -> %016Lx\n",
>                                 delta, t0, t1);
>                 }
>                 unlock(&SHARED(LOCK));
>         }
>         if (!((unsigned long)t0 & 31))
>                 asm volatile ("rep; nop");
> #endif
> }
>
> static inline void test_TOD(unsigned long *shared)
> {
> #if TEST_TOD
>         usecs_t T0, T1;
>         long long delta;
>
>         lock(&SHARED(LOCK));
>         rdtod(T1);
>         T0 = SHARED_LL(TOD);
>         SHARED_LL(TOD) = T1;
>         SHARED_LL(NR_TOD_LOOPS)++;
>         unlock(&SHARED(LOCK));
>
>         delta = T1-T0;
>         if (delta < 0) {
>                 lock(&SHARED(LOCK));
>                 SHARED(NR_TOD_WARPS)++;
>                 if (delta < SHARED_LL(WORST_TOD)) {
>                         SHARED_LL(WORST_TOD) = delta;
>                         fprintf(stderr, "\rnew TOD-warp maximum: %9Ld
> usecs,  %016Lx -> %016Lx\n",
>                                 delta, T0, T1);
>                 }
>                 unlock(&SHARED(LOCK));
>         }
> #endif
> }
>
> static inline void test_CLOCK(unsigned long *shared)
> {
> #if TEST_CLOCK
>         usecs_t T0, T1;
>         long long delta;
>
>         lock(&SHARED(LOCK));
>         rdclock(T1);
>         T0 = SHARED_LL(CLOCK);
>         SHARED_LL(CLOCK) = T1;
>         SHARED_LL(NR_CLOCK_LOOPS)++;
>         unlock(&SHARED(LOCK));
>
>         delta = T1-T0;
>         if (delta < 0) {
>                 lock(&SHARED(LOCK));
>                 SHARED(NR_CLOCK_WARPS)++;
>                 if (delta < SHARED_LL(WORST_CLOCK)) {
>                         SHARED_LL(WORST_CLOCK) = delta;
>                         fprintf(stderr, "\rnew CLOCK-warp maximum: %9Ld
> nsecs,  %016Lx -> %016Lx\n",
>                                 delta, T0, T1);
>                 }
>                 unlock(&SHARED(LOCK));
>         }
> #endif
> }
>
> int main(int argc, char **argv)
> {
>         int i, parent, me;
>         unsigned long *shared;
>         unsigned long cpus, tasks;
>
>         cpus = sysconf(_SC_NPROCESSORS_CONF);
>
>         if (argc > 2) {
> usage:
>                 fprintf(stderr,
>                         "usage: tsc-sync-test <threads>\n");
>                 exit(-1);
>         }
>         if (argc == 2) {
>                 tasks = atol(argv[1]);
>                 if (!tasks)
>                         goto usage;
>         } else
>                 tasks = cpus;
>
>         printf("%ld CPUs, running %ld parallel test-tasks.\n", cpus,
> tasks);
>         printf("checking for time-warps via:\n"
> #if TEST_TSC
>         "- read time stamp counter (RDTSC) instruction (cycle
> resolution)\n"
> #endif
> #if TEST_TOD
>         "- gettimeofday (TOD) syscall (usec resolution)\n"
> #endif
> #if TEST_CLOCK
>         "- clock_gettime(CLOCK_MONOTONIC) syscall (nsec resolution)\n"
> #endif
>                 "\n"
>                 );
>         shared = setup_shared_var();
>
>         parent = getpid();
>
>         for (i = 1; i < tasks; i++) {
>                 if (!fork())
>                         break;
>         }
>         me = getpid();
>
>         while (1) {
>                 int i;
>
>                 for (i = 0; i < 10; i++)
>                         test_TSC(shared);
>                 for (i = 0; i < 10; i++)
>                         test_TOD(shared);
>                 for (i = 0; i < 10; i++)
>                         test_CLOCK(shared);
>
>                 if (me == parent)
>                         print_status(shared);
>         }
>
>         return 0;
> }
>
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAH7qZftZ3oHAhsfmoZncorBN7Nr_wXn3MK6Q7YfG_eh_bZJnqQ>