Date: Wed, 21 Oct 2015 11:42:36 +0300 From: Andriy Gapon <avg@FreeBSD.org> To: freebsd-hackers <freebsd-hackers@FreeBSD.org> Cc: Poul-Henning Kamp <phk@phk.freebsd.dk>, Jung-uk Kim <jkim@FreeBSD.org> Subject: Re: instability of timekeeping Message-ID: <56274FFC.2000608@FreeBSD.org> In-Reply-To: <56261FE6.90302@FreeBSD.org> References: <56261398.60102@FreeBSD.org> <56261FE6.90302@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On 20/10/2015 14:05, Andriy Gapon wrote: > I performed a small observation. With ntpd disabled I ran `ntpdate -d` at 10 > second intervals in a loop (done via `sleep 10`). It looks like for about 25 > minutes the time offset between a reference server and my machine was quite > stable. But then it sort of jumped about 2.5 seconds between two consecutive > ntpdate invocations. [snip] > On 20/10/2015 13:12, Andriy Gapon wrote: [snip] >> kern.timecounter.tc.TSC-low.frequency: 1607357461 >> kern.timecounter.tc.TSC-low.counter: 2457319922 >> kern.timecounter.tc.TSC-low.mask: 4294967295 [snip] Another observation and a hypothesis. I tried time counters other than TSC and I couldn't reproduce the issue with them. Another thing that occurred to me is that TSC-low.mask / TSC-low.frequency ≈ 2.7 seconds. >From these observations and reading some code comes my hypothesis. First, I assume that visible values of TSCs on different cores are not perfectly synchronized. Second, I think that there can be circumstances where tc_ticktock() -> tc_windup() can get called on different cores sufficiently close in time that the later call would see TSC value which is "before" (sequentially smaller) than the TSC value read earlier (on the other core). In that case the delta between the readings would be close to TSC-low.mask. Right now I do not have any proof that what the hypothesis says is what actually happens. On the other hand, I do not see anything that would prevent the hypothesized situation from occurring. To add more weight to the hypothesis: cpu1:invltlb 674384 4 cpu1:invlrng 453020 3 cpu1:invlpg 108772578 652 cpu1:preempt 37435000 224 cpu1:ast 36757 0 cpu1:rendezvous 9473 0 cpu1:hardclock 22267434 133 As you can see I am currently running workloads that result in a very significant number of IPIs, especially the page invalidation IPIs. Given that all (x86) IPIs have the same priority (based on their vector numbers) I think it's plausible that the hardclock IPI could get arbitrarily delayed. I guess I could add a tracepoint to record deltas that are close to a current timecounter's mask. Assuming the hypothesis is correct I see two possible ways to work-around the problem: 1. Increase tsc_shift, so that the cross-CPU TSC differences are smaller (at the cost of lower resolution). That should reduce the risk of seeing "backwards" TSC values. Judging from numbers that tools/tools/tscdrift produces I can set tsc_shift to 7. The resulting resolution should not be worse than that of HPET or ACPI-fast counters with the benefit of TSC being much faster to read. 2. Change the code, so that tc_windup() is always called on the same CPU. E.g. it could be the BSP or a CPU that receives the actual timer interrupts (as opposed to the hardclock IPIs). This should help with the timekeeping, but won't help with the "jitter" in binuptime() and friends. 3. In tc_delta() somehow detect and filter out "slightly behind" timecounter readings. Not sure if this is possible at all. -- Andriy Gapon
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?56274FFC.2000608>