Date: Mon, 12 Dec 2016 12:34:15 -0700 From: Ian Lepore <ian@freebsd.org> To: Hrant Dadivanyan <hrant@dadivanyan.net> Cc: freebsd-hackers@freebsd.org Subject: Re: system time instability Message-ID: <1481571255.1889.329.camel@freebsd.org> In-Reply-To: <E1cGVtQ-000Acm-7c@pandora.amnic.net> References: <E1cGVtQ-000Acm-7c@pandora.amnic.net>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 2016-12-12 at 23:04 +0400, Hrant Dadivanyan wrote: > [ Charset ISO-8859-1 converted... ] > > > > On Mon, 2016-12-12 at 17:23 +0400, Hrant Dadivanyan wrote: > > > > > > Hello, > > > > > > After upgrade of stratum 1 ntp server hardware from a Via EPIA > > > Mini- > > > ITX > > > to Supermicro PDSBM-LN2 and OS from 8.4/i386 to 10.3-RELEASE- > > > p12/amd64 it > > > starts to work unstable. Most of the time it keeps time pretty > > > well > > > with > > > offset less than 1-2 us, but once a few hours pll frequency > > > jumps, > > > then > > > clock drifts. After passing calibration interval time (256s) > > > frequency > > > returns back to normal, then, after appropriate time, clock > > > stabilizes > > > again. Excerpt from loopstats: > > > 57734 37624.525 -0.000000955 0.120 0.000000588 0.000211 4 > > > 57734 37640.526 0.000000319 0.120 0.000000506 0.000198 4 > > > 57734 37656.526 -0.000000789 0.120 0.001081214 0.000185 4 > > > 57734 37672.526 -0.000398921 100.120 0.000154630 35.355339 4 > > > 57734 37688.526 -0.001941140 100.120 0.000188374 33.071891 4 > > > 57734 37704.525 -0.003389196 100.120 0.000177488 30.935922 4 > > > 57734 37720.525 -0.004745689 100.120 0.000166147 28.937905 4 > > > 57734 37736.525 -0.006022007 100.120 0.000156269 27.068931 4 > > > 57734 37752.526 -0.007220430 100.120 0.000146663 25.320667 4 > > > 57734 37768.526 -0.008343331 100.120 0.000137805 23.685315 4 > > > 57734 37784.525 -0.009399651 100.120 0.000129406 22.155583 4 > > > 57734 37800.525 -0.010391390 100.120 0.000121937 20.724651 4 > > > 57734 37816.526 -0.011320293 100.120 0.000114053 19.386136 4 > > > 57734 37832.526 -0.012194902 100.120 0.000107191 18.134069 4 > > > 57734 37848.526 -0.013013037 100.120 0.000100035 16.962869 4 > > > 57734 37864.526 -0.013783932 100.120 0.000094497 15.867311 4 > > > 57734 37880.526 -0.014507271 100.120 0.000088691 14.842510 4 > > > 57734 37896.525 -0.015184384 100.120 0.000083266 13.883897 4 > > > 57734 37912.526 -0.015822296 100.120 0.000078249 12.987196 4 > > > 57734 37928.525 -0.016119704 0.122 0.000103405 37.383615 4 > > > 57734 37944.526 -0.015132723 0.122 0.000120509 34.969170 4 > > > 57734 37960.526 -0.014207941 0.122 0.000113355 32.710663 4 > > > 57734 37976.525 -0.013339661 0.122 0.000107051 30.598023 4 > > > [snip] > > > 57734 40296.525 -0.000000337 0.122 0.000001621 0.002136 4 > > > 57734 40312.526 -0.000000980 0.122 0.000001635 0.001998 4 > > > > > > The change in pll frequency is usually 100ppm, but not always. > > > For > > > today, > > > for example, it's 29ppm once, 69.3ppm once and 100ppm three > > > times. > > > > > > Had tried three available timecounters: TSC-low, ACPI-fast, HPET. > > > Had > > > changed eventtimer from HPET to LAPIC, kern.eventtimer.periodic > > > from > > > 0 to 1. > > > All the changes are followed by service ntpd restart. > > > Also tried to change kern.hz from 1000 to 100. > > > Had even tried 11.0 on other, but the exactly same board. The > > > original > > > board has OCXO instead of quartz, but reconnecting the original > > > quartz > > > doesn't help. > > > > > > Didn't try another hardware and/or OS yet, the server isn't easy > > > reachable, > > > but, in lack of better ideas, will definitely try. > > > > > > > > > Kernel has stripped all unused drivers and options plus PPS_SYNC, > > > then > > > FFCLOCK added. All the additions: > > > options IPSEC > > > options GEOM_ELI > > > options PPS_SYNC > > > options FFCLOCK > > > > > > device crypto > > > device enc > > > device pf > > > device pflog > > > device smbus > > > device ichsmb > > > device smb > > > device coretemp > > > device cpuctl > > > device nvram > > > device smbios > > > device ipmi > > > device aesni > > > > > > The relevant part of ntp.conf: > > > fudge 127.127.20.0 time2 0.6 flag1 1 flag2 0 flag3 1 > > > server 127.127.20.0 mode 2 minpoll 4 prefer > > > server <external_server> minpoll 8 iburst > > > restrict default limited kod nomodify notrap nopeer noquery > > > > > > rc.conf: > > > ntpd_program="/usr/local/sbin/ntpd" > > > ntpd_config="/etc/ntpd.conf" > > > ntpd_flags="-N -p /var/run/ntpd.pid -f /var/db/ntpd.drift" > > > ntpd_sync_on_start="YES" > > > > > > sysctl.conf (this change is also seems irrelevant, rebooting > > > without > > > this > > > frequency correction changes nothing in the behaviour): > > > machdep.tsc_freq=2194498500 # pll freq offset change from > > > 21.678 > > > to 0.120ppm > > > > > > $ sysctl kern.hz kern.timecounter kern.eventtimer > > > kern.hz: 1000 > > > kern.timecounter.tsc_shift: 1 > > > kern.timecounter.smp_tsc_adjust: 0 > > > kern.timecounter.smp_tsc: 1 > > > kern.timecounter.invariant_tsc: 1 > > > kern.timecounter.fast_gettime: 1 > > > kern.timecounter.tick: 1 > > > kern.timecounter.choice: TSC-low(1000) ACPI-fast(900) i8254(0) > > > HPET(950) dummy(-1000000) > > > kern.timecounter.hardware: TSC-low > > > kern.timecounter.alloweddeviation: 5 > > > kern.timecounter.stepwarnings: 0 > > > kern.timecounter.tc.TSC-low.quality: 1000 > > > kern.timecounter.tc.TSC-low.frequency: 1097249250 > > > kern.timecounter.tc.TSC-low.counter: 2359573202 > > > kern.timecounter.tc.TSC-low.mask: 4294967295 > > > kern.timecounter.tc.ACPI-fast.quality: 900 > > > kern.timecounter.tc.ACPI-fast.frequency: 3579545 > > > kern.timecounter.tc.ACPI-fast.counter: 9238615 > > > kern.timecounter.tc.ACPI-fast.mask: 16777215 > > > kern.timecounter.tc.i8254.quality: 0 > > > kern.timecounter.tc.i8254.frequency: 1193182 > > > kern.timecounter.tc.i8254.counter: 9906 > > > kern.timecounter.tc.i8254.mask: 65535 > > > kern.timecounter.tc.HPET.quality: 950 > > > kern.timecounter.tc.HPET.frequency: 14318180 > > > kern.timecounter.tc.HPET.counter: 2305610093 > > > kern.timecounter.tc.HPET.mask: 4294967295 > > > kern.eventtimer.periodic: 0 > > > kern.eventtimer.timer: HPET > > > kern.eventtimer.idletick: 0 > > > kern.eventtimer.singlemul: 2 > > > kern.eventtimer.choice: HPET(450) HPET1(440) HPET2(440) > > > LAPIC(400) > > > i8254(100) RTC(0) > > > kern.eventtimer.et.i8254.quality: 100 > > > kern.eventtimer.et.i8254.frequency: 1193182 > > > kern.eventtimer.et.i8254.flags: 1 > > > kern.eventtimer.et.RTC.quality: 0 > > > kern.eventtimer.et.RTC.frequency: 32768 > > > kern.eventtimer.et.RTC.flags: 17 > > > kern.eventtimer.et.HPET2.quality: 440 > > > kern.eventtimer.et.HPET2.frequency: 14318180 > > > kern.eventtimer.et.HPET2.flags: 3 > > > kern.eventtimer.et.HPET1.quality: 440 > > > kern.eventtimer.et.HPET1.frequency: 14318180 > > > kern.eventtimer.et.HPET1.flags: 3 > > > kern.eventtimer.et.HPET.quality: 450 > > > kern.eventtimer.et.HPET.frequency: 14318180 > > > kern.eventtimer.et.HPET.flags: 3 > > > kern.eventtimer.et.LAPIC.quality: 400 > > > kern.eventtimer.et.LAPIC.frequency: 99749970 > > > kern.eventtimer.et.LAPIC.flags: 15 > > > $ > > > > > > Any hints ? > > > > > > Thank you, > > > Hrant > > > > > Very strange, I've never seen behavior like that. You're using > > ntpd > > from ports, is it the latest version? > > > Yes, it's 4.2.8p9_1 from ports. > > > > > How are you feeding the PPS signal to the system? Do you know how > > wide > > the PPS pulse is? I'm wondering if the driver is occasionally > > missing > > an edge of a narrow pulse, although an occasional bad measurement > > should get weeded out by ntpd's refclock median filter. If the > > pulse > > is wider than a few microseconds the whole theory falls apart > > anyway. > > > Pulse width is 100 ms, receiver is Garmin GPS 18x LVC. Actually I've > replaced reciver as well. The cable is too long 12-13 meters and > there was > badformat (I guess CRC) errors, when setup 4-5 years ago. I've used > CAT5 > cable with PPS and Rx wires twised to ground and 74LS245 bus driver > close > to GPS receiver to amplify signal. It's not a real amplifier, but > works > fine there for years and these errors gone. > There are also a few per a hour: > kernel reports TIME_ERROR: 0x2307: PPS Time Sync wanted but PPS > Jitter exceeded > errors in the logs, so it looks like the signal is not okay anyway. > > > > > Anyway, I'm a bit focused on the PPS because there were changes to > > the > > serial (uart) PPS capture between 8.4 and 10.x, and I'm responsible > > for > > some of them. :) > > > Now, when you ask, I start to suspect PPS delivery to uart again - > cable > and amplifier, but can't understand how the 100ppm error fits into > that. > > Thank you, > Hrant Hmm. But the one part of the system that didn't change (even if it was a bit bad all along) was delivery of the PPS signal. Maybe the uart on the new hardware is more sensitive to the bad signaling. If you have a usb-serial adapter available, you could try using it for the PPS instead of the uart port on the new motherboard. A USB serial adapter performs surprisingly well for PPS input, with a fixed latency that averages around 60 microseconds and fairly small jitter. The uart(4) manpage now has some information on configuring PPS inputs for traditional motherboard-style rs232, and ucom(4) has the info for usb-serial adapters. -- Ian
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1481571255.1889.329.camel>