Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 12 Dec 2016 23:04:08 +0400 (AMT)
From:      Hrant Dadivanyan <hrant@dadivanyan.net>
To:        Ian Lepore <ian@freebsd.org>
Cc:        freebsd-hackers@freebsd.org
Subject:   Re: system time instability
Message-ID:  <E1cGVtQ-000Acm-7c@pandora.amnic.net>
In-Reply-To: <1481556581.1889.322.camel@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
[ Charset ISO-8859-1 converted... ]
> On Mon, 2016-12-12 at 17:23 +0400, Hrant Dadivanyan wrote:
> > Hello,
> > 
> > After upgrade of stratum 1 ntp server hardware from a Via EPIA Mini-
> > ITX
> > to Supermicro PDSBM-LN2 and OS from 8.4/i386 to 10.3-RELEASE-
> > p12/amd64 it
> > starts to work unstable. Most of the time it keeps time pretty well
> > with
> > offset less than 1-2 us, but once a few hours pll frequency jumps,
> > then
> > clock drifts. After passing calibration interval time (256s)
> > frequency
> > returns back to normal, then, after appropriate time, clock
> > stabilizes
> > again. Excerpt from loopstats:
> > 57734 37624.525 -0.000000955 0.120 0.000000588 0.000211 4
> > 57734 37640.526 0.000000319 0.120 0.000000506 0.000198 4
> > 57734 37656.526 -0.000000789 0.120 0.001081214 0.000185 4
> > 57734 37672.526 -0.000398921 100.120 0.000154630 35.355339 4
> > 57734 37688.526 -0.001941140 100.120 0.000188374 33.071891 4
> > 57734 37704.525 -0.003389196 100.120 0.000177488 30.935922 4
> > 57734 37720.525 -0.004745689 100.120 0.000166147 28.937905 4
> > 57734 37736.525 -0.006022007 100.120 0.000156269 27.068931 4
> > 57734 37752.526 -0.007220430 100.120 0.000146663 25.320667 4
> > 57734 37768.526 -0.008343331 100.120 0.000137805 23.685315 4
> > 57734 37784.525 -0.009399651 100.120 0.000129406 22.155583 4
> > 57734 37800.525 -0.010391390 100.120 0.000121937 20.724651 4
> > 57734 37816.526 -0.011320293 100.120 0.000114053 19.386136 4
> > 57734 37832.526 -0.012194902 100.120 0.000107191 18.134069 4
> > 57734 37848.526 -0.013013037 100.120 0.000100035 16.962869 4
> > 57734 37864.526 -0.013783932 100.120 0.000094497 15.867311 4
> > 57734 37880.526 -0.014507271 100.120 0.000088691 14.842510 4
> > 57734 37896.525 -0.015184384 100.120 0.000083266 13.883897 4
> > 57734 37912.526 -0.015822296 100.120 0.000078249 12.987196 4
> > 57734 37928.525 -0.016119704 0.122 0.000103405 37.383615 4
> > 57734 37944.526 -0.015132723 0.122 0.000120509 34.969170 4
> > 57734 37960.526 -0.014207941 0.122 0.000113355 32.710663 4
> > 57734 37976.525 -0.013339661 0.122 0.000107051 30.598023 4
> >  [snip]
> > 57734 40296.525 -0.000000337 0.122 0.000001621 0.002136 4
> > 57734 40312.526 -0.000000980 0.122 0.000001635 0.001998 4
> > 
> > The change in pll frequency is usually 100ppm, but not always. For
> > today,
> > for example, it's 29ppm once, 69.3ppm once and 100ppm three times.
> > 
> > Had tried three available timecounters: TSC-low, ACPI-fast, HPET. Had
> > changed eventtimer from HPET to LAPIC, kern.eventtimer.periodic from
> > 0 to 1.
> > All the changes are followed by service ntpd restart.
> > Also tried to change kern.hz from 1000 to 100.
> > Had even tried 11.0 on other, but the exactly same board. The
> > original
> > board has OCXO instead of quartz, but reconnecting the original
> > quartz
> > doesn't help.
> > 
> > Didn't try another hardware and/or OS yet, the server isn't easy
> > reachable,
> > but, in lack of better ideas, will definitely try.
> > 
> > 
> > Kernel has stripped all unused drivers and options plus PPS_SYNC,
> > then
> > FFCLOCK added. All the additions:
> > options         IPSEC
> > options         GEOM_ELI
> > options         PPS_SYNC
> > options         FFCLOCK
> > 
> > device          crypto
> > device          enc
> > device          pf
> > device          pflog
> > device          smbus
> > device          ichsmb
> > device          smb
> > device          coretemp
> > device          cpuctl
> > device          nvram
> > device          smbios
> > device          ipmi
> > device          aesni
> > 
> > The relevant part of ntp.conf:
> > fudge  127.127.20.0 time2 0.6 flag1 1 flag2 0 flag3 1
> > server 127.127.20.0 mode 2 minpoll 4 prefer
> > server <external_server>   minpoll 8 iburst
> > restrict default limited kod nomodify notrap nopeer noquery
> > 
> > rc.conf:
> > ntpd_program="/usr/local/sbin/ntpd"
> > ntpd_config="/etc/ntpd.conf"
> > ntpd_flags="-N -p /var/run/ntpd.pid -f /var/db/ntpd.drift"
> > ntpd_sync_on_start="YES"
> > 
> > sysctl.conf (this change is also seems irrelevant, rebooting without
> > this
> > frequency correction changes nothing in the behaviour):
> > machdep.tsc_freq=2194498500     # pll freq offset change from 21.678
> > to 0.120ppm
> > 
> > $ sysctl kern.hz kern.timecounter kern.eventtimer
> > kern.hz: 1000
> > kern.timecounter.tsc_shift: 1
> > kern.timecounter.smp_tsc_adjust: 0
> > kern.timecounter.smp_tsc: 1
> > kern.timecounter.invariant_tsc: 1
> > kern.timecounter.fast_gettime: 1
> > kern.timecounter.tick: 1
> > kern.timecounter.choice: TSC-low(1000) ACPI-fast(900) i8254(0)
> > HPET(950) dummy(-1000000)
> > kern.timecounter.hardware: TSC-low
> > kern.timecounter.alloweddeviation: 5
> > kern.timecounter.stepwarnings: 0
> > kern.timecounter.tc.TSC-low.quality: 1000
> > kern.timecounter.tc.TSC-low.frequency: 1097249250
> > kern.timecounter.tc.TSC-low.counter: 2359573202
> > kern.timecounter.tc.TSC-low.mask: 4294967295
> > kern.timecounter.tc.ACPI-fast.quality: 900
> > kern.timecounter.tc.ACPI-fast.frequency: 3579545
> > kern.timecounter.tc.ACPI-fast.counter: 9238615
> > kern.timecounter.tc.ACPI-fast.mask: 16777215
> > kern.timecounter.tc.i8254.quality: 0
> > kern.timecounter.tc.i8254.frequency: 1193182
> > kern.timecounter.tc.i8254.counter: 9906
> > kern.timecounter.tc.i8254.mask: 65535
> > kern.timecounter.tc.HPET.quality: 950
> > kern.timecounter.tc.HPET.frequency: 14318180
> > kern.timecounter.tc.HPET.counter: 2305610093
> > kern.timecounter.tc.HPET.mask: 4294967295
> > kern.eventtimer.periodic: 0
> > kern.eventtimer.timer: HPET
> > kern.eventtimer.idletick: 0
> > kern.eventtimer.singlemul: 2
> > kern.eventtimer.choice: HPET(450) HPET1(440) HPET2(440) LAPIC(400)
> > i8254(100) RTC(0)
> > kern.eventtimer.et.i8254.quality: 100
> > kern.eventtimer.et.i8254.frequency: 1193182
> > kern.eventtimer.et.i8254.flags: 1
> > kern.eventtimer.et.RTC.quality: 0
> > kern.eventtimer.et.RTC.frequency: 32768
> > kern.eventtimer.et.RTC.flags: 17
> > kern.eventtimer.et.HPET2.quality: 440
> > kern.eventtimer.et.HPET2.frequency: 14318180
> > kern.eventtimer.et.HPET2.flags: 3
> > kern.eventtimer.et.HPET1.quality: 440
> > kern.eventtimer.et.HPET1.frequency: 14318180
> > kern.eventtimer.et.HPET1.flags: 3
> > kern.eventtimer.et.HPET.quality: 450
> > kern.eventtimer.et.HPET.frequency: 14318180
> > kern.eventtimer.et.HPET.flags: 3
> > kern.eventtimer.et.LAPIC.quality: 400
> > kern.eventtimer.et.LAPIC.frequency: 99749970
> > kern.eventtimer.et.LAPIC.flags: 15
> > $ 
> > 
> > Any hints ?
> > 
> > Thank you,
> > Hrant
> > 
> 
> Very strange, I've never seen behavior like that.  You're using ntpd
> from ports, is it the latest version?
> 

Yes, it's 4.2.8p9_1 from ports.

> How are you feeding the PPS signal to the system?  Do you know how wide
> the PPS pulse is?  I'm wondering if the driver is occasionally missing
> an edge of a narrow pulse, although an occasional bad measurement
> should get weeded out by ntpd's refclock median filter.  If the pulse
> is wider than a few microseconds the whole theory falls apart anyway.
> 

Pulse width is 100 ms, receiver is Garmin GPS 18x LVC. Actually I've
replaced reciver as well. The cable is too long 12-13 meters and there was
badformat (I guess CRC) errors, when setup 4-5 years ago. I've used CAT5
cable with PPS and Rx wires twised to ground and 74LS245 bus driver close
to GPS receiver to amplify signal. It's not a real amplifier, but works
fine there for years and these errors gone.
There are also a few per a hour:
kernel reports TIME_ERROR: 0x2307: PPS Time Sync wanted but PPS Jitter exceeded
errors in the logs, so it looks like the signal is not okay anyway.

> Anyway, I'm a bit focused on the PPS because there were changes to the
> serial (uart) PPS capture between 8.4 and 10.x, and I'm responsible for
> some of them. :)
> 

Now, when you ask, I start to suspect PPS delivery to uart again - cable
and amplifier, but can't understand how the 100ppm error fits into that.

Thank you,
Hrant

> -- Ian
> 

-- 
Hrant Dadivanyan (aka Ran d'Adi)		hrant(at)dadivanyan.net
/* "Feci quod potui, faciant meliora potentes." */       ran(at)psg.com



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?E1cGVtQ-000Acm-7c>