Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 12 Dec 2016 12:34:15 -0700
From:      Ian Lepore <ian@freebsd.org>
To:        Hrant Dadivanyan <hrant@dadivanyan.net>
Cc:        freebsd-hackers@freebsd.org
Subject:   Re: system time instability
Message-ID:  <1481571255.1889.329.camel@freebsd.org>
In-Reply-To: <E1cGVtQ-000Acm-7c@pandora.amnic.net>
References:  <E1cGVtQ-000Acm-7c@pandora.amnic.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 2016-12-12 at 23:04 +0400, Hrant Dadivanyan wrote:
> [ Charset ISO-8859-1 converted... ]
> > 
> > On Mon, 2016-12-12 at 17:23 +0400, Hrant Dadivanyan wrote:
> > > 
> > > Hello,
> > > 
> > > After upgrade of stratum 1 ntp server hardware from a Via EPIA
> > > Mini-
> > > ITX
> > > to Supermicro PDSBM-LN2 and OS from 8.4/i386 to 10.3-RELEASE-
> > > p12/amd64 it
> > > starts to work unstable. Most of the time it keeps time pretty
> > > well
> > > with
> > > offset less than 1-2 us, but once a few hours pll frequency
> > > jumps,
> > > then
> > > clock drifts. After passing calibration interval time (256s)
> > > frequency
> > > returns back to normal, then, after appropriate time, clock
> > > stabilizes
> > > again. Excerpt from loopstats:
> > > 57734 37624.525 -0.000000955 0.120 0.000000588 0.000211 4
> > > 57734 37640.526 0.000000319 0.120 0.000000506 0.000198 4
> > > 57734 37656.526 -0.000000789 0.120 0.001081214 0.000185 4
> > > 57734 37672.526 -0.000398921 100.120 0.000154630 35.355339 4
> > > 57734 37688.526 -0.001941140 100.120 0.000188374 33.071891 4
> > > 57734 37704.525 -0.003389196 100.120 0.000177488 30.935922 4
> > > 57734 37720.525 -0.004745689 100.120 0.000166147 28.937905 4
> > > 57734 37736.525 -0.006022007 100.120 0.000156269 27.068931 4
> > > 57734 37752.526 -0.007220430 100.120 0.000146663 25.320667 4
> > > 57734 37768.526 -0.008343331 100.120 0.000137805 23.685315 4
> > > 57734 37784.525 -0.009399651 100.120 0.000129406 22.155583 4
> > > 57734 37800.525 -0.010391390 100.120 0.000121937 20.724651 4
> > > 57734 37816.526 -0.011320293 100.120 0.000114053 19.386136 4
> > > 57734 37832.526 -0.012194902 100.120 0.000107191 18.134069 4
> > > 57734 37848.526 -0.013013037 100.120 0.000100035 16.962869 4
> > > 57734 37864.526 -0.013783932 100.120 0.000094497 15.867311 4
> > > 57734 37880.526 -0.014507271 100.120 0.000088691 14.842510 4
> > > 57734 37896.525 -0.015184384 100.120 0.000083266 13.883897 4
> > > 57734 37912.526 -0.015822296 100.120 0.000078249 12.987196 4
> > > 57734 37928.525 -0.016119704 0.122 0.000103405 37.383615 4
> > > 57734 37944.526 -0.015132723 0.122 0.000120509 34.969170 4
> > > 57734 37960.526 -0.014207941 0.122 0.000113355 32.710663 4
> > > 57734 37976.525 -0.013339661 0.122 0.000107051 30.598023 4
> > >  [snip]
> > > 57734 40296.525 -0.000000337 0.122 0.000001621 0.002136 4
> > > 57734 40312.526 -0.000000980 0.122 0.000001635 0.001998 4
> > > 
> > > The change in pll frequency is usually 100ppm, but not always.
> > > For
> > > today,
> > > for example, it's 29ppm once, 69.3ppm once and 100ppm three
> > > times.
> > > 
> > > Had tried three available timecounters: TSC-low, ACPI-fast, HPET.
> > > Had
> > > changed eventtimer from HPET to LAPIC, kern.eventtimer.periodic
> > > from
> > > 0 to 1.
> > > All the changes are followed by service ntpd restart.
> > > Also tried to change kern.hz from 1000 to 100.
> > > Had even tried 11.0 on other, but the exactly same board. The
> > > original
> > > board has OCXO instead of quartz, but reconnecting the original
> > > quartz
> > > doesn't help.
> > > 
> > > Didn't try another hardware and/or OS yet, the server isn't easy
> > > reachable,
> > > but, in lack of better ideas, will definitely try.
> > > 
> > > 
> > > Kernel has stripped all unused drivers and options plus PPS_SYNC,
> > > then
> > > FFCLOCK added. All the additions:
> > > options         IPSEC
> > > options         GEOM_ELI
> > > options         PPS_SYNC
> > > options         FFCLOCK
> > > 
> > > device          crypto
> > > device          enc
> > > device          pf
> > > device          pflog
> > > device          smbus
> > > device          ichsmb
> > > device          smb
> > > device          coretemp
> > > device          cpuctl
> > > device          nvram
> > > device          smbios
> > > device          ipmi
> > > device          aesni
> > > 
> > > The relevant part of ntp.conf:
> > > fudge  127.127.20.0 time2 0.6 flag1 1 flag2 0 flag3 1
> > > server 127.127.20.0 mode 2 minpoll 4 prefer
> > > server <external_server>   minpoll 8 iburst
> > > restrict default limited kod nomodify notrap nopeer noquery
> > > 
> > > rc.conf:
> > > ntpd_program="/usr/local/sbin/ntpd"
> > > ntpd_config="/etc/ntpd.conf"
> > > ntpd_flags="-N -p /var/run/ntpd.pid -f /var/db/ntpd.drift"
> > > ntpd_sync_on_start="YES"
> > > 
> > > sysctl.conf (this change is also seems irrelevant, rebooting
> > > without
> > > this
> > > frequency correction changes nothing in the behaviour):
> > > machdep.tsc_freq=2194498500     # pll freq offset change from
> > > 21.678
> > > to 0.120ppm
> > > 
> > > $ sysctl kern.hz kern.timecounter kern.eventtimer
> > > kern.hz: 1000
> > > kern.timecounter.tsc_shift: 1
> > > kern.timecounter.smp_tsc_adjust: 0
> > > kern.timecounter.smp_tsc: 1
> > > kern.timecounter.invariant_tsc: 1
> > > kern.timecounter.fast_gettime: 1
> > > kern.timecounter.tick: 1
> > > kern.timecounter.choice: TSC-low(1000) ACPI-fast(900) i8254(0)
> > > HPET(950) dummy(-1000000)
> > > kern.timecounter.hardware: TSC-low
> > > kern.timecounter.alloweddeviation: 5
> > > kern.timecounter.stepwarnings: 0
> > > kern.timecounter.tc.TSC-low.quality: 1000
> > > kern.timecounter.tc.TSC-low.frequency: 1097249250
> > > kern.timecounter.tc.TSC-low.counter: 2359573202
> > > kern.timecounter.tc.TSC-low.mask: 4294967295
> > > kern.timecounter.tc.ACPI-fast.quality: 900
> > > kern.timecounter.tc.ACPI-fast.frequency: 3579545
> > > kern.timecounter.tc.ACPI-fast.counter: 9238615
> > > kern.timecounter.tc.ACPI-fast.mask: 16777215
> > > kern.timecounter.tc.i8254.quality: 0
> > > kern.timecounter.tc.i8254.frequency: 1193182
> > > kern.timecounter.tc.i8254.counter: 9906
> > > kern.timecounter.tc.i8254.mask: 65535
> > > kern.timecounter.tc.HPET.quality: 950
> > > kern.timecounter.tc.HPET.frequency: 14318180
> > > kern.timecounter.tc.HPET.counter: 2305610093
> > > kern.timecounter.tc.HPET.mask: 4294967295
> > > kern.eventtimer.periodic: 0
> > > kern.eventtimer.timer: HPET
> > > kern.eventtimer.idletick: 0
> > > kern.eventtimer.singlemul: 2
> > > kern.eventtimer.choice: HPET(450) HPET1(440) HPET2(440)
> > > LAPIC(400)
> > > i8254(100) RTC(0)
> > > kern.eventtimer.et.i8254.quality: 100
> > > kern.eventtimer.et.i8254.frequency: 1193182
> > > kern.eventtimer.et.i8254.flags: 1
> > > kern.eventtimer.et.RTC.quality: 0
> > > kern.eventtimer.et.RTC.frequency: 32768
> > > kern.eventtimer.et.RTC.flags: 17
> > > kern.eventtimer.et.HPET2.quality: 440
> > > kern.eventtimer.et.HPET2.frequency: 14318180
> > > kern.eventtimer.et.HPET2.flags: 3
> > > kern.eventtimer.et.HPET1.quality: 440
> > > kern.eventtimer.et.HPET1.frequency: 14318180
> > > kern.eventtimer.et.HPET1.flags: 3
> > > kern.eventtimer.et.HPET.quality: 450
> > > kern.eventtimer.et.HPET.frequency: 14318180
> > > kern.eventtimer.et.HPET.flags: 3
> > > kern.eventtimer.et.LAPIC.quality: 400
> > > kern.eventtimer.et.LAPIC.frequency: 99749970
> > > kern.eventtimer.et.LAPIC.flags: 15
> > > $ 
> > > 
> > > Any hints ?
> > > 
> > > Thank you,
> > > Hrant
> > > 
> > Very strange, I've never seen behavior like that.  You're using
> > ntpd
> > from ports, is it the latest version?
> > 
> Yes, it's 4.2.8p9_1 from ports.
> 
> > 
> > How are you feeding the PPS signal to the system?  Do you know how
> > wide
> > the PPS pulse is?  I'm wondering if the driver is occasionally
> > missing
> > an edge of a narrow pulse, although an occasional bad measurement
> > should get weeded out by ntpd's refclock median filter.  If the
> > pulse
> > is wider than a few microseconds the whole theory falls apart
> > anyway.
> > 
> Pulse width is 100 ms, receiver is Garmin GPS 18x LVC. Actually I've
> replaced reciver as well. The cable is too long 12-13 meters and
> there was
> badformat (I guess CRC) errors, when setup 4-5 years ago. I've used
> CAT5
> cable with PPS and Rx wires twised to ground and 74LS245 bus driver
> close
> to GPS receiver to amplify signal. It's not a real amplifier, but
> works
> fine there for years and these errors gone.
> There are also a few per a hour:
> kernel reports TIME_ERROR: 0x2307: PPS Time Sync wanted but PPS
> Jitter exceeded
> errors in the logs, so it looks like the signal is not okay anyway.
> 
> > 
> > Anyway, I'm a bit focused on the PPS because there were changes to
> > the
> > serial (uart) PPS capture between 8.4 and 10.x, and I'm responsible
> > for
> > some of them. :)
> > 
> Now, when you ask, I start to suspect PPS delivery to uart again -
> cable
> and amplifier, but can't understand how the 100ppm error fits into
> that.
> 
> Thank you,
> Hrant

Hmm.  But the one part of the system that didn't change (even if it was
a bit bad all along) was delivery of the PPS signal.  Maybe the uart on
the new hardware is more sensitive to the bad signaling.

If you have a usb-serial adapter available, you could try using it for
the PPS instead of the uart port on the new motherboard.  A USB serial
adapter performs surprisingly well for PPS input, with a fixed latency
that averages around 60 microseconds and fairly small jitter.

The uart(4) manpage now has some information on configuring PPS inputs
for traditional motherboard-style rs232, and ucom(4) has the info for
usb-serial adapters.

-- Ian




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1481571255.1889.329.camel>