From owner-freebsd-hackers@freebsd.org Mon Dec 12 19:04:10 2016 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 8FD7FC72E73 for ; Mon, 12 Dec 2016 19:04:10 +0000 (UTC) (envelope-from hrant@dadivanyan.net) Received: from pandora.amnic.net (pandora.amnic.net [IPv6:2001:67c:21c:a610::11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4FE0615EF; Mon, 12 Dec 2016 19:04:10 +0000 (UTC) (envelope-from hrant@dadivanyan.net) Received: from ran by pandora.amnic.net with local (Exim 4.87 (FreeBSD)) (envelope-from ) id 1cGVtQ-000Acm-7c; Mon, 12 Dec 2016 23:04:08 +0400 Subject: Re: system time instability In-Reply-To: <1481556581.1889.322.camel@freebsd.org> To: Ian Lepore Date: Mon, 12 Dec 2016 23:04:08 +0400 (AMT) From: Hrant Dadivanyan CC: freebsd-hackers@freebsd.org X-PGP: https://amnic.net/pgpkeys/hrant.asc X-NCC-RegID: am.isoc X-Mailer: ELM [version 2.4ME+ PL127 (25)] MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="UTF-8" Message-Id: X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 12 Dec 2016 19:04:10 -0000 [ Charset ISO-8859-1 converted... ] > On Mon, 2016-12-12 at 17:23 +0400, Hrant Dadivanyan wrote: > > Hello, > > > > After upgrade of stratum 1 ntp server hardware from a Via EPIA Mini- > > ITX > > to Supermicro PDSBM-LN2 and OS from 8.4/i386 to 10.3-RELEASE- > > p12/amd64 it > > starts to work unstable. Most of the time it keeps time pretty well > > with > > offset less than 1-2 us, but once a few hours pll frequency jumps, > > then > > clock drifts. After passing calibration interval time (256s) > > frequency > > returns back to normal, then, after appropriate time, clock > > stabilizes > > again. Excerpt from loopstats: > > 57734 37624.525 -0.000000955 0.120 0.000000588 0.000211 4 > > 57734 37640.526 0.000000319 0.120 0.000000506 0.000198 4 > > 57734 37656.526 -0.000000789 0.120 0.001081214 0.000185 4 > > 57734 37672.526 -0.000398921 100.120 0.000154630 35.355339 4 > > 57734 37688.526 -0.001941140 100.120 0.000188374 33.071891 4 > > 57734 37704.525 -0.003389196 100.120 0.000177488 30.935922 4 > > 57734 37720.525 -0.004745689 100.120 0.000166147 28.937905 4 > > 57734 37736.525 -0.006022007 100.120 0.000156269 27.068931 4 > > 57734 37752.526 -0.007220430 100.120 0.000146663 25.320667 4 > > 57734 37768.526 -0.008343331 100.120 0.000137805 23.685315 4 > > 57734 37784.525 -0.009399651 100.120 0.000129406 22.155583 4 > > 57734 37800.525 -0.010391390 100.120 0.000121937 20.724651 4 > > 57734 37816.526 -0.011320293 100.120 0.000114053 19.386136 4 > > 57734 37832.526 -0.012194902 100.120 0.000107191 18.134069 4 > > 57734 37848.526 -0.013013037 100.120 0.000100035 16.962869 4 > > 57734 37864.526 -0.013783932 100.120 0.000094497 15.867311 4 > > 57734 37880.526 -0.014507271 100.120 0.000088691 14.842510 4 > > 57734 37896.525 -0.015184384 100.120 0.000083266 13.883897 4 > > 57734 37912.526 -0.015822296 100.120 0.000078249 12.987196 4 > > 57734 37928.525 -0.016119704 0.122 0.000103405 37.383615 4 > > 57734 37944.526 -0.015132723 0.122 0.000120509 34.969170 4 > > 57734 37960.526 -0.014207941 0.122 0.000113355 32.710663 4 > > 57734 37976.525 -0.013339661 0.122 0.000107051 30.598023 4 > >  [snip] > > 57734 40296.525 -0.000000337 0.122 0.000001621 0.002136 4 > > 57734 40312.526 -0.000000980 0.122 0.000001635 0.001998 4 > > > > The change in pll frequency is usually 100ppm, but not always. For > > today, > > for example, it's 29ppm once, 69.3ppm once and 100ppm three times. > > > > Had tried three available timecounters: TSC-low, ACPI-fast, HPET. Had > > changed eventtimer from HPET to LAPIC, kern.eventtimer.periodic from > > 0 to 1. > > All the changes are followed by service ntpd restart. > > Also tried to change kern.hz from 1000 to 100. > > Had even tried 11.0 on other, but the exactly same board. The > > original > > board has OCXO instead of quartz, but reconnecting the original > > quartz > > doesn't help. > > > > Didn't try another hardware and/or OS yet, the server isn't easy > > reachable, > > but, in lack of better ideas, will definitely try. > > > > > > Kernel has stripped all unused drivers and options plus PPS_SYNC, > > then > > FFCLOCK added. All the additions: > > options         IPSEC > > options         GEOM_ELI > > options         PPS_SYNC > > options         FFCLOCK > > > > device          crypto > > device          enc > > device          pf > > device          pflog > > device          smbus > > device          ichsmb > > device          smb > > device          coretemp > > device          cpuctl > > device          nvram > > device          smbios > > device          ipmi > > device          aesni > > > > The relevant part of ntp.conf: > > fudge  127.127.20.0 time2 0.6 flag1 1 flag2 0 flag3 1 > > server 127.127.20.0 mode 2 minpoll 4 prefer > > server    minpoll 8 iburst > > restrict default limited kod nomodify notrap nopeer noquery > > > > rc.conf: > > ntpd_program="/usr/local/sbin/ntpd" > > ntpd_config="/etc/ntpd.conf" > > ntpd_flags="-N -p /var/run/ntpd.pid -f /var/db/ntpd.drift" > > ntpd_sync_on_start="YES" > > > > sysctl.conf (this change is also seems irrelevant, rebooting without > > this > > frequency correction changes nothing in the behaviour): > > machdep.tsc_freq=2194498500     # pll freq offset change from 21.678 > > to 0.120ppm > > > > $ sysctl kern.hz kern.timecounter kern.eventtimer > > kern.hz: 1000 > > kern.timecounter.tsc_shift: 1 > > kern.timecounter.smp_tsc_adjust: 0 > > kern.timecounter.smp_tsc: 1 > > kern.timecounter.invariant_tsc: 1 > > kern.timecounter.fast_gettime: 1 > > kern.timecounter.tick: 1 > > kern.timecounter.choice: TSC-low(1000) ACPI-fast(900) i8254(0) > > HPET(950) dummy(-1000000) > > kern.timecounter.hardware: TSC-low > > kern.timecounter.alloweddeviation: 5 > > kern.timecounter.stepwarnings: 0 > > kern.timecounter.tc.TSC-low.quality: 1000 > > kern.timecounter.tc.TSC-low.frequency: 1097249250 > > kern.timecounter.tc.TSC-low.counter: 2359573202 > > kern.timecounter.tc.TSC-low.mask: 4294967295 > > kern.timecounter.tc.ACPI-fast.quality: 900 > > kern.timecounter.tc.ACPI-fast.frequency: 3579545 > > kern.timecounter.tc.ACPI-fast.counter: 9238615 > > kern.timecounter.tc.ACPI-fast.mask: 16777215 > > kern.timecounter.tc.i8254.quality: 0 > > kern.timecounter.tc.i8254.frequency: 1193182 > > kern.timecounter.tc.i8254.counter: 9906 > > kern.timecounter.tc.i8254.mask: 65535 > > kern.timecounter.tc.HPET.quality: 950 > > kern.timecounter.tc.HPET.frequency: 14318180 > > kern.timecounter.tc.HPET.counter: 2305610093 > > kern.timecounter.tc.HPET.mask: 4294967295 > > kern.eventtimer.periodic: 0 > > kern.eventtimer.timer: HPET > > kern.eventtimer.idletick: 0 > > kern.eventtimer.singlemul: 2 > > kern.eventtimer.choice: HPET(450) HPET1(440) HPET2(440) LAPIC(400) > > i8254(100) RTC(0) > > kern.eventtimer.et.i8254.quality: 100 > > kern.eventtimer.et.i8254.frequency: 1193182 > > kern.eventtimer.et.i8254.flags: 1 > > kern.eventtimer.et.RTC.quality: 0 > > kern.eventtimer.et.RTC.frequency: 32768 > > kern.eventtimer.et.RTC.flags: 17 > > kern.eventtimer.et.HPET2.quality: 440 > > kern.eventtimer.et.HPET2.frequency: 14318180 > > kern.eventtimer.et.HPET2.flags: 3 > > kern.eventtimer.et.HPET1.quality: 440 > > kern.eventtimer.et.HPET1.frequency: 14318180 > > kern.eventtimer.et.HPET1.flags: 3 > > kern.eventtimer.et.HPET.quality: 450 > > kern.eventtimer.et.HPET.frequency: 14318180 > > kern.eventtimer.et.HPET.flags: 3 > > kern.eventtimer.et.LAPIC.quality: 400 > > kern.eventtimer.et.LAPIC.frequency: 99749970 > > kern.eventtimer.et.LAPIC.flags: 15 > > $  > > > > Any hints ? > > > > Thank you, > > Hrant > > > > Very strange, I've never seen behavior like that.  You're using ntpd > from ports, is it the latest version? > Yes, it's 4.2.8p9_1 from ports. > How are you feeding the PPS signal to the system?  Do you know how wide > the PPS pulse is?  I'm wondering if the driver is occasionally missing > an edge of a narrow pulse, although an occasional bad measurement > should get weeded out by ntpd's refclock median filter.  If the pulse > is wider than a few microseconds the whole theory falls apart anyway. > Pulse width is 100 ms, receiver is Garmin GPS 18x LVC. Actually I've replaced reciver as well. The cable is too long 12-13 meters and there was badformat (I guess CRC) errors, when setup 4-5 years ago. I've used CAT5 cable with PPS and Rx wires twised to ground and 74LS245 bus driver close to GPS receiver to amplify signal. It's not a real amplifier, but works fine there for years and these errors gone. There are also a few per a hour: kernel reports TIME_ERROR: 0x2307: PPS Time Sync wanted but PPS Jitter exceeded errors in the logs, so it looks like the signal is not okay anyway. > Anyway, I'm a bit focused on the PPS because there were changes to the > serial (uart) PPS capture between 8.4 and 10.x, and I'm responsible for > some of them. :) > Now, when you ask, I start to suspect PPS delivery to uart again - cable and amplifier, but can't understand how the 100ppm error fits into that. Thank you, Hrant > -- Ian > -- Hrant Dadivanyan (aka Ran d'Adi) hrant(at)dadivanyan.net /* "Feci quod potui, faciant meliora potentes." */ ran(at)psg.com