From owner-freebsd-current@freebsd.org Thu Mar 23 14:38:49 2017 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9D000D1A5A0 for ; Thu, 23 Mar 2017 14:38:49 +0000 (UTC) (envelope-from o.hartmann@walstatt.org) Received: from mout.gmx.net (mout.gmx.net [212.227.15.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "mout.gmx.net", Issuer "TeleSec ServerPass DE-2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 11B8E1C95; Thu, 23 Mar 2017 14:38:48 +0000 (UTC) (envelope-from o.hartmann@walstatt.org) Received: from freyja.zeit4.iv.bundesimmobilien.de ([87.138.105.249]) by mail.gmx.com (mrgmx003 [212.227.17.190]) with ESMTPSA (Nemesis) id 0MSduu-1cjK5j3pmG-00RaTF; Thu, 23 Mar 2017 15:38:40 +0100 Date: Thu, 23 Mar 2017 15:38:33 +0100 From: "O. Hartmann" To: Ian Lepore Cc: "O. Hartmann" , freebsd-current Subject: Re: ntpd dies nightly on a server with jails Message-ID: <20170323153833.75e1b013@freyja.zeit4.iv.bundesimmobilien.de> In-Reply-To: <1489774815.40576.182.camel@freebsd.org> References: <20170315071724.78bb0bdc@freyja.zeit4.iv.bundesimmobilien.de> <201703152012.v2FKCbvg078762@slippy.cwsent.com> <20170317180507.5c64fb26@thor.intern.walstatt.dynvpn.de> <1489774815.40576.182.camel@freebsd.org> Organization: Walstatt X-Mailer: Claws Mail 3.14.1 (GTK+ 2.24.29; amd64-portbld-freebsd12.0) MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Provags-ID: V03:K0:WGWa/YxSBZNajLHJ13QgvQXZL9QyTN1YG4gss+tYaxGy7SHiYDf QQ5knTUSgZC8m/qP5nf/bGd6A7CbwcdT/Fuc4Kfn5iO8BPg6caw3XzIIiwdG4ZLOnW9/7y0 8ymIuyHmOSi+cwavo1bHOBCuwPcperz7GYO3Ukba/ta9Cu9e+5V1aadOAE8UuuV5jPk5Emf Io5FQ7vd28rX02E3qg2OQ== X-UI-Out-Filterresults: notjunk:1;V01:K0:CBw1KvvvIEc=:iod+CHlZSXYFn6MuQ3DoKS Ada/gAEeFrYSpTRpTPP5pEfqvYXiBIYncEQIxSa9xxKII6sBC5mNlOqcDv/LxwnwpGVmAbA79 uQRVHB1r/bLU0m08T182kf45W4ecV/8YQJU4YJwO9TCgja8GONLzhOAMlikDoP62QcYjvgWwg zXJ6rgFGLnZ/BBO045XM4JBkFS0qaWYAF82UxnVS+sLSPYMIEIumGdfBCd7UIlFgioCoSIXup tpnQzkMIjQEJmBe1xo+uK22uEDb1kqPjUVn4Zw/0wPOxFEf3mNwBkOQmI67NIiTCxtyIGkiFa 8Lp3u2v0Xqfzrog0C6OIVo+VB5jHYcMoWGG9fL0ozZe1a2VPbiSKqeH0L/s6PQBcjMyivI2+n d3utK8UxzA4yLXU0cZKtuT+YdWR7MYKh9NBeehxWOOjVSnzjmC90d7UqUGqByfcUXd3HMlwuB 73I8aFD0AKic1TBI9twAd70sukCLDOjzMP0AKpg4XV9vNICa5mddoJZMpF64t0WUSu4FxoFbl uPR3ETJAjEy7Ni+bN1hHC7Z7MaPYO6K8Il0BOtU6TKX5r/Afv2lptqoxh+na1Wzq4TBzVyAtU 0RGASfRA/A5gc5ylrlu72NgyY7RsnIntfnXHfwNUX59wPAYHMA2dUPsmOzrG1tNOfm0vgJ6li MD3I/d4vvFQjwPEQ3fcWjm1RiQPeW/ubDnEwlyBHWjB+CRC5CUFiYVRgFbkLv3aDQYJ7UyOdl 2TGZ+uF1WyMoeUE1b8Wiy331lrDLKjf16OP8Dhocz8SRUAwZqPn0y31Fwk5+X4jYeP1XmHEHj O/p+FMK X-Mailman-Approved-At: Thu, 23 Mar 2017 15:43:21 +0000 X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 23 Mar 2017 14:38:49 -0000 On Fri, 17 Mar 2017 12:20:15 -0600 Ian Lepore wrote: > On Fri, 2017-03-17 at 18:05 +0100, O. Hartmann wrote: > > Am Wed, 15 Mar 2017 13:12:37 -0700 > > Cy Schubert schrieb: > > =20 > > >=20 > > > Hi O.Hartmann, > > >=20 > > > I'll try to answer as much as I can in the noon hour I have left. > > >=20 > > > In message <20170315071724.78bb0bdc@freyja.zeit4.iv.bundesimmobilie = =20 > > > n.de>,=A0 =20 > > > "O. H > > > artmann" writes: =20 > > > >=20 > > > > Running a host with several jails on recent CURRENT (12.0-CURRENT=20 > > > > #8 r315187: > > > > Sun Mar 12 11:22:38 CET 2017 amd64) makes me trouble on a daily > > > > basis. > > > >=20 > > > > The box is an older two-socket Fujitsu server equipted with two > > > > four-core > > > > Intel(R) Xeon(R) CPU L5420=A0=A0@ 2.50GHz. > > > >=20 > > > > The box has several jails, each jail does NOT run service ntpd. > > > > Each jail has > > > > its dedicated loopback, lo1 throughout lo5 (for the moment) with > > > > dedicated IP > > > > : > > > > 127.0.1.1 - 127.0.5.1 (if this matter, I believe not). > > > >=20 > > > > The host itself has two main NICs, broadcom based. bcm0 is > > > > dedicated to the > > > > host, bcm1 is shared amongst the jails: each jail has an IP bound > > > > to bcm1 via > > > > whihc the jails communicate with the network. > > > >=20 > > > > I try to capture log informations via syslog, but FreeBSD's ntpd > > > > seems to be > > > > very, very sparse with such informations, coverging to null - I > > > > can't see > > > > anything suiatble in the logs why NTPD dies almost every night > > > > leaving the > > > > system with a wild reset of time. Sometimes it is a gain of 6 > > > > hours, sometime > > > > s > > > > it is only half an hour. I leave the box at 16:00 local time > > > > usually and take > > > > care again at ~ 7 o'clock in the morning local time.=A0=A0 =20 > > > We will need to turn on debugging. Unfortunately debug code is not > > > compiled=A0 > > > into the binary. We have two options. You can either update=A0 > > > src/usr.sbin/ntp/config.h to enable DEBUG or build the port (it's > > > the exact=A0 > > > same ntp) with the DEBUG option -- this is probably simpler. Then > > > enable=A0 > > > debug with -d and -D. -D increases verbosity. I just committed a > > > debug=A0 > > > option to both ntp ports to assist here. > > >=20 > > > Next question: Do you see any indication of a core dump? I'd be > > > interested=A0 > > > in looking at it if possible. > > > =20 > > > >=20 > > > >=20 > > > > When the clock is floating that wild, in all cases ntpd isn't > > > > running any mor > > > > e. > > > > I try to restart with options -g and -G to adjust the time > > > > quickly at the > > > > beginning, which works fine.=A0=A0 =20 > > > This is disconcerting. If your clock is floating wildly without > > > ntpd=A0 > > > running there are other issues that might be at play here. At most > > > the=A0 > > > clock might drift a little, maybe a minute or two a day but not by > > > a lot.=A0 > > > Does the drift cause your clocks to run fast or slow? > > > =20 > > > >=20 > > > >=20 > > > > Apart from possible misconfigurations of the jails (I'm quite new > > > > to jails an > > > > d > > > > their pitfalls), I was wondering what causes ntpd to die. i can't > > > > determine > > > > exactly the time of its death, so it might be related to > > > > diurnal/periodic > > > > processes (I use only the most vanilla configurations on > > > > periodic, except for > > > > checking ZFS's scrubbing enabled).=A0=A0 =20 > > > As I'm a little rushed for time, I didn't catch whether the jails=A0 > > > themselves were also running ntpd... just thought I'd ask. I don't > > > see how=A0 > > > zfs scrubbing or any other periodic scripts could cause this. > > > =20 > > > >=20 > > > >=20 > > > > I'ven't had the chance to check whether the hardware is > > > > completely all right, > > > > but from a superficial point of view there is no issue with high > > > > gain of the > > > > internal clock or other hardware issues.=A0=A0 =20 > > > It's probably a good idea to check. I don't think that would cause > > > ntpd any=A0 > > > gas. I've seen RTC battery messages on my gear which haven't caused > > > ntpd=A0 > > > any problem. I have two machines which complain about RTC battery > > > being=A0 > > > dead, where in fact I have replaced the batteries and the messages > > > still=A0 > > > are displayed at boot. I'm not sure if it's possible for a kernel > > > to damage=A0 > > > the RTC. In my case that doesn't cause ntpd any problems. It's > > > probably=A0 > > > good to check anyway. > > > =20 > > > >=20 > > > >=20 > > > > If there are known issues with jails (the problem occurs since I > > > > use those), > > > > advice is appreciated.=A0=A0 =20 > > > Not that I know of. > > >=20 > > > =20 > > Just some strange news: > >=20 > > I left the server the whole day with ntpd disabled and I didn't watch > > a gain of the RTC > > by one second, even stressing the machine. > >=20 > > But soon after restarting ntpd, I realised immediately a 30 minutes > > off! This morning, > > the discrapancy was almost 5 hours - it looked more like a weird > > ajustment to another > > time base than UTC. > >=20 > > Over the weekend I'll leave the server with ntpd disabled and only > > RTC running. I've the > > strange feeling that something is intentionally readjusting the ntpd > > time due to a > > misconfiguration or a rogue ntp server in the X.CC.pool.ntp.org > > =20 >=20 > The rogue server theory is a bad one, unless you have configured just a > single server in your ntp.conf and it is the rogue. =A0Ntpd requires > agreement among the set of configured servers, it will ignore outliers. Past weekend, I had switched off ntpd and ran the server completely with the onboard RTC. On Monday morning when I entered the office, the clock was in synchronisation with the official time. As usual, I update sources and buildworld. After a couple of builds over the week and letting ntpd restart via rc.conf as usual after rebooting, I check= ed over the past two days and i found the server always in a state of dissonant clock. The more curious part is that the clock is almost 6 hours behind UTC. I can= not tell whether the ntpd is still trying to adjust time to a foreign clock whi= ch has another time reference. I checked the TZ and everything seems all right. >=20 > It would help to have some actual data. =A0What does ntpq -p show right > after starting ntpd? =A0Then a few minutes later, then again 10 minutes [RESTART] remote refid st t when poll reach delay offset jit= ter =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D 0.de.pool.ntp.o .POOL. 16 p - 64 0 0.000 0.000 0.= 000 1.de.pool.ntp.o .POOL. 16 p - 64 0 0.000 0.000 0.= 000 2.de.pool.ntp.o .POOL. 16 p - 64 0 0.000 0.000 0.= 000 3.de.pool.ntp.o .POOL. 16 p - 64 0 0.000 0.000 0.= 000 ptbtime1.ptb.de .INIT. 16 u - 64 0 0.000 0.000 0.= 000 ptbtime2.ptb.de .INIT. 16 u - 64 0 0.000 0.000 0.= 000 [after 1 Minute] remote refid st t when poll reach delay offset jit= ter =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D 0.de.pool.ntp.o .POOL. 16 p - 64 0 0.000 0.000 0.= 000 1.de.pool.ntp.o .POOL. 16 p - 64 0 0.000 0.000 0.= 000 2.de.pool.ntp.o .POOL. 16 p - 64 0 0.000 0.000 0.= 000 3.de.pool.ntp.o .POOL. 16 p - 64 0 0.000 0.000 0.= 000 ptbtime1.ptb.de .PTB. 1 u 34 64 1 16.931 -4.841 0.= 000 ptbtime2.ptb.de .PTB. 1 u 34 64 1 18.273 -5.518 0.= 000 fks.dan.net.uk 117.161.90.132 3 u 31 64 1 24.217 -3.904 0.= 000 213.95.200.109 213.95.151.123 2 u 33 64 1 25.464 -2.449 0.= 000 ns3.customer-re 192.53.103.108 2 u 35 64 1 23.905 -1.187 0.= 000 ns1.blazing.de 213.172.96.14 2 u 36 64 1 17.045 -3.017 0.= 000 ntp2.m-online.n 212.18.1.106 2 u 36 64 1 20.758 -2.693 0.= 000 stratum2-3.NTP. 129.70.130.71 2 u 35 64 1 22.000 -3.800 0.= 000 estoma.de 144.76.96.7 3 u 33 64 1 7.919 -3.182 0.= 000 clint.blazing.d 213.172.96.14 2 u 34 64 1 17.642 -2.932 0.= 000 news01.nierle.c 192.53.103.103 2 u 34 64 1 19.880 -3.750 0.= 000 q.fu110.de 131.234.137.64 2 u 35 64 1 16.649 -6.037 0.= 000 [after ~10 Minutes] remote refid st t when poll reach delay offset jit= ter =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D 0.de.pool.ntp.o .POOL. 16 p - 64 0 0.000 0.000 0.= 000 1.de.pool.ntp.o .POOL. 16 p - 64 0 0.000 0.000 0.= 000 2.de.pool.ntp.o .POOL. 16 p - 64 0 0.000 0.000 0.= 000 3.de.pool.ntp.o .POOL. 16 p - 64 0 0.000 0.000 0.= 000 #ptbtime1.ptb.de .PTB. 1 u 45 64 177 15.740 0.289 1.= 147 #ptbtime2.ptb.de .PTB. 1 u 38 64 177 17.489 -0.651 1.= 632 #fks.dan.net.uk 117.161.90.132 3 u 46 64 177 21.736 -0.634 9.= 040 -213.95.200.109 213.95.151.123 2 u 41 64 177 23.400 1.216 1.= 353 +ns1.blazing.de 213.172.96.14 2 u 48 64 177 16.848 1.912 0.= 570 *ntp2.m-online.n 212.18.1.106 2 u 48 64 177 20.681 2.409 0.= 927 -stratum2-3.NTP. 129.70.130.71 2 u 44 64 177 20.868 1.482 0.= 719 +clint.blazing.d 213.172.96.14 2 u 42 64 177 16.612 2.374 12.= 795 -news01.nierle.c 192.53.103.103 2 u 40 64 177 20.127 1.504 12.= 851 #q.fu110.de 131.234.137.64 2 u 103 64 176 16.070 -0.769 0.= 663 > after that, etc. =A0What is in the /var/db/ntpd.drift file? =A0Are you > using the standard freebsd ntp.conf file as delivered, or have you > customized it? =A0Any non-default settings in your rc.conf related to > ntp? The line in /etc/rc.conf is: ntpd_flags=3D"-4 -g -G -I 192.168.0.1 -p /var/run/ntpd.pid -f /var/db/ntpd.= drift" The IP at -I is the IP of the primary NIC of the machine, which has two NIC= s. I use a customized /etc/ntp.conf and I did a lot of variations during the approach to figure out the problem. I did the same on host onto the same network, but being of "modern date" (regarding hardware, the server in ques= tion is an 2008 two-socket Core2Duo XEON box with 2x 4 cores) and which does not= host jails. The reference host seems not to show the weird clock gain. the recent /etc/ntp.conf looks this now: tos minclock 3 maxclock 6 server ptbtime1.ptb.de =20 server ptbtime2.ptb.de =20 pool 0.de.pool.ntp.org =20 pool 1.de.pool.ntp.org =20 pool 2.de.pool.ntp.org =20 pool 3.de.pool.ntp.org =20 restrict 192.168.0.0 mask 255.255.255.0 noquery kod nomodify notrap \ nopeer restrict default limited kod nomodify notrap noquery nopeer restrict -6 default limited kod nomodify notrap noquery nopeer restrict source limited kod nomodify notrap noquery restrict 127.0.0.1 restrict 127.127.1.0 restrict -6 ::1 leapfile "/var/db/ntpd.leap-seconds.list" >=20 > -- Ian