Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 23 Mar 2017 15:38:33 +0100
From:      "O. Hartmann" <o.hartmann@walstatt.org>
To:        Ian Lepore <ian@freebsd.org>
Cc:        "O. Hartmann" <ohartmann@walstatt.org>, freebsd-current <freebsd-current@freebsd.org>
Subject:   Re: ntpd dies nightly on a server with jails
Message-ID:  <20170323153833.75e1b013@freyja.zeit4.iv.bundesimmobilien.de>
In-Reply-To: <1489774815.40576.182.camel@freebsd.org>
References:  <20170315071724.78bb0bdc@freyja.zeit4.iv.bundesimmobilien.de> <201703152012.v2FKCbvg078762@slippy.cwsent.com> <20170317180507.5c64fb26@thor.intern.walstatt.dynvpn.de> <1489774815.40576.182.camel@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 17 Mar 2017 12:20:15 -0600
Ian Lepore <ian@freebsd.org> wrote:

> On Fri, 2017-03-17 at 18:05 +0100, O. Hartmann wrote:
> > Am Wed, 15 Mar 2017 13:12:37 -0700
> > Cy Schubert <Cy.Schubert@komquats.com> schrieb:
> >  =20
> > >=20
> > > Hi O.Hartmann,
> > >=20
> > > I'll try to answer as much as I can in the noon hour I have left.
> > >=20
> > > In message <20170315071724.78bb0bdc@freyja.zeit4.iv.bundesimmobilie =
=20
> > > n.de>,=A0 =20
> > > "O. H
> > > artmann" writes: =20
> > > >=20
> > > > Running a host with several jails on recent CURRENT (12.0-CURRENT=20
> > > > #8 r315187:
> > > > Sun Mar 12 11:22:38 CET 2017 amd64) makes me trouble on a daily
> > > > basis.
> > > >=20
> > > > The box is an older two-socket Fujitsu server equipted with two
> > > > four-core
> > > > Intel(R) Xeon(R) CPU L5420=A0=A0@ 2.50GHz.
> > > >=20
> > > > The box has several jails, each jail does NOT run service ntpd.
> > > > Each jail has
> > > > its dedicated loopback, lo1 throughout lo5 (for the moment) with
> > > > dedicated IP
> > > > :
> > > > 127.0.1.1 - 127.0.5.1 (if this matter, I believe not).
> > > >=20
> > > > The host itself has two main NICs, broadcom based. bcm0 is
> > > > dedicated to the
> > > > host, bcm1 is shared amongst the jails: each jail has an IP bound
> > > > to bcm1 via
> > > > whihc the jails communicate with the network.
> > > >=20
> > > > I try to capture log informations via syslog, but FreeBSD's ntpd
> > > > seems to be
> > > > very, very sparse with such informations, coverging to null - I
> > > > can't see
> > > > anything suiatble in the logs why NTPD dies almost every night
> > > > leaving the
> > > > system with a wild reset of time. Sometimes it is a gain of 6
> > > > hours, sometime
> > > > s
> > > > it is only half an hour. I leave the box at 16:00 local time
> > > > usually and take
> > > > care again at ~ 7 o'clock in the morning local time.=A0=A0 =20
> > > We will need to turn on debugging. Unfortunately debug code is not
> > > compiled=A0
> > > into the binary. We have two options. You can either update=A0
> > > src/usr.sbin/ntp/config.h to enable DEBUG or build the port (it's
> > > the exact=A0
> > > same ntp) with the DEBUG option -- this is probably simpler. Then
> > > enable=A0
> > > debug with -d and -D. -D increases verbosity. I just committed a
> > > debug=A0
> > > option to both ntp ports to assist here.
> > >=20
> > > Next question: Do you see any indication of a core dump? I'd be
> > > interested=A0
> > > in looking at it if possible.
> > >  =20
> > > >=20
> > > >=20
> > > > When the clock is floating that wild, in all cases ntpd isn't
> > > > running any mor
> > > > e.
> > > > I try to restart with options -g and -G to adjust the time
> > > > quickly at the
> > > > beginning, which works fine.=A0=A0 =20
> > > This is disconcerting. If your clock is floating wildly without
> > > ntpd=A0
> > > running there are other issues that might be at play here. At most
> > > the=A0
> > > clock might drift a little, maybe a minute or two a day but not by
> > > a lot.=A0
> > > Does the drift cause your clocks to run fast or slow?
> > >  =20
> > > >=20
> > > >=20
> > > > Apart from possible misconfigurations of the jails (I'm quite new
> > > > to jails an
> > > > d
> > > > their pitfalls), I was wondering what causes ntpd to die. i can't
> > > > determine
> > > > exactly the time of its death, so it might be related to
> > > > diurnal/periodic
> > > > processes (I use only the most vanilla configurations on
> > > > periodic, except for
> > > > checking ZFS's scrubbing enabled).=A0=A0 =20
> > > As I'm a little rushed for time, I didn't catch whether the jails=A0
> > > themselves were also running ntpd... just thought I'd ask. I don't
> > > see how=A0
> > > zfs scrubbing or any other periodic scripts could cause this.
> > >  =20
> > > >=20
> > > >=20
> > > > I'ven't had the chance to check whether the hardware is
> > > > completely all right,
> > > > but from a superficial point of view there is no issue with high
> > > > gain of the
> > > > internal clock or other hardware issues.=A0=A0 =20
> > > It's probably a good idea to check. I don't think that would cause
> > > ntpd any=A0
> > > gas. I've seen RTC battery messages on my gear which haven't caused
> > > ntpd=A0
> > > any problem. I have two machines which complain about RTC battery
> > > being=A0
> > > dead, where in fact I have replaced the batteries and the messages
> > > still=A0
> > > are displayed at boot. I'm not sure if it's possible for a kernel
> > > to damage=A0
> > > the RTC. In my case that doesn't cause ntpd any problems. It's
> > > probably=A0
> > > good to check anyway.
> > >  =20
> > > >=20
> > > >=20
> > > > If there are known issues with jails (the problem occurs since I
> > > > use those),
> > > > advice is appreciated.=A0=A0 =20
> > > Not that I know of.
> > >=20
> > >  =20
> > Just some strange news:
> >=20
> > I left the server the whole day with ntpd disabled and I didn't watch
> > a gain of the RTC
> > by one second, even stressing the machine.
> >=20
> > But soon after restarting ntpd, I realised immediately a 30 minutes
> > off! This morning,
> > the discrapancy was almost 5 hours - it looked more like a weird
> > ajustment to another
> > time base than UTC.
> >=20
> > Over the weekend I'll leave the server with ntpd disabled and only
> > RTC running. I've the
> > strange feeling that something is intentionally readjusting the ntpd
> > time due to a
> > misconfiguration or a rogue ntp server in the X.CC.pool.ntp.org
> >  =20
>=20
> The rogue server theory is a bad one, unless you have configured just a
> single server in your ntp.conf and it is the rogue. =A0Ntpd requires
> agreement among the set of configured servers, it will ignore outliers.

Past weekend, I had switched off ntpd and ran the server completely with the
onboard RTC. On Monday morning when I entered the office, the clock was in
synchronisation with the official time.

As usual, I update sources and buildworld. After a couple of builds over the
week and letting ntpd restart via rc.conf as usual after rebooting, I check=
ed
over the past two days and i found the server always in a state of dissonant
clock.

The more curious part is that the clock is almost 6 hours behind UTC. I can=
 not
tell whether the ntpd is still trying to adjust time to a foreign clock whi=
ch
has another time reference. I checked the TZ and everything seems all right.

>=20
> It would help to have some actual data. =A0What does ntpq -p show right
> after starting ntpd? =A0Then a few minutes later, then again 10 minutes

[RESTART]
     remote           refid      st t when poll reach   delay   offset  jit=
ter
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D
 0.de.pool.ntp.o .POOL.          16 p    -   64    0    0.000    0.000   0.=
000
 1.de.pool.ntp.o .POOL.          16 p    -   64    0    0.000    0.000   0.=
000
 2.de.pool.ntp.o .POOL.          16 p    -   64    0    0.000    0.000   0.=
000
 3.de.pool.ntp.o .POOL.          16 p    -   64    0    0.000    0.000   0.=
000
 ptbtime1.ptb.de .INIT.          16 u    -   64    0    0.000    0.000   0.=
000
 ptbtime2.ptb.de .INIT.          16 u    -   64    0    0.000    0.000   0.=
000

[after 1 Minute]
     remote           refid      st t when poll reach   delay   offset  jit=
ter
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D
 0.de.pool.ntp.o .POOL.          16 p    -   64    0    0.000    0.000   0.=
000
 1.de.pool.ntp.o .POOL.          16 p    -   64    0    0.000    0.000   0.=
000
 2.de.pool.ntp.o .POOL.          16 p    -   64    0    0.000    0.000   0.=
000
 3.de.pool.ntp.o .POOL.          16 p    -   64    0    0.000    0.000   0.=
000
 ptbtime1.ptb.de .PTB.            1 u   34   64    1   16.931   -4.841   0.=
000
 ptbtime2.ptb.de .PTB.            1 u   34   64    1   18.273   -5.518   0.=
000
 fks.dan.net.uk  117.161.90.132   3 u   31   64    1   24.217   -3.904   0.=
000
 213.95.200.109  213.95.151.123   2 u   33   64    1   25.464   -2.449   0.=
000
 ns3.customer-re 192.53.103.108   2 u   35   64    1   23.905   -1.187   0.=
000
 ns1.blazing.de  213.172.96.14    2 u   36   64    1   17.045   -3.017   0.=
000
 ntp2.m-online.n 212.18.1.106     2 u   36   64    1   20.758   -2.693   0.=
000
 stratum2-3.NTP. 129.70.130.71    2 u   35   64    1   22.000   -3.800   0.=
000
 estoma.de       144.76.96.7      3 u   33   64    1    7.919   -3.182   0.=
000
 clint.blazing.d 213.172.96.14    2 u   34   64    1   17.642   -2.932   0.=
000
 news01.nierle.c 192.53.103.103   2 u   34   64    1   19.880   -3.750   0.=
000
 q.fu110.de      131.234.137.64   2 u   35   64    1   16.649   -6.037   0.=
000

[after ~10 Minutes]
     remote           refid      st t when poll reach   delay   offset  jit=
ter
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D
 0.de.pool.ntp.o .POOL.          16 p    -   64    0    0.000    0.000   0.=
000
 1.de.pool.ntp.o .POOL.          16 p    -   64    0    0.000    0.000   0.=
000
 2.de.pool.ntp.o .POOL.          16 p    -   64    0    0.000    0.000   0.=
000
 3.de.pool.ntp.o .POOL.          16 p    -   64    0    0.000    0.000   0.=
000
#ptbtime1.ptb.de .PTB.            1 u   45   64  177   15.740    0.289   1.=
147
#ptbtime2.ptb.de .PTB.            1 u   38   64  177   17.489   -0.651   1.=
632
#fks.dan.net.uk  117.161.90.132   3 u   46   64  177   21.736   -0.634   9.=
040
-213.95.200.109  213.95.151.123   2 u   41   64  177   23.400    1.216   1.=
353
+ns1.blazing.de  213.172.96.14    2 u   48   64  177   16.848    1.912   0.=
570
*ntp2.m-online.n 212.18.1.106     2 u   48   64  177   20.681    2.409   0.=
927
-stratum2-3.NTP. 129.70.130.71    2 u   44   64  177   20.868    1.482   0.=
719
+clint.blazing.d 213.172.96.14    2 u   42   64  177   16.612    2.374  12.=
795
-news01.nierle.c 192.53.103.103   2 u   40   64  177   20.127    1.504  12.=
851
#q.fu110.de      131.234.137.64   2 u  103   64  176   16.070   -0.769   0.=
663

> after that, etc. =A0What is in the /var/db/ntpd.drift file? =A0Are you
> using the standard freebsd ntp.conf file as delivered, or have you
> customized it? =A0Any non-default settings in your rc.conf related to
> ntp?

The line in /etc/rc.conf is:

ntpd_flags=3D"-4 -g -G -I 192.168.0.1 -p /var/run/ntpd.pid -f /var/db/ntpd.=
drift"

The IP at -I is the IP of the primary NIC of the machine, which has two NIC=
s.


I use a customized /etc/ntp.conf and I did a lot of variations during the
approach to figure out the problem. I did the same on host onto the same
network, but being of "modern date" (regarding hardware, the server in ques=
tion
is an 2008 two-socket Core2Duo XEON box with 2x 4 cores) and which does not=
 host
jails. The reference host seems not to show the weird clock gain.

the recent /etc/ntp.conf looks this now:

tos minclock 3 maxclock 6
server          ptbtime1.ptb.de        =20
server          ptbtime2.ptb.de        =20
pool            0.de.pool.ntp.org      =20
pool            1.de.pool.ntp.org      =20
pool            2.de.pool.ntp.org      =20
pool            3.de.pool.ntp.org      =20
restrict        192.168.0.0 mask 255.255.255.0 noquery kod nomodify notrap \
nopeer
restrict    default limited kod nomodify notrap noquery nopeer
restrict -6 default limited kod nomodify notrap noquery nopeer
restrict    source  limited kod nomodify notrap noquery
restrict 127.0.0.1
restrict 127.127.1.0
restrict -6 ::1
leapfile "/var/db/ntpd.leap-seconds.list"

>=20
> -- Ian




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170323153833.75e1b013>