Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 05 Feb 2015 08:21:45 -0500
From:      John Baldwin <jhb@freebsd.org>
To:        freebsd-current@freebsd.org
Cc:        Konstantin Belousov <kostikbel@gmail.com>, Luigi Rizzo <rizzo@iet.unipi.it>, Peter Wemm <peter@wemm.org>
Subject:   Re: PSA: If you run -current, beware!
Message-ID:  <2613155.3ZBxDvY16q@ralph.baldwin.cx>
In-Reply-To: <CA%2BhQ2%2BiVE53PJs0noc_SPHpwDZVLX-tHpgYmzO9tGzJzDXwXWg@mail.gmail.com>
References:  <8089702.oYScRm8BTN@overcee.wemm.org> <2509923.ondFvsFdql@overcee.wemm.org> <CA%2BhQ2%2BiVE53PJs0noc_SPHpwDZVLX-tHpgYmzO9tGzJzDXwXWg@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thursday, February 05, 2015 08:48:33 AM Luigi Rizzo wrote:
> On Thursday, February 5, 2015, Peter Wemm <peter@wemm.org> wrote:
> > On Wednesday, February 04, 2015 04:29:41 PM Konstantin Belousov wrote:
> > > On Tue, Feb 03, 2015 at 01:33:15PM -0800, Peter Wemm wrote:
> > > > Sometime in the Dec 10th through Jan 7th timeframe a timing bug has
> > 
> > been
> > 
> > > > introduced to 11.x/head/-current.    With HZ=1000 (the default for
> > > > bare
> > > > metal, not for a vm); the clocks stop just after 24 days of uptime.
> > 
> > This
> > 
> > > > means things like cron, sleep, timeouts etc stop working.  TCP/IP
> > > > won't
> > > > time out or retransmit, etc etc.  It can get ugly.
> > > > 
> > > > The problem is NOT in 10.x/-stable.
> > > > 
> > > > We hit this in the freebsd.org cluster, the builds that we used are:
> > > > FreeBSD 11.0-CURRENT #0 r275684: Wed Dec 10 20:38:43 UTC 2014 - fine
> > > > FreeBSD 11.0-CURRENT #0 r276779: Wed Jan  7 18:47:09 UTC 2015 - broken
> > > > 
> > > > If you are running -current in a situation where it'll accumulate
> > 
> > uptime,
> > 
> > > > you may want to take precautions.  A reboot prior to 24 days uptime
> > > > (as
> > > > horrible a workaround as that is) will avoid it.
> > > > 
> > > > Yes, this is being worked on.
> > > 
> > > So the issue is reproducable in 3 minutes after boot with the following
> > > change in kern_clock.c:
> > > volatile int  ticks = INT_MAX - (/*hz*/1000 * 3 * 60);
> > > 
> > > It is fixed (in the proper meaning of the word, not like worked around,
> > > covered by paper) by the patch at the end of the mail.
> > > 
> > > We already have a story trying to enable much less ambitious option
> > > -fno-strict-overflow, see r259045 and the revert in r259422.  I do not
> > > see other way than try one more time.  Too many places in kernel
> > > depend on the correctly wrapping 2-complement arithmetic, among others
> > > are callweel and scheduler.
> 
> Rather than depending on a compiler option, wouldn't it be better/more
> robust to change ticks to unsigned, which has specified wrapping behavior?

Yes, but non-trivial.  It's also not limited to ticks.  Since the compiler 
knows when it would apply these optimizations, it would be nice if it could 
warn instead (GCC apparently has a warning, but clang does not).  Having 
people do a manual audit of every signed integer expression in the tree will 
take a long time.

-- 
John Baldwin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?2613155.3ZBxDvY16q>