From owner-freebsd-current@FreeBSD.ORG Thu Feb 5 07:48:38 2015 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 11FFF9FA for ; Thu, 5 Feb 2015 07:48:38 +0000 (UTC) Received: from mail-la0-x234.google.com (mail-la0-x234.google.com [IPv6:2a00:1450:4010:c03::234]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 6FF7FF6A for ; Thu, 5 Feb 2015 07:48:37 +0000 (UTC) Received: by mail-la0-f52.google.com with SMTP id gd6so4540962lab.11 for ; Wed, 04 Feb 2015 23:48:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=ZXPTc5CrPKh+JvGSSLwOW/XZBZqMrkV5F5Hlub8CPko=; b=uviYlPXRRsCBH9eTiJcnj/Yu8/SFHCvVtB2QHxaVwWSsh8Tlz0/Gcjh1G7q4mfb+WS 17UU/ClhCvGikRUIV0jePxFGL1mYc3wKJLiZ3CJjvtdbG9qaHFUKx7phYn7ivyBKxLxS HWB+GwieE657jZpvdsYCbC1DG90WElrCR6a6FjHoGDTi7SapBNRjjUA8ChbNccUQ9lgS RqyHG7dXhUed6ck0ic5xt+pPvKpQ08+wFb18bGVrARItqD5Zl5yAFISYuv52sEz1Gd2t eKokB/tK4Tmys6GxJ5S/LGOLlJCYKdwDAVnQR1T8CuKybyycm5qy3/0drCigvhlaa1h7 9a0A== MIME-Version: 1.0 X-Received: by 10.112.55.199 with SMTP id u7mr1948836lbp.74.1423122513837; Wed, 04 Feb 2015 23:48:33 -0800 (PST) Sender: rizzo.unipi@gmail.com Received: by 10.114.19.206 with HTTP; Wed, 4 Feb 2015 23:48:33 -0800 (PST) In-Reply-To: <2509923.ondFvsFdql@overcee.wemm.org> References: <8089702.oYScRm8BTN@overcee.wemm.org> <20150204142941.GE42409@kib.kiev.ua> <2509923.ondFvsFdql@overcee.wemm.org> Date: Thu, 5 Feb 2015 08:48:33 +0100 X-Google-Sender-Auth: co1yx-97K2UJ5OTeWkq1wHgFWBo Message-ID: Subject: Re: PSA: If you run -current, beware! From: Luigi Rizzo To: Peter Wemm Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.18-1 Cc: Konstantin Belousov , "freebsd-current@freebsd.org" X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 05 Feb 2015 07:48:38 -0000 On Thursday, February 5, 2015, Peter Wemm wrote: > On Wednesday, February 04, 2015 04:29:41 PM Konstantin Belousov wrote: > > On Tue, Feb 03, 2015 at 01:33:15PM -0800, Peter Wemm wrote: > > > Sometime in the Dec 10th through Jan 7th timeframe a timing bug has > been > > > introduced to 11.x/head/-current. With HZ=1000 (the default for bare > > > metal, not for a vm); the clocks stop just after 24 days of uptime. > This > > > means things like cron, sleep, timeouts etc stop working. TCP/IP won't > > > time out or retransmit, etc etc. It can get ugly. > > > > > > The problem is NOT in 10.x/-stable. > > > > > > We hit this in the freebsd.org cluster, the builds that we used are: > > > FreeBSD 11.0-CURRENT #0 r275684: Wed Dec 10 20:38:43 UTC 2014 - fine > > > FreeBSD 11.0-CURRENT #0 r276779: Wed Jan 7 18:47:09 UTC 2015 - broken > > > > > > If you are running -current in a situation where it'll accumulate > uptime, > > > you may want to take precautions. A reboot prior to 24 days uptime (as > > > horrible a workaround as that is) will avoid it. > > > > > > Yes, this is being worked on. > > > > So the issue is reproducable in 3 minutes after boot with the following > > change in kern_clock.c: > > volatile int ticks = INT_MAX - (/*hz*/1000 * 3 * 60); > > > > It is fixed (in the proper meaning of the word, not like worked around, > > covered by paper) by the patch at the end of the mail. > > > > We already have a story trying to enable much less ambitious option > > -fno-strict-overflow, see r259045 and the revert in r259422. I do not > > see other way than try one more time. Too many places in kernel > > depend on the correctly wrapping 2-complement arithmetic, among others > > are callweel and scheduler. > > Rather than depending on a compiler option, wouldn't it be better/more robust to change ticks to unsigned, which has specified wrapping behavior? Cheers Luigi Ugh. > > I believe I have a smoking gun that suggests that the clock-stop problem is > caused by the clang-3.5 import on Dec 31st. > > Backstory: > http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html > http://www.airs.com/blog/archives/120 > > I suspect that what has happened is that clang's optimizer got better at > seeing the direct or indirect effects of integer overflow and clang (and > gcc) > take advantage of that. > > I have used a slightly different change for about 10 years: > > --- kern/kern_clock.c 2014-12-01 15:42:21.707911656 -0800 > +++ kern/kern_clock.c 2014-12-01 15:42:21.707911656 -0800 > @@ -410,6 +415,11 @@ > #ifdef SW_WATCHDOG > EVENTHANDLER_REGISTER(watchdog_list, watchdog_config, NULL, 0); > #endif > + /* > + * Arrange for ticks to go negative just 5 minutes after boot > + * to help catch sign problems sooner. > + */ > + ticks = INT_MAX - (hz * 5 * 60); > } > > /* > > This came about from when we had problems with integer overflow arithmetic > in > the tcp stack. > > In any case, I'm in the process of adding -fwrapv and the early wraparound > to > the freebsd.org cluster builds to give it some wider exercise. > > -- > Peter Wemm - peter@wemm.org ; peter@FreeBSD.org; > peter@yahoo-inc.com ; KI6FJV > UTF-8: for when a ' or ... just won\342\200\231t do\342\200\246 -- -----------------------------------------+------------------------------- Prof. Luigi RIZZO, rizzo@iet.unipi.it . Dip. di Ing. dell'Informazione http://www.iet.unipi.it/~luigi/ . Universita` di Pisa TEL +39-050-2211611 . via Diotisalvi 2 Mobile +39-338-6809875 . 56122 PISA (Italy) -----------------------------------------+-------------------------------