From owner-freebsd-current@FreeBSD.ORG Wed Feb 4 23:16:02 2015 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 6C58E17F for ; Wed, 4 Feb 2015 23:16:02 +0000 (UTC) Received: from smtp2.wemm.org (smtp2.wemm.org [IPv6:2001:470:67:39d::78]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "smtp2.wemm.org", Issuer "StartCom Class 1 Primary Intermediate Server CA" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 4EE3A7F1 for ; Wed, 4 Feb 2015 23:16:02 +0000 (UTC) Received: from overcee.wemm.org (canning.wemm.org [192.203.228.65]) by smtp2.wemm.org (Postfix) with ESMTP id 74570126; Wed, 4 Feb 2015 15:16:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=wemm.org; s=m20140428; t=1423091761; bh=4GQfQI4DAfXi221girZvkUQe0cnntESHnuZNKZnDgfc=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=Su3KnwcIsUrEbOun9pacU+QkFd65/ocMI4SKlq96nRNk/h//M8+7nbiUSC9aNxi5x Zw+jUhK/2WZacx46zqLuyToft5kzzb7/xxoLmX/+q8awXyb8tNISZtgtmqUICw4hKa XKCMQOLJPOug4ig9KZjJRz0QkHnKuS71TLFUQcG4= From: Peter Wemm To: freebsd-current@freebsd.org Subject: Re: PSA: If you run -current, beware! Date: Wed, 04 Feb 2015 15:15:57 -0800 Message-ID: <2509923.ondFvsFdql@overcee.wemm.org> User-Agent: KMail/4.14.2 (FreeBSD/11.0-CURRENT; KDE/4.14.2; amd64; ; ) In-Reply-To: <20150204142941.GE42409@kib.kiev.ua> References: <8089702.oYScRm8BTN@overcee.wemm.org> <20150204142941.GE42409@kib.kiev.ua> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="nextPart2822483.AiuhAghUd7"; micalg="pgp-sha256"; protocol="application/pgp-signature" Cc: Konstantin Belousov X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 04 Feb 2015 23:16:02 -0000 --nextPart2822483.AiuhAghUd7 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="us-ascii" On Wednesday, February 04, 2015 04:29:41 PM Konstantin Belousov wrote: > On Tue, Feb 03, 2015 at 01:33:15PM -0800, Peter Wemm wrote: > > Sometime in the Dec 10th through Jan 7th timeframe a timing bug has= been > > introduced to 11.x/head/-current. With HZ=3D1000 (the default fo= r bare > > metal, not for a vm); the clocks stop just after 24 days of uptime.= This > > means things like cron, sleep, timeouts etc stop working. TCP/IP w= on't > > time out or retransmit, etc etc. It can get ugly. > >=20 > > The problem is NOT in 10.x/-stable. > >=20 > > We hit this in the freebsd.org cluster, the builds that we used are= : > > FreeBSD 11.0-CURRENT #0 r275684: Wed Dec 10 20:38:43 UTC 2014 - fin= e > > FreeBSD 11.0-CURRENT #0 r276779: Wed Jan 7 18:47:09 UTC 2015 - bro= ken > >=20 > > If you are running -current in a situation where it'll accumulate u= ptime, > > you may want to take precautions. A reboot prior to 24 days uptime= (as > > horrible a workaround as that is) will avoid it. > >=20 > > Yes, this is being worked on. >=20 > So the issue is reproducable in 3 minutes after boot with the followi= ng > change in kern_clock.c: > volatile int=09ticks =3D INT_MAX - (/*hz*/1000 * 3 * 60); >=20 > It is fixed (in the proper meaning of the word, not like worked aroun= d, > covered by paper) by the patch at the end of the mail. >=20 > We already have a story trying to enable much less ambitious option > -fno-strict-overflow, see r259045 and the revert in r259422. I do no= t > see other way than try one more time. Too many places in kernel > depend on the correctly wrapping 2-complement arithmetic, among other= s > are callweel and scheduler. Ugh. I believe I have a smoking gun that suggests that the clock-stop proble= m is=20 caused by the clang-3.5 import on Dec 31st. Backstory: http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html http://www.airs.com/blog/archives/120 I suspect that what has happened is that clang's optimizer got better a= t=20 seeing the direct or indirect effects of integer overflow and clang (an= d gcc)=20 take advantage of that. I have used a slightly different change for about 10 years: =2D-- kern/kern_clock.c=092014-12-01 15:42:21.707911656 -0800 +++ kern/kern_clock.c=092014-12-01 15:42:21.707911656 -0800 @@ -410,6 +415,11 @@ #ifdef SW_WATCHDOG =09EVENTHANDLER_REGISTER(watchdog_list, watchdog_config, NULL, 0); #endif +=09/* +=09 * Arrange for ticks to go negative just 5 minutes after boot +=09 * to help catch sign problems sooner. +=09 */ +=09ticks =3D INT_MAX - (hz * 5 * 60); } =20 /* This came about from when we had problems with integer overflow arithme= tic in=20 the tcp stack. In any case, I'm in the process of adding -fwrapv and the early wraparo= und to=20 the freebsd.org cluster builds to give it some wider exercise. =2D-=20 Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com; KI= 6FJV UTF-8: for when a ' or ... just won\342\200\231t do\342\200\246 --nextPart2822483.AiuhAghUd7 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part. Content-Transfer-Encoding: 7Bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQEcBAABCAAGBQJU0qgtAAoJEDXWlwnsgJ4ECaoH/2oGq9kp+gdyF3xCjtluy3Po y172XTnGQNIv2Z5/gVDU6i9hgFQxVHnYlUolpB1cs/B7YV/lfjUKYts1FBZrpd7c y4THM7QdUdDccSZoHTWFWQVi7cdJW8IUR6cQwke/lpwX9fcudknwBE56iYYlIDSB 6/DaAAfC1mWHagXDmaTOIBhPT6JVBCoK9SeCITNIW9unyFMAqNGqRDr0KTeFRzo7 M3aKIIzwWKpgIIIbwwu56t0VwBNfqbEjM27Yjfm1wvJTc0FF2njpm+1JnP4ivD7Q f7jFfOPPtBzC1Snge8CVnb4TdcamqAAYLPlUAjpg8e5Ey60ad+1UMom1YXPtGhY= =vo+W -----END PGP SIGNATURE----- --nextPart2822483.AiuhAghUd7--