From owner-freebsd-hackers@FreeBSD.ORG Wed Jan 26 19:11:10 2005 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 6674716A517 for ; Wed, 26 Jan 2005 19:11:10 +0000 (GMT) Received: from mail.ambrisko.com (mail.ambrisko.com [64.174.51.43]) by mx1.FreeBSD.org (Postfix) with ESMTP id 078B543D2D for ; Wed, 26 Jan 2005 19:11:10 +0000 (GMT) (envelope-from ambrisko@ambrisko.com) Received: from server2.ambrisko.com (HELO www.ambrisko.com) (192.168.1.2) by mail.ambrisko.com with ESMTP; 26 Jan 2005 11:11:09 -0800 Received: from ambrisko.com (localhost [127.0.0.1]) by www.ambrisko.com (8.12.11/8.12.9) with ESMTP id j0QJB9BK088725; Wed, 26 Jan 2005 11:11:09 -0800 (PST) (envelope-from ambrisko@ambrisko.com) Received: (from ambrisko@localhost) by ambrisko.com (8.12.11/8.12.11/Submit) id j0QJB9jo088720; Wed, 26 Jan 2005 11:11:09 -0800 (PST) (envelope-from ambrisko) From: Doug Ambrisko Message-Id: <200501261911.j0QJB9jo088720@ambrisko.com> In-Reply-To: <200501251558.j0PFwFjn029638@calamari.aero.org> To: Chris Landauer Date: Wed, 26 Jan 2005 11:11:09 -0800 (PST) X-Mailer: ELM [version 2.4ME+ PL94b (25)] MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=US-ASCII cc: freebsd-hackers@freebsd.org Subject: Re: bug in calcru() in kernel: integer overflow computing user time X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Jan 2005 19:11:10 -0000 Chris Landauer writes: | | hihi, all - | | well, i have an "almost" fix for the problem - read on, ... | (this is for discussin before send-pr submission) | | description | | in FreeBSD 5.3-RELEASE (and 5.2.1R, and 5.1R), | in file /usr/src/sys/kern/kern_resource.c, | lines 657-750 define a function calcru(), | and there is a bug in that function - | | line 713 in that file is | | uu = (tu * ut) / tt, | | which is trying to compute a proportion (since ut<=tt) - the problem | is that for my application, the values of tt and ut are very large and | the value of the intermediate product overflows, leading to incorrect | values (i reported the symptoms to freebsd-hackers and a very helpful | description and localization of the problem to calcru() was provided | by peter jeremy, who also wrote that it is a very old, but only partly | known, problem) | | repetition | | use time in csh or /usr/bin/time or getrusage() to time any program | that takes more than a week to run | | fix (almost) | | it turns out that this problem can be (partly) fixed by replacing that | one line above with the following lines: | | /* we know 0 <= ut <= tt and 1 <= tt */ | if (tu >= tt) | { | **whatever type they need to be** q, r; | q = tu / tt; | r = tu % tt; | uu = (q * ut) + (r * ut) / tt; | } | else uu = (tu * ut) / tt | | this change does not solve all the arithmetic problems (especially | when ut is very large), and a similar change is needed for system | time, computing su in line 714 of the file, but it suffices for me - | | analysis (yup, proof that it should work 8-)) - | | i expect that all of these counters are increasing throughout the life | of the process - | tu is total time in microseconds | ut is user 128hz ticks | tt is total 128hz ticks | i assume therefore that | tu ~ tt * 10^6/128 | strictly because of the clock rates | (machines with other kinds of clock rate ratios will need a different | analysis, but the same idea should work there, too) - | | in my case ut ~ tt, so we see overflow in the old computation when | tu * ut >= 2^64 | tt^2 * 10^6/128 >= 2^64 | tt * 10^3/8*sqrt(2) >= 2^32 ~ 4 * 10^9 | tt >= 32*sqrt(2)*10^6, | which is about | sqrt(2)*10^6 / 4 ~ 3.54*10^5 seconds, | or | ~ 98 hours | (this is the phenomenon i observed) | | in the new computation offered above, since we know that | ut <= tt, | we also know that | uu <= tu, | and | (q * ut) <= uu, | so there can be no overflow in the (q * ut) term - | therefore, this changed code will overflow only when r is large | r ~ tt, | and then only when | r * ut >= 2^64 | tt^2 >= 2^64 | tt >= 2^32, | which is about | ~ 2^25 seconds | ~ 9300 hours | ~ 388 days | (i can live with that for now) | | for everyone else, it adds the one test in the if statement to every | call to calcru() (or two, if system time is similarly fixed) - | instrumentation is costly, and correct instrumentation is | even more costly, but it's worth it, every time, to get the right | answer | | i'm about to try it out this week - i will report the results when i | get some data (which will be a few weeks) | | i'm thinking about how to solve the problem correctly for really | long-running programs, but it's all pretty special case ugly so far This is my fix: Index: kern_resource.c =================================================================== RCS file: /usr/local/cvsroot/freebsd/src/sys/kern/kern_resource.c,v retrieving revision 1.55.2.5 diff -u -p -r1.55.2.5 kern_resource.c --- kern_resource.c 3 Nov 2001 01:41:08 -0000 1.55.2.5 +++ kern_resource.c 26 Jan 2005 19:03:27 -0000 @@ -552,10 +552,10 @@ calcru(p, up, sp, ip) tu = ptu; } - /* Subdivide tu. */ - uu = (tu * ut) / tt; - su = (tu * st) / tt; - iu = tu - uu - su; + /* Subdivide tu. try to becareful of overflow */ + su = tu * (st * 1024 / tt) / 1024; + iu = tu * (it * 1024 / tt) / 1024; + uu = tu - (su + iu); /* Enforce monotonicity. */ if (uu < p->p_uu || su < p->p_su || iu < p->p_iu) { If people like it I can commit it. It works for us. We had problems with this as well. It's pretty simple fix. I used 1k since usage of this tends to be % so rounding should effect that much. Doug A.