From owner-freebsd-hackers@FreeBSD.ORG Sat Jan 22 21:49:12 2005 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 35CA216A4CE for ; Sat, 22 Jan 2005 21:49:12 +0000 (GMT) Received: from mail01.syd.optusnet.com.au (mail01.syd.optusnet.com.au [211.29.132.182]) by mx1.FreeBSD.org (Postfix) with ESMTP id 66EC743D2D for ; Sat, 22 Jan 2005 21:49:11 +0000 (GMT) (envelope-from PeterJeremy@optushome.com.au) Received: from cirb503493.alcatel.com.au (c211-30-75-229.belrs2.nsw.optusnet.com.au [211.30.75.229]) j0MLn9H6012196 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO); Sun, 23 Jan 2005 08:49:09 +1100 Received: from cirb503493.alcatel.com.au (localhost.alcatel.com.au [127.0.0.1])j0MLn87l028662; Sun, 23 Jan 2005 08:49:08 +1100 (EST) (envelope-from pjeremy@cirb503493.alcatel.com.au) Received: (from pjeremy@localhost)j0MLn8X0028661; Sun, 23 Jan 2005 08:49:08 +1100 (EST) (envelope-from pjeremy) Date: Sun, 23 Jan 2005 08:49:07 +1100 From: Peter Jeremy To: Chris Landauer Message-ID: <20050122214907.GA241@cirb503493.alcatel.com.au> References: <200501212249.j0LMnfpJ091129@calamari.aero.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200501212249.j0LMnfpJ091129@calamari.aero.org> User-Agent: Mutt/1.4.2i cc: freebsd-hackers@freebsd.org Subject: Re: time and timing errors in c code on 5.x/i386 (longish) X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 22 Jan 2005 21:49:12 -0000 On Fri, 2005-Jan-21 14:49:41 -0800, Chris Landauer wrote: >i'm running some combinatorial search programs that take weeks or months to >complete, and no timer i've used is able to report correctly the user and >system time (they all make the same mistake - eventually the user time stops >incrementing) - i want precise times to do some predictive modeling [evidence deleted] The problem looks like an overflow error in calcru(). Have you seen any kernel messages beginning 'calcru:'? The offending code is: uu = (tu * ut) / tt; where all variables are uint64 uu is the user time in microseconds (that will be converted to a timeval and reported via getrusage()) tu is the total usermode runtime allocated to your program (in usec) ut is the number of usermode statclock hits (128Hz) tt is the total (user+sys+int) statclock hits. >user 378925.483628 syst 286.845375 elapse 381328.785295 pct 99.44% >user 379089.748458 syst 286.962284 elapse 381493.700660 pct 99.45% >user 379255.472355 syst 287.088004 elapse 381660.106387 pct 99.45% >user 379417.184286 syst 287.190223 elapse 381822.457863 pct 99.45% >user 379417.184286 syst 451.110470 elapse 381986.906692 pct 99.45% >user 379417.184286 syst 615.737725 elapse 382152.058304 pct 99.45% At this point tu is roughly 379417184286 and ut is roughly 48565399 The product is about 1.8e19 - which is roughly 2^64. That particular code goes all the way back to BSD4.4lite so it's a bug that has always existed. We can't use FP in the kernel and don't support 128-bit integers (or arithmetic) anywhere so a correct fix is quite ugly (and inefficient) in portable C. I can suggest two options: 1) If exact timings aren't critical, just use the elapsed time. 2) It would be fairly easy to write some i386 assembler (or __asm()) that correctly calculated (uint64 * uint32)/uint32 which would work for tt < 2^32. Assuming that nothing is being profiled, this would be good for just over a year of process time. -- Peter Jeremy