Date: Fri, 21 Jan 2005 14:49:41 -0800 (PST) From: Chris Landauer <cal@rush.aero.org> To: freebsd-hackers@freebsd.org Cc: cal@rush.aero.org Subject: time and timing errors in c code on 5.x/i386 (longish) Message-ID: <200501212249.j0LMnfpJ091129@calamari.aero.org>
next in thread | raw e-mail | index | archive | help
hihi, all - i hope this isn't too long, but i have evidence to go with my problem 8-) i'm running 5.3-RELEASE/i386, and i've been having a timing problem on very long programs (i asked the -questions list a year ago, and got many helpful but ultimately inadequate responses), and the same thing happens on 5.2.1 and 5.1 (where i first noticed it) i'm running some combinatorial search programs that take weeks or months to complete, and no timer i've used is able to report correctly the user and system time (they all make the same mistake - eventually the user time stops incrementing) - i want precise times to do some predictive modeling some summary output lines that illustrate the problem n46 r08 295.722u 0.417s 5:17.59 93.2% n46 r09 32031.689u 41.815s 9:34:08.17 93.1% n46 r10 5672.848u 2096.191s 635:23:04.54 0.3% (the last column is edited from the output from time in csh; the other two columns are problem parameters, and the mathematics of the problem is such that these values should increase rapidly with increasing r; in this case, the user time has wrapped or otherwise failed in the last line, and since the machine was only running the one program in addition to the usual kde or gnome or something in all 4 cases, it should have been about 93% usage or so like the previous two) i looked at the time structure and it should be big enough to prevent wrapping for some long period of time, but the same thing happens with /usr/bin/time (of course, since it takes a few weeks for each experiment, i didn't do very many of them) - the i/o counts also wrap around, but i don't care about that so i added some code that uses rusage() to the program, and the same failure occurs (my own code for this computation is included next) - a snippet from the time lines in the printed output: user 378925.483628 syst 286.845375 elapse 381328.785295 pct 99.44% user 379089.748458 syst 286.962284 elapse 381493.700660 pct 99.45% user 379255.472355 syst 287.088004 elapse 381660.106387 pct 99.45% user 379417.184286 syst 287.190223 elapse 381822.457863 pct 99.45% user 379417.184286 syst 451.110470 elapse 381986.906692 pct 99.45% user 379417.184286 syst 615.737725 elapse 382152.058304 pct 99.45% (as you can see, the user time stops changing and the system time starts incrementing rapidly - one hint is that the percentage stays the same, so it appears that the wrong variable is being incremented in the kernel - another hint is that i don't think that this fail time is the same for all runs, though it is similarly large) - the code snippet that produces these lines: ...start #include <sys/types.h> #include <sys/time.h> #include <sys/resource.h> ... /* time statistics parameters */ static struct timeval tv; static struct rusage ru; static double ts, te, tu, ty, elapse, pct; static double convert(struct timeval tv); ... /* get start time */ gettimeofday(& tv, NULL); ts = convert(tv); ... /* get current time */ gettimeofday(& tv, NULL); te = convert(tv); /* get resource usage */ getrusage(RUSAGE_SELF, & ru); tu = convert(ru.ru_utime); ty = convert(ru.ru_stime); /* compute elapsed time and percentage */ elapse = te - ts; pct = 100.0 * (tu + ty) / (elapse); /* times are in microseconds, so 6 digits is enough */ printf("user %.6f syst %.6f elapse %.6f pct %.2f%%\n", tu, ty, elapse, pct); ... static double convert(struct timeval tv) { /* convert a timeval to a double */ double sec; double usec; /* extract the parts of the structure */ usec = tv.tv_usec; sec = tv.tv_sec; /* combine the parts */ #define USEC_PER_SEC 1.0e6 sec += usec / USEC_PER_SEC; return sec; /* end of function convert() */ } ...end any suggestions? i know that both ru_utime and ru_stime are struct timevals, and that they are at the beginning of the struct rusage definition, but beyond that, i can't see what is happening (by the way, the search program is producing correct answers; it is only the timing that is wrong) i would be happy to perform experiments to solve this problem, and i know that i have to be patient (after all, each experiment takes over two weeks) more later, cal Dr. Christopher Landauer Aerospace Integration Science Center The Aerospace Corporation, Mail Stop M6/214 P.O.Box 92957 Los Angeles, California 90009-2957, USA e-mail: cal@aero.org, Phone: +1 (310) 336-1361
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200501212249.j0LMnfpJ091129>