Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 21 Jan 2005 14:49:41 -0800 (PST)
From:      Chris Landauer <cal@rush.aero.org>
To:        freebsd-hackers@freebsd.org
Cc:        cal@rush.aero.org
Subject:   time and timing errors in c code on 5.x/i386 (longish)
Message-ID:  <200501212249.j0LMnfpJ091129@calamari.aero.org>

next in thread | raw e-mail | index | archive | help

hihi, all -

i hope this isn't too long, but i have evidence to go with my problem 8-)

i'm running 5.3-RELEASE/i386, and i've been having a timing problem on very
long programs (i asked the -questions list a year ago, and got many helpful
but ultimately inadequate responses), and the same thing happens on 5.2.1 and
5.1 (where i first noticed it)

i'm running some combinatorial search programs that take weeks or months to
complete, and no timer i've used is able to report correctly the user and
system time (they all make the same mistake - eventually the user time stops
incrementing) - i want precise times to do some predictive modeling

some summary output lines that illustrate the problem

n46     r08     295.722u 0.417s 5:17.59 93.2%
n46     r09     32031.689u 41.815s 9:34:08.17 93.1%
n46     r10     5672.848u 2096.191s 635:23:04.54 0.3%

(the last column is edited from the output from time in csh; the other two
columns are problem parameters, and the mathematics of the problem is such
that these values should increase rapidly with increasing r; in this case, the
user time has wrapped or otherwise failed in the last line, and since the
machine was only running the one program in addition to the usual kde or gnome
or something in all 4 cases, it should have been about 93% usage or so like
the previous two)

i looked at the time structure and it should be big enough to prevent wrapping
for some long period of time, but the same thing happens with /usr/bin/time
(of course, since it takes a few weeks for each experiment, i didn't do very
many of them) - the i/o counts also wrap around, but i don't care about that

so i added some code that uses rusage() to the program, and the same failure
occurs (my own code for this computation is included next) - a snippet from
the time lines in the printed output:

user 378925.483628 syst 286.845375 elapse 381328.785295 pct 99.44%
user 379089.748458 syst 286.962284 elapse 381493.700660 pct 99.45%
user 379255.472355 syst 287.088004 elapse 381660.106387 pct 99.45%
user 379417.184286 syst 287.190223 elapse 381822.457863 pct 99.45%
user 379417.184286 syst 451.110470 elapse 381986.906692 pct 99.45%
user 379417.184286 syst 615.737725 elapse 382152.058304 pct 99.45%

(as you can see, the user time stops changing and the system time starts
incrementing rapidly - one hint is that the percentage stays the same, so it
appears that the wrong variable is being incremented in the kernel - another
hint is that i don't think that this fail time is the same for all runs,
though it is similarly large) - the code snippet that produces these lines:

...start

#include <sys/types.h>
#include <sys/time.h>
#include <sys/resource.h>

...

/* time statistics parameters */
static struct timeval tv;
static struct rusage ru;
static double ts, te, tu, ty, elapse, pct;
static double convert(struct timeval tv);

...


        /* get start time */
        gettimeofday(& tv, NULL);
        ts = convert(tv);

...

        /* get current time */
        gettimeofday(& tv, NULL);
        te = convert(tv);

        /* get resource usage */
        getrusage(RUSAGE_SELF, & ru);
        tu = convert(ru.ru_utime);
        ty = convert(ru.ru_stime);

        /* compute elapsed time and percentage */
        elapse = te - ts;
        pct = 100.0 * (tu + ty) / (elapse);

        /* times are in microseconds, so 6 digits is enough */
        printf("user %.6f syst %.6f elapse %.6f pct %.2f%%\n", tu, ty, elapse, pct);

...

static double
convert(struct timeval tv)
        {
        /* convert a timeval to a double */
        double sec;
        double usec;

        /* extract the parts of the structure */
        usec = tv.tv_usec;
        sec = tv.tv_sec;

        /* combine the parts */
#define USEC_PER_SEC 1.0e6
        sec += usec / USEC_PER_SEC;

        return sec;

        /* end of function convert() */
        }

...end

any suggestions? i know that both ru_utime and ru_stime are struct timevals,
and that they are at the beginning of the struct rusage definition, but beyond
that, i can't see what is happening (by the way, the search program is
producing correct answers; it is only the timing that is wrong)

i would be happy to perform experiments to solve this problem, and i know that
i have to be patient (after all, each experiment takes over two weeks)

more later,
cal

Dr. Christopher Landauer
Aerospace Integration Science Center
The Aerospace Corporation, Mail Stop M6/214
P.O.Box 92957
Los Angeles, California 90009-2957, USA
e-mail: cal@aero.org, Phone: +1 (310) 336-1361



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200501212249.j0LMnfpJ091129>