Date: Wed, 16 Sep 1998 17:55:15 -0700 From: Mike Smith <mike@smith.net.au> To: current@FreeBSD.ORG Cc: bde@FreeBSD.ORG Subject: Death by SIGXCPU (problems with our clock code) Message-ID: <199809170055.RAA01288@dingo.cdrom.com>
next in thread | raw e-mail | index | archive | help
This is a multipart MIME message. --==_Exmh_-21291286440 Content-Type: text/plain; charset=us-ascii Many of us have been seeing problems with processes being killed for absurd CPU time usage recently, and the blame has been waved everywhere from APM (it appears to be as powerful as the Good Times virus when it comes to getting into new systems) to the quality of pickled herring this time of year. Since nobody else has taken up my suggestion to instrument the code to find out what's going on, I've shouldered the cross. I forwarded some of this to Poul earlier today, but I realise he may not be able to do anything about it immediately, so I thought I should open the matter for general study. Attached are some small diffs to kern_clock.c and kern_synch.c which check various time values and report on oddities. The indication at the moment would seem to be that there is a problem in the arithmetic in kern_clock.c:tco_forward(), and I invite our arithmetic pedants to study it briefly to see if they can spot any problems, or to suggest further tests to bring the culprits to justice. Meanwhile, here's the output I've seen so far: mi_switch: switching history 1985.236473 > 1985.-695149134 mi_switch: switching history 1987.693624450 > 1985.241575 calcru: negative time of -1390647919 usec for pid 97 (syslogd) getmicrotime: 905988347.835090 > 905988347.825089 microtime: 905988936.547000 > 905988936.-694832253 microtime: 905989985.342863 > 905989985.-695036431 microtime: 905991238.525890 > 905991238.-694847527 mi_switch: switching history 6224.460675 > 6224.-694925183 mi_switch: switching history 6226.693847510 > 6224.462841 mi_switch: switching history 6228.691848697 > 6224.464297 mi_switch: switching history 6230.689849221 > 6224.464892 calcru: negative time of -290624527 usec for pid 278 (xterm) calcru: negative time of -685601360 usec for pid 272 (xterm) mi_switch: switching history 7182.665240 > 7182.-694720727 mi_switch: switching history 7184.694051835 > 7182.667612 mi_switch: switching history 7186.692054475 > 7182.670740 mi_switch: switching history 7188.690056433 > 7182.696035 mi_switch: switching history 7190.688059242 > 7182.716614 mi_switch: switching history 7192.686061114 > 7182.726335 mi_switch: switching history 7194.684063012 > 7182.735806 mi_switch: switching history 7196.682064876 > 7182.745597 mi_switch: switching history 7198.680066750 > 7182.754845 mi_switch: switching history 7200.678068675 > 7182.764626 mi_switch: switching history 7202.676069640 > 7182.766654 mi_switch: switching history 7204.674071377 > 7182.783861 mi_switch: switching history 7206.672073450 > 7182.793647 mi_switch: switching history 7208.670075397 > 7182.803403 mi_switch: switching history 7210.668077325 > 7182.813224 mi_switch: switching history 7212.666079274 > 7182.822987 mi_switch: switching history 7214.664081179 > 7182.832737 mi_switch: switching history 7216.662083083 > 7182.842469 mi_switch: switching history 7218.660084985 > 7182.852329 mi_switch: switching history 7220.658086912 > 7182.862197 mi_switch: switching history 7222.656088814 > 7182.865549 mi_switch: switching history 7224.654090663 > 7182.874002 mi_switch: switching history 7226.652092595 > 7182.891112 mi_switch: switching history 7228.650094568 > 7182.900933 mi_switch: switching history 7230.648096446 > 7182.910791 mi_switch: switching history 7232.646098330 > 7182.920238 mi_switch: switching history 7234.644100213 > 7182.930085 mi_switch: switching history 7236.642102122 > 7182.939641 mi_switch: switching history 7238.640104005 > 7182.949304 mi_switch: switching history 7240.638105889 > 7182.959102 mi_switch: switching history 7242.636107832 > 7182.965556 mi_switch: switching history 7244.634109633 > 7182.970635 mi_switch: switching history 7246.632111500 > 7182.987897 mi_switch: switching history 7248.630113451 > 7182.997693 mi_switch: switching history 7250.628115337 > 7183.5565 mi_switch: switching history 7252.626116142 > 7183.7926 mi_switch: switching history 7254.624117864 > 7183.11609 mi_switch: switching history 7256.622119701 > 7183.37161 mi_switch: switching history 7258.620121712 > 7183.46838 mi_switch: switching history 7260.618123816 > 7183.56696 mi_switch: switching history 7262.616125696 > 7183.65549 mi_switch: switching history 7264.614127486 > 7183.68279 mi_switch: switching history 7266.612129322 > 7183.85323 mi_switch: switching history 7268.610131323 > 7183.87777 mi_switch: switching history 7270.608135227 > 7183.93281 mi_switch: switching history 7272.606136952 > 7183.117345 mi_switch: switching history 7274.604138977 > 7183.125618 mi_switch: switching history 7276.602139248 > 7183.127485 mi_switch: switching history 7278.600141013 > 7183.129287 mi_switch: switching history 7280.598142789 > 7183.134399 mi_switch: switching history 7282.596144784 > 7183.136835 mi_switch: switching history 7284.594145490 > 7183.139110 mi_switch: switching history 7286.592147226 > 7183.165555 mi_switch: switching history 7288.590149056 > 7183.188125 mi_switch: switching history 7290.588151269 > 7183.205708 mi_switch: switching history 7292.586153236 > 7183.215177 mi_switch: switching history 7294.584155252 > 7183.217624 mi_switch: switching history 7296.582158959 > 7183.222991 mi_switch: switching history 7298.580160682 > 7183.247362 mi_switch: switching history 7300.578162685 > 7183.259122 mi_switch: switching history 7302.576164593 > 7183.265601 mi_switch: switching history 7304.574166441 > 7183.267790 calcru: negative time of -1224705799 usec for pid 239 (XF86_SVGA) As you can see, this accounted for two xterms and the X server in a period of about 30 minutes. Any input would be gratefully received - we need to fix this before 3.0 or it will be a total writeoff for laptop users. --==_Exmh_-21291286440 Content-Type: text/plain ; name="kern_synch.diff"; charset=us-ascii Content-Description: kern_synch.diff Content-Disposition: attachment; filename="kern_synch.diff" Index: kern_synch.c =================================================================== RCS file: /home/ncvs/src/sys/kern/kern_synch.c,v retrieving revision 1.61 diff -u -r1.61 kern_synch.c --- kern_synch.c 1998/07/15 02:32:10 1.61 +++ kern_synch.c 1998/09/16 22:20:27 @@ -604,6 +604,7 @@ register struct proc *p = curproc; /* XXX */ register struct rlimit *rlim; int x; + u_int64_t runtime; /* * XXX this spl is almost unnecessary. It is partly to allow for @@ -633,9 +634,21 @@ * process was running, and add that to its total so far. */ microuptime(&switchtime); + if (timevalcmp(&(p->p_switchtime), &switchtime, >)) + printf("mi_switch: switching history %d.%d > %d.%d\n", + p->p_switchtime.tv_sec, p->p_switchtime.tv_usec, + switchtime.tv_sec, switchtime.tv_usec); + runtime = p->p_runtime; p->p_runtime += (switchtime.tv_usec - p->p_switchtime.tv_usec) + (switchtime.tv_sec - p->p_switchtime.tv_sec) * (int64_t)1000000; - + timevalsub(&switchtime,&(p->p_switchtime)); + runtime += ((u_int64_t)switchtime.tv_sec * 1000000) + + switchtime.tv_usec; + + if (runtime != p->p_runtime) + printf("mi_switch: runtime botch, is %08x%08x should %08x%08x\n", + (int)(p->p_runtime >> 32), (int)(p->p_runtime & 0xffffffff), + (int)(runtime >> 32), (int)(runtime & 0xffffffff)); /* * Check if the process exceeds its cpu resource allocation. * If over max, kill it. --==_Exmh_-21291286440 Content-Type: text/plain ; name="kern_clock.diff"; charset=us-ascii Content-Description: kern_clock.diff Content-Disposition: attachment; filename="kern_clock.diff" Index: kern_clock.c =================================================================== RCS file: /home/ncvs/src/sys/kern/kern_clock.c,v retrieving revision 1.78 diff -u -r1.78 kern_clock.c --- kern_clock.c 1998/08/05 18:06:40 1.78 +++ kern_clock.c 1998/09/16 22:28:17 @@ -497,24 +497,37 @@ getmicrotime(struct timeval *tvp) { struct timecounter *tc; + static struct timeval otv; tc = timecounter; *tvp = tc->tc_microtime; + + if (timevalcmp(&otv, tvp, >)) + printf("getmicrotime: %d.%d > %d.%d", + otv.tv_sec, otv.tv_usec, tvp->tv_sec, tvp->tv_usec); + otv = *tvp; } void getnanotime(struct timespec *tsp) { struct timecounter *tc; + static struct timespec ots; tc = timecounter; *tsp = tc->tc_nanotime; + + if (timespeccmp(&ots, tsp, >)) + printf("getnanotime: %d.%d > %d.%d", + ots.tv_sec, ots.tv_nsec, tsp->tv_sec, tsp->tv_nsec); + ots = *tsp; } void microtime(struct timeval *tv) { struct timecounter *tc; + static struct timeval otv; tc = (struct timecounter *)timecounter; tv->tv_sec = tc->tc_offset_sec; @@ -526,6 +539,11 @@ tv->tv_usec -= 1000000; tv->tv_sec++; } + + if (timevalcmp(&otv, tv, >)) + printf("microtime: %d.%d > %d.%d", + otv.tv_sec, otv.tv_usec, tv->tv_sec, tv->tv_usec); + otv = *tv; } void @@ -534,6 +552,7 @@ unsigned count; u_int64_t delta; struct timecounter *tc; + static struct timespec ots; tc = (struct timecounter *)timecounter; ts->tv_sec = tc->tc_offset_sec; @@ -549,6 +568,11 @@ ts->tv_sec++; } ts->tv_nsec = delta; + + if (timespeccmp(&ots, ts, >)) + printf("nanotime: %d.%d > %d.%d", + ots.tv_sec, ots.tv_nsec, ts->tv_sec, ts->tv_nsec); + ots = *ts; } void --==_Exmh_-21291286440 Content-Type: text/plain; charset=us-ascii \\ Sometimes you're ahead, \\ Mike Smith \\ sometimes you're behind. \\ mike@smith.net.au \\ The race is long, and in the \\ msmith@freebsd.org \\ end it's only with yourself. \\ msmith@cdrom.com --==_Exmh_-21291286440-- To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199809170055.RAA01288>