From owner-freebsd-hackers@FreeBSD.ORG Mon Jun 11 15:33:28 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B7FB2106564A for ; Mon, 11 Jun 2012 15:33:28 +0000 (UTC) (envelope-from eric@vangyzen.net) Received: from aussmtpmrkpc120.us.dell.com (aussmtpmrkpc120.us.dell.com [143.166.82.159]) by mx1.freebsd.org (Postfix) with ESMTP id 889E68FC08 for ; Mon, 11 Jun 2012 15:33:28 +0000 (UTC) X-Loopcount0: from 64.238.244.148 X-IronPort-AV: E=Sophos;i="4.77,389,1336366800"; d="scan'208";a="504918485" Received: from mail.compellent.com ([64.238.244.148]) by aussmtpmrkpc120.us.dell.com with ESMTP; 11 Jun 2012 10:33:22 -0500 Message-ID: <4FD60FC2.3060206@vangyzen.net> Date: Mon, 11 Jun 2012 10:33:22 -0500 From: Eric van Gyzen User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:12.0) Gecko/20120531 Thunderbird/12.0.1 MIME-Version: 1.0 To: Andrey Zonov References: <4FC6748B.5030708@zonov.org> <4FC71F13.6040008@zonov.org> In-Reply-To: <4FC71F13.6040008@zonov.org> Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-hackers@freebsd.org Subject: Re: usertime stale at about 371k seconds X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 11 Jun 2012 15:33:28 -0000 On 05/31/2012 02:34, Andrey Zonov wrote: > On 5/30/12 11:27 PM, Andrey Zonov wrote: >> Hi, >> >> I have long running process for which `ps -o usertime -p $pid' shows >> always the same time - 6190:07.65, `ps -o cputime -p $pid' for the same >> process continue to grow and now it's 21538:53.61. It looks like >> overflow in resource usage code or something. >> > > I reproduced that problem with attached program. I ran it with 23 > threads on machine with 24 CPUs and after night I see this: > > $ ps -o usertime,time -p 24134 && sleep 60 && ps -o usertime,time -p 24134 > USERTIME TIME > 6351:24.74 14977:35.19 > USERTIME TIME > 6351:24.74 15000:34.53 > > Per thread user-time counts correct: > > $ ps -H -o usertime,time -p 24134 > USERTIME TIME > 0:00.00 0:00.00 > 652:35.84 652:38.59 > 652:34.75 652:37.97 > 652:50.46 652:51.97 > 652:38.93 652:43.08 > 652:39.73 652:43.36 > 652:44.09 652:47.36 > 652:56.49 652:57.94 > 652:51.84 652:54.41 > 652:37.48 652:41.57 > 652:36.61 652:40.90 > 652:39.41 652:42.52 > 653:03.72 653:06.72 > 652:49.96 652:53.25 > 652:45.92 652:49.03 > 652:40.33 652:42.05 > 652:46.53 652:49.31 > 652:44.77 652:47.33 > 653:00.54 653:02.24 > 652:33.31 652:36.13 > 652:51.03 652:52.91 > 652:50.73 652:52.71 > 652:41.32 652:44.64 > 652:59.86 653:03.25 > > (kgdb) p $my->p_rux > $14 = {rux_runtime = 2171421985692826, rux_uticks = 114886093, > rux_sticks = 8353, rux_iticks = 0, rux_uu = 381084736784, rux_su = > 65773652, rux_tu = 904571706136} > (kgdb) p $my->p_rux > $15 = {rux_runtime = 2191831516209186, rux_uticks = 115966087, > rux_sticks = 8444, rux_iticks = 0, rux_uu = 381084736784, rux_su = > 66458587, rux_tu = 913099969825} > > As you can see rux_uu stale, but rux_uticks still ticks. I think the > problem is in calcru1(). This expression > > uu = (tu * ut) / tt > > overflows. > > I applied the following patch: > > Index: /usr/src/sys/kern/kern_resource.c > =================================================================== > --- /usr/src/sys/kern/kern_resource.c (revision 235394) > +++ /usr/src/sys/kern/kern_resource.c (working copy) > @@ -885,7 +885,7 @@ calcru1(struct proc *p, struct rusage_ext *ruxp, s > struct timeval *sp) > { > /* {user, system, interrupt, total} {ticks, usec}: */ > - uint64_t ut, uu, st, su, it, tt, tu; > + uint64_t ut, uu, st, su, it, tt, tu, tmp; > > ut = ruxp->rux_uticks; > st = ruxp->rux_sticks; > @@ -909,10 +909,20 @@ calcru1(struct proc *p, struct rusage_ext *ruxp, s > * The normal case, time increased. > * Enforce monotonicity of bucketed numbers. > */ > - uu = (tu * ut) / tt; > + if (ut == 0) > + uu = 0; > + else { > + tmp = tt / ut; > + uu = tmp ? tu / tmp : 0; > + } > if (uu < ruxp->rux_uu) > uu = ruxp->rux_uu; > > and now ran test again. This looks related to, and possibly identical to, PR kern/76972: http://www.freebsd.org/cgi/query-pr.cgi?pr=76972 If you filed a PR, please submit a follow-up to both PRs so they reference each other. Thanks, Eric