From owner-freebsd-hackers@FreeBSD.ORG Tue May 4 20:51:02 2010 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id C9B2C1065670 for ; Tue, 4 May 2010 20:51:02 +0000 (UTC) (envelope-from ak@natsys-lab.com) Received: from omr3.networksolutionsemail.com (omr3.networksolutionsemail.com [205.178.146.53]) by mx1.freebsd.org (Postfix) with ESMTP id 75C708FC1A for ; Tue, 4 May 2010 20:51:02 +0000 (UTC) Received: from cm-omr6 (mail.networksolutionsemail.com [205.178.146.50]) by omr3.networksolutionsemail.com (8.13.6/8.13.6) with ESMTP id o44Kp1o9011618 for ; Tue, 4 May 2010 16:51:01 -0400 Authentication-Results: cm-omr6 smtp.user=ak; auth=pass (CRAM-MD5) Received: from [81.200.119.167] ([81.200.119.167:60518] helo=[81.200.119.167]) by cm-omr6 (envelope-from ) (ecelerity 2.2.2.41 r(31179/31189)) with ESMTPA id A1/44-29263-3B880EB4; Tue, 04 May 2010 16:51:01 -0400 Message-ID: <4BE0C075.2040106@natsys-lab.com> Date: Wed, 05 May 2010 00:48:53 +0000 From: Alexander Krizhanovsky Organization: NatSys Lab. User-Agent: Thunderbird 2.0.0.23 (X11/20100322) MIME-Version: 1.0 To: Kostik Belousov References: <4BCBA7F8.3060109@natsys-lab.com> <20100425145929.GJ2422@deviant.kiev.zoral.com.ua> <4BDF2E4C.4090009@natsys-lab.com> <20100503163524.GE23646@deviant.kiev.zoral.com.ua> In-Reply-To: <20100503163524.GE23646@deviant.kiev.zoral.com.ua> Content-Type: text/plain; charset=KOI8-R; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-hackers@freebsd.org, svn-src-all@freebsd.org, ur4ina@gmail.com Subject: Re: [PATCH] RUSAGE_THREAD X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 04 May 2010 20:51:03 -0000 Konstantin, Concerning i/o counters we collect them in rucollect() in for loop and update in various places, for example in vfs_bio.c. Rusage of an exiting threads is handled in exit1() -> ruadd(). Your version of the patch certainly is more robust, however I'm still concerning about following things, which could be done better: 1) Zeroing of thread runtime statistic looks fine if we use it only for whole process statistic calculating, but now (when it's also used as is for the thread statistic) it should be handled independently, i.e. RUSAGE_{SELF,CHILDREN} calls should not affect the thread structures somehow. So I suppose we need to introduce some new counters to proc structure and counters update code (it was stopped me to go on this way). 2) With first in mind, getruasge(RUSAGE_THREAD) is called in current thread and doesn't affect or use information from other threads, so it definitely should use less number of locks, may be with atomic operations for read-update-write operations. In fact the same getting per-thread statistic in Linux is done _without_ locks at all (however Linux uses different process/thread model). If we're in time and it really looks like a problem (is getrusage() ever a hotspot?) I can try to prepare the patch with less locking on this weekend. Also I still don't understand the sanity check in calccru1() for negativeness of tu ( (int64_t)tu < 0 ) - is it reachable? It seems that it is possible only after about 300 thousand years of uptime... Could you please explain this? Should I write further about the patch to svn-src-all@ (sorry, but I'm new in FreeBSD mailing) ? Kostik Belousov wrote: > On Mon, May 03, 2010 at 08:13:00PM +0000, Alexander Krizhanovsky wrote: > >> Kostik, >> >> thank you very much for the review! >> >> Kostik Belousov wrote: >> >>> On Mon, Apr 19, 2010 at 12:46:48AM +0000, Alexander Krizhanovsky wrote: >>> >>> >>>> Hi all! >>>> >>>> This patch implements per-thread rusage statistic (like RUSAGE_THREAD in >>>> Linux and RUSAGE_LWP in OpenSolaris). >>>> >>>> Unfortunately, we have to acquire a number of locks to read/update >>>> system and user times for current thread rusage information because it's >>>> also used for whole process statistic and needs to be zeroed. >>>> >>>> Any comments are very appreciated. >>>> >>>> It's tested against 8.0-RELEASE. Appropriate PR is submitted. >>>> >>>> -- >>>> Alexander Krizhanovsky >>>> NatSys Lab. (http://natsys-lab.com) >>>> tel: +7 (916) 717-3899, +7 (499) 747-6304 >>>> e-mail: ak@natsys-lab.com >>>> >>>> >>>> >>> I think this should be somewhat modified before commit. >>> >>> First, please separate patch into two pieces. One would introduce >>> ruxagg_tlock() helper and use it in existing locations instead of >>> three-liners. Possibly, add K&R->ANSI C kern_getrusage definition >>> change. >>> >>> Second would actually add RUSAGE_THREAD. There, please follow >>> the style(9), in particular, do not initialize the local variables >>> at the declaration, instead, use explicit initialization in the code. >>> >>> >> Done. I have also introduced third patch for casting microseconds to >> timeval and replaced a few appropriate places in whole kernel code. >> patch-rusage-thread-03052010.txt is incremental for >> patch-ruxagg_tlock-03052010.txt, which is by turn incremental for >> patch-usec2timeval-03052010.txt. >> >> I have made following change for calcru1(): >> - if ((int64_t)tu < 0) { >> - /* XXX: this should be an assert /phk */ >> - printf("calcru: negative runtime of %jd usec for pid %d >> (%s)\n", >> - (intmax_t)tu, p->p_pid, p->p_comm); >> - tu = ruxp->rux_tu; >> - } >> + KASSERT((int64_t)tu < 0, ("calcru: negative runtime of %jd usec " >> + "for pid %d (%s)\n", >> + (intmax_t)tu, p->p_pid, p->p_comm)); >> >> tu can become negative at about 300 thousand years of uptime, so this >> condition seems quite unrealistic and indeed should be replaced by an >> assertion. However I didn't understand why it isn't done so far and the >> comment only was added. Did I miss something? >> >> >>> Should calctru() share the code with calcru1() ? I am esp. concerned >>> with sanity check like negative runtime etc. Possibly, this code >>> >> >from calcru1() should be separated into function used from both >> >>> calcru1() and calctru(). >>> >>> Man page for getrusage(2) should be updated. >>> >>> >> It's added to patch-rusage-thread-03052010.txt. >> >> In the end - shoud I raise a new PR (the original one is kern/145813)? >> >> >>> Thanks for the submission. >>> > > It was some time, I already committed ruxagg_tlock() part, that caused > some feedback, and I reworked the rest of the patch myself. See svn-src@ > for some discussion, and the latest version that I intend to commit > shortly is below. Your comments are welcome. > > Lets discuss third patch after this is landed. > > diff --git a/lib/libc/sys/getrusage.2 b/lib/libc/sys/getrusage.2 > index bdf5d45..423503f 100644 > --- a/lib/libc/sys/getrusage.2 > +++ b/lib/libc/sys/getrusage.2 > @@ -28,7 +28,7 @@ > .\" @(#)getrusage.2 8.1 (Berkeley) 6/4/93 > .\" $FreeBSD$ > .\" > -.Dd June 4, 1993 > +.Dd May 1, 2010 > .Dt GETRUSAGE 2 > .Os > .Sh NAME > @@ -42,6 +42,7 @@ > .In sys/resource.h > .Fd "#define RUSAGE_SELF 0" > .Fd "#define RUSAGE_CHILDREN -1" > +.Fd "#define RUSAGE_THREAD 1" > .Ft int > .Fn getrusage "int who" "struct rusage *rusage" > .Sh DESCRIPTION > @@ -49,11 +50,12 @@ The > .Fn getrusage > system call > returns information describing the resources utilized by the current > -process, or all its terminated child processes. > +thread, the current process, or all its terminated child processes. > The > .Fa who > argument is either > -.Dv RUSAGE_SELF > +.Dv RUSAGE_THREAD , > +.Dv RUSAGE_SELF , > or > .Dv RUSAGE_CHILDREN . > The buffer to which > @@ -175,6 +177,10 @@ The > .Fn getrusage > system call appeared in > .Bx 4.2 . > +The > +.Dv RUSAGE_THREAD > +facility first appeared in > +.Fx 8.1 . > .Sh BUGS > There is no way to obtain information about a child process > that has not yet terminated. > diff --git a/sys/kern/kern_proc.c b/sys/kern/kern_proc.c > index 49a3097..6c50f1f 100644 > --- a/sys/kern/kern_proc.c > +++ b/sys/kern/kern_proc.c > @@ -901,7 +901,7 @@ fill_kinfo_thread(struct thread *td, struct kinfo_proc *kp, int preferthread) > kp->ki_pri.pri_user = td->td_user_pri; > > if (preferthread) { > - kp->ki_runtime = cputick2usec(td->td_runtime); > + kp->ki_runtime = td->td_rux.rux_runtime; > kp->ki_pctcpu = sched_pctcpu(td); > kp->ki_estcpu = td->td_estcpu; > } > diff --git a/sys/kern/kern_resource.c b/sys/kern/kern_resource.c > index a3ed75d..0bc78d0 100644 > --- a/sys/kern/kern_resource.c > +++ b/sys/kern/kern_resource.c > @@ -76,7 +76,7 @@ static void calcru1(struct proc *p, struct rusage_ext *ruxp, > struct timeval *up, struct timeval *sp); > static int donice(struct thread *td, struct proc *chgp, int n); > static struct uidinfo *uilookup(uid_t uid); > -static void ruxagg_tlock(struct proc *p, struct thread *td); > +static void ruxagg(struct proc *p, struct thread *td); > > /* > * Resource controls and accounting. > @@ -630,7 +630,7 @@ lim_cb(void *arg) > return; > PROC_SLOCK(p); > FOREACH_THREAD_IN_PROC(p, td) { > - ruxagg_tlock(p, td); > + ruxagg(p, td); > } > PROC_SUNLOCK(p); > if (p->p_rux.rux_runtime > p->p_cpulimit * cpu_tickrate()) { > @@ -841,7 +841,7 @@ calcru(struct proc *p, struct timeval *up, struct timeval *sp) > FOREACH_THREAD_IN_PROC(p, td) { > if (td->td_incruntime == 0) > continue; > - ruxagg_tlock(p, td); > + ruxagg(p, td); > } > calcru1(p, &p->p_rux, up, sp); > } > @@ -961,6 +961,16 @@ kern_getrusage(struct thread *td, int who, struct rusage *rup) > calccru(p, &rup->ru_utime, &rup->ru_stime); > break; > > + case RUSAGE_THREAD: > + PROC_SLOCK(p); > + ruxagg(p, td); > + PROC_SUNLOCK(p); > + thread_lock(td); > + *rup = td->td_ru; > + calcru1(p, &td->td_rux, &rup->ru_utime, &rup->ru_stime); > + thread_unlock(td); > + break; > + > default: > error = EINVAL; > } > @@ -1001,7 +1011,7 @@ ruadd(struct rusage *ru, struct rusage_ext *rux, struct rusage *ru2, > * Aggregate tick counts into the proc's rusage_ext. > */ > void > -ruxagg(struct rusage_ext *rux, struct thread *td) > +ruxagg_locked(struct rusage_ext *rux, struct thread *td) > { > > THREAD_LOCK_ASSERT(td, MA_OWNED); > @@ -1010,18 +1020,19 @@ ruxagg(struct rusage_ext *rux, struct thread *td) > rux->rux_uticks += td->td_uticks; > rux->rux_sticks += td->td_sticks; > rux->rux_iticks += td->td_iticks; > - td->td_incruntime = 0; > - td->td_uticks = 0; > - td->td_iticks = 0; > - td->td_sticks = 0; > } > > static void > -ruxagg_tlock(struct proc *p, struct thread *td) > +ruxagg(struct proc *p, struct thread *td) > { > > thread_lock(td); > - ruxagg(&p->p_rux, td); > + ruxagg_locked(&p->p_rux, td); > + ruxagg_locked(&td->td_rux, td); > + td->td_incruntime = 0; > + td->td_uticks = 0; > + td->td_iticks = 0; > + td->td_sticks = 0; > thread_unlock(td); > } > > @@ -1039,7 +1050,7 @@ rufetch(struct proc *p, struct rusage *ru) > *ru = p->p_ru; > if (p->p_numthreads > 0) { > FOREACH_THREAD_IN_PROC(p, td) { > - ruxagg_tlock(p, td); > + ruxagg(p, td); > rucollect(ru, &td->td_ru); > } > } > diff --git a/sys/kern/kern_thread.c b/sys/kern/kern_thread.c > index 9be4c2f..b32c584 100644 > --- a/sys/kern/kern_thread.c > +++ b/sys/kern/kern_thread.c > @@ -432,7 +432,7 @@ thread_exit(void) > PROC_UNLOCK(p); > thread_lock(td); > /* Save our tick information with both the thread and proc locked */ > - ruxagg(&p->p_rux, td); > + ruxagg_locked(&p->p_rux, td); > PROC_SUNLOCK(p); > td->td_state = TDS_INACTIVE; > #ifdef WITNESS > diff --git a/sys/sys/proc.h b/sys/sys/proc.h > index fb31cfc..e32e494 100644 > --- a/sys/sys/proc.h > +++ b/sys/sys/proc.h > @@ -173,6 +173,27 @@ struct kdtrace_thread; > struct cpuset; > > /* > + * XXX: Does this belong in resource.h or resourcevar.h instead? > + * Resource usage extension. The times in rusage structs in the kernel are > + * never up to date. The actual times are kept as runtimes and tick counts > + * (with control info in the "previous" times), and are converted when > + * userland asks for rusage info. Backwards compatibility prevents putting > + * this directly in the user-visible rusage struct. > + * > + * Locking for p_rux: (cj) means (j) for p_rux and (c) for p_crux. > + * Locking for td_rux: (t) for all fields. > + */ > +struct rusage_ext { > + u_int64_t rux_runtime; /* (cj) Real time. */ > + u_int64_t rux_uticks; /* (cj) Statclock hits in user mode. */ > + u_int64_t rux_sticks; /* (cj) Statclock hits in sys mode. */ > + u_int64_t rux_iticks; /* (cj) Statclock hits in intr mode. */ > + u_int64_t rux_uu; /* (c) Previous user time in usec. */ > + u_int64_t rux_su; /* (c) Previous sys time in usec. */ > + u_int64_t rux_tu; /* (c) Previous total time in usec. */ > +}; > + > +/* > * Kernel runnable context (thread). > * This is what is put to sleep and reactivated. > * Thread context. Processes may have multiple threads. > @@ -219,7 +240,8 @@ struct thread { > u_int td_estcpu; /* (t) estimated cpu utilization */ > int td_slptick; /* (t) Time at sleep. */ > int td_blktick; /* (t) Time spent blocked. */ > - struct rusage td_ru; /* (t) rusage information */ > + struct rusage td_ru; /* (t) rusage information. */ > + struct rusage_ext td_rux; /* (t) Internal rusage information. */ > uint64_t td_incruntime; /* (t) Cpu ticks to transfer to proc. */ > uint64_t td_runtime; /* (t) How many cpu ticks we've run. */ > u_int td_pticks; /* (t) Statclock hits for profiling */ > @@ -426,26 +448,6 @@ do { \ > #define TD_SET_CAN_RUN(td) (td)->td_state = TDS_CAN_RUN > > /* > - * XXX: Does this belong in resource.h or resourcevar.h instead? > - * Resource usage extension. The times in rusage structs in the kernel are > - * never up to date. The actual times are kept as runtimes and tick counts > - * (with control info in the "previous" times), and are converted when > - * userland asks for rusage info. Backwards compatibility prevents putting > - * this directly in the user-visible rusage struct. > - * > - * Locking: (cj) means (j) for p_rux and (c) for p_crux. > - */ > -struct rusage_ext { > - u_int64_t rux_runtime; /* (cj) Real time. */ > - u_int64_t rux_uticks; /* (cj) Statclock hits in user mode. */ > - u_int64_t rux_sticks; /* (cj) Statclock hits in sys mode. */ > - u_int64_t rux_iticks; /* (cj) Statclock hits in intr mode. */ > - u_int64_t rux_uu; /* (c) Previous user time in usec. */ > - u_int64_t rux_su; /* (c) Previous sys time in usec. */ > - u_int64_t rux_tu; /* (c) Previous total time in usec. */ > -}; > - > -/* > * Process structure. > */ > struct proc { > diff --git a/sys/sys/resource.h b/sys/sys/resource.h > index 9af96af..e703744 100644 > --- a/sys/sys/resource.h > +++ b/sys/sys/resource.h > @@ -56,6 +56,7 @@ > > #define RUSAGE_SELF 0 > #define RUSAGE_CHILDREN -1 > +#define RUSAGE_THREAD 1 > > struct rusage { > struct timeval ru_utime; /* user time used */ > diff --git a/sys/sys/resourcevar.h b/sys/sys/resourcevar.h > index 21728aa..95a9b49 100644 > --- a/sys/sys/resourcevar.h > +++ b/sys/sys/resourcevar.h > @@ -131,7 +131,7 @@ void rucollect(struct rusage *ru, struct rusage *ru2); > void rufetch(struct proc *p, struct rusage *ru); > void rufetchcalc(struct proc *p, struct rusage *ru, struct timeval *up, > struct timeval *sp); > -void ruxagg(struct rusage_ext *rux, struct thread *td); > +void ruxagg_locked(struct rusage_ext *rux, struct thread *td); > int suswintr(void *base, int word); > struct uidinfo > *uifind(uid_t uid); > -- Alexander Krizhanovsky NatSys Lab. (http://natsys-lab.com) tel: +7 (916) 717-3899, +7 (499) 747-6304 e-mail: ak@natsys-lab.com