From owner-freebsd-arch@FreeBSD.ORG Sun Jun 17 06:37:24 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id BE1D216A46B; Sun, 17 Jun 2007 06:37:24 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail31.syd.optusnet.com.au (mail31.syd.optusnet.com.au [211.29.132.102]) by mx1.freebsd.org (Postfix) with ESMTP id 5290C13C480; Sun, 17 Jun 2007 06:37:24 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c220-239-235-248.carlnfd3.nsw.optusnet.com.au [220.239.235.248]) by mail31.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id l5H6b8Gf022524 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 17 Jun 2007 16:37:13 +1000 Date: Sun, 17 Jun 2007 16:37:08 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Jeff Roberson In-Reply-To: <20070606152352.H606@10.0.0.1> Message-ID: <20070617153238.K21498@besplex.bde.org> References: <20070529105856.L661@10.0.0.1> <200705291456.38515.jhb@freebsd.org> <20070529121653.P661@10.0.0.1> <20070530065423.H93410@delplex.bde.org> <20070529141342.D661@10.0.0.1> <20070530125553.G12128@besplex.bde.org> <20070529201255.X661@10.0.0.1> <20070529220936.W661@10.0.0.1> <20070530201618.T13220@besplex.bde.org> <20070530115752.F661@10.0.0.1> <20070531091419.S826@besplex.bde.org> <20070531010631.N661@10.0.0.1> <20070601154833.O4207@besplex.bde.org> <20070601014601.I799@10.0.0.1> <20070601200348.G6201@delplex.bde.org> <20070601123530.B606@10.0.0.1> <20070604160036.N1084@besplex.bde.org> <46652D17.5090903@FreeBSD.org> <20070605214404.X47001@delplex.bde.org> <20070606152352.H606@10.0.0.1> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Attilio Rao , freebsd-arch@freebsd.org Subject: Re: Updated rusage patch X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 17 Jun 2007 06:37:24 -0000 On Wed, 6 Jun 2007, Jeff Roberson wrote: > I'd like to make a list of the remaining problems with rusage and potential > fixes. Then we can decide which ones myself and attilio will resolve > immediately to clean up some of the effect of the sched lock changes. I haven't verified which of these fixes is necessary and/or has been done yet. The list is a bit incomplete. 3 more minor problems turned up (one caused by applying one of these fixes?). (1) Results of some makeworlds run after the threead lock changes, all with fixes for pagezero (best previous result 827 seconds; results without touching pagezero ~845 seconds without PREEMPTION; ~837 seconds with PREEMPTION). Only the differences in the following results are interesting. % Sat Jun 9 03:28:33 UTC 2007: % 831.61 real 1308.57 user 184.80 sys % 1320199 voluntary context switches % 1533639 involuntary context switches % pgzero time 7 seconds Base result. % Wed Jun 13 14:52:15 UTC 2007: % 833.97 real 1291.71 user 201.64 sys % 1329247 voluntary context switches % 1518959 involuntary context switches % pgzero time 7 seconds Some change between June 9 and June 13 made a big difference to the user+sys decomposition. I think the June 9 result is more correct. % Wed Jun 13 14:52:15 UTC 2007: % Same kernel as previous with HZ = 1000 (HZ = 100 except as noted); stathz = 100 % 836.24 real 1310.22 user 191.04 sys % 1323793 voluntary context switches % 1559229 involuntary context switches % pgzero time 7 seconds The accuracy of the decomposition depends mainly on stathz (the decomposition is based on statclock tick counts, and there is a significant bias towards system time when the tick counts are all 0 -- see calcru1() -- which is reduced by increasing stathz) I forgot that stathz != HZ and tried the HZ = 1000 pessimization to fix it. This somehow gave the old decomposition. (2) By reading the code, in sched_throw() (from sched_4bsd.c; the version in sched_ule.c is identical; duplicating this is another bug): % /* % * A CPU is entering for the first time or a thread is exiting. % */ % void % sched_throw(struct thread *td) % { % /* % * Correct spinlock nesting. The idle thread context that we are % * borrowing was created so that it would start out with a single % * spin lock (sched_lock) held in fork_trampoline(). Since we've % * explicitly acquired locks in this function, the nesting count % * is now 2 rather than 1. Since we are nested, calling % * spinlock_exit() will simply adjust the counts without allowing % * spin lock using code to interrupt us. % */ % if (td == NULL) { % mtx_lock_spin(&sched_lock); % spinlock_exit(); % } else { % MPASS(td->td_lock == &sched_lock); % } Comment doesn't match code (comment only applies to td == NULL case). % mtx_assert(&sched_lock, MA_OWNED); % KASSERT(curthread->td_md.md_spinlock_count == 1, ("invalid count")); % PCPU_SET(switchtime, cpu_ticks()); % PCPU_SET(switchticks, ticks); % cpu_throw(td, choosethread()); /* doesn't return */ % } Setting switchtime, etc., here loses the delta between the current time and switchtime. Old code only sets switchtime when a CPU is entering for the first time. switchtime is normally not actually a switch time, but is set by thread_exit() just before calling here. Not much time should be lost from this, but lots seems to be in practice. According to a benchmark that does 100000 fork/wait/exits: 2.99 real 0.13 user 2.78 sys About 3% of the time is not accounted for. Interrupt and kernel thread time can only account for < 1%. Old code didn't get this nearly right either, despite my attempts to minimize the unaccounted-for time. Fixing it should be easier now. Of course, the part of the time for exiting cannot _all_ be accounted to the exiting thread. I want as much of it as possible to go there and the rest to the next thread (which might be idlethread in general, so the time would be almost invisible, but for the fork-wait-exit benchmark the fork-wait thread should always be switched to next to complete its wait()). (3) Bugs found while grepping near cpu_throw: - kern_thread.c has cpu_throw() hard-coded in 4 comments and one string, but now only calls sched_throw(). - sched_throw() is not declared as non-returning in sys/sched.h. - kern_thread.c has a bogus panic and NOTREACHED comment after sched_throw() doesn't return. Bruce