From owner-freebsd-arch@FreeBSD.ORG  Sun Jun 17 06:37:24 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: freebsd-arch@freebsd.org
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id BE1D216A46B;
	Sun, 17 Jun 2007 06:37:24 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail31.syd.optusnet.com.au (mail31.syd.optusnet.com.au
	[211.29.132.102])
	by mx1.freebsd.org (Postfix) with ESMTP id 5290C13C480;
	Sun, 17 Jun 2007 06:37:24 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from besplex.bde.org (c220-239-235-248.carlnfd3.nsw.optusnet.com.au
	[220.239.235.248])
	by mail31.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	l5H6b8Gf022524
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Sun, 17 Jun 2007 16:37:13 +1000
Date: Sun, 17 Jun 2007 16:37:08 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Jeff Roberson <jroberson@chesapeake.net>
In-Reply-To: <20070606152352.H606@10.0.0.1>
Message-ID: <20070617153238.K21498@besplex.bde.org>
References: <20070529105856.L661@10.0.0.1> <200705291456.38515.jhb@freebsd.org>
	<20070529121653.P661@10.0.0.1> <20070530065423.H93410@delplex.bde.org>
	<20070529141342.D661@10.0.0.1> <20070530125553.G12128@besplex.bde.org>
	<20070529201255.X661@10.0.0.1> <20070529220936.W661@10.0.0.1>
	<20070530201618.T13220@besplex.bde.org> <20070530115752.F661@10.0.0.1>
	<20070531091419.S826@besplex.bde.org> <20070531010631.N661@10.0.0.1>
	<20070601154833.O4207@besplex.bde.org> <20070601014601.I799@10.0.0.1>
	<20070601200348.G6201@delplex.bde.org> <20070601123530.B606@10.0.0.1>
	<20070604160036.N1084@besplex.bde.org> <46652D17.5090903@FreeBSD.org>
	<20070605214404.X47001@delplex.bde.org> <20070606152352.H606@10.0.0.1>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: Attilio Rao <attilio@freebsd.org>, freebsd-arch@freebsd.org
Subject: Re: Updated rusage patch
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 17 Jun 2007 06:37:24 -0000

On Wed, 6 Jun 2007, Jeff Roberson wrote:

> I'd like to make a list of the remaining problems with rusage and potential 
> fixes.  Then we can decide which ones myself and attilio will resolve 
> immediately to clean up some of the effect of the sched lock changes.

I haven't verified which of these fixes is necessary and/or has been done
yet.  The list is a bit incomplete.

3 more minor problems turned up (one caused by applying one of these
fixes?).

(1)
Results of some makeworlds run after the threead lock changes, all with
fixes for pagezero (best previous result 827 seconds; results without
touching pagezero ~845 seconds without PREEMPTION; ~837 seconds with
PREEMPTION).  Only the differences in the following results are interesting.

% Sat Jun  9 03:28:33 UTC 2007:
% 831.61 real      1308.57 user       184.80 sys
%    1320199  voluntary context switches
%    1533639  involuntary context switches
% pgzero time 7 seconds

Base result.

% Wed Jun 13 14:52:15 UTC 2007:
% 833.97 real      1291.71 user       201.64 sys
%    1329247  voluntary context switches
%    1518959  involuntary context switches
% pgzero time 7 seconds

Some change between June 9 and June 13 made a big difference to the user+sys
decomposition.  I think the June 9 result is more correct.

% Wed Jun 13 14:52:15 UTC 2007:
% Same kernel as previous with HZ = 1000 (HZ = 100 except as noted); stathz = 100
% 836.24 real      1310.22 user       191.04 sys
%    1323793  voluntary context switches
%    1559229  involuntary context switches
% pgzero time 7 seconds

The accuracy of the decomposition depends mainly on stathz (the
decomposition is based on statclock tick counts, and there is a
significant bias towards system time when the tick counts are all 0
-- see calcru1() -- which is reduced by increasing stathz)  I forgot
that stathz != HZ and tried the HZ = 1000 pessimization to fix it.
This somehow gave the old decomposition.

(2)
By reading the code, in sched_throw() (from sched_4bsd.c; the version in
sched_ule.c is identical; duplicating this is another bug):

% /*
%  * A CPU is entering for the first time or a thread is exiting.
%  */
% void
% sched_throw(struct thread *td)
% {
% 	/*
% 	 * Correct spinlock nesting.  The idle thread context that we are
% 	 * borrowing was created so that it would start out with a single
% 	 * spin lock (sched_lock) held in fork_trampoline().  Since we've
% 	 * explicitly acquired locks in this function, the nesting count
% 	 * is now 2 rather than 1.  Since we are nested, calling
% 	 * spinlock_exit() will simply adjust the counts without allowing
% 	 * spin lock using code to interrupt us.
% 	 */
% 	if (td == NULL) {
% 		mtx_lock_spin(&sched_lock);
% 		spinlock_exit();
% 	} else {
% 		MPASS(td->td_lock == &sched_lock);
% 	}

Comment doesn't match code (comment only applies to td == NULL case).

% 	mtx_assert(&sched_lock, MA_OWNED);
% 	KASSERT(curthread->td_md.md_spinlock_count == 1, ("invalid count"));
% 	PCPU_SET(switchtime, cpu_ticks());
% 	PCPU_SET(switchticks, ticks);
% 	cpu_throw(td, choosethread());	/* doesn't return */
% }

Setting switchtime, etc., here loses the delta between the current
time and switchtime.  Old code only sets switchtime when a CPU is
entering for the first time.  switchtime is normally not actually
a switch time, but is set by thread_exit() just before calling here.
Not much time should be lost from this, but lots seems to be in practice.
According to a benchmark that does 100000 fork/wait/exits:

         2.99 real         0.13 user         2.78 sys

About 3% of the time is not accounted for.  Interrupt and kernel thread
time can only account for < 1%.

Old code didn't get this nearly right either, despite my attempts to
minimize the unaccounted-for time.  Fixing it should be easier now.
Of course, the part of the time for exiting cannot _all_ be accounted
to the exiting thread.  I want as much of it as possible to go there
and the rest to the next thread (which might be idlethread in general,
so the time would be almost invisible, but for the fork-wait-exit
benchmark the fork-wait thread should always be switched to next to
complete its wait()).

(3)
Bugs found while grepping near cpu_throw:
- kern_thread.c has cpu_throw() hard-coded in 4 comments and one string,
   but now only calls sched_throw().
- sched_throw() is not declared as non-returning in sys/sched.h.
- kern_thread.c has a bogus panic and NOTREACHED comment after sched_throw()
   doesn't return.

Bruce