From owner-freebsd-performance@FreeBSD.ORG  Wed Jun 14 03:15:47 2006
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: freebsd-performance@freebsd.org
Delivered-To: freebsd-performance@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 2BADC16A474
	for <freebsd-performance@freebsd.org>;
	Wed, 14 Jun 2006 03:15:47 +0000 (UTC)
	(envelope-from kip.macy@gmail.com)
Received: from nz-out-0102.google.com (nz-out-0102.google.com [64.233.162.196])
	by mx1.FreeBSD.org (Postfix) with ESMTP id C4A8D43D5C
	for <freebsd-performance@freebsd.org>;
	Wed, 14 Jun 2006 03:15:42 +0000 (GMT)
	(envelope-from kip.macy@gmail.com)
Received: by nz-out-0102.google.com with SMTP id 13so35914nzn
	for <freebsd-performance@freebsd.org>;
	Tue, 13 Jun 2006 20:15:42 -0700 (PDT)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com;
	h=received:message-id:date:from:reply-to:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
	b=cFaYB9oOQOaatvqI0nPbvKvAMvfzZGydwVCVbSL6HzuUk8YkP9ytabE8whpbXI74Xc2K1CDGX+/D8I9YPovZpu6yTpGuHZ1Cl2taIsbSBgseX92okkH8ntJTKpbqWXE3r1tLgXrii9ytBwfgP87XERavlq0m4B7Aw2C9Jtoc9Og=
Received: by 10.65.239.8 with SMTP id q8mr126486qbr;
	Tue, 13 Jun 2006 20:15:42 -0700 (PDT)
Received: by 10.65.231.11 with HTTP; Tue, 13 Jun 2006 20:15:42 -0700 (PDT)
Message-ID: <b1fa29170606132015p654e2877s1ec1da6184ce672e@mail.gmail.com>
Date: Tue, 13 Jun 2006 20:15:42 -0700
From: "Kip Macy" <kip.macy@gmail.com>
To: "Robert Watson" <rwatson@freebsd.org>
In-Reply-To: <20060613105930.N34121@fledge.watson.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <20060612195754.72452.qmail@web33306.mail.mud.yahoo.com>
	<20060612210723.K26068@fledge.watson.org>
	<20060612203248.GA72885@xor.obsecurity.org>
	<200606130715.52425.davidxu@freebsd.org>
	<20060613105930.N34121@fledge.watson.org>
Cc: Scott Long <scottl@samsco.org>, kmacy@freebsd.org, Paul Saab <ps@mu.org>,
	David Xu <davidxu@freebsd.org>, Kris Kennaway <kris@obsecurity.org>,
	freebsd-performance@freebsd.org, danial_thom@yahoo.com
Subject: Re: Initial 6.1 questions
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: kmacy@fsmware.com
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 14 Jun 2006 03:15:47 -0000

I apologize if this e-mail seems a bit disjoint, I'm quite tired from
hauling stuff around today.

I'm not entirely familiar with the system as a whole - but to give a
brief rundown of what I do know:
Context switches, thread prioritization, process statistics keeping,
and access to a handful of other random variables are all serialized
by sched_lock. Process creation, process exit, process scheduling
(schedcpu() access to the allproc_list) are all serialized through the
allproc_lock.

I've discovered that schedcpu()'s serialization needs doesn't fit in
well with sched_lock removal in the presence of a global process list
and global runqueue (I'll skip the tedious details for now). In other
words, I have missing prerequisites. My current plan for this week,
once I get back from Tahoe, is in a separate branch to do the
following:
 - replace the global process list with a per-cpu process list hung
off of pcpu protected by a  non-interrupt disabling spinlock
pcpu_proclist_lock
 - replace the global run queue with a per-cpu runqueue hung off of
pcpu protected by non-interrupt blocking pcpu_runq_lock

Once I have this stable I will integrate it into my branch where I
have replaced sched_lock with per-thread locks and re-do the current
locking I have in choosethread() which I believe causes performance
and stability problems.

At some point it may be desirable to add support for rebalancing the
pcpu process lists to avoid schedcpu/ps/top having to hold the
pcpu_proclist_lock for too long.

Why do I say "non-interrupt blocking?". Currently we have roughly a
half dozen locking primitives. The two that I am familiar with are
blocking and spinning mutexes. The general policy is to use blocking
locks except where a lock is used in interrupts or the scheduler. It
seems to me that in the scheduler interrupts only actually need to be
blocked across cpu_switch. Spin locks obviously have to be used
because a thread cannot very well context switch while its in the
middle of context switching - however, provided td_critnest > 0, there
is no reason that interrupts need to be blocked. Currently sched_lock
is acquired in cpu_hardclock and statclock - so it does need to block
interrupts. There is no reason that these two functions couldn't be
run in ast(). In my tree I set td_flags atomically to avoid the need
to acquire locks when setting or clearing flags. All the timer
interrupt really needs to do for purposes statistics etc. is set a
flag in td_flags indicating to ast() that the current thread is
returning from a timer interrupt so that cpu_hardclock and statclock
are called.

I have more in mind, but I'd like to keep the discussion simple by
focusing on the next week or two.

                  -Kip

On 6/13/06, Robert Watson <rwatson@freebsd.org> wrote:
>
> On Tue, 13 Jun 2006, David Xu wrote:
>
> > On Tuesday 13 June 2006 04:32, Kris Kennaway wrote:
> >> On Mon, Jun 12, 2006 at 09:08:12PM +0100, Robert Watson wrote:
> >>> On Mon, 12 Jun 2006, Scott Long wrote:
> >>>> I run a number of high-load production systems that do a lot of network
> >>>> and filesystem activity, all with HZ set to 100.  It has also been shown
> >>>> in the past that certain things in the network area where not fixed to
> >>>> deal with a high HZ value, so it's possible that it's even more
> >>>> stable/reliable with an HZ value of 100.
> >>>>
> >>>> My personal opinion is that HZ should gop back down to 100 in 7-CURRENT
> >>>> immediately, and only be incremented back up when/if it's proven to be
> >>>> the right thing to do. And, I say that as someone who (errantly) pushed
> >>>> for the increase to 1000 several years ago.
> >>>
> >>> I think it's probably a good idea to do it sooner rather than later.  It
> >>> may slightly negatively impact some services that rely on frequent timers
> >>> to do things like retransmit timing and the like.  But I haven't done any
> >>> measurements.
> >>
> >> As you know, but for the benefit of the list, restoring HZ=100 is often an
> >> important performance tweak on SMP systems with many CPUs because of all
> >> the sched_lock activity from statclock/hardclock, which scales with HZ and
> >> NCPUS.
> >
> > sched_lock is another big bottleneck, since if you 32 CPUs, in theory you
> > have 32X context switch speed, but now it still has only 1X speed, and there
> > are code abusing sched_lock, the M:N bits dynamically inserts a thread into
> > thread list at context switch time, this is a bug, this causes thread list
> > in a proc has to be protected by scheduler lock, and delivering a signal to
> > process has to hold scheduler lock and find a thread, if the proc has many
> > threads, this will introduce long scheduler latency, a proc lock is not
> > enough to find a thread, this is a bug, there are other code abusing
> > scheduler lock which really can use its own lock.
>
> I've added Kip Macy to the CC, who is working with a patch for Sun4v that
> eliminates sched_lock.  Maybe he can comment some more on this thread?
>
> Robert N M Watson
> Computer Laboratory
> Universty of Cambridge
>