From owner-freebsd-hackers  Sat Jun 12  3:40:31 1999
Delivered-To: freebsd-hackers@freebsd.org
Received: from quark.ChrisBowman.com (crbowman.erols.com [209.122.47.155])
	by hub.freebsd.org (Postfix) with ESMTP id C224614CDF
	for <freebsd-hackers@FreeBSD.ORG>; Sat, 12 Jun 1999 03:40:24 -0700 (PDT)
	(envelope-from crb@ChrisBowman.com)
Received: from fermion (crb@fermion.ChrisBowman.com [10.0.1.2])
	by quark.ChrisBowman.com (8.9.3/8.9.3) with SMTP id FAA04934;
	Sat, 12 Jun 1999 05:40:10 -0500 (EST)
	(envelope-from crb@ChrisBowman.com)
Message-Id: <199906121040.FAA04934@quark.ChrisBowman.com>
X-Sender: crb@quark
X-Mailer: QUALCOMM Windows Eudora Pro Version 4.0.1 
Date: Sat, 12 Jun 1999 06:37:50 -0400
To: Arun Sharma <adsharma@home.com>
From: "Christopher R. Bowman" <crb@ChrisBowman.com>
Subject: Re: High syscall overhead?
Cc: "David E. Cross" <crossd@cs.rpi.edu>, freebsd-hackers@FreeBSD.ORG
In-Reply-To: <m34skeeyh8.fsf@c62443-a.frmt1.sfba.home.com>
References: <"David E. Cross"'s message of "Fri, 11 Jun 1999 10:40:37 -0400">
 <199906111440.KAA70517@cs.rpi.edu>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

At 11:15 PM 6/11/99 -0700, Arun Sharma wrote:
>"David E. Cross" <crossd@cs.rpi.edu> writes:
>
>> Looking through the exception.s it appears that on entry to the
>> kernel an MP lock is obtained...  I thought we had splX(); to
>> protect concurancy in the kernel.
>
>Can someone explain to me why is SYSCALL_LOCK necessary ? It certainly
>seems to hurt system call performance on a MP machine.
>
>Also, is there any data on lock contention in FreeBSD ? Is anyone
>working on decomposing some of the giant locks ?

I can't speak authoritatively since I don't know specifically what SYSCALL_LOCK
is, but if it is what is often referred to on this list as the Giant Kernel
Lock(tm) then the following should generally apply.

When dealing with data structures there are often areas of code that must run
with out any possibility of being interrupted or executing at the same time as
another specific piece of code.  By way of example, consider the simple act of
incrementing a shared variable.  In most modern RISC processors this is
essentially 3 instructions:

         LOAD     r2, #var load address of variable var into register r2
         LOAD     r1,r2    load the value stored in var into cpu register r1
         ADD      r1,r1,#1 add one to r1 and store result in r1
         STORE    r2,r1    store the incremented value of var back into memory

Suppose that this simple 4 instruction code sequence was interrupted after the
second load and before the add, and sometime later execution resumes at the add
instruction.  Lets first assume that we are on a uniprocessor machine, in which
case there are only 2 ways this sequence of instruction can stop before that
add and come back later.

1) if we are in user mode then a page boundary can occur between the load and
the add, and this could lead to a page fault exception (this would be a page
fault on the instruction read for the add).  This cannot occur in kernel mode
since if we were in kernel mode we would be inside the kernel, and the kernel
wires down all of it's code pages (this might change in the future but probably
only for pages that can be guaranteed will never be executed again like the
startup splash screen code maybe).

2) an interrupt could occur such that the load completes, but the add does not
start.  This could be for a variety of reasons: perhaps a disk requests has
finished, or maybe the sound card needs new data, or a clock interrupt
signaling the end of a process quanta may occur.  This interrupt may occur
weather the code is running in user mode or in kernel mode.

In either of the above 2 cases the processor typically saves the current
context/state of the processor, switches to and interrupt or excerption
context/state, and goes off to execute either a page fault exception handler or
an interrupt handler as appropriate for the particular case.  If in kernel mode
at the time of the interrupt it restores the context and resumes where it was
interrupted at the conclusion of the handler.  If in user mode the processor
may or may not resume executing the user process immediately after the
interrupt is processed, but generally (baring a signal delivery for instance)
when the process does resume execution it will restore the context/state and
resume at the add.

Now having examined how the example code could have it's execution interrupted,
let us further suppose that the code that executes in between the load and the
add also increments the same variable.  If it happens to run a similar 4
instruction sequence then we can see that the value that get stored in memory
doesn't reflect the desired effect of the 2 pieces of code and it appears as if
the variable was incremented only once when it was intended to be incremented
twice.  If this happens to kernel code it can have catastrophic effects.  To
avoid this we must have some way of preventing the two pieces of code from
intermingling their execution.

In the FreeBSD (and all single threaded) kernels this is handled by locking out
interrupts around these so called critical sections of code. So, for instance,
if a variable was shared with the only network interrupt handler, the kernel
would make a x=splnet() call before the second load and an splx(x) call after
the store.  Since the network interrupt is prevented from being serviced during
between the spl calls the kernel is guaranteed that the load, add, store will
proceed with out problems, and thus our problem is solved.  We can see that
this is so: the network interrupt which changes the variable has it's execution
delayed until after the kernel updates the variable, and by assumption the
other interrupts which may occur and won't be delayed, don't change the
variable and thus don't present a problem.

Now consider the multiprocessor case.  We have the same problem we had with the
single processor case, but now we also have a new variant.  Suppose that both
processors enter the kernel and both want to execute kernel code which will
update a variable. If the second processor performs it's load after the same
time as the first processor performs that add we again get improper results,
and our spl mechanism for delaying interrupts doesn't help for 2 reasons: a)
locking out interrupts usually occurs on only one processor, and thus the other
processor could still take an interrupt that would lead to a conflict b) even
if we assume that we could lock interrupts out of both processors, the 2
processors could both be executing the code as part of plain kernel code, a
situation we could have on a uniprocessor machine.  Some how we have got to
prevent both processors from being in the kernel and executing these types of
mutually exclusive code sections.

If our kernel code starts out, as ours has, assuming it runs only on
uniprocessor machines then usually only the variables and data structures
shared with an interrupt are protected.  If we then hack on second processor
support we have to some how lock the data structures and global variables from
simultaneous modification.  A simple first order solution would be to lock out
other processors from entering the kernel once one processor enters it.  To do
this we need only add code to start of the syscall to check for other another
processor already being in the kernel and to wait for it to exit before we
continue.  This is the so called Giant Kernel Lock(tm) method since one lock
locks the whole kernel code.  This is easy to implement and incurs little
overhead.  If the processors spend most of their time executing user code, then
they may only rarely have to wait to enter the kernel.  2 encryption processes
may hardly ever enter the kernel, and may execute very efficiently.  On the
other hand, 2 process that read their pid in a loop might spend most of their
time waiting, and thus seem to run almost as if there was only one processor.

The alternative to the Giant Kernel Lock(tm) is so called fine grained locking
wherein locking is pushed down closer to the data structures.  In fine grained
locking two processors might be executing in the kernel at the same time, but
only if they didn't need the same resources.  On might be doing a disk read
while the other queues up a character for the serial port.  The fine grained
lock has the potential for higher parallelism and thus better throughput since
process may not have to wait as long, but the larger number of locks with their
many required lock and unlock operations add overhead and further the design is
more difficult and error prone since the interaction of the numerous locks may
result in deadlock or livelock situations every bit as problematical as the
problem they try to solve.

Well there you have it, I hope this answers some questions you had, I am sure I
made a few mistakes here and there, but the overall problem is laid out for you
and I am sure others will step in and correct me where I am mistaken.


"Now you know, and knowing is half the battle "
         G.I. Joe, Sunday morning cartoons. 
--------
Christopher R. Bowman
crb@ChrisBowman.com
http://www.ChrisBowman.com/


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message