Date: Sat, 24 Jun 2000 23:56:05 -0700 From: Jason Evans <jasone@canonware.com> To: smp@freebsd.org Subject: SMP meeting summary Message-ID: <20000624235605.D8965@blitz.canonware.com>
index | next in thread | raw e-mail
[The following meeting summary was originally written by Greg Lehey, and he
later revised it to include various points from the notes that I took
during the meeting. Finally, I edited (added some, changed some, removed
some) Greg's summary. Thanks go to Greg for doing the majority of the
writing work!]
On the 15th and 16th of June we had a seminar at Yahoo! in Sunnyvale about
the recent changes to the BSD/OS kernel designed to improve SMP
performance.
Participants were, in seating order:
Don Brady Apple Computer File systems
Ramesh ? Apple Computer
Ted Walker Apple Computer network drivers
Jeffrey Hsu FreeBSD project
Chuck Paterson BSDi Chief developer
Jonathan Lemon Cisco, FreeBSD project
Matt Dillon FreeBSD project VM, NFS
Paul Saab Yahoo!
Kirk McKusick
Peter Wemm Yahoo!
Jayanth ? Yahoo!
Doug Rabson FreeBSD project Alpha port
Jason Evans FreeBSD project kernel threads
David Greenman FreeBSD project chief architect
Justin Gibbs Adaptec, FreeBSD project SCSI, 0 copy TCP
Greg Lehey Linuxcare, FreeBSD project storage management
Mike Smith BSDi, FreeBSD project hardware, iA64 port
Alfred Perlstein Wintel, FreeBSD project
David O'Brien BSDi, FreeBSD project compilers, binutils
Ceren Ercen Linuxcare Daemon babe
We met for approximately 8 hours on Thursday and 4 hours on Friday.
Chuck Patterson spent Thursday presenting how BSDi implemented SMP in
BSD/OS 5.0 (as of yet unreleased). Chuck also briefly explained BSD/OS
4.x's SMP implementation.
The BSD/OS 4.x SMP implementation is mainly comprised of a giant lock, but
with a twist. Whenever a processor tries to acquire the giant lock it can
either succeed or fail. If it succeeds, then it's business as usual.
However, if the acquisition fails, the processor does not spin on the giant
lock. Instead, it acquires the schedlock (which protects scheduler-related
portions of the kernel) and schedules another runnable process, if any.
This technique works extremely well for heavy work loads that have less
than one CPU worth of system (kernel processing) load. It is very simple,
and it achieves optimal throughput.
The BSD/OS 5.0 SMP implementation is more complex, and is what most of the
meeting time was spent discussing. From here on, all discussion of BSD/OS
is with regard to 5.0.
1. Source code access.
BSD/OS is a proprietary operating system, for which binary-only and
source code licenses are available. BSD/OS is based on the same free
sources (4.4BSD) as the free BSD operating systems. It is similar to
FreeBSD, though the two have diverged significantly enough to cause
serious pains when moving kernel code between them.
A few weeks back, BSDi made the source code of BSD/OS available to all
FreeBSD committers. During the meeting we discussed what this really
means, and Kirk McKusick (amongst other things chairman of the board of
BSDi) said, "Well, we're quite happy for you to take generous chunks,
but if you ended up taking it all, people might get a little uneasy".
Basically, anything short of simple repackaging of BSD/OS is not an
issue.
2. The current problems.
UNIX was written for single processor machines, and many of the design
choices are not just suboptimal for SMP, they're just plain ugly. In
particular the synchronization mechanisms don't work well with more
than one processor. Briefly:
- The process context, including the upper half of device drivers,
doesn't need to protect itself. The kernel is non-preemptive, so as
long as a process is executing in the kernel, no other process can
execute in the kernel. If another process, even with higher
priority, becomes runnable while a process is executing kernel code,
it will have to wait until the active process leaves the kernel or
sleeps.
- Processes protect themselves against the interrupt context, primarily
the bottom half of device drivers, by masking interrupts. The
original PDP-11 UNIX used the hardware priority levels (numbered 4 to
7), and even today you'll find function calls like spl4() and spl7()
in System V code. BSD changed the names to more descriptive terms
like splbio(), splnet() and splhigh(), and also replaced the fixed
priorities by interrupt masks, but the principle remains the same.
It's not always easy to solve the question of which interrupts need
to be masked in which context, and one of the interesting
observations at this meeting was that as time goes on, the interrupt
masks are getting blacker. In other words each spl() is masking off
more and more bits in the interrupt mask register. This is not good
for performance.
- Processes synchronize with each other using the sleep() or tsleep()
calls. Traditional UNIX, including System V, uses sleep(), and BSD
prefers tsleep(), which provides nice strings which ps(1) displays to
show what the process is waiting for. FreeBSD no longer has a
sleep() call, while BSD/OS has both, but sleep() is deprecated.
tsleep() is used both for voluntary process synchronization
(e.g. send a request to another process and wait until it is
finished), and for involuntary synchronization (e.g. wait for a
shared resource to become available).
Processes sleep on a specific address. In many cases, the address in
itself has no meaning, and it's probably easier to think of it as a
number. When a process sleeps, it is put on a sleep queue. The
wakeup() function takes the sleep address, walks through the sleep
queue, and wakes *every* process which is sleeping on this address.
This can cause massive problems even on single processor machines;
UNIX was never really intended to have hundreds of processes waiting
on the same resource, and a number of Apache performance problems
center around this behavior. As a partial solution, FreeBSD also has
an additional function, wakeup_one(), which only wakes one process.
There are a number of reasons why this concept is not a good solution
for SMP. Firstly, the simplistic assumption "nothing else can be
executing in the kernel while I am" falls flat. We currently haven't
implemented a solution for this. Instead, we found a way of enforcing
this illogical state, the Big Giant Lock (BGL). Any process entering
the kernel must first obtain the BGL; if a process executing on another
processor has the lock, then the current processor spins; it can't even
schedule another process to run, because that requires entering the
kernel.
The other issue is with masking interrupts. This is also quite a
problem for SMP machines, since it requires masking the interrupts on
all processors, again requiring an expensive synchronization.
3. The BSD/OS solution.
- The BGL remains, but becomes increasingly meaningless. In particular,
it is not always necessary to obtain it in order to enter the kernel.
- Instead the system protects shared data structures with mutexes.
These mutexes replace calls to tsleep() when waiting on shared
resources (the involuntary process synchronization mentioned above).
In contrast to traditional UNIX, mutexes will be used much more
frequently in order to protect data structures which were previously
implicitly protected by the non-preemptive nature of the kernel. This
mechanism will replace calls to tsleep() for involuntary context
switches.
Compared with the use of tsleep(), mutexes have a number of
advantages:
- Each mutex has its own wait (sleep) queue. When a process releases
a mutex, it automatically schedules the next process waiting on the
queue. This is more efficient than searching a possibly very long,
linear sleep queue. It also avoids the flooding when multiple
processes get scheduled, and most of them have to go back to sleep
again.
- Mutexes can be a combination of spin and sleep mutexes: for a
resource which may be held only for a very short period of time,
even the overhead of sleeping and rescheduling may be higher than
waiting in a tight loop. A spin/sleep lock might first wait in a
tight loop for 2 microseconds and then sleep if the lock is still
not available at that time. This is an issue which Sun has
investigated in great detail with Solaris. BSDi has not pursued
this yet, though the BSD/OS threading primitives make this an easy
extention to add. It's possibly an area for us to investigate once
the system is up and limping again.
- Interrupt lockout (spl()s) go away completely. Instead, interrupt
functions use mutexes for synchronization. This means that an
interrupt function must be capable of blocking, which is currently
impossible. In order to block, the function must have a "process"
context (a stack and a process control block). In particular, this
could include kernel threads.
BSD/OS on Intel currently uses light-weight interrupt threads to
process interrupts, while on SPARC uses normal ("heavyweight")
processes. Chuck indicated that the decision to implement
light-weight threads initially was probably the wrong one, since it
gave rise to a large number of problems, and although the heavyweight
process model would give lousy performance, it would probably make it
easier to develop the kernel while the light-weight processes were
being debugged. There is also the possibility of building a kernel
with one or the other support, so that in case of problems during
development it would be possible to revert to the heavy-weight
processes while searching for the bug.
Other details we discussed included:
- BSD/OS has not implemented condition variables. We didn't go into
details. The opinion was expressed that they would be very useful for
synchronization, but that they require coding discipline somewhat
different than the tsleep() mechanism. Converting all use of tsleep()
is a lot of work, and of dubious value. However, condition variables
can live along with tsleep(), so a full changeover is not necessary.
- BSD/OS also does not implement read/write locks, a thing that Linux
has recently introduced. We didn't discuss this further, but the
concepts are well understood, and it should be relatively simple to
implement them if we find a need.
- Netgraph poses locking performance problems, since locks have to be
released at multiple potential transfer points, regardless of whether
Netgraph is in use. This problem also exists with System V STREAMS.
During the meeting we didn't come to a clear consensus on how much of
a problem this really is.
- Interrupts can have priority inversion problems on MP machines in
combination with lazy context switching (aka context stealing).
However, it's a temporary inversion that just causes latency. The
reason that deadlock never occurs is that as soon as a lock is missed,
the interrupt stack stealing is unwound, so there is never a situation
where a lock is held that can cause deadlock. When a high-priority
process waits for a lower priority process, the blocking process
temporarily lends its priority to the running process in order to
ensure that it finishes quickly. This technique is interchangeably
called priority inheritance, priority lending, deadlock avoidance, and
probably other names, just to make things confusing.
- NFS leasing causes big problems. Samba will have similar problems,
potentially.
- Message queues are probably worthwhile, but they're currently not a
high priority.
- There are a number of global variable updates that are not locked, and
can thus result in partially updated variables (i.e. reads can get
corrupt values). This requires either using a locked instruction, or
using a mutex. A mutex isn't much more expensive, and is probably
easier.
- We should split part of struct proc out into a fixed-size kproc for ps
use. This isn't really related to the SMP work, but it would be nice
to get rid of the dreaded "proc size mismatch" error message that
people get when their kernel is out of sync with userland.
- We spoke about naming conventions. Some people weren't too happy with
BSD/OS's macro names. Chuck agreed and said that he would adopt our
naming convention if we chose a better one.
- Per-CPU variables need GET_*() and SET_*() routines to lock.
4. Things we need to do.
There are a number of things we need to do. During the meeting we
didn't get beyond deciding the first couple of things:
- First remove the BGL (currently a spinlock) and replace it with two,
maybe three mutexes, one for the scheduler (schedlock), and a
blocking mutex for the kernel in place of the BGL. BSD/OS also has
an ipending lock for posting interrupts, which we should probably
implement in the short term, though it's possible that it might go
again.
- In addition, implement the heavy-weight interrupt processes. These
would remain in place while the light-weight threads were being
debugged.
5. Who does what?
A number of people will work on the SMP project. During the first stage:
- Matt Dillon will put in locking primitives and schedlock. This
includes resurrecting our long-dead idle process to scan the run queue
for interrupt threads. He won't have time for NFS.
- Doug Rabson will work on the alpha bits, so that it doesn't get left in
the dust.
- Greg Lehey will implement the heavyweight interrupt processes and
lightweight interrupt threads.
- Jason Evans will be the project manager.
6. Timing.
We have a general agreement that it's better to do it right than do it
quickly. Thus far, Matt has implemented much of his part and is now
waiting on Greg to do the interrupt processes. When they've done that,
they'll do their own tests, and others will do additional testing. All
commits will be dependent on approval from Jason, and the first can be
expected within two months (probably sooner).
The SMP changes will be maintained as patches against -current until the
following milestones have been met:
- Port the BSD/OS locking primitives to the i386 port (Matt) and the
alpha port (Doug Rabson).
- Convert the BGL to a blocking lock, add the schedlock, add per-CPU
idle processes (Matt).
- Implement heavy-weight interrupt threads (Greg). Light-weight
interrupt thread context switching may be working by the time the
first commit is made, but this is not a requirement.
- Stub out (basically disable) spl()s.
- Demonstrated successful compilation and reasonable stability
(self-hosted kernel build) on both i386 (UP and SMP) and alpha.
The maintenance of the patches is expected to be a bit of pain, but we
have decided not to branch due to technical issues with maintaining
branches in CVS. The patches are expected to exist only until the first
commit is made. At that point, all further development will be done
directly on HEAD in cvs.
On the light side, we had a rather amusing experience on Friday. We wanted
to order some sandwiches, but something went wrong with the order, so Paul
ordered pizza instead. A bit later, the pizza boy came in and deposited
the pizzas on the conference table and was about to leave when Paul
introduced him. His name is David Filo. Thanks for the pizza!
To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message
help
Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20000624235605.D8965>
