Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 15 Jun 2009 10:16:09 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Sam Leffler <sam@FreeBSD.org>
Cc:        svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org, src-committers@FreeBSD.org, Ed Schouten <ed@FreeBSD.org>
Subject:   Re: svn commit: r194204 - in head/sys: amd64/conf i386/conf
Message-ID:  <20090615075134.K24645@delplex.bde.org>
In-Reply-To: <4A356A0F.3050800@freebsd.org>
References:  <200906141801.n5EI1Zti056239@svn.freebsd.org> <4A356A0F.3050800@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 14 Jun 2009, Sam Leffler wrote:

>> Log:
>>   Enable PRINTF_BUFR_SIZE on i386 and amd64 by default.
>>
>>   In the past there have been some reports of PRINTF_BUFR_SIZE not
>>   functioning correctly. Instead of having garbled console messages, we
>>   should just see whether the issues are still there and analyze them.
>
> When I suggested enabling this by default for SMP configurations I was
> reminded it wasn't enabled because of potential stack overflow.

Stack overflow shouldn't be a problem, since printf shouldn't be called
recursively.  However, recursive calls are possible from panic(), and
recursive calls are handled badly if they happen.  E.g., PRINTF_BUFR_SIZE
gives unlimited stack growth, and the locking in cnputs() may give
deadlock.  Deadlock should occur if cnputs() is preempted and the new
thread panics, since panicing should prevent the preempted thread from
running (and the system state may be so messed up that the preempted
thread cannot run anyway).

I wouldn't use an implementation anything like FreeBSD's.  printf()s
must be serialized, and once printf() is serialized there is no need
for buffering or cnputs().  Put the locking in kvprintf() only.  To
avoid deadlock, the locking must not be strict.  Either allow it to
be bypassed (needed for calls from db_printf() and panic()), or
automatically bypass it after a timeout, or automatically abort the
printf() after a timeout.  I use the latter to fix related bugs in
kernel profiling (for some arches, MCOUNT_ENTER() deadlocks on NMIs
and ddb traps if trap occurs while the lock is held, and for other
arches MCOUNT_ENTER() is missing locking).  For printf(), an automatic
bypass after a timeout seems best -- in the worst case, it just gives
interleaved messages like now, but that should rarely happen; the
timeout can be very long for a CPU but short enough on a human scale
for important messages to not be blocked for significantly long by
unimportant messages or locking bugs.

Here is my locking for kernel profiling (not showing simplications in
MCOUNT_ENTER/EXIT()):

% Index: mcount.c
% ===================================================================
% RCS file: /home/ncvs/src/sys/libkern/mcount.c,v
% retrieving revision 1.24
% diff -u -2 -r1.24 mcount.c
% --- mcount.c	13 Jun 2007 06:17:48 -0000	1.24
% +++ mcount.c	13 Jun 2007 06:18:33 -0000
% @@ -34,7 +34,5 @@
%  #include <sys/gmon.h>
%  #ifdef _KERNEL
% -#ifndef GUPROF
%  #include <sys/systm.h>
% -#endif
%  #include <vm/vm.h>
%  #include <vm/vm_param.h>
% @@ -43,4 +41,61 @@
% 
%  /*
% + * Locking strategy: use a simple spin mutex, but don't wait very long
% + * for it.  When we don't acquire it, just give up.  We avoid using a
% + * normal mutex since normal mutex code is profiled and might be
% + * inefficent.  Not waiting very long avoids deadlock.  Our callers
% + * should have used the usual anti-deadlock strategy of disabling
% + * interrupts on the current CPU, but that is especially insufficient
% + * for us since it doesn't affect debugger traps and NMIs, and at least
% + * the debugger code is profiled.
% + *
% + * Note: the atomic functions used here, like all functions used in this
% + * file, must be compiled without profiling.  This normally happens for
% + * the atomic functions because they are inline.
% + */
% +
% +static __inline int
% +mcount_trylock(struct gmonparam *p, int want_both)
% +{
% +#ifdef SMP
% +	int timeout;
% +
% +	timeout = 1000;		/* XXX ~= 10 times #statements in mcount(). */
% +	do {
% +		/* XXX having 2 "on" states complicates things. */
% +#ifdef GUPROF
% +		if (atomic_cmpset_acq_int(&p->state, GMON_PROF_HIRES,
% +		    GMON_PROF_BUSY))
% +			return (GMON_PROF_HIRES);
% +#endif
% +		if (want_both && atomic_cmpset_acq_int(&p->state, GMON_PROF_ON,
% +		    GMON_PROF_BUSY))
% +			return (GMON_PROF_ON);
% +	} while (--timeout != 0);
% +#else /* !SMP */
% +	/* Ugliness for efficiency. */
% +#ifdef GUPROF
% +	if (p->state == GMON_PROF_HIRES) {
% +		/* XXX want acq. */
% +		atomic_store_rel_int(&p->state, GMON_PROF_BUSY);
% +		return (GMON_PROF_HIRES);
% +	}
% +#endif
% +	if (want_both && p->state == GMON_PROF_ON) {
% +		atomic_store_rel_int(&p->state, GMON_PROF_BUSY);
% +		return (GMON_PROF_ON);
% +	}
% +#endif /* SMP */
% +	return (GMON_PROF_BUSY); /* Might actually be OFF or ERROR. */
% +}
% +
% +static __inline void
% +mcount_unlock(struct gmonparam *p, int ostate)
% +{
% +
% +	atomic_store_rel_int(&p->state, ostate);
% +}
% +
% +/*
%   * mcount is called on entry to each function compiled with the profiling
%   * switch set.  _mcount(), which is declared in a machine-dependent way
% @@ -68,4 +123,5 @@
%  	struct gmonparam *p;
%  	long toindex;
% +	int ostate;
%  #ifdef _KERNEL
%  	MCOUNT_DECL(s)
% @@ -73,17 +129,17 @@
% 
%  	p = &_gmonparam;
% -#ifndef GUPROF			/* XXX */
% +#ifndef GUPROF
%  	/*
% -	 * check that we are profiling
% -	 * and that we aren't recursively invoked.
% +	 * Quick check that we are profiling.  In caller for GUPROF, and
% +	 * should always be there.
%  	 */
% -	if (p->state != GMON_PROF_ON)
% +	if (p->state == GMON_PROF_OFF)
%  		return;
%  #endif
%  #ifdef _KERNEL
%  	MCOUNT_ENTER(s);
% -#else
% -	p->state = GMON_PROF_BUSY;
%  #endif
% +	if ((ostate = mcount_trylock(p, 1)) == GMON_PROF_BUSY)
% +		return;
% 
%  #ifdef _KERNEL
% @@ -101,5 +157,5 @@
% 
%  #ifdef GUPROF
% -	if (p->state == GMON_PROF_HIRES) {
% +	if (ostate == GMON_PROF_HIRES) {
%  		/*
%  		 * Count the time since cputime() was previously called
% @@ -224,12 +281,11 @@
%  	}
%  done:
% +	mcount_unlock(p, ostate);
%  #ifdef _KERNEL
%  	MCOUNT_EXIT(s);
% -#else
% -	p->state = GMON_PROF_ON;
%  #endif
%  	return;
%  overflow:
% -	p->state = GMON_PROF_ERROR;
% +	mcount_unlock(p, GMON_PROF_ERROR);
%  #ifdef _KERNEL
%  	MCOUNT_EXIT(s);
% @@ -251,6 +307,9 @@
%  	struct gmonparam *p;
%  	uintfptr_t selfpcdiff;
% +	int ostate;
% 
%  	p = &_gmonparam;
% +	if ((ostate = mcount_trylock(p, 0)) == GMON_PROF_BUSY)
% +		return;
%  	selfpcdiff = selfpc - (uintfptr_t)p->lowpc;
%  	if (selfpcdiff < p->textsize) {
% @@ -267,4 +326,5 @@
%  		*p->mexitcount_count += p->mexitcount_overhead;
%  	}
% +	mcount_unlock(p, ostate);
%  }
%

The above makes mcount()'s kernel locking strategy very similar to its
userland one -- try to obtain the lock, and if it cannot be obtained
then do nothing and just lose some statistics.  In userland, recursive
calls to mcount() are possible for signal handlers and for threads.
For signal handlers, waiting forever could deadlock and the old code
avoids that using a simple state flag; it assumes atomic accesses to
p->state.  The threaded case was mostly broken -- it won't have the
necessary atomic accesses in the SMP case, and contention on mcount()
and consequent lost statistics is much more common for threads.  The
above is supposed to fix the locking in userland in the same way as
in the kernel using the same trylock function, but hasn't been tested
in userland.  In the kernel, there is enormous contention on the lock
iff multiple threads are in the kernel.  mcount() needs to have per-CPU
context to run fast enough in this case.

For printf(), things are much simpler because results are only needed
on a human time scale, and printf()'s internals are reentrant enough.

Instead of:

 	if ((ostate = mcount_trylock(p, 0)) == GMON_PROF_BUSY)
 		return;
 	...
 	mcount_unlock(p, ostate);

kvprintf() can just do:

 	printf_trylock();
 	/*
 	 * Continue (after a timeout) even if we didn't acquire the lock
 	 * exclusively, so that printf() is guaranteed to complete
 	 * eventually.  Use a global in-use counter instead of p->state
 	 * and ostate above.
 	 */
 	...
 	printf_unlock();

Normal mutex locking should not be used here.  (For profiling, it
cannot be used, since mutex code is profiled).  mtx_lock_spin() is
more unusuable than I remembered, for the above and where it is now
used in cnputs().  It does recursion inline even when MTX_RECURSE is
not configured for the mutex.  _mtx_lock_spin_flags() checks for
recursion on non-recursive mutexes but won't be called in the usual
case where the recursion is done inline.  This results in erroneous
recursive calls to cnputs() not deadlocking like they should; instead
the locking just doesn't work.

The effectiveness of the automatic bypass of printf()'s locking after
a timeout can be tested easily by debugging printf() and/or its locking
code using ddb without using a special bypass for ddb's printf()s.
Once the top level printf() acquires the lock, ddb's printf()s should
become quite slow since they have to wait for the lock, but they should
still occur (interleaved with the top-level i/o, which should be messed
up further by ddb's printfs interfering with device accesses for the
top-level i/o when you step through the lowest levels), and they might
even be useable if the timeout is short enough.

The timeout needs to be configured better than by `timeout = 1000; /* XXX */'
Use something like DELAY(1000) tested 1000 times.  Use a timeout of 0
for ddb once the above tests are done.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20090615075134.K24645>