From owner-freebsd-stable  Sun Dec  9 14:56:29 2001
Delivered-To: freebsd-stable@freebsd.org
Received: from mailhub2.pegs.com (mailhub2.pegs.com [138.113.16.9])
	by hub.freebsd.org (Postfix) with ESMTP id 9A9B037B416
	for <freebsd-stable@freebsd.org>; Sun,  9 Dec 2001 14:56:22 -0800 (PST)
Received: (from root@localhost)
	by mailhub2.pegs.com (8.11.4/8.11.4) id fB9MtnJ65247;
	Sun, 9 Dec 2001 15:55:49 -0700 (MST)
	(envelope-from william.bloom@pegs.com)
Received: from wbloom.pegs.com (wbloom.pegs.com [138.113.129.92])
	by mailhub2.pegs.com (8.11.4/8.11.4) with ESMTP id fB9Mtii65235;
	Sun, 9 Dec 2001 15:55:44 -0700 (MST)
	(envelope-from william.bloom@pegs.com)
Date: Sun, 9 Dec 2001 15:55:43 -0700
From: William Bloom <william.bloom@pegs.com>
To: freebsd-stable@freebsd.org
Cc: john.beckner@pegs.com, chad@larsons.org
Subject: SMP and Process Priority: named-xfer problem
Message-ID: <20011209155543.G12454@wbloom.pegs.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
X-Mailer: Balsa 1.2.0
Lines: 125
X-Virus-Scanned: by AMaViS perl-10
Sender: owner-freebsd-stable@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-stable.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-stable>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-stable>
X-Loop: FreeBSD.ORG

I've just reached a checkpoint on a problem-solving effort with BIND
8.2.5 on a dual-processor Dell PowerEdge 4200 running an SMP FreeBSD
kernel built from STABLE sources CVSup'd on 11/7, and I'd like to
compare notes with folks on the list and also perhaps get some
enlightenment over a few unanswered question about what happened.  As
background, this machine runs a secondary nameserver with several
hundred zones.  The nameserver is not chroot'd, but it is configured
to run in a sandbox (a non-superuser account and a non-wheel group).
As usual for a sandbox, the entire runtime directory for the
nameserver database is of course owned by the sandbox user/group, as
is the directory where the PID file is maintained.

The symptom, briefly, was that whenever the nameserver attempted to
transfer a zone from the master, then the named-xfer process would
become suspended and would eventually timeout.  The problem was 100%
reproducible.  None of the other non-SMP machines on which we run the
-same- FreeBSD installation, with the -same- BIND (built from the
ports collection in all cases), and the -same- BIND configuration had
this symptom.  The symptom only occured on the one machine.

After some investigation, it was found that a trial transfer of a
particular zone could be done from the command line using named-xfer
successfully -only- if the caller was the superuser.  We were using
something like...

    /usr/local/libexec/named-xfer -z abc.com -f db.abc -s 0 <master>

...where '<master>' would be the IP address of a master nameserver.
Any attempt to use named-xfer interactively by any user other than the
superuser caused the transfer to suspend in precisely the same fashion
as seen during nameserver operation on the problem machine.  The
non-SMP nameserver hosts did not exhibit this behavior.  Even more
interesting, we discovered that once a named-xfer process had hung,
then it would resume and complete the zone transfer normally if it
were sent one of a set of certain signals (SIGALRM or SIGHUP) using
'kill'.

Using 'truss' to see how the named-xfer process was hanging, we found
that the hang always happened during a system call, but not always the
same one.  Sometimes it would suspend during a 'connect' call for a
socket, sometimes it would hang during a 'getpid' call.  No matter
where it suspended, it would shake loose and complete normally if it
were sent an ALRM signal.

The clue was noticing an EPERM error returned by a 'setpriority' call
near the top of the truss output.  This is a 'silent' error that does
not appear on stderr or in the nameserver debug log; it is only seen
in a truss session log.  This particular 'setpriority' error was
absent (meaning that the 'setpriority' completes without error) in the
truss output which we captured from a coomparative superuser
named-xfer session.  Checking the named-xfer.c sources, the following
is present near the beginning of the named-xfer code path...

    #ifdef RENICE
    nice(-40);  /* this is the recommended procedure to        */
    nice(20);   /*   reset the priority of the current process */
    nice(0);    /*   to "normal" (== 0) - see nice(3)          */   
    #endif

This code looks quite suspicious in a program that is allowed to run
as non-superuser, since BSD -only- permits a negative 'nice' value as
an argument if the calling process is owned by the superuser.  The
point of the above code is that the current process's 'nice' value
will be first reduced to the lowest possible value permitted by
'nice()' (which should be -20), then immediately bumped back to an
absolute value of 0 (lowered to -20 and then raised by 20), and the
final 'nice(0)' call is for upwards compatibility with older versions
of 'nice()'.  This estblishes a nominal baseline scheduling priority
for a process that was forked from another process whose priority is
unknown.  As a footnote, it seems that in the case of FreeBSD it would
much simpler to just make one call to 'setpriority()'.

But executing such code as a non-superuser has an entirely different
effect for BSD processes.  The first call (nice(-40)) is ignored and
returns an EPERM error, since only the superuser can lower the 'nice'
value.  That means that the 2nd 'nice(20)' call now has the effect of
raising the process's 'nice' to 20 instead of to 0, hence greatly
lowering the process's scheduling priority.  That's why the problem
was only in evidence when we ran named-xfer as the superuser.

Experimentally, we inserted an '#undef RENICE' in front of the above
code in named-xfer.c, rebuilt/reinstalled the one binary, and now all
seems well.

But there are unanswered questions that are relevant to STABLE.  The
only impact was on a dual-processor SMP machine, and one that wasn't
particularly busy at the time.  Non-SMP machines running the same
named-xfer binary on the same FreeBSD build never saw a named-xfer
suspend. I'm thinking that the above code is indeed incorrect for a
program that may be run by a non-superuser process (as would be the
case for a chroot'd or sandbox'd nameserver), and we certainly aren't
keen on the idea of making named-xfer root SUID (nor, am I supposing,
would anyone who bothers to chroot or sandbox a nameserver).  However,
even though the named-xfer process ends up being deprioritized when
executed from a sandbox'd named as described above, we notice that the
zone transfer -still- easily completes within the 2 minute timeout
period on non-SMP FreeBSD nameservers.  Only on an SMP machine does
the 'endless suspend' seem to occur.

So why does the deprioritized named-xfer suspend forever on an SMP
FreeBSD host, as the first question that puzzles me?  As a second
question, why does such a suspended process then resume without any
further hangs after it getss a SIGALRM (does signal response include a
priority boost)?

It seems that I've got a set of circumstances that must not be very
common or else there would have been more people impacted: a
multiprocessor machine running a FreeBSD SMP kernel built from
post-4.4 sources on a Dell PowerEdge and executing a nameserver with a
lot of slave zones.  Are there really not many people doing this?  Is
there some nuance of kernel configuration that I've overlooked that
accounts for this odd SMP behavior?  I'm only using the SMP and APIC_IO
options on this PowerEdge so far, and we've seen no instability or
oddness in the machine in about 4 weeks of operation apart from this
one process priority issue.

I've perused the freebsd-stable and freebsd-smp lists and not seen
anything that quite sheds light on this.


Bill
-- 
William Bloom <william.bloom@pegs.com> (602) 906-7525
Pegasus Solutions, Inc.  7500 North Dreamy Draw Drive, Suite 120
Phoenix, Az 85020  http://www.pegs.com

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message