From owner-freebsd-smp  Thu Apr 25 14:15:27 2002
Delivered-To: freebsd-smp@freebsd.org
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by hub.freebsd.org (Postfix) with ESMTP
	id 7248E37B417; Thu, 25 Apr 2002 14:15:13 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.11.6/8.9.1) id g3PLF6c07119;
	Thu, 25 Apr 2002 14:15:06 -0700 (PDT)
	(envelope-from dillon)
Date: Thu, 25 Apr 2002 14:15:06 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200204252115.g3PLF6c07119@apollo.backplane.com>
To: Joe Greco <jgreco@ns.sol.net>
Cc: freebsd-smp@FreeBSD.ORG, freebsd-stable@FreeBSD.ORG
Subject: Re: kernel trap 9 with interrupts disabled
References:  <200204251836.NAA41191@aurora.sol.net>
Sender: owner-freebsd-smp@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-smp.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-smp>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-smp>
X-Loop: FreeBSD.org


:>     Hmm.  Maybe adjust the code to panic the machine when this
:>     situation occurs, then see if you can get a kernel dump out
:>     of it.
:
:Looks like I'll be doing that next.  Any help available from anyone in
:looking at that?  I'm not big into reading kernel dumps :-)
:
:>     As to the load issue... that sounds like a classic priority
:>     inversion problem.  Check the 'nice' of all the processes in
:>     the system and see if some nice'd-down processes are hogging
:>     the cpu.  'ps axlww' in a big window.
:
:Hmmm.  I did just notice something.  I run setiathome everywhere using a
:little daemon that punts it down to idprio etc.  I just tried to kill them
:and they didn't, and I looked again and it's because they're running at 
:0.0%, so then I idprio -t -<pid>'d them, and when I did that to the first
:one, my login session froze for the better part of a minute.  It remained
:pingable but apparently unresponsive.  Then it recovered.  The second one
:went as expected.
:
:>     Also look at the user cpu verses system cpu percentage to see
:>     where the cpu is going.
:
:Here's top, any hints?  (note: the names have been changed to protect the
:innocent) 

    The classic priority inversion problem occurs when you have a low
    priority process blocked on I/O and a higher priority process
    monopolizing the cpu.  Even though the lower priority process is
    woken up by the kernel, it doesn't get cpu until there are no
    runnable higher priority processes and so it is unable to release
    any locks it might have been holding for the I/O.

    The FreeBSD-stable scheduler will dynamically alter the priority of a
    running process, which prevents the priority inversion from locking up
    the machine when all the processes in question are on the normal 
    scheduler queue.  But it can't cross priority queues so if you have
    a process on the idle priority queue it can get 'stuck' in
    a system call while holding a lock (like on a directory vnode or
    something) and then never get the cpu *at* *all* while other normal
    processes are monopolizing the cpu.  As other normal processes try
    to obtain the lock they block, locking the whole system up
    (except for the higher priority processes monopolizing the cpu,
    but for all intents and purposes the system is locked up).

    I believe FreeBSD-current solves this problem by aggregating
    the three priority queues we had in -stable into a single queue
    for -current, and then allowing a higher priority process to
    'lend' its priority to a lower priority process that is holding
    a lock that the higher priority process wants.  I don't know if
    it's been 100% implemented yet.  You could ask John (JHB).

    In your case I'm sure the normal priority 'nit' and other
    cpu intensive processes combined with the idprio setiathome 
    processes are creating this problem.  I recommend either not
    running setiathome, or running it with a normal NICE (like nice +19).
    Alternatively you might consider running -current but I would not
    recommend it for a production environment yet.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>

:last pid:  3145;  load averages: 13.60, 13.97, 14.05   up 18+14:27:19  13:26:35
:63 processes:  15 running, 47 sleeping, 1 stopped
:CPU states:  4.5% user,  0.0% nice, 94.8% system,  0.6% interrupt,  0.0% idle
:Mem: 142M Active, 656M Inact, 145M Wired, 47M Cache, 112M Buf, 14M Free
:Swap: 2048M Total, 56K Used, 2048M Free
:
:  PID USERNAME PRI NICE  SIZE    RES STATE  C   TIME   WCPU    CPU COMMAND
:78128 useruser  63   0 34696K 33896K RUN    0  83:26 31.30% 31.30% nit
:78596 useruser  64   0 18716K 17896K RUN    0  79:59 31.10% 31.10% nit
:78959 useruser  64   0 15872K 14728K RUN    0  79:30 29.93% 29.93% nit
:57493 use       63   0  6412K  5804K RUN    1 601:36 13.43% 13.43% perl
:99887 useruser  63   0 14200K 10420K CPU1   1   3:26 13.09% 13.09% perl
:99918 use       64   0  1060K   656K RUN    1   2:26 11.33% 11.33% funny
: 2059 useruser  63   0  2220K  1424K RUN    1   0:59 11.18% 11.18% grep
:  507 use       63   0  1060K   656K RUN    1   1:47  9.52%  9.52% funny
: 1363 use       61   0  1060K   632K RUN    0   0:57  8.98%  8.98% funny
:...
: 3145 use        2   0  1060K   596K sbwait 0   0:00  9.00%  0.44% funny
:99230 nobody    37  52 16556K 16424K RUN    1 182.4H  0.00%  0.00% setiathome
:21867 nobody    37  52 16556K 16428K RUN    0 171.6H  0.00%  0.00% setiathome
:-- 
:Joe Greco - sol.net Network Services - Milwaukee, WI - http://www.sol.net

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message