From owner-freebsd-smp Thu Apr 25 14:15:27 2002 Delivered-To: freebsd-smp@freebsd.org Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by hub.freebsd.org (Postfix) with ESMTP id 7248E37B417; Thu, 25 Apr 2002 14:15:13 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.11.6/8.9.1) id g3PLF6c07119; Thu, 25 Apr 2002 14:15:06 -0700 (PDT) (envelope-from dillon) Date: Thu, 25 Apr 2002 14:15:06 -0700 (PDT) From: Matthew Dillon Message-Id: <200204252115.g3PLF6c07119@apollo.backplane.com> To: Joe Greco Cc: freebsd-smp@FreeBSD.ORG, freebsd-stable@FreeBSD.ORG Subject: Re: kernel trap 9 with interrupts disabled References: <200204251836.NAA41191@aurora.sol.net> Sender: owner-freebsd-smp@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org :> Hmm. Maybe adjust the code to panic the machine when this :> situation occurs, then see if you can get a kernel dump out :> of it. : :Looks like I'll be doing that next. Any help available from anyone in :looking at that? I'm not big into reading kernel dumps :-) : :> As to the load issue... that sounds like a classic priority :> inversion problem. Check the 'nice' of all the processes in :> the system and see if some nice'd-down processes are hogging :> the cpu. 'ps axlww' in a big window. : :Hmmm. I did just notice something. I run setiathome everywhere using a :little daemon that punts it down to idprio etc. I just tried to kill them :and they didn't, and I looked again and it's because they're running at :0.0%, so then I idprio -t -'d them, and when I did that to the first :one, my login session froze for the better part of a minute. It remained :pingable but apparently unresponsive. Then it recovered. The second one :went as expected. : :> Also look at the user cpu verses system cpu percentage to see :> where the cpu is going. : :Here's top, any hints? (note: the names have been changed to protect the :innocent) The classic priority inversion problem occurs when you have a low priority process blocked on I/O and a higher priority process monopolizing the cpu. Even though the lower priority process is woken up by the kernel, it doesn't get cpu until there are no runnable higher priority processes and so it is unable to release any locks it might have been holding for the I/O. The FreeBSD-stable scheduler will dynamically alter the priority of a running process, which prevents the priority inversion from locking up the machine when all the processes in question are on the normal scheduler queue. But it can't cross priority queues so if you have a process on the idle priority queue it can get 'stuck' in a system call while holding a lock (like on a directory vnode or something) and then never get the cpu *at* *all* while other normal processes are monopolizing the cpu. As other normal processes try to obtain the lock they block, locking the whole system up (except for the higher priority processes monopolizing the cpu, but for all intents and purposes the system is locked up). I believe FreeBSD-current solves this problem by aggregating the three priority queues we had in -stable into a single queue for -current, and then allowing a higher priority process to 'lend' its priority to a lower priority process that is holding a lock that the higher priority process wants. I don't know if it's been 100% implemented yet. You could ask John (JHB). In your case I'm sure the normal priority 'nit' and other cpu intensive processes combined with the idprio setiathome processes are creating this problem. I recommend either not running setiathome, or running it with a normal NICE (like nice +19). Alternatively you might consider running -current but I would not recommend it for a production environment yet. -Matt Matthew Dillon :last pid: 3145; load averages: 13.60, 13.97, 14.05 up 18+14:27:19 13:26:35 :63 processes: 15 running, 47 sleeping, 1 stopped :CPU states: 4.5% user, 0.0% nice, 94.8% system, 0.6% interrupt, 0.0% idle :Mem: 142M Active, 656M Inact, 145M Wired, 47M Cache, 112M Buf, 14M Free :Swap: 2048M Total, 56K Used, 2048M Free : : PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU CPU COMMAND :78128 useruser 63 0 34696K 33896K RUN 0 83:26 31.30% 31.30% nit :78596 useruser 64 0 18716K 17896K RUN 0 79:59 31.10% 31.10% nit :78959 useruser 64 0 15872K 14728K RUN 0 79:30 29.93% 29.93% nit :57493 use 63 0 6412K 5804K RUN 1 601:36 13.43% 13.43% perl :99887 useruser 63 0 14200K 10420K CPU1 1 3:26 13.09% 13.09% perl :99918 use 64 0 1060K 656K RUN 1 2:26 11.33% 11.33% funny : 2059 useruser 63 0 2220K 1424K RUN 1 0:59 11.18% 11.18% grep : 507 use 63 0 1060K 656K RUN 1 1:47 9.52% 9.52% funny : 1363 use 61 0 1060K 632K RUN 0 0:57 8.98% 8.98% funny :... : 3145 use 2 0 1060K 596K sbwait 0 0:00 9.00% 0.44% funny :99230 nobody 37 52 16556K 16424K RUN 1 182.4H 0.00% 0.00% setiathome :21867 nobody 37 52 16556K 16428K RUN 0 171.6H 0.00% 0.00% setiathome :-- :Joe Greco - sol.net Network Services - Milwaukee, WI - http://www.sol.net To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-smp" in the body of the message