Date: Wed, 26 Jul 1995 00:57:03 -0700 From: Matt Dillon <dillon@blob.best.net> To: davidg@Root.COM Cc: Doug Rabson <dfr@render.com>, bugs@freebsd.org Subject: Re: brelse() panic in nfs_read()/nfs_bioread() Message-ID: <199507260757.AAA13857@blob.best.net>
next in thread | raw e-mail | index | archive | help
Dima and I will bring BEST's system uptodate tonight. We have been having some rather severe (about once a day) crashes on our second shell machine that are completely different from the crashes we see on other machines. This second shell machine is distinguished from the others in that it mounts user's home directories via NFS, so there is a great deal more NFS client activity. Unfortunately, the crash locks things up.. it can partially synch the disks but it can't dump core. The only message I get is the panic message on the console: panic biodone: page busy < 0 off: 180224, foff: 180224, valid: 0xFF, dirty:0 mapped:0 resid: 4096, index: 0, iosize: 8192, lblkno: 22 I believe the failure is related to NFS. The question is, is this a new bug or do any of the recent patches have a chance at fixing it? Hard question considering the lack of information. All in all, not counting the above crashing bug we are having on our shell2 machine, the machines are becoming quite a bit more stable. -- I have been noticing some pretty major cascade failures in the scheduling algorithm. Basically it is impossibe to use nice() values to give one process a reasonable priority over another. What occurs in a heavily loaded system is that the niced processes (say, nice +10) wind up getting NO cpu whatsoever in the face of heavy loading (load of 10) coupled with interactive activity. Simple short processes such as /bin/ls runs from FTP stay in a Run state, get no CPU, and simply build up on the machine, causing the machine's load to jump. Since the load average is fed back into the scheduling algorithms, this cascades until the process resource limit is hit... I've seen our WWW server hit a load of 200 from this effect! The solution is that I've pretty much redone the scheduling core... about 6 source files and one assembly file (i386/i386/swtch.s). In the course of redoing it, I noticed that the critical paths in the tsleep(), wakeup(), and task switching code had all sorts of junk in them that was slowing the task switch down, and so shifted some stuff out of the critical path and into hardclock() and schedcpu(). In anycase, the new core uses a baseline time slice which it tries to divide up according to: ~p->p_priority -------------------------------------------- * 40 mS sum(~p->p_priority) for all runnable processes The algorithm works very well even with a system granularity of 10mS, and without any fancy calculations. Fractional portions of the calculated time slice have the side effect of causing a low priority process to skip one or more round-robin's. I got rid of nearly all the need_resched() calls strewn all over the code in favor of a priority-based-insertion into the (now single) run queue whenever a process is woken up. Since the sum(~p->p_priority) is adjusted instantaniously whenever a process goes to sleep or wakes up, there is no need to preempt the current process from inside wakeup(). Instead, hardclock() does it at the next clock tick. I also completely rewrote the p_estcpu calculation which, along with n ice()ness is the basis for p->p_priority's generation. p_estcpu now reflects the ratio of the amount of cpu used over the amount of cpu allocated to the process, and thus has a roughly linear relationship to the load for cpu-bound processes without compromising the interactive responsiveness for I/O-bound processes. As a system gets more loaded down, interactive responsiveness stays about the same, and even the highest-niced process still gets *some* cpu... a nice +20 process will not be totally locked out by a nice -1 process or 20 running nice +1 processes. Time wise, the actual context switch isn't much faster... maybe a 5% improvement, but I am still disappointed that I can only get 20,000 context switches a second with a pipe() write/echo/read between two processes so I will be researching it a bit more. The main thrust was to get rid of the load-based cascade failure. We are going to install these scheduling changes tonight as well and I will tell you on friday how well they worked. If they work well, I'd like to submit them for review. -Matt :>> DG will probably test and commit this change soon... :> :>This is good news; I just saw the commit mail go past. Is it possible :>that this affects Karl Denninger's problem as well? : : Yes. : :-DG :
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199507260757.AAA13857>