Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 26 Jul 1995 00:57:03 -0700
From:      Matt Dillon <dillon@blob.best.net>
To:        davidg@Root.COM
Cc:        Doug Rabson <dfr@render.com>, bugs@freebsd.org
Subject:   Re: brelse() panic in nfs_read()/nfs_bioread() 
Message-ID:  <199507260757.AAA13857@blob.best.net>

next in thread | raw e-mail | index | archive | help
   Dima and I will bring BEST's system uptodate tonight.

   We have been having some rather severe (about once a day)
   crashes on our second shell machine that are completely
   different from the crashes we see on other machines.

   This second shell machine is distinguished from the others
   in that it mounts user's home directories via NFS, so there
   is a great deal more NFS client activity.

   Unfortunately, the crash locks things up.. it can partially
   synch the disks but it can't dump core.  The only message I
   get is the panic message on the console:

	panic biodone: page busy < 0
	off: 180224, foff: 180224, valid: 0xFF, dirty:0 mapped:0
	resid: 4096, index: 0, iosize: 8192, lblkno: 22

    I believe the failure is related to NFS.  The question is,
    is this a new bug or do any of the recent patches have a
    chance at fixing it?  Hard question considering the lack
    of information.

    All in all, not counting the above crashing bug we are having
    on our shell2 machine, the machines are becoming quite a bit more
    stable.

    --

    I have been noticing some pretty major cascade failures in the scheduling
    algorithm.  Basically it is impossibe to use nice() values to give one
    process a reasonable priority over another.  What occurs in a heavily
    loaded system is that the niced processes (say, nice +10) wind up getting
    NO cpu whatsoever in the face of heavy loading (load of 10) coupled with
    interactive activity.  Simple short processes such as /bin/ls runs from
    FTP stay in a Run state, get no CPU, and simply build up on the machine,
    causing the machine's load to jump.   Since the load average is fed back
    into the scheduling algorithms, this cascades until the process resource
    limit is hit... I've seen our WWW server hit a load of 200 from this 
    effect!

    The solution is that I've pretty much redone the scheduling core... about
    6 source files and one assembly file (i386/i386/swtch.s).  In the course
    of redoing it, I noticed that the critical paths in the tsleep(), wakeup(),
    and task switching code had all sorts of junk in them that was slowing
    the task switch down, and so shifted some stuff out of the critical path
    and into hardclock() and schedcpu().  In anycase, the new core uses a
    baseline time slice which it tries to divide up according to:

			    ~p->p_priority
	    -------------------------------------------- * 40 mS
	    sum(~p->p_priority) for all runnable processes

    The algorithm works very well even with a system granularity of 10mS,
    and without any fancy calculations.  Fractional portions of the
    calculated time slice have the side effect of causing a low priority
    process to skip one or more round-robin's.  I got rid of nearly all
    the need_resched() calls strewn all over the code in favor of a
    priority-based-insertion into the (now single) run queue whenever
    a process is woken up. Since the sum(~p->p_priority) is adjusted
    instantaniously whenever a process goes to sleep or wakes up,
    there is no need to preempt the current process from inside wakeup().
    Instead, hardclock() does it at the next clock tick.

    I also completely rewrote the p_estcpu calculation which, along with n
    ice()ness is the basis for p->p_priority's generation.  p_estcpu now 
    reflects the ratio of the amount of cpu used over the amount of cpu 
    allocated to the process, and thus has a roughly linear relationship 
    to the load for cpu-bound processes without compromising the interactive 
    responsiveness for I/O-bound processes.  As a system gets more loaded 
    down, interactive responsiveness stays about the same, and even the 
    highest-niced process still gets *some* cpu... a nice +20 process will 
    not be totally locked out by a nice -1 process or 20 running nice +1 
    processes.

    Time wise, the actual context switch isn't much faster... maybe a 5%
    improvement, but I am still disappointed that I can only get 20,000
    context switches a second with a pipe() write/echo/read between two
    processes so I will be researching it a bit more.  The main thrust
    was to get rid of the load-based cascade failure.
    
    We are going to install these scheduling changes tonight as well and I
    will tell you on friday how well they worked.  If they work well, I'd
    like to submit them for review.

						-Matt

:>> DG will probably test and commit this change soon...
:>
:>This is good news; I just saw the commit mail go past.  Is it possible 
:>that this affects Karl Denninger's problem as well?
:
:   Yes.
:
:-DG
:




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199507260757.AAA13857>