Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 23 Sep 1997 19:08:53 -0700 (PDT)
From:      Archie Cobbs <archie@whistle.com>
To:        freebsd-hackers@freebsd.org
Subject:   is this a bug?
Message-ID:  <199709240208.TAA04263@bubba.whistle.com>

next in thread | raw e-mail | index | archive | help

We're trying to track down a very elusive bug that involves processes
ending up on two queues at once, or having runnable state SRUN and
being on a sleep queue, ... etc..

Question: is there a bug in this code from tsleep() ? The code does this:

  1. Puts process on a sleep queue
  2. Calls CURSIG(), which can result in a context switch
  3. Sets p->p_stat to SSLEEP

It seems steps #2 and #3 are out of order...  couldn't this result
in a process being on a sleep queue yet having state SRUN?

> #ifdef DIAGNOSTIC
> 	if( p == NULL ) 
> 		panic("tsleep1");
> 	if (ident == NULL || p->p_stat != SRUN)
> 		panic("tsleep");
> 	/* XXX This is not exhaustive, just the most common case */
> 	if ((p->p_procq.tqe_prev != NULL) &&  (*p->p_procq.tqe_prev == p))
> 		panic("sleeping process on run queue");
> #endif
> 	p->p_wchan = ident;
> 	p->p_wmesg = wmesg;
> 	p->p_slptime = 0;
> 	p->p_priority = priority & PRIMASK;
> 	TAILQ_INSERT_TAIL(&slpque[LOOKUP(ident)], p, p_procq);
> 	if (timo)
> 		timeout(endtsleep, (void *)p, timo);
> 	/*
> 	 * We put ourselves on the sleep queue and start our timeout
> 	 * before calling CURSIG, as we could stop there, and a wakeup
> 	 * or a SIGCONT (or both) could occur while we were stopped.
> 	 * A SIGCONT would cause us to be marked as SSLEEP
> 	 * without resuming us, thus we must be ready for sleep
> 	 * when CURSIG is called.  If the wakeup happens while we're
> 	 * stopped, p->p_wchan will be 0 upon return from CURSIG.
> 	 */
> 	if (catch) {
> 		p->p_flag |= P_SINTR;
> 		if ((sig = CURSIG(p))) {
> 			if (p->p_wchan)
> 				unsleep(p);
> 			p->p_stat = SRUN;
> 			goto resume;
> 		}
> 		if (p->p_wchan == 0) {
> 			catch = 0;
> 			goto resume;
> 		}
> 	} else
> 		sig = 0;
> 	p->p_stat = SSLEEP;
> 	p->p_stats->p_ru.ru_nvcsw++;
> 	mi_switch();
> resume:
> 	curpriority = p->p_usrpri;
> 	splx(s);

In fact, one panic() we got was the "wakeup_one" panic, where
this test failed:

> 	qp = &slpque[LOOKUP(ident)];
> 
> 	for (p = qp->tqh_first; p != NULL; p = p->p_procq.tqe_next) {
> #ifdef DIAGNOSTIC
> 		if (p->p_stat != SSLEEP && p->p_stat != SSTOP)
> 			panic("wakeup_one");
> #endif

Also: would there be any problems moving the "p->p_stat = SSLEEP"
statement to before the "if (catch)" statement?

We've tried it, and things seem to work, except for an occasional:

  calcru: negative time: -1318 usec

which may or may not be related...

Thanks,
-Archie

___________________________________________________________________________
Archie Cobbs   *   Whistle Communications, Inc.  *   http://www.whistle.com



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199709240208.TAA04263>