Date: Tue, 23 Jan 2001 10:21:52 -0800 (PST) From: John Baldwin <jhb@FreeBSD.ORG> To: Poul-Henning Kamp <phk@FreeBSD.ORG> Cc: current@FreeBSD.ORG Subject: RE: make -j 128 world hang.... Message-ID: <XFMail.010123102152.jhb@FreeBSD.org> In-Reply-To: <23620.980267723@critter>
next in thread | previous in thread | raw e-mail | index | archive | help
On 23-Jan-01 Poul-Henning Kamp wrote: > > I can still stall my 2xPII/350 machine with a make -j 128 world, > but it is slightly different now I think: I can break into ddb. > > This machine is a ahc/scsi machine, so I don't know if this is > really SMP or Justins recent changes... > > Poul-Henning Ok, the problem children here are the 'allproc' waiters, which are waiting on the allproc lock. I have had this lockup occur once so far. It appears to be a deadlock in the lockmgr (surprise, suprise). When I had this, the allproc lockmgr lock has 1 pending exclusive lock and 4 pending exclusive locks, and 1 existing shared lock. However, the actual lockmgr struct itself thought it had 1 existing shared lock and 5 pending shared locks. *sigh* I tried to dig around and determined that my smp_hlt.patch had made it worse because cpu0 had basically HLT'd and never been woken up, and cpu1 was spinning forever in lockmgr in the atkbd0 ithread. I got no farther than that, however: > 55818 cdc1ba80 cdc6c000 0 55775 16280 004006 3 allproc c02e08a0 cc Exclusive lock for wait4() or exit1(). I think exit1(). > 13 cbc73400 cc4c1000 0 0 0 00020c 3 allproc c02e08a0 swi6: > clock > 12 cbc73620 cc4bf000 0 0 0 000204 3 allproc c02e08a0 swi1: > net These are both shared lock waiters. The fact that softclock() is blocked is why the machine "locks up". It is blocked in schedcpu(), and no timeouts are being called. I have a vmcore and kernel.debug and have futzed around in gdb with them for a while but don't know why it is locked up. IIRC, the atkbd thread was stuck here: /* * This is the waitloop optimization, and note for this to work * simple_lock and simple_unlock should be subroutines to avoid * optimization troubles. */ static int apause(struct lock *lkp, int flags) { #ifdef SMP int i, lock_wait; #endif if ((lkp->lk_flags & flags) == 0) return 0; #ifdef SMP for (lock_wait = LOCK_WAIT_TIME; lock_wait > 0; lock_wait--) { mtx_exit(lkp->lk_interlock, MTX_DEF); for (i = LOCK_SAMPLE_WAIT; i > 0; i--) if ((lkp->lk_flags & flags) == 0) break; mtx_enter(lkp->lk_interlock, MTX_DEF); if ((lkp->lk_flags & flags) == 0) return 0; } #endif return 1; } If you want some fun, stick KTR and KTR_EXTEND in your kernel. Then, before you start your world, do: sysctl -w debug.ktr.mask=0x1008 To log process switches (0x1000) and mutex ops (8). When you break into ddb, you can use 'tbuf' to display the first entry in the log buffer, and 'tnext' to display the next entry. Then tnext again, etc. If you can get a core dump (it worked for me on my dual 200 at least), then I have gdb macros that allow you to dump the KTR logs in gdb easily. As for a fix, Jason Evans has implemented and tested and will hopefully soon commit some simpler and lighter weith shared/exclusive locks that allproc and proctree will switch to using. However, lockmgr is used in lots of places, so it is still in our best interest to get it fixed. Also, for the preemptive kernel, (which is very close to running stably on UP and SMP x86 and UP alpha last I heard, just some problems with FPU state) all these #ifdef SMP's will have to go away and we will use mutexes in UP as well. -- John Baldwin <jhb@FreeBSD.org> -- http://www.FreeBSD.org/~jhb/ PGP Key: http://www.Baldwin.cx/~john/pgpkey.asc "Power Users Use the Power to Serve!" - http://www.FreeBSD.org/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?XFMail.010123102152.jhb>