From owner-freebsd-amd64@FreeBSD.ORG Wed Jan 30 02:27:52 2008 Return-Path: Delivered-To: freebsd-amd64@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A76F216A417 for ; Wed, 30 Jan 2008 02:27:52 +0000 (UTC) (envelope-from freebsd@penx.com) Received: from Elmer.dco.penx.com (elmer-sprint.dco.penx.com [65.173.215.114]) by mx1.freebsd.org (Postfix) with ESMTP id 581CC13C45A for ; Wed, 30 Jan 2008 02:27:52 +0000 (UTC) (envelope-from freebsd@penx.com) Received: from [172.19.10.240] (sylvester.dco.penx.com [172.19.10.240]) by Elmer.dco.penx.com (8.14.2/8.14.2) with ESMTP id m0U2Rn0h098463; Tue, 29 Jan 2008 19:27:49 -0700 (MST) (envelope-from freebsd@penx.com) From: Dennis Glatting To: John Baldwin In-Reply-To: <200801291900.42989.jhb@freebsd.org> References: <1201388299.84900.12.camel@Sylvester.dco.penx.com> <20080129202643.6BF568DE@fep1.cogeco.net> <200801291900.42989.jhb@freebsd.org> Content-Type: text/plain Date: Tue, 29 Jan 2008 19:27:49 -0700 Message-Id: <1201660069.95413.9.camel@Sylvester.dco.penx.com> Mime-Version: 1.0 X-Mailer: Evolution 2.12.3 FreeBSD GNOME Team Port Content-Transfer-Encoding: 7bit Cc: freebsd-amd64@freebsd.org Subject: Re: Multi processor locking problem under 7.0 X-BeenThere: freebsd-amd64@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Porting FreeBSD to the AMD64 platform List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Jan 2008 02:27:52 -0000 On Tue, 2008-01-29 at 19:00 -0500, John Baldwin wrote: > On Tuesday 29 January 2008 03:26:44 pm Paul wrote: > > > > >I have several systems of two different types running 7.0. One is an IBM > > >3550 and the other a Dell 2950. The IBMs more than the Dells > > >consistently seem to have a kernel locking problem during dump. > > >Specifically, if I execute this command: > > > > > > dump 0uaLCf 64 /dev/null /usr > > > > > >Dump consistently stops in Phase IV. However, if I set > > >machdep.hlt_logical_cpus=1, dump does not stop. At the end of this > > >message is my boot information. > > > > > >When logical_cpus=0, the following is typical of what is displayed by > > >top when dump stops: > > > > > > PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU > > >COMMAND > > > 926 root 1 4 0 75476K 71744K sbwait 0 0:04 0.00% dump > > > 928 root 1 20 0 75348K 67740K pause 1 0:02 0.00% dump > > > 929 root 1 20 0 75348K 67740K pause 1 0:02 0.00% dump > > > 927 root 1 20 0 75348K 67740K pause 1 0:02 0.00% dump > > > 919 root 1 8 0 75348K 67144K wait 0 0:00 0.00% dump > > > > > >Fooling around a bit I have found that if I truss dump, the dump > > >continues. On the Dells, if I force disk activity during the dump, such > > >as executing a ls -lR /usr > /dev/null, the dump finishes. > > > > > >I am unsure how to proceed in debugging this problem. It has been around > > >for a while but I am now installing the IBMs and the dump problem is a > > >no-starter. Please contact me directly on how to proceed. > > > > I have noticed something similar on my Intel test box. > > > > When compiling many ports in the tree that is updated on 7.0RC1 with > > a S5000pal with 2 Quadcore Xeons the process just STOPS. I am using > > the install disk and have not updated to the latest cvsup release yet > > (I am trying to make the world now with fingers crossed :) ) I tried > > it with just one quadcore and the same problem happens. > > > > There are no errors on the screen but it no longer proceeds with the > > port build. When I suspend the process and restart the make in the > > same session it has no problem getting past this impasse and with a > > few suspends the make finishes without error. It does not happen > > every time which is very odd. > > > > Based on your description above it seems like it may be the same problem. > > > > What do you think? > > If you have threads blocked on "vmo_de" then upgrade to the latest RELENG_7 or > RELENG_7_0 (specifically the sys/kern/subr_sleepqueue.c file) and try again. > I got the right file and updated my systems. I ran dump on the IBM system five times. Dump hung four times, three times when 99.99% complete. Below is a ps output. How do I tell what the threads are blocked on? Daffy> ps -axwHl | grep dump 0 801 1 0 96 0 20952 4060 select Is ?? 0:00.00 /usr/sbin/sshd -f /etc/ssh/dumper/sshd_config 0 14682 870 0 8 0 34388 26628 wait I+ p0 0:00.20 dump 0uaLCf 24 /dev/null /usr (dump) 0 14774 14682 0 4 0 34388 30680 sbwait I+ p0 0:01.01 dump: /dev/aacd0s1e: pass 4: 14.97% done, finished in 0:03 at T 0 14775 14774 0 20 0 34388 26644 pause I+ p0 0:00.69 dump 0uaLCf 24 /dev/null /usr (dump) 0 14776 14774 0 20 0 34388 26644 pause I+ p0 0:00.69 dump 0uaLCf 24 /dev/null /usr (dump) 0 14777 14774 0 20 0 34388 26644 pause I+ p0 0:00.69 dump 0uaLCf 24 /dev/null /usr (dump) 600 14896 12552 0 96 0 5900 1184 - R+ p2 0:00.00 grep dump