From owner-freebsd-current@FreeBSD.ORG  Fri Feb 11 16:48:44 2005
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 43C7216A4CF
	for <current@FreeBSD.org>; Fri, 11 Feb 2005 16:48:44 +0000 (GMT)
Received: from mail28.sea5.speakeasy.net (mail28.sea5.speakeasy.net
	[69.17.117.30])	by mx1.FreeBSD.org (Postfix) with ESMTP id A3B6943D31
	for <current@FreeBSD.org>; Fri, 11 Feb 2005 16:48:43 +0000 (GMT)
	(envelope-from jhb@FreeBSD.org)
Received: (qmail 18454 invoked from network); 11 Feb 2005 16:48:42 -0000
Received: from server.baldwin.cx ([216.27.160.63])
	(envelope-sender <jhb@FreeBSD.org>)AES256-SHA encrypted SMTP
	for <jeffr@FreeBSD.org>; 11 Feb 2005 16:48:41 -0000
Received: from [10.50.40.202] (gw1.twc.weather.com [216.133.140.1])
	(authenticated bits=0)
	by server.baldwin.cx (8.13.1/8.13.1) with ESMTP id j1BGmKf0020922;
	Fri, 11 Feb 2005 11:48:22 -0500 (EST)
	(envelope-from jhb@FreeBSD.org)
From: John Baldwin <jhb@FreeBSD.org>
To: current@FreeBSD.org
Date: Fri, 11 Feb 2005 11:48:45 -0500
User-Agent: KMail/1.6.2
MIME-Version: 1.0
Content-Disposition: inline
Content-Type: text/plain;
  charset="us-ascii"
Content-Transfer-Encoding: 7bit
Message-Id: <200502111148.45976.jhb@FreeBSD.org>
X-Spam-Status: No, score=-102.8 required=4.2 tests=ALL_TRUSTED,
	USER_IN_WHITELIST autolearn=failed version=3.0.2
X-Spam-Checker-Version: SpamAssassin 3.0.2 (2004-11-16) on server.baldwin.cx
cc: jeffr@FreeBSD.org
cc: Alan Cox <alc@cs.rice.edu>
cc: Julian Elischer <julian@elischer.org>
Subject: HEADSUP: Turn off cpu_idle_hlt on SMP for now on x86
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 11 Feb 2005 16:48:44 -0000

I figured out a deadlock I got on a 4 CPU testbox last night and I think I
have tracked it down to a bug in 4BSD that can be worked around by turning
off idle halt via the machdep.cpu_idle_hlt sysctl.  Specifically, it involves
waking up halted CPUs when a thread it is pinned or bound to a specific CPU
is made runnable.  The details of what I saw today are below:

FreeBSD/i386 (deimos.baldwin.cx) (ttyd0)

login: KDB: enter: Line break on console
[thread pid 13 tid 100006 ]
Stopped at      kdb_enter+0x30: leave
db> show pcpu
cpuid        = 3
curthread    = 0xc1dfd5c0: pid 11 "idle: cpu3"
curpcb       = 0xdc608da0
fpcurthread  = none
idlethread   = 0xc1dfd5c0: pid 11 "idle: cpu3"
APIC ID      = 3
currentldt   = 0x28
db> ps
  pid   proc     uid  ppid  pgrp  flag   stat  wmesg    wchan  cmd
79079 c2ebe1fc 1001 79067   610 0000012 [LOCK vm page queue mutex c1f90c40] make
79078 c2f2ede4 1001 78894   610 0000002 [LOCK vm page queue mutex c1f90c40] sh
79077 c3498000 1001 79043   610 0004002 [LOCK vm page queue mutex c1f90c40] make
79075 c2f317f0 1001 78898   610 0004002 [LOCK vm page queue mutex c1f90c40] make
79074 c2184de4 1001 78902   610 0004002 [LOCK vm page queue mutex c1f90c40] make
79067 c23c71fc 1001 78955   610 0004002 [SLPQ ppwait 0xc23c71fc][SLP] make
79060 c23e35f4 1001 78915   610 0004002 [LOCK  Giant c1e41800] make
79043 c2a50000 1001 79042   610 0004002 [SLPQ wait 0xc2a50000][SLP] sh
79042 c2f493f8 1001 78884   610 0004002 [SLPQ select 0xc075e024][SLP] make
78978 c28de7f0 1001 78975   610 0004002 [RUNQ] sh
78975 c29bf1fc 1001 78970   610 0004002 [SLPQ select 0xc075e024][SLP] make
...
   40 c1e67000    0     0     0 0000204 [IWAIT] swi1: net
   39 c1e671fc    0     0     0 0000204 [IWAIT] swi3: vm
   38 c1e4b5f4    0     0     0 000020c [LOCK  Giant c1e41800] swi4: clock sio
--More--
   37 c1e4b7f0    0     0     0 0000204 [IWAIT] irq13:
   36 c1e4b9ec    0     0     0 0000204 [IWAIT] irq0: clk
   35 c1e4bbe8    0     0     0 0000204 [IWAIT] irq23: ahc0 ahc1
   34 c1e4bde4    0     0     0 0000204 [IWAIT] irq22:
   33 c1e5b000    0     0     0 0000204 [IWAIT] irq21:
   32 c1e5b1fc    0     0     0 0000204 [IWAIT] irq20:
   31 c1e5b3f8    0     0     0 0000204 [LOCK  Giant c1e41800] irq19: fxp0 uhci0
...
   14 c1dfc5f4    0     0     0 000020c [Can run] idle: cpu0
   13 c1dfc7f0    0     0     0 000020c [CPU 1] idle: cpu1
   12 c1dfc9ec    0     0     0 000020c [CPU 2] idle: cpu2
   11 c1dfcbe8    0     0     0 000020c [CPU 3] idle: cpu3

Thus, CPU 0 is running a make process while CPUs 1, 2, and 3 are all idle.
Note that softclock() and irq19 (me trying to ssh in this morning over fxp0)
are both blocked on Giant.  Also note that there are 5 threads blocked on the
vm page queue mutex.  First, let's see who owns Giant and what they are doing.
I was doing a benchmark, so I didn't have witness compiled in or turned on, so
'show alllocks' wasn't an option.

db> x/x Giant,10
Giant:  c0730028        c06ffb2c        c06ffb2c        b0000           0
Giant+0x14:     0               0               c2ebd8a2        0
Giant+0x24:     0               c07300a0        0               0
time_monitor:   92c86           7               420c903b

The owner is 0xc2ebd8a2 but minus 0x2 since that is a flag:

db> show threads
...
  100286 (0xc2ebd8a0)  sched_switch(c2ebd8a0,0,1,77,fcb6cb7e) at sched_switch+0x170
...
  100189 (0xc28dfcf0)  sched_switch(c28dfcf0,0,2,dff58a7c,95a67b30) at sched_switch+0x170
...
db> tr 100286
Tracing pid 79075 tid 100286 td 0xc2ebd8a0
sched_switch(c2ebd8a0,0,1,77,fcb6cb7e) at sched_switch+0x170
mi_switch(1,0,c07026d2,254,c075d568) at mi_switch+0x1d9
turnstile_wait(c076bda0,c28dfcf0,c2ebd8a0,d17bc930,c21a79e8) at turnstile_wait+0x269
_mtx_lock_sleep(c076bda0,c2ebd8a0,0,c0704083,dc5) at _mtx_lock_sleep+0x101
_mtx_lock_flags(c076bda0,0,c0704083,dc5,c0704083) at _mtx_lock_flags+0x45
vfs_clean_pages(d17bc930,0,c0704083,423,c21a79b4) at vfs_clean_pages+0x70
bdwrite(d17bc930,c3bdfa00,80,2000,0) at bdwrite+0x2f0
ffs_update(c46bf564,0,1,50,e00efb2c) at ffs_update+0x396
ufs_inactive(e00efb44,e00efb5c,c05978ea,e00efb44,0) at ufs_inactive+0x225
VOP_INACTIVE_AP(e00efb44,0,c0705028,768,c074e7a0) at VOP_INACTIVE_AP+0x4e
vrele(c46bf564,c074e2a0,c46bf564,1,c255a700) at vrele+0x11a
vn_close(c46bf564,1,c255a700,c2ebd8a0,0) at vn_close+0x50
vn_closefile(c30e6318,c2ebd8a0,c06fddb9,846,c30e6318) at vn_closefile+0xf4
fdrop_locked(c30e6318,c2ebd8a0,c06fddb9,831) at fdrop_locked+0x94
fdrop(c30e6318,c2ebd8a0,c06fddb9,77c,c0515cfd,c0737ce0,1000,c2ebd8a0,e00efcac,c0
56009c,1,e00efc80,c255a700,0,e00efc90,e00efc90,c0513783,3f,c31b93c8,0,6,c31b9300,e00efcb0,c051389d,c31b9300,0,6) at fdrop+0x3c
closef(c30e6318,c2ebd8a0,c06fddb9,3e8,c2ebd8a0) at closef+0x3f2
close(c2ebd8a0,e00efd14,4,e00efd00,1) at close+0x1f2
syscall(2f,2f,2f,809c200,0) at syscall+0x2a0
Xint0x80_syscall() at Xint0x80_syscall+0x1f
--- syscall (6, FreeBSD ELF32, close), eip = 0x805c743, esp = 0xbfbfd18c, ebp = 0xbfbfd1a8 ---

Lookup the pid in the ps output above, I see that this thread is one of the
threads blocked on the vm page queue mutex.  Ok, so let's look at that lock's
owner:

db> x/x 0xc076bda0,10
vm_page_queue_mtx:      c0730028        c07100af        c07100af        b0000
vm_page_queue_mtx+0x10: 0               0               0               c28dfcf2
vm_page_queue_mtx+0x20: 0               0               c1e673f8        f540
vm_pages_needed:        0               0               0               1552

Lock owner is 0xc28dfcf2.  Looked that up via show threads and got:

db> tr 100189
Tracing pid 78978 tid 100189 td 0xc28dfcf0
sched_switch(c28dfcf0,0,2,dff58a7c,95a67b30) at sched_switch+0x170
mi_switch(2,0,c0701343,252,0) at mi_switch+0x1d9
critical_exit(c074bcc0,c052ddd5,c0755f18,0,c28dfcf0) at critical_exit+0x7e
intr_execute_handlers(c074bcc0,dff58b10,c076bc00,dff58b1c,c28dfcf0) at intr_execute_handlers+0xa4
atpic_handle_intr(0) at atpic_handle_intr+0xaf
Xatpic_intr0() at Xatpic_intr0+0x20
--- interrupt, eip = 0xc054dd3c, esp = 0xdff58b54, ebp = 0xdff58b68 ---
critical_exit(dff58b94,c06bbcbd,c075cae0,0,c0715535) at critical_exit+0xcc
_mtx_unlock_spin_flags(c075cae0,0,c0715535,252,e) at _mtx_unlock_spin_flags+0x23
pmap_invalidate_page(c29c869c,8064000,2,766,c19d6d10) at pmap_invalidate_page+0xdd
pmap_enter(c29c869c,8064000,c19d6d10,3,0) at pmap_enter+0x33d
vm_fault(c29c85dc,8064000,2,8,c28dfcf0) at vm_fault+0x13e7
trap_pfault(dff58d48,1,80643e8,bfbfe060,80643e8) at trap_pfault+0xf1
trap(2f,2f,2f,0,8069000) at trap+0x21a
calltrap() at calltrap+0x5
--- trap 0xc, eip = 0x80563cf, esp = 0xbfbfe0d0, ebp = 0xbfbfe0d8 ---

So this is the thread that is in state RUNQ even though I have three idle
CPUs.  Very curious!  I then wondered if my current CPU (CPU 3) thought
that there were any runnable threads:

db> call sched_runnable()
0

0 means no.  That's when I figured that the runnable thread must be pinned,
especially since it was interrupted while messing with the pmap which
usually requires pinning.  I then tried to do an experiment to test my theory
by turning off the idle halt and manually IPI'ing all the other CPUs so that
they would recheck their run queues and whichever CPU needed to run the
process owning the vm page queue mutex would run.  Unfortunately, I think I
got the vector number wrong as all 3 CPUs paniced. :-P

db> x cpu_idle_hlt
cpu_idle_hlt:   1
db> w cpu_idle hlt 0
cpu_idle_hlt                 0x1        =              0
db> call lapic_ipi_raw(0x000c00e7, 0)
0xd1622000
db> c
cccpppuuuiiiddd   ===   102;;;   aaapppiiiccc   iiiddd   ===   000102


iiinnnssstttrrruuuccctttiiiooonnn   pppoooiiinnnttteerer        r       =
= = 0 x008xx:80:8x0:cx00cx60ca50536bae055eb9e
a5st
as
ctkas ctpkao cipkno tipenorti   en rt   e  r              =     0 x= 1  0 :0=0x
x10d0xc:16000xed:cf0bex40dfc
afb6r40abm
...
  857 (make)
trap number             = 30
panic: unknown/reserved trap
cpuid = 0
Uptime: 1d16h58m16s
Cannot dump. No dump device defined.
Automatic reboot in 15 seconds - press a key on the console to abort
Rebooting...
cpu_reset: Stopping other CPUs

Thus, my theory is that when the pinned thread was preempted and put back on
the run queue, the scheduler didn't IPI the CPU it was pinned to to wake it
up in case it was idle.  The IPI is only needed if the CPUs are halted, which
is why I think turning the idle halt off might work as a workaround.  I don't
know if ULE has this same issue, but I've cc'd Jeff and hopefully he can look
into it.

-- 
John Baldwin <jhb@FreeBSD.org>  <><  http://www.FreeBSD.org/~jhb/
"Power Users Use the Power to Serve"  =  http://www.FreeBSD.org