From owner-freebsd-current@FreeBSD.ORG Fri Feb 11 21:22:11 2005 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id D318416A4CF; Fri, 11 Feb 2005 21:22:11 +0000 (GMT) Received: from mail.vicor-nb.com (bigwoop.vicor-nb.com [208.206.78.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id 6A5D543D2D; Fri, 11 Feb 2005 21:22:11 +0000 (GMT) (envelope-from julian@elischer.org) Received: from elischer.org (julian.vicor-nb.com [208.206.78.97]) by mail.vicor-nb.com (Postfix) with ESMTP id 502297A423; Fri, 11 Feb 2005 13:22:11 -0800 (PST) Message-ID: <420D2203.4090203@elischer.org> Date: Fri, 11 Feb 2005 13:22:11 -0800 From: Julian Elischer User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.3.1) Gecko/20030516 X-Accept-Language: en, hu MIME-Version: 1.0 To: John Baldwin References: <200502111148.45976.jhb@FreeBSD.org> In-Reply-To: <200502111148.45976.jhb@FreeBSD.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit cc: jeffr@FreeBSD.org cc: Alan Cox cc: current@FreeBSD.org Subject: Re: HEADSUP: Turn off cpu_idle_hlt on SMP for now on x86 X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 11 Feb 2005 21:22:12 -0000 John Baldwin wrote: >I figured out a deadlock I got on a 4 CPU testbox last night and I think I >have tracked it down to a bug in 4BSD that can be worked around by turning >off idle halt via the machdep.cpu_idle_hlt sysctl. Specifically, it involves >waking up halted CPUs when a thread it is pinned or bound to a specific CPU >is made runnable. The details of what I saw today are below: > >FreeBSD/i386 (deimos.baldwin.cx) (ttyd0) > >login: KDB: enter: Line break on console >[thread pid 13 tid 100006 ] >Stopped at kdb_enter+0x30: leave >db> show pcpu >cpuid = 3 >curthread = 0xc1dfd5c0: pid 11 "idle: cpu3" >curpcb = 0xdc608da0 >fpcurthread = none >idlethread = 0xc1dfd5c0: pid 11 "idle: cpu3" >APIC ID = 3 >currentldt = 0x28 >db> ps > pid proc uid ppid pgrp flag stat wmesg wchan cmd >79079 c2ebe1fc 1001 79067 610 0000012 [LOCK vm page queue mutex c1f90c40] make >79078 c2f2ede4 1001 78894 610 0000002 [LOCK vm page queue mutex c1f90c40] sh >79077 c3498000 1001 79043 610 0004002 [LOCK vm page queue mutex c1f90c40] make >79075 c2f317f0 1001 78898 610 0004002 [LOCK vm page queue mutex c1f90c40] make >79074 c2184de4 1001 78902 610 0004002 [LOCK vm page queue mutex c1f90c40] make >79067 c23c71fc 1001 78955 610 0004002 [SLPQ ppwait 0xc23c71fc][SLP] make >79060 c23e35f4 1001 78915 610 0004002 [LOCK Giant c1e41800] make >79043 c2a50000 1001 79042 610 0004002 [SLPQ wait 0xc2a50000][SLP] sh >79042 c2f493f8 1001 78884 610 0004002 [SLPQ select 0xc075e024][SLP] make >78978 c28de7f0 1001 78975 610 0004002 [RUNQ] sh >78975 c29bf1fc 1001 78970 610 0004002 [SLPQ select 0xc075e024][SLP] make >... > 40 c1e67000 0 0 0 0000204 [IWAIT] swi1: net > 39 c1e671fc 0 0 0 0000204 [IWAIT] swi3: vm > 38 c1e4b5f4 0 0 0 000020c [LOCK Giant c1e41800] swi4: clock sio >--More-- > 37 c1e4b7f0 0 0 0 0000204 [IWAIT] irq13: > 36 c1e4b9ec 0 0 0 0000204 [IWAIT] irq0: clk > 35 c1e4bbe8 0 0 0 0000204 [IWAIT] irq23: ahc0 ahc1 > 34 c1e4bde4 0 0 0 0000204 [IWAIT] irq22: > 33 c1e5b000 0 0 0 0000204 [IWAIT] irq21: > 32 c1e5b1fc 0 0 0 0000204 [IWAIT] irq20: > 31 c1e5b3f8 0 0 0 0000204 [LOCK Giant c1e41800] irq19: fxp0 uhci0 >... > 14 c1dfc5f4 0 0 0 000020c [Can run] idle: cpu0 > 13 c1dfc7f0 0 0 0 000020c [CPU 1] idle: cpu1 > 12 c1dfc9ec 0 0 0 000020c [CPU 2] idle: cpu2 > 11 c1dfcbe8 0 0 0 000020c [CPU 3] idle: cpu3 > >Thus, CPU 0 is running a make process while CPUs 1, 2, and 3 are all idle. >Note that softclock() and irq19 (me trying to ssh in this morning over fxp0) >are both blocked on Giant. Also note that there are 5 threads blocked on the >vm page queue mutex. First, let's see who owns Giant and what they are doing. >I was doing a benchmark, so I didn't have witness compiled in or turned on, so >'show alllocks' wasn't an option. > >db> x/x Giant,10 >Giant: c0730028 c06ffb2c c06ffb2c b0000 0 >Giant+0x14: 0 0 c2ebd8a2 0 >Giant+0x24: 0 c07300a0 0 0 >time_monitor: 92c86 7 420c903b > >The owner is 0xc2ebd8a2 but minus 0x2 since that is a flag: > >db> show threads >... > 100286 (0xc2ebd8a0) sched_switch(c2ebd8a0,0,1,77,fcb6cb7e) at sched_switch+0x170 >... > 100189 (0xc28dfcf0) sched_switch(c28dfcf0,0,2,dff58a7c,95a67b30) at sched_switch+0x170 >... >db> tr 100286 >Tracing pid 79075 tid 100286 td 0xc2ebd8a0 >sched_switch(c2ebd8a0,0,1,77,fcb6cb7e) at sched_switch+0x170 >mi_switch(1,0,c07026d2,254,c075d568) at mi_switch+0x1d9 >turnstile_wait(c076bda0,c28dfcf0,c2ebd8a0,d17bc930,c21a79e8) at turnstile_wait+0x269 >_mtx_lock_sleep(c076bda0,c2ebd8a0,0,c0704083,dc5) at _mtx_lock_sleep+0x101 >_mtx_lock_flags(c076bda0,0,c0704083,dc5,c0704083) at _mtx_lock_flags+0x45 >vfs_clean_pages(d17bc930,0,c0704083,423,c21a79b4) at vfs_clean_pages+0x70 >bdwrite(d17bc930,c3bdfa00,80,2000,0) at bdwrite+0x2f0 >ffs_update(c46bf564,0,1,50,e00efb2c) at ffs_update+0x396 >ufs_inactive(e00efb44,e00efb5c,c05978ea,e00efb44,0) at ufs_inactive+0x225 >VOP_INACTIVE_AP(e00efb44,0,c0705028,768,c074e7a0) at VOP_INACTIVE_AP+0x4e >vrele(c46bf564,c074e2a0,c46bf564,1,c255a700) at vrele+0x11a >vn_close(c46bf564,1,c255a700,c2ebd8a0,0) at vn_close+0x50 >vn_closefile(c30e6318,c2ebd8a0,c06fddb9,846,c30e6318) at vn_closefile+0xf4 >fdrop_locked(c30e6318,c2ebd8a0,c06fddb9,831) at fdrop_locked+0x94 >fdrop(c30e6318,c2ebd8a0,c06fddb9,77c,c0515cfd,c0737ce0,1000,c2ebd8a0,e00efcac,c0 >56009c,1,e00efc80,c255a700,0,e00efc90,e00efc90,c0513783,3f,c31b93c8,0,6,c31b9300,e00efcb0,c051389d,c31b9300,0,6) at fdrop+0x3c >closef(c30e6318,c2ebd8a0,c06fddb9,3e8,c2ebd8a0) at closef+0x3f2 >close(c2ebd8a0,e00efd14,4,e00efd00,1) at close+0x1f2 >syscall(2f,2f,2f,809c200,0) at syscall+0x2a0 >Xint0x80_syscall() at Xint0x80_syscall+0x1f >--- syscall (6, FreeBSD ELF32, close), eip = 0x805c743, esp = 0xbfbfd18c, ebp = 0xbfbfd1a8 --- > >Lookup the pid in the ps output above, I see that this thread is one of the >threads blocked on the vm page queue mutex. Ok, so let's look at that lock's >owner: > >db> x/x 0xc076bda0,10 >vm_page_queue_mtx: c0730028 c07100af c07100af b0000 >vm_page_queue_mtx+0x10: 0 0 0 c28dfcf2 >vm_page_queue_mtx+0x20: 0 0 c1e673f8 f540 >vm_pages_needed: 0 0 0 1552 > >Lock owner is 0xc28dfcf2. Looked that up via show threads and got: > >db> tr 100189 >Tracing pid 78978 tid 100189 td 0xc28dfcf0 >sched_switch(c28dfcf0,0,2,dff58a7c,95a67b30) at sched_switch+0x170 >mi_switch(2,0,c0701343,252,0) at mi_switch+0x1d9 >critical_exit(c074bcc0,c052ddd5,c0755f18,0,c28dfcf0) at critical_exit+0x7e >intr_execute_handlers(c074bcc0,dff58b10,c076bc00,dff58b1c,c28dfcf0) at intr_execute_handlers+0xa4 >atpic_handle_intr(0) at atpic_handle_intr+0xaf >Xatpic_intr0() at Xatpic_intr0+0x20 >--- interrupt, eip = 0xc054dd3c, esp = 0xdff58b54, ebp = 0xdff58b68 --- >critical_exit(dff58b94,c06bbcbd,c075cae0,0,c0715535) at critical_exit+0xcc >_mtx_unlock_spin_flags(c075cae0,0,c0715535,252,e) at _mtx_unlock_spin_flags+0x23 >pmap_invalidate_page(c29c869c,8064000,2,766,c19d6d10) at pmap_invalidate_page+0xdd >pmap_enter(c29c869c,8064000,c19d6d10,3,0) at pmap_enter+0x33d >vm_fault(c29c85dc,8064000,2,8,c28dfcf0) at vm_fault+0x13e7 >trap_pfault(dff58d48,1,80643e8,bfbfe060,80643e8) at trap_pfault+0xf1 >trap(2f,2f,2f,0,8069000) at trap+0x21a >calltrap() at calltrap+0x5 >--- trap 0xc, eip = 0x80563cf, esp = 0xbfbfe0d0, ebp = 0xbfbfe0d8 --- > >So this is the thread that is in state RUNQ even though I have three idle >CPUs. Very curious! I then wondered if my current CPU (CPU 3) thought >that there were any runnable threads: > >db> call sched_runnable() >0 > >0 means no. That's when I figured that the runnable thread must be pinned, >especially since it was interrupted while messing with the pmap which >usually requires pinning. I then tried to do an experiment to test my theory >by turning off the idle halt and manually IPI'ing all the other CPUs so that >they would recheck their run queues and whichever CPU needed to run the >process owning the vm page queue mutex would run. Unfortunately, I think I >got the vector number wrong as all 3 CPUs paniced. :-P > >db> x cpu_idle_hlt >cpu_idle_hlt: 1 >db> w cpu_idle hlt 0 >cpu_idle_hlt 0x1 = 0 >db> call lapic_ipi_raw(0x000c00e7, 0) >0xd1622000 >db> c >cccpppuuuiiiddd === 102;;; aaapppiiiccc iiiddd === 000102 > > >iiinnnssstttrrruuuccctttiiiooonnn pppoooiiinnnttteerer r = >= = 0 x008xx:80:8x0:cx00cx60ca50536bae055eb9e >a5st >as >ctkas ctpkao cipkno tipenorti en rt e r = 0 x= 1 0 :0=0x >x10d0xc:16000xed:cf0bex40dfc >afb6r40abm >... > 857 (make) >trap number = 30 >panic: unknown/reserved trap >cpuid = 0 >Uptime: 1d16h58m16s >Cannot dump. No dump device defined. >Automatic reboot in 15 seconds - press a key on the console to abort >Rebooting... >cpu_reset: Stopping other CPUs > >Thus, my theory is that when the pinned thread was preempted and put back on >the run queue, the scheduler didn't IPI the CPU it was pinned to to wake it >up in case it was idle. The IPI is only needed if the CPUs are halted, which >is why I think turning the idle halt off might work as a workaround. I don't >know if ULE has this same issue, but I've cc'd Jeff and hopefully he can look >into it. > there is an option (sysctl even) kern.sched.ipiwakeup.htt2: 0 kern.sched.ipiwakeup.onecpu: 0 <-------------- kern.sched.ipiwakeup.useloop: 0 kern.sched.ipiwakeup.usemask: 1 kern.sched.ipiwakeup.delivered: 0 kern.sched.ipiwakeup.requested: 0 kern.sched.ipiwakeup.enabled: 1 kern.sched.quantum: 100000 if this is set it should wake up all idle CPUS not just one of them. > > >