Date: Wed, 11 Apr 2018 06:09:00 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Dexuan Cui <decui@microsoft.com> Cc: Bruce Evans <brde@optusnet.com.au>, "jeff@freebsd.org" <jeff@freebsd.org>, "bugs@freebsd.org" <bugs@freebsd.org>, "Hongxiong Xian (Wicresoft)" <v-hoxian@microsoft.com> Subject: RE: [Bug 227404] UP FreeBSD VM always hangs on reboot since 20180329-r331740 Message-ID: <20180411055306.P5336@besplex.bde.org> In-Reply-To: <SG2P15301MB0015EA97582E940755361F64BFBE0@SG2P15301MB0015.APCP153.PROD.OUTLOOK.COM> References: <bug-227404-39@https.bugs.freebsd.org/bugzilla/> <bug-227404-39-TxuoE7IXT0@https.bugs.freebsd.org/bugzilla/> <20180410173347.D1459@besplex.bde.org> <SG2P15301MB0015EA97582E940755361F64BFBE0@SG2P15301MB0015.APCP153.PROD.OUTLOOK.COM>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 10 Apr 2018, Dexuan Cui wrote: >> From: Bruce Evans >> Sent: Tuesday, April 10, 2018 00:45 >> >> On Tue, 10 Apr 2018 a bug that doesn't want replies@freebsd.org wrote: >> >> (The bug didn't even Cc freebsd-bugs for this followup.) > > Thanks for the reminder! I Cc'd bugs@ just now. > >>> --- Comment #4 from Dexuan Cui --- >> ... >> >> I think I saw this a few months before that. >> >> My only history of this is that I built a UP kernel on 17 Dec 2017 to see >> if UP kernels had the bug. So SMP kernels probably had the bug then. >> >> Bruce > > Here the bug is that UP FreeBSD VM hangs on reboot or power-off, and > I'm sure this recent patch (which was commited by Jeff on Mar 26) caused this bug: > https://github.com/freebsd/freebsd/commit/63a483ed5f4eaadb8979992c7a5de24c7a471c61 > ("Fix a bug introduced in r329612 that slowly invalidates all clean bufs"). > > However, SMP VM with 2 or more CPUs doesn't hang on reboot/power-off > according to our tests. Actually, r329612 is what causes this bug. I already did the bisection to find almost this bug a couple of weeks ago. The hang occurs on amd64 with 4 CPUs but not on amd64 with 8 CPUs or i386 with 4 or 8 CPUS. I just checked that it occurs on i386 with 1 CPU. All on the same machine. But r329611 doesn't hang for any of these cases. XX From bde@optusnet.com.au Fri Mar 23 20:06:40 2018 +1100 XX Date: Fri, 23 Mar 2018 20:06:39 +1100 (EST) XX From: Bruce Evans <brde@optusnet.com.au> XX X-X-Sender: bde@besplex.bde.org XX To: jeff@freebsd.org XX Subject: r329612 breaks sync for shutdown XX Message-ID: <20180323192409.F1878@besplex.bde.org> XX MIME-Version: 1.0 XX Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed XX Status: O XX X-Status: XX X-Keywords: XX X-UID: 7935 XX XX r329612 (with or without later changes) sometimes (consistently in one XX hardware coniguration) hangs in clean shutdowns by init: XX XX i386 with 8 or 4 CPUs: it hasn't failed yet XX amd64 with 8 CPUs:: it hasn't failed yet XX and64 with 4 CPUs (by turning off HTT): it always or almost always hangs XX XX The hang is usually in "Syncing disks, vnodes remaining... 0 ". Less than XX 10% of the time it hangs earlier in "Waiting ... for .... syncer' ...". XX It is just waiting for a wakeup that never arrives: In the working case, XX it prints about 2 more 0's, with about half a second between each. If it XX reaches the second 0, it always completes. XX XX This is with SCHED_4BSD. SCHED_ULE seems to work. I recently looked for XX missing wakeups and found some for idle threads in mwait. This affects XX both schedulers but fixing it makes little difference. The bug is invariant XX under other large changes in options and code. XX XX XX KDB: enter: Break to debugger XX XX [ thread pid 9 tid 100069 ] XX XX Stopped at kdb_enter+0x3a: movq $0,0x700c97(%rip) XX XX db> ps XX XX pid ppid pgrp uid state wmesg wchan cmd XX XX 18 0 0 0 DL - 0xffffffff80be8462 [schedcpu] XX XX 17 0 0 0 DL kpsusp 0xfffff80003e1a6e0 [vnlru] XX XX 16 0 0 0 RL [syncer] XX XX 9 0 0 0 RL (threaded) [bufdaemon] XX XX 100059 RunQ [bufdaemon] XX XX 100064 Run CPU 1 [bufspacedaemon-0] XX XX 100065 Run CPU 3 [bufspacedaemon-1] XX XX 100066 RunQ [bufspacedaemon-2] XX XX 100067 RunQ [bufspacedaemon-3] XX XX 100068 Run CPU 0 [bufspacedaemon-4] XX XX 100069 Run CPU 2 [bufspacedaemon-5] XX XX 100070 CanRun [bufspacedaemon-6] XX XX 8 0 0 0 DL (threaded) [pagedaemon] XX XX 100058 D psleep 0xffffffff80c7b82d [pagedaemon] XX XX 100062 D launds 0xffffffff80c7b834 [laundry: dom0] XX XX 100063 D umarcl 0xffffffff80676967 [uma] XX XX 7 0 0 0 DL - 0xffffffff80c73dd4 [soaiod4] XX XX 6 0 0 0 DL - 0xffffffff80c73dd4 [soaiod3] XX XX 5 0 0 0 DL - 0xffffffff80c73dd4 [soaiod2] XX XX --More-- 4 0 0 0 DL - 0xffffffff80c73dd4 [soaiod1] XX XX 15 0 0 0 DL cooling 0xfffff8000186a758 [acpi_cooling1] XX XX 14 0 0 0 DL tzpoll 0xffffffff80aaa110 [acpi_thermal] XX XX 3 0 0 0 DL - 0xffffffff80aab218 [rand_harvestq] XX XX 13 0 0 0 DL (threaded) [usb] XX XX 100023 D - 0xfffffe00839d4460 [usbus0] XX XX 100024 D - 0xfffffe00839d44b8 [usbus0] XX XX 100025 D - 0xfffffe00839d4510 [usbus0] XX XX 100026 D - 0xfffffe00839d4568 [usbus0] XX XX 100027 D - 0xfffffe00839d45c0 [usbus0] XX XX 100029 D - 0xfffffe0083846d10 [usbus1] XX XX 100030 D - 0xfffffe0083846d68 [usbus1] XX XX 100031 D - 0xfffffe0083846dc0 [usbus1] XX XX 100032 D - 0xfffffe0083846e18 [usbus1] XX XX 100033 D - 0xfffffe0083846e70 [usbus1] XX XX 100035 D - 0xfffffe0083b60d10 [usbus2] XX XX 100036 D - 0xfffffe0083b60d68 [usbus2] XX XX 100037 D - 0xfffffe0083b60dc0 [usbus2] XX XX 100038 D - 0xfffffe0083b60e18 [usbus2] XX XX 100039 D - 0xfffffe0083b60e70 [usbus2] XX XX --More-- 2 0 0 0 DL (threaded) [cam] XX XX 100021 D - 0xffffffff80a9e000 [doneq0] XX XX 100053 D - 0xffffffff80a9dec0 [scanner] XX XX 12 0 0 0 DL (threaded) [geom] XX XX 100018 D - 0xffffffff80bd7a60 [g_event] XX XX 100019 D - 0xffffffff80bd7a70 [g_up] XX XX 100020 D - 0xffffffff80bd7a78 [g_down] XX XX 11 0 0 0 WL (threaded) [intr] XX XX 100006 I [swi6: task queue] XX XX 100008 I [swi5: fast taskq] XX XX 100011 I [swi6: Giant taskq] XX XX 100012 I [swi4: clock (0) sio] XX XX 100013 I [swi4: clock (1)] XX XX 100014 I [swi4: clock (2)] XX XX 100015 I [swi4: clock (3)] XX XX 100016 I [swi1: netisr 0] XX XX 100017 I [swi3: vm] XX XX 100022 I [irq264: xhci0] XX XX 100028 I [irq16: ehci0] XX XX 100034 I [irq23: ehci1] XX XX --More-- 100040 I [irq265: ahci0] XX XX 100041 I [swi0: sio] XX XX 100042 I [irq1: atkbd0] XX XX 100043 I [irq14: ata0] XX XX 100044 I [irq15: ata1] XX XX 10 0 0 0 RL (threaded) [idle] XX XX 100002 CanRun [idle: cpu0] XX XX 100003 CanRun [idle: cpu1] XX XX 100004 CanRun [idle: cpu2] XX XX 100005 CanRun [idle: cpu3] XX XX 1 0 1 0 RLs [init] XX XX 0 0 0 0 DLs (threaded) [kernel] XX XX 100000 D swapin 0xffffffff80bd8130 [swapper] XX XX 100007 D - 0xfffff800014bad00 [kqueue_ctx taskq] XX XX 100009 D - 0xfffff800014baa00 [thread taskq] XX XX 100010 D - 0xfffff800014ba900 [aiod_kick taskq] XX XX 100045 D - 0xfffff80001877e00 [mca taskq] XX XX 100046 D - 0xfffff80003dad900 [acpi_task_0] XX XX 100047 D - 0xfffff80003dad900 [acpi_task_1] XX XX 100048 D - 0xfffff80003dad900 [acpi_task_2] XX XX --More-- 100052 D - 0xfffff8000156ad00 [CAM taskq] XX XX 100071 D - 0xfffff80003e9fa00 [if_config_tqg_0] XX XX 100072 D - 0xfffff80003e9f900 [softirq_0] XX XX 100073 D - 0xfffff80003e9f800 [softirq_1] XX XX 100074 D - 0xfffff80003e9f700 [softirq_2] XX XX 100075 D - 0xfffff80003e9f600 [softirq_3] XX XX 100076 D - 0xfffff80003e9f500 [if_io_tqg_0] XX XX 100077 D - 0xfffff80003e9f400 [if_io_tqg_1] XX XX 100078 D - 0xfffff80003e9f300 [if_io_tqg_2] XX XX 100079 D - 0xfffff80003e9f200 [if_io_tqg_3] XX XX 724 1 0 0 ZL [newnfs 0] XX XX 725 1 0 0 ZL [newnfs 1] XX XX db> t 18 XX XX Tracing pid 18 tid 100080 td 0xfffff8000430f560 XX XX sched_switch() at sched_switch+0x1c3/frame 0xfffffe0000590920 XX XX mi_switch() at mi_switch+0xe1/frame 0xfffffe0000590960 XX XX sleepq_timedwait() at sleepq_timedwait+0x2d/frame 0xfffffe0000590990 XX XX _sleep() at _sleep+0x218/frame 0xfffffe0000590a00 XX XX pause_sbt() at pause_sbt+0xc9/frame 0xfffffe0000590a50 XX XX schedcpu_thread() at schedcpu_thread+0x2b/frame 0xfffffe0000590a60 XX XX fork_exit() at fork_exit+0x85/frame 0xfffffe0000590ab0 XX XX fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0000590ab0 XX XX --- trap 0, rip = 0, rsp = 0, rbp = 0 --- XX XX db> t 17 XX XX Tracing pid 17 tid 100061 td 0xfffff80003e1f000 XX XX sched_switch() at sched_switch+0x1c3/frame 0xfffffe0000531900 XX XX mi_switch() at mi_switch+0xe1/frame 0xfffffe0000531940 XX XX sleepq_wait() at sleepq_wait+0x2d/frame 0xfffffe0000531970 XX XX _sleep() at _sleep+0x23a/frame 0xfffffe00005319e0 XX XX kproc_suspend_check() at kproc_suspend_check+0x6a/frame 0xfffffe0000531a20 XX XX vnlru_proc() at vnlru_proc+0x47/frame 0xfffffe0000531a60 XX XX fork_exit() at fork_exit+0x85/frame 0xfffffe0000531ab0 XX XX fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0000531ab0 XX XX --- trap 0, rip = 0, rsp = 0, rbp = 0 --- XX XX db> t 16 XX XX Tracing pid 16 tid 100060 td 0xfffff80003e1f560 XX XX sched_switch() at sched_switch+0x1c3/frame 0xfffffe000052c810 XX XX mi_switch() at mi_switch+0xe1/frame 0xfffffe000052c850 XX XX kern_yield() at kern_yield+0xab/frame 0xfffffe000052c880 XX XX __mnt_vnode_next_active() at __mnt_vnode_next_active+0x20/frame 0xfffffe000052c8a0 XX XX vfs_msync() at vfs_msync+0x15e/frame 0xfffffe000052c910 XX XX sync_fsync() at sync_fsync+0xf8/frame 0xfffffe000052c950 XX XX VOP_FSYNC_APV() at VOP_FSYNC_APV+0x69/frame 0xfffffe000052c980 XX XX sync_vnode() at sync_vnode+0x177/frame 0xfffffe000052ca00 XX XX sched_sync() at sched_sync+0x21a/frame 0xfffffe000052ca60 XX XX fork_exit() at fork_exit+0x85/frame 0xfffffe000052cab0 XX XX fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe000052cab0 XX XX --- trap 0, rip = 0, rsp = 0, rbp = 0 --- XX XX db> t 9 XX XX Tracing pid 9 tid 100059 td 0xfffff80003e20000 XX XX sched_switch() at sched_switch+0x1c3/frame 0xfffffe0000527920 XX XX mi_switch() at mi_switch+0xe1/frame 0xfffffe0000527960 XX XX sleepq_wait() at sleepq_wait+0x2d/frame 0xfffffe0000527990 XX XX _sleep() at _sleep+0x23a/frame 0xfffffe0000527a00 XX XX kproc_suspend_check() at kproc_suspend_check+0x6a/frame 0xfffffe0000527a40 XX XX buf_daemon() at buf_daemon+0x10a/frame 0xfffffe0000527a60 XX XX fork_exit() at fork_exit+0x85/frame 0xfffffe0000527ab0 XX XX fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0000527ab0 XX XX --- trap 0, rip = 0, rsp = 0, rbp = 0 --- XX XX db> XX XX Also, recent SMP kernels on UP systems usually or always hang in nearly the XX same place, but I think this is an unrelated older bug. XX XX Bruce I still think there is an older bug, but now think it is related. I only tested with SCHED_4BSD. For SCHED_4BSD, I suspect that the bug is from pinning a thread to a CPU and then stopping that CPU. Pure UP has no problems since pinning is null for it. SCHED_4BSD has especially special handing for SMP (a separate runq for each CPU. I have been modifying SCHED_4BSD and the separate queues mostly get in the way). Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20180411055306.P5336>