Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 11 Apr 2018 06:09:00 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Dexuan Cui <decui@microsoft.com>
Cc:        Bruce Evans <brde@optusnet.com.au>, "jeff@freebsd.org" <jeff@freebsd.org>,  "bugs@freebsd.org" <bugs@freebsd.org>,  "Hongxiong Xian (Wicresoft)" <v-hoxian@microsoft.com>
Subject:   RE: [Bug 227404] UP FreeBSD VM always hangs on reboot since 20180329-r331740
Message-ID:  <20180411055306.P5336@besplex.bde.org>
In-Reply-To: <SG2P15301MB0015EA97582E940755361F64BFBE0@SG2P15301MB0015.APCP153.PROD.OUTLOOK.COM>
References:  <bug-227404-39@https.bugs.freebsd.org/bugzilla/> <bug-227404-39-TxuoE7IXT0@https.bugs.freebsd.org/bugzilla/> <20180410173347.D1459@besplex.bde.org> <SG2P15301MB0015EA97582E940755361F64BFBE0@SG2P15301MB0015.APCP153.PROD.OUTLOOK.COM>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 10 Apr 2018, Dexuan Cui wrote:

>> From: Bruce Evans
>> Sent: Tuesday, April 10, 2018 00:45
>>
>> On Tue, 10 Apr 2018 a bug that doesn't want replies@freebsd.org wrote:
>>
>> (The bug didn't even Cc freebsd-bugs for this followup.)
>
> Thanks for the reminder! I Cc'd bugs@ just now.
>
>>> --- Comment #4 from Dexuan Cui ---
>> ...
>>
>> I think I saw this a few months before that.
>>
>> My only history of this is that I built a UP kernel on 17 Dec 2017 to see
>> if UP kernels had the bug.  So SMP kernels probably had the bug then.
>>
>> Bruce
>
> Here the bug is that UP FreeBSD VM hangs on reboot or power-off, and
> I'm sure this recent patch (which was commited by Jeff on Mar 26) caused this bug:
> https://github.com/freebsd/freebsd/commit/63a483ed5f4eaadb8979992c7a5de24c7a471c61
> ("Fix a bug introduced in r329612 that slowly invalidates all clean bufs").
>
> However, SMP VM with 2 or more CPUs doesn't hang on reboot/power-off
> according to our tests.

Actually, r329612 is what causes this bug.  I already did the bisection
to find almost this bug a couple of weeks ago.  The hang occurs on amd64
with 4 CPUs but not on amd64 with 8 CPUs or i386 with 4 or 8 CPUS.  I
just checked that it occurs on i386 with 1 CPU.  All on the same machine.
But r329611 doesn't hang for any of these cases.

XX From bde@optusnet.com.au Fri Mar 23 20:06:40 2018 +1100
XX Date: Fri, 23 Mar 2018 20:06:39 +1100 (EST)
XX From: Bruce Evans <brde@optusnet.com.au>
XX X-X-Sender: bde@besplex.bde.org
XX To: jeff@freebsd.org
XX Subject: r329612 breaks sync for shutdown
XX Message-ID: <20180323192409.F1878@besplex.bde.org>
XX MIME-Version: 1.0
XX Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
XX Status: O
XX X-Status: 
XX X-Keywords: 
XX X-UID: 7935
XX 
XX r329612 (with or without later changes) sometimes (consistently in one
XX hardware coniguration) hangs in clean shutdowns by init:
XX 
XX i386 with 8 or 4 CPUs: it hasn't failed yet
XX amd64 with 8 CPUs:: it hasn't failed yet
XX and64 with 4 CPUs (by turning off HTT): it always or almost always hangs
XX 
XX The hang is usually in "Syncing disks, vnodes remaining... 0 ".  Less than
XX 10% of the time it hangs earlier in "Waiting ... for .... syncer' ...".
XX It is just waiting for a wakeup that never arrives:  In the working case,
XX it prints about 2 more 0's, with about half a second between each.  If it
XX reaches the second 0, it always completes.
XX 
XX This is with SCHED_4BSD.  SCHED_ULE seems to work.  I recently looked for
XX missing wakeups and found some for idle threads in mwait.  This affects
XX both schedulers but fixing it makes little difference.  The bug is invariant
XX under other large changes in options and code.
XX 
XX XX KDB: enter: Break to debugger
XX XX [ thread pid 9 tid 100069 ]
XX XX Stopped at      kdb_enter+0x3a: movq    $0,0x700c97(%rip)
XX XX db> ps
XX XX   pid  ppid  pgrp   uid  state   wmesg   wchan               cmd
XX XX    18     0     0     0  DL      -       0xffffffff80be8462  [schedcpu]
XX XX    17     0     0     0  DL      kpsusp  0xfffff80003e1a6e0  [vnlru]
XX XX    16     0     0     0  RL                                  [syncer]
XX XX     9     0     0     0  RL      (threaded)                  [bufdaemon]
XX XX 100059                   RunQ                                [bufdaemon]
XX XX 100064                   Run     CPU 1                       [bufspacedaemon-0]
XX XX 100065                   Run     CPU 3                       [bufspacedaemon-1]
XX XX 100066                   RunQ                                [bufspacedaemon-2]
XX XX 100067                   RunQ                                [bufspacedaemon-3]
XX XX 100068                   Run     CPU 0                       [bufspacedaemon-4]
XX XX 100069                   Run     CPU 2                       [bufspacedaemon-5]
XX XX 100070                   CanRun                              [bufspacedaemon-6]
XX XX     8     0     0     0  DL      (threaded)                  [pagedaemon]
XX XX 100058                   D       psleep  0xffffffff80c7b82d  [pagedaemon]
XX XX 100062                   D       launds  0xffffffff80c7b834  [laundry: dom0]
XX XX 100063                   D       umarcl  0xffffffff80676967  [uma]
XX XX     7     0     0     0  DL      -       0xffffffff80c73dd4  [soaiod4]
XX XX     6     0     0     0  DL      -       0xffffffff80c73dd4  [soaiod3]
XX XX     5     0     0     0  DL      -       0xffffffff80c73dd4  [soaiod2]
XX XX --More--            4     0     0     0  DL      -       0xffffffff80c73dd4  [soaiod1]
XX XX    15     0     0     0  DL      cooling 0xfffff8000186a758  [acpi_cooling1]
XX XX    14     0     0     0  DL      tzpoll  0xffffffff80aaa110  [acpi_thermal]
XX XX     3     0     0     0  DL      -       0xffffffff80aab218  [rand_harvestq]
XX XX    13     0     0     0  DL      (threaded)                  [usb]
XX XX 100023                   D       -       0xfffffe00839d4460  [usbus0]
XX XX 100024                   D       -       0xfffffe00839d44b8  [usbus0]
XX XX 100025                   D       -       0xfffffe00839d4510  [usbus0]
XX XX 100026                   D       -       0xfffffe00839d4568  [usbus0]
XX XX 100027                   D       -       0xfffffe00839d45c0  [usbus0]
XX XX 100029                   D       -       0xfffffe0083846d10  [usbus1]
XX XX 100030                   D       -       0xfffffe0083846d68  [usbus1]
XX XX 100031                   D       -       0xfffffe0083846dc0  [usbus1]
XX XX 100032                   D       -       0xfffffe0083846e18  [usbus1]
XX XX 100033                   D       -       0xfffffe0083846e70  [usbus1]
XX XX 100035                   D       -       0xfffffe0083b60d10  [usbus2]
XX XX 100036                   D       -       0xfffffe0083b60d68  [usbus2]
XX XX 100037                   D       -       0xfffffe0083b60dc0  [usbus2]
XX XX 100038                   D       -       0xfffffe0083b60e18  [usbus2]
XX XX 100039                   D       -       0xfffffe0083b60e70  [usbus2]
XX XX --More--            2     0     0     0  DL      (threaded)                  [cam]
XX XX 100021                   D       -       0xffffffff80a9e000  [doneq0]
XX XX 100053                   D       -       0xffffffff80a9dec0  [scanner]
XX XX    12     0     0     0  DL      (threaded)                  [geom]
XX XX 100018                   D       -       0xffffffff80bd7a60  [g_event]
XX XX 100019                   D       -       0xffffffff80bd7a70  [g_up]
XX XX 100020                   D       -       0xffffffff80bd7a78  [g_down]
XX XX    11     0     0     0  WL      (threaded)                  [intr]
XX XX 100006                   I                                   [swi6: task queue]
XX XX 100008                   I                                   [swi5: fast taskq]
XX XX 100011                   I                                   [swi6: Giant taskq]
XX XX 100012                   I                                   [swi4: clock (0) sio]
XX XX 100013                   I                                   [swi4: clock (1)]
XX XX 100014                   I                                   [swi4: clock (2)]
XX XX 100015                   I                                   [swi4: clock (3)]
XX XX 100016                   I                                   [swi1: netisr 0]
XX XX 100017                   I                                   [swi3: vm]
XX XX 100022                   I                                   [irq264: xhci0]
XX XX 100028                   I                                   [irq16: ehci0]
XX XX 100034                   I                                   [irq23: ehci1]
XX XX --More--        100040                   I                                   [irq265: ahci0]
XX XX 100041                   I                                   [swi0: sio]
XX XX 100042                   I                                   [irq1: atkbd0]
XX XX 100043                   I                                   [irq14: ata0]
XX XX 100044                   I                                   [irq15: ata1]
XX XX    10     0     0     0  RL      (threaded)                  [idle]
XX XX 100002                   CanRun                              [idle: cpu0]
XX XX 100003                   CanRun                              [idle: cpu1]
XX XX 100004                   CanRun                              [idle: cpu2]
XX XX 100005                   CanRun                              [idle: cpu3]
XX XX     1     0     1     0  RLs                                 [init]
XX XX     0     0     0     0  DLs     (threaded)                  [kernel]
XX XX 100000                   D       swapin  0xffffffff80bd8130  [swapper]
XX XX 100007                   D       -       0xfffff800014bad00  [kqueue_ctx taskq]
XX XX 100009                   D       -       0xfffff800014baa00  [thread taskq]
XX XX 100010                   D       -       0xfffff800014ba900  [aiod_kick taskq]
XX XX 100045                   D       -       0xfffff80001877e00  [mca taskq]
XX XX 100046                   D       -       0xfffff80003dad900  [acpi_task_0]
XX XX 100047                   D       -       0xfffff80003dad900  [acpi_task_1]
XX XX 100048                   D       -       0xfffff80003dad900  [acpi_task_2]
XX XX --More--        100052                   D       -       0xfffff8000156ad00  [CAM taskq]
XX XX 100071                   D       -       0xfffff80003e9fa00  [if_config_tqg_0]
XX XX 100072                   D       -       0xfffff80003e9f900  [softirq_0]
XX XX 100073                   D       -       0xfffff80003e9f800  [softirq_1]
XX XX 100074                   D       -       0xfffff80003e9f700  [softirq_2]
XX XX 100075                   D       -       0xfffff80003e9f600  [softirq_3]
XX XX 100076                   D       -       0xfffff80003e9f500  [if_io_tqg_0]
XX XX 100077                   D       -       0xfffff80003e9f400  [if_io_tqg_1]
XX XX 100078                   D       -       0xfffff80003e9f300  [if_io_tqg_2]
XX XX 100079                   D       -       0xfffff80003e9f200  [if_io_tqg_3]
XX XX   724     1     0     0  ZL                                  [newnfs 0]
XX XX   725     1     0     0  ZL                                  [newnfs 1]
XX XX db> t 18
XX XX Tracing pid 18 tid 100080 td 0xfffff8000430f560
XX XX sched_switch() at sched_switch+0x1c3/frame 0xfffffe0000590920
XX XX mi_switch() at mi_switch+0xe1/frame 0xfffffe0000590960
XX XX sleepq_timedwait() at sleepq_timedwait+0x2d/frame 0xfffffe0000590990
XX XX _sleep() at _sleep+0x218/frame 0xfffffe0000590a00
XX XX pause_sbt() at pause_sbt+0xc9/frame 0xfffffe0000590a50
XX XX schedcpu_thread() at schedcpu_thread+0x2b/frame 0xfffffe0000590a60
XX XX fork_exit() at fork_exit+0x85/frame 0xfffffe0000590ab0
XX XX fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0000590ab0
XX XX --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
XX XX db> t 17
XX XX Tracing pid 17 tid 100061 td 0xfffff80003e1f000
XX XX sched_switch() at sched_switch+0x1c3/frame 0xfffffe0000531900
XX XX mi_switch() at mi_switch+0xe1/frame 0xfffffe0000531940
XX XX sleepq_wait() at sleepq_wait+0x2d/frame 0xfffffe0000531970
XX XX _sleep() at _sleep+0x23a/frame 0xfffffe00005319e0
XX XX kproc_suspend_check() at kproc_suspend_check+0x6a/frame 0xfffffe0000531a20
XX XX vnlru_proc() at vnlru_proc+0x47/frame 0xfffffe0000531a60
XX XX fork_exit() at fork_exit+0x85/frame 0xfffffe0000531ab0
XX XX fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0000531ab0
XX XX --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
XX XX db> t 16
XX XX Tracing pid 16 tid 100060 td 0xfffff80003e1f560
XX XX sched_switch() at sched_switch+0x1c3/frame 0xfffffe000052c810
XX XX mi_switch() at mi_switch+0xe1/frame 0xfffffe000052c850
XX XX kern_yield() at kern_yield+0xab/frame 0xfffffe000052c880
XX XX __mnt_vnode_next_active() at __mnt_vnode_next_active+0x20/frame 0xfffffe000052c8a0
XX XX vfs_msync() at vfs_msync+0x15e/frame 0xfffffe000052c910
XX XX sync_fsync() at sync_fsync+0xf8/frame 0xfffffe000052c950
XX XX VOP_FSYNC_APV() at VOP_FSYNC_APV+0x69/frame 0xfffffe000052c980
XX XX sync_vnode() at sync_vnode+0x177/frame 0xfffffe000052ca00
XX XX sched_sync() at sched_sync+0x21a/frame 0xfffffe000052ca60
XX XX fork_exit() at fork_exit+0x85/frame 0xfffffe000052cab0
XX XX fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe000052cab0
XX XX --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
XX XX db> t 9
XX XX Tracing pid 9 tid 100059 td 0xfffff80003e20000
XX XX sched_switch() at sched_switch+0x1c3/frame 0xfffffe0000527920
XX XX mi_switch() at mi_switch+0xe1/frame 0xfffffe0000527960
XX XX sleepq_wait() at sleepq_wait+0x2d/frame 0xfffffe0000527990
XX XX _sleep() at _sleep+0x23a/frame 0xfffffe0000527a00
XX XX kproc_suspend_check() at kproc_suspend_check+0x6a/frame 0xfffffe0000527a40
XX XX buf_daemon() at buf_daemon+0x10a/frame 0xfffffe0000527a60
XX XX fork_exit() at fork_exit+0x85/frame 0xfffffe0000527ab0
XX XX fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0000527ab0
XX XX --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
XX XX db>
XX 
XX Also, recent SMP kernels on UP systems usually or always hang in nearly the
XX same place, but I think this is an unrelated older bug.
XX 
XX Bruce

I still think there is an older bug, but now think it is related.  I
only tested with SCHED_4BSD.  For SCHED_4BSD, I suspect that the bug
is from pinning a thread to a CPU and then stopping that CPU.  Pure
UP has no problems since pinning is null for it.  SCHED_4BSD has especially
special handing for SMP (a separate runq for each CPU.  I have been modifying
SCHED_4BSD and the separate queues mostly get in the way).

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20180411055306.P5336>