Date: Sat, 26 Aug 2017 12:08:22 +0200 From: "Ronald Klop" <ronald-lists@klop.ws> To: "FreeBSD-STABLE Mailing List" <freebsd-stable@freebsd.org>, "Mike Tancsa" <mike@sentex.net> Subject: Re: file system deadlock in RELENG_11 Message-ID: <op.y5ko38r0kndu52@joepie> In-Reply-To: <28c89f80-4797-7e95-a637-472ac7bc98a5@sentex.net> References: <66b97b27-cbea-a3a8-374d-3f7c017b5418@sentex.net> <28c89f80-4797-7e95-a637-472ac7bc98a5@sentex.net>
next in thread | previous in thread | raw e-mail | index | archive | help
Running procstat -kk <pid> will display on what syscall a process/thread is blocking. That might give some valuable information to people. Regards, Ronald. On Thu, 24 Aug 2017 22:01:25 +0200, Mike Tancsa <mike@sentex.net> wrote: > OK, this is fairly easy to repeat. If I start a sync of a snapshot via > zrep, it hangs the box. CTRL+T shows > > > DEBUG: overiding stale lock on zroot/chyves from pid 19378 > sending zroot/chyves@zrep_000010 to 10.151.9.2:zroot/chyves > cannot receive new filesystem stream: destination > 'zroot/chyves/guests/resi/disk1' exists > must specify -F to overwrite it > ^C > load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 358.94r 0.00u 0.00s 0% > 3476k > load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 360.42r 0.00u 0.00s 0% > 3476k > load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 360.79r 0.00u 0.00s 0% > 3476k > load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 360.99r 0.00u 0.00s 0% > 3476k > load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 361.19r 0.00u 0.00s 0% > 3476k > load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 361.37r 0.00u 0.00s 0% > 3476k > load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 361.55r 0.00u 0.00s 0% > 3476k > load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 361.74r 0.00u 0.00s 0% > 3476k > load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 361.92r 0.00u 0.00s 0% > 3476k > load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 362.11r 0.00u 0.00s 0% > 3476k > load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 362.31r 0.00u 0.00s 0% > 3476k > load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 362.50r 0.00u 0.00s 0% > 3476k > load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 362.69r 0.00u 0.00s 0% > 3476k > load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 362.90r 0.00u 0.00s 0% > 3476k > load: 0.52 cmd: zfs 29690 [tx->tx_sync_done_cv] 363.08r 0.00u 0.00s 0% > 3476k > > > > On 8/24/2017 11:48 AM, Mike Tancsa wrote: >> I upgraded a server yesterday from RELENG_11 from march 2017 to r322800 >> (Aug 22) and noticed that under heavy disk IO in a VM, the server is >> locking up. In the vm, I was doing a large untar and I noticed that >> prior to the lockup, the hypervisor would be struggling to keep up the >> disk writes. The VM is on a zvol if that makes any difference. A few >> times in the VM, IO would be clogged to the point that the disk would >> timeout in the VM >> >> Aug 24 08:32:02 kernel: ahcich6: Timeout on slot 14 port 0 >> Aug 24 08:32:02 kernel: ahcich6: is 00000000 cs 00000000 ss ffffffff rs >> ffffffff tfd 50 serr 00000000 cmd 0001db17 >> Aug 24 08:32:02 kernel: (ada1:ahcich6:0:0:0): WRITE_FPDMA_QUEUED. ACB: >> 61 00 a8 47 d8 40 01 00 00 01 00 00 >> Aug 24 08:32:02 kernel: (ada1:ahcich6:0:0:0): CAM status: Command >> timeout >> Aug 24 08:32:02 kernel: (ada1:ahcich6:0:0:0): Retrying command >> >> When the parent deadlocks, I cant run anything thats not already in RAM. >> shutdown doesnt work and I have to reboot the box via IPMI. >> >> Any ideas how to debug this or try and better understand the problem so >> I can at least work around it ? >> >> ---Mike >> >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?op.y5ko38r0kndu52>