From owner-freebsd-stable@freebsd.org Thu Aug 24 20:01:29 2017 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id DB2C3DE71C4 for ; Thu, 24 Aug 2017 20:01:29 +0000 (UTC) (envelope-from mike@sentex.net) Received: from smarthost2.sentex.ca (smarthost2.sentex.ca [IPv6:2607:f3e0:80:80::2]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client CN "smarthost.sentex.ca", Issuer "smarthost.sentex.ca" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 9F9D16443C for ; Thu, 24 Aug 2017 20:01:29 +0000 (UTC) (envelope-from mike@sentex.net) Received: from lava.sentex.ca (lava.sentex.ca [IPv6:2607:f3e0:0:5::11]) by smarthost2.sentex.ca (8.15.2/8.15.2) with ESMTPS id v7OK1RgQ062393 (version=TLSv1 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO) for ; Thu, 24 Aug 2017 16:01:28 -0400 (EDT) (envelope-from mike@sentex.net) Received: from [192.168.43.26] (saphire3.sentex.net [192.168.43.26]) by lava.sentex.ca (8.15.2/8.15.2) with ESMTP id v7OK1Plb055770 for ; Thu, 24 Aug 2017 16:01:25 -0400 (EDT) (envelope-from mike@sentex.net) Subject: Re: file system deadlock in RELENG_11 From: Mike Tancsa To: FreeBSD-STABLE Mailing List References: <66b97b27-cbea-a3a8-374d-3f7c017b5418@sentex.net> Organization: Sentex Communications Message-ID: <28c89f80-4797-7e95-a637-472ac7bc98a5@sentex.net> Date: Thu, 24 Aug 2017 16:01:25 -0400 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.3.0 MIME-Version: 1.0 In-Reply-To: <66b97b27-cbea-a3a8-374d-3f7c017b5418@sentex.net> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.78 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Aug 2017 20:01:30 -0000 OK, this is fairly easy to repeat. If I start a sync of a snapshot via zrep, it hangs the box. CTRL+T shows DEBUG: overiding stale lock on zroot/chyves from pid 19378 sending zroot/chyves@zrep_000010 to 10.151.9.2:zroot/chyves cannot receive new filesystem stream: destination 'zroot/chyves/guests/resi/disk1' exists must specify -F to overwrite it ^C load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 358.94r 0.00u 0.00s 0% 3476k load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 360.42r 0.00u 0.00s 0% 3476k load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 360.79r 0.00u 0.00s 0% 3476k load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 360.99r 0.00u 0.00s 0% 3476k load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 361.19r 0.00u 0.00s 0% 3476k load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 361.37r 0.00u 0.00s 0% 3476k load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 361.55r 0.00u 0.00s 0% 3476k load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 361.74r 0.00u 0.00s 0% 3476k load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 361.92r 0.00u 0.00s 0% 3476k load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 362.11r 0.00u 0.00s 0% 3476k load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 362.31r 0.00u 0.00s 0% 3476k load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 362.50r 0.00u 0.00s 0% 3476k load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 362.69r 0.00u 0.00s 0% 3476k load: 0.48 cmd: zfs 29690 [tx->tx_sync_done_cv] 362.90r 0.00u 0.00s 0% 3476k load: 0.52 cmd: zfs 29690 [tx->tx_sync_done_cv] 363.08r 0.00u 0.00s 0% 3476k On 8/24/2017 11:48 AM, Mike Tancsa wrote: > I upgraded a server yesterday from RELENG_11 from march 2017 to r322800 > (Aug 22) and noticed that under heavy disk IO in a VM, the server is > locking up. In the vm, I was doing a large untar and I noticed that > prior to the lockup, the hypervisor would be struggling to keep up the > disk writes. The VM is on a zvol if that makes any difference. A few > times in the VM, IO would be clogged to the point that the disk would > timeout in the VM > > Aug 24 08:32:02 kernel: ahcich6: Timeout on slot 14 port 0 > Aug 24 08:32:02 kernel: ahcich6: is 00000000 cs 00000000 ss ffffffff rs > ffffffff tfd 50 serr 00000000 cmd 0001db17 > Aug 24 08:32:02 kernel: (ada1:ahcich6:0:0:0): WRITE_FPDMA_QUEUED. ACB: > 61 00 a8 47 d8 40 01 00 00 01 00 00 > Aug 24 08:32:02 kernel: (ada1:ahcich6:0:0:0): CAM status: Command timeout > Aug 24 08:32:02 kernel: (ada1:ahcich6:0:0:0): Retrying command > > When the parent deadlocks, I cant run anything thats not already in RAM. > shutdown doesnt work and I have to reboot the box via IPMI. > > Any ideas how to debug this or try and better understand the problem so > I can at least work around it ? > > ---Mike > -- ------------------- Mike Tancsa, tel +1 519 651 3400 Sentex Communications, mike@sentex.net Providing Internet services since 1994 www.sentex.net Cambridge, Ontario Canada http://www.tancsa.com/