From owner-freebsd-fs@FreeBSD.ORG Sat Sep 21 00:40:17 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id B9C198C3; Sat, 21 Sep 2013 00:40:17 +0000 (UTC) (envelope-from Devin.Teske@fisglobal.com) Received: from mx1.fisglobal.com (mx1.fisglobal.com [199.200.24.190]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 851692C6D; Sat, 21 Sep 2013 00:40:16 +0000 (UTC) Received: from smtp.fisglobal.com ([10.132.206.31]) by ltcfislmsgpa06.fnfis.com (8.14.5/8.14.5) with ESMTP id r8L0eFIn016139 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NOT); Fri, 20 Sep 2013 19:40:15 -0500 Received: from LTCFISWMSGMB21.FNFIS.com ([169.254.1.202]) by LTCFISWMSGHT03.FNFIS.com ([10.132.206.31]) with mapi id 14.02.0309.002; Fri, 20 Sep 2013 19:40:14 -0500 From: "Teske, Devin" To: freebsd-fs Subject: Re: zfs upgrade hang Thread-Topic: zfs upgrade hang Thread-Index: AQHOtmFmwDDb9jiJXU6uozaRArj8/ZnPrVOA Date: Sat, 21 Sep 2013 00:40:13 +0000 Message-ID: <13CA24D6AB415D428143D44749F57D720FBE1299@LTCFISWMSGMB21.FNFIS.com> References: <13CA24D6AB415D428143D44749F57D720FBE11B0@LTCFISWMSGMB21.FNFIS.com> In-Reply-To: <13CA24D6AB415D428143D44749F57D720FBE11B0@LTCFISWMSGMB21.FNFIS.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.132.253.121] Content-Type: text/plain; charset="us-ascii" Content-ID: <7DB63A132FD0FC49A19C4FCCB5F75403@fisglobal.com> Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.10.8794, 1.0.431, 0.0.0000 definitions=2013-09-20_11:2013-09-20,2013-09-20,1970-01-01 signatures=0 Cc: Devin Teske , "Teske, Devin" X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: Devin Teske List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 21 Sep 2013 00:40:17 -0000 On Sep 20, 2013, at 5:27 PM, Teske, Devin wrote: > Hi, >=20 > Seem to be having an issue with "zfs upgrade" hanging. >=20 > Please note that "zpool upgrade" seems to be fine... it's "zfs upgrade" t= hat hangs. >=20 > The system is 8.4-STABLE @ r255470M amd64. >=20 > The dataset versions prior to upgrade are 3 and after upgrade are 5. >=20 > It doesn't always hang. But once it does, it's always in the following ^T= state: >=20 > tx->tx_sync_done_cv >=20 > You can let it sit there for minutes or hours, but it never completes or = enters a > different state. Also, you can't Ctrl-C it, you can't Ctrl-Z it, you can'= t kill it, not even > with `-9'. Further, anything like "zfs list" will hang. All the meanwhile= , the filesystem > is readable and fine. >=20 > The sure-fire way to hit this for us is to attempt a "-a" or "-r" or "-ra= " to do many datasets > at once. >=20 > However, doing one dataset at a time will work... until it too leads to t= he same state. >=20 > Once we hit this state (hung upgrade) we have to reboot. >=20 > We've been able to get through all the datasets on a box by doing one-at-= a-time and > rebooting when one hangs and ends up in this state but it's frustrating b= ecause we can > usually only do a handful at a time before hitting the problem. >=20 > Scripting it won't help. >=20 > Also, we've tried unmounting the filesystems prior to upgrade too, that d= idn't help. > Updating libraries/binaries to r255747 didn't seem to help either. I gues= s next step is to > update the kernel to latest stable/8 (which is probably not far ahead of = r255470). >=20 > Advice? Before you chime-in, I think I might have more to add to the puzzle. It would seem that the LSI-RAID1 pool is the problem. We always hang when trying to execute "zfs upgrade LSI-RAID1" despite the fact that we've done a "zfs upgrade" of everything underneath it. We've also done a "zfs upgrade" of other pools and their descendants on the same system without this hang. So I was thinking... what makes the "LSI-RAID1" pool different from, say, t= he "NEC1_POOL_A" pool. The answer is mount-point. LSI-RAID1 does not have a mountpoint, while everything else does. Here's the layout we have: hvm2b# zfs get version NAME PROPERTY VALUE SOURCE LSI-RAID1 version 3 - LSI-RAID1/vm version 5 - LSI-RAID1/vm/golden0 version 5 - LSI-RAID1/vm/golden0@pre-cfg0-snap1 version 3 - LSI-RAID1/vm/golden0@non-cfg0-snap1 version 3 - LSI-RAID1/vm/golden0@zxfer_4709_20130905025825 version 3 - LSI-RAID1/vm/ipu0c version 5 - LSI-RAID1/vm/ipu1c version 5 - LSI-RAID1/vm/ipu2c version 5 - LSI-RAID1/vm/oos0c version 5 - LSI-RAID1/vmbak version 5 - LSI-RAID1/vmbak/vm version 5 - NEC2_POOL_B version 5 - NEC2_POOL_B/oos0c version 5 - As you can see, while trying to work around this hang, we've been able to upgrade all the datasets with the exception of the one culprit (at the top). So... Did I discover a bug? Perhaps relating to "zfs upgrade" touching datasets t= hat don't have a mountpoint set? --=20 Devin _____________ The information contained in this message is proprietary and/or confidentia= l. If you are not the intended recipient, please: (i) delete the message an= d all copies; (ii) do not disclose, distribute or use the message in any ma= nner; and (iii) notify the sender immediately. In addition, please be aware= that any message addressed to our domain is subject to archiving and revie= w by persons other than the intended recipient. Thank you.