From owner-freebsd-fs@FreeBSD.ORG Wed Aug 15 08:24:42 2012 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id A10A0106566B for ; Wed, 15 Aug 2012 08:24:42 +0000 (UTC) (envelope-from Karli.Sjoberg@slu.se) Received: from Exchange2.ad.slu.se (exchange2.ad.slu.se [193.10.100.95]) by mx1.freebsd.org (Postfix) with ESMTP id 054E08FC0A for ; Wed, 15 Aug 2012 08:24:41 +0000 (UTC) Received: from exmbx3.ad.slu.se ([193.10.100.93]) by Exchange2.ad.slu.se ([193.10.100.95]) with mapi; Wed, 15 Aug 2012 10:24:40 +0200 From: =?iso-8859-1?Q?Karli_Sj=F6berg?= To: Hugo Lombard Date: Wed, 15 Aug 2012 10:24:38 +0200 Thread-Topic: Hang when importing pool Thread-Index: Ac16v2lpsoLTWUL3SXGZN8wSmNyv3g== Message-ID: References: <49C9D08A-85EF-4D23-B07F-F3980CBA5A97@slu.se> <20120815073135.GO6757@squishy.elizium.za.net> In-Reply-To: <20120815073135.GO6757@squishy.elizium.za.net> Accept-Language: sv-SE, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: sv-SE, en-US MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Cc: "freebsd-fs@freebsd.org" Subject: Re: Hang when importing pool X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 15 Aug 2012 08:24:42 -0000 15 aug 2012 kl. 09.31 skrev Hugo Lombard: On Wed, Aug 15, 2012 at 08:45:38AM +0200, Karli Sj=F6berg wrote: I took your advice. I replaced my Core i5 with a Xeon X3470 and ramped up the RAM to 32GB, maxing out the HW. Sadly enough, it still stalls in the exact same manner:( This has to be the most frustrating thing ever, since there=B4s tons of data there that I really need and if it wasn=B4t for that stupid destroy operation, it would still be accessible. I feel that FreeBSD is partly to blame since it was completely possible in the originating SUN machine with Solaris that only has 16GB RAM to do the same destroy to the same dataset without any problem. Sure, it took forever and then some (about two weeks) but it stayed afloat during the whole time. Sorry to hear about your pain. I've recently run into a similar problem where destroying a lot of snapshots on de-duped filesystems caused two boxes (one a replica of the other) to strangle itself. After much stuggling, I opted to redo the slave box, mounted the master box's pool readonly, and rsync'ed the datasets across. In retrospect, I shouldn't have deleted so many snapshots at once. Boxes are both quad-core Opterons with 16GB RAM each. On the newly re-done box, I've decided not to use de-dupe. In the process of searching for an answer I came across this thread: http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg47526.html The person who noted the issue originally finally managed to recover their pool with a loan machine from Oracle that had 120GB RAM: http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg47529.html Personally, I don't think the problem is purely FreeBSD's fault. Neither do I, I said "partly". >From the link you sent me, quoted by a Mr Jim Klimov: "According to my research (flushed out with the Jive Forums, so I'd repeat here) it seems that (MY SPECULATION FOLLOWS): 1) some kernel module (probably related to ZFS) takes hold of more and more RAM; 2) since it is kernel memory, it can not be swapped out; 3) since all RAM is depleted but there are requests for RAM allocation, the kernel scans all allocated memory to find candidates for swapping out (hence the high scanrate). 4) Since all RAM is now consumed by a BADLY DESIGNED kernel module which can not be swapped out, the system dies in a high-scanrate agony, because there is no RAM available to do anything. It can be "pinged" for a while, but not much more. I stress that the module is BADLY DESIGNED as it is in my current running version of the OS (I don't know yet if it was fixed in oi_151a), because probably it is trying to build the full ZFS tree in its adressable memory - regardles of whether it can fit there. IMHO the module should try to process the pool in smaller chunks, or allow swapping out, if the hardware constraints like insufficient RAM force it to." Wow, repeated two times... "Symptoms are like what you've described, including the huge scanrate just before the system dies (becomes unresponsive). Also if you try running= with "vmstat 1" you can see that in the last few seconds of uptime the system would go from several hundred free MBs (or even over a GB free RAM) down to under 32Mb very quickly - consuming hundreds of MBs per second." These symptoms are exactly what I=B4m experiencing! Further down: "However, with ZDB analysis I managed to find some counter of free blocks - those which belonged to a killed dataset. Seems that at first they are quickly marked for deletion (i.e. are not referenced by any dataset, but are still in the ZFS block tree), and then during pool's current uptime or further import attempts, these blocks are actually walked and excluded from the ZFS tree. In my case I saw that between reboots and import attempts this counter went down by some 3 million blocks every uptime, and after a couple of stressful weeks the destroyed dataset was gone and the pool just worked on and on. So if you still have this problem, try running ZDB to see if deferred-free count is decreasing between pool import attempts: # time zdb -bsvL -e ... 976K 114G 113G 172G 180K 1.01 1.56 deferred free ..." So hopefully if I just keep at it, maybe it solves itself. Right now, I=B4m= trying: # zpool import -f -F -X id as Marcelo Araujo suggested. See how long it=B4ll take before it stalls thi= s time... /Karli