From owner-freebsd-questions@FreeBSD.ORG Mon Sep 9 13:08:18 2013 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 2D82BA9C for ; Mon, 9 Sep 2013 13:08:18 +0000 (UTC) (envelope-from dweimer@dweimer.net) Received: from webmail.dweimer.net (24-240-198-187.static.stls.mo.charter.com [24.240.198.187]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id C33C0235D for ; Mon, 9 Sep 2013 13:08:17 +0000 (UTC) Received: from www.dweimer.net (webmail.dweimer.local [192.168.5.2]) by webmail.dweimer.net (8.14.5/8.14.5) with ESMTP id r89D89fK042972 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO) for ; Mon, 9 Sep 2013 08:08:09 -0500 (CDT) (envelope-from dweimer@dweimer.net) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Date: Mon, 09 Sep 2013 08:08:09 -0500 From: dweimer To: freebsd-questions@freebsd.org Subject: Re: ZFS Snapshots Not able to be accessed under .zfs/snapshot/name Organization: dweimer.net Mail-Reply-To: dweimer@dweimer.net In-Reply-To: <23413f3a4b95328c0bc838e6ffad364d@dweimer.net> References: <22a7343f4573d6faac5aec1d7c9a1135@dweimer.net> <520C405A.6000408@ShaneWare.Biz> <776e30b627bf30ece7545e28b2a2e064@dweimer.net> <23413f3a4b95328c0bc838e6ffad364d@dweimer.net> Message-ID: X-Sender: dweimer@dweimer.net User-Agent: Roundcube Webmail/0.8.1 X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: dweimer@dweimer.net List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 09 Sep 2013 13:08:18 -0000 On 08/16/2013 8:49 am, dweimer wrote: > On 08/15/2013 10:00 am, dweimer wrote: >> On 08/14/2013 9:43 pm, Shane Ambler wrote: >>> On 14/08/2013 22:57, dweimer wrote: >>>> I have a few systems running on ZFS with a backup script that >>>> creates >>>> snapshots, then backs up the .zfs/snapshot/name directory to make >>>> sure >>>> open files are not missed. This has been working great but all of >>>> the >>>> sudden one of my systems has stopped working. It takes the >>>> snapshots >>>> fine, zfs list -t spnapshot shows the snapshots, but if you do an ls >>>> command, on the .zfs/snapshot/ directory it returns not a directory. >>>> >>>> part of the zfs list output: >>>> >>>> NAME USED AVAIL REFER MOUNTPOINT >>>> zroot 4.48G 29.7G 31K none >>>> zroot/ROOT 2.92G 29.7G 31K none >>>> zroot/ROOT/91p5-20130812 2.92G 29.7G 2.92G legacy >>>> zroot/home 144K 29.7G 122K /home >>>> >>>> part of the zfs list -t snapshot output: >>>> >>>> NAME USED AVAIL REFER >>>> MOUNTPOINT >>>> zroot/ROOT/91p5-20130812@91p5-20130812--bsnap 340K - 2.92G >>>> - >>>> zroot/home@home--bsnap 22K - 122K >>>> - >>>> >>>> ls /.zfs/snapshot/91p5-20130812--bsnap/ >>>> Does work at the right now, since the last reboot, but wasn't always >>>> working, this is my boot environment. >>>> >>>> if I do ls /home/.zfs/snapshot/, result is: >>>> ls: /home/.zfs/snapshot/: Not a directory >>>> >>>> if I do ls /home/.zfs, result is: >>>> ls: snapshot: Bad file descriptor >>>> shares >>>> >>>> I have tried zpool scrub zroot, no errors were found, if I reboot >>>> the >>>> system I can get one good backup, then I start having problems. >>>> Anyone >>>> else ever ran into this, any suggestions as to a fix? >>>> >>>> System is running FreeBSD 9.1-RELEASE-p5 #1 r253764: Mon Jul 29 >>>> 15:07:35 >>>> CDT 2013, zpool is running version 28, zfs is running version 5 >>>> >>> >>> >>> I can say I've had this problem. Not certain what fixed it. I do >>> remember I decided to stop snapshoting if I couldn't access them and >>> deleted existing snapshots. I later restarted the machine before I >>> went back for another look and they were working. >>> >>> So my guess is a restart without existing snapshots may be the key. >>> >>> Now if only we could find out what started the issue so we can stop >>> it >>> happening again. >> >> I had actually rebooted it last night, prior to seeing this message, I >> do know it didn't have any snapshots this time. As I am booting from >> ZFS using boot environments I may have had an older boot environment >> still on the system the last time it was rebooted. Backups ran great >> last night after the reboot, and I was able to kick off my pre-backup >> job and access all the snapshots today. Hopefully it doesn't come >> back, but if it does I will see if I can find anything else wrong. >> >> FYI, >> It didn't shutdown cleanly, so if this helps anyone find the issue, >> this is from my system logs: >> Aug 14 22:08:04 cblproxy1 kernel: >> Aug 14 22:08:04 cblproxy1 kernel: Fatal trap 12: page fault while in >> kernel mode >> Aug 14 22:08:04 cblproxy1 kernel: cpuid = 0; apic id = 00 >> Aug 14 22:08:04 cblproxy1 kernel: fault virtual address = 0xa8 >> Aug 14 22:08:04 cblproxy1 kernel: fault code = supervisor >> write data, page not present >> Aug 14 22:08:04 cblproxy1 kernel: instruction pointer = >> 0x20:0xffffffff808b0562 >> Aug 14 22:08:04 cblproxy1 kernel: stack pointer = >> 0x28:0xffffff80002238f0 >> Aug 14 22:08:04 cblproxy1 kernel: frame pointer = >> 0x28:0xffffff8000223910 >> Aug 14 22:08:04 cblproxy1 kernel: code segment = base 0x0, >> limit 0xfffff, type 0x1b >> Aug 14 22:08:04 cblproxy1 kernel: = DPL 0, pres 1, long 1, def32 0, >> gran 1 >> Aug 14 22:08:04 cblproxy1 kernel: processor eflags = interrupt >> enabled, resume, IOPL = 0 >> Aug 14 22:08:04 cblproxy1 kernel: current process = 1 >> (init) >> Aug 14 22:08:04 cblproxy1 kernel: trap number = 12 >> Aug 14 22:08:04 cblproxy1 kernel: panic: page fault >> Aug 14 22:08:04 cblproxy1 kernel: cpuid = 0 >> Aug 14 22:08:04 cblproxy1 kernel: KDB: stack backtrace: >> Aug 14 22:08:04 cblproxy1 kernel: #0 0xffffffff808ddaf0 at >> kdb_backtrace+0x60 >> Aug 14 22:08:04 cblproxy1 kernel: #1 0xffffffff808a951d at panic+0x1fd >> Aug 14 22:08:04 cblproxy1 kernel: #2 0xffffffff80b81578 at >> trap_fatal+0x388 >> Aug 14 22:08:04 cblproxy1 kernel: #3 0xffffffff80b81836 at >> trap_pfault+0x2a6 >> Aug 14 22:08:04 cblproxy1 kernel: #4 0xffffffff80b80ea1 at trap+0x2a1 >> Aug 14 22:08:04 cblproxy1 kernel: #5 0xffffffff80b6c7b3 at >> calltrap+0x8 >> Aug 14 22:08:04 cblproxy1 kernel: #6 0xffffffff815276da at >> zfsctl_umount_snapshots+0x8a >> Aug 14 22:08:04 cblproxy1 kernel: #7 0xffffffff81536766 at >> zfs_umount+0x76 >> Aug 14 22:08:04 cblproxy1 kernel: #8 0xffffffff809340bc at >> dounmount+0x3cc >> Aug 14 22:08:04 cblproxy1 kernel: #9 0xffffffff8093c101 at >> vfs_unmountall+0x71 >> Aug 14 22:08:04 cblproxy1 kernel: #10 0xffffffff808a8eae at >> kern_reboot+0x4ee >> Aug 14 22:08:04 cblproxy1 kernel: #11 0xffffffff808a89c0 at >> kern_reboot+0 >> Aug 14 22:08:04 cblproxy1 kernel: #12 0xffffffff80b81dab at >> amd64_syscall+0x29b >> Aug 14 22:08:04 cblproxy1 kernel: #13 0xffffffff80b6ca9b at >> Xfast_syscall+0xfb > > Well its back, 3 of the 8 file systems I am taking snapshots of failed > in last nights backups. > > The only thing different on this system from all the 4 others I have > running is that it has a second disk volume with a UFS file system. > > Setup is 2 Disks, both setup with GPART: > => 34 83886013 da0 GPT (40G) > 34 256 1 boot0 (128k) > 290 10485760 2 swap0 (5.0G) > 10486050 73399997 3 zroot0 (35G) > > => 34 41942973 da1 GPT (20G) > 34 41942973 1 squid1 (20G) > > I didn't want the Squid cache directory on ZFS, system is running on > an ESX 4.1 server backed by iSCSI SAN. I have 4 other servers running > on the same group of ESX servers and SAN, booting from ZFS without > this problem. Two of the other 4 are also running Squid but forward > to this one so they are running without a local disk cache. A quick update on this, in case anyone else runs into it, I did finally try on the 2nd of this month to delete my UFS volume, and create a new ZFS volume to replace it. I recreated the Squid cache directories and let squid start over building up cache. So far their hasn't been a noticeable impact on performance with the switch over, and the snapshot problem has not reoccurred since making the change. Its only a week into running this way but the problem before started within 36-48 hours. -- Thanks, Dean E. Weimer http://www.dweimer.net/