From owner-freebsd-fs@FreeBSD.ORG Sat Dec 14 03:14:50 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id A5312A04 for ; Sat, 14 Dec 2013 03:14:50 +0000 (UTC) Received: from mail-wi0-x229.google.com (mail-wi0-x229.google.com [IPv6:2a00:1450:400c:c05::229]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 2B8D31AA9 for ; Sat, 14 Dec 2013 03:14:50 +0000 (UTC) Received: by mail-wi0-f169.google.com with SMTP id hn6so88334wib.0 for ; Fri, 13 Dec 2013 19:14:48 -0800 (PST) X-Received: by 10.194.60.103 with SMTP id g7mr4630130wjr.37.1386990888646; Fri, 13 Dec 2013 19:14:48 -0800 (PST) Received: from mail-wg0-x22d.google.com (mail-wg0-x22d.google.com [2a00:1450:400c:c00::22d]) by mx.google.com with ESMTPSA id mz10sm3278492wic.2.2013.12.13.19.14.47 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 13 Dec 2013 19:14:47 -0800 (PST) Received: by mail-wg0-f45.google.com with SMTP id y10so2674244wgg.24 for ; Fri, 13 Dec 2013 19:14:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=zXF8Ju6QYfv49km0qsOKsSZMvfFw/ZyLKzmpFFRK5rs=; b=jfC2yK25fh94FSChNRfa3MLh/l0ZsqxF3s1fwCd9FcXGR9A0WwvE3uKgA4kEw7tU/m IObP4KrzYYfoRk0vJjkVILjyi7tYVLCKUpuq5z5AERQqE9szq+EeCHiu4IDkK+KGYpBo jEkwgfxZH1LaEm6iqqc0Q+6wOMJcIEAgOICva4kCrMyeGrQsbz7O9YX6EeieBXzsxFf7 bDnv5YT5HYU5YSKRROtTEDZ7Pq/f437KJHLJkjsFYiaCsF3Jg/veL6FkNzNFpEr7BN+X 2gqGnagB3TwuZ7uCDOmd5gLktCacBvvzXoLJVwed3nXcmMkP/CKqJI5JA3j/Tdm+Vma+ ENgQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=zXF8Ju6QYfv49km0qsOKsSZMvfFw/ZyLKzmpFFRK5rs=; b=DgO1mm3eJNODoC+a+z0xxfv8khrITIGVP9kQZ7O6t7eqdomUGcH5bydIyW6wuc+yeC xtfEabQuJBNlddH9n3tkibP6ZbuIPlRmi5O4ZghzUBksAWMPQxMYMn0zxy0Pa1wcv8oM JwOqbKIomIREEH6LCjYJ+TcgVmnyJh2QIYfubnbNoNpegzCA5Yt52LO8S6gwj6esU0LA ldsqpUV/Uc9PDPY9RhlUnnNCOQPVmsGIXdvOCDAmpTddNsR2IQaKiAJWUKf612UA7ZVg uBygz9+zlOm80J+XMLIOHnDwgiwQhD9HtbDmGQE+WF5ZkJZ+WDGgB74xrNZGswAdCwHm RzvQ== X-Gm-Message-State: ALoCoQnXeIKRP+ro7TYEDDG6Wy4Nf1IPBu9nAhRZTbLbis15v/AEex+a+xdBaEH1nafuZ1tokZxT MIME-Version: 1.0 X-Received: by 10.194.84.72 with SMTP id w8mr4612881wjy.55.1386990886918; Fri, 13 Dec 2013 19:14:46 -0800 (PST) Received: by 10.194.166.100 with HTTP; Fri, 13 Dec 2013 19:14:46 -0800 (PST) In-Reply-To: References: <52AA97B8.8060408@nexusalpha.com> Date: Fri, 13 Dec 2013 22:14:46 -0500 Message-ID: Subject: Re: ZFS related hang with FreeBSD 9.2 From: Rod Taylor To: Steven Hartland Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.17 Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 14 Dec 2013 03:14:50 -0000 On Fri, Dec 13, 2013 at 7:55 PM, Steven Hartland wrote: > Are you doing any snapshot sends as well as interacting with > snapshots such as listing files in them via the .zfs? > I didn't take much time to debug it as the snapshots created by zfSnap were for local backups only. Off-site backups are a simple pg_dump. Snapshots were unmounted and untouched by me, nor do they show up in zfs list by default. They did not get copied/imported to any other machines. No clones were in use. With zfSnap periodics enabled with the following configuration, the machine spontaneously reboots about once a week. 9.0 was much worse, I could push it over with simple heavy IO such a query performing a sequential table scan in PostgreSQL on a 40GB table. 9.2 only seemed to go down during periodic runs. I've not been able to push it over at any other time. Anyway, /var/crash remains empty after a reboot (dumpdev="AUTO"). I *think* the problem is related to creating snapshots during high load. Though the problem is significantly reduced if I disabled deletes. I've been unable to manually trigger a crash on 9.2 using zfSnap commands; but they still occur with regularity during periodics and spontaneously during the day. ZFS v28 under 9.0/9.1, and all feature flags enabled under 9.2. Nothing is logged. Relevant snippet from periodic.conf: # Filesystem snapshots daily_zfsnap_enable="YES" daily_zfsnap_recursive_fs="tank0" daily_zfsnap_flags="-s -S" daily_zfsnap_ttl=2m monthly_zfsnap_enable="YES" monthly_zfsnap_recursive_fs="tank0" monthly_zfsnap_flags="-s -S" monthly_zfsnap_ttl=6m reboot_zfsnap_enable="YES" reboot_zfsnap_flags="-s -S" reboot_zfsnap_recursive_fs="tank0" weekly_zfsnap_delete_enable="YES" weekly_zfsnap_delete_flags="-s -S" weekly_zfsnap_recursive_fs="tank0" > If so make sure you have the following patch applied as > that can cause a deadlock between these two operations > http://svnweb.freebsd.org/changeset/base/258595 > I have not tried this patch but can over the holidays. ----- Original Message ----- From: "Rod Taylor" > To: "Ryan Baldwin" > Cc: > Sent: Friday, December 13, 2013 11:21 PM > Subject: Re: ZFS related hang with FreeBSD 9.2 > > > > Are you using snapshots? >> >> I've found ZFS Snapshots on 9.0, 9.1, and 9.2 regularly crash the system. >> Delete the snapshots and don't create any new ones and suddenly it's >> stable >> for months. >> >> >> >> On Fri, Dec 13, 2013 at 12:14 AM, Ryan Baldwin >> wrote: >> >> Hi, >>> >>> We have a server based on FreeBSD 9.2 which hangs at times on a daily >>> basis. The longest uptime we have achieved is 5 days conversely it has >>> stopped daily several days in a row. >>> >>> When this occurs it appears there are two proceses stuck in 'tx->tx' >>> state. In the top output shown these are snapshot-manager processes which >>> create and destroy snapshots generally and sometime rollback filesystems >>> to >>> snapshots. When the lockup occurs other processes which try to access the >>> file system can seem to end up stuck in state 'rrl->r'. The reboot >>> command >>> that was issued to try and reboot the server has ended up stuck in this >>> state as can be seen. >>> >>> The server is not under particularly heavy load. >>> >>> It has remained in this state for hours. The 'deadman handler'? does not >>> appear to restart the system. Once this has occurred there is no further >>> disk activity. >>> >>> We did not experience this problem at all previously using 9.1 although >>> we >>> had less snapshot-manager processes before. We have built this server >>> against 9.1 again now but it has only been going one day so far. >>> >>> We can try and reproduce this problem again on 9.2 if by doing so we can >>> gather any additional information that could help resolve this problem. >>> Please let me know what other information would be helpful. >>> >>> The hardware is a Dell R420 with Perc H310 raid controller in JBOD mode >>> with the pool mirrored on two SAS disks. >>> >>> Thanks >>> >>> top and procstat output follow: ... >>> >>