Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 13 Dec 2013 22:14:46 -0500
From:      Rod Taylor <rod.taylor@gmail.com>
To:        Steven Hartland <killing@multiplay.co.uk>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: ZFS related hang with FreeBSD 9.2
Message-ID:  <CAKddOFAEwmaWWF1PTp6P5FqU9X59S6nAzEUB-6RYpgSFOJswrQ@mail.gmail.com>
In-Reply-To: <E01DB08EA47D4897B73F8CCD55FFB103@multiplay.co.uk>
References:  <52AA97B8.8060408@nexusalpha.com> <CAKddOFATpE0U8Z0AYhsBPwfym3XJWeh95Q%2BS3Ug_YZJmkhUKCQ@mail.gmail.com> <E01DB08EA47D4897B73F8CCD55FFB103@multiplay.co.uk>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Dec 13, 2013 at 7:55 PM, Steven Hartland <killing@multiplay.co.uk>wrote:

> Are you doing any snapshot sends as well as interacting with
> snapshots such as listing files in them via the .zfs?
>

I didn't take much time to debug it as the snapshots created by zfSnap were
for local backups only. Off-site backups are a simple pg_dump. Snapshots
were unmounted and untouched by me, nor do they show up in zfs list by
default. They did not get copied/imported to any other machines. No clones
were in use.

With zfSnap periodics enabled with the following configuration, the machine
spontaneously reboots about once a week. 9.0 was much worse, I could push
it over with simple heavy IO such a query performing a sequential table
scan in PostgreSQL on a 40GB table. 9.2 only seemed to go down during
periodic runs. I've not been able to push it over at any other time.

Anyway, /var/crash remains empty after a reboot (dumpdev="AUTO").

I *think* the problem is related to creating snapshots during high load.
Though the problem is significantly reduced if I disabled deletes. I've
been unable to manually trigger a crash on 9.2 using zfSnap commands; but
they still occur with regularity during periodics and spontaneously during
the day.

ZFS v28 under 9.0/9.1, and all feature flags enabled under 9.2.

Nothing is logged.


Relevant snippet from periodic.conf:

# Filesystem snapshots
daily_zfsnap_enable="YES"
daily_zfsnap_recursive_fs="tank0"
daily_zfsnap_flags="-s -S"
daily_zfsnap_ttl=2m

monthly_zfsnap_enable="YES"
monthly_zfsnap_recursive_fs="tank0"
monthly_zfsnap_flags="-s -S"
monthly_zfsnap_ttl=6m

reboot_zfsnap_enable="YES"
reboot_zfsnap_flags="-s -S"
reboot_zfsnap_recursive_fs="tank0"

weekly_zfsnap_delete_enable="YES"
weekly_zfsnap_delete_flags="-s -S"
weekly_zfsnap_recursive_fs="tank0"



> If so make sure you have the following patch applied as
> that can cause a deadlock between these two operations
> http://svnweb.freebsd.org/changeset/base/258595
>

I have not tried this patch but can over the holidays.



----- Original Message ----- From: "Rod Taylor" <rod.taylor@gmail.com>
> To: "Ryan Baldwin" <ryan.baldwin@nexusalpha.com>
> Cc: <freebsd-fs@freebsd.org>
> Sent: Friday, December 13, 2013 11:21 PM
> Subject: Re: ZFS related hang with FreeBSD 9.2
>
>
>
>  Are you using snapshots?
>>
>> I've found ZFS Snapshots on 9.0, 9.1, and 9.2 regularly crash the system.
>> Delete the snapshots and don't create any new ones and suddenly it's
>> stable
>> for months.
>>
>>
>>
>> On Fri, Dec 13, 2013 at 12:14 AM, Ryan Baldwin
>> <ryan.baldwin@nexusalpha.com>wrote:
>>
>>  Hi,
>>>
>>> We have a server based on FreeBSD 9.2 which hangs at times on a daily
>>> basis. The longest uptime we have achieved is 5 days conversely it has
>>> stopped daily several days in a row.
>>>
>>> When this occurs it appears there are two proceses stuck in 'tx->tx'
>>> state. In the top output shown these are snapshot-manager processes which
>>> create and destroy snapshots generally and sometime rollback filesystems
>>> to
>>> snapshots. When the lockup occurs other processes which try to access the
>>> file system can seem to end up stuck in state 'rrl->r'. The reboot
>>> command
>>> that was issued to try and reboot the server has ended up stuck in this
>>> state as can be seen.
>>>
>>> The server is not under particularly heavy load.
>>>
>>> It has remained in this state for hours. The 'deadman handler'? does not
>>> appear to restart the system. Once this has occurred there is no further
>>> disk activity.
>>>
>>> We did not experience this problem at all previously using 9.1 although
>>> we
>>> had less snapshot-manager processes before. We have built this server
>>> against 9.1 again now but it has only been going one day so far.
>>>
>>> We can try and reproduce this problem again on 9.2 if by doing so we can
>>> gather any additional information that could help resolve this problem.
>>> Please let me know what other information would be helpful.
>>>
>>> The hardware is a Dell R420 with Perc H310 raid controller in JBOD mode
>>> with the pool mirrored on two SAS disks.
>>>
>>> Thanks
>>>
>>> top and procstat output follow: ...
>>>
>>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAKddOFAEwmaWWF1PTp6P5FqU9X59S6nAzEUB-6RYpgSFOJswrQ>