Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 17 Dec 2012 13:37:22 +0100
From:      Martin Matuska <mm@FreeBSD.org>
To:        Andriy Gapon <avg@FreeBSD.org>
Cc:        freebsd-fs@FreeBSD.org, freebsd-stable@FreeBSD.org
Subject:   Re: NFS/ZFS hangs after upgrading from 9.0-RELEASE to -STABLE
Message-ID:  <50CF1202.9070805@FreeBSD.org>
In-Reply-To: <50CA1639.1010409@FreeBSD.org>
References:  <CALC5%2B1Ptc=c_hxfc_On9iDN4AC_Xmrfdbc1NgyJH2ZxP6fE0Aw@mail.gmail.com> <50C9AFC6.6080902@FreeBSD.org> <CALC5%2B1MRurpbznOYrnE%2BK%2B=BEuj80iqJUbYkLN7SKFwtKqbE1Q@mail.gmail.com> <50CA1639.1010409@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On 13.12.2012 18:54, Andriy Gapon wrote:
> on 13/12/2012 19:46 olivier said the following:
>> Thanks. I'll be sure to follow your suggestions next time this happens.
>>
>> I have a naive question/suggestion though. I see from browsing past discussions on
>> ZFS problems that it has been suggested a number of times that problems that
>> appear to originate in ZFS in fact come from lower layers; in particular because
>> of driver bugs or disks in the process of failing. It seems that it can take a lot
>> of time to troubleshoot such problems. I accept that ZFS behavior correctly leaves
>> dealing with timeouts to lower layers, but it seems to me that the ZFS layer would
>> be a great place to warn the user about issues and provide some information to
>> troubleshoot them.
>>
>> For example, if some I/O requests get lost because of a buggy driver, the driver
>> itself might not be the best place to identify those lost requests. But perhaps we
>> could have a compile time option in ZFS code that spits out a warning if it gets
>> stuck waiting for a particular request to come back for more than say 10 seconds,
>> and identifies the problematic disk? I'm sure there would be cases where these
>> warnings would be unwarranted, and I imagine that changes in the code to provide
>> such warnings would impact performance; so one certainly would not want that code
>> active by default. But someone in my position could certainly recompile the kernel
>> with a ZFS debugging option turned on to figure out the problem.
>>
>> I understand that ZFS code comes from upstream, and that you guys probably want to
>> keep FreeBSD-specific changes minimal. If that's a big problem, even just a patch
>> provided "as such" that does not make it into the FreeBSD code base might be
>> extremely useful. I wish I could help write something like that, but I know very
>> little about the kernel or ZFS. I would certainly be willing to help with testing.
> Google for "zfs deadman".  This is already committed upstream and I think that it
> is imported into FreeBSD, but I am not sure...  Maybe it's imported just into the
> vendor area and is not merged yet.
> So, when enabled this logic would panic a system as a way of letting know that
> something is wrong.  You can read in the links why panic was selected for this job.
>
> And speaking FreeBSD-centric - I think that our CAM layer would be a perfect place
> to detect such issues in non-ZFS-specific way.
>
I can try to merge the ZFS deadman stuff (r242732) to HEAD, but I guess
this will be something for a 1-month MFC period.
Afterwards, a 9-STABLE patch can be easily created.

https://www.illumos.org/issues/3246
https://hg.openindiana.org/upstream/illumos/illumos-gate/rev/921a99998bb4

Cheers,
mm

-- 
Martin Matuska
FreeBSD committer
http://blog.vx.sk




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?50CF1202.9070805>