Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 27 Jul 2011 13:41:34 -0700
From:      David P Discher <dpd@bitgravity.com>
To:        Steven Hartland <killing@multiplay.co.uk>
Cc:        freebsd-fs@FreeBSD.org, Andriy Gapon <avg@freebsd.org>
Subject:   Re: zfs process hang on pool access
Message-ID:  <6703F0BB-D4FC-4417-B519-CAFC62E5BC39@bitgravity.com>
In-Reply-To: <4E302204.2030009@FreeBSD.org>
References:  <A14F1C768A41483C876AD77502A864D6@multiplay.co.uk> <0D449EC916264947AB31AA17F870EA7A@multiplay.co.uk> <4E3013DF.10803@FreeBSD.org> <3D6CEB50BEDD4ACE96FD35C4D085618A@multiplay.co.uk> <4E301C55.7090105@FreeBSD.org> <5C84E7C8452E489C8CA738294F5EBB78@multiplay.co.uk> <4E301F10.6060708@FreeBSD.org> <63705B5AEEAD4BB88ADB9EF770AB6C76@multiplay.co.uk> <4E302204.2030009@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help

The way I found this was breaking into the debugger, do some back traces, continue, break in again, do some more back traces on the hung processes ... see what is going on, then walk through the code. 

Then what I had specific loops and code locations, asking the higher powers of the freebsd kernel world.

Of course, I had the high cpu and was peaking at the arc_reclaim_thread. 

I've seen this nearly like clockwork in production at 106-107 days. If it goes on too much longer than that, then things deadlock. 

But 112 days, and 8.2 ... you for sure have the LBOLT overflow. 

Otherwise, reboot and patch.  However, I have not fully vetted the patch under heavily load, and currently seeing another deadlock issue with 8.1+ zfs v14 - but seemly durning writes after 6-40 hours.  Still investigating. 

Note, my proposal of "time_uptime" doesn't work - as it causes a buildworld error in zfs userland tools.

This is what I'm currently running to fix the 26 day issue with l2arc feeder and arc_reclaim_thread with LBOLT in 8.1. 


Index: sys/cddl/compat/opensolaris/sys/time.h
===================================================================
--- sys/cddl/compat/opensolaris/sys/time.h      (.../8.1-BGOS-20110105) (revision 3322)
+++ sys/cddl/compat/opensolaris/sys/time.h      (.../8.1-BGOS-20110613) (working copy)
@@ -38,7 +38,7 @@
 
 typedef longlong_t     hrtime_t;
 
-#define        LBOLT   ((gethrtime() * hz) / NANOSEC)
+#define        LBOLT   (gethrtime() * (NANOSEC/hz))
 
 #if defined(__i386__) || defined(__powerpc__)
 #define        TIMESPEC_OVERFLOW(ts)                                           \

Index: sys/cddl/compat/opensolaris/sys/types.h
===================================================================
--- sys/cddl/compat/opensolaris/sys/types.h     (.../8.1-BGOS-20110105) (revision 3322)
+++ sys/cddl/compat/opensolaris/sys/types.h     (.../8.1-BGOS-20110613) (working copy)
@@ -34,6 +34,12 @@
  */
 
 #include <sys/stdint.h>
+
+#ifdef _KERNEL
+typedef        int64_t         clock_t;
+#define        _CLOCK_T_DECLARED
+#endif
+
 #include_next <sys/types.h>
 
 #define        MAXNAMELEN      256


---
David P. Discher
dpd@bitgravity.com * AIM: bgDavidDPD
BITGRAVITY * http://www.bitgravity.com

On Jul 27, 2011, at 7:34 AM, Andriy Gapon wrote:

>> Ahh, is there anyway to confirm that before I reboot, or any other
>> information we could glean that might be useful?
> 
> No quick ideas, unfortunately.




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?6703F0BB-D4FC-4417-B519-CAFC62E5BC39>