From owner-freebsd-fs@FreeBSD.ORG  Fri May 27 02:17:54 2011
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 7F6921065673
	for <freebsd-fs@freebsd.org>; Fri, 27 May 2011 02:17:54 +0000 (UTC)
	(envelope-from dpd@bitgravity.com)
Received: from mail1.sjc1.bitgravity.com (mail1.sjc1.bitgravity.com
	[209.131.97.19])
	by mx1.freebsd.org (Postfix) with ESMTP id 65D1F8FC18
	for <freebsd-fs@freebsd.org>; Fri, 27 May 2011 02:17:54 +0000 (UTC)
Received: from mail-pw0-f45.google.com ([209.85.160.45])
	by mail1.sjc1.bitgravity.com with esmtps (TLSv1:RC4-SHA:128)
	(Exim 4.69 (FreeBSD)) (envelope-from <dpd@bitgravity.com>)
	id 1QPm8V-0005PN-12
	for freebsd-fs@freebsd.org; Thu, 26 May 2011 18:46:47 -0700
Received: by pwi6 with SMTP id 6so751308pwi.18
	for <freebsd-fs@freebsd.org>; Thu, 26 May 2011 18:46:41 -0700 (PDT)
Received: by 10.68.27.71 with SMTP id r7mr507981pbg.385.1306460801542;
	Thu, 26 May 2011 18:46:41 -0700 (PDT)
Received: from netops-199.sfo1.bitgravity.com (netops-199.sfo1.bitgravity.com
	[209.131.110.199])
	by mx.google.com with ESMTPS id m9sm75087pbd.87.2011.05.26.18.46.39
	(version=TLSv1/SSLv3 cipher=OTHER);
	Thu, 26 May 2011 18:46:40 -0700 (PDT)
From: David P Discher <dpd@bitgravity.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: quoted-printable
Date: Thu, 26 May 2011 18:46:37 -0700
Message-Id: <0EFD28CD-F2E9-4AE2-B927-1D327EC99DB9@bitgravity.com>
To: freebsd-fs@freebsd.org
Mime-Version: 1.0 (Apple Message framework v1084)
X-Mailer: Apple Mail (2.1084)
Subject: ZFS: arc_reclaim_thread running 100%, 8.1-RELEASE, LBOLT related
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 27 May 2011 02:17:54 -0000

Hello FS list:

We've been using ZFS v3, storage pool v14 with FreeBSD 8.1-RELEASE with =
fairly good results for over a year.  We have been moving more and more =
of our storage to ZFS.  Last week, I believe we hit another issue with =
LBOLT.

The original which was first reported by Artem Belevich for =
l2arc_feed_thread :

 - =
http://lists.freebsd.org/pipermail/freebsd-fs/2011-January/010558.html

But this also affects the arc_reclaim_thread as well. The guys over at =
iX Systems helped out and pointed me to this patch :

 - http://people.freebsd.org/~delphij/misc/218180.diff

which typedef's clock_t to int64_t.

However, the arc_reclaim_thread does not have a ~24 day rollover - it =
does not use clock_t.  I think this rollover in the integer results in =
LBOLT going negative, after about 106-107 days.  We haven't noticed this =
until actually 112-115 days of uptime.  I think it is also related to L1 =
ARC sizing, and load.  Our systems with arc set to min-max of  512M/2G =
ARC haven't developed the issue - at least the CPU hogging thread - but =
the systems with 12G+ of ARC, and lots of rsync and du activity along =
side of random reads from the zpool develop the issue.

The problem is slight different, and possibly more harmful than the =
l2arc feeder issue seen with LBOLT.=20

in sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c, the arc_evict() =
function, under "evict_start:" has this loop to walk the arc cache page =
list:

	1708         for (ab =3D list_tail(list); ab; ab =3D ab_prev) {
	1709                 ab_prev =3D list_prev(list, ab);
	1710                 bytes_remaining -=3D (ab->b_size * =
ab->b_datacnt);
	1711                 /* prefetch buffers have a minimum lifespan =
*/
	1712                 if (HDR_IO_IN_PROGRESS(ab) ||
	1713                     (spa && ab->b_spa !=3D spa) ||
	1714                     (ab->b_flags & =
(ARC_PREFETCH|ARC_INDIRECT) &&
	1715                     LBOLT - ab->b_arc_access < =
arc_min_prefetch_lifespan)) {
	1716                         skipped++;
	1717                         continue;
	1718                 }


Now, when LBOLT is negative, with some degree of jitter/randomness, this =
loop short-circuts, resulting in high CPU usage.  Also the ARC buffers =
may not get evicted on-time, or possibly at all.  One system I had, all =
processes to the zpool where waiting on D-state, and the =
arc_reclaim_thread was stuck at 100%.  du and rysnc seem to help =
aggravate this issue.  On an affected system :

> top -SHb 500 | grep arc_reclaim_thr
   95 root           -8    -     0K    60K arc_re  3 102.9H 96.39% =
{arc_reclaim_thre}

Conveniently, "skipped++" is surfaced via a sysctl, here's two queries =
to it on this system with the arc reclaim thread running hot (a du going =
too at the same time ), 60 seconds in between :

  kstat.zfs.misc.arcstats.evict_skip: 4117714520450
  kstat.zfs.misc.arcstats.evict_skip: 4118188257434

> uptime
 3:51PM  up 116 days, 23:48, 3 users, load averages: 1.30, 0.96, 0.64

After chatting with someone else well respected in the community, he =
proposed an alternative fix.  I'm vetting here to make sure there isn't =
something deeper in the code that could get bitten by, as well as some =
clarification :=20

in:
./sys/cddl/compat/opensolaris/sys/time.h=20

the relevant parts are:

	 41 #define LBOLT   ((gethrtime() * hz) / NANOSEC)
	 ...
	=20
	 54 static __inline hrtime_t
	 55 gethrtime(void) {
	 56=20
	 57         struct timespec ts;
	 58         hrtime_t nsec;
	 59=20
	 60 #if 1
	 61         getnanouptime(&ts);
	 62 #else
	 63         nanouptime(&ts);
	 64 #endif
	 65         nsec =3D (hrtime_t)ts.tv_sec * NANOSEC + ts.tv_nsec;
	 66         return (nsec);
	 67 }

QUESTION - what units is LBOLT suppose to be ?  If gethrtime() is =
returning nanoseconds, why is nanoseconds getting multiplied by hz ?  If =
LBOLT is suppose to be clock-ticks (which is what arc.c looks like it =
wants it in) then it really should be :

	 #define LBOLT   ( (gethrtime() / NANOSEC) * hz )

But if that is case, then why make the call to getnanouptime() at all ?  =
If LBOLT is number of clock ticks, then can't this just be a query to =
uptime in seconds ?  So how about something like this:

	#define LBOLT	(time_uptime * hz)

I've applied this changed locally, and did a basic stress test with our =
load generator in the lab, thrashing the arc cache. (96GB RAM, 48G =
min/max for ARC) It seems to have no ill effects - though, will have to =
wait 4-months before declaring the actual issue here fixed.  I'm hoping =
to put this in production next week.

All this above is on 8.1-RELEASE. ZFS v28 changed some of this, but =
still didn't improve lbolt:

 - =
http://svnweb.freebsd.org/base/head/sys/cddl/compat/opensolaris/sys/time.h=
?revision=3D221991&view=3Dmarkup

	65	#define ddi_get_lbolt()         ((gethrtime() * hz) / =
NANOSEC)
	66	#define ddi_get_lbolt64()       (int64_t)((gethrtime() * =
hz) / NANOSEC)

It would seem, the same optimization could be done here too:

		#define ddi_get_lbolt()         (time_uptime * hz)
		#define ddi_get_lbolt64()       (int64_t)(time_uptime * =
hz)

With saving the call to getnanouptime() a multiple and divide, there =
should be a couple hundred cycle performance improvement here.  I don't =
claim this would be noticeable, but seems like a simple, straight =
forward optimization.=20

clock_t will still need the typedef'ed to 64bit to still address the =
l2arc usage of LBOLT.

Thanks !

---
David P. Discher
dpd@bitgravity.com * AIM: bgDavidDPD
BITGRAVITY * http://www.bitgravity.com