Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 26 May 2011 18:46:37 -0700
From:      David P Discher <dpd@bitgravity.com>
To:        freebsd-fs@freebsd.org
Subject:   ZFS: arc_reclaim_thread running 100%, 8.1-RELEASE, LBOLT related
Message-ID:  <0EFD28CD-F2E9-4AE2-B927-1D327EC99DB9@bitgravity.com>

next in thread | raw e-mail | index | archive | help
Hello FS list:

We've been using ZFS v3, storage pool v14 with FreeBSD 8.1-RELEASE with =
fairly good results for over a year.  We have been moving more and more =
of our storage to ZFS.  Last week, I believe we hit another issue with =
LBOLT.

The original which was first reported by Artem Belevich for =
l2arc_feed_thread :

 - =
http://lists.freebsd.org/pipermail/freebsd-fs/2011-January/010558.html

But this also affects the arc_reclaim_thread as well. The guys over at =
iX Systems helped out and pointed me to this patch :

 - http://people.freebsd.org/~delphij/misc/218180.diff

which typedef's clock_t to int64_t.

However, the arc_reclaim_thread does not have a ~24 day rollover - it =
does not use clock_t.  I think this rollover in the integer results in =
LBOLT going negative, after about 106-107 days.  We haven't noticed this =
until actually 112-115 days of uptime.  I think it is also related to L1 =
ARC sizing, and load.  Our systems with arc set to min-max of  512M/2G =
ARC haven't developed the issue - at least the CPU hogging thread - but =
the systems with 12G+ of ARC, and lots of rsync and du activity along =
side of random reads from the zpool develop the issue.

The problem is slight different, and possibly more harmful than the =
l2arc feeder issue seen with LBOLT.=20

in sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c, the arc_evict() =
function, under "evict_start:" has this loop to walk the arc cache page =
list:

	1708         for (ab =3D list_tail(list); ab; ab =3D ab_prev) {
	1709                 ab_prev =3D list_prev(list, ab);
	1710                 bytes_remaining -=3D (ab->b_size * =
ab->b_datacnt);
	1711                 /* prefetch buffers have a minimum lifespan =
*/
	1712                 if (HDR_IO_IN_PROGRESS(ab) ||
	1713                     (spa && ab->b_spa !=3D spa) ||
	1714                     (ab->b_flags & =
(ARC_PREFETCH|ARC_INDIRECT) &&
	1715                     LBOLT - ab->b_arc_access < =
arc_min_prefetch_lifespan)) {
	1716                         skipped++;
	1717                         continue;
	1718                 }


Now, when LBOLT is negative, with some degree of jitter/randomness, this =
loop short-circuts, resulting in high CPU usage.  Also the ARC buffers =
may not get evicted on-time, or possibly at all.  One system I had, all =
processes to the zpool where waiting on D-state, and the =
arc_reclaim_thread was stuck at 100%.  du and rysnc seem to help =
aggravate this issue.  On an affected system :

> top -SHb 500 | grep arc_reclaim_thr
   95 root           -8    -     0K    60K arc_re  3 102.9H 96.39% =
{arc_reclaim_thre}

Conveniently, "skipped++" is surfaced via a sysctl, here's two queries =
to it on this system with the arc reclaim thread running hot (a du going =
too at the same time ), 60 seconds in between :

  kstat.zfs.misc.arcstats.evict_skip: 4117714520450
  kstat.zfs.misc.arcstats.evict_skip: 4118188257434

> uptime
 3:51PM  up 116 days, 23:48, 3 users, load averages: 1.30, 0.96, 0.64

After chatting with someone else well respected in the community, he =
proposed an alternative fix.  I'm vetting here to make sure there isn't =
something deeper in the code that could get bitten by, as well as some =
clarification :=20

in:
./sys/cddl/compat/opensolaris/sys/time.h=20

the relevant parts are:

	 41 #define LBOLT   ((gethrtime() * hz) / NANOSEC)
	 ...
	=20
	 54 static __inline hrtime_t
	 55 gethrtime(void) {
	 56=20
	 57         struct timespec ts;
	 58         hrtime_t nsec;
	 59=20
	 60 #if 1
	 61         getnanouptime(&ts);
	 62 #else
	 63         nanouptime(&ts);
	 64 #endif
	 65         nsec =3D (hrtime_t)ts.tv_sec * NANOSEC + ts.tv_nsec;
	 66         return (nsec);
	 67 }

QUESTION - what units is LBOLT suppose to be ?  If gethrtime() is =
returning nanoseconds, why is nanoseconds getting multiplied by hz ?  If =
LBOLT is suppose to be clock-ticks (which is what arc.c looks like it =
wants it in) then it really should be :

	 #define LBOLT   ( (gethrtime() / NANOSEC) * hz )

But if that is case, then why make the call to getnanouptime() at all ?  =
If LBOLT is number of clock ticks, then can't this just be a query to =
uptime in seconds ?  So how about something like this:

	#define LBOLT	(time_uptime * hz)

I've applied this changed locally, and did a basic stress test with our =
load generator in the lab, thrashing the arc cache. (96GB RAM, 48G =
min/max for ARC) It seems to have no ill effects - though, will have to =
wait 4-months before declaring the actual issue here fixed.  I'm hoping =
to put this in production next week.

All this above is on 8.1-RELEASE. ZFS v28 changed some of this, but =
still didn't improve lbolt:

 - =
http://svnweb.freebsd.org/base/head/sys/cddl/compat/opensolaris/sys/time.h=
?revision=3D221991&view=3Dmarkup

	65	#define ddi_get_lbolt()         ((gethrtime() * hz) / =
NANOSEC)
	66	#define ddi_get_lbolt64()       (int64_t)((gethrtime() * =
hz) / NANOSEC)

It would seem, the same optimization could be done here too:

		#define ddi_get_lbolt()         (time_uptime * hz)
		#define ddi_get_lbolt64()       (int64_t)(time_uptime * =
hz)

With saving the call to getnanouptime() a multiple and divide, there =
should be a couple hundred cycle performance improvement here.  I don't =
claim this would be noticeable, but seems like a simple, straight =
forward optimization.=20

clock_t will still need the typedef'ed to 64bit to still address the =
l2arc usage of LBOLT.

Thanks !

---
David P. Discher
dpd@bitgravity.com * AIM: bgDavidDPD
BITGRAVITY * http://www.bitgravity.com




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?0EFD28CD-F2E9-4AE2-B927-1D327EC99DB9>