Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 26 May 2011 21:06:03 -0700
From:      Artem Belevich <art@freebsd.org>
To:        David P Discher <dpd@bitgravity.com>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: ZFS: arc_reclaim_thread running 100%, 8.1-RELEASE, LBOLT related
Message-ID:  <BANLkTikVq0-En7=4Dy_dTf=tM55Cqou_mw@mail.gmail.com>
In-Reply-To: <0EFD28CD-F2E9-4AE2-B927-1D327EC99DB9@bitgravity.com>
References:  <0EFD28CD-F2E9-4AE2-B927-1D327EC99DB9@bitgravity.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, May 26, 2011 at 6:46 PM, David P Discher <dpd@bitgravity.com> wrote=
:
> Hello FS list:
>
> We've been using ZFS v3, storage pool v14 with FreeBSD 8.1-RELEASE with f=
airly good results for over a year. =A0We have been moving more and more of=
 our storage to ZFS. =A0Last week, I believe we hit another issue with LBOL=
T.
>
> The original which was first reported by Artem Belevich for l2arc_feed_th=
read :
>
> =A0- http://lists.freebsd.org/pipermail/freebsd-fs/2011-January/010558.ht=
ml
>
> But this also affects the arc_reclaim_thread as well. The guys over at iX=
 Systems helped out and pointed me to this patch :
>
> =A0- http://people.freebsd.org/~delphij/misc/218180.diff
>
> which typedef's clock_t to int64_t.
>
> However, the arc_reclaim_thread does not have a ~24 day rollover - it doe=
s not use clock_t. =A0I think this rollover in the integer results in LBOLT=
 going negative, after about 106-107 days. =A0We haven't noticed this until=
 actually 112-115 days of uptime. =A0I think it is also related to L1 ARC s=
izing, and load. =A0Our systems with arc set to min-max of =A0512M/2G ARC h=
aven't developed the issue - at least the CPU hogging thread - but the syst=
ems with 12G+ of ARC, and lots of rsync and du activity along side of rando=
m reads from the zpool develop the issue.
>
> The problem is slight different, and possibly more harmful than the l2arc=
 feeder issue seen with LBOLT.
>
> in sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c, the arc_evict() =
function, under "evict_start:" has this loop to walk the arc cache page lis=
t:
>
> =A0 =A0 =A0 =A01708 =A0 =A0 =A0 =A0 for (ab =3D list_tail(list); ab; ab =
=3D ab_prev) {
> =A0 =A0 =A0 =A01709 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ab_prev =3D list_prev=
(list, ab);
> =A0 =A0 =A0 =A01710 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 bytes_remaining -=3D =
(ab->b_size * ab->b_datacnt);
> =A0 =A0 =A0 =A01711 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* prefetch buffers h=
ave a minimum lifespan */
> =A0 =A0 =A0 =A01712 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (HDR_IO_IN_PROGRES=
S(ab) ||
> =A0 =A0 =A0 =A01713 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (spa && ab->b=
_spa !=3D spa) ||
> =A0 =A0 =A0 =A01714 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (ab->b_flags =
& (ARC_PREFETCH|ARC_INDIRECT) &&
> =A0 =A0 =A0 =A01715 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 LBOLT - ab->b=
_arc_access < arc_min_prefetch_lifespan)) {
> =A0 =A0 =A0 =A01716 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 skipp=
ed++;
> =A0 =A0 =A0 =A01717 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 conti=
nue;
> =A0 =A0 =A0 =A01718 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 }
>
>
> Now, when LBOLT is negative, with some degree of jitter/randomness, this =
loop short-circuts, resulting in high CPU usage. =A0Also the ARC buffers ma=
y not get evicted on-time, or possibly at all. =A0One system I had, all pro=
cesses to the zpool where waiting on D-state, and the arc_reclaim_thread wa=
s stuck at 100%. =A0du and rysnc seem to help aggravate this issue. =A0On a=
n affected system :
>
>> top -SHb 500 | grep arc_reclaim_thr
> =A0 95 root =A0 =A0 =A0 =A0 =A0 -8 =A0 =A0- =A0 =A0 0K =A0 =A060K arc_re =
=A03 102.9H 96.39% {arc_reclaim_thre}
>
> Conveniently, "skipped++" is surfaced via a sysctl, here's two queries to=
 it on this system with the arc reclaim thread running hot (a du going too =
at the same time ), 60 seconds in between :
>
> =A0kstat.zfs.misc.arcstats.evict_skip: 4117714520450
> =A0kstat.zfs.misc.arcstats.evict_skip: 4118188257434
>
>> uptime
> =A03:51PM =A0up 116 days, 23:48, 3 users, load averages: 1.30, 0.96, 0.64
>
> After chatting with someone else well respected in the community, he prop=
osed an alternative fix. =A0I'm vetting here to make sure there isn't somet=
hing deeper in the code that could get bitten by, as well as some clarifica=
tion :
>
> in:
> ./sys/cddl/compat/opensolaris/sys/time.h
>
> the relevant parts are:
>
> =A0 =A0 =A0 =A0 41 #define LBOLT =A0 ((gethrtime() * hz) / NANOSEC)
> =A0 =A0 =A0 =A0 ...
>
> =A0 =A0 =A0 =A0 54 static __inline hrtime_t
> =A0 =A0 =A0 =A0 55 gethrtime(void) {
> =A0 =A0 =A0 =A0 56
> =A0 =A0 =A0 =A0 57 =A0 =A0 =A0 =A0 struct timespec ts;
> =A0 =A0 =A0 =A0 58 =A0 =A0 =A0 =A0 hrtime_t nsec;
> =A0 =A0 =A0 =A0 59
> =A0 =A0 =A0 =A0 60 #if 1
> =A0 =A0 =A0 =A0 61 =A0 =A0 =A0 =A0 getnanouptime(&ts);
> =A0 =A0 =A0 =A0 62 #else
> =A0 =A0 =A0 =A0 63 =A0 =A0 =A0 =A0 nanouptime(&ts);
> =A0 =A0 =A0 =A0 64 #endif
> =A0 =A0 =A0 =A0 65 =A0 =A0 =A0 =A0 nsec =3D (hrtime_t)ts.tv_sec * NANOSEC=
 + ts.tv_nsec;

Yup. This would indeed overflow in ~106.75 days.

> =A0 =A0 =A0 =A0 66 =A0 =A0 =A0 =A0 return (nsec);
> =A0 =A0 =A0 =A0 67 }
>
> QUESTION - what units is LBOLT suppose to be ? =A0If gethrtime() is retur=
ning nanoseconds, why is nanoseconds getting multiplied by hz ? =A0If LBOLT=
 is suppose to be clock-ticks (which is what arc.c looks like it wants it i=
n) then it really should be :
>
> =A0 =A0 =A0 =A0 #define LBOLT =A0 ( (gethrtime() / NANOSEC) * hz )
>
> But if that is case, then why make the call to getnanouptime() at all ? =
=A0If LBOLT is number of clock ticks, then can't this just be a query to up=
time in seconds ? =A0So how about something like this:
>
> =A0 =A0 =A0 =A0#define LBOLT =A0 (time_uptime * hz)

I believe lbolt used to hold number of ticks on solaris, though they
switched to tickless kernel some time back and got rid of lbolt.

>
> I've applied this changed locally, and did a basic stress test with our l=
oad generator in the lab, thrashing the arc cache. (96GB RAM, 48G min/max f=
or ARC) It seems to have no ill effects - though, will have to wait 4-month=
s before declaring the actual issue here fixed. =A0I'm hoping to put this i=
n production next week.
>
> All this above is on 8.1-RELEASE. ZFS v28 changed some of this, but still=
 didn't improve lbolt:
>
> =A0- http://svnweb.freebsd.org/base/head/sys/cddl/compat/opensolaris/sys/=
time.h?revision=3D221991&view=3Dmarkup
>
> =A0 =A0 =A0 =A065 =A0 =A0 =A0#define ddi_get_lbolt() =A0 =A0 =A0 =A0 ((ge=
thrtime() * hz) / NANOSEC)
> =A0 =A0 =A0 =A066 =A0 =A0 =A0#define ddi_get_lbolt64() =A0 =A0 =A0 (int64=
_t)((gethrtime() * hz) / NANOSEC)
>
> It would seem, the same optimization could be done here too:
>
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#define ddi_get_lbolt() =A0 =A0 =A0 =A0 (t=
ime_uptime * hz)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#define ddi_get_lbolt64() =A0 =A0 =A0 (int=
64_t)(time_uptime * hz)
>
> With saving the call to getnanouptime() a multiple and divide, there shou=
ld be a couple hundred cycle performance improvement here. =A0I don't claim=
 this would be noticeable, but seems like a simple, straight forward optimi=
zation.

The side effect is that it limits bolt resolution to hz units. With
HZ=3D100, that will be 10ms. Whether it's good enough or too coarse I
have no idea. Perhaps we can compromise and update lbolt in
microseconds. That should give us few hundred years until the
overflow.

--Artem

>
> clock_t will still need the typedef'ed to 64bit to still address the l2ar=
c usage of LBOLT.
>
> Thanks !
>
> ---
> David P. Discher
> dpd@bitgravity.com * AIM: bgDavidDPD
> BITGRAVITY * http://www.bitgravity.com
>
> _______________________________________________
> freebsd-fs@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?BANLkTikVq0-En7=4Dy_dTf=tM55Cqou_mw>