Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 30 May 2011 22:15:37 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Artem Belevich <art@FreeBSD.org>
Cc:        freebsd-fs@FreeBSD.org
Subject:   Re: ZFS: arc_reclaim_thread running 100%, 8.1-RELEASE, LBOLT related
Message-ID:  <20110530210943.Y3265@besplex.bde.org>
In-Reply-To: <BANLkTikVq0-En7=4Dy_dTf=tM55Cqou_mw@mail.gmail.com>
References:  <0EFD28CD-F2E9-4AE2-B927-1D327EC99DB9@bitgravity.com> <BANLkTikVq0-En7=4Dy_dTf=tM55Cqou_mw@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.

--0-2106762154-1306757737=:3265
Content-Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed
Content-Transfer-Encoding: QUOTED-PRINTABLE

On Thu, 26 May 2011, Artem Belevich wrote:

> On Thu, May 26, 2011 at 6:46 PM, David P Discher <dpd@bitgravity.com> wro=
te:
>> Hello FS list:
>>
>> We've been using ZFS v3, storage pool v14 with FreeBSD 8.1-RELEASE with =
fairly good results for over a year. =A0We have been moving more and more o=
f our storage to ZFS. =A0Last week, I believe we hit another issue with LBO=
LT.
>>
>> The original which was first reported by Artem Belevich for l2arc_feed_t=
hread :
>>
>> =A0- http://lists.freebsd.org/pipermail/freebsd-fs/2011-January/010558.h=
tml
>>
>> But this also affects the arc_reclaim_thread as well. The guys over at i=
X Systems helped out and pointed me to this patch :
>>
>> =A0- http://people.freebsd.org/~delphij/misc/218180.diff
>>
>> which typedef's clock_t to int64_t.

I think that patch is unusable and didn't get used.  clock_t is a
(bogusly) machine-dependent system type that cannot be declared in a
cddl header (except possibly to hide bugs by making it different from
the system type in some places only).

>> However, the arc_reclaim_thread does not have a ~24 day rollover - it do=
es not use clock_t. =A0I think this rollover in the integer results in LBOL=
T going negative, after about 106-107 days. =A0We haven't noticed this unti=
l actually 112-115 days of uptime. =A0I think it is also related to L1 ARC =
sizing, and load. =A0Our systems with arc set to min-max of =A0512M/2G ARC =
haven't developed the issue - at least the CPU hogging thread - but the sys=
tems with 12G+ of ARC, and lots of rsync and du activity along side of rand=
om reads from the zpool develop the issue.
>>
>> The problem is slight different, and possibly more harmful than the l2ar=
c feeder issue seen with LBOLT.
>>
>> in sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c, the arc_evict()=
 function, under "evict_start:" has this loop to walk the arc cache page li=
st:

[unreadable non-text with hard \xa0's deleted]

>> Now, when LBOLT is negative, with some degree of jitter/randomness, this=
 loop short-circuts, resulting in high CPU usage. =A0Also the ARC buffers m=
ay not get evicted on-time, or possibly at all. =A0One system I had, all pr=
ocesses to the zpool where waiting on D-state, and the arc_reclaim_thread w=
as stuck at 100%. =A0du and rysnc seem to help aggravate this issue. =A0On =
an affected system :

>> ...
>> After chatting with someone else well respected in the community, he pro=
posed an alternative fix. =A0I'm vetting here to make sure there isn't some=
thing deeper in the code that could get bitten by, as well as some clarific=
ation :
>>
>> in:
>> ./sys/cddl/compat/opensolaris/sys/time.h
>>
>> the relevant parts are:
>>
>> =A0 =A0 =A0 =A0 41 #define LBOLT =A0 ((gethrtime() * hz) / NANOSEC)

Might work if gethrtime() is a 64-bit type, but as you pointed out, it is
almost perfectly passimized.

>> =A0 =A0 =A0 =A0 ...
>>
>> =A0 =A0 =A0 =A0 54 static __inline hrtime_t

Not clear what its type is.

>> =A0 =A0 =A0 =A0 55 gethrtime(void) {
>> =A0 =A0 =A0 =A0 56
>> =A0 =A0 =A0 =A0 57 =A0 =A0 =A0 =A0 struct timespec ts;
>> =A0 =A0 =A0 =A0 58 =A0 =A0 =A0 =A0 hrtime_t nsec;
>> =A0 =A0 =A0 =A0 59
>> =A0 =A0 =A0 =A0 60 #if 1
>> =A0 =A0 =A0 =A0 61 =A0 =A0 =A0 =A0 getnanouptime(&ts);
>> =A0 =A0 =A0 =A0 62 #else
>> =A0 =A0 =A0 =A0 63 =A0 =A0 =A0 =A0 nanouptime(&ts);
>> =A0 =A0 =A0 =A0 64 #endif

The ifdef prevents more perfect pessimization -- we will scale to hz ticks,
so we don't want to extra accuracy and overheads from using nanouptime().

>> =A0 =A0 =A0 =A0 65 =A0 =A0 =A0 =A0 nsec =3D (hrtime_t)ts.tv_sec * NANOSE=
C + ts.tv_nsec;
>
> Yup. This would indeed overflow in ~106.75 days.

Apparently hr_time_t is 64 bits signed, and hz =3D 1000.  64 bits unsigned
would overflow after twice as long (after a further multiplication by hz
in LBOLT).  Use hz =3D much larger or much smaller than 1000 to change the
overflow point.

>> QUESTION - what units is LBOLT suppose to be ? =A0If gethrtime() is retu=
rning nanoseconds, why is nanoseconds getting multiplied by hz ? =A0If LBOL=
T is suppose to be clock-ticks (which is what arc.c looks like it wants it =
in) then it really should be :
>>
>> =A0 =A0 =A0 =A0 #define LBOLT =A0 ( (gethrtime() / NANOSEC) * hz )
>>
>> But if that is case, then why make the call to getnanouptime() at all ? =
=A0If LBOLT is number of clock ticks, then can't this just be a query to up=
time in seconds ? =A0So how about something like this:
>>
>> =A0 =A0 =A0 =A0#define LBOLT =A0 (time_uptime * hz)

This is similar to FreBSD's `ticks', and overflows at a similar point:
- `ticks' is int, and int is int32_t on all supported arches, so `ticks'
   overflows at 2**31 / hz on all supported arches.  This was 248 days
   if hz is correctly configured as 100.   The default misconfiguration
   of hz =3D 1000 gives overflow after 24.8 days.
- time_uptime is time_t, so the above overflows at TIME_T_MAX / hz.
   - on i386 and powerpc, time_t is int32_t, so the above overflows at
     the same point as does `ticks' on these arches.
   - on all other arches, time_t is int64_t, so the above overflows after
     2**32 times as long.

> I believe lbolt used to hold number of ticks on solaris, though they
> switched to tickless kernel some time back and got rid of lbolt.

Yes, LBOLT in solaris seems to correspond to `ticks' in FreeBSD, except
it might not overflow like `ticks' does.  (`lbolt' in FreeBSD was a dummy
sleep address with value always 0.)

>> I've applied this changed locally, and did a basic stress test with our =
load generator in the lab, thrashing the arc cache. (96GB RAM, 48G min/max =
for ARC) It seems to have no ill effects - though, will have to wait 4-mont=
hs before declaring the actual issue here fixed. =A0I'm hoping to put this =
in production next week.

Won't work on i386.

>> It would seem, the same optimization could be done here too:
>>
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#define ddi_get_lbolt() =A0 =A0 =A0 =A0 (=
time_uptime * hz)
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#define ddi_get_lbolt64() =A0 =A0 =A0 (in=
t64_t)(time_uptime * hz)

To fix the overflow it has to be:

#define ddi_get_lbolt64()=09((int64_t)time_uptime * hz)

>> With saving the call to getnanouptime() a multiple and divide, there sho=
uld be a couple hundred cycle performance improvement here. =A0I don't clai=
m this would be noticeable, but seems like a simple, straight forward optim=
ization.

Should only be a couple of ten cycle performance improvement.
getnanouptime() is very fast.

> The side effect is that it limits bolt resolution to hz units. With
> HZ=3D100, that will be 10ms. Whether it's good enough or too coarse I
> have no idea. Perhaps we can compromise and update lbolt in
> microseconds. That should give us few hundred years until the
> overflow.

No, it actually limits the resolution to seconds units, and that seems
too coarse.  Using getnanupotime() already limited the resolution to
hz/tc_tick inverse-units (tc_tick =3D 1 for hz <=3D ~1000, but above that
getnanouptime() provides less than hz inverse-units).

`ticks' would be the right thing to use if it didn't overflow.

Some networking code (e.g., in tcp_output.c) still uses `ticks', and at
least used to have bugs from this use when `ticks' overflowed.  Blindly
increasing hz of course made the bugs more frequent.

Bruce
--0-2106762154-1306757737=:3265--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110530210943.Y3265>