Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 31 May 2011 18:24:01 +0300
From:      Andriy Gapon <avg@FreeBSD.org>
To:        Artem Belevich <art@FreeBSD.org>, David P Discher <dpd@bitgravity.com>
Cc:        freebsd-fs@FreeBSD.org
Subject:   Re: ZFS: arc_reclaim_thread running 100%, 8.1-RELEASE, LBOLT related
Message-ID:  <4DE50811.5060606@FreeBSD.org>
In-Reply-To: <BANLkTikVq0-En7=4Dy_dTf=tM55Cqou_mw@mail.gmail.com>
References:  <0EFD28CD-F2E9-4AE2-B927-1D327EC99DB9@bitgravity.com> <BANLkTikVq0-En7=4Dy_dTf=tM55Cqou_mw@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help

Artem and David,

I am replying to both of you, so meaning of 'you' below depends on quoting context.

on 27/05/2011 07:06 Artem Belevich said the following:
> On Thu, May 26, 2011 at 6:46 PM, David P Discher <dpd@bitgravity.com> wrote:
[snip]
>> However, the arc_reclaim_thread does not have a ~24 day rollover - it does not use clock_t.  I think this rollover in the integer results in LBOLT going negative, after about 106-107 days.  We haven't noticed this until actually 112-115 days of uptime.  I think it is also related to L1 ARC sizing, and load.  Our systems with arc set to min-max of  512M/2G ARC haven't developed the issue - at least the CPU hogging thread - but the systems with 12G+ of ARC, and lots of rsync and du activity along side of random reads from the zpool develop the issue.
>>
>> The problem is slight different, and possibly more harmful than the l2arc feeder issue seen with LBOLT.
>>
>> in sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c, the arc_evict() function, under "evict_start:" has this loop to walk the arc cache page list:
>>
>>        1708         for (ab = list_tail(list); ab; ab = ab_prev) {
>>        1709                 ab_prev = list_prev(list, ab);
>>        1710                 bytes_remaining -= (ab->b_size * ab->b_datacnt);
>>        1711                 /* prefetch buffers have a minimum lifespan */
>>        1712                 if (HDR_IO_IN_PROGRESS(ab) ||
>>        1713                     (spa && ab->b_spa != spa) ||
>>        1714                     (ab->b_flags & (ARC_PREFETCH|ARC_INDIRECT) &&
>>        1715                     LBOLT - ab->b_arc_access < arc_min_prefetch_lifespan)) {
>>        1716                         skipped++;
>>        1717                         continue;
>>        1718                 }
>>
>>
>> Now, when LBOLT is negative, with some degree of jitter/randomness, this loop short-circuts, resulting in high CPU usage.  Also the ARC buffers may not get evicted on-time, or possibly at all.  One system I had, all processes to the zpool where waiting on D-state, and the arc_reclaim_thread was stuck at 100%.  du and rysnc seem to help aggravate this issue.  On an affected system :
>>
>>> top -SHb 500 | grep arc_reclaim_thr
>>   95 root           -8    -     0K    60K arc_re  3 102.9H 96.39% {arc_reclaim_thre}
>>
>> Conveniently, "skipped++" is surfaced via a sysctl, here's two queries to it on this system with the arc reclaim thread running hot (a du going too at the same time ), 60 seconds in between :
>>
>>  kstat.zfs.misc.arcstats.evict_skip: 4117714520450
>>  kstat.zfs.misc.arcstats.evict_skip: 4118188257434
>>
>>> uptime
>>  3:51PM  up 116 days, 23:48, 3 users, load averages: 1.30, 0.96, 0.64
>>
>> After chatting with someone else well respected in the community, he proposed an alternative fix.  I'm vetting here to make sure there isn't something deeper in the code that could get bitten by, as well as some clarification :
>>
>> in:
>> ./sys/cddl/compat/opensolaris/sys/time.h
>>
>> the relevant parts are:
>>
>>         41 #define LBOLT   ((gethrtime() * hz) / NANOSEC)
>>         ...
>>         54 static __inline hrtime_t
>>         55 gethrtime(void) {
>>         56
>>         57         struct timespec ts;
>>         58         hrtime_t nsec;
>>         59
>>         60 #if 1
>>         61         getnanouptime(&ts);
>>         62 #else
>>         63         nanouptime(&ts);
>>         64 #endif
>>         65         nsec = (hrtime_t)ts.tv_sec * NANOSEC + ts.tv_nsec;
> 
> Yup. This would indeed overflow in ~106.75 days.

Have you referred to the LBOLT above?
gethrtime() should have several hundred years before overflowing.

>>         66         return (nsec);
>>         67 }
>>
>> QUESTION - what units is LBOLT suppose to be ?  If gethrtime() is returning nanoseconds, why is nanoseconds getting multiplied by hz ?  If LBOLT is suppose to be clock-ticks (which is what arc.c looks like it wants it in) then it really should be :

It should be ticks, no doubt.

>>         #define LBOLT   ( (gethrtime() / NANOSEC) * hz )

This definitely is a bad idea, division by NANOSEC would limit resolution to 1
second.  It's obvious that what is defined above would jump from N * hz to (N + 1)
* hz.
Please see below.

>> But if that is case, then why make the call to getnanouptime() at all ?  If LBOLT is number of clock ticks, then can't this just be a query to uptime in seconds ?  So how about something like this:
>>
>>        #define LBOLT   (time_uptime * hz)

Exactly the same problem as above.

BTW, we already have 'ticks' variable that should be more or less equivalent to
the old Solaris LBOLT.  Except that ticks is int and LBOLT was clock_t.

> I believe lbolt used to hold number of ticks on solaris, though they
> switched to tickless kernel some time back and got rid of lbolt.

Yes and (kind of) no.
They still use ddi_get_lbolt and ddi_get_lbolt64 where lbolt was previously used
and it was easier to keep thinking in 'ticks' rather than to rewrite code more
radically:
http://fxr.watson.org/fxr/source/common/os/sunddi.c?v=OPENSOLARIS#L5634

Both ddi_get_lbolt and ddi_get_lbolt64 just call lbolt_hybrid function pointer.
There is some interesting read about it here:
http://fxr.watson.org/fxr/source/common/os/clock.c?v=OPENSOLARIS#L244

Long story short, when in "event driven mode" they do the following:
	ts = gethrtime();
	ASSERT(ts > 0);
	ASSERT(nsec_per_tick > 0);
	lb = (ts/nsec_per_tick);


>> I've applied this changed locally, and did a basic stress test with our load generator in the lab, thrashing the arc cache. (96GB RAM, 48G min/max for ARC) It seems to have no ill effects - though, will have to wait 4-months before declaring the actual issue here fixed.  I'm hoping to put this in production next week.
>>
>> All this above is on 8.1-RELEASE. ZFS v28 changed some of this, but still didn't improve lbolt:
>>
>>  - http://svnweb.freebsd.org/base/head/sys/cddl/compat/opensolaris/sys/time.h?revision=221991&view=markup
>>
>>        65      #define ddi_get_lbolt()         ((gethrtime() * hz) / NANOSEC)

Please note that unlike OpenSolaris where ddi_get_lbolt is a function with return
type of clock_t, we have a macro, so resulting type of the above expression will
be the same as return type of gethrtime(), that is int64_t, because it is the
widest type in the expression.
To be entirely compatible with Solaris we should have a clock_t cast here (but our
clock_t is defined differently from theirs).

>>        66      #define ddi_get_lbolt64()       (int64_t)((gethrtime() * hz) / NANOSEC)

Actually, the cast in the second definition shouldn't affect generated machine
code as described above.

>> It would seem, the same optimization could be done here too:
>>
>>                #define ddi_get_lbolt()         (time_uptime * hz)
>>                #define ddi_get_lbolt64()       (int64_t)(time_uptime * hz)

Already discussed above.

>> With saving the call to getnanouptime() a multiple and divide, there should be a couple hundred cycle performance improvement here.  I don't claim this would be noticeable, but seems like a simple, straight forward optimization.
> 
> The side effect is that it limits bolt resolution to hz units. With
> HZ=100, that will be 10ms. Whether it's good enough or too coarse I

Nope, I think you did your your math wrong here.
As shown above it limits the resolution to hz ticks, i.e. hz * 1/hz seconds :)

> have no idea. Perhaps we can compromise and update lbolt in
> microseconds. That should give us few hundred years until the
> overflow.

Well, we can either use the ticks variable, since we are not switching to tickless
mode yet.  But we would need to make it 64-bit to avoid early overflows.
Or, perhaps, to be somewhat future-friendly we could do approximately what
OpenSolaris [upstream :-)] code does:

gethrtime() / (NANOSEC / hz)

Or, given that NANOSEC is constant and hz is invariant, we could apply extended
invariant division by multiplication approach to get precise and fast result
without overflowing.  But likely that's an overkill here.  Though we definitely
should pre-calculate, store and re-use (NANOSEC / hz) value just like OpenSolaris
does it.

>> clock_t will still need the typedef'ed to 64bit to still address the l2arc usage of LBOLT.

Hm, I see that OpenSolaris defines clock_t to long, while we use int32_t.
So, I think that it means two things:
- i386 OpenSolaris (if it exists) should be affected by the problem as well
- we should not use our native clock_t definition for ported OpenSolaris code

Maybe we should fix our clock_t to be something wider at least for 64-bit
platforms.  But I am not prepared to discuss this.

To summarize:
1. We must fix ddi_get_lbolt*() to avoid unnecessary overflow in multiplication
2. We could fix 31-bit clock_t overflow by using Solaris-compatible definition of
it (long), but that still won't help on i386.  Maybe we should bring up this issue
to the attention of upstream ZFS developers.  Universally using ddi_get_lbolt64()
seems like a safer bet.

-- 
Andriy Gapon



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4DE50811.5060606>