From owner-freebsd-fs@FreeBSD.ORG  Fri May 27 04:06:04 2011
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id A9C5A1065672
	for <freebsd-fs@freebsd.org>; Fri, 27 May 2011 04:06:04 +0000 (UTC)
	(envelope-from artemb@gmail.com)
Received: from mail-yx0-f182.google.com (mail-yx0-f182.google.com
	[209.85.213.182])
	by mx1.freebsd.org (Postfix) with ESMTP id 625A28FC0A
	for <freebsd-fs@freebsd.org>; Fri, 27 May 2011 04:06:04 +0000 (UTC)
Received: by yxl31 with SMTP id 31so731375yxl.13
	for <freebsd-fs@freebsd.org>; Thu, 26 May 2011 21:06:03 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=domainkey-signature:mime-version:sender:in-reply-to:references:date
	:x-google-sender-auth:message-id:subject:from:to:cc:content-type
	:content-transfer-encoding;
	bh=zskmKtnZrjTf+TyyxASh2EjNREwNaHGGEWaKWz1itLs=;
	b=ECKYbfy19h6wF2jgoRHqxuNzfrerQgFFh6aR2R7AzfH1IKKhojrAQpau099KVHpkb6
	rvYM2poYVQy0qbe57sbI1IWqK6vvB9UIWdkJKf2ELeVazCzN6n8KALkvhjMiz/NmnPgL
	rkCapc7BoWaFS68lfKmoChb3gUuipYPZutJCA=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=mime-version:sender:in-reply-to:references:date
	:x-google-sender-auth:message-id:subject:from:to:cc:content-type
	:content-transfer-encoding;
	b=MptNPonl0w+zWemK3d0c6z9LQ/rrHqoVs8/WvPRp7jwt3Gwnx6Qym+m4UoRpcxo6S9
	XQdcl6/wEdRfgV7BA/VPO+Fw9aIw/fjBc2f8oB0fUyzD2qfOF+m1DqXajsF1ne7+t1dS
	Mvzc+fnyej3Ugj52yMy693roq3RFxr+T3KMRE=
MIME-Version: 1.0
Received: by 10.236.70.72 with SMTP id o48mr1951915yhd.512.1306469163366; Thu,
	26 May 2011 21:06:03 -0700 (PDT)
Sender: artemb@gmail.com
Received: by 10.236.209.165 with HTTP; Thu, 26 May 2011 21:06:03 -0700 (PDT)
In-Reply-To: <0EFD28CD-F2E9-4AE2-B927-1D327EC99DB9@bitgravity.com>
References: <0EFD28CD-F2E9-4AE2-B927-1D327EC99DB9@bitgravity.com>
Date: Thu, 26 May 2011 21:06:03 -0700
X-Google-Sender-Auth: F57sUesPP8v_0ZNoi6Eqt_W_w18
Message-ID: <BANLkTikVq0-En7=4Dy_dTf=tM55Cqou_mw@mail.gmail.com>
From: Artem Belevich <art@freebsd.org>
To: David P Discher <dpd@bitgravity.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: freebsd-fs@freebsd.org
Subject: Re: ZFS: arc_reclaim_thread running 100%, 8.1-RELEASE, LBOLT related
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 27 May 2011 04:06:04 -0000

On Thu, May 26, 2011 at 6:46 PM, David P Discher <dpd@bitgravity.com> wrote=
:
> Hello FS list:
>
> We've been using ZFS v3, storage pool v14 with FreeBSD 8.1-RELEASE with f=
airly good results for over a year. =A0We have been moving more and more of=
 our storage to ZFS. =A0Last week, I believe we hit another issue with LBOL=
T.
>
> The original which was first reported by Artem Belevich for l2arc_feed_th=
read :
>
> =A0- http://lists.freebsd.org/pipermail/freebsd-fs/2011-January/010558.ht=
ml
>
> But this also affects the arc_reclaim_thread as well. The guys over at iX=
 Systems helped out and pointed me to this patch :
>
> =A0- http://people.freebsd.org/~delphij/misc/218180.diff
>
> which typedef's clock_t to int64_t.
>
> However, the arc_reclaim_thread does not have a ~24 day rollover - it doe=
s not use clock_t. =A0I think this rollover in the integer results in LBOLT=
 going negative, after about 106-107 days. =A0We haven't noticed this until=
 actually 112-115 days of uptime. =A0I think it is also related to L1 ARC s=
izing, and load. =A0Our systems with arc set to min-max of =A0512M/2G ARC h=
aven't developed the issue - at least the CPU hogging thread - but the syst=
ems with 12G+ of ARC, and lots of rsync and du activity along side of rando=
m reads from the zpool develop the issue.
>
> The problem is slight different, and possibly more harmful than the l2arc=
 feeder issue seen with LBOLT.
>
> in sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c, the arc_evict() =
function, under "evict_start:" has this loop to walk the arc cache page lis=
t:
>
> =A0 =A0 =A0 =A01708 =A0 =A0 =A0 =A0 for (ab =3D list_tail(list); ab; ab =
=3D ab_prev) {
> =A0 =A0 =A0 =A01709 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ab_prev =3D list_prev=
(list, ab);
> =A0 =A0 =A0 =A01710 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 bytes_remaining -=3D =
(ab->b_size * ab->b_datacnt);
> =A0 =A0 =A0 =A01711 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* prefetch buffers h=
ave a minimum lifespan */
> =A0 =A0 =A0 =A01712 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (HDR_IO_IN_PROGRES=
S(ab) ||
> =A0 =A0 =A0 =A01713 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (spa && ab->b=
_spa !=3D spa) ||
> =A0 =A0 =A0 =A01714 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (ab->b_flags =
& (ARC_PREFETCH|ARC_INDIRECT) &&
> =A0 =A0 =A0 =A01715 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 LBOLT - ab->b=
_arc_access < arc_min_prefetch_lifespan)) {
> =A0 =A0 =A0 =A01716 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 skipp=
ed++;
> =A0 =A0 =A0 =A01717 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 conti=
nue;
> =A0 =A0 =A0 =A01718 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 }
>
>
> Now, when LBOLT is negative, with some degree of jitter/randomness, this =
loop short-circuts, resulting in high CPU usage. =A0Also the ARC buffers ma=
y not get evicted on-time, or possibly at all. =A0One system I had, all pro=
cesses to the zpool where waiting on D-state, and the arc_reclaim_thread wa=
s stuck at 100%. =A0du and rysnc seem to help aggravate this issue. =A0On a=
n affected system :
>
>> top -SHb 500 | grep arc_reclaim_thr
> =A0 95 root =A0 =A0 =A0 =A0 =A0 -8 =A0 =A0- =A0 =A0 0K =A0 =A060K arc_re =
=A03 102.9H 96.39% {arc_reclaim_thre}
>
> Conveniently, "skipped++" is surfaced via a sysctl, here's two queries to=
 it on this system with the arc reclaim thread running hot (a du going too =
at the same time ), 60 seconds in between :
>
> =A0kstat.zfs.misc.arcstats.evict_skip: 4117714520450
> =A0kstat.zfs.misc.arcstats.evict_skip: 4118188257434
>
>> uptime
> =A03:51PM =A0up 116 days, 23:48, 3 users, load averages: 1.30, 0.96, 0.64
>
> After chatting with someone else well respected in the community, he prop=
osed an alternative fix. =A0I'm vetting here to make sure there isn't somet=
hing deeper in the code that could get bitten by, as well as some clarifica=
tion :
>
> in:
> ./sys/cddl/compat/opensolaris/sys/time.h
>
> the relevant parts are:
>
> =A0 =A0 =A0 =A0 41 #define LBOLT =A0 ((gethrtime() * hz) / NANOSEC)
> =A0 =A0 =A0 =A0 ...
>
> =A0 =A0 =A0 =A0 54 static __inline hrtime_t
> =A0 =A0 =A0 =A0 55 gethrtime(void) {
> =A0 =A0 =A0 =A0 56
> =A0 =A0 =A0 =A0 57 =A0 =A0 =A0 =A0 struct timespec ts;
> =A0 =A0 =A0 =A0 58 =A0 =A0 =A0 =A0 hrtime_t nsec;
> =A0 =A0 =A0 =A0 59
> =A0 =A0 =A0 =A0 60 #if 1
> =A0 =A0 =A0 =A0 61 =A0 =A0 =A0 =A0 getnanouptime(&ts);
> =A0 =A0 =A0 =A0 62 #else
> =A0 =A0 =A0 =A0 63 =A0 =A0 =A0 =A0 nanouptime(&ts);
> =A0 =A0 =A0 =A0 64 #endif
> =A0 =A0 =A0 =A0 65 =A0 =A0 =A0 =A0 nsec =3D (hrtime_t)ts.tv_sec * NANOSEC=
 + ts.tv_nsec;

Yup. This would indeed overflow in ~106.75 days.

> =A0 =A0 =A0 =A0 66 =A0 =A0 =A0 =A0 return (nsec);
> =A0 =A0 =A0 =A0 67 }
>
> QUESTION - what units is LBOLT suppose to be ? =A0If gethrtime() is retur=
ning nanoseconds, why is nanoseconds getting multiplied by hz ? =A0If LBOLT=
 is suppose to be clock-ticks (which is what arc.c looks like it wants it i=
n) then it really should be :
>
> =A0 =A0 =A0 =A0 #define LBOLT =A0 ( (gethrtime() / NANOSEC) * hz )
>
> But if that is case, then why make the call to getnanouptime() at all ? =
=A0If LBOLT is number of clock ticks, then can't this just be a query to up=
time in seconds ? =A0So how about something like this:
>
> =A0 =A0 =A0 =A0#define LBOLT =A0 (time_uptime * hz)

I believe lbolt used to hold number of ticks on solaris, though they
switched to tickless kernel some time back and got rid of lbolt.

>
> I've applied this changed locally, and did a basic stress test with our l=
oad generator in the lab, thrashing the arc cache. (96GB RAM, 48G min/max f=
or ARC) It seems to have no ill effects - though, will have to wait 4-month=
s before declaring the actual issue here fixed. =A0I'm hoping to put this i=
n production next week.
>
> All this above is on 8.1-RELEASE. ZFS v28 changed some of this, but still=
 didn't improve lbolt:
>
> =A0- http://svnweb.freebsd.org/base/head/sys/cddl/compat/opensolaris/sys/=
time.h?revision=3D221991&view=3Dmarkup
>
> =A0 =A0 =A0 =A065 =A0 =A0 =A0#define ddi_get_lbolt() =A0 =A0 =A0 =A0 ((ge=
thrtime() * hz) / NANOSEC)
> =A0 =A0 =A0 =A066 =A0 =A0 =A0#define ddi_get_lbolt64() =A0 =A0 =A0 (int64=
_t)((gethrtime() * hz) / NANOSEC)
>
> It would seem, the same optimization could be done here too:
>
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#define ddi_get_lbolt() =A0 =A0 =A0 =A0 (t=
ime_uptime * hz)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#define ddi_get_lbolt64() =A0 =A0 =A0 (int=
64_t)(time_uptime * hz)
>
> With saving the call to getnanouptime() a multiple and divide, there shou=
ld be a couple hundred cycle performance improvement here. =A0I don't claim=
 this would be noticeable, but seems like a simple, straight forward optimi=
zation.

The side effect is that it limits bolt resolution to hz units. With
HZ=3D100, that will be 10ms. Whether it's good enough or too coarse I
have no idea. Perhaps we can compromise and update lbolt in
microseconds. That should give us few hundred years until the
overflow.

--Artem

>
> clock_t will still need the typedef'ed to 64bit to still address the l2ar=
c usage of LBOLT.
>
> Thanks !
>
> ---
> David P. Discher
> dpd@bitgravity.com * AIM: bgDavidDPD
> BITGRAVITY * http://www.bitgravity.com
>
> _______________________________________________
> freebsd-fs@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"
>