From owner-freebsd-fs@FreeBSD.ORG Fri May 27 04:06:04 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A9C5A1065672 for ; Fri, 27 May 2011 04:06:04 +0000 (UTC) (envelope-from artemb@gmail.com) Received: from mail-yx0-f182.google.com (mail-yx0-f182.google.com [209.85.213.182]) by mx1.freebsd.org (Postfix) with ESMTP id 625A28FC0A for ; Fri, 27 May 2011 04:06:04 +0000 (UTC) Received: by yxl31 with SMTP id 31so731375yxl.13 for ; Thu, 26 May 2011 21:06:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=zskmKtnZrjTf+TyyxASh2EjNREwNaHGGEWaKWz1itLs=; b=ECKYbfy19h6wF2jgoRHqxuNzfrerQgFFh6aR2R7AzfH1IKKhojrAQpau099KVHpkb6 rvYM2poYVQy0qbe57sbI1IWqK6vvB9UIWdkJKf2ELeVazCzN6n8KALkvhjMiz/NmnPgL rkCapc7BoWaFS68lfKmoChb3gUuipYPZutJCA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; b=MptNPonl0w+zWemK3d0c6z9LQ/rrHqoVs8/WvPRp7jwt3Gwnx6Qym+m4UoRpcxo6S9 XQdcl6/wEdRfgV7BA/VPO+Fw9aIw/fjBc2f8oB0fUyzD2qfOF+m1DqXajsF1ne7+t1dS Mvzc+fnyej3Ugj52yMy693roq3RFxr+T3KMRE= MIME-Version: 1.0 Received: by 10.236.70.72 with SMTP id o48mr1951915yhd.512.1306469163366; Thu, 26 May 2011 21:06:03 -0700 (PDT) Sender: artemb@gmail.com Received: by 10.236.209.165 with HTTP; Thu, 26 May 2011 21:06:03 -0700 (PDT) In-Reply-To: <0EFD28CD-F2E9-4AE2-B927-1D327EC99DB9@bitgravity.com> References: <0EFD28CD-F2E9-4AE2-B927-1D327EC99DB9@bitgravity.com> Date: Thu, 26 May 2011 21:06:03 -0700 X-Google-Sender-Auth: F57sUesPP8v_0ZNoi6Eqt_W_w18 Message-ID: From: Artem Belevich To: David P Discher Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: freebsd-fs@freebsd.org Subject: Re: ZFS: arc_reclaim_thread running 100%, 8.1-RELEASE, LBOLT related X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 27 May 2011 04:06:04 -0000 On Thu, May 26, 2011 at 6:46 PM, David P Discher wrote= : > Hello FS list: > > We've been using ZFS v3, storage pool v14 with FreeBSD 8.1-RELEASE with f= airly good results for over a year. =A0We have been moving more and more of= our storage to ZFS. =A0Last week, I believe we hit another issue with LBOL= T. > > The original which was first reported by Artem Belevich for l2arc_feed_th= read : > > =A0- http://lists.freebsd.org/pipermail/freebsd-fs/2011-January/010558.ht= ml > > But this also affects the arc_reclaim_thread as well. The guys over at iX= Systems helped out and pointed me to this patch : > > =A0- http://people.freebsd.org/~delphij/misc/218180.diff > > which typedef's clock_t to int64_t. > > However, the arc_reclaim_thread does not have a ~24 day rollover - it doe= s not use clock_t. =A0I think this rollover in the integer results in LBOLT= going negative, after about 106-107 days. =A0We haven't noticed this until= actually 112-115 days of uptime. =A0I think it is also related to L1 ARC s= izing, and load. =A0Our systems with arc set to min-max of =A0512M/2G ARC h= aven't developed the issue - at least the CPU hogging thread - but the syst= ems with 12G+ of ARC, and lots of rsync and du activity along side of rando= m reads from the zpool develop the issue. > > The problem is slight different, and possibly more harmful than the l2arc= feeder issue seen with LBOLT. > > in sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c, the arc_evict() = function, under "evict_start:" has this loop to walk the arc cache page lis= t: > > =A0 =A0 =A0 =A01708 =A0 =A0 =A0 =A0 for (ab =3D list_tail(list); ab; ab = =3D ab_prev) { > =A0 =A0 =A0 =A01709 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ab_prev =3D list_prev= (list, ab); > =A0 =A0 =A0 =A01710 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 bytes_remaining -=3D = (ab->b_size * ab->b_datacnt); > =A0 =A0 =A0 =A01711 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* prefetch buffers h= ave a minimum lifespan */ > =A0 =A0 =A0 =A01712 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (HDR_IO_IN_PROGRES= S(ab) || > =A0 =A0 =A0 =A01713 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (spa && ab->b= _spa !=3D spa) || > =A0 =A0 =A0 =A01714 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (ab->b_flags = & (ARC_PREFETCH|ARC_INDIRECT) && > =A0 =A0 =A0 =A01715 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 LBOLT - ab->b= _arc_access < arc_min_prefetch_lifespan)) { > =A0 =A0 =A0 =A01716 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 skipp= ed++; > =A0 =A0 =A0 =A01717 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 conti= nue; > =A0 =A0 =A0 =A01718 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 } > > > Now, when LBOLT is negative, with some degree of jitter/randomness, this = loop short-circuts, resulting in high CPU usage. =A0Also the ARC buffers ma= y not get evicted on-time, or possibly at all. =A0One system I had, all pro= cesses to the zpool where waiting on D-state, and the arc_reclaim_thread wa= s stuck at 100%. =A0du and rysnc seem to help aggravate this issue. =A0On a= n affected system : > >> top -SHb 500 | grep arc_reclaim_thr > =A0 95 root =A0 =A0 =A0 =A0 =A0 -8 =A0 =A0- =A0 =A0 0K =A0 =A060K arc_re = =A03 102.9H 96.39% {arc_reclaim_thre} > > Conveniently, "skipped++" is surfaced via a sysctl, here's two queries to= it on this system with the arc reclaim thread running hot (a du going too = at the same time ), 60 seconds in between : > > =A0kstat.zfs.misc.arcstats.evict_skip: 4117714520450 > =A0kstat.zfs.misc.arcstats.evict_skip: 4118188257434 > >> uptime > =A03:51PM =A0up 116 days, 23:48, 3 users, load averages: 1.30, 0.96, 0.64 > > After chatting with someone else well respected in the community, he prop= osed an alternative fix. =A0I'm vetting here to make sure there isn't somet= hing deeper in the code that could get bitten by, as well as some clarifica= tion : > > in: > ./sys/cddl/compat/opensolaris/sys/time.h > > the relevant parts are: > > =A0 =A0 =A0 =A0 41 #define LBOLT =A0 ((gethrtime() * hz) / NANOSEC) > =A0 =A0 =A0 =A0 ... > > =A0 =A0 =A0 =A0 54 static __inline hrtime_t > =A0 =A0 =A0 =A0 55 gethrtime(void) { > =A0 =A0 =A0 =A0 56 > =A0 =A0 =A0 =A0 57 =A0 =A0 =A0 =A0 struct timespec ts; > =A0 =A0 =A0 =A0 58 =A0 =A0 =A0 =A0 hrtime_t nsec; > =A0 =A0 =A0 =A0 59 > =A0 =A0 =A0 =A0 60 #if 1 > =A0 =A0 =A0 =A0 61 =A0 =A0 =A0 =A0 getnanouptime(&ts); > =A0 =A0 =A0 =A0 62 #else > =A0 =A0 =A0 =A0 63 =A0 =A0 =A0 =A0 nanouptime(&ts); > =A0 =A0 =A0 =A0 64 #endif > =A0 =A0 =A0 =A0 65 =A0 =A0 =A0 =A0 nsec =3D (hrtime_t)ts.tv_sec * NANOSEC= + ts.tv_nsec; Yup. This would indeed overflow in ~106.75 days. > =A0 =A0 =A0 =A0 66 =A0 =A0 =A0 =A0 return (nsec); > =A0 =A0 =A0 =A0 67 } > > QUESTION - what units is LBOLT suppose to be ? =A0If gethrtime() is retur= ning nanoseconds, why is nanoseconds getting multiplied by hz ? =A0If LBOLT= is suppose to be clock-ticks (which is what arc.c looks like it wants it i= n) then it really should be : > > =A0 =A0 =A0 =A0 #define LBOLT =A0 ( (gethrtime() / NANOSEC) * hz ) > > But if that is case, then why make the call to getnanouptime() at all ? = =A0If LBOLT is number of clock ticks, then can't this just be a query to up= time in seconds ? =A0So how about something like this: > > =A0 =A0 =A0 =A0#define LBOLT =A0 (time_uptime * hz) I believe lbolt used to hold number of ticks on solaris, though they switched to tickless kernel some time back and got rid of lbolt. > > I've applied this changed locally, and did a basic stress test with our l= oad generator in the lab, thrashing the arc cache. (96GB RAM, 48G min/max f= or ARC) It seems to have no ill effects - though, will have to wait 4-month= s before declaring the actual issue here fixed. =A0I'm hoping to put this i= n production next week. > > All this above is on 8.1-RELEASE. ZFS v28 changed some of this, but still= didn't improve lbolt: > > =A0- http://svnweb.freebsd.org/base/head/sys/cddl/compat/opensolaris/sys/= time.h?revision=3D221991&view=3Dmarkup > > =A0 =A0 =A0 =A065 =A0 =A0 =A0#define ddi_get_lbolt() =A0 =A0 =A0 =A0 ((ge= thrtime() * hz) / NANOSEC) > =A0 =A0 =A0 =A066 =A0 =A0 =A0#define ddi_get_lbolt64() =A0 =A0 =A0 (int64= _t)((gethrtime() * hz) / NANOSEC) > > It would seem, the same optimization could be done here too: > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#define ddi_get_lbolt() =A0 =A0 =A0 =A0 (t= ime_uptime * hz) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#define ddi_get_lbolt64() =A0 =A0 =A0 (int= 64_t)(time_uptime * hz) > > With saving the call to getnanouptime() a multiple and divide, there shou= ld be a couple hundred cycle performance improvement here. =A0I don't claim= this would be noticeable, but seems like a simple, straight forward optimi= zation. The side effect is that it limits bolt resolution to hz units. With HZ=3D100, that will be 10ms. Whether it's good enough or too coarse I have no idea. Perhaps we can compromise and update lbolt in microseconds. That should give us few hundred years until the overflow. --Artem > > clock_t will still need the typedef'ed to 64bit to still address the l2ar= c usage of LBOLT. > > Thanks ! > > --- > David P. Discher > dpd@bitgravity.com * AIM: bgDavidDPD > BITGRAVITY * http://www.bitgravity.com > > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" >