From owner-freebsd-fs@FreeBSD.ORG Fri May 27 02:17:54 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7F6921065673 for ; Fri, 27 May 2011 02:17:54 +0000 (UTC) (envelope-from dpd@bitgravity.com) Received: from mail1.sjc1.bitgravity.com (mail1.sjc1.bitgravity.com [209.131.97.19]) by mx1.freebsd.org (Postfix) with ESMTP id 65D1F8FC18 for ; Fri, 27 May 2011 02:17:54 +0000 (UTC) Received: from mail-pw0-f45.google.com ([209.85.160.45]) by mail1.sjc1.bitgravity.com with esmtps (TLSv1:RC4-SHA:128) (Exim 4.69 (FreeBSD)) (envelope-from ) id 1QPm8V-0005PN-12 for freebsd-fs@freebsd.org; Thu, 26 May 2011 18:46:47 -0700 Received: by pwi6 with SMTP id 6so751308pwi.18 for ; Thu, 26 May 2011 18:46:41 -0700 (PDT) Received: by 10.68.27.71 with SMTP id r7mr507981pbg.385.1306460801542; Thu, 26 May 2011 18:46:41 -0700 (PDT) Received: from netops-199.sfo1.bitgravity.com (netops-199.sfo1.bitgravity.com [209.131.110.199]) by mx.google.com with ESMTPS id m9sm75087pbd.87.2011.05.26.18.46.39 (version=TLSv1/SSLv3 cipher=OTHER); Thu, 26 May 2011 18:46:40 -0700 (PDT) From: David P Discher Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Date: Thu, 26 May 2011 18:46:37 -0700 Message-Id: <0EFD28CD-F2E9-4AE2-B927-1D327EC99DB9@bitgravity.com> To: freebsd-fs@freebsd.org Mime-Version: 1.0 (Apple Message framework v1084) X-Mailer: Apple Mail (2.1084) Subject: ZFS: arc_reclaim_thread running 100%, 8.1-RELEASE, LBOLT related X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 27 May 2011 02:17:54 -0000 Hello FS list: We've been using ZFS v3, storage pool v14 with FreeBSD 8.1-RELEASE with = fairly good results for over a year. We have been moving more and more = of our storage to ZFS. Last week, I believe we hit another issue with = LBOLT. The original which was first reported by Artem Belevich for = l2arc_feed_thread : - = http://lists.freebsd.org/pipermail/freebsd-fs/2011-January/010558.html But this also affects the arc_reclaim_thread as well. The guys over at = iX Systems helped out and pointed me to this patch : - http://people.freebsd.org/~delphij/misc/218180.diff which typedef's clock_t to int64_t. However, the arc_reclaim_thread does not have a ~24 day rollover - it = does not use clock_t. I think this rollover in the integer results in = LBOLT going negative, after about 106-107 days. We haven't noticed this = until actually 112-115 days of uptime. I think it is also related to L1 = ARC sizing, and load. Our systems with arc set to min-max of 512M/2G = ARC haven't developed the issue - at least the CPU hogging thread - but = the systems with 12G+ of ARC, and lots of rsync and du activity along = side of random reads from the zpool develop the issue. The problem is slight different, and possibly more harmful than the = l2arc feeder issue seen with LBOLT.=20 in sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c, the arc_evict() = function, under "evict_start:" has this loop to walk the arc cache page = list: 1708 for (ab =3D list_tail(list); ab; ab =3D ab_prev) { 1709 ab_prev =3D list_prev(list, ab); 1710 bytes_remaining -=3D (ab->b_size * = ab->b_datacnt); 1711 /* prefetch buffers have a minimum lifespan = */ 1712 if (HDR_IO_IN_PROGRESS(ab) || 1713 (spa && ab->b_spa !=3D spa) || 1714 (ab->b_flags & = (ARC_PREFETCH|ARC_INDIRECT) && 1715 LBOLT - ab->b_arc_access < = arc_min_prefetch_lifespan)) { 1716 skipped++; 1717 continue; 1718 } Now, when LBOLT is negative, with some degree of jitter/randomness, this = loop short-circuts, resulting in high CPU usage. Also the ARC buffers = may not get evicted on-time, or possibly at all. One system I had, all = processes to the zpool where waiting on D-state, and the = arc_reclaim_thread was stuck at 100%. du and rysnc seem to help = aggravate this issue. On an affected system : > top -SHb 500 | grep arc_reclaim_thr 95 root -8 - 0K 60K arc_re 3 102.9H 96.39% = {arc_reclaim_thre} Conveniently, "skipped++" is surfaced via a sysctl, here's two queries = to it on this system with the arc reclaim thread running hot (a du going = too at the same time ), 60 seconds in between : kstat.zfs.misc.arcstats.evict_skip: 4117714520450 kstat.zfs.misc.arcstats.evict_skip: 4118188257434 > uptime 3:51PM up 116 days, 23:48, 3 users, load averages: 1.30, 0.96, 0.64 After chatting with someone else well respected in the community, he = proposed an alternative fix. I'm vetting here to make sure there isn't = something deeper in the code that could get bitten by, as well as some = clarification :=20 in: ./sys/cddl/compat/opensolaris/sys/time.h=20 the relevant parts are: 41 #define LBOLT ((gethrtime() * hz) / NANOSEC) ... =20 54 static __inline hrtime_t 55 gethrtime(void) { 56=20 57 struct timespec ts; 58 hrtime_t nsec; 59=20 60 #if 1 61 getnanouptime(&ts); 62 #else 63 nanouptime(&ts); 64 #endif 65 nsec =3D (hrtime_t)ts.tv_sec * NANOSEC + ts.tv_nsec; 66 return (nsec); 67 } QUESTION - what units is LBOLT suppose to be ? If gethrtime() is = returning nanoseconds, why is nanoseconds getting multiplied by hz ? If = LBOLT is suppose to be clock-ticks (which is what arc.c looks like it = wants it in) then it really should be : #define LBOLT ( (gethrtime() / NANOSEC) * hz ) But if that is case, then why make the call to getnanouptime() at all ? = If LBOLT is number of clock ticks, then can't this just be a query to = uptime in seconds ? So how about something like this: #define LBOLT (time_uptime * hz) I've applied this changed locally, and did a basic stress test with our = load generator in the lab, thrashing the arc cache. (96GB RAM, 48G = min/max for ARC) It seems to have no ill effects - though, will have to = wait 4-months before declaring the actual issue here fixed. I'm hoping = to put this in production next week. All this above is on 8.1-RELEASE. ZFS v28 changed some of this, but = still didn't improve lbolt: - = http://svnweb.freebsd.org/base/head/sys/cddl/compat/opensolaris/sys/time.h= ?revision=3D221991&view=3Dmarkup 65 #define ddi_get_lbolt() ((gethrtime() * hz) / = NANOSEC) 66 #define ddi_get_lbolt64() (int64_t)((gethrtime() * = hz) / NANOSEC) It would seem, the same optimization could be done here too: #define ddi_get_lbolt() (time_uptime * hz) #define ddi_get_lbolt64() (int64_t)(time_uptime * = hz) With saving the call to getnanouptime() a multiple and divide, there = should be a couple hundred cycle performance improvement here. I don't = claim this would be noticeable, but seems like a simple, straight = forward optimization.=20 clock_t will still need the typedef'ed to 64bit to still address the = l2arc usage of LBOLT. Thanks ! --- David P. Discher dpd@bitgravity.com * AIM: bgDavidDPD BITGRAVITY * http://www.bitgravity.com