Date: Mon, 30 May 2011 22:15:37 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Artem Belevich <art@FreeBSD.org> Cc: freebsd-fs@FreeBSD.org Subject: Re: ZFS: arc_reclaim_thread running 100%, 8.1-RELEASE, LBOLT related Message-ID: <20110530210943.Y3265@besplex.bde.org> In-Reply-To: <BANLkTikVq0-En7=4Dy_dTf=tM55Cqou_mw@mail.gmail.com> References: <0EFD28CD-F2E9-4AE2-B927-1D327EC99DB9@bitgravity.com> <BANLkTikVq0-En7=4Dy_dTf=tM55Cqou_mw@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --0-2106762154-1306757737=:3265 Content-Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Thu, 26 May 2011, Artem Belevich wrote: > On Thu, May 26, 2011 at 6:46 PM, David P Discher <dpd@bitgravity.com> wro= te: >> Hello FS list: >> >> We've been using ZFS v3, storage pool v14 with FreeBSD 8.1-RELEASE with = fairly good results for over a year. =A0We have been moving more and more o= f our storage to ZFS. =A0Last week, I believe we hit another issue with LBO= LT. >> >> The original which was first reported by Artem Belevich for l2arc_feed_t= hread : >> >> =A0- http://lists.freebsd.org/pipermail/freebsd-fs/2011-January/010558.h= tml >> >> But this also affects the arc_reclaim_thread as well. The guys over at i= X Systems helped out and pointed me to this patch : >> >> =A0- http://people.freebsd.org/~delphij/misc/218180.diff >> >> which typedef's clock_t to int64_t. I think that patch is unusable and didn't get used. clock_t is a (bogusly) machine-dependent system type that cannot be declared in a cddl header (except possibly to hide bugs by making it different from the system type in some places only). >> However, the arc_reclaim_thread does not have a ~24 day rollover - it do= es not use clock_t. =A0I think this rollover in the integer results in LBOL= T going negative, after about 106-107 days. =A0We haven't noticed this unti= l actually 112-115 days of uptime. =A0I think it is also related to L1 ARC = sizing, and load. =A0Our systems with arc set to min-max of =A0512M/2G ARC = haven't developed the issue - at least the CPU hogging thread - but the sys= tems with 12G+ of ARC, and lots of rsync and du activity along side of rand= om reads from the zpool develop the issue. >> >> The problem is slight different, and possibly more harmful than the l2ar= c feeder issue seen with LBOLT. >> >> in sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c, the arc_evict()= function, under "evict_start:" has this loop to walk the arc cache page li= st: [unreadable non-text with hard \xa0's deleted] >> Now, when LBOLT is negative, with some degree of jitter/randomness, this= loop short-circuts, resulting in high CPU usage. =A0Also the ARC buffers m= ay not get evicted on-time, or possibly at all. =A0One system I had, all pr= ocesses to the zpool where waiting on D-state, and the arc_reclaim_thread w= as stuck at 100%. =A0du and rysnc seem to help aggravate this issue. =A0On = an affected system : >> ... >> After chatting with someone else well respected in the community, he pro= posed an alternative fix. =A0I'm vetting here to make sure there isn't some= thing deeper in the code that could get bitten by, as well as some clarific= ation : >> >> in: >> ./sys/cddl/compat/opensolaris/sys/time.h >> >> the relevant parts are: >> >> =A0 =A0 =A0 =A0 41 #define LBOLT =A0 ((gethrtime() * hz) / NANOSEC) Might work if gethrtime() is a 64-bit type, but as you pointed out, it is almost perfectly passimized. >> =A0 =A0 =A0 =A0 ... >> >> =A0 =A0 =A0 =A0 54 static __inline hrtime_t Not clear what its type is. >> =A0 =A0 =A0 =A0 55 gethrtime(void) { >> =A0 =A0 =A0 =A0 56 >> =A0 =A0 =A0 =A0 57 =A0 =A0 =A0 =A0 struct timespec ts; >> =A0 =A0 =A0 =A0 58 =A0 =A0 =A0 =A0 hrtime_t nsec; >> =A0 =A0 =A0 =A0 59 >> =A0 =A0 =A0 =A0 60 #if 1 >> =A0 =A0 =A0 =A0 61 =A0 =A0 =A0 =A0 getnanouptime(&ts); >> =A0 =A0 =A0 =A0 62 #else >> =A0 =A0 =A0 =A0 63 =A0 =A0 =A0 =A0 nanouptime(&ts); >> =A0 =A0 =A0 =A0 64 #endif The ifdef prevents more perfect pessimization -- we will scale to hz ticks, so we don't want to extra accuracy and overheads from using nanouptime(). >> =A0 =A0 =A0 =A0 65 =A0 =A0 =A0 =A0 nsec =3D (hrtime_t)ts.tv_sec * NANOSE= C + ts.tv_nsec; > > Yup. This would indeed overflow in ~106.75 days. Apparently hr_time_t is 64 bits signed, and hz =3D 1000. 64 bits unsigned would overflow after twice as long (after a further multiplication by hz in LBOLT). Use hz =3D much larger or much smaller than 1000 to change the overflow point. >> QUESTION - what units is LBOLT suppose to be ? =A0If gethrtime() is retu= rning nanoseconds, why is nanoseconds getting multiplied by hz ? =A0If LBOL= T is suppose to be clock-ticks (which is what arc.c looks like it wants it = in) then it really should be : >> >> =A0 =A0 =A0 =A0 #define LBOLT =A0 ( (gethrtime() / NANOSEC) * hz ) >> >> But if that is case, then why make the call to getnanouptime() at all ? = =A0If LBOLT is number of clock ticks, then can't this just be a query to up= time in seconds ? =A0So how about something like this: >> >> =A0 =A0 =A0 =A0#define LBOLT =A0 (time_uptime * hz) This is similar to FreBSD's `ticks', and overflows at a similar point: - `ticks' is int, and int is int32_t on all supported arches, so `ticks' overflows at 2**31 / hz on all supported arches. This was 248 days if hz is correctly configured as 100. The default misconfiguration of hz =3D 1000 gives overflow after 24.8 days. - time_uptime is time_t, so the above overflows at TIME_T_MAX / hz. - on i386 and powerpc, time_t is int32_t, so the above overflows at the same point as does `ticks' on these arches. - on all other arches, time_t is int64_t, so the above overflows after 2**32 times as long. > I believe lbolt used to hold number of ticks on solaris, though they > switched to tickless kernel some time back and got rid of lbolt. Yes, LBOLT in solaris seems to correspond to `ticks' in FreeBSD, except it might not overflow like `ticks' does. (`lbolt' in FreeBSD was a dummy sleep address with value always 0.) >> I've applied this changed locally, and did a basic stress test with our = load generator in the lab, thrashing the arc cache. (96GB RAM, 48G min/max = for ARC) It seems to have no ill effects - though, will have to wait 4-mont= hs before declaring the actual issue here fixed. =A0I'm hoping to put this = in production next week. Won't work on i386. >> It would seem, the same optimization could be done here too: >> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#define ddi_get_lbolt() =A0 =A0 =A0 =A0 (= time_uptime * hz) >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0#define ddi_get_lbolt64() =A0 =A0 =A0 (in= t64_t)(time_uptime * hz) To fix the overflow it has to be: #define ddi_get_lbolt64()=09((int64_t)time_uptime * hz) >> With saving the call to getnanouptime() a multiple and divide, there sho= uld be a couple hundred cycle performance improvement here. =A0I don't clai= m this would be noticeable, but seems like a simple, straight forward optim= ization. Should only be a couple of ten cycle performance improvement. getnanouptime() is very fast. > The side effect is that it limits bolt resolution to hz units. With > HZ=3D100, that will be 10ms. Whether it's good enough or too coarse I > have no idea. Perhaps we can compromise and update lbolt in > microseconds. That should give us few hundred years until the > overflow. No, it actually limits the resolution to seconds units, and that seems too coarse. Using getnanupotime() already limited the resolution to hz/tc_tick inverse-units (tc_tick =3D 1 for hz <=3D ~1000, but above that getnanouptime() provides less than hz inverse-units). `ticks' would be the right thing to use if it didn't overflow. Some networking code (e.g., in tcp_output.c) still uses `ticks', and at least used to have bugs from this use when `ticks' overflowed. Blindly increasing hz of course made the bugs more frequent. Bruce --0-2106762154-1306757737=:3265--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110530210943.Y3265>