From owner-freebsd-stable@FreeBSD.ORG  Tue Mar  5 02:08:02 2013
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 83C88C65
 for <freebsd-stable@freebsd.org>; Tue,  5 Mar 2013 02:08:02 +0000 (UTC)
 (envelope-from freebsd@pki2.com)
Received: from btw.pki2.com (btw.pki2.com [IPv6:2001:470:a:6fd::2])
 by mx1.freebsd.org (Postfix) with ESMTP id 28ABE883
 for <freebsd-stable@freebsd.org>; Tue,  5 Mar 2013 02:08:02 +0000 (UTC)
Received: from [127.0.0.1] (localhost [127.0.0.1])
 by btw.pki2.com (8.14.6/8.14.5) with ESMTP id r2527kwu040755;
 Mon, 4 Mar 2013 18:07:46 -0800 (PST) (envelope-from freebsd@pki2.com)
Subject: Re: ZFS "stalls" -- and maybe we should be talking about defaults?
From: Dennis Glatting <freebsd@pki2.com>
To: Karl Denninger <karl@denninger.net>
In-Reply-To: <513524B2.6020600@denninger.net>
References: <513524B2.6020600@denninger.net>
Content-Type: text/plain; charset="ISO-8859-1"
Date: Mon, 04 Mar 2013 18:07:46 -0800
Message-ID: <1362449266.92708.8.camel@btw.pki2.com>
Mime-Version: 1.0
X-Mailer: Evolution 2.32.1 FreeBSD GNOME Team Port 
Content-Transfer-Encoding: 7bit
X-yoursite-MailScanner-Information: Dennis Glatting
X-yoursite-MailScanner-ID: r2527kwu040755
X-yoursite-MailScanner: Found to be clean
X-MailScanner-From: freebsd@pki2.com
Cc: freebsd-stable@freebsd.org
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 05 Mar 2013 02:08:02 -0000

I get stalls with 256GB of RAM with arc_max=64G (my limit is usually 25%
) on a 64 core system with 20 new 3TB Seagate disks under LSI2008 chips
without much load. Interestingly pbzip2 consistently created a problem
on a volume whereas gzip does not.

Here, stalls happen across several systems however I have had less
problems under 8.3 than 9.1. If I go to hardware RAID5 (LSI2008 -- same
chips: IR vs IT) I don't have a problem.


On Mon, 2013-03-04 at 16:48 -0600, Karl Denninger wrote:
> Well now this is interesting.
> 
> I have converted a significant number of filesystems to ZFS over the
> last week or so and have noted a few things.  A couple of them aren't so
> good.
> 
> The subject machine in question has 12GB of RAM and dual Xeon
> 5500-series processors.  It also has an ARECA 1680ix in it with 2GB of
> local cache and the BBU for it.  The ZFS spindles are all exported as
> JBOD drives.  I set up four disks under GPT, have a single freebsd-zfs
> partition added to them, are labeled and the providers are then
> geli-encrypted and added to the pool.  When the same disks were running
> on UFS filesystems they were set up as a 0+1 RAID array under the ARECA
> adapter, exported as a single unit, GPT labeled as a single pack and
> then gpart-sliced and newfs'd under UFS+SU.
> 
> Since I previously ran UFS filesystems on this config I know what the
> performance level I achieved with that, and the entire system had been
> running flawlessly set up that way for the last couple of years. 
> Presently the machine is running 9.1-Stable, r244942M
> 
> Immediately after the conversion I set up a second pool to play with
> backup strategies to a single drive and ran into a problem.  The disk I
> used for that testing is one that previously was in the rotation and is
> also known good.  I began to get EXTENDED stalls with zero I/O going on,
> some lasting for 30 seconds or so.  The system was not frozen but
> anything that touched I/O would lock until it cleared.  Dedup is off,
> incidentally.
> 
> My first thought was that I had a bad drive, cable or other physical
> problem.  However, searching for that proved fruitless -- there was
> nothing being logged anywhere -- not in the SMART data, not by the
> adapter, not by the OS.  Nothing.  Sticking a digital storage scope on
> the +5V and +12V rails didn't disclose anything interesting with the
> power in the chassis; it's stable.  Further, swapping the only disk that
> had changed (the new backup volume) with a different one didn't change
> behavior either.
> 
> The last straw was when I was able to reproduce the stalls WITHIN the
> original pool against the same four disks that had been running
> flawlessly for two years under UFS, and still couldn't find any evidence
> of a hardware problem (not even ECC-corrected data returns.)  All the
> disks involved are completely clean -- zero sector reassignments, the
> drive-specific log is clean, etc.
> 
> Attempting to cut back the ARECA adapter's aggressiveness (buffering,
> etc) on the theory that I was tickling something in its cache management
> algorithm that was pissing it off proved fruitless as well, even when I
> shut off ALL caching and NCQ options.  I also set
> vfs.zfs.prefetch_disable=1 to no effect.  Hmmmm...
> 
> Last night after reading the ZFS Tuning wiki for FreeBSD I went on a
> lark and limited the ARC cache to 2GB (vfs.zfs.arc_max=2000000000), set
> vfs.zfs.write_limit_override to 1024000000 (1GB) and rebooted.  /*
> 
> The problem instantly disappeared and I cannot provoke its return even
> with multiple full-bore snapshot and rsync filesystem copies running
> while a scrub is being done.*/
> /**/
> I'm pinging between being I/O and processor (geli) limited now in normal
> operation and slamming the I/O channel during a scrub.  It appears that
> performance is roughly equivalent, maybe a bit less, than it was with
> UFS+SU -- but it's fairly close.
> 
> The operating theory I have at the moment is that the ARC cache was in
> some way getting into a near-deadlock situation with other memory
> demands on the system (there IS a Postgres server running on this
> hardware although it's a replication server and not taking queries --
> nonetheless it does grab a chunk of RAM) leading to the stalls. 
> Limiting its grab of RAM appears to have to resolved the contention
> issue.  I was unable to catch it actually running out of free memory
> although it was consistently into the low five-digit free page count and
> the kernel never garfed on the console about resource exhaustion --
> other than a bitch about swap stalling (the infamous "more than 20
> seconds" message.)  Page space in use near the time in question (I could
> not get a display while locked as it went to I/O and froze) was not
> zero, but pretty close to it (a few thousand blocks.)  That the system
> was driven into light paging does appear to be significant and
> indicative of some sort of memory contention issue as under operation
> with UFS filesystems this machine has never been observed to allocate
> page space.
> 
> Anyone seen anything like this before and if so.... is this a case of
> bad defaults or some bad behavior between various kernel memory
> allocation contention sources?
> 
> This isn't exactly a resource-constrained machine running x64 code with
> 12GB of RAM and two quad-core processors in it!
>