From owner-freebsd-fs@FreeBSD.ORG  Thu Jul 21 17:08:04 2011
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 7F27A106564A;
	Thu, 21 Jul 2011 17:08:04 +0000 (UTC)
	(envelope-from lists.br@gmail.com)
Received: from mail-gy0-f182.google.com (mail-gy0-f182.google.com
	[209.85.160.182])
	by mx1.freebsd.org (Postfix) with ESMTP id 2C0E88FC12;
	Thu, 21 Jul 2011 17:08:03 +0000 (UTC)
Received: by gyf3 with SMTP id 3so887971gyf.13
	for <multiple recipients>; Thu, 21 Jul 2011 10:08:03 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=subject:mime-version:content-type:from:in-reply-to:date:cc
	:content-transfer-encoding:message-id:references:to:x-mailer;
	bh=r3C/AxEMfCyBD8s47doW/3AodVlSSDCv8GFyeGz5Z38=;
	b=nzAKuk0I1gygzi2wvb0UZs/V0ad9rlKPyk0ZtDXx1O6BfCi46nmTAxAG7Da7wuzhKB
	1eBpJuJUw8/nrcpkFsfzNQPQUBHfsCHzN/dlxOt0OBY7s0Ld8sqFFgzZOCOhC4tbRFFB
	NOALdlqpAmN1so4NDUwYw+/7N+bNnTbnpA7gU=
Received: by 10.236.76.169 with SMTP id b29mr665848yhe.474.1311266333180;
	Thu, 21 Jul 2011 09:38:53 -0700 (PDT)
Received: from [192.168.0.53] ([187.120.139.136])
	by mx.google.com with ESMTPS id v4sm1270544yhm.48.2011.07.21.09.38.51
	(version=TLSv1/SSLv3 cipher=OTHER);
	Thu, 21 Jul 2011 09:38:52 -0700 (PDT)
Mime-Version: 1.0 (Apple Message framework v1084)
Content-Type: text/plain; charset=us-ascii
From: Luiz Otavio O Souza <lists.br@gmail.com>
In-Reply-To: <j09hk8$svj$1@dough.gmane.org>
Date: Thu, 21 Jul 2011 13:38:50 -0300
Content-Transfer-Encoding: quoted-printable
Message-Id: <13577F3E-DE59-44F4-98F7-9587E26499B8@gmail.com>
References: <j09hk8$svj$1@dough.gmane.org>
To: Ivan Voras <ivoras@freebsd.org>
X-Mailer: Apple Mail (2.1084)
Cc: freebsd-fs@freebsd.org
Subject: Re: ZFS and large directories - caveat report
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 21 Jul 2011 17:08:04 -0000

On Jul 21, 2011, at 12:45 PM, Ivan Voras wrote:

> I'm writing this mostly for future reference / archiving and also if =
someone has an idea on how to improve the situation.
>=20
> A web server I maintain was hit by DoS, which has caused more than 4 =
million PHP session files to be created. The session files are sharded =
in 32 directories in a single level - which is normally more than enough =
for this web server as the number of users is only a couple of thousand. =
With the DoS, the number of files per shard directory rose to about =
130,000.
>=20
> The problem is: ZFS has proven horribly inefficient with such large =
directories. I have other, more loaded servers with simlarly bad / large =
directories on UFS where the problem is not nearly as serious as here =
(probably due to the large dirhash). On this system, any operation which =
touches even only the parent of these 32 shards (e.g. "ls") takes =
seconds, and a simple "find | wc -l" on one of the shards takes > 30 =
minutes (I stopped it after 30 minutes). Another symptom is that =
SIGINT-ing such find process takes 10-15 seconds to complete (sic! this =
likely means the kernel operation cannot be interrupted for so long).
>=20
> This wouldn't be a problem by itself, but operations on such =
directories eat IOPS - clearly visible with the "find" test case, making =
the rest of the services on the server fall as collateral damage. =
Apparently there is a huge amount of seeking being done, even though I =
would think that for read operations all the data would be cached - and =
somehow the seeking from this operation takes priority / livelocks other =
operations on the same ZFS pool.
>=20
> This is on a fresh 8-STABLE AMD64, pool version 28 and zfs version 5.
>=20
> Is there an equivalent of UFS dirhash memory setting for ZFS? (i.e. =
the size of the metadata cache)

Hello Ivan,

I've some kind of similar problems on a client that needs to store a =
large amount of files.

I have 4.194.303 (0x3fffff) files created on FS (unused files are =
already created with zero size - this was a precaution from the UFS =
times to avoid the 'no more free inodes on FS').

And I just break the files like mybasedir/3f/ff/ff, so under no =
circumstance i have a 'big amount of files' in a single directory.

The general usage on this server is fine, but the periodic (daily) =
scripts take almost a day to complete and the server is slow as hell =
while the daily scripts are running.

All i need to do is kill 'find' to get the machine back to 'normal'.

I did not stopped to look at it in detail, but the little bit i checked, =
looks like the stat() calls takes a long time on ZFS files.

Previously, we'd this running on UFS with a database of 16.777.215 =
(0xffffff) files without any kind of trouble (i've reduced the database =
size to keep the daily scripts run time under control).

The periodic script is simply doing its job of verifying setuid files =
(and comparing the list with the previous one).

So, yes, i can confirm that running 'find' on a ZFS FS with a lot of =
files is very, very slow (and looks like it isn't related to how the =
files are distributed on the FS).

But sorry, no idea about how to improve that situation (yet).

Regards,
Luiz